What will happen to PDB format files once four-character PDB IDs have been consumed?
PDB format files will not be provided for PDB entries deposited after four-character PDB IDs have been consumed. Best-effort PDB bundle files will no longer be provided for entries issued with extended PDB IDs.
What is the format of extended PDB IDs?
The format of extended PDB IDs is prefix “pdb_” followed by eight alphanumeric characters, e.g., pdb_10021abc. This PDB ID format will enable text mining detection of PDB entries in the published literature and allow for more informative and transparent delivery of revised data files.
How does one derive the new PDB IDs from old IDs?
All existing four-character PDB IDs will be extended by adding prefixing “pdb_0000” to the IDs, e.g., PDB ID “1abc” would be listed as “pdb_00001abc” in the _database_2.pdbx_database_accession data item.
How to cite PDB IDs in manuscripts or provide PDB IDs to journals?
As long as PDB entries are issued with 4-character PDB IDs, the 4-character PDB IDs should be submitted PDB IDs to journals and citing 4-character PDB IDs in manuscripts for accurate DOI linking.
When all 4-character PDB IDs have been exhausted, all new PDB entries, including IHM structures in PDB-IHM, will only-be issued 12-character PDB IDs and the corresponding PDB DOIs will be formatted as 10.2210/[extended_PDB_ID]/pdb for accurate cross-referencing to wwPDB DOI landing pages. For example, a PDB entry with ID “pdb_10001xyz” will have the DOI https://doi.org/10.2210/pdb_10001xyz/pdb.
At this point, depositors should submit 12-character PDB IDs to journals and citing extended PDB IDs in manuscripts.
Which part of extended PDB IDs should be cited in manuscripts or provided to journals?
The entire PDB ID, e.g., “pdb_1001ba3c” should be cited or provided to journals. Users should not omit the prefix or zeros. Users or journals should parse/recognize PDB IDs using the prefix “pdb_”
How does OneDep assign extended PDB IDs?
The wwPDB OneDep tool will assign extended PDB IDs starting with pdb_1xxxxxxx, followed by seven alphanumeric characters. When referencing PDB IDs in manuscripts that will be submitted to journals, authors should refrain from abbreviating prefix characters or omitting leading zeros in issued PDB IDs.
Where is the extended PDB ID stored in a PDB entry?
The extended PDB ID is currently stored in the mmCIF format file as the _database_2.pdbx_database_accession data item value. Once four-character PDB IDs have been consumed, extended PDB IDs will be stored as values for both _database_2.database_code and _database_2.pdbx_database_accession data items.
What will be the filenames in the PDB archive?
After the depletion of four-character PDB IDs, filenames for both existing and new entries will be based on extended PDB IDs. For example, the current filename for PDB entry 1abc, which is 1abc.cif, will transform into pdb_00001abc.cif.
How should journals manage PDB IDs within their workflows?
As long as PDB entries are issued with 4-character PDB IDs, journals should cite 4-character PDB IDs in manuscripts for proper PDB DOI linking. When PDB entries are only-be issued extended PDB IDs without 4-character PDB IDs, journals should cite 12-character extended PDB ID and the corresponding PDB DOIs will be formatted as 10.2210/[extended_PDB_ID]/pdb for linking to wwPDB landing pages.
How will data block names in PDBx/mmCIF files change?
Data blocks will be named by appending the extended PDB ID to the data_ token, e.g., data_pdb_10021abc.
How will data block names in structure factor files change?
Data block names in structure factor files will follow the pattern data_pdb_xxxxxxxx-sf. In the case of multiple data blocks, they will be labeled as data_pdb_xxxxxxxx-sf-A for the initial data block, data_pdb_xxxxxxxx-sf-B for the second data block, and so forth.
What is the regular expression for extended PDB IDs?
The regular expression defined in the PDBx/mmCIF dictionary is pdb_[a-z0-9]{8}.
Are there example files that can be accessed for software adoption?
Example files that contain extended PDB IDs can be downloaded at https://github.com/wwPDB/extended-wwPDB-identifier-examples.
Is there going to be a change in the file directory architecture?
All data files for a particular entry will be stored in a single directory, labeled based on a two-character hash generated from the penultimate two characters of the PDB code. To aid users in adapting to this change, a PDB “beta” archive will be provided during the transition phase. The expected timeline of this beta archive is in 2026. The directory structure will mirror the data organization of the PDB Versioned Archive, i.e.,
https://files-beta.org/pub/wwpdb/pdb/data/entries/<two-letter-hash>/<pdb_accession_code>/<entry_data_File_names>.
The two-letter hash will be based on the second and third characters from the last character. For example, PDB entry PDB_1abc5678 will be under /67/.
This will maintain consistency with the current PDB archive: PDB entry 1abc is under /ab.
Will the already issued 4-character PDB IDs be kept in the files?
Yes. The 4-character PDB IDs in the existing entries will be still available at _database_2.database_code in the mmCIF structure files.
Will the existing PDB format files be retained in the “beta” archive?
Yes. The existing legacy PDB format files will be retained and updated at “beta” archive. However, a PDB entry may become PDB format incompatible after remediation (e.g., new ligand with 5-character CCD ID). In such case, existing PDB format file will be removed from the archive as a result of remediation effort. Users should use PDBx/mmCIF formatted files.