wwPDB Remediation

A major focus of the wwPDB is maintaining consistency and accuracy across the archive. As the PDB grows, developments in structure determination methods and technologies can challenge how all structures are represented. The wwPDB addresses these challenges with regular reviews of data processing procedures and coordinates remediation efforts to improve data representation.

2023 Peptide Residues Chemical Component Dictionary Remediation

In October 2023, wwPDB will update and enrich Chemical Component Dictionary (CCD) data files with standardized atom naming and additional annotation of protein backbone and terminal atoms within peptide residues. Entries containing those updated CCDs will be updated accordingly. This will improve the Findability and Interoperability of the PDB data, as well as open up new opportunities to use the updated peptide residue annotation.

Click here to see details about the Peptide Residues Chemical Component Dictionary Remediation Project and obtain github example files for testing and adoption.

2023 Remediation of Crystal Structures Deposited in Non-standard Coordinate Frames

A total of 268 structure entries deposited to PDB in a non-standard coordinate frame or space group setting have been re-versioned and re-released between December 2022 and September 2023. The updates have been made to enable improved model visualization and validation against deposited experimental data using modern graphics and refinement software.

Each re-versioned entry has atom x,y,z coordinates transformed into the standard crystallographic frame. Non-crystal symmetry matrices, if present, have also been transformed to operate on the updated coordinates.

Coordinate transformations were extracted either from REMARK or SCALE records, or from transformations published in the primary citation.

All transformed structure coordinates have been carefully checked for integrity of crystal packing and where available, validation against the deposited structure factor data.

Original deposited coordinates/matrices remain accessible/downloadable as the previous major version in PDB's versioned archive.

2023 Remediation of NMR Restraints Data

wwPDB has unified NMR restraints and chemical shifts data in standard NMR-STAR and NEF formats in collaboration with the NMR community. The restraints data deposited to the PDB in the following formats have been standardized with best effort:

AMBER, BIOSYM, CHARMM, CNS, CYANA, DYNAMO/TALOS/PALES, GROMACS, ISD, ROSETTA, SYBYL, and XPLOR-NIH

Both assigned chemical shifts and restraints have been deposited to the PDB
A valid topology file or specific comments must be present to interpret restraint files in a specific format (i.e. AMBER, CHARMM, GROMACS)
All atoms described in NMR data (assigned chemical shifts and restraints) are consistent with model’s atoms
Sequence alignment between the model and restraints matches, allowing terminal sequence extensions

The unified NMR data files have enabled NMR restraint validation in wwPDB validation reports for NMR entries having assigned chemical shifts and reasonable restraints.

NMR-STAR format is the master format used to handle wide variety of restraints in this NMR restraint remediation project. Data items starting with “Auth_asym_ID”, “Auth_seq_ID”, “Auth_comp_ID”, and “Auth_atom_ID” terms, i.e. “_Gen_dist_constraint.Auth_seq_ID_1”, point to equivalent data items in mmCIF, “_atom_site.auth_asym_id”, “_atom_site.auth_seq_id”, “_atom_site.auth_comp_id”, “_atom_site.auth_atom_id”, respectively. Complete atom name mapping information is incorporated using “_Assembly” and “_Entity” categories. NMR constraints used by structure determination software but not covered by the NMR-STAR ontology, such as chemical planarity, equilibrium bond angle, non-crystallographic symmetry, etc., are stored as JSON data under "_Other_constraint_list.text_data" tag to ensure no information is lost during remediation. NEF data files are provided as best effort conversions from the master NMR-STAR data files.

2021 Remediation of Mutation Annotation

wwPDB has standardized mutation annotation in the existing entries. Around 11000 entries were updated with "engineered mutation" annotation in the mmCIF data item, _struct_ref_seq_dif.details to conform to the current standard defined in the PDBx/mmCIF dictionary that has been used in the OneDep deposition-validation-biocuration system.

2020 Remediation of Carbohydrates

wwPDB has standardized atom nomenclature and provided uniform data representation and linear descriptors in collaboration with the glycoscience community to enable easy translation of PDB data to other representations commonly used by glycobiologists.

Click here to see details about Carbohydrate Remediation Project.

The details of this project have been described in
Chenghua Shao, Zukang Feng, John D Westbrook, Ezra Peisach,John Berrisford, Yasuyo Ikegawa, Genji Kurisu, Sameer Velankar,Stephen K Burley and Jasmine Y Young. (2021) Modernized uniform representation of carbohydrate molecules in the Protein Data Bank. Glycobiology 1–15: doi: 10.1093/glycob/cwab039.

2017 Archival PDBx/mmCIF Files Update

The 2017 release of the PDB archive with PDB structure entry files conformed to data standards, V5.0 of the PDBx/mmCIF dictionary, which already supports the global wwPDB system for Deposition, Biocuration and Validation of PDB data - wwPDB OneDep Deposition System.

PDB format files do not contain all of the remediated information, as PDB format is a legacy format. These files were provided previously for the community to review and test.

The updated PDBx/mmCIF and XML structure entry files for all experimental methods are provided via current PDB FTP archive (https://files.wwpdb.org/pub/pdb/data/structures/).

The changes to V5.0 include:

Improved audit categories to capture change information down to the category level for entry revisions.
Better organized data content, and much more extensive metadata in model files for Electron Microscopy derived models.
Corrected source organism and sequence references for each sequence fragment for chimeric proteins. Data in several categories have been standardized, including software name, detector name and detector type.
Standardized data in several categories, including software name, detector name and detector type.

The complete list of changes is described here.

2016 Remediation of 3DEM Entries

The wwPDB and the EMDataBank/Unified Data Resource for 3DEM Project have collaborated to update the experimental methods descriptions of all electron microscopy and electron crystallography-derived structures in the PDB archive. With this work now completed, all 3DEM-derived entries have better-organized content and conform to the revised data model developed by the EMDataBank team for use within the wwPDB OneDep System. The OneDep System has supported deposition, annotation, and validation of 3DEM structures and fully integrates deposition of 3DEM maps and model coordinates since January 2016.

Examples of 3DEM model files (both remediated and from the OneDep system) are provided in a new wwPDB ftp directory (https://files.wwpdb.org/pub/pdb/test_data/EM/). A data-item level description of the changes made during remediation is provided here. 3DEM terms in the updated PDBx/mmCIF dictionary can be reviewed here.

Files in the current PDB ftp archive will be replaced with new files corresponding to the updated PDBx/mmCIF dictionary in 2017. Users are encouraged to review and test the example data files.

2014 Integration of Large Structures with the Main PDB Archive

Large structures (containing >62 chains and/or 99999 ATOM lines) represented as single files in both PDBx/mmCIF and PDBML formats were fully integrated into the main PDB FTP archive. Previously, large structures were represented in multiple "SPLIT" entries, which have now been removed (obsoleted). Users searching for ID codes of "SPLIT" entries at wwPDB member websites will be automatically redirected to the combined entry. The file large_split_mapping.tsv lists the single large structure IDs created during this remediation and the corresponding obsoleted SPLIT entries. Following this remediation, large structures will only be distributed in the main PDB FTP directory in PDBx/mmCIF and PDBML formats, including biological assembly files. In addition, a separate directory in the PDB FTP archive provides access to a TAR file containing a collection of minimal PDB format files that represent all large structures.

2013 Remediation

In 2012, the wwPDB reviewed structure factor files in the archive for format consistency and correspondence with coordinate data. As a result, around 43,800 structure factor files were updated to standardize the format and to incorporate data corrections. Changes made are described in the audit.details record of the structure factor file. These modified entries were released throughout 2013 in weekly batches of 1000.

2011 Remediation

The July 2011 release of Version 4.0 of the PDB archive involved remediating complex problems, including the representation of biological assemblies, residual B factors, peptide inhibitors and antibiotics, and entries in nonstandard crystal frames. A description of the review and resulting changes and corrections is available in a PDF document. Any changes made to the data are recorded in the PDBX_VERSION data category and in a revision log created for this release (XLS and CSV).

2008 Remediation

The March 2009 release of Version 3.2 of the PDB archive includes improvements and enhancements to the data, including details about the chemistry of the polymer and the ligands bound to it, biological assemblies, and binding sites of ligands and metal ions. An overview (PDF) is provided.

2007 Remediation

In August 2007, the wwPDB released remediated files for all PDB entries to ensure the uniformity of the data across the archive. As a part of this project, many entries in the archive were updated and made consistent, with a focus on sequences (references to databases and taxonomies, plus differences between chemical and macromolecular sequences); primary citations; and assembly and virus information. In the Chemical Component Dictionary, the chemistry and nomenclature in monomers and ligands was standardized. This release of remediated data greatly improved searching and reporting capabilities across the PDB archive.

The details of this project have been described in

K. Henrick, Z. Feng, W. Bluhm, D. Dimitropoulos, J.F. Doreleijers, S. Dutta, J.L. Flippen-Anderson, J. Ionides, C. Kamada, E. Krissinel, C.L. Lawson, J.L. Markley, H. Nakamura, R. Newman, Y. Shimizu, J. Swaminathan, S. Velankar, J. Ory, E.L. Ulrich, W. Vranken, J. Westbrook, R. Yamashita, H. Yang, J. Young, M. Yousufuddin, and H. Berman (2008) Remediation of the Protein Data Bank Archive. Nucleic Acids Res. 36(Database issue): D426-D433.

C.L. Lawson, S. Dutta, J.D. Westbrook, K. Henrick, and H.M. Berman (2008) Representation of viruses in the remediated PDB archive. Acta Cryst. D64: 874-882.

The full scope of the remediation project is also available an Overview Document (PDF).