A new article by the wwPDB and the PDBx/mmCIF Working Group describes the community-driven data representation for structural biology data that is critical to the PDB archive. It describes file standards and governance, and summarizes software tools for data processing and checking.
PDBx/mmCIF Ecosystem: Foundational Semantic Tools for Structural Biology John D. Westbrook, Jasmine Y. Young, Chenghua Shao, Zukang Feng, Vladimir Guranovic, Catherine L. Lawson, Brinda Vallat, Paul D. Adams, John M. Berrisford, Gerard Bricogne, Kay Diederichs, Robbie P. Joosten, Peter Keller, Nigel W. Moriarty, Oleg V. Sobolev, Sameer Velankar, Clemens Vonrhein, David G. Waterman, Genji Kurisu, Helen M. Berman, Stephen K. Burley, Ezra Peisach (2022) Journal of Molecular Biology 434: 167599 doi: 10.1016/j.jmb.2022.167599
This article is dedicated to John D. Westbrook, whose work established the PDBx/mmCIF data dictionary and format as the foundation of the modern Protein Data Bank (PDB) archive (wwPDB.org).
Starting May 3, 2022, the PDB archive distributes assembly files in PDBx/mmCIF format, allowing direct access and visualization of the curated assemblies for all PDB entries (original announcement).
Previously, PDBx/mmCIF formatted assembly files provided for structures were non-PDB compliant, however the coordinates use model numbers to differentiate alternate symmetry copies of PDB chain IDs. This method is not ideal, nor necessary, for the current archive PDBx/mmCIF format and has led to limited use of these files in community software tools. In response to this issue and recommendations by the wwPDB advisory committee, we are implementing updated, standardized practices for generation of assembly files for all PDB entries.
These updated PDBx/mmCIF format assembly files have improved organization of assembly data to support usage by the community. These files will include all symmetry generated copies of each chain within a single model, with distinct chain IDs (_atom_site.auth_asym_id and _atom_site.label_asym_id) assigned to each. Generation of distinct chain IDs in assembly files are based upon the following rules:
In addition, entity ID and chain ID mapping categories are provided: _pdbx_entity_remapping and _pdbx_chain_remapping.
A new directory (ftp.wwpdb.org/pub/pdb/data/assemblies/mmCIF/) was created for the distribution of these updated assembly files. The directory containing the existing assembly mmCIF files for large entries has been removed (ftp.wwpdb.org/pub/pdb/data/biounit/mmCIF/'>ftp.wwpdb.org/pub/pdb/data/biounit/mmCIF/).
wwPDB asks all PDB users and software developers to review code and address any limitations related to PDB assemblies. Sample files were made available for testing purposes and to support community adoption at GitHub.com/wwpdb/assembly-mmcif-examples.
If you plan to use these assembly files for graphical viewing, check if your visualization software (e.g., PyMol, ChimeraX, etc.) supports instantiation of assemblies directly from atomic coordinate files (_struct_assembly related categories), for improved efficiency.
For any further information please email email@example.com.
ModelCIF, an extension of PDBx/mmCIF for computed structure models, is now available. The PDBx/mmCIF data standard underpins the Protein Data Bank (PDB) Core Archive, which is jointly managed by the worldwide Protein Data Bank (wwPDB) consortium. A software library called python-modelcif has been developed to support ModelCIF and enables reading and writing mmCIF files compliant with ModelCIF.
ModelCIF serves as the data standard for representing structural models of macromolecules obtained using computational methods. These computed structure models may be derived from existing structure templates using homology or comparative modeling or can be obtained from ab initio modeling methods. ModelCIF data standard is being adopted by computational biologists as well as major repositories of computed structure models, including ModelArchive, MODBASE, and AlphaFoldDB Protein Structure Database repositories for computed structure models. Partial support for ModelCIF is also available in SWISS-MODEL projects and will soon be added to the SWISS-MODEL Repository.
ModelCIF is developed and maintained by the wwPDB ModelCIF Working Group (WG), consisting of representatives from the wwPDB and the computational structural biology community. The WG is focused on developing common data standards and software tools for archiving and visualization of computed structure models. The WG promotes adoption of ModelCIF within the computational modeling community, and is also involved in developing software tools that support ModelCIF. Research teams making computed structure models available from their own web portals are strongly encouraged to do so using the ModelCIF data standard and integrate them into the 3D-Beacons network. Structural biologists are strongly encouraged to deposit computed structure models to the ModelArchive to ensure long-term preservation and public access. Guidelines on how to deposit computed structure models together with relevant metadata are also available on ELIXIR's RDMkit page for structural bioinformatics.
CASP (Critical Assessment of protein Structure Prediction) is in search for targets for the upcoming CASP15 modeling experiment (starting in May 2022). CASP community experiments aim to advance the state of the art in protein structure modeling. Every other year since 1994, CASP collects information on soon-to-be released experimental structures, passes on sequence data to the structure modeling community, and collects blind predictions of structure for assessment. Typically, about 100 modeling groups from around the world participate. Results of CASP experiments are assessed by leaders in the field (Independent Assessors), and published in special issues of the journal PROTEINS.
Following the 2020 CASP14 experiment, it is hard to find a structural biologist who has not heard about the success of deep learning methods in modeling protein structures, particularly by the AlphaFold and more recently RosettaFold. As a result of these advances, computed protein structures are becoming much more widely used in a broadening range of applications. Since CASP14, the protein modeling community has intensified development of these methods and extended their application to include modeling of protein complexes and protein ensembles. CASP15 will provide definitive insight into how successful these new developments are.
CASP15’s success depends on generosity of the experimental community in providing targets as ground truth against which to assess the computation methods. Over the years more than 150 structure determination groups have provided over 1100 targets for CASP challenges. For CASP15, we are requesting submission of all types of experimental structures determined by X-ray crystallography, cryo-electron microscopy and NMR as potential targets, but are particularly interested in the following:
CASP also plans to include modeling assisted by sparse experimental data, in collaboration with experimental groups in NMR, SAXS, and crosslinking mass spectrometry. For that, protein material is needed (this is not expected for most targets, but if available, it would be much appreciated!).
So, if you have suitable targets in any if these areas, we would very much appreciate you getting in touch by replying to this email or writing to firstname.lastname@example.org or suggesting your target directly through the CASP15 target entry page.
Note that CASP target providers are regularly invited to contribute to CASP special journal issue papers (e.g. Computational models in the service of X-ray and cryo-electron microscopy structure determination (2021) Proteins 89: 1633-1646; Target highlights in CASP14: Analysis of models by structure providers (2021) Proteins 89: 1647-1672), and we plan to continue this practice in the future.
As announced previously, deposition of half-maps for single-particle, single-particle-based helical, and sub-tomogram averaging reconstructions to the EM Data Bank (EMDB) is mandatory as of February 25, 2022. This change is in response to a long-standing community request to the wwPDB EMDB Core Archive and was also a recommendation from the 2020 wwPDB single-particle cryo-EM data-management workshop (white paper in preparation). Several recommendations from this workshop have already been implemented in the wwPDB OneDep system. These include improvements to wwPDB validation reports and enhancements for capturing metadata via the deposition interface.
Mandatory half-maps must be unfiltered, unmasked, unsharpened, and positioned in the same coordinate-space and orientation as the primary map such that they superimpose. The availability of half-maps will contribute to improved validation of EM structures as reflected in the wwPDB validation reports.
wwPDB strongly urges developers of cryo-EM processing software for the affected modalities to implement support for output of such half-maps (if this is not already available).
Any queries about this policy change can be directed to email@example.com.
Starting May 3, 2022, the PDB archive will distribute assembly files in PDBx/mmCIF format, allowing direct access and visualization of the curated assemblies for all PDB entries.
Currently, PDBx/mmCIF formatted assembly files are provided for structures that are non-PDB compliant, however the coordinates use model numbers to differentiate alternate symmetry copies of PDB chain IDs. This method is not ideal, nor necessary, for the current archive PDBx/mmCIF format and has lead to limited use of these files in community software tools. In response to this issue and recommendations by the wwPDB advisory committee, we are implementing updated, standardized practices for generation of assembly files for all PDB entries.
These updated PDBx/mmCIF format assembly files will have improved organization of assembly data to support usage by the community. These files will include all symmetry generated copies of each chain within a single model, with distinct chain IDs (_atom_site.auth_asym_id and _atom_site.label_asym_id) assigned to each. Generation of distinct chain IDs in assembly files are based upon the following rules:
In addition, entity ID and chain ID mapping categories will be provided: _pdbx_entity_remapping and _pdbx_chain_remapping.
A new directory (ftp.wwpdb.org/pub/pdb/data/assemblies/mmCIF/) will be created for the distribution of these updated assembly files. The directory containing the existing assembly mmCIF files for large entries will be removed (ftp.wwpdb.org/pub/pdb/data/biounit/mmCIF/).
wwPDB asks all PDB users and software developers to review code and address any limitations related to PDB assemblies. Sample files are made available for testing purposes and to support community adoption at GitHub.com/wwpdb/assembly-mmcif-examples.
If you plan to use these assembly files for graphical viewing, check if your visualization software (e.g., PyMol, ChimeraX, etc.) supports instantiation of assemblies directly from atomic coordinate files (_struct_assembly related categories), you do so for improved efficiency.
Extensions to the PDBx/mmCIF dictionary for reflection data with anisotropic diffraction limits, for unmerged reflection data, and for quality metrics of anomalous diffraction data are now supported in OneDep.
In October 2020, a subgroup of the wwPDB PDBx/mmCIF Working Group was convened to develop a richer description of experimental data and associated data quality metrics. Members of this Data Collection and Processing Subgroup are all actively engaged in development and support of diffraction data processing software. The Subgroup met virtually for several months discussing, reviewing, and finalizing a new set dictionary content extension that were incorporated into the PDBx/mmCIF dictionary on February 16, 2021. A reference implementation of the new content extensions has been developed by Global Phasing Ltd.
These extensions facilitate the deposition and archiving of a broader range of diffraction data, as well as new quality metrics pertaining to these data. These extensions cover three main areas:
The new mmCIF data extensions describing anisotropic diffraction now enable archiving of the results of Global Phasing’s STARANISO program. Developers of other software can make use of them or extend the present definitions to suit their applications. Example files created by autoPROC, BUSTER (version 20210224) and Gemmi that are compliant with the new dictionary extensions are provided in a GitHub repository.
These example files, and similarly compliant files produced by other data processing and/or refinement programs, are suitable for direct uploading to the wwPDB OneDep system. Automatic recognition of that compliance, implemented by means of explicit dictionary versioning using the new pdbx_audit_conform record, will avoid unnecessary pre-processing at the time of deposition. This improved OneDep support will ensure a lossless round trip between data processing/refinement in the lab and deposition at the PDB.
wwPDB strongly encourages structural biologists to always use the latest versions of structure determination software packages to produce data files for PDB deposition. wwPDB also encourages crystallographers wishing to deposit new structures together with their associated diffraction data to use the software which guarantees consistency between data and final model. This consistency is difficult to achieve when separate diffraction data files and model coordinate files are pieced together a posteriori by ad hoc means.
wwPDB also encourages depositors to make their raw diffraction images available from one of the public repositories to allow direct access to the original diffraction image data.
A snapshot of the PDB Core archive (ftp://ftp.wwpdb.org) as of January 3, 2022 has been added to ftp://snapshots.wwpdb.org and ftp://snapshots.pdbj.org. Snapshots have been archived annually since 2005 to provide readily identifiable data sets for research on the PDB archive.
The directory 20220103 includes the 185541 experimentally-determined structure and experimental data available at that time. Atomic coordinate and related metadata are available in PDBx/mmCIF, PDB, and XML file formats. The date and time stamp of each file indicates the last time the file was modified. The snapshot of PDB Core Archive is 923 GB.
A snapshot of the EMDB Core archive (ftp://ftp.ebi.ac.uk/pub/databases/emdb/) as of January 3, 2022 can be found in ftp://ftp.ebi.ac.uk/pub/databases/emdb_vault/20220103/ and ftp://snapshots.pdbj.org/20220103/. The snapshot of EMDB Core Archive contains map files and their metadata within XML files for both released and obsoleted entries (18059 and 254, respectively) and is 4.5 TB in size.