To better support the increasing complexity and size of data submitted to the PDB archive, the wwPDB Deposition & Biocuration system is based on the PDBx/mmCIF data dictionary and file format. The system accepts, processes and distributes PDBx/mmCIF data files.
This document describes available tools for generating PDBx/mmCIF format files and how to prepare "large structures" for deposition. A large structure is defined as having more than 99,999 atoms and/or more than 62 polymer chains, which are the restrictions of the traditional PDB format.
Depositors are encouraged to use the PDBx/mmCIF format for coordinate files whenever possible.
PDBx/mmCIF is the official working format of the wwPDB for coordinate files. It is flexible, extensible, and can accommodate structures of any size. PDBx/mmCIF files ready for deposition can be generated using the program pdb_extract or selected structure refinement programs. Additional information about the PDBx/mmCIF format can be found in this FAQ.
PDBx/mmCIF format is especially useful:
Recent versions of refinement packages Phenix (version 1.8.2+) and REFMAC (version 5.8+) generate PDBx/mmCIF files ready for deposition:
The pdb_extract program is available as an online interface and as a standalone command-line program. This powerful tool extracts and harvests data in PDBx/mmCIF format from structure determination programs.
The best format for depositing your structure is using PDBx/mmCIF, particularly for large structures. While the traditional PDB file format may be relatively easy manually manipulate, it cannot accommodate large structures that comprise more than 99,999 atoms and/or more than 62 chains.
The program pdb_extract can translate PDB format files with one or two-letter polymer chain IDs into a single PDBx/mmCIF file if they meet the following requirements.
Each polymer chain in a coordinate file represents a biopolymer in the experimental sample. A protein chain with a stretch of unmodeled amino acids in the middle of its sequence is still a single chain, and both modeled portions should have the same chain ID (chain A, for example). A protein molecule that has been physically cut in half by proteolysis, however, should be represented as two chains (chain A and chain B, for example).
Each polymer chain must have a unique chain ID. Permissible characters are uppercase letters A-Z, lowercase letters a-z, and numbers 0-9. PDB format allows only single-character chain IDs, while PDBx/mmCIF can accommodate chain IDs of up to four characters. pdb_extract, which converts PDB to PDBx/mmCIF format, can read pseudo-PDB files containing two-character chain IDs.
Ligands, ions, and solvent molecules can be deposited with any chain ID, but will have their chain IDs automatically re-assigned during processing to match the chain ID of the nearest polymer chain.
To help convert multiple files into a single large structure files, pdb_extract accepts as input PDB files that bend the standard PDB format. An ATOM record in a properly-formatted PDB file has a consecutive, right-justified atom number constrained to fit within column 2, which has a width of five digits, and a single-character chain ID in column 5 (both shown in bold below):
ATOM 91563 OE1 GLU A 373 4.449 58.856 -2.941 1.00 85.83 O
pdb_extract will accept input in which the atom number, while still constrained by the 5-digit limit of column 2, can be arbitrary and/or non-consecutive. In addition, the chain ID can be either a single character as above or two characters, as shown below:
ATOM 00A4B OE1 GLUAA 373 4.449 58.856 -2.941 1.00 85.83 O
pdb_extract will assign new atom numbers and separate residue labels and chain IDs from each other where they run together (as in the above example). This allows pdb_extract to accept input with unlimited atom counts and chain counts of up to 3844 (62*62), respectively.
While the PDBx/mmCIF format can accommodate chain IDs of up to four characters in length, it is advisable, due to current limitations of pdb_extract and some visualization tools, to limit chain IDs to one or two characters.
Observe all column restrictions. The PDB format has very rigid column-width and justification rules (see example below and look here for detailed information). Neither the wwPDB deposition interface nor pdb_extract will correctly read a PDB format file that does not observe the format's column restrictions. There is one exception to this rule, detailed in the above section of this document "Two-letter chain ids rules".
ATOM 1 N MET A 0 67.840 45.068 47.509 1.00 70.12 N
Insert TER cards only at the ends of polymer chains. In a PDB file, a line starting with or containing only "TER" signifies the termination of a polymer chain. A TER card is required at the end of a polymer chain. A TER card should not be placed in the middle of a polymer chain (regardless of the size of the gap in the sequence), nor should a TER card be preceded by any HETATM records (ligands, ions, solvent atoms) that share the same chain ID. For an example of correct TER card placement between polymer chains A and B:
ATOM 1563 OE1 GLU A 373 4.449 58.856 -2.941 1.00 85.83 O
ATOM 1564 OE2 GLU A 373 4.119 57.934 -4.918 1.00 95.09 O
ATOM 1565 OXT GLU A 373 8.013 55.105 -6.685 1.00 95.09 O
HETATM 3133 ZN ZN A 401 -2.320 35.058 -4.024 1.00 70.61 ZN
ATOM 1567 N ASN B 190 -28.191 85.252 -7.869 1.00 60.21 N
ATOM 1568 CA ASN B 190 -27.762 84.082 -7.010 1.00 68.43 C
ATOM 1569 C ASN B 190 -28.219 82.658 -7.477 1.00 72.07 C
Programs (including the wwPDB deposition interface) that read PDB files rely on proper placement of TER cards to parse coordinates correctly. Improper placement of TER cards can result in numerous problems, including (but not limited to) inclusion of ligands within a protein sequence or omission of entire polymer chains during parsing.
Do not use MODEL (or ENDMDL) records. MODEL records and their accompanying ENDMDL records are designed for the representation of NMR ensembles (superimposed collections of structurally identical but conformationally diverse models) and should not be used in the representation of electron microscopy models (unless an NMR-style conformational ensemble is intended). Different polymer chains should have unique chains IDs and be terminated using TER cards, not bracketed between MODEL and ENDMDL records.
Remove header information from starting structures. If an existing PDB format file or files have been used as a starting point for fitting, remove any header information before starting, or any residual header information that might be left over after fitting. This includes everything from HEADER through SCALE3 inclusive, i.e., everything above the first ATOM record.
There can be only one END card at the end of the file.