INTRODUCTION

Protein Data Bank Exchange macromolecular Crystallographic Information Framework, PDBx/mmCIF, provides the foundation for the deposition, annotation, and archiving of structural data across various experimental techniques.

PDBx/mmCIF uses data blocks to organize related information and data. A data block is a logical partition of a data file designated by a data_ record. A data block may be named by appending a text string after the data_ record and a data block is terminated by either another data_ record or by the end of the file.

An example of identifying data block at the beginning of the model file for the PDB entry 4HHB:

data_4HHB
#
_entry.id	4HHB
#
_audit_conform.dict_name	mmcif_pdbx.dic
_audit_conform.dict_version	5.367
_audit_conform.dict_location	http://mmcif.pdb.org/dictionaries/ascii/mmcif_pdbx.dic
#

PDBx/mmCIF format utilizes the ASCII character set. All data items are identified by name, begin with the underscore character, and are composed of a category name followed by an attribute name. The category name is separated from the attribute name by a period.

An example of PDBx/mmCIF data item (_category.attribute):

_entity.id

Data items are presented in two styles: key-value and tabular.

An example of a key-value style where the PDBx/mmCIF item is followed directly by a corresponding value:

_cell.entry_id	4HHB
_cell.length_a	63.150
_cell.length_b	83.590
_cell.length_c	53.800
_cell.angle_alpha	90.00
_cell.angle_beta	99.34
_cell.angle_gamma	90.00
_cell.Z_PDB	4

An example of a tabular style used when there are multiple values for each item. In this style, a loop_ record is followed by rows of data item names and then white-space delimited data values:

loop_
_audit_author.name
_audit_author.pdbx_ordinal
_audit_author.identifier_ORCID
'Fermi, G.'	1	0000-000x-xxxx-xxxx
'Perutz, M.F'	2	0000-000x-xxxx-xxxx

The hash symbol (#) is used to separate categories to improve readability, but is not strictly necessary. It is also used to indicate comments.

Numbers and single-word data values (i.e., those not containing white space) are listed by themselves:

_cell.length_a

63.150

A single value composed of multiple words separated by white-space need to be quoted:

_audit_author.name

'Fermi, G.'

A single value encompassing multiple line data values can be listed on a new line within a pair of semicolons:

loop_

_entity_poly.entity_id

_entity_poly.type

_entity_poly.nstd_linkage

_entity_poly.nstd_monomer

_entity_poly.pdbx_seq_one_letter_code

_entity_poly.pdbx_seq_one_letter_code_can

_entity_poly.pdbx_strand_id

_entity_poly.pdbx_target_identifier

1 'polypeptide (L)' no no

;VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD

LHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR

;

;VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD

LHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR

;

A,C ?

There are two special characters used as placeholders for mmCIF item values which for some reason cannot be explicitly assigned. The question mark (?) is used to mark an item value as missing. A period (.) may be used to identify that there is no appropriate value for the item or that a value has been intentionally omitted.

The PDBx/mmCIF Dictionary supports primary data types (integers, real numbers, and text), defines boundary conditions and controlled vocabularies, and provides the ability to link data items together to express relationships (e.g., parent-child related data items).

For example, the entity identifier assigned to a molecule in the "parent" _entity data category (shown here) is referred to in the "child" categories (as shown in the subsequent example):

loop_
_entity.id
_entity.type
_entity.src_method
_entity.pdbx_description
_entity.formula_weight
_entity.pdbx_number_of_molecules
_entity.pdbx_ec
_entity.pdbx_mutation
_entity.pdbx_fragment
_entity.details
1	polymer	nat	'Hemoglobin subunit alpha'	14981.087	1	?	?	?	?
2	polymer	nat	'Hemoglobin subunit beta'	16032.274	1	?	?	?	?
3	non-polymer	syn	'PROTOPORPHYRIN IX CONTAINING FE'	616.487	2	?	?	?	?
4	water	nat	water	18.015	93	?	?	?	?

Example of a "child" category describing the source information for each polymer entity listed in the above "parent" category:

loop_
_entity_src_nat.entity_id
_entity_src_nat.pdbx_src_id
_entity_src_nat.pdbx_alt_source_flag
_entity_src_nat.pdbx_beg_seq_num
_entity_src_nat.pdbx_end_seq_num
_entity_src_nat.common_name
_entity_src_nat.pdbx_organism_scientific
_entity_src_nat.pdbx_ncbi_taxonomy_id
1	1	sample	1	140	horse	'Equus caballus'	9796
2	1	sample	1	146	horse	'Equus caballus'	9796

The primary PDBx/mmCIF resource mmcif.wwpdb.org contains all relevant data dictionaries and documentation, as well as a detailed description of the format's development and history. The below sections present some key PDBx/mmCIF categories with descriptions and examples, and aim to help users to understand and adopt the PDBx/mmCIF format.