A dataset of alternately located segments in protein crystal structures

Protein Data Bank (PDB) files list the relative spatial location of atoms in a protein structure as the final output of the process of fitting and refining to experimentally determined electron density measurements. Where experimental evidence exists for multiple conformations, atoms are modelled in alternate locations. Programs reading PDB files commonly ignore these alternate conformations by default leaving users oblivious to the presence of alternate conformations in the structures they analyze. This has led to underappreciation of their prevalence, under characterisation of their features and limited the accessibility to this high-resolution data representing structural ensembles. We have trawled PDB files to extract structural features of residues with alternately located atoms. The output includes the distance between alternate conformations and identifies the location of these segments within the protein chain and in proximity of all other atoms within a defined radius. This dataset should be of use in efforts to predict multiple structures from a single sequence and support studies investigating protein flexibility and the association with protein function.


Background & Summary
The diffraction pattern produced from the interaction of X-rays with protein molecules can be used to calculate an electron density map into which a model of the structure can be built and refined.Since the signal from a single molecule is far too weak for detection, proteins are first coerced, through crystallization, into an ordered, repeating lattice of molecules so that the diffraction from each molecule will constructively intensify the signal.Flexible and variable regions of the protein chain will lead to destructive interference, the absence of definable electron density and regions of the protein structure that cannot be modelled, at the extreme, or smeared electron density and a low model certainty, as represented by a high temperature factor otherwise known as the B-factor 1 .Rigid regions of the protein, having little variability between molecules in the crystal lattice, are characterised by high model certainty and low B-factor.The molecules in the crystal can also adopt several well-defined conformations such that the electron density is best described by modelling atoms into two or more discrete locations, commonly termed alternate locations (altlocs).
Proteins are dynamic molecules, transitioning between ensembles of conformations (recently reviewed in 2 ), and structural biology has recently been calling for methods that will push beyond the one sequence, one structure framework towards description and prediction of ensembles of conformations 3 .Although the conformationally explored space in crystals is limited by the constraints of the crystal packing, a largely overlooked resource for experimentally observed protein flexibility are X-ray crystal structures modelled with alternate conformations.Gutermuth et al. recently argued this resource may have remained unnoticed since to date most modelling approaches either ignore altlocs altogether or resolve them with simple heuristics 4 .Indeed, whilst common structural visualization programs such Chimerax or Pymol display alternate side chain locations, only a single, default backbone conformation is shown unless the user specifically calls up the hidden alternate conformation.In some cases, even tools dedicated to question of protein flexibility appear to overlook alternate conformations.The PDBFlex database was developed to provide information of protein flexibility as revealed through comparison of different PDB structures for the same protein 5 .Looking into the database, last updated in 2020, we were unable to find reference to or use of alternate conformations in single crystal protein structures.
More recently a group used machine learning to predict same sequence structure ensembles, again guided by cases of different PDB structures displaying conformational differences and originating from the same sequence, overlooking single crystal protein structures 6 .Altogether it seems that single crystal alternate backbone conformations are underused in attempts to characterize protein ensembles perhaps because the abundance of stable alternate conformations remains underrecognized.
The last years have seen efforts to develop tools for uncovering unmodeled alternately located atoms [7][8][9][10] , particularly relevant for older PDB entries before the development of refinement software capable of automatically detecting altlocs.However, there are few resources that collate data on altlocs, from across the PDB.One that does is the Alternate Location Server, part of the OCA browser-database for protein structure/function, that maintains an up-to-date list of PDB structures having alternate locations 11 .Although this list highlights the abundance of altlocs, including in backbone atoms, it provides very minimal data: a single line entry for each structure, noting the total number of residues modelled as altocs and the minimum, maximum and average distance between any backbone altlocs.To provide a more detailed resource useful for surveying the alternate conformation landscape and analyzing their prevalence in greater detail, we have created a custom dataset of alternately modelled backbone segments.The dataset is available through Harvard Dataverse 12 .

Methods
Raw data collection.Protein structures are collected from the Protein Data Bank (PDB) through a structured query against the polymer entity data API 13,14 .We queried for all entities in structures meeting the following criteria: (i) Method: X-Ray Diffraction; (ii) X-Ray Resolution ≤ 3.5 Å; (iii) R free ≤ 0.33, (iv) Number of chains ≤ 20.
Each query result contains a list of PDB IDs with an entity number (e.g.1ABC:1) matching the query criteria, and each entity within a PDB structure corresponds to one or more identical polypeptide chains which exist in the structure.Note that a structure may have more than one unique entity (e.g., 1ABC:1 and 1ABC:2) in which case we would obtain both.For each unique entity ID, we then obtain its associated chains (e.g., 1ABC:A and 1ABC:C), and include all of them in the dataset.In cases where the chain ID in the PDB files (author chains) do not match the canonical chain IDs assigned by the PDB, we map between the author and PDB chains, such that our dataset will contain only the canonical IDs.altloc collection.We use BioPython 15 to parse the PDB structure files and extract the residues and atom locations from the collected chains.For each atom in the structure, we parse all available alternate locations (altlocs) from the file.The altlocs are usually labelled with capital letters starting from ' A' .In cases where a structure has a non-standard altloc labelling, we sort the labels lexicographically and relabel them starting from ' A' .In such cases, the dataset will denote these altlocs as e.g.' A(Z)' in the altloc name columns, meaning that the altloc with original label Z is denoted by A in the dataset's other columns.This relabeling helps keep the column names consistent across different structures.
Aligning to uniprot sequences.We align the amino acid sequence of each chain to the Uniprot 16 record sequence to provide a Uniprot index for each residue in the chain.We query the PDB's entry data API 14 and examine the metadata to construct a mapping from the specific chain to a list of Uniprot IDs.Whilst most chains map to a single Uniprot ID, there are cases of synthetic proteins which have no associated Uniprot ID, and other cases where a chain is chimeric i.e. contains sections from multiple different proteins.We discard such cases and keep only chains which map to a unique Uniprot ID.The alignment is performed using BioPython's default pairwise alignment algorithm.We used BLOSUM80 as the substitution matrix for the alignment, a gap-opening penalty of − 10 and a gap-extension penalty of −0.5.

Backbone locations and dihedral angles per altloc. For each PDB chain, we calculate the backbone angles ( , )
ϕ ψ , per altloc.To calculate a dihedral angle at altloc X, we take the X-altloc coordinate of all atoms participating in the calculation (from the current and previous/next residue).In case the atoms required for dihedral angle calculation from either the previous or next residue do not have the current altloc, we use the single set of coordinates modelled at that location in the calculation of the dihedral angles of all the current altlocs.The dataset also always includes the dihedral angles calculated with all atoms at their default positions, i.e. ignoring altlocs.For each of the backbone atoms in each residue, we also collect its XYZ coordinates under each of the altlocs which exist for it.Finally, we use DSSP to assign a secondary structure per residue.

B-factors, location standard deviations and distances between altlocs.
We calculate the b-factor per residue, by averaging the b-factors of the N, CA and C backbone atoms.This is performed using the default atom positions, and additionally for each altloc which is defined for all three atoms.
For the CA atom, we also calculate, per altloc, the standard deviation in its location and distance from other altlocs, in Angstroms.The standard deviation is obtained from the b-factor B X of altloc X by σ π = B /8 For each pair of altlocs X and Y, we then calculate the distance in Angstroms between the alpha-carbons, , where p CA X , is the location of the alpha-carbon under altloc X.We also calculate this distance in units of the standard deviation, which is given by . Finally, we calculate the peptide bond length between adjacent residues under each pair of altlocs of the current residue's carbon and the next residue's nitrogen.

contacts.
We calculate the contacts between all atoms of a residue, under all its altlocs, and all other atoms in the PDB structure, also under all possible altlocs.The per-atom contacts are then aggregated to the residue level for inclusion in the dataset.
First, we collect the set of locations of all atoms in the structure, under all altlocs.Next, we iterate over each residue in the chain, each atom within it, and each altloc defined for that atom.Given the location of this altloc atom as a source, we calculate the distance to each target atom in the set of all locations.Two atoms are defined as in contact when their distance is below a threshold of 5 Angstroms.Each detected contact is then classified into one of three types: regular AA contact, out-of-chain (OOC) contact or ligand contact, depending on the identity and chain of the target atom.Contacts from all atoms of the current residue are collected into one of three lists of contacts for that residue, based on this classification.The minimum distance is calculated across atoms belonging to the same source and target residue.Hydrogen atoms and water molecules are always excluded even if they are modeled in the structure.codon assignment.Since the exact genetic sequence of the protein is not annotated in the PDB we assigned codons from the native sequence following the procedure described in our prior work 17 .Given a PDB chain, we obtain its unique Uniprot ID from the previous step.We query Uniprot to obtain all cross-referenced IDs to the European Nucleotide Archive (ENA).From the ENA database, we obtain all available genetic sequences for the specific protein, translate each genetic sequence to an amino-acid sequence using the standard genetic code table, and perform pairwise sequence alignment between the PDB chain's amino-acid sequence and the translated genetic sequences.The alignment is performed using BioPython using the same options as in the previous section.
Following the pairwise alignment of the amino acid sequence to all translated genetic sequences, we obtain the aligned codons from each sequence and assign them to corresponding residues from the PDB chain.This process yields zero or more assigned codons per residue in the PDB chain.In cases where there is more than one codon (i.e., different genetic sequences contributed different codons), we choose the most common, and reflect this ambiguity by assigning a codon score which is the proportion of genetic sequences that contributed the assigned codon.
Removal of low-quality structures.We used the R-factor to remove structures with a potentially poor fit to the electron density.The intersection of two criteria was used to define a structure as admissible: (1) R work ≤ 0.98 R free ; and (2) R free ≤ min{0.3,max{0.2,resolution-dependent cut-off}}.The resolution-dependent cut-off was fitted as a monotone polynomial to the 90%-tile of R free estimated in 12 equiprobable resolution bins ranging from 0.5 to 3.5Å (Fig. 1).

Non-redundant cluster assignment.
To account for redundancy of the collected chains and proteins they originate from, we clustered the chains into non-redundant clusters using the amino acid sequence data.Clustering was performed using mmseq2 18 with minimum sequence identity threshold 0.5 and target coverage 0.8.Cluster identities were recorded in the metadata alongside with the chain identities.

Segmentation of contiguous altlocs.
For each of the collected chains containing altlocs, we grouped all altlocs with contiguous residue numbers into numerically numbered segments.Assignment as a segment requires the altlocs at every location within the segment.This means that in cases of altlocs broken by missing residues, this section will be counted as two segments (Fig. 1)

Data Records
The dataset is available through Harvard Dataverse 12 .The dataset contains amino acid-level data records and chain-level metadata records.entity_chains PDB chain identifiers of all chains in the structure which belong to the same polymer entity as the current chain.
entity_auth_chains As above but using the PDB deposition author's original chain identifiers instead of the PDB-assigned identifiers.
chain_entities Identifier (internal to the structure) of the polymer entity associated with this chain.
chain_to_auth_chain Deposition author-assigned name of this chain.
entity_sequence Canonical sequence of the protein in the standard one-letter code of amino acids.

num_altloc_segments
The number of segments of contiguous residues harboring altlocs within the chain.

cluster_id
The mmseq2 cluster identity of the current chain sequence.
Table 3. Description of the column in the metadata records.
Fig. 2 Quantiles of the R free parameter across the collected structures as a function of resolution.The cut-off above which low-quality structures were rejected is plotted in thick black.
Where altlocs exist for a particular residue, additional columns are populated within the same row and include information pertaining to each altloc as described in Table 2. Italic characters, X and Y, in the column names shown in Table 2 are placeholders for actual altloc label names.We limit the altlocs to four (A, B, C, D).

Metadata.
A single comma-separated value (csv) file containing PDB structure-and chain-level metadata can be accessed as altloc_metadata.csv.
The dataset entries (rows) each represent a single chain.Columns are described in Table 3.

Technical Validation
Since our data is derived from the PDB 13 , our starting point are records that have already been validated through the standardised procedures of this well-established global repository.The quality and reliability of crystallographic structures is commonly assessed by the R-factor that measure of the goodness of fit between the model and the experimental X-ray diffraction data for the refined data.A small percent of data is left out of refinement and an analogous R free is calculated to assess against the R factor for potential overfitting biases.Figure 2 shows the resolution-dependent R free thresholds we used to exclude models with a poor fit the experimental data.As a validation step, we verified that our collected data follows expected trends with relation to altloc abundance.Figure 3 shows the distribution of resolutions for structures deposited in PDB during different time brackets showing that structures of resolutions between 1.5A and 2.5A constituted a large and similar fraction of structures deposited during the period since 1995.Focusing on this resolution range, Fig. 4 shows how the fraction of structures modelled with an altloc, particularly short segments of 1-2 amino acids, progressively increased until 2010.This rise would be expected as modelling refinement techniques improved and became more automated 19,20 .Another clear and expected trend is the increase in altlocs with resolution as shown in Fig. 5.
Manual verification of the correspondence between collected contact distances and those observed in visualisation of the protein structure was carried out as shown in Fig. 6.We note here that our current dataset analyses    the contents of the asymmetric unit cell as these coordinates are explicitly available with the PDB file.A current limitation of this dataset is that it does not identify contacts made within the crystal lattice.

Fig. 1
Fig.1Example of broken altloc chain shown on the crystal structure 1VYO (Avidin at 1.48Å resolution).Altloc B (red) is not modelled between residues 37 and 41 and is therefore counted as two separate segments in our data set.

Fig. 3
Fig. 3 Absolute and relative histograms of the resolution of the retained structures by structure deposition date.

Fig. 4
Fig.4 Absolute and relative histograms of backbone altloc segment lengths in the resolution range 1.5-2.5Åby structure deposition date.Any refers to any of the collected chains regardless of altloc presence; Altloc refers to the presence of any altloc; Backbone ≥ n refers to the presence of a segment of length ≥ n containing alternate locations for CA atoms.We observe a steady and sharp increase in the fraction of structures with modelled altlocs in years preceding 2010 which we attribute to the improvement of experimental techniques and data analysis software.Note that the fraction of structures with long altloc segments (≥3) increases only slightly, probably since such segments are more readily modelled with older protocols.

Fig. 5
Fig.5 Absolute and relative histograms of backbone altloc segment lengths by resolution.Any refers to any of the collected chains regardless of altloc presence; Altloc refers to the presence of any altloc; Backbone ≥ n refers to the presence of a segment of length ≥ n containing alternate locations for CA atoms.Shown are distributions of individual chains (first row), and non-redundant clusters (second row).Note that the fraction of modelled altloc segments (especially the long ones) consistently increases with better resolution.We attribute this trend to the finer ability to discern between truly multi-modal distribution of the electron density (altlocs), and the unimodal high B-factor case.About 4% of segments of length 2 are peptide flips.

Table 1 .
Description of the column in the data records.
contact_count_XTotal number of X-contacts (as defined in the preceding sections) between the current residue's atoms and atoms of other residues in the structure.42 contact_types_X Type of contacts this residue's atoms make.Currently not implemented.Always 'proximal' .proximal contact_smax_X Maximum distance, in the amino acid sequence, of any of the current residue's X-contacts.24 contact_ooc_X Comma-delimited list of X-contacts between the current residue's atoms and atoms of a residue in another chain.Contacts specified as: 'chain_id:residue_ name:residue_index:target_altloc:distance' .Target altloc appears as ~ if target has only one modeled location.A:T:9:B:3.97,C:M:88:A:4.03 contact_non_aa_X Comma-delimited list of X-contacts between the current residue's atoms and atoms of ligand in the structure.Contact is specified as above, with ligand instead of residue name.A:GYS:66:~:4.95,A:EOH:242:B:3.98 contact_aas_X Comma-delimited list of X-contacts between the current residue's atoms and atoms of a residue in the same chain.Contacts specified as above.A:T:9:B:3.97,C:M:88:A:4.03 c_terminal_dist Distance in number of residues from the C-terminal.Note: if terminal regions contain unmodelled residues, the distance is under-estimated.10

Table 2 .
Description of the columns describing altlocs in the data records.For each row, columns, describing the residue, are available and are explained in Table1.List of the chemical component identifiers for all ligand interactions in the chain.ligandsList of the chemical component identifiers for all ligand interactions in the structure.
Data.A single comma-separated value (csv) file containing local altloc description for the entire dataset can be accessed as altloc_data.csv.The dataset entries (rows) each represent a single residue in a specific chain of a PDB structure.