Crystallographic model validation: from diagnosis to healing
Introduction
The general presumption behind crystallographic model validation is that in addition to explaining the experimental data, macromolecular coordinates must also be consistent with basic physics and chemistry and with prior experience of what holds true for molecular structures. If ailments are diagnosed, healing is called for.
Model validation as practiced since the early 1990s (reviewed in [1••]) has three primary components: geometry, conformation, and sterics. Traditionally, geometry includes covalent bond lengths and angles ([2] for proteins; [3] for nucleic acids), plus planarity and chirality where appropriate. Deviations from geometric ideality primarily reflect the weighting terms in refinement, so usually they affect atomic coordinates only slightly. However, WhatCheck utilizes overall directional bond-length deviations to diagnose errors in unit cell dimensions [4], and some combinations of bond angles are sensitive to local fitting errors, such as the Cβ deviation to backward-fit sidechains [5]. Conformation covers validity of dihedral angle values, assessed in multi-dimensional sets such as 2-D Ramachandran plots [6], up to 4-D protein sidechain rotamers [7], 2-D glycoMaps for carbohydrate linkages [8], and 7-D backbone conformers for RNA [9]. Sterics encompass non-covalent interactions: unsatisfied H-bonds [4], steric clashes as best measured with all hydrogens explicit [10], and preferred local environments around sidechains [11].
The above criteria are local in the model: each outlier is within a residue or between two residues. Global scores for an entire structure are obtained from the number of outliers, suitably normalized by total residues, parameters, or atoms. For criteria with well-behaved reference data distributions, such as covalent geometry, outliers are usually declared at 4σ, or Z-score = 4. Conformational outliers are traditionally defined by a boundary enclosing nearly all filtered reference data, such as outside the 99.95% or 1-in-2000 contour for Ramachandran outliers (equivalent to Z = 3.5) [1••]. Clashes are declared at overlap ≥0.4 Å, and overall ‘clashscore’ = clashes per 1000 atoms [12••]. Most global validation scores depend strongly on resolution, the single number most indicative of crystal structure quality. This relationship is sometimes made explicit by reporting percentile scores relative to PDB X-ray structures within a resolution cohort [12••].
Most validation outliers are errors, but a few are functionally significant anomalies where evolution has found it worthwhile to spend some net folding energy on stabilizing an unfavorable local conformation. Since active stabilization is required to hold a group in an unfavorable position, valid outliers should have clear electron density and be constrained by H-bonds or packing interactions [13] (Figure 1a).
Diagnosing problems is a much more productive endeavor if anomalies can be distinguished from errors and some large fraction of the latter can be corrected, or ‘healed’. This reversed meaning of ‘molecular medicine’ is one motivation for many recent developments in model validation. The errors worth correcting in macromolecular models are qualitative misfittings into the wrong local minimum (Figure 1b). Rather than random, those are nearly always systematic errors arising from a particular misinterpretation to which either people or programs are prone [13].
The earliest form of reliable, automated correction of model-fitting errors is for 180° ‘flips’ of sidechain amide or histidine ring orientations [14, 15, 16]. Flips are common because of indistinguishable electron density for N versus O or C — which also means they can be corrected without reference to diffraction data. Flip state is especially important for histidines, because it dominates the determination of protonation state [17]. Since 2002, the incidence of incorrect Asn/Gln/His flips in new PDB depositions worldwide has decreased by 45% [18•], a conspicuous win for the community.
A theoretical issue recognized by early contributors [19, 20] is that validation criteria are stronger if independent of the target functions used in refinement. This produces an inherent conflict between using all relevant information to produce the best attainable model and still being able to reliably evaluate the accuracy of the result. There are several partial answers to this dilemma. First, the argument applies to global measures, not to individual problems, and to complacency over good scores, not to distress over bad ones. Ideal covalent geometry does not reflect general model quality, but it is inexcusable to retain a 10σ bond-length outlier — those occur disturbingly often even at very high resolution if geometry terms are downweighted [18•]. The second point is that we should not (and probably cannot) prevent crystallographers from using all available information in order to succeed for difficult cases. Rather, validators should continue developing new criteria and methods, perhaps including full cross-validation enabled by the computational power to routinely perform parallel structure determinations with different information left out each time. The third, most important, point is that any one validation criterion can be ‘gamed’, but the entire set of model plus data criteria (at least for protein interiors, currently, at reasonable resolutions) provides cooperative constraints that cannot be satisfied simultaneously without having the right answer.
For the practitioner of model validation and healing, we suggest two guiding principles. The first, from the classic T-shirt of the 2010 Cold Spring Harbor crystallography course (Figure 1b), is ‘At least one person should look at the map!’ Even the best automated methods will occasionally make a silly mistake for some new case the programmer did not expect. The second principle is the ‘Zen of Model Anomalies’ [18•]:
- •
Consider each outlier and correct most.
- •
Treasure the meaningful, valid few.
- •
Live serenely with the small inscrutable remainder.
Section snippets
New access to concise and detailed quality metrics
The current era in crystallographic model validation and healing was initiated in 2008, when the worldwide Protein Data Bank (wwPDB) required deposition of structure factors and constituted the wwPDB X-ray Validation Task Force committee, the first of the wwPDB VTFs [21]. Reliable availability of diffraction data not only allows validation of the data itself, such as detection of twinning [22, 23], and of model-to-data match [24], but also enables better quality-filtering of reference data,
Future directions
We will soon see the appearance, and then the effects, of more complete, understandable, and obvious validation at the PDB. Similarly, we will see whether more dimensions are the right answer for validation, or only for design. High-productivity tools will deliver robust diagnosis and easy correction for previously neglected components — ligands, ions, waters, carbohydrates, nucleic acids, and alternate conformations. Model healing will be increasingly integrated into our automated protocols, to
References and recommended reading
Papers of particular interest, published within the period of review, have been highlighted as:
• of special interest
•• of outstanding interest
Acknowledgement
Funding for this analysis is from NIH Research Grants R01-GM073930 and R01-GM073919.
References (70)
- et al.
A new generation of crystallographic validation tools for the protein data bank
Structure
(2011) - et al.
GlycoMapsDB — a database of the accessible conformational space of glycosidic linkages
Nucleic Acids Res
(2007) - et al.
Quality control of protein models — directional atomic contact analysis
J Appl Crystallogr
(1993) - et al.
Asparagine and glutamine — using hydrogen atom contacts in the choice of side-chain amide orientation
J Mol Biol
(1999) - et al.
Doing molecular biophysics: Finding, naming, and picturing signal within complexity
Annu Rev Biophys
(2013) - et al.
Crystallography Open Database — an open-access collection of crystal structures
J Appl Crystallogr
(2009) - et al.
Advances, interactions, and future developments in the CNS, Phenix, and Rosetta structural biology software systems
Annu Rev Biophys
(2013) - et al.
Inclusion of thermal motion in crystallographic structures by restrained molecular dynamics
Science
(1990) - et al.
Accurate bond and angle parameters for X-ray protein structure refinement
Acta Crystallogr
(1991) - et al.
New parameters for the refinement of nucleic acid containing structures
Acta Crystallogr
(1996)
Errors in crystal structures
Nature
Structure validation by Cα geometry — ω, χ and Cβ deviation
Proteins: Struct Funct Genet
ProCheck — a program to check the stereochemical quality of protein structures
J Appl Crystallogr
Rotamer libraries in the 21st century
Curr Opin Struct Biol
RNA backbone — consensus all-angle conformers and modular string nomenclature (an RNA Ontology Consortium Contribution)
RNA
Visualizing and quantitating molecular goodness-of-fit — small-probe contact dots with explicit hydrogens
J Mol Biol
MolProbity — all-atom structure validation for macromolecular crystallography
Acta Crystallogr
The penultimate rotamer library
Proteins: Struct Funct Genet
Satisfying hydrogen bonding potential in proteins
J Mol Biol
Positioning hydrogen atoms by optimizing hydrogen-bond networks in protein structures
Proteins: Struct Funct Genet
Fitting tip#3: histidine flips and protonation
Comput Crystallogr Newsletter
Stereochemical quality of protein structure coordinates
Proteins: Struct Funct Genet
Validation of protein crystal structures
Acta Crystallogr
The future of the Protein Data Bank
Biopolymers
Xtriage and Fest — automatic assessment of X-ray data and substructure structure factor estimation
CCP4 Newsl
Intensity statistics in twinned crystals with examples from the PDB
Acta Crystallogr
phenix.model_vs_data: a high-level tool for the calculation of crystallographic model and data statistics
J Appl Crystallogr
Between objectivity and subjectivity
Nature
In defence of our science — validation now!
Acta Crystallogr
Retraction
Science
Crystallography: crystallographic evidence for deviating C3b structure
Nature
Implementing an X-ray validation pipeline for the Protein Data Bank
Acta Crystallogr
Retraction: cocrystal structure of synaptobrevin-II bound to botulinum neurotoxin type B at 2.0 Å resolution
Nat Struct Mol Biol
The Uppsala electron-density server
Acta Crystallogr
Experimentally observed conformation-dependent geometry and hidden strain in proteins
Protein Sci
Cited by (13)
Assessment of detailed conformations suggests strategies for improving cryoEM models: Helix at lower resolution, ensembles, pre-refinement fixups, and validation at multi-residue length scale
2018, Journal of Structural BiologyCitation Excerpt :Our laboratory developed all-atom contact analysis and the MolProbity validation web service (Word et al., 1999; Davis et al., 2004) to successfully diagnose and guide correction of local model errors in macromolecular crystal structures at 2.5 Å or better (Chen et al., 2010; Read et al., 2011; Richardson et al., 2013b).
Enhancing Structure Prediction and Design of Soluble and Membrane Proteins with Explicit Solvent-Protein Interactions
2017, StructureCitation Excerpt :However, because solvent molecules move with a broad range of dynamics, their observation can be challenging especially for weakly bound molecules that dissociate rapidly from the protein (Persson and Halle, 2008, 2013). The ability to detect solvent molecules in protein X-ray structures also depends on the resolution and the interpretation of X-ray crystallographic diffraction patterns (Richardson et al., 2013; Wlodawer et al., 2008). Most membrane protein structures are solved at low resolution, which prevents reliable observation of solvent molecules.
Flexibility and Design: Conformational Heterogeneity along the Evolutionary Trajectory of a Redesigned Ubiquitin
2017, StructureCitation Excerpt :The approach of comparing independent refinements has been previously used to assess the accuracy of structure determination under different purifications (Daopin et al., 1994), with different refinement software (Fields et al., 1994), and with the same data (Terwilliger et al., 2007). Although modeling alternative conformations at low signal levels is necessary to successfully interpret and minimize the local difference density, care must be taken not to interpret signal unless there is a stereochemically reasonable model that can be built (Richardson et al., 2013). The independent refinement procedure allowed us to check for consistent interpretation in regions of high disorder, such as the β1β2 loop, where relevant signals for alternative conformations frequently appear only at low electron density contour levels.
Ff19SB: Amino-Acid-Specific Protein Backbone Parameters Trained against Quantum Mechanics Energy Surfaces in Solution
2020, Journal of Chemical Theory and ComputationIdentification and unusual properties of the master regulator FNR in the extreme acidophile acidithiobacillus ferrooxidans
2019, Frontiers in Microbiology