Crystallographic model validation: from diagnosis to healing

https://doi.org/10.1016/j.sbi.2013.06.004Get rights and content

Highlights

  • Robust evaluation: distinguishing errors from valid anomalies.

  • New validation at the PDB.

  • Updated infrastructure: reference data, parameters, and methods.

  • Diagnostics for more structure components.

  • Automated correction for protein sidechains and RNA backbone.

Model validation has evolved from a passive final gatekeeping step to an ongoing diagnosis and healing process that enables significant improvement of accuracy. A recent phase of active development was spurred by the worldwide Protein Data Bank requiring data deposition and establishing Validation Task Force committees, by strong growth in high-quality reference data, by new speed and ease of computations, and by an upswing of interest in large molecular machines and structural ensembles. Progress includes automated correction methods, concise and user-friendly validation reports for referees and on the PDB websites, extension of error correction to RNA and error diagnosis to ligands, carbohydrates, and membrane proteins, and a good start on better methods for low resolution and for multiple conformations.

Introduction

The general presumption behind crystallographic model validation is that in addition to explaining the experimental data, macromolecular coordinates must also be consistent with basic physics and chemistry and with prior experience of what holds true for molecular structures. If ailments are diagnosed, healing is called for.

Model validation as practiced since the early 1990s (reviewed in [1••]) has three primary components: geometry, conformation, and sterics. Traditionally, geometry includes covalent bond lengths and angles ([2] for proteins; [3] for nucleic acids), plus planarity and chirality where appropriate. Deviations from geometric ideality primarily reflect the weighting terms in refinement, so usually they affect atomic coordinates only slightly. However, WhatCheck utilizes overall directional bond-length deviations to diagnose errors in unit cell dimensions [4], and some combinations of bond angles are sensitive to local fitting errors, such as the Cβ deviation to backward-fit sidechains [5]. Conformation covers validity of dihedral angle values, assessed in multi-dimensional sets such as 2-D Ramachandran plots [6], up to 4-D protein sidechain rotamers [7], 2-D glycoMaps for carbohydrate linkages [8], and 7-D backbone conformers for RNA [9]. Sterics encompass non-covalent interactions: unsatisfied H-bonds [4], steric clashes as best measured with all hydrogens explicit [10], and preferred local environments around sidechains [11].

The above criteria are local in the model: each outlier is within a residue or between two residues. Global scores for an entire structure are obtained from the number of outliers, suitably normalized by total residues, parameters, or atoms. For criteria with well-behaved reference data distributions, such as covalent geometry, outliers are usually declared at 4σ, or Z-score = 4. Conformational outliers are traditionally defined by a boundary enclosing nearly all filtered reference data, such as outside the 99.95% or 1-in-2000 contour for Ramachandran outliers (equivalent to Z = 3.5) [1••]. Clashes are declared at overlap ≥0.4 Å, and overall ‘clashscore’ = clashes per 1000 atoms [12••]. Most global validation scores depend strongly on resolution, the single number most indicative of crystal structure quality. This relationship is sometimes made explicit by reporting percentile scores relative to PDB X-ray structures within a resolution cohort [12••].

Most validation outliers are errors, but a few are functionally significant anomalies where evolution has found it worthwhile to spend some net folding energy on stabilizing an unfavorable local conformation. Since active stabilization is required to hold a group in an unfavorable position, valid outliers should have clear electron density and be constrained by H-bonds or packing interactions [13] (Figure 1a).

Diagnosing problems is a much more productive endeavor if anomalies can be distinguished from errors and some large fraction of the latter can be corrected, or ‘healed’. This reversed meaning of ‘molecular medicine’ is one motivation for many recent developments in model validation. The errors worth correcting in macromolecular models are qualitative misfittings into the wrong local minimum (Figure 1b). Rather than random, those are nearly always systematic errors arising from a particular misinterpretation to which either people or programs are prone [13].

The earliest form of reliable, automated correction of model-fitting errors is for 180° ‘flips’ of sidechain amide or histidine ring orientations [14, 15, 16]. Flips are common because of indistinguishable electron density for N versus O or C  which also means they can be corrected without reference to diffraction data. Flip state is especially important for histidines, because it dominates the determination of protonation state [17]. Since 2002, the incidence of incorrect Asn/Gln/His flips in new PDB depositions worldwide has decreased by 45% [18], a conspicuous win for the community.

A theoretical issue recognized by early contributors [19, 20] is that validation criteria are stronger if independent of the target functions used in refinement. This produces an inherent conflict between using all relevant information to produce the best attainable model and still being able to reliably evaluate the accuracy of the result. There are several partial answers to this dilemma. First, the argument applies to global measures, not to individual problems, and to complacency over good scores, not to distress over bad ones. Ideal covalent geometry does not reflect general model quality, but it is inexcusable to retain a 10σ bond-length outlier  those occur disturbingly often even at very high resolution if geometry terms are downweighted [18]. The second point is that we should not (and probably cannot) prevent crystallographers from using all available information in order to succeed for difficult cases. Rather, validators should continue developing new criteria and methods, perhaps including full cross-validation enabled by the computational power to routinely perform parallel structure determinations with different information left out each time. The third, most important, point is that any one validation criterion can be ‘gamed’, but the entire set of model plus data criteria (at least for protein interiors, currently, at reasonable resolutions) provides cooperative constraints that cannot be satisfied simultaneously without having the right answer.

For the practitioner of model validation and healing, we suggest two guiding principles. The first, from the classic T-shirt of the 2010 Cold Spring Harbor crystallography course (Figure 1b), is ‘At least one person should look at the map!’ Even the best automated methods will occasionally make a silly mistake for some new case the programmer did not expect. The second principle is the ‘Zen of Model Anomalies’ [18]:

  • Consider each outlier and correct most.

  • Treasure the meaningful, valid few.

  • Live serenely with the small inscrutable remainder.

Section snippets

New access to concise and detailed quality metrics

The current era in crystallographic model validation and healing was initiated in 2008, when the worldwide Protein Data Bank (wwPDB) required deposition of structure factors and constituted the wwPDB X-ray Validation Task Force committee, the first of the wwPDB VTFs [21]. Reliable availability of diffraction data not only allows validation of the data itself, such as detection of twinning [22, 23], and of model-to-data match [24], but also enables better quality-filtering of reference data,

Future directions

We will soon see the appearance, and then the effects, of more complete, understandable, and obvious validation at the PDB. Similarly, we will see whether more dimensions are the right answer for validation, or only for design. High-productivity tools will deliver robust diagnosis and easy correction for previously neglected components  ligands, ions, waters, carbohydrates, nucleic acids, and alternate conformations. Model healing will be increasingly integrated into our automated protocols, to

References and recommended reading

Papers of particular interest, published within the period of review, have been highlighted as:

  • • of special interest

  • •• of outstanding interest

Acknowledgement

Funding for this analysis is from NIH Research Grants R01-GM073930 and R01-GM073919.

References (70)

  • R.W.W. Hooft et al.

    Errors in crystal structures

    Nature

    (1996)
  • S.C. Lovell et al.

    Structure validation by Cα geometry  ω, χ and Cβ deviation

    Proteins: Struct Funct Genet

    (2003)
  • R.A. Laskowski et al.

    ProCheck  a program to check the stereochemical quality of protein structures

    J Appl Crystallogr

    (1993)
  • R.L. Dunbrack

    Rotamer libraries in the 21st century

    Curr Opin Struct Biol

    (2002)
  • J.S. Richardson et al.

    RNA backbone  consensus all-angle conformers and modular string nomenclature (an RNA Ontology Consortium Contribution)

    RNA

    (2008)
  • J.M. Word et al.

    Visualizing and quantitating molecular goodness-of-fit  small-probe contact dots with explicit hydrogens

    J Mol Biol

    (1999)
  • V.B. Chen et al.

    MolProbity  all-atom structure validation for macromolecular crystallography

    Acta Crystallogr

    (2010)
  • S.C. Lovell et al.

    The penultimate rotamer library

    Proteins: Struct Funct Genet

    (2000)
  • I.K. McDonald et al.

    Satisfying hydrogen bonding potential in proteins

    J Mol Biol

    (1994)
  • R.W.W. Hooft et al.

    Positioning hydrogen atoms by optimizing hydrogen-bond networks in protein structures

    Proteins: Struct Funct Genet

    (1996)
  • J.S. Richardson et al.

    Fitting tip#3: histidine flips and protonation

    Comput Crystallogr Newsletter

    (2012)
  • A.L. Morris et al.

    Stereochemical quality of protein structure coordinates

    Proteins: Struct Funct Genet

    (1992)
  • G.J. Kleywegt

    Validation of protein crystal structures

    Acta Crystallogr

    (2000)
  • H.M. Berman et al.

    The future of the Protein Data Bank

    Biopolymers

    (2012)
  • P.H. Zwart et al.

    Xtriage and Fest  automatic assessment of X-ray data and substructure structure factor estimation

    CCP4 Newsl

    (2005)
  • A.A. Lebedev et al.

    Intensity statistics in twinned crystals with examples from the PDB

    Acta Crystallogr

    (2006)
  • P. Afonine et al.

    phenix.model_vs_data: a high-level tool for the calculation of crystallographic model and data statistics

    J Appl Crystallogr

    (2010)
  • C.-I. Branden et al.

    Between objectivity and subjectivity

    Nature

    (1990)
  • E.N. Baker et al.

    In defence of our science  validation now!

    Acta Crystallogr

    (2010)
  • G. Chang et al.

    Retraction

    Science

    (2006)
  • B.J.C. Janssen et al.

    Crystallography: crystallographic evidence for deviating C3b structure

    Nature

    (2007)
  • S. Gore et al.

    Implementing an X-ray validation pipeline for the Protein Data Bank

    Acta Crystallogr

    (2012)
  • M.A. Hanson et al.

    Retraction: cocrystal structure of synaptobrevin-II bound to botulinum neurotoxin type B at 2.0 Å resolution

    Nat Struct Mol Biol

    (2009)
  • G.J. Kleywegt et al.

    The Uppsala electron-density server

    Acta Crystallogr

    (2004)
  • P.A. Karplus

    Experimentally observed conformation-dependent geometry and hidden strain in proteins

    Protein Sci

    (1996)
  • Cited by (13)

    • Assessment of detailed conformations suggests strategies for improving cryoEM models: Helix at lower resolution, ensembles, pre-refinement fixups, and validation at multi-residue length scale

      2018, Journal of Structural Biology
      Citation Excerpt :

      Our laboratory developed all-atom contact analysis and the MolProbity validation web service (Word et al., 1999; Davis et al., 2004) to successfully diagnose and guide correction of local model errors in macromolecular crystal structures at 2.5 Å or better (Chen et al., 2010; Read et al., 2011; Richardson et al., 2013b).

    • Enhancing Structure Prediction and Design of Soluble and Membrane Proteins with Explicit Solvent-Protein Interactions

      2017, Structure
      Citation Excerpt :

      However, because solvent molecules move with a broad range of dynamics, their observation can be challenging especially for weakly bound molecules that dissociate rapidly from the protein (Persson and Halle, 2008, 2013). The ability to detect solvent molecules in protein X-ray structures also depends on the resolution and the interpretation of X-ray crystallographic diffraction patterns (Richardson et al., 2013; Wlodawer et al., 2008). Most membrane protein structures are solved at low resolution, which prevents reliable observation of solvent molecules.

    • Flexibility and Design: Conformational Heterogeneity along the Evolutionary Trajectory of a Redesigned Ubiquitin

      2017, Structure
      Citation Excerpt :

      The approach of comparing independent refinements has been previously used to assess the accuracy of structure determination under different purifications (Daopin et al., 1994), with different refinement software (Fields et al., 1994), and with the same data (Terwilliger et al., 2007). Although modeling alternative conformations at low signal levels is necessary to successfully interpret and minimize the local difference density, care must be taken not to interpret signal unless there is a stereochemically reasonable model that can be built (Richardson et al., 2013). The independent refinement procedure allowed us to check for consistent interpretation in regions of high disorder, such as the β1β2 loop, where relevant signals for alternative conformations frequently appear only at low electron density contour levels.

    View all citing articles on Scopus
    View full text