Regular article
Ab initio phasing of high-symmetry macromolecular complexes: successful phasing of authentic poliovirus data to 3.0 Å resolution1

https://doi.org/10.1006/jmbi.2001.4485Get rights and content

Abstract

A genetic algorithm-based computational method for the ab initio phasing of diffraction data from crystals of symmetric macromolecular structures, such as icosahedral viruses, has been implemented and applied to authentic data from the P1/Mahoney strain of poliovirus. Using only single-wavelength native diffraction data, the method is shown to be able to generate correct phases, and thus electron density, to 3.0 Å resolution. Beginning with no advance knowledge of the shape of the virus and only approximate knowledge of its size, the method uses a genetic algorithm to determine coarse, low-resolution (here, 20.5 Å) models of the virus that obey the known non-crystallographic symmetry (NCS) constraints. The best scoring of these models are subjected to refinement and NCS-averaging, with subsequent phase extension to high resolution (3.0 Å). Initial difficulties in phase extension were overcome by measuring and including all low-resolution terms in the transform. With the low-resolution data included, the method was successful in generating essentially correct phases and electron density to 6.0 Å in every one of ten trials from different models identified by the genetic algorithm. Retrospective analysis revealed that these correct high-resolution solutions converged from a range of significantly different low-resolution phase sets (average differences of 59.7° below 24 Å). This method represents an efficient way to determine phases for icosahedral viruses, and has the advantage of producing phases free from model bias. It is expected that the method can be extended to other protein systems with high NCS.

Introduction

The phase problem remains a fundamental challenge in determining novel protein structures by X-ray crystallography. Methods for determining structures directly from amplitude information continue to be of great interest, and are beginning to be applied successfully to small proteins;1, 2 for a review, see Abrahams & De Graf.”3 The case of icosahedral viruses offers a unique set of challenges and advantages in structure determination. Because of their large size (and correspondingly large unit cell), there are enormous numbers of reflections, making data collection laborious and some calculations large and time-consuming. The high degree of symmetry present in these systems, however, offers compensating advantages that are of particular use in attempting to solve the phase problem. The non-crystallographic symmetry (NCS) in the virus crystals provides powerful constraints that are routinely exploited to substantially improve initial phase estimates, and might be useful for ab initio structure determination as well. These possibilities were initially explored by Rossmann & Blow,4, 5 Main & Rossmann,6 and Main7 using a non-linear formalism that treated phases as explicit variables. Later, Crowther8 suggested a more powerful linear method involving the assignment of coefficients to symmetry-consistent “eigendensity” functions, thereby restricting the number of explicit variables. The feasibility of these earliest phasing methods was demonstrated in test cases involving one-dimensional functions and three-dimensional arrangements of a small number of atoms. With enormous improvements in computational power, attention has once again focused on the possibility of exploiting NCS in the ab initio phasing of symmetric macromolecular crystal structures. Current application of NCS constraints are generally carried out by Bricogne’s method,9 involving symmetry averaging of electron density in real space. This is formally equivalent to Crowther’s approach, but is much more efficient when applied to larger structures at higher resolutions. Further, given a sufficiently correct initial phase set, it is possible to extend a set of low-resolution phases to higher-resolution data by direct space averaging, as was demonstrated at low resolution by Argos et al.,10 at 3.2 Å resolution in the case of 6-fold NCS by Gaykema, et al.,11 and later at high resolution in the structure determinations of icosahedral viruses by Rossmann et al.12 and Hogle et al.13 It has been suggested that phase extension from low (>20 Å) to high resolution is feasible and could be exploited in determining virus structures ab initio.14 If direct phase extension can be carried out reliably, only low-resolution phases need be generated for successful structure determination, in which case the problem of phase determination for icosahedral viruses could be reduced to a tractable size.

Several approaches to obtaining these low-resolution phases have been suggested. Electron micrograph reconstructions are a promising source of low-resolution structural information,15 which have yielded useful phase and envelope information (e.g. Speir et al.;16 and Grimes et al.17) and have recently provided a source of starting phases for determination of novel virus structures.18, 19 Solution scattering, analyzed with icosahedral harmonics, has long been known to provide low-resolution information,20 and it has been proposed that low-resolution models from solution scattering might be suitable starting points for phase extension.21

The most efficient methods for phase determination would involve the generation of low-resolution phases using only measured native crystallographic information at a single wavelength, which could be extended subsequently to high resolution. One such approach uses a spherical shell model of a virus with adjustable radii and density levels for protein and RNA to generate phases to 20 Å resolution, which are then subjected to NCS-based phase refinement and extension.22 While an initial attempt to solve parvovirus de novo via this method proved unsuccessful, the phase information generated by the procedure was ultimately used in the solution of the structure.23 A feasibility study demonstrated that the procedure might have been able to provide sufficiently accurate starting phases, provided that the particle position and orientation had been known accurately.14 It is believed that NCS averaging can break the centric symmetry of the shell-based phases only if 2-fold NCS axes of the virus do not lie parallel with 2-fold axes of the crystal.24 Nevertheless, in the recent determination of the low-resolution (20 Å) structure of rice dwarf virus,25 the centric symmetry was seemingly broken by repeated application of NCS averaging, in apparent violation of that principle.

Earlier,26 we described a computational method for generating the necessary low-resolution phases directly from a native set of single-wavelength X-ray diffraction data. The method uses a genetic algorithm (GA) to generate symmetry-consistent, low-resolution direct space models that are then Fourier-transformed to generate starting phase sets for phase extension. Originally, we demonstrated the feasibility of the approach using error-free synthetic data from poliovirus empty capsids, a system in which only two main density levels, protein and solvent, need to be considered (rather than the additional RNA density level in native virus). Here, we describe improvements in the procedure and application of the method to authentic data from the Mahoney strain of type 1 poliovirus, including successful phase extension of these data to 3 Å resolution. Collection of the lowest-resolution data proved to be essential for the success of the procedure. With these low-resolution data included, all ten of the best-scoring starting models created by the GA procedure and selected automatically by objective, amplitude-based numerical criteria led to essentially correct phase sets at 6 Å. At this resolution, these phase sets all agreed with one another, with the known atomic model, and with the results of phase extension from a 22 Å resolution cryo-electron micrograph reconstruction of poliovirions.27 The single highest-scoring phase set was extended to 3.0 Å, where the obvious correlation of the map with the known amino acid sequence of poliovirus (chemical information that was not used in the creation of the map) provided an objective criterion for the success of the ab initio phasing procedure. Finally, a retrospective analysis of the ten successful phase extensions showed (1) that omitting the low-resolution (infinite-50 Å) data would have caused many of the successful phase extensions to fail, and (2) that using randomly generated instead of GA-selected models would have significantly reduced the success rate. These results demonstrate the importance of the GA and of the low-resolution data, and provide a blueprint for the ab initio determination of unknown structures in the future.

Because details of the algorithm have been reported,26 only a general description of the method and modifications to the original procedure are given here. The method is a multi-step procedure, involving a search of coarse direct space models of the virus, refinement of the best-scoring models, and phase extension to high resolution via NCS averaging. At the start of the procedure, nothing is assumed to be known about the structure other than the position of the particle center, the orientation of the particle symmetry axes, and a rough estimate of its radius.

In the first step of the procedure, coarse low-resolution trial models were generated and evaluated using a genetic algorithm. Given the scope of the problem, it was essential to develop a parameterization of the virus that exploits the symmetry of the system, so as to reduce the size of solution space to be surveyed. Fortunately, it is necessary to model only the icosahedrally unique volume, in this case taken to be a pyramid that extends outward from the center of the virus particle to a radial distance of 190 Å (safely beyond the radius of the virus, estimated from crystal packing considerations, electron microscopy, or other physical measurements). The pyramid has a four-sided (kite-shaped) base whose vertices lie along neighboring 5-fold, 3-fold and two 2-fold axes of icosahedral symmetry (Figure 1). Since the spherical cutoff is considerably larger than the virus, the modeling process makes no a priori assumptions about the virus shape. This volume was sampled by 99 equally spaced lattice points, each of which represents a point scatterer of unit electron density and which sample the volume as uniformly as possible.

Thus, at the GA stage, any possible model could be created by turning some subset of the lattice points on and the remainder off. This two-value constraint on the electron density values is physically plausible at low resolution and provides a good observation-to-parameter ratio, thus accelerating convergence. Moreover, this binary parameterization allowed any model to be represented by a 99-bit string with each bit corresponding to a particular lattice point, a representation that contributes to the theoretical efficiency of the genetic algorithm.26 Using a combination of crystallographic and non-crystallographic operators, the icosahedrally unique volume, as specified by a trial bit string, is expanded 120-fold (30-fold NCS and 4-fold crystallographic symmetry) to generate the full unit cell contents and then Fourier-transformed to obtain model-based structure factors (∼Fcalc). For computational efficiency, these two steps are accomplished together, by multiplying the vector of 99 density values by a single complex-valued matrix.26 To evaluate the quality of each trial model its Fcalc set is compared with the single-crystal X-ray diffraction data (∣Fobs∣) via a statistic, Qexp that assesses many density models at once and penalizes those with relatively poor ∣Fcalc∣ values in any resolution shell (see below).

Given the size and irregular topography of the solution space, simple optimization procedures would not have been effective at selecting a model from which to calculate the starting phase set. Instead, the initial survey of the space of coarse, direct space models was carried out using a genetic algorithm (specifically, the GAucsd software package28). Originally developed by Holland,29 GA is an optimization technique based on a rough analogy to Darwinian selection and is employed to solve optimization problems with multiple local minima and no known efficient solution. Here, 100 GA calculations were undertaken, with each evaluating tens of thousands of trial models. Each run of the GA started with a population of 500 randomly generated bit-strings and was run for approximately 300 generations or until the population was nearly converged to a single solution. The solution with the single best amplitude-based fitness score in the final generation was selected as the output from each run.

The ultimate success of the GA and subsequent refinement depend critically on the use of an appropriate fitness function. Initially, a simple scale-independent function was employed to evaluate overall agreement between observed and calculated structure factors†:Q2F=1−〈|Fobs||Fcalc|〉2〈|Fobs|2〉〈|Fcalc|2 When starting phase sets are refined by minimizing Q2F, by either the GA or the subsequent refinement step, the transform tends to segregate into shells by resolution. Phases within each shell tend to be consistent with a particular choice of sign and hand, but are frequently inconsistent with neighboring shells and thus lack overall agreement. While either a positive or negative (i.e. a “Babinet” solution) version of the correct image would be acceptable, the superposition of portions of more than one transform produces highly fragmented density that is uninterpretable.30 In general, the phase discrepancies in these incorrect solutions could not be repaired by NCS averaging, and the solutions were therefore inadequate for phase extension.

To address this problem, a new fitness function was designed to try to select for solutions with globally self-consistent phase sets. This function compares observed and calculated structure factors in narrow resolution bins and acts to penalize solutions with poor agreement in any given bin (i.e. where a change in sign or hand may be occurring) rather than to reward those with bins where there is a locally good amplitude agreement. This function, Qexp, has the form:Qexp=1Nexpk(Qbin−μbin)σbin where Qbin is the Q2F function within a specific resolution range, k is an empirically determined constant, and μ and σ represent the mean and standard deviations of all values of Qbin previously encountered in the current GA run, respectively, weighted to progressively decrease the influence of older individuals†. The use of μ and σ ensure that each bin-specific agreement score (Qbin) is compared with the range of values achievable at that resolution by other comparably detailed models. In particular, it prevents very good Qbin values at low resolution from dominating the calculation. The exponential has a different purpose: it severely penalizes any trial model with below-average agreement in any range, thus selecting against potential cross-over points.

As the GA calculation proceeds and the population improves, the distribution of Qexp scores (which are population-dependent) changes such that the score received by a particularly good bit-string gradually becomes worse (higher in value). Therefore, the best-scoring individual in each generation of the GA was automatically included, unchanged, in the following generation and re-evaluated. This “elitist” strategy ensured that the best scoring bit-string in the final generation was the best encountered by the GA. The Qexp function was found to be more effective than population-insensitive functions (such as Q2F) in rejecting solutions with phase inconsistencies30 and is therefore used in both the GA and refinement stages.

The GA was followed by a refinement of lattice point density values. The best-scoring individual solutions from each of 100 GA trials were collected to form a new population. This population was then refined in a process that relaxes the constraint limiting the lattice points to binary values and thus permits a better fit to the data, and potentially a more accurate modeling of the virus, while maintaining the NCS constraints. This refinement step was carried out by a steepest decent minimization of Qexp with respect to the electron density at each of the lattice points. In the absence of a F000 term in the reference data set, the mapping of the bit values “0“ and “1“ to specific positive or negative electron density levels implicitly sets the unsampled volume of the unit cell (i.e. points lying further than 190 Å from any virus particle center) to a particular level. In the GA, the unsampled volume was mapped to 0.5 (that is, a level mid-way between the density corresponding to a 0 bit and that of a 1 bit). In subsequent refinement, 0.0, 0.5, and 1.0 were all evaluated as alternative starting levels for the unsampled volume. For each bit-string, the evaluation that yielded the best Qexp score was selected. These strings were then evaluated as a population with the Qexp residual and ranked.

The ten best-scoring solutions from this refinement procedure were then used to generate phases from which phase extension could be initiated (trying both the direct and mirror-image of the icosahedrally unique volume). Iterative NCS averaging was carried out using a locally developed implementation of the method of Bricogne31 and cycled to convergence (typically 20 cycles) at the resolution of the GA. This implementation includes resolution-dependent bin scaling, non-linear interpolation of density values, and the omission of statistical outliers from density averages by including only the 0.33-0.67th percentile of contributing densities in each output average (to partially compensate for large series termination ripples). This averaging eliminated the coarseness of sampling in the GA lattice. Phase extension was then undertaken by repeatedly adding a new higher-resolution shell of reflections approximately one reciprocal lattice unit wide and then averaging to convergence. The extension procedure was monitored through the R1 statistic† and validated by the appearance of the map.

Section snippets

Model-based calculations

In previous work,26 the GA-based procedure was run on perfect synthetic data from poliovirus empty capsids, and was shown to produce internally self-consistent phase sets. When all data (infinite-22 Å) were present, phase extension (to 12 Å) was successful30 and yielded calculated phases that agreed well with the atomic model-based phases, with a phase difference of 35.3° at 12 Å resolution (data not shown).

Initial problems with authentic data

X-ray diffraction data were collected from a single frozen crystal of mature poliovirus

Materials and methods

Purified Mahoney strain poliovirus (P1/Mahoney) was generously provided by Marie Chow (University of Arkansas). The virus was concentrated to approximately 12 mg ml−1 by pelleting into a sucrose cushion and resuspending in 1 M NaCl in PMC7 buffer (10 mM Pipes Na (pH 7.0), 5 mM MgCl2, 1 mM CaCl2). Crystals were produced by microdialysis against 50-70 mM NaCl in PMC7 buffer at 4°C. Crystals were washed in PMC7 buffer, transferred for one minute to a cryoprotectant solution of 30 % (w/v) ethylene

Acknowledgements

The authors thank J. Genova for help in collecting low-resolution data and F. Hughson and M. Munson for helpful comments. This work was supported, in part, by NIH grant AI20566 (to J.M.H.) and by an NSF grant for High Performance Computing and Communications (NSF grant MCB 9527181 to G. Wagner). S.T.M. was supported initially by Biophysics training grant TM#T32-GM08313 and subsequently by a Lewis Thomas Fellowship from the Department of Molecular Biology, Princeton University. The Harvard

References (51)

  • Z. Otwinowski et al.

    Processing of difraction data collected in oscillation mode

    Methods Enzymol

    (1997)
  • A.M. Deacon et al.

    The shake-and-bake structure determination of triclinic lysozyme

    Proc. Natl Acad. Sci. USA

    (1998)
  • G.G. Prive et al.

    Packed protein bilayers in the 0.90 Å resolution structure of a designed alpha helical bundle

    Protein Sci

    (1999)
  • M.G. Rossmann et al.

    Determination of phases by the conditions of non-crystallographic symmetry

    Acta Crystallog

    (1963)
  • M.G. Rossmann et al.

    Solution of the phase equations representing non-crystallographic symmetry

    Acta Crystallog

    (1964)
  • P. Main et al.

    Relationships among structure factors due to identical molecules in different crystallographic environments

    Acta Crystallog

    (1966)
  • P. Main

    Phase determination using non-crystallographic symmetry

    Acta Crystallog

    (1967)
  • R.A. Crowther

    A linear analysis of the non-crystallographic symmetry problem

    Acta Crystallog

    (1967)
  • G. Bricogne

    Geometric sources of redundancy in intensity data and their use for phase determination

    Acta Crystallog. sect. A

    (1974)
  • P. Argos et al.

    An application of the molecular replacement technique in direct space to a known protein structure

    Acta Crystallog. sect. A

    (1975)
  • M.G. Rossmann et al.

    Structure of a human common cold virus and functional relationship to other picornaviruses

    Nature

    (1985)
  • J.M. Hogle et al.

    Three-dimensional structure of poliovirus at 2.9 Å resolution

    Science

    (1985)
  • J. Tsao et al.

    Ab initio phase determination for viruses with high symmetrya feasibility study

    Acta Crystallog. sect. A

    (1992)
  • R. McKenna et al.

    Structure determination of the bacteriophage phiX174

    Acta Crystallog. sect. B

    (1992)
  • J.M. Grimes et al.

    The atomic structure of the bluetongue virus core

    Nature

    (1998)
  • Cited by (0)

    1

    Edited by I. A. Wilson

    View full text