Fast and accurate conversion of atomic models into electron density maps

: New image processing methodologies and algorithms have greatly contributed to the signi(cid:28)cant progress in three-dimensional electron microscopy (3DEM) of biological complexes we have seen over the last decades. Naturally, the availability of accurate procedures for the objective testing of new algorithms is a crucial requirement for the further advancement of the (cid:28)eld. A good and accepted testing work(cid:29)ow involves the generation of realistic 3DEM-like maps of biological macromolecules from which some measure of (cid:16)ground truth(cid:17) can be derived, ideally because their 3D atomic structure is already known. In this work we propose a very accurate generation of maps using atomic form factors for electron scattering. We thoroughly review current approaches in the (cid:28)eld, quantitatively demonstrating the bene(cid:28)ts of the new methodology. Additionally, we study a concrete example of the use of this approach for hypothesis testing in 3D Electron Microscopy.


Introduction
Knowledge of the 3D structure of macromolecules and their complexes is crucial for understanding their biological function.Three-dimensional Electron Microscopy (3DEM) addresses the problem of determining the structure of macromolecular complexes from projection images recorded by transmission electron microscopy [9,29,15].
The resolution levels routinely achieved with this approach have improved signicantly over time, and have now reached the level of 3Å [47,11] (the corresponding spatial frequency is 1/3 Å −1 ).At the level of image processing, these continuous improvements in resolution are due, among other reasons, to: increases in the number of particles processed; improvements at the level of automation of the data-collection instrument [34]; the introduction of new electron detectors; and, in general, the design of new algorithms extracting more information from each available micrograph [9,29].One of the best ways of testing the strengths and limitations of such new algorithms is to rst simulate the computational experiments using articially-generated test data (also known as phantom data).Indeed, 3D model structures can be projected into 2D images and can then be used, for example, to generate articial micrographs.The accuracy of these simulations, which should closely resemble experimental situations, is becoming critical as resolution is improved.
3D model structures can consist of simple geometrical forms like spheres, cylinders and other geometrical features [2,31], but this may have the disadvantage of generating data which have an unrealistic amplitude spectrum distribution.It can thus be advantageous to use randomly generated noise with a precisely dened spectral behavior [12].In general, the most realistic model structures for single particle 3DEM are simulated biological complexes based on the atomic coordinates of known structures as deposited in the Protein Data Bank (PDB or wwPDB).
The Protein Data Bank (PDB, part of the world-wide: wwPDB) is the standard repository for the atomic structure of macromolecules [1].A PDB entry is formed of a list of atoms and their coordinates in space (see Fig. 1).Thus, a volume can be created simply by assigning dierent densities to the dierent kinds of atoms.
However, as we will show in this work, the atom shapes traditionally used may be too simple, leading to problems with the generated models.For example, using a simple 3D hard sphere to model the atoms of a 3D structure (see Fig. 1, the common choice of visualization programs.e.g., Chimera or Jmol) leads to a Fourier transform of the 3D model structure with articial zero-power rings.The most standard choice, a single Gaussian per atom [46,45], does not reproduce well atom shapes in Fourier space (as shown below).The more complex model based on a sum of Gaussians dened in Fourier space [44] does correlate well (as shown below) to EM structures, but its implementation in Fourier space is extremely slow.Considering the need to increase the resemblance between high-resolution cryoEM images and computer generated models, in this work we promote the use of PDB along with more elaborated functions describing individual atoms such as Electron Atomic Scattering Factors (EASF) [28], namely, the actual shape of the atoms seen by the electrons traveling down the microscope column.EASFs have been studied extensively in physics [28,23] and have been measured experimentally [32].A review of available experimental data, along with other physical properties related to X-ray and neutron scattering of the corresponding atoms, is presented in [4].This experimental information can be approximated analytically as a sum of a collection of Gaussian functions [24] and we here propose to use these sums to replace the simpler atomic models currently in use in 3DEM.Additionally, there are a number of applications, like tting [37,39,38] that may also need to convert atomic structures into density volumes.As we show in this article, EASFs are much better suited for this task than any other function currently available.EASFs have already been used in biological EM applications.In particular, Bsoft [14], TEM simulator [26,43] and RSref [5] have made use of EASFs.But, as we will show in the Results section, these implementations are either too slow, their accuracy can be improved, or they do not return the converted volume.
In material science, where samples are usually less sensitive to damage by the electron beam than biological samples, high resolution electron microscopy (HREM) studies can achieve the visualization of individual atoms.
Indeed, over the last decade EASFs have been used extensively in this eld to help interpret the (relatively) noise free images so obtained [16,17] and special software is now available for the specic needs of this eld [33].
However, the specicities of macromolecular samples (large number of atoms and not in a regular arrangement) makes these programs inapplicable to structural biology.
In this article we present a method for generating three-dimensional volumes from atomic coordinates obtained by X-ray crystallography, Nuclear Magnetic Resonance, or computational modeling, as well as their two-dimensional projections resembling those obtained in an electron microscope.Contrary to the standard approach in the eld (that works in Fourier space), the proposed method works in real space with very low memory

AIMS Biophysics
Volume 2, Issue 1, 8-20 requirements since the need of upsampling to have accurate results has been eliminated.Additionally, we present an example in which the design of new image processing algorithms would have beneted from this development by providing an accurate hypothesis testing tool.

Generation of volumes from atomic models
Most programs in 3DEM represent volumes as a collection of samples of some underlying continuous function.
Samples are taken with a xed spacing that is the sampling rate and is measured in Å/pixel.Sampling an atomic structure means sampling a continuous function with a support that has a diameter in the order of twice the atomic radius (the empirical atomic radius ranges from 0.25Å for the hydrogen atom, to 1.40Å for the iron atom).
The sampling frequency must be carefully chosen so that so small functions are appropriately sampled, avoiding aliasing.The resulting volume could, later on, be downsampled to the desired nal sampling rate.

Atomic model
The EASF of an atom is related to the scattering of electrons by matter [23].Electrons may interact elastically (deviated from their trajectory without any loss of energy) or inelastically (part of the energy is transferred to the interacting atom).Both kinds of interactions can be modeled to a good-extent using the complex optical potential [23].In our model, only the elastic component of the interaction is considered, since electrons that interact inelastically are supposed to be ltered out by electron optical components [9].It must be pointed out that the interaction of an electron with an atom is related to the atomic number more than to the atomic weight of that atom, since neutrons do not contribute to this interaction.
It has been shown that the elastic scattering of electrons can be accurately approximated by a sum of Gaussians up to a frequency of 6 Å −1 [24]: where f e (R) is the electron scattering factor of a given atom at the spatial frequency R (in Å −1 ).Parameters a i and b i are specic for each atom.Five Gaussians are employed by Peng and coworkers to faithfully represent the experimentally collected data, and the specic values of a i and b i for each atom are tabulated in [24].Having the expression of the EASF in Fourier space automatically allows us to derive the corresponding expression of the EASF in real space, by Fourier inversion in the continuous variable R: Fig. 2 shows the EASF of some of the most common atoms present in biological samples in Fourier space as well as in real space.This gure presents three interesting properties of the EASFs: rst, the shape of the EASF in real space is not so similar to the shape of a Gaussian (note the peak at the origin r = 0, although, by construction, it is a dierentiable peak); second, the dierences among the shape of the dierent atoms cannot be modeled by a simple multiplicative factor (EASFs are not parallel to each other in Fourier space), as is usually done when atoms are modeled by a single Gaussian [45]; third, an atom can interact elastically with an electron far beyond the atomic radius (for instance, the atomic radius of iron is 1.40Å while its EASF is signicantly dierent from zero up to a radius of 5Å).
EASFs can be compared in Fourier space to other models that have been used for the simulation of atoms in 3DEM: the step function (a solid sphere), the Gaussian, the modied Kaiser-Bessel window, and the blob.The formula for the step function is where u(r) is the Heaviside step function and r 0 is the atomic radius.The formula for the Gaussian is

AIMS Biophysics
Volume 2, Issue 1, 8-20 The expression for the modied Kaiser-Bessel window is ) ) I 0 (β) . ( Finally, the expression for the blob [19] is ) ) ) In the previous equations, I m is the modied Bessel function of the rst kind and order m (I 0 is the particular case in which m = 0) and β is a parameter controlling the atom shape.Fig. 3 shows the spectra of all these functions (they have been normalized in frequency so that all of them have the same amplitude at frequency R = 0).We presume the EASF to be a more reliable estimate of what actually happens within the microscope since it has been experimentally measured, while the other functions are some smooth functions, just convenient from a signal processing or implementation point of view.The modied Kaiser-Bessel and the blob models have been computed for β = 3.6 and m = 2 which are the standard values implemented in Xmipp.Remarkably, the most widely used function, the Gaussian, is a specially bad model both at low and high frequency.The zoomed plot of the dierent spectra corresponds to a nal sampling rate of 1.5Å/pixel (a rather common sampling rate in 3DEM).It can be observed, for the carbon atom, that all the models introduce unnecessary extra energy at high frequencies with respect to the EASF.

Sampling+Downsampling
Since EASFs are represented by a sum of Gaussians, their power spectra never vanish and, hence, EASFs are not band-limited signals.However, it can be easily checked that, at a frequency of 6Å −1 , most of the spectra of the EASFs of biologically relevant atoms are more than 60dB below their maximum value (achieved at frequency R = 0, see Fig. 2).Therefore, we can assume that at a sampling frequency of 12Å −1 , the aliasing is negligible.In any case, this is the maximum sampling frequency usable with our atomic model, since the EASF approximation provided in [24] is only valid up to 6Å −1 .
In conclusion, to sample the atomic model of a macromolecular structure it suces to substitute every atom in the list of atomic coordinates provided by PDB by the EASF corresponding to every type of atom and, then, sample this new function: where f model is the continuous function representing the atomic model of the macromolecule, r ∈ R 3 is the coordinate at which the volume is being evaluated, r i is the center of the i-th atom as provided by PDB, and f ei is the EASF of the i-th atom.
Once the model has been nely sampled, it can be safely resampled to any sampling rate that is a rational multiple of the ne sampling rate applying standard low-pass lters, together with the stretching/decimation operators [22].
A typical macromolecular complex has a diameter of around 200Å.Thus, the sampled volume needs to be stored in a volume of at least 2400x2400x2400 (2400=200x12) voxels.If double precision numbers are used (8 bytes/sample), the memory needed to store this volume is about 103 GB.As this amount of memory cannot be currently eciently handled, the standard approach to downsampling cannot be implemented.Alternatively, in the following section we propose to combine the sampling, low-pass ltration and decimation in one single step, so that the aliasing at the nal sampling rate is minimized and there is not such a formidable memory requirement.

Sampling directly at a low sampling rate
We propose a low-pass ltered version of the EASF, fe (LEASF), to represent each atom.This results in a new continuous function:

AIMS Biophysics
Volume 2, Issue 1, 8-20 Since the new function fe is eectively band-limited, it can be safely sampled at the nal sampling rate without any signicant aliasing.We discuss in this section how to generate such a low-pass ltered EASF.
The EASF can be convolved at the nely sampled space by the impulse response of a low-pass lter designed by windowing a sinc function with a Kaiser-Bessel window [22].Although the expression of the low-pass lter is widely known, we reproduce it here for completeness.If h LP F [n] is the impulse response of the sought-after low-pass lter, then ) ) where ω c is the discrete cut-o frequency of the lter (normalized so that the maximum frequency is π), the total length of the lter is 2M + 1 (M is calculated below), and β is a parameter that is calculated from the width of the transition band ∆ω as follows: rst, an intermediate parameter A is calculated where δ is the ripple allowed in the resulting low-pass lter, and then Thus, it can be observed that the design parameters for the low-pass lter are ω c , ∆ω and δ.
If the EASF is sampled at the ne sampling rate, it results in the discrete sequence f e [n].This sequence is convolved with the low-pass lter h LP F [n] to produce fe [n], i.e., the samples of the proposed continuous LEASF.The continuous function fe (r) is then interpolated using cubic B-splines [40, 41].
For every atom we search for the low-pass lter (i.e., for the parameters ω c , ∆ω and δ) that minimizes the quadratic error between the LEASF and the EASF within the frequency interval ] , being T the nal sampling rate of the volume.Fig. 4 shows the EASF and LEASF for the carbon atom.Note the close agreement between the carbon EASF and the corresponding LEASF in the range [0,0.33](this example was computed assuming a nal sampling rate of 1.5Å/pixel).Fig. 5 shows the EASF and LEASF of the carbon atom in real space.we have tackled the problem of the huge amount of memory (and time) needed to process that nely sampled volume.
Note that the parameter of a single Gaussian model could also have been optimized to maximally resemble the EASF prole.However, this is not the standard approach in the eld and we have not performed this exercise since the proposed LEASF approximation is a better approximation of the truly underlying electron atomic scattering factors.

Generation of volumes
To show the applicability of our atomic models to the generation of phantoms, we have chosen the chaperonin GroEL [3] (PDB accession code: 1GRL; see Fig. 1).However, similar results and conclusions have been derived from other structures (not shown in this paper for space constraints).We generated the corresponding voxel volume at a sampling rate of 1Å/pixel from the PDB by using the most standard software tools available: • Situs [46,45] (pdb2vol): this program projects the atomic structure on a cubic lattice by trilinear interpolation.Then, it convolves the lattice points with a Gaussian kernel whose width is internally calculated to achieve a given resolution (in our case 2Å).
• Spider [10] (cp from pdb): this program uses a similar algorithm to the previously described employing trilinear interpolation.
• EMAN [21] (pdb2mrc): this program uses a Gaussian in real-space that is sampled at the appropriate lattice locations.
• Bsoft [14] (bsf): this program follows the standard method in X-ray crystallography as programmed in the CCP4 [6] program sfall and reimplemented for EM in Bsoft [14].The program follows a dierent set of Gaussians [44] (dierent from the ones proposed in this paper) and it performs all calculations in Fourier space instead of real space (as proposed in this paper).
Fig. 6 shows the radial average of the 3D Power Spectrum Density of each one of the models.We see that Situs and Bsoft produced almost identical results (although Situs took a 12 seconds to produce this result, while Bsoft took more than 16 hours; the experiment was run on a single Intel Xeon 2.66 GHz with 2 Gb of memory).
Xmipp LEASFs was executed in 5 seconds (twice faster than Situs and more than 11,000 times faster than Bsoft; another speed comparison is reported below with a wider set of structures).Situs, Bsoft and LEASFs coincide up to a resolution about 20Å.Beyond this point LEASFs power density is much smaller compared to the other two programs.Spider produces a structure whose power spectrum is above the previous ones, and EMAN has the largest amount of energy of all.This may be pointing out that simple models may be overemphasizing the

AIMS Biophysics
Volume 2, Issue 1, 8-20 importance of high frequency components in atomic structures.Note that in EM, many times volumes are scaled at the end of the process to match a certain intensity prole in Fourier space.Although this practice would homogenize all packages regarding amplitudes, Fig. 6 highlights the fact that the underlying functions used by the dierent approaches are introducing dierent distortions in Fourier space that show up in amplitude (as shown in the gure) as well as in phase (not shown).To test whether the proposed simulation is more faithful to the EM volume than the standard Gaussian approach, we downloaded all volumes from the EMDB (Electron Microscopy Data Bank, http://www.emdatabank.org)[18] that were fully modeled with a single PDB le (Protein Data Bank, http://www.rcsb.org)[25], a total of 27 (on Feb. 15th, 2012).We measured the correlation of the EM reconstructed volume and the simulated atomic volume.We compared the correlation of the volumes simulated with the proposed method and with Gaussians+trilinear downsampling.The hypothesis that both methods were equally performing was rejected with a p-value of 0.019, that is, the correlation of the volumes generated with the new method was signicantly larger (in a statistical sense) than the correlation of the volumes generated by a widely used method (the correlation with the new method was 2.6% larger).The execution time of the new method was twice faster, and the memory requirements of the new method are much smaller since the volume needs not to be sampled rst at a high-resolution that is later downsampled.
We performed the same experiment with TEM Simulator 1.3.The correlation of the new method was on average 6.2% larger than the one of TEM Simulator.The dierence was signicant with a p-value of 0.048.Despite the larger dierence in correlation, the larger p-value is due to a larger variance of the TEM Simulator results.TEM Simulator produced an empty volume for 4 of the 27 structures without any error message.
Finally, we repeated the previous experiment comparing the newly proposed method to the results from the standard method in X-ray crystallography.bsf failed to produce an appropriate volume for 4 out of the 27 volumes (either it produced an empty volume or it produced a volume with strange strip patterns and no recognizable molecule).The correlation of the remaining volumes with the EMDB volume was 56% larger using the new method compared to the correlation using the standard crystallographic method (with a p-value of 0, i.e., the bsf correlation was systematically lower than the one of the newly proposed method).bsf took 8 days to produce the 27 volumes while the proposed method took only 3 minutes (about 4,000 times faster).bsf allows to dene any set of weights for describing the atoms.We exploited this option and we substituted the original weights of bsf with the weights proposed in this article.bsf still failed to produce some molecules (taking the AIMS Biophysics Volume 2, Issue 1, 8-20 same amount of time as the one reported in the previous experiment).However, for the remaining structures, the correlation of the proposed method and bsf with the EMDB structure was not, as expected, signicantly dierent (condence level of 95%).
Note that the main dierence between the proposed method and the standard crystallographic method, as implemented in bsf and sfall, is that the newly proposed method is implemented in real space without any downsampling, while the standard crystallographic method is implemented in Fourier space.There is a performance dierence of more than 3 orders of magnitude in speed.

Application to Bayesian models in single particle analysis
In this section we show how the model developed above can be used to check statistical distributions such as the one put forward in [27] in which Fourier coecients of the reconstructed volumes were supposed to be independent, identically distributed random variables with a Gaussian distribution where the real and imaginary parts are independent and have zero mean.For doing so, we analyzed the Fourier transform of 500 atomic structures randomly selected from the Protein Data Bank [1].Our simulation framework provides us with a very accurate way to empirically estimate the statistical distribution of 3DEM map properties, like their Fourier coecients.
We computed the Fourier transform of the 500 structures as the sum of the Fourier transforms of each of their atoms modeled with LEASFs.The Fourier transform of the LEASF is a continuous function that we sampled every 0.01Å −1 in the frequency range between 0 and 0.5Å −1 (maximum resolution 2Å).For each possible pair of Fourier coecients, V l1 and V l2 , we tried to t a linear regression model of the real part of V l1 (analogously, the imaginary part) on the real and imaginary parts of V l2 , i.e.
If the independence assumption is true, then the b 1 and b 2 coecients should not be signicantly dierent from 0. We tested these hypothesis using the standard condence intervals for linear regression coecients [8] with a 95% condence.At least one of the hypothesis (b 1 = 0 or b 2 = 0) was rejected in 83.13% of the cases, i.e., the vast majority of the macromolecule spectrum has a statistically signicant linear dependence with other Fourier coecients.
We also studied the statistical distribution of the real and imaginary parts of the Fourier coecients corresponding to the aforementioned 500 atomic structures.If each complex coecient follows a bivariate Gaussian distribution, then its marginal distributions (the ones of the real and imaginary parts) should be univariate Gaussians (we used Lilliefors' normality test, [20]).The hypothesis that either the real part of the Fourier coefcient or its imaginary part are univariate Gaussians was rejected in 86% of the cases with 95% of condence.
Consequently, Fourier coecients cannot be Gaussianly distributed.
We used a Wilcoxon signed rank test (non-parametric test, [7]) to test the hypothesis that the real and imaginary parts of the Fourier coecients were actually centered at 0. The hypothesis was rejected in 54% of the cases with a 95% of condence (interestingly, the hypothesis was rejected in 100% of the cases up to a resolution of 3.3Å).
Next, we tested the assumption that the real and imaginary parts of the Fourier coecients have the same variance.We did this by computing a bootstrap estimate (1000 resamplings) of the ratio between the variance of the real part over the variance of the imaginary part (normally this ratio is distributed as a Snedecor-F, but the lack of normality prevents the use of this distribution).The variance of the real part was 1.9 times larger than the variance of the imaginary part, and the condence interval with a level of condence of 95% was between 1.3 and 3.3.
Finally, we tested the assumption that the real part and the imaginary parts of the Fourier coecients are uncorrelated (this is implied by the zeros of the covariance matrix Σ l ).As we did before, we tested whether the real part had a signicant linear dependence on the imaginary part or vice-versa.The hypothesis that both parts are independent was rejected in 56% of the cases with a 95% condence (the hypothesis was rejected in 100% of the cases up to 3.7Å).

Discussion
In this paper we have introduced a new basis function (LEASF) for the conversion of atomic models into density volumes based on Electron Atomic Scattering Factors (EASF, an accurate representation of the electron behavior

AIMS Biophysics
Volume 2, Issue 1, 8-20 inside the electron microscope in the presence of atoms).The EASF in Fourier space was already known and partly used by the EM community.However, our real-space counterpart allows a much faster implementation.
We have compared the other models to the newly introduced basis.We have shown that previous methods tend to overemphasize high-frequency components with respect to low-frequency ones, this is specially true for EMAN and Spider.An exhaustive statistical analysis using both the PDB and EMDB databases has shown that our method outperforms currently used approaches in the eld.Additionally, based on our new approach, we have derived more accurate ways to compute very realistic projection images.
The fact that the volumes simulated with the newly proposed basis functions correlate better with experimental electron microscopy volumes is an important issue that opens the door to all tasks in which converting from atoms to voxels is needed, like tting [37,39,38] or normal mode analysis [36,35].In fact, recently [13] it has been proposed that one of the ways to validate EM volumes is to correlate the EM reconstructed volume with the simulated atomic structure of those domains that are known at atomic resolution.Additionally, we have used the very accurate EASF-based modeling of EM-like density maps for PDB les to check the consistency of some statistical assumptions underlying a recently introduced method in the eld, specically, a Maximum a posteriori method used for 3D classication.The goal was to see if by using the information already existing in structural databases along with the the new LEASF approach we could check the validity of a number of statistical assumptions.The possibility of accurately converting atomic models into density volumes allows to estimate the empirical distribution of Fourier coecients and their joint probability density functions without the need to use closed-form a priori distributions such as the Gaussian.

Conclusions
We have here presented a strategy for generating reliable models of macromolecules for 3DEM that can be used for generating very realistic two-dimensional projections as well as 3D models.The availability of good models is important for developing new algorithms and for testing image analysis procedures prior to applying them to real data as well as for molecular tting purposes.With the methodology described in this paper, excellent three-dimensional model structures can be generated based on the sampling of the low-pass electron form factors (LEASF) for electron scattering.We have introduced an ecient way of sampling these functions at any desired sampling rate such that aliasing eects are exploited in order to faithfully reproduce the original EASF up to half the sampling frequency (Nyquist frequency).All operations are very fast, ranging from 2 to 11,000 times faster than currently used approaches.Additionally, memory requirements are much smaller since there is no need to explicit downsampling.The developed methodologies are now available in the Xmipp software system (http://xmipp.cnb.csic.es)[30].

Figure 1 .
Figure 1.Example of the atomic structure of the GroEL chaperonin (PDB entry: 1GRL).For display purposes, most atomic visualizers substitute each atom by a sphere whose color depends on the kind of atom.

Volume 2 ,Figure 2 .
Figure 2. Top: EASFs in Fourier space for several atoms.Bottom: Corresponding EASFs in real space.

Volume 2 ,Figure 3 .
Figure 3. Fourier spectra of dierent atom models proposed in the literature to simulate atoms in electron microscopy.The EASF corresponds to that of the carbon atom.The lower plot is a zoomed version of the top plot up to a frequency of 0.33 Å −1 , corresponding to a standard sampling rate in 3DEM of 1.5Å/pixel.The region corresponding to the zoomed plot at the bottom is marked by a rectangle at the top gure.

Figure 4 . 20 -Figure 5 .
Figure 4. Fourier spectrum of the carbon EASF up to a frequency of 0.33Å −1 along with its corresponding LEASF.Note the perfect agreement up to a sampling frequency of 1.5 Å/pixel.

Figure 6 .
Figure 6.Radial average of the 3D Power Spectrum Density (in decibels) of dierent volume generation strategies.