Simulations on time-resolved structure determination of uncrystallized biomolecules in the presence of shot noise

Determination of fast structural changes of biomolecules is usually performed on crystalline samples in a time-resolved pump-probe experiment. Changes in the structure are found by the difference Fourier method using phases of a known reference structure. As we showed recently, such changes can also be determined from diffraction of uncrystallized molecules in random orientations. In this case, the difference in the angular correlations of the diffraction patterns is used to find structural changes. Similar to the difference Fourier method, there is no need for iterative phasing. We validated this approach previously with simulations in the absence of noise. In this paper, we show that the effects of noise can be adequately suppressed by averaging over a sufficiently large ensemble as they can be obtained using an X-ray free electron laser.


I. INTRODUCTION
Time-resolved crystallography (TRX) 1 can be used to determine the structure of short-lived (transient) reaction intermediates in biological macromolecules. 2 TRX is traditionally performed using synchrotron-based Laue diffraction experiments that are currently restricted by X-ray beam brilliance, uniform reaction initiation, and the time resolution of the probe x-ray pulse. The advent of X-ray free electron lasers (XFELs) such as the Linac Coherent Light Source (LCLS) 3,4 and the SPring-8 Angstrom Compact Free-Electron Laser (SACLA) 5 have opened the avenue for ultrafast time-resolved studies. 6,7 Recently, high-quality difference electron density maps were generated from serial femtosecond crystallography (SFX) 8 on nanocrystals of photoactive yellow protein (PYP), 9 which allowed determination of reaction intermediates up to 1.6 Å . The critical bottleneck of synchrotron and XFEL-based crystallography lies in growing crystals of sufficient size and quality, 10 which is particularly difficult for membrane proteins. At the same time, crystal lattice constraints might influence both structure and kinetics extracted from the time-resolved X-ray data. 11 Other successful methods for structure determination, like nuclear magnetic resonance (NMR) and cryogenic electron microscopy (cryo-EM), pose restrictions on the kind of systems that can be studied-NMR methods are limited to proteins of about 40 kDa in molecular weight, 12 while cryo-EM is limited to intermediate resolutions of [6][7][8][9][10] and cannot be used for time-resolved studies. It is therefore desirable to develop non-crystallographic techniques for structure determination and to perform timeresolved experiment on biomolecules in solution and at physiologically relevant temperatures, so that unconstrained molecular relaxations can occur.
It was pointed out by Kam 15 that diffraction patterns from an ensemble of molecules in random orientations and frozen in time by using intense, brief X-ray pulses contain structural a) Author to whom correspondence should be addressed. Electronic mail: kpande@uwm.edu 2329-7778/2015/2(2)/024103/13 V C Author(s) 2015 2, 024103-1 information that tend to get washed out in the case of conventional small-angle X-ray scattering (SAXS), 16 which is obtained by exposure of molecules to X-ray pulses on time-scales longer than their rotational diffusion times. This structural information can be extracted by measuring intensity angular correlations, which yield a quantity that can be related to the threedimensional diffraction volume of a single molecule. Several proof-of-principle experiments have validated Kam's basic theory. Saldin and coworkers have employed particle symmetry to reconstruct the 3D structure of icosahedral 17 and cylindrical virus 18 from angular correlations. Liu et al. 19,20 have demonstrated the feasibility of ab initio reconstruction of low-resolution structures by using angular correlations in a fluctuation scattering experiment. In a previous paper, 21 we proposed and demonstrated by computer simulations the possibility of performing a pump-probe experiment to determine the difference electron density of a biomolecule on photoexcitation of an ensemble of uncrystallized randomly oriented biomolecules. We showed that the difference in angular correlations of two closely related sets of molecules, such as the "dark" and photoexcited molecules of PYP, is linearly related to the change in the electron density provided the structural change is small.
The method of correlated X-ray scattering has tremendous appeal because it is applicable to solution-phase studies at room temperature and does not suffer from algorithmic issues associated with "hit-finding" and orientation determination. Despite various successful theoretical studies on the basic methodology of correlated X-ray scattering, there is still a significant lack of understanding of the underlying factors that limit the signal-to-noise ratio (SNR) for intensity correlation measurements, which makes the analysis of experimental data a challenge. An important source of error is the detector shot noise, which we have modeled by Poisson statistics due to the discrete nature of photon arrival at the detector pixels. When the expected number of photons/pixel is large ($10), Poisson distribution can be approximated by Gaussian distribution. While it is widely accepted that scattering from solution is the largest contributor towards shot noise, we have only considered scattering from biomolecules in the present study. A detailed treatment of scattering from solvent and the resultant contrast loss will be the focus of a future paper. Another source of error comes from XFEL beam jitter due to the stochastic nature of the emitted pulses, 22 resulting in shot-to-shot intensity variations.
The main goal of this paper is to assess the behavior of the computational approach suggested in Ref. 21 for the extraction of difference electron density from angular correlations in the presence of substantial noise. We performed simulations employing a large number of "noisy" diffraction patterns calculated from an ensemble of molecules in random orientations and show that meaningful difference electron density can be extracted even in the presence of substantial noise and from an ensemble of particles that contribute to individual diffraction patterns.

II. METHODS
We begin by defining an angular pair correlation function between the intensities at points in the detector plane specified by the polar coordinates (q; /) and ðq 0 ; / þ D/Þ, averaged over a set of measured XFEL diffraction patterns corresponding to several random particle orientations where hÁ Á Ái /;DP represents an average over all azimuthal angles and all diffraction patterns. If N particles contribute incoherently to a single diffraction pattern, the total intensity may be written as where x i is the particle's orientation, and the sum over intensities is justified even in the coherent case if the inter-particle separations are random. 21,23 Thus, Eq. (1) has a contribution from a term where the particle orientations are the same, namely, In Eq. (4), the arrow indicates the mathematical limit when the number of diffraction patterns ! 1. Note that the second term does not exist for scattering from single particles. However, since typical illumination areas in current applications of XFEL radiation tend to be much larger than a small biological molecule like PYP, often the subject of time-resolved studies in crystals, it is quite likely that a typical sample will contain multiple particles in a typical experiment, so we keep the discussion general for now. For multiple scattering particles, we can remove the second term from Eq. (1) by redefining the pair correlations as C 2 ðq; q 0 ; D/Þ ¼ hIðq; /ÞIðq 0 ; / þ D/Þi /;DP À hIðq; /Þi /;DP hIðq 0 ; /Þi /;DP : Intensities Iðq; /Þ on a polar grid are computed by cubic interpolation of simulated intensities I(q x , q y ) from an oversampled Cartesian grid (since XFEL detectors are usually arranged on a Cartesian grid in the detector plane). The angular pair correlation function C 2 ðq; q 0 ; D/Þ is related to a quantity B l ðq; q 0 Þ, 15 which depends on the angular momentum quantum number l. The quantity B l is obtained from the measured C 2 quantities by performing the integral j is the wavenumber of the incident radiation (defined by 2p/k, where k is the x-ray wavelength), and P l is a Legendre function of lth degree. As has been shown previously, 17 B l ðq; q 0 Þ can be expressed in terms of the spherical harmonic components I lm (q) of the 3D diffraction volume IðqÞ of a single particle by Consider small structural changes in the biomolecules upon photoexcitation. Under this assumption, we may take a variation on both sides of Eq. (8) to give In terms of the complex structure factor AðqÞ and the difference electron density dqðr j Þ between the dark and photoexcited structures, we can write dI lm (q) as where we have used the relation and dAðqÞ ¼ X Using Eq. (10) in Eq. (9), we get and where In Eq. (15), matrix M depends only on the complex structure factor AðqÞ of the dark structure, which is usually completely known in time-resolved experiments, and the quantity dB l ðq; q 0 Þ is determined directly from the experimentally measured intensities of the dark and excited structures.
Since the correlations C 2 ðq; q 0 ; D/Þ are symmetric in q, dB l ðq; q 0 Þ is also symmetric in q, i.e., dB l ðq; q 0 Þ ¼ dB l ðq 0 ; qÞ; ) M lqq 0 ;r j ¼ M lq 0 q;r j : Looking at the form of M in Eq. (15), we see that hence the elements of M are real. Similarly, the elements of the column vector B l ðq; q 0 Þ are also real. For a particle in an orientation x, the scattered intensity collected on a detector pixel of solid angle DX at a scattering vectorq is given as where F is the incident flux (photons/area), the differential cross-section of the particle is with r e being the classical electron radius and the solid angle subtended at the sample by a detector pixel at an oversampling ratio o ! 2 in the reciprocal space is given as where k is the wavelength of the incoming radiation and D is the diameter of the particle. Combining the above equations, the expression for the intensity detected by a single pixel is To study the effect of noise on the reconstruction of the difference density dqðrÞ, we introduce shot noise at each detector pixel. We define P(n, I) to be the probability of measuring a photon count n with an expected intensity of I. The probability P(n, I) can be written as Pðn; IÞ ¼ PðnjIÞPðIÞ; where PðnjIÞ is the conditional probability of measuring n photons when the expected intensity is known to be I and is given by the Poisson distribution

III. RESULTS AND DISCUSSION
The simulations reported in this paper were done for the PYP (protein data bank (PDB) entry 2PHY). Since we are working with single isolated particles, we did not assume any periodicity of the reconstructed real-space image and took the size of the real space volume that we reconstructed as a 50 Å cube, a little more than the size of a PYP molecule in a crystal. The simulations were done for a real space resolution res ¼ 5 Å , giving rise to a (21 Â 21 Â 21) grid in which the difference density was represented.
As per the following relations: the values for q max and l max at 5 Å resolution and a diameter of 50 Å were found to be $1.25 Å À1 and 30, respectively. Since the intensities follow Friedel's law, i.e., IðqÞ ¼ IðÀqÞ, only even values of l have non-zero I lm (q). Note that the width of the reciprocal space grid is taken in accordance with the Shannon spacing to ensure that the rows of matrix M lqq 0 ;r are linearly independent. In order to have an almost square matrix M, the number of resolution rings in the reciprocal space on a polar grid is given by Hence, the intensities have to be calculated for q 0 max higher than the q max , where The DPs were simulated on a Cartesian grid, oversampled eight times in a linear dimension, and then interpolated to a polar grid using cubic interpolation. We simulated DPs in 1 000 000 random orientations 24 and then selected a random subset of these DPs for our calculations. For simulations with noise, since the Poisson shot noise applies to each detector pixel, it was applied to intensities calculated on the Cartesian grid before cubic interpolation. Using F ¼ 8 Â 10 12 photons/lm 2 per pulse, r e ¼ 2.8179 Â 10 À15 m, and k ¼ 2 Å ($6 keV) in Eq. (22), and Wilson statistics for structure factor, we find that PYP scatters $3 photons/Shannon pixel at 5 Å resolution, as shown in Fig. 1. Simulations have also been done for higher photon counts of 15 and 30 photons/Shannon pixel. Typical 2D DPs for these photon counts are shown in Fig. 2.

A. Matrix inversion: Singular value decomposition (SVD)
The SVD of a matrix M (where M is a m Â n matrix, with m > n) is the factorization of M into the product of three matrices where U is an m Â m matrix, R is an m Â n matrix, and V is an n Â n matrix. U and V have orthogonal columns so that and and R has entries only along the diagonal. The diagonal elements of R are non-negative and appear in decreasing order such that We find that the matrix M is highly ill-conditioned (the condition number of M is the ratio r 1 /r n , which for the present case is $10 22 ), hence we use truncated SVD (TSVD) to compute M À1 to solve for dq. The solution dq to the system of equations is given as where f i are filter factors, and for the case of TSVD f i ¼ 0; 1. Here, u i and v i are the columns of matrices U and V, respectively. For a discrete ill-posed problem, such as the one under consideration in this paper, the coefficients ju T i dBj corresponding to the smaller singular values do not decay as fast as the singular values, but rather tend to level off as shown in Fig. 3. The solution is therefore dominated by the terms in the sum corresponding to the smallest r i and appears completely random. 25

B. Reconstruction of difference density
To quantify the quality of the difference electron density recovered with our algorithm, we use the Pearson correlation coefficient 26 CC r given by Here Dq 1 and Dq 2 are the two difference densities being compared and are limited to the pixels at which jDqj < 2r, where r is the variance of Dq.
In the rest of the paper, we will look at the reconstruction of difference density for (I) the cis-trans isomerization of the chromophore (HC4) and arginine (ARG52) and (II) the movement of widely separated residues HC4 and PHE121. Since the initial dark structure is assumed FIG. 2. Simulated 2D diffraction patterns at 5 Å resolution for 1 particle/DP, 10 particles/DP, and 25 particles/DP for photon count of 3, 15, and 30 photons/Shannon pixel. known, the matrix M and its SVD need to be computed only once. We first calculate the real-space correlation coefficient between difference density calculated directly from the PDB files of the two structures (dq 1 À dq 2 ¼ dq pdb ), and that using our suggested inversion method [Eq. (14)]. The dB l ðq; q 0 Þ are computed using the spherical harmonic decomposition of intensities calculated from the PDB files (dq 2 ¼ dq I lm ). and 4(f) look very similar, and result in highest real-space correlation coefficient CC r between dq pdb and dq I lm is $0.7 for I and 0.75 for II. It must be pointed out that while the truncation of singular values helps to regularize the solution to dq, it results in a loss of resolution. For low values of truncation parameter i, the density on ARG residues is not clearly resolved. It should also be noted that the difference features are clearer when the residues are widely separated as in structure II (lower panel), than when they are very close, as in structure I (top panel). Next, we consider the effect that the number of particles exposed on each diffraction pattern has on the resulting density reconstructions (dq corr ). In Fig. 5, we show results for CC r and dq corr for quantities obtained from simulated noise-free DPs for structural change I, which involves movement of residues HC4 and ARG52. Figs. 5(a), 5(b), and 5(c) show CC r plotted as a function of the number of singular values and the number of DPs over which the correlations are averaged for calculation of B l ðq; q 0 Þ for DPs with a single particle, 10 particles, and 25 particles per snapshot, respectively. The corresponding dqðrÞ are shown in panels (d), (e), and (f). In all cases, we find that CC r improves with increase in the number of DPs, and in all cases CC r is highest when the singular values are truncated at between i ¼ 3000 and 4000, dropping considerably for 5000 singular values. For DPs with a single particle per snapshot, a good convergence to the expected structure is obtained with as few as 30 000 DPs. On the other hand, for the DPs with 10 and 25 particles per snapshot, the CC r for 100 000 DPs is lower than that for 10 000 DPs when there is only one particle per snapshot.
We now consider the case when the 2D snapshots contain shot noise given by the Poisson distribution in Eq. (24). Noisy correlations C 2 ðq; q 0 ; D/Þ are calculated for snapshots according to our prescription that each detector pixel detects only a whole number of photons. Calculations have been done for scattering from one particle, 10 particles, and 25 particles per snapshot each for three different values of measured photon counts per molecule: (i) 3 photons/ Shannon pixel, which is the number of photons scattered by PYP at 5 Å resolution when the incoming flux F ¼ 8 Â 10 12 photons/lm 2 per pulse (expected at the European XFEL 27  The correlation coefficients CC r between ideal difference density and those recovered by our technique are shown in Fig. 6, for expected movements of the chromophore and nearby ARG 52 residue in a time-resolved pump-probe photo excitation experiment. The CC r is shown as a function of both the number of particles illuminated and the detected scattering signal expressed in terms of photons/Shannon pixel. The corresponding difference density reconstructed for the point with the highest CC r is shown in Fig. 7. For a signal of 3 photons/Shannon pixel and scattering from one particle per snapshot, the signal is too weak to reproduce any difference features when the correlations are averaged over 400 000 DPs, as seen in Fig. 7(a). When the scattering is from 10 and 25 particles per snapshot, CC r shows some improvement, which is manifested in the appearance of difference features on the chromophore (HC4 residue) of the molecule in Figs. 7(b) and 7(c), respectively. For a signal of 15 photons/Shannon pixel, only the difference density on the chromophore is recovered from simulations for both the single and multiple particle snapshots, as shown in Figs. 7(d)-7(f). A small signal on the ARG is seen for a signal of 30 photons/ Shannon pixel when the scattering is from only one particle/DP. However, when there are multiple particles/DP, the CC r is lower and the difference density features are weaker, and show similar trend as the CC r for the noise free case. In Ref. 23,Kirian et al. have also observed that the SNR for correlated fluctuation SAXS (CFSAXS) in the high flux regime rapidly approaches a limit independent of the number of particles contributing to the diffraction pattern. Fig. 8 shows the variation of CC r as a function of the scattering signal and also the number of particles in the 2D snapshots. It is clear that at low flux (signal ¼ 3 and 15 photons/Shannon pixel), the correlation coefficient improves when the scattering is from multiple particles per snapshot. While there is a slight increase in CC r for 15 photons/pixel when going from 10 to 25 particles/DP, it decreases for signal of 3 photons/Shannon pixel. This is because both these data points come from correlations averaged over 400 000 snapshots. When the signal is lower and the number of particle is higher, a larger number of diffraction patterns are required to recover the difference features. It is important to note that for a signal of 30 photons/Shannon pixel as well as for the noise free case, increasing the number of particles in the snapshot does not improve the CC r and the difference density. Similar conclusions have been reported in Ref. 23.

IV. CONCLUSIONS
The method of analyzing XFEL data of uncrystallized molecules using angular correlations can be exploited to reveal fast structural changes of biomolecules in a pump-probe experiment without the need for a phasing algorithm. Unlike algorithms for ab initio structure determination which reconstruct only a 3D diffraction volume from which the real-space structure needs to be deduced by means of an iterative phasing algorithm, it is possible 21 to derive an algorithm that can reveal directly the real-space difference structure in analogy to the difference Fourier method of time-resolved crystallography.
The present paper examines the feasibility of this method in the presence of noise. Kirian et al. 23 have pointed out that due to the fact that the method of angular correlations gives the signal required as a small perturbation of a much larger SAXS/wide-angle x-ray scattering (WAXS) signal from uncorrelated scattering from differently oriented particles, the noise on the SAXS/WAXS signal is actually comparable to the signal sought.
The same paper concluded that the signal-to-noise ratio actually improved with N 1=2 , where N is the number of diffraction patterns used in the analysis. The point is that the distorting effect is noise, while the sought effect is signal. Noise can be reduced by averaging over a large number of diffraction patterns, whereas such reductions do not occur in the signal. However, the signal we seek in such experiments is very small since it is associated with a difference structure. Consequently, even a small amount of noise can significantly distort it, and averaging may need to be performed over a large number of diffraction patterns. Exactly how many are needed can only be found by detailed simulations as performed in this paper, or even better by experiment. The results of this paper suggest that such experiments may be feasible.
It should be noted that at the pulse repletion rate of a current generation XFEL, like the LCLS, it is not unusual to measure a few million diffraction patterns during the course of an experiment. Figs. 4-6 all suggest that the results improve with the number of diffraction patterns measured. More importantly, Figs. 6 and 7 suggest that the results improve quite significantly with greater incident flux. Both of these improvements become possible at future XFEL sources, so prospects of this method for the future are promising.
In experiments on uncrystallized biomolecules, scattering from the solution containing the biomolecules will likely be the dominant source of background noise, especially at low photon counts. At low resolution, the effect of the solvent can be modeled using Babinet's principle, 28 which states that the scattering from a uniform object is the same as the scattering from a hole of the same shape. However, for subnanometer resolutions, correlated scattering from closely packed solvent must be taken into consideration. Kirian et al. have outlined an approximate treatment of solvent scatter in Ref. 23 for the case of incoherent scattering, where they have shown that the SNR in the presence of solvent varies as N N s of the SNR for scattering from biomolecules only. Here, N N s is the ratio of protein to solvent molecules. Although the water background was not considered in the computation of the diffraction patterns here, the methods outlined here can already be applied to experiments where the molecular ensemble is injected with as little water as possible, for example, by using an aerodynamic lens particle injector. 29 Work is in progress on a detailed study of the effect of scattering from solution on the use of angular correlations for recovery of static as well as dynamic structures and will be the subject of a future report.
It must be noted that the simulations reported in this work have been done for sufficiently dilute particle concentration under which condition it is reasonable to neglect inter-particle interference fringes. For the case where N particles scatter coherently, the total diffracted intensity on a snapshot can be written as where F n ðqÞ is the complex structure factor of n th particle, andr nm is the relative displacement between the particles. Since the particle positions and orientations are assumed random, it is expected that the interference term averages out when the correlations are calculated over a sufficiently large number of diffraction patterns. 30 Since PYP is a relatively small protein, only a few X-ray photons are scattered per molecule which is challenging for this method. With larger macromolecules, the number of scattered photons increases. 31 In this case, meaningful difference densities may be already calculated from fewer diffraction pattern. Archetypal photoreactive molecules are those large molecules involved in photosynthesis. In a recent study, structural changes were obtained from a bacterial photosynthetic reaction center using molecular dynamics simulations constrained by timeresolved difference SAXS/WAXS data. 32 With an adaptation of our method, we may be able to directly invert the pump-probe diffraction patterns to difference electron densities