In-plane rotation classiﬁcation for coherent X-ray imaging of single biomolecules

: We report a new classiﬁcation scheme with computation complexity well within the capacity of a PC for coherent X-ray imaging of single biomolecules. In contrast to current methods, which are based on data from large scattering angles, we propose to classify the orientations of the biomolecule using data from small angle scattering, where the signals are relatively strong. Further we integrate data to form radial and azimuthal distributions of the scattering pattern to reduce the variance caused by the shot noise. Classiﬁcation based on these two distributions are shown to successfully recognize not only the patterns from molecules of the same orientation but also those that differ by an in-plane rotation.


Introduction
With the advent of the X-ray free electron laser, it is possible to image a single biomolecule through coherent diffraction imaging [1][2][3][4], albeit at lower wavelengths. Many promising approaches have been suggested for accomplishing this goal [5][6][7][8][9][10][11][12][13][14][15][16][17][18]. In a typical coherent X-ray imaging experiment, single biomolecules are injected into a pulsed X-ray beam. The elastically scattered photons are recorded by a detector array placed downstream. Due to the ultrashort time scale of the pulse, the scattering pattern can be recorded before the molecule is destroyed [19]. The recorded scattering data are then sorted according to the orientations of the biomolecules. Signal strength can be improved by averaging shots of the biomolecules having the same orientation, which will benefit subsequent steps of phase retrieval and image reconstruction. Due to the small size of the biomolecule and small elastic scattering cross-section of the electron, we encounter two problems for the coherent X-ray diffraction imaging. One is that the detected signal will be mostly dominated by Poisson noise. This is especially the case when the scattering angle is large, as the scattering is mostly in the forward direction for coherent illumination. The other difficulty is that the biomolecule itself is intrinsically a 3D structure and its orientation is not controlled. Thus, orientation classification is the first crucial step of data processing in coherent X-ray imaging.
Early work on classification was based on large scattering angle data where the diffraction pattern is modeled as a speckle pattern [12,20]. In practice, this is primarily true for largeangle scattering where the detected photon level is very low with the current or near future X-ray photon budgets. Thus the classification schemes based on large-angle-scattering-data are not currently practical. Recently there are two interesting developments where Expectation-Maximization algorithms is utilized for classification [21,22]. The signal level for successful classifications is much reduced. Another imaging scheme is to simultaneously inject many randomly oriented biomolecules [23]. The scattered photons are much more and the orientation classification is unnecessary in this setup.
In this paper we focus on scattering by a single particle and propose a classification scheme based on patterns of low scattering angles where a substantial number of photons can be detected. Further the comparison among different rotations is not directly based on the pattern itself but on compressed signals which are the radial and azimuthal distributions of the scattering pattern. As both of these distributions are signals integrated over many pixels, the variance in the radial or azimuthal distribution is much smaller than that in the individual pixels. Consequentially a classification can be relatively immune to the shot noise and the results of the classification can be trustworthy. An example of utilizing those compressed signals is given in this paper for classification of in-plane rotations (i.e., relative orientations between biomolecules are within a plane parallel to the detector plane).

Weak elastic scattering process and the issue of shot noise
In a typical setup for coherent diffraction imaging of single biomolecules, the samples are injected into the waist of the focused coherent X-ray beam and a detector is placed downstream to record the scattering pattern. The scattered pattern recorded by the detector array, I(x, y), can be written as where r e is the free electron scattering length, Φ inc is the incident fluence and Ω pix is the solid angle of one detector pixel extended with respect to the biomolecule. The coordinates and atomic form factor of the n th atom in the biomolecule are described by (x n , y n , z n ) and f n , respectively. The pixel coordinates for the detector array are (x, y, z) and the wavelength of the X-ray is lambda. The center region of the detector array has a hole to pass the unscattered photons. In this paper we are interested in the small-angle scattering patterns, the variation of the atomic form factor over the scattering angles can thus be ignored and f n is assumed to be a real function. The DNA polymerase delta (PDB ID:3IAY, molecular weight: 113822; total number of atoms: ∼ 8.2K) is used as an example of a typical biomolecule to calculate the diffraction pattern and perform orientation classifications.
In this paper we also assume that the noise is mainly shot noise, the detected photon number in any pixel is a Poisson distributed random variable whose mean value can be calculated from Eq. (1). In Fig. 1 we show a typical scattering pattern by the single biomolecule when the incident photon number is infinite (noise free), 2 × 10 14 and 2 × 10 12 per pulse. The beam is assumed to be 100 nm and the X-ray wavelength is 1.5Å. The detector array is 150 mm away from the sample and the pixel size is 0.11 mm. From Fig. 1 we can see that for the "noise free" case, the scattering pattern looks like speckles in regions corresponding to large scattering angles. However, with a finite incident photon number such as 2 × 10 14 photons per pulse, the contrast of the speckles is diminished and some even disappear. For the 2 × 10 12 photons per pulse case, which is the current available level at LCLS, the number of photons scattered with scattering angles larger than 2 • is so little that basically all speckles disappear. It will be very difficult to distinguish any exposures from each other, not to mention conducting orientation classification with this incident photon level.
Consider an annular area of the scattering pattern with radius ρ min ≤ ρ ≤ ρ max . We divide this area into regions of radial or azimuthal segments as shown in Fig. 2 between these radial segments are rings with radii ρ 1 , ρ 2 , ..., ρ N , which can be defined as ρ n+1 = ρ n + Δρ n , n = 0, 1, 2, ...N − 1; (2) where ρ 0 = 3.72mm, Δρ 0 = 0.66mm. We see that the width of the outer ring is larger than the inner rings. This will help to compensate for the decreasing signal strength with increasing radius. The radial distribution, I ρ (n), can be written as, Note that I ρ is normalized with respect to the total scattered photons in the effective detection area. In this paper we choose N = 60, which is able to significantly reduce the variance in the radial distribution and still give us sufficient sensitivity to rotations as we will see in the following parts of this paper.
In the simulation study we choose δ θ = 1 • . As the scattering pattern is symmetric about the origin, we limit the value of θ to be within [0, 180 • ]. The azimuthal distribution can be written as: In order to study the effect of the noise, the mean value of the radial and azimuthal distributions are first calculated from the "noise free" diffraction patterns. The noisy versions of the radial or azimuthal distribution are calculated from the noisy diffractions when the incident photon number is 2 × 10 14 and 2 × 10 12 respectively. A typical radial and azimuthal distributions and their mean values are shown in Fig. 3(a) and 3(b) respectively. We see that when the incident photon number is 2 × 10 14 , the noisy version of the radial and azimuthal distribution is similar to its mean value. However when the incident photon number is 2 × 10 12 , the noisy version of the radial and azimuthal distribution deviates substantially from their corresponding mean values, which will affect the reliability of the classification adversely.
In Figs. 4 we show surface plots when the biomolecule rotates about the x, y, and z axes. We see that with rotations about x or y, both the radial and the azimuthal distributions change smoothly with rotation angles. For rotations about z, the radial distribution stays the same while the azimuthal distribution shifts continuously with the rotation angle. Thus rotation about z (also called an in-plane rotation) can be recognized by studying the evolution of the radial distributions, and the rotation angles can be identified by finding the amount of the shift of the azimuthal distribution. For rotations other than along the z-axis, both the radial and azimuthal distributions are going to change smoothly with the rotation angles. Thus a distance between two distributions, e.g. Euclidean distance, can be a measure describing the global changes during continuous rotation. A small distance between distributions means that the two distributions are quite similar to each other and the relative rotation angle between them is small. In a separate simulation we modified a large biomolecule (PDB ID: 3I55, size ∼ 200Å) by removing all atoms greater than 90Å from the centroid and showed that the same process can be easily applied to spherical molecules.

Euler angles and the possibility of in-plane rotation
In the previous sections we have devised a method to preprocess the scattering data to improve the data consistency for the classification of orientations. In this section, we will use introduce Euler angles, which have been widely used in the cryo-EM field, to represent arbitrary orientations [24][25][26].
An arbitrary rotation of an angle φ about an arbitrary axis can be described by three consecutive rotations, i.e., where R is the rotation matrix that depends on the rotation axis and rotation angle φ . The first rotation is about the intrinsic z-axis, the second rotation is about the intrinsic y-axis and the last one is again a rotation about the intrinsic z-axis.
As the low-angle scattering pattern is proportional to the square of the amplitude of the Fourier transform of the scattering potential, which is a real and even function, we can limit the ranges of α, β and γ to [0, π].
The last γ-rotation is the so-called in-plane rotation because the z-axis of the intrinsic coordinate and the z-axis of the laboratory coordinates are co-axial. If we can classify the in-plane rotation out of all possible rotations and know the angle of the in-plane rotation of a selected pattern with respect to a reference pattern, we can rotate that pattern to overlap with the reference pattern. This will give us an opportunity to increase the SNR by averaging over all the in-plane rotations. The extent of the SNR improvement depends on the number of in-plane rotations.
Consider the random orientation of a single biomolecule when injected into the passage of the X-ray beam, the probability of the orientation which can be described by the Euler angles (α, β , γ) and can be written as [27] Due to the symmetry of the scattering pattern, the effective ranges for Euler angles are within [0, π], thus we can write the effective probability density function in terms of α, cos β , γ as Thus the probability density function is uniform within a rectangular volume of 2π 2 with Euler rotations defined by α, γ and cos β . With a total number N total of random rotations, the number density for α and γ is N 1/3 total /pi and N 1/3 total /2 for cos β . Thus the mean number of inplane rotations is expected to be: where n is the number of classes for each Euler angles. Thus the SNR of averaging over all the in-plane rotations can be further improved by N 1/6 total √ n times compared to averaging over only same orientations.

In-plane rotation recognition through the radial distribution
We can also see from the surface plots shown in Fig. 4(e) that the radial distribution stays unchanged with continuous rotation about z. Thus, we may identify the in-plane rotation by studying the radial distribution.
Consider a distance defined as where I test and I re f are the scattering pattern before and after the in-plane rotation, respectively, I re f ρ and I test ρ are the radial distributions calculated according to Eq. (3). This "radial" distance should be zero if there is no shot noise, but with shot noise unavoidable, this distance should still be minimum if it is to be an effective classification measure. Thus we start from the biomolecule in an initial orientation, compute the scattering pattern according to Eq. (1), include the Poisson distributed noise and compute the corresponding radial distribution. This radial distribution is taken as the reference distribution I re f ρ . Then we rotate the biomolecule along y-and z-axis by angles from 0 to 180 • . The corresponding noisy scattering pattern I test and radial distribution I test ρ are computed in the same way as I re f and I re f ρ . Then the distance between I test ρ and the reference distribution I re f ρ is computed according to Eq. (9). Clearly this distance is a function of the rotation of the test pattern with respect to the reference pattern. We repeat the distance calculation 100 times independently and calculate the mean and variance of the distance from those results (Fig. 5). In order to understand the effect of the incident photon number on the distance calculation, we have used two incident photon number levels: 2 × 10 14 (Fig. 5(a)) and 2 × 10 12 (Fig. 5(b)). We see that with both cases, the mean value of the distance maintains the lowest level for the in-plane rotations and increases with the out-of-plane rotation (rotation about y-axis) in general. Zoomed-in versions of the distance curves around the reference position are plotted in Fig. 5(c) and 5(d). We can see that for the case of 2 × 10 14 incident photon number, the variance in the distance is so small that we can differentiate even a half degree of out-of-plane rotation from in-plane rotations with high certainty. This means that the sensitivity of the above classification is about 0.5 • . With less incident photon number, the scattering patterns are more noisy and have a larger variance (Fig. 5(d)). The sensitivity of the classification is about ±2.5 • in this case.

Finding the in-plane rotation angles through the azimuthal distribution
After having identified scattering patterns corresponding to in-plane rotations, we can find the angle of the in-plane rotation by studying the azimuthal distribution I θ . This angle information is needed if we hope to average over in-plane rotations. In this section we describe a simple method to find this angle.
If I re f θ is the azimuthal distribution of the reference pattern and I test θ is the azimuthal distribution of the test pattern with only an in-plane rotation γ with respect to the reference pattern, then these two distributions should be shifted with respect to each other. This can be clearly seen in the surface plot shown in Fig. 4(f). The amount of the shift should be the angle of the in-plane rotation of the biomolecule with respect to its original position.
Consider a new function, I θ , which is a circularly-shifted version of I test θ , i.e., where θ shi f t is the shift angle. When the shift angle equals the in-plane rotation angle γ, the shifted azimuthal distribution I θ shall overlap with the reference azimuthal distribution I re f θ . Thus we can define a distance between I θ and I re f θ as: where I re f and I test are the scattering patterns before and after the in-plane rotation, respectively. This distance is a function of the shift angle θ shi f t . When the shift angle is exactly the in-plane rotation angle, this distance should be zero if there is no shot noise. With shot noise being unavoidable, this distance should reach the minimum when θ shi f t = γ. For example, when the in-plane rotation is 90 degrees, we compute the distance according to Eq. (11) as a function of the shift angle θ shi f t . In order to study the effect of the incident photon number on the distance, we have again studied two cases, 2 × 10 14 and 2 × 10 12 (Fig. 6). We can see that the distance reaches the minimum when the shift angle equals the in-plane rotation angle for both incident photon number levels. Note the steepness of the curves around the in-plane rotation angle, indicating that the distance is quite sensitive to the shift angle. In this example, it is about 0.5 • , which is much smaller than the sensitivity for identifying the in-plane rotations from all possible rotations. Thus in the next few sections, we will use the sensitivity of the radial distance as the limit of the overall sensitivity of in-plane rotation classification. With in-plane rotations classified (section 5) and the angle of in-plane rotation determined as described in this section, we can rotate those in-plane rotated patterns to overlap the reference patterns, and average those patterns to improve the signal-to-noise ratio. In Fig. 7 we show a typical single shot of the scattering pattern and the averaged pattern from 19 shots with different in-plane rotations. Comparing these two images we can clearly see that the signal-to-noise ratio is improved after the averaging, especially in regions away from the center.

Accuracy of the classification vs photon budget
As we saw in the previous sections, the simple distances can be used to compare the radial or azimuthal distributions to distinguish between in-plane rotations and out-of-plane rotations, and to find the degree of in-plane rotation. In this section we are going to study the performance of the classification scheme for different incident photon numbers. To be specific, we want to know the ratio of the correct classification results from the results of the proposed classification method.
In order to visualize the success rate of the proposed algorithm, we have generated a number of random orientations and their corresponding diffraction patterns, then used the method described in Section 5 to find the in-plane rotations. The Euler angles for those orientations are limited to {0 o , 180 o }. Twenty random values for each α, cos β , γ are generated independently.  Thus a total number of orientations is 20 × 20 × 20 = 8000, ignoring the inconsequential fact that the ranges for those three random variables (α, cos β , γ) are not all equal to each other. The diffraction patterns for the protein with those orientations are generated using Eq. (1). Then we select an arbitrary pattern as a reference frame and assign the rest of the patterns to be the test patterns. The corresponding radial distribution for the reference frame and the test patterns are computed according to Eq. (3). Then the distance is calculated using Eq. (9) to compare the reference and test frames. Consider a total number N total of patterns from randomly oriented biomolecules, the total number of pairwise distances, N d , will be N total × (N total − 1) if each pattern is compared to other patterns in the data set. Note that the comparison between any two patterns are performed twice since each one has the opportunity to be a reference pattern. We also assume that the number of random values for each Euler angle is N 1/3 total . Thus the percentage for in-plane rotations among all possible rotations shall be N 1/3 total /N total . The total number of distances that describe in-plane rotations, N d in , is expected to be As the distances describing the in-plane rotations should be zero if there is no noise in the experiment, but with noise, we choose N d in shortest distances as arising from comparison of two patterns who differ by at most an in-plane rotation. Of course, the noise inherent in the frames will tend to mix the in-plane rotations and out-of-plane rotations, especially those patterns with small out-of-plane rotations.
As we have generated the patterns and know the orientations of the test frames, we can verify the results of the classification with the known information. Thus we can define the accuracy of the classification as: where N true and N classi f ied are the number of true and assigned in-plane rotations, respectively. This accuracy rate can provide a quantitative measure of the classification performance. For a certain classification scheme, it depends mainly on two factors, noise level and the size of the orientation classes. For patterns that were found to be in-plane rotations, their values for α and β should be identical in the ideal case, but with noise, we say the classification is accurate if α and β differ by an amount smaller than the size of the orientation class. In this section we will test the class size from 1 • to 5 • . In Table 1 we list the accuracy of these simulation results for different incident photon numbers and different orientation class sizes. a The values outside the parenthesis are for a complete data set and those inside the parenthesis are for the smaller data set described in the text.
We can see that even with the current incident level, the success of the classification can be as high as 79% with a class size of 1 • , which is required for atomic resolution. If we can relax the requirement on the sensitivity of the classification, we can achieve 92% accuracy for the class size of 3 • and 97% accuracy for 5 • . In this way we can break the data set into smaller groups and use more sophisticated algorithms such as described in [21] or [22] to find additional information. Research in this direction is underway. With higher photon levels, almost 99% accuracy can be achieved for class size of 1 • . The accuracy from a mere guess is expected to be N inplane /N total , which is only 0.25% in this case.
Note that with the limited number of generated random orientations due to available computing speed, the orientations which are close to each other in the R 3 space may not have been fully simulated in the above example. Thus we perform another round of simulations where the Euler angles are limited to α ∈ (0, 10 • ), cos β ∈ (0.9, 1). As the radial distributions are invariant to inplane rotations, we choose the limit on the in-plane rotation angle still to be (0, 180 • ). For each Euler angle we generate 10 random numbers resulting in 1000 patterns. With a much smaller range of Euler angles, the percentage for two orientations whose Euler angles (α i , β i , i = 1, 2) are within 1 • becomes much higher. In our simulation, the percentage is about 48%, i.e., almost half of the orientations have a similar orientation. The accuracy in this small data set can be lower compared to the entire data set. In Table 1 we list the accuracy (numbers inside the parenthesis) for different photon numbers and orientation class sizes.
We see that even in this smaller data set where almost half of the orientations have a nearby orientation, the success of the classification is still almost 50%. Note that with a smaller data set, the expected success rate by a mere guess is bigger. In the case we have described above, it is about 10/1000 = 1% and so the proposed classification scheme still gives a much better result.

Discussion and summary
In this paper we have shown a method to compress the 2-dimensional diffraction patterns into 1-dimensional distributions, namely, radial and azimuthal distributions. With the compressed signals, the variance is greatly reduced. Those signals can be utilized as a first step in processing the data from coherent X-ray experiments. As an example, we have used a simple distance measure computed from the radial and azimuthal distributions to compare the test frame and the reference frames. As the radial distributions are invariant to in-plane rotations, we can use the radial distribution to find the in-plane rotations. The data is divided into classes of (α, β ). With the proposed classification scheme we can achieve 79% accuracy even with current incident photon number. With a protein that is bigger than used in this paper or possesses additional symmetries, a good classification may be achieved with an even lower incident photon level. We have performed two simulations to test the performance of the proposed classification scheme. The performance for a real situation should be between these two cases.
With the degree of in-plane rotation identified, we can rotate one pattern to overlap with the corresponding reference pattern. This provides us an opportunity to be able to average over more patterns improving the signal-to-noise level further compared to averaging over only patterns with the same orientations. The method is non-iterative in nature, which makes it a good candidate for data preprocessing.
Another interesting aspect of the suggested compression method is that the radial distribution is invariant to in-plane rotations. This property could be used in conjunction with the manifold method described in [22]. A hyper-space constructed from the radial distribution shall exhibit a 2-manifold depending only on α and β . With fewer parameters to estimate, searching the manifold will be easier, which could potentially lower the required photon level even more. Further as the variance in the radial distributions is much smaller than those in the individual pixels, the searching of the 2-manifold will be more robust to noise, thus offering us a possibility to work with even fewer photons per pixel.
As far as atomic resolution is concerned, it is associated with large-angle scattering data, which lies on the Ewald sphere. As the signal-to-noise ratio is extremely low in these regions, averaging diffraction patterns within the same orientation class is necessary to obtain meaningful large-angle-scattering data. If the size of the orientation class is large, information will be lost during averaging. For a biomolecule of size a, the maximum displacement of the same atom within one orientation class of size δ θ can be estimated as |δ r| = a 2 δ θ .
In order to maintain atomic resolution, the displacement of atoms within one class should be smaller than an angstrom. Thus for a biomolecule of size about 10 nm, such as the one considered in this paper, the size of the orientation class should be around 1.2 • . From Table 1 we see that the classification scheme proposed in this paper can divide the data into orientation classes with size of 1 • , thus maintaining the atomic resolution desired for coherent X-ray imaging. In this paper, the biomolecules are assumed to be rigid during the injection and measuring time. Other than their orientations and relative positions in the X-ray beam, the biomolecules are also assumed to be identical, i.e. no shot-to-shot variation of the structure or conformation is considered. Different conformations of the molecules will result in different diffraction patterns. The small-angle scattering data and the resulting radial and azimuthal distributions will encode information on different confirmations [28]. The classification scheme proposed in this paper will hence distinguish different classes to different conformations. Whether a certain class belongs to an out-of-plane orientation of the reference molecule in the same conformation or an orientation class of the molecule in a different conformation can be addressed in the next step of the classification. One possible method for such purpose is the manifold method [22]. A general heterogeneous sample may not be solely classified by means of the low-angle scattering data, large-angle-scattering may have to be additionally included in the classification algorithm. This is a research direction we are currently pursuing.
A common technique for solving the orientation problem in cryo-EM is the common line approach [24][25][26]. Because of the extremely low signal-to-noise level, this method is not applicable in current x-ray diffraction experiments. Current classification work on X-ray diffraction data [21,22] is primarily focused on the direct 2D diffraction patterns. We propose to compress the diffraction data by radial and azimuthal integrals, so that the orientation information can