Supervised Classification Methods for Flash X-ray single particle diffraction Imaging

Current Flash X-ray single-particle diffraction Imaging (FXI) experiments, which operate on modern X-ray Free Electron Lasers (XFELs), can record millions of interpretable diffraction patterns from individual biomolecules per day. Due to the stochastic nature of the XFELs, those patterns will to a varying degree include scatterings from contaminated samples. Also, the heterogeneity of the sample biomolecules is unavoidable and complicates data processing. Reducing the data volumes and selecting high-quality single-molecule patterns are therefore critical steps in the experimental set-up. In this paper, we present two supervised template-based learning methods for classifying FXI patterns. Our Eigen-Image and Log-Likelihood classifier can find the best-matched template for a single-molecule pattern within a few milliseconds. It is also straightforward to parallelize them so as to fully match the XFEL repetition rate, thereby enabling processing at site.


Introduction
Modern X-ray Free Electron Laser (XFEL) technology has provided the opportunity for exploring biological structures from individual biological particles, rather than relying on crystallization-based technologies. It is therefore potentially possible to investigate biomolecules or biological processes that are intrinsically dynamic. XFELs produce X-ray pulses shorter than 50 femtosecond (fs), which are 10 9 times more brilliant than the radiation produced in conventional synchrotrons. The ultra-short and extremely bright X-ray pulses outrun the radiation damage and allow the recording of sufficiently strong and interpretable 2-dimensional (2D) diffraction patterns from single biological particles [6,19]. This principle is called diffract-and-destroy and has been shown to be successful for particles as large as small cells, and down to viruses smaller than 50 nanometers (nm) [8,11,22,25].
Another feature of XFELs is their high repetition rates. The Linac Coherent Light Source (LCLS) [4] operates at 120 Hz and can produce over 400,000 diffraction patterns per hour, i.e., more than 1.6 TB per hour or 38 TB per day. The massive volume of data makes manual classification of diffraction patterns impractical. The challenge is much more severe in the newest facility -the European XFEL [21], which operates at up to 27,000 Hz and can store more than 3 million images per hour. Ideally, all these images would originate from one single biomolecule per exposure. However, the detector also records diffracted signals from multiple scatterers such as particle clusters, buffer impurities, and contaminant materials as discussed in [8,11].
In order to assemble the 2D diffraction patterns into 3D structures, it is essential that data frames are classified and that diffraction patterns originating from contaminants and multiple molecules are sorted out. In 2014, a real-time rejection method [1] was proposed to select diffraction patterns by thresholding and using Time-of-Flight spectroscopy. Previous sorting algorithms was based on support vector machines and on spectral clustering techniques [3,13].
In this paper, we develop two template-based classification methods for particle selection -the Eigen-Image (EI) and the Log-Likelihood (LL) method. Both methods assess the similarity between template diffraction patterns and incoming patterns by analyzing eigenvector projections and log-likelihood function, respectively. In §2, we briefly describe a typical Flash X-ray single-particle diffraction Imaging (FXI) experiment. Next, we introduce the EI and the LL method for classification in §3. Following data descriptions in §4, we perform numerical experiments to evaluate the sharpness of our classification methods in §5. A concluding discussion is found in §6.

Flash X-ray single particle diffraction Imaging (FXI)
For a typical FXI experiment, the diffraction data acquired is depicted graphically in Figure 1. A stream of biological molecules is injected into the X-ray interaction region, where sample particles interact with incoming coherent X-ray pulses, resulting in a collection of diffraction patterns on the detector. This procedure is a stochastic process as the interactions between particles and X-ray pulses occur at random. Firstly, the number of particles at the interaction point is unobserved, i.e., we may obtain blank frames with only background noises, single-particle patterns, multiple-particles patterns, and frames with signals from contaminants. Secondly, the current FXI technology cannot monitor the orientations of particles, and therefore extra steps are necessary to recover the 3D structure from single-particle frames. Last but not least, the strengths of the diffraction signals vary a lot, mainly due to the stochastic nature of the XFELs and the different locations of particles in the interaction region, respectively. The relative strength of the diffraction signal is referred to as photon fluence, and we denote it by φ.
Typical FXI setups use discrete digital detectors, and therefore the captured frames are also discrete. Further, some pixel counts near the center are inaccessible or overflow as a result of physical limitations and arrangements of the detector.

Classification Methods
Template-based methods for classifying diffraction patterns allow identifying the class of an unlabeled diffraction pattern by searching for its best-matched template. For such methods, the collection of templates is referred to as the training dataset, an unlabeled pattern is called a testing image, and the classification procedure is referred to as the classifier. In this section, we discuss two classifiers -the Eigen-Image (EI) [5,14,18,20,24,26] and the Log-Likelihood (LL) classifier [2,7,17], to classify a testing diffraction pattern relying on the training dataset. were frames from multiple particles or with contaminants. (e) was a single-particle frame from an icosahedral-shape virus with a relatively strong signal. This is the most interesting pattern and can be used for assembling 3D structure in later steps. All diffraction patterns are displayed in logarithmic scale.
3.1. Eigen-Image (EI) Classifier. The EI method has two steps -the training and the classification step. In the training step, we train our EI classifier by projecting the training dataset to its eigenvectors. In the classification step, we label a testing image by minimizing the distance between the eigenvector projections of the testing image and the training dataset.
be the training dataset, consisting of M data frames. Since the detector is discrete, we denote the kth pattern by To train an EI classifier, we first transfer the training dataset T into the image space A by the shift whereT is the pixel average of the training dataset, Practically, the covariance matrix of A (AA T ) is too large to decompose into eigenvectors. Therefore we factorize the matrix A T A by instead, where Λ is the main diagonal matrix, whose diagonal elements are the corresponding eigenvalues, and V is the matrix of eigenvectors of A T A. We can now compute the eigenvectors of the covariance matrix AA T by and U is sometimes also referred to as eigenfaces [18,24]. The eigenvector projection matrix of the image space A is defined as follows: Using U and Ω, we can now classify a testing diffraction pattern P = (P i ) , by minimizing the euclidean distance between its eigenvector projection matrix W and Ω, where 3.2. Log-Likelihood (LL) Classifier. The LL Classifier attempts to classify a testing image by maximizing the log-likelihood function of a given probability density function. Since the photon counting procedure is assumed to obey the Poisson distribution, we can write the joint likelihood function as follows: where φ is the photon fluence (relative signal strength), and can be estimated by The joint log-likelihood function L for the LL classifier is therefore We can now classify the testing image P by simply maximizing the joint loglikelihood function in (10): For classifying multiple testing images, the EI method computes U and Ω only once, and hence less computations are needed compared to the LL method.

Data Description
In this section we describe our training and testing datasets. Since most viruses have either helical or icosahedral capsid structure [10,15], we used regular uniformdensity icosahedrons to generate diffraction patterns via Condor [12]. For our simulations, we used a setup similar to the beam profile of the FXI mimivirus experiment [23]. More specifically, we used 1 nm X-ray pulses, with a peak energy of 1 mJ. We also assumed that the X-ray pulses had a circular focus of 10 µm in diameter. Further, the distance between the detector and the interaction region was 0.74 meters, and the detector itself was 960 × 960 pixels, or 72 × 72 µm 2 . Finally, a circular missing-data area of 80 pixels in diameter was set to zero.
To assess our classifiers systematically, we gradually increased the complexity of the testing dataset. With five synthetic testing datasets, we mimicked diffraction patterns of particles with noise, different fluences, and of various sizes and shapes. We also evaluated our methods for the actual mimivirus FXI data [9]. Figure 2 illustrates two noisy icosahedral diffraction patterns at particle sizes 180 nm and 200 nm in the same particle orientation, and one spheroid diffraction pattern at size 180 nm. 4.1. Homogeneous Datasets. We first simulated diffraction patterns from a regular icosahedron of 180 nm in diameter. The training dataset T had 290 frames, and the Euclidean distances between two arbitrary patterns were larger or equal than 220. The first testing dataset D was a noiseless homogeneous dataset, which contained M data = 1000 noiseless icosahedral diffraction patterns. The first 290 frames were from the training dataset T and were used as benchmarks. The rest 810 frames were random-orientation patterns from the same icosahedron.
Since the photon counting procedure is assumed to follow the Poisson distribution, we added Poissonian noise to D for our noisy dataset P , By scaling P with different fluences, we obtained our last homogeneous testing dataset -the scaled noisy dataset F by where Φ k was uniformly and randomly chosen between 0.01 to 1.1, 4.2. Heterogeneous Particle Sizes. Considering the potential size variation of viruses, we generated our testing dataset S (M data = 2000) from uniform-density icosahedrons with randomly and uniformly chosen diameters between 150 nm and 210 nm (∼ U{150, 210}). Similar to F , all patterns in S were Poissonian with random fluences according to (14).

4.3.
Heterogeneous Particle Shapes. To mimic heterogeneous particle shapes, the synthetic testing dataset X contained diffraction patterns from both icosahedrons and spheroids. The diameters of the objects varied from 150nm to 210nm, with changing fluences Φ k ∼ U{0.01, 1.1}. Further, the shapes of the spheroids were also changing, as the aspect ratios of the spheroids (the ratio of the length of the minor axis to the length of the major one) were varying between 0.6 and 1. In total, the dataset X contained M data = 1200 frames -200 spheroidal patterns and 1000 icosahedral patterns randomly selected from S.

Mimivirus Dataset.
To be relevant to real FXI experiments, we also classified the mimivirus dataset [9,23]. To classify this dataset, we generated a new training dataset with the corresponding experimental beam profile [23], and the training dataset contained 1000 random-orientation frames of a 490 nm icosahedron.
To summarize, Table 1 lists the primary parameters of all datasets. Only synthetic regular icosahedral patterns were included in these datasets. c U is the uniform distribution. d X contained 1000 random icosahedral frames from S and 200 spheroidal patterns with aspect ratio ∼ U{0.6, 1}. e Patterns were as used in [9,23] and of icosahedral shape.

Experiments
We now perform numerical experiments to investigate the efficiency and accuracy of our EI and LL classifiers. For saving memory and execution time without losing much accuracy in the classification, only the central 480 × 480 pixels were used in computations, and they were binned into 120 × 120 pixels, i.e., every 4 × 4 pixels were averaged into one pixel.

Error Metrics.
To compare the classification results, we define the following error metrics.
be the best-matched pattern of Γ k from the training dataset. The classification error of Γ k with respect to R is now defined as follows: whereΦ k is the estimated fluence, Further, U k (R, s) is an interpolation (or extrapolation) method that resizes the pattern R s times and returns a scaled image at the same size as Γ k . Note that, U (R, s) = R for our homogeneous testing datasets D, P and F , andΦ k = 1 for the first two. Similarly, we define the fluence error by where Φ k is the true fluence used to generate Γ k .

5.2.
Homogeneous Patterns. We first tested our classifiers on the homogeneous synthetic datasets -the noiseless dataset D, the noisy dataset P , and the scaled-Poisson dataset F , as listed in Table 1.
Since the first 290 images of the three testing datasets were modifications of the training dataset, we used them as benchmarks, and compared their average classification error with the error from the whole dataset. As listed in Table 2, we observed that both classifiers matched all benchmark frames successfully with classification errors around 0, 0.03 and 0.04 on datasets D, P and F , respectively. However, for the whole datasets, the EI classifier performed slightly better than the LL classifier, obtaining about 1% less classification error. The fluence error, as defined in (17), of the dataset F from the EI classifier was 0.035, comparing with 0.049 from the LL classifier. Further, the EI classifier was more efficient, and took only 3.7 ms per image in a single-core Matlab implementation, nearly 15 times faster than the LL classifier. 5.3. Heterogeneous Particle Sizes. We next classified the dataset S, which contained patterns from icosahedrons of diameter 150 nm to 210 nm. Figure 3 illustrates the average classification and fluence errors for the EI and the LL classifiers. As expected, both classifiers obtained the smallest errors when the particle size of the testing pattern was similar to the template size (180 nm), and the LL classifier had slightly larger errors on average. Furthermore, the EI classifier was better at estimating particle sizes, as shown in Figure 3(c), and this also implied that the EI classifier was more accurate in searching for the best-matched template than the LL classifier was.   (15) and (17), respectively. Both classifiers obtained the smallest errors around the template size (180 nm). (c): The absolute errors in estimating sizes from the EI (blue triangle) and the LL (red star ) classifier. On average we obtained a minimum error of 1 nm around 180 nm, and a maximum error of 4 nm.
The size and the fluence estimation procedures together took around 80 ms for each image, i.e., around 1.5 times longer than the LL classifier or 22 times longer than the EI classifier. In other words, with size estimation, the EI classifier can handle 12 images per minute and the LL classifier can perform 8 images per minute, as using a single-core Matlab implementation. With 16 cores, it is therefore possible to speed up both classifiers to the LCLS repetition rate (120 Hz). Using a larger numbers of GPUs and/or CPU cores, we may also parallelize them to match the European XFEL rate (2,700 Hz).

Heterogeneous Shapes.
Since the EI classifier performed better than the LL classifier in the previous experiments, we investigated its performance for the dataset X, which contained particles with heterogeneous shapes and sizes. For identifying the spheroids in X, we added a 180 nm sphere diffraction pattern into the training dataset, and retrained the EI classifier. The new EI classifier distinguished the icosahedral and spheroidal diffraction patterns successfully, as listed in Table 3. All icosahedral diffraction patterns were classified as icosahedron with a smaller classification error (< 0.25). With a threshold of 0.5, the retrained classifier rejected 78 elongated spheroidal patterns, and identified 114 spheroidal frames as spheroids successfully. However, 8 (4%) frames were misclassified as icosahedron, and their classification errors were between 0.42 and 0.5, see Figure 4(b). Table 3. Classification results of dataset X. The threshold of the classification error for rejection was set to 0.5. We visually illustrate the classification results in Figure 4. With a classificationerror threshold of 0.25, the EI classifier identified all the icosahedral diffraction patterns and 86 roundish spheroidal patterns. With a threshold of 0.5, the classifier rejected 78 elongated spheroidal patterns, and most aspect ratios of these rejected patterns were less than 0.75, see Figure 4(b). As expected, we also observed that the classification error decreased with increasing aspect ratio. 5.5. Mimivirus diffraction patterns. We finally tested our EI classifier on the mimivirus FXI dataset, which has been used for 3D mimivirus reconstruction [9]. To compensate detector saturation at the image center and low signal-to-noise ratio at the edges of the patterns, we used the central part of the diffraction patterns for classification, see Figure 5. The training dataset for the mimivirus dataset contained 1000 randomly-oriented icosahedral patterns of 490 nm in diameter. Furthermore, we binned 4 × 4 pixels into one pixel in the calculations.
As expected, we obtained larger classification errors, see Figure 6, and this is due to the heterogeneity in size and shape of the mimiviruses. We used 0.5 as a threshold to detect irregular patterns. In total, 9.1% of patterns (18 patterns) were rejected. We also validated our classification results to the 3D Fourier intensity by looking at the correlation between the classification errors and the sum of the largest 0.035% rotational probabilities of each diffraction pattern in Figure 6(b). We assembled the 3D Fourier intensity by the Expansion Maximization Compression (EMC) method with an assumption of scaled Poisson noise [16]. As expected, the sum of the rotational probabilities increased with decreasing classification error. However, we did not get a linear correlation, most likely due to the fact that  All icosahedral patterns were located in the perfectly matched region, and all elongated spheroidal patterns (78 patterns) were rejected. For the rest 122 accepted spheroidal patterns, 114 were successfully classified as spheroids, and 8 frames, or 4% of the spheroidal patterns were misclassifed. (b): The relationship between the C-error and the aspect ratios for the spheroidal patterns. The red stars were misclassified patterns. The aspect ratio was the ratio of the length of the minor axis to the length of the main axis of the spheroidal particle. [(c)-(g)]: Five combination images, corresponding to the five data points (red circles) in (a). The left half of each image was from the testing dataset X and the right half was the best-matched patterns from the training dataset. The number in each figure was the classification error.  Figure 5. A mimivirus diffraction pattern (a) and its central region used for classification (b). The region shown in (b) was the region between two circles in (a).
the mimiviruses samples were not regular uniform-density icosahedrons, and had different particle sizes. For example, in [ Figure 6(c)- Figure 6(e)], the templates and the mimivirus patterns matched quite well, however, the particle in Figure 6(f) was slightly elongated, and the particle size of Figure 6(g) was 457 nm, which was 33 nm smaller than the template size (490 nm). Five combination images at the data points (red circles) in (a). The left part of each image was from the mimivirus dataset, and the right part was the corresponding template scaled by the recovered fluence. (f) was a slightly elongated pattern and the particle size of (g) was smaller than the template size.

Conclusions
The FXI technique holds the promise of obtaining biomolecule structures from single particles. It operates at a high repetition rate and records thousands of millions of diffraction data every day. The stochastic nature of XFELs and the heterogeneity of the sample molecules make the recorded dataset too complex and massive to classify manually. By using our knowledge of the sample molecules, such as sizes and shapes, we can use template-based methods to reduce the complexity of the classification problem.
To find the best-matched pattern, we have presented and tested two supervised learning methods -the Eigen-Image (EI) and the Log-Likelihood classifier. In our straight-forward Matlab implementations, both methods can classify a testing pattern in a few milliseconds, and they certainly can be accelerated to the XFEL repetition rates, albeit using considerable resources. We also observed that the rotational probabilities from the 3D assembling procedure, increased with decreasing classification error. This suggests that the selected patterns from our classifiers will fit better into a 3D Fourier intensity, resulting in a potentially high-resolution 3D electron density of the sample molecules.
Newer facilities, such as the European XFEL, operate at high repetition rates and will create massive volumes of FXI diffraction data with heterogeneities to varying degrees. With our methods, we can use most of our knowledge of the sample molecules to reduce data storage and automatically select homogeneous single-particle patterns. We also foresee that an on-site FXI analysis pipeline, which connects our classifier to the 3D reconstruction procedure, can solve the 3D structure with sub-nanometer resolution in the near future.