Femtosecond free-electron laser x-ray diffraction data sets for algorithm development

: We describe femtosecond X-ray diffraction data sets of viruses and nanoparticles collected at the Linac Coherent Light Source. The data establish the first large benchmark data sets for coherent diffraction methods freely available to the public, to bolster the development of algorithms that are essential for developing this novel approach as a useful imaging technique. Applications are 2D reconstructions, orientation classification and finally 3D imaging by assembling 2D patterns into a 3D diffraction volume.


Introduction
X-ray free-electron lasers (XFELs) exceed the peak brilliance of conventional synchrotrons by almost a factor of 10 billion. It has been proposed that radiation damage, which limits the high resolution imaging of soft condensed matter, can be "outrun" by using ultrafast and extremely intense X-ray pulses that pass the sample before the onset of significant radiation damage [1]. Thus, one of the most promising scientific applications of XFELs is in subnanometer resolution imaging of biological objects, including cells, viruses, macromolecular assemblies, and nanocrystals. The concept of "diffraction-before-destruction" has been demonstrated recently at the Linac Coherent Light Source (LCLS) [2], the first operational hard X-ray FEL, for protein micro-and nanocrystals [3] and single mimivirus particles [4]. Since the enabling technologies, the XFEL itself [2], sample injection [5,6], and fast-framing integrating X-ray detectors [7] are recent developments, it is of paramount importance to understand their capacity and limitations in delivering the data sets required for reliable subnanometer three-dimensional bio-imaging. Algorithms that assemble 2D diffraction data of randomly orientated molecules into a 3D volume require a highly homogeneous data set and a detailed understanding of measurement errors. Thus, sorting and orientation/classification algorithms need to be further developed to handle current experimental conditions and identify suitable data sets. Due to their mode of operation, experiments at XFELs are limited by scarcity of beam time, thus to date largely restricting the testing of algorithms to simulated data.
Here, we report on femtosecond coherent diffraction imaging experiments on model systems that differ in size, symmetry, and complexity. Data were collected from Paramecium bursarium Chlorella virus (PBCV-1), bacteriophage T4, and nanorice, an ellipsoidal (~250 x 50 nm) iron oxide nanoparticle which serves as a strongly scattering morphological analogue of the T4 tail. The experiments were carried out at the LCLS at the Atomic, Molecular Optical Science (AMO) beamline [8] in the CFEL-ASG Multi-purpose (CAMP) instrument [7], in a similar manner as described previously [4] and detailed below. Aerosolized particles were injected into the FEL interaction region using an aerodynamic lens stack [5] and diffraction patterns were collected on two pairs of pnCCD detectors [7] (Fig. 1A). Aerosolized particles are injected into the FEL beam in random, unknown orientations using an aerodynamic lens stack [5]. The diffraction patterns are captured with a set of pnCCD detectors [7]. (b-g) Diffraction patterns of different samples and byproducts of the injection. In addition to single particles, solvent droplets, sample aggregates or multiple particles are also recorded. Shown are diffraction patterns of a large aggregate (b), a water droplet (c), single T4 phage particles (d, e), a nanorice grain (f), and two nanogrrains (g).
The diffraction data were deposited in the Coherent X-ray Imaging Data Bank (www.CXIDB.org). By making the data publicly available, we provide a rich resource for testing the performance of existing algorithms on real data from known samples, which should help drive the development of improved or new algorithms and identify the next steps toward fulfilling the potential of 3D bio-imaging at XFELs.

Samples
Nanorice (SiO 2 coated Fe 2 O 3 ellipsoids) was purchased from Corpuscular, Inc. (Cold Spring Harbor, NY, USA). It was supplied as a 50% ethanol suspension at a concentration of 6.25 x 10 12 particles per ml. Transmission electron microscopy (TEM) (Fig. 2) demonstrated some variability in the size and shape of the individual nanorice grains. Prior to injection into the FEL beam, the sample was sonicated for 5 minutes to break down larger clusters of particles. T4 and PBCV-1 [9] were purified as described previously. T4 was dialyzed against 50 mM ammonium acetate pH 7.2 long before the measurements whereas buffer exchange for PBCV-1 was done just prior to the experiments. Measurements of the virus preparations via NanoTracking Analysis with a Nanosight LM10-HS sizing system (Schaefer Technologie, Langen, Germany) demonstrated a largely monodisperse distribution of particles in solution with approximately 90% having the expected sizes for both T4 (mean 147 nm, FWHM of 42.2 nm) and PBCV-1 (mean 194 nm, FWHM of 82.2 nm); about 10% of the particles had a larger size corresponding to a clear dimer peak. Samples were analyzed by TEM (Fig. 2) using a Zeiss EM912 microscope running at 120 kV and equipped with a 1024 × 1024 pixel GATAN CCD camera. PBCV-1 and T4 samples were negatively stained using 1-2% uranyl acetate.

Experimental setup/injection
The experiments were carried out at the LCLS at the AMO beamline [8] in the CAMP instrument [7] using 150 fs X-ray pulses with a photon energy of 1.2 keV and 2 keV, and with a pulse energy of 3.2 mJ and 2.7 mJ, respectively. Particles were injected as a continuous aerosol beam into the 10 µm2 X-ray focus using an aerodynamic lens stack [5]. Diffraction patterns were collected on two pairs of pnCCD detectors [7] that were read out with the FEL repetition rate of 60 Hz. The front detector, initially placed at a distance of ~150 mm and moved to the furthest possible downstream position, assumed to be ~250 mm [7] in the course of the beam time, was shielded by a 3 µm thick polyimide filter to prevent contamination. The back detector was placed 738 mm downstream of the interaction region. Front and back detector were operated such that a 1 eV photon corresponds to 0.01031 analog-to-digital units Purified T4 and PBCV1 samples (~5 x 10 11 particles per ml) were transferred into a volatile buffer (50 mM ammonium acetate pH 7.2) and the suspension was aerosolized with a gas dynamic virtual nozzle [6] or a commercial nebulizer (Burgener Mira Mist CE nebulizer, AHF Analysentechnik, Tübingen, Germany) using constant liquid flow rates between 10 -20 µl/min (from a Shimadzu HPLC system) and nitrogen gas pressure of 5 to 7 bar. In the latter case, the droplet distribution was polydisperse with an estimated size range from several hundred nanometers to several micrometers. During the injection, much of the surrounding volatile buffer evaporates, although this seemed somewhat sample dependent. The divergence of the particle beam exiting the aerodynamic lens stack was 0.57°, which results in a particle beam diameter of approximately 440 µm at the X-ray interaction region (22 mm from the lens stack exit). A "brim" of particles at a much lower concentration that formed a 2 mm diameter halo was also observed. Previous measurements of spherical particles (98-190 nm diameter) indicate that particle speed inside the vacuum is on the order of 100 m/s [10]. Neglecting the halo, and assuming that only single particles form the aerosol and that the transmission through the system is 100%, the probability to have a particle in the interaction region can be approximated. Having a concentration of 5 x 10 11 particles per ml as for our virus samples and a typical flow rate of 15 µl/min, 1.25 x 10 8 particles are transferred into vacuum per second. With a speed of 100 m/s, the particles fly across the 3 µm FEL focus in 30 ns. Due to the much larger diameter of the particle beam focus of 440 µm, only a fraction of the particles injected (0.7%) cross the FEL focus. This results in a maximal achievable hit rate of: Big droplets quite often contain more than one single particle. During the evaporation process these particles agglomerate into clusters. Additionally, biological samples can already aggregate in solution, in particular at the high concentrations required for efficient data collection. Both of these effects reduce the probability of a good hit and lead to diffraction patterns of bigger clusters that have to be identified in the analysis process.

Pre-processing of diffraction images
Images were pre-processed to remove the following artifacts that are a property of the experimental system (Barty unpublished): (i) The pnCCD detectors have multiple readout channels in order to achieve in excess of 120 Hz readout speed, and the offsets on these readout channels drift slowly over time. Such fluctuations are corrected by periodically collecting data with no photons ('dark frames') and subtracting this from the raw data. (ii) The parallel readout results in gain fluctuations between different portions of the CCD. These are corrected for by using flat-field measurements obtained using an 55 Fe-source. (iii) Incoherent scattering from beamline optics and carrier gas in the aerodynamic lens is estimated from non-hit images that occurred before and after each hit. Each time a non-hit is identified, it is written to a buffer, replacing the oldest frame stored in the buffer, and the background recalculated. The recalculation of the estimated background accounts for fluctuations in background during the experiment. The median of the last 50 non-hits is subtracted from the signal in the hits as incoherent background. A median is used because it is less sensitive to accumulating signal from any extremely weak hits that fall below threshold. This approach additionally subtracts any remaining drifts in CCD readout offsets over time.
(iv) Bad pixels are identified on the fly as those pixels that remain above a threshold of 140 ADU in more than 80% of sequential frames, and are set to zero. There are detector artifacts which are not corrected in the preprocessing: (i) charge spill to neighboring pixels for high intensities. (ii) The offset of each readout channel fluctuates independently on a shot-by-shot basis (referred to as common-mode noise).

Results and discussion
Particles entering the interaction volume (defined as focal area of the FEL beam times the longitudinal diameter of the particle beam) are intercepted randomly by the LCLS pulses. The hit rate depends on sample concentration and the overlap between the particle and the X-ray beams. Of ~5 million data frames collected during the experiment, 0.7% were classified as 'hits'. Hits were identified based on scattering strength by counting the number of pixels containing values above a predetermined analog to digital unit (ADU) threshold applied after background subtraction (>500 pixels above 170 ADU excluding bad pixels required to register as a 'hit'). This simple threshold approach is biased in favor of false positives, yet still yields over 99% rejection rate. "Hits" comprise diffraction patterns of single particles, of multiple particle clusters, of water drops, buffer aggregates, other false positives, detector glitches and hits too weak to be of use (Fig. 1B). Sample carry-over introduced at the aerosol source or from accumulated material within the aerodynamic lens stack commonly produces contamination from one run to the next. For most studies, only single particle hits of a single species are of interest, although correlation analysis can be used on multiple particle clusters [11,12].
Particles were first sized based on dimensions of the particle autocorrelation, rejecting most of the outliers (Barty unpublished). Further distinction between samples such as nanorice, mimivirus and spherical water droplets was achieved by means of statistical learning methods operating on diffraction data with reduced dimensionality. Meaningful dimensions were selected by principal component analysis and an analysis of the diffraction patterns' rotational symmetry and speckle size. Details will be published elsewhere. In addition, diffraction patterns have been classified in an unsupervised manner [13]. Lists of individual frames belonging to single classes of samples have also been included in the CXIDB deposition. These lists are not guaranteed to be perfect; however data for T4 and nanorice have been examined and edited manually. The pre-processed data (see Material and Methods for details) were deposited in the CXIDB for use with minimal post-processing, allowing development and testing of algorithms involved at different stages of the sorting and 3D-structure determination process. Raw data files are available on request.
Expected size distributions were not observed for all samples analyzed. The size of a particle can be determined from the inverse Fourier transform of its diffraction pattern (resulting in the autocorrelation of the object) if the geometry of the experiment is known. The maximum spatial frequency that is recorded on the detector is determined by the wavelength λ of the experiment and the maximum diffraction angleϑ as max for small scattering angles, N is the number of pixels in the detector, p ∆ is the size of a pixel of the detector, and Z d is the distance between detector and particle. According to the Nyquist theorem, the corresponding real space sampling is given as A bounded object of diameter D will have an autocorrelation that is bounded to a diameter 2D. Thus, one can determine the size of a spherical particle from the radius of its autocorrelation and the real space sampling interval x ∆ . Particles can be intercepted at different positions along the beam propagation axis z in the interaction volume as well as at different locations in the approximately Gaussian-shaped intensity profile of the FEL focus. This will result in shot-to-shot variations of total diffracted intensity as well as a radial scaling of the diffracted intensities that is approximately linear with change in z-position. Due to the finite extent of the interaction volume along the direction of the X-ray beam, diffraction patterns will show a distribution of Z d values leading to an apparent distribution in particle size whose center corresponds to the actual size and whose width is comparable to the width of the anticipated Z d distribution. This effect explains the rather narrow size distribution observed e.g. in size selected polystyrene latex spheres (Loh unpublished) and mimivirus, which was identified as a member of a single class by classification algorithms [13] and showed the expected size of the strongly scattering capsid [4,14]. We find this to be true for most samples except for small biological objects, like enterobacteria phage T4. Figure 3 shows the size distribution of a subset of diffraction patterns from T4 phages determined by applying a threshold to the autocorrelation. The size distribution is broad and the average size too large (mean 330 nm for T4 (head diameter ~90 nm, tail length ~100 nm). The width of the distribution far exceeds the anticipated spread of < 0.1% in Z d values. (A particle hit 1cm away from the nominal interaction region would experience an apparent change in size of 14%.). This size increase may be caused by nonvolatile components such as protein fragments or salts or by a residual shell of ammonium acetate buffer that has not completely evaporated during aerosolization and injection and seems to have concentrated during the evaporation process, surrounding small particles. The reduced contrast at 1.2 keV between the partially evaporated, concentrated buffer and the virus leads to apparent particle sizes that are too large on average, with the size distribution Fig. 3. Size distribution of a subset of recorded diffraction patterns of phage T4 determined by thresholding the autocorrelation. The apparent average size (~330 nm) is about 1.5-3 times larger than the actual particle size (~100 nm (strongly diffracting head) -~200 nm (head and tail)). The width of the distribution far exceeds the anticipated spread in sample-to-detector distance values (< 0.1%).
reflecting that of the initial aerosol rather than actual particle size. Aerosol generation and surface tension may lead to a preferred aerosol droplet size coinciding with single large particles such as mimivirus [4,14], a cluster of several smaller ones in the case of highly concentrated solutions, or a single smaller one in a solvent droplet. Thus, single smaller viruses are likely affected more by solvent shells and the consequential apparent change in size. In the future, the interplay of particle size, buffer composition and the aerosolizing process has to be adjusted carefully for each sample to work around this effect. Volatile buffers with a higher vapor pressure than ammonium acetate such as 4-methylmorpholine might be more suitable and are well tolerated by T4 and PBCV-1; 4-methylmorpholine also has the advantages of lowering the overall vapor pressure of the solution [15].
The inhomogeneity in the data collected on the T4 sample prevents a 3D reconstruction using the 2D patterns. Nevertheless, the quality of these single-shot diffraction patterns has proved sufficient for two dimensional imaging via phase retrieval. Numerous diffraction patterns from this experiment have been successfully phased [14]. Some examples of reconstructed images of the T4 bacteriophage are shown in Fig. 4. Reconstructions were performed with the Relaxed-Alternating-Averaged-Reflections algorithm [16]. The only constraint in the object plane was the support constraint. Each reconstruction in Fig. 4 is an average of 10 reconstructions from random-starting phase, each of 2000 iterations. In the region where data was not measured, the initial wave function for each reconstruction was set to the Fourier transform of the initial support, which is aimed to improve the rate of convergence. The Shrinkwrap algorithm [17] was used to refine the support during the reconstruction. The full-period resolution of these reconstructed images lies in the range 20-40 nm. Phasing is more reliable when the effective size of the central hole relative to the imaged object is such that the number of missing speckles is small, facilitating the unambiguous reconstruction of the missing low frequency data in the iterative process. Under our experimental conditions, this is true for objects up to 500 nm in diameter. The classification of diffraction patterns may identify those that can be phased with similar initial parameters, opening up the possibility of automated phasing, an important development for processing large volumes of FEL data. Nanorice provided the first three-dimensional reconstruction of a non-crystalline object from randomly oriented, continuous diffraction patterns captured with an X-ray laser [18,19] using an expectation-maximization algorithm [20]. The study was based on 56 diffraction patterns obtained at the Free-electron Laser in Hamburg, FLASH, using a wavelength of 7 nm [18]. The reconstruction was challenging due to missing spatial frequencies caused by saturation effects of the detector and background scatter [19]. The nanorice data presented here do not suffer from these limitations (Fig. 5), sample the 3D diffraction space in a finer manner, and extend to higher resolution, which will allow different algorithms for 3D reconstructions to be tested. Around 1000 useful nanorice diffraction patterns have been found by combining hit identification, statistical learning and manual selection. A list is provided at CXIDB.  5. A selection of nanorice diffraction patterns. They show individual nanorice particles in a variety of orientations (sideways at various angles and some more head-on), exposed to different X-ray fluences.

Conclusion
We present the experimental details, some caveats, and the recording of large volumes of high quality diffraction data from femtosecond XFEL pulses together with the successful sorting of data and the phasing of single-shot diffraction patterns obtained from single-particle-imaging experiments at the LCLS using model systems. The data is deposited in the CXIDB database where it is available for download and further analysis. The data has already been used to test a spectral clustering based unsupervised method able to sort experimental snapshots without