cisTEM: User-friendly software for single-particle image processing

We have developed new open-source software called cisTEM (computational imaging system for transmission electron microscopy) for the processing of data for high-resolution electron cryo-microscopy and single-particle averaging. cisTEM features a graphical user interface that is used to submit jobs, monitor their progress, and display results. It implements a full processing pipeline including movie processing, image defocus determination, automatic particle picking, 2D classification, ab-initio 3D map generation from random parameters, 3D classification, and high-resolution refinement and reconstruction. Some of these steps implement newly-developed algorithms; others were adapted from previously published algorithms. The software is optimized to enable processing of typical datasets (2000 micrographs, 200k - 300k particles) on a high-end, CPU-based workstation in half a day or less, comparable to GPU-accelerated processing. Jobs can also be scheduled on large computer clusters using flexible run profiles that can be adapted for most computing environments. cisTEM is available for download from cistem.org.


Introduction 22
The three-dimensional (3D) visualization of biological macromolecules and their assemblies by 23 single-particle electron cryo-microscopy (cryo-EM) has become a prominent approach in the 24 study of molecular mechanisms (Cheng et al., 2015;Subramaniam et al., 2016). Recent advances 25 have been primarily due to the introduction of direct electron detectors (McMullan et al., 2016). 26 With the improved data quality, there is increasing demand for advanced computational 27 algorithms to extract signal from the noisy image data and reconstruct 3D density maps from 28 them at the highest possible resolution. The promise of near-atomic resolution (3 -4 Å), where 29 densities can be interpreted reliably with atomic models, has been realized by many software 30 tools and suites (Frank et al., 1996;Hohn et al., 2007;Lyumkis et al., 2013;Punjani et al., 2017;31 Scheres, 2012; Tang et al., 2007;van Heel et al., 1996). Many of these tools implement a 32 standard set of image processing steps that are now routinely performed in a single particle 33 project. These typically include movie frame alignment, contrast transfer function (CTF) 34 determination, particle picking, two-dimensional (2D) classification, 3D reconstruction, 35 refinement and classification, and sharpening of the final reconstructions. 36 We have written new software called cisTEM to implement a complete image processing 37 pipeline for single-particle cryo-EM, including all these steps, accessible through an easy-to-use 38 7 The background noise spectrum of the micrograph is estimated by computing the average 133 rotational power spectrum of 50 areas devoid of particles, and is then used to "whiten" the 134 background (shot + solvent) noise of the micrograph. Normalization, including CTF effects, and 135 matched filtering are then performed as described (Sigworth, 2004), except using a single 136 reference image and no principal components' decomposition. When particles are very densely 137 packed on micrographs, this approach can significantly over-estimate the background noise 138 power so that users may find they have to use lower thresholds for picking. It might also be 139 expected that under those circumstances, micrographs with much lower particle density will 140 suffer from a higher rate of false-positive picks. 141 One difficulty in estimating the background noise spectrum of the micrograph is to locate areas 142 devoid of particles without a priori knowledge of their locations. Our algorithm first computes a 143 map of the local variance and local mean in the micrograph (computed over the area defined by 144 the maximum radius given by the user (Roseman, 2004;van Heel, 1982)) and the distribution of 145 values of these mean and variance maps. The average radial power spectrum of the 50 areas of 146 the micrograph with the lowest local variance is then used as an estimate of the background noise 147 spectrum. Optionally, the user can set a different number of areas to be used for this estimate (for 148 example if the density of particles is very high or very low) or use areas with local variances 149 closest to the mode of the distribution of variances, which may also be expected to be devoid of 150 particles. 151 Matched-filter methods are susceptible to picking high-contrast features such as contaminating 152 ice crystals or carbon films. (Sigworth, 2004) suggests subtracting matched references from the 153 extracted boxes and examining the remainder in order to discriminate between real particles and 154 false positives. In the interest of performance, we decided instead to pick using a single artificial 155 8 reference (disk) and to forgo such subtraction approaches. To avoid picking these kinds of 156 artifacts, the user can choose to ignore areas with abnormal local variance or local mean. We find 157 that ignoring high-variance areas often helps avoid edges of problematic objects, e.g. ice crystals 158 or carbon foils, and that avoiding high-and low-mean areas helps avoid picking from areas 159 within them, e.g. the carbon foil itself or within an ice crystal (Figure 3). The thresholds used are 160 set to + 2 for the variance and ± 2 for the mean, where is the mode 161 (i.e. the most-commonly-occurring value) and the full width at half-maximum of the 162 distribution of the relevant statistic. For micrographs with additional phase plate phase shifts 163 between 0.1 and 0.9 π, where much higher contrast is expected, the variance threshold is 164 increased to + 8 . We have found that in favorable cases many erroneous picks can 165 be avoided. Remaining false-positive picks are removed later during 2D classification. 166 Because of our emphasis on performance, our algorithm can be run nearly instantaneously on a 167 typical ~4K image, using a single processor. In the Action panel, the user is presented with an 168 "Auto preview" mode to enable interactive adjustment of the picking parameters ( Figure 3). In 169 this mode, the micrograph is displayed with optional and adjustable low-pass and high-pass 170 filters, and the results of picking using the currently selected parameters are overlaid on top. 171 Changing one or more of the parameters leads to a fast re-picking of the displayed micrograph, 172 so that the parameters can be optimized in real-time. Once the three main parameters have been 173 adjusted appropriately, the full complement of input micrographs can be picked, usually in a few 174 seconds or minutes. 175 A possible disadvantage of using a single disk template exists when the particles to be picked are 176 non-uniform in size or shape (e.g. in the case of an elongated particle). In this case, it may be 177 expected that a single template would have difficulty in picking all the different types and views 9 of particles present, and that in this case using a number of different templates would lead to a 179 more accurate picking. In practice, we found that with careful optimization of the parameters, 180 elongated particles and particles with size variation (Figure 3) were picked adequately. 181 The underlying implementation of the algorithm supports multiple references as well as 182 reference rotation. These features may be exposed to the graphical user interface in future 183 versions, for example enabling the use of 2D class averages as picking templates (Scheres, 184 2015). 185 186 2D classification 187 2D classification is a relatively quick and robust way to assess the quality of a single-particle 188 dataset. cisTEM implements a maximum likelihood algorithm (Scheres et al., 2005) and 189 generates fully CTF-corrected class averages that typically display clear high-resolution detail, 190 such as secondary structure. Integration of the likelihood function is done by evaluating the 191 function at defined angular steps that are calculated according to 192

= /
(1) 193 where is the resolution limit of the data and is the diameter of the particle (twice the mask 194 radius that is applied to the iteratively-refined class averages). cisTEM runs a user-defined 195 number of iterations defaulting to 20. To speed up convergence, the resolution limit is adjusted 196 as a function of iteration cycle (0 ≤ < ): 197 where and ℎ are user-defined resolution limits at the first and last iteration, 199 defaulting to 40 Å and 8 Å, respectively. The user also sets , the number of classes to calculate. 200 Depending on this number and the number of particles in the dataset, only a percentage of 201 the particles are included in the calculation. These particles are randomly reselected for each 202 iteration and is typically small, for example 0.1, in the first 10 iterations ( 0−9 ), then increases 203 to 0.3 for iteration 10 to 14 ( 10−14 ) and finishes with five iterations including all data ( 15−19 ): 204 , ⁄ 300 ⁄ < 1 1, 300 ⁄ ≥ 1 205 For example, for a dataset containing = 100,000 particles, 0−9 = 0.15, i.e. 15% of the data 208 will be used to obtain = 50 classes. Apart from speeding up the calculation, the stepwise 209 increase of the resolution limit and the random selection of subsets of the data also reduce the 210 chance of overfitting (see also the calculation of ab-initio 3D reconstructions and 3D refinement 211 below) and, therefore, increase the convergence radius of the 2D classification algorithm. 212 For the calculation of the likelihood function, the particle images X are noise-whitened by 213 dividing their Fourier transforms ℱ{X } by the square root of the radially average noise power 214 spectrum, : 215 where is the 2D reciprocal space coordinate and = | | its magnitude. The noise power 217 spectrum is calculated from the boxed particle images using the area outside the circular mask 218 11 set by the user according to the expected particle size. To increase accuracy, it is further 219 averaged across 2000 randomly selected particles. The background (density outside the mask) is 220 further normalized by adding a constant to each particle that yields a background average of 221 zero. 222 Finally, at the beginning of each iteration, noise features in the class averages A are suppressed 223 by resetting negative values below a threshold to the threshold: 224 where runs over all pixels in average A . 226 227 3D refinement (FrealignX) 228 The refinement of 3D reconstructions in cisTEM uses a version of Frealign (Lyumkis et al.,229 2013) that was specifically designed to work with cisTEM. Most of Frealign's control 230 parameters are exposed to the user in the "Manual Refine" Action panel ( Figure 4). The "Auto 231 Refine" and "Ab-Initio" panels also use Frealign but manage many of the parameters 232 automatically (see below). Frealign's algorithm was described previously (Grigorieff, 2007;233 Lyumkis et al., 2013) and this section will mostly cover important differences, including a new 234 objective function used in the refinement, different particle weighting used in reconstructions, 235 optional likelihood-based blurring, as well as new masking options. 236 Matched filter To make Frealign compatible with cisTEM's GUI, the code was completely 237 rewritten in C++, and it will be referred to here as Frealign v10, or FrealignX. The new version 238 makes use of a matched filter (McDonough and Whalen, 1995) to maximize the signal in cross 12 correlation maps calculated between particle images and reference projections. This requires 240 whitening of the noise present in the images and resolution-dependent scaling of the reference 241 projections to match the signal in the noise-whitened images. Both can be achieved if the spectral 242 signal-to-noise ratio (SSNR) of the data is known. As part of a 3D reconstruction, Frealign 243 calculates the resolution-dependent , the radially averaged SSNR present in the particle 244 images before they are affected by the CTF (Sindelar and Grigorieff, 2012). Using and 245 the CTF determined for a particle, the SSNR in the particle image can be calculated as 246 (as before, is the 2D reciprocal space coordinate and = | |). Here, is defined as the 248 ratio of the variance of the signal and the noise. The Fourier transform ℱ{X } of the noise-249 whitened particle image X can then be calculated as 250 where ℱ{X } is the Fourier transform of the original image X , | • | is the absolute value, and 252 |ℱ{X }| 2 is the radially averaged spectrum of the squared 2D Fourier transform amplitudes of 253 image X . To implement Eq. (7), a particle image is first divided by its amplitude spectrum, 254 which includes power from both signal and noise, and then multiplied by a term that amplifies where is the standard deviation of the noise in the particle image and Θ represents a set of 277 model parameters including the average particle positions in a dataset ̅ and ̅, and the standard 278 14 deviations of the x,y positions from the average values, and , and is the number of pixels 279 in the mask applied to the particle before alignment. The complete objective function is therefore 280 The maximized values determined in a refinement are converted to particle scores by 282 multiplication with 100. 283 CTF refinement FrealignX can refine the defocus assigned to each particle. Given typical 284 imaging conditions with current instrumentation (300 kV, direct electron detector), this may be 285 useful when particles have a size of about 400 kDa or larger. Depending on the quality of the 286 sample and images, these particles may generate sufficient signal to yield per-particle defocus 287 values that are more accurate than the average defocus values determined for whole micrographs 288 by CTFFIND4 (see above). Refinement is achieved by a simple one-dimensional grid search of a 289 defocus offset applied to both defocus values determined in the 2D CTF fit obtained by 290 CTFFIND4. FrealignX applies this offset to the starting values in a refinement, typically 291 determined by CTFFIND4, and evaluates the objective function, Eq. (11), for each offset. The 292 offset yielding the maximum is then used to assign refined defocus values. In a typical 293 refinement, the defocus offset is searched in steps of 50 Å, in a range of ± 500 Å. In the case of 294 β-galactosidase (see below), a single round of defocus refinement changed the defocus on 295 average by 60 Å; the RMS change was 80 Å. For this refinement, the resolution for the signed 296 cross terms equaled the overall refinement resolution limit (3.1 Å), i.e. no unsigned cross terms 297 were used. The refinement produced a marginal improvement of 0.05 Å in the Fourier Shell 298 Correlation (FSC) threshold of 0.143, suggesting that the defocus values determined by 299 CTFFIND4 were already close to optimal. In a different dataset of rotavirus double-layer 300 particles, a single round of defocus refinement changed the defocus on average by 160 Å; the 301 RMS change was 220 Å. In this case, the refinement increased the resolution from ~3.0 Å to 302 ~2.8 Å. 303 Masking FrealignX has a 3D masking function to help in the refinement of structures that 304 contain significant disordered regions, such as micelles in detergent-solubilized membrane 305 proteins. To apply a 3D mask, the user supplies a 3D volume that contains positive and negative 306 values. cisTEM will binarize this volume by zeroing all voxels with values less than or equal to 307 zero, and setting all other voxels to 1, indicating the region of the volume that is inside the mask. 308 A soft cosine-shaped falloff of specified width (e.g. 10 Å) is then applied to soften the edge of 309 the masked region and avoid sharp edges when the mask is applied to a 3D reconstruction. The 310 region of the reconstruction outside the mask can be set to zero (simple multiplication of the 311 mask volume), or to a low-pass filtered version of the original density, optionally downweighted 312 by multiplication by a scaling factor set by the user. At the edge of the mask, the low-pass 313 filtered density is blended with the unfiltered density inside the mask to produce a smooth 314 transition. Figure 5 shows the result of masking the reconstruction of an ABC transporter 315 associated with antigen processing (TAP, (Oldham et al., 2016)). The mask was designed to 316 contain only density corresponding to protein and the outside density was low-pass filtered at 30 317 Å resolution and kept with a weight of 100% in the final masked reconstruction. The 318 combination of masking and low-pass filtering in this case keeps a low-pass filtered version of 319 the density outside the mask in the reconstruction, including the detergent micelle. Detergent 320 micelles can be a source of noise in the particle images because the density represents disordered 321 material. However, at low, 20 to 30 Å resolution, micelles generate features in the images that 322 can help in the alignment of the particles. In the case of TAP, this masking prevented noise 323 16 overfitting in the detergent micelle and helped obtain a reconstruction at 4 Å resolution (Oldham 324 et al., 2016). 325

3D reconstruction
where is the probability of particle belonging to class , is the standard deviation of the 329 noise in particle image , are its alignment parameters, the score-based weights (Eq. (14), 330 see below), the CTF of the particle image, ℛ( ,• ) the reconstruction operator merging 331 data into a 3D volume according to alignment parameters , the radially averaged 332 particle SSNR derived from the FSC between half-maps (Sindelar and Grigorieff, 2012), X 333 noise-whitened image , and ℱ −1 { • } the inverse Fourier transform. For the calculation of the 3D 334 reconstructions, as well as 3D classification (see below) the particle images are not whitened 335 according to Eq. (7). Instead, they are whitened using the radially-and particle-averaged power 336 spectrum of the background around the particles: 337 where (X ) is a masked version of image X with the area inside a circular mask centered on the 339 particle replaced with the average values at the edge of the mask, and scaled variance to produce 340 an average pixel variance of 1 in the whitened image X . Using the procedure in Eq. (13) has the 341 advantage that whitening does not depend on the knowledge of the SSNR of the data, and 342 reconstructions can therefore be calculated even when the SSNR is not known. 343 Score-based weighting In previous versions of Frealign, resolution-dependent weighting was 344 applied to the particle images during reconstruction (the Frealign parameter was called "PBC", 345 (Grigorieff, 2007)). The weighting function took the form of a B-factor dependent exponential 346 that attenuates the image data at higher resolution. FrealignX still uses B-factor weighting but the 347 weighting function is now derived from the particle scores (see above) as 348 (14) 349 converts the difference between a particle score and ̅̅̅̅̅̅̅, the score average, into a B-350 factor. Setting to zero will turn off score-based particle weighting. Scores typically vary by 351 about 10, and values for that produce reasonable discrimination between high-scoring and 352 low-scoring particles are between 2 and 10 Å 2 , resulting in B-factor differences between particles 353 of 20 to 100 Å 2 . 354 3D Classification FrealignX uses a maximum-likelihood approach for 3D classification 355 (Lyumkis et al., 2013). Assuming that all images were noise-whitened according to Eq. (13), 356 which scales the variance of each image such that the average standard deviation of the noise in a 357 pixel is 1, the probability density function (PDF) of observing image X , given alignment 358 parameters and reconstruction V , is calculated as (Lyumkis et al., 2013) 359 As before, are the alignment parameters (usually just Euler angles and x,y shifts) determined 361 for image with respect to class average , ℘ is the projection operator producing an aligned 2D 362 projection of reconstruction V according to parameters , ‖X − • ℘(V , )‖2 is the 363 sum of the squared pixel value differences between whitened image X and the reference 364 18 projection inside a circular mask defining the area of the particle with user-defined diameter, ̃ 365 is the number of pixels inside this mask, and ( |Θ ) is a hierarchical prior describing the 366 probability of observing alignment parameters given model parameters Θ (see Eq. (10)). 367 Eq. (15) does not include marginalization over alignment parameters. Marginalization could be 368 added to improve classification when particle alignments suffer from significant errors. 369 However, this is currently not implemented in cisTEM. Given the joint probability, Eq. (15), 370 determined in a refinement, the probability of particle belonging to class can be updated 371 where the summation in the denominator is taken over all classes and the average probabilities 374 for a particle to belong to class are given by the average values of determined in a prior 375 iteration, calculated for the entire dataset of particles: 376 An example of 3D classification is shown in Figure 6 for F1FO-ATPase, revealing different 378 conformational states of the γ subunit (Zhou et al., 2015). 379 Focused classification 3D classification can be improved by focusing on conformationally-or 380 compositionally-variable regions of the map. To achieve this, a mask is applied to the particle 381 images and reference projections, the area of which is defined as the projection of a sphere with 382 user-specified center (within the 3D reconstruction) and radius. This 2D mask is therefore 383 defined independently for each particle, as a function of its orientation. When using focused 384 classification, ̃ in Eq. (15) is adjusted to the number of pixels inside the projected mask and the 385 sum of the squared pixel value differences in Eq. (15) is limited to the area of the 2D mask. By 386 applying the same mask to image and reference, only variability inside the masked region is used 387 for 3D classification. Other regions of the map are ignored, leading to a "focusing" on the region 388 of interest. The focused mask also excludes noise contained in the particle images outside the 389 mask and therefore improves classification results that often depend on detecting small 390 differences between particles and references. where, in the case of FrealignX, only includes the x,y particle positions and in-plane 400 rotation angle , which are a subset of the alignment parameters , and V −1 is the 401 reconstruction from an earlier refinement iteration. As before, Γ(X | , V −1 ) is the probability 402 of observing image , given alignment parameters and reconstruction −1 . Integration over 403 these three parameters can be efficiently implemented and, therefore, does not produce a 404 significant additional computational burden. 405 Resolution assessment The resolution of reconstructions generated by FrealignX is assessed 406 using the FSC criterion (Harauz and van Heel, 1986) using the 0.143 threshold (Rosenthal and 407 Henderson, 2003). FSC curves in cisTEM are calculated using two reconstructions ("half-maps") 408 calculated either from the even-numbered and odd-numbered particles, or by dividing the dataset 409 into 100 equal subsets and using the even-and odd-numbered subsets to calculate the two 410 reconstructions (in the cisTEM GUI, the latter is always used). The latter method has the 411 advantage that accidental duplication of particles in a stack is less likely to affect the FSC 412 calculation. All particles are refined against a single reference and, therefore, the calculated FSC 413 values may be biased towards higher values (Grigorieff, 2000;Stewart and Grigorieff, 2004). 414 This bias extends slightly beyond the resolution limit imposed during refinement, by 415 approximately 2⁄ , where is the mask radius used to mask the reconstructions (see 416 above). During auto-refinement (see below), the resolution limit imposed during refinement is 417 carefully adjusted to stay well below the estimated resolution of the reconstruction and the 418 resolution estimate is therefore unbiased (Scheres and Chen, 2012). However, users have full 419 control over all parameters during manual refinement and will have to make sure that they do not 420 bias the resolution estimate by choosing a resolution limit that is close to, or higher than, the 421 estimated resolution of the final reconstruction. Calculated FSC curves are smoothed using a 422 Savitzky-Golay cubic polynomial that reduces the noise often affecting FSC curves at the high-423 resolution end. 424 The FSC calculated between two density maps is dependent on the amount of solvent included 425 inside the mask applied to the maps. A larger mask that includes more solvent background will 426 yield lower FSC values than a tighter mask. To obtain an accurate resolution estimate in the 427 region of the particle density, one possibility is to apply a tight mask that closely follows the 428 21 boundary of the particle. This approach bears the risk of generating artifacts because the particle 429 boundary is not always well defined, especially when the particle includes disordered domains 430 that generate weak density in the reconstruction. The approach in Frealign avoids tight masking 431 and instead calculates an FSC curve using generously masked density maps, corrected for the 432 solvent content inside the mask (Sindelar and Grigorieff, 2012 where is the ratio of mask volume to estimated particle volume. The particle volume can be 437 estimated from its molecular mass as (20) 445 Speed optimization FrealignX has been optimized for execution on multiple CPU cores. Apart 446 from using optimized library functions for FFT calculation and vector multiplication (Intel Math 447 Kernel Library), the processing speed is also increased by on-the-fly cropping in real and 448 reciprocal space of particle images and 3D reference maps. Real-space cropping reduces the 449 22 interpolation accuracy in reciprocal space and is therefore limited to global parameter searches 450 that do not require the highest accuracy in the calculation of search projections. Reciprocal-space 451 cropping is used whenever a resolution limit is specified by the user or in an automated 452 refinement (ab-initio 3D reconstruction and auto-refinement). For the calculation of in-plane 453 rotated references, reciprocal-space padding is used to increase the image size four-fold, 454 allowing fast nearest-neighbor resampling in real space with sufficient accuracy to produce 455 rotated images with high fidelity. 456 457

Ab-initio 3D reconstruction 458
Ab-initio reconstruction offers a convenient way to proceed from single particle images to a 3D 459 structure when a suitable reference is not available to initialize 3D reconstruction and refinement. The global alignment parameters are performed using the "general" FrealignX procedure with 485 the following changes. Firstly, the is not directly estimated from the FSC calculated at 486 each round. Instead, for the first 3 iterations, a default is calculated based on the 487 molecular weight. From the 4 th iteration onwards, the is calculated from the FSC, 488 however if the calculated is higher than the default , the default is taken 489 instead. This is done in order to avoid some of the overfitting that will occur during refinement. 490 Secondly, during a normal global search the top ℎ (where ℎ defaults to 20) results of the grid 491 search are locally refined, and the best locally refined result is taken. In the ab-initio procedure, 492 however, the result of the global search for a given particle image is taken randomly from all 493 24 results that have a score which lies in the top 15% of the difference between the worst score and 494 the best score. 495 During the reconstruction steps, the calculated for each particle is reset to 1 prior to 3D 496 reconstruction and score weighting is disabled. This is done because the and score values are 497 not meaningful until an approximately correct solution is obtained. 498 The reconstructions are automatically masked before each new refinement iteration to suppress 499 noise features that could otherwise be amplified in subsequent iterations. The same masking 500 procedure is also applied during auto-refinement (see below). It starts by calculating the density 501 average ̅ of the reconstruction and resetting all voxel values below ̅ to ̅ . This thresholded 502 reconstruction is then low-pass filtered at 50 Å resolution and turned into a binary mask by 503 setting densities equal or below a given threshold to zero and all others to 1. The threshold is 504 calculated as 505 where ̅ is the density average of the low-pass filtered map and ̅ max _500 is the average of 507 the 500 highest values in the filtered map. The largest contiguous volume in this binarized map is 508 then identified and used as a mask for the original thresholded reconstruction, such that all 509 voxels outside of this mask will be set to ̅ . Finally, a spherical mask, centered in the 510 reconstruction box, is applied by resetting all densities outside the mask to zero. 511 The user has the option to repeat the ab-initio procedure multiple times using the result from the 512 previous run as the starting map in each new run, to increase the convergence radius if necessary. 513 In the case of symmetric particles, the default behavior is to perform the first 2/3 rds of the 514 iterations without applying symmetry. The non-symmetrized map is then aligned to the expected 515 25 symmetry axes and the final 1/3 rd of the iterations are carried out with the symmetry applied. 516 This default behavior can be changed by the user such that symmetry is always applied, or is 517 never applied. 518 Alignment of the model to the symmetry axes is achieved using the following process. A brute 519 force grid search over rotations around the x, y and z axes is set up. At each position on the grid 520 the 3D map is rotated using the current x, y and z parameters, and then its projection along Euler 521 angle (0, 0 ,0) is calculated. All of the symmetry-related projections are then also calculated, and 522 for each one a cross-correlation map is calculated using the original projection as a reference, 523 and the peak within this map is found. The sum of all peaks from all symmetry related directions 524 is taken and the x,y,z rotation that most closely aligns the original 3D map along the symmetry 525 axes should provide the highest peak sum. To improve robustness, this process is repeated for 526 two additional angles (-45, -45, -45 and 15, 70, -15) that were chosen with the aim of including 527 different-looking areas when the map to be aligned is unusual in some way. The x,y,z rotation 528 that results in the largest sum of all peaks, over all three angles, is taken as the final rotation 529 result. Shifts for this rotation are then calculated based on the found 2D x,y shifts between the 530 initial and symmetry related projections, with the z shift being set to 0 for C symmetries. The 531 symmetry alignment is also included as a command-line program, which can be used to align a 532 volume to the symmetry axis when the ab-initio is carried out in C1 only, or when using a 533 reference obtained by some other means. 534 535 Automatic refinement 536 Like ab-initio 3D reconstruction, auto-refinement makes use of randomly selected subsets of the 537 data and of an increasing resolution limit as refinement proceeds. However, unlike the ab-initio 538 procedure, the percentage of particles and the resolution limit used in iteration depend on 539 the resolution of the reconstructions estimated on iteration − 1. When the estimated resolution 540 improved in the previous cycle, 541 where is the number of 3D classes to be calculated and the number of particles in the 545 dataset. As before, if the particle has symmetry, is replaced by where is the 546 number of asymmetric units present in one particle. If the calculated exceeds 1, it is reset to 1. 547 The resolution limit is estimated as 548 where 0.5 is the point at which the FSC, unadjusted for the solvent within the mask (see 550 above) crosses the 0.5 threshold and is the user-specified diameter of the spherical mask 551 applied to the 3D reference at the beginning of each iteration, and to the half-maps used to 552 calculate the FSC. The term 2⁄ accounts for correlations between the two half-maps due 553 to the masking (see above). When the resolution did not improve in the previous iteration, 554 = 1.5 −1 (26) 555 (reset to 1 if resulting in a value larger than 1). At least five refinement iterations are run and 556 refinement stops when reaches 1 (all particles are included) and there was no improvement in 557 the estimated resolution for the last three iterations. 558 If multiple classes are refined, the resolution limit in Eq. (25) is set independently for each class, 559 however the highest resolution used for classification is fixed at 8 Å. At least nine iterations are 560 run and refinement stops when reaches 1, the average change in the particle occupancy in the 561 last cycle was 1% or less, and there was no improvement in the estimated resolution for the last 562 three iterations. 563 In a similar manner to the ab-initio procedure, values for each particle are set to 1 and score 564 weighting is turned off. This is done until the refinement resolution is better than 7 Å, at which 565 point it is assumed the model is of a reasonable quality. 566 567 Map sharpening 568 Most single-particle reconstructions require some degree of sharpening that is usually achieved 569 by applying a negative B-factor to the map. cisTEM includes a map sharpening tool that allows 570 the application of an arbitrary B-factor. Additionally, maps can be sharpened by whitening the 571 power spectrum of the reconstruction beyond a user-specified resolution (the default is 8 Å). The 572 whitening amplifies terms at higher resolution similar to a negative B-factor but avoids the over-573 amplification at the high-resolution end of the spectrum that sometimes occurs with the B-factor 574 method due to its exponential behavior. Whitening is applied after masking of the map, either 575 with a hollow spherical mask of defined inner and outer radius, or with a user-defined mask 576 supplied as a separate 3D volume. The masking removes background noise and makes the 577 28 whitening of the particle density more accurate. Both methods can be combined in cisTEM, 578 together with a resolution limit imposed on the final reconstruction. The whitened and B-factor-579 sharpened map can optionally be filtered with a figure-of-merit curve calculated using the FSC 580 which are also the main repository of processing results, such as movie frame alignments, image 591 defocus measurements and particle alignment parameters. As processing strategies become more 592 complex and the number of users new to cryo-EM grows, the demands on the GUI increase in 593 the quest for obtaining the best possible results. Useful GUI functions include guided user input 594 (so-called wizards) that adjust to specific situations, graphical presentation of relevant results, 595 user interaction with running processes to allow early intervention and make adjustments, tools 596 to manipulate data (e.g. masking), implementation of higher-level procedures that combine more 597 primitive processing steps to achieve specific goals, and a global searchable database that keeps 598 track of all processing steps and result. While some of these functions can be or have been 599 implemented in wrapper GUIs, the lack of control of these GUIs over the data and processes 600 29 makes a reliable implementation more difficult. For example, keeping track of results from 601 multiple processing steps, some of them perhaps repeated with different parameters or run many 602 times during an iterative refinement, can become challenging if each step produces a separate 603 results file. Communicating with running processes via files can be slow and is sometimes 604 unreliable due to file system caching. Communication via files may complicate the 605 implementation of higher-level procedures, which rely on the parsing of results from the more 606 primitive processing steps. 607 The cisTEM GUI is more than a wrapper as it implements some of the new algorithms in the 608 processing pipeline directly, adjusting the input of running jobs as the refinement proceeds. It 609 enables more complex data processing strategies by tracking all results in a single searchable 610 database. All processing steps are run and controlled by the GUI, which communicates with 611 master and slave processes through TCP/IP. cisTEM uses an SQL database, similar to Appion 612 (Lander et al., 2009), to store all results (except image files), offers input functions that guide the 613 user or set appropriate defaults, and implements more complex procedures to automate 614 processing where possible. cisTEM's design is flexible to allow execution in many different 615 environments, including single workstations, multiple networked workstations and large 616 computer clusters. 617 User input and the display of results is organized into different panels that make up cisTEM's 618 GUI, each panel dedicated to specific processing steps (for examples, see Figure 1, 3, 4). This 619 design guides users through a standard workflow that most single particle projects follow: movie 620 alignment, CTF determination, particle picking, 2D classification, 3D reconstruction, refinement 621 and classification, and sharpening of the final reconstructions. Three types of panels exist, 622 dealing with Assets, Actions and Results. Assets are mostly data that can be used in processing 623 30 steps called Actions. They include Movies, Images, Particle Positions and Volumes. One type of 624 Asset, a Refinement Package, defines the data and parameters necessary to carry out refinement 625 of a 3D structure (or a set of structures if 3D classification is done), it contains a particle stack, as 626 well as information about the sample (e.g. particle size and molecular weight) along with 627 parameters for each particle (e.g. orientations and defocus values). Actions comprise the above 628 mentioned workflow steps, with additional options for ab-initio 3D reconstruction, as well as 629 automatic and manual 3D refinement to enable users to obtain the best possible results from their 630 data. The results of most of these Actions are stored in the database and can be viewed in the 631 related Results panels, which display relevant data necessary to evaluate the success of each 632 processing step. The option to sort and select results by a number of different metrics is available 633 in the movie alignment and CTF estimation Results panels. For example images can be sorted / 634 selected based on the CTF fit resolution (Rohou and Grigorieff, 2015). Movie alignment, 3D 635 refinement and reconstruction also produce new Image and Volume Assets, respectively. 636 Importing or generating new Assets is accomplished by wizards that solicit the necessary user 637 input and perform checks were possible to avoid nonsensical input. In the more complex case of 638 creating a new Refinement Package Asset, the wizard allows the specification of input data, for 639 example based on particle picking results or the selection of 2D and 3D classes. Once an Action 640 has been launched, results are displayed as soon as they become available, together with an 641 overall progress bar, giving users an estimate of how long a processing step will take and of 642 whether the results are as expected. If desired, an Action can be aborted and restarted with a 643 different set of parameters, or the Action can be run again after regular termination to test 644 different parameters. In the latter case, all prior results remain accessible and users can specify 645 those to be used for the next step in the workflow. This provides users with the flexibility to pick 646 31 and choose the best results in cases where different parts of a dataset require different settings to 647 yield optimal results. 648

649
Parallelization 650 cisTEM uses a home-grown scheme to accelerate processing in parallel environments. Image 651 processing of single-particle data is an embarrassingly parallel problem, i.e. the parallelization of 652 most tasks can be achieved simply by dividing the data to be processed into smaller chunks that 653 are each processed by a separate program thread, without the need for inter-process 654 communication. Only certain steps require merging of data, such as the calculation of a 3D 655 reconstruction from the entire dataset. cisTEM parallelizes processing steps by running multiple 656 instances of the same program, each dealing with a subset of the data, then directly 657 communicating with the launched processes over TCP/IP sockets. This enables the cisTEM GUI 658 to distribute jobs and receive results in real time. Communication is directed through a 659 "manager" process, which enables jobs to be run on a cluster, while the GUI itself can run on a 660 local workstation. 661 How each copy of the program is run is specified by the user by setting up a "Run Profile". This 662 profile is a user defined command, or script that will be run to launch the job, and is designed to 663 be flexible to enable the user to set up parallelization in many different environments. For 664 example, users can design profiles to run on multiple machines via SSH, or to submit to a cluster 665 (e.g. using qsub) etc., or even merge the two in a single profile. One disadvantage of this system 666 is that it may be difficult to create profiles for clusters that require many jobs to be submitted 667 using one command. 668 32 Another advantage of using a home-grown scheme over existing schemes (e.g. MPI) occurs 669 when jobs are run on a multi-node computing cluster. In this case, jobs will complete even if the 670 full number of requested processors is not available. For example, if a user requests 300 CPUs 671 for a processing step but only 100 are available, cisTEM launches 300 jobs of which 200 will 672 remain in the job scheduler's queue. Data processing starts immediately with the 100 jobs that 673 are allowed to run and will complete even if the remaining jobs never run. In such a scenario, an 674 MPI-based job could only run when 300 CPUs become available, potentially delaying execution. 675 In the few cases were a step requires merging of an entire dataset, for example in a 3D Pleasanton, CA) and stored locally as tif files using LZW compression (the conversion to tiff and 689 compression was performed by mrc2tif (Mastronarde and Held, 2017)). The pixel size of the 690 super-resolution frames was 0.3185 Å, and images were binned to a pixel size of 0.75 Å after 691 33 movie processing. For 2D classification and ab-initio 3D reconstruction, particles were boxed 692 using 384 x 384 pixel boxes. For auto-and manual refinement, the particles were re-boxed into 693 648 x 648 pixel boxes (boxing is part of the creation of Refinement Packages, see above). For all 694 processing steps, a Dell Precision T7910 workstation containing two E5-2699 v4 Xeon 695 processors with a total of 44 cores was used. Processing parameters were left on default settings 696 except for CTF determination, which was performed at 3.5 Å resolution using the movie 697 averages instead of the frames, and particle picking, which used optimized parameters based on 698 previewing a few selected images (Figure 3). The resolution limit during refinement (auto, 699 manual and CTF) never exceeded 3.1 Å. The data were stored on a local SSD Raid 0 disk for fast 700 access. The implementation of a complete image processing workflow in cisTEM offers users a uniform 709 experience and guarantees smooth transitions between processing steps. It also helps developers 710 maintain the software as all the tools and algorithms are developed in-house. 711 The main focus of cisTEM is on the processing of single-particle cryo-EM data and high-712 resolution 3D reconstruction. Future releases of cisTEM may include on-the-fly processing of 713 data as it is collected, particle-based movie alignment, support for helical particles, improved 3D 714 masking tools, more reliable resolution and quality indicators, as well as miscellaneous tools 715 such as the determination of the detective quantum efficiency of electron detectors. 716 Since cisTEM does not rely on third-party libraries, such as Python, MPI or CUDA, that usually 717 have to be installed and compiled separately on the target system, ready-to-run binaries can be 718 made available for download that are optimized for different architectures. Using the wxWidgets 719 library also means that cisTEM can be compiled for different operating systems, including 720 Linux, Windows and OSX. Using a configure script, different options for the fast Fourier 721 transforms (FFTs) can be specified, including the FFTW (http://www.fftw.org) and Intel MKL 722 background within that mask (Part_FSC, Eq. (19)). The resolution limit of 3.1 Å, which was not 815 exceeded during refinement, as well as the FSC = 0.143 threshold are indicated by lines. 816