Review: Deep Learning in Electron Microscopy

Deep learning is transforming most areas of science and technology, including electron microscopy. This review paper offers a practical perspective aimed at developers with limited familiarity. For context, we review popular applications of deep learning in electron microscopy. Following, we discuss hardware and software needed to get started with deep learning and interface with electron microscopes. We then review neural network components, popular architectures, and their optimization. Finally, we discuss future directions of deep learning in electron microscopy.

resampling. However, a variety of other strategies to minimize beam damage have also been proposed, including dose fractionation 232 and a variety of sparse data collection methods 233 . Perhaps the most intensively investigated approach to the latter is sampling a random subset of pixels, followed by reconstruction using an inpainting algorithm [233][234][235][236][237][238] . Random sampling of pixels is nearly optimal for reconstruction by compressed sensing algorithms 239 . However, random sampling exceeds the design parameters of standard electron beam deflection systems, and can only be performed by collecting data slowly 240, 241 , or with the addition of a fast deflection or blanking system 236,242 .
Sparse data collection methods that are more compatible with conventional beam deflection systems have also been investigated. For example, maintaining a linear fast scan deflection whilst using a widely-spaced slow scan axis with some small random 'jitter' 234, 240 . However, even small jumps in electron beam position can lead to a significant difference between nominal and actual beam positions in a fast scan. Such jumps can be avoided by driving functions with continuous derivatives, such as those for spiral and Lissajous scan paths 192,236,241,243,244 . Sang 241, 244 considered a variety of scans including Archimedes and Fermat spirals, and scans with constant angular or linear displacements, by driving electron beam deflectors with a field-programmable gate array 245 (FPGA) based system. Spirals with constant angular velocity place the least demand on electron beam deflectors. However, dwell times, and therefore electron dose, decreases with radius. Conversely, spirals created with constant spatial speeds are prone to systematic image distortions due to lags in deflector responses. In practice, fixed doses are preferable as they simplify visual inspection and limit the dose dependence of STEM noise 125 .
Deep learning can leverage an understanding of physics to infill images [246][247][248] . Example applications include increasing SEM 174,249,250 ,STEM 193,251 and TEM 252 resolution, and infilling continuous sparse scans 192 . Example applications of DNNs to complete sparse spiral and grid scans are shown in fig. 2. However, caution should be used when infilling large regions as ANNs may generate artefacts if a signal is unpredictable 192 . A popular alternative to deep learning for infilling large regions is exemplar-based infilling [253][254][255][256] . However, exemplar-based infilling often leaves artefacts 257 and is usually limited to leveraging information from single images. Smaller regions are often infilled by fast marching 258 , Navier-Stokes infilling 259 , or interpolation 227 . spectroscopy 737,738 (XPS). Quantitative advantages of electron microscopes can include higher resolution and depth of field, and lower radiation damage than light microscopes 739 . In addition, electron microscopes can record images, enabling visual interpretation of complex structures that may otherwise be intractable. This section will briefly introduce varieties of electron microscopes, simulation software, and how electron microscopes can interface with ANNs. Figure 6. Numbers of results per year returned by Dimensions.ai abstract searches for SEM, TEM, STEM, STM and REM qualitate their popularities. The number of results for 2020 is extrapolated using the mean rate before 14th July 2020.

Microscopes
There are a variety of electron microscopes that use different illumination mechanisms. For example, reflection electron microscopy 740,741 (REM), scanning electron microscopy 742,743 (SEM), scanning transmission electron microscopy 744,745 (STEM), scanning tunneling microscopy 746,747 (STM), and transmission electron microscopy [748][749][750] (TEM). To roughly gauge popularities of electron microscope varieties, we performed abstract searches with Dimenions.ai 640, 751-753 for their abbreviations followed by "electron microscopy" e.g. "REM electron microscopy". Numbers of results per year in fig. 6 qualitate that popularity increases in order REM, STM, STEM, TEM, then SEM. It may be tempting to attribute the popularity of SEM over TEM to the lower cost of SEM 754 , which increases accessibility. However, a range of considerations influence the procurement of electron microscopes 755 and hourly pricing at universities 756-760 is similar for SEM and TEM.
In SEM, material surfaces are scanned by sequential probing with a beam of electrons, which are typically accelerated to 0.2-40 keV. The SEM detects quanta emitted from where the beam interacts with the sample. Most SEM imaging uses low-energy secondary electrons. However, reflection electron microscopy 740,741 (REM) uses elastically backscattered electrons and is often complimented by a combination of reflection high-energy electron diffraction [761][762][763] (RHEED), reflection highenergy electron loss spectroscopy 764,765 (RHEELS) and spin-polarized low-energy electron microscopy [766][767][768] (SPLEEM). Some SEMs also detect Auger electrons 769,770 . To enhance materials characterization, most SEMs also detect light. The most common light detectors are for cathodoluminescence and energy dispersive r-ray 771,772 (EDX) spectroscopy. Nonetheless, some SEMs also detect Bremsstrahlung radiation 773 .
Alternatively, TEM and STEM can detect electrons transmitted through specimens. In conventional TEM, a single region is exposed to a broad electron beam. In contrast, STEM uses a fine electron beam to probe a series of discrete probing locations. Typically, electrons are accelerated across a potential difference to kinetic energies, E k , of 80-300 keV. Electrons also have rest energy E e = m e c 2 , where m e is electron rest mass and c is the speed of light. The total energy, E t = E e + E k , of free electrons is related to their rest mass energy by a Lorentz factor, γ, where v is the speed of electron propagation in the rest frame of an electron microscope. Electron kinetic energies in TEM and STEM are comparable to their rest energy, E e = 511 keV 774 , so relativistic phenomena 775,776 must be considered to accurately describe their dynamics.
Electrons exhibit wave-particle duality 338,339 . Thus, in an ideal electron microscope, the maximum possible detection angle, θ , between two point sources separated by a distance, d, perpendicular to the electron propagation direction is diffraction-limited. The resolution limit for imaging can be quantified by Rayleigh's criterion [777][778][779] where resolution increases with decreasing wavelength, λ . Electron wavelength increases with increasing accelerating voltage, as described by the relativistic de Broglie relation [780][781][782] , where h is Planck's constant 774 . Electron wavelengths for typical acceleration voltages tabulated by JEOL are in picometres 783 . In comparison, Cu K-α x-rays, which are often used for XRD, have wavelengths near 0.15 nm 784 . In theory, electrons can therefore achieve over 100× higher resolution than x-rays. Electrons and x-rays are both ionizing; however, electrons often do less radiation damage to thin specimens than x-rays 739 . Tangentially, TEM and STEM often achieve over 10 times higher resolution than SEM 785 as transmitted electrons in TEM and STEM are easier to resolve than electrons returned from material surfaces in SEM.
In practice, TEM and STEM are also limited by incoherence [786][787][788] introduced by inelastic scattering, electron energy spread, and other mechanisms. TEM and STEM are related by an extension of Helmholtz reciprocity 789,790 where the source plane in a TEM corresponds to the detector plane in a STEM 791 , as shown in fig. 5. Consequently, TEM coherence is limited by electron optics between the specimen and image, whereas STEM coherence is limited by the illumination system. For conventional TEM and STEM imaging, electrons are normally incident on a specimen 792 . Advantages of STEM imaging can include higher contrast and resolution than TEM imaging, and lower radiation damage 793 . Following, STEM is increasing being favored over TEM for high-resolution studies. However, we caution that definitions of TEM and STEM resolution can be disparate 794 .

Automation
Most modern electron microscopes support Gatan Microscopy Suite (GMS) Software 870 . GMS enables electron microscopes to be programmed by DigitalMicrograph Scripting, a propriety Gatan programming language akin to a simplified version of C++. A variety of DigitalMicrograph scripts, tutorials and related resources are available from Dave Mitchell's DigitalMicrograph Scripting Website 668,669 , FELMI/ZFE's Script Database 871 and Gatan's Script library 872 . Some electron microscopists also provide DigitalMicrograph scripting resources on their webpages [873][874][875] . However, DigitalMicrograph scripts are slow insofar that they are interpreted at runtime, and there is limited native functionality for parallel and distributed computing. As a result, extensions to DigitalMicrograph scripting are often developed in other programming languages that offer more functionality.
Historically, most extensions were developed in C++ 876 . This was problematic as there is limited documentation, the standard approach used outdated C++ software development kits such as Visual Studio 2008, and programming expertise required to build functions that interface with DigitalMicrograph scripts limits accessibility. To increase accessibility, recent versions of GMS now support python 877 . This is convenient as it enables ANNs developed with python to readily interface with electron microscopes. For ANNs developed with C++, users have the option to either create C++ bindings for DigitalMicrograph script or for python. Integrating ANNs developed in other programming languages is more complicated as DigitalMicrograph provides almost no support. However, that complexity can be avoided by exchanging files from DigitalMicrograph script to external libraries via a random access memory (RAM) disk 878 or secondary storage 879 .
Increasing accessibility, there are collections of GMS plugins with GUIs for automation and analysis [873][874][875]880 . In addition, various individual plugins are available [881][882][883][884][885] . Some plugins are open source, so they can be adapted to interface with ANNs. However, many high-quality plugins are proprietary and closed source, limiting their use to automation of data collection and processing. Plugins can also be supplemented by a variety of libraries and interfaces for electron microscopy signal processing. For example, popular general-purpose software includes ImageJ 886 , pycroscopy 515, 516 and HyperSpy 517, 518 . In addition, there are directories for tens of general-purpose and specific electron microscopy programs [887][888][889] .

Components
Most modern ANNs are configured from a variety of DLF components. To take advantage of hardware accelerators 58 , most ANNs are implemented as sequences of parallelizable layers of tensor operations 890 . Layers are often parallelized across data and may be parallelized across other dimensions 891 . This section introduces popular nonlinear activation functions, normalization layers, convolutional layers, and skip connections. To add insight, we provide comparative discussion and address some common causes of confusion.

Nonlinear Activation
DNNs need multiple layers to be universal approximators [35][36][37][38][39][40][41][42][43] . Nonlinear activation functions 892,893 are therefore essential to DNNs as successive linear layers can be contracted to a single layer. Activation functions separate artificial neurons, similar to biological neurons 894 . To learn efficiently, most DNNs are tens or hundreds of layers deep 45, [895][896][897] . High depth increases representational capacity 45 , which can help training by gradient descent as DNNs evolve as linear models 898 and nonlinearities can create suboptimal local minima where data cannot be fit by linear models 899 . There are infinitely many possible activation functions. However, most activation functions have low polynomial order, similar to physical Hamiltonians 45 .
Most ANNs developed for electron microscopy are for image processing, where the most popular nonlinearities are rectifier linear units 900,901 (ReLUs). The ReLU activation, f (x), of an input, x, and its gradient, ∂ x f (x), are Popular variants of ReLUs include Leaky ReLU 902 , where α is a hyperparameter, parametric ReLU 20 (PreLU) where α is a learned parameter, dynamic ReLU where α is a learned function of inputs 903 , and randomized leaky ReLU 904 (RReLU) where α is chosen randomly. Typically, learned PreLU α are higher the nearer a layer is to ANN inputs. Motivated by limited comparisons that do not show a clear performance difference between ReLU and leaky ReLU 905 , some blogs 906 argue against using leaky ReLU due to its higher computational requirements and complexity. However, an in-depth comparison found that leaky ReLU variants consistently slightly outperform ReLU 904 . In addition, the non-zero gradient of leaky ReLU for x ≤ 0 prevents saturating, or "dying", ReLU [907][908][909] , where the zero gradient of ReLUs stops learning.
There are a variety of other piecewise linear ReLU variants that can improve performance. For example, ReLUh activations are limited to a threshold 910 , h, so that

12/97
Thresholds near h = 6 are often effective, so popular choice is ReLU6. Another popular activation is concatenated ReLU 911 (CRELU), which is the concatenation of ReLU(x) and ReLU(−x). Other ReLU variants include adaptive convolutional 912 , bipolar 913 , elastic 914 , and Lipschitz 915 ReLUs. However, most ReLU variants are uncommon as they are more complicated than ReLU and offer small, inconsistent, or unclear performance gains. Moreover, it follows from the universal approximator theorems 35-43 that disparity between ReLU and its variants approaches zero as network depth increases.
In shallow networks, curved activation functions with non-zero Hessians often accelerate convergence and improve performance. A popular activation is the exponential linear unit 916 (ELU), where α is a learned parameter. Further, a scaled ELU 917 (SELU), Sigmoids can also be applied to limit the support of outputs. Unscaled, or "logistic", sigmoids are often denoted σ (x) and are related to tanh by tanh(x) = 2σ (2x) − 1. To avoid expensive exp(−x) in the computation of tanh, we recommend K-tanH 926 , LeCun tanh 927 or piecewise linear approximation 928,929 .
The activation functions introduced so far are scalar functions than can be efficiently computed in parallel for each input element. However, functions of vectors, x = {x 1 , x 2 , ...}, are also popular. For example, softmax activation 930 , is often applied before computing cross-entropy losses for classification networks. Similarly, Ln vector normalization, is often applied to n-dimensional vectors to ensure that they lie on a unit n-sphere 337 . Finally, max pooling 931,932 , is another popular multivariate activation function that is often used for downsampling. However, max pooling has fallen out of favor as it is often outperformed by strided convolutional layers 933 . Other vector activation functions include squashing nonlinearities for dynamic routing by agreement in capsule networks 934 and cosine similarity 935 .
There is a range of other activation functions that are not detailed here for brevity. Further, finding new activation functions

13/97
is an active area of research 936,937 . Notable variants include choosing activation functions from a set before training 938,939 and learning activation functions 938,[940][941][942][943] . Activation functions can also encode probability distributions [944][945][946] or include noise 929 . Finally, there are a variety of other deterministic activation functions 937,947 . In electron microscopy, most ANNs enable new or enhance existing applications. Subsequently, we recommend using computationally efficient and established activation functions unless there is a compelling reason to use a specialized activation function.

Normalization
Normalization 948-950 standardizes signals, which can accelerate convergence by gradient descent and improve performance. Batch normalization [951][952][953][954][955][956] is the most popular normalization layer in image processing DNNs trained with minibatches of N examples. Technically, a "batch" is an entire training dataset and a "minibatch" is a subset; however, the "mini" is often omitted where meaning is clear from context. During training, batch normalization applies a transform, where x = {x 1 , ..., x N } is a batch of layer inputs, γ and β are a learnable scale and shift, and ε is a small constant added for numerical stability. During inference, batch normalization applies a transform, Increasing batch size stabilizes learning by averaging destabilizing loss spikes over batches 251 . Batched learning also enables more efficient utilization of modern hardware accelerators. For example, larger batch sizes improve utilization of GPU memory bandwidth and throughput 379, 957, 958 . Using large batches can also be more efficient than many small batches when distributing training across multiple CPU clusters or GPUs due to communication overheads. However, the performance benefits of large batch sizes can come at the cost of lower test accuracy as training with large batches tends to converge to sharper minima 959,960 . As a result, it often best not to use batch sizes higher than N ≈ 32 for image classification 961 . However, learning rate scaling 529 and layer-wise adaptive learning rates 962 can increase accuracy of training with fixed larger batch sizes. Batch size can also be increased throughout training without compromising accuracy 963 to exploit effective learning rates being inversely proportional to batch size 529, 963 . Alternatively, accuracy can be improved by creating larger batches from replicated instances of training inputs with different data augmentations 964 .
There are a few caveats to batch normalization. Originally, batch normalization was applied before activation 952 . However, applying batch normalization after activation often slightly improves performance 965,966 . In addition, training can be sensitive to the often-forgotten ε hyperaparameter 967 in eqn. 16. Typically, performance decreases as epsilon is increased above ε ≈ 0.001; however, there is a sharp increase in performance around ε = 0.01 on ImageNet. Finally, it is often assumed that batches are representative of the training dataset. This is often approximated by shuffling training data to sample independent and identically distributed (i.i.d.) samples. However, performance can often be improved by prioritizing sampling 968,969 . We observe that batch normalization is usually effective if batch moments, µ B and σ B , have similar values for every batch.
Batch normalization is less effective when training batch sizes are small, or do not consist of independent samples. To improve performance, standard moments in eqn. 16 can be renormalized 970 to expected means, µ, and standard deviations, σ , where gradients are not backpropagated with respect to (w.r.t.) the renormalization parameters, r and d. Moments, µ and σ are tracked by exponential moving averages and clipping to r max and d max improves learning stability. Usually, clipping values are increased from starting values of r max = 1 and d max = 0, which correspond to batch normalization, as training progresses. Another approach is virtual batch normalization 971 (VBN), which estimates µ and σ from a reference batch of samples and does not require clipping. However, VBN is computationally expensive as it requires computing a second batch of statistics at every training iteration. Finally, online 972 and streaming 950 normalization enable training with small batch sizes by replace µ B and σ B in eqn. 16 with their exponential moving averages.
There are alternatives to the L 2 batch normalization of eqns. 14-18 that standardize to different Euclidean norms. For example, L 1 batch normalization 973 computes where C L 1 = (π/2) 1/2 . Although the C L 1 factor could be learned by ANNs parameters, its inclusion accelerates convergence of the original implementation of L 1 batch normalization 973 . Another alternative is L ∞ batch normalization 973 , which computes where C L ∞ is a scale factor, and top k (x) returns the k highest elements of x. Hoffer et al suggest k = 10 973 . Some L 1 batch normalization proponents claim that L 1 batch normalization outperforms 951 or achieves similar performance 973 to L 2 batch normalization. However, we found that L 1 batch normalization often lowers performance in our experiments. Similarly, L ∞ batch normalization often lowers performance 973 . Overall, L 1 and L ∞ batch normalization do not appear to offer a substantial advantage over L 2 batch normalization. A variety of layers normalize samples independently, including layer, instance, and group normalization. They are compared with batch normalization in fig. 7. Layer normalization 974, 975 is a transposition of batch normalization that is computed across feature channels for each training example, instead of across batches. Batch normalization is ineffective in RNNs; however, layer normalization of input activations often improves accuracy 974 . Instance normalization 976 is an extreme version of layer normalization that standardizes each feature channel for each training example. Instance normalization was developed for style transfer and makes ANNs insensitive to input image contrast. Group normalization 977 is intermediate to instance and layer normalization insofar that it standardizes groups of channels for each training example.
The advantages of a set of multiple different normalization layers, Ω, can be combined by switchable normalization 978,979 , which standardizes tô

15/97
where µ z and σ z are means and standard deviations computed by normalization layer z. Their respective importance ratios, λ µ z and λ σ z , are trainable parameters that are softmax activated to sum to unity. Combining batch and instance normalization statistics 980 outperforms batch normalization for a range of computer vision tasks 980 . However, most layers strongly weighted either batch or instance normalization, with most preferring batch normalization. Interestingly, combining batch, instance and layer normalization statistics 978,979 results in instance normalization being used more heavily in earlier layers, whereas layer normalization was preferred in the later layers, and batch normalization was preferred in the middle. Smaller batch sizes lead to a preference towards layer normalization and instance normalization. Limitingly, using multiple normalization layers increases computation. To limit expense, we therefore recommend either defaulting to batch normalization, or progressively using single instance, batch or layer normalization layers.
A significant limitation of batch normalization is that it is not effective in RNNs. This is a limited issue as most electron microscopists are developing CNNs for image processing. However, we anticipate that RNNs may become more popular in electron microscopy following the increasing popularity of reinforcement learning 981 . In addition to general-purpose alternatives to batch normalization that are effective in RNNs, such as layer normalization, there are a variety of dedicated normalization schemes. For example, recurrent batch normalization 982, 983 uses distinct normalization layers for each time step. Alternatively, batch normalized RNNs 984 only have normalization layers between their input and hidden states. Finally, online 972 and streaming 950 normalization are general-purpose solutions that improve the performance of batch normalization in RNNs by applying batch normalization based on a stream of past batch statistics.
Normalization can also standardize trainable weights, w. For example, weight normalization 985 , decouples the L2 norm, g, of a variable from its direction. Similarly, weight standardization 986 subtracts means from variables and divides them by their standard deviations, similar to batch normalization. Weight normalization often outperforms batch normalization at small batch sizes. However, batch normalization consistently outperforms weight normalization at larger batch sizes used in practice 987 . Combining weight normalization with running mean-only batch normalization can accelerate convergence 985 . However, similar final accuracy can be achieved without mean-only batch normalization at the cost of slower convergence, or with the use of zero-mean preserving activation functions 913,973 . To achieve similar performance to batch normalization, norm-bounded weight normalization 973 can be applied to DNNs with scale-invariant activation functions, such as ReLU. Norm-bounded weight normalization fixes g at initialization to avoid learning instability 973,987 , and scales outputs with the final DNN layer. Limitedly, weight normalization encourages the use of a small number of features to inform activations 988 . To maximize feature utilization, spectral normalization 988 , divides tensors by their spectral norms, σ (w). Further, spectral normalization limits Lipschitz constants 989 , which often improves generative adversarial network [990][991][992][993] (GAN) training by bounding backpropagated discriminator gradients 988 . The spectral norm of v is the maximum value of a diagonal matrix, Σ, in the singular value decomposition [994][995][996][997] where U and V are orthogonal matrices of orthonormal eigenvectors for vv T and v T v, respectively. To minimize computation, σ (w) is often approximated by the power iteration method 998,999 , where one iteration of eqns. 31-32 per training iteration is usually sufficient.
Parameter normalization can complement or be combined with signal normalization. For example, scale normalization 1000 , learns scales, g, for activations, and is often combined with weight normalization 985,1001 in transformer networks. Similarly, cosine normalization 935 , computes products of L2 normalized parameters and signals. Both scale and cosine normalization can outperform batch normalization.
In general, the convolution of two functions, f (t) and g(t), is and their cross-correlation is where integrals have unlimited support, Ω. In a CNN, convolutional layers sum convolutions of feature channels with trainable kernels, as shown in fig. 8. Thus, f (t) and g(t) are discrete functions, and the integrals in eqns. 36-37 can be replaced with limited summations. Since cross-correlation is equivalent to convolution if the kernel is flipped in every dimension, and CNN kernels are trainable, convolution and cross-correlation is often interchangeable in deep learning. For example, a TensorFlow function named "tf.nn.convolution" computes cross-correlations 1032 . Nevertheless, the difference between convolution and cross-correlation could be source of subtle errors if convolutional layers from a DLF are used in an image processing pipeline with static asymmetric kernels.
Alternatives to Gaussian kernels for image smoothing 1034 include mean, median and bilaterial filters. Sobel kernels compute horizontal and vertical spatial gradients that can be used for edge detection 1035 . For example, 3×3 Sobel kernels are Alternatives to Sobel kernels offer similar utility, and include extended Sobel 1036 , Scharr 1037, 1038 , Kayyali 1039 , Roberts cross 1040 and Prewitt 1041 kernels. Two-dimensional Gaussian and Sobel kernels are examples of linearly separable, or "flattenable", kernels, which can be split into two one-dimensional kernels, as shown in eqns. 38-39b. Kernel separation can decrease computation in convolutional layers by convolving separated kernels in series, and CNNs that only use separable convolutions are effective [1042][1043][1044] . However, serial convolutions decrease parallelization and separable kernels have fewer degrees of freedom, decreasing representational capacity. Following, separated kernels are usually at least 5×5, and separated 3×3 kernels are unusual. Even-sized kernels, such as 2×2 and 4×4, are rare as symmetric padding is needed to avoid information erosion caused by spatial shifts of feature maps 1045 .
A traditional 2D convolutional layer maps inputs, x input , with height H, width, W , and depth, D, to where K output channels are indexed by k ∈ [1, K], is the sum of a bias, b, and convolutions of each input channel with M × N kernels with weights, w. For clarity, a traditional convolutional layer is visualized in fig. 8a. Convolutional layers for 1D, 3D and higher-dimensional kernels 1046 where every input element is connected to every output element. Convolutional layers reduce computation by making local connections within receptive fields of convolutional kernels, and by convolving kernels rather than using different weights at each input position. Intermediately, fully connected layers can be regularized to learn local connections 1061 . Fully connected layers are sometimes used at the middle of encoder-decoders 1062 . However, such fully connected layers can often be replaced by multiscale atrous, or "holey", convolutions 931 in an atrous spatial pyramid pooling 294, 295 (ASPP) module to decrease computation without a significant decrease in performance. Alternatively, weights in fully connected layers can be decomposed into multiple smaller tensors to decrease computation without significantly decreasing performance 1063,1064 . Convolutional layers can perform a variety of convolutional arithmetic 931 . For example, strided convolutions 1065 usually skip computation of outputs that are not at multiples of an integer spatial stride. Most strided convolutional layers are applied throughout CNNs to sequentially decrease spatial extent, and thereby decrease computational requirements. In addition, strided convolutions are often applied at the start of CNNs 527, 1050-1052 where most input features can be resolved at a lower resolution than the input. For simplicity and computational efficiency, stride is typically constant within a convolutional layer; however, increasing stride away from the centre of layers can improve performance 1066 . To increase spatial resolution, convolutional layers often use reciprocals of integer strides 1067 . Alternatively, spatial resolution can be increased by combining interpolative upsampling with an unstrided convolutional layer 1068,1069 , which can help to minimize output artefacts.
Convolutional layers couple the computation of spatial and cross-channel convolutions. However, partial decoupling of spatial and cross-channel convolutions by distributing inputs across multiple convolutional layers and combining outputs can improve performance. Partial decoupling of convolutions is prevalent in many seminal DNN architectures, including FractalNet 1049 , Inception 1050-1052 , NASNet 1053 . Taking decoupling to an extreme, depthwise separable convolutions 527, 1070, 1071 shown in fig. 8 compute depthwise convolutions, then compute pointwise 1×1 convolutions for D intermediate channels, where K output channels are indexed by k ∈ [1, K]. Depthwise convolution kernels have weights, u, and the depthwise layer is often followed by extra batch normalization before pointwise convolution to improve performance and accelerate convergence 1070 . Increasing numbers of channels with pointwise convolutions can increase accuracy 1070 , at the cost of increased computation. Pointwise convolutions are a special case of traditional convolutional layers in eqn. 40 and have convolution kernel weights, v, and add biases, b. Naively, depthwise separable convolutions require fewer weight multiplications than traditional convolutions 1072,1073 . However, extra batch normalization and serialization of one convolutional layer into depthwise and pointwise convolutional layers mean that depthwise separable convolutions and traditional convolutions have similar computing times 527, 1073 . Most DNNs developed for computer vision use fixed-size inputs. Although fixed input sizes are often regarded as an artificial constraint, it is similar animalian vision where there is an effectively constant number of retinal rods and cones [1074][1075][1076] . Typically, the most practical approach to handle arbitrary image shapes is to train a DNN with crops so that it can be tiled across images. In some cases, a combination of cropping, padding and interpolative resizing can also be used. To fully utilize unmodified variable size inputs, a simple is approach to train convolutional layers on variable size inputs. A pooling layer, such as global average pooling, can then be applied to fix output size before fully connected or other layers that might require fixed-size inputs. More involved approaches include spatial pyramid pooling 1077 or scale RNNs 1078 . Typical electron micrographs are much larger than 300×300, which often makes it unfeasible for electron microscopists with a few GPUs to train high-performance DNNs on full-size images. For comparison, Xception was trained on 300×300 images with 60 K80 GPUs for over one month.
The Fourier transform 1079 ,f (k 1 , ..., k N ), at an N-dimensional Fourier space vector, {k 1 , ..., k N }, is related to a function, where π = 3.141..., and i = (−1) 1/2 is the imaginary number. Two parameters, a and b, can parameterize popular conventions that relate the Fourier and inverse Fourier transforms. Mathematica documentation nominates conventions 1080 for general applications (a, b), pure mathematics (1, −1), classical physics (−1, 1), modern physics (0, 1), systems engineering (1, −1), and signal processing (0, 2π). We observe that most electron microscopists follow the modern physics convention of a = 0 and b = 1; however, the choice of convention is arbitrary and usually does not matter if it is consistent within a project. For discrete functions, Fourier integrals are replaced with summations that are limited to the support of the function. Discrete Fourier transforms of uniformly spaced inputs are often computed with a fast Fourier transform (FFT) algorithm, which can be parallelized for CPUs 1081 or GPUs 61, 1082-1084 . Typically, the speedup of FFTs on GPUs over CPUs is higher for larger signals 1085,1086 . Most popular FFTs are based on the Cooley-Turkey algorithm 1087,1088 , which recursively divides FFTs into smaller FFTs. We observe that some electron microscopists consider FFTs to be limited to radix-2 signals that can be recursively halved; however, FFTs can use any combination of factors for the sizes of recursively smaller FFTs. For example, clFFT 1089 FFT algorithms support signal sizes that are any sum of powers of 2, 3, 5, 7, 11 and 13.
Convolution theorems can decrease computation by enabling convolution in the Fourier domain 1090 . To ease notation, we denote the Fourier transform of a signal, I, by FT(I), and the inverse Fourier transform by FT −1 (I). Following, the convolution theorems for two signals, I 1 and I 2 , are 1091 where the signals can be feature channels and convolutional kernels. Fourier domain convolutions, I 1 * I 2 = FT −1 (FT(I 1 ) · FT(I 2 )), are increasingly efficient, relative to signal domain convolutions, as kernel and image sizes increase 1090 . Indeed, Fourier domain convolutions are exploited to enable faster training with large kernels in Fourier CNNs 1090,1092 . However, Fourier CNNs are rare as most researchers use small 3×3 kernels, following University of Oxford Visual Geometry Group (VGG) CNNs 1093 .

Skip Connections
Residual connections 1056 add a signal after skipping ANN layers, similar to cortical skip connections 1094,1095 . Residuals improve DNN performance by preserving gradient norms during backpropagation 525, 1096 and avoiding bad local minima 1097 by smoothing DNN loss landscapes 1098 . In practice, residuals enable DNNs to behave like an ensemble of shallow networks 1099 that learn to iteratively estimate outputs 1100 . Mathematically, a residual layer learns parameters, w l , of a perturbative function, f l (x l , w l ), that maps a signal, x l , at depth l to depth l + 1, Figure 9. Residual blocks where a) one, b) two, and c) three convolutional layers are skipped. Typically, convolutional layers are followed by batch normalization then activation.
Residuals were developed for CNNs 1056 , and examples of residual connections that skip one, two and three convolutional layers are shown in fig. 9. Nonetheless, residuals are also used in MLPs 1101 and RNNs [1102][1103][1104] . Representational capacity of perturbative functions increases as the number of skipped layers increases. As result, most residuals skip two or three layers. Skipping one layer rarely improves performance due to its low representational capacity 1056 .
There are a range of residual connection variants that can improve performance. For example, highway networks 1105, 1106 apply a gating function to skip connections, and dense networks 1107-1109 use a high number of residual connections from multiple layers. Another example is applying a 1×1 convolutional layer to x l before addition 527, 1056 where f l (x l , w l ) spatially resizes or changes numbers of feature channels. However, resizing with norm-preserving convolutional layers 1096 before residual blocks can often improve performance. Finally, long additive 1110 residuals that connect DNN inputs to outputs are often applied to DNNs that learn perturbative functions.
A limitation of preserving signal information with residuals 1111,1112 is that residuals make DNNs learn perturbative functions, which can limit accuracy of DNNs that learn non-perturbative functions if they do not have many layers. Feature channel concatenation is an alternative approach that not perturbative, and that supports combination of layers with different numbers of feature channels. In encoder-decoders, a typical example is concatenating features computed near the start with layers near the end to help resolve output features 294, 295, 297, 305 . Concatenation can also combine embeddings of different 1113,1114 or variants of 354 input features by multiple DNNs. Finally, peephole connections in RNNs can improve performance by using concatenation to combine cell state information with other cell inputs 1115, 1116 .

Architecture
There is a high variety of ANN architectures 3-6 that are trained to minimize losses for a range of applications. Many of the most popular ANNs are also the simplest, and information about them is readily available. For example, encoderdecoder 294-297, 490-492 or classifier 262 ANNs usually consist of single feedforward sequences of layers that map inputs to outputs. This section introduces more advanced ANNs used in electron microscopy, including actor-critics, GANs, RNNs, and VAEs. These ANNs share weights between layers or consist of multiple subnetworks. Other notable architectures include recursive CNNs 1054, 1055 , Network-in-Networks 1117 (NiNs), and transformers 1118,1119 . Although they will not be detailed here, their references may be good starting points for research. Figure 10. Actor-critic architecture. An actor outputs actions based on input states. A critic then evaluates action-state pairs to predict losses.

Actor-Critic
Most ANNs are trained by gradient descent using backpropagated gradients of a differentiable loss function c.f. section 6.1. However, some losses are not differentiable. Examples include losses of actors directing their vision 1120,1121 , and playing competitive 22 or score-based 1122, 1123 computer games. To overcome this limitation, a critic 1124 can be trained to predict differentiable losses from action and state information, as shown in fig. 10. If the critic does not depend on states, it is a surrogate loss function 1125,1126 . Surrogates are often fully trained before actor optimization, whereas critics that depend on actor-state pairs are often trained alongside actors to minimize the impact of catastrophic forgetting 1127 by adapting to changing actor policies and experiences. Alternatively, critics can be trained with features output by intermediate layers of actors to generate synthetic gradients for backpropagation 1128 . Figure 11. Generative adversarial network architecture. A generator learns to produce outputs that look realistic to a discriminator, which learns to predict whether examples are real or generated.

Generative Adversarial Network
Generative adversarial networks [990][991][992][993] (GANs) consist of discriminator and generator subnetworks that play an adversarial game, as shown in fig. 11. Generators learn to generate outputs that look realistic to a discriminator, whereas the discriminator learns to predict whether examples are real or generated. Most GANs are developed to generate visual media with realistic characteristics. For example, partial STEM images infilled with a GAN are less blurry than images infilled with a non-adversarial generator trained to minimize MSEs 192 . Alternatively, computationally inexpensive loss functions engineered by humans, such as SSIM 1129 and Sobel losses 220 , can improve generated output realism. However, it follows from the universal approximator theorems 35-43 that training with ANN discriminators can often yield more realistic outputs.
There are many popular GAN loss functions and regularization mechanisms [1130][1131][1132][1133][1134] . Originally, GANs were trained to minimize logarithmic discriminator, D, and generator, G, losses 1135 , where z are generator inputs, G(z) are generated outputs, and x are example outputs. Discriminators predict labels, D(x) and D(G(z)), where target labels are 0 and 1 for generated and real examples, respectively. Limitedly, logarithmic losses are numerically unstable for D(x) → 0 or D(G(z)) → 1, as the denominator, vanishes. In addition, discriminators must be limited to D(x) > 0 and D(G(z)) < 1, so that logarithms are not complex. To avoid these issues, we recommend training discriminators with squared difference losses 1136,1137 , Nevertheless, there are a variety of other alternatives to logarithmic loss functions that are also effective 1130,1131 .
A variety of methods have been developed to improve GAN training 971,1138 . The most common issues are catastrophic forgetting 1127 of previous learning, and mode collapse 1139 where generators only output examples for a subset of a target domain. Mode collapse often follows discriminators becoming Lipschitz discontinuous. Wasserstein GANs 1140 avoid mode collapse by clipping trainable variables, albeit often at the cost of 5-10 discriminator training iterations per generator training iteration. Alternatively, Lipschitz continuity can be imposed by adding a gradient penalty 1141 to GAN losses, such as differences of L2 norms of discriminator gradients from unity,

22/97
where ε ∈ [0, 1] is a uniform random variate, λ weights the gradient penalty, andx is an attempt to generate x. However, using a gradient penalty introduces additional gradient backpropagation that increases discriminator training time. There are also a variety of computationally inexpensive tricks that can improve training, such as adding noise to labels 971,1051,1142 or balancing discriminator and generator learning rates 337 . These tricks can help to avoid discontinuities in discriminator output distributions that can lead to mode collapse; however, we observe that these tricks do not reliably stabilize GAN training. Instead, we observe that spectral normalization 988 reliably stabilizes GAN discriminator training in our electron microscopy research 192,193,337 . Spectral normalization controls Lipschitz constants of discriminators by fixing the spectral norms of their weights, as introduced in section 4.2. Advantages of spectral normalization include implementations based on the power iteration method 998, 999 being computationally inexpensive, not adding a regularizing loss function that could detrimentally compete 1143,1144 with discrimination losses, and being effective with one discriminator training iterations per generator training iteration 988,1145 . Spectral normalization is popular in GANs for high-resolution image synthesis, where it is also applied in generators to stabilize training 1146 .
There are a variety of GAN architectures 1147

Recurrent Neural Network
Recurrent neural networks 519-524 reuse an ANN cell to process each step of a sequence. Most RNNs learn to model long-term dependencies by gradient backpropagation through time 1156 (BPTT). Essentially, the ability of RNNs to utilize past experiences enables them to model partially observed and variable length Markov decision processes 1157 1165,1166 , or to process RNN outputs 1167,1168 . RNNs can also be combined with with MLPs 1120 , or text embeddings 1169 such as BERT 1169,1170 , continuous bag-of-words [1171][1172][1173] (CBOW), doc2vec 1174,1175 , GloVe 1176 and word2vec 1171,1177 .
The most popular RNNs consist of short-term memory [1178][1179][1180][1181] (LSTM) cells and gated recurrent units 1179, 1182-1184. LSTMs and GRUs are popular as they solve the vanishing gradient problem 525, 1185,1186 and have consistently high performance [1187][1188][1189][1190][1191][1192] . Their architecture is shown in fig. 12. At step t, an LSTM outputs a hidden state, h t , and cell state, C t , given by where C t−1 is the previous cell state, h t−1 is the previous hidden state, x t is the step input, and σ is a logistic sigmoid function of eqn. 10a, [x, y] is the concatenation of x and y channels, and are pairs of weights and biases. A GRU performs fewer computations than an LSTM and does not have separate cell and hidden states, are pairs of weights and biases. Minimal gated units (MGUs) can further reduce computation 1193 . A large-scale analysis of RNN architectures for language translation found that LSTMs consistently outperform GRUs 1187 . GRUs struggle with simple languages that are learnable by LSTMs as the combined hidden and cell states of GRUs make it more difficult for GRUs to perform unbounded counting 1191 . However, further investigations found that GRUs can outperform LSTMs on tasks other than language translation 1188 , and that GRUs can outperform LSTMs on some datasets 1189,1190,1194 . Overall, LSTM performance is comparable to that of GRUs.
There are a variety of alternatives to LSTM and GRUs. Examples include continuous time RNNs [1195][1196][1197][1198][1199] (CTRNNs), Elman 1200 and Jordan 1201 networks, independently RNNs 1202 (IndRNNs), Hopfield networks 1203 , recurrent MLPs 1204 (RMLPs). However, none of the variants offer consistent performance benefits over LSTMs for general sequence modelling. Similarly, augmenting LSTMs with additional connections, such as peepholes 1115,1116 and projection layers 1205 , does not consistently improve performance. For electron microscopy, we recommend defaulting to LSTMs as we observe that their performance is more consistently high than performance of other RNNs. However, LSTM and GRU performance is often comparable, so GRUs are also a good choice to reduce computation.
There are a variety of architectures based on RNNs. Popular examples include deep RNNs 1206 that stack RNN cells to increase representational ability, bidirectional RNNs 1207-1210 that process sequences both forwards and in reverse to improve input utilization, and using separate encoder and decoder subnetworks 1182,1211 to embed inputs and generate outputs. Hierarchical RNNs 1212-1216 are more complex models that stack RNNs to efficiently exploit hierarchical sequence information, and include multiple timescale RNNs 1217, 1218 (MTRNNs) that operate at multiple sequence lengthscales. Finally, RNNs can be augmented with additional functionality to enable new capabilities. For example, attention 1159, 1219-1221 mechanisms can enable more efficient input utilization. Further, creating a neural Turing machine (NTMs) by augmenting a RNN with dynamic external memory 1222, 1223 can make it easier for an agent to solve dynamic graphs.

Autoencoders
Autoencoders 1224-1226 (AEs) learn to efficiently encode inputs, I, without supervision. An AE consists of a encoder, E, and decoder, D, as shown in fig. 13a. Most encoders and decoders are jointly trained 1227 to restore inputs from encodings, E(I), to minimize a MSE loss, by gradient descent. In practice, DNN encoders and decoders yield better compression 1225   Architectures of autoencoders where an encoder maps an input to a latent space and a decoder learns to reconstruct the input from the latent space. a) An autoencoder encodes an input in a deterministic latent space, whereas a b) traditional variational autoencoder encodes an input as means, µ, and standard deviations, σ , of Gaussian multivariates, µ + σ · ε, where ε is a standard normal multivariate.
In general, semantics of AE outputs are pathological functions of encodings. To generate outputs with well-behaved semantics, traditional variational autoencoders 945,1237,1238 (VAEs) learn to encode means, µ, and standard deviations, σ , of Gaussian multivariates. Meanwhile, decoders learn to reconstruct inputs from sampled multivariates, µ + σ · ε, where ε is a standard normal multivariate. Traditional VAE architecture is shown in fig. 13b. Usually, VAE encodings are regularized by adding the Kullback-Leibler (KL) divergence of encodings from standard multinormals to the VAE loss function, where λ KL weights the contribution of the KL divergence loss for a batch size of B, and a latent space with u elements. However, variants of Gaussian regularization can improve clustering 220 , and sparse autoencoders [1239][1240][1241][1242] (SAEs) that regularize encoding sparsity can encode more meaningful features. To generate realistic outputs, a VAE can be combined with a GANs to create a VAE-GAN [1243][1244][1245] . Adding a loss to minimize differences between gradients of generated and target outputs is computationally inexpensive alternative that can generate realistic outputs for some applications 220 . A popular application of VAEs is data clustering. For example, VAEs can encode hash tables [1246][1247][1248][1249][1250] for search engines, and we use VAEs as the basis of our electron micrograph search engines 220 . Encoding clusters visualized by tSNE can be labelled to classify data 220 , and encoding deviations from clusters can be used for anomaly detection [1251][1252][1253][1254][1255] . In addition, learning encodings with well-behaved semantics enables encodings to be used for semantic manipulation 1255,1256 . Finally, VAEs can be used as generative models to create synthetic populations 1257,1258 , develop new chemicals [1259][1260][1261][1262] , and synthesize underrepresented data to reduce imbalanced learning 1263 .

Optimization
Training, testing, deployment and maintenance of machine learning systems is often time-consuming and expensive [1264][1265][1266][1267] . The first step is preparing training data and setting up data pipelines for ANN training and evaluation. Typically, ANN parameters are randomly initialized for optimization by gradient descent, possibly as part of an automatic machine learning algorithm. Reinforcement learning is a special optimization case where the loss is a discounted future reward. During training, ANN components are often regularized to stabilize training, accelerate convergence, or improve performance. Finally, trained models can be streamlined for efficient deployment. This section introduces each step. We find that electron microscopists can be apprehensive about robustness and interpretability of ANNs, so we also provide subsections on model evaluation and interpretation.

Gradient Descent
Most ANNs are iteratively trained by gradient descent 453, 1280-1284 , as described by algorithm 1 and shown in fig. 14 1285,1286 . Alternatively, forward pass computations Figure 14. Gradient descent. a) Arrows depict steps across one dimension of a loss landscape as a model is optimized by gradient descent. In this example, the optimizer traverses a small local minimum; however, it then gets trapped in a larger sub-optimal local minimum, rather than reaching the global minimum. b) Experimental DNN loss surface for two random directions in parameter space showing many local minima 1098 . The plot in part b) is reproduced with permission under an MIT license 1268 .
Forwards propagate a randomly sampled batch of inputs, x, through the model to compute outputs, y = f (x). Compute loss, L t , for outputs. Use the differentiation chain rule 1269 to backpropagate gradients of the loss to trainable parameters, θ t .
Apply an optimizer to the gradients to update θ t−1 to θ t . end for can be split across multiple devices 1287 . Optimization by gradient descent plausibly models learning in some biological systems 1288 . However, gradient descent is not generally an accurate model of biological learning 1289-1291 . There are many popular gradient descent optimizers for deep learning [1280][1281][1282] . Update rules for eight popular optimizers are summarized in fig. 15. Other optimizers include AdaBound 1292 , AMSBound 1292 , AMSGrad 1293 , Lookahead 1294 , NADAM 1295 , Nostalgic Adam 1296 , Power Gradient Descent 1297 , RADAM 1298 , and trainable optimizers [1299][1300][1301][1302][1303] . Gradient descent is effective in the high-dimensional optimization spaces of overparameterized ANNs 1304 as the probability of getting trapped in a suboptimal local minimum decreases as the number of dimensions increases. The simplest optimizer is "vanilla" stochastic gradient descent (SGD), where a trainable parameter perturbation, ∆θ t = θ t − θ t−1 , is the product of a learning rate, η, and derivative of a loss, L t , w.r.t. the trainable parameter, ∂ θ L t . However, vanilla SGD convergence is often limited by unstable parameter oscillations as it a low-order local optimization method 1305 . Further, vanilla SGD has no mechanism to adapt to varying gradient sizes, which vary effective learning rates as ∆θ ∝ ∂ θ L t .
To accelerate convergence, many optimizers introduce a momentum term that weights an average of gradients with past gradients 1273,1306,1307 . Momentum-based optimizers in fig. 15 are momentum, Nesterov momentum 1273, 1274 , quasi-hyperbolic momentum 1276 , AggMo 1277 , ADAM 1279 , and AdaMax 1279 . To standardize effective learning rates for every layer, adaptive optimizers normalize updates based on an average of past gradient sizes. Adaptive optimizers in fig. 15 are RMSProp 1278 , ADAM 1279 , and AdaMax 1279 , which usually result in faster convergance and higher accuracy than other optimizers 1308,1309 . However, adaptive optimizers can be outperformed by vanilla SGD due to overfitting 1310 , so some researchers transition from adaptive optimization to vanilla SGD as training progresses 1292 . We recommend adaptive optimization with Nadam 1295 , which combines ADAM with Nesterov momentum, as a comparative analysis of select gradient descent optimizers found that it often achieves higher performance than other popular optimizers 1311 . Limitingly, most adaptive optimizers slowly adapt to changing gradient sizes e.g. the recommended value for ADAM β 2 is 0.999 1279 . To prevent learning being destabilized by spikes in gradient sizes, adaptive optimizers can be combined with adaptive learning rate 251, 1292 or gradient 1185,1312,1313 clipping.
For non-adaptive optimizers, effective learning rates are likely to vary due to varying magnitudes of gradients w.r.t. trainable parameters. Similarly, learning by biological neurons varies as stimuli usually activate a subset of neurons 1314 Quasi-hyperbolic momentum 1276 Figure 15. Update rules of various gradient descent optimizers for a trainable parameter, θ t , at iteration t, gradients of losses w.r.t. the parameter, ∂ θ L t , and learning rate, η. Hyperparameters are listed in square brackets.
neuron outputs are usually computed for ANNs. Thus, not effectively using all weights to inform decisions is computational inefficient. Further, inefficient weight updates can limit representation capacity, slow convergence, and decrease training stability. A typical example is effective learning rates varying between layers. Following the chain rule, gradients backpropagated to the ith layer of a DNN from its start are for a DNN with L layers. Vanishing gradients 525, 1185, 1186 occur when many layers have ∂ x l+1 /∂ x l 1. For example, DNNs with logistic sigmoid activations often exhibit vanishing gradients as their maximum gradient is 1/4 c.f. eqn. 10b. Similarly, exploding gradients 525, 1185, 1186 occur when many layers have ∂ x l+1 /∂ x l 1. Adaptive optimizers alleviate vanishing and exploding gradients by dividing gradients by their expected sizes. Nevertheless, it is essential to combine adaptive optimizers with appropriate initialization and architecture to avoid numerical instability.
Optimizers have a myriad of hyperparameters to be initialized and varied throughout training to optimize performance 1315 c.f. fig. 15. For example, stepwise exponentially decayed learning rates are often theoretically optimal 1316 . There are also various heuristics that are often effective, such as using a DEMON decay schedule for an ADAM first moment of the momentum decay rate 1317 , where β init is the initial value of β 1 , t is the iteration number, and T is the final iteration number. Developers often optimize ANN hyperparameters by experimenting with a range of heuristic values. Hyperparameter optimization algorithms [1318][1319][1320][1321][1322][1323] can automate optimizer hyperparameter selection. However, automatic hyperparameter optimizers may not yield sufficient performance improvements relative to well-established heuristics to justify their use, especially in initial stages of development.
In practice, many MDPs are partially observed or have non-differentiable losses that may make it difficult to learn a good policy from individual observations. However, RNNs can often learn a model of their environments from sequences of observations 1123 . Alternatively, FNNs can be trained with groups of observations that contain more information than individual observations 1122,1371 . If losses are not differentiable, an critic can learn to predict differentiable losses for actor training c.f. section 5.1. Alternatively, actions can be sampled from a differentiable probability distribution 1120, 1374 as training losses given by products of losses and sampling probabilities are differentiable. There are also a variety of alternatives to gradient descent introduced at the end of section 6.1 that do not require differentiable loss functions.
There are a variety of exploration strategies for RL 1375,1376 . Adding Ornstein-Uhlenbeck 1377 (OU) noise to actions is effective for continuous control tasks optimized by deep deterministic policy gradients 1122 (DDPG) or recurrent deterministic policy gradients 1123 (RDPG) RL algorithms. Adding Gaussian noise achieves similar performance for optimization by TD3 1378 or D4PG 1379 RL algorithms. However, a comparison of OU and Gaussian noise across a variety of tasks 1380 found that OU noise usually achieves similar performance to or outperforms Gaussian noise. Similarly, exploration can be induced by adding noise to ANN parameters 1381,1382 . Other approaches to exploration include rewarding actors for increasing action entropy [1382][1383][1384] and intrinsic motivation [1385][1386][1387] , where ANNs are incentified to explore actions that they are unsure about.
RL algorithms are often partitioned into online learning 1388,1389 , where training data is used as it is acquired; and offline learning 1390,1391 , where a static training dataset has already been acquired. However, many algorithms operate in an intermediate regime, where data collected with an online policy is stored in an experience replay 1392-1394 buffer for offline learning. Training data is often sampled at random from a replay. However, prioritizing the replay of data with high losses 969 or data that results in high policy improvements 968 often improves actor performance. A default replay buffer size of around 10 6 examples is often used; however, training is sensitive to replay buffer size 1395 . If the replay is too small, changes in actor policy may destabilize training; whereas if the replay is too large, convergence may be slowed by delays before the actor learns from policy changes.

Automatic Machine Learning
There are a variety of automatic machine learning [1396][1397][1398][1399] [1412][1413][1414][1415][1416] . AutoML is becoming increasingly popular as it can achieve higher performance than human developers 1053 and enables human developer time to be traded for potentially cheaper computer time. Nevertheless, AutoML is currently limited to established ANN architectures and learning policies. Following, we recommend that researchers either focus on novel ANN architectures and learning policies or developing ANNs for novel applications.

Initialization
How ANN trainable parameters are initialized 525, 1417 is related to model capacity 1418 . Further, initializing parameters with values that are too small or large can cause slow learning or divergence 525 . Careful initialization can also prevent training by gradient descent being destabilized by vanishing or exploding gradients 525, 1185, 1186 , or high variance of length scales across layers 525 . Finally, careful initialization can enable momentum to accelerate convergence and improve performance 1273  Var(x output ) = n in E(x input ) 2 Var(w) + n in E(w) 2 Var(x input ) + n in Var(w)Var(x input ) , where E(x) and Var(x) denote the expected mean and variance of elements of x, respectively. For similar length scales across layers, Var(x output ) should be constant. Initially, similar variances can be achieved by normalizing ANN inputs to have zero mean, so that E(x input ) = 0, and initializing weights so that E(w) = 0 and Var(w) = 1/n in . However, parameters can shift during training, destabilizing learning. To compensate for parameter shift, popular normalization layers like batch normalization often impose E(x input ) = 0 and Var(x input ) = 1, relaxing need for E(x input ) = 0 or E(w) = 0. Nevertheless, training will still be sensitive to the length scale of trainable parameters.
There are a variety of popular weight initializers that adapt weights to ANN architecture. One of the oldest methods is LeCun initialization 917,927 , where weights are initialized with variance, which is argued to produce outputs with similar length scales in the previous paragraph. However, a similar argument can be made for initializing with Var(w) = 1/n out to produce similar gradients at each layer during the backwards pass 1417 . As a compromise, Xavier initialization 1419 computes an average, However, adjusting weights for n out is not necessary for adaptive optimizers like ADAM, which divide gradients by their lengths scales, unless gradients will vanish or explode. Finally, He initialization 20 doubles the variance of weights to and is often used in ReLU networks to compensate for activation functions halving variances of their outputs 20, 1417, 1420 . Most trainable parameters are initialized from either a zero-centred Gaussian or uniform distribution. For convenience, the limits of such a uniform distribution are ±(3Var(w)) 1/2 . Uniform initialization can outperform Gaussian initialization in DNNs due to Gaussian outliers harming learning 1417 . However, issues can be avoided by truncating Gaussian initialization, often to two standard deviations, and rescaling to its original variance. Some initializers are mainly used for RNNs. For example, orthogonal initialization 1421 often improves RNN training 1422 by reducing susceptibility to vanishing and exploding gradients. Similarly, identity initialization 1423,1424 can help RNNs to learn long-term dependencies. In most ANNs, biases are initialized with zeros. However, the forget gates of LSTMs are often initialized with ones to decrease forgetting at the start of training 1188 . Finally, the start states of most RNNs are initialized with zeros or other constants. However, random multivariate or trainable variable start states can improve performance 1425 .
There are a variety of alternatives to initialization from random multivariates. Weight normalized 985 ANNs are a popular example of data-dependent initialization, where randomly initialized weight magnitudes and biases are chosen to counteract variances and means of an initial batch of data. Similarly, layer-sequential unit-variance (LSUV) initialization 1426 consists of orthogonal initialization followed by adjusting the magnitudes of weights to counteract variances of an initial batch of data. Other approaches standardize the norms of backpropagated gradient. For example, random walk initialization 1427 (RWI) finds scales for weights to prevent vanishing or exploding gradients in deep FNNs, albeit with varied success 1426 . Alternatively, MetaInit 1428 scales the magnitudes of randomly initialized weights to minimize changes in backpropagated gradients per iteration of gradient descent.

Regularization
There are a variety of regularization mechanisms 1429-1432 that modify learning algorithms to improve ANN performance. One of the most popular is LX regularization, which decays weights by adding a loss, weighted by λ X to each trainable variable, θ i . L2 regularization 1433-1435 is preferred 1436 for most DNN optimization as subtraction of its gradient, ∂ θ i L 2 = λ 2 θ i , is equivalent to computationally-efficient multiplicative weight decay. Nevertheless, L1 regularization is better at inducing model sparsity 1437 than L2 regularization, and L1 regularization achieves higher performance in some applications 1438 . Higher performance can also be achieved by adding both L1 and L2 regularization in elastic nets 1439 . LX regularization is most effective at the start of training and becomes less important near convergence 1433 . Finally, L1 and L2 regularization are closely related to lasso 1440 and ridge 1441 regularization, respectively, whereby trainable parameters are adjusted to limit L 1 and L 2 penalties. Gradient clipping 1313,[1442][1443][1444] accelerates learning by limiting large gradients, and is most commonly applied to RNNs. A simple approach is to clip gradient magnitudes to a threshold hyperparameter. However, it is more common to scale gradients, g i , at layer i if their norm is above a threshold, u, so that 1185,1443 where n = 2 is often chosen to minimize computation. Similarly, gradients can be clipped if they are above a global norm, computed with gradients at L layers. Scaling gradient norms is often preferable to clipping to a threshold as scaling is akin to adapting layer learning rates and does not affect the directions of gradients. Thresholds for gradient clipping are often set based on average norms of backpropagated gradients during preliminary training 1445 . However, thresholds can also be set automatically and adaptively 1312,1313 . In addition, adaptive gradient clipping algorithms can skip training iterations if gradient norms are anomalously high 1446 , which often indicates an imminent gradient explosion. Dropout 1447-1451 often reduces overfitting by only using a fraction, p i , of layer i outputs during training, and multiplying all outputs by p i for inference. However, dropout often increases training time, can be sensitive to p i , and sometimes lowers performance 1452 . Improvements to dropout at the structural level, such as applying it to convolutional channels, paths, and layers, rather than random output elements, can improve performance 1453 . For example, DropBlock 1454 improves performance by dropping contiguous regions of feature maps to prevent dropout being trivially circumvented by using spatially correlated neighbouring outputs. Similarly, PatchUp 1455 swaps or mixes contiguous regions with regions for another sample. Dropout is 30/97 often outperformed by Shakeout 1456,1457 , a modification of dropout that randomly enhances or reverses contributions of outputs to the next layer.
Noise often enhances ANN training by decreasing susceptibility to spurious local minima 1458 . Adding noise to trainable parameters can improve generalization 1459,1460 , or exploration for RL 1381 . Parameter noise is usually additive as it does not change an objective function being learned, whereas multiplicative noise can change the objective 1461 . In addition, noise can be added to inputs 1230,1462 , hidden layers 1134, 1463 , generated outputs 1464 or target outputs 971,1465 . However, adding noise to signals does not always improve performance 1194 . Finally, modifying usual gradient noise 1466 by adding noise to gradients can improve performance 1467 . Typically, additive noise is annealed throughout training, so that that final training is with a noiseless model that will be used for inference.
There are a variety of regularization mechanism that exploit extra training data. A simple approach is to create extra training examples by data augmentation [1468][1469][1470] . Extra training data can also be curated, or simulated for training by domain adaption [1153][1154][1155] . Alternatively, semi-supervised learning [1471][1472][1473][1474][1475][1476] can generate target outputs for dataset of unpaired inputs to augment training with a dataset of paired inputs and target outputs. Finally, multitask learning [1477][1478][1479][1480][1481] can improve performance by introducing additional loss functions. For instance, by adding an auxiliary classifier to predict image labels from features generated by intermediate DNN layers [1482][1483][1484][1485] . Losses are often manually balanced; however, their gradients can also be balanced automatically and adaptively 1143, 1144 .

Data Pipeline
A data pipeline prepares data to be input to an ANN. Most data parallelize data pipelines across multiple CPU cores 1486 . Small datasets can be stored in RAM to decrease data access times, whereas large dataset elements are often loaded from files. Loaded data can then be preprocessed and augmented 1468,1469,[1487][1488][1489] . For electron micrographs, preprocessing often includes replacing non-finite elements, such as Nan and inf, with finite values; linearly transforming intensities to a common range, such as [−1, 1] or zero mean and unit variance; and performing a random combination of flips and 90 • to augment data by a factor of eight 66,192,193,220,337 . Preprocessed examples can then be combined into batches. Typically, multiple batches that are ready to be input are prefetched and stored in RAM to avoid delays due to fluctuating CPU performance.
To efficiently utilize data, training datasets are often reiterated over for multiple training epochs. Usually, training datasets are reiterated over about 10 2 times. Increasing epochs can maximize utilization of potentially expensive training data; however, increasing epochs can lower performance due to overfitting 1490,1491 or be too computationally expensive 527 . Naively, batches of data can be randomly sampled with replacement during training by gradient descent. However, convergence can be accelerated by reinitializing a training dataset at the start of each training epoch and randomly sampling data without replacement [1492][1493][1494][1495][1496] . Most modern DLFs, such as TensorFlow, provide efficient and easy-to-use functions to control data sampling 1497 .

Model Evaluation
There are a variety of methods for ANN performance evaluation 526 . However, most ANNs are evaluated by 1-fold validation, where a dataset is partitioned into training, validation, and test sets. After ANN optimization with a training set, ability to generalize is measured with a validation set. Multiple validations may be performed for training with early stopping 1490,1491 or ANN learning policy and architecture selection, so final performance is often measured with a test set to avoid overfitting to the validation set. Most researchers favor using single training, validation, and test sets to simplify standardization of performance benchmarks 220 . However, multiple-fold validation 526 or multiple validation sets 1498 can improve performance characterization. Alternatively, models can be bootstrap aggregated 1499 (bagged) from multiple models trained on different subsets of training data. Bagging is usually applied to random forests [1500][1501][1502] or other lightweight models, and enables model uncertainly to be gauged from the variance of model outputs.
For small datasets, model performance is often sensitive to split of data between training and validation sets 1503 . Increasing training set size usually increases model accuracy, whereas increasing validation set size decreases performance uncertainty. Indeed, a scaling law can be used to estimate an optimal tradeoff 1504 between training and validation set sizes. However, most experimenters follow a Pareto 1505 splitting heuristic. For example, we often use a 75:15:10 training-validation-test split 220 . Heuristic splitting is justified for ANN training with large datasets insofar that sensitivity to splitting ratios decreases with increasing dataset size 2 .

Deployment
If an ANN is deployed 1506-1508 on multiple different devices, such as various electron microscopes, a separate model can be trained for each device 391 , Alternatively, a single model can be trained and specialized for different devices to decrease training requirements 1509 . In addition, ANNs can remotely service requests from cloud containers [1510][1511][1512] . Integration of multiple ANNs can be complicated by different servers for different DLFs supporting different backends; however, unified interfaces are available. For example, GraphPipe 1513 provides simple, efficient reference model servers for Tensorflow, Caffe2, and ONNX; a minimalist machine learning transport specification based on FlatBuffers 1514 ; and efficient client implementations in Go, Python, and Java. In 2020, most ANNs developed researchers were not deployed. However, we anticipate that deployment will become a more prominent consideration as the role of deep learning in electron microscopy matures.
Most ANNs are optimized for inference by minimizing parameters and operations from training time, like MobileNets 1070 . However, less essential operations can also be pruned after training 1515,1516 . Another approach is quantization, where ANN bit depths are decreased, often to efficient integer instructions, to increase inference throughput 1517,1518 . Quantization often decreases performance; however, the amount of quantization can be adapted to ANN components to optimize performancethroughput tradeoffs 1519 . Alternatively, training can be modified to minimize the impact of quantization on performance [1520][1521][1522] . Another approach is to specialize bit manipulation for deep learning. For example, signed brain floating point (bfloat16) often improves accuracy on TPUs by using an 8 bit mantissa and 7 bit exponent, rather than a usual 5 bit mantissa and 10 bit exponent 1523 . Finally, ANNs can be adaptively selected from a set of ANNs based on available resources to balance tradeoff of performance and inference time 1524 , similar to image optimization for web applications 1525,1526 .

Interpretation
We observe that some electron microscopists are apprehensive about working with ANNs due to a lack of interpretability, irrespective of rigorous ANN validation. We try to address uncertainty by providing loss visualizations in some of our electron microscopy papers 66,192,193 . However, there are a variety of popular approaches to explainable artificial intelligence [1528][1529][1530][1531][1532][1533][1534] (XAI). One of the most popular approaches to XAI is saliency, where gradients of outputs w.r.t. inputs correlate with their importance. Saliency is often computed by gradient backpropagation [1535][1536][1537] . For example, with Grad-CAM 1538 or its variants [1539][1540][1541][1542] . Alternatively, saliency can be predicted by ANNs 1030, 1543-1545 or a variety of methods inspired by Grad-CAM [1546][1547][1548] . Saliency maps can be exploited to select features 1549 .

32/97
There are a variety of other approaches to XAI. For example, feature visualization via optimization 1527,[1550][1551][1552][1553] can find inputs that maximally activate neurons, as shown in fig. 16. Another approach is to cluster features, e.g. by tSNE 1554,1555 with the Barnes-Hut algorithm 1556,1557 , and examine corresponding clustering of inputs or outputs 220 . More simply, researchers can view raw features and gradients during forward and backward passes of gradient descent, respectively. For example, CNN explainer 1558,1559 is an interactive visualization tool designed for non-experts to learn and experiment with CNNs. Similarly, GAN Lab 1560 is an interactive visualization tool for GANs.

Discussion
We introduced a variety of electron microscopy applications in section 1 that have been enabled or enhanced by deep learning. However, the greatest benefit of deep learning in electron microscopy may be general-purpose tools that enable researchers to be more effective. Search engines based on deep learning are almost essential to navigate an ever-increasing number of scientific publications 685 . Further, machine learning can enhance communication by filtering spam and phishing attacks [1561][1562][1563] , and by summarizing [1564][1565][1566] and classifying 1031, 1567-1569 scientific documents. In addition, machine learning can be applied to education to automate and standardize scoring [1570][1571][1572][1573] , detect plagiarism [1574][1575][1576] , and identify at-risk students 1577 .
Creative applications of deep learning 1578,1579 include making new art by style transfer 1020,1149,1580,1581 , composing music [1582][1583][1584] , and storytelling 1585,1586 . Similar DNNs can assist programmers 1587,1588 . For example, by predictive source code completion [1589][1590][1591][1592][1593][1594] , and by generating source code to map inputs to target outputs 1595 or from labels describing desired source code 1596 . Text generating DNNs can also help write scientific papers. For example, by drafting scientific passages 1597 or drafting part of a paper from a list of references 1598 . Papers generated by early prototypes for automatic scientific paper generators, such as SciGen 1599 , are realistic insofar that they have been accepted by scientific venues.
An emerging application of deep learning is mining scientific resources to make new scientific discoveries 1600 . Artificial agents are able to effectively distil latent scientific knowledge as they can parallelize examination of huge amounts of data, whereas information access by humans [1601][1602][1603] is limited by human cognition 1604 . High bandwidth bi-directional brainmachine interfaces are being developed to overcome limitations of human cognition 1605 ; however, they are in the early stages of development and we expect that they will depend on substantial advances in machine learning to enhance control of cognition. Eventually, we expect that ANNs will be used as scientific oracles, where researchers who do not rely on their services will no longer be able to compete. For example, an ANN trained on a large corpus of scientific literature predicted multiple advances in materials science before they were reported 1606 . ANNs are already used for financial asset management 1607,1608 and recruiting [1609][1610][1611][1612] , so we anticipate that artificial scientific oracle consultation will become an important part of scientific grant 1613, 1614 reviews.
A limitation of deep learning is that it can introduce new issues. For example, DNNs are often susceptible to adversarial attacks [1615][1616][1617][1618][1619] , where small perturbations to inputs cause large errors. Nevertheless, training can be modified to improve robustness to adversarial attacks [1620][1621][1622][1623] . Another potential issue is architecture-specific systematic errors. For example, CNNs often exhibit structured systematic error variation 66,192,193,1068,1069,1624 , including higher errors nearer output edges 66,192,193 . However, structured systematic error variation can be minimized by GANs incentifying the generation of realistic outputs 192 . Finally, ANNs can be difficult to use as they often require downloading code with undocumented dependencies, downloading a pretrained model, and may require hardware accelerators. These issues can be avoided by serving ANNs from cloud containers. However, it may not be practical for academics to acquire funding to cover cloud service costs.
Perhaps the most important aspect of deep learning in electron microscopy is that it presents new challenges that can lead to advances in machine learning. Simple benchmarks like CIFAR-10 550, 551 and MNIST 552 have been solved. Following, more difficult benchmark like Fashion-MNIST 1625 have been introduced. However, they only partially address issues with solved datasets as they do not present fundamentally new challenges. In contrast, we believe that new problems often invite new solutions. For example, we developed adaptive learning rate clipping 251 to stabilize training of DNNs for partial scanning transmission electron microscopy 192 . The challenge was that we wanted to train a large model for high-resolution images; however, training was unstable if we used small batches needed to fit it in GPU memory. Similar challenges abound and can lead to advances in both machine learning and electron microscopy. 20. He, K., Zhang, X., Ren