Adaptive Partial Scanning Transmission Electron Microscopy with Reinforcement Learning

Compressed sensing is applied to scanning transmission electron microscopy to decrease electron dose and scan time. However, established methods use static sampling strategies that do not adapt to samples. We have extended recurrent deterministic policy gradients to train deep LSTMs and differentiable neural computers to adaptively sample scan path segments. Recurrent agents cooperate with a convolutional generator to complete partial scans. We show that our approach outperforms established algorithms based on spiral scans, and we expect our results to be generalizable to other scan systems. Source code, pretrained models and training data is available at https://github.com/Jeffrey-Ede/Adaptive-Partial-STEM.


Introduction
Most scan systems sample signals at sequences of discrete probing locations. Examples include atomic force microscopy 1 , computerized axial tomography 2, 3 , electron backscatter diffraction 4 , scanning electron microscopy 5 , scanning Raman spectroscopy 6 , scanning transmission electron microscopy 7 (STEM) and X-ray diffraction spectroscopy 8 . In STEM, the high current density of electron probes produces radiation damage in many materials, limiting the range and type of investigations that can be performed 9,10 . In addition, most STEM signals are oversampled 11 to ease inspection and decrease sub-Nyquist artefacts 12 . As a result, compressed sensing 13 algorithms have been developed to decrease STEM probing. In this paper, we introduce a new approach to STEM compressed sensing where a scan system learns to adapt partial scans 14 to samples by reinforcement learning 15 (RL).
Established compressed sensing strategies include random sampling [16][17][18] , uniformly spaced sampling 17,[19][20][21] , sampling based on a model of a sample 22,23 , partials scans with fixed paths 14 , dynamic sampling to minimize entropy [24][25][26][27] and dynamic sampling based on supervised learning 28 . Complete signals can be extrapolated from partial scans by an infilling algorithm, estimating their fast Fourier transforms 29 or inferred by an artificial neural network 14,21 (ANN). The best sampling strategy varies, for example, uniformly spaced sampling is often better than spiral paths for oversampled STEM images 14 . However, hand-crafted strategies have a limited ability to leverage a physical understanding to optimize sampling. As proposed 14 , we have therefore developed ANNs to adapt scan paths to signals. This is motivated by the universal approximator theorem 30 , which proves that ANNs can learn to represent 31 the best sampling strategy to arbitrary accuracy.
Exploration of STEM images is a finite-horizon partially observed Markov decision process 32,33 (MDP) with sparse losses. A partial scan can be constructed from path segments sampled at each step and a loss is based on the accuracy of a completion generated from the partial scan. Most scan systems support custom scan paths or can be augmented with a field programmable gate array 34,35 (FPGA) to support custom scan paths. However, there is a delay before a scan system can execute or is ready to receive a new command. Total delay can be reduced by using fewer steps with larger path segments. Decreasing steps could also reduce distortions due to errors in probing positions 34 . In addition, command executions could be delayed by ANN inference. However, delay can be minimized by using a lightweight ANN or by inferring commands while previous commands are executing.
MDPs can be optimized by recurrent neural networks (RNNs) based on long short-term memory 36,37 (LSTM), gated recurrent unit 38 (GRU) or other cells. LSTMs and GRUs are popular as they solve the vanishing gradient problem 39 and have consistently high performance 40 . Small RNNs are computationally inexpensive and are often applied to MDPs as they can learn to extract and remember state information to inform future decisions. To solve dynamic graphs, an RNN can be augmented with dynamic external memory to create a differentiable neural computer 41 (DNC). A loss, L t , at step t in a MDP with T steps can be given by Bellman's equation, where γ ∈ [0, 1) discounts future losses. RL equations are often presented in terms of rewards, r t = −L t ; however, losses are an equivalent representation that avoids complicating our calculations with minus signs. Loss backpropagation through time 42 (BPTT) enables RNNs can be trained by gradient descent 43 46 and playing score-based computer games 47,48 . Nevertheless, these losses can be backpropagated to agent parameters by sampling actions from a differentiable probability distribution 44,49 , or by introducing a differentiable surrogate 50 or critic 51 to predict losses that can be backpropagated. Alternatives to gradient descent, such as simulated annealing 52 and evolutionary 53 algorithms, can also optimize agents for non-differentiable loss functions. However, gradient descent typically achieves better results with less computation for large ANNs.

Training
In this section, we outline our training environment, ANN architecture and learning policy. Our ANNs were developed in Python with TensorFlow 54 . Detailed architecture and learning policy is in supplementary information. In addition, source code and pretrained models are available via GitHub 55 , and training data is available 11, 56 . Figure 1. An abstract 8×8 partial scan with T = 5 straight path segments. Each segment has P = 3 probing positions separated by d = 2 1/2 px and their starts are labelled by step numbers, t. Partial scans are selected from STEM images by sampling pixels nearest probing positions.

Environment
To create partial scans from STEM images, an actor, µ, infers L2 normalized vectors, µ(h t ), based on a history, h t = (o i 1 , a 1 , ..., o t − 1 , a t−1 ), of previous actions, a, and observations, o. To encourage exploration, µ(h t ) is rotated to a t by Ornstein- where we chose θ = 0.1 to decay noise to ε avg = 0, σ = 0.2 to scale a standard normal distributed Wiener variate, W , and ε 0 = 0. O-U noise is linearly decayed to zero throughout training. Correlated O-U exploration noise is recommended for continuous control tasks optimized by deep deterministic policy gradients 47 (DDPG) and recurrent deterministic policy gradients 48 (RDPG). Nonetheless, follow-up experiments with TD3 59 and D4PG 60 have found that uncorrelated Gaussian noise can produce similar results. An action, a t , is the direction to move to observe a path segment, o t , relative to the position at the end of the previous segment. Partial scans are constructed from complete histories of actions and observations, h T . A simplified partial scan is shown in fig. 1. In our experiments, partial scans, s, are constructed from T = 20 straight path segments selected from 96×96 STEM images. Each segment has P = 20 probing positions separated by d = 2 1/2 px and positions can be outside an image. The pixels in the image nearest each probing position are sampled, so a separation of d ≥ 2 1/2 prevents probing positions in a segment from sampling the same pixel. A separation of d < 2 1/2 would allow a pixel to sampled more than once by moving diagonally, potentially incentifying orthogonal scan motion to sample more pixels.
Selecting a subset of STEM image pixels to be partial scans to train ANNs for compressed sensing follows earlier work 14,21,61 . It is cheaper and more practical than preparing a large, carefully partitioned and representative dataset 62,63 containing partial scan and full image pairs, and selected pixels have realistic noise characteristics as they are from an experimental images. Nevertheless, selecting a subset of pixels does not account for probing location errors varying with scan shape 34 . We use publicly available datasets containing 19769 32-bit 96×96 images cropped or downsampled from full images 11,56 . Cropped images were blurred by a symmetric 5×5 Gaussian kernel with a 2.5 px standard deviation to decrease any training loss variation due to varying noise characteristics. Finally, images, I, were linearly transformed to normalized images, I N , with minimum and maximum values of −1 and 1, respectively. To test performance, images were split, without pre-shuffling, into training sets containing 15815 images and test sets containing 3954 image. Details of and scripts used to prepare datasets are available with both static and interactive dataset visualizations 11 .

Architecture
Training configurations of actor, µ, target actor, µ , critic, Q, target critic, Q , and generator, G, networks are shown in fig. 2. Our actor and critic are computationally inexpensive deep LSTMs 64 or DNCs to minimize latency, and our generator is convolutional neural network 65,66 . As shown in fig. 2a, a recurrent actor selects sequences of actions and path segments that are added to an experience replay 67 , R, containing 25000 sequences. Partial scans, s, are constructed from histories sampled from the replay to train a generator shown in fig. 2b to completes partial scans, I i G = G(s i ). A new experience is added to the replay once every four training iterations. The actor and generator cooperate to minimize generator losses, L G , and are the only networks needed for inference.
Generator losses are not differentiable w.r.t. actions used to render partial scans; ∂ L G /∂ a t = 0. Similar to RDPG 48 , we therefore introduce recurrent critics to predict losses that can be backpropagated to actors, as shown in fig. 2c. Actors and critics have the same architecture, except actors have two outputs for actions whereas critics have one output for losses. Target networks 47, 68 track live actor and critic networks to stabilize learning. In RDPG, live and target ANNs separately replay experiences. However, we propagate target ANN states to live ANNs as target states are more stable than live states, it models inference with a target actor, and it does not require additional computation.

Learning Policy
To train actors to cooperate with a generator to complete partial scans, we developed cooperative recurrent deterministic policy gradients (CRDPG) (algorithm 1). This is an extension of RDPG to an actor that cooperates with another ANN to minimize its loss. We train our networks by ADAM 69 optimized gradient descent for M = 10 6 iterations with a batch size, N, of 32. We use constant learning rates η µ = 0.0007 and η Q = 0.0010 for the actor and critic, respectively. For the generator, we use an exponentially decayed cyclic 70 learning rate, where m ∈ [0, M] is the iteration number, c = M/9 is the cycle period, and x mod y is the remainder of the division of x by y.
Training takes one day with an i7-6700 CPU and a GTX 1080 Ti GPU. The generator learns to minimize mean squared errors (MSEs), L G , between scan completions, G(s ), and normalized target images, I N . Similar to our earlier work 14, 21, 61, 71 , we apply a random combination of flips and 90 • rotations, mapping s → s and I N → I N , to augment training data by a factor of eight. Following Mnih et al 68 , we restrict loss support by clipping losses to 4, the maximum possible MSE for generated intensities in [−1, 1], Generator losses decrease as performance improves, and they can change due loss spikes 61 , learning rate oscillations 70 or other phenomena. Normalizing losses can improve RL 72 , so we divide generator losses for actor training by their running mean,
(append observation and previous action to history). Select action, a t , by computing µ(h t ) and applying exploration noise, ε t . end for , from generator losses, L i G , and over edge losses, E i t , where δ is the Dirac delta function.
Compute target values, (y i 1 , ..., y i T ), using recurrent target networks where α ∈ [0, 1] weights the contribution of supervised and reinforcement losses. Compute critic update (using BPTT) Compute actor update (using BPTT) Compute generator update Update the actor, critic and generator by gradient descent. Update the target networks and average generator loss end for 4/20 where we chose β L = 0.997 and L G → L G /L avg . Heuristically, an optimal policy does not go over image edges as there is no information there in our training environment. To accelerate convergence, we therefore added a small loss penalty, E t = 0.1, at

5/20
step t if an action results in a probing position being over an image edge. The total loss at each step is where the Dirac delta function δ , adds the sparse normalized generator loss at the final step, T .
To estimate discounted future losses, Q rl t , for RL, we use a target actor and critic, where we chose γ = 0.97. Target networks stabilize learning and decrease policy oscillations [73][74][75] . Our target actor and critic have trainable parameters ω and θ , respectively, that track live parameters, ω and θ , by soft updates 47 , where we chose β ω = β θ = 0.9997. We tried hard updates 68 , where target networks are periodic copies of live networks; however, we found that soft updates result in faster convergence and more stable training. Supervised losses, Q super t , can also be computed with Bellman's equation, We found that minimizing Q rl t results in lower final losses than Q super t . However, Q super t resulted in faster convergence at the start of training, especially in early experiments before our learning policy was optimized. Model-free RL algorithms, such as Q-learning and its variants, often performs poorly in the early stages of training while critics unlearn biased estimates of state-action value functions 76 . As a result, we balance both reinforcement and supervised losses, where α = (max(10 5 ) − m)/10 5 starts with supervised losses at m = 0 and linearly changes to reinforcement losses by m = 10 5 .
The critic learns to minimize mean squared differences, L Q , between predicted and target losses and the actor learns to minimize losses, L µ , predicted by the critic:

Experiments
In this section, we present examples of adaptive partial scans and select learning curves for architecture and learning policy experiments. Importantly, we show that adaptive scans outperform established methods that use static spiral scans. Additional sheets of examples for both adaptive and spiral scans, experiments, and test set errors for each experiment are in supplementary information.
Examples of 1/23.04 px coverage partial scans, target outputs and generator completions are shown in fig. 3 for 96×96 crops from test set STEM images. They show both adaptive and spiral scans after flips and rotations to augment data for the generator. The first action always selects a path segment from the middle of image in the direction of a corner. Actors use the first observation, and following observations, to inform where to sample the remaining T − 1 = 19 path segments. Actors adapt scan paths to the environment. For example, if an image contains regular atoms, actors will cover a large area to see if there is a region where that changes. Alternatively, if an image contains continuous regions, actors explore near image edges and far away to find region boundaries.
Learning curves for adaptive scans with an LSTM based actor and static spiral scans in fig. 4a show that adaptive scans outperform spirals. Spirals scans are an established method for compressed sensing and are a special case of adaptive scans. Spirals were created from the same straight path segments, starting from the centre of a STEM images, and are the largest spirals that fit in images. We also tried augmenting our LSTM with dynamic external memory to form a DNC. We thought that recording state information to external memory could reduce actor memory attenuation to improve navigation. However, we found that DNCs and LSTMs have similar performance in our experiments. Nevertheless, we expect that DNCs might outperform LSTMs on scans with more path segments.
Most STEM signals are imaged at several times their Nyquist rates 11 . To investigate adaptive STEM performance on signals imaged close to their Nyquist rates, we downsampled STEM images to 96×96. Learning curves in fig. 4b show that losses are lower for oversamples STEM crops. Following, we investigated if MSEs vary for training with different loss metrics by adding a Sobel loss, L S , to generator losses. Our Sobel loss is where S x and S y compute horizontal and vertical Sobel derivatives 77 , and we chose λ S = 0.2 to weight contribution to the total loss. Learning curves in fig. 4b   learning stability and decreases final errors in fig. 4c, similar to previous experiments where normalization improves learning 72 . Nevertheless, normalization does not guarantee stability. For example, losses for training with normalization increase near 2.5 × 10 5 iterations. We expect that training could be further improved by gradient clipping 39 , inputting remaining steps 80 and other refinements to architecture and learning policy. To train actors by BPTT, we differentiate critic loss predictions w.r.t. actor parameters by the chain rule, Differentiating w.r.t. actions computed during replays follows Spielberg's RDPG implementation 81 . However, ∂ Q(h i t , a i t )/∂ µ(h i t ) is replaced with a derivative w.r.t. replayed actions, ∂ Q(h i To train adaptive scans systems to outperform established methods based on static spiral scans, we developed CRDPG. This is an extension of RDPG 48 , which is based on DDPG 47 . However, alternatives to DDPG, such as TD3 59 and D4PG 60 , arguably achieve higher performance, and we expect they could form the basis of a future training algorithm. In addition, we expect that architecture and learning policy could be improved by AdaNet 82 , Ludwig 83 , or other automatic machine learning 84 algorithms. In particular, adaptive scan losses are decreasing at the end of our experiments, so we expect that performance could be improved by increasing the number of training iterations.
Our scan systems sample straight path segments that cannot go over image edges. Straight segments simplify development. Nevertheless, actors could learn to output additional parameters to describe curves, multiple successive path segments, or sequences of discontinuous probing positions. Actions could also be restricted, for example, to avoid actions that may cause high probing position errors. Training environments could be modified to allow actors to sample pixels over image edges by loading images larger than partial scan regions. This would model adaptive scans where the actor is allowed to sampled pixels outside a scan region, which could improve performance. However, using larger images would increase data loading and processing time.
We expect the main limitation of experimental adaptive partial STEM to be distortions caused by probing position errors. Errors depend on scan shapes 34 and accumulate for each path segment. Non-linear scan distortions can be corrected by comparing pairs of orthogonal raster scans 85,86 , and we expect this method can be extended to partial scans. However, orthogonal scans complicate measurement by restring scan paths to two half scans to avoid doubling electron dose on beamsensitive materials. This is an unwanted restriction and iterative corrections based on image pairs are unsuitable for live applications. As a result, we propose that the generator should be trained to correct distortions. Another limitation is that our generators do not lot learn to remove STEM noise 87 . However, we expect that generators can learn to remove noise from single noisy examples 88 .
We propose that a cyclic generator 89 could learn to correct distortions by translating between partial scans and raster scans. A detailed method is provided in supplementary information. This may be the most practical approach as it uses unpaired raster and partial scans. Moreover, partial scans could be generated from raster scans by applying simulated distortion fields. Another approach is training a RNN to predict position errors based on an understanding of scan system dynamics. However, we believe this approach is less practical as it would be specific to a scan system and any errors in probing position error predictions would accumulate for each segment.
Not all scan systems support non-raster scan paths. However, most scan controllers can be augmented with an FPGA to perform custom scans 34,35 . Recent versions of Gatan Digital Micrograph support Python 90 , so our Python/TensorFlow based ANNs can be directly applied to scan systems. Alternatively, an actor could be synthesized on the scan controlling FPGA 91,92 to minimize latency. There could be hundreds of path segments in a partial scan, so lightweight and parallelizable actors are essential to minimize latency. As a result, we have developed actors based computationally inexpensive RNNs, which can remember state information to inform decisions. Alternatively, a partial scan could be updated at each step for a CNN based actor to infer actions. However, a CNN is less practical than an RNN as most CNNs require more computation.

Conclusions
We have developed CRDPG to train actors to cooperate with generators to complete STEM images from adaptive scans. Our approach outperforms established methods based on static spiral scans. We expect adaptive scans to decrease scan time and enable new beam-sensitive applications. As a result, we have made our source code, pretrained models, training datasets, and details of experiments available to encourage further investigation. We expect our results to be generalizable to scan systems in all areas of science and technology. Figure S1. Actor, critic and generator architecture. a) An actor outputs action vectors whereas a critic predicts losses. Dashed lines are for extra components in a DNC. b) A convolutional generator completes partial scans.
Fully Connected: A dense layer linearly connects inputs to outputs. Weights are initialized from a truncated normal distribution and there are no biases.
The actor and critic cooperate with a convolutional generator, shown in fig. S1b, to complete partial scans. Our generator is constructed from convolutional layers 94 and skip-3 residual blocks 95 .
Conv d, wxw, Stride, x: Convolutional layer with a square kernel of width, w, that outputs d feature channels. If the stride is specified, convolutions are only applied to every xth spatial element of their input, rather than to every element. Striding is not applied depthwise.
Trans Conv d, wxw, Stride, x: Transpositional convolutional layer with a square kernel of width, w, that outputs d feature channels. If the stride is specified, convolutions are only applied to every xth spatial element of their input, rather than to every element. Striding is not applied depthwise. + : Circled plus signs indicate residual connections where incoming tensors are added together. Residuals help reduce signal attenuation and allow a network to learn perturbative transformations more easily.
Convolutional layers are followed by ReLU 96 activation then batch normalization 97 . Residual connections are added between activation and batch normalization. Convolutional weights are Xavier 98 initialized and biases are zero initialized. We apply L2 regularization 99 to decay generator parameters by a proportion, λ = 10 −5 , at each training step

S2 Additional Experiments
In this section, we present additional learning curves for some of our architecture and learning policy experiments are in fig. S2. Learning curves show that cyclic generator learning rates decrease losses, performances for ranges of architecture and learning policy hyperparameters, and the effect of optimizing a generator to minimize maximum loss regions. Test set errors for these experiments, and experiments in the main article, are tabulated in table S1.
Learning curves for both exponentially decayed and exponentially decayed cyclic 70 generator learning rate schedules are in fig. S2a. They show that multiplying by cyclic decay envelopes accelerates convergence and decreases final losses. Cyclic learning rates often improve training; however, they can also produce oscillations in ANN losses 70 . We were concerned that oscillations would destabilize training as actors learn to predict generator losses. Nevertheless, losses steadily decay for training with normalized generator losses.  Augmenting reward functions with subgoal based heuristic rewards can accelerate RL by making problems more tractable 100 . As a result, we add small losses when actors sample probing positions over image edges. Heuristically, samples at image edges yield less information as they have fewer neighbours. Edge losses accelerated convergence in early experiments, before architecture and learning policy were optimized. However, their benefit is less clear in later experiments shown in fig. S2b as actors can learn that edge pixels are less valuable. We find that adding a small penalty, E ≤ 0.1, for sampling pixels at image edges decreases errors, whereas larger penalties destabilize learning.
Actors are controlled by a two-layer LSTM with n h = 128 hidden units in each cell. To accelerate convergence and decrease computation, LSTM units can be augmented by a linear projection layer with n p < 3n h /4 units 101 . Learning curves in fig. S2c show training with n p = 64, n p = 32 and no projections. Decreasing the number of projection units accelerates convergence; however, it also increases final losses. Further, training becomes increasingly prone to instability as n p increases. As a result, we do not use projection layers in our other experiments.
In the main article, we show that adding a Sobel loss can decrease MSEs. As a result, we also experimented with other loss functions, such as the maximum MSE of a 5×5 region. Learning curves in fig. S2d show that MSEs result in faster convergence than maximum region losses; however, both loss functions result in similar final MSEs. We expect that MSEs calculated with every output pixel result in faster convergence than maximum region errors as more pixels inform gradient calculations. We expect that a better approach to minimize maximum errors is to use a higher order loss function, such as absolute cubic differences. If training with a higher-order loss function is unstable, it could be stabilized by adaptive learning rate clipping 102 .
Calculating supervised future losses with Bellman's equation, rather than reinforcement losses with target networks, accelerated convergence, especially in early experiments before architecture and learning policy was optimized. Learning curves for full supervision, supervision linearly decayed to zero in the first 10 5 iterations, and no supervision are shown in fig. S2e. We find that supervised losses did less to accelerate convergence after we refined our architecture and learning policy. However, reinforcement learning based losses continue to result in lower final losses with lower variance c.f. table S1.
Although experience replay buffer sizes near 10 6 are popular, reinforcement learning can be sensitive to replay buffer size 67 . However, learning curves in fig. S2d do not show a clear relationship between final errors and the size of our replay buffer or the average number of times each history is replayed from it. We did find that increasing replay buffer size and decreasing average number of replays decrease small learning curve oscillations [73][74][75] with a period of about 2000 iterations. However, the size of oscillations does not appear to affect performance.
Generator learning rate optimization is shown in fig. S3. To find the best initial learning rate for ADAM optimization, we increased the learning rate until training became unstable, as shown in fig. S3a. We performed the learning rate sweep over 10 4 iterations to avoid results being complicated by losses rapidly decreasing in the first couple of thousand. The best learning rate was then selected by training for 10 5 iterations with learning rates within a factor of 10 from a learning rate 10× lower than

16/20
where training became unstable, as shown in fig. S3b. We performed initial learning rate sweeps in fig. S3a for both ADAM and stochastic gradient descent 43 (SGD) optimization. We chose ADAM as it is less sensitive to hyperparameter choices than SGD, and ADAM is recommended in the RDPG paper 48 .

S3 Test Set Errors
Test set errors for every graph in the main text and supplementary information are tabulated in table S1. However, they should be interpreted with caution as learning was unstable in some of our experiments.

Figure
Label Mean Std Dev Figure fig. 4 and fig. S2, and have the same labels in figure legends.

S4 Distortion Correction
We expect that experimental adaptive partial STEM will be limited by probing position errors. Nevertheless, we propose that cyclic generators 89 could be trained to correct position errors. To be clear, this section is intended to be starting point for future research. It outlines a method to train cyclic generators that could be refined or improved upon. Let I partial and I raster be unpaired partial scans and raster scans, respectively. A binary mask, M, can be constructed to be 1 at nominal probing positions in I partial and 0 elsewhere. We introduce generators G p→r (I partial ) and G r→p (I raster , M) to map from partial scans to raster scans and from raster scans to partial scans, respectively. A mask should be input to the partial generator for it can output an image with an accurate distortion field as distortions depend on scan shapes 34 . Finally, we introduce discriminators D partial and D raster are trained to distinguish between real and generated partial scans and raster scans, respectively, and predict losses that can be used to train generators to create realistic images. In short, partial scans could be mapped to raster scans by minimizing losses L GAN p→r = D raster (G p→r (I partial )) (S1) where L p→r and L p→r are total losses to optimize G p→r and G p→r , respectively. A scalar, b, balances adversarial losses and cycle-consistency losses.

S5 Additional Examples
Additional sheets of test set adaptive scans are shown in fig. S4 and fig. S5. In addition, a sheet of test set spiral scans is shown in fig. S6. Target outputs were low-pass filtered by a 5×5 symmetric Gaussian kernel with a 2.5 px standard deviation to suppress high-frequency noise.
18/20 Figure S5. Test set 1/23.04 px coverage adaptive partial scans, target outputs and generated partial scan completions for 96×96 crops from STEM images. Figure S6. Test set 1/23.04 px coverage spiral partial scans, target outputs and generated partial scan completions for 96×96 crops from STEM images.