Reconstructing the exit wave in high-resolution transmission electron microscopy using machine learning

Reconstruction of the exit wave function is an important route to interpreting high-resolution transmission electron microscopy (HRTEM) images. Here we demonstrate that convolutional neural networks can be used to reconstruct the exit wave from a short focal series of HRTEM images, with a fidelity comparable to conventional exit wave reconstruction. We use a fully convolutional neural network based on the U-Net architecture, and demonstrate that we can train it on simulated exit waves and simulated HRTEM images of graphene-supported molybdenum disulphide (an industrial desulfurization catalyst). We then apply the trained network to analyse experimentally obtained images from similar samples, and obtain exit waves that clearly show the atomically resolved structure of both the MoS$_2$ nanoparticles and the graphene support. We also show that it is possible to successfully train the neural networks to reconstruct exit waves for 3400 different two-dimensional materials taken from the Computational 2D Materials Database of known and proposed two-dimensional materials.


Introduction
Machine learning has become a powerful tool for analyzing images. In fact, machine learning is a nascent tool in electron microscopy that is envisioned to have a large potential for quantitative image analysis [1,2]. In electron microscopy, applications of machine learning have up to now included segmentation of medical images [3], grain and phase identification [4,5,6], noise filtering [7,8] and in-plane location of atoms [9,10,11]. Moreover, Ede et al. showed recently that the imaginary part of the exit wave function can be reconstructed from the real part using a convolutional neural network [12] and Meyer showed that off-axis holograms, where phase information is recorded directly into the image, can be reconstructed using neural networks [13]. In this work we suggest that neural networks could potentially solve the classical phase problem and thus retrieve the entire electron wave function exiting the specimen in a transmission electron microscopy experiment.
Aberration-corrected high-resolution transmission electron microscopy (HRTEM) is one of the important experimental techniques to study the structure of materials at * Corresponding author: schiotz@fysik.dtu.dk the atomic scale. The maximal amount of information about the sample is present in the exit wave, i.e. the wavefunction of the electrons exiting the sample. As the image is formed, some of this information is lost, both due to aberration in the lenses, and because the camera detects the intensity of the wave, not its phase.
It is well established that the full exit wave can be reconstructed from a focal series of images [14,15]. A series of typically around 20-50 images with varying defocus is used to numerically reconstruct the most likely wave function of the electron beam as it exits the sample. This can then be used to further reconstruct information about the chemical composition and 3D structure of the sample [16,17]. For beam-sensitive samples [18], exit wave reconstruction has the advantage of being averaging in nature such that information from many images with very low signal-to-noise ratio is combined in a single exit wave image of superior signat-to-noise ratio [17]. Several numerical algorithms are available for reconstructing the exit wave [19,20,21].
Here we examine a Convolutional Neural Network (CNN) as an alternative way to reconstruct the exit wave. This reconstruction is possible from a low number of HRTEM images, and with the advantage that the detailed knowl-edge of the aberration parameters of the microscope is not needed. We envision that this can be developed into a tool for on-the-fly exit wave reconstruction while taking data on the microscope, perhaps supplemented with more traditional exit wave reconstruction as post processing. In the present case, the images were convoluted with the effects of defocus, first order astigmatism, coma, and blurring including focal spread. In this case a focal series of two to three simulated HRTEM images were sufficient to reconstruct the exit wave with sufficient accuracy in order to extract quantitative information about the sample. In principle, it should be straightforward to extend the present method to situations with low signal-to-noise ratio and more unknown aberrations, in which case it is likely that a larger focal series will be needed.
Recently, atomically thin two-dimensional (2D) materials have been an active topic of research, with applications ranging from electronics to energy storage and catalysis [22,23]. For example, molybdenum disulphide (MoS 2 ) is the preferred catalyst for removing sulphur from crude oil, and is one of the reasons that acid rain is no longer one of the most pressing environmental problems [24]. In this paper, we focus on exit wave reconstruction for the rapidly growing class of 2D materials, although the methods should be generally applicable. We show that neural networks can reconstruct the exit wave both when trained to a single material, and to a database of thousands of proposed 2D materials. The reconstruction is of sufficient quality to permit analysis of the image peaks associated with the atomic columns e.g. by using Argand plots to identify the type and number of elements in the material [16].
We also show that it is possible to train the neural network purely on simulated data, and apply it successfully to experimental images of non-trivial complexity, in this case a model catalyst based on molybdenum disulphide.

Methods
The neural network architecture is a Unet [25] / Fusion-Net [3] architecture, very close to the one used by Madsen et al. [9], with the main modification that concatenation is used instead of elementwise addition for the skip connections. A linear activation function is applied in the output layer, as exit wave reconstruction is a regression problem rather than a classification/segmentation problem. Details of the architecture can be found in the Supplementary Online Information (SOI Sec. S1). The neural network is implemented and trained using the Keras interface [26] to Tensorflow version 2.5 [27]. We train using simulated images only. We computer-generate a training set and a corresponding validation set of atomic structures, using the Atomic Simulation Environment (ASE) [28].
Three data sets of increasing complexity were created. The first consists of nanoparticles (nanoflakes) of molybdenum disulphide (MoS 2 ). In this data set we ignore that the nanoparticles will typically be supported on another material in the microscope. Nevertheless this data set will be relevant for e.g. edges of MoS 2 films on a TEM grid, where no support is visible in the region of interest.
The second dataset is MoS 2 supported on a graphene substrate. A nanoflake of graphene and one of MoS 2 are generated in the computer, and are placed with a random distance between 3.3 and 7.0Å. One quarter of the cases are placed with the lattice vectors of the two layers in the same directions, another quarter with a rotation of 15 • , one quarter with a rotation of 30 • , and the rest with a random rotation. In both of these datasets 1000 samples are created for training, and 1000 for validation.
The third dataset consists of nanoflakes of materials from the Computational 2D-materials Database (C2DB) [29] in the latest version dated 2021/06/24. This version of the database contains 4056 known or proposed 2D materials, but a significant number of these have very complex structures where the quasi-2D material contains a large number of atomic layers. We filtered the database so we only keep structures at most eight atoms in the unit cell, that left us with 3393 materials. Two samples are created of each material. Materials are randomly assigned to the training or validation set with a probability of 2:1, but in such a way that all materials containing the same set of elements are assigned to the same set.
For all three datasets, vacancies and holes are introduced in the systems. A vacancy is introduced by selecting a random atom and removing it; holes are made by selecting a random atom and then removing the entire atomic column. In the case of MoS 2 , if a sulphur atom is selected then a vacancy would be removing just that atom, whereas creating a hole would be removing an S 2 dimer. If a molybdenum atom is selected there will be no difference. We select 5% of the atoms for vacancy creation, then 5% for hole creation. All atomic positions are then perturbed by adding a Gaussian with mean of 0 and spread of 0.01Å to all atomic positions. Finally, all samples are tilted by a random angle up to 10 • in a random direction.
Exit waves are then calculated using the multislice algorithm [30,31], using the abTEM software [32]. The lateral sampling of the wave function is 0.05Å, and the slice thickness is 0.2Å, see the SOI Sec. S2. As a simple model of atomic vibrations, the potential of the atoms is smeared by a Gaussian with with where m is the atomic mass and θ is the Debye temperature [33, supplementary online information]. As the same value must be used for all atoms, we use the atomic mass of Sulphur. With θ D = 580 K for bulk MoS 2 [33], this gives a value of u 2 = 0.0030Å 2 at 300 K. Our own ab initio molecular dynamics simulations of MoS 2 gives a somewhat larger value, which is expected as molecular dynamics ignores the quantization of the phonons which is important below the Debye temperature. As an approximation, we  also use this value of u 2 for the materials in the C2DB. If the reconstructed exit wave is to be used to gain information about the vibrational amplitudes of different kinds of atoms, as is done in Ref. [17], the phonons need to be modelled with a more sophisticated method, such as the frozen phonon method, at a significant cost in computational burden (up to two orders of magnitude). After generating the exit waves, the abTEM software is used to generate typically three images of the sample by applying a Contrast Transfer Function (CTF), Poisson noise in the detector, and a Modulation Transfer Function (MTF) introducing correlations in the noise. This is described in detail elsewhere [9]. The parameters of the CTF and the MTF (collectively referred to as the "microscope parameters") are drawn from distributions given in Table  1. The three images have the same microscopy parameters except that the defocus is changed by 5 ± 0.1 nm between the images. If a different number of images is used, the total variation in defocus remains at 10 nm.
The expensive part of the image simulation is the multislice algorithm calculating the interaction between the electron beam and the sample. The action of the CTF and the MTF are computationally cheap, and for that reason it is convenient to generate multiple images of the same sample with varying microscope parameters. Depending on the computational setup, it may be most convenient to generate images on-the-fly during training, such that the network sees different images of the same samples in each training epoch, or it is possible to pre-generate and store the images. In this work we pre-generated ten epochs of images for the training set, and one for the validation set. We then cycled through the pre-generated epochs for the actual training, which were up to 200 epochs (leading to each image being reused 20 times).
The neural network is trained using the mean square error (MSE) as the loss function, with the RMSprop training algorithm as implemented in Keras, and a learning rate of 5 × 10 −4 . We also tried using the Adam algorithm [34], and saw similar but slightly less stable results, whereas Adam with the AMSgrad modification gave almost identical results to RMSprop. Increasing the learning rate above 1×10 −3 would make the training unstable, and decreasing it below 5 × 10 −4 was detrimental to the learning. Training curves showing the loss function of the training and validation set are shown in the SOI (Fig. S2). In spite of the reuse of pre-generated images, the training curves do not show signs of overfitting. We therefore did not use regularization in the neural network.
The sharp potential of the nucleus causes some amount of annular structures to appear in the exit wave, in spite of the application of Debye-Waller smearing. This fine structure contain little or no information of value when analysing the exit waves. However, the neural network will attempt to recover this structure, leading to an overall small degradation of its ability to recover more important information about the main peaks associated with the atomic columns. For simplicity, we have filtered the exit waves prior to training by folding them with a Gaussian with a spread of 15 pm, see SOI Fig. S4. This leads to a significant improvement in the network performance, in particular when it comes to extracting quantitative information from the peak values. Figure 1 shows the simplest situation, where the network is trained and tested with unsupported MoS 2 nanoparticles. The figure shows the real and imaginary parts of the exit wave used to simulate the images (the "ground truth"), and the exit wave reconstructed by the neural network (the "prediction"). For thin samples, the interaction between the electron wave and the sample mainly results in a phase shift of the wave [17], this is also the case for the data in the figure, where the main part of the signal is in the imaginary part.

Results and discussion
The difference plot in Fig. 1 shows that the network clearly reconstructs the imaginary part of the exit wave both qualitatively and quantitatively. We see that all peaks are reconstructed correctly, and that the neural network both reconstructs the periodic lattice and the deviations from periodicity such as vacancies, including single sulphur vacancies where a single sulphur atom leaves a weaker peak than the usual two atoms. The system shown in Figure 1 was chosen as the median of the validation set, half the systems in the validation set perform worse, and half perform better. In the SOI Section S5 we show some of the worst systems in the validation set, even the five percentile sample is reconstructed quite well.  Figure 2 shows the more complex situation, where the network is trained on graphene-supported MoS 2 nanoparticles. The way the training set is constructed does not guarantee that the full MoS 2 nanoparticle is overlapping with the support, so in this case the network needs to learn to recognize both supported and unsupported MoS 2 .
The network is able to reconstruct both the part of the wave function coming from the support and from the nanoparticle, in spite of the signal from the support being much weaker than from the nanoparticle. The network is even able to correctly find the carbon vacancies that have been introduced in the support. It should be noted that if the network is trained for a shorter time (50 epochs instead of 200), it loses its ability to find the carbon atoms below the nanoparticle. The largest deviation in the reconstructed exit wave comes from a slight misplacement of the atoms in the MoS 2 layer, the maximal error in the placement of an atom is 9.7 pm, corresponding to a single pixel. This system is again chosen as the median of the validation set.
Finally, the method was tested on the C2DB database of 3393 proposed two-dimensional materials [29]. Again we show the median system, a nanoparticle of CoCl, (Fig. 3). We see how all atoms are placed correctly, but the detailed shape of the peaks in the imaginary part of the wave function is not well reproduced, the network predicts somewhat smoother peaks. In addition, the network does not always identify positions where single atoms are missing, leaving only one atom in the atomic column. Each position in the apparent hexagonal lattice contain both a Co and a Cl atom, alternately oriented with the Co or Cl on top, and staggered in the z direction.
In order to obtain a more quantitative measure of the performance of the networks, we have created histograms of the root-mean-square error (RMSE) of all the images in the validation sets, see Fig. 4. In general, the networks is better at reproducing the strong signal in the imaginary part of the exit wave than the weaker real part. It is seen that the performance of the network decreases somewhat as the complexity of the data set is increased, going from unsupported MoS 2 to supported MoS 2 to the C2DB dataset. It is not surprising that the network can be trained for better performance on the simpler datasets. As a "baseline", we also show the histogram produced from one of the datasets where the predictions are compared with randomly chosen other exit waves in the dataset (the Y-scramble method) rather than with the correct exit wave. This shows the performance of a hypothetical network learning the overall properties of exit waves but learning nothing about the specific systems, i.e. it acts as a "null hypothesis".
It is also seen that the relative error is significantly larger for the real part of the exit wave. This is because its magnitude is 3-4 times smaller than the imaginary part (this can e.g. be seen by the position of the peaks in the Argand plots in Figure 5). It is only in the simplest case (unsupported MoS 2 ) that the network performs well on the real part.
We also test how networks trained on the C2DB dataset performs on the supported MoS 2 and vice versa. Unsurprisingly, the network trained on supported MoS 2 performs poorly on the C2DB dataset, as the latter contains a far richer variety of structures. On the other hand, the network trained on the C2DB generates a very broad distribu-  Fig. 1. It is seen that the network locates the atoms both in the MoS 2 nanoparticle, and in the substrate. The worst spot in the prediction where an atom is misplaced by a single pixel. It is worth noticing that the graphene support is also reconstructed correctly, including the vacancies in the graphene. The RMSE is 0.0122 and the colorbar is the same as in Fig. 1 tion of results when applied to the database of supported MoS 2 structures (the red curve in Fig. 4). Our interpretation is that this is because the network correctly analyses the parts of the system where the MoS 2 and graphene only overlap a little, but performs badly where they overlap. While these systems are not inherently more complicated than in the C2DB, they differ in a fundamental way, as there are two different lattices in the system (the lattice of graphene and the one of MoS 2 ), whereas all systems in the C2DB training set only contain a single (but often more complicated) crystal lattice. This illustrates the importance of training the network on systems that are similar to the final application.
The purpose of an exit wave reconstruction is usually to extract quantitative information about the atomic columns. This is often done in form of an Argand plot, where the peak values of the wave function at the locations of the atoms are plotted in the complex plane [16]. It is therefore not enough that an exit wave reconstructed by neural networks visually and statistically ressemble the actual exit wave, it should also permit analysis in an Argand plot. This is shown in Fig. 5, where we show Argand plots of both the unsupported and supported nanoparticles from Figs. 1 and 2. For the unsupported nanoparticle, the Argand plot is just able to distinguish between a single Mo atom (atomic number Z = 42) and a sulphur dimer (sum of atomic numbers Z = 32). The sulphur vacancies, where there is only a single sulphur atom in the atomic column (Z = 16) are clearly separated from the other types. It is, however, not possible to determine if the missing atom was above or below the plane of the Mo atoms, although that information was present in the original wave function (shown as "ground truth", where we see that the spots corresponding to single S atoms is split into two nearby spots with different real value, as would be expected from atoms with different z coordinate, see Chen et al. [16]).
For the case of supported MoS 2 [ Fig. 5(d-f)], the picture is less clear. The Argand plot still clearly separates the sulphur vacancies from the other atomic columns, but there is a larger spread on the column values, and no longer a clear separation between columns containing two S atoms or a single Mo atom. However, if the same analysis is done on the ground truth exit waves (Fig 5(f)), the situation is the same. This is most likely due to interference from the substrate.
In the Argand plot, the position along the imaginary axis is largely indicative of the total atomic number of the atomic column in the weak phase limit [17]. The positions of the Argand points are also affected by the z-height of the column relative to the plane of the exit wave. This is mainly due to the peak spreading out spatially as the wave propagates from the bottom of the atomic column to the plane where the exit wave is defined, leading to a decrease in the peak value for atomic columns at higher z [35]. This effect is clearly not reproduced by the neural network, as it cannot distinguish between single sulphur vacancies on the two sides of the nanoparticle [ Fig. 5(b+c)]. It is possible that a network could be trained to distinguish these features by including training data where they are more prominent, i.e. a larger concentration of single sulphur vacancies and perhaps samples with higher tilt angles, producing height differences.
As a significant amount of information about the exit wave is encoded in how the image changes with defocus, it is our working hypothesis that a number of images are  Blue is the network trained and tested on unsupported MoS 2 , orange is graphene supported MoS 2 , and green is the C2DB. It is seen that the performance of the network decreases somewhat as the complexity of the data set is increased. The brown line is a baseline, this is the performance obtained if the network does not at all recognize the structure, obtained using the Y-scramble method (see text). The red curve shows the supported MoS 2 validation set with the network trained on the more diverse C2DB. As the samples contain two separate lattices, it is outside the training set of the C2DB. Validating the C2DB test set with the network trained on MoS 2 also gives bad results (purple curve), as the C2DB contains structures too far from what is observed in MoS 2 .
necessary for a neural network to be able to reconstruct the exit wave. This is verified in Fig. 6, showing the performance of networks trained on the same C2DB training sets but with a different number of input images. It is seen that some information about the exit wave can be gained from even a single image, but a dramatic improvement is seen going to two input images. A small further improvement is seen when increasing the number of images to three or four, and we decided to use three images in the rest of this work. In the simulations with two, three or four images, the total range of defocus from the first to the last image were in each case 10 nm.

When the network fails
No neural network is perfect, and it is important to be aware of the kind of failures that can occur when analysing Figure 6: Test of how the number of input images affects the performance. A single input image (blue curve) clearly does not give a good reconstruction of the exit wave. Already with two images (orange curve), good performance is obtained, at least for the imaginary part. Three images (green) as used in the rest of this work gives an improvement, whereas four images (red) gives only a marginal further improvement. an image. We illustrate this with two kinds of errors observed in the C2DB database.
The first case is silver copper telluride (AgCuTe 2 ), shown in Fig. 7. On one hand, the method reliably finds all the vacancies in the structure, a task that would be very difficult by visual inspection of the three images. On the other hand, the network fails to discover a small spontaneous breaking of the symmetry in the structure: the Cu atoms are slightly displaced compared to the rectangular lattice formed by the Ag and Te atoms. This is a highly unusual configuration, and the neural network interprets it as the far more common symmetric configuration.
In the second case, the network is locally inserting extra atoms into the structure, creating weird unphysical defects, see Fig. 8. This kind of errors should be relatively easy to spot for the scientist.
The cases in Figs. 7 and 8 were chosen manually. In the SOI, we give examples of some of the worst and best results of the networks, selected solely from the RMSE of the prediction.
As the examples here show, it is difficult to train a single network to 3400 different materials, even if they are two-dimensional. The networks trained to a single material (MoS 2 ), with or without support, do not exhibit these failure modes. It is therefore recommended to train networks to smaller classes of materials matching the kinds of systems being studied experimentally. Furhermore, the kinds of errors shown here can be detected by training two or more different networks to similar data sets, and detecting when the networks differ in their prediction.

Application to experimental data
We apply the method to experimental data, a focal series of a MoS 2 model catalyst recorded on the TEAM 0.5 transmission electron microscope at 50 keV beam energy. The data analysed here is similar to what was published recently by Chen et al. [17], and we refer to that publications for details regarding the experimental setup.
In their publication, Chen et al. used focal series of 20 -44 images to reconstruct the wave functions. Here, we have selected three images from their focal series for analysis by the neural network.
As the resolution of this image series is significantly lower than what we have otherwise been using in this work (0.227Å/pixel instead of 0.105Å/pixel) we retrained a network for this resolution, based on the same data set of supported MoS 2 , but resampled to resolutions in the interval from 0.215 to 0.235Å/pixel. The lower resolution had only a small detrimental effect on the network performance when tested on the validation set. We then selected three experimental images with a difference in defocus of 50Å, to match the defocus difference between the three images used to train the network. The images and the resulting exit wave are shown in Fig. 9. As can be seen, a clear exit wave is reconstructed, showing the honeycomb lattice of the supported MoS 2 nanoflake, and of the supporting The network also provides a best guess on the positions of carbon atoms in the support, but as the support is graphite and the network was trained on single graphene layers as support, that cannot be considered reliable. The defocus values are as reported by MacTempas (overfocus), which uses the opposite sign convention from abTEM and this work. The colors are the same as in Fig. 1. graphite lattice. However, an Argand plot is not able to distinguish the lattice points of the Mo and S sublattice (not shown), consistent with what we saw in simulated images (see Figure 5, panel e and f). In both cases, the reason is the same. Some peaks in the wave function of the MoS 2 coincide with peaks from the graphite, some do not, and that leads to greater variation between peaks than the difference between a Mo atom and two S atoms.
In their publication, Chen et al. [17] were able to distinguish between peaks from Mo and S atomic columns, but their analysis of the exit wave is also more elaborate. First, they had to Fourier filter their images, removing spatial frequencies coming from the graphene support from the exit wave. Second, even if a clear distinction of the peak imaginary values of the Mo and two S atomic columns were made, it is worth noticing that the chemical interpretation of the relative intensities calls for caution. As reported by Chen et al., the peak values can be severely reduced and the imaginary parts be broadended across a nanocrystal due to heterogeneous vibrations response of the sample under illumination. Chen et al. offers a framework for an interpretation of the exit wave function. This interpretation is independent of the way in which the exit wave function is reconstructed, which is the prime objective for the present analysis.
With even just a few images, the network can thus already capture the main arrangement of the atomic columns based on an experimental focal series of low-dose HRTEM images. Further inclusion of images from the focal series might help in better account for the column intensitities and role of high order aberrations on the contrast blurring in the expeirmental image. For a full qualitative analysis of the experimental data, networks would have to be trained to specifically take into account a more realistic model for the vibrations of the atoms, as well as the more complicated multilayer support in the experimental data.
In addition, the network should be trained to handle carbon contamination of the sample.

Comparison to traditional exit wave reconstruction
To be able to compare this method with more traditional methods for exit wave reconstructions, we have applied the algorithm of Gerchberg and Saxton [36], as implemented in MacTempas version 2.4.50, to three simulated image series of graphene supported MoS 2 . The systems were selected according to how well they had been reconstructed by the neural network, we chose the 25, the 50 and the 75 percentile images (Figures S10, 2 and S11, respectively).
The generated data sets contain eleven images with a 1 nm change in defocus between each of them, leading to a total defocus range of 10 nm, the same that was used for the neural networks. All eleven images are used for the Gerchberg-Saxton (GS) exit wave reconstruction, whereas only three (the first, middle and last) were used for reconstructions with the neural network.
The GS exit wave reconstruction algorithm was given the actual values of the defocus, the spherical aberration (C s ) and the focal spread, instead of determining them through an optimization process as is usually done. No coma or 2-fold astigmatism was assumed in the reconstruction process, although both coma and astigmatism were present in the images.
In contrast, the neural network does not require any of this information, it is trained to reconstruct the wave function a fewÅngström below the lowest atom in the sample without further knowledge of neither the exact values of the defocus, nor of the aberrations of the microscope, except that they are within the intervals used to train the neural network (Table 1).
A comparison between the neural network and the more traditional exit wave reconstruction is shown in Fig. 10. At first sight, it looks like the neural network strongly outperforms the traditional reconstruction, the difference between the reconstructed image and the original ground truth wave function is much smaller for the neural network reconstruction. However, this is mainly because the longest wavelengths in the exit wave that have not been reconstructed by the Gerchberg-Saxton algorithm, leading to the phase of the wave function locally being averaged to zero. It is thus more fair to compare the reconstructed wave in Fig. 10(e) with a ground truth wave function where the longest wavelengths have been filtered out (panel f), using a Gaussian filter with a width of eight pixels (0.9Å). In this case, visual inspection indicate that the error of the two models are of similar magnitude, although the neural network appears to be performing best. This is confirmed by calculating the Root Mean Square Error for the CNN reconstruction (i.e. for the difference between panel c and d) and for the Gerchberg-Saxton (panels e and f). The RMSE is 0.013 and 0.061, respectively.
The system shown in Figure 10 is the 25-percentile system. Similar plots for the 50 and the 75-percentile systems are shown in the SOI (Figures S13 and S14). It should be noted that the Gershberg-Saxton reconstruction of the 50-percentile image is of significantly lower quality than the two others, although the neural network did not have problems with this image series. This could be due to those images having both 2-fold astigmatism and coma in the upper end of the range shown in Table 1.
Inclusion of more abberrations than the ones in Table  1 might change these conclusions, and might require using more images for the neural network reconstruction to be reliable. It does, however, appear that a neural network is able to quickly give a reconstructed exit wave of a quality at least comparable to a traditional exit wave reconstruction from only a few images.

Conclusions
Convolutional Neural Networks are a promising alternative to traditional exit wave reconstruction, with the obvious advantage that they only require a few images instead of a long image sequence, that the data processing is fast enough to be done in real time at the microscope, and that detailed knowledge of the aberration parameters of the microscope is not needed. It does, however, require that the networks are optimized for the systems at hand.
As expected, the method works best for simpler systems, illustrated here with unsupported and graphenesupported MoS 2 nanoparticles, where the exit waves are reproduced with a fidelity that allows for both qualitative and quantitative analysis. For significantly more complicated structures, illustrated here with the relatively diverse C2DB dataset, the network overall performs well, but fails to reconstruct some details in some of the more complex materials. Nevertheless, even in the more complicated materials, the majority of the structure including the positions of point defects is recovered by the neural network.
One could hope that the neural network had learned to generally invert the Contrast Transfer Function of the microscope. That is, however, not the case. The network utilises knowledge about "likely" structures based on the kind of structures it has seen in the training set, and must be trained on structures similar to the ones it will be used to analyse. On the other hand, this use of prior knowledge of the systems is probably what enables the network to reconstruct the exit wave based on only three input images, and without knowledge of the actual parameters of the CTF. It should be pointed out that including further abberrations than the ones used in this work (Table 1) may require using more than three images as input to the neural network.
In summary, we have demonstrated that neural networks can be trained to reconstruct the exit wave function of a varied class of two-dimensional materials, with only three HRTEM images with different defocus as input to the network. We can train and validate the network on simulated data, and then apply it to analyse experimentally obtained data, demonstrated here with the case of MoS 2 supported on graphene.

Acknowledgements
We would like to thank Sophie K. Kaptain and Dr. Daniel Kelly for technical assistance in connetion with the MacTempas exit wave reconstructions.

Funding
The authors acknowledge financial support from the Independent Research Fund Denmark (DFF-FTP) through grant no. 9041-00161B. L.P.H. was financially supported by The Danish Council for Technology and Innovation (08-044837). The Center for Visualizing Catalytic Processes is sponsored by the Danish National Research Foundation (DNRF146).

Availability of data and materials
The code is available on github [37]. The trained networks, the scripts and data used to train the network and generate all figures except Fig. 9 are available from the DTU Data repository at doi:10.11583/DTU.15263655. The experimental data used in Fig. 9 belong to the authors of Ref. [17].

Convolutional block
Convol. + ReLU Maxpool Upsampling Concatenation Figure S1: The architecture of the neural network. Information flows from left to right. The first part of the network, the "encoding path", consists of convolutional processing blocks alternated by downsampling layers using the MaxPool method. The second part, the "decoding path", consists of convolutional blocks alternated by bilinear upsampling layers. Long skip connections ensures that the original spatial information can be maintained. Adapted from Madsen et al. [9] The neural network architecture is adapted from the one some of us previously used [9]. It consists of a downsampling (or "encoding") path, where convolutional blocks alternate with downsampling layers, and an upsampling (or "decoding") path, where the convolutional blocks alternate with upsampling layers (see Fig. S1). The convolutional blocks consist of five convolutional layers, with a short skip connection between the output of the first layer and the input of the fifth.
The downsampling is done with conventional MaxPool operations. Each time the resolution is cut in half in a MaxPool operation, the number of feature channels in the following convolutional block is doubled to maintain the information flow in the network. The upsampling is done using bilinear interpolation, and the following convolution block has the number of channels cut by a factor two. After each upsampling, information from the last layer with the same spatial resolution is added from the downsampling path, this is done by concatenating the channels, in contrast to Madsen et al. where elementwise addition was used. The first layer in the convolutional blocks in both paths will therefore have a different number of input channels from what is stated in the figure.
Each convolutional layer uses a 3 × 3 convolutional kernel, followed by a Parametric Leaky Rectifying Linear Unit. During hyperparameter optimization, we found that increasing the size of the kernel to 5×5 or 7×7 did not improve the performance of the network, nor did increasing the number of channels over the numbers given in Fig. S1.

Appendix A.2. Convergence of the multislice algorithm
The multislice algorithm depends on a discretization of space, with alternating interactions between the electron  wave with the matter within a slice of the material, and propagation of the wave from one slice to another. The finite slice thickness is an approximation. As the "signal" is the change in the wave function from the unperturbed value of 1, we define a measure of the relative change between wave functions Ψ d and Ψ 0 as In figure S2, we show the convergence of the calculated wave function with slice thickness by plotting the difference between the wave function calculated with various slice thicknesses d, using the one calculated with d = 0.01 A as the correct Ψ 0 .

Appendix A.3. Network training
As mentioned in the main text, training is done using the Mean Square Error (MSE) loss function, and the Figure S4: Effect of Gaussian smearing on the wave function of the MoS 2 nanoparticle also shown in Fig. 2 of the main text. Left column shows the real and imaginary part of the exit wave without smearing, the right column shows them after Gaussian smearing with a width of 20 pm. The ring-shaped structures are suppressed, allowing the network to describe the main peak better.
RMSprop algorithm. To check for overfitting, we artificially limited the size of the training set, and calculated both training and validation losses as a function of training set size, this is shown in Fig. S3(a). As no increase in the validation loss is seen for the largest training sets, we conclude that overfitting is not an issue. Figure S3(b) show the same losses as a function of epoch number, again no overfitting is seen, but perhaps the network trained on MoS2 supported by graphene could have been even better with a bit more training.
Appendix A.4. Gaussian smearing of the wave function.
The effect of Gaussian smearing of the wave function is shown in Fig. S4.
Appendix A.5. Examples of good and bad predictions.
In Figures S5 to S8, we show examples of the predictions of the network on the C2DB dataset, selected by ranking the results obtained on the validation set, and showing images with 5% percentile rank, 25%, 75% and 95%, meaning that 5% (etc) of the images are worse than the image shown. The 50% images are shown in the main text.
In the same way, we show the same percentiles for the data set of supported MoS 2 (Figs. S9 to S12).
The performance measure used to select the images is based on the mean square error (MSE) of the reconstructed wave function. However, the areas of the nanoparticles vary substantially, and the network finds it very Figure S5: The 5% image from the C2DB dataset. Five percent of the images are this bad or worse. The material is SiF (Silicon fluoride), and the main error is a systematic overestimation of the intensity of the peaks The worst placed peak is displaced by 11 pm (1 pixel). Figure S6: The 25% image from the C2DB dataset. The material is GeO 2 (Germanium oxide), and the main error is an overestimation of the peaks relating to the Ge atoms. The worst placed peak is displaced by 14 pm (1.3 pixels). easy to reproduce the value 1 in the vacuum, leading to an MSE that is just as much determined by the area of the nanoparticle as by the quality of the prediction. We therefore divide by the total signal in the exit wave, as that is proportional to the area, and select the images by the quantity defined in Eq. (A.1) where Ψ d is now the predicted wave function and Ψ 0 is the ground truth.
Appendix A.6. Comparison with Gerchberg-Saxton exit wave reconstruction. Three systems were reconstructed using the Gerschberg-Saxton algorithm, as described in the main text. The reconstructed wave for the 25-percentile system i Figure S10 is shown in the main text. The reconstructed waves for the 50-percentile and 75-percentile systems (Figures 2 (main  text) and S11) are shown in Figures S13 and S14. We Figure S7: The 75% image from the C2DB dataset. The network slightly underestimates the intensity of the peaks in this NbBr 2 (Niobium bromide). The worst placed peak is displaced by 8 pm (0.8 pixels). Figure S8: The 95% image from the C2DB dataset, only five percent of preditions are better than this. The system is CoBr 2 (Cobalt bromide), and the exit wave is reconstructed almost perfectly, with a slight misplacement of some peaks by up to 9 pm (0.8 pixels). Figure S9: The 5% image from the MoS 2 dataset. In spite of the bad contrast in the images, the network mostly resolves the structure. It misses a few sulphur vacancies in the MoS 2 layer and a significant number of vacancies in the graphene substrate. The largest error in positioning Mo or S atoms is 16 pm (1.5 pixels). Figure S10: The 25% image from the MoS 2 dataset. All defects and atoms are correctly localized, the only error the network produces is in the shape of the peaks. The largest error in positioning is 9.5 pm (below 1 pixel). Figure S11: The 75% image from the MoS 2 dataset. Here the network mistakes a single sulphur vacancy that coincides with a carbon support atom for a sulphur dimer above a carbon vacancy. Figure S12: The 95% image from the MoS 2 dataset. The exit wave of this unsupported particle is almost perfectly reproduced. see that the 50-percentile system has been harder to reconstruct with the Gerschberg-Saxton algorithm than the other two, presumably because both 2-fold astigmatism and coma of the simulated are near the upper limits in their respective distributions (main text, Table 1).

Appendix A.7. Computational ressources
The computational ressources necessary for a project like this are relatively modest. We here list the amount of computational ressources used for the three phases of the project: Multislice simulations of the exit wave, image generation, and neural network training.
The computations were done on the Niflheim supercomputer cluster at DTU, but always on single servers within the cluster. • The multislice simulations were done on servers with two 12-core Intel Broadwell processors (24 cores in total) running at 2.20 GHz, installed in 2017.
• The image generations were done on servers with two 20-core Intel Skylake processors (40 cores in total) running at 2.40 GHz, installed in 2019. The image generation could easily have been done on the abovementioned servers instead.
• The network training was done on a single Nvidia RTX 3090 GPU on a shared multi-gpu server.
The amount of time used for the three phases for the three different kinds of systems are shown in Table A