Combining variational autoencoders and physical bias for improved microscopy data analysis ∗

Arpan Biswas; Maxim Ziatdinov; Sergei V Kalinin

doi:10.1088/2632-2153/acf6a9

1. Introduction

Over the last two decades, electron and scanning probe microscopies have become the instrument of choice for the visualization of the atomic and mesoscale structures and functionality of materials [1–4]. In electron microscopy, high resolution images of 2D materials [5, 6], catalysts [7], functional oxides [8], metals, and semiconductors have become routine [9]. Scanning probe microscopy have provided images of the surface topography, functional properties and forces [10, 11], ferroelectric [12] and magnetic domains [13, 14]. High resolution imaging in liquids and ultra-high vacuum has provided spectacular images of self-assembled monolayers [15], polymer chains [16, 17], and electronic density of states and order parameters [18–20].

The broad availability of imaging data across length scales has further necessitated generating of physically relevant insight on materials microstructure. These analyses are closely tied to the nature of materials systems. For example, in high-resolution scanning transmission electron microscope (STEM) images of oxide materials we seek to identify phases and ferroic variants and delineate the grain boundaries between them. In scanning tunneling microscopy (STM) images of quantum materials we seek to identify individual defects and the characteristic wavelength of the symmetry breaking distortions. Traditionally, these analyses were based on the uses of the techniques based on the Fourier transforms, recently combined with the linear decomposition methods such as principal component analysis (PCA). Alternatively, a broad set of analysis methods can be introduced once special points such as atomic locations are identified and the atomic neighborhoods [18–20] or image patches centered at these points are analyzed via linear decomposition methods.

The broad availability of deep learning methods over the last five years has significantly broadened the opportunities for image analysis [21]. The supervised machine learning methods have been applied for atomically resolved STEM [22–24] and STM [25–28] images. However, by definition supervised and semi-supervised methods rely on the human operator to provide the labeled examples. As such, they can discover only known structures and behaviors. The second limitation of the supervised methods is their vulnerability to the out of distribution drift effects [29]. Here, the network trained using examples acquired under one imaging conditions such as resolution, sampling, or other microscope-specific parameters will dramatically degrade performance under different conditions.

These considerations necessitate the development of the unsupervised machine learning methods for the image analysis. For systems with a well-defined crystallographic lattice, methods based on linear decompositions and variational autoencoders (VAEs) [30] were shown to be highly efficient. The development of the rotationally invariant autoencoders extended these approaches to general orientation and discovery of chemical transformation pathways in disordered systems [31–34]. Also, such rotationally invariant autoencoders have been tuned with Bayesian optimization to maximize uncovering of the features from complex microscopic data [35]. However, for the time being these methods relied on the descriptors centered at the individual atomic units, whereas unsupervised discovery allowing for translation invariance has remained far more complex. Overall, all analysis to date relied on the local properties of the system, whereas the inferential biases associated with possible spatial structures, periodicities were limited.

Here, we introduce the approach for the analysis of the atomic and mesoscopic images that combines the capability of VAEs to discover the factors of variability in the data under the invariances and the physics-based global criteria including existence and minimization of length of phase boundaries, presence of periodic orderings. This is accomplished via the custom VAE approach combining classical reconstruction and Kullback-Leibler (KL) [36, 37] losses and physics-based reconstruction loss imposing physical constraints on the global characteristics of the latent images such as interface length or periodicities. This approach was implemented for the data sets of ferroelectric domains in BiFeO₃ [38], LSMO-NiO interface [39], and graphene [40].

2. Methodology

Here, we discuss on the key components focused for this paper, namely, VAE and shift (or translational)-invariant variational autoencoders (sh-VAE) [41]. Finally, we discuss the formulation of the customized physics driven loss function, and its integration to sh-VAE, to build the physics-driven VAE model (phy-VAE) for better image analysis and discoveries. It is worthy to mention that in this paper, though we have focused the phy-VAE as the modification of sh-VAE, the same can be easily build (with similar proposed approach) on any other VAE models.

2.1. VAE to shift-invariant variational auto-encoder (sh-VAE)

A VAE [30] is a deep generative probabilistic model that belongs to the family of probabilistic graphical models and variational Bayesian methods. The purpose of a VAE model is to infer (latent space learning) the factors of variability of the large-scale complex data. The two primary components of any VAE models are the encoder (stochastic inference model) and decoder (generative model). The encoder consists of the convoluted neural network layers, ${f_e},\,$ with $e$ being the parameters (weights and biases) and an activation function at each layer, followed by the flattening of data. Thus, given an N-dimensional input in the real space, ${\boldsymbol{{X}}}$ , the encoder projects the input into a reduced n-dimensional (where $n < N$ ) latent variables, ${\boldsymbol{{Z}}}$ , following a distribution $p({\boldsymbol{{Z}}}|{f_e}\left( {\boldsymbol{{X}}} \right))$ . The decoder structure is the reverse of the encoder, which is another convoluted neural network layers, ${f_d}$ with $d$ being the parameters (weights and biases). From the latent maps, the data de-flattened and fit into the neural network with an activation function, and finally we score the observed data against the Bernoulli (Ber) likelihood parametrized by the decoder output into a reconstructed high-dimensional real space (similar to the input space). Here we model the decoder as outputting parameters (probabilities) of a Bernoulli distribution and the 'observation' (which is equivalent to the reconstructed output) is considered as a sample from this Bernoulli distribution. This is more in line with the underlying probabilistic model of the VAE and makes the probabilistic nature of the model more explicit. For more details, we refer to this introductory tutorial on VAE implementation as a deep probabilistic model [42]. Thus, given any sampling in the low-dimensional latent space, the decoder reconstructs the n-dimensional latent variables into the respective estimated N-dimensional input, $\overline {\boldsymbol{{X}}}$ , following a likelihood distribution $p(\overline {\boldsymbol{{X}}} |{f_d}\left( {\boldsymbol{{Z}}} \right))$ . This encoding-decoding process of the VAE models need to be optimized such that the models can best learn the training data with minimizing the loss of information. The loss function of a VAE, ${\mathcal{L}_{{\text{vae}}}}$ , which we minimize, can be defined as the sum of reconstruction error and KL divergence. The reconstruction error $\varphi$ can be chosen as mean square error, the cross-entropy error. Here, we have considered the reconstruction loss as negative log likelihood function. The KL divergence [36, 37] ${D_{{\text{KL}}}}(p(z\left| {{f_e}\left( x \right))} \right||p\left( z \right))$ is the distance loss between the prior $p\left( z \right)$ distribution (usually chosen as standard Gaussian) and the posterior $p(z|{f_e}\left( x \right))$ distributions of the latent representation from data. VAE models have been implemented to materials systems on various tasks like classification, feature or pattern recognition, prediction though unsupervised, semi-, or supervised learning [41, 43–46].

$\begin{equation}{\mathcal{L}_{{\text{vae}}}} = {\text{ }}\varphi + {\text{ }}{D_{{\text{KL}}}}\left( {p({\boldsymbol{{Z}}}\left| {{f_e}\left( {\boldsymbol{{X}}} \right))} \right|\left| {p\left( {\boldsymbol{{Z}}} \right)} \right.} \right)\end{equation} \tag{ 1 }$

where $\varphi$ is the reconstruction loss, ${D_{{\text{KL}}}}(p(.\left| {x)} \right||p\left( . \right))$ is the KL divergence.

This general or vanilla VAE, as stated in equation (1), is not robust to the variability (e.g. rotation, translation, scale) of the data during the training process. However, the extension of this vanilla VAE has been attempted to develop VAE models which enforce rotational, translation or scale in-variances individually or together, while learning the individual continuous latent representations or jointly learning both continuous and discrete latent representation from the data, as implemented in pyroVED package in Python [47]. Here, we considered a specific example of shift-invariant variational autoencoder model (sh-VAE), as demonstrated in figure 1. As stated, sh-VAE is one of the extensions to vanilla VAE where the shift or translational variability of the data is enforced during model training. The encoding-decoding architecture of sh-VAE is very similar to vanilla VAE as described above, except the latent maps are separated into the conventional latent maps and special maps associated with the positional variances of the object of interest in the data. Here, the latent maps consist of latent vectors, sampled from normal distribution, and the translational transformation matrix, where the model learns the transformation matrix in an unsupervised way such that the learned latent maps are invariant to the translational transformation matrix. To describe the training process in sh-VAE, we use the first $d$ columns of the latent matrix or the special latent variables to shift the co-ordinate grid of the input data by $k\Delta t$ where $0 \lt k \unicode{x2A7D} 1$ represents the belief of the degree of disorder in the system, and $\Delta t$ is the matrix of first $d$ columns. This enhancement is based on the Galilean principle that mechanical equations of motion are invariant under the $t{^{^{\prime}}} = t + \Delta t$ transformation, which is not accounted for by the classical VAE. Then the shifted co-ordinate grid is augmented with the remaining columns of the latent variables or the conventional latent variables before passing to the decoder. The loss function of a beta sh-VAE, ${\mathcal{L}_{{\text{sh - vae}}}}$ , can be mathematically written as equation (2)

$\begin{equation}{\mathcal{L}_{{\text{sh}} - {\text{vae}}}} = {\text{ }}\varphi + {\text{ }}\beta \left( i \right){D_{{\text{KL}}}}\left( {p({\boldsymbol{{Z}}}\left| {{f_e}\left( {\boldsymbol{{X}}} \right))} \right|\left| {p\left( {\boldsymbol{{Z}}} \right)} \right.} \right)\end{equation} \tag{ 2 }$

**Figure 1.** Shift-invariant variational autoencoder (sh-VAE) framework.
Download figure:
Standard image High-resolution image

where $\beta \left( i \right)$ is the continuous scale factor of KL divergence at $i{\text{th}}$ training cycle of sh-VAE. The scale factor encourages a better disentanglement of the data, thus provides better learning [48]. For more details on the development of the sh-VAE, readers are referred to [41]. Previously, sh-VAE has been implemented for unsupervised discovery of features, patterns, and order parameters from complex experimental microscopic (STM, STEM) images, where key features of several training images have positional variance[41]. However, the sh-VAE model (or any other VAE models) does not consider any physical characteristics of the data, which can reduce the accuracy of learning of such complex images, compromising physics relevant knowledge extraction.

Here, our enhance the performance of the VAE models by introducing inferential bias that allow to encode known physical laws and constraints. Thus, we propose a physics driven VAE (phy-VAE) model to facilitate the training process for the analysis of the atomic and mesoscopic images by combining the existing capability of VAEs to discover the factors of variability in the data under the stated invariances and the new capability of learning any physics-based global criteria to refine the discovery of valuable information from such complex data. It is to be noted that in the paper, though we have focused on the enhancement of earlier described sh-VAE to phy-VAE where we aim to inject known physical behavior of a system during the training process or the latent space learning, this enhancement is well suited for any chosen VAE models as per the data since none of the VAE models possesses any prior physical knowledge of the system or guarantee to converge learning in accordance to the known physics. In the next part, we will thus focus on the extension of the shift-VAE to the phy-VAE where we will focus on the enhancement (proposed contribution) part of the modelling. The other setups like encoder, latent map generation and decoder are same as described earlier in this section.

2.2. Proposed Work: phy-VAE-customizing sh-VAE

Figure 2 shows the general approach to integrate the physics driven criteria to a VAE model and develop phy-VAE. As an extension to sh-VAE, the loss function can be rewritten as below equation (3).

$\begin{equation}{\mathcal{L}_{{\text{phy}} - {\text{vae}}}} = \left( {\varphi + {\text{ }}\beta \left( i \right){D_{{\text{KL}}}}\left( {p({\boldsymbol{{Z}}}\left| {{f_e}\left( {\boldsymbol{{X}}} \right))} \right|\left| {p\left( {\boldsymbol{{Z}}} \right)} \right.} \right)} \right)\left( {w + {{ \Psi }}} \right)\end{equation} \tag{ 3 }$

**Figure 2.** A general approach to physics-driven VAE.
Download figure:
Standard image High-resolution image

where ${{\Psi }}$ is a physic driven loss, $w$ is the slack (a small value) added to new physics driven loss with $0 \unicode{x2A7D} w \unicode{x2A7D} 0.5$ , to avoid numerical error. In this paper, we derived two loss functions which serves the goal of better learning the phase boundaries with maximizing the smoothness of the learned spatial maps of the encoded latent variables. While the reconstruction and KL divergence loss learn the factors of variability of the data, the new loss component imposes the physical constraints of the data.

Here, the first smoothing loss for ${{\Psi }}$ is computed through Scharr edge detection filter. As such, it imposes the condition that the total length of the edges within latent image should be minimal, corresponding to the minimization of the interface length due surface tension. The second smoothing loss is computed through peak intensity of the Fourier transform of the spatial latent maps. According to the Parseval theorem, the total intensity of the Fourier image is equal to the roughness spectrum of the real-space image. The intensity of the Fourier image is partitioned between the peaks corresponding to the crystal lattice, the background corresponding to the disordered features (would be a ring if there was characteristic length scale), and the zero-wavelength peak corresponding to the large-scale features in the image. Additionally, the Moire effects due to the beating between the sampling grid and the crystallographic lattice will result in the clear Fourier peaks. By reducing the intensity of the Fourier peaks in latent images, we minimize these effects, leading to minimization of the smoothing loss. For the purpose of simplicity, we call these two losses as SL1 and SL2, which we augment to equation (3) as ${{\Psi }}$ to minimize. Interestingly, if we refer to equation (3), we have multiplied ${{\Psi }}$ with the conventional loss instead to addition. The decision is from our extensive analysis to find the best way to augment the physics loss. Though minimizing the loss functions when augmented in either adding or multiplying produces similar results, however the maximization of the loss function when ${{\Psi }}$ is added does not work as expected. As it turned out, with maximization, did not reduce the smoothness of the latent space. As we know, for the optimization problem, the regulation of the results should follow as we minimize or maximize, which worked for multiplying ${{\Psi }}$ . We will further clarify during the analysis in the result section. Algorithms 1 and 2 provides the detail workflow to compute SL1 and SL2.

Algorithm 1. SL1: edge magnitude of latent images (considering latent dim = 2).

1.
Generate the latent maps ${\boldsymbol{{Z}}} = \left( {{{\boldsymbol{{z}}}_{\boldsymbol{{1}}}},\,{{\boldsymbol{{z}}}_{\boldsymbol{{2}}}}} \right)$ , given the current trained model, ${\boldsymbol{\Lambda }}$ and data ${\boldsymbol{d}}$

2.
Normalize ${\boldsymbol{{Z}}} = \left( {{{\boldsymbol{{z}}}_{\boldsymbol{{1}}}},\,{{\boldsymbol{{z}}}_{\boldsymbol{{2}}}}} \right)$ .
3.
Use a denoising filter to the latent maps. Here we use bilateral filter with proper tuning as per the data. The reason we choose this filter as it an edge-preserving, denoising filter. We considered the function 'denoise_bilateral' under scikit-image Python library for bilateral filter. Let us annotate the denoised latent maps as the output from bilateral filter as ${\boldsymbol{{Z}}} = \left( {{{\boldsymbol{{z}}}_{\boldsymbol{{1}}}},\,{{\boldsymbol{{z}}}_{\boldsymbol{{2}}}}} \right).$

4.
Given the denoised latent maps ${\boldsymbol{{Z}}} = \left( {{{\boldsymbol{{z}}}_{\boldsymbol{{1}}}},\,{{\boldsymbol{{z}}}_{\boldsymbol{{2}}}}} \right),{\boldsymbol{{\,}}}$ find the edge magnitude. Here we used Scharr transform, implementing the scikit-image Python library function 'filters.scharr'. The function outputs the edge maps (array) of $\left( {{{\boldsymbol{{z}}}_{\boldsymbol{{1}}}},\,{{\boldsymbol{{z}}}_{\boldsymbol{{2}}}}} \right)$ as ${{\boldsymbol{{e}}}_{{z_1}}},{\boldsymbol{{\,}}}{{\boldsymbol{{e}}}_{{z_2}}}{\boldsymbol{{\,}}}$ respectively.
5.
Calculate loss: Compute the mean of the sum of the edge maps of denoised latent images as per equation (4),

$\begin{equation}{{\Psi }} = {{{\Psi }}_{{\text{SL}}1}} = \frac{{{{\mathop \sum \nolimits }}{e_{{z_1}}} + {{\mathop \sum \nolimits }}{e_{{z_2}}}}}{2}.\end{equation} \tag{ 4 }$

Algorithm 2. SL2: peak intensity of Fourier transform of latent images (considering latent dim =2).

1.
Generate the latent maps ${\boldsymbol{{Z}}} = \left( {{{\boldsymbol{{z}}}_{\boldsymbol{{1}}}},\,{{\boldsymbol{{z}}}_{\boldsymbol{{2}}}}} \right)$ , given the current trained model, ${\boldsymbol{\Lambda }}$ and data ${\boldsymbol{d}}$

2.
Fourier Transform: Conduct 2D discrete Fourier transform of $\left( {{{\boldsymbol{{z}}}_{\boldsymbol{{1}}}},\,{{\boldsymbol{{z}}}_{\boldsymbol{{2}}}}} \right)$ and shift the zero-frequency component to the center of the spectrum. Let us annotate the Fourier transformed (FFT) latent maps as ${{\boldsymbol{\mathcal{F}}}_{\boldsymbol{{Z}}}} = \left( {{\mathbb{f}_{{{\boldsymbol{{z}}}_1}}},{\mathbb{f}_{{{\boldsymbol{{z}}}_2}}}} \right)$ . Next, we do a log conversion as $\mathbf{log} {\boldsymbol{\mathbb{f}}_{{{\boldsymbol{\mathbf{z}}}_1}}} = \log (\left| {{\mathbb{f}_{{{\boldsymbol{{z}}}_1}}}} \right| + 1)$ and $\mathbf{log} {\mathbb{f}_{{{\boldsymbol{\mathbf{z}}}_2}}} = \log (\left| {{\mathbb{f}_{{{\boldsymbol{{z}}}_2}}}} \right| + {\boldsymbol{{1}}})$
3.
Calculate the mean total peak intensity as ${\text{p}}{{\text{i}}_{{\text{total}}}} = \frac{{{{\mathop \sum \nolimits }}{\mathbf{log}}{\mathbb{f}_{{{\boldsymbol{{z}}}_1}}} + {{\mathop \sum \nolimits }}{\mathbf{log}}{\mathbb{f}_{{{\boldsymbol{{z}}}_2}}} }}{2}$ . Choose the central indices (grid), $\left[ {{x_{\text{min}}},\,{x_{\text{max}}}} \right],\,\left[ {{y_{\text{min}}},\,{y_{\text{max}}}} \right];\,\,0\left\langle {{x_{\text{min}}},\,{y_{\text{min}}}\left\langle {\,\frac{i}{2}\,,\,i} \right\rangle {x_{\text{max}}},\,{y_{\text{max}}}} \right\rangle \,\frac{i}{2}$ , of FFT latent maps. Here $i$ is the total length of each dimension of square matrix, d ${\mathcal{d}}$ . Here, to define the grid of the central indices ${x_{\text{min}}},\,{y_{\text{min}}}\,$ are calculated as $\frac{i}{2} - \alpha$ , ${x_{\text{max}}},\,{y_{\text{max}}}\,$ are calculated as $\frac{i}{2} + \alpha ,\,$ where $\alpha = 1$ . However, to extend the central grid matrix, we can increase the value of $\alpha$ such that $\alpha \lt \frac{i}{2}$ . Next, we calculate the total peaks of the FFT latent maps within the defined central grid as $\mathop \sum \nolimits_{{x_{{\text{min}}}},{\text{ }}{y_{{\text{min}}}}{\text{ }}}^{{x_{{\text{max}}}},{\text{ }}{y_{{\text{max}}}}} {\boldsymbol{{log}}}{\mathbb{f}_{{{\boldsymbol{{z}}}_1}}}$ and $\mathop \sum \nolimits_{{x_{{\text{min}}}},{\text{ }}{y_{{\text{min}}}}{\text{ }}}^{{x_{{\text{max}}}},{\text{ }}{y_{{\text{max}}}}} {\boldsymbol{{log}}}{\mathbb{f}_{{{\boldsymbol{{z}}}_2}}}$ . In other words, $\mathop \sum \nolimits_{{x_{{\text{min}}}},{\text{ }}{y_{{\text{min}}}}{\text{ }}}^{{x_{{\text{max}}}},{\text{ }}{y_{{\text{max}}}}} {\boldsymbol{{log}}}{\mathbb{f}_{{{\boldsymbol{{z}}}_1}}}$ and $\mathop \sum \nolimits_{{x_{{\text{min}}}},{\text{ }}{y_{{\text{min}}}}{\text{ }}}^{{x_{{\text{max}}}},{\text{ }}{y_{{\text{max}}}}} {\boldsymbol{{log}}}{\mathbb{f}_{{{\boldsymbol{{z}}}_2}}}$ represents the summation of the FFT log values (see step 2) at those indices which falls within the chosen central grid matrix as described above.

4.
Calculate the total intensity outside central peak as $\sum {{\boldsymbol{{log}}}{\mathbb{f}_{{{\boldsymbol{{z}}}_1}}}} - \mathop \sum \nolimits_{{x_{{\text{min}}}},{\text{ }}{y_{{\text{min}}}}{\text{ }}}^{{x_{{\text{max}}}},{\text{ }}{y_{{\text{max}}}}} {\boldsymbol{{log}}}{\mathbb{f}_{{{\boldsymbol{{z}}}_1}}}$ and $\sum {{\boldsymbol{{log}}}{\mathbb{f}_{{{\boldsymbol{{z}}}_2}}}} - \mathop \sum \nolimits_{{x_{{\text{min}}}},{\text{ }}{y_{{\text{min}}}}{\text{ }}}^{{x_{{\text{max}}}},{\text{ }}{y_{{\text{max}}}}} {\boldsymbol{{log}}}{\mathbb{f}_{{{\boldsymbol{{z}}}_2}}}$ . In other words, we calculate the summation of the FFT log values (see step 2) at those indices which falls outside the chosen central grid matrix as described in Step 3.

5.
Calculate loss: Compute the ratio of the mean outside central peak intensity and mean total peak intensity of FFT latent images as per equation (5),
$\begin{equation}{{\Psi }} = {{{\Psi }}_{{\text{SL}}2}} = \frac{{0.5({{\mathop \sum \nolimits }}{\mathbf{log}}{\mathbb{f}_{{{\boldsymbol{{z}}}_1}}} - { }{{\mathop \sum \nolimits }}_{{x_{{\text{min}}}},{ }{y_{{\text{min}}}}{ }}^{{x_{{\text{max}}}},{ }{y_{{\text{max}}}}}{\mathbf{log}}{\mathbb{f}_{{{\boldsymbol{{z}}}_1}}} + {{\mathop \sum \nolimits }}{\mathbf{log}}{\mathbb{f}_{{{\boldsymbol{{z}}}_2}}} - { }{{\mathop \sum \nolimits }}_{{x_{{\text{min}}}},{ }{y_{{\text{min}}}}{ }}^{{x_{{\text{max}}}},{ }{y_{{\text{max}}}}}{\mathbf{log}}{\mathbb{f}_{{{\boldsymbol{{z}}}_2}}}){\boldsymbol{{\,}}}}}{{{\text{p}}{{\text{i}}_{{\text{total}}}}}}.\end{equation} \tag{ 5 }$

Here, like the conventional VAE loss computation, these new losses are also calculated at each epoch of training in batches, considering the same batch data $\left( {\boldsymbol{{d}}} \subset {\boldsymbol{{D}}} \right)$ in calculating VAE loss. ${\boldsymbol{{D}}}$ is the full training data. The batch file, instead of full training data, is considered for computation due to the reasonable memory usage and computational cost of model training for practical implementations. However, due to using the batch file (low resolution latent maps), we approximated SL1 or SL2 instead of exact calculation with full training data (high resolution latent maps). In additional to the loss functions, we have also optionally integrated a visualization function (refer to figure 2) to track the boundary segmentation of the latent images at the current state of the phy-VAE training. We will discuss more on the boundary segmentation techniques implemented in the result section.

In algorithms 1 and 2, the process of generating the latent maps (step 1) by VAE is different, depending on the specific VAE model (e.g. vanilla VAE, rVAE, sh-VAE) we are implementing for a given problem. It is to be noted the phy-VAE is universal to integrate with any VAE models, however, for the demonstration we implemented here with sh-VAE. In this case, for the sh-VAE to encode any positional information of input image patches, we store their original coordinates in the image space and use them to construct a 'map' where we assign a value associated with a latent variable z to each coordinate. Here the first two column of the z matrix is used to shift the co-ordinate grid by $k\Delta t,\,\Delta t = z\left[ {:,\,:2} \right],\,0 \lt k \unicode{x2A7D} 1$ , and then the transformed grid is concatenated with the remaining columns of the latent variables before passing to the decoder. The details of the workflow of the latent maps generation in sh-VAE can be found in [41].

3. Results

To demonstrate the proposed workflow of phy-VAE, we showcased to different experimental image data generated from microscopes, such as STEM images of NiO-LSMO heterostructure (figure 3(a)) and BiFeO₃ (figure 3(b)), and STM images of Graphene system (figure 3(c)). These datasets have been used in prior publications [29, 40, 49] which give the details on the nature of the system and imaging conditions. Here, these examples of systems have complex spatially varying patterns with atomic resolution. We considered these systems for having two-phase mixture of perovskite LSMO (lanthanum-strontium manganite, LaxSr1-xMnO₃) and rock-salt NiO in NiO-LSMO composite, two-phase domain of SrTiO₃ substrate and ferroelectric with ferroic variants in BiFeO₃, perturbation of electronic structure of graphene on a large number of individual defects. The analysis for the NiO-LSMO, BiFeO₃, and graphene system has been chosen to ascertain the performance of the proposed physical VAE for systems with increasing degree of complexity. In all three cases, we expect to have the a priori known (based on human perception of the images and prior domain expertise) factor of variation, as well as potential presence of unknown (and invisible for human eye) factors of variation. However, for the unknown factors of variation we expect that spatial distribution of the associated fields will follow certain physical constraints (and in fact these are introduced via the physical loss in phy-VAE). The images are transformed into collection of patches sampled over the square grid. The window size and the grid sampling can be varied to match the specific physics of the image. Here we have considered the window size of 48 unit cells, with the grid step being 8 unit cells. Note that the grid spacing in this case is chosen randomly and does not contain any information on atomic positions. We also note that for graphene data, we found that each patch should be normalized; otherwise, the relative intensity changes become the primary factor of variation in data. The full dataset for NiO-LSMO, BiFeO₃ and Graphene consists of 15 129, 233 289 and 11 881 images. Finally, the dimension of the dataset matrices is [N × 48 × 48] where N is the number of images in each case studies, from where we aim to extract meaningful physical information from learned latent space of the trained proposed phy-VAE models in an unsupervised fashion.

**Figure 3.** Experimental images of (a) STEM of NiO-LSMO (Reproduced from [29]. CC BY 4.0), (b) STEM of Sm-doped BiFeO₃ and (c) STM of Graphene Oxide.
Download figure:
Standard image High-resolution image

We first train the standard sh-VAE models with these data (see figure S1 in supplementary material) to visualize the performance to develop physics relevant latent domain boundaries for reference to compare with the respective latent maps after training with phy-VAEs. For all case studies on both sh-VAE and phy-VAE model architecture, we considered the number of hidden units in each layer of encoder and decoder as [128, 128], 'tanh' activation function for the inner layers of the encoders and the decoders, a sigmoid activation function for the decoder output, and a Bernoulli distribution as the decoder sampler. Additionally, for the purpose of visualization we have considered latent dimension to be 2. It is to be noted that we can increase the latent dimension for potentially better performance, however the analysis will be considered for future scope. For the model training process, we considered Adam optimizer with learning rate 0.3. To process the full dataset into batch files, we have considered the batch size of 100 in case of BiFeO₃ datasets, batch size of 64 in case of NiO-LSMO and Graphene datasets.

As we have earlier mentioned, the purpose of the training the proposed phy-VAE to extract the factors of variability of the complex experimental data, and in accordance with the known physical behavior, we aim to learn such from the reduced latent maps as derived from training the models in an unsupervised fashion. Likewise in a general PCA where the feature analysis is done in a reduced eigen space, here also we first fit the high dimensional data in a low dimensional latent space and focus on learning the physical factors of variability of the data. The choice for considered a VAE model is due to the facility of non-linear representation of data, where PCA limits to only linear representations, thus likelihood to gain more information. In these systems, within the domain phases, the order parameter should change continuously with the length scale of atomic lattice, which in other words, the latent variables within the phase should change smoothly. As in this paper, we focus on an improved feature learning in the latent space which aligns with the known physical behavior of the system, where the conventional VAE does not guarantee training guiding to that rule, we validate the performance of the phyVAE models as per the stated domain knowledge of these systems (data).

3.1. Analysis: STEM images of NiO-LSMO heterostructure

For NiO-LSMO, we considered denoising factor = 0.1 for bilateral filtering during SL1 computation. For model training with either SLI or SL2, we considered 50 epochs for model training with $\beta \left( i \right)$ trajectory linearly increasing from 0.05 to 1 as the model completes 50 epochs. We considered the hyperparameter settings as done in [41] with a good initial guess through manual tweaking, for training the standard sh-VAE and phy-VAE models, to justify the comparison. It is to be noted, another approach to aim at improved (physically meaningful) data analysis would be to focus on optimizing the hyper-parameters, particularly the $\beta \left( i \right)$ trajectory. However, tuning this high-dimensional iterative dependent hyper-parameter would be a very time-consuming task (given the sh-VAE model is computationally very expensive) which need to be done prior to the training. However, one approach we took earlier in tuning such trajectory hyperparameter is through latent Bayesian optimization approach [35] but would still take a significant amount of time to conduct such optimization task. Here, bypassing the need and the mentioned challenges of prior tuning of particularly high-dimensional hyperparameters (e.g. $\beta \left( i \right)$ ), the proposed approach is more direct utilization of the known physical behavior of the system where the VAE model can train the latent space simultaneously in accordance with the applied known physical constraints. In this case, we see minimizing the objectives (SL1 or SL2) on latent map, ${z_1}$ , oversimplifies the learning such that the domain boundary edges disappear. Therefore, we only optimize the latent map, ${z_2}$ , following algorithms 1 and 2 as the phy-VAE model learns ${z_1}$ efficiently with the existing VAE loss function.

After several investigations, we considered the following approach to the visualization (ref figure 2) of the image segmentation of the latent maps of the full training data as the training progresses and the final learned latent maps. As we compute the losses with batch data, thus, it is essential to track the performance with whole data. First, we applied a threshold to the latent maps to smooth the full domain if there are negligible difference between the maximum and the minimum latent values. We set this threshold as 1 × 10⁻³. Once the above threshold is passed, then after normalizing, we denoise the latent maps with bilateral filter, setting the denoising factor = 0.1. Then we applied Otsu threshold and Chan-Vese methods, considering the inbuild functions in Python scikit-image library [50, 51], to enhance the boundaries between foreground and background object in the images, and image segmentation respectively. The reason for the selection of these methods, and this sequence of approach to ensure proper defining edges of the latent maps.

Figure 4 compares the results of the learned latent maps phy-VAE, at different slack values associated to SL1 loss. The same training with phy-VAE models with considering SL2 for physics driven loss is provided in figure S2 in supplementary material. We can see that sh-VAE yields highly satisfactory results (figure S1(a)) in segmenting the phases between NiO and LSMO and therefore the purpose of the phy-VAE is to preserve the existing training of shape reconstruction, without degrading due to enforcing physical constraints. Comparing figure 4, phy-VAE attain the similar phase boundaries of both the latent maps, and while we optimize only the latent map, ${z_2}$ , we see it keeps the actual domain boundary shape and does not affect the boundary shape of ${z_1}$ . Thus, we do not see any sign of distortion or discontinuity over the learned latent manifolds and latent distributions. We find similar performance for the phy-VAE models considering SL2 objective function (figure S2). Here, in the NiO-LSMO system, the known factors of variation are the phase composition, meaning the single-phase regions on rock salt NiO and perovskite LSMO. Based on general physical considerations, we expect the boundary between the phases to be smooth (minimization of the interfacial free energy). The unknown factors of variation here can include strains, or variation in the imaging conditions due to mis-tilt. In both cases, we expect these to be smooth on the level of the unit cells. Hence, the interpretation of the latent space and quality of representation disentanglement is established based on whether these constraints are satisfied. We also explore the latent distributions—the bimodal distribution will suggest that the system has only two components, whereas more complex distributions suggest either problems in VAE, or presence of additional factors of variability.

**Figure 4.** Application to NiO-LSMO system. In this case, we considered SL1 (algorithm 1) as the physics driven loss. Figure (a) shows few samples of input images with window size 48 × 48 (representing the axes of the images) as used in training. As mentioned earlier, these images patches are generated from the experimental image in figure 3(a). Figures (b)–(d) are the comparison between phy-VAEs at different slack values, $w$ , of physics driven loss. For each top subfigures, the right subplots are learned (after completed training of 50 epochs) latent maps ${z_1},\,{z_2}$ and left subplots are the respective image segmentation plots, where we detect phase boundaries and sign of discontinuities within the domains (referring to the known physical behavior) of the adjacent learned latent maps through visualization process as described earlier in this section. The latent images show that the first latent variable, ${z_1}$ , clearly separated the NiO and LSMO phases, with the minimal sampling effects due to the beating between the sampling grid and crystallographic lattice. Comparatively, the second latent variable, ${z_2}$ , has strong sampling effects. The left and right of the bottom subfigures for each respective scenarios, plotted over axes as latent var 1 and 2, are the learned latent 2D manifolds and the encoded latent distributions (denoted by the intensity of the red color) respectively. The blue dots over the encoded latent distributions plots are the training latent samples. Here, the latent manifolds show the input image structures in a comprehensive way with identification of different factors of variability. The latent manifolds are one dimensional, suggesting that there is a single factor of variability in the system, namely the phase variation form rock salt to perovskite. Additional interpretation of the figure is provided in supplementary materials.
Download figure:
Standard image High-resolution image

3.2. Analysis: STEM images of BiFeO₃

Moving forward to more complex BFO system (BiFeO₃), during computation of SL1, we considered denoising factor (for bilateral filtering) as the standard deviation of the low-resolution latent images. For model training with either SLI or SL2, we considered 50 epochs for model training with $\beta \left( i \right)$ trajectory linearly increasing from 0.1 to 1 as the model completes 50 epochs. For the visualization part, we followed the similar approach to the previous case study, except considering denoising factor = 0.3, and considering multi-Otsu threshold instead of binary Otsu and Chan-Vese for the multi-label domain segmentation. For the BiFeO₃, the situation is more complex. Here, we still have two phases—the ferroelectric BiFeO₃ and SrTiO₃ substrate that can be visualized by human eye. However, within BiFeO₃ we additionally have ferroelectric domains. These are not detectable by human eye and can be visualized via physics-based analysis (atom finding). The goal for ph-VAE is to discover these based on images only and explore if there are additional (unknown) factors of variability. Again, based on general physics knowledge we postulate that domain walls should be sharp, whereas other factors of variability should change slowly on the length scale of a unit cell. From figure 5, we can now see the limitation of sh-VAE performance as we observe several scattered edges and grains within the phase which deviates the physical behavior that typically order parameters in physical systems are postulated to change smoothly on the length scale of the unit cell. It is to be noted that we followed the same visualization process for image segmentation of the sh-VAE learned latent maps as for the phy-VAE, and thus explain our earlier statement of requirement of enforcing the physics driven knowledge in phy-VAE in spite of the image analysis procedure. Figure 6 shows the latent maps of phy-VAE at different training epochs (following figures 6(a)–(e)), considering for SL1, $w = 0.1$ . Looking at the edge detection image of ${z_2}$ in figure 6(b) (iteration 5), we see the first indication of the SrTiO₃ substrate phase (blue region) which gets more refined as the model trained with more iterations (referring 6(c)–(f)). Subsequently, the large right region of the edge detection image of ${z_2}$ for those figures is the ferroelectric phase. Similarly, we can also see the first indication of the domain boundary between the SrTiO₃ substrate phase and ferroelectric phase in the edge detection image of ${z_1}$ in figure 6(d), which gets refined with progression of the training (referring 6(e) and (f)). Within the ferroelectric phase (referring 6(c)), we see the formation of the curved boundary to segment the ferroic variants, which gets refined as the training progress (referring figures 6(d)–(f)) and at the same time minimizing all the small edges and grains within the variants of the ferroelectric segments, following the physical behavior as the model trained after iteration 50 (referring figure 6(f)). Figure 7 compares the results of the learned latent maps phy-VAE, at different slack values associated to SL1 loss. The same training with phy-VAE models with considering SL2 for physics driven loss is provided in figure S3 in supplementary material. We clearly see the phy-VAE with both physics driven losses outperformed sh-VAE (figure 5) in restoring the physical behavior. However, interestingly, we see for figure 7(a), the learned latent map 2 does not detect the ferroic variants while the respective latent map 1 provides well defined boundaries. This can happen due to the fact that latent map 1 learn all the factors of variability of training data in reconstructing the phase, as we can see from the encoded latent distributions of the same. Likewise, for all the cases, we see the learned latent manifolds and distributions does not have any distortion or discontinuities, unlike the case for training with sh-VAE. To further clarify the reasoning of multiplying ${{\Psi }}$ in equation (3), we provided figure S4 in supplementary material to showcase an example of the regulation of the physics driven loss functions towards the trained latent maps as we maximize those, where we see the increments of boundary edges, distortion, and discontinuities over the latent space.

**Figure 5.** Application of sh-VAE to BFO system. Image segmentation map for edge or phase boundary detection of learned (a) latent map 1, ${z_1}$ (b) latent map 2, ${z_1}$ . We can see multiple scattered small edges and grains throughout the ferroic variant domain phase. Similarly, we see the (c) distorted learned latent 2D manifolds and (d) discontinuity in the encoded latent distributions in the latent space (denoted by the intensity of the red color). The blue dots over the encoded latent distributions plots are the training latent samples.
Download figure:
Standard image High-resolution image

**Figure 5.** Application of sh-VAE to BFO system. Image segmentation map for edge or phase boundary detection of learned (a) latent map 1, ${z_1}$ (b) latent map 2, ${z_1}$ . We can see multiple scattered small edges and grains throughout the ferroic variant domain phase. Similarly, we see the (c) distorted learned latent 2D manifolds and (d) discontinuity in the encoded latent distributions in the latent space (denoted by the intensity of the red color). The blue dots over the encoded latent distributions plots are the training latent samples.
Download figure:
Standard image High-resolution image

**Figure 6.** Application to BFO system. Training process of phy-VAE with objective SL1 and $w = 0.1$ . For each figure, the top and bottom right subplots are learned latent maps ${z_1},\,{z_2}$ and top and bottom left subplots are edge or phase boundary detection of the respective latent maps through image segmentation.
Download figure:
Standard image High-resolution image

**Figure 6.** Application to BFO system. Training process of phy-VAE with objective SL1 and $w = 0.1$ . For each figure, the top and bottom right subplots are learned latent maps ${z_1},\,{z_2}$ and top and bottom left subplots are edge or phase boundary detection of the respective latent maps through image segmentation.
Download figure:
Standard image High-resolution image

**Figure 7.** Application to BFO system. In this case, we considered SL1 (algorithm 1) as the physics driven loss. Figure (a) shows few samples of input images with window size 48 × 48 (representing the axes of the images) as used in training. As mentioned earlier, these images patches are generated from the experimental image in figure 3(b). Figures (b)–(d) are the comparison between phy-VAEs at different slack values, $w$ , of physics driven loss. For each top subfigures, the right subplots are learned (after completed training of 50 epochs) latent maps ${z_1},\,{z_2}$ and left subplots are the respective image segmentation plots, where we detect phase boundaries and sign of discontinuities within the domains (referring to the known physical behavior) of the adjacent learned latent maps through visualization process as described earlier in this section. Here, the latent images clearly depict the difference between the SrTiO₃ substrate (left) and 2 ferroelectric domain orientations in the BiFeO₃ that are not visible to human eye. The left and right of the bottom subfigures for each respective scenarios, plotted over axes as latent var 1 and 2, are the learned latent 2D manifolds and the encoded latent distributions (denoted by the intensity of the red color) respectively. The blue dots over the encoded latent distributions plots are the training latent samples. Here, the latent manifolds show the input image structures in a comprehensive way with identification of different factors of variability. The corresponding latent distributions clearly show the trimodal distributions. Additional interpretation of the figure is provided in supplementary materials.
Download figure:
Standard image High-resolution image

**Figure 7.** Application to BFO system. In this case, we considered SL1 (algorithm 1) as the physics driven loss. Figure (a) shows few samples of input images with window size 48 × 48 (representing the axes of the images) as used in training. As mentioned earlier, these images patches are generated from the experimental image in figure 3(b). Figures (b)–(d) are the comparison between phy-VAEs at different slack values, $w$ , of physics driven loss. For each top subfigures, the right subplots are learned (after completed training of 50 epochs) latent maps ${z_1},\,{z_2}$ and left subplots are the respective image segmentation plots, where we detect phase boundaries and sign of discontinuities within the domains (referring to the known physical behavior) of the adjacent learned latent maps through visualization process as described earlier in this section. Here, the latent images clearly depict the difference between the SrTiO₃ substrate (left) and 2 ferroelectric domain orientations in the BiFeO₃ that are not visible to human eye. The left and right of the bottom subfigures for each respective scenarios, plotted over axes as latent var 1 and 2, are the learned latent 2D manifolds and the encoded latent distributions (denoted by the intensity of the red color) respectively. The blue dots over the encoded latent distributions plots are the training latent samples. Here, the latent manifolds show the input image structures in a comprehensive way with identification of different factors of variability. The corresponding latent distributions clearly show the trimodal distributions. Additional interpretation of the figure is provided in supplementary materials.
Download figure:
Standard image High-resolution image

3.3. Analysis: STM images of graphene

From the application of phy-VAE on previous two material systems, we see the model on one hand perverse the goodness of the phase extraction over the latent maps for both physics driven losses, and on the other hand able to enforce the physical constraints to learn the data in terms of relevant physical behavior of the system. Therefore, we attempt to implement the phy-VAE model into the graphene system where the domain phase is more complex (refer figure 3(c)) for interpretation. Here, during computation of SL1, we considered the same setup as for BFO system. For model training with either SLI or SL2, we considered 200 epochs for model training with $\beta \left( i \right)$ trajectory linearly increasing from 0.01 to 0.5 as the model completes 150 epochs and then kept constant at 0.5 for the remaining 50 training epochs. The image segmentation process also remains the same as for BFO system. Figure 8 shows the results of the learned latent maps with phy-VAE, considering SL1 and SL2 losses. Finally, for graphene we see complex patterns associated with long-range quantum interference effects due to defects (see figure 8(a)). In this case, we explore the phy-VAE behavior on an unknown system. We segment the images for edge or phase boundary detection with considering 2, 3 and 4 phases. In the figure, the coloring of the segmented phases is as per the provided colormap where 'L' denoted low values and 'H' denoted high values. We see the edges and corners of the latent maps has the domain with color representing smaller values, which gradually increases as the distance of latent variables from the edges of latent maps increases, with possibly new domain regions. Similarly, we find no distorted latent manifolds or discontinuous latent distributions. Finally, in this paper, though we have focused on the meaningful understanding of the domain phases and its boundaries through image segmentation in the latent maps, there are possible more key information that are left to be unearthed from the data through other visualization approach for potential scientific discoveries, which will be considered for future scope of study.

**Figure 8.** Application to Graphene system. Figure (a) shows few samples of input images with window size 48 × 48 (representing the axes of the images) as used in training. As mentioned earlier, these images patches are generated from the experimental image in figure 3(c). Comparison between phy-VAEs at (b) objective SL1 (algorithm 1) with $w = 0.1$ , (c) objective SL2 (algorithm 2) with $w = 0.05$ of physics driven loss. For each left subfigures, the left subplots are learned (after completed training of 200 epochs) latent map 1, ${z_1}$ , and the respective image segmentation for edge or phase boundary detection with considering 2, 3 and 4 phases (bottom to top). The coloring of the segmented phases is as per the provided colormap where 'L' denoted low values and 'H' denoted high values. The right subplots are the same for learned latent map 2, ${z_2}$ . The left and right of the bottom subfigures for each respective scenarios, plotted over axes as latent var 1 and 2, are the learned latent 2D manifolds and the encoded latent distributions (denoted by the intensity of the red color) respectively. The blue dots over the encoded latent distributions plots are the training latent samples. Here, the latent manifolds show the input image structures in a comprehensive way with identification of different factors of variability. The presence of bimodal latent distribution suggests two possible states in graphene. Additional interpretation of the figure is provided in supplementary materials.
Download figure:
Standard image High-resolution image

**Figure 8.** Application to Graphene system. Figure (a) shows few samples of input images with window size 48 × 48 (representing the axes of the images) as used in training. As mentioned earlier, these images patches are generated from the experimental image in figure 3(c). Comparison between phy-VAEs at (b) objective SL1 (algorithm 1) with $w = 0.1$ , (c) objective SL2 (algorithm 2) with $w = 0.05$ of physics driven loss. For each left subfigures, the left subplots are learned (after completed training of 200 epochs) latent map 1, ${z_1}$ , and the respective image segmentation for edge or phase boundary detection with considering 2, 3 and 4 phases (bottom to top). The coloring of the segmented phases is as per the provided colormap where 'L' denoted low values and 'H' denoted high values. The right subplots are the same for learned latent map 2, ${z_2}$ . The left and right of the bottom subfigures for each respective scenarios, plotted over axes as latent var 1 and 2, are the learned latent 2D manifolds and the encoded latent distributions (denoted by the intensity of the red color) respectively. The blue dots over the encoded latent distributions plots are the training latent samples. Here, the latent manifolds show the input image structures in a comprehensive way with identification of different factors of variability. The presence of bimodal latent distribution suggests two possible states in graphene. Additional interpretation of the figure is provided in supplementary materials.
Download figure:
Standard image High-resolution image

4. Conclusion

We develop a custom VAE, augmenting the physical constraints or knowledge of a material system with the existing VAE training loss, to guide the model training towards an improved (physically meaningful) feature learning of microscopy data analysis. This proposed phy-VAE provides an unsupervised approach for the discovery of order parameter and image segmentation in the presence of other factors of variability. Here this approach is implemented on a shift invariant variational autoencoder model (sh-VAE) and demonstrated on the open data sets of ferroelectric domains in BiFeO₃, LSMO-NiO interface, and graphene. However, this approach to combine the classical VAE KL and reconstruction losses with the physics-based losses representing the known physical inferential biases is universal and can be extended to other custom-design loss functions and other VAE models. This can include more involved sparsity conditions or can also include the non-local conditions and relationships between the latent variables. As such, we believe this approach to be universal and applicable for the unsupervised analysis of the broad variety of physical problems.

Acknowledgments

This work was supported by the US Department of Energy, Office of Science, Office of Basic Energy Sciences, MLExchange Project, Award Number 107514, and supported by University of Tennessee (Knoxville) start-up funding. The autoencoder research in PyroVED library was supported by the Center for Nanophase Materials Sciences (CNMS), which is a US Department of Energy, Office of Science User Facility at Oak Ridge National Laboratory. Dr Matthew Chisholm is gratefully acknowledged for the STEM data used in this work.

Data availability statement

The modified code reported here for the purpose of tutorial and application to other data: https://github.com/arpanbiswas52/phy-VAE [52].

The notebook for the analysis in the paper can be found in below links:

https://colab.research.google.com/drive/1yGWpLKCFH4sSVDzGZXoATxz3bNIr274X [53]

https://colab.research.google.com/drive/11rtZut5HdRuupt3wrHm0TR1WJNMUiYIX [54]

https://colab.research.google.com/drive/1ODbAfBZwzISPwPrNuIpqUXIlgNOhF60z [55]

All data that support the findings of this study are included within the article (and any supplementary files).

Conflict of interest

The authors declare no conflict of interest.

Combining variational autoencoders and physical bias for improved microscopy data analysis^{^∗}

Article metrics

Submit

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction

2. Methodology

2.1. VAE to shift-invariant variational auto-encoder (sh-VAE)

2.2. Proposed Work: phy-VAE-customizing sh-VAE

3. Results

3.1. Analysis: STEM images of NiO-LSMO heterostructure

3.2. Analysis: STEM images of BiFeO₃

3.3. Analysis: STM images of graphene

4. Conclusion

Acknowledgments

Data availability statement

Conflict of interest

Footnotes

Combining variational autoencoders and physical bias for improved microscopy data analysis∗

Article metrics

Submit

Share this article

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction

2. Methodology

2.1. VAE to shift-invariant variational auto-encoder (sh-VAE)

2.2. Proposed Work: phy-VAE-customizing sh-VAE

3. Results

3.1. Analysis: STEM images of NiO-LSMO heterostructure

3.2. Analysis: STEM images of BiFeO3

3.3. Analysis: STM images of graphene

4. Conclusion

Acknowledgments

Data availability statement

Conflict of interest

Footnotes

Combining variational autoencoders and physical bias for improved microscopy data analysis^{^∗}

3.2. Analysis: STEM images of BiFeO₃