Real-space analysis of scanning tunneling microscopy topography datasets using sparse modeling approach

A sparse modeling approach is proposed for analyzing scanning tunneling microscopy topography data, which contains numerous peaks corresponding to surface atoms. The method, based on the relevance vector machine with $\mathrm{L}_1$ regularization and $k$-means clustering, enables separation of the peaks and atomic center positioning with accuracy beyond the resolution of the measurement grid. The validity and efficiency of the proposed method are demonstrated using synthetic data in comparison to the conventional least-square method. An application of the proposed method to experimental data of a metallic oxide thin film clearly indicates the existence of defects and corresponding local lattice deformations.


I. INTRODUCTION
Scanning tunneling microscopy (STM) is an experimental technique that enables observation of a material surface at atomic-scale resolution [1][2][3]. An electrondensity topography map is obtained with STM by measuring the tunneling current between the surface to be observed and an atomic-scale conducting tip with an applied bias voltage. Since the invention of STM, various types of scanning probe microscopies, such as atomic force microscope, have been developed and used for measuring surface topography and physical properties of materials surfaces.
Several interesting phenomena on the surfaces have been shown to be caused by local strain induced by impurities and/or defects. For example, the critical temperature of High-T c cuprate superconductors significantly depends on local strain [4][5][6][7]. Fourier transforms are often used for extracting certain properties of surface structures, such as the set of lattice vectors of a surface reconstruction structure [8]. For a clean crystalline surface structure, Fourier transforms can be used to accurately estimate the atomic positions and associated local strain from the perfect lattice structure. However, thin films of metallic oxides are generally not clean surface structures, and it is difficult to extract local structural information from STM topography data. In fact, the desired local information for thin film structures can be obscured behind noise in the Fourier transform. Hence, a new methodology for performing real-space data analysis beyond the Fourier transform is highly desired.
In this study, we propose a data-analysis methodology for extracting the atomic arrangement from noisy STM topography. Our method is based on the fact that the STM topography data for a given surface can be represented by the superposition of suitable basis functions with noise, each of which is a spatially localized with a center corresponding to the location of atom. The basis function is characterized by the set of parameters, which includes the center position, amplitude, and shape of the basis function.
Our strategy decomposes a given STM data set into the basis functions with determining the set of parameters in the data model. This strategy could be accomplished by using the least-squares method. In fact, when the number of the peaks N peak and a shape parameter of the basis function are known in advance, this simple strategy is effective. However, because the typical number of atoms assumed here could be more than ten thousand and the associated number of data points could be more than one million, it is difficult to know N peak beforehand. Also, the shape parameters of the basis function are unknown a priori in general.
To establish a methodology for analyzing STM topography with an unspecified number of atoms, we use a relevance vector machine (RVM) [9] as the data model and a maximum a posteriori (MAP) estimation, which is based on the framework of Bayesian inference, to determine the model parameters. As the prior distribution in MAP estimation, we introduce a Laplace prior which is equivalent to the least absolute shrinkage and selection operator (LASSO) regression [10]. Depending on the measurement resolution, the number of data points is typically much larger than that of atoms, so the variables that we extract from the data can be "sparse." Using LASSO permits model inference with emphasis on the sparsity of the data. Recently, sparse modeling has been applied to a wide range of problems dealing with highdimensional data. Our proposed method is regarded as a sparse modeling for STM topography data analysis.
In this paper, we present the procedures used for extracting the atomic positions and peak amplitudes included in the STM topography images; we discuss not only the method for determining the model parameters but also the method for validating the models. First, we apply our method to synthesis data, and we examine the accuracy of our estimation. Then, we report the results of our model application to actual STM topography data obtained with a metallic oxide thin film.

A. Data model
Typical topography data obtained by STM measurements of SrVO 3 is shown in Fig. 1. The STM topography picture typically shown in literature is the top view shown on the left of Fig. 1. The surface of SrVO 3 is relatively clean and flat in comparison to the surfaces of other metallic oxide compounds. Nevertheless, it is noticed from the bird's-eye view ( Fig. 1 (b)) that the STM image has a rugged structure. This structure may be due to atomic-scale fluctuation or the STM tip condition. Each peak in the figure is considered to correspond to an atom, and dark spots often indicate the existence of atomic defects. Note that STM is generally responsible for indicating the electronic state underlying the surface, not the atom itself. Our aim in this work is to decompose such STM topography data into the peaks, assumed to be resulting from each atom.
This process is formally similar to the peak decomposition of spectral data measured in various natural science experiments. Recently, a statistical analysis technique based on Bayesian inference was used to successfully extract a finite number of peaks for a one-dimensional data spectrum with noise [11,12]. In this sense, our problem might be considered as peak decomposition in a twodimensional (2D) data spectrum. The number of peaks in this study, however, is much larger than those attained in the previous study. Thus, a different numerical calculation strategy is required for treating the large set of data.
In this study, our framework is based on the RVM . The pixel data is denoted as y = (y 1 , · · · , y D ) with D being the total number of pixels, and the vector x represents the weight of the STM source signal to be estimated. The weight x i is defined on an artificial array point, which is generally different from the original pixel array for y. The dimension of the vector x, which is the number of array points introduced, is denoted by N . The number N can be chosen independently of D depending on the resolution of the estimation. Assuming an explicit functional form of the measurement matrix, which we discuss later, our task is reduced to inferring relevant components in vector x for a given vector y.
Using a D ×N measurement matrixÂ, our data model is expressed as where is a noise vector with dimension D associated with the observation. For simplicity, each element of the vector is assumed to be iid Gaussian random variables with zero mean and variance σ : In other words, the noise property is independent of the pixel position, and coherent noise such as that induced by the STM tip or by the surface condition is not considered. In our method, the vector x is estimated by a posterior distribution P (x|y) for a given pixel data y. With Bayes' theorem, the a posterior distribution is expressed as where P (y|x) and P (x) are the likelihood function and a prior distribution, respectively. We employ the MAP estimation in which the value of x is chosen by maximizing the posterior distribution. The likelihood function is given by the noise distribution of Eq. (2) as where · · · 2 denotes the L 2 norm. The prior distribution in Eq. (3) used here is given by a Laplace prior over x: where λ is a hyperparameter and |· · ·| 1 is the L 1 norm. The prior distribution usually reduces the number of nonzero elements of the vector x. We assume sparsity of the vector x based on the reasonable assumption that the number of signal sources from existing atoms is significantly smaller than that of the pixel arrays. The present approach is called the sparse modeling. It is emphasized that our framework does not specify the number of peaks N peak at the present stage. All the element of x could be the peak centers in principle and the sparse modeling is used for a sparse solution for x with a small number of non-zero elements

B. Measurement matrix and MAP estimate
Our data model of Eq. (1), represented by a linear relation with additive noise, means that the observation vector y is a superposition of the basis functions and that the relevance vector x is a weight factor. In this work, we assume the kernel function is an isotropic 2D Gaussian function in which the element of the measurement matrix in Eq. (1) is given by where A di is an element of the measurement matrixÂ, σ represents the variance, and r di is the spatial distance between the position of measurement y d and that of signal source x i . Note that there is no theoretical or physical basis for choosing the 2D Gaussian function. In Ref. 8, the 2D isotropic Gaussian function with the covariance matrix Σ = σI, where I is an identity matrix, is used to fit peaks in STM topography data. In the field of optics, the width of the point spread function, which corresponds to our basis function, can be measured by an independent experiment a priori. In that case, an algorithm based on the maximum-likelihood method works well [13]. However, it is difficult to know the value of σ from a calibration experiment in STM because the target surfaces as well as the tip states are sensitive to experimental conditions. Our problem is more difficult than a peak decomposition problem with known σ in the sense that simultaneous inference of peaks and the value of σ is to be solved from the input data y.
The MAP estimate with respect to x is equivalent to the minimization of the cost function E(x; y, λ, µ), where µ denotes a set of unknown parameters in the measurement matrixÂ. The inference scheme with the prior distribution is known as the least absolute shrinkage and selection operator (LASSO) [10], and the hyperparameter λ determines the strength of the sparsity. Our inference scheme is the vector machine with L 1 regularization, which is equivalent the so-called L1VM [14]. In our case, the parameter µ includes the variance σ in Eq. (6). Without loss of generality, the unit of the cost function is set to σ −2 , and thus the cost function is represented as a function of x, λ, and µ. This resulting problem is an optimization problem. In this work, we use a fast iterative shrinkage-thresholding algorithm (FISTA) [15] for minimizing the cost function, which is popularly used in L 1 optimization problems. For λ = 0, minimization of the cost function is reduced to the least-squares method, and for a sufficiently large value of λ, the trivial solution x = 0 is obtained. Therefore, an appropriate value of λ is expected to exist between these two extremes, and λ can be determined as a consequence of the competition between the data fit and the sparsity of x. Unfortunately, an appropriate value of λ is not known a priori. It would be suitable to choose the value of λ to reduce prediction error; however, the prediction error is difficult to estimate. Instead, a promising method for determining the hyperparameter λ, as well as unknown parameters in the likelihood function, is through cross validation (CV).

C. Cross validation and hyperparameter selection
In K-fold CV, the data set of y is divided into K subsets, which are denoted by Here, Λ (k) is an index set of the elements contained in k-th subset. The subsets are chosen randomly from the original y, and each element of y appears once in the subsets. Using the data set y (k) = y \ y (k) as a training set, we obtain the optimal solution x (k) that minimizes the cost function E(x; y (k) , λ, µ). Then, for each test set y (k) , we calculate the CV error as where the measurement matrixÂ (k) for the partial data set y (k) is given byÂ Averaging over the possible test data set, the averaged CV error is defined by Regarding the CV error as an estimate of the prediction error, the hyperparameter and unknown parameter are determined by minimizing the averaged CV error. The CV error L K is known to equal the true prediction error in the large K limit, so ideally, we should choose a sufficiently large number of K. In particular, the case of K = D corresponds to so-called leave-one-out cross validation (LOOCV), which requires D minimization calculations for a given set of λ and µ. This CV is timeconsuming with increasing the data size D.
Recently, Obuchi and Kabashima [16] have proposed a simplified method for performing LOOCV. Once the minimization of the cost function is computed for the total data set, the LOOCV error L LOO is estimated by the approximated formula given by where N 0 ( th ) is the number of elements of x below the threshold th . The value of th may depend on the solver used for minimizing the cost function Eq. (7). Using FISTA, th is unambiguously obtained by th = λ/L, where L is a Lipschitz constant of the cost function (see Appendix A).
We performed the K-fold CV procedure for typical STM data with changing K and confirmed that L K (λ, µ) is almost independent of K for K ≥ 10. Thus, in the following sections, we present the results of both 10-fold CV and the approximated LOOCV for comparison. In our data model, two parameters are to be determined by CV: the LASSO tuning parameter λ and the variance of the Gaussian function µ = {σ}. We first determine λ for a fixed value of σ according to the onestandard-error rule [17] often used in a LASSO analysis, that is, whereλ is given byλ = arg min λ L K (λ) and SE(· · · ) is the standard error of the K-fold CV error. After choosing λ * as a function of σ, we choose a suitable σ as the minimizer of the CV error, that is,

D. Estimation of the peak position
Our goal is to determine the position of atoms with a reasonable resolution in order to quantify any local distortion of the position. The non-zero elements of the estimated vector x will lead to the central peak position. The resolution of each position is, however, limited to the grid size of our data model. Some non-zero elements of the optimized value x i are localized, and they are separated from each other. Therefore, we can extract the center of peaks from x with higher resolution than those obtained with the L1VM grid size using the k-means clustering method.
We suppose that the number of the peaks N peak in k-means clustering is countable for the estimated vector x. This assumption is based on the fact that the nonzero elements of x are highly localized in the L1VM grid space. The center of kth peak r k with k = 1, . . . , N peak is initially chosen by a certain pixel i at which the element x i takes a maximum value within the radius R around pixel i. The value of R is appropriately set as a mean distance of the localized elements.
Then, an attributed variable z i is allocated for each pixel i as where i is the position vector at pixel i on the L1VM grid as i = (i x , i y ) and d(i, r k ) denotes the Euclid distance between the pixel i and the center r k . Using the attributed variables, the center is defined by where θ(x) is a Heaviside step function, meaning that an element with a negative value is not considered in this analysis. Here, r k is a weighted average of the pixel positions when the amplitudes x are regarded as the weights.

A. Synthetic data and typical examples of estimated data
First, we examine the validity and reliability of the proposed method using a synthetic data set. The synthetic data are generated by the following procedure. For the given primitive basis vectors, a 1 and a 2 of the 2D lattice, the lattice vector r i of i-th lattice point is defied as where m and n are integers and ξ i is a uniformly random vector representing a local lattice distortion. All atoms are allocated at the lattice points in the region [0 : ] × [0 : ], where the length unit of the lattice data is set to 1px. Some lattice points are attributed to vacancy sites, which are randomly chosen with the probability ρ vac . The number of peaks N peak is given by N peak = (1− ρ vac )N tot with N tot being the number of lattice points in the region under consideration. Thus, the atom positions to be inferred from the imaging data are determined as {r k } (k = 1, . . . , N peak ). The amplitude {x k } (k = 1, . . . , N peak ) of a peak is set as a Gaussian random variable with mean 1 and variance σ x . Using the set of parameters {x,r}, the synthetic data y(x,r) is generated through the measurement matrix by Eq. (1). We fix the following parameters: σ = σ true ≡ 2.25, = 64px, ρ vac. = 0.02, R center = 0.15px, σ x = 0.01, and σ = 5 × 10 −4 . The synthetic data used in this section is shown in Fig. 2. For this synthetic data, we first perform the optimization using the least-squares method (λ = 0) for fixed σ = σ true (= 2.25). As shown in Fig. 3, the optimized vector x contains both positive and negative values and extensively fluctuates with a huge amplitude (x i ≈ 100) compared with the original signal's amplitude (y d ≈ 0.01). This result demonstrates that the leastsquares method overfits the data y, and a non-sparse solution of x is obtained when any regularization terms are absent.   Fig. 4(a), although the true valuesx have no negative values. Moreover, the variables are noisy in the higher-λ regime such as λ = 10 −3 in comparison with the λ = 10 −4 case. Therefore, there must be a suitable value of λ between these two extremes, which is to be determined using the CV.

B. Result of Cross Validation
For the synthetic data, we performed the 10-fold CV and LOOCV in order to determine the suitable parame- ters λ * and σ * . The results of 10-fold CV and LOOCV are shown in Fig. 5 (a) and (b), respectively. There are no apparent quantitative differences between these results in the regime 10 −6 < λ < 10 −3 , indicating that the approximated LOOCV error provides a good estimator of the large K-fold CV errors. Then, we choose the optimal λ * (σ) for each σ in accordance with the one standard error rule. Next, we study the σ-dependence of the 10-fold CV error L (10) (σ, λ * (σ)) and the LOOCV error L LOO (σ, λ * (σ)) shown in Fig. 6. Both CV errors take a minimal value at around σ * = 2.225. The error bars displayed in Fig. 6 represent the standard error of each CV error. The mean value L in the parameter regime σ = 2.225 ± 0.050 is within its one standard error at the minimum of σ * = 2.225. Hence, this result is consistent with the true value σ true = 2.25. By choosing the (hyper)parameters using CV, the optimized amplitude x * (λ * ; σ * ) is obtained with λ * = The final solution x * (σ * , λ * ) is sparse, and x i has a finite value only near the true peak position, as shown in Fig. 8. In Fig. 8, we show the amplitude of the true peaksx and the estimated amplitude of L1VM for a portion of the grid . For each true peak, there are still several "active" pixels with non-zero elements of x i . The number of active pixels is in the range between two to five, depending on the resolution of the L1VM grid. Then, we obtained the peak positionr * by applying the above-mentioned k-means clustering method to the optimized L1VM solution x * (σ * , λ * ). In Fig. 9, we show the obtained positionsr * together with the true positionŝ r. We also show the difference between the true positions and their corresponding estimated positions on the right side of Fig. 9. No significant differences are observed in the figures. In fact, the accuracy of our estimation is within 1px, meaning that the positions of the peaks are extracted from the STM data with accuracy beyond the resolution of the input signal. This is our main claim in this paper.

D. Application to real experimental data
The presented results for the synthetic data are useful for examining the validity of our method. Before applying our scheme to real experimental data sets, some issues must be addressed. For example, the choice of the basis function is the one of the essential problems because the basis function must depend on the surface materials. However, assuming a Gaussian base function, we apply our scheme to experimental data from STM topography measurements of a SrVO 3 thin film. Fig. 10 presents the tentative results obtained by our scheme. Many defects are clearly observed on the square lattice, and the local lattice distortion is enhanced around the defects. Since our method is not based on Fourier transformations, it should be possible to directly detect real-space properties such as local distortion and/or strain. Details of physical properties of the material are discussed in a separated paper.

IV. CONCLUDING REMARKS
In this study, we propose an efficient data analysis method for STM topography datasets, which allows highly accurate extraction of peak centers. Technically, our main problem belongs to a 2D peak decomposition problem with a large unspecified number of peaks . Examples of such problems include NMR spectral data and X-ray or neutron beam diffraction pattern data. Therefore, our scheme could be applicable to a wide range of datasets by changing the basis function.
First, we discuss the computational cost of our method. An elementary step of the L 1 optimization consists of FISTA. For estimating an N dimensional vector x * , the computational cost of FISTA is O(N 2 ) due to matrixvector product. The typical computation time required for convergence of x * (σ, λ) estimation in the analysis of 64 × 64 pixel data is about 30 sec with a standard singlecore laptop computer. In this case, the dimensions of the measurement matrixÂ is 4096 × 4096. The typical size of STM topography data is 512 × 512 pixels, so the measurement matrix becomes tremendously large. However, a suitable cutoff length decreases the relevant elements in the measurement matrix when the basis function is spatially localized, such as the Gaussian base function used in this study. We succeeded in a preliminary analysis of 512 × 512 pixels of real data using a set of cluster machines. In our method, most of the computational time is devoted to the hyperparameter estimation by cross validation. As shown in Fig. 5 and 6, our results indicate that the approximated LOOCV error proposed by Obuchi and Kabashima agrees well with the results of the 10-fold CV error. Thus, using the approximated LOOCV, which requires 10 times less computational time than 10-fold CV, is computationally efficient.  When we estimate the center positions of the peaks from topography data, we simultaneously obtain the amplitude values x * of the L1VM variables. As shown in Fig. 8, however, our analysis provides a bundle of peaks for each true peak. In our analysis, the summation of the peak amplitude for each cluster is easily calculated from the optimized variables x * using the attributed variable In Fig. 11, we compare the accumulated amplitude for each peak to the true amplitude. The estimation of the peak amplitude is not accurate, unlike the estimation of the peak position. Our method is significantly modified from the naive least-square method shown in the right side of Fig. 3. The mean values and standard variance of the peak amplitudes are shown for our estimate and the true values in Table. I. The mean value of the estimated amplitudex * is compatible to that of the true valuex, but the standard variance ofx * is about three times larger than that ofx. This discrepancy may be due to the lack of resolution of the L1VM. We expect that the accuracy of the amplitude estimation will be improved by increasing the dimension N of x so that N is larger than the input dimension D. Another practical way for improving the accuracy might be to re-evaluate the peak amplitude using the knowledge of the peak positions extracted by our scheme.
Finally, the choice of the basis function is still an important problem in analyzing experimental datasets. In the preliminary results shown in Fig. 10, typical STM topography data of a SrVO 3 thin film is analyzed by a 2D isotropic Gaussian function. However, situations exist where this basis function choice is not suitable. To apply our scheme to more general cases, we will utilize machine learning techniques to estimate suitable basis functions from obtained datasets.