Novel spectral unmixing approach for electron energy-loss spectroscopy

Electron energy-loss spectroscopy is a well-established technique for characterizing low-Z elements in materials. Typically, a measured spectrum image is contributed from several materials when the composition of the specimen is sophisticated. Therefore, decomposing the distribution of each endmember is crucial to material scientists. In this article, we combined multiple linear least-squares fitting and k-means clustering to resolve the aforementioned issue. In addition, our method can nearly extract the true endmembers in materials unsupervisedly. Simulated and experimental data were employed to evaluate the performance and feasibility of our method.


Introduction
A spectrum image of a scanning transmission electron microscope (STEM) allows to collect high-resolution structural and chemical information simultaneously. The two most common spectroscopic techniques in STEM are electron energy-loss spectroscopy (EELS) and energy-dispersive x-ray spectroscopy (EDS), respectively. Although EDS is more efficient in detecting high-Z elements, EELS, on the contrary, is more sensitive to low-Z chemicals [1][2][3][4]. A typical issue of EELS is that the measurement is a mixed spectrum; therefore, signal processing becomes essential when exploring a deep insight into materials, especially to extract the useful signal from the dataset [5][6][7][8][9][10].
Spectral unmixing is one of the tools to decompose the mixed spectrum into a collection of distinct spectra or endmembers-the absorption of individual elements or compounds at a specific range of energy [11,12]. Several algorithms for EDS/EELS data processing, such as principal component analysis (PCA), independent component analysis (ICA), multiple linear least-squares (MLLS) fitting, and non-negative matrix factorization (NMF), have been applied to extract the endmembers from a spectrum image.
PCA and ICA are used to reduce the dimension of the dataset. However, PCA is challenging to extract physically meaningful endmembers because the individual elements in the spectrum can not be identified. Recently, ICA was applied to extract endmembers, but the ICA route requires denoising and is highly timeconsuming [13][14][15][16][17].
MLLS fitting and NMF are both vector quantization techniques aiming to decompose the signal into a linear combination of principal components. It has been demonstrated that the MLLS fitting can solve the issue of edge overlapping in EELS by manually assigning principal spectra as [9,18,19]. On the contrary, the NMF can automatically extract the meaningful components with the positivity constraints on the weighting of the components. Although NMF has been widely used in image processing, text mining, and spectral unmixing, great efforts are required to enhance the efficiency [20][21][22][23][24].
In order to extract physical meaningful endmembers, the cluster analysis, which groups the data with similarity, would be one of the solutions. Various algorithms have been developed for clusterings, such as agglomerative clustering, density-based spatial clustering of applications with noise, and k-means clustering Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI. [25][26][27]. K-means clustering is well-known as an unsupervised routine that groups the data only depending on the similarity and has been widely used for the data processing of electron microscopy. Most of the applications focus on image processing but rarely on spectroscopy [27][28][29][30][31][32][33][34].
In EELS, the spectra are recorded by multiple channels corresponding to different electron energy at a single position. It can be considered as every single spectrum spanned in the multi-dimensional space, and the superposition forms the measured spectra. Therefore, the metric of similarity to each spectrum can be defined as the Euclidean distance in the energetic space [27]. The centroids of each cluster can be represented as the reference spectra for MLLS routine. However, in most of the case, we cannot extract the true endmember by the k-means clustering.
In this article, we propose a novel spectral unmixing approach named kMLLS clustering. By taking both advantages of the MLLS fitting and k-means clustering, the true individual endmembers can be extracted distinctly from a measured spectrum image.
The feasibility and robustness were examined by both simulated and experimental data.

Validation through simulation
The multiple linear least-squares (MLLS) fitting assumes that the nth spectrum in a spectrum image can be formulated as a linear combination of p reference spectra as follows where S n (E cm ) represents the mth energy channels selected to acquire the spectrum image. β i refers to the fitting coefficient of each reference spectra. R i (E cm ) is the endmember in a spectrum image. The main purpose of performing MLLS to a spectrum image is to solve β. The solution can be expressed as The relative fractional abundance of the pth reference spectrum is represented by β p , so that we can obtain the spatial distribution of each reference spectrum [18,19]. Usually, the k-means clustering algorithm, which groups data with the similarity defined by the Euclidean distance, is applied to extract the reference spectra or the endmember from a spectrum image. However, the EELS signal background decays exponentially, which may result in large differences of the similarity if we considered all the energy range during grouping. In order to avoid such situation, we cropped the energy region of interest before the analysis. After that, k cluster centroids were randomly selected from the dataset. Then each data point was assigned to the specific group depending on the distance to the centroid. New centroids were computed as the center of mass of the data points in individual groups. The iterative process was not terminated until the centroids converged. To determine the proper value of k, one would need to compute the sum of squared errors (SSE) from a range of k clusters and determine the k-value on the elbow of the SSE curve [25,30]. The physical interpretation of k-means clustering can be realized as grouping the electron energy-loss spectra at specific positions into k clusters. The centroid indicates the average spectra of each cluster.
In most cases, the accurate phase of endmember could not be extracted by k-means clustering. We hence combine k-means clustering with MLLS fitting to overcome the difficulty. The flow-chart of kMLLS is shown in    1. First, we determined the preliminary k-value of the system and then performed the k-means clustering to obtain the candidates for the MLLS. Second, a brute-force algorithm enumerates all the combinations of centroids and then examines whether a centroid is composed of other two or more centroids. If the centroid is a linear combination (with positive coefficients) of other centroids, it will be removed from the reference spectra for further MLLS calculation. Third, we performed the MLLS routine pixel-by-pixel to obtain the coefficient of each reference. Fourth, the coefficients were employed to calculate the new reference. Finally, we repeated the procedures until no coefficients were changed beyond the tolerance of each reference. The endmembers would be obtained when the references converged.
Simulated spectra with different compositions of endmembers were generated by the EELS Advisor package in the GATAN DigitalMicrograph © software to demonstrate our kMLLS clustering algorithm [35]. In this study, the experimental parameters of JEOL JEM-ARM200F were used to simulate the spectra (primary energy: 200 keV;  convergence angle: 44.2 mrad; collection angle of the EEL spectra: 29.6 mrad). To be more realistic, 10% Poisson noise was imposed on each spectrum. Although neither background subtraction nor denoising were needed in advance, the combination of the endmembers in the mixed region was assumed to be linear. All the data processing in this paper were implemented by the Numpy, Scikit-learn, and HyperSpy packages in Python [36,37].
The first example is a two-endmember system. The pure BN and C are located at two endpoints, and the abundant profiles were shown in figure 2(a). Figures 2(b), (c) shows the simulated noisy EELS spectra of BN and C, respectively. The elbow method was applied to find out the optimal k-value for clustering. The result of the elbow route presented in figure 3(a) indicates that the number of clusters is three, which is different from the  It is notable that if we pick the mixture phase ( figure 3(b)) as one of the reference spectra, the k-means clustering always trap at the incorrect fractions shown in figures 4(a) and (e). By contrast, the kMLLS clustering still converges to nearly pure phases. For example, if figure 3(c) (nearly pure BN) and figure 3(b) (mixture BN and C) are selected as the reference spectra, the figure 3(b) in kMLLS clustering converges to the nearly pure C phase. A similar result was obtained when we pick figure 3(d) (nearly pure C) and figure 3(b) as the reference spectra. The results were shown in figures 4(b) and (f). This indicates no matter which references are selected for kMLLS, the true endmembers can be extracted if the k-value is correctly determined. It is notable that the kvalues are typically larger than the number of endmembers, and can be corrected by the brute-force algorithm. If the k-value is underestimated, it may result in negative fractions for some endmembers. In such a case, we can then manually determine the proper k-value for kMLLS clustering. An overlapped multiple-edge system, TiN-Ti-TiO 2 , is designed to demonstrate the feasibility of our method in advance. As shown in figure 5, the O K edge is located between Ti L 3,2 and Ti L 1 edges. The k-value determined by the SSE curve shown in figure 6(a) is three. The corresponding centroids of k-means clustering are shown in figures 6(b)-(d). Although the k-value is correctly determined, figures 6(c) and (d) are still mixed spectra. After we implemented kMLLS clustering, the weighting distribution and the endmembers are both correctly retrieved. The comparisons of k-means and kMLLS clustering are shown in figure 7. 3. Demonstrate the feasibility using experimental data Finally, we examined the feasibility of kMLLS clustering through experimental data. A line-scanned spectrum image of oxide-nitride-oxide (ONO) multilayer thin film was acquired using a JEOL JEM-ARM200F Cs-Corrected STEM with a GATAN Quantum 965 EELS camera at 200 keV. Since the EELS signal decays exponentially, inaccurate Euclidean distances might result from poor background subtraction. In reality, the background subtraction is usually problematic when multiple elements exist in the material. Therefore, we only select the region of absorption edge for kMLLS clustering. The advantage is that the features in the spectra dominate the value of Euclidean distances (i.e. similarity), thus alleviating the bias of clustering due to the background. Figure 8

Conclusion
In this paper, we have demonstrated that a novel algorithm, the kMLLS clustering, can successfully extract the endmembers in a spectrum image from both simulated and experimental data. In comparison with k-means clustering, no pre-knowledge or pre-selection of references is needed; therefore, the endmember extraction can be conducted unsupervisedly and lead to nearly true endmembers. We believe that the kMLLS clustering has great potential to in-line investigations and provides a significant insight into materials.