Deep neural network for X-ray photoelectron spectroscopy data analysis

In this work, we characterize the performance of a deep convolutional neural network designed to detect and quantify chemical elements in experimental X-ray photoelectron spectroscopy data. Given the lack of a reliable database in literature, in order to train the neural network we computed a large ($>$100 k) dataset of synthetic spectra, based on randomly generated materials covered with a layer of adventitious carbon. The trained net performs as good as standard methods on a test set of $\approx$ 500 well characterized experimental X-ray photoelectron spectra. Fine details about the net layout, the choice of the loss function and the quality assessment strategies are presented and discussed. Given the synthetic nature of the training set, this approach could be applied to the automatization of any photoelectron spectroscopy system, without the need of experimental reference spectra and with a low computational effort.


I. INTRODUCTION
Deep neural networks (DNNs) are currently state-of-the-art in image recognition applications, and have already been tested for several scientific spectroscopy applications [1][2][3][4].
In fact, fast machine learning processing will be crucial for high-throughput data analysis, especially for large research experiment facilities such as synchrotron or free-electron lasers [5], where the large data amount prevents the standard hand processing. In addition to the way faster processing, machine learning methods can, for specific tasks, match or even outperform the accuracy of a human analysis. X-ray photoelectron spectroscopy (XPS) data represent an ideal application field for deep neural network (DNN) classification methods. In an XPS experiment [6] the sample surface is hit by x-rays with specific energy (hν) from a monochromatic source (schematics are given in Figure 1). If hν is larger that the binding energy (BE) of the electrons in the solid, the electron is ejected with a kinetic energy KE = hν − BE − φ, where φ is the work function, ultimately related to the bulk/vacuum discontinuity at the sample surface. This simple relation allows one to collect XPS spectra by measuring the number and the kinetic energy of the photoemitted electrons.
Each element displays characteristic core levels and Auger lines that are then used to identify the elements present in a sample. Elemental identification should be obtained with ease with DNNs, without the complexities required, for example, for image recognition tasks. Given the potential relevance of XPS for material physics and industrial research [7], this machine learning approach looks appealing for automated analysis.
However, major drawbacks prevents the application of DNNs to XPS data analysis. A neural network training requires a large database of consistent spectra, which should cover all the possible XPS analysis outcomes in a well distributed, random sampling of all chemical elements; up to now, such amount of data cannot be found in literature. The lack of a proper spectra database is due to several characteristic of XPS analysis: the technical complexity (due to the ultra high vacuum requirements), the variety of XPS set-up (different photon sources, analyzers, experimental geometries), the often long spectra acquisition time, the details of sample preparation, and the different spectra range, resolution and noise level.
Apart from time and sample constraints, the collection of a universal XPS database suitable for DNNs is unfeasible, since each XPS machine would require a special dataset related to its specific technical details.
Moreover, the XPS quantification and identification process is strongly influenced by the large difference of photoemission cross sections for each chemical element. In fact, the actual detection threshold is different for each element, and is also dependent on the actual total electron count statistics; a larger acquisition time or a higher photon flux allows [8] to detect elements in a sample with a sensitivity down to 0.1%. Moreover, the peculiar superposition of core-levels of different elements significantly affect the XPS element detection capability [9]. The stoichiometry evaluation (i.e., the elemental quantification) is also affected by the spectra analysis routine and by the specific choice of sensitivity factors. The practical accuracy for relative elemental quantification is generally considered [10] to be 10%, although much better results can be obtained with a very accurate setup characterization and data treatment [11]. As a result, most of the quantification work is usually carried out on a single, specific core-level for each element, for which a specific sensitivity factor is known [12]; these normalization factors are different for each XPS setup and must be supplied by the machine vendor or obtained by accurate investigations on calibration samples. An extensive review of quantitative XPS resolution can be found at this reference [13].
Finally, XPS spectra are often affected by the presence of a surface adventitious carbon contamination layer, due to the high surface sensitivity of the technique. While in some cases this layer can be removed by UHV cleaning techniques, such as Ar + sputtering or plasma cleaning, in many other cases the surface can not be cleaned without inducing a sample degradation. The carbon contamination leads to an overall lower XPS intensity [14], which is different for each core level because of its dependency on the photoelectron KE.
In this work we show the application of a DNN to the task of identification and quantification of XPS survey spectra. To overcome the lack of a large experimental dataset, we generated a synthetic training set, based on state of the art theory for XPS [15]. Each training spectrum has been calculated on a randomly generated material, with a random contamination layer on top; every element in the periodic table, from Li to Bi, has been taken into account with identical probability. Each detail of real XPS spectra, such as peak position, width and intensity, inelastic loss backgrounds, chemical shifts, the analyzer transmission function, the signal-to-noise ratio etc., has been carefully simulated according to available XPS databases and theories, in order to produce random synthetic, yet realistic spectra. As for an experimental data reference, we used a set of 535 survey spectra, collected in similar experimental conditions. The DNN has been specifically designed to produce consistent results without the use of any experimental spectra during the training; for this task, an optimized net layout and a specific loss measurement have been introduced. The DNN has also been trained to ignore the adventitious carbon contribution, in order to produce the pristine material stoichiometry quantification. Due to its design, this approach could be applied to any XPS system, with any photon source, and without the need of a large experimental data set for the DNN training.

A. The training set
We numerically generated a synthetic training set made of 100k survey spectra based on XPS parameter databases and electron scattering theory in the transport approximation [15].
The spectra KE range was 400-1486 eV on a 2000 point grid, which then will correspond to the size of the input array of the DNN (i.e., the number of features); we restricted the analysis to Al k α source, although this method could be extended to any soft x-ray source, within the limits of database availability and model approximations.
Given that we only consider the task of overall identification and elemental quantification, we devised a relatively simple two-layer model for the spectra simulation, where a random bulk material is covered by an over-layer of the usual hydrocarbon contamination. In order to make the training set as general as possible, each material is composed by a random number (from 2 to 5) of elements, with variable stoichiometry ratios. We consider random possible combinations of elements in the [3,81] atomic number interval, without any bias towards a specific element or material; while this method is generating spectra for several unphysical materials, it allows for a completely unbiased network training. We assigned to each synthetic material a density, evaluated on the basis of elemental densities. Although this is a rather approximate approach, it allows for a more accurate evaluation of peak intensity in very dense materials, such as low-Z elements diluted in high-Z compounds. The contamination layer density has been set to 1.56 g/cm −3 , as an average of several similar organic compounds; the environmental contamination layer should thus be considered as an effective layer, whose thickness was randomly chosen in the [0 − 40]Å interval.
For each element we considered databases entries for all the core-levels [16,17] and Auger [18,19] structures, both for cross-sections and native peak widths. Peak positions are also randomly shifted by considering the largest chemical shifts found in available databases [20]; this shift can reach up to 10 eV, for instance for sulfur core levels. In order to predict the actual peak intensity we performed full Monte-Carlo simulations, as described by Werner [15], including both the electron inelastic and transport mean free path (IMFP and TMFP). IMFP has been calculated with the usual TPP2M formula [21], while for TMFP we used the interpolation method of Jablonsky [22]. This method allows for the prediction of peak intensity and of the peak inelastic background, which has been simulated on the basis of Tougaard differential inverse inelastic mean free path (DIIMFP) formula [23]. Each peak has been simulated with a Voigt peak, and the background has been evaluated through subsequent convolutions with the DIIMFP function. The full details of the actual XPS setup have been considered, including the analyzer acceptance angle. Such approach has been used previously to accurately characterize the transmission function of electron analyzers from survey spectra quantification [11].
For the contamination layer, based on experimental results on adventitious carbon contamination [14], we considered a carbon rich layer (5:1 carbon to oxygen ratio), with an additional 10% noise for the intensity and a random shift of maximum 0.5 eV for the peak position. Such shift should mimic the effect of small peak drifts due to charging or to different analyzer-to-sample work functions.
In order to make comparison with the raw experimental spectra, we performed several final refinement steps. Synthetic spectra are convoluted with a gaussian peak in order to reproduce the experimental resolution; we introduced the XPS satellites for the nonmonochromatic Al k α source(for core-level photoelectron features only); we multiplied the spectra with the proper analyzer transmission function, which has been characterized properly for our XPS system [11]; finally, we added a small gaussian relative noise (0.3 %) and normalized the data in the [0, 1] intensity range. Some example of the training set spectra are given in Figure 2-a.
The generation of each synthetic spectrum required an average of two minutes computational time on a single-core desktop machines. Consequently, the training set production was in fact the most computational intensive part for this work, and was carried out in parallel fashion on several computers. The relatively long time required for a single spectrum calculation and the large number of parameters renders the application of this spectra prediction method as a direct data fitting procedure impracticable. Instead, after the DNN training, any subsequent spectra evaluation is then extremely fast and direct.
The [0, 1] intensity normalization constraint, which is typical for the input features of trainable DNNs, introduces some additional difficulties for the quantification task. The XPS total intensity, measured with a fixed X-ray flux and accumulation time, can vary by up to two order of magnitude because of the element cross sections. Accordingly, the amount of time required to decrease the noise level of XPS spectra is also variable. By fixing the intensity range, we are then losing information which could be in principle useful for the quantification process. However, in standard XPS practice, the X-ray flux could be different in each experiment, as well as the accumulation time; hence the choice to normalize the training and experimental spectra to the same scale. For the same reason we also decided to fix the signal-to-noise ratio (SNR) to 0.5% of the [0, 1] intensity range (i.e., each spectra shows exactly the same SNR), which corresponds to the lab practice to tune the total accumulation time in order to reach a reasonably clean spectrum.
Two different set of labels have been considered for the DNN output. The most straightforward one is the choice of an 81 numbers array q i i = 1, . . . , 81, which directly represent the relative elemental quantification from Z = 3 (lithium) to Z = 83 (bismuth). However, this approach is suboptimal because the relative elemental quantification does not directly correspond to the relative contribution of each chemical species to the corresponding spectrum intensity, which confuses the DNN. More precisely, due to the different photoelectron cross sections, each element displays different XPS intensities for the same relative quantification.
For instance, in a compound made by Li and Cu in a 1:1 ratio, 99% of the XPS spectral weight is related to Cu, making the lithium detection nearly infeasible without a very large data statistics. In Figure 2-b, we show as a reference for the relative XPS intensity the calculated spectra maxima of pure elements without contamination (labeled T i ). Therefore, instead of the relative elemental quantification, we used as labels for the classification the normalized quantification intensity defined as where now y i is the contribution of the element i to the total intensity spectrum. Here and throughout this paper variables with overline denote true labels and variables without overline denote the network outputs. In order to obtain the labels q i of the experimental spectra we first performed a standard quantification by using the peak area of all the detectable elements. We then removed the contribution of surface carbon contamination from the experimental labels, whenever allowed by the additional info about each specimen. Such evaluation can be problematic even for a human user, especially for heavily contaminated carbon-based materials. With this method, the obtained labels q i have a relative error of ∼ 10% on the experimental dataset.

C. Deep neural network layout
We tested several geometries for our neural network and, although we tested some purely fully connected layer and deep convoluted networks, the best results were obtained for the hybrid geometry shown in Figure 3, which also identifies the level of carbon contamination.
Our best network is composed of three stages. The first is made of convolutions and essentially plays the role of a noise filter and feature extraction. The second is used to identify the level of carbon contamination, and the third is used for normalized intensity quantification.
Concretely, the first stage is made of an initial multilevel convolution sub-net modeled after the inception module of incnet [24], with three parallel 1D convolutions with one The optimal net hyper-parameters, such as the convolution kernel sizes, the convolution number, the fully-connected layer sizes and the optimization algorithm have been tuned with hyperparameter optimization routines. One of the major difficulties in the hyperparameter optimization was the choice of the proper activation functions and the associated loss function for the quantification stage (81 output neurons). From a machine learning perspective, we deal with a multi-label supervised learning task, which points towards using a sigmoid output on each of the output neurons together with a binary-cross-entropy loss.
However, contrarily to the standard definition of multi-label classification tasks [25], the labels are not independent, but must sum up to one as we consider relative concentrations of elements. Thus, one could use a soft-max activation with categorical cross-entropy. However, this produces unsatisfying results from a physics perspective, because it cannot deal well with samples with several roughly equally present elements. For practical purposes, one would desire a network that has a very high accuracy for large relative concentrations, let's say ≥ 10%. In order to achieve this, we used the combination of sigmoid activation functions, a non-trainable normalization layer (as shown in Figure 3) and a custom loss function where y i is the network output and y i the target values. This loss function, which multiplies the standard L 2 norm by the net results squared, is then larger for large output values; for this reason we termed it 'high-pass filter' loss. The combination of this net layout, ADAM optimizer and the tuned loss function allowed for a robust training.

III. RESULTS AND DISCUSSION
The evolution of the loss function for the DNN training is given in Figure 4. Without the introduction of a regularization methods, after few epochs the training algorithm begins to overfit the synthetic data, leading to a minimum in the experimental data loss (blue arrow in Figure 4-a). It should be pointed out that overfitting is observed on the experimental data only, and not on subsets of synthetic spectra (not used during the training); the dataset size is thus adequately large, and has been computed with an sufficient amount of randomness  in the initial parameters.
In order to avoid overfitting and to achieve a better parameter convergence we introduced a dropout before the final classification layers (c.f. Figure 3): the training algorithm was then forced to ignore a random portion of the net connections at a specific location, with a probability p. With this method we routinely achieved a smooth convergence of both the training set and the experimental data (see Figure 4-b). As a side effect, the introduction of a dropout layer led to a slightly slower learning process; we estimated that a training of nearly 200 epoch is enough to achieve good and consistent quantification and identification performances. The optimal range for the dropout probability, estimated by the hyperparameter study, is 0.1 < p < 0.2; within this interval the dropout is large enough to avoid overfitting, and small enough to prevent excessive randomness in the quantification performances.
The quantification results q i obtained using the post-processing Eq. (2) from the predicted labels y i of a well trained DNN applied to the experimental data are given in Figure   5.  Two examples of actual data quantification are given in Figure 6. For carbon nanotubes deposited on Si (Figure 6-a), the DNN correctly assign the carbon peak to the actual material and not to the adventitious contamination ( Figure 6-b), which is nearly negligible; in indiumtin oxide (Figure 6-c,d), the DNN correctly identifies all the main elements and assigns all the carbon content to the actual adventitious contamination. It is also possible to use the DNN stoichiometry and contamination predictions to actually reproduce synthetic spectra (red traces, Figure 6-a,c) which are in a nice agreement with the experimental one, within the 10% accuracy limit of the hand-made quantification methods.  Figure 3), the predicted quantifications q i for all the elements is always a non-zero.
For a well-working network, we expect most of the wrong predictions to be related to small output (either q i or y i ), while positive identification should be related to high output values. With a perfect identification, the graph of Figure 7-a should then be a Heaviside-like step function, whose integral I R in the [0−1] output interval would be exactly one. We found that a good elemental identification, regardless of the RMS on the quantification accuracy discussed above, can be found when the integral of this function is above 0.90; with our DNN we routinely achieve a I exp ≈ 0.93, such as for the red dotted curve shown in Figure   7-a. For the synthetic dataset (black line in Figure7-a), the integral is I synth = 0.95, which is very close to I exp .
The identification graph can now be used to estimate an accuracy threshold for the DNN results, pointed out by black arrows in the Figure 7- is strongly dependent of the atomic number, with a detection threshold ranging from 1% to nearly 20%; accordingly, the absolute accuracy of quantification is also varying from a base value of 1.5% to 9%. Note that for both identification and quantification, the lower limits are probably dependent on the amount of random noise added to the synthetic training set. Figure 8. Full carbon experimental quantification (red) and corresponding results for the DNN (black), which has been trained to ignore carbon from adventitious contamination.
Finally, we address the capability of the DNN to discriminate the carbon contamination from actual carbon-based compounds. Since it was not possible to accurately compute the level of carbon contamination of the experimental spectra, we will use a qualitative proxy measure. The red graph in Figure 8 shows the overall carbon content of the experimental dataset, evaluated from the total area of C 1s peak with respect to other elements peaks.
The black trace shows the network output for the carbon quantification which, as expected, is significantly different from the experimental one. In general, the DNN performs very well reporting a high carbon content only for pristine organic materials, such as carbon nanotubes and other organic molecules, represented by the underlying light gray shades beneath the DNN identification. Some weak carbon presence is also detected by the DNN in the first 50 spectra of the experimental dataset, which are mostly composed by Ga, Se and Ge; these elements show several Auger features which are superimposed to C 1s core-level, possibly confusing even a trained human XPS user. Moreover, a thick contamination layer can also be present on top of organic compounds, further complicating the quantification process; that is, for instance, the case of the experimental spectra around the 100 dataset index.

IV. CONCLUSIONS
In conclusion, we have shown the application of a neural network to the identification and quantification task of XPS data on the basis of a synthetic random training set. Results are encouraging, showing a detection and an accuracy comparable with standard XPS users, supporting both the training set generation algorithm and the DNN layout. This approach can easily be scaled to different photon energies, energy resolution and data range; furthermore, the DNN could be trained to provide more output values, such as the actual chemical shifts for each element, expanding the net sensitivity towards the chemical bonds classification.