Emulator-based decomposition for structural sensitivity of core-level spectra

We explore the sensitivity of several core-level spectroscopic methods to the underlying atomistic structure by using the water molecule as our test system. We first define a metric that measures the magnitude of spectral change as a function of the structure, which allows for identifying structural regions with high spectral sensitivity. We then apply machine-learning-emulator-based decomposition of the structural parameter space for maximal explained spectral variance, first on overall spectral profile and then on chosen integrated regions of interest therein. The presented method recovers more spectral variance than partial least-squares fitting and the observed behaviour is well in line with the aforementioned metric for spectral sensitivity. The analysis method is able to independently identify spectroscopically dominant degrees of freedom, and to quantify their effect and significance.


Introduction
Owing to orbital localization, core-level spectroscopies are sensitive to structure in the neighbourhood of the excited atomic site. However, the dependence between the atomistic arrangement and the resulting spectra is not straightforward, which complicates the analysis of these spectra [1][2][3][4]. A satisfactory solution to this complexity calls for new methods, such as machine learning (ML), that may relieve the computational burden of repeated function evaluations [5]. Here, the inherent lightness of evaluation may, for example, help with problems involving predictions of statistical averages or prediction of spectra for new structures instead of their explicit simulation. Several ML approaches have recently been applied to spectroscopy [6][7][8][9][10][11], typically to emulate the relations between known molecular/atomic structures and corresponding spectra [8,9]. The possibility to predict structural variations in the crystals using extended X-ray absorption fine structure has also been demonstrated [7]. Moreover, prediction of X-ray absorption near-edge structure based on descriptors of the molecular structure has been recently shown with a high accuracy [10].
In this work, we turn to the question of how to apply an accurate ML emulator to the interpretation of core-level spectra in terms of the underlying atomistic structure. We develop an ML-based dimensionality reduction of the structural parameter space based on most covered spectral variance, and apply the method to simulations for three types: X-ray photoelectron spectra (XPS), X-ray emission spectra (XES), and X-ray absorption spectra (XAS). To interpret the findings, we present a metric to measure spectral sensitivity to structural change, and as a result we consistently identify regions of higher and lower spectroscopic structural sensitivity with the different methods.

Data and emulators
The number of electrons and the nuclear configuration, given by the set of all structural parameters p, define the electronic Hamiltonian and its spectra. We obtain transition energies and intensities for numerous structures p from electronic structure simulations. The transition intensities are approximated as squared lengths of the transition dipole vectors of the velocity form. To account for physical (lifetime, vibrational substructure) and instrumental lineshapes, the resulting 'stick spectrum' is convoluted. This procedure results in a continuous spectrum S( p), which on a predefined grid presents a vector. The procedure is repeated for a set of points p obtained from structural simulations. This work is based on applying ML to the simulated structure-spectrum pairs to create an emulator that approximates the function S( p) at any p.
As our data we use 10 000 snapshots from ab initio molecular dynamics (AIMD) trajectories for the H 2 O molecule, with initial kinetic energy equivalent to 10 000 K temperature and spectra simulated for these structures. The structural data and the related XAS spectra have been published previously [11]. For the calculation of XAS and XES spectra, we apply transition-potential density functional theory (TP-DFT). For evaluation of the XPS core-level binding energies, and for correction of the onset of XAS spectra, we carry out respective Δ-DFT calculations for the core-hole state energy with respect to the ground state. Here, we assume a high-enough photon energy to result in a constant O 1s ionization cross-section regardless of the structure. All spectra are convoluted with a 1.0 eV Gaussian and are presented on a 0.1-eV-spaced grid (100 points for all cases). The calculations are carried out using CP2K software [12]. For easier comparison with the experiment, the spectra are shifted by −6.0 eV, 2.25 eV, and 1.5 eV for XES, XAS, and XPS, respectively.
Our analysis relies on ML and the ability to predict spectra at new points in the configurational space, here defined by three degrees of freedom: H-O-H bond angle α, and the shorter and longer O-H bond lengths b s and b l , respectively. We select the ML spectroscopic emulators in a fashion similar to that of Niskanen et al. [11]. In brief, we examine polynomial models with the orders from 2 to 9, and multilayered perceptrons (MLP) with 2-5 hidden layers and 5-500 neurons in each layer, and use mean-squared error as a metric of the training quality for a set of 8000 data points. The scikitlearn [13] Python package is used. Based on cross-validation performance scores, we use an MLP emulator for XES, and polynomial emulators for XAS and XPS in the later stages of the analysis, carried out with a completely separate test set of 2000 samples. However, due to the wiggly behaviour of the MLP isosurfaces for XES spectra, we use the smoother-behaving polynomial emulators to produce all the plots in figure 1.

Spectral sensitivity metric
We measure structural sensitivity as the rate of change of spectrum S(p) at structural parameter point p.  Each channel in the spectrum S is defined by the structural parameters p. Thus, each row in the Jacobian gives the gradient of the particular energy channel with respect to structure. Spectral sensitivity with respect to a given structural parameter is given by the length of the according column vector. To classify points in the configuration space, we focus on the square norm of the whole Jacobian matrix. Since we compare different spectroscopies, normalization with the spectrum at the centre of the data p cen set is applied. An alternative metric is spectral deviation from that at the centre of the training set M diff ðpÞ : ¼ kSðpÞ À Sðp cen Þk 2 kSðp cen Þk 2 ð2:3Þ Numerical calculations on a grid rely on evaluation of the ML predictor.

Emulator-based component analysis
The algorithm carries out a step-wise component vector (CV) search for dimensionality reduction in the structural parameter space, with the criterion to maximize the explained spectral variance together with the components of the previous steps. For a set of N parameter points fp i g N i¼1 , this is achieved by  I  II III  I  II  I II III Figure 1. Spectra of the H 2 O molecule in the training dataset: (a-c) the mean spectrum is shown in black and the shaded area depicts ±1 s.d. from the mean; dashed lines indicate the regions of interest (ROIs I, II and III) for the coarsened spectra; digitized experimental spectra from [14][15][16][17] are shown for comparison; and simulated spectra have been shifted for the best match with the experiments. Structural sensitivity of these spectra: (d-f ) Cartesian distance difference M diff and (g-i) Jacobian norm M grad . Since polynomial approaches behave smoother, they were used also for the plots of XES. The ranges of the parameters shown are ±σ from the mean of the training set. For details, see text.
royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 9: 220093 projection on CVs optimized for the purpose. For each step k (k = 1, 2, …), a unit vectorv k is searched so that generalized covered variance is maximized. Here, matrix A contains the true spectra of the original data points as its row vectors A i . The corresponding predicted spectra for projected data points are given as row vectors of matrix where function S ðpredÞ is an ML-based emulator capable of predicting spectra for previously unseen structures andÃ We apply the orthonormality constraintv k Áv j ¼ d kj to the CVs, and as the result of the procedure, a set of orthonormal projection vectors is obtained so that they always maximize the generalized covered spectral variance ρ up to the given order k. We apply an overall factor ±1 for the CVs to point towards increasing intensity.
The generalized covered variance ρ accounts for the goodness score in the spectrum space and is necessitated by the nonlinearity of spectrum prediction operation S ðpredÞ . When applied to a data matrix from a projection in the same linear space, the definition reduces to that of covered variance used, for example, in principal component analysis. Due to its definition, ρ may obtain negative values for notably bad predictions as the value zero corresponds to errors with the magnitude of the variance of the known data. We see no problem in alternatively using the remaining unexplained spectral variance 1 − ρ as an error metric in a minimization problem for vectorsv k .

Partial least-squares fits using SVD
We adapt an approach based on singular value decomposition (PLSSVD) [18] owing to its straightforward simplicity and to orthogonality of the CVs. Here, the partial least-squares fit is applied to data in matrices X and Y that contain standardized structural parameters and the corresponding standardized spectra in their row vectors. A linear fit is applied between the component scores of left and right eigenvectors for each order of the decomposition. As a result, an approximation of data is obtained. In the equation, U ( j ) and V ( j ) denote the left and right eigenvectors (column vectors) corresponding to the eigenvalue λ j ordered in descending fashion. As the data are standardized in each of their dimensions, the covariance matrix reads directly from which the matrices U,V and diag(λ 1 , …, λ k ) are obtained by singular value decomposition. The procedure thus gives basis vectors on which to project the data X and Y.
The coefficients c j are obtained from a linear least-squares fit between projected data points XU ( j ) and YV ( j ) for each order j = 1, 2, …. The constant term in the fits is negligible and the first-order coefficient is assigned c j . As an example, the results of the fits for the overall spectrum case are depicted in the electronic supplementary material. For comparison of the PLSSVD fit results, generalized explained variance metrics are evaluated for decompositions cumulatively incremented up to order k, as given by equation (2.7). An overall factor ±1 is applied for the PLSSVD structural space basis vectors to point towards increasing intensity.

Results and discussion
Although a static classical nuclei model is used, the appearance of the studied spectra of the H 2 O molecule in figure 1a-c are in agreement with the respective experiments [14][15][16][17]. The emulators royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 9: 220093 trained on the sampled AIMD structures and corresponding spectra allow for easy and computationally light evaluation of the data on a mesh grid. We applied this capability to calculate the square norms of spectral deviation from that of the mean structure, as depicted in figure 1d-f. In addition, numerical differentiation of an emulator for the spectrum S(r) is a computationally light task on a mesh grid. Here, each partial derivative gives the rate of change for each channel in a spectrum S(r) at point r with respect to each structural parameter. The square norms of the Jacobian matrices [J S (r 0 )] ij = ∂S i /∂r j | r=r 0 presented in figure 1g-i indicate strongest spectral changes in specific directions for each method. Normalization by the spectrum at the mean structure r cen is applied in both cases to allow for a direct comparison.
The spectra show differing structural behaviour, with more variation in XES and XAS than XPS, also indicated by the channel-wise one standard deviation drawn together with the spectra. Figure 1e,h reveals that XAS is most sensitive to the symmetric stretch. This is seen as the largest isovalue surface being located at large b l and b s values, with little variation along the bond angle α. On the other hand, the XPS spectrum changes most at high bond angles, as seen in figure 1f,i: isosurfaces are oriented parallel to the b l -b s plane. From this view, XES is expected to be most sensitive to all structural parameters in the system, being least affected by the asymmetric stretch as seen in figure 1d,g. Here, the cartesian distance difference has a low-value isosurface region intersecting the plot of figure 1d, but the overall rate of change still has high isosurface values throughout the plot of figure 1g.
Spectroscopic data can be seen as two correlated datasets: one for structures and one for the corresponding spectra. One way to analyse the interdependencies in such data is provided by partial least-squares (PLS) fitting [19,20], and a variant of this family of methods has already been applied to binding energies in XPS in aqueous solution [1]. In PLS algorithms, latent variables connecting the two datasets are searched for using only existing data points. However, we show that the relation of structure and spectra may be investigated more deeply with the help of an ML-based emulator that is capable of making accurate and computationally light predictions of new data. Indeed, for a set of parameters defining the Hamiltonian, the spectra are defined as a function. We use the aforementioned capabilities of a good emulator and make a step-wise parameter-space decomposition, where the search for structural space CVs is guided by covering of maximal variance in the spectrum space. Because the search for each CV consists of an iterative solution of an optimization problem, the lightness of evaluation of the emulator is essential. Moreover, this emulator-based component analysis (ECA) routine relies on prediction of spectra on new data, i.e. projected data points in the standardized structural parameter space.
When compared with the results of PLS implemented on eigenvectors from singular value decomposition of the covariance matrix (PLSSVD) [18], the ECA algorithm is able to explain more spectral variance with a decomposition to a given order (table 1). Consequently, explained structural variance for ECA may be less than for the PLSSVD. We understand this by the design principle of ECA to search for directions that matter the most for spectra, with no emphasis on covered structural variance. Moreover, the nonlinearity of ECA allows for a tighter match with the data than linear methods. The first CVs of the methods agree in interplay of all structural parameters, in opposing directions for angle and bond lengths for XES. Likewise, the overall shape of XAS is agreed to be dominantly affected by the bond lengths, and the XPS is virtually completely explained by the H-O-H angle. The results are also depicted in figure 2 and these findings are consistent with the spectral sensitivity metrics presented in figure 1.
Interpretation of experimental core-level spectra is complicated by unavoidable inaccuracy of the spectrum simulations. As a solution to the problem, we have previously proposed an analysis of spectral regions of interest (ROI) that are identifiable in both experimentation and theory [2,3,11,21,22]. In such a line of thought, it is argued that the risk of overanalysis is reduced, as the procedure would naturally focus on confirmedly reproduced spectral features. An alternative approach to assess uncertainties in simulated X-ray spectra has been presented by Bergmann et al. [23]. By studying the spectral response to slight structural distortions, their method results in error bars for calculated spectra for more reliable interpretation of the experiment.
We analysed the behaviour of ROIs marked in figure 1a-c with two approaches: simultaneous and independent for each ROI. A joint treatment of ROIs revealed that some regions dominated the component analysis at the cost of the others. This occurred due to different overall variances in the ROI intensities seen in figure 1a-c. For example, the optimization of the first CV became dictated by XES ROI I, which resulted in highly sub-optimal description of ROI III intensity. Therefore, we conclude that interpretation of ROIs is best done by individual fitting, i.e. analysing each ROI separately.
The results of individual analyses for each ROI are presented in figure 3 and in table 2. When performed this way, the first CVs explain on average ð87 + 14Þ% of ROI intensity variance with the mean structural covered variance of ð38 + 7Þ%, as indicated by figure 3b-c. The first PLSSVD CVs royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 9: 220093 Table 1. Analysis of the overall shape of spectra in increasing order of decomposition: cumulative fractional explained variance in spectral (s 2 spec ) and structural (s 2 stru ) space and the corresponding CVs in the standardized parameter space. show a weaker ð68 + 27Þ% performance for covered spectral variance but cover ð42 + 9Þ% of the structural variance. Standard deviations are given as the uncertainties above.
The CVs were oriented along the increase of corresponding ROI intensity. Whereas this is a trivial task for linear models, defining the positive direction is more complicated for ECA, because of nonlinear and possibly oscillatory behaviour of intensity along the component (see electronic supplementary material). Our analysis reports dominant dependence on the H-O-H angle of all ROIs in XES spectra: based on the first CVs intensity transfer to ROI II is expected with inward bending. The ROIs in XAS are mostly affected by the bond lengths, and, for example, ROI I intensity is found to be increased with further elongation of the longer bond. Last, the sensitivity of XPS to the H-O-H bond angle only is recovered, as intensity is shifted to lower binding energies with increasing bend angles.
In the H 2 O molecule that we use as the pilot system, there are only three nuclear degrees of freedom. It is therefore relevant to ask what would change if a problem with more degrees of freedom, such as a liquid, was to be studied. We turn to this question next.
All other things being equal, a more complicated system can be expected to require a more complicated emulator architecture. This naturally will require larger training (and test) datasets that should cover the whole region of prediction [11], i.e. accessible structural space. The field of ML provides measures how to evaluate the model and the number of required training points, by, for example, studying the learning curves. For the water molecule alone, a simple three-dimensional grid evaluation would have been feasible. However, for more complicated systems, the number of dimensions would prohibit such a raw approach. We see (AI)MD and Monte Carlo simulations as feasible ways to generate structures, as the achieved sampling cuts out a large portion of the inaccessible structural space by design. These considerations are complicated by the note that the complexity of an emulator architecture depends also on how well behaving a function the spectral response is. Last, it remains a case-dependent question of how much precision loss is tolerated in the process.  I II III  I II III  I II  I II III  I II III  I   royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 9: 220093 7 Table 2. Component-wise ECA analysis of the ROI intensities: cumulative fractional explained variance in spectral (s 2 spec ) and structural (s 2 stru ) space and the corresponding CVs in the standardized parameter space. The CVs are oriented along increasing ROI intensity based on a linear fit on the predicted data for projection along the CV in question only.    The idea of using decomposition is to provide interpretation of spectroscopic data learned by an emulator. The aim is to identify dominant trends in a complicated structure-spectrum relation, with inherent loss of information. In this work, we used a linear transformation around a well-identifiable centre to identify relevant directions of spectral sensitivity. For more complicated data such as liquids, these centres may be numerous or a continuous valley of regions may appear-possibly with varying local spectral behaviour. As one potential way to solve the problem, a manifold approach might be used. In such an approach, locally linear variations would be studied together with additional parameters defining the local neighbourhood, e.g. particular molecular isomer. Such parametrizations could be made by energy criteria, by abundance of points in an MD trajectory, or by principal component or clustering analysis of the structural data. However, for spectral data that is severely wiggly or heavily scattered over the accessible structural space, it is hard to see any interpretation method to be able to draw correct universal trends from, as inverting the structure-spectrum function becomes impossible. It seems that a structural-information bottleneck can be reached in at least two ways: first, due to insensitivity of the probe to certain structural variation and, second, due to the back-and-forth wiggle of the spectra in the structural parameter space.

Conclusion
Spectroscopically relevant structural variability can be captured by decomposition techniques. Using ML-based emulators allows for decomposition of structural space based on explained spectral variance; this is an approach that outperforms partial least-squares fitting both in spectral coverage and structural selectivity. The presented ECA method relies on accurate and computationally light prediction of spectra for new structures enabled by ML emulators, the development of which is currently an active field of research. Application of this analysis on ROIs in the spectrum may provide a direct interpretation of experimentally observed and theoretically reproduced spectral change. Our results manifest X-ray spectra forming a bottleneck for structural information, some of which is not recoverable from them. Whereas high sensitivity might be beneficial for a detailed analysis of structure, sensitivity to only a few structural parameters may be used for identification of the related structural classes by their spectroscopic fingerprints. On the other hand, spectroscopic methods that are heavily sensitive to many parameters may require a statistical approach.