Supplemental Information: Using Gaussian Process Regression to Simulate the Vibrational Raman Spectra of Molecular Crystals

Vibrational properties of molecular crystals are constantly used as structural fingerprints, in order to identify both the chemical nature and the structural arrangement of molecules. The simulation of these properties is typically very costly, especially when dealing with response properties of materials to e.g. electric fields, which require a good description of the perturbed electronic density. In this work, we use Gaussian process regression (GPR) to predict the static polarizability and dielectric susceptibility of molecules and molecular crystals. We combine this framework with ab initio molecular dynamics to predict their anharmonic vibrational Raman spectra. We stress the importance of data representation, symmetry, and locality, by comparing the performance of different flavors of GPR. In particular, we show the advantages of using a recently developed symmetry-adapted version of GPR. As an examplary application, we choose Paracetamol as an isolated molecule and in different crystal forms. We obtain accurate vibrational Raman spectra in all cases with fewer than 1000 training points, and obtain improvements when using a GPR trained on the molecular monomer as a baseline for the crystal GPR models. Finally, we show that our methodology is transferable across polymorphic forms: we can train the model on data for one crystal structure, and still be able to accurately predict the spectrum for a second polymorph. This procedure provides an independent route to access electronic structure properties when performing force-evaluations on empirical force-fields or machine-learned potential energy surfaces.


Optimizing hyperparameters
In order to find the optimal hyperparameters σ and η used in GPR, we have used a cross-validation method. We picked a set of structures (here by farthest point sampling) from an NVT trajectory, which we then divided into a training and validation set. We then varied the hyperparameters and calculated the error (see Eq. 6) on the validation set in each case. In Fig. S1 we show an example of a contour plot of xx for Paracetamol I. Very similar profiles are obtained σ η Figure S1. Contour plot of xx, using GPR with a density grid representation, with 300 training points and 100 test points. Blue (red) indicates smaller (larger) errors. The contour actually shifts to the left when increasing the number of training points.
for all polarizability components. The values for σ and η given in the main text have been optimized using random subselections of 900 structures for the training and 300 for the validation set.
A similar procedure has been carried out to optimize the radial cutoff r cut and the Gaussian width σ entering the SOAP-kernel definitions. In this case, random subselections of 1500 configurations out of a total of 2000 has been chosen to train the model while the remaining 500 are selected as a validation set. Both for the monomer and the crystal case, we find that σ = 0.3Å and r cut =4.0Å yield the best learning performances. We multiplied the linear λ-SOAP kernel by a λ = 0 scalar kernel with the same parameters, so as to obtain a non-linear kernel that captures higherorder structural correlations without sacrificing the covariant behavior of the model [1].

Atomic coordinates
In the main text, Gaussian process regression was systematically used with a density-grid representation.
In this section we show the performance of a GPR approach based on unprocessed coordinates in Fig. S2. Note that representing the system through an atomic coordinates only entails using the cartesian components as the feature vector u(A), thus requiring an alignment to a reference structure. We observe that the learning curve when simply taking raw coordinates to train the model shares a similar behaviour with the atomic density representation, but is systematically a little lower, meaning that it needs more training points to reach the same level of accuracy. Therefore the simplest representation with unprocessed atomic coordinates was not considered in the rest of this article, where GPR is systematically employed in combination with atomic densities.

Fully converged Raman spectrum (molecule)
To make the parallel with Figs 10 and 11 in the main text, we show in Fig. S3 the Raman spectrum of the Paracetamol molecule, using SA-GPR trained on different training models, containing each 2000 structures.
We see that with this amount training points, SA-GPR reproduces exactly the ab initio spectrum of the Paracetamol molecule; At this level of accuracy, the error bars are barely visible, and low frequencies and high frequencies are equally well described.
Furthermore, comparison with Fig. 10 in particular highlights the difficulty in predicting a molecular crystal rather than one of its molecular constituents: many more training points are needed in the case of Paracetamol I to reach the same level of accuracy as that of the molecule.

Variations in polarizability predictions
In the main text, we showed results that were obtained each time with a unique subselection of the training set.
To show the dependence of our predicted polarizabilities on the choice of the training data, we chose a fixed validation set, but varied the selection of training data, keeping its size constant. Results are shown in Fig. S4. While some substantial modifications

Molecules in crystal
Let us have a look at the Raman spectrum of a single molecule as behaving inside the crystal, which we shall call "interacting molecule", and compare it to that of the isolated molecule, shown in Fig. S6. We see that the two spectra are very different. The very high-frequency peaks of the isolated molecules (O-H and N-H stretching motions) disappear in the case of the interacting molecule, and also the low-frequency range shows some important changes. Actually, the interacting-molecule spectrum looks much closer to the full crystal spectrum. It actually seems somewhat counter-intuitive that the molecular spectrum would differ that much at high frequency.
Indeed, the higherfrequency vibrational modes should be governed by intramolecular vibrations, and in this regard are expected to be similar within the crystal. Interestingly now, if we compute directly the autocorrelation function of α Σ , i.e., the sum of molecular contributions, and compare it to the full crystal spectrum, we find a very good agreement, as shown on Fig. S7, provided that we apply a posteriori a fixed scaling factor for each different frequency range. We see in particular that  Figure S7. Raman spectrum computed directly from the sum of molecular polarizabilities (obtained with GPR) as compared to the actual crystal spectrum. No machine learning is used here for representing the crystal; However, note that the spectrum obtained from α Σ has been arbitrarily scaled (different factor for each panel) so that both intensities are comparable for every frequency range.
the three main peaks of the low-frequency range are present in the molecular spectrum, although not with the right intensity. We note that for the latter figure, we used Gaussian process regression for predicting the individual molecular polarizabilities, but we did not use any machine-learning scheme to predict directly the crystal polarizabilities. All the information coming from the interaction between the different molecules is thus absent, and still needs to be grasped by a machinelearning approach. As illustrated in Fig. S8 for GPR, the prediction accuracy can be improved even further if one scales the molecular polarizability tensors so that their average matches that of the full crystal. The variations of α Σ are very close to that of the full polarizability of the crystal, but both its absolute values and derivatives are considerably smaller. The scaled version reduces these differences.

Harmonic vibrational Raman spectra
A more widely used possibility to simulate vibrational Raman spectra in an ab initio manner is through the harmonic approximation, in which the potential energy is Taylor-expanded up to second order; Within this approximation, harmonic Raman intensities I H (ω) are proportional to the derivatives of the polarizability tensor with respect to atomic displacements (see e.g., Refs. [2,3,4]): where (α ij ) p = (∂α ij /∂Q p ) 0 is the derivative of the ij component of the polarizability with respect to the displacement of normal mode Q p . In practice, we evaluate these derivatives by finite differences: we make 6n (forward and backward, n being the number of atoms per unit cell) small nuclear displacements in the unit cell around the equilibrium position, on top of which we compute the polarizability tensor with DFPT. Because in the harmonic approximation, every calculation takes place close to equilibrium, the standard deviation of the polarizability in that case will be very small. The challenge to predict the latter with machine learning is thus slightly different than in the previous case, where polarizabilities can experience large variations. Here the model will have to be sensitive to very small symmetric alterations in the structure.
We show the results for Paracetamol I in Fig. S9, where we train on an NVT trajectory, as described in the main text, and use Eq. 15. We see that the predicted harmonic spectrum reproduces nicely the ab initio one. We note that the predicted polarizabilities experience a small global offset, which does not impact the spectrum. Beyond this agreement, these results also signify that the computational cost of a harmonic Raman spectrum can be basically reduced to the cost of a simple geometry optimization.

Offset in predictions of Paracetamol II
In this section we show in Fig. S10 that the prediction of the individual susceptibility components that enter the Paracetamol II Raman spectrum presented in the main text (based on the GPR trained for form I) show different offsets with respect to the DFPT values..