Spectroscopy learning: A machine learning method for study diatomic vibrational spectra including dissociation behavior

Molecular spectroscopy plays an important role in the study of physical and chemical phenomena at the atomic level. However, it is difficult to acquire accurate vibrational spectra directly in theory and experiment, especially these vibrational levels near the dissociation energy. In our previous study (Variational Algebraic Method), dissociation energy and low energy level data are employed to predict the ro-vibrational spectra of some diatomic system. In this work, we did the following: 1) We expand the method to a more rigorous combined model-driven and data-driven machine learning approach (Spectroscopy Learning Method). 2) Extracting information from a wide range of existing data can be used in this work, such as heat capacity. 3) Reliable vibrational spectra and dissociation energy can be predicted by using heat capacity and the reliability of this method is verified by the ground states of CO and Br2 system.


Method details
For diatomic molecules, the spectral lines in the short-range region can be measured experimentally, while those lines lying in the mid-long-range region and the dissociation energy are difficult to measure. The spectroscopic parameters of the CO molecule are of great need in research of astrochemistry [1] , and there remains interest in the Br 2 molecule for the industrial applications [ 2 , 3 ]. These two species are considered as promising candidates in the study of vibrational spectra. In recent years, machine learning has found a way to use data [4] to construct reliable higher-dimensional functions, which has shown good performance in solving problems of quantum mechanics and statistical mechanics [5][6][7] . In this paper, model-driven and data-driven methods are combined to predict the accurate vibrational spectra of diatomic molecules including dissociation energy by using limited experimental information, such as energy levels and heat capacity.
The method has three main parts.
1) Turning the spectroscopy problem to an optimizing machine learning problem. First, constructing a reasonable parametric model that can describe all the details of vibrational spectrum to solve under fitting problem (discussed in Section 1 ). Second, using machine learning strategy to focus on the over fitting problem (discussed in Section 2 ). 2) Introducing two main approach to solve under fitting problem. First, limiting the shape and size of parameter space (discussed in Section 3 ). Second, using testing data that including heterologous information to verify the predictive power of the selected model (discussed in Section 4 ). 3) Using greedy algorithm to search the optimal model in the parameter space (discussed in Section 5 ).

Model analysis from and beyond quantum models
According to Born-Oppenheimer approximation (BOA), the electronic Schrödinger equation of the diatomic molecules is given as where ˆ H is the Hamiltonian of the N -electrons system, E is the electronic eigenvalue and ψ (r, R ) stands for the total wave function of the system.
in which Z α and Z β represent the charge number of nuclei α and β.
One is able to solve the radial nuclear Schrödinger equation for the vibrational energies and wave functions where J, and v represent the total angular momentum quantum number, the absolute value of the projection of the angular momentum of the electron orbit on the nucleus line (corresponding to the electronic state) and the vibrational quantum number respectively. E v J is the rovibrational energy . And the potential energy function V (r) can be expanded to order 8 at the equilibrium position 1 n ! f n x n (4) where through the selection of the origin of coordinates, the constant term and the first-order term can be set as 0, f n = V (n ) ( r e ) = ( d n V d r n ) r= r e is n -rank force constant.
Using the second order perturbation method to regroup the Hamiltonian in Eq. (3) , the Hamiltonian can be written as The vibrational energies can be obtained with neglect of rotational part of the diatomic molecule Where ω 0 , ω e 0 , ω e x e , ω e y e , ω e z e …is the spectrum coefficients.
And the dissociation energy is only a function of the last three vibrational energies [8] D cal According to Eq. (6) and Eq. (7) , it can be found that the form of perturbation results is similar to the energy expansion formula of Herzberg [9] and the Dunham [10] formula derived by the method of WKB theory. Compare to Taylor's expansion we can see that, E(v ) can be expanded as series at v = − 1 2 . Any complex physical effect can be reacted by the coefficient of expansion, which is also known as the molecular constant [11] that are usually obtained by fitting the experimental data with the least square method. For the study of energy spectra of diatomic molecules, a large number of expansion terms are needed in the longrange region, so more spectral constants and polynomial terms are required. The problem is that, as the number of polynomial terms increases, the ability of fitting increases rapidly, which overwrite the physical effects from other factors, resulting in overfitting. However, if the relevant high-order constants and uncontrollable errors are abandoned, the degree of fitting will go down, so that the phenomenon of underfitting will appear. Therefore, the least square method is not applicable here [8] , and we need to construct a new method to study the molecular vibration in the long-range region.

A data-driven approach base on machine learning
Machine learning method is to make use of the existing data (experience), get a certain model (parameter), so as to achieve the purpose of predicting unknown data. In the machine learning method, some very complex functions can be introduced, such as deep-neural-network (DNN), recurneural-network (RNN), convolutional neural-network (CNN), etc., which serves for general purpose like image classification. Their working principle is similar to the neural network shown in Fig. 1 .
In general, machine learning means using data set carefully to determine the right mapping of X → Y . In order to solve the problem of underfitting, the number of parameters should be able to cover the established relationship when building the model. For the problem of overfitting, all data are divided into training set and test set. The data of the training set is used for learning to determine parameters, while the data of the test set is used to verify learning results. The overfitting problems can be further controlled by normalizing methods that introducing sufficient but limited parameter space to carry out effective model search [4] .
Similarly, we found that Eq. (7) had the characteristics of fewer parameters and higher flexibility. And, the diatomic vibrational behavior relevant information (such as low-lying levels, dissociation energy and heat capacity) can be used as the training set and testing set. But, as mentioned above, overfitting is still the core challenge.

Limiting the shape and range of the parameter space
For diatomic molecular systems, the following matrix form can be obtained based on Eq. (7) AX = E (10) where (11) in which X is the spectral constant matrix, which is the parameter that we need to determine in the spectral learning process. The "real" energy levels E is break up to two parts: E v 1 stands for Table 1 Dissociation energy of the ground electronic state for CO molecule [14] . experimental measure value and δE [12] is a small variational term to offset any possible experimental error. According to Eq. (10) , one can solve molecular constants X out: and the range of the parameter space ( X) is constrained by the experimental error: Occam's razor is used to further confine the shape of X. The simpler model (X ) with enough expression are preferred. Usually, the dimension of X is set to 5 as a starting guess, and if 5 is not enough (Cannot represent the details of the data) then 6 will be used, and so on.

Preparing the data sets
Three different types of data are used to build the dataset.
1) The experimental vibrational energy E v , exp . If the size of X is five, then five energy levels are enough to solve it out according to Eq. (10) . The rest of the experimental levels can be used as validation set. For example, there are 42 experimental vibrational energies available for the ground electronic state of CO molecule. Assuming that the expansion order in the vibration energy term (see Eq. (7) ) of this system is m (≥ 5) , there are C m 42 selections from the known 42 data to form the calculated subset for a certain m . So, we use levels as a part of training data. 2) Heat capacity [13] ( C mol ) is introduced to enhance the training data set. The molar vibration heat capacity ( C mol ) can be obtained experimentally and have a strong relation with the levels 3) Dissociation energy ( D e in Eq. (8) ). It is worth noting that this quantity may have very large uncertainty or lack of experimental data. From the probability point of view, if we can predict it correctly, it will greatly enhance the reliability of forecasts. So, we set it to test data.

Learning by optimizing
Now, parameter X is constrained by Eq. (13) , and there are still many possibilities for its value. In learning steps, there are following objects to optimize: where D cal e and C cal mol are determined by X, respectively from Eq. (8) and Eq. (14) . The distance is defined as for vibrational energies, and which represent dissociation energy and heat capacity respectively. Eq. (15) -Eq. (17) can be used to obtain X * , however, in order to predict the vibrational spectra with neglect of experimental dissociation energy, we can only use Eq. (15) and Eq. (17) . That means taking D e as the unknown parameter and heat capacity is further introduced as an additional physical criterion to determine D e .
In real calculation, we use the greedy algorithm to adjust X * parameter one by one, the calculation details are as follows: 1) For a certain D e , five low-order parameters are used as initial attempt to determine the size of parameter X. 2) In the existing m experimental levels, one selects 5 experimental levels arbitrarily. Then 5 parameters in step 1) can be obtained according to Eq. (10) . There can be totally C 5 m different attempts, fortunately, you actually only need to try a few of them to find a satisfactory solution in practice.
3) Verify the parameters obtained in step 2) to see if they satisfy the criterion E < 0 . 5 c m −1 in Eq. (18) and D e < 10 c m −1 in Eq. (19) . What is noteworthy is that the criterion can also ensure that the final error given by the parameter solution found by different initial values in step 2) is very small. 4) If step 3) is met, the calculation ends. On the contrary, if the condition can't be met, keep the number of parameters to 5, a small variable item δE (usually 1 cm −1 ) is added to the first level, then, there are two new levels the validation in step 3) to see if the situation has improved. If not, the variable item becomes δE = 1 2 δE. If the conditions are still not met, the variable item will be halved again until reaching the upper limit in 10 times or achieve convergence (usually 0.001 cm −1 ). Next, adjust the next four energy levels in the same way. 5) If the conditions in step 3) are still not satisfied, increasing the number of parameters by 1(to 6 this time) and repeat steps 2) to 4) until step 3) is satisfied. 6) When step 5) is completed, the heat capacity can be calculated by Eq. (14) according to the spectrum. The heat capacity error curve of different dissociation energies can be drawn by changing the D e used in step 1), so that the optimal D e (the first inflection point) can be determined. 7) Compare D e with the experiment to see the quality of the prediction.

Method validation
We found the experimental dissociation energy values of the ground electronic state CO molecule over the years, as shown in Table 1 .
Hence the full vibrational spectra can be predicted using these different experimental dissociation energy values, as shown in Fig. 2 , which were found to have great influence on the prediction of vibrational energies, especially those vibrational levels near the dissociation energy. And, better agreement can be found between the measurement and the present calculation using the dissociation energy D e = 90679 . 1c m −1 [15] , as shown in Table 2 .
We take the dissociation energy as an unknown quantity and use the relative error between the calculated ( C cal mol ) and experimental heat capacity ( C exp mol ) as the standard to search for the dissociation energy which can best meet our requirements. As shown in Fig. 3 (a), more accurate the dissociation energy is, more reliable the calculated vibrational energies will be. Again, the best choice for the dissociation energy is still D e = 90679 . 1c m −1 [15] .
The dependence of dissociation energy on heat capacity provides a way to obtain dissociation energy and makes it a good criterion to verify the reliability of this method. The results show that, as shown in Fig. 3 (b)( T = 500 K), with the increase of dissociation energy, the relative error of heat capacity decreases gradually, and the corresponding dissociation energy at the first inflection point is close to the latest experimental value (90,679.1cm −1 ). Since the heat capacity also have its uncertainty, we can ignore the details of the change after the first turning point( D e > 91179 . 1c m −1 ), so that the 91,179.1 cm −1 we found can be used as an estimate of the absolute error within 500 cm −1 (5.5 ‰ ), which is better than the second best dissociation energy value in Table 1 . As shown in Fig. 3 (c), and we also set a second temperature( T = 1200 K) to find the right dissociation energy( D e = 91579 . 1c m −1 ),which is a little bit worse than what we just did.
In order to further verify the effectiveness and practicability of this method, similar analysis for the ground electronic state of Br 2 is carried out. Several candidate points were selected near the experimental dissociation energy (16,057 cm −1 [18] ), which can yield a group of vibrational energies for each case, as shown in Fig. 4 . The dissociation energy can be determined the same as those in CO molecule and is given as 16,165 cm −1 with the help of vibrational molar heat capacity as the requirement. In addition, the result shown in Fig. 5 (c) at the second temperature( T = 2400 K) is consistent with that just shown in Fig. 5 (b)( T = 3800 K), and the dissociation energy also given as 16,165 cm −1 .

Declaration of Competing Interests
The Authors confirm that there are no conflicts of interest.