Extraction of information about the molecule structure directly from GC-MS data

. Gas chromatography – mass spectrometry (GC-MS) is a very important method of chemical analysis. GC-MS can be used for non-target chemical analysis and preliminary screening of completely unknown compounds. Electron ionization mass spectrometry is commonly used in GC-MS. Some information can be extracted directly from GC-MS data using machine learning methods. There are several previous works in which machine learning models extract information about the presence or absence of given substructures in a molecule directly from the electron ionization mass spectrum. Rarely, the additional data such as molecular weight and retention index are used together with the mass spectrum as input features of such models, however, no systematic comparison of how the use of such data increases the accuracy of the prediction was previously conducted. In this work, gradient boosting was used for prediction of the presence or absence of given sub-structures in a molecule. The following substructures were considered: aromatic ring, 5-membered aromatic ring, 6-membered aromatic ring without heteroatoms (benzene ring), nitrogen-containing aromatic ring, primary, secondary, and tertiary amino groups, nitrile, hydroxyl, carbonyl, methoxy, methyl, and carboxyl groups. Three types of additional features were used: molecular weight and neutral loss spectra (molecular weight also allows for the neutral loss spectra computation), retention index for the non-polar stationary phase, and retention index for the polar stationary phase. A total of 8 feature sets were considered. In most cases, the molecular weight and neutral loss spectrum considerably improve the accuracy. Retention indices also allow for further accuracy increase. For polar functional groups such as carbonyl and hydroxyl, the effect of using retention indices is maximal. The use of retention indices for two stationary phases allows for the achievement of the best accuracy. The best accuracy of prediction was achieved for the benzene ring and aromatic ring, the worst (but still high) accuracy was observed for the secondary amino group. The achieved accuracy was compared with the previous results. In addition to the classification tasks, the regression tasks were considered. The gradient boosting models that predict the number of aromatic atoms, methyl groups, and benzene rings were developed. It was observed that the use of additional features considerably improves the accuracy in this case. Finally, it should be noted that the regression models underestimate the number of occurrences when the number is high.


Introduction
Gas chromatography -mass spectrometry (GC-MS) is a widely used analytical method for both targeted and non-targeted analysis of complex mixtures of volatile compounds.It is widely used in industry, pharmaceutics, environmental analysis [1][2], and in metabolomics studies [3][4].The most widely used approach for non-target GC-MS analysis is the library search in MS databases [5], such as the NIST 17 database [6].Unfortunately, the majority of organic molecules are absent in all MS databases, and standard samples also are not available.Qualitative screening of such molecules is a complex task [7][8].Fortunately, there are software and machine learning approaches that facilitate this task.There are many tools for prediction of mass spectra from the structure of a molecule [8][9][10].There are also tools for prediction of the presence or absence of substructures (fragments of a molecule) in a molecule based on the mass spectra [4,[11][12][13][14][15][16].There are many such tools [17][18].Some of them predict the so-called "molecular fingerprint" (a long vector of bits; each bit corresponds to the presence or absence of a structural feature) [4,11,12].Others predict the presence or absence of specific fragments and functional groups that are of most interest to the researcher.
The accuracy of prediction of "molecular fingerprint" even by the most modern tools is not very high.For example, in the work [11], for approximately half of the bits constituting the fingerprint, the accuracy is less than 0.9.Such a molecular fingerprint can be used for automated search in the database of possible candidates.Many works are devoted to prediction of the presence or absence of common substructures and functional groups.For example, in the work [13], the presence or absence of several structural features such as a benzene ring, a dimethylamine group, etc., was predicted.The observed classification accuracy lies in the range of 76-95 for all considered fragments (except trimethylsylil groups).Another similar work is the well-known work by Varmuza et al. [14].Several classifying models of common substructures were developed.The majority of such works [13,14], including the MOLGEN-MS software [18], use only the mass spectrum as a source of features (Figure 1A).
In the work [17], the model predicts the presence or absence for many common functional groups, such as -NH2, -OH, = O.In that work, unlike the majority of other works, the molecular weight (determined in a mass spectral experiment) is used for creation of features.In another work [15], the Golm Metabolome Database is used for training and validation, and the authors predict the presence or absence of many functional groups with relatively high probability.The authors use information about the retention index as an additional feature.The vast majority of these works provide only binary classifications: predicting the presence or absence of fragments, while determining the number of occurrences can be a valuable task.
The aims of this work are (I) to study how the additional use of information about retention (on different stationary phases) and molecular weight (Figure 1B) together with the mass spectrum affect the accuracy of prediction, and (II) to consider not only classification, but also regression models that predict not only the presence or absence of a given structural feature but also the number of such features in a molecule.

Methods
The NIST 17 database was used for training and validation of developed models.Molecules containing elements other than H, B, C, N, O, F, P, S, Cl, Br, I were excluded from the data sets, as well as molecules with a molecular weight (MW) of more than 300 and molecules for which the retention index cannot be predicted using a 1D convolutional neural network (CNN) [19].Spectra with peaks with m/z more than 300 and spectra without peaks with m/z less than 50 were excluded from the data set.The absence of peaks with m/z less than 50 highly likely means that the scanning range starts from high m/z values.Such spectra are not suitable enough for the considered task.The data set contained 132489 spectra.The initial data set was split into training, validation, and test data sets containing 105871, 13367, and 13251 mass spectra, respectively.
Gradient boosting was used for prediction of the number and presence of substructures.The models were trained using the XGBoost [20] library (version 1.5.1)using our own   1.For regression tasks, the early stopping was used: if the accuracy for a validation set is not improved for 250 iterations, the training is stopped.The following features were used with the gradient boosting model: mass spectrum (scaled to a range of 0-1 intensities of peaks for each integer m/z in the range 1-300), neutral loss spectrum (see below), MW (divided by 1000), retention indices (RI) for polar (RIpolar) and non-polar (RInon-polar) stationary phases (SP), as well as the difference between RI for polar and non-polar SP (RIpolar -RInon-polar).
The RI values were divided by 1000.The neutral loss spectrum is interrelated with the mass spectrum by the following equation: where In -intensity of mass spectra corresponding to m/z = n; Nn -intensities of the neutral loss spectrum, M -MW of a molecule.
Because NIST 17 contains information about RI only for few molecules, RI predicted using 1D CNN were used as features.These RI values are close to the experimental ones [19].For prediction of RI, 1D CNN with the following hyperparameters was used: 2 CNN layers with 300 output channels; 2 fully connected layers (kernel = 6) with 600 and 1 output nodes; rectified lin-ear activation function was used for all layers except the linear output layer; early stopping using a validation set was used.More information about 1D CNN for retention index prediction is given in our previous work [19].The mean and median absolute errors of prediction were 45.5 and 17.2, respectively, for non-polar SP.For polar SP, these values were 67.7 and 29.5, respectively.The error values are given for test sets.As the initial values of the trainable parameters (weights and biases) of the neural network for polar SP, the parameters obtained for non-polar SP were used.
The following accuracy measures were considered for binary classification tasks: where TN, FN, TP, FP -number of true negative, false negative, true positive, false positive predictions; yi, Yi -prediction and correct answer for the i-th sample; N -number of samples, C -binary cross-entropy (CE).For computation of TN, FN, TP, FP, the float predictions numbers were rounded up to 0 or 1. ROC (Receiver Operating Characteristics) curves were also considered, and the area under such curves (ROC-AUC) was used as an additional measure.For regression tasks, the root mean square error (RMSE) was used.RMSE was used for early stopping for regression.The validation set was used for early stopping, all accuracy measures given below were calculated for the test set.

Results and discussion
In order to study how the use of various features (neutral loss spectrum, RI) affects the accuracy of prediction of the presence or absence of a given substructure in a molecule based on the electron ionization mass spectrum, the series of computational experiments were conducted (see Table 2).For 13 substructures (see Table 3), the gradient boosting model was trained with 8 sets of features and the accuracy was evaluated.In all 8 cases, the intensities corresponding to m/z 1-300 in the raw mass spectrum were used as features with or without additional features.The area under the ROC curve [21] was considered as the primary accuracy measure.The XGBoost predictor predicts a float value in the range [0, 1] instead of a binary value.This value characterizes the probability of the presence or absence of the given fragment.The accuracy measures such as TPR (recall), precision, FPR (see equations (2)-( 4)) depend on what is considered a threshold value above which the XGBoost 378 prediction is considered as positive.The ROC-AUC measure [21] does not depend on the threshold value and characterizes such a predictor well.The perfect ROC curve passes through the points (0,0), (0,1), (1,1).The closer the ROC curve passes to the point (0,1), the closer it is to the perfect one.
As an example, Figure 2 shows ROC curves for prediction of the presence or absence of nitrile and carbonyl groups.Note that for better readability, the axis range in Figure 2 is not [0, 1]. Figure 2 clearly shows that the use of the neutral loss spectrum and molecular weight as additional features considerably improves the prediction accuracy.The use of RInon-polar as an additional feature does not greatly improve the accuracy, but the use of information about RI for two SP causes an additional growth of the accuracy.Table 3 demonstrates that for all substructures, the use of MW and neutral loss spectrum considerably improves the accuracy, and the use of RI for two SP allows for the achievement of the best accuracy.In some cases, the use of RI for only one SP gives the accuracy growth comparable to the use of RI for two SP (for example, -CH3, -NH2, benzene ring), in other cases, the use of RI for two SP gives considerably better accuracy (for example, -OH, aromatic nitrogen).The complete data are given in Table 3.For other than ROC-AUC accuracy measures, the situation is similar.For example, Table 4 shows various accuracy measures for prediction of the presence or absence of aromatic atoms in a molecule using various feature sets.The precision is almost constant for various feature sets, while the recall increases.It means that the overall accuracy improves.Note that for all accuracy measures except CE, the value 1.0 corresponds to the perfect model, and for CE, the value 0.0 corresponds to the perfect model.
In addition to the classification task: the prediction of the presence or absence of a given substructure, the regression task was also considered: the prediction of how many times a substructure is present in a molecule  379 for 3 substructures (see Table 5).In this case, the effect of additional features is even larger than for classification tasks.Figure 3 shows the actual vs predicted plots for prediction of the number of aromatic atoms and CH3 groups in a molecule.The prediction accuracy is quite high for a low number of occurrences, however, Figure 3 clearly shows that in cases when the correct answer is high, this value is considerably underestimated.As an additional example, the ROC curve for prediction of the presence or absence of a benzene ring (6-membered ring containing only carbons) and the actual vs predicted plot for the number of benzene rings are shown in Figure 4. Figures 3-4 show the data for the full set of features.We also tried to tune the XGBoost hyperparameters for each feature set and task separately and to make such comparisons using various sets of hyperparameters (other than given in Table 1).However, the same patterns are observed for various sets of hyperparameters, and the same hyperparameters are nearly optimal for various sets of features and various tasks.Taking this fact into account, all comparisons were made using the same set of hyperparameters.The accuracy of classification was also compared with the accuracy given in the work of Stein et al. [17].Because such measures as recall and precision depend on the probability threshold (and as the threshold increases, the precision increases and the recall decreases), the comparison was made at the fixed precision value of 0.9.Table 6 contains such a comparison.Since the work [17] does not use information about RI, we considered only feature sets 1 and 2 (see Table 2) in this comparison.It can be concluded that the considered classifiers in most cases have the same or better accuracy compared with the described in the work [17].The worse accuracy was observed only for the worst predicted substructures: non-aromatic nitrogens.However, that work [17] uses a much older version of the NIST library, and this comparison is not completely correct.The correct comparison of the machine learning methods should be made using the same data set.The models developed in this work were applied to the mass spectra and retention indices of the recently identified transformation products of unsymmetrical dimethylhydrazine (UDMH).These compounds are not available in mass spectral databases, and elucidating their structure using only chromatography and mass spectrometry is a very difficult task [22].Table 7 shows the structures of the three UDMH transformation products and model predictions for them.The first compound contains two conjugated aromatic rings.Previously, prior to the publication of work [22] by our team,  similar transformation products of UDMH were not known.The models developed in this work receive the retention indices for two stationary phases and mass spectra as an input and make it possible to obtain preliminary information about the structure and limit the number of possible candidates.This is an excellent starting point for further elucidation of the structure [22], for which the observed experimental data are consistent with the results of predicting mass spectra and retention indices.The prediction models developed in this work will be implemented in our previously published free and open-source software [22] that can be obtained by the following link: https://github.com/mtshn/sveklaConclusions A model that allows for the prediction of the presence and number of given substructures in a molecule based on GC-MS data was built using the XGBoost library.The use of additional data besides the electron ionization mass spectrum allows for the considerable improvement of the prediction accuracy.If the molecular weight is known in addition to the mass spectrum (for example, it can be determined using a chemical ionization ion source), the prediction accuracy considerably improves.In this case, the neutral loss spectrum can be used as an additional set of features.The retention index is less important, but the use of the retention index allows for the considerable improvement of the prediction of the presence or absence of polar functional groups such as hydroxyl and carbonyl.The use of retention indices for two stationary phases allows for the greater improvement of the accuracy compared with a single retention index.For regression tasks, when models predict the number of occurrences rather than the presence or absence, the use of additional features considerably increases the accuracy.

Fig. 1 .
Fig. 1.Extraction of information about the structure from the electron ionization mass spectrum (A) and from all available GC-MS data (B) True positive rate (TPR or recall)

Fig. 3 .
Fig. 3. Actual vs predicted plots for prediction of the number of aromatic atoms (A) and CH3 groups (B) in a molecule.The boxes show the range containing the half of the predictions, and the whiskers demonstrate the ranges containing 90% of the predictions

Table 1 .
Hyperparameters of the XGBoost library used in this work

Table 2 .
Additional features that are used together with the raw mass spectrum

Table 5 .
Accuracy of prediction of the number of various substructures in a molecule from GC-MS data using different feature sets (see Table2)

Table 6 .
Comparison of the classification accuracy achieved in this work with the accuracy from the work [17].The values of recall at precision = 0.9 are given

Table 7 .
Application of models to UDMH transformation products