Deep Q-Learning-Based Molecular Graph Generation for Chemical Structure Prediction From Infrared Spectra

In this article, we present a novel approach to predicting chemical structures from their infrared (IR) spectra using deep Q-learning. IR spectra measurements are widely used in chemical analysis because they provide information on the types and characteristics of chemical bonds present within compounds. However, there are currently no algorithms to predict the entire chemical structure of a broad range of compounds relying solely on IR spectra, unless there is an exact or closely matched spectrum in an existing reference spectra library. To address this, we apply double deep Q-learning for automated prediction of the entire chemical structures of organic compounds based on IR spectra. Our method builds predicted structures by starting from a single carbon atom and subsequently adding an atom and bond step-by-step by ranking the rewards of each possible addition based on Q-values. We devised new structural similarity metrics, atom bond count and substructure count metrics to achieve our goal. Compared to the commonly used structural similarity score, the Jaccard index of extended-connectivity fingerprints, the devised metrics exhibit more suitable properties for Q-learning. The deep Q-model, which uses the combination of our two proposed metrics, gives the overall best performance and can generate structures similar to the actual structures in terms of their structural features and molecular weight in most tested cases.


I. INTRODUCTION
I NFRARED (IR) spectroscopy is commonly used in optical methods for research and development on chemicals and materials [1] and broader areas including forensics [2], environmental analysis [3], and astronomy [4].One of the common ways of utilizing IR spectroscopy is the identification of unknown compounds, which are not in the training set including previously unreported compounds, via searching for a similar spectrum within an existing reference spectra library.On the other hand, determining chemical structures of compounds using IR spectra alone without a reference library search remains a challenging problem.In IR spectroscopy, each compound produces a unique IR spectrum based on the vibrational properties of chemical bonds that are present within the compound's structure [5].It means that the spectrum of a compound is a product of its molecular structure.Therefore, IR spectra provide valuable information for chemical structure determination.However, except for the cases where compounds are very small in their size, the use of IR spectra is typically limited to determining whether certain functional groups (i.e., substructures) exist within the compounds [5] or if two samples are chemically identical to each other [6].This is partly due to the possibility of peak overlaps, variations in their absorption cross-sections (i.e., how strongly each functional group absorbs IR light), and the lack of IR absorption by some types of chemical bonds.These reasons pose significant challenges in IR spectra-based prediction of chemical structures in contrast to the simulation of IR spectra from chemical structures [7], [8], [9], [10].
The recent development of machine learning methods provides new opportunities for big data processing, including IR spectra analysis.Earlier works explored machine learning methods' ability to predict the presence/absence of certain substructures (i.e., functional groups) within a compound based on IR spectra [11], [12], [13].Recent progress in the machine learning field includes the development of support vector machines [14] and neural networks that take IR and mass spectroscopic data [15] to improve the substructure identification accuracy.However, in terms of IR spectra-based prediction of entire chemical structures, no advancement has been reported since 1995 after the seminal work by Hemmer and Gasteiger [16], which employed IR spectral similarity search and subsequent spectral modeling.Meanwhile, as per [17], the similarity of IR spectra does not always correlate well with the structural similarity of compounds, posing the limitation of employing spectral similarity search in the structure determination of unknown compounds.Therefore, it is of significant interest to develop computational methods for IR spectra-based prediction of an entire chemical structure without relying on spectral similarity search.
In this article, we report a new approach for IR spectrabased prediction of the entire chemical structure using deep Q-learning [18] to generate multiple structure predictions.We investigated the applicability of multiple structural similarity metrics as reward functions for deep Q-learning to train our structure prediction networks.Deep Q-learning was employed as it allows for heuristically searching the space of possible chemical structures.Our approach is to build molecular graphs of chemical structures step-by-step by estimating the value of possible intermediate structures based on the provided IR spectrum.We apply deep Q-learning to estimate the value of possible intermediate structures after addition of an atom and bond to current intermediate structures in each step.Finally, we evaluate our model based on: 1) the rate at which the actual structure was found in the top predictions; 2) similarity in terms of existing functional groups; and 3) similarity of molecular weights between our prediction and the actual structure.
The rest of the article is organized as follows.Section II provides a brief overview of the related background.Section III covers our proposed approach.In Section IV, we present our experimental setup and results followed by a discussion on various aspects of our approach in Section V. Finally, Section VI concludes the article.

A. Deep Q-Learning
Double deep Q-learning is one of the commonly used methods of reinforcement learning [19], and it is a variation of Q-learning.In Q-learning, a Q-table is created which is taught the value of various actions given the current state.In this process, the current state will have a finite number of possible actions that can be taken.Each of these actions will lead to a new state.This state-action-state chain continues until some stopping condition is reached.The value of Q[s, a], where s is the current state and a is an action, is learned using the formula in (1).
Equation ( 1) has two parameters, α and γ, which represents the learning rate and the discount factor for future rewards, respectively.A stochastic search method called Greedy Epsilon Learning is used to explore the space by taking random actions and then slowly increasing the probability of using the Q-table to select the next action.The probability of using the Q-table over a random action is controlled by the value, which slowly decays over the training process.The Q-learning process requires that the reward function must fit a Markov decision process (MDP), which is a discrete-time stochastic control process.In an MDP, the value of any given state must be purely based on that state and not the steps taken to reach it [20].
In deep Q-learning, a deep neural network replaces the Qtable.This has the advantage of not requiring all possible states and actions to be visited and taken in order to learn the problem.The network takes the state as an input and outputs the predicted Q-value.When Huber loss, a commonly used loss function for deep Q-learning, the loss function can be represented using (2).
Double deep Q-learning is a variation of deep Q-learning that is more stable than standard deep Q-learning [19].In general, errors in the model's estimations of the actual Q-table will propagate through the Q-model's estimations of future rewards resulting in suboptimal evaluations of actions.To address this issue, the double deep Q-learning method uses a second Q-model called Q-target.On a chosen interval, the weights and biases from the Q-model will be copied to Q-target.The Q-model is trained while Q-target's weights remain constant.Q-target is then used to estimate future rewards.Using a second target model results in a much better performance of the final Q-model.The refined loss function can be represented by (3).

B. Molecular Graph Generation and Optimization
Molecular graphs have become an increasingly popular representation of the chemical structures [21], [22].The use of molecular graphs in reinforcement learning [23], [24], [25], [26] has been reported to show advantages over the use of SMILES string generators in reinforcement learning [27], [28], [29], [30] in the effectiveness at generating the chemically valid structures.In the context of the application of reinforcement learning in goaldirected molecular graph generation, You et al. [23] and De Cao et al. [26] recently reported the graph convolution policy network and molecular generative adversarial network (MolGAN).These are reinforcement learning-based approaches with adversarial training.The reported networks can generate graph structures that maximize a set of predicted metrics including octanol-water partition coefficient (logP), druglikeness (QED), and molecular weight.In 2019, Zhou et al. [24] also reported the molecule deep Q-model (MolDQN), which optimizes a molecular graph based on logP, QED, and structural similarity by using double deep Q-learning.Their models consist of a multilayer perceptron which takes an extended-connectivity fingerprint (ECFP) [31] of chemical structures as an input and outputs Q-values for the various actions it can take.Işık et al. [25] have reported a variation of MolDQN that implements graph convolution network and graph attention network, which use molecular graphs as inputs and returns structures that are similar to inputs with optimized logP and QED scores.Their approach to generating new molecular graphs involved the modification of an intermediate structure by taking actions such as adding new bonds, removing bonds, or adding atoms, followed by the evaluation of Q-values for all generated graphs.The final molecular graph of the optimized structure is the one that scored the highest Q-value within a predetermined number of steps.

III. APPROACH
The goal of our work is to predict the structure of an unknown compound based on its IR spectrum.To achieve this goal, we generate and evaluate a set of molecular graphs using a deep Q-model trained to predict the value of intermediate and final predicted structures.In Fig. 1(a), we show the process of generating a set of predicted structures, in the form of molecular graphs, from an IR spectrum.This process starts with a single carbon atom and adds one new atom and bond to the carbon atom.From this new set of possible intermediate structures, the structures with the high Q-values will be selected as current top intermediate structures.A second measure, called size ratio score (srs), predicts the size of the intermediate predictions relative to the actual structure as detailed in Section III-C.If any intermediate structures contain atoms with an excessive number of bonds (for example, five or more for carbon, four or more for nitrogen, and three or more for oxygen), the module automatically excludes those structures from the intermediate structure set.Then, the process add a new atom and bond to all possible locations on each of the current top intermediate structures and evaluate new structures based on their Q-values and srs.This process continues until the top intermediate structure meets a few criteria, which will be explained in Sections III-C and III-E.
Finally, a set of intermediate structures is returned as the final set of predicted structures.As shown in Fig. 1(b), this series of predictions is then stored in a replay buffer which is used to train the Q-model.In the following subsections, we will further describe the training and evaluation of the deep Q-model.

A. Dataset
We used IR spectra in NIST Chemistry Webbook [32] as the dataset.From this dataset, we selected a subset of 1700 IR spectra of organic compounds in the gas phase.In this work, we focused on predicting linear organic compounds that contain at least one carbon atom and that do not contain any cyclic structure.These compounds also contain multiple hydrogen atoms, and some contain one or more oxygen and/or nitrogen atoms.These restrictions were made to reduce computational costs.The samples were divided into a training set of 1358 samples and a testing dataset of 342 samples.

B. Deep Q-Model
Our neural network is trained using a hybrid learning process.Its two output nodes for the Q-value and srs are trained using deep Q-learning and regression, respectively.The topology of our model is illustrated in Fig. 2. The model takes two inputs, an IR spectrum and a molecular graph of an intermediate structure.
The IR spectrum was converted to a 3×559 matrix using the method previously described in [33] by "Spectrum Conv Net".Fig. 3 shows an example of a spectrum of a compound and its converted representation.The converted spectrum is provided as an input to a convolutional subnetwork.The structure is provided to a second linear sub-network as an ECFP with a radius of 4 and a vector size of 2048.This is inputted into a single fully connected layer with 512 nodes, 0.5 dropout, and ReLU activation.The output of these two sub-networks are Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

C. Training
The deep Q-model was trained to estimate a state's current and future rewards.Each state consists of the input IR spectrum and the molecular graph of the current intermediate structure.
While our process stores all intermediate structures as molecular graphs, we convert each molecular graph into an ECFP with radius 4, a binary vector of size 2048, before presenting the structures to the deep Q-model.
We define an episode as a set of modifications to an intermediate structure that ends with a final structure prediction.Each modification to the intermediate structure is defined as a frame.During each episode, the IR spectrum remains constant.During training, the reward for each frame is calculated using a structural similarity metric that compares intermediate structures to the actual structure of the compound.We evaluated the intermediate structures using the Jaccard Index of ECFPs and two additional metrics shown below in Section III-D.Equation (4) represents the loss function being used in this work.We modified the loss function of conventional double deep Q-learning because the action space for adding a new atom and bond is finite but nondiscrete.Therefore, we replaced s, a with s and s + 1, a + 1 with s + 1|s, where s + 1|s is a state reached from the current state.
In ( 4), the Q-value of entering into state s is the sum of the reward at state s and the discounted future reward from the target model (γ * max s+1 Qt[s + 1|s]).The value of reward s is calculated using the substructure similarity metrics in Section III-D.Ideally, the process should stop when any future modifications can not improve the value of the current intermediate state.However, (4) would give a lower Q-value to the optimal stopping point due to negative future rewards, thus causing it to stop early and generate a undersized structure.The challenge related to optimal stopping point determination is called optimal stopping problems in reinforcement learning [20].To address this issue, we further revised the loss function as per (5) to create a network that will give the maximum Q-value to the actual structure and lower values to all other structures.Equation ( 5) operates under the domain assumptions that rewards are in the range of [0, 1], intermediate states which are substructures of the actual structure will produce an increase in reward values, intermediate states which are not substructures of the actual structure will not increase in reward values, and the actual structure will receive a reward value of 1.The use of future reward clipping causes Y Q s to include future rewards if and only if future rewards are estimated to be positive.Equation (5a) simply returns the current reward when future rewards are negative.This prevents the network from giving a low score to the actual structure due to negative future rewards, leading to better convergence with respect to the reward function.
The revised loss function in (5) acts as a piece-wise function where future rewards are max s+1 Qt[s This means that the future rewards are equal to the maximum next state's Q-value minus the current intermediate state's Q-value if this gain in value between states is positive.Otherwise, future values are discarded for the current intermediate state.The revised loss function ensures that our revised deep Q-learning process still calculates the values of current actions using future rewards but ignores future rewards when future rewards are declining.
We further improved our method's performance by modifying the network to output a second value, size ratio score (srs), which is shown in (6).The srs of an intermediate structure is defined as the number of nonhydrogen atoms in the intermediate structure divided by the number of nonhydrogen atoms in the actual structure.Without incorporation of the srs, the network tended to more often generate structures that are larger than the actual structures.The srs value measures how close the size of a given intermediate structure is to the final structure.This makes it less likely to generate oversized structures.The srs was used to determine when the generation process should stop adding an atom and bond.The same network produces both Q-value and srs, but each value is trained using a different loss function -the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Q-value being trained by (5)  Our model was updated every four frames, where 16 384 random experience tuples were selected from the replay buffer, experience tuples were chosen evenly for compounds in the training set.Each unique state s was paired with the highest future reward max s+1 Qt[s + 1|s] from all (s, s + 1) pairs in the replay buffer.These experiences were then used to update the model in random mini-batches of size 1024.Our target model was synced every 32 frames.The training process also uses greedy epsilon learning where the value of decays from 1 to 0.1 over the first 500 frames and a value of 0.1 is maintained for all remaining frames.We used a γ value of 0.99.Our model was optimized using the Adam optimizer with a learning rate of 0.0001 and a clip norm of 0.001.

D. Structural Similarity Metrics
The Jaccard similarity coefficient (jcsc), also called the Tanimoto coefficient, of ECFPs [31], [34] is commonly used to calculate the structural similarity of two compounds [35].The ECFP is a type of circular fingerprint where the information about the structure of the compound is hashed to a vector of binary values.Therefore, when two ECFPs are compared by a similarity metric, it allows for estimating the similarity of two chemical structures.
In addition to the Jaccard index of ECFPs, we created two additional substructure similarity scores.The first new score we created is an atom-bond similarity metric, atom-bond counts (abc), designed to compare the smallest structural features which contain both atoms and bonds.This first metric counts and compares every pairing of an atom and a bond using the formula in (7).In this equation, S ab ⊆ S is all occurrences of atom a connected to bond b in a given structure S. S ab is then the number of occurrences of substructure ab in structure S. The metric compares the compound's actual structure S to intermediate structure prediction P .The metric abc is equal to 1 minus the total differences in counts between both structures divided by the maximum possible differences.This score ranges between 0 and 1 where a score of 1 indicates both structures contain equal counts for all ab.
, a Atoms, b Bonds (7) where a Atoms are the atom types {C, N, O} and b Bonds are the bond types {−, =, ≡}.
The second new metric, substructure counts (ssc), we created was designed to count the differences in substructure occurrences as per (8).We created a set of substructures by permuting all possible valid noncyclic substructure containing up to five atoms, made of {C, N, O} atoms and {−, =, ≡} bonds.Substructures can contain one atom, two atoms and one bond, three atoms and two bonds, four atoms and three bonds, or five atoms and four bonds.We chose to use substructures which occurred in at least 10 samples in the dataset and occurred in at most 1690 samples.This resulted in 531 substructures that our metric utilizes to compare structures.On these 531 substructures, our dataset's 1700 samples have a Dice similarity coefficient mean of 0.3331 and a standard deviation of 0.1703.Equation (8) shows our second structure similarity metric, where S is the actual structure and P is the predicted structure.The value of |S ss − P ss | is the absolute differences between the number of occurrences of substructure ss in S and in P .

E. Evaluation Process
The evaluation process generates a list of structure predictions for the 342 samples in the testing dataset.In the evaluation phase, our process generates the predicted structures through the same process that was outlined for the training process (Fig. 2) except for the following three differences: 1) up to ten top structures are retained within the current top intermediate structures; 2) in each frame possible intermediate structures are created from current top 5 structures with srs values < 1; and 3) a maximum of 5 top structures with srs values >= 1 are stored.It is also important to note that the current top intermediate structures may include the structures that were present in the previous set of top intermediate structures.This occurs when previous structures score a Q-value high enough to place them within the top 5 of new structures.As illustrated in the figure, multiple paths can lead to the actual structure.The final output of the process is a list of five structures that scored the highest Q-values.In the subsequent part of this article, we call the structure predictions that scored the highest Q-values "top 1" and the five highest scoring predictions "top 5".The performance of each trained model was then evaluated by a series of metrics including the rate of top 1 prediction being the same as the actual structure, the rate of top 5 containing the actual structure, and the percent deviation of actual and predicted structures in terms of their molecular weights.All performance metrics reported in Tables I-V are averages from five-fold cross validation.Results from Sections IV-C to V are from fold 1.Additionally, Fig. 6 shows the loss and average final reward curves during the training process for fold 1.In each fold, continued training produced only marginal improvement in results.Sample code can be found at [36].

A. Single Metric-Based Models
In Table I, we present the performance of networks that use one of the structural similarity metrics as a reward metric.Their performance is provided in terms of the percentage of test samples that these networks were able to correctly predict the compound's structure.The top 1 match score is the rate that the network's highest ranked prediction matches the actual structure and the top 5 match score is the rate for finding the actual structure within the 5 highest ranked predictions.From the results in Table I, we find that the abc model produced the lowest performance at 0.8772% and 5.263% for its top 1 and top 5 match scores, respectively.The jcsc model performed better with 2.047% and 8.187%.The ssc model showed the best performance with 4.678% for its top 1 match score and 15.79% for its top 5 match score.
In Table II, we show the percent deviation of the predicted structures from the actual structure in terms of their molecular weights.The jcsc model showed the largest deviation with 22.64% deviation for its top 1 predictions.The abc and ssc models both showed lower deviation with 10.93% and 12.38%, respectively.For all three models, the mean molecular weights of their top 5 predictions deviated from the actual structure to a similar degree as the top 1 predictions.These results indicate that, although the models may not be capable of predicting exactly the actual structures, abc and ssc models can generate structures that are reasonably close to the actual molecules in terms of their sizes.
Finally, in Table III, we show the average score for each model's performances in the structural similarity between the predicted and actual structures.We can see that each model showed the best performance among them when it is evaluated by the respective reward metric being used to train the network.For example, abc model showed the lowest performance when it is evaluated in terms of ssc and jcsc metrics.This is understandable since abc metric does not contain information on the connectivity of atoms beyond their adjacent bonds whereas the other metrics account for this aspect.jcsc model was slightly superior to abc model in ssc metric, but it was inferior to ssc model in abc metric.Overall, ssc model turned out to show reasonably good performance even when evaluated against the other two metrics.

B. Combined Metric-Based Models
Each of the three individual metrics tested showed both strengths and weaknesses, such that none of them unambiguously outperformed the others.Each reward metric focuses on different aspects of structures.Therefore we created new metrics Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Table V shows the mean percent deviation of the predicted structure's molecular weight from the actual structure's for the combined metrics.The abc + ssc model showed the best performance in generating structures that are close to the actual structures in their size.It is noteworthy that combining ssc and abc metrics improved the model's performance.On the other hand, the incorporation of the jcsc metric into their reward function (abc + jcsc, ssc + jcsc, and abc + ssc + jcsc) resulted in larger deviations than the models trained by either ssc or abc models.

C. Analyzing Performance on Different Structures
Since the ssc + abc model showed the highest performance of all the tested models, we further investigated the predicted structures generated by this model.Fig. 7 shows the structures of all compounds correctly predicted by our network as the top 1 match.As it can be seen, there is diversity in what kinds of structures can be predicted by our model.These structures include linear and branched hydrocarbons and alcohols (-OH).There are also structures of various sizes and various structural complexities.While compounds containing nitrogen occupy approximately 20% of the dataset (table VI), the model is comparatively less capable of accurately predicting the structure of nitrogen-containing compounds.A plausible explanation is that primary, secondary, and tertiary amines exhibited less pronounced spectral differences compared to oxygen-containing functional groups.
Fig. 8 shows the actual structure and top 5 predicted structures from our model, where the actual structure was present within the top 5 results but not as the top 1 result.Each row in the figure is a different structure and its predictions.The structure predictions, found on the right side of the vertical bar, each show the Q-value given to the prediction by the model and the actual similarity score between the prediction and the actual structure using the abc + ssc metric.The model's top prediction is the first prediction after the vertical bar.The predicted structure matching the actual structure of the compounds, which has an abc + ssc score of 1.0, is found within top 5 in these cases.
The top 5 predictions generated by our model tend to be similar to the actual structure in terms of the number of nonhydrogen atoms in the structure.This suggests that our model can predict the size of molecules based on the IR spectrum.In many cases, the structures of top 1 predictions were very similar to the actual structure, but only differed by the location of functional groups and/or the inclusion of an additional functional group.These results show that the network can predict functional groups that are present within the compounds to a reasonable degree.Overall, the model generally understands which functional groups exist within the structure.Furthermore, it is capable of generating a list of predictions that contain the correct or, at least, closely related structures.Fig. 9. shows the distribution of abc + ssc similarity scores for the top 1 predictions by our model, along with examples of predicted structures from each score range.Fig. 9(b), (c), and  (d) show examples of predicted structures which received abc + ssc similarity scores in the 0.9-1, 0.8-0.9, and 0.7-0.8ranges, respectively.Overall, the predicted structures in these ranges are close to the actual structures in terms of the size of molecules and functional groups present within.In most cases, the structures in the 0.9-1 range contained the functional groups present in the actual structures and the correct or close composition of atoms.Typically, the errors present in these structures are limited to predicting the location of branching points and/or functional groups.The structures in the 0.8-0.9range often miss and/or include a second functional group that are present elsewhere within the structures.In other cases, the presence of all functional groups is correctly predicted, but the lengths of carbon chains are incorrectly predicted.The structures in the 0.7-8 range tend to show a few minor errors in terms of the size of the molecules, the location of functional groups, and/or the presence of one or two functional groups.However, the predicted structures share reasonable degrees of similarities with the actual structures.Fig. 9(e) and (f) show predictions in the range of 0.7 and below.These structures tend to deviate in their compositions of atoms and functional groups, whereas the sizes of the molecules are often reasonably close to the actual structures.Overall, as it is shown in 9 A, the majority of top 1 predictions (77.49%) have an abc + ssc score of 0.7 or above.This demonstrates that our model can produce structures that are reasonably close to the actual structures based on IR spectra even when the exactly correct structures are not predicted.
The abc + ssc model does tend to under-perform on nitrogencontaining structures.This could be due to the fact that the absorption cross-section of C-N, C = N, or N-H bond is relatively small in comparison to C = O, C-O, and O-H bonds.Furthermore, C-N and N-H peaks typically appear in the regions where C-O and O-H bonds appear.This overlap in their spectral characteristics make it difficult for the model to learn the spectra-structure relationships.
In order to investigate if the model's performance depends on the size of the actual molecules, we looked into the model's performance in generating predictions for the actual structures in different size ranges.The figure shows the results grouped by the number of nonhydrogen atoms in the actual structures.As shown in Fig. 10, the means of Jaccard index of ECPFs between the predicted and actual structures were similar across all different molecular size ranges, suggesting that our model performs at a comparable level regardless of the size of the molecules.There is a large deviation in the scores for relatively small molecules consisting of 2-5 nonhydrogen atoms.This can be attributed to the fact that smaller molecules have a greater chance of being accurately predicted, while even small variations in predicting these small molecules can cause a significant drop in this metric.
As described in the previous part, the size of predicted structures are often very close to the size of actual structures even in the cases where the structural characteristics of the compounds were not very accurately predicted.We, therefore, investigated the correlation between the molecular weight of the predicted structures and actual structures.Fig. 11 shows the molecular weight of predicted structures versus the molecular weight of the actual structures.The result shows that the molecular weight of predicted structures strongly correlates with the molecular weight of the actual structures, further supporting that our model can predict the molecular weight of compounds from their IR spectrum.We also observed that the linear regression line of the data and identity function are close to each other, indicating that our model shows minimal bias in terms of under-or overestimating the molecular weight of compounds.Meanwhile, the data points that deviate from the regression line to a relatively large degree tend to be underestimated.Some of those tend to be compounds that have one or more symmetries within their structures.This could be attributed to the fact that the IR absorption peaks of the groups on two symmetry sides often overlap within the spectrum.As described earlier, the molecular weight of the top 1 prediction of the model and actual structure deviated by 10.46% on average.This was somewhat surprising to us since IR spectra are not a type of experimental data used in estimating organic compounds' size.In general, the larger molecules tend to contain greater variations of substructures within, resulting in their tendency to give a greater number of peaks in IR spectra.Hence, our result implies that the deep Q-model can learn the correlation between the IR spectra' complexity and the molecules' size.It is important to note that the molecular weights of the compounds in our dataset contain 25 or fewer nonhydrogen atoms.It is reasonable to assume that learning this correlation above a certain threshold could be more challenging.However, our findings showcase the potential for this and similar Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.machine learning techniques to expand our ability to perform the structural analysis of compounds by IR spectroscopy that a relatively inexpensive instrument can carry out.

V. DISCUSSION
In this work, we introduced a variation on double deep Qlearning which used one of several different structural similarity methods as a reward function.We showed that our method could be used to generate a set of structural predictions from a compound's IR spectrum.We then showed that our model could produce reasonably similar structure predictions depending on the structural similarity score the model was trained on.
Our results show that the model trained by using jcsc as a reward metric under-performed compared to other models, even though the Jaccard Index of ECFPs is frequently used as a measure of chemical structure similarity.We speculate that this is due to a pair of behaviors.The first is that this metric does not always monotonically increase even when a correct pair of one atom and one bond is added.Second, the metric occasionally gives a higher score to an incorrect alternate over the correct intermediate structure.Fig. 12 shows the difference between a correct intermediate structure's value and the value of the highest valued incorrect alternate structure that can be created from the previous intermediate structure.The red circles are the reward scores for a series of intermediate structures that can be considered a substructure of the actual structure.The blue squares are the highest value an incorrect intermediate structure would reach.An ideal metric should consistently give higher values to the correct additions over incorrect alternates.As can be seen, adding a correct pair of one atom and one bond to an intermediate state can decrease the jcsc score.Additionally, we observed that the jcsc metric sometimes gives a larger reward to incorrect additions to the intermediate structure than to correct additions.These behaviors could explain the poor performance of the model that was trained by using jcsc metric as a reward function for deep Q-learning.
The two metrics we introduced in this work were designed to mitigate the abovementioned problem.The abc metric has the property that it is monotonic for correct additions, meaning the correct addition of a pair of one atom and one bond will always leads to a higher value than the previous intermediate structure.This metric does exhibit a shortcoming that an incorrect addition of an atom and bond could give the same score as the addition of a correct pair.It also gives the highest possible score, which is 1, to both the actual structure and to multiple incorrect structures that are similar to the actual structure.Among the three individual metrics (abc, ssc, and jcsc), the ssc metric was overall the best performer.We attributed this to the behaviors of the ssc metric, which is monotonic and also far less likely to give the highest score of 1 to incorrect final structures.This behavior makes this reward function less complex for deep Q-models to learn, although it is possible for this metric to give a higher score to incorrect alternatives than the correct intermediate structure.However, deep Q-learning model is capable of overcoming this shortcoming of ssc metric, at least to some extent, by considering both immediate and future rewards.Therefore, our results suggest that the ssc metric, among the three individual structural similarity metrics presented in this article, exhibits the most suitable properties as a reward function.Now, in terms of performance exhibited by the combined metrics, the abc + ssc model showed the best performance.The abc + ssc metric can outperform abc and ssc because each of the individual metrics has its advantages that are complementary.The abc and ssc metrics count substructures of different sizes within two structures and compare these counts between the structures.This makes these metrics monotonically increasing functions when looking at a series of correct additions.Meanwhile, each metric emphasizes the different aspects of the chemical structures.The abc metric provides direct information on the number of various atoms and bonds in the structure.The primary drawback of the abc metric is its inability to distinguish certain structural features, such as the locations of branching points.On the other hand, the ssc metric incorporates more information on specific substructures.Overall, combining these metrics allowed for creating a smoother, more "learnable" reward function that outperformed the networks trained on the individual metrics.
Finally, we would like to note that all models presented in this work were trained by using the gas phase spectra of noncyclic compounds whose structures are comprised of only hydrogen, carbon, nitrogen, and oxygen.The reward metrics presented in this article can be applied to compounds containing cyclic structures.The graph generation approach can be easily adapted to handle cyclic structures.To predict spectra of compounds in liquid phase, solution phase, or solid phase, a new model would need to be trained with spectra measured in each respective phase.However, these can be accomplished by expansion of training data.We therefore believe that, with the incorporation of sufficient data, our method can be extended to predict a broader range of compounds.

VI. CONCLUSION
In this article, we presented an approach based on reinforcement learning for predicting the structures of organic compounds based on their IR spectra.To pursue our goal, we utilized double deep Q-learning for training a neural network model that takes two inputs, the IR spectrum and the current intermediate structure.The Q-model outputs a Q-value and a size ratio score, which are used to rank possible intermediate structures and select when to stop adding a new atom and bond to the intermediate structures.The Q-model was trained and tested on independent datasets to gauge the performance of our model on the IR spectra of unknown compounds.Our results showed that the abc + ssc metric we developed in this work functions as the best reward function among all tested structure similarity metrics, including the Jaccard index of ECFPs.Overall, using a compound's IR spectrum, our model was able to generate structure predictions that were shown to be structurally similar, by molecular weight and functional groups, to the compound's actual structure.Therefore, we believe that our work demonstrates the utility of reinforcement learning in IR spectra-based chemical structure prediction.We envision that further developments toward this direction would lead to a realization of powerful analytical tools for broad application areas including, but are not limited to, research and development of chemical products, forensic investigations, and environmental analysis.

Fig. 1 .
Fig. 1.(a) Process of generating a set of predicted structures (molecular graphs) from an IR spectrum over multiple frames.(b) Deep Q-learning process where, for each sample, frames are stored in a replay buffer and used to periodically update the Q-model and Q-target networks.

Fig. 3 .
Fig. 3. Sample from the training set (2-methylpentan-2-ol), its spectrum, and the representation of the spectrum that is provided to the network.
is flattened and connected to a dense layer with 285 nodes which uses the same batch normalization and leaky ReLU activation as the convolutional layers.The neural network is initialized using the default random weights provided in the TensorFlow library.Through the training process, it learns to estimate Qvalues based on the encodings of intermediate structures, which are input into the network.

Fig. 5
illustrates the process of generating new intermediate predictions from the current top five structures.The figure shows steps 5-6 and the final top five structure predictions.The green lines show correct additions to intermediate structures leading to the actual structure.The black lines represent incorrect additions which cannot lead to the actual structure.The blue lines represent cases where an intermediate structure is retained in the next step.

Fig. 5 .
Fig. 5. Schematic illustration of the process to generate predicted structures.

Fig. 6 .
Fig. 6.Average final reward (red) shows the average value given to the final prediction of each training sample during their last episode.Loss (blue) calculated according to (5).

Fig. 8 .
Fig. 8. Examples of top 5 matches by abc + ssc model and predicted structures.

Fig. 9 .
Fig. 9. (a) Histogram of top 1 predictions by abc + ssc score.(b) 3 samples with incorrect top 1 predictions that have abc + ssc scores in the 0.9-1 range.(c) 5 samples with top 1 predictions that have abc + ssc scores in the 0.8-0.9range.(d) 5 samples with top 1 predictions that have abc + ssc scores in the 0.7-0.8range.(e) 5 samples with top 1 predictions that have abc + ssc scores in the 0.5-0.7 range.(f) 5 samples with top 1 predictions that have abc + ssc scores in the 0-0.5 range.

Fig. 11 .
Fig. 11.Predicted molecular weight versus actual molecular weight.The orange line is the linear regression line and the black line is the identity function.

Fig. 12 .
Fig. 12. Reward values for correct and incorrect modifications to trimethyl citrate.
and the srs being learned through regression with Huber loss.If any intermediate structure is given a srs of 1 or more, then the process does not add a new atom or bond to the intermediate structure.Therefore, the process returns final prediction for each episode when the intermediate structure with the highest Q-value receives srs > 1 or no new intermediate structures score a higher Q-value than the current intermediate structure.The training process lasts 9600 frames, where 135 samples from the training dataset have their intermediate structures modified during each frame.All possible intermediate structures are generated in each frame by adding one atom and bond to all existing atoms within the previous intermediate structure.For each of the 135 samples, an experience tuple (s, s + 1, reward s , srs s , spectrum) was stored in an experience replay buffer.The replay buffer contained a maximum of 420 000 experience tuples where each compound in the training set could have at most 2048 experience tuples.

TABLE I RATE
OF EXACT MATCH

TABLE IV RATE
OF EXACT MATCH FOR COMBINED METRICS

TABLE V AVERAGE
PERCENT DEVIATION FROM THE ACTUAL STRUCTURE'S MOLECULAR WEIGHT FOR COMBINED METRICS by taking the mean of two or more of these metrics.TableIVshows the new models trained on combined metrics named abc + ssc, abc + jcsc, ssc + jcsc, and abc + ssc + jcsc based on which sets of metrics they utilize.The abc + jcsc model was the worst performing of these models with 2.924% top 1 match and 7.602% top 5 match.The abc + ssc + jcsc model was next with 2.924% top 1 match and 10.82% top 5 match.The ssc + jcsc model was second best with 3.801% top 1 match and 12.28% top 5 match.The abc + ssc model performed the best with 6.140% top 1 match and 14.91% top 5 match, which resulted in the best in terms of top 1 match and was comparable to the ssc model in top 5 match performance.

TABLE VI SAMPLE
TYPES BY DATASET SPLIT