Completion of partial chemical equations

Inferring missing molecules in chemical equations is an important task in chemistry and drug discovery. In fact, the completion of chemical equations with necessary reagents is important for improving existing datasets by detecting missing compounds, making them compatible with deep learning models that require complete information about reactants, products, and reagents in a chemical equation for increased performance. Here, we present a deep learning model to predict missing molecules using a multi-task approach, which can ultimately be viewed as a generalization of the forward reaction prediction and retrosynthesis models, since both can be expressed in terms of incomplete chemical equations. We illustrate that a single trained model, based on the transformer architecture and acting on reaction SMILES strings, can address the prediction of products (forward), precursors (retro) or any other molecule in arbitrary positions such as solvents, catalysts or reagents (completion). Our aim is to assess whether a unified model trained simultaneously on different tasks can effectively leverage diverse knowledge from various prediction tasks within the chemical domain, compared to models trained individually on each application. The multi-task models demonstrate top-1 performance of 72.4%, 16.1%, and 30.5% for the forward, retro, and completion tasks, respectively. For the same model we computed round-trip accuracy of 83.4%. The completion task exhibiting improvements due to the multi-task approach.


Introduction
The integration of Artificial Intelligence (AI) in organic chemistry is gaining significant prominence, primarily due to its potential to improve the pace and efficiency of drug discovery, materials design, and chemical processes [1,2].AI techniques can be harnessed to generate novel molecules with desired properties [3][4][5], such as improved solubility [6,7] or toxicity [8], leveraging existing molecular data.This capability has the profound potential to expedite materials design and discovery [9].
In recent years, the utilization of machine learning (ML) in organic chemistry has experienced substantial growth, with increasing interest in employing AI to address diverse challenges within the field.The advancements in knowledge extraction from documents have significantly expanded the availability of extensive molecular datasets [58,59], thereby enabling the training of a wide range of ML models for various applications.
Datasets play a central role in the ongoing data-driven transformation of organic chemistry.Although chemical reaction records may contain errors, these can be identified and rectified through machine learning strategies [57].Nonetheless, such records often lack completeness, reflecting the common practice of not documenting the full set of chemicals required for a chemical experiment.Limited research has addressed the completion of the chemical reactions [25,45,57,75] employing the Molecular Transformer [20] to suggest missing molecules or detect erroneous records [57].
In this work, we present a comprehensive evaluation of the reaction completion task.This task involves providing missing reactants, reagents, or products in a partially completed chemical reaction.Additionally, our primary focus is to investigate whether a unified model trained simultaneously on different sub-tasks can effectively leverage diverse knowledge from various prediction tasks within the chemical domain, compared to models trained individually on each application.Concurrently, we trained Molecular Transformer models for three sub-tasks: the prediction of products (forward sub-task), precursors (retro sub-task) or any other molecule in arbitrary positions such as solvents, catalysts or reagents (completion sub-task), see figure 1.The usage of the term 'sub-task' throughout the manuscript aligns with the common usage in the field of machine learning, where all three author-defined 'sub-tasks' correspond to the same modality task of a text-to-text translation.Sometimes, for brevity, we will refer simply to these as tasks instead of sub-tasks to maintain a simpler text.We provide a comparative analysis of the performance of models trained individually on each task versus a multi-task model, which is presented herein.Similar to other works using Molecular Transformers, molecules and chemical reactions are represented with (SMILES) [76][77][78].Using the data derived from the US patent office (USPTO) by Lowe [58], the overall better performing multi-task model demonstrate top-1 performance of 72.4%, 16.1%, and 30.5% for the forward, retro, and completion tasks, respectively, with the completion task exhibiting notable improvements due to the multi-task approach.

Model, training and inference
We used a modified version of the Molecular Transformer [20].Both input and output of the model are tokenized versions of reaction SMILES strings [76,79].We used the rule-based tokenizer proposed by Schwaller and coworkers [19].This tokenizer uses a regex and it treats text enclosed in square brackets as single tokens.A complete reaction SMILES can be defined as a textual string containing the precursors, delimited by a '.' , and a product, with the precursors and the product separated by a '≫' .We classified reactants and reagents as precursors, combining all of them together without making any distinction.In our setup, the model takes as input a list of incomplete chemical equations and the output contains the missing compounds: precursors and/or products depending on the specific task.When a complete input is given, which is a reaction SMILES containing all the precursors and the product, the corresponding output is an empty reaction SMILES, which we represented using a '≫' string.Each input for the multi-task models contains an additional prompt sub-task token which identifies the specific sub-task.The used tags are the following: '[forward]' , '[retro]' , '[rsc]' , and '[unspecified]' .The prompt tokens '[forward]' and '[retro]' refers to the forward and single-step retro-synthesis prediction sub-tasks, respectively; the token '[rsc]' indicates the sub-task to predict missing reagents, catalysts, solvents in the precursors, we compactly define RSC task.Finally, the '[unspecified]' tag refers to any of the previous sub-tasks (forward, retrosynthesis and RSC sub-tasks), without any additional specific information.
The Transformer model is implemented using the OpenNMT-py library [80,81] trained on a single GPU.We used 4 encoder and 4 decoder layers, setting rnn_size to 384, word_vec_size to 384, batch size to 6144, max_generator_batches to 32, accum_count to 4, learning_rate to 2, and label_smoothing to 0. We used an Adam optimizer with β 1 and β 2 equal to 0.9 and 0.998, respectively.We trained a larger model with the same parameters reported above except a larger rnn_size and word_vec_size which were both set to 512.
The model quality is assessed through a comparison based on accuracy, which is defined as the ratio of correct predictions to total predictions.A prediction is deemed correct only when there is an exact match between the target and the top-1 predictions.Therefore, each inference is correct only if all the predicted molecules match those in the ground truth.We used this accuracy computed on the validation set to select the best performing model and to stop training.This metric for accuracy is different from that one computed on-the-fly during training by OpenNMT-py on the tokenized data, which considers the number of correctly predicted tokens divided by the number of total tokens.This latter accuracy is always higher because an error on a token does not invalidate the entire prediction.

Data preparation
For this work we trained and tested the models using reaction data derived from the US patent office by Lowe [58].First, we preprocessed the data, canonicalizing each reaction SMILES, and filtering invalid SMILES.The filtered dataset contains 1037 423 reaction SMILES, one reaction SMILES per line.Second, we used the atom mapping described in [82] to map the atoms in the precursors with those in the products.Atoms that are not mapped relate to reagents, catalysts, and solvents.These categories were used to create datasets for single task and multi tasks as explained below in this section starting from the canonicalized reaction SMILES.We used the same train, validation, test splits for all the tasks to avoid data contamination.We used 90% of data for training and the remaining 10% dividing in two equal parts for validation and test (5% each).
From the filtered reaction SMILES, we created six different datasets as described below.

Forward dataset
For each successfully preprocessed reaction SMILES, the complete set of precursors serves as the source, while the associated product is designated as the target.This dataset serves to train single sub-task models for forward predictions, in which the model is trained to forecast the product when presented with a list of precursors, with no specific role assigned.

Retro dataset
The Retro dataset is similar to the forward dataset, with the difference that source and target are reversed in this case.For this dataset, the single sub-task model is trained to predict the precursors given a desired target product.

RSC dataset
This dataset is built using the knowledge about the role of the precursors.The source is a reaction smile including the reactant(s) and the corresponding product separated by '≫' .The target is the list of either reagents, catalysts, and solvents removed from the original chemical reaction record and followed by a '≫' .
The model trained on this dataset predicts the reagents, catalysts, and solvents.
The following three datasets are used to train and test multi-task models.We consider both cases where the sub-tasks can be given implicitly or explicitly.The sub-tasks are forward, retro, and RSC predictions described before.

All-tagged dataset
This dataset includes 13 data entries for each of the initial reaction SMILES, resulting in a total number of lines that is 13 times the original count, see table 1. Figure 2 illustrates the chemical structures of the source and target smiles of table 1.
The source data incorporates an additional prompt-task token, such as '[forward],' '[retro],' '[rsc],' or '[unspecified],' serving as a tag for the input sub-task, which is prepended in the incomplete reaction SMILES.The output targets are unchanged, since the prompt-task token should not be part of the text string predicted by the model.For simplicity in the remaining of this section, we refer to the prompt-task token simply as 'token' .The first source data contains the forward dataset with the additional prepended '[forward]' token.Similarly, the second and the third source data contain the '[retro]' and the '[rsc]' tokens each followed by the corresponding source smile of the retro and RSC datasets, respectively.For the ten remaining source data points in table 1 we used the '[unspecified]' tag, in particular, the fourth input data is a complete reaction SMILES, containing all the precursors and the product separated by the string '≫' as in the original version.Because the input corresponds to a complete reaction, the model is trained to predict an empty string, containing only the separator string: '≫' without reaction SMILES.The subsequent three source data are similar to the first three with the exception that instead of using '[forward]' , '[retro]' and the '[rsc]' tokens, the record use the generic '[unspecified]' token, the output strings remain the same.The last six data of table 1 are always tagged with '[unspecified]' .The sources contain reaction SMILES in which one or more precursors and/or the product are randomly removed.For each of the six sources, the corresponding target consists in the reaction SMILES containing the missing molecules followed by the separator '≫' string.Two additional datasets were derived from the tagged dataset.

Unspecified dataset
This dataset is a subset of the all-tagged dataset.It contains exclusively the data tagged with the '[unspecified]' tokens.For every reaction SMILE, it includes ten out of the 13 datapoints.The '[unspecified]' token in the source was removed from the source input, since common to all the lines.This dataset does not contain information on the sub-task to be carried out.

Three-tagged dataset
This dataset is a subset of the all-tagged dataset.It contains only the data tagged with the '[forward]' , '[retro]' , and '[rsc]' tokens.For every reaction SMILE, it includes three out of the 13 datapoints.This dataset contains the tokens with the information on which one of the three sub-task has to be carried out.

Results and discussion
We train and test different translation models on the single-task and multi-task datasets.We present the accuracy of the multi-task models and compare it to that of one of the single-task models, using this assessment to measure when multi-task models benefit from being trained jointly on different sub-tasks.To simplify the notation and reading, throughout the remainder of this section, we will refer to the forward, retrosynthesis, and RSC sub-tasks simply as tasks, omitting the 'sub-' prefix.

Single-task models
We trained three single-task models.The number of training steps of each model was chosen to maximize the exact match accuracy of the validation set as described in the Method section.The number of training steps for the forward, retro and RSC models are reported in table 2, which shows the three tasks trained with the smaller and larger models.Compared to the forward and RSC models, the retro required a larger number of training steps to reach their maximum accuracy.For the forward model we obtain accuracy equal to 73.6% and 73.3% with the small and large model, respectively, showing that utilizing a larger model does not enhance performance in this task.
The single-step retrosynthesis and RSC prediction tasks perform marginally better using the larger model, which gives in both cases about 7% improvement compared to the small model, as reported in table 2. The single-step retrosynthesis predictions show the lowest accuracy, in the range of 16.8%-18.0%.
In the retrosynthesis task, there can be slight variations between the model's prediction and the target of the reaction, while still remaining chemically valid.For instance, if the model predicts a different solvent or catalyst, the exact match metric might label the prediction as incorrect, even though it could still yield the desired target product using slightly different but chemically similar compounds.To address this, we computed also the round trip accuracy for the single-step retrosynthesis task [27].This metric is computed as follows.We use the precursors predicted from the retro model as input for a forward model.If the outcome of the forward prediction is the original product, we consider the round trip prediction to be successful.For this specific retro model, we computed 83.9% and 83.3% round trip accuracy for the small and the large model, respectively.The forward predictions for the round trip accuracy were computed consistently using the forward models with the same size of the retro models, see table 2. This approach aligns with a recent study by Jaume-Santero and coworkers [83] on Transformer Performance for Chemical Reactions, where round-trip accuracy has emerged as a valuable metric for evaluating the practical effectiveness of predictive models.Their study [83] shows the importance of round-trip accuracy as a superior evaluation metric compared to simpler metrics such as top-k accuracy.In their analysis, they asked experts to evaluate the predictions.As result, they found that most of the reactions that pass the round-trip predictions are likely to occur.Moreover, our accuracies for 'retro' and 'RSC' sub-tasks align with the findings reported in table 1 [83]: reagent prediction using SMILES yielded accuracies ranging from approximately 13.5%-21.1% under different setups.We also investigated the types of errors leading to low accuracy in single-step retro predictions, with a specific focus on the influence of solvents in the prediction errors.We used the list of solvents from [84] and we group them in three categories: non polar and aprotic; polar and protic; polar and aprotic solvents.We used the single-step retro-synthesis single-task large model (size 512 in table 2), which gives the higher accuracy 18.0%, as prototypical example to check deeper what produces the errors.We found that in 56.2% there is at least a solvent in the correct precursors that is missed in the predictions and, vice versa, in 45.5% of predictions we find at least one solvent missed in the ground truth.In 14.5% of cases the list of solvents in predictions and ground truth are matching exactly.There are 67.5% of predictions with differences in solvents.For those cases, we compared the type of solvents using the three groups listed above.We found that in 50.3% of these cases all the solvent types present in the correct precursors are also present in the predictions.We repeated the same analysis for the RSC case, using the single-task large model (size 512 in table 2).We found similar results.
To provide further evidence of how accuracy depends on the quality and diversity of the dataset, we retrained two models for the 'forward' and 'retro' sub-tasks using a smaller dataset commonly employed in benchmarking: the USPTO-50k [62].We use the same architecture of the model in table 2 (size 384).This dataset comprises 50 000 reactions, organized into 50 classes, each containing 1000 reactions.Compared to the entire USPTO dataset, which contains two orders of magnitude more reactions, the USPTO-50k dataset exhibits less variability.We randomly shuffled the 50 000 reactions and split them into 90% for the training set and 5% each for the validation and test sets.For the 'forward' and 'retro' tasks, we obtained accuracies of 82.6% (82.6%) and 27.9% (27.8%), respectively, on the test (validation) set.

Multi task models
Table 3 shows the number of training steps for each of the six multi-task models.The number of steps is about three to four times larger for the 384-size models compared to the 512-size one.
Table 4 reports the results of the multi-task models and is structured to highlight two different cases.In one case, the model task is given explicitly, with the input string containing one of the following three prompt-task tokens: '[forward]' , '[retro]' or '[rsc]' .The corresponding results are grouped below the header 'explicit task-token' .The 'three-tagged' models support only one of these three input prompt-task tokens.In the second case, the task in not given explicitly, resulting in the input string containing the '[unspecified]' token, leaving to the model to infer which sub-task among a forward, a retrosynthesis, or a RSC prediction Table 4. Results of multi-task models.The header of the table includes the model name, the top-1 accuracy of the test set [adim.], the rnn_size of the model.The name of the tasks are 'forward' and 'retro' for the forward and single-step retro-synthesis prediction tasks, respectively; 'RSC' indicates the task to predict missing reagents, catalysts, solvents in the precursors.The header 'complete' refers to '[unspecified]' token prepended to a complete reaction, and 'all' refers to the accuracy averaged on the entire test set, see Method section.

Model name Size
Tasks Top- needs to be performed.The 'all-tagged' models support both input cases.Instead, the 'unspecified' models employ only the dataset featuring the '[unspecified]' token, with the common token removed from all input lines.The results related to an unspecified task are grouped below the header 'unspecified task-token' .The empty cells in the table refer to unsupported tasks, for example the 'three-tagged' models do not support the '[unspecified]' token.We can use as baseline the single-task models presented in the previous section, see table 2, to compare the accuracy of the single-and multi-task models.We report that the accuracy for the forward and retrosynthesis tasks is higher for the single-task models trained each one on its own dedicated dataset.There is no advantage for the model to leverage cross-learning.Independently on the model size, none of the multi-task models gives higher accuracy.On the contrary, the accuracy of the RSC predictions is better for all the multi-task models.The highest accuracy is obtained with the 'three-tagged' models, which is 30.0%and 30.5% for the small and the large models, respectively.The 'three-tagged' models learn to fill missing precursors by leveraging cross-learning across the other two sub-tasks: forward prediction and retrosynthesis.
Table 4 indicates a consistent trend observed when comparing the sections of the table under the header 'explicit task-tokens' .The performance of three-tagged models, which exclude data with the '[unspecified]' token, is consistently lower than that of all-tagged models.The three-tagged models are trained only 3 out of 13 records from the all-tagged models.Including the 10 records with the '[unspecified]' token give a small degradation in prediction performance.The degradation is higher for the RSC predictions.Similarly, under the header unspecified task-token, we can compare the effect of removing 3 out of the 13 lines with the explicit task tokens [forward], [retro] and [RSC].Excluding the 3 explicit tokens give a 2%-5% degradation of the accuracy.
Table 4 shows also the accuracy of the prediction of complete reactions.This prediction is triggered via an unspecified task.In all cases, the 'all-tagged' and 'unspecified' models give high accuracy, which is always above 96%.The models prove to be able to identify complete reactions, in which no additional compound needs to be predicted.The correct prediction of this task is a string containing only the precursors-product separator: the '≫' string.
We utilized a 'three-tagged' model that was trained with the requirement of one of three prompt-task tokens: '[forward]' , '[retro]' , or '[RSC]' .During inference, we substituted the correct task token with one of the other two tokens.For instance, if we fed the model with input data meant for a forward prediction, the expected output would be a product.However, instead of employing the appropriate [forward] prompt, we utilized either '[retro]' or '[RSC]' .In the context of our evaluation, the diagonal values in Table 5 represent scenarios where both the input data and the designated prompt task align correctly.Conversely, off-diagonal values denote instances where the input data are incompatible with the assigned prompt-task.For example, when the model is tasked with predicting precursors from input text containing only precursors, the Table 6.Accuracy of the round-trip single retro-synthesis step predictions.The header of the table includes the model name, the top-1 accuracy averaged on the data with the explicit task token '[forward]' , '[retro]' , '[rsc]' (A explicit), accuracy averaged on the unspecified tasks (A unspecified), and rnn_size of the model.The name of the models are described in the Method section.Unless specified, the models are multitask.The two entries 're/fw single-task' refer to the retro and forward single-task models of table 2.

Model name
A expected output would be the corresponding product.Our evaluation demonstrates a decrease in model performance when prompted with an incorrect task.This experimentation shows the sensitivity of the three-tagged model to the specified prompt-task.The single-retro synthesis prediction does not improve training the models on multiple tasks simultaneously.However, it is important to note that round-trip accuracy offers a more practical measure of the model's performance, particularly in assessing its ability to provide meaningful predictions despite the challenges posed by diverse solvent options.As in the previous section, we computed the round-trip accuracy for the single-retro synthesis task and report it in table 6.For single-task round-trip prediction, we use a unified model for both retro and forward predictions, as it intrinsically supports both tasks, with task specification being provided explicitly or implicitly.In both types of cases, we report high round-trip accuracy, all above 80%.

Conclusions
In this manuscript, we explore the use of a deep-learning model based on the transformer architecture that infers the missing molecules in partial chemical equations.We focused on three sub-tasks: the forward prediction, the single-step retro prediction, and the prediction of missing compounds such as solvents, catalysts, and reagents.All these three sub-tasks, though they all involve a single modality deep learning text-to-text translation, are perceived as distinct from a chemistry perspective and can be interpreted as processes aimed at identifying missing compounds in incomplete chemical equations.We compared whether or not a single unified model trained simultaneously on different sub-tasks can exploit the diverse knowledge from the types of predictions, compared to models trained individually on every single application.
All the models trained and tested in this work are based on the modified version of Molecular Transformer [20].
In this work, we trained and tested the single-task models to provide a reference for accuracy.We then trained multi-task models with two different approaches.We distinguished the possibility to explicitly give the task as part of the input versus the alternative of not giving the task or giving an unspecified generic instruction.In both cases, the model is trained on different type of applications, but when the task is not given, the model needs also to infer the task.
We compared the accuracy of three types of multi-task models.The 'all-tagged' models were trained on input containing explicit and unspecified tokens.The 'three-tagged' models were trained only on the subset of data in which the tasks were explicitly given.Instead, the 'unspecified' models used the complement part of the dataset, the part without explicit tokens, and they were trained without any information on the task to be performed.
We found that the larger 'three-tagged' model trained on explicitly given task is the one with overall highest accuracy, 39.7%, for the combined three tasks.The accuracy of the 'RSC' task is the 30.5%, which is more than 12% higher compared to the 'RSC' accuracy of the best single-task model.The accuracy of the forward predictions is quite close between single and multi-task.The single-step retro predictions are better predicted with the single-task model.
Both the 'all-tagged' and the 'unspecified' models show a similar trend compared to the three single-task models: giving in general better 'RSC' predictions, worst 'retro' prediction, and relatively similar 'forward' predictions.Overall, the accuracy of single-and multi-task models computed averaging over the three tasks are quite close (within 3%).
A significant finding arises from the versatility of the multi-task models trained without the explicit input task token.These models proved to be able to infer the task at hand and provide predictions with an accuracy that is often on par with or even surpasses that of single-task models, which are not burdened with the task inference process.
Datasets in reaction chemistry are compiled through text-mining techniques from documents.However, due to the challenges of NLP extraction, these datasets often contain errors or missing compounds.Implementing a system capable of filling in the gaps in reaction SMILES not only addresses these issues but also provides valuable support for data curation efforts.Since the specific nature of the missing molecules, whether they are reactants, solvents, or catalysts, is often unknown beforehand, having a model that can intelligently fill in these gaps enhances the overall data curation process.

Figure 1 .
Figure 1. Figure illustrating a prototypical chemical reaction to show the versatile capabilities of a single trained transformer model.

Figure 2 .
Figure 2.An example of source (left) and target (right).The sub-tasks are indicated above the blue arrow, separating source and target.In the fourth entry of table 1, the input corresponds to a complete reaction, here indicated by the 'complete' label.The target contains only the separator string: ≫.The last six predictions refers to mixed tasks, in which molecules randomly removed from source and target.

Table 2 .
Results of single task models.The header of the table includes model task (T), top-1 accuracy of the test set (A) [adim.],number of training steps (N), and rnn_size of the model (S).The name of the tasks are 'forward' and 'retro' for the forward and single-step retro-synthesis prediction tasks, respectively; 'RSC' indicates the task to predict missing reagents, catalysts, solvents in the precursors, see Method section.

Table 3 .
Multi-task models: the header of the table includes the multi-task model name (T), the number of training steps (N) in millions, and rnn_size of the model (S).The name of the multi-task models are 'all-tagged' , 'unspecified' , and 'three-tagged' , as described in the Method section.

Table 5 .
Testing the inference of the 'three-tagged' model (size 384) with different prompt-task tokens.The header 'Correct Token' denotes the correct task corresponding to the input data type, while 'Given Token' represents the prompt-task token provided to the model. .