Generic Interpretable Reaction Condition Predictions with Open Reaction Condition Datasets and Unsupervised Learning of Reaction Center

Effective synthesis planning powered by deep learning (DL) can significantly accelerate the discovery of new drugs and materials. However, most DL-assisted synthesis planning methods offer either none or very limited capability to recommend suitable reaction conditions (RCs) for their reaction predictions. Currently, the prediction of RCs with a DL framework is hindered by several factors, including: (a) lack of a standardized dataset for benchmarking, (b) lack of a general prediction model with powerful representation, and (c) lack of interpretability. To address these issues, we first created 2 standardized RC datasets covering a broad range of reaction classes and then proposed a powerful and interpretable Transformer-based RC predictor named Parrot. Through careful design of the model architecture, pretraining method, and training strategy, Parrot improved the overall top-3 prediction accuracy on catalysis, solvents, and other reagents by as much as 13.44%, compared to the best previous model on a newly curated dataset. Additionally, the mean absolute error of the predicted temperatures was reduced by about 4 °C. Furthermore, Parrot manifests strong generalization capacity with superior cross-chemical-space prediction accuracy. Attention analysis indicates that Parrot effectively captures crucial chemical information and exhibits a high level of interpretability in the prediction of RCs. The proposed model Parrot exemplifies how modern neural network architecture when appropriately pretrained can be versatile in making reliable, generalizable, and interpretable recommendation for RCs even when the underlying training dataset may still be limited in diversity.


Introduction
As a cornerstone of modern science and technology, any advancement of our mastery in chemical synthesis may bear a profound impact on the development of downstream disciplines such as pharmacy, environmental science, energy industry, and materials science.For decades, scientists have been attempting to build reliable and convenient computer-aided synthesis planning (CASP) tools .With the recent advancement of computing power, deep learning (DL) algorithms, and theoretical understanding of electronic structures and chemical reactions, some reliable CASP tools have been developed, and they could potentially enhance chemists' productivity in synthesis planning.For instance, contemporary CASP tools can achieve similar performance of synthetic route planning to human experts for some complex natural products [1].With the explosive growth of experimental chemical data in recent decades, it is anticipated that DL-assisted synthesis planning (DASP) tools will inevitably play a more crucial role in the digitized chemistry discovery.The combination of DASP and robotic synthesis platforms promises to eventually automate the pipeline of molecular discovery and optimization, starting from in silico synthetic route planning to autonomous experimental synthesis [23] in a closed loop.Despite these encouraging progresses, many existing DASP algorithms still face nontrivial challenges [24], obstructing their wider applications in the labs.Particularly, it is difficult to automatically assess the quality of machine-proposed synthesis plans.A key factor that undermines the quality of DASP is that existing algorithms cannot reliably recommend comprehensive reaction conditions (RCs) for a broad array of reactions as needed in organic syntheses.
The choices of a reasonable chemical environment (catalysts, reagents, and solvents) and other operating conditions (temperature, pressure, etc.) for reactions are crucial as they collectively determine what product molecules to be expected along with reaction yields and rates.In the past, researchers would query the literature to learn how similar molecules were synthesized, and then they would apply similar reactions to obtain the target molecules.In this scenario, researchers tend to adopt the reported RCs for their synthesis plans instead of consulting a computational algorithm for recommendations.This practice restricts the choice of RCs and often turn out to be suboptimal choices.With the continuing curation of valuable reaction data and development of DL, there have been attempts to develop algorithms that can recommend RCs, thus potentially overcoming the aforementioned limitations.As shown in Fig. 1A, RC prediction also has an increasing impact on the evaluation of synthesis pathways and the optimization of chemical RCs [25,26].However, the development of a general DL-based RC predictor remains a complex challenge that has rarely been addressed.Most existing RC predictors only focus on predicting certain aspects of RCs (such as only solvents or reagents) or modeling a specific type of reactions (such as Suzuki reactions or Negishi reactions).For example, Walker et al. [27] predicted the solvents for 4 types of reactions, and Shim et al. [28] predicted the RCs for Pd-catalyzed coupling reactions.One prominent factor that hinders the development of a general RC prediction model is the lack of high-quality and open-source standardized RC datasets.Previous works [25,[27][28][29][30] obtained data from commercial or private databases, such as Reaxys.Because the training and testing data in most of these earlier works have not been disclosed to the public, it is difficult for later practitioners to build new models and then compare against previously published models under a fair setting.Clearly, there is an urgent need for establishing a more standardized and opensource benchmark for RC predictions in order to stimulate or facilitate further algorithmic development on this front.However, while existing open reaction database [31] is highly favorable to DASP, existing data sources such as United States Patent and Trademark Office (USPTO) reaction data for RC prediction tasks still require further data cleaning and standardization to produce a reasonably reliable chemical RC dataset.
Another intriguing issue for the development of RC prediction algorithms is to determine an effective combination of DL model and associated representation of chemical reactions in order to model the intrinsic correlations between different factors of RC.Some works [27][28][29]32] have represented chemical reactions by molecular/reaction fingerprints as the input to feed-forward neural network or traditional machine learning model, but these representation methods are not inherently compatible with the more advanced DL algorithms such as Transformer with the attention mechanisms, which may improve the interpretability and prediction accuracy.There is also a work [30] using molecular graph and graph neural network (GNN) as the representation method and DL model, respectively, for predicting the RCs on a small-scale reaction dataset.This method proposed by Maser et al. is suitable for RC predictions when facing a training set with small sample size and a fixed number of molecular graphs are involved in the reaction samples (for example, the reactions always involve 2 reactants and 1 product), but it is not suitable for modeling more complex large-scale general RC data.For modeling multiple classes of RCs, the simplest way is to model multiple classes of RCs separately [22], but this approach cannot render a model to truly learn the intrinsic correlation between RCs.Gao et al. [29] formulated RC recommendations as a sequential prediction task and used a method similar to the recurrent neural network such that the RCs predicted in the previous step are fed into the model as the input for the next step prediction, which better considered the relationship between the predicted RCs, but the deep neural network architecture adopted by Gao et al. still has room for improvement in terms of interpretability and accuracy.Finally, a standardized benchmark dataset for RC prediction is currently lacking, and only a limited number of attempts have been reported to try different machine learning models and representation methods for RC prediction.Thus, this field can be greatly benefited if more advanced DL models can be reported and serve as strong baselines to inspire further algorithmic development.As already mentioned, without specifying appropriate RCs, all the CASP predictions are impractical, especially for a futuristic self-driving laboratory.
In this study, we spent considerable amount of time to curate high-quality RC datasets for benchmark purposes and developed an end-to-end RC prediction model named Parrot based on Transformer and pretrain strategy.We regarded the problem of RC prediction as a causal sequence prediction problem and completed the classification for multiple conditions and the regression of temperatures.The contributions of this work can be succinctly summarized as follows: 1. We curated a large open-source dataset named USPTO-Condition based on the original USPTO reaction dataset [33] for benchmarking RC recommendation models.In addition, according to the specific data extraction strategy, another general RC dataset Reaxys-TotalSyn-Condition was also extracted from Reaxys for a comprehensive model evaluation.The general procedure regarding the curation of the benchmark RCs dataset is shown in Fig. 1B.

Leveraging the attention-based model architecture and
training methodology specifically designed for enhanced reaction center, our method Parrot achieved an overall top-3 prediction accuracy of 2.64% and 13.44% higher than the RCs recommender (RCR) proposed by Gao et al. [26] for 2 large-scale general RC datasets mentioned above, respectively, and the temperature mean absolute error (MAE) was also reduced by about 4 °C.Figure 1C shows the model structure and strategy used by Parrot.3. We also demonstrated that our method exhibits strong generalization ability, maintaining higher predictive accuracy and suffering less accuracy loss compared to RCR [26] when prediction across reaction space.4. Finally, we utilized the attention mechanisms to illustrate the intrinsic correlation between the substructures in the reaction and the predicted catalysts and reagents.
The model design of Parrot can capture reaction centers and characteristic functional groups well.The model interpretation provides more scientific insights.

Overview of methods
We treat the condition prediction task in 2 parts.The first part is the prediction of the chemical context, which is treated as a causal multitask multiclassification problem involving catalysts, 2 solvents, and 2 reagents.This treatment is similar to the work of Gao et al. [29].The second part is the prediction of temperature, which is treated as a regression problem.Bert [34] is employed as the encoder to embed the reaction information directly from the simplified molecular input line entry system (SMILES) [35][36][37][38] (reactants >> products) to an abstract latent space, and then this machine-readable representation of reactions will be used to predict the downstream tasks, such as chemical context condition and temperature.

Dataset
We curated 2 large datasets, including USPTO-Condition (without temperature) and Reaxys-TotalSyn-Condition (with temperature), with the data volumes of 680,741 and 180,129, respectively.Both datasets were split according to the ratio of train:validation:test = 8:1:1 in this study.All molecules, such as the reactants and products in each entry of reaction, are recorded in canonical SMILES, and each data entry contains 1 reaction SMILES, temperature (in Reaxys-TotalSyn-Condition), and 5 chemical context labels, including catalyst, solvent1, solvent2, reagent1, and rea-gent2.For each class of chemical context condition (catalyst, sol-vent1, solvent2, reagent1, and reagent2), an additional "Null" category is added to represent that the reaction does not require this type of RC [29].In the Reaxys-TotalSyn-Condition dataset, we retained the original details (RC name, concentration, etc.) of the chemical context conditions, implying that the model does not just predict the SMILES of chemical context condition molecules but needs to fully predict the conditions used in the original chemical reaction.The prediction task of the Reaxys-TotalSyn-Condition dataset is more difficult than USPTO-Condition due to the sparser RC labels.After completing the curation of these 2 datasets, we used a reaction classifier to classify the USPTO-Condition dataset and the Reaxys-TotalSyn-Condition dataset into 12 categories, respectively.The composition of the reaction categories for both datasets is shown in Fig. 2B, and the details about the reaction classifier can be found in Section S2. 4. Finally, we also designed an external validation experiment to verify the prediction ability of the model across the chemical reaction space, where we extracted 8,413 reaction data (named Reaxys-TotalSyn-Condition-Sampled) from Reaxys that could be covered by the RC data labels and were significantly different from the USPTO-Condition dataset.The details of the processing methods for the USPTO-Condition dataset, Reaxys-TotalSyn-Condition dataset, and Reaxys-TotalSyn-Condition-Sampled test set are provided in Section S1, the processing scripts can be found at https://github.com/wangxr0526/Parrot, and the approach to obtain the curated RC datasets can be found in Data Availability.We have also summarized the key information of the USPTO-Condition and Reaxys-TotalSyn-Condition datasets in Tables S1 and S2, respectively.

Model architecture
The working principle of the DL models can be roughly conceptualized as a 2-stage process: autonomous feature learning and downstream task (classification, regression, etc.) predictions.
Inspired by the natural language processing tasks and the works reported by Schwaller et al. [39][40][41][42], we proposed an interpretable pretrained reaction condition Transformer (Parrot).This model uses Bert-like encoder to extract the reaction features from SMILES and a Transformer decoder to generate the hidden-layer representation of reaction context conditions.Finally, the classifier is employed for sequential prediction of reaction context conditions, and the tensor containing 5 context condition information is combined with the reaction embedding tensor.This combined tensor is then passed through a regression layer, named temperature decoder, to estimate the temperature.Our model architecture is summarized in Fig. 3.
We treat the prediction of the chemical context (i.e., catalyst, solvent1, solvent2, reagent1, and reagent2) as a sequence to 5 condition multiclass classification tasks, and the conditions for postprediction also consider the conditions that have been predicted, with the target lengths fixed (length = 6).We use the information contained in the memory tensors from the encoder and the decoder output tensors toward the 5 RCs to predict the temperature.Each of these tensors is deformed by a feedforward neural network, which is fed into a third feed-forward neural network to compute a scalar (temperature) after tensor concatenation.
The loss function we use consists of 2 parts, the classification part and the regression part.As the general sequence-to-sequence generation tasks, for the classification part we use cross-entropy as the loss function for the optimization of 5 conditions.For the regression part, we use mean squared error as the loss function.
To balance the loss values between the regression and classification components, we introduced a coefficient α in the temperature regression loss.We tested various combinations of coefficients and ultimately determined that the optimal value for α is 0.001.The loss function equation is as follows: where I is the chemical context condition number, c i is the predicted label of the i-th condition, ĉi is the ground truth label of the i-th condition, t is the predicted temperature, and t is the ground truth temperature.In our method, I = 6 (including 5 chemical context conditions and an end token).It is worth noting that when the temperature prediction function is not imposed on Parrot, the loss function only includes the classification loss function (the first part).

Model pretrain strategy
The prediction of downstream tasks is deeply dependent on the embedding and representation of the source data.Inspired by the successful experiments on reaction classification and reaction yield prediction reported by Schwaller et al. [41,42], we also adopted a pretraining strategy when designing Parrot for RC predictions.Well-curated RC data with reaction classes and reaction yields is relatively rare, but there is a large inventory of raw chemical reaction data.In our pipeline, we also adopted a pretraining strategy to allow Bert (the encoder) to better embed reaction SMILES to hidden tensors through unsupervised learning.
We tried 2 pretraining strategies, i.e., masked language modeling (Masked LM) and masked reaction center modeling (Masked RCM) by incorporating the domain knowledge on chemical reactions.The reaction datasets we use in both pretraining strategies contain about 1.3 million reaction SMILES obtained by cleaning USPTO 1976-2016sep [33].These data have been cleared of all RCs to include reactants and products only (reactants >> products) and keep the same format and content as the input of the RC prediction task.Considering the disparate distribution observed between the Reaxys-TotalSyn-Condition dataset and the USPTO-Condition dataset (as illustrated in Fig. 2B), alongside the relatively smaller size of the former, we developed a Masked RCM pretraining strategy aimed at acquiring domain knowledge related to reaction centers.With appropriate inductive bias, Parrot pretrained with Masked RCM delivers a superior performance when the training set is small.The schematic diagram of this strategy implementation is shown in Fig. S1.In the Masked RCM strategy, in order to strengthen the model's understanding of reaction centers, we increased the mask probability of the reaction center tokens to 0.5 instead of 0.15.The reaction center tokens were labeled by performing substructure matching, which involved matching reactions with their corresponding reaction templates using the rdkit [43] library.See Table S4 for the hyperparameters used in the Bert Masked LM and Masked RCM pretraining.

Model performance
Unlike most of the previous works [27,28,30,32,44], our model is designed and trained for recommending RCs for generic scenarios.Since no specification of reaction types is required, Parrot can be directly embedded into exiting synthetic planning algorithms to determine the optimality of a given synthesis path.Our model is more versatile than models trained on a single type of reaction data.For a more comprehensive evaluation of our method, we cleaned 2 general RC datasets used for benchmarking, named USPTO-Condition and Reaxys-TotalSyn-Condition, extracted from USPTO and Reaxys, respectively.

USPTO-Condition results
For this dataset, we conducted 6 ablation experiments to investigate the influence of the presence or absence of pretraining, types of pretraining, and the size of the decoder (number of layers and number of heads) on model performance.We also conducted an enhanced training experiment for further improving the prediction accuracy of the Parrot model.The evaluated variants of Parrot in this dataset include Parrot-D, Parrot-LM, Parrot-RCM, Parrot-LM-E, Parrot-RCM-E, and Parrot-LM-6L8H.Parrot-D employed the strategy of initializing weights with a uniform distribution.For Parrot-LM and Parrot-RCM, the encoder weights were initialized using pretrained Masked LM and Masked RCM Bert models trained on the USPTO reaction dataset, respectively.To enhance prediction accuracy, we performed fine-tuning on ×5 SMILES augmented training set, which was created by 5-fold training data augmentation based on the weights of Parrot-LM and Parrot-RCM.These 2 enhanced models are referred to as Parrot-LM-E and Parrot-RCM-E, respectively.The performance of Parrot's 6 variants and the aforementioned baseline model RCR [29] on the entire test set is shown in Table 1.While solvent1 and reagent1 have a larger number of labels and a denser distribution, due to the sparse (1)  distribution of the labels for catalyst, solvent2, and reagent2, we adopted a strategy with fewer candidate selections for the sparse reaction categories of catalyst, solvent2, and reagent2.Conversely, for the dense condition categories of solvent1 and reagent1, we used a strategy with more candidate selections.In our evaluation on the USPTO-Condition dataset, our output condition strategy was to predict the top-1 catalyst, top-3 solvent1, top-1 solvent2, top-5 reagent1, and top-1 reagent2.Finally, the 15 results (combinations of all the RC predictions) were sorted according to the overall scores (softmax probability score product for each RC token), and the top-k accuracy was calculated.In this experiment, in order to compare the accuracy of each model more accurately, all top-k accuracies were calculated by imposing a strict matching.According to the results in Table 1, it can be seen that the catalyst (c) top-1 accuracies of all 7 models including the baseline model exceed 90%, but the accuracy of Parrot based on pretraining is higher than that of the RCR model.When using the Masked LM pretraining strategy (Parrot-LM), the top-1 accuracy reaches 92.35%, and with further enhanced training (Parrot-LM-E), the accuracy increases to 92.50%.The overall top-3 accuracy of Parrot-LM-E is 2.64% higher than that of the RCR model, and the top-15 accuracy improvement increases to 3.14%.Furthermore, except the top-1 accuracy for solvent2, Parrot-LM-E achieves the highest accuracy among all the model configurations we tried.In Parrot-D, Parrot-LM, Parrot-LM-E, Parrot-RCM, and Parrot-RCM-E, we employed 3 decoder layers with 4 attention heads per layer.We also tested the model configuration using 6 decoder layers and 8 attention heads per layer (named Parrot-LM-6L8H), and the impact on the results was very slight, indicating that this task is not sensitive to the decoder configuration.All of the subsequent experiments adopt the architecture of the decoder with 3 layers and 4 attention heads per layer.The model achieved the best accuracy when initialized with Masked LM parameters, as observed in Parrot-LM-E.Additionally, Parrot-RCM-E, initialized with Masked RCM, also achieved higher overall accuracy than RCR.Due to the larger size of the USPTO-Condition dataset, both pretraining methods had similar positive effects on downstream RC predictions, yielding excellent results.The difficulty of the RC predictions may vary among different reaction categories.We also examined the model performance by the chemical reaction category in this dataset.The accuracy of the Parrot-LM model for each reaction category is shown in Fig. 2A.We can see from Fig. 2A that Parrot-LM shows a relatively weak performance in predicting the C-C bond formation reaction in this dataset.Although AR-GCN proposed by Maser et al. [30] achieved excellent prediction accuracy for reactions such as Suzuki, Negishi, C-N couplings, and Pauson-Khand, there exists a substantial performance gap compared to the general RC prediction models RCR and Parrot in this dataset.Specifically tailored for small-scale datasets comprising a single reaction type, AR-GCN exhibited an overall top-1 accuracy that was approximately 10% lower than the generic RC prediction model on USPTO-Condition dataset.As a result, it may not be well suited for RC prediction tasks that are closer to the real-world applications.For the development of the GNN-based condition prediction models applied in the context of general synthetic planning, we also selected the work by Zhang et al. [22] (referred to as CIMG-Condition) as a comparison.Their approach involved modeling multiple RC strategies separately, and we established the CIMG-Condition prediction models for each condition category in the USPTO-Condition dataset.By employing the same inference strategy, we compared this method and found that the overall top-1 accuracy was approximately 7% lower compared to the models that consider the interdependencies between conditions (RCR and Parrot).Due to significant differences among the modeling strategies used by their approach, our method and the RCR model, we have included the performance of the CIMG-Condition models in the Supplementary Information for readers' reference.
For detailed accuracy information on USPTO-Condition dataset of AR-GCN and CIMG-Condition models, please refer to Table S10.

Reaxys-TotalSyn-Condition results
Unlike USPTO-Condition that only has the chemical context condition data, the Reaxys-TotalSyn-Condition dataset gives the operating temperature for each reaction.On this dataset, we conducted the 2 experiments and compared the effect of the type of the pretraining strategies on the Parrot model accuracy.
Regarding the evaluation of the chemical context conditions prediction, we used the same method as USPTO-Condition to calculate the top-k accuracy.In addition, we also used MAE as an evaluation method for temperatures.For the prediction of temperatures, we used the decoder hidden tensor corresponding to the chemical context condition top-1 as part of the temperature decoder input.In this dataset, many reaction records do not use catalyst, solvent2, and reagent2, and these items rarely show up.Especially for catalyst, we found that 96% of all records has no catalyst in the test set.In order to evaluate the performance of the model more reasonably, we divided the test set into 2 parts, namely, the part containing the catalyst (denoted as Alpha group) and the part not containing the catalyst (denoted as Beta group).The model evaluation method described in Experiment details.The performances of these models in this dataset are shown in  S11.

Generalizable prediction capabilities across reaction space
To evaluate the Parrot model's ability across reaction space prediction, we created an external test set for the model trained on USPTO-Condition, called Reaxys-TotalSyn-Condition-Sampled.This external test set was derived from the Reaxys-TotalSyn-Condition dataset, and we selected a portion of the dataset with RC labels that can be covered by USPTO-Condition as the external test set.The production process of this part of the dataset is introduced in Section S1. 3. In order to quantify the distribution difference between the Reaxys-TotalSyn-Condition-Sampled dataset and USPTO-Condition dataset in chemical space, we calculated the average similarity of each chemical reaction to its 5 most similar reactions within and between these 2 datasets, respectively.The similarity distribution histogram is visualized in Fig. 2C.The reaction data are represented using the reaction difference fingerprint (calculated from extended connectivity fingerprints [45]) and the similarity calculation method is tanimoto similarity.The blue distribution histogram shows a significant chemical space difference between the external test set Reaxys-TotalSyn-Condition-Sampled and the training set USPTO-Condition.Differences in the distribution between the training and test sets can significantly increase the challenges in model prediction.It can be used to assess the ability of different models to predict across reaction spaces.For the detailed information on the calculation method of the similarity analysis used in this section, please refer to Section S6.
In this experiment, we used the same test approach as Reaxys-TotalSyn-Condition to calculate the accuracy based on whether the test set contains catalysts, with the Alpha part containing 7,428 test data without catalysts and the Beta part containing 202 test data with catalysts.Since the solvents and reagents in the RCs are quite substitutable, we also adopted a more relaxed metrics for evaluation.The idea is that if the predicted results for solvent and reagent match the substitutable part (i.e., belonging to the same category) of the ground truth, then the predictions are considered to be correct for both types of chemical RCs.The classification method of solvents refers to the solvent similarity index [46], and the classification method of reagents is based on the key substructure fingerprint.The detailed classification method is introduced in Sections S3.1 and S3.2.The test results of the Parrot-LM-E and the baseline model RCR in the cross-chemical space predictions experiment are shown in Table 3.
In the Alpha part of the test results, when the test data was switched from the USPTO test set, which had the same training data, to the Reaxys-TotalSyn-Condition-Sampled test set with a significantly different data distribution from USPTO, both the Parrot and RCR models experienced varying degrees of accuracy reduction.RCR suffered a more notable decrease, with the top-1 accuracy of s1r1 dropping from 38.14% to 29.77%, resulting in an 8.37% decrease in accuracy.On the other hand, Parrot-LM-E demonstrated more robust performance, experiencing only a 4.35% decrease.Similar observations were made in the Beta part of the test results.Although there are significant differences in reaction similarity between the Reaxys-TotalSyn-Condition-Sampled test set and the USPTO-Condition training set, the Parrot model achieved better prediction accuracy than RCR.We further show the differences in the decrease in accuracy of the various reaction types after transitioning the test set from USPTO-Condition to Reaxys-TotalSyn-Sampled in Section S8 of the Supplementary Information.These results indicate that the Parrot model exhibits significantly higher cross-chemical space prediction capability compared to the RCR model.The Parrot's stronger ability to predict across reaction spaces is contributed from its cross-attention mechanism's enhanced learning capability regarding the relationship between reaction features and RCs.Additionally, the pretraining strategy further enhances the learning of reaction features.This allows the Parrot model to perform well even in situations where there are significant differences in the chemical space of the dataset, enabling it to capture crucial information effectively.

Interpretability results
In this section, we conducted analyses from 2 different perspectives to explore the information embedded in the attention mechanism of the Parrot model when predicting RCs.In the first analysis, we investigated the model's understanding of crucial reaction centers.In a different analysis, we delved deeper into the correlation between the predicted RCs and the functional groups present in the inputted reactions.Through these 2 analyses, we gain a more comprehensive understanding of the performance and information extraction capabilities of the Parrot model in predicting chemical RCs.Finally, we also visualized some reaction cases as demonstrations.

Analysis results of the Parrot's understanding of reaction centers
In this part of the analysis, we employed 3 strategies to investigate the attention mechanism of the Parrot model.These strategies involved examining the cross-attention mechanism, the self-attention mechanism of the encoder, and a comprehensive analysis of both attention mechanisms.We evaluated the model's understanding of reaction centers by comparing the overlap score (OS) between the selected active atoms represented by cross-attention weights and self-attention weights with the ground truth reaction centers.The schematic diagram of the method pipeline is shown in Fig. 4.During the analysis process, we introduced certain parameters that were adjusted on the USPTO-Condition validation set, and the final results were obtained on the USPTO-Condition test set.Table 4 presents the OS, false positive rate (FPR), and accuracy of the reaction centers for the 3 attention information extraction strategies (cross-attention, Bert self-attention, and combination).The accuracy of the reaction centers was assessed using 2 criteria.The first criterion, "half ", considered active atoms overlapping with at least half of the reaction center atoms as hits, while the second criterion, "at least 2", required active atoms to overlap with at least 2 reaction center atoms to be classified as hits.Further details of the analysis can be found in Interpretability analysis.
According to the results shown in Table 4, the following observations can be made: The OS between the active atoms indicated by the cross-attention mechanism and the reaction centers is relatively low at 60.14%.The FPR is 32.96%.The accuracy of the reaction centers is 70.57% (half) and 94.61% (at least 2).This suggests that the cross-attention weights not only focus on the information of the reaction centers but also seem to capture other important information for RCs prediction task.In subsequent interpretability analysis, we demonstrate that this portion of important information captured by the cross-attention weights is closely associated with a nonreactive characteristic functional group.In contrast, the active atoms identified by the Bert self-attention mechanism exhibit a higher OS with the reaction centers, reaching 94.70%.The FPR is low at only 9.49%.The accuracy of the reaction centers is 95.10% (half) and 95.28% (at least 2).This indicates that the encoder of the Parrot model demonstrates a remarkable understanding of the reaction center information.In the third analysis approach, by combining the information from the cross-attention mechanism and the encoder's self-attention mechanism, the OS between the active atoms and the reaction centers is further increased to 97.57%.The accuracy of the reaction centers significantly improves, reaching 98.38% (half) and 99.75% (at least 2).This suggests that the Parrot model exhibits complementary advantages in understanding the reaction centers under different attention mechanisms.Furthermore, we also observe that these 4 metrics demonstrate consistent performance across the test and validation sets, which to some extent validates the reliability of our analysis approach.However, solely attending to the information of the reaction centers is insufficient for better predicting RCs.In order to further explore the information beyond the reaction centers that the crossattention mechanism focuses on, we have designed an alternative perspective for analysis.

Analysis results of attention-based association between functional groups and RCs
In this analysis, we utilized cross-attention weights to investigate the relationship between reactions and predicted RCs, focusing on the level of functional groups.Firstly, we selected a palladium-catalyzed alcohol deprotection reaction as an example, visualized in Fig. 5A to C. Figure 5B presents the heat map of attention weights, displaying the correspondence between key subsequences in the reaction SMILES and the predicted RCs (high-resolution images of this example can be found in Figs.S2 to S13). Figure 5C illustrates the attention weights of palladium catalyst relative to the atoms in the reactants and products, clearly indicating a higher attention on the atoms involved in the reaction center.However, there are still some cases where the attention weight distribution is challenging to interpret.Hence, in order to comprehensively investigate the grasp of group information by the cross-attention mechanism, we employed a macroscopic approach to analyze the attention distribution at the group level across the entire USPTO-Condition test dataset.We employed the BRICS [47] algorithm for reaction functional group segmentation and calculated the average attention weights across multiple heads and layers.Detailed analysis methods can be found in Interpretability analysis.Through this analytical approach, we obtained correlation scores based on attention weights for different catalysts, solvents, and reagents with respect to various chemical functional groups.
We refer to these correlation score matrices as attention score maps (ASMs).Figure 5D displays the ASM between chemical group substructures and the catalyst, with similar ASMs obtained for solvents and reagents.However, due to space limitations, we are unable to present the ASMs for all RCs in main text.Please refer to Data Availability for the tables containing ASMs.
According to the ASM sorted by column (condition), multiple molecular substructures most relevant to individual RCs can be identified.We further visualized the molecular substructures for each (catalyst, solvent, and reagent) sorted by attention scores.The results revealed that the high attention score substructures for catalysts and reagents corresponded to characteristic molecular structures involved in the reactions.
Two typical examples are visualized in Fig. 6. Figure 6A displays tetrakis(methyldiphenylphosphine) palladium and its top 15 relevant molecular substructures.Among these 15 substructures, 11 are aromatic groups (highlighted by blue boxes) and the first 2 are halogenated groups (highlighted by green boxes).These substructures are characteristic groups for the Suzuki reaction catalyzed by this catalyst.Another example is shown in Fig. 6B, representing substructures related to a ruthenium metal catalyst.The most relevant groups in this case are terminal alkene structures (highlighted by blue boxes), and among the top 6 related substructures, there are also 3 important carbonyl structures (highlighted by orange boxes).Additionally, Fig. 4. Reaction center analysis schematic diagram.In the Parrot model, while predicting chemical RCs, the cross-attention weights and the encoder's self-attention weights are extracted to obtain reaction center information using 3 methods: ① Using only the cross-attention mechanism, potential active atoms that may be reaction centers are determined by setting a threshold using the validation set.② Using only the encoder's self-attention mechanism, an atom mapping algorithm guided by self-attention weights is used to label the atom mapping between reactants and products, followed by extracting reaction templates to identify potential active atoms that may be reaction centers.③ Both the cross-attention mechanism and the encoder's self-attention mechanism are considered simultaneously to determine potential active atoms that may be reaction centers.
among the top 15 substructures, 7 contain aromatic structures with benzene rings (highlighted by green boxes).These substructures are characteristic groups for the Murai reaction catalyzed by the ruthenium catalyst.The reaction reagents also exhibit high correlation scores with characteristic reaction substructures.Two typical examples of the reagents are visualized in Fig. 7, where Fig. 7A displays the top 15 molecular groups most relevant to sodium carbonate (base).We can observe that the top 3 ranked groups are chlorinated, boronic acid, and brominated groups, and the remaining highly correlated molecular groups are aromatic groups, which perfectly aligns with the Suzuki reaction.Figure 7B displays the top 15 molecular groups most relevant to lithium aluminum hydride (reducing agent), which is commonly used as a reducing agent in organic reactions.The highly correlated groups shown in the figure are those that can be reduced by lithium aluminum hydride and include carboxyl, nitro, cyano, carbonyl, and halogenated groups.Since solvents do not directly participate in the reaction, their ASM representation does not exhibit the same level of prominence as catalysts and reagents directly involved in the reaction.These examples demonstrate how the Parrot model automatically learns the relationship between reactions and predicted RCs (particularly catalysts and reagents) through attention mechanisms, thus exhibiting strong interpretability

Interpretability case study
In this subsection, we visualized some cases of Parrot's predicted RCs. Figure 8 displays the reaction centers and typical functional groups that the model's attention mechanism focused on when predicting 3 types of reactions: Grignard reaction, Suzuki coupling reaction, and alcohol deprotection reaction.Each reaction type is highlighted with a different color frame.Each visualization case consists of 2 parts: (a) identification of reaction centers guided by the self-attention weights of the Parrot model's encoder and (b) functional group structures that the cross-attention weights between the encoder and the decoder of the Parrot model focus on when predicting catalyst or reagent.In these case studies, we employed the parameter configurations that demonstrated the best performance for the reaction center recognition task.In all 3 cases, the reaction centers identified by the self-attention weights of the Parrot model's encoder accurately matched the real reaction centers.In the Grignard reaction case presented in Fig. 8A, the Parrot model assigned high attention weights to the aldehyde group of reactant2 and the hydroxyl group of the product when predicting the magnesium metal as a reagent.It also placed relatively high attention weights on the bromine substitution structure of reactant1.These functional groups are typical for Grignard reactions.In the Suzuki coupling reaction case depicted in Fig. 8B, the model's attention was not only focused on the boronic acid group and the bromo substituent when predicting the metal palladium catalyst tetrakis(triphenylphosphine)palladium(0), but it also exhibited significant attention weights toward the aromatic ring.This indicates that the model's attention is not limited to the reaction center alone.Additionally, the model displayed less attention weights toward ether bonds or nitrogen atoms on aromatic rings, which are less relevant to the Suzuki coupling reaction.A similar phenomenon was observed when the model predicted sodium carbonate as reagent.In the alcohol deprotection reaction case illustrated in Fig. 8C, the model highly attended to the protecting group on the reactant and the hydroxyl group on the product.These cases clearly demonstrate the interpretability of the Parrot's 2 attention mechanisms when predicting RCs.

Limitation and outlook
Although 2 RC datasets have been curated in this work, it should be noted that the existing datasets still face some limitations: 1.The dataset USPTO-Condition is obtained from the US patent database, which exhibits certain limitations in terms of data quality.Specifically, the accuracy of the recorded reaction temperatures is subject to notable errors.Therefore, we did not include temperature information in USPTO-Condition.2. Similar to the work of Gao et al. [29], this study also adopts the strategy of constructing a model to predict 5 categories of RCs, and (following their workflow) we omit rare data with more than 5 RCs in these 2 datasets.3.Although this work has demonstrated that Parrot has stronger cross-reaction space prediction capabilities compared to other similar models for predicting RCs, similar to other works that treat RC prediction as a classification task, Parrot's predictions heavily rely on the quality of collected RC data.For novel RCs that are not present in the dataset, the model struggles to    make accurate predictions.The approach of decoding RCs step by step at the character level can overcome the limitations of RC labels.However, it also brings the issues related to syntax effectiveness and the assignment of roles to RCs. 4. We did not take into account of the high and low chemical yields when curating the data and training the model.In other words, the model considered all combinations of RCs present in the dataset to be of equal value.However, the chemical reaction yields are strongly affected by the RCs. 5.Although this study made efforts to clean the RC data obtained from USPTO and Reaxys, the data size is still limited compared to large-scale commercial datasets.This limitation is also evident in the trained models' performance.Access to higher-quality and larger-scale RCs data can improve RC prediction model performance further.Presently, large language models such as GPT-4 [48,49] have excelled in literature summarization.Employing similar techniques, it is feasible to automatically extract chemical RC data from extensive chemical synthesis literature to enhance data quality.As part of the future research, we plan to integrate the chemical reaction yield into the development of chemical RC prediction models.Our proposed model for chemical reaction prediction can be seamlessly integrated into existing synthesis planning algorithms, thereby aiding in the optimization of synthesis routes.

Discussion
In this study, we address the RC prediction task, which is essential for synthesis planning and RC optimization.In response to the lack of readily available open-source datasets for the RC prediction task, we curated 2 general RC datasets named USPTO-Condition and Reaxys-TotalSyn-Condition, which contain In the visualization of attention weights related to reaction conditions, the catalyst and its associated groups are highlighted in green, with darker shades indicating higher weights.Reagent1 and its associated groups are highlighted in blue.
approximately 680,000 and 180,000 RC data, respectively.Here, we also proposed a novel RC prediction model called Parrot, which achieved the best performance on both datasets by incorporating a pretraining strategy using reaction domain knowledge and a well-designed training pipeline.Compared with the baseline model, the Parrot model improved the overall top-3 accuracy of reaction context condition prediction by 2.64% and 13.44%, respectively, and the MAE of the temperature prediction was reduced by about 4 °C.Our proposed model can not only simultaneously predict multiple types of RCs for multiple chemical reactions but also provide good interpretability by using an attention mechanism to gain insight into the intrinsic relationship between the molecular substructures in reactions and RCs.Additionally, Parrot can be seamlessly integrated into existing synthesis planning algorithms, offering synthesis chemists improved capabilities in designing reaction routes and optimizing RCs.Moreover, our open-source code includes a user-friendly web-based graphical user interface (GUI), providing convenient access for researchers to utilize Parrot's functionalities.Looking ahead, with the availability of a larger volume of high-quality RCs data, we anticipate that Parrot and Parrot-inspired algorithms will emerge as indispensable components of DASP tools, effectively guiding the development of self-driving laboratories.

Software and implementation
All the codes were implemented in the python, the rdkit [43] cheminformatics toolkit was used for data processing, and the model was constructed based on the pytorch [50] library.The web-based GUI was implemented using flask [51] library.The Parrot model is trained on Dell Precision 7920 Tower (Intel Xeon Bronze 3204, NVIDIA Quadro RTX8000 GPU, 512 GB RAM), and it can be inferred on a consumer computer Dell OptiPlex 7090 (Intel Core i7-11700, 8 GB RAM) without discrete GPU.

Experiment details
We choose RCR [29] that also predicts multiple chemical RCs at the same time as the baseline model to compare with Parrot.In the later discussion, we report the RCR's performance on the 2 datasets we prepared under the same dataset split.RCR is a model that uses the reaction fingerprints as the input of the feed-forward neural network to predict RCs.
For the prediction of the reaction context conditions, we adopted the top-k accuracy evaluation scheme to compare models, and for the temperature prediction, we adopted the MAE.During the test, we took the predictions of the USPTO-Condition test set for the first-ranked catalyst, the top-3 solvent1, the first-ranked solvent2, the top-5 reagent1, and the first-ranked reagent2.Finally, all predictions are sorted by the product of the logistic scores to compute the top-k accuracy.Since roughly 96% of the reaction records in the Reaxys-TotalSyn-Condition dataset do not use catalyst (i.e., the entry has null value in that column), we divided the test set into 2 separate categories depending on whether a catalyst is present in the reaction record.This gives us a more precise characterization of the models' performance on RC predictions.The first test set contains the data without catalyst, and we predicted the top-3 solvents1 and the top-5 reagents1; the second test set contains the data with catalyst, and we predicted the top-2 catalysts, the top-3 solvents1, and the top-3 reagents1.The number of top candidates predicted for each RC category is determined by the sparsity level of the corresponding condition labels.Sparse labels result in fewer candidate selections (smaller top-k values), such as catalyst, solvent2, and reagent2.Conversely, dense labels lead to a larger number of candidate selections (larger top-k values), as seen in solvent1 and reagent1.For an illustrative diagram of the top-k accuracy calculation process for RC prediction across all models, please refer to Fig. S15. We

Interpretability analysis
Many previous works have demonstrated that attention mechanism can capture key chemical reaction information [52][53][54], but the interpretability in RC prediction is rarely studied.In this section, we utilized several attention-based methods to analyze and demonstrate how the Parrot RC prediction model captures the relationship between the details of chemical reactions and the predicted RCs.We analyzed the self-attention weights of the encoder (Bert) component as well as the crossattention weights between the encoder and the RC decoder.These 2 sets of attention weights respectively manifest the model's understanding of the relationships among various atoms in the reaction and the model's understanding of the associations between the functional groups in the reaction and each RC.

Analysis of the Parrot's understanding of reaction centers
We first analyzed the model's understanding of the reaction center, which is the most distinctive part of a chemical reaction.As shown in Fig. 4, we extracted the cross-attention weights (representing the relationship between reaction atoms and conditions) and the self-attention weights of Bert (representing the relationship between reactant and product atoms) when the Parrot model predicts the conditions.Next, we utilized the extracted cross-attention matrix and self-attention matrix separately and employed the following approach to analyze them on the USPTO-Condition validation set.We attempted 3 methods to establish correspondences between the information embedded in the attention weights and the reaction center: 1. Using only the reaction-condition cross-attention weights: First, we normalize the cross-attention weight score vector corresponding to each RC and select the active atoms using the mean and standard deviation.The selected atoms follow the following formula: In this formula, ActiveAtomIdx represents the indices of atoms selected as active atoms.Attn is an attention weight vector that represents the attention values of the chemical reaction SMILES sequence for a specific RC (catalyst, solvent1, solvent2, reagent1, and reagent2).Mean(•) denotes the average of attention weights for a specific RC, while Std(•) represents the standard deviation.k is a constant used to control the number of active atoms.It is multiplied by the standard deviation and added to the mean to determine the threshold for attention weights, which is used to select active atoms.If an atom's attention weight is greater than or equal to this threshold, it is considered an active atom.It is important to note that the cross-attention weights contain matrices for multiple layers and heads, so the aforementioned calculations need to be performed for each layer and head matrix.The hyperparameters include Layer, Head, k, and the ConditionType (C), which are optimized on the validation set.
The evaluation criterium consists of 2 components: the OS and the FPR.The optimal parameters are determined by maximizing the difference between OS and FPR.The calculation method is as follows: Here, TP represents the number of active atoms that match the ground truth reaction center atoms, TP+FN represents the total number of ground truth reaction center atoms, FP represents the number of active atoms falsely predicted as reaction center atoms, and FP+TN represents the total number of active atoms.The ground truth reaction center atoms are obtained by matching the reaction template subgraph corresponding to the reaction.

Using only Bert's reaction-reaction self-attention weights:
First, the self-attention weight matrices extracted from Bert's encoder are inputted into the atom mapping algorithm implemented in rxnmapper [52].To determine the optimal layer and head configuration, a methodology similar to that employed by Schwaller et al. [52] is adopted, using a dataset of 996 instances derived from the USPTO-50K.Subsequently, the atom mapping procedure is applied to the validation set of the USPTO-Condition, wherein the self-attention weights of Parrot's encoder serve as a guiding mechanism.The reaction templates are extracted using rdchiral [55].Following that, rdkit is utilized to execute subgraph matching on the reactions, enabling the identification of indices associated with active atoms.Finally, a comparison is made between the indices of active atoms and the ground truth reaction center, upon which the OS and FPR are computed.

Simultaneously using cross-attention weights and Bert's self-attention weights:
In this approach, we combine the cross-attention weights with the self-attention weights of the encoder to enhance the relevance between the attention mechanism and the reaction center.However, we adopt a stricter treatment for the cross-attention weights, and the calculation process is as follows: The first part of the calculation process involves obtaining the indices of active atoms using the cross-attention mechanism, denoted as ActiveAtomIdx cross .It consists of 2 components: the first part is the same as Method 1, and the second part involves identifying the top-ranked atoms based on attention weights, where the value of "n" is optimized on the validation set.
The second part involves obtaining the indices of active atoms using the self-attention mechanism, denoted as ActiveAtomIdx cross .These indices are obtained through template matching, similar to the second method, where the template is marked by an atom mapping algorithm guided by the self-attention weights of Parrot's encoder and extracted using rdchiral [55].
Finally, the indices of active atoms obtained from the crossattention mechanism and the self-attention mechanism are Seven parameters are optimized on the validation set: the constant k, the ConditionType (C), the Head and Layer for the cross-attention part, the Head and Layer for the self-attention part, and the constant n.
The parameter optimization space and the optimal parameters for the 3 aforementioned parts can be found in the Section S2.6.The attention weights used for calculating active atoms are normalized separately for reactants and products.

Analysis of attention-based association between functional groups and RCs
Furthermore, we conducted further exploration of the crossattention weights between chemical reactions and RCs to analyze the additional information contained within them, apart from the reaction center information.In the analysis described in this part, we first fragmented all molecules (reactants and products) from USPTO reaction records using BRICS [47], counted the number of occurrences for the fragments, and selected the most representative and important 103 substructures for analysis.Then, in the test set of USPTO-Condition, we counted the attention weights calculated by the model according to the substructure and then calculated the attention map according to the number of the hits of the substructure.The calculation goes as follows: where A is the attention map matrix; i and j are the indices of substructure G (from reactants and products) and condition C, respectively; E is the number of times that substructure G is matched in the dataset; and a e is the average attention weight (by heads) connecting atom w to chemical context condition C (eth hit).n is the number of the atoms in substructure G.The results of this analysis can be found in Results.The detailed calculation process of the attention weight a e can be found in Section S2.

Fig. 1 .
Fig. 1.Overview.(A) Effects of RC prediction tasks on synthesis planning.(B) Schematic representation of the processing flow and structure of the RC dataset.(C) Parrot model structure and strategy design.

Fig. 2 .
Fig. 2. Visualization of the Parrot model prediction accuracy based on the reaction category.(A) Prediction performance based on the reaction category in the USPTO-Condition test set.(B) Reaction category composition of the USPTO-Condition dataset and Reaxys-TotalSyn-Condition dataset.(C) Distribution of the similarity between the reactions within USPTO-Condition and Reaxys-TotalSyn-Condition-Sampled and between USPTO-Condition and Reaxys-TotalSyn-Condition-Sampled.

Fig. 3 .
Fig. 3. Parrot model architecture.The model decodes 5 contextual RCs in the first 5 steps of prediction.In the sixth step, it combines the tensor from the encoder and the tensor containing the information of the 5 contextual RCs from the condition decoder to predict the temperature.

Fig. 5 .
Fig. 5. Example visualization of the reaction-condition attention weights.(A) Reaction example from the USPTO-Condition test set.(B) Brief visualization of the attention weights of the encoder's memory by the decoder's 3-layer 4 attention heads, with each subgraph having reaction SMILES horizontally and 5 contextual conditions vertically.(C) After averaging the attention weights of each attention head in each layer and visualized on the molecule, the attention weights with the palladium catalyst are shown here.(D) Brief visualization of the subgraph-Condition ASM (shown here is the Subgraph-Catalyst ASM).See Data Availability for source data table.

Fig. 6 .
Fig.6.Catalysts and their highly correlated molecular subgraphs in the ASM.

Fig. 7 .
Fig. 7. Reagents and their highly correlated molecular subgraphs in the ASM.

Fig. 8 .
Fig. 8. Visualizations of reaction information identified by Parrot's two attention mechanisms, showcased in case studies of (a) Grignard reactions, (b) Suzuki coupling reactions, and (c) hydroxyl deprotection reactions.Each case visualizes two parts of information: 1) the model's identification results of reaction centers, and 2) the reaction functional groups attended by the model when predicting reaction conditions.In the visualization of the model's identification results of the reaction center, the reaction center is highlighted in red, with the top row indicating the ground truth reaction center and the second row representing the predicted reaction center by the Parrot model.In the visualization of attention weights related to reaction conditions, the catalyst and its associated groups are highlighted in green, with darker shades indicating higher weights.Reagent1 and its associated groups are highlighted in blue.
employed various model weight initialization schemes to train the Parrot model.For the USPTO-Condition dataset, we trained Parrot-D employed the strategy of initializing weights with a uniform distribution.Parrot-LM with encoder parameters initialized from a pretrained Masked LM model on the USPTO reaction dataset, and Parrot-RCM with encoder parameters initialized from a pretrained Masked RCM model on the USPTO reaction dataset.To further improve the prediction accuracy of Parrot, we also utilized the enhanced training method of Parrot-LM-E and Parrot-RCM-E, which involved a ×5 augmentation of the training set.This enhanced training method included data augmentation through SMILES permutation and multiple reactants (and products) shuffling and initialized the model parameters from the trained Parrot-LM or Parrot-RCM, followed by additional training for 2 epochs with a lower learning rate (1 × 10 −6 ).For the Reaxys-TotalSyn-Condition dataset, we performed the Masked RCM initialization model training Parrot-RCM and Masked LM initialization model training Parrot-LM.We also trained and tested the GNNbased RC prediction model, AR-GCN, proposed by Maser et al.,which is based on GNNs.This comparison aimed to assess the performance differences between a general RC prediction model and a model specifically designed for small-scale, specific types of RC datasets in large-scale prediction tasks.Furthermore, another graph-network-based generic RC prediction model, CIMG-Condition, is also included in the comparison.To ensure fair comparison, all models were trained using the same dataset splitting method, and the hyperparameters of each model were optimized during training to achieve optimal performance.See Section S2.2 for training details and hyperparameter selection of RCR, AR-GCN, CIMG-Condition, and Parrot.

Table 1 .
Results of the Parrot model on the USPTO-Condition test set and comparison with the baseline model a . (Continued)

Table 2
, s2, r1, and r2 refer to catalyst, solvent 1, solvent 2, reagent 1, and reagent 2, respectively.This model is fine-tuned for 2 epochs with a small learning rate using a 5× data augmentation training set based on LM/RCM.The decoder of the default Parrot model is 3 layers and 4 heads per layer, and 6 layers and 8 heads per layer are used in this experiment.
the RCR model, and top-15 accuracy expanded to 13.70%.In the Beta part of the test, which contains less data, the Parrot-RCM model also achieved a higher overall (c1s1r1) top-1 accuracy of 9.7% than the RCR model, and top-15 accuracy expanded to 18.16%.The temperature MAE of both parts (Alpha and Beta) was reduced by about 4 °C compared to that of the RCR model.Comparing these 2 pretraining strategies, a c, s1b Overall: c, s1, s2, r1, and r2.c E: d 6L8H:

Table 1 .
(Continued)the RCM pretraining strategy performs better than LM in Reaxys-TotalSyn-Condition. The accuracy of AR-GCN on this general RC dataset still remains significantly lower compared to RCR and Parrot.Due to the lack of consideration for the relationships between RCs, the prediction accuracy of the general RC prediction model CIMG-Condition is lower than those of RCR and Parrot.However, it is noticeably superior to AR-GCN, which is specifically designed for modeling small datasets.The specific accuracy values on Reaxys-TotalSyn-Condition of AR-GCN and CIMG-Condition models are included in Table

Table 2 .
Results of the Parrot model on the Reaxys-TotalSyn-Condition test set and comparison with the baseline model a .

Table 3 .
Test results of Parrot and RCR when predicting across the chemical reaction space.This model is fine-tuned for 2 epochs with a small learning rate using 5× data augmentation training set based on LM.
d Top-k accuracy is calculated using the close math of solvents and reagents.eUSPTO: USPTO-Condition dataset.fSampled: Reaxys-TotalSyn-Condition-Sampled.g E:

Table 4 .
The OS, FPR, and accuracy of the reaction centers for the 3 attention information extraction strategies in terms of the active atoms and reaction centers.