Extracting structured data from organic synthesis procedures using a fine-tuned large language model

The popularity of data-driven approaches and machine learning (ML) techniques in the field of organic chemistry and its various subfields has increased the value of structured reaction data. Most data in chemistry is represented by unstructured text, and despite the vastness of the organic chemistry literature (papers, patents), manual conversion from unstructured text to structured data remains a largely manual endeavor. Software tools for this task would facilitate downstream applications such as reaction prediction and condition recommendation. In this study, we fine-tune a large language model (LLM) to extract reaction information from organic synthesis procedure text into structured data following the Open Reaction Database (ORD) schema, a comprehensive data structure designed for organic reactions. The fine-tuned model produces syntactically correct ORD records with an average accuracy of 91.25% for ORD “messages” (e.g., full compound, workups, or condition definitions) and 92.25% for individual data fields (e.g., compound identifiers, mass quantities), with the ability to recognize compound-referencing tokens and to infer reaction roles. We investigate its failure modes and evaluate performance on specific subtasks such as reaction role classification.

1 Sequence length of USPTO reaction records Fig. S1 Cumulative proportion of 1339260 USPTO reaction records as a function of the maximum number of tokens (sequence length limit).The shaded area denotes the 1300613 records within the sequence length limit.

Results from the ChemRxnExtracton dataset
Table S1 shows the NER results for two fine-tuned LLaMA-7B models.The first row comes from the model fine-tuned using the training set of USPTO-ORD-100K (the main focus of this manuscript), and it is tested on the entire uniproduct ChemRxnExtracton dataset.The second row describes the model fine-tuned using the training set from a random 9:1 train:test split of the ChemRxnExtracton dataset, which is tested on the test set from the aforementioned random split.We note while the second fine-tuned model is able to produce valid ORD-formatted JSON, the test set (12 records) is too small to allow meaningful conclusions.

Notes on yield extraction and evaluation
In a structured ORD record investigated in this study, a product can have two ProductMeasurement messages describing the yield of this product: One for the reported yield which can be found in the procedure text, and the other for the calculated yield which cannot.In our data pipeline, if the integer part of a yield value cannot be found in the procedure text then this ProductMeasurement message is dropped out from the record (main text section 2.2 Calculated yield).We choose to only detect the integer part of a yield value to avoid erroneous matching caused by different rounding methods and reporting conventions.This, however, could still lead to situations where the yield values that are not reported in the procedure text remain in the ORD record when all yield values share the same integer part.In the following example (ord-1f43f796680147a3869d7928c02529ac), 1 the yield is reported as "89%" in the procedure text.
However, in the structured ORD record, in addition to the percentage yield value of "89.0", another calculated percentage yield of "89.5" is also present.
Both of the yield values remain in the ORD record after applying our data pipeline.Table S2 shows the field-level evaluation results for reported and calculated yields, where the fine-tuned model can accurately extract reported yields while tends to skip generating calculated yields.
Table S1 NER results for two fine-tuned models.Note they are evaluated on two different datasets (see above).As the names of different chemical entities can be very similar, we excluded the case of "Alteration" so a name from the text can be captured either successfully ("Accurate") or unsuccessfully ("Removal").4 Fine-tuning prompt template

Fine-tuned by Tested on Accurate Removal Addition Total
The following shows the prompt template used in fine-tuning, where {procedure_text}, including the curly brackets, is to be replaced with the unstructured procedure text.Note linebreaks are always explicitly denoted as \n.
Below is a description of an organic reaction.Extra ct information from it to an ORD JSON record.\n\n### Procedure:\n{procedure_text}\n\n###ORD JSON:\n 5 Chain-of-thought prompting In this section we detail our implementation of chain-of-thought prompting for structured data extraction. 2 We compose the prompt to have three parts (Figure S2).The first part summarizes the task at a high level, the second part describes the sequential NER/RE steps to construct a generic ORD JSON record, and the third part includes detailed procedures to extract ORD JSON from two example texts.Due to the complicated structures of ReactionWorkup and ReactionConditions, we excluded these messages in chain-of-thought prompting.This method is tested with 500 reaction procedure texts using Ope-nAI's gpt-3.5-turbo-0125,which was chosen due to its low cost compared to contemporary GPT-4 models.The temperature is set to zero for consistent outputs.Out of the 500 completions after repairing JSON format, all of them are JSON parsable, but almost half (249) of them do not comply with ORD schema.Most of these violations of ORD schema originate from the misplacement of outcomes as a part of inputs, and can be fixed programmatically.Other violations include unallowed values of enum fields, e.g., the allowed types of a Compound.identifierdo not include "INDEX" which is however extracted in the completion.After further repairing the completions based on ORD schema, 91 (18.2 %) of them are still invalid ORD records.Evaluation results for the remaining 409 completions are shown in Table S3, from which a reasonable success rate (61.2 %) for extracting Compound is observed, along with a poor success rate of 31.3 % for ProductCompound, both using the more lenient routine.Similar results (Table S4) are obtained when the JSON mode is turned on through OpenAI API, which promotes the model to generate syntactically valid JSON strings.Note only 351 out of 500 completions generated with JSON mode are valid ORD records.This prompting method is also limited by human-crafted instructions and the context window of the model, and, considering there are more than 600 different fields defined in ORD schema, preparing examples and steps to extract a full Reaction record seems impractical.However, chain-of-thought prompting can still be a low-cost, less-accurate handle for structured data extraction at the compound level when fine-tuning is not available.

Numerical error in reaction temperature extraction
While it is possible to use numerical error measure such as the mean squared error for numeric fields, in this study such measure is not used as we prefer the strict evaluation of exact-match accuracy for the information extraction task.A practical reason is, for some fields, errors from addition/removal happen more frequently than alteration.For example, when extracting the temperature fields in ReactionConditions, the errors from addition/removal account for 3.2%/2.4%,mostly due to misextracting a workup temperature as the reaction condition temperature (or the reverse), and errors from alteration only account for 1.2%.The percentages here are calculated using the method used to produce Table 3 but restricted only to temperature values in ReactionConditions.The extracted values, disregarding the Table S3 Evaluation results at the message level (Evaluation Metric 1) and the leaf field level (Evaluation Metric 2) for completions generated using chain-of-thought prompting on gpt-3.5-turbo-0125.The "Path" column denotes the path of the corresponding messages in a Reaction message.The success rates are calculated based on "Accurate" messages/leaf fields.The percentages were calculated using the total number of messages/leaf fields found in ground truth records.* These values were calculated using a more lenient routine detailed in the main text.

(11.8%) 2193
Table S4 Evaluation results for completions generated using chain-of-thought prompting on gpt-3.5-turbo-0125with JSON mode turned on.See the caption of Table S3 for more details.Step 1: Identify all the chemicals in the given `reaction_text`.An chemical identifier can be the name of a compound, for example, `methanol`.

Message type
An identifier can also be an index or a generic description, for example, `compound 6`, or `desired compound`.
Step 2: <CONTINUE TO DEFINE STEPS> Here is the first example.`reaction_text` is the text between two delimiters ```and ```.The exported ORD JSON record is the text between two delimiters ### and ###.
`reaction_text` = ```<EXAMPLE REACTION TEXT>``H ere the workflow to extract information from this `reaction_text` and export them to a ORD JSON record.

Step 2 :
Fig. S2 An example prompt used for chain-of-thought prompting.The three text chunks correspond to the three semantic parts discussed in Section S5.Texts in angle brackets, including the brackets, are defined by the two examples and are omitted for clarity.The full prompt can be found at https://github.com/qai222/LLM_organic_synthesis/blob/main/workplace_cot/cot_prefix.txt.
TableS2Comparison between reported yield and calculated value extractions.