Artificial intelligence driven design of catalysts and materials for ring opening polymerization using a domain-specific language

Advances in machine learning (ML) and automated experimentation are poised to vastly accelerate research in polymer science. Data representation is a critical aspect for enabling ML integration in research workflows, yet many data models impose significant rigidity making it difficult to accommodate a broad array of experiment and data types found in polymer science. This inflexibility presents a significant barrier for researchers to leverage their historical data in ML development. Here we show that a domain specific language, termed Chemical Markdown Language (CMDL), provides flexible, extensible, and consistent representation of disparate experiment types and polymer structures. CMDL enables seamless use of historical experimental data to fine-tune regression transformer (RT) models for generative molecular design tasks. We demonstrate the utility of this approach through the generation and the experimental validation of catalysts and polymers in the context of ring-opening polymerization—although we provide examples of how CMDL can be more broadly applied to other polymer classes. Critically, we show how the CMDL tuned model preserves key functional groups within the polymer structure, allowing for experimental validation. These results reveal the versatility of CMDL and how it facilitates translation of historical data into meaningful predictive and generative models to produce experimentally actionable output.

The monomers 2d and 2f were prepared according to literature procedures. 1,2 Below are adapted procedures from these references. Synthesis of 2f. 2 A Schlenk tube (250 mL), equipped with a stir-bar and an addition funnel (250 mL), was charged with TsCl (4.4 g, 23.0 mmol), TMEDA (0.90 g, 7.50 mmol), Et3N (5.3 g, 23.0 mmol), and MeCN (80 mL). A solution of 2,2'-(4-methylphenylimino)diethanol (3.0 g, 15.0 mmol) dissolved in MeCN (80 mL) was added to the addition funnel. The reaction flask was cooled to 0 °C in an ice bath and the atmosphere was purged with CO2 for 5 min. Afterwards, the apparatus was sealed, and the diol solution was added dropwise to the solution over 15 min. After 1.5 hours, the reaction mixture was filtered to remove solids and the filtrate was concentrated with the aid of a rotary evaporator. The isolated crude material was dissolved in a minimal amount of CHCl3 and purified by filteration through a plug of silica gel, eluting with 25% EtOAc in hexanes. Recrystallization from EtOAc/hexanes afforded the title compound as a white crystalline solid (2.85 g, 85%). The catalyst 4c was prepared according to a literature protocol, 3 below is an adapted procedure from this reference. 3 To a 100 mL round-bottom flask equipped with a magnetic stir-bar, benzoyl chloride (1.55 g, 11.04 mmol), Et3N (1.17 g, 11.54 mmol), and THF (20.1 mL) were added. 3,5-bis(trifluoromethyl)aniline (2.30 g, 10.04 mmol) was added dropwise over 5 min. After stirring at rt for 4 h, the reaction mixture was diluted with EtOAc, transferred to a separatory funnel, and washed three times with 1 M HCl and one time with brine. The organic layer was dried over MgSO4, filtered, and concentrated with the aid of the rotary evaporator. The obtained solids were purified by flash column chromatography on silica gel (gradient heptane/EtOAc) to afford the title compound as a white solid (2.71 g, 81 %).

Synthesis of 4c.
Characterization data for the title compound were in agreement with published data. 4 The catalyst 5a was prepared according to a literature protocol, 5 below is an adapted procedure from this reference. Synthesis of 5a. 5 To a 100 mL round bottom flask equipped with a stir-bar, 3,5bis(trifluoromethyl)phenyl isothiocyanate (2.10 g, 9.67 mmol) and THF (19.3 mL) were added. 2-aminopyridine (0.96 g, 10.15 mmol) was added and the reaction mixture was allowed to stir at rt for 15 h. The THF was removed with the aid of a rotary evaporator and the crude product purified by precipitation from THF into methanol (three times). After isolating and drying the solids the title compound was obtained as a white solid (2.62 g, 87 %).

Synthesis of 5c.
To a 100 mL round bottom flask, 2-chloro-1,3-dimethylimidazolinium chloride (3.00 g, 17.0 mmol), cyclohexylamine (1.76 g, 17.0 mmol) and sodium phosphate tribasic (4.45 g, 21.0 mmol) were dissolved in chloroform (45 mL) and heated to 50 °C for 20 h in an oil bath. After 20 h, the reaction removed from the oil bath and allowed to cool before filtering. The filtrate was concentrated with the aid of rotary evaporator. The isolated crude residue was dispersed in PhMe (40 mL) and 2 M NaOH (40 mL) was added, and the mixture was stirred for 1 h. The mixture was then transferred to a separatory funnel with additional toluene (80 mL) and 2 M NaOH (200 mL) and the aqueous and organic layers were separated. The organic layer was dried over NaSO4, filtered, and concentrated with the aid of the rotary evaporator to afford the title compound as an oil (1.86 g, 56%).

Synthesis of 5d.
To a 100 mL round bottom flask, 2-chloro-1,3-dimethylimidazolinium chloride (2.00 g, 11.0 mmol), benzyl amine (2.50 g, 23 mmol) and sodium phosphate tribasic (3.00 g, 14 mmol) was heated to 50 °C for 20 h. After 20 h, the reaction removed from the oil bath and allowed to cool before filtering. The filtrate was concentrated with the aid of rotary evaporator. The isolated crude residue was dispersed in PhMe S-5 (40 mL) and 2 M NaOH (40 mL) was added and allowed to stir for 1 h. The reaction was transferred to a separatory funnel with vigorous shaking. The organic layer was dried over NaSO4, filtered, and concentrated with the aid of the rotary evaporator to afford the title compound as an oil (0.95 g, 42%). Characterization data for the title compound were in agreement with published data. 6 Synthesis of 6d. To a 100 mL round bottom flask equipped with a stir bar, propargyl bromide (80 wt. % in toluene) (4.90 g, 41.0 mmol), bis(hydroxy) propionic acid (5.00 g, 37.0 mmol), and DIEA (4.77 g, 37.0 mmol), and MeCN (50 mL) were added. The flask was equipped with a condenser and heated to 60 °C in an oil bath for 20 h. The reaction was then cooled to room temperature and transferred to a 250 mL flask containing an additional 100 mL of MeCN. To this solution, CDI (19.9 g, 0.123 mol) was added over a 5 min and once dissolved, AcOH (18.8 mL, 0.328 mol) was added in increments over a 5 min. The reaction was heated to 80 °C for 2.5 h. The reaction mixture was cooled, concentrated with the aid of a rotary evaporator, and dissolved in EtOAc (150 mL). The solution was transferred to a separatory funnel and washed 3 times with 1.5 M HCl followed by a wash with brine. The organic layer was dried over MgSO4 followed by the addition of 40 mL of PhMe and concentration using a rotary evaporator. Additional PhMe (40 mL) was added and concentrated to remove remaining acetic acid. This process was repeated 3 times to yield the title compound as a clear oil that crystallized overnight (1.65 g, 22% over two steps). The monomer was used without further purification. Characterization data for the title compound were in agreement with published data. 7

Supplementary Discussion
CMDL is a domain specific language developed to enable an extensible approach for experimental documentation while leveraging features of modern integrated development environments (IDE) to assist in the documentation process. It should be noted that the language and syntax features described below reflect the version used in the paper. For the most up to date documentation and examples, please see the GitHub repository and the associated documentation webpage. In depth tutorials on CMDL and the IBM Materials Notebook can be found at the GitHub repository along with numerous example CMDL notebook documents. Below is an abbreviated introduction to CMDL within the IBM Materials notebook environment. The number and type of properties that may allowed within a group or allowable ranges for data for a particular property are defined and enforced by the CMDL compiler. As noted in the manuscript, the term compiler is used loosely as this portion of the program performs static analysis of the CMDL text. Groups can come in several different types. Supplementary Figure 1 depicts a generic group describing the metadata for a particular experiment in a notebook document. Other group types include a named group, which can be used to define references in CMDL. References are groups that describe a particular entity within an experiment, such as a chemical, polymer, or continuousflow reactor, and allow data for that entity to be used in multiple locations within the record without redefinition. Supplementary Figure 4 shows example references, including an example of a reference which was imported. These entities may be used in more than one place within a single experiment record. For instance, a solvent such as Ethyl Acetate may be used both in the reaction itself and as the organic phase in a biphasic extraction during workup. Using a reference and appending additional data is accomplished by the "@" prefix. Supplementary Fig. 5 shows an example of a named group (reaction) containing references to chemicals or polymers being used in the reaction. By using the different group types and their associated properties, we can define a highly extensible data model as new properties can easily be added to existing groups as well as be reused across multiple different groups, all with the compiler enforcing type checks. Once valid CMDL is written and checked by the compiler, it can be executed using the kernel which is part of the IBM Materials Notebook. During execution, the CMDL interpreter will perform some basic calculations based on the group type and its associated model. For example, the interpreter will compute basic stoichiometry and estimate concentrations during execution of a reaction group ( Supplementary Fig. 6). The output is by default rendered as JSON, however it can be displayed as a table using the custom notebook renderer. Supplementary Fig. 6. Example reaction output. Screenshot of reaction group output after running the cell. The CMDL interpreter reads the valid CMDL syntax and performs a stoichiometry calculation for the reaction based on available data. The data for each of chemicals, either defined using CMDL elsewhere or imported (Fig. S4) is merged with the values defined for them within the reaction group for the calculation.
CMDL Polymer Graphs. CMDL comes with built-in support for defining polymer graphs. The CMDL syntax for polymer graphs is composed of three elements, the top-level polymer graph group itself, containers, and connections. Additionally, discrete structural elements are defined separately as fragments, and referenced within the polymer graph definition. This allows definition of polymer graphs using CMDL as a composite tree. Each container (including the polymer graph group) may define which discrete nodes exist within them. Additionally, each container or polymer graph will define connection properties (defined by angle brackets)-representing edges in the polymer graph-for only nodes defined within them or nested container groups. Fragments can potentially be referenced in multiple locations within a polymer graph, depending on the polymer structure.

S-10
Supplementary Fig. 7. Example fragments for use within a polymer graph definition. Screenshot of fragment groups for defining discrete structural elements within a polymer graph definition. The point groups defined on each fragment enable the CMDL compiler to recognize specific attachment points on each fragment for creating connection objects.
Supplementary Figure 8 depicts the CMDL syntax for a simple poly(trimethylene carbonate) homopolymer initiated from methanol. The fragments which define the methanol initiator (eg_MeO) and the trimethylene carbonate repeat unit (p_TMC) are defined separately (Supplementary Fig. 7) and referenced within the polymer graph ( Supplementary Fig. 8). On each fragment a point group is defined for each distinct attachment point within the SMILES string, allowing these points to be referenced within connection properties ( Supplementary Fig. 7). The polymer graph group itself contains a reference to the methanol fragment in the nodes property. The polymer graph group also has a connection property (defined with angle brackets) to define an edge connection between the methanol node and the trimethylene carbonate repeat unit. Supplementary Fig. 8. Example polymer graph for a carbonate homopolymer. Screenshot of polymer graph for a poly(trimethylene carbonate) homopolymer.

S-11
Nested within the polymer graph group is the container group for the trimethylene carbonate block, which references the trimethylene carbonate fragment within the nodes property and defines the self-referencing connection for the trimethylene carbonate repeating connection ( Supplementary Fig. 8). By convention, each polymer graph group or container group may define connections between its own nodes and between its nodes and those of nested containers. Repeating units, such as trimethylene carbonate in the case of Supplementary Figure 8, are typically separated into their own containers as it allows clear delineation of repeating structures and discrete end groups or other structural moieties. This is especially convenient in the case of multiblock architectures, statistical copolymers, grafted polymers, or more complex polymer architectures. Supplementary Fig. 9. Example polymer graph for a block copolymer. Screenshot of polymer graph for a poly(valerolactone)b-poly(L-lactide) block copolymer.
An AB block copolymer can be defined by adding a second nested container group and defining additional connection properties and node references. Supplementary Figure 9 shows an example poly(valerolactone)-b-poly(L-lactide) block copolymer initiated from pyrenebutanol. Supplementary Fig. 10. Example polymer graph for a statistical homopolymer. Screenshot of polymer graph for a carbonate statistical copolymer.
A statistical copolymer is readily defined when two (or more) repeat units are defined on the same container group. Additionally, connection properties would have to be added for each connection between the different repeat units and each repeat unit with itself. This is somewhat tedious, so instead we can use a syntactical shortcut with the pipe ("|") and express the distributed connections between repeat units in a statistical copolymer (Supplementary Fig. 10).

S-12
Consuming polymer graphs. The definition of a polymer graph using CMDL syntax simply defines the base structural features and connectivity of a polymeric material. The graph definition can then be consumed within the definition of a polymer reference. Supplementary Fig. 11. Example of a definition of polymer reference. Screenshot of polymer reference definition (MeO-pTMC20 and MeO-pTMC200). The tree property references the polymer graph definition. Figure 11, the polymer references for a poly(trimethylene carbonate) polymers are defined, one for the known starting material (MeO-pTMC20) and one for the new, chain extended product (MeO-pTMC200).The DPn is assigned to the poly(trimethylene carbonate) repeat unit in the starting material for the chain extension reaction. Following the experiment, the DPn value for the product polymer (MeO-pTMC200) can be assigned based on measured values in the sample group ( Supplementary Fig. 12). Components may also be grouped within a reactor group. Any component defined under a reactor group will be considered as part of a single reactor and contribute to its total reactor volume. Reaction stoichiometry and estimated residence times will be computed for each reactor group defined within the reactor graph. Supplementary Fig. 14 shows the definition of a reactor group (PolyReactor) using CMDL syntax.

In Supplementary
Stock solutions for each input node on the reactor graph are also defined in separate groups, similar to how reaction groups are defined (Supplementary Fig. 14). Upon execution of a cell containing a stock solution group, the stoichiometry is computed by the CMDL interpreter and displayed in a table in the cell output. Supplementary Fig. 15. Example of stock solution group. Screenshot of stock solution group, its chemical components, and the output after running the cell.
Once the reactor graph and stock solutions have been defined, they may be referenced within the flow reaction group. The reactor is referenced in the reactor property on the flow reaction group, whereas the stock solutions are referenced as reference groups ( Supplementary  Fig. 16). Within each stock solution reference group, the input property is defined and references the input node on the reactor graph. The flow rate property defines the flow rate for a stock solution input. Supplementary Fig. 16. Example of flow reaction group. Screenshot of flow reaction group, its stock solution components, and the polymer product (pMeBnO-pL-lac).

S-16
Upon execution of the cell with the defined flow reaction the CMDL interpreter will use the referenced reactor graph and propagate the stock solutions through the graph. When stock solutions are mixed in a reactor, the dilution ratios and stoichiometry for the reaction will be computed along with the estimated residence time. These values will be displayed in the cell output ( Supplementary Fig. S17). Supplementary Fig. 17. Example of flow reaction output. Screenshot of flow reaction group output from running the cell in Supplementary Fig. 16.
Compilation and export of experimental data. Once all the requisite data for a given experiment is recorded in the CMDL syntax and executed, saving the record will automatically create a JSON file using the default export schema (see the GitHub repository for examples). These JSON files may be loaded into a database or other AI pipeline as needed.
Inspection of generated polymer graphs. Data from the RT model for design of new polymers were inspected using the IBM Materials Notebook. This was accomplished by first cleaning the CSV output from the model and then serializing the data into CMDL syntax. The generated materials were then written to individual CMDL notebook files in groups of 50. It should be noted that during serialization a dummy molecular weight (123 g mol -1 ) was given to new fragment groups, future work will aim to provide a more accurate molecular weight estimation for valid SMILES strings. While the CMDL compiler assisted in identifying erroneous syntax and missing components in the polymer graphs, future versions that assist in detection and/or correction of S-17 invalid SMILES strings will be important to better enable experimentalists to inspect AI predictions. Nonetheless, the use of the IBM Materials Notebook to inspect the generated polymer structures proved invaluable. Supplementary Figure 18 depicts example serialized data and its output upon execution of the cell. Supplementary Fig. 18. Example of a generated polymer serialized into CMDL. Screenshot of generated polymer (Gen_Poly_33), its polymer graph (Base_33), and a simple rendering of the polymer graph upon running the cell.

S-18
Supplementary Figures   Supplementary Fig. 19. Examples of structures with symmetric and non- Supplementary Fig. 22. Examples of a composite tree and graph representation for an end-capped polyurethane copolymer. a Example of end-capped polyurethane copolymer. An additional container was added to the composite tree defined in Supplementary Fig. 21 to better define the connectivity of the end-group with the rest of the material. b Graph representation of the polyurethane copolymer. Edge definitions are omitted for simplicity.

S-24
Supplementary Fig. 26. Example of conversion of graph representations to strings for use in RT models. a Screenshot of CMDL definition of a block copolymer. b Example of string representation output of the block copolymer polymer graph. Each node is enclosed with angle brackets wherein there is a generic label for a particular node, its SMILES string with variable attachment points, and edge definition(s). Each of these components is separated by a pipe ("|") character. If there are multiple edges originating from a node, each is also separated by a pipe character.