Substrate specific closed-loop optimization of carbohydrate protective group chemistry using Bayesian optimization and transfer learning

A new way of performing reaction optimization within carbohydrate chemistry is presented. This is done by performing closed-loop optimization of regioselective benzoylation of unprotected glycosides using Bayesian optimization. Both 6-O-monobenzoylations and 3,6-O-dibenzoylations of three different monosaccharides are optimized. A novel transfer learning approach, where data from previous optimizations of different substrates is used to speed up the optimizations, has also been developed. The optimal conditions found by the Bayesian optimization algorithm provide new insight into substrate specificity, as the conditions found are significantly different. In most cases, the optimal conditions include Et3N and benzoic anhydride, a new reagent combination for these reactions, discovered by the algorithm, demonstrating the power of this concept to widen the chemical space. Further, the developed procedures include ambient conditions and short reaction times.


General experimental methods and equipment
All chemicals have been used as purchased from the supplier without further purification. Solvents were HPLC grade and were used directly from the flask without any attempt of further drying them. Reactions were monitored using TLC using aluminum sheets coated with silica (Merck F 254 -plates). The spots were visualized using UV and staining with 10% H 2 SO 4 in ethanol. Purification using flash chromatography or by a Buchi Pure C-815 chromatography instrument.
HRMS values were obtained using MALDI, the spectra were recorded using a Bruker SolariX XR 7T ESI/MALDI-FT-ICR-MS instrument. 1 H-NMR and 13 C-NMR spectra were recorded at 500 MHz and 125 MHz respectively, using a Bruker 500 MHz Ultra Shield Plus instrument with a cryoprobe.

General experimental procedure manual synthesis of suggested experiments
250 mg of the specific carbohydrate was dissolved in 255 µL DMF in a pear-shaped flask. While stirring, the solvent was added, followed by the base. After waiting for t1, the benzoylating reagent was added, either neat (benzoyl chloride) or as a 4M solution in MeCN (benzoic anhydride). The reaction was stirred for 1 hour, and then quenched with DMAPA. After stirring the quenched solution for 10 minutes, 10 mL of EtOAc was added to the solution, and the solution was washed thrice with 1M HCl (3x10ml), once with 1M NaOH(10 ml) and once with brine(10 ml). The organic phase was concentrated in vacou and purified by column chromatography (CDCl3 and MeOH). Note: + equals true or included and ≠ equals false or not included. ú Compound has not been characterized using NMR, but melting point and elemental analysis have been reported.

General
All chemicals have been used as purchased from the supplier without further purification. Solvents were HPLC grade and were used directly from the flask without any attempt of further drying them. Reactions were monitored using TLC using aluminium sheets coated with silica (Merck F 254 -plates). The spots were visualized using UV and staining with 10% H 2 SO 4 in ethanol. Purification using flash chromotography was either performed manually with silica gel (40-63 µm) or by a Buchi Pure C-815 chromatography instrument.
HRMS values were obtained using MALDI, the spectras were recorded using a Bruker SolariX XR 7T ESI/MALDI-FT-ICR-MS instrument. 1 H-NMR and 13 C-NMR spectras were recorded at 500 MHz and 125 MHz respectively, using a Bruker 500 MHz Ultra Shield Plus instrument with a cryoprobe. [2 ]

A.2.2 Standard Curves
All standard curves are produced by running triplicates for each calibration point.

Benchmark study
The study has been performed utilizing reaction data for a direct arylation, which has been obtained by Doyle and coworkers. [1] The reaction and the reaction space are depicted in Scheme S1. Scheme S1. Reaction and reaction space used for the benchmark study. Dataset obtained by Doyle and co-workers. [1] Setup To carry out the benchmark study the script illustrated below was constructed: The script can query the optimizer for reaction conditions, get the yield obtained from these reaction conditions from the Doyle dataset, and tell the yield to the optimizer, which based on the yield propose a new experiment and so forth. This inner loop is run 100 times, corresponding to 100 experiments. The inner loop is looped over 30 times and the average is calculated. Each optimization run is initiated by 10 random sets of reaction conditions. This is done to average out statistical Part II 3 RESULTS AND DISCUSSION

Setup
To carry out the benchmark study the script illustrated below was constructed: Algorithm 1 Benchmark Script for optimization in range(30) do for experiment in range(100) do Query ProcessOptimizer for proposed reaction conditions Get yield for queried reaction conditions from Doyle dataset Give yield from queried reaction conditions to ProcessOptimizer end for end for The script is capable of querying the optimizer for reaction conditions, get the yield obtained from these reaction conditions from the Doyle dataset, and tell the yield to the optimizer, which based on the yield propose a new experiment and so forth. This inner loop is run 100 times, corresponding to 100 experiments. The inner loop is looped over 30 times and the average is calculated, each optimization run is initiated by 10 random sets of reaction conditions conditions. This is done to average out statistical uncertainties and make sure the values compared have some statistical significance. If only one optimization was run for each method, one would not be able to say with certainty if a method performs better because it is in fact superior or because it had a better starting point. Unless otherwise noted the default values or conditions of the ProcessOptimizer have been used. These conditions can be found in Appendix A.2.4.
As seen in Scheme 17, all the parameters in the reaction space are either categorical or uncertainties and make sure the values compared have some statistical significance. If only one optimization was run for each method, one would not be able to say with certainty if a method performs better because it is in fact superior or because it had a better starting point. Unless otherwise noted the default values or conditions of the ProcessOptimizer have been used, which, at the time the benchmark study was conducted, were: As seen in Scheme S1, all the parameters in the reaction space are either categorical or have discrete values. Currently the ProcessOptimizer can handle categorical, continuous, and integer parameters. Three different approaches to describe the discrete parameters were therefore proposed: 1. Treat them as categorical parameters.
2. Treat them as continuous parameters and return the discrete value closest to the value proposed by the optimizer.
3. Keep them as discrete parameters using integers, with each integer corresponding to a discrete value.
All three methods were tested; the results are shown in Figure Sb1. The optimizations were performed using an extra trees base estimator and using the sampling setting to minimize the acquisition function. An extra trees base estimator, sampling to minimize the acquisition function, and 10 initial points have been used.
It is seen that treating all parameters as categorical parameters gives the worst performance, whereas using continuous or discrete parameters give similar performances. The approach using integers as discrete parameters was chosen for the following studies.

Robot script with unfolded macros
Parameter tuning Base estimator: The performance of different base estimators was also investigated. The acquisition function was set to "gp_hedge", which probabilistically choose between the LCB, EI and PI acquisition functions at each interaction. The result are displayed in Figure b2. Four different base estimators were tested: gradient boosted regression tree (GBRT), random forest (RF), extra trees (ET), and two variants of Gaussian process, one with sampling minimization and one using L-BFGS minimization. It is seen that all methods initially have a similar performance. After around 10 experiments the GBRT base estimator seems to perform best but afterwards the performance plateaus. The methods using a GP base estimator and the RF base estimator seem to perform best overall. Though, the GP base estimator with L-BFGS minimization performs better than the GP base estimator with sampling minimization. L-BFGS minimization is not possible for the tree based base estimators.

Acquisition function:
The effect of changing the acquisition function for different base estimators has also been investigated; Figure S3 to Figure Figure S5. In general, there is not a big difference in the performance of the acquisition functions. Though, when using a GP base estimator and L-BFGS minimization the gp hedge method (\ref{fig:acq_gp_L-BFGS}) seems to converge at a higher value. In general, it seems that the PI acquisition function gives smaller fluctuations, which might suggest that it is less explorative under the default conditions.   Kernel: The default kernel, when using a GP base estimator, is a Matérn kernel with length scale bounds (lsb) of 0.00001 to 100000 and n= 1.5. The effect of changing the hyperparameters of the Matérn kernel is shown in Figure S6. No great differences are observed, however, length scale bounds of 0.1-10 seem to give the highest observed maximum yields.

Robot setup
Photo from inside the robot   In Figure 52, the inside of the robot is shown. The reagents are placed in di erent racks and the reaction vials are placed in the iSynth Reactor, which is capable of stirring and regulating the temperature. In the upper left corner we see the Gripper tool, which is the tool that opens and closes the reaction vials. In the upper right corner we see the bottom of the needle head, which is responsible for all volumetric transfers. With the current setup the robot is not capable of handling solids.
With the reaction and reaction space ready, the synthetic workflow was designed and the robot was programmed accordingly. An illustration of the workflow is depicted in Figure 53, and the script for the robot can be found in Appendix A.2.4. When the script is initiated, the robot fetches the first reaction conditions from the database, and calculates the volumes it has to add using the fetched parameters. Next, the sugar DMF solution is mixed with the solvent and hereafter the base is added. After stirring for an amount of time, determined by the t1 parameter, the benzoylating reagent is added. After one hour the reaction is quenched with methanol and a chromatogram is obtained. From the area under the peak that corresponds to the desired compound, a score is calculated using a predefined scoring function. This result is given to the algorithm, which then proposes a new experiment. The robot then carries out the next experiment in the database. All yields were calculated using standard curves, which can be found in Appendix A.  Figure 53: Flow diagram showing the experimental procedure performed by the robot

Principal Component Analysis of Bases
To try and convert the bases into meaningful input, a principal component analysis was performed. This was done using 10 di erent descriptors. The descriptors and their source are displayed in Table 10. In some cases the descriptor was not available at the specified source and was found elsewhere. Values, references, and detailed sources can be found in

Detailed description of quenching and HPLC
The reactions were quenched with methanol until a total volume of 6 mL was reached. Initially, insufficient mixing after quenching was observed using only the built-in stirring plate, when the 7 mL vials were filled to 6 mL. To obtain sufficient mixing of the quenched reaction mixture the vial was stirred for 2 minutes at 400 rpm, hereafter the mixture was "manually" mixed by the robot. The mixing was carried out by, first transferring 1 mL of the mixture from the top to the bottom of the vial two times, hereafter 1 mL from the bottom was transferred from the bottom to the top twice. After the mixing, the mixtures were stirred for an additional 5 minutes at 300 rpm. The approach was validated by recording a chromatogram after the mixing was performed by the robot. Hereafter, the quenched vial was removed and shaken vigorously by hand, and a second chromatogram was recorded. No change in peak integrals occurred, indicating full mixing was obtained during the automated mixing.
After quenching, the reaction mixture 0.15 mL of the quenched solution was further diluted with methanol (1.3 mL) in a Waters HPLC vial. As it is not possible to stir the HPLC vial rack, an approach like the one described above was used to obtain a homogenous mixture in the HPLC vials. It was found that mixing by transferring 0.5 mL of the mixture from the bottom of the vial to the top twice, was sufficient mixing i.e., no change in integrals after shaking the vials by hand. Area (mAU*min)

Example of HPLC chromatogram
Base descriptor sources Table S2. Detailed overview of origin of base descriptors.

Dependency Plots
Cl1 -Mono-gluco Figure S9a. Dependency plot of the estimated objective function. Black dots correspond to experiments carried out. Red dots and red dotted lines indicate the optimal reaction conditions found. Figure S9a shows a dependency plot, which shows how the different parameters influence the yield when all other parameters take the value they have in the best set of conditions. In this way, it is possible to get insight into what the optimizer thinks the objective (the true function) looks like. In the contour plots, the yellow corresponds to minima (high yield), and the dark blue to low yields. The red dot in the contour plots and the dotted red line in the 2D plots corresponds to the observed optimal reaction conditions (experiment 63). It is seen that the influence of the concentration, equivalents benzoylating reagent and t1 are very similar. Equivalents base seems to be more important than the other continuous parameters. For the solvents, MeCN/THF seems to be the best, with pure MeCN the second best. Many of the optimal conditions' reaction parameters are either at or near the ends of their defined ranges.
ditions from Cl2 (Experiment 63). The same benzoylating reagent, the same base, similar equivalents of base, and a similar concentration are used. The fact that the Bayesian optimization algorithm can find near optimal conditions, even with the uncertainties caused by the improper mixing, really illustrates its robustness. Figure 57 shows a dependency plot, which shows how the di erent parameters influence the yield, when all other parameters take the value they have in the best set of conditions. In this way it is possible to get insight into what the optimizer thinks the objective (the All experiments from Cl2, which were within the new reaction space, were given to the optimizer as initial data points. The dotted red line shows the maximum yield from Cl2. It is seen that some experiments surpass the optimal yield observed in the previous Closed-loop, though only small improvements are observed. The convergence plot shows that convergence is reached after 15 experiments. in each experiment of Closed-loop 3. It was found that the major dibenzoylated product is the 2,6-dibenzoylated product, which is in accordance with Williams and Richardson's order of reactivity (Section 1.2). In the following, a brief introduction to Bayesian optimization is provided. The focus will be on the di erent elements of the algorithm, but without going into detail with the mathematics as this is thoroughly described elsewhere. 3 The descriptions in this section are based on Francesco Archetti and Antonio Candelieri's Bayesian optimization and Data

Pie charts of compound distribution
Science. 3

Bayesian optimization
Bayesian optimization is based on Bayes theorem, which was proposed by Thomas Bayes in 1764. 4 Later the idea behind Bayes theorem was evolved into Bayesian optimization by Krige, 5 Kushner, 6 and Mo kus. 7 Bayesian optimization is used to solve problems of the form: Where x ú is the set of conditions within the space X, that minimizes the function f (x). For Bayesian optimization to be a suitable optimization strategy three key criteria have to be fulfilled: 8 is not known, nor its gradients.
2. f (x) is expensive to evaluate.
3. Evaluations of f (x) are noisy.
If we want to apply Bayesian optimization to chemical reactions, we have to argue that a chemical reaction fulfills these criteria. This can be easily done by looking at a toy example. Let us say we want to maximize the yield of the reaction A + B ae C by varying the reaction conditions x (e.g. concentration, solvent, the equivalent of B). Since we want to maximize the yield of C (minimize 100 -yield%), the reaction optimization is by definition a minimization problem. Since we cannot write a closed-form mathematical expression for the yield (f (x)) as a function of the reaction conditions, f (x) is indeed a black box function, which fulfills criterion 1. If we want to evaluate f (x), i.e. determine the yield obtained from a specific set of reaction conditions, we have to carry out the experiment, which is time-consuming, and hence expensive, fulfilling criterion 2. The last criterion states that evaluations of f (x) are noisy, which chemical reaction optimization also conforms to, as duplicating an experiment rarely gives the exact same yield. To sum up, it seems that chemical reaction optimization is suitable for Bayesian optimization.

Part
Broyden-Fletcher-Goldfarb-Shanno algorithm (L-BFGS), can also be used for minimizing the acquisition function.

Part
Bayesian optimization is in fact not very di erent from the human interpretation of the world. We base our beliefs about most things on prior observations. However, new observations might lead us to change, or update, our beliefs. The model that describes the algorithm's beliefs about the task is called the surrogate model. The surrogate models also contain information about how sure the algorithm is about its belief at a given point and using the surrogate model it is possible to come up with a qualified guess of where the minimum is.
In the following three key components of Bayesian optimization algorithms are discussed; Base estimator, Kernel, and Acquisition function.
Base Estimator: The base estimator is a probabilistic surrogate model. It is this model that incorporates prior knowledge and models the objective function and the uncertainties. The most widely used base estimator is a Gaussian process (GP) but di erent tree regression models are also commonly used. 3,9 To a given set of data points, there are an infinite number of functions that will fit the data. Gaussian processes work by assigning a probability to each of these functions, providing a mean function with uncertainties. A Gaussian process is defined by a mean vector µ and a covariance matrix. The covariance matrix describes how each data point varies according to the other data points and specifies the distribution of the functions. The covariance matrix is calculated using a pre-defined covariance function, which is also termed the kernel.

Kernel:
The kernel encodes assumptions about the function that is approximated. Many di erent kernels exist e.g. the squared exponential kernel, the Rational Quadratic Kernel, and Matérn kernels. The Matérn kernels have two hyperparameters, ‹ and l, with l describing the length of the correlations, i.e. determining if the data is only strongly correlated to the nearby data points or also to data points far away. ‹ controls the smoothness of the function.

Acquisition function:
The acquisition function is a function that guides the search for an optimum using the probabilistic surrogate model generated by the base estimator, by describing the potential gain. Again, many di erent acquisition functions are available including the probability of improvement (PI), 6 expected improvement (EI), 10 and lowerconfident bound (LCB). 11 One of the major challenges for the acquisition function is to balance exploration and exploitation. An exploring acquisition function will query the points with high uncertainty and explore all parts of f (x), thereby not utilizing the previous data points in an e cient way. An exploiting acquisition function will stay close to the data point that is the current optimum, hence it is likely to get stuck in a local optimum. The hyperparameter that controls the trade-o between exploration and exploitation is denoted › (xi). For instance, is LCB, in the case of minimization, given by: › Ø 0 and › = 0 corresponds to pure exploration. When an acquisition function is chosen, a method for minimizing the acquisition function also has to be determined. This is commonly done using either random sampling or a genetic algorithm. If the surrogate function is a Gaussian process, optimization algorithms, like the limited-memory