Optimization of Formulations Using Robotic Experiments Driven by Machine Learning DoE

Summary Formulated products are complex mixtures of ingredients whose time to market can be difficult to speed due to the lack of general predictable physical models for the desired properties. Here, we report the coupling of a machine learning classification algorithm with the Thompson sampling efficient multiobjective optimization (TSEMO) algorithm for the simultaneous optimization of continuous and discrete outputs. The methodology is successfully applied to the design of a formulated liquid product of commercial interest for which no physical models are available. Experiments are carried out in a semiautomated fashion using robotic platforms triggered by the machine learning algorithms. The procedure allows one to find nine suitable recipes meeting the customer-defined criteria within 15 working days, outperforming human intuition in the target performance of the formulations.


INTRODUCTION
Liquid formulated products have a large number of applications in chemical, energy, and environmental industries and include a wide range of products, such as pharmaceuticals, food, fuels, cosmetics and personal care products, and polymers. [1][2][3] They consist of mixtures of ingredients and are processed to meet specific target properties. 4 Despite the fact that some ingredients might be completely miscible and form a homogeneous phase, many formulated products of industrial interest are, in fact, emulsions of immiscible phases, stabilized with additional ingredients, such as ionic surfactants, 2,5,6 and using specific process conditions. As a result, one of the main objectives in liquid formulated product design is to find a proper composition to obtain a stable emulsion, with no phase separation, meeting specific functionalities or customer-defined properties, such as a certain viscosity, color, or fragrance. 7 Unfortunately, such a product can be very complex from a physicochemical point of view, and often, no direct correlations are available to predict the target properties given the composition of the formulation. As a consequence, the problem of the design of formulated products is of great interest to both academia and industry, and most relevant publications manifested the need for a general framework for their development. 8 Pioneering work in this area has been carried out by the group of Prof. Rafiqul Gani. 4,[9][10][11][12][13] According to these sources, the general approaches to formulated product design consist of three main strategies: (1) a trial-and-error approach, based on the expertise of the operators and completely dependent on experimental data (this approach is highly resource and time demanding but in most cases is still widely adopted); (2) a model-based approach that consists of the generation of a few options to be experimentally tested using available models to predict the target properties (in this case, the resources employed are minimal, but accurate and reliable physical models are needed); and (3) integrated approaches combining model predictions and experimental data. Starting from the general framework proposed by Cussler and Moggridge, 14 Conte et al. 4 reported an integrated multistage approach. First, the main performance criteria and the properties determining them are taken into account, based on a priori knowledge, setting constraints and restricting the search space. Later on, available models can be used to design formulations with a specific target function, such as a fixed viscosity and stability, and the tradeoff between different candidates is evaluated. Experimental data might be used at different stages in the framework to verify the validity of the prediction or reject some of the candidates. This approach still relies on the availability of predictive models.
Unfortunately, for complex commercial products with a large number of ingredients, multiple empirical or semi-empirical models might be available, and in some cases, none of them can successfully predict the experimental outcomes. As an example, predicting viscosity of multicomponent emulsions can be a rather challenging task when little information is available about the complex interactions of surfactants, especially for high concentrations of active ingredients. An overview of the different models to predict viscosity of formulated products can be found elsewhere, 3,15,16 highlighting the limitations of each model and the strong assumptions that often are invalid for industrially relevant formulations.
An alternative methodology for developing formulations is to employ an intelligent design of experiments (DoE) 17 methodology that would be capable of learning key features of a particular formulation during an experimental campaign. Statistical DoE methods, such as two-level full factorials, 18 fractional factorials, 19 Plackett-Burman method, 20 mixed-level fractional factorial design, 21 D-optimal design, 22 and others, 23 were applied in the field of formulated products development. The DoE methods are typically used to determine a statistical surrogate model by generating samples that cover the design space, 24 but these are usually applied to relatively simple models, focusing on main effects or a main effect plus identification of interactions with few factors. 25 Such DoE methods are inefficient when a complex model (in the case of a large number of ingredients and complex interactions) is required to achieve predefined design objectives. 26 More recently, machine learning algorithms have emerged as a potential solution for challenging tasks, such as the discovery of new chemical reactivity and prediction of reaction outcomes. 23 Among these methods, Bayesian optimization active learning provides an adaptive paradigm to sample the design space more efficiently for identifying the optimum. For example, Zhang et al. presented a Bayesian optimization framework for design of materials, where a latent variable approach was integrated with Gaussian process (GP) modeling in order to support a variety of materials design applications with mixed qualitative and quantitative design variables. 27 This method was efficient and effective in optimizing material constituents of a hybrid organic-inorganic perovskite.
Another common problem in formulated products design is finding tradeoffs between different conflicting objectives that can be a combination of continuous and discrete variables. To solve this problem, the combination of surrogate models and multiobjective optimization has recently presented an exciting opportunity to maximize the amount of useful information gained per experiment. Fitzpatrick et al. applied scalarization approach, where multiple objectives are combined into a single objective function with different weightings, in order to simultaneously optimize throughput, conversion, and consumption for a model reaction. 28 A scalarization approach is very useful and computationally efficient, but it is frequently difficult to objectively define suitable weights, especially without sufficient a priori knowledge, not to mention that even a minor change to the weightings can result in significant changes to the obtained solution. 29 On the other hand, evolutionary algorithms, such as non-dominated-sort genetic algorithms, are designed to converge on the Pareto front using a Pareto dominance ranking system. 30 By coupling GPs with genetic algorithm, several multiobjective optimization schemes were proposed and applied in chemical reaction and process design.
The multiobjective active learner (MOAL) 31 was proposed for expensive to evaluate multiobjective optimization tasks and successfully applied to discovery of emulsion polymerization recipes with 14 input variables and two desired objectives (namely full monomer conversion and a specific mean particle diameter). 32 Similar machine learning methodologies, such as Pareto efficient global optimization (ParEGO) 33 and expected hypervolume improvement (EHI), 34 were also proposed in the literature. In a previous study, some of the authors developed the Thompson sampling efficient multiobjective optimization (TSEMO) algorithm, which randomly samples from the GPs and uses the NSGA-II (Non-dominated Sorting Genetic Algorithm II) algorithm to identify the Pareto front of each random sample. 35 Although a number of recent studies explored the topic optimization of chemical systems in the presence of continuous and discrete input variables, [36][37][38][39] there are no examples of multiobjective optimization of both discrete and continuous outputs.
With the recent advances in laboratory automation, a large amount of highly reproducible data can be available to train machine learning models to correlate multiple target outputs to complex high-dimensional inputs. 40,41 Such models can be integrated in algorithms for the DoE that are capable of suggesting suitable conditions to optimize the process with respect to the targets of interest and combined with robotic platforms for the generation and analyses of experimental samples. 42,43 In this paper, we present such closed-loop optimization system for the multiobjective optimization of a commercial formulated product. The interaction between a classification algorithm and an efficient multiobjective optimization algorithm for continuous variables enables one to simultaneously meet discrete (namely, formulation stability) and continuous (namely, viscosity, turbidity, and price) targets. The proposed methodology enables one to find suitable solutions within a relatively short time period (i.e., 15 working days) using little empirical prior knowledge about the physical system to define the constraints of the input variables. This makes the proposed pipeline particularly suitable for the early stages of the formulated products design.

RESULTS AND DISCUSSION
Classification Algorithm for Stability Prediction As explained in detail in Experimental Procedures, optimization of the formulated product under investigation was carried out with respect to different conflicting objectives, both discrete, i.e., stability, and continuous, i.e., turbidity, viscosity, and price. A classification algorithm was developed and trained in order to classify the suggestions of the coupled TSEMO algorithm based on their stability and avoid waste of resources and time, generating samples that exhibit the undesired phase separation. The smart sampling classifier was developed as described in Experimental Procedures. An initial dataset of 48 samples was generated using a Latin hypercube design as a space-filling technique. 44 The samples were generated by two complete rounds of the robotic platforms R1 and R2. The initial dataset was used to train the classifier and generate 12 new potential experiments to be performed at each iteration. The new generated samples were added to the dataset and the procedure repeated for four complete cycles. The initial number of 48 experiments was chosen according to the full capacity of the robotic platform and the previously reported 103d rule of thumb, with d representing the number of input variables under consideration. 45 As described in detail in Experimental Procedures, a naive Bayes classifier was chosen due to its simplicity, efficiency, and accuracy in classification problems. The performance was evaluated as the average for prediction accuracies of the 5-fold cross-validation between the classes of stable and unstable formulations. 46 A bi-dimensional representation of the initial dataset is shown in Figure 1 (run 1), together with the first 12 suggested experiments. Linear discriminant analysis Four consecutive iterations of the active learning classifier show the efficiency of the algorithm in suggesting new conditions (red crosses) at the boundary between the stable and unstable domains. The repulsion criterium described in Experimental Procedures results in suggestions that are not too similar to each other for efficient exploration. Given the 5-dimensional space, LDA was used as a dimensionality reduction technique for visualization of the data. A direct comparison with the suggestions using random sampling is provided in Figure S6.

OPEN ACCESS
(LDA) was found to be a suitable dimensionality reduction algorithm for better visualization of the dataset. The results of the iterative training of the classifier are shown in Figure 1. As one can see, the suggested experiments at each iteration are nicely distributed at the border between the two clusters, which represents the area with the highest uncertainties. Moreover, the repulsive criterion adopted for the batch sequential design provided a better exploration of the space. For the sake of comparison, the same procedure was repeated, adopting a random sampling strategy. For the sake of comparison, full representation of the suggested experiments using a random sampling is reported in Table S2.
The prediction accuracy was evaluated using a support vector machine (SVM) classifier. In fact, as the naive Bayes model was used in the active learning algorithm method, we might have collected data solely tailored to the model built by the naive Bayes classifier. Figure 2. shows the prediction accuracy of the active learning algorithm and the random sampling, proving the superiority of the former, at each iteration.

Optimization Results
The same 96 data points collected to train the classification algorithm described in the previous section were used to initiate the TSEMO algorithm. Once initiated, 16 iterations of the optimization procedure were carried out, generating a total of 128 samples within 15 working days. As described in Experimental Procedures, the algorithm suggests conditions in order to find the best predictions of the actual Pareto front, minimizing uncertainties and finding a compromise between the minimization of the conflicting objective functions. The target properties for the specific product under considerations were (1) stability and low turbidity, (2) honey-like viscosity (target 3 Pa3s at a shear rate of 10 s À1 and 25 C), and (3) low price of the adopted ingredients. Interestingly, the percentage of suggested unstable formulations presenting phase separation significantly decreased over the iterative optimization loop, and the algorithm stopped suggesting unstable conditions from the 12 th iteration, as shown in Figure 3. This can be ascribed to the fact that unstable samples often present a higher value of turbidity, which is one objective chosen for the optimization procedure. However, the integration of the stability classifier helped to The prediction accuracy of the Bayesian classifier trained using active learning is superior to the one observed using random sampling. The so-trained classifier is embedded in the general algorithmic framework as described in Experimental Procedures. Prediction accuracy has been calculated as the mean value (i.e., the average correct rates of the five times 5-fold cross-validation and corresponding standard deviation results from the SVM classifier).

OPEN ACCESS
Cell Reports Physical Science 2, 100295, January 20, 2021 avoid running experiments giving unstable products, saving time and reducing the waste of resources.
The experimentally collected data were automatically analyzed to provide a full list of the non-dominated experimental solutions, which represent the experimental Pareto front of the dataset. Non-dominated solutions were defined as the ones where an improvement in one objective would lead to a worsening in at least one other objective. The full list of 32 non-dominated solutions was identified, 11 of which were in the training dataset and 19 of which were in the suggested experiments (Table S3). In Table 1, we reported only non-dominated solutions meeting the viscosity criterion. The full dataset is reported in Tables S1, S2, and S4.
As one can see, although a good number of clear formulations were already present in the training set, the algorithm was able to explore the input space more efficiently, finding alternative solutions with a significant reduction in the price. The obtained solution was compared with the best solution provided in a dataset guided by experts' intuition. In a previous independent experimental campaign, formulation scientists from BASF used factorial design 47,48 for the initial screening of the design space and based on the results designed the final batch of experiments for testing and optimization (280 experiments in total). In this case, the closest solution to the target was found using the following recipe: S1 = 4.00 g/L, S2 = 5.00 g/L, S3 = 6.00 g/L, P1 = 2.00 g/L, T1 = 2.00 g/L. In this case, a homogeneous formulation with a viscosity of 9,270 mPa3s was obtained, with a turbidity value slightly higher than 200 nephelometric turbidity units (NTU) and a cost of 2.19 $/L, proving that    In order to evaluate the predictive capability of the trained surrogate models, the predicted Pareto front was plotted together with the experimental optima. In Figure 4, we report the predicted non-dominated solutions surface, the data used for the initial training of TSEMO, all suggested experiments, and the non-dominated experimental optima. As shown, the predicted Pareto front gives a good approximation of the actual best solutions, laying in the neighborhood of the calculated surface. A posteriori analyses of the shape of the Pareto front can also give some physical information to human operators to provide a better insight about the system. From the data reported in Figure 4, it is clear that a large number of solutions close to the Pareto front have a low value of turbidity. Also, as a general trend, as viscosity approaches the target value, the price and turbidity of the system tend to increase. This would suggest that, on average, the most expensive ingredients (S2, T1, and P1) would be the ones responsible for an increase in viscosity and turbidity of the samples, which was found to be the case for most of the samples in the dataset. Of course, these are only general semiquantitative indications, which do not reveal any information about more complex interactions that might occur between the different ingredients at different concentrations. However, it is worth noting that the presented methodology can also offer some guidelines to experts for further improvements and considerations about the actual physical role of the input variables on the desired properties of the product. Different angles of view of Figure 4 are provided in Figure S7.
In this regard, the values of the hyperparameters of the trained GPs can also provide information about the relevance of the input variables for each objective function. 49 For the adopted surrogate models, a lower value of the hyperparameter of an input variable indicates a greater contribution to the objective. The values of the hyperparameters are reported in Table 2. The analysis of the hyperparameters suggests again a stronger influence of T1 on the viscosity and turbidity; more complex interactions between S1 and T1 seem to be responsible for variations in the viscosity value, whereas S2 seems to also have a relevant effect on the turbidity of the samples. This kind of qualitative information may lay the foundation for integrated approaches for the simultaneous black-box optimization and physical knowledge generation by using robotic platform, in combination with other recently published promising methodologies. [50][51][52] In conclusion, time-saving and highly reproducible robotic experiments were coupled with machine learning algorithms for the efficient optimization of the recipe of a complex formulated product of industrial interest. The optimization procedure outperformed human experts' intuition and suggested more convenient and lowpriced solutions within 15 working days. The coupling of a naive Bayes classifier with the TSEMO algorithm allowed us to take into account in the optimization procedure binary discrete outputs while also avoiding wasting time and material resources. Despite the fact that the optimization was carried out in the absence of predictive physical models, a posteriori analysis of the Pareto front and analysis of hyperparameters of the surrogate statistical model gave some important qualitative information about the physics of the system. Future research will focus on the connection between data-driven statistical models and the generation of physical models, which would give an insight into the physics of the system and lead to an optimal design of similar types of products. Moreover, future efforts will address the problem of optimization under unknown constraints and its integration in formulated product design to minimize the need for prior knowledge about the system.

Resource Availability Lead Contact
The lead contact is A.L. (aal35@cam.ac.uk)

Materials Availability
This study did not generate new unique reagents. The described surfactants, polymer, and thickener were provided by BASF.

Data and Code Availability
The file repository used for this work can be found at the following GitHub page: https://github.com/sustainable-processes/centripeta. All data are provided in Tables S1, S2, and S4.

Case Study and Materials
The case study under consideration is a commercial formulation consisting of three different surfactants (S1, Texapon SB3; S2, Dehyton AB30; and S3, Plantacare 818), a polymer (P1, Dehyquart CC7), and a thickener (T1, Arlyon TT). The pH was adjusted using citric acid (ACS reagent, R99.5%) from Sigma-Aldrich, used as received. Turbidity standards (1, 2, 5, 10, 100, 500, and 1,000 NTU) were purchased from Sigma-Aldrich. Article Optimization Procedure A general scheme for the optimization procedure is given in Figure 5. In Figure 5, continuous lines represent the materials flow, whereas dashed lines represent the information flow. The formulation was simultaneously optimized with respect to viscosity, turbidity, stability, and price. At each iteration, a batch of eight different suggested samples is prepared using the robot R1. 15% of the batches were randomly run in triplicate to ensure repeatability. The so-prepared samples are then processed to generate the final product. The samples are successively transferred to the robot R2, which can automatically perform pH, turbidity, and stability tests. The samples are finally analyzed offline to measure viscosity. Details on the robotic platforms R1 and R2 and the experiments are provided in the next section, Supplemental Experimental Procedures, and Figures S1-S5. The turbidity and viscosity values are used to train surrogate GP models for prediction of the target outputs. The price is analytically calculated using the unitary price of each ingredient and their relative amounts, as follows: Based on the predictions, the Bayesian TSEMO algorithm generates eight new conditions to be tested in order to find a compromise between exploitation (finding the best conditions to minimize the objectives) and exploration (reducing the uncertainties) of the input chemical space. The generated temporary suggestions are then tested in silico using a classification algorithm to predict which samples would be stable. The conditions that give unstable formulations according to the classification algorithm are discarded, and the TSEMO algorithm is reused to generate other suggestions, until an entire batch of eight stable conditions is available. The new suggested conditions are finally added to the dataset and used to trigger a new iteration. Details about the TSEMO algorithm and classification algorithm are provided in the following sections.

Robotic Experiments
Samples preparation and analyses were partially automated by using two robotic platforms adapted from Salley et al. 53 An image of the platforms R1 and R2 is given in Figure 6. Schematic representation of the different parts and detailed descriptions can be found in Supplemental Experimental Procedures.
Briefly, both platforms consist of a laser-cut rotating wheel that can allocate up to 24 sample vials per batch. In R1, a 3D-printed element, equipped with a variable number of needles, is connected to automated syringe pumps (Tricontinent, Gardner Denver, C-Series), dispensing the five different ingredients. The three surfactants are pre-diluted in water to achieve a concentration of active matter of 20 g/L in the feeding bottles. pH was pre-adjusted to the desired value of 5.5 using citric acid. The requirement for the pH value was fixed for the specific application of the product under consideration. The polymer (P1) and thickener (T1) were used as received. The system was automated and triggered by a Python script to generate eight samples at each iteration of the optimization procedure. Samples were always prepared according to the following constraints for the concentrations, defined using some semi-empirical prior knowledge about the system: S1 + S2 + S3 = 15 g active matter 3 L À1 ; P1 % 2 g/L; and T1 % 2 g/L. The generated samples were then transferred to an incubator (Corning LSE 71L Shaking Incubator), where they were mixed for 2 h at 50 C and 300 rpm. The obtained formulations were

OPEN ACCESS
cooled down to room temperature before being transferred to R2. In this second robotic platform, we perform three kinds of analyses in an automated fashion. A pH probe confirms that no pH changes occur during the process (VWR pH electrode, Semi-micro, Pellon junction 662-1767). No significant deviation from the target pH of 5.5 was recorded at any time. A built-in turbidity sensor is used to measure the turbidity value in NTU. Calibrations with turbidity standards were carried out every 3 days. Finally, an automated camera was used to take pictures of the samples and discriminate between stable homogeneous samples and unstable formulations presenting phase separation. In this work, phase separation was detected offline by a human operator; however, automated image analysis can be integrated in the future. The samples were tested offline to measure viscosity at a shear rate of 10 s À1 and 25 C using a rotational viscometer (ARES Rheometric Scientific, strain controlled, Couette configuration).

Classification Algorithm
Classification is a type of model assigning labels to regions of the parameters space given only a few known labeled training data. 54 The algorithm used in this study was developed based on the multiple active learning methodologies, which were developed and successfully applied to images classification, 55 music annotation, 56 and text categorization. 57 The method was applied to our case study to distinguish between stable and nonstable formulations faster than randomly accumulating and searching experimental evidence. A naive Bayes classifier and an uncertainty-based sampling strategy were adopted. A flowchart for this framework is shown in Figure 7.
A small set of initial data is needed to first train the model and generate possible experiments for the next step. The trained classifier then can predict the outcome of these experiments and selects the most uncertain experiment (i.e., the one with the lowest confidence regarding the predictions). The selected experiment is then performed on the real system, and the result is added to the dataset, which can be used to train a new classifier. The process is repeated again until a given termination criterion is met. The so-collected final dataset should be more informative than one built using a non-active acquisition method. In this work, a batch sequential design was used, suggesting 12 different experiments at each iteration. According to Bayes theorem, the probability of given x being class c is given by Equation 1, 58 with the assumption that all attributes are independent given the value of the class variables (Equation 2): pðy = c j xÞ = pðx j y = cÞpðcÞ pðxÞ (Equation 1) Due to the fact that features S1, S2, and S3 sum to a constant, Equation 2 was modified to consider the subsets of independent features, following a joint distribution.
The posterior distribution will then be given by Equation 3: where j denotes each subset.
In this work, there are three subsets of features: (1) features S1, S2, and S3 sum to a constant value and follow the Dirichlet distribution, which is denoted as x 0 : The parameters were estimated by using maximum likelihood estimation: max q logpðy = c j xÞ; (Equation 8) where q = ða c , m c , s c Þ. m c and s c have analytical solutions by setting the derivative of log likelihood equals 0; a c is found by using the mean precision algorithm. 59 Given the nature of the classification algorithm, each batch at a given iteration, would likely be made of similar experiments. Therefore, the algorithm assigns a score to represent the importance of each sample in terms of local uncertainty and global exploration. From the score, we assign a probability to each of the sample according to a pair of predefined probabilities: local uncertainty and global exploration.
The local uncertainty is measured by Shannon entropy 60 : where k represents class k, d p i;k is the predicted probability, and c is the number of the classes. The score for global exploration is given by )

OPEN ACCESS
The optimization procedure can be stopped when the maximum number of evaluations is reached or when the operator is satisfied with the obtained results. This can be automated by terminating the algorithm when the objective functions are lower than a given epsilon.
For this specific case study, the number of suggested experiments at each iteration was set to 8. The input variables were chosen as the concentrations of the ingredients, and the three targets to minimize were chosen as the turbidity value in NTU, the squared distance between the measured viscosity and the target viscosity of 3 Pa3s, and the cost of the adopted ingredients ($/L). The latter was calculated as the sum as the unitary prices of the ingredients multiplied by the adopted amounts in each sample. As this target was an explicit function of the input variables, the code was modified to not train a GP for this specific target and using the directly calculated values instead. The targets were chosen according to the indications of the product supplier, and an ideal product is a stable homogeneous clear formulation with a viscosity as close to 3.00 Pa3s as possible at the lowest possible cost.