Predictive Synthesis of Copper Selenides Using a Multidimensional Phase Map Constructed with a Data-Driven Classifier

Copper selenides are an important family of materials with applications in catalysis, plasmonics, photovoltaics, and thermoelectrics. Despite being a binary material system, the Cu–Se phase diagram is complex and contains multiple crystal structures in addition to several metastable structures that are not found on the thermodynamic phase diagram. Consequently, the ability to synthetically navigate this complex phase space poses a significant challenge. We demonstrate that data-driven learning can successfully map this phase space in a minimal number of experiments. We combine soft chemistry (chimie douce) synthetic methods with multivariate analyses via classification techniques to enable predictive phase determination. A surrogate model was constructed with experimental data derived from a design matrix of four experimental variables: C–Se bond strength of the selenium precursor, time, temperature, and solvent composition. The reactions in the surrogate model resulted in 11 distinct phase combinations of copper selenide. These data were used to train a classification model that predicts the phase with 95.7% accuracy. The resulting decision tree enabled conclusions to be drawn about how the experimental variables affect the phase and provided prescriptive synthetic conditions for specific phase isolation. This guided the accelerated phase targeting in a minimum number of experiments of klockmannite CuSe, which could not be isolated in any of the reactions used to construct the surrogate model. The reaction conditions that the model predicted to synthesize klockmannite CuSe were experimentally validated, highlighting the utility of this approach.


Phases of Binary Cu-Se
The surrogate model data plotted in Figure 3 in the main text is visualized differently in Figure  S1. ScatteredInterpolant was used to perform interpolation on the 3-D dataset of scattered surrogate model data. This returns the interpolant ( ) for the experimental dataset. Interpolant can be evaluated at a set of query points, such as ( , , ) in 3-D, to produce predicted interpolated values = ( , , ). Figure S1. Each reaction in the surrogate model with their respective coded variable values for (a,c) the Ph2Se2 precursor and (b,d) the Bn2Se2 precursor. (c,d) Phase maps given in Figure 3 after a 180˚ rotation. The resulting phase combination for each reaction is indicated by color, with the key given to the right, corresponding to Table S5.

Classification
The classification model was chosen by performing a Bayesian optimization of the hyperparameters on the surrogate model data via the fitcauto function in MATLAB. After 120 iterations the best observed and estimated learner was an ensemble using the bag method over 254 learning cycles, with a minimum leaf size equal to 21 and a maximum number of splits equal to 70. Specifically, bootstrap aggregation, or bagging with random predictor selections at each split (random forest), was used as it was predicted to be best for the multiclass nature of the data. The classification bagged ensemble model was trained with these optimized hyperparameters (as specified by the Bayesian optimization) and leave-one-out cross-validation. This evaluation method was chosen as it is ideal for smaller datasets, providing a much less biased measure of test error compared to using a single test set because we repeatedly fit a model to a dataset that contains n-1 observations. 7 The classification accuracy was 95.7 %, misclassifying only 3 reactions ( Figure  S4). The re-substitution loss was 0.0380, which equates to the misclassification rate. The closer the model predictions are to the observations, the smaller the misclassification error will be, and error £ 0.05 is considered acceptable, rendering our model statistically significant.  Table S5.
Other train/test splits were tested, as illustrated in Figures S5-7. Considering that some phases were only observed once or twice, and in the case of klockmannite CuSe, not at all, train/test splits that leave out even five reactions can result in a specific phase combination being missed entirely in the training, which makes predictive synthesis of such phases essentially impossible unless further sampling is appended to the initial dataset. For this reason, models trained on the 80 observations in the surrogate model with a holdout > 5% (95/5) almost always lead to errors in training. A train test split of 97/3 (leaving out 2 observations) resulted in an accuracy of 94.9% with a resubstitution loss of 0.0385 ( Figure S5). Unsurprisingly, the additional sampling during the optimization throughout the regions where specific phase combinations were scarce in the original dataset enabled the model to maintain a high prediction accuracy over a range of training/testing set ratios. Figure S6 shows a train test split of 90/10, which resulted in a prediction accuracy of 92.1% and a resubstitution loss of 0.0442. Similarly, a train/test split of 80/20 resulted in a prediction accuracy of 91.9% and a resubstitution loss of 0.050 (Figure S7).   Table S5.
Univariate feature ranking for classification (fsschi2) ranks features (predictors) using chisquared tests. The predictor variables and the response variable (phase) from the training data were provided to a function that returns the indices of predictors ordered by predictor importance. This means the first predictor returned is the most important predictor. The feature rankings are illustrated in Figure 2 in the main text.
The trained model could then be visualized, which outputted a phase map in the form of a decision tree. Unsurprisingly, the diselenide precursor was the first node on the full decision tree, so the tree was separated by precursor type for simplicity in Figures S3,4.