Leveraging Legacy Data to Accelerate Materials Design via Preference Learning

Machine learning applications in materials science are often hampered by shortage of experimental data. Integration with legacy data from past experiments is a viable way to solve the problem, but complex calibration is often necessary to use the data obtained under different conditions. In this paper, we present a novel calibration-free strategy to enhance the performance of Bayesian optimization with preference learning. The entire learning process is solely based on pairwise comparison of quantities (i.e., higher or lower) in the same dataset, and experimental design can be done without comparing quantities in different datasets. We demonstrate that Bayesian optimization is significantly enhanced via addition of legacy data for organic molecules and inorganic solid-state materials.


Introduction
A substantial amount of materials data are accumulated in public databases [1][2][3], and machine-learningbased design of materials is increasingly common in recent years [4,5]. The problem of materials design is mathematically formulated as a black-box optimization problem, where a large number of candidates are available and the goal is to find the candidate with best target property via a minimum number of observations. In Bayesian optimization [6], one of the most prominent methods of black-box optimization, a next candidate to observe is chosen using a Bayesian surrogate model trained with observed candidates. Gaussian process [7] is one of most frequently used surrogate models that provides prediction together with uncertainty quantification. The next candidate is chosen such that the chance of getting beyond the current best candidate is maximized.
Despite progress in materials informatics, machine learning often yields poor results due to shortage of experimental data [8]. The problem may be solved by augmenting the current dataset with a legacy dataset in public databases or private repositories. However, even if a dataset about a similar experiment is found, direct mixing often leads to poor results, because the experiment in the past was done with different instruments and conditions. Calibration of quantities is hard due to shortage of data. To make things worse, the conditions in past experiments are often poorly documented or completely unknown. Also, in materials design, the difficulty of making most of legacy data depends on the overlap between candidate materials and examples in legacy data ( Figure 1). If all the candidate materials are included in legacy examples, legacy data would provide plenty of information about experimental design (Figure 1, right). With small overlap, it may be difficult to accelerate the search (Figure 1, left).
In this paper, we propose a new calibration-free strategy of data integration without comparing the quantities in different datasets. Figure 2 illustrates our basic idea. First of all, each dataset is described as a set of pairwise relationships. Pairwise comparison is done for every pair of target values and the outcome is summarized as a set of 'larger-than' relationships. Then, a Bayesian surrogate model is learned from the two sets of pairwise relationships only. As a result, the learned model has a value range completely different from the original datasets, but it can still be used to select candidates with Bayesian optimization. One can use any preference learning method, but we employed Gaussian process-based method by Chu and Ghahramani [9] in this paper.
In benchmarking our method, we consider two types of materials search problems. First, we search for organic molecules with longer absorption wavelength [4]. Bayesian optimization is applied to 14 candidate compounds whose absorption wavelength is known experimentally. As the legacy data, we employed TD-DFT to compute the wavelengths of 90 compounds including the 14 candidates. We performed several computational experiments with different degrees of overlap and found significant search acceleration in all cases including no-overlap. Second, an oxide with the largest bandgap is sought from 194 candidates [3]. Similar successful results were obtained with a legacy dataset of 2142 examples. Overall, preference learning was effective in exploiting information in legacy data and may serve as a new tool of data integration in a wide range of materials science problems.
Before merging the datasets, each one is converted to preferences. If -# > -8 , we denote " # ≻ " 8 , i.e., " # is , respectively. A Gaussian process is trained from the merged preference set, and subsequently used to rank the remaining candidates for next observation. Note that no comparison is made across the two datasets.

Gaussian process preference learning
In this section, we briefly review the preference learning method by Chu and Ghahramani [9]. For notational simplicity, all descriptor vectors in 0 ∪ 0 5 are redefined as = = {> # } #%&,…,? . Let A denote the merged preference set, After learning from A, the Gaussian process will be able to assign a latent value E(>) to any vector > ∈ ℛ , . In addition, the variance of a latent value can be inferred. Bayesian optimization will be performed based on these latent values.
The prior probability of E(> # )is defined as , . . . . , E(> ? )] S , and Σ is the covariance matrix defined by a radial basis function kernel [7]. Using Gaussian noise variables X ∼ Z(X; 0, ] ; ), the probability of preference B 4 ≻ C 4 is described as The probability of data generation is then defined as By using Bayes' theorem, we can arrive at the posterior probability, The maximum a posteriori estimate (MAP) of the latent values is defined as G hij = arg max G F(G|A).
Taking the logarithm of the posterior probability, the solution is obtained by minimizing To make a prediction at a new sample point > * , we infer the probability distribution of its latent value as

Bayesian optimization based on preference learning
In Bayesian optimization, the mean latent value Ü * and standard deviation ] * are computed for all remaining candidates. Let Ü Dáà denote the maximum value observed so far. The expected improvement of a candidate > * is described as follows.
, where é and ë represent the cumulative distribution function and the probability density function of standard normal distribution, respectively. The candidate with maximum expected improvement is chosen for next observation.

Absorption wavelength of molecules
Most large-scale public databases provide materials properties obtained from first principle calculations, not experimental ones [1,2]. It is thus interesting to see if computational data can help the search for best materials. We created our own small database of 90 organic molecules with their absorption spectra computed via TD-DFT. The set of molecules in denoted as A. See [4] for computational details. In our first benchmark, we examine how much this database can accelerate experimental search of molecules with longest absorption wavelength.
The experimental dataset C contains N=14 molecules from our previous publication [4]. We synthesized these molecules and measured absorption wavelength with UV spectroscopy. They are all included in our database, í ⊂ î, but there is a considerably large gap between experimental and computational absorption wavelengths (Supplementary Table 1). We created five types of 'legacy' datasets, each consists of 50 molecules. For q=0, 25, 50, 75 and 100, the q%-overlap dataset consists of ⌊ñ 100 ⁄ ⌋ molecules in C, 50 − ⌊ñ 100 ⁄ ⌋ molecules in A-C, and their computational wavelengths.
To see how the Gaussian process model is enhanced due to a legacy dataset, we evaluated it with ranking accuracy. First, molecules in C are divided into 80% training set and 20% test set. A Gaussian process model is trained with preferences derived from the training set and a legacy dataset. As descriptors, 200 dimensional features were obtained using RDKit Descriptors Calculators [10,11]. The trained model is used to compute latent values of test examples. For the test set, the difference between two rankings due to experimental wavelengths and latent values are measured with an accuracy measure called NDCG [12]. If rankings are completely identical, NDCG is one. A smaller value of NDCG indicates a larger difference in rankings. Figure 3(a) shows the ranking accuracy without any legacy dataset (i.e., single dataset) and that with a various type of legacy dataset. Each violin plot is created with 50 different training/test splits. The accuracy improved, as the degree of overlap is increased and the accuracy is almost perfect for 100% overlap. The result matched our intuition that a legacy data is more valuable when overlap is larger (Figure 1). The accuracy is enhanced at 0% overlap as well, indicating that a legacy data without overlap can sometimes be of help.
Next, we performed a materials design benchmark using Bayesian optimization. First, two molecules are randomly chosen and the selection with Bayesian optimization is applied from the third molecule. For a degree of overlap, we performed 50 runs of Bayesian optimization, where the initial two molecules and the legacy dataset was resampled in every run. The success rate at iteration ö is defined as the fraction of runs where the best molecule was found within ö selections of molecules. Figure 3(b) shows the result without legacy set (i.e., single dataset) and with a different type of legacy set. Since our experimental dataset C was very small, the performance for single dataset was poor. Improvement with a legacy data was observed at all cases in including 0% overlap, indicating that preference learning can retrieve useful information from legacy data without explicit calibration.

Bandgap of inorganic materials
The same series of benchmarking experiments is applied to another subject. The online material database of the National Renewable Energy Laboratory (NREL, https://materials.nrel.gov) provides bandgap calculated by Perdue Burke Ernzerhof (PBE) method of 2142 oxides [13]. Among these oxides, 194 oxides have the bandgap data by many-body GW calculation method [3]. GW calculation predicts band gaps more accurately but is far more computationally expensive than PBE [14,15]. We define a search problem of finding the oxide with largest bandgap in terms of GW. The candidate set C is determined as the 194 oxides with GW bandgaps and the total set A corresponds to the 2142 oxides. Legacy datasets of size 200 are created at different degrees of overlap. 132 dimensional descriptors are obtained using Elementproperty Featurizer of Matminer [16].
Ranking accuracy and Bayesian optimization performances are shown in Figures 4(a) and (b), respectively. With a legacy dataset without overlap, ranking accuracy was worse than that of single dataset. Nevertheless, Bayesian optimization was accelerated in comparison to the single dataset case. As Section 3.1, a larger overlap resulted in higher accuracy and better acceleration.

Discussion and conclusion
We reported that preference-learning-based data integration works excellently in two kinds of materials datasets. This result is surprising and encouraging at the same time, because the conversion of numerical data to preferences incurs information loss in trade with calibration-free integration. Our method extends easily to deal with more than three datasets. In current materials science, data sharing is not commonly done due to difficulty of integration. Our method may promote cooperation among researchers to save the cost of expensive and time-consuming experiments.
In materials sciences, there is wide-spread misunderstanding that machine learning always require a large amount of data. One favorable aspect of our results is that our method worked in small data scenarios (i.e., less than several hundred data points). When users want to use larger datasets, current implementation of our algorithm may not be very scalable, because the computational complexity is õ(ú 3 ) [9] where ú is the number of preference relations. Recent developments in Gaussian process and preference learning [17,18] may be beneficial in improving scalability.

Figure 1.
In materials design with a legacy dataset, we search the best one from a set of candidates (red), using the information from a set of examples in the legacy dataset (blue). If these two sets have large overlap (right), we can make most of the legacy data for accelerating the search, while it would be difficult without no overlap (left).