Efficient recommendation tool of materials by an executable file based on machine learning

To accelerate the discoveries of novel materials, an easy-to-use materials informatics tool is essential. We develop materials informatics applications, which can be executed on a Windows computer without any special settings. Our applications efficiently perform Bayesian optimization to optimize materials properties and uncertainty sampling to complete a new phase diagram. We introduce the usage of these applications and show the sampling results for a ternary phase diagram.


Introduction
Materials informatics (MI) research has received tremendous attention. [1][2][3][4][5][6] These studies should accelerate materials discovery from both academic and business perspectives. With the aid of MI techniques, it is now possible to develop novel materials with the desired properties using the smallest number of syntheses or simulations. [7][8][9][10][11][12] In such cases, MI techniques efficiently recommend candidate materials compositions and process parameters to discover materials with the desired properties, even if the amount of materials data is small. Furthermore, other MI techniques can also propose next candidate points to efficiently construct phase diagrams. 13,14) Enormous machine learning tools, which can be adopted for MI research, have been developed and released. However, many of these tools are developed as a code of specific programming languages such as Python. Thus, the proper environment should be prepared for specific languages when these tools are performed on computers. This can be burdensome, especially for non-programmers. To proliferate machine learning techniques in materials science, easy-to-use MI tools that can be easily adopted are essential. We developed applications for MI, which can be performed on Windows computers without special settings. We targeted Windows computers due to the ubiquity and the fact that they control many experimental devices.
This paper explains how to use our developed applications (see Fig. 1), the computing time to obtain the next candidate, and the sampling results for a ternary phase diagram. Our applications are composed of two executable files: COMBO. exe and PDC.exe. These are available from our project page (https://tsudalab.org/en/projects/mitools). Both files can be executed on a Windows computer. We performed an operation check on Windows 7, 8.1, and 10 (64-bit version). For ease of use, external parameters are not necessary for these applications.
COMBO.exe is a Bayesian optimization tool based on the COMmon Bayesian Optimization library (COMBO). 7) This selects candidate materials, which may have the desired properties, for syntheses or simulations by machine learning. In COMBO, the Gaussian process is approximated using the Bayesian linear model with a random feature map 15) to overcome the computational bottleneck of the Bayesian optimization framework. Furthermore, hyperparameters are automatically determined by maximizing the type-II likelihood, 16) while the regularization term avoids overfitting. In COMBO.exe, the Thompson sampling is adopted to select the next candidate from the posterior distribution. 17) On the other hand, the selection of a candidate point to efficiently complete a phase diagram can be performed by PDC.exe based on the uncertainty sampling for the phase diagram construction (PDC). 14) Using uncertainty sampling, the point with the highest uncertainty score, as evaluated by a machine learning based classification model, is chosen as an informative sample. Typically, data with the highest uncertainty are located near phase boundaries. This approach is useful to construct complicated phase diagrams from scratch. In PDC.exe, the label propagation method 18) is used as a phase estimation and the uncertainty score is defined by the least confident method. 19)

Application framework
First, the data file to be read by our applications should be prepared. Let be a set of prepared candidate points, which corresponds to the search space. For example, this d-dimensional vector x i includes materials compositions, materials descriptors, or process parameters. Here, it is assumed that the values of y(x i ), which called objective variable, are known at only some data points among the prepared candidate points, and the number of known points is denoted as M. In this problem setting, our applications recommend one unknown point, which should be investigated in the next step from the remaining N-M candidate points. The procedure to create the data file (see Fig. 1) is as follows: (1) The data file, called "data.csv", should be prepared in the csv format (e.g. using Microsoft Excel).
(2) The first row should denote explanations of each column for objective variable, y(x i ), and explanatory variables, x i .
The real constant value α is prepared by the experimenter. Aim4: Materials with large a i , b i , and c i should be obtained: w a , w b , and w c are positive artificial weights, which express the priorities of the properties to be optimized. These values are prepared by the experimenter. Aim5: Materials with a i = α and b i = β should be obtained: w a and w b are positive artificial weights, which express the priorities of the properties to be optimized. α, β, w a , and w b values are prepared by the experimenter. On the other hand, for PDC.exe, y x i ( ) should be integer values, which express the index of the phases. This application selects the next candidate with the highest uncertainty in the phase diagram. (4) From the second column, the explanatory variables x i for all prepared candidate points are packed. In this part, the blanks are prohibited. Note that as the number of candidate points increases, the selection time increases.
Here, any number of dimensions of x i can be applied in our applications. Next, when the preparation of the data file is finished, to perform the Bayesian optimization or uncertainty sampling for the PDC on a Windows computer, the following procedures are performed: (1) Place the prepared data.csv in the same folder containing COMBO.exe or PDC.exe.

Computation time
We address the computing time to obtain the next candidate recommended by COMBO.exe and PDC.exe. The computing time should depend on the numbers of candidate points N and known points (training data) M. Figure 2 shows the N and M dependence of computing time as the real time when a Windows 10 computer with an Intel Core i7-8650U CPU is used.
For COMBO.exe, the computation time steadily increases against N and M. Within a few minutes, an enormous number of candidate points (N ∼ 200 000) can be processed by COMBO.exe. This computing time is much shorter than the timescale for syntheses or simulations. On the other hand, PDC.exe takes more time than COMBO.exe. Interestingly, as the number of known points M increases, the computing time is reduced when N is fixed. It is considered that the label propagation, which is the algorithm of the phase estimation in PDC.exe, converges quickly due to the large number of known points. For practical use, the candidate points should be reduced within N = 25 000.

Sampling demonstrations
We demonstrate samplings on a ternary phase diagram by COMBO.exe and PDC.exe. The target is the NaF-KF-LiF system, which has a simple calculated phase diagram. 20) Intermediate compounds did not appear in this system. Only three phases characterized by NaF, KF, and LiF exist. The left side of Fig. 3 is the contour plot of the melting temperature, while the right side is the phase diagram of this system.
As a demonstration of COMBO.exe, we searched the composition with the minimum melting temperature for the NaF-KF-LiF system. That is, Eq. (2) when a i is used as the melting temperature should be used for the objective variable. The left side of Fig. 3 is the sampling results by COMBO.exe when the parameter space is discretized to 352 candidate points. The top figure indicates the positions of five initial points, which are randomly generated. Beginning from the second row, the sampling points are plotted for every four point increase by COMBO.exe. Note that, practically, its sampling point increases by one point. As the iteration continues, points with low melting temperatures are sampled. In addition, points far from the already sampled points are also chosen by uncertainty.
The right side of Fig. 3 is the sampling results by PDC.exe to efficiently complete the ternary phase diagram. The five initial points are the same with COMBO.exe, and the results for every eight point increase are plotted by PDC.exe. Points around the phase boundaries are actively sampled. In the initial points, the KF phase is not detected. Consequently, the candidate points around the phase boundary between NaF and LiF are intensively sampled at the beginning. In the next step, when the KF phase is found, PDC.exe suddenly recommend points near the other phase boundaries. Note that in general, many more sampling points are necessary in PDC.exe than COMBO.exe due to their different purposes. In both cases, efficient sampling, which is well suited for the purpose, can be performed by COMBO.exe and PDC. exe. This means that our applications should reduce the trial and error in the materials discovery.

Conclusion
We explain how to use our developed MI applications, which can be executed on Windows computers. Our applications are based on machine learning and perform Bayesian optimization and uncertainty sampling on the phase diagram. The sampling results on the ternary phase diagram confirmed that our applications can efficiently sample for the intended purpose.
The important points are that neither additional settings in a Windows computer nor external parameters are required. Thus, we strongly believe that the materials scientists can achieve the benefits of machine learning without stress via our applications. Furthermore, since many experimental devices in materials science are controlled by Windows computers, our applications have a high affinity for such equipment. Therefore, we expect that our developed applications can accelerate materials discovery with the aid of machine learning.