Ensemble learning reveals dissimilarity between rare-earth transition-metal binary alloys with respect to the Curie temperature

We propose a data-driven method to extract dissimilarity between materials, with respect to a given target physical property. The technique is based on an ensemble method with Kernel ridge regression as the predicting model; multiple random subset sampling of the materials is done to generate prediction models and the corresponding contributions of the reference training materials in detail. The distribution of the predicted values for each material can be approximated by a Gaussian mixture models. The reference training materials contributed to the prediction model that accurately predicts the physical property value of a specific material, are considered to be similar to that material, or vice versa. Evaluations using synthesized data demonstrate that the proposed method can effectively measure the dissimilarity between data instances. An application of the analysis method on the data of Curie temperature ( T C ) of binary 3d transition metal- 4f rare-earth binary alloys also reveals meaningful results on the relations between the materials. The proposed method can be considered as a potential tool for obtaining a deeper understanding of the structure of data, with respect to a target property, in particular.


Introduction
Situations where observation data are generated by different mechanisms in different contexts appear frequently in various scientific phenomena. Because of this reason, supervised machine learning models for predicting physical properties of materials still get frequently outmatched by empirical models created by human researchers. Finding a model that qualitatively determines the mixture effect is a constantly demanded task in both theoretical and experimental model construction. Use of unsupervised learning techniques, with the ability to screen predefined correlations at different data scales, can be a promising approach [1]. Conventional unsupervised learning techniques for unveiling of mixture models in descriptive space are implemented using clustering methods [2] such as the Gaussian mixture model, hierarchical clustering, K-means clustering, etc. Besides, the unveiling of mixture models using supervised information, which considers functions connecting the descriptive space and target space as center objects, has not gained considerable attention from the machine-learning community. One of the well-known methods in this research direction is the mixture of experts model [3,4], which learns the gating functions to appropriately partition the descriptive space for identifying the components of mixture models. Further, linear regression-based clustering was recently developed [5,6,7] without partitioning the descriptive space. However, these models are sensitive to parameters setting, including the number of clusters, complexity of the learners (linear model), etc.
In this study, we propose a data-driven method to unveil the mixture of information on the mechanism of physical properties of materials by using nonlinear supervised learning techniques. We pay attention to extracting dissimilarity between materials, with respect to a given target physical property. The method is based on an ensemble method with Kernel ridge regression as the predicting model. We apply a bagging algorithm to carry out random subset samplings of the materials for generating multiple prediction models. The distribution of the predicted values for each material is then approximated by a Gaussian mixture model. Further, the contributions of the reference training materials to each of the corresponding models are investigated in detail. Reference training materials that are avoided and do not contribute to a predictive model, which accurately predicts the physical properties of a particular material, are considered dissimilar to that material. This paper is organized as follows: The components of the algorithm are described in Section 2, including the base learning model in Section 2.1, the bagging algorithm in Section 2.2, and the dissimilarity voting machine in Section 2.3. Data preprocessing is elucidated in Section 3, and the results and discussions are presented in Section 4.

Methodology
We consider a dataset D of p materials. Assume that a material with index i is described by an m-dimensional predictor variable vector, , (x 2 , y 2 ) . . . (x p , y p )} is then represented using a (p × (m + 1)) matrix. The target physical property values of all materials in the dataset are stored as a p-dimensional target vector y = (y 1 , y 2 . . . y p ) ∈ R p .

Kernel ridge regression
To learn a regression functionf for predicting the target variable, we utilize the kernel ridge regression (KRR) technique [8], which has been recently applied successfully in several materials science studies [9,10,11]. We use the KRR with Laplacian kernel function as follows: where |x i − x j | = m a=1 |x a i − x a j | and σ is the tuning variance parameter of the Laplacian kernel. For a given new material x * , the predicted propertyf (x * ) is expressed as the weighted sum of the kernel functions:f where N is the number of training materials. The weighting coefficients c i for the corresponding materials x i are determined by minimizing The regularization parameter λ and the hyper parameter σ are selected by cross-validation [12,13], i.e., by excluding some of the materials during the training process, and maximizing the prediction accuracy for those excluded materials. We consider the component c i k(x * , x i ) in Eq. 2 as the contribution of the training material x i to the prediction modelf and if it is not refered in the training set, c i is set to zero.

Testing for homogeneity of dataset with ensemble learning
Ensemble learning [14,15] is a method in machine learning where multiple learners are trained to solve the same problem, which was initially developed to avoid over fitting for our designed base learner [16]. There are various strategies, such as bagging, boosting, and stacking, for different learning purposes.
In this study, we intend to unveil the mixture of information in prediction model space, which is a linear combination of kernel functions constructed by training materials. Applying the bagging algorithm [17], we carry out random subset samplings of the materials dataset to generate multiple prediction models. For each sampling, we prepare two datasets: bagging dataset, D bagg , and testing dataset, D test . These two datasets satisfy D bagg ∩ D test = ∅ and D bagg ∪ D test = D. With each of the two datasets D bagg , D test , we generate a prediction model by regressing the bagging datasets D bagg using a cross-validation technique [12,13]. For each obtained prediction model, we collect the predicted values of the target property for all the materials in the corresponding testing dataset D test . The canonical size of D bagg is selected as 66% of the total number of data instances. By repeating the bagging process, each material x i has an equal chance to appear in the test set D test . Finally, we can obtain a distribution of the predicted values of target property p(ŷ(x i )) for all materials.
Here, the null hypothesis stands for an assumption of the homogeneity of the dataset in the kernel space or the existence of a single regression function. If the null hypothesis is true, the distribution p(ŷ(x i )) should be Gaussian for every material x i ; else, we can significantly approximate the distribution p(ŷ(x i )) for a particular material x i in the form of a mixture of Gaussian distributions. By examining the distribution p(ŷ(x i )), we can test the hypothesis on the homogeneity of our dataset.
The approximation of the distribution p(ŷ(x i )) by a mixture model of K Gaussian distributions [18] is as follows: where π k i , µ k i , and σ k i are the weights, centers, and coefficient matrices of the constituent Gaussians components. For a given number of mixture components, the parameters are estimated using an expectation-maximization algorithm, which is explained in detail in [18].
To determine the number of mixture components, a maximizing Bayesian information criterion [19] process is utilized by applying several different trials to randomize the initial states.

Dissimilarity voting machine
In this section, we utilize the information from the bagging experiment to vote for the dissimilarity among materials. To perform the dissimilarity voting procedure, first, under a predefined tolerance in prediction error δ thres , all prediction modelsf learned from a dataset D bagg (Eq. 2), which satisfies |f (x i ) − y i | < δ thres , are collected. Then, under a predefined neighborhood condition in description space k thres , for a given x j in the D test and all x i in D − {x j }, a vote ds(x i , x j ) for the dissimilarity between x i and x j is defined: In this voting machine, for each material, we pay more attention on its relationship with the neighborhood materials (in the data set) in the description space. If the neighborhood materials are avoided and make no contribution to the predictive models that accurately predict the physical property value of the concerning material, those neighborhood materials should be considered dissimilar to that material. Finally, the bagging-based dissimilarity voting algorithm is summarized as follows: Algorithm 1: Bagging-based dissimilarity voting algorithm Data: Base learning:f Number of base learners: T Parameters: k thres , δ thres Result: Dissimilarity matrix, dS 1 begin

Prototype model
We simulate a dataset containing 70 data instances with a one-dimensional descriptive variable, x, and target variable, y, as shown in Figure 1a. To illustrate the capability of the bagging prediction model in unveiling the mixture of nonlinear models, the dataset is designed as a mixture of three main functions. In the range of x lesser than -0.4, the function y = f (x) is monotonic and centered at 0.1. In the range of x greater than -0.4, the function f is bifurcated with a branch fluctuation increasing from 0.1 to 0.3 and the other variation decreasing from 0.1 to -0.15.

Curie temperature data
We collected experimental data on 101 binary materials, consisting of transition metals and rareearth metals, from the Atomwork database of NIMS [20,21], including the crystal structure of the materials and their observed T C values. Our task was to develop a model for estimating the T C of a new material based on the training data of known materials. For this, one of the crucial steps is the selection of an appropriate data representation that reflects the application domain, i.e., a model of the underlying physics. On the other hand, data representation that derives a good estimation model may imply the discovery of the underlying physics. To represent the structural and physical properties of each binary material, we designed 21 descriptive variables. We divided the 21 variables into three groups; the first and second categories contained descriptive variables that describe the atomic properties of the transition-metal (T ) and rare-earth (R) constituents. The properties were as follows: (1, 2) atomic number (Z R , Z T ), (3, 4) covalent radius (r covR , r covT ), (5, 6) first ionization (IP R , IP T ), and (7, 8) electronegativity (χ R , χ T ). In addition, descriptive variables related to the magnetic properties were included, as follows: (9, 10) total spin quantum number (S 3d , S 4f ), (11,12) total orbital angular momentum quantum number (L 3d , L 4f ), and (13,14) total angular momentum (J 3d , J 4f ). The selection of these variables originates from the physical consideration that the intrinsic electronic and magnetic properties determine the 3d orbital splitting at the transition-metal sites. To better capture the effect of the 4f electrons, the strong spin-orbit coupling effect in particular, we included three additional variables for describing the properties of the constituent rare-earth metal ions: the (15) projection of the total magnetic moment onto the total angular moment (J 4f g j ), and (16) projection of the spin magnetic moment onto the total angular moment (J 4f (1 − g j )) of the 4f electrons. The selection of these variables was based on the physical consideration that the magnitude of the magnetic moment determines T C . It has been well established that information related to the crystal structure is highly important for understanding the physics of binary materials with transition metals and rare-earth metals. Therefore, we designed a third group with structural variables that approximately represent the structural information at the transition metal and rare-earth metal sites, including the (17) concentration of the transition metal (C T ), and (18) concentration of the rare-earth metal (C R ). Note that, if we use the atomic percentage for the concentration, the two quantities are not independent. Therefore, in this study, we measured the concentrations in units of atoms/Å 3 . This unit is more informative than the atomic percentage because it contains information on the constituent atomic size. As a consequence, (C T ) and (C R ) are not completely dependent on each other. Other structure variables were also included: the mean radius of the unit cell (19) between two rare-earth elements, r RR , (20) between two transition metal elements, r T T , and (21) between transition and rare-earth elements, r T R . We set the experimentally observed T C as the target variable.

Prototype model
To demonstrate the effect of applying the bagging prediction model in investigating the structural insight dataset, we present the results of applying the model to two-dimensional simulation data. The dataset contains 70 instances with a one-dimensional descriptive variable, x, and target variable, y, as depicted in Figure 1b and Section 3.1. The bagging prediction model includes one million random samplings, with the sampling size is 35% of the total number of data instances. Details about setting parameters are described in Table 1 in Supplemental Materials. Figure 1a shows the distributions of the predicted values,ŷ, obtained using the bagging model. It is obvious that, for x values lesser than -0.4, theŷ distributions include a single distribution centered around 0.1. On the other hand, for x values greater than -0.4, almost all theŷ distributions can be considered to be a mixture of two main Gaussians, whose mean are approximately 0.2 and -0.1, respectively. These distribution components of the predicted value, y reflect the actual shape of the designed data, shown by colored points. Figure 1b shows the dissimilarity (70 × 70) matrix obtained by our developed dissimilarity voting machine. In the matrix, dark blue cells represent dissimilarity pairs of data instances. Zero value cells show no dissimilarity information between corresponding pairs. For convenience, the ordered data instances shown in the matrix are sorted by the x values. Details about how the voting machine works to detect dissimilarity effect are shown by zero contribution example profiles in Supplemental Materials.
There are two noticeable points extracted from this figure. First, the upper left of the matrix represents a large bright region or the region of non-dissimilarity among instances. It is consistent with the monotone and smoothly changes for x values lesser than -0.4. Second, for x values larger than -0.4, one can notice that any data instances in this region are dissimilar with their two closest neighbors and similar to the next ones. Once again, this extracted information shows consistency with the actual distribution of the designed dataset.
By transferring information from the dissimilarity matrix to hierarchical clustering model [22], we obtain clustering results as shown in figure 1c. The dissimilarity information helps us to identify a mixture of two main groups, which are labeled by green and red. These two groups successfully reconstruct the shape of two branches in the bifurcated region of the dataset. To conclude, through the prototype model result, the dissimilarity voting machine based on the bagging algorithm is shown to possess the ability to identify the mixture phenomena regarding a specific target property.

Curie temperature analysis
For the designed descriptive variables, the maximum prediction ability with an R 2 score of 0.967 ± 0.004 was achieved by a model derived from the variable combinations, {χ R , χ T , J 4f (1 − g j ) , Z T , r covT , IP T , S 3d , L 3d , J 3d , C R }. Details about setting parameters are described in Table 1 in Supplemental Materials. The model selection and relation among variables are discussed in [23,24]. The high prediction accuracy level of this model shows that it is possible to accurately predict the T C values of rare-earth transition bimetal materials with the designed variables. With the bagging size at level of 90% total data instances in the data set, the mean absolute value of cross-validation is approximate 40K. The summation about the bagging size dependence of mean absolute error in cross-validation process is shown in Figure  2 in Supplemental Materials. Details about training-testing prediction errors for all materials is shown in Figure 1 in Supplemental Materials. From these evidences, under the description of this variable combination, the regression function to predict T C is approximately a single function associated with a number of probable unknown anomalies. Our designed dissimilarity voting machine could help to address this problem. Next, we discuss the distribution of the predicted T C . Almost all materials obtained its own predicted values follow a single Gaussian function distribution, whereas some materials were associated with distributions that were a mixture of Gaussians. For example, figure 2a shows the T C predicted-value distribution of Co 5 La (with observed T C of 838 K) for bagging sizes varying from 50-95 percent of the total dataset. It is clear that for all values of the size in the subset sampling selection, there is a consistent form of distribution. On changing the size, the corresponding peaks at 686 K, 739 K, 925 K, and 980K remained. Figure 2b displays an enlarged view of the distribution for a bagging size of 65 percent of the total dataset. The detailed result, on applying the Gaussian mixture model (Sect. 2.2), shows that this distribution is a mixture of seven Gaussian distributions, whose means are at 580.34 K, 686.81 K, 739.52 K, 881.59 K, 925.18 K, 980.15 K, and 1101.72 K, respectively. This indicates the presence of a mixture of nonlinear functions in the structural insight dataset. Further investigation will reveal the significance of appearance of these functions. Figure 2c shows the contribution of each training material in the dataset to the target material, Co 5 La, in detail. The color bar shows the color encoding of the contributions, in Eqn. 3 for all materials in the dataset to Co 5 La. It is zero-center symmetrical color encoding with the white color corresponds to non referring materials in training set. The vertical axis represents those sorted by the predicted T C value, i.e., the summation of all the contribution values as in Eqn. 3. The horizontal axis represents materials with an ascending order of the L 1 distance to the target material. The top five closest to Co 5 La are: Co 5 Ce (662 K), Co 13 La (1298 K), Co 17 Ce 2 (1090 K), Co 5 Pr (931 K),and Co 7 Ce 2 (50 K). This figure also shows that models with the closest predicted value of T C (purple distribution with a mean of 881.59 K) are constructed from a combination of instances, D bagg , with no contribution from Co 5 Ce, Co 13 La, Co 17 Ce 2 , and Co 7 Ce 2 . Only Co 5 Pr shows significant contribution to Co 5 La. Details about zero-contribution counting profiles are shown in Figure 4 in Supplemental Materials section.
For comparison, the two nearest neighbor models of the purple model, namely, the green model with mean 739 K, and blue model with mean 925.18 K were considered. For the blue model, all the five nearest neighbors were considered as contributors, and in the green model, the contribution of Co 13 La was removed. The results from the training data contributions, shown above, provide certain conclusions on the actual physical meaning. The three materials, Co 5 Ce, Co 13 La, and Co 17 Ce 2 , should not be considered highly "similar" to Co 5 La with respect to T C , even though they have the same constituent T -metal and the same constituent R-metals or are positioned next to each other on the periodic table (Z La =57 and Z Ce =58). In other words, the distance between these materials with respect to the rare-earth element difference should not be close, as measured by the three R predictive variables, χR, J 4f (1 − g j ), and C R . As the concentration of the R-metal efficiently indicates the change in T C values among those sharing the same R and T materials, e.g., Co 5 La vs Co 13 La, these dissimilarity results show that the other two R variables do not capture the real mechanism of T C .
The entire dataset can be divided into four main groups based on transition metals: cobaltbased, iron-based, manganese-based, and nickel-based materials. In our defined k thres neighbor regions, the number of dissimilarity values was not identical for all the groups of materials. Here, we analyze the cobalt-based and iron-based material groups. In the cobalt-based group, we can notice that Co 5 Ce does not receive contributions from the other materials of the group Co 5 R. The dissimilarity between the Co 5 Ce and the Co 5 La material was shown in the previous analysis. Here, the dissimilarity can be observed more distinctly. Compared to the other Co 5 R materials, Co 5 Ce has a T C of 662 K, which is considerably lower than those of Co 5 La at 838 K, Co 5 Pr at 931 K, Co 5 Nd at 910 K, and Co 5 Sm at 1016 K. In this family, except for Co 5 Ce, an increase in the atomic number of the rare-earth element correlates to an increasing T C value. Figure 3a shows the hierarchical clustering result obtained by utilizing the information from the dissimilarity voting machine. It is obvious that Co 5 Ce is isolated from other Co-based materials. We can also confirm the anomalousness of Co 5 Ce by comparing Co 5 Ce with other Co-based materials (Fig. 3b, the compounds surrounded by blue line). This result confirms the significance of our method of dissimilarity measurement.
In the group of iron-based materials, Figure. 3a shows that Fe 5 Gd appears with a large number of dissimilarity values compared to other Fe-based materials -especially the Fe x Gd y group, indicating that Fe 5 Gd is out of trend with its nearest neighbors. From Figure 3b (the compounds surrounded by blue line), it can be shown that, for an increasing concentration of rare-earth elements C R , Fe 17 Gd 2 , Fe 5 Gd, Fe 23 Gd 6 , Fe 3 Gd, and Fe 2 Gd, the T C values are 479 K, 465 K, 659 K, 725 K, and 814 K, respectively. It is clear that, Fe 5 Gd does not follow the general trend of Fe x Gd y groups. Thus, it is again demonstrated that the information extracted by dissimilarity voting using the bagging algorithm could be used as a useful method to detect  Figure 3: a) Hierarchical clustering model by utilizing information from dissimilarity matrix. b) Relationship between T C and the concentration of rare-earth element, C R . Red and blue lines show anomalies detection which is discussed in detail in Section 4.2 anomalies.

Conclusion
In this study, we have proposed a method to extract dissimilarity between materials, with respect to a given target property. This technique is based on an ensemble method with Kernel ridge regression as the predicting model; multiple random subset sampling of the materials is performed to generate prediction models and corresponding contributions of the reference training materials in detail. The mixture distribution of predicted values was unveiled using a Gaussian mixture models. The reference training materials contributed to the prediction model that accurately predicts the physical property value of a specific material, are considered to be similar to that material, or vice versa. Evaluations using synthesized data demonstrate that the proposed method can effectively measure the dissimilarity between data instances. Next, the algorithm was applied for analyzing the Curie temperature (T C ) prediction of the binary 3d transition metal -4f rare-earth binary alloy problem and exhibited satisfactory results. The proposed method can be considered as a potential tool for obtaining a deeper understanding of the structure of data, with respect to a target property, in particular.