Guided patchwork kriging to develop highly transferable thermal conductivity prediction models

The machine learning models developed on a dataset comprising particular class of materials show poor transferability across different classes. The problem can be partially solved by increasing the variability in the dataset at the cost of prediction accuracy. To develop a model on a highly variable database, we propose a localized regression based patchwork kriging approach for capturing most of the complex details in the data. In this approach, the data is partitioned into smaller regions with shared patches of few datapoints across the neighboring boundaries. Local regression functions are developed in each partition with a constrain to give similar performance at the boundary. Out of 17 different properties tried for partitioning the data, the decomposition with respect to target output κl gave local models with unprecedented accuracies. The partitioning with respect to κl, however, requires its estimate for any unknown compound beforehand. To address this, we developed a global model for the entire database. The global model accurately predicts the order of magnitude of κl for the compounds in the dataset and hence, directs them towards a particular partition for more accurate prediction. We define this stepwise approach as guided patchwork kriging, which can be applied to develop highly accurate transferable prediction models.


Introduction
Fueled by the developments in algorithms and advancement in computational resources, data-driven machine learning (ML) methods are revolutionizing the materials science domain. Using these ML-based approaches, there is a rich profusion of prediction models for resource extensive materials properties [1][2][3][4][5][6][7][8][9][10][11]. Among these, lattice thermal conductivity κ l is of great relevance in variety of applications such as thermoelectrics, barrier coatings, and electronic devices [12][13][14][15][16]. Significant efforts and notable contributions have been made by the scientific community for the efficient and accelerated prediction of κ l , using data-assisted methods. For example, highthroughput modeling was employed on a dataset of half-Heuslers from AFLOW repository and three low κ l materials were proposed for experimental verification [9]. Automatic Gibbs Library has been implemented within the AFLOW and Materials Project frameworks for high-throughput screening of desired κ l materials, calculated under quasiharmonic Slack model [17]. 221 low κ l candidates were proposed using Bayesian optimization, starting from the rocksalt, zincblende, and wurtzite-type 101 compounds [10]. By combining finite-temperature phonon calculations with high-throughput screening, κ l was estimated for oxide and fluoride perovskites. Moreover, physical insights such as the variation of κ l slower than usual T −1 dependence for cubic perovskites were unraveled [18]. Semiempirical models for isotropic and anisotropic κ l have been proposed for accelerated high-throughput screening [19][20][21]. Recently, a machine learning model was developed using four high-throughput assisted descriptors directly related to the physics of κ l , which outperformed the Slack model [22].Undoubtedly, these studies over the past few years present a significant advance in expediting the search for desired κ l materials and provide several new physical insights. However, for the single target output property such as κ l , the developed prediction models are different. These models differ mainly with respect to the class of compounds and type of descriptors in the dataset.
The ML-based predictions depend solely on the type of data employed for training. Hence, the models focused on particular class of materials show poor transferability across different classes. Increasing the variability in the data, however, will result in high prediction errors compared to focused models due to its incapability to adapt every complex detail in the dataset. So far, there has been no approach discussed in literature, which addresses the generalizability of prediction models for a given target output property of interest. In this work, for the target output κ l , we propose the use of a localized regression based patchwork kriging approach for a class independent dataset. The compilation of dataset is done such that it covers diverse chemical and structural space, having compounds with elements belonging to any group of the periodic table, covering all the 7 crystal systems, and having 4 orders of magnitude variation in the target output κ l . As expected, the model developed on this data using elemental and structural descriptors gives high train/test root mean square error (RMSE) of 0.24/0.25 for the log-scaled κ l . To improve the prediction accuracy, we employ the patchwork kriging, where the dataset is partitioned into smaller local subsets with respect to a property [23,24]. The neighboring local regions are patched together to share some datapoints at the boundary and the local models on either side of the boundary are expected to produce the same response at the boundary points. Out of 17 different possible properties tested for creating partitions, the target property κ l turns out to be the best in terms of providing unprecedented accuracies. Since the decomposition of data is with respect to target output, it requires an estimation of κ l for the unknown compounds. For that, global model developed on the entire dataset was used, which predicts the correct order of magnitude of κ l and thus can direct the datapoints to specific partitions for improved predictions by local models. The developed approach is therefore guided patchwork kriging, where the global model directs the datapoints to specific partitions for more accurate prediction of target output.

Methodology
The electronic structure calculations were performed using density functional theory [25,26], as implemented in the VASP package [27,28]. Generalized gradient approximation (GGA) of Perdew-Burke-Ernzerhof (PBE) was employed to account for electronic exchange and correlation. The projector augmented wave (PAW) potentials were used to represent the ion-electron interactions [29,30]. The energy cutoff for the plane waves was taken to be 1.3 times the default value specified in the PAW pseudopotentials. The criteria for energy convergence was set to 10 −8 eV. The dynamical stability was assessed by phonon dispersion calculated using finite difference method, as implemented in the PHONOPY [31]. By solving the phonon Boltzmann transport equation, lattice thermal conductivity is calculated, as implemented in the PHONO3PY [32,33].
For making the generalized model for capturing wide variation of κ l , patchwork kriging method is employed [23,24]. Patchwork kriging regression is a supplement to the family of well established kriging/Gaussian process regression (GPR) [34][35][36]. Instead of fitting a particular parametric function to the input data, the Gaussian processes (GP) provide a non-parametric approach for constructing a regressor function by using the input data based on Bayesian inferences. In GP, to each input vector x, a random variable f (x) is assigned, which is assumed to follow a Gaussian distribution defined by a mean function m(x) and covariance function k(x, x′). For a finite number of variables X=(x 1 , x 2 , ..., x N ), the joint prior distribution is also Gaussian: ,..., and K ij =k(x i , x j ). The mean function m(x) are generally taken to be 0. The covariance k are positive definite symmetric kernel functions, which decide the shape/smoothness of functions. The GP prior combined with the observations y gives the GP posterior distribution ( | ) p f X y , . The posterior distribution is used to obtain posterior predictive distribution of f at an arbitrary new input x * , denoted by f * . The joint distribution of observations y and predictions f * is , 1 y T * * * ** where K y =k(X, X+σ 2 I), K * =k(X, X * ), and K ** =k(X * , X * ). Using the Gaussian conditioning rules, the predictive posterior distribution of f * is given by In Patchwork kriging, the input domain of f (x) is partitioned into P local regions {Ω p :p=1, 2, ..., P} and inferences are made in each partition with additional continuity constraints. f p (x) is defined as local approximation of f (x) in the p th partition, which follows Gaussian prior distribution. The continuity conditions are imposed in such a way that two neighboring partitions share some datapoints, so that predictions of local models does not show discontinuity at the boundary. Now to develop the prediction models, the input domain X, also known as descriptors, and the covariance function k(x i , x j ) needs to be selected. The performance of the model depends crucially on the choice of descriptors and the covariance function. The filtering of most relevant subset of features was carried out using the least absolute shrinkage and selection operator (LASSO) [34,35]. It minimizes the sum of squared errors along with a constraint on the sum of the absolute values of model parameters (L1 norm). Assuming the linear relationship between the predicted variable vector y and the input feature vectors X, the LASSO estimate is defined by [34,35] where λ0 is the shrinkage parameter. Depending upon the value of λ, some of the coefficients β may become zero, thereby providing the reduced set of relevant features. Out of different possibilities for the covariance functions, the automatic relevance determination (ARD) squared exponential covariance function gave best accuracy. For the given data points x i and x j and hyperparameter vector θ, it is given by where σ f , d, and σ m are the standard deviation, number of predictors in the data, and length scale for each predictor m. The statistical parameters namely standard deviation and length scale are also known as hyperparameters θ [36]. The ARD covariance function has the advantage of providing a separate length scale for each predictor, which improves the flexibility of the model. The parameter σ controls the numerical stability of the covariance function and length scale captures impact of each data point. The hyperparameters for the covariance functions were optimized using Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm [37]. The performance of the developed model was assessed by statistical regression metric such as root mean square error (RMSE) and the coefficient of determination (R 2 ), which determines the proportion of variability captured by input descriptors in the target output values.

Results and discussion
With the objective to develop a generalized prediction model for κ l , we utilized the original dataset of 2162 compounds [22] along with additional 676 compounds from the Materials Project [38,39]. These additional compounds further diversify the spanned structural and chemical space. The dataset is independent of any particular class of materials and covers wide variability in various aspects. For instance, it consists of compounds with elements belonging to different groups of the periodic table (alkali, alkaline earth, transition metals, lanthanides, actinides, anions) and covering all 7 crystal systems with different possible space groups. After the high-throughput screening for fully optimized, non-metallic, and dynamically stable compounds, a total of 185 candidates are filtered. The screened compounds cover all the 7 crystal systems, as seen in figure 1. There are 20 monoclinic, 8 triclinic, 23 orthorhombic, 15 tetragonal, 14 trigonal, 14 hexagonal, and 91 cubic compounds in the screened dataset. Additionally, the response variable κ l at room temperature for these screened compounds covers four orders of magnitude variation, thereby further enriching the diversity in the dataset. The phonon dispersion and κ l for the dynamically stable additional compounds are shown in the supporting information figures S1-S2 and S3-S4, respectively available atstacks.iop.org/JPMATER/3/024006/mmedia. To check the prediction accuracy for this dataset, we developed the ML model for the log-scaled κ l at room temperature. For developing the ML prediction model, a certain set of descriptors X for the constituent compounds are required, which are mapped to the target property of interest y. These descriptors can be either directly related to the physics of target output [22] or they can be in the form of elemental and structural properties to represent the compounds. For κ l , Seko et al [6] have proposed a combination of elemental and structural descriptors, which result in ML models with unprecedented accuracies for the log-scaled κ l . The elemental descriptors include the mean and standard deviation of various physical properties of constituent elements such as boiling point (B.P.), melting point (M.P.), specific heat (c p g ), molar specific heat (c p ), molar volume (V m ), heat of fusion (H f ), heat of vaporization (H v ), Pauling electronegativity (χ p ), first ionization energy (I.E. 1 ), group and period in the periodic table, elemental thermal conductivity (κ ele ), atomic mass (M), covalent radius (r cov ), van der Waals radius (r vdw ), and density (ρ). The structural features include generalized radial distribution function (GRDF) and bond order parameter (BOP). The large variance among descriptors is nullified by standardizing them to have zero mean and unit variance. The relevant set of descriptors was selected via ten-fold cross-validated LASSO. Using this descriptor set, a GPR model is developed for predicting the logscaled κ l . Out of various covariance functions, ARD squared exponential covariance function turns out to be the best performing. The training of the model is carried out on 90% of the data and the remaining 10% is used for the testing of the developed model. The splitting of data into training and test set is done over 2000 random trials and the hyperparameters of the covariance function are tuned for each random trial. The performance of the developed models is estimated by evaluating RMSE and R 2 for the training as well as the test sets. The best  Figure 2(a) presents the regression plot showing the variation of true versus predicted log-scaled κ l for the developed model, where the triangles and empty squares correspond to the training set and the test set, respectively. The difference between true and predicted log-scaled κ l , also known as residuals, for the whole data set is shown in figure 2(b), which is very high.
Since the structural descriptors such as GRDF and BOP are relatively complex to calculate, we searched for alternative simpler descriptors, which can provide the similar information about the structure without compromising the performance of the prediction model. For example, the GRDF is calculated by defining a pairwise function g n (r ij ) for distance r ij [6] between atom i and neighboring atom j as GRDF represents the variation in density as a function of distance from a reference atom, which could possibly be captured by quantifying bonding and anti-bonding strength. We calculated the average bond strength by integrating the crystal orbital Hamilton population (COHP) [40,41]. The bond strength has been shown to be directly correlated with κ l for zincblende, rock salt and CsCl structure classes [42], where increasing bond strength results in high κ l . The another structural descriptor BOP is evaluated by calculating the average spherical harmonics Q lm as [6] BOP represents the local environment around a reference atom due to its surrounding atoms, which could alternatively be described by average coordination number and average bond distances. While bond distances are inversely related to κ l [42], coordination number has been shown as a potential descriptor for capturing anharmonicity [19]. Other simpler parameter volume, which has inverse correlation [22] with κ l has also been included as descriptor. Hence, we replaced the complex structural descriptors GRDF and BOP with the new simpler descriptors average bond distances, average bond strength, average coordination number, and volume of cell, which have strong correlations with κ l . The GPR model developed with these new descriptors gives the improved accuracy compared to previously developed model using GRDF and BOP, thereby implying that the new descriptors can provide better structural representation for compounds without any additional computational complexity. In this case, the best optimized model has train/test RMSE and R 2 of 0.24/0.25 and 0.99/0.99, respectively. Figure 2(c) presents the regression plot showing the variation of true versus predicted log-scaled κ l by the developed model and figure 2(d) shows the corresponding residuals plot.
As expected, the model developed on the entire dataset gives high prediction errors compared to state-of-the art models reported for κ l [6,22]. To develop a generalized model with improved prediction accuracy, we propose the use of localized regression-based approach. The underlying notion is that in a dataset with wide variability, the local details of the input space may have significant effect on the prediction of target output. Several local approximations have been developed in the literature for capturing the large variability in the input space of big datasets. For example, sparse approximations such as sparse Gaussian process (SGP), where the full dataset is replaced by a smaller subset of relevant inputs [43][44][45]; local approximations such as mixture of experts [46][47][48], where input data is divided into smaller subspaces to build local models. The major advantage of the local approaches is to take into consideration every possible detail in the data. However, the disadvantage is discontinuity in the prediction by local models at the boundary between the local regions. There have been several ways to smooth out the discontinuity [47,[49][50][51]. One of such approaches with high numerical stability is patchwork kriging [23,24], where the neighboring local regions are patched together to share some datapoints at the boundary and the local models on either side of the boundary are expected to produce the same response at the boundary points. Inspired by the success of patchwork kriging for the capturing the complexities in the big datasets [23,24], we applied this algorithm to our dataset.
Efficient identification of the property with respect to which data is partitioned to build several local models is one of the key factors deciding the prediction accuracy. Here, we tried several properties for partitioning the data, which include elemental descriptors such as B.P., M.P., c p g , c p , H f , H v , χ p , I.E. 1 , κ ele , M, r cov , r vdw , and ρ; and structural descriptors such as B.S., B.D., and V cell . Partitioning with respect to the target output κ l was also considered. Since the dataset size in the present case is not very large, we created only two partitions corresponding to each of these properties. As the large number of partitions would result in very less datapoints in each partition, it may lead to overfitted local prediction models. The schematic representation of the patchwork kriging is illustrated in figure 3. The two partitions corresponding to all the 17 properties are created in the ascending order of the magnitude of the properties with some common points (CP) at the shared boundary. For example, in the case of partitions corresponding to κ l , partition 1 and 2 contain compounds with log-scaled κ l ranging from −3.16 to 1.96 and 1.18 to 6.81, respectively. At the shared boundary of these two partitions, there were 13 datapoints common with log-scaled κ l ranging from 1.18 to 1.91. The choice of the range for creating the common boundary and the corresponding number of common datappoints across partitions was decided in such a way to have nearly equal and sufficient data in each partition. For all the properties, the selected range and corresponding number of common datapoints are given in the supporting information table S1.
To develop the local prediction models in each partition corresponding to all the considered properties, a set of suitable descriptors has to be selected. We used ten-fold cross validated LASSO to select the relevant subset of elemental and structural descriptors for all the partitions. The descriptor set for both partitions corresponding to different properties is given in the supporting information table S2.
The descriptor set in the partitions created with respect to 17 properties is different. This may be attributed to the fact that each partition contains different local information. By using the descriptor sets for respective partitions, the prediction models are developed using GPR for the log-scaled κ l . For all the cases, the ARD squared exponential covariance function turns out to be the best performing. The training of the model in each case is carried out on 90% of the data and the remaining 10% is used for the testing of the developed model. The splitting of data into training and testing set is done over 2000 random trials and the hyperparameters of the covariance function are tuned for each random trial. The performance of the developed models is estimated by RMSE and R 2 for the training and the test sets, which is given for all the cases in table 1.
The notable differences observed in the prediction accuracies corresponding to different partitions with respect to various properties indicate the effect of local structure of the data on the performance of the model. For most of the cases, the model performs well only in one of the partitions. However, the partitioning corresponding to B.P., c p g , r cov , and κ l gives local models with high accuracies in both the partitions. Among these, decomposition with respect to target output κ l leads to local models with the highest accuracy. The partition 1 in this case has train/test RMSE and R 2 of 0.13/0.13 and 0.99/0.99, while partition 2 has train/test RMSE and R 2 of 0.11/0.12 and 0.99/0.99. Figure 4(a) presents the regression plot showing the variation of true versus predicted log-scaled κ l for the developed model, where the triangles and empty squares correspond to the training set and test set, respectively. The high value of R 2 clearly indicates the good predictive capability of our model covering wide variability. The difference between true and predicted log-scaled κ l for the whole data set is  shown in figure 4(b). Figure 4(c) presents the regression plot showing the variation of true versus predicted logscaled κ l by the developed model for partition 2 and figure 4(d) shows the corresponding residual plot. The prediction accuracies achieved for the highly variable data using this approach are comparable to models developed on particular class and type of materials. For example, using the mean and standard deviation of elemental descriptors and BOP, the best model developed on 110 compounds for log-scaled κ l gives RMSE of 0.096 [6]. In another model developed on 93 compounds, the difference between experimental and predicted values of κ l has been reported to lie within a factor of 1.5-2 [52]. Using the descriptors related directly to the physics of κ l , a prediction model developed on 120 compounds gives RMSE of 0.21 [22]. A prediction model developed on 100 experimentally available κ l compounds, a train/test RMSE and R 2 of 0.17/0.21 and 0.93/0.93, respectively, has been reported [53].
Since the highly accurate models are obtained by partitioning the data with respect to target output κ l , employing this patchwork kriging approach for unknown compounds needs an estimation of its κ l so as to decide the particular partition for its accurate prediction. For that, we used the model developed on the entire dataset for classification of compounds to particular partitions ( figure 2(c)). This model is named as global model. The global model predicts correct order of magnitude of κ l for all the compounds in the dataset. Therefore, the model can be used to direct/guide unknown data to particular partition for accurate prediction of κ l . The global model is tested on 10 independent data points, which were not present in initial training set. The global model accurately guides each datapoint to the correct partition, as presented in the supporting information table S4. Hence, we propose a two-step guided patchwork kriging approach, where the global model will guide the datapoints to a particular partition for subsequent accurate prediction. The pseudo-code for the guided patchwork kriging is summarized in table 2.
To make sure that the developed models are not overfitted, we checked the learning curves by plotting the prediction accuracy as a function of increasing training data size. The learning curves for the best global model, partition 1, and partition 2 model are shown in the supporting information figures S5(a), (b), and (c), respectively. The convergence of train and test RMSE around 90% of training data size ensures no overfitting. Furthermore, we checked the average performance of all the three best models, as shown in the Supporting Information figure S6. The average is taken for the developed model over 2000 different train-test splits. Figures  S6(a), (b), and (c) show the true versus predicted κ l for the average global, partition 1, and partition 2 model, respectively. The average test RMSE of 2000 models for global, partition 1, and partition 2 model is 0.81, 0.56, and 0.41, respectively. The average performance of local models are again better than the global model. The corresponding learning curves are shown in figures S6(d)-(f).
Analysis of the machine learning outcomes gives very useful physical insights and also a correct description of the underlying physics related to κ l . For all the partitions corresponding to 17 different properties, average bond distances came out to be one of the most relevant descriptors. Additionally, the average bond strength, average coordination number and volume of cell were also found to be relevant descriptors in most of the partitions. To get the intuitive insights, we plotted these three descriptors against each other as shown in figure 5, where the magnitude of κ l is represented by the color bar. The smaller B.S. and larger B.D. give low κ l and vice versa, as shown in figure 5(a). This is because large bond distances in a structure imply that the atoms in the system are not tightly packed, hence the bonding strength between the atoms is weak. The group velocity of   [54]. The weaker bond strength will lead to weaker force constant and hence small phonon group velocities, which will result into low κ l . The model reveals this physical picture accurately. Similar trend has been reported for zincblende, rock salt and CsCl structure based materials [42]. Furthermore, figures 5(b) and (c) show the inverse correlation of V cell with κ l . Weak B.S. or large B.D. and large V cell gives low κ l . The inverse dependence of V with κ l has also been reported for a large class of materials [22]. Figures 5(d)-(f) show, respectively, the behavior of B.S., B.D., and V cell as a function of C.N.. High C.N. and weaker B.S. result in low κ l and vice versa, as shown in figure 5(d). The opposite trend is observed for C.N. and B. D. (Figure 5(e)), as B.D. and B.S. are inversely related to each other. The observed behavior of κ l unravels that C. N. can shed some light on anharmonicity of a system. Anharmonicity in a system signifies deviation from perfect harmonic behavior. The systems where the atoms are weakly bonded usually have more anharmonicity compared to strongly bonded closely packed systems. Hence, for the soft anharmonic lattices, B.S. will be weak and B.D. will be large, leading to high C.N. Hence, C.N. could be a good descriptor for anharmonicity. Figure 5(f) shows the regions of high κ l for small V cell and low C.N. This is because smaller volume corresponds to tightly packed stiff lattice, which will result in large phonon group velocities and hence high κ l . The effect of coordination number on κ l has also been illustrated for group I-V-VI 2 compounds, where octahedralcoordinated compounds has lower κ l by a factor of 4 compared to tetrahedrally bonded compounds [55]. These descriptors, thereby, provide a simpler route for screening low or high κ l compounds. Hence, the developed ML model captures the relationship between bonding chemistry of structures and κ l .

Conclusion
In summary, we proposed patchwork kriging approach for developing the prediction model for a dataset with huge variability, where several local models are developed over the partitioned domains with shared patches of datapoints across neighboring boundaries. The partitioning of the data with respect to 17 properties is carried out. The best local models were obtained, when the data was partitioned based on magnitude of κ l . The involvement of the target output for creating the partitions requires its estimate beforehand. We used a global model developed on the entire dataset, which directs the data points to a specific partitions for accurate prediction of κ l . When applied to 10 unseen datapoints, the global model accurately guides them to correct partitions. We call this step-wise property prediction approach as guided patchwork kriging. Additionally, the analysis of machine learning models and descriptors correctly captures the expected correlations between chemistry of bonding characteristics and κ l . Our approach can be extended to develop highly transferable models for prediction of any physical or chemical properties of class-independent, highly variable, comprehensive datasets.