Machine Learning Stability and Bandgaps of Lead‐Free Perovskites for Photovoltaics

Compositional engineering of perovskites has enabled the precise control of material properties required for their envisioned applications in photovoltaics. However, challenges remain to address efficiency, stability, and toxicity simultaneously. Mixed lead‐free and inorganic perovskites have recently demonstrated potential for resolving such issues but their composition space is gigantic, making it difficult to discover promising candidates even using high‐throughput methods. A machine learning approach employing a generalized element‐agnostic fingerprint is shown to rapidly and accurately predict key properties using a new database of 344 perovskites generated with density functional theory. Bandgap, formation energy, and convex hull distance are predicted using validation subsets to within 146 meV, 15 meV per atom, and 11 meV per atom, respectively. The resulting model is used to predict trends in entirely different chemical spaces, and perform rapid composition and configuration space sampling without the need for expensive ab initio simulations.

Features are constructed from the final configurations of the relaxed structures using a new material fingerprint extending the partial radial distribution function (PRDF) [27] called the property density distribution function (PDDF) depicted in the abstract. The PDDF computes the density for an atomic property p in a shell of radius r and width dr centered around atom i by summing the values j,p of that property for each atom j where the pair-wise distance d i , j satisfies r ≤ d i , j < r + dr using the Heaviside step functions Θ. We then take an average of this density over the N atoms in the unit cell, repeating the calculation at each atomic center i and summing over the N atoms in the local atomic neighborhood defined by r max . Finally, we normalize by the shell volume V r to yield for a given property and radial distance. As in the PRDF case, the PDDF is globally valid and has the advantage that it can be applied to supercells of arbitrary size and complexity.
The PDDF values were calculated with dr = 0.2 Å up to r max = 15 Å for the Pauling's electronegativity, first and second ionization energies, the s-and p-orbital radii, empirical atomic covalent radius, lowest unoccupied atomic orbital, highest occupied atomic orbital, Fermi level, ionic Shannon radii, and the valence s-and p-orbital electron occupations for the isolated neutral elements. The values used in these calculations can be found in Table S3, Supporting Information. Property densities are concatenated into a feature vector ⃗ x and then the entire set of observations are preprocessed by scaling each feature to the interval [0, 1], applying a variance selection threshold th , and finally down-selecting the remaining features using an ElasticNet routine with a Select From Model (SFM) weight threshold SFM th . This feature-selection approach is conceptually similar to the one applied in ref. [20] on perovskites, as well as the one successfully applied in ref. [28]. "The dr and r max values are also varied such that a given "feature selection" model M can be denoted by its parameters as M = ( 2 th , N cv , dr, SFM th , r max ) T , where N cv is the number of cross-validation folds to use during training of the actual KRR predictor, which is fed the features from the selection routine.
360 different selection models, each exploring a range of Gaussian kernel and regularization hyper-parameters for the KRR predictor in a cross-validation procedure, were surveyed to identify the best model for prediction of the direct fundamental bandgap E g . Model errors tend to be mainly introduced through uncertainty in the final crystal structure governed by both the DFT simulation parameters and dr, which effectively limits the resolution of configuration information seen by the algorithm. These results are addressed in more depth in Section S2, Supporting Information, where trends in marginalized model parameters indicate a relationship between information and performance.

www.advancedsciencenews.com www.advtheorysimul.com
An average root mean square error (RMSE) of E g,RMSE = 146 ± 19 meV was achieved for M best = (0.001, 8, 0.2, 0.1, 15.0) T when applying it to over 100 random test-train splits of the observation data of 10% and 90%, respectively. An example of one such split is shown in Figure 1a, the prediction KRR model was parametrized by cross-validation on the training set and tested against a set-aside test set. Prediction performance was then measured for M best as a function of dataset size by measuring the R 2 score value for eightfold cross-validated N subset -sized subsets of the data to produce the learning curve in Figure 1b. Our fit indicates that bias and variance are well balanced and that further expansion of the training dataset will likely help improve performance since the gap between the training and testing curves continues to decrease.
In addition to the bandgap, we also included predictions of stability metrics by calculation of the formation energy and convex hull distance Here, E tot is the DFT-computed total energy for the supercell, n i is the number of atoms in the cell of chemical species i, and i is the corrected chemical potential of the element computed from the 0 K reference state. ΔH f ,OQMD is the convex hull energy computed using the open quantum materials database (OQMD). [29,30] More details on these calculations can be found in Section S3, Supporting Information.
Using individually selected optimal models M form = (0.001, 8, 0.2, 0.01, 15.0) T for ΔH f and M hull = (0.001, 8, 0.2, 0.0001, 15.0) T for ΔE hull , and averaging over 100 test-train splits, test errors as low as ΔH f ,RMSE = 15 ± 3 meV per atom, and ΔE hull,RMSE = 11 ± 1 meV were achieved for these stability metrics. Examples for ΔH f and ΔE hull predictions are shown in Figure 1d,c, respectively. Though relative error calculations do not make sense for these interval scales, the dispersion indicates that formation www.advancedsciencenews.com www.advtheorysimul.com energy is generally a simpler quantity to predict than both the bandgap and hull distance.
Remarkably, similar predictions can be made for new chemical spaces. For example, we removed the 229 compounds containing Cs from the dataset and used the remaining 115 for training a KRR model using features from M = M best . Over 25 trials, we found that the average E g,RMSE of the Cs-containing compounds was 220 ±3 meV , while the average ΔH f ,RMSE with model M = M form was 120 ± 3 meV per atom.
However, performance was poorer when we used the full dataset to train an model, then predict the bandgaps of eight Pb-containing compounds taken from the Materials Project detailed in Section S4, Supporting Information, as expected when applying a model to a test set from a different distribution than the training data. While we could roughly capture the correct bandgap trends, the large bias for these compounds indicates that a more chemically diverse training set, and also perhaps more accurate DFT methods to address known inaccuracies in bandgap calculations specifically for Pb, will be needed to fully extend the algorithm to these compounds (for spin-orbit coupling in leadcontaining perovskites see ref. [31]). The highlighted outliers that break from the dotted trend line represent crystals with B-site vacancies. These crystals are likely in a more distant part of configuration space that the model cannot easily generalize to since it has been trained on only nearly cubic configurations. These results are summarized in Figure 2.
Despite the excellent performance of this algorithm, a major drawback is the requirement for an input geometry to construct features-typically the most computationally intensive part when doing DFT calculations is relaxing a given composition to an energetic minimum. However, with rapid prediction speeds for bandgaps and an optimized feature construction process, there is essentially no cost for simply trying an input geometry. We therefore created large sets of trial geometries-so-called "dummy" crystals-in an attempt to discover chemical trends by dense sampling of both compositional and configuration spaces.
As an example, we explored the bandgap and stability trends of two sets of trial geometries by randomly varying the atomic positions and lattice constants uniformly around a cubic geometry initialized with l a = l c = 11.7 Å, and l b = 5.4 Å. To simulate a lack of knowledge about the true DFT-computed geometry, each atomic position in a crystal was allowed to vary up to ±3% and each lattice vector up to ±10%. Twenty variations of atomic position for each of 20 lattice distortion variations were computed for every possible configuration of mixing for a Sn-Ge dataset of the form A 4 B x B ′ 4−x X 12 with x ∈ {0, 2, 4} and a I-Cl-Br halide mixing dataset of the form A 4 B 4 X 12 X ′ x−12 with x ∈ {0, 6, 12}, resulting in two datasets of 19 200 and 28, 800 crystals, respectively. In only a couple of hours on a standard laptop, this set could be generated and fed to the M best and M hull models to predict the trends shown in Figure 3.
Beyond different A-site cation "classes" [2] we can identify a rise in the bandgap with decreasing period number of the halide anions (Figure 3b, in accordance with calculations of ref. [6]), with 50/50 mixed compounds occupying a middle ground between the one-anion only ones. As all structures where of the same phase this variation is in accordance with experiments. [32][33][34][35] For different mixing ratios of tin and germanium we do not know of any experimental work, as ref. [6] is only considering ratios far from 1:1, but theoretic work on tin/lead indicates similar constant variation [36] (when keeping A-site cation and anions fixed). However a precise prediction of these properties might necessitate the consideration of spin-orbit coupling in the calculation. [31] The stability trends in Figure 3c,d indicate that the stability decreases with the period for halides (mixing), which is in line with other calculations. [37] The Sn/Ge data is only showing a very slight variation, which could be another subject of further study. For Asite cations, the overall trend of Rb/Cs perovskites being more stable than K/Na based is well documented in the literature. [1] The main issue with relating this data to experiments is that stability is typically impacted by a wide array of factors (atmosphere, illumination) beyond the static crystal structure-also methods vary in which dissociation paths are considered. Thus the results should be seen as providing a guideline of likely trends for informing experimental research.
Interestingly, the plots can also help delineate the relative effect of targeting A-, B-, or X-site ions for property tuning. In Sn 2+ /Ge 2+ mixing, for example, it is clear that choice of A-site and X-site ions drastically offsets the absolute values of bandgap in a largely predictable fashion, while the range affected by B-site substitution is relatively minimal.
While it is encouraging that a fast-screening dummy crystal approach can uncover non-trivial trends and reproduce work from the literature, the 400 random configurations explored here at each composition barely scratch the surface of the possibilities available to the perovskites. A more comprehensive set of training geometries and elemental compositions will ultimately be required to gain higher confidence in trend and property prediction. This could be achieved, for example, via full integration of large existing datasets such as the OQMD, [30] utilizing force-field approximations to produce more realistic initial geometries, [38,39] or bootstrapping of cheaper DFT calculations using co-kriging methods. [40] Additionally, major gains are expected from a more advanced feature construction process. One can imagine tuning r max and dr, perhaps non-uniformly, to control feature sparsity, and even adding angular resolution to encode vector fields to capture effects such as van der waals interactions. Additional elemental properties and their combinations should also be explored in detail beyond those presented here to find the best descriptors for key material properties.
Nevertheless, our successful application of this algorithm in predicting E g , ΔH f , and ΔE hull shows that the method can be an efficient and accurate new tool for rapid exploration of several important properties for photovoltaics in this composition space. By transforming our descriptor space from atomic labels to a set of shared elemental properties, we can easily generalize to new chemical spaces and crystal configurations while simultaneously connecting feature construction to physically meaningful atomic properties. We are therefore able to control how information enters the machine and identify experimentally accessible factors relevant for the rational design of new perovskite materials through a careful selection of the descriptors.

Supporting Information
Supporting Information is available from the Wiley Online Library or from the author.