A Deep Generative Modeling Architecture for Designing Lattice-Constrained Perovskite Materials

In modern materials discovery, materials are now efficiently screened using machine learning (ML) techniques with target-specific properties for meeting various engineering applications. However, a major challenge that persists with deep generative ML approach is the issue related to lattice reconstruction at the decoding phase, leading to the generation of materials with low symmetry, unfeasible atomic coordination, and triclinic behavioral properties in the crystal lattice. To address this concern, the present research makes a contribution by proposing a Lattice-Constrained Materials Generative Model (LCMGM) for designing new and polymorphic perovskite materials with crystal conformities that are consistent with predefined geometrical and thermodynamic stability constraints at the encoding phase. A comparison with baseline models such as Physics Guided Crystal Generative Model (PGCGM) and Fourier-Transformed Crystal Property (FTCP), confirms the potential of the LCMGM for improved training stability, better chemical learning effect and higher geometrical conformity. The new materials emerging from this research are Density Functional Theory (DFT) validated and openly made available in the Mendeley data repository: https://doi.org/10.17632/m262xxpgn2.1.

consists of one-hot encoded features that characterize the thermochemistry behavior of each atom that belongs to a distinctive site occupancy in the stoichiometry of the unit cell.Hence, for  3 and  2  ′  6 stoichiometries, there are three and four encoded thermochemistry columns, respectively, which are stacked together.As illustrated using the Fig. S1, the thermochemistry properties are twelve in total and are concatenated along their row axis.Table S1 lists the numerical range of each property, in addition to the bin size used in the one-hot encoding process.To fit the overall matrix size with respect to the other meshes, the property mesh is zero-padded to produce a 132×8 two-dimensional array and is reshaped to form a 32×32 square matrix.Finally, the X-ray diffraction (XRD) mesh fingerprints the scaled XRD computed peaks/intensities onto a 32×32 matrix array.The XRD pattern is simulated using a CuKα wavelength (~1.5Å) prober for 2θ diffraction angle ranging from 0ᵒ to 180ᵒ.Assembling all meshes together produces a 32×32×3 invertible RGB image.Figure S2 displays the mean squared error (MSE) distribution in pixel attributes between decoded and actual/calculated XRD patterns for over 70,000 newly generated crystals by the LCMGM, as it relates to a crystal system type.The source codes for calculating and projecting XRD peaks were enabled using the XRDCalculator in the pymatgen.analysis.diffraction.xrdsubmodule [1,2].This research makes available the codes and other  Table S1.Thermochemistry properties used in developing the property mesh array of the invertible mesh-grid descriptor.Each property is discretized using one-hot encoding features based on the bin size and numerical range of real values.

LCMGM neural network architecture and hyperparameters
For addressing the inverse design scheme, the Lattice Constrained Materials Generative Model

Bayesian Optimization (BO) pseudocode
The BO algorithm is used to find the optimized lattice configuration   (i.e.edge vectors and inter-axial angles) that best minimizes the total DFT energy (  ).The BO algorithm as implemented in study was developed using Scikit-Optimize (skopt) [8], which is a sequential model-based library for multiobjective optimization.As demonstrated in the main article, the pseudocode can be described as follows: 1.For a newly generated perovskite  ̂, initialize the surrogate model with a dataset  of initial points   ⊆  ̂∈ ℝ (32×32×3) and their corresponding true solutions   .

Return the best point found in 𝐷.
In the present research, the applied surrogate model is based on Gaussian Processes and the acquisition function is expected improvement.The surrogate model is initialized using  = 40 points and the number of calls () are observed for  = 200 iterations.It should be noted moreover that the mesh-grid descriptor concept is modified to include the   features in order to effect the dynamic changes within the search space for finding the optimized configuration.For each considered crystal system, the search space dimensionality for guiding the acquisition function is outlined in the Table 7 of the main article.
To further demonstrate the good effect of the BO algorithm in performing pre-DFT relaxation, the Fig. S3 (A) visualizes the variation in optimized versus non-optimized/generated candidates with respect to their final DFT-relaxed forms, as investigated singularly on the lattice a (Å) edge vectors.The Fig. S3 (B) displays the average absolute error difference in matching lattice parameters for all newly generated candidates by the BO.Except for the b-edge vector, performing pre-relaxation with the BO algorithm yields better lattice configurations that are closer to their final DFT-relaxed states.Further information on implementing the BO algorithm are made available at www.github.com/chenebuah/LCMGM.

Prototypical geometries for querying the A-GAN model
For generating new candidates with higher geometrical conformity, the fully trained A-GAN model is queried using proven perovskite prototypes as reference geometries.In selecting prototypical structures, the present research prioritizes Inorganic Crystal Structure Database (ICSD) [9] perovskites due to their higher validation from physical experiments.However, for special instances where no suitable ICSD prototype (e.g.cubic  3 stoichiometry), hypothetically discovered compounds that have been extensively investigated via ab initio computations are used.The Table S4 outlines the prototypes used in this study, in addition to their data entries ID from the Materials Project (MP) database [10].

Newly designed LCMGM perovskites
Altogether, 124 perovskite materials were newly designed in this research.Among them, 72 are suggested to be new chemistries (i.e.unique and novel), as they cannot be found in established materials databases such as Materials Project (MP) [10], Open Quantum Materials Database (OQMD) [11], and Novel Materials discovery (NOMAD) [12].Tables S5 provides a full breakdown of the new materials with respect to their determined lattice features.In addition, Table S6 .com/chenebuah/LCMGM.

Figure S1 .
Figure S1.Stacking arrangement for organizing discrete features with respect to the label mesh and property mesh arrays.

Figure S2 .
Figure S2.Mean squared error (MSE) distribution in pixel attributes between decoded and actual/calculated XRD patterns for over 70,000 newly generated crystals by the LCMGM.

(
LCMGM) comprise of three modeling phases, namely: (1) semi-supervised variational autoencoder (SS-VAE); (2) auxiliary generative adversarial network (A-GAN); and (3) Geometrical optimization in Bayesian optimization (BO) and density functional theory (DFT).Both the SS-VAE and A-GAN are target-learning generative models and are architectured using deep neural networks.The TablesS2 and S3details the modeling architectures and hyperparameters for designing the SS-VAE and A-GAN, respectively.The SS-VAE comprise of four modeling networks (i.e.encoder, regressor, classifier, and decoder) that are graphically tied together in backpropagation.The A-GAN on the other hand consists of two typical neural networks (i.e.generator and discriminator) that competes against each other.Each neural network contains sub-networks that reflects the auxiliary generative approach for learning the geometrical lattice constraints.Given a perovskite unit cell, the geometrical constraints are the atomic coordinates ( ⃗ (,,) ∈ ℝ (40×3) ) and lattice parameters ( [,,;,,] ∈ ℝ (2×3) ).The modeling weights are continuously updated, as the predictor sub-model of the A-GAN's discriminator regressively predicts the unknown geometrical constraints and evaluates the error in prediction.Batch normalization are typically used between interconnecting layers for standardizing and stabilizing the networks.All neural network-modeling designs were scripted using Python, and were enabled on a Keras API[6] of the TensorFlow backend[7].Further information on coding the LCMGM are made available at www.github.com/chenebuah/LCMGM.
= 1, 2, … ,  : a. Fit the surrogate model to the current perovskite dataset .b.Choose the next   point by optimizing the acquisition function over the surrogate model.c.Evaluate the objective function   which is a function of   .d. Augment the dataset  with the new point { |+1 ,  |+1 ( |+1 )}.

Figure S3 .
Figure S3.Effect of BO algorithm for finding the optimized three-dimensional lattice configuration.(A) optimized versus non-optimized (i.e.generated) candidates with respect to their final DFT-relaxed forms for lattice a-edge vectors; (B) average absolute difference in DFT-relaxed parameters for optimized and non-optimized.

Table S2 .
Modeling architecture and hyperparameters for designing the SS-VAE model of the LCMGM

Table S3 .
Modeling architecture and hyperparameters for designing the A-GAN model of the LCMGM

Table S4 .
Examples of prototypical perovskite structures from the Materials Project (MP) database used in querying the A-GAN model of the fully trained LCMGM.  is the number of atoms in the conventional unit cell.

Table S5 .
Lattice features of new materials designed by the LCMGM.For the listed materials, DFTcalculations are strictly reported using GGA-PBE functionals.
reports the calculated thermodynamic and bandgap properties of the new materials.Except otherwise stated, the materials are reported based on DFT evaluation using GGA-PBE functionals.The new materials dataset can be accessed from the Mendeley data repository at https://doi.org/10.17632/m262xxpgn2.1.The raw Crystallographic Information Files (CIF) and Quantum Espresso output files can also be downloaded from https://github.com/chenebuah/LCMGM.

Table S6 .
Thermodynamic stability and bandgap properties of new materials designed by the LCMGM.For the listed materials, DFT-calculations are strictly reported using GGA-PBE functionals.