Crystal structure prediction by combining graph network and Bayesian optimization

We developed a density functional theory (DFT)-free approach for crystal structure prediction, in which a graph network (GN) is adopted to establish a correlation model between the crystal structure and formation enthalpies, and Bayesian optimization (BO) is used to accelerate the search for crystal structure with optimal formation enthalpy. The approach of combining GN and BO for crystal structure searching (GN-BOSS) can predict crystal structures at given chemical compositions with and without additional constraints on cell shapes and lattice symmetries. The applicability and efficiency of the GN-BOSS approach is then verified by solving the classical Ph-vV challenge. The approach can accurately predict the crystal structures with a computational cost that is three orders of magnitude less than that required for DFT-based approaches. The GN-BOSS approach may open new avenues for data-driven crystal structural predictions without using expensive DFT calculations.

Predicting crystal structure at a given chemical composition prior to experimental synthesis has attracted a long-lasting interest in condensed matter science. Earlier attempts based on empirical rules provide qualitative description of structures, for example, Pauling's five rules for ionic crystals 1 , Goldschmidt's tolerance factor for perovskite formability 2 , and dimensional descriptors to classify the zinc-blend (ZB)/wurtzite (WZ) and rock-salt (RS) structures for binary semiconductor compounds 3,4 .
Owing to reliable energy calculation via density functional theory (DFT), the current state-of-the-art approaches for crystal structure prediction (CSP) mainly combine DFT calculations with structural searching algorithms such as (quasi-) random search 5,6 , simulated annealing 7 , genetic algorithm 8-11 , particle-swarm optimization (PSO) 12,13 , and differential evolutionary process 14 . These approaches extensively explore the structural candidates via searching algorithms, adopting DFT-calculated energy as a stability metric. The necessary DFT calculations involve the evaluation of numerous structural candidates in the process of structure searching and are thus time-consuming.
Advancement of machine learning (ML) in materials science has recently focused on its applications in predicting materials properties such as formation enthalpies (ΔH) 15,16 , Gibbs free energies 17 , bandgaps 18,19 , even wave function and electron density 20 , X-ray absorption spectra 21 , and phase transitions 22 . The accuracy of this approach is close to that of quantum mechanics calculations, but the computational costs are orders of magnitude lower. In addition to the influence of compositional atoms, the influence of their spatial arrangement, i.e., crystal structure, on materials properties has recently been analyzed via structural characterizing approaches such as the Wyckoff-species matrixbased method 23 , Voronoi tessellation method 24 , and graph network 18,19 . A crystal G can be represented by a vector ( ) ii N ii N = = CRL , where {Ci} and {Ri} are elemental features and coordinates of the ith atom, N is the total number of atoms in a periodic cell, and L is the vector (a, b, c, α, β, γ) defining the cell shape. In these approaches, crystal structures are transformed to a physically meaningful and algorithm-readable data format, such as a symmetry-invariant matrix 23 , bond configurations 24 , or crystal graphs 18 , enabling the establishment of a correlation model between a crystal and its formation enthalpy as follows: In principle, CSP can be efficiently performed using equation (1)  Despite this potential advantage, the practical approach of DFT-free CSP still has challenges 25 .
First, the ML model should have a sensitive response to the crystal structure; therefore, the fixedstructure model 15,26 and symmetry-invariant model 23 31 . Data cleaning was performed to exclude data with incomplete information and restrictions: i) the number of atoms in the unit cell (less than 50), ii) PBE as exchange-correlation functional, and iii) kinetic energy cutoff (520 eV), making data as reliable and comparable as possible.
Accordingly, we obtained more than 320,000 data points, including ~40,000 experimentally known ones and ~280,000 hypothetical ones, covering 85 elements, 7 lattice systems, and 167 space groups.
The best model was selected to minimize the errors between the GN-predicted and DFT-calculated ΔH.
Previous studies 18,19 have shown that the network architecture and update functions have a significant impact on the model performance. Here, we incorporated the MEGNet architecture 19 (Fig. S1) using the atomic features 32 and structural features of Voronoi polyhedrons 33 . The MEGNet architecture was adopted because it is a universal framework for crystals and molecules 19 , facilitating the future extension of the current approach to molecular structure prediction. In particular, our GN models can reach a MAE of 26.9 meV/atom when the portion of training data is 50% (Fig. 2), which is less than that of previous models such as CGCNN (39 meV, MP database) 18 , iCGCNN (30.5 meV, OQMD database) 33 , and original MEGNet (28 meV, MP database) 19 . We observed a significant decrease in MAE as the amount of training data increased, and the values rapidly converged below 50 meV when the portion of training data reached as low as 25%, showing the generality of the current GN model.
In the framework of GN, CSP is an optimization problem to identify the crystal graph is successfully applied in DFT-based CSP [12], we also combined GN with PSO for a comparative study.
First, for a given number of atoms in a cell, n initial structures R L of random cell shape and atomic positions were generated with or without crystal symmetry constraints (among 230 space groups excluding P1), and their corresponding structural features were obtained by conducting a Voronoi analysis 24 . Accordingly, ΔH's were predicted using our GN model (equation [1]). Next, the TPE-based Gaussian mixture model and acquisition functions were used to suggest a potential global minimum at structural spaces. Subsequently, its true formation energy was obtained using our GN model and iteratively fed back into the model to refine the BO model to suggest the (n+2)th structure.
For practical implementation, we added an additional constraint to avoid the generation of unreasonable structures with extremely small or large volumes. In particular, the surrogate function in the BO model is a type of approximation to the black-box function in equation (1), based on the limited dataset of (G, ΔH).  Fig. 3(b). Next, the formation enthalpy decreased rapidly, and the reasonable structures without wrong bonds appeared as in Fig. 3(c). After 3000 steps, the four-coordination structure [ Fig. 3(d)] was found, and after 5000 steps, the ground-state RS structure [ Fig. 3(e)] was successfully discovered. Because the accuracy of the current GN model is mildly inferior to state-of-the-art DFT calculations (Fig. S2), the final structure in Fig. 3(e) still has some distortions. Additional BO steps may help to minimize these distortions (Fig. S3). However, predicting a perfect RS structure remains a current challenge because the GN model is still inferior to accurately perceive small coordinate perturbation to total energies. We further used DFT calculations to relax the GN-BO predicted structures and obtain the precise crystal structures. To verify that the ground state crystal structures were derived using GN-BO rather than DFT relaxations, we adopted 18 prototype structures of AB compounds, which frequently appeared in the Inorganic Crystal Structure Database 39 as initial structures of CaS, and only three of them were relaxed to RS structure (Fig. S4). These results further confirm that the ground state structure is derived using the GN-BO approach, not by DFT relaxation. This indicates that DFT relaxations help to find the precise location of a local minimum in structural space ( ) 1,4 { },  (Fig. 4). Finally, we adopted GN-BO-SYM and DFT-PSO-SYM to predict C (8 atoms), Si In conclusion, we combined graph network and Bayesian optimization as a GN-BOSS approach for crystal structure prediction without performing DFT calculations. GN-BOSS was then applied to successfully predict the crystal structures of typical I-VII (I=Li, Na, K, Rb, Cs; VII=F, Cl) and II-VI Accordingly, we obtained more than 320,000 data points has been obtained, including ~40,000 experimentally known ones and ~280,000 hypothetical ones, covering 85 elements, 7 lattice systems, and 167 space groups. All data were divided into three parts: training set (50%), validation set (12.5%), and test set (37.5%). Fig. 5, a crystal structure is considered as a crystal graph G, numerically represented using atomic and structural features [18,19]. A crystal graph G can be represented by a vector ({Ci }i=1..N, {Ri} i=1..N, L), where {Ci} and {Ri} are elemental features and coordinates of the ith atom, N is the total number of atoms in a periodic cell, and L is the vector (a, b, c, α, β, γ) 6). The MEGNet architecture was adopted because it is a universal framework for crystals and molecules, facilitating the future extension of the current approach to molecular structure prediction.

Features. As shown in
Bayesian optimization. Here, we applied BO via a Gaussian mixture model based on tree of Parzen estimators (TPE) [34] to explore the structural space. BO is a selection-type algorithm that is compatible with the black-box ML model, demonstrating a great capability to identify the global minimum [35,36], and has been recently combined with DFT calculations for CSP in fixed crystal systems [37]. Compared to the normal BO algorithm based on the Gaussian process, which performs better in low-dimensional space (number of features less than 20), the TPE-based Gaussian mixture model demonstrated higher efficiency in high-dimensional space [34].

Code availability
The code for GN-BOSS approach is available on http://www.comates.group/links?software=gn_boss All GN-based results reported in this work can be reproduced by this code. The DFT-based results are produced by CALYPSO code.