Machine Learning Modeling of Materials with a Group-Subgroup Structure

A cornerstone of materials science is Landau's theory of continuous phase transitions. Crystal structures connected by Landau-type transitions are mathematically related through group-subgroup relationships. We introduce"group-subgroup learning"and show training on small unit cell phases of materials to decrease out-of-sample errors for modeling larger phases. The proposed approach is generic and is independent of the ML formalism, descriptors, or datasets; and extendable to other symmetry abstractions such as spin-, valency-, or charge order. Since available materials datasets are heterogeneous with too few examples for realizing the group-subgroup structure, we present the"FriezeRMQ1D"dataset of 8393 Q1D organometallic materials uniformly distributed across seven frieze groups and provide a proof-of-the-concept. For these materials, we report<3% error with 25% training with the Faber-Christensen-Huang-Lilienfeld descriptor and compare its performance with a fingerprint representation that encodes materials composition as well as crystallographic Wyckoff positions.


I. INTRODUCTION
The upsurge in the use of machine learning (ML) modeling in computational chemistry or materials science is because ML models, once trained sufficiently well on computed properties, can deliver accurate new predictions at a cost lower than that of the reference method by orders of magnitude [1,2]. Developments of such novel, data-driven methods are fueled by pioneering efforts in designing and generating big data using high-throughput computation [3][4][5][6][7][8][9]. While for molecules such campaigns aim at complete coverage of synthetically feasible chemical compound space [10], for materials, the coverage is mostly inspired by experimentally known structures [11]. Symmetry-adaption in ML has its merits for potential energy surface modeling [12]. However, across stoichiometries, most of the information about the total electronic energy is encoded in atom-in-molecule-based fragments [13] that seldom show any symmetry relationships with the molecule.
In this study, we propose the group-subgroup learning approach for ML modeling on materials datasets containing multiple crystal structures related through a phase transition. Landau applied group theory for understanding continuous phase transition of material from a phase of high-symmetry space group, G h to a phase of low-symmetry space group, G l , which is a subgroup, G l ⊂ G h [14].
Landau theory states that symmetry breaking occurs through a collective variable that transforms according to a single irreducible representation of G h [15]. Systematic development of structural relationships between different crystal structures was done by Bärnighausen using crystallographic group-subgroup relations [16]. These relationships are represented in a graphical tree, (Bärnighausen-tree or B-tree), with the most symmetric * ramakrishnan@tifrh.res.in space group placed at the top. A B-tree not only encodes group-subgroup relations but it also correlates the Wyckoff positions during symmetry reduction [16]. The so-called translationengleiche subgroups (t-subgroups) are related to their supergroup by a decrease in the point group symmetry of the lattice and conservation of the translation symmetry. A group can also have klassengleiche subgroups (k-subgroups), retaining the point group symmetry but the translational symmetry is compromised [16,17]. A material's cohesive energy that largely arises from intra-unit cell atomic interactions in a group should be similar in its t-subgroups and t-supergroups. This similarity must be maximum when going from a group to its maximal subgroup or minimal supergroup.
For a target accuracy, ML modeling of materials belonging to a space group requires a smaller training set size compared to training on materials across space groups [18]. When modeling on a space group with a large unit cell, the proposed group-subgroup learning approach facilitates both a reduction in the training set size and instantaneous generation of the descriptor for a query. The key features of group-subgroup learning are: 1) For modeling on a class of materials belonging to one or more groups with large unit cells, the training set includes phases of the same materials with smaller unit cells. For conventional unit cells, moving from a group to its t-subgroup decreases the unit cell size by a factor, n, which equals the ratio between orders of the group and its subgroup. 2) Query on large unit cell materials uses a prototype descriptor made of equilibrium geometries of smaller unit cells. For this, one applies the splitting pattern of the Wyckoff positions (listed in the B-tree) during symmetry lowering from G h to its subgroup G l . Generating such commensurate prototype unit cells also ensures the descriptor size to be homogeneous to establish a faithful correlation with cohesive energy per atom, a size-intensive quantity.  1. a) B-tree for 7 frieze groups. The Wyckoff position occupied by the unit cell of a frieze group is shown in the box next to its name. Klassengleiche subgroups of the frieze groups are linked with an arrow marked with k2 and the translationengleiche subgroups are marked t2. b) Definitions of 6 degrees of freedom that are relaxed during the constrained optimization of FriezeRMQ1D materials: θ1 and θ2 are angles through which ring is rotated around x and y axes, respectively; tx, ty, tz are components of the translation vector of the Cu atom; c is the lattice constant. c) Seven crystallographic frieze groups and their lattice arrangements. Atoms in the solid red box comprise the unit cell. Common names of frieze groups are stated below the unit cell. Mirror planes are shown in solid lines while glide planes in dotted lines, and a red oval symbol signifies a 2-fold rotation axis. For all frieze groups, the corresponding space group names/numbers, and crystal systems are also given.

II. RING-METAL Q1D MATERIALS IN SEVEN CRYSTALLOGRAPHIC FRIEZE GROUPS
Our dataset consists of Q1D materials comprising a ring-metal pair as the formula unit. In a previous study, we reported geometries, electronic and phonon properties of these materials with 11 monovalent metals (Na-Cs, Cu-Au, Al-Tl) and 109 heterocyclic rings generated by combinatorial substitution of C atoms in the cyclopentadienyl ring with B, N or S atoms of all possible valencies [19]. In this work, we extend the dataset by optimizing the lattice geometries according to 7 frieze group symmetries. In a frieze pattern, commonly encountered in architecture or textile design, a two-dimensional unit is repetitive in one direction.
A frieze group is a set of all elements that define the symmetries of a frieze pattern. For every frieze group, there is an isomorphic crystallographic space group. Since the lateral packing of the FriezeRMQ1D materials are less relevant for their electronic properties, we restrict our attention to only 7 frieze groups rather than 230 space groups. The B-tree of the frieze groups showing group-subgroup interrelationships is presented in Figure 1a.
RMQ1D materials belonging of the smallest frieze group symmetry, p1, contain repeated ring-metal units.
While in the earlier study, we performed a full relaxation of all 1199 FriezeRMQ1D materials in the p1 phase, here to conserve computational efforts, we have defined the most relevant internal coordinates involving translations, rotations, and reflections of the formula unit as shown I. Symmetry operations for generating the atomic coordinates for various formula units in a unit cell of polymeric multidecker sandwich complexes for seven frieze group symmetries:Tj(ξ) denotes translation along the j-axis through a distance ξ,Rx(π) denotes two-fold rotation w.r.t. the x-axis, σij denotes mirror planes and {r J k } denote the set of atomic coordinates belonging to the J−th unit. Each formula unit contains a five-membered cycle and a metal center. Also given are the number of degrees-of-freedom for the crystal structure relaxation a .
Group N Wyckoff positions of formula units a in all cases, the six internal degrees of freedom with in a unit is defined in Figure 1b.
in Figure 1b. In principle stable phases with a larger unit cell can be realized by beginning with more than one units in the unit cell. However, with a full relaxation, it is likely that the final crystal structure will be of lower symmetry.For finding minimum energy configurations corresponding to 6 larger frieze groups, constrained geometry relaxation is inevitable [20]. Such constrained optimizations can be performed using atomic coordinates and the Wyckoff positions of the frieze groups. Groups p211, p1m1, p11m, and p11g contain two Wyckoff positions, while the larger groups p2mm and p2mg contain four Wyckoff positions. The symmetry of these larger groups necessitate additional degrees of freedom to be relaxed as summarized in Table I. The structures of the seven frieze phases resulting from an application of the transformations presented in Figure 1b and Table I, are shown in Figure 1c.
It must be noted that the number of atoms in a material's unit cell is independent of the order of the crystallographic space group [21][22][23]. Consider structures in the widely studied perovskite family ABX 3 . The most symmetric structure with the smallest unit cell is cubic in the space group, P m3m. A simple mechanism of tilting the rigid octahedral units results in structures belonging to 15 subgroups with larger unit cells [24]. A counter example is the Peierls transition in a 1D chain of H-atoms. Here, the uniformly distributed chain belongs to the P 1 group with one atom in the unit cell. Following Peierls distortion, the unit cell contains a H 2 molecule and the system is in P 1 space group. Hence, the high-symmetry phase contains fewer atoms in the unit cell only when the phase is more compact (smaller lattice constants) than the low-symmetry phase.
Symmetry constrained geometry relaxations were performed with the all-electron, numerical atom-centered orbital code FHI-aims [25] with the PBE [26] functional. Since the goal of the present study is to explore the capabilities of the ML approach, rather than presenting Structure-property diversity in the FriezeRMQ1D dataset: (a) Preference for hapticity (η) for intra formula unit metal-ring bonding in FriezeRMQ1D materials is shown through joint variation of the ring-metal distance, r (inÅ), and the ring-slip angle, θ (in degrees). (b) Distribution of the thermodynamically most stable phase for each stoichiometry across 7 frieze groups shown for 1199 Cu-based materials. a robust database for benchmarking first-principles method with experiment, geometry optimizations were performed only for the 763 Cu-systems (109 rings combined with 7 frieze groups). Energies of the materials with other metals were calculated in a single point fashion. In all calculations, we used 1x1x64 k-grids, and tight/tier-1 basis set. Search for a stationary point on the potential energy surface was performed using the BFGS minimization procedure using PBE total energy. The geometries were converged until the change in the energy in successive iterations converged below 10 −5 eV. The impact of the frieze group arrangement on the internal structure of the ring-metal formula unit is illustrated in Figure 2.
A preference for co-planar arrangement of the ring-metal pair coincides with an increase in the ring-metal distance, r. Single point PBE0 [27] calculations were performed at relaxed geometries for accurate estimation of energies and band gaps.
The total computational cost for PBE-level symmetry constrained optimizations and PBE0-level single point energy evaluations are 14 CPU months, and 45 CPU days, respectively.

III. KERNEL-RIDGE REGRESSION
We use kernel-ridge regression [28,29] for modeling atomization energies, E, of the FriezeRMQ1D materials. The energy of a query material, q, is estimated as the linear combination of kernel functions, each centered on a training material, t. The kernel functions take as argument the similarity between the query and the training materials quantified through a descriptor, d Here, we use a Laplacian kernel, k(d q , d t ) = exp(−|d q − d t |/σ), where σ defines the length scale of the kernel function. The fit-coefficients, c t , are obtained by solving the linear system In all ML calculations, we used a fixed-value for the regularization strength, λ = 10 −6 , as a preconditioning measure. For determining a suitable kernel-width, we followed the "single-kernel recipe" proposed in Ref. 30 where d max ij is the largest descriptor difference among training entries. Use of fixed hyperparameters enables λ, σ enables rapid training facilitating adequate shuffling of the training set to prevent any bias. Since the goal of this study is to test the validity of modeling a superstructure phase by training on substructures, we limit our explorations to one of the best structure-based representation: FCHL [31,32]. This descriptor has a tensor structure and a direct evaluation of the kernel matrix elements without storing the descriptor is the preferred approach as implemented in the program QML [33]. The scope of the problem presented here is not limited by the representation, hence, one can also apply other descriptors such as row-sorted Coulomb matrix [29], bag-of-bond (BoB), [34], SLATM [35], MBTR [36], or SOAP [37] to exploit group-subgroup structure in ML modeling of materials.
The FriezeRMQ1D dataset is dominated by compositional diversity than structural variations compared to larger datasets such as OQMD [6]. Hence, we also explore a fingerprint representation that encodes the stoichiometry of the materials and the Wyckoff positions of the 7 frieze groups. For a similar problem of modeling the energetics of 2 million elpasolites, a fingerprint descriptor has delivered more accurate ML learning curves than one of best structure-based formalism, FCHL [18]. Further, while modeling on ICSD [11] subset of OQMD containing only few cases of multiple structures for the same chemical formula, augmenting structural information to a composition-based one did not improve the prediction accuracy [38]. The problem can be rectified if a ∆-ML strategy [39,40] can be adapted where an inexpensive theory can be used as a baseline for property estimation and to generate a structural descriptor. Due to the lack of rapid and reliable baseline models for the materials problem, studies have usually dependent on a fingerprint- [41][42][43][44][45], atom-level- [37], or composition-based representations [38] that do not require structural information for out-of-sample predictions. crystallographic Wyckoff positions (Fig. 3). Making the fingerprint vector commensurate across frieze groups requires a minimum of 8 formula units to be considered to apply the group operations. The size of the vector per formula unit is 19 bits, of which the first 15 are reserved for the 5 ring atom sites -each ring site requires 3 bits to store the 8 possible atomic combinations: C, CH, B, BH, N, NH, S and SH-while the last 4 to store 11 metal types.

IV. GROUP-SUBGROUP MACHINE LEARNING
To understand the advantages of group-subgroup learning over conventional ML, we begin by presenting the learning curve for modeling the atomization energy of 1199 FriezeRMQ1D materials in the p2mm phase (see Figure 4). We present results for ML models using a fingerprint representation (Figure 4a) and the FCHL formalism (Figure 4b). When training directly on p2mm, with the fingerprint representation, the offset of the learning curve is over 1 eV/atom. The offsets drop for all cases of group-subgroup learning to 0.4-0.5 eV/atom indicating the materials in p2mm to have chemical similarity with those in all the subgroups. The performance of p11g which is not a t-subgroup of p2mm is only on par with the smallest subgroup p1 implying the lattice arrangement in a k-subgroup to have less relevance to that in the supergroup. The only information that is common to the structures in both these subgroups are the intra-formula-unit features. Learning with p2 and p11m show better rates than with p1 for smaller training set sizes. It is important to note that with increasing training set size, the learning curves of subgroup learning must coincide with that of direct learning. The better the subgroup is, lower the out-of-sample errors will be hence approaching the direct learning error asymptotically.
Learning with the best subgroup p1m1 shows consistently lower errors reaching an MAE of 0.07 eV/atom for 50% training.
Learning curves for modeling p2mm energies with FCHL are provided in Figure 4b.
The offset of direct p2mm modeling for 0.1% training (number of training entity is 1) is over 3.1 eV/atom. While not evident from Figure 4, learning curves for structure-based representations such as FCHL typically exhibit continuity as approaching the limit of zero-training-set size, where the offset should be the mean of absolute target property values. This is in agreement with the actual offset, 4.1 eV, calculated over all 1199 p2mm energies. With FCHL, the offsets for all subgroup learning training curves are noticeably smaller compared to those with the fingerprint vector. The subgroup delivering the least error is consistent across the representations, in both cases the MAE with 50% training is 0.07 eV/atom. We compare the performance of subgroup learning with a ∆-ML [39], for which energies from the subgroups are used as baselines. While the atomization energies of the subgroup structures better approximate the supergroup ones, their performance as a baseline for ∆-ML was found to very poor (see Figure 5). For both the fingerprint and FCHL descriptors, the training curves poorly converging towards zero indicates weak learning. Rather than a good rank correlation between the baseline and target quantities, the structure-property correlation between the representation and the '∆' determines the prediction accuracy in ∆-ML. For the FriezeRMQ1D materials, the difference between the p2mm and p1m1 energies amounts to packing interaction between the formula units while intra formula unit interactions, such as ring-metal attraction cancel. On the other hand, with global descriptors such as those used in this work, a major part of the the descriptor difference arises from short-ranged ring-metal interactions. While the FCHL learning curve shows a slower learning rate when comparing the offset with the error at 75%, the fingerprint representation indicates faster convergence due to the fact that the the descriptor vector encodes intra-formula unit and inter-formula unit features explicitly. Based on these findings, one can anticipate that the material descriptor that performs well for modeling total electronic energies to perform poorly for second-order interactions such as packing energies in crystals.
Percentage errors for various group-subgroup combinations are collected in Table II. For both high-symmetry phases p2mm and p2mg, better performances are seen when using a maximal subgroup than p1. For p2mg, all three maximal subgroups have resulted in lower prediction errors compared to using p1 symmetry one can expect the internal degrees of freedom in this phase to be similar to that of p1.
As stated in Section II, for materials such as perovskites, phases of low-symmetry contain larger unit cells. Hence, the target phase of interest corresponds to subgroups.
The ML formalism presented here can be applied also for such cases when the roles of groups and subgroups are reversed compared to the situation in FriezeRMQ1D materials. To validate this point, we present results for all possible group-subgroup modeling in Fig. 7. For the FriezeRMQ1D materials, the computationally complicated phases are p2mm and p2mg, including these in the training set to model materials in subgroups such as p1 also yields favorable results compared to a direct modeling on p1.
Generalizing the results discussed above, the FCHL approach in combination with p1 data, facilitates modeling of atomization energies in p2mm with a prediction error of < 3% using 10% reference data (see Figure 6). Although the FriezeRMQ1D materials ternary materials such as perovskites or transition metal based pnictides/chalcogenides. It must be noted that the cost for generating the data for several smaller unit cell materials is often negligible compared to that of large unit cell ones. For instance, the ratio between the time taken for single point energy calculations in p1 and p11m phases for the cyclopentadienyl-Cu system is about 0.015. For modeling the energies of a few thousand ternary materials demands a data-generation cost for only a few hundred examples. Compared to this, the cost for performing DFT calculations of the entire set in a compact phase should be negligible.

V. CONCLUSIONS
We have presented the group-subgroup learning formalism for efficient modeling of materials properties in complex phases with multiple chemical formulas in a unit cell.
The approach exploits transformation relationships between crystallographic space groups and is applicable for materials datasets where more than one crystal structures are available per stoichiometry. To provide a proof-of-the-concept, we have generated a new Q1D materials dataset "FriezeRMQ1D" with chemical compositions spanning 11 metals, 109 rings, and seven frieze group symmetries. To facilitate further symmetry-based explorations, all data generated for this study are made publicly available. For the resulting 8393 materials, minimum energy geometries of desired symmetries were calculated using constrained lattice relaxations. The target property of interest is the atomization energies calculated at the hybrid-DFT level. The dataset is one-hot encodable. Hence, we designed a fingerprint vector representation containing information about the cyclic structure of the ring, its stoichiometry, metal bonded to the ring, and the frieze group symmetry of the lattice. Besides, we have tested the performance of the FCHL formalism that captures structural and alchemical similarities. We depend on a % error as a reliable error metric than absolute errors because other studies have shown the prediction errors of ML models for materials to depend strongly on the dataset [31].
We have analyzed the performance of group-subgroup learning through all possible combinations of frieze groups. Irrespective of the materials representation, we find the maximal subgroup of a group to deliver the lowest prediction errors. Furthermore, we have shown the proposed formalism to be orthogonal to the ∆-ML approach. Application of the ∆-ML idea to this dataset by using the energies from the p1m1 phase as a baseline to model p2mm energies exposes poor structure-property correlations between global descriptors and '∆' energies. The most attractive feature of the group-subgroup learning is that it alleviates the explicit dependence of materials descriptors on DFT-level minimum energy geometries. While structural descriptors such as FCHL, MBTR or SOAP by design deliver best prediction accuracies when using minimum energy geometries to predict ground state total energies, such a dependency restricts the application domain of the models tremendously. Hence, constructing structural descriptors for materials phases with a complex unit cell arrangement using the geometries of compact phases with the knowledge of Wyckoff positions splitting widens the application domain of ML for materials modeling.
The general conclusions drawn from our results are independent of the ML approach, datasets or materials representation. It will be interesting to see if this strategy can be adapted to modeling economically important materials like perovskites [24] or Fe-based pnictides [46] that have received a lot of attention as superconductor candidates.
Similarly, the chemistries common in non-magnetic and spin-collinear phases of materials can also be exploited using group-subgroup relations for the distribution of local magnetic moments in materials.

VII. ACKNOWLEDGMENTS
We acknowledge support of the Department of Atomic Energy, Government of India, under Project Identification No. RTI 4007. All calculations have been performed using the Helios computer cluster, which is an integral part of the MolDis Big Data facility, TIFR Hyderabad (http://moldis.tifrh.res.in).