Genarris 2.0: A Random Structure Generator for Molecular Crystals

Genarris is an open source Python package for generating random molecular crystal structures with physical constraints for seeding crystal structure prediction algorithms and training machine learning models. Here we present a new version of the code, containing several major improvements. A MPI-based parallelization scheme has been implemented, which facilitates the seamless sequential execution of user-deﬁned workﬂows. A new method for estimating the unit cell volume based on the single molecule structure has been developed using a machine-learned model trained on experimental structures. A new algorithm has been implemented for generating crystal structures with molecules occupying special Wyck-oﬀ positions. A new hierarchical structure check procedure has been developed to detect unphysical close contacts eﬃciently and accurately. New intermolecular distance settings have been implemented for strong hydrogen bonds. To demonstrate these new features, we study two speciﬁc cases: benzene and glycine. For all polymorphs, the ﬁnal pool either contained the experimental structure, or structures with similar lattice parameters, symmetry, and packing motifs.


Introduction
The properties of molecular crystals depend not only on their constituents but also the relative arrangement of the molecules inside the unit cell.Properties such as the stability [1][2][3], electronic conductivity [4][5][6][7][8], solubility and bioavailability [9,10], have all been observed to vary as a function of the molecular crystal solid state form.The molecules comprising these crystals are held together by weak intermolecular interactions [11,12] and thus can commonly be experimentally synthesized in multiple forms [13,14].This phenomena, known as polymorphism, has been of great importance to pharmaceutical research and for the design of high performance organic electronics [5,15,16].
The field of crystal structure prediction (CSP) is devoted to the prediction of the solid state forms of a molecule [17][18][19][20][21][22][23].CSP requires algorithms that can efficiently generate new structures in order to sample the high dimensional configuration space associated with molecular crystals [23,24].Random, and quasi-random, sampling of the configuration space has been established as a critical component of CSP workflows within the Cambridge Crystallographic Data Centre (CCDC) CSP blind test [17][18][19][20][21][22].Most of the groups that participated in the sixth CSP blind test used a random crystal structure generation method [25][26][27][28][29]. Random crystal structure generation methods identified four of the five, chemically diverse target systems in the sixth blind test, demonstrating their importance for CSP [22].
Random crystal structure generation methods for CSP follow a similar procedure.First a space group [30] is chosen for the new structure.Second, random unit cell parameters commensurate with the space group's crystal system are generated.Third, the molecule positions and orientation of each independent molecule are randomly sampled within the asymmetric unit.Finally, the symmetry operations of the space group are applied to the asymmetric unit generating all molecules in the unit cell.The generated structures are subsequently relaxed using either a system specific force field [31][32][33] or a fully ab initio approach [26,34,35].The success of random structure generation stems comes from unbiased and diverse sampling covering the potential energy surface, followed by a structural relaxation to the nearest local minima, hopefully converging to all experimentally observed polymorphs [25][26][27].
Despite their overall similarity,, structure generation methods from the sixth blind test differ in subtle ways.Structure parameters may be sampled using either a uniformly random number generator [26,34], or quasi-random, low discrepancy sequences [25,27,28].Structure generation may be performed over all space groups [34], or using only the most common space groups [24,27] observed in the Cambridge Structural Database (CSD) [36,37].A critical component of the generation procedure is approximating the volume of the molecular crystal before generation.Several methods have been proposed, such as adding up atomic volumes [26], using the morphology of the molecule [25,38], or relaxing a few handmade structures [27,29].It has been demonstrated that random CSP methods may be sensitive to the choice of unit cell volume [25].Therefore, it's important to use an accurate volume estimation method.Additionally, structures with reasonable densities are typically closer to their respective local minima making structure relaxations more efficient.Lastly, most random crystal structure generation packages are only capable of generating struc-tures in general Wyckoff positions and rely on the serendipitous generation of structures with molecules occupying special positions.However, analysis of the CSD has shown that molecules with internal symmetry often occupy special Wyckoff positions [39].
Here we present a new version of Genarris [34], an open source Python package that performs random structure generation for homomolecular molecular crystals of semi-rigid molecules with no bond rotational degrees of freedom using general and special Wyckoff positions.Genarris 2.0 offers several improvements over the previous version.The parallelization model has been changed from Python multiprocessing to MPI for Python (mpi4py) [40] to enable more efficient utilization of many cores and seamless sequential execution of user-defined workflows.A new machine learning method for volume estimation, based on a topological molecular descriptor, provides accurate volume predictions across a chemically diverse dataset from the CSD.The speed of structure generation has been significantly increased by developing a new hierarchical scheme for intermolecular distance checks.New settings have been implemented to improve structure generation for systems with strong hydrogen bonds.The performance of new the features in Genarris 2.0 is demonstrated for glycine, which contains relatively strong intermolecular hydrogen bonds, and benzene, a symmetric molecule occupying special Wykckoff position with an inversion center.

Code description
Genarris 2.0 is written in Python 3, with the exception of the new structure generation function, Pygenarris, which is written in C and automatically compiled and installed into Genarris 2.0 as a Python library.Genarris only requires standard Python libraries to install on any machine (i.e.numpy, sci-kit learn, mpi4py, spglib, pymatgen,and ASE, PyTorch).Genarris 2.0 is parallelized with MPI, using mpi4py (Sec.2.1).Pygenarris has additional built in OpenMP support.For energy evaluations and geometry relaxations, Genarris currently interfaces with the electronic structure package FHI-aims [41].It may be adapted to interface with any other electronic structure, force field, or machine learning package that accepts an MPI communicator as an argument.
The workflow of Genarris 2.0 is depicted in Figure 1.It begins by estimating the crystal unit cell volume.Given the desired number of molecules per unit cell (Z), the estimate is obtained by relaxing the single molecule geometry and applying a machinelearned model trained on a dataset of experimental structures from the Cambridge Structural Database (CSD) (Sec.2.2).Crystal structure generation begins by determining all compatible space groups given the molecular geometry within a user-defined symmetrization tolerance (Sec.2.3).Genarris moves sequentially through this list of space groups, generating a user-defined number of structures per allowed space group and checking them to ensure that no two molecules are unphysically close to each other (Sec.2.4).If the userdefined maximum number of consecutive failed generation attempts for a space group is reached, Genarris will proceed to the next space group on the list.
Once a "raw" pool of physically reasonable, random structures is generated, a userdefined sequence of energy evaluation, clustering, and selection steps may be performed to produce a smaller curated pool of structures, which can be used, e.g., as an initial population for a genetic algorithm [42,43].For clustering, Genarris uses the affinity propagation (AP) machine learning algorithm [44].Two types of feature vectors are available in Genarris 2.0, the relative coordinate descriptor (RCD) [34] and a radial distribution function (RDF), implemented in PyTorch, similar to that described by Ref. [45].Three workflows for down selection have been proposed previously [34].Here, a new "Robust" workflow is proposed (Sec.2.5).Lastly, full geometry relaxation may be performed for the final pool of structures.
Genarris 2.0 automatically executes all the procedures in the user-defined procedure list in the order specified.A single input file contains the user defined settings for all desired procedures.This includes the number of cores to be used for each procedure, as different procedures scale differently (see Sec. 2.1).Genarris can infer some parameters from previous sections of the workflow.For example, the output file containing the relaxed geometry of the single molecule becomes the default molecule path of subsequent sections if it exists.The user may reorder the procedures as long as the dependencies are satisfied (e.g., feature vector calculation must be performed prior to clustering).
If Genarris is restarted, it determines which step in the procedure list was not completed and resumes from that point.Restarts for generation are implemented by parsing the geometry output file for the last generated space group, and continuing from there.
Finished RCD calculations are output as files which permits easy restarts.Restarts for RDF calculations are unnecessary because they take less than a minute for several thousand structures.Genarris supports restarts of FHI-aims jobs by first determining which of them are incomplete and then automatically relaunching them from the last relaxation step.

Parallelization
Genarris 2.0 is parallelized using the message passing interface (MPI) paradigm via the mpi4py package.MPI enables immediate cross-platform portability without code changes.The structure generation function in Genarris 2.0 determines the number of allowed space groups for the given molecule, n, and accepts as input the number of structures to generate for each of these space groups.Hence, structure generation and subsequent structure checks (Sec.2.3) are embarrassingly parallelized over the total number of structures desired, N , with the problem size (maximum number of usable cores) for the generation and structure check procedures equal to N/n.
For clustering (see Sec. 2.5), both the RCD and RDF feature vector calculations are embarrassingly parallelized with problem size N .The number of clusters produced by the affinity propagation (AP) algorithm [44] is nearly-directly correlated with its preference hyperparameter value, but its functional form is not known a priori.Therefore, a parallelized version of the standard binary search algorithm has been implemented to output a specified number of clusters C within a tolerance tol.The preference range is initially wide ([−1000, 1000]).This range is evenly partitioned into R preference values, where R is the number of total MPI ranks available.Each rank executes AP with its assigned value of preference and reports the number of clusters obtained to the root rank.The root rank sets the preference range upper (lower) bound to the preference that returned the lowest (highest) number of clusters above (below) the target number of clusters.The root then partitions the updated preference range and assigns each rank its new preference value.The procedure is repeated until a preference value is found, which yields C ± tol clusters.The communication required in each iteration is less expensive than an AP call by orders of magnitude, therefore it does not significantly affect the scaling.Because preference and the number of clusters are only nearly-directly correlated, fail-safes have been implemented.For example, if the current preference range fails to yield a number of clusters within C ± tol, then the preference range is widened by a random amount.In addition, the user may have the program output the closest number of clusters to C within a desired number of iterations.The memory usage is kept manageable by writing and accessing the affinity and distance matrices via memory maps so that each rank does not make a redundant copy.
Genarris currently interfaces with FHI-aims for energy evaluations and geometry relaxations.It may be adapted to interface with any other electronic structure, force field, or machine learning package that accepts an MPI communicator as an argument.FHI-aims is compiled as a shared library and made an importable Python library through the standard f2py function.The Python communicator is then converted into a Fortran communicator via the py2f method.Thus, Genarris 2.0 allows all ranks to be utilized by FHI-aims.In addition, the world communicator may be split into a user-defined number of sub-communicators, each of which performs an FHI-aims calculation concurrently.This enables massive parallelization of the steps involving energy evaluations and geometry relaxations.The master-slave paradigm is used to allow efficient and automated management of these tasks.The world root (master) rank is not passed into FHI-aims.Rather, it keeps track of which structures have completed and assigns uninitiated structures to root ranks of sub-communicators, which proceed to run FHI-aims.

Unit cell volume estimation
The solid form volume of a molecule is defined as the volume of the unit cell divided by Z, the number of molecules contained in the unit cell.Accurate prediction of the solid form volume of an input molecule is critical for generating structures with reasonable unit cell volumes.To this end, a machine learned model using a Monte Carlo volume estimation scheme and a topological molecular fingerprint constructed based on atomic neighborhoods was developed.The model was trained on a dataset obtained from the CSD using the Conquest program [46].A chemically diverse dataset was compiled, containing molecules with 5 to 260 atoms comprising the organic elements, H, C, N, O, all the halogens, F, Cl, Br, and I, as well as B, P, S, Si, Te, and Se.The accuracy of the machine learned model is within the range of polymorph density differences as identified from 2,173 unique, homomolecular polymorph pairs from the CSD.

Dataset construction
The dataset used for training the volume estimation model was obtained from the CSD using the Conquest program [46].The search was performed over entries of the 2017 version of the CSD for structures of homomolecular organic crystals, characterized at room temperature, under standard pressure, and containing the text phrase 'polymorph'.As described elsewhere [47][48][49], all polymorphic compounds in the CSD are flagged with the tag 'polymorph'.All duplicate structures were identified using the COMPACK program [50] and removed to prevent bootstrapping of the underlying distribution.This yielded 3,768 individual entries in the dataset and 2,173 unique polymorph pairs, which is similar in size to previous statistical studies of homomolecular polymorphs [48,49].
The expected variance of the percent difference in the solid form volume of a molecule due to polymorphism was calculated using this dataset.All unique pairs of polymorphs were identified and the percent difference between each polymorph density was calculated.The percent difference of densities is equivalent to that of the solid form molecular volume because the molecular weight remains constant for these systems.The distribution of percent differences is plotted in Figure 2. The distribution has a standard deviation of 2.95% with respect to the solid form volume of the molecule, consistent with numerous previous reports of molecular crystal density estimation [51][52][53][54].This indicates that polymorphs which can exist under the same temperature and pressure conditions could posses significant volume differences owing to the complex nature of the relatively weak intermolecular interactions that govern the lattice energy of homomolecular crystals.Thus, the distribution presented in Figure 2 places a lower bound on the expected accuracy of estimated solid form molecular volumes.

Monte Carlo volume estimation
Volume estimation is performed by placing a sphere with a van der Waals (vdW) radius [55] at the position of each atom in the molecule [56,57].A Monte Carlo method is then used evaluate the volume occupied by the spheres.First, a three-dimensional box encompassing the molecule is defined.Points within the box are sampled randomly and determined if they fall within at least one of the atomic vdW spheres.The ratio between the number of sampled points and the number of points found within a sphere multiplied by the volume of the three-dimensional box is the estimated volume of the molecule.The Monte Carlo volume estimation is deemed to be converged when the estimated volume changes by less than 10 −3 after 10 6 new points are samples.
The ratio between the experimental molecular solid form volume and the Monte Carlo volume estimate for the polymorph dataset was found to be 1.47, indicating that the Monte Carlo method systematically underestimates the true solid form volume.Using this linear relationship to predict the solid form volume of the molecule achieves a standard deviation of 4.72% error with respect to the dataset (Figure 3).To improve the accuracy of the volume estimation model, specific information about the chemical environment of the atoms in the molecule must be included.To this end, a molecular topological fragment representation has been developed.

Molecular topological fragment model
We present a topological molecular fingerprint representation for predicting solid form molecular volume within the accuracy of polymorph density differences.The representation is based on molecular fragments determined through analysis of the CSD dataset.The fact that the fragments are not predefined enables an unbiased choice of fragments such that they can represent any structural class.The complexity of the model increases with the size and chemical diversity of the dataset making this representation amenable to large datasets as well as datasets comprising a restricted chemical space.Moreover, representation is invariant to permutations of the atom indexing.The molecular topological fragment representation can be used to predict any molecular property of interest with linear and nonlinear regression or classification models and can also be used to compute chemical similarity between molecules using metrics such as the Tanimoto coefficient [58].The Genarris 2.0 source code includes a model construction Python class, enabling users to quickly build topological fingerprints for a training dataset, regularize the model, evaluate the accuracy on a target dataset, and output graphs of predicted values versus target values to asses the performance of the model.
The molecular topological fragment representation is built by constructing a unique string representation for every atom in the molecule.Given an atomic environment, the string is deterministic and does not coincide with another distinct atomic neighborhood.The bonding of the molecule is calculated and used to construct a graph comprised of nodes and edges corresponding to atoms and bonds.Then, the atomic neighborhood strings are constructed for every atom in the molecule.The atom's bonded neighbors are identified and concatenated into a string.The string is sorted first by terminal elements in alphabetical order, then by the atom itself if it is non-terminal, and finally the elements the atom is bonded to are given in alphabetical order.All unique neighborhoods across the dataset are collected and sorted in alphabetical order.This ordering is used to index the vector representation of the molecules.For a given molecule, the value of the vector at each index corresponds to the number of each type of fragments present in the molecule.An example of vector representations for glycine and benzene is seen in Table 1.The representation described here is similar to other fragment based representations seen in chemical informatics [59,60].
Table 1: Example of vector representations constructed for a dataset containing benzene and glycine using the molecular topological fragment model.

Fragment HC HCCC HHCCN HHHNC HN OC OOCC
To construct a predictive model for solid form molecular volumes, the volume predicted by the Monte Carlo method was concatenated to the topological fragment representation vector of each molecule.The coefficients for a linear model were then calculated using Bayesian ridge regression as implemented in scikit-learn [61].The regularization parameter was optimized using a grid search method and five-fold cross validation.The number of features contained in the model was constrained by removing features that did not occur at least thirty times in the dataset.Thirty was identified as the optimal number using a five-fold cross validation scheme.This left 64 unique molecular fragments in the model.
The distribution of errors obtained using the topological fragment model is displayed in Figure 3.It is shown that the fragment based model significantly reduces the error in the predicted solid state molecular volumes compared to the Monte Carlo volume estimation.Furthermore, the fragment based model achieves an error of similar magnitude to the volume differences between polymorphs found in the CSD.Thus, the topological fragment model developed here achieves an accuracy within the error one could expect from polymorph density differences.

Structure generation
The generation process begins by identifying all space groups compatible with the number of molecules in the unit cell.This task is easy for general Wyckoff positions whose multiplicity must equal the desired number of molecules per cell.To determine whether a molecule can occupy special Wyckoff positions, its symmetry must be considered.Genarris 2.0 detects compatible space groups automatically within a given numerical tolerance.Once the compatible space groups are found, Genarris 2.0 attempts generation of crystal structures sequentially, starting from the lowest space group number.If Genarris is unable to generate a structure within the maximum attempt limit specified by the user, then it proceeds to the next space group.
A random volume is drawn from a Gaussian distribution whose mean and standard deviation are the predicted volume and three times the prediction error of our volume estimation method (see Sec. 2.2).The volume is redrawn after a successful generation or after a user-specified number of failed attempts.Subsequently, using this volume, a unit cell of the desired lattice system is constructed randomly as shown in Figure 4.If the attempted position is a general Wyckoff position, then the molecule's orientation is sampled randomly and placed randomly inside the unit cell.The space group symmetry operations are then applied to generate the remaining molecules in the unit cell.Special positions, with the exception of inversion centers, require alignment of the molecule and their treatment is described in Sec.2.3.1.The attempted structures that pass the intermolecular distance checks, as described in Sec.2.4, are added to the raw pool.

Generation in special positions of space groups
Special Wyckoff positions are left invariant under at least one symmetry operation of the space group in addition to the identity operation.For each space group, the International Table of Crystallography [62] lists the special positions whose multiplicity is lower than that of the general position.Only molecules with appropriate symmetry can occupy a special position.Since most molecules do not have higher order symmetries, molecular crystals with molecules on special positions are infrequent.According to an analysis of the CSD [39], in 70.1% of the molecular crystals, molecules occupy general positions, and in the remaining structures molecules occupy special positions.Among the special positions, two-fold rotation (2), mirror planes (m), and inversion centers ( 1) are the most frequent.
Genarris 2.0 generates molecular crystals with molecules on special positions by checking all possible orientations of the molecule with respect to the symmetry directions of the crystal system [62], as shown in the flowchart in Figure 4.At the start of generation, the program finds all possible molecular axes that may be associated with a symmetry element.For this purpose, first the center of mass of the molecule is shifted to the origin.Then, all atoms of the same element that are farthest from the center of mass are selected.The possible symmetry elements of the molecule would map any of these atoms onto itself or onto another.The axes corresponding to these symmetry elements are obtained by calculating the averages and cross products of the position vectors of the selected atoms.A list of potential molecular axes is constructed.To keep the length of the list minimal, parallel vectors are deleted.
Once the potential molecular axes are identified, the code proceeds to check the compatibility of molecule placement at a special position with the specified number of molecules in a unit cell.The molecule's center of mass is placed in a special position such that one of the molecular axes is oriented along one of the viewing directions of the crystal system.Then, the symmetry operations of the space group are applied.If the number of overlapping molecules generated is equal to the order of the site, the special Wyckoff position is regarded as compatible.If not, different molecular axes and viewing directions are considered.All combinations of molecular symmetry axes and viewing directions are examined and compatible ones are stored for subsequent generation attempts.Once a molecule is placed in a compatible special position, its geometry is slightly adjusted (within a user-defined tolerance) by averaging over the atomic positions of all the overlapping molecules occupying the same site.Depending on the site symmetry of the special position, the allowed degrees of freedom are randomized.For example, on a general position or an inversion center, the molecule can be freely rotated about its center of mass.A molecule placed on a two-fold axis can be freely rotated about the axis.

Structure checks
Attempted structures are checked to avoid unphysically close intermolecular contacts.Checking the distance between every atom of a molecule and every atom of all neighboring molecules, including its own periodic replicas, has a scaling of O(N 2 ), where N is the number of atoms in the unit cell.This is found to be the bottleneck of structure generation.To improve the efficiency, Genarris 2.0 performs a series of three hierarchical structure checks.Failed structures are discarded at each stage, such that fewer structures undergo the more rigorous and computationally expensive checks.
The threshold for allowed close contacts between two atoms is called the cutoff distance and is defined based on a specific radius fraction, s r , of the sum of atomic van der Waals radii [34].The crystal structure is deemed unphysical and rejected if the distance, d, between two atoms belonging to different molecules is such that where r A and r B are the van der Waals radii of atom A and atom B, respectively.This ensures the quality of the generated structures.The default value of s r is 0.85.Based on statistical analysis of structures extracted from the CSD, this is a reasonable setting for all but the strong hydrogen bonds.For these cases, special settings have been implemented in Genarris 2.0, as described in Sec.2.4.4.For this value of s r and the target unit cell volume determined as described in Sec.2.2, random generation of crystal structures may require a large number of attempts (a few thousand to millions) before it passes all three stages of structure checks and is accepted into the pool.Therefore, the new hierarchical structure check procedure is a significant efficiency improvement in Genarris 2.0.The details of each stage are explained below.

Stage I: Fast screening without periodic boundary conditions
For preliminary screening, periodic boundary conditions are completely ignored and only intermolecular distances in the unit cell are evaluated.Because distances are computed using the Euclidean norm, this stage is the fastest.If the centers of mass of a pair of molecules in a cell are much farther than twice the molecule length, defined as the maximum distance between two atoms of a molecule, then those pairs are ignored as these molecules cannot overlap.We find that most of the unphysical structures generated are rejected at this stage.The structures that pass this screening proceed to the second stage of structure checks.

Stage II: Distance checks with periodic replicas
In this step, the distances of a molecule from other molecules in the unit cell as well as its own periodic images are checked against the cutoff distance.An approximate minimum The Stage II algorithm is illustrated in Figure 5. First, the atom positions are expressed in fractional coordinates.Then, the distance between two atoms is computed in fractional space and translated to the "origin cube", spanned by the vectors [1, 0, 0], [0, 1, 0], [0, 0, 1].Finally, the minimum Cartesian distance of this point from the corners of the origin cube is calculated.For orthogonal cells, the closest point in fractional space necessarily corresponds to the closest point in real Cartesian space.However, for an oblique triclinic lattice a different lattice point may be closer to this point.Therefore, if a non-orthogonal structure passes Stage II, it proceeds to Stage III for a more rigorous check.

Stage III: Rigorous checks for non-orthogonal cells
Complete structure checks require exact evaluation of distances under minimum image convention.For non-orthogonal cells, this problem is a three-dimensional case of the wellstudied closest vector problem [63].If the lattice is translated such that one of the two points coincides with the origin, we need to find the distance d to the nearest lattice point of the position vector x of the second point.That is, where n = [n x , n y , n z ]; n x , n y , n z ∈ Z 3 ; and L = [a, b, c] T .This problem is encountered in communication theory, where the received signal over a communication line is decoded by finding the nearest lattice point [64].One popular approach is the Finck and Pohst sphere decoder [65,66] method, where the closest lattice point is found using a tree search and the depth of the tree corresponds to the dimension of the problem.Genarris 2.0 uses a version of the sphere decoder to compute the exact distance under minimum image convention for non-orthogonal cells.The distance estimate obtained from Stage II is used as the initial sphere radius for the sphere decoder algorithm.This step is the slowest, but only few nonorthogonal structures that pass Stage I and Stage II reach Stage III.Hence, the overall efficiency is not compromised.

Intermolecular cutoff distances
Choosing appropriate intermolecular cutoff distances is critical for generating physically reasonable structures.In Genarris 2.0, cutoff distances are a function of the elements participating in the intermolecular interaction.For vdW interactions, cutoff distances are implemented using an s r of 0.85.An s r of 0.85 was determined to be a physically reasonable value based on our statistical analysis of intermolecular contacts in a data extracted from CSD and presented in Figure 6 as well as an earlier analysis [67].However, for hydrogen bonds, the intermolecular distance may be considerably shorter than the s r value used for weaker intermolecular interactions [67,68].Hence, new settings for the allowed interatomic distances for hydrogen bonds have been implemented in Genarris 2.0.
Hydrogen bonds among the most important intermolecular interactions in both naturally occurring and artificially engineered molecular crystals [68].Intermolecular hydrogen bonds are denoted as XH• • •Y where X is the donor, which is covalently bonded to the hydrogen, and Y is the acceptor, which belongs to a different molecule than X.The cutoff distance between H and Y implemented in Genarris 2.0 depends on the identity of atoms X and Y.However, these cutoff distances are applicable to any functional group pair that would participate in an intermolecular hydrogen bond for homomolecular crystals.2 displays the newly implemented contact distances for hydrogen bonds, in which oxygen or nitrogen are the donor and acceptor.These values were determined based on the existing literature [69], as well as statistical searches of the CSD using the IsoStar program [70].The IsoStar program provides distributions of nonbonded, intermolecular distances between pairs of functional groups.The central and contact functional groups were chosen across the available pK a range [71] for each type of hydrogen bond in order to develop general three body cutoff distances for all relevant hydrogen bonds.The results of the IsoStar searches are shown in Figure 6.For hydrogen bonds involving oxygen and nitrogen as the donor and acceptor, the sum of the vdW radii multiplied by the default s r value of 0.85 (red dashed lines) exceeds a large number of non-bonded interaction distances, illustrated by the left-most peak of the bimodal distributions.Using the default s r value, structures with strong hydrogen bonds, such as glycine, would be deemed unphysical and discarded.With the new settings listed in Table 2, they would be considered physically reasonable.For hydrogen bonds involving halogens [72], or those with carbon as the donor atom [73], the default s r value of 0.85 is still appropriate.

Clustering and down-selection
Once a "raw" pool of physically reasonable, random structures is generated, Genarris 2.0 offers the option of performing a user-defined sequence of clustering, energy evaluation, and down selection steps in order to form a smaller curated pool of structures.Here, we use the Robust workflow, as shown in Figure 7.For the purpose of clustering via the affinity propagation (AP) machine learning algorithm [44], a feature vector describing the molecular packing is calculated for each structure.The relative coordinate descriptor (RCD) [34] and a radial distribution function (RDF) [45,74,75] descriptor are implemented in Genarris 2.0.The preference hyperparameter of AP is automatically tuned by Genarris 2.0 to produce the desired number of clusters within a user-defined tolerance, as described in Sec.2.1.The default number of clusters for this step is 10% of the the number of structures in the raw pool.For the exemplar of each cluster, a single point energy (SPE) evaluation is performed using FHI-aims with the settings described in Sec. 3 below.Then, AP clustering is performed again with the target number of clusters set to 10% of the reduced pool and the lowest energy structure is selected out of each cluster.Finally, the remaining structures are fully relaxed with FHI-aims as described in Sec. 3.This constitutes the final pool of structures output by Genarris 2.0 using the Robust workflow.

DFT settings
Genarris 2.0 interfaces with the FHI-aims electronic structure code [41] for geometry relaxation of the single molecule and of the structures in the final pool, as well as for single point energy (SPE) evaluations within the Robust workflow used here.All invocations of FHI-aims in this work used the Perdew-Burke-Ernzerhof (PBE) generalized gradient approximation [76] and the Tkatchenko-Scheffler (TS) pairwise dispersion correction [77] with lower-level numerical settings, which correspond to the light/tier1 settings of FHI-aims.SPE calculations for crystals were done self-consistently with a 1 × 1 × 1 k-point grid for fast screening.Geometry relaxations of the final pool were performed using a 3 × 3 × 3 k-point grid.Additionally, no constraints were placed on the lattice.All relaxations were done under ambient conditions except for the case of the high-pressure Z = 2 polymorph of benzene, where the pressure was set to 25 kbar to reflect the experimental conditions.

Benzene
With the chemical formula of C 6 H 6 , benzene is one of the simplest aromatic hydrocarbons.It is a highly symmetric molecule with a 6/mmm point group which allows special positions with 20 different site symmetries.Two known polymorphs of benzene are [78] P 2 1 /c (14) under high pressure.In both structures, benzene occupies a special position with an inversion center ( 1).Figures 8 and 9 show the volume histograms obtained at each step of the Robust workflow for benzene with Z = 2 and Z = 4, respectively.The experimental volume and the predicted volume are indicated by solid green line and dashed orange line, respectively.Raw pools of about 6000 structures were generated for both Z = 2 and Z = 4 with predicted volumes of 243 Å3 and 487 Å3 , and volume standard deviations of 18 Å3 and 37 Å3 , respectively.The volume of the experimental structures were 206 Å3 and 471 Å3 , respectively.Our prediction for Z = 4 is much closer to the experimental volume than for Z = 2 because the latter forms under pressure of 25 kbar whereas the volume estimation model was trained on structures obtained under ambient pressure.Nevertheless, Figure 8 shows noteworthy density about the experimental volume throughout the workflow progression.The resulting volume distributions in Figures 8 and 9 are approximately Gaussian until the relaxation step.For Z = 2, relaxation under pressure resulted in volume contraction, whereas some Z = 4 structures expanded beyond the initial volume range.
Figures 10 and 11 show the space group distributions obtained at each step of the Robust    Allowed space groups workflow for benzene with Z = 2 and Z = 4, respectively.Space groups with general Wyckoff positions are colored in blue and space groups with special Wyckoff positions are colored in orange.Genarris 2.0 attempts to generate a uniform space group distribution.We find that the generated space group distributions are approximately uniform with significant number of structures in the experimental space group for both Z = 2 and Z = 4.Some space groups may be very difficult or impossible to generate within the given physical constraints.For example, for Z = 4, space groups like P 2/m (10), P mm2 (25), and P mmm (47) which have mirror planes are harder to generate as molecules that touch the planes overlap with their own mirror image [39].In contrast, space groups with glide planes and screws axes are easier to generate because symmetry-equivalent molecules are translated in space.Some structures can have a higher site symmetry on a special position than we attempted to generate, resulting in overpopulation of some space groups.For example, for Z = 2, space group P 6/mmm (191) has a relatively large occupation as shown in panel (a) of Figure 10.Many of these structures were discarded in the subsequent selection steps.Figures 12 and 13 show the lattice parameter distributions obtained at each step of the Robust workflow for benzene with Z = 2 and Z = 4, respectively.For the energy-based selection and final relaxation steps, the color scale corresponds to relative energies with respect to the lowest energy structure in the final relaxed pool.The lattice parameter distribution  of the raw pools resembles the shape of the surface |a||b||c| = constant (an approximate relation given that benzene is able to assume many lattice types), indicating approximately uniform sampling of the lattice parameter space.Down-selection based on energy tends to filter out very elongated structures whose c parameter is significantly longer than a and b, indicating that these are relatively unstable for benzene.In fact, the experimental unit cells are not elongated.Panels (f) and (g) of Figure 12 shows that relaxation under pressure resulted in a distribution characterized by a few clusters, suggesting that pressure may have restricted the physically feasible regions.For Z = 2, the experimental structure, indicated by a green X, is found in the final relaxed pools for both the RCD and RDF runs.For Z = 4, the final pool was more diverse in lattice parameter space as seen in panels (f) and (g) of Figure 13 and the experimental structure was found only in the RDF run.It should be noted that the goal of Robust workflow is not necessarily to find the experimental structure but to adequately sample the configuration space.

Glycine
Glycine is the simplest proteinogenic amino acid.It is achiral and forms a zwitterion in the solid state.Therefore, the zwitterionic form of glycine was used in the single molecule relaxation step.Under ambient conditions, glycine has three common polymorphs: a) α- glycine with Z = 4 and space group P 2 1 /n (14), b) β-glycine with Z = 2 and space group P 2 1 (4), and c) γ-glycine with Z = 3 and space group P 3 1 (144)/ P 3 2 (145) [79].The structures belonging to the two space groups are enantiomorphic forms of the chiral γ-glycine crystal.Experimentally, it has been found that the relative thermodynamic stability of the polymorphs at room temperature follows γ > α > β with Gibbs free energy difference (∆G) of 0.16 kJ/mol between γ-glycine and α-glycine [80].At temperatures higher than 440 K, αglycine becomes more stable than γ-glycine.The crystal structure and relative stabilities of the glycine polymorphs have been studied extensively, using different computational methods [81][82][83][84][85][86][87].
Glycine is known for its ability to form strong intermolecular hydrogen bonds, owing to which it crystallizes in a relatively dense molecular solid.Figures 14,15  Z = 4, the mean of the distribution is larger than the predicted volume.This is because the Z = 3 is much easier to generate as two out of the three space groups that are allowed have screw axes.The new settings for hydrogen-bonded systems helped generate much dense structures that are close to the predicted volume.Panels (d) and (e) in Figures 14, 15, and 16 show that energy-based selection and the final relaxation favor structures near the experimental volume.Figures 17,18,and 19 show the space group distribution for each step of the Robust workflow for glycine with Z = 2, Z = 3, and Z = 4, respectively.The raw pools for all cases show almost uniform space group distribution.For Z = 4, space groups P 2/m (10) and P mm2 (25) are missing because they contain mirror planes that are hard to generate [39].There are a significant number of structures in the experimental space group in the raw pool and subsequently selected pools for all the cases.Relaxation of the final pool may break existing symmetries or create new ones as there are no constraints imposed.This resulted in additional space groups with a different Z or Z .For example, space group Cmc2 1 (36) and space group P 1 were created after geometry optimization for glycine with Z = 2 as shown in panels (f) and (g) of Figure 17.Allowed space groups selection and final relaxation steps, the color scale corresponds to relative energies with respect to the lowest energy structure in the final relaxed pool.For Z = 2 and Z = 4, the lattice parameter space is well-sampled and diverse regions are obtained upon downselection.For Z = 3, the generated structures are concentrated in distinct regions of the lattice parameter space because there are only three compatible space groups, all of which are in the hexagonal crystal family.For the Z = 2 case, the experimental structure of β-glycine was found in the final relaxed pool for both RDF and RCD runs.Similarly, γglycine was found in both runs with Z = 3.Although α-glycine was not found for both the runs, a few structures with similar packing motif, same space group, or similar lattice parameters of experimental structure were found in the final pool.We note that the goal of the Robust workflow is not necessarily to find the experimental structure but to produce an initial population for a structure search algorithm, such as a genetic algorithm.

Conclusion
In summary, we have presented a new version of the molecular crystal random structure generator, Genarris, with several new features and demonstrated its application to benzene and glycine.The new MPI parallelization scheme has made Genarris 2.0 significantly faster  than the previous version, more portable, and able to scale better on high performance computing architectures.The new machine learning method for volume estimation has been demonstrated to reliably predict the volumes of the polymorphs of benzene and glycine.The somewhat larger deviation from the experimental volume for the high-pressure polymorph of benzene was expected, considering that the model was trained on crystal structures obtained at ambient pressure.
For all polymorphs of benzene and glycine, the new structure generation function has successfully generated structures in the target volume range with approximately uniform space group distributions and has adequately sampled the possible range of lattice parameters.The new capability to generate structures with molecules occupying special Wyckoff positions has proven to be instrumental for benzene.The updated structure check settings for strong hydrogen bonds have been particularly useful for glycine.Thus, Genarris 2.0 is expected to deliver a significantly better performance than the previous version for symmetric molecules and for molecules capable of forming strong hydrogen bonds.
A new Robust workflow has been implemented for clustering and down-selection of the raw pool of random structures to form a small curated population of low-energy structures with diverse crystal packing motifs.Although this workflow is intended for producing an initial population for other structure search algorithms (such as genetic algorithms), not as a structure prediction method, the experimental structures of both polymorphs of benzene and of the beta and gamma forms of glycine were found in the final relaxed pools.For alpha glycine, the final pools contained several structures in the same space group, with similar lattice parameters, or with similar packing motifs, which should be sufficient for a genetic algorithm to generate the experimental structure.
Genarris 2.0 offers the user full flexibility to design and easily implement new workflows by sequentially executing a user-defined list of procedures.For example, to generate datasets for training machine learning models, the user may wish to perform energy evaluations for a larger number of structures from the raw pool.To perform crystal structure prediction, the user may wish to fully relax a larger number of structures.Thus, Genarris 2.0 is a useful random structure generator for homomolecular crystals of semi-rigid molecules with no rotatable bonds, which can be applied to generate initial populations for structure search algorithms or to generate datasets for machine learning or as a standalone crystal structure prediction method.

Figure 2 :
Figure 2: Histogram of the percent difference of polymorph density for 2,173 unique pairs of polymorphs in a dataset obtained from the CSD.The standard deviation of the distribution is displayed in the top left corner.

Figure 3 :
Figure 3: Percent error of the predicted solid form volume for the described dataset from the CSD for a linear model using Monte Carlo volumes and a linear model using Monte Carlo volumes in green and the topological fragment representation of the molecules in orange.

Figure 4 :
Figure 4: Flowchart of crystal structure generation in Genarris 2.0.Molecules are placed in general Wyckoff positions with a random orientation.In contrast, special positions require specific orientations of the molecule to be compatible with the site symmetry

Figure 5 :
Figure 5: A two dimensional representation of Stage II approximate distance evaluation under periodic boundary conditions.a) An oblique lattice in Cartesian coordinates.The star denotes the point whose distance to the nearest lattice point we need to find.b) Once the lattice is converted from real space into the fractional basis, it is easy to find the box that bounds the point.c) The nearest lattice point is likely to be one of the real space points that map to the green points.
image convention is implemented for non-orthogonal cells.To accelerate the distance checks, non-orthogonal cells undergo a lattice reduction.Let a = [a x , 0, 0], b = [b x , b y , 0], and c = [c x , c y , c z ] be the lattice vectors in a Cartesian coordinate system.It is possible to choose a less oblique lattice which satisfies: a x , b y , c z > 0; |b x |, |c x | ≤ a x /2; and |c y | ≤ b y /2.

Figure 6 :
Figure 6: Plots of the number of observations as a function of distance for intermolecular contact distances mined from the CSD and gathered from the IsoStar program.Each histogram is labeled with its hydrogen bond and the number of interactions obtained from IsoStar are labeled N. Drawn on the histograms are vertical lines at the distance corresponding to the sum of the vdW radii, a vdW cutoff of 0.85, and the new hydrogen bond cutoff distances.

Figure 7 :
Figure 7: The clustering and down-selection steps of the "Robust" workflow of Genarris 2.0.

Figure 8 :Figure 9 :
Figure 8: Unit cell volume histograms obtained at each step of the Robust workflow for benzene with Z = 2.The solid green line denotes the unit cell volume of the experimental structure and the dashed orange line shows the volume predicted by our model.

Figure 10 :
Figure 10: Space group distribution histograms obtained at each step of the Robust workflow for benzene with Z = 2.The green arrow points to the space group of the experimental structure, P 2 1 /c (14)

Figure 11 :
Figure 11: Space group distribution histograms obtained at each step of the Robust workflow for benzene with Z = 4.The green arrow points to the space group of the experimental structure, P bca (61).

Figure 12 :
Figure 12: Lattice parameter distributions obtained at each step of the Robust workflow for benzene with Z = 2.The green cross denotes the experimental structure.

Figure 13 :
Figure 13: Lattice parameter distributions obtained at each step of the Robust workflow for benzene with Z = 4.The green cross denotes the experimental structure.

Figure 14 :Figure 15 :
Figure 14: Unit cell volume distributions obtained at each step of the Robust workflow for glycine with Z = 2.The green line denotes the unit cell volume of the experimental structure and the dashed orange line shows the volume predicted by our model.

Figure 16 :
Figure 16: Unit cell volume distributions obtained at each step of the Robust workflow for glycine with Z = 4.The green line denotes the unit cell volume of the experimental structure and the dashed orange line shows the volume predicted by our model.

Figure 17 :
Figure 17: Space group distribution histograms obtained at each step of the Robust workflow histogram for glycine with Z = 2.The green arrow points to the space group of the experimental structure, P 2 1 (4).

Figures 20 ,Figure 18 :
Figure 18: Space group distribution histograms obtained at each step of the Robust workflow for glycine with Z = 3.The green arrow points to the space group of the experimental structure, P 3 1 (144).

Figure 19 :
Figure 19: Space group distribution histograms obtained at each step of the Robust workflow for glycine with Z = 4.The green arrow points to the space group of the experimental structure, P 2 1 /n (14).

Figure 20 :
Figure 20: Lattice parameter distributions obtained at each step of the Robust workflow for glycine with Z = 2.The green cross denotes the experimental structure.

Figure 21 :
Figure 21: Lattice parameter distributions obtained at each step of the Robust workflow for glycine with Z = 3.The green cross denotes the experimental structure.

Figure 22 :
Figure 22: Lattice parameter distributions obtained at each step of the Robust workflow for glycine with Z = 4.

Table 2 :
New cutoff distances implemented in Genarris for intermolecular hydrogen bonds.The cutoff distances are compared to the sum of the van der Waals radii for the intermolecular interactions using the specific radius (s r ) fraction defined in Sec.2.4.