Efficient Handling of Molecular Flexibility in Ab Initio Generation of Crystal Structures

A key step in many approaches to crystal structure prediction (CSP) is the initial generation of large numbers of candidate crystal structures via the exploration of the lattice energy surface. By using a relatively simple lattice energy approximation, this global search step aims to identify, in a computationally tractable manner, a limited number of likely candidate structures for further refinement using more detailed models. This paper presents an effective and efficient approach to modeling the effects of molecular flexibility during this initial global search. Local approximate models (LAMs), constructed via quantum mechanical (QM) calculations, are used to model the conformational energy, molecular geometry, and atomic charge distributions as functions of a subset of the conformational degrees of freedom (e.g., flexible torsion angles). The effectiveness of the new algorithm is demonstrated via its application to the recently studied 5-methyl-2-[(2-nitrophenyl)amino]-3-thiophenecarbonitrile (ROY) molecule and to two molecules, β-D-glucose and 1-(4-benzoylpiperazin-1-yl)-2-(4,7-dimethoxy-1H-pyrrolo[2,3-c]pyridin-3yl)ethane-1,2-dione, a Bristol Myers Squibb molecule referenced as BMS-488043. All three molecules present significant challenges due to their high degree of flexibility.


INTRODUCTION
The crystal structure adopted by a molecule influences many solid state properties that significantly affect its utility, including solubility, dissolution rate, bioavailability, hygroscopicity, compactibility, tabletability, shock-sensitivity, and color. 1 This represents both a challenge and an opportunity: the emergence of unanticipated crystal forms, such as polymorphs or solvates, is a significant risk in the fine chemical and pharmaceutical industries; on the other hand, the ability to devise new crystal forms, such as cocrystals or salts, with improved properties (e.g., bioavailability), is highly desirable.
Computational crystal structure prediction (CSP) aims to predict the crystal structures that a given molecule or set of molecules can form, starting from the corresponding molecular diagram(s) only. CSP has made considerable advances in the past decade, 2−4 as evidenced, for example, by the Cambridge Crystallographic Data Centre's "blind tests". In the two most recent (fourth and fifth) blind tests 5,6 correct predictions were made for every molecular system, with the exception of a hydrate that was predicted correctly, with the exception of hydrogen atomic positions. These and other studies have demonstrated that predictions for small molecules are now becoming almost routine. Furthermore, improvements in prediction capabilities have also been demonstrated for larger molecules of pharmaceutical size, 7−10 and for hydrates, 11 cocrystals, 12−15 and salts. 16,17 Based on thermodynamic considerations, the crystal forms that are most likely to appear in nature are those corresponding to low-energy minima in the free-energy surface with respect to all variables determining the structure of an infinite periodic crystal lattice. These comprise the unit cell geometry and the positions, orientations, and conformations of all chemical entities (e.g., molecules or ions) within the unit cell. The stable crystal form is the one corresponding to the global free-energy minimum, whereas metastable forms typically correspond to local minima within a few kJ mol −1 of the global minimum. 18 However, in view of the theoretical and computational difficulties of determining free energies, most current CSP methods employ lattice energy or enthalpy as surrogate measures for the minimization.
1.1. Two-Stage CSP Methodologies. The complexity of CSP mainly arises from the fact that the lattice energy surface exhibits a very large number of local minima, the correct identification of which tends to be very sensitive to the accuracy of the lattice energy computations. Accordingly, practical CSP methods involve two distinct steps: (1) a global search stage which aims to identify the low-energy minima by performing an extensive search of the decision space using moderately accurate and relatively inexpensive lattice energy computations; and (2) a structure refinement stage during which the structures of lower energy identified by the global search are further refined using a more accurate, but also more computationally expensive, lattice energy model.
The success of the above two-stage approach is predicated on ensuring that no physically relevant structure is missed either because it is not identified by the global search, or because it is determined by the latter to have such a high lattice energy that it is not selected for further refinement. Thus, the more accurate the lattice energy model being used during stage 1, the fewer the structures that need to be refined during stage 2. On the other hand, using too expensive a lattice energy model at stage 1 may render the whole algorithm impracticable given that, typically, hundreds of thousands or millions of structures need to be generated and evaluated. Accordingly, this paper is specifically concerned with techniques for improving the effectiveness and accuracy of the global search stage in identifying all low-energy structures, while staying within acceptable computational limits.
It is worth noting that the two-stage approach was applied by all four groups who achieved successes in the most recent (fifth) blind test of crystal structure prediction, 6 using one of three different codes for initial structure generation. The program GRACE 19 uses a force field that is customparametrized for each molecule, taking into account a full range of variation in bonds, angles and torsions, as well as charges and repulsion/dispersion parameters to model intermolecular interactions; the starting points for local minimization are generated via a parallel-tempering approach that seeks out low-energy regions of the decision variable space. The program UPACK 20−23 allows for several types of potential energy function and generates starting points across the whole decision variable space, using either random numbers or a grid of points. Finally, two of the four successful groups used the program CrystalPredictor 24,25 which is described in more detail in section 1.2.
There are other approaches in addition to those above that can be used for the generation of crystal structures. Some workers have applied a genetic algorithm, combined with local minimizations, to home in on low-energy structures. 26 Other researchers have dispensed with a potential energy function altogether (cf. the use of close packing/volume minimization to generate structures in MOLPAK 27 ) or have supplemented energy calculations with knowledge of common interactions or packing motifs (cf. the program Promet 28 or the work 29 of Thakur et al.). Metadynamics has also been used 30 to explore the potential energy surface of a molecule in the crystal phase without a variable-space spanning generation of points. However, none of these alternate approaches has yet demonstrated successful predictions for a wide range of molecules, especially under blind test conditions. 1.2. CrystalPredictor Algorithm. This paper aims to present a significantly improved version of the CrystalPredictor algorithm 24,25 for the global search stage of CSP methodologies (cf. section 1.1). This section presents a brief outline of the original algorithm to the extent that it is necessary to understand the current paper. This is followed by an analysis of some aspects that are the focus of the proposed improvements.
CrystalPredictor performs an extensive exploration of the lattice energy surface by using low-discrepancy Sobol sequences. 31 These are sequences of points chosen deterministically to ensure the best coverage of the space of decision variables that is achievable using any given number of points. In particular, they result in more uniform coverage than pseudorandom number generation, and they are better than a uniform grid in that each generated point corresponds to a unique value for each and every decision variable. Furthermore, a Sobol sequence can be restarted from any point, taking into account all previously generated structures; this makes it possible, if necessary, to extend the scope of an already performed global search without having to repeat any calculations. The structures generated are used as initial points for lattice energy minimization using a sequential quadratic programming algorithm. The minimized structures of lowest energy are selected for the refinement stage of the CSP algorithm.
In common with several other CSP algorithms, Crystal-Predictor expresses lattice energy, E latt , as the sum of intramolecular and intermolecular (repulsion/dispersion and electrostatic) energy terms. The repulsion/dispersion interaction contribution is described as an exponential-r −6 term, typically parametrized by atom type using standard values obtained by fitting to experimental data, for example, the "FIT" potential 32−34 or the "Williams 01" potential. 35 This shorterrange potential is calculated for each atom pair out to a userdefined cutoff (e.g., 15 Å). Electrostatic interactions are represented by distributed point charges situated on the atoms and, optionally, other "satellite" locations. 36 The computation of intramolecular lattice energy contributions in the original CrystalPredictor is closely related to the latter's model of molecular flexibility. 24,25 In particular, a flexible molecule is considered to comprise two or more rigid fragments connected via flexible torsion angles (FTAs). Thus, the intramolecular energy is a function of the values of the FTAs and can be computed via an isolated-molecule QM calculation which minimizes the conformational energy for chosen values of the FTAs. In practice, these QM calculations are performed a priori to determine the values of the intramolecular energy and its partial derivatives with respect to the FTAs on a regular grid in the space of the FTAs. This allows the construction of restricted multidimensional cubic Hermite interpolants that are used for the efficient computation of intramolecular energy contributions during the global search. The rigid fragments are assumed to retain their in vacuo conformation throughout the search.
The CrystalPredictor algorithm outlined above has been implemented for, and deployed on, machines comprising multiple processors within a distributed computing architecture. Once the grid of intramolecular energy has been derived, the code allows the minimization of a single structure in a few seconds, making it feasible to minimize hundreds of thousands of structures within reasonable CPU times. In recent years, it has been applied successfully by several research groups to a relatively wide variety of systems including single-compound crystals, 9,37−45 cocrystals, 47−49 including chiral cocrystals, 49−51 hydrates and solvates. 11,52 It has also provided the basis of several successful predictions in the third, 53 fourth, 5 and fifth 6 blind tests, including the only two successful predictions for "molecule XX", 7 one of the largest and most flexible molecules considered in a blind test to date.
1.3. Need for an Improved Global Search Algorithm. The application of CrystalPredictor to a wide range of systems has helped identify several areas that would need to be addressed in order to extend the algorithm's applicability to even more challenging systems.
One important concern is the size of the interpolation grid needed to provide a sufficiently accurate representation of the intramolecular energy contributions. In particular, the use of a regular grid spanning the FTA ranges of interest may pose a severe computational barrier given the cost of the QM calculation at each grid point and the fact that the number of grid points increases exponentially with the number of FTAs. Moreover, the number of points in each dimension of the grid depends on the accuracy of the interpolating function being employed, namely, an n-dimensional Hermite interpolant where n is the number of FTAs. In principle, such an interpolant provides a cubic approximation of the function being interpolated on the grid and can be constructed using the values of the function and its partial derivatives up to order n at the grid points. However, in the case of CrystalPredictor, in order to avoid the need for any partial derivatives of the configurational energy above first order, the n-dimensional Hermite interpolant is "restricted" (i.e., some of its coefficients are set to zero). 25 As a result, a cubic quality of approximation is actually obtained only for the case n = 1; as n increases, the proportion of interpolating coefficients that are set to zero increases, and consequently, the quality of approximation deteriorates. The practical implication of this is that an increasingly finer grid needs to be used to achieve the required accuracy of configurational energy representation as more FTAs are taken into account.
A second difficulty arises from treating the molecular fragments connected via the FTAs as completely rigid during the global search, which is sometimes physically unrealistic. In practice, the primary deformation of the FTAs caused by intermolecular interactions within the crystalline environment may result in non-negligible changes in the other molecular conformational parameters (e.g., torsion angles that are not treated as "flexible" and potentially some of the bond angles). These secondary deformations will, in turn, affect interatomic distances and therefore intermolecular energies.
Moreover, the effects of molecular deformation on the atomic charges used for computing electrostatic contributions to the lattice energy may also need to be considered. In principle, these charges may be described as functions of the FTAs using interpolants similar to those used for the intramolecular energy. However, in practice, this approach is limited by the difficulty in extracting the values of the partial derivatives of the atomic charges with respect to the FTAs from QM calculations. Therefore, most of the applications of CrystalPredictor listed above have made use of conformationinvariant atomic charges.
Finally, the task of dividing the molecule(s) of interest into rigid fragments introduces a degree of subjective judgment that is undesirable in a general CSP methodology. Although the choice of such fragments is often obvious (especially given the limited number of FTAs that can be handled, as previously mentioned), this is not always the case. For example, in a recent study 45 on 5-methyl-2-[(2-nitrophenyl)amino]-3-thiophenecarbonitrile (ROY, so-called because of its Red, Orange, and Yellow polymorphs), the existence of an intramolecular hydrogen bond was a key factor that had to be taken into account both in the definition of an appropriate set of rigid fragments and in the choice of FTAs.
1.4. Structure of This Paper. The work reported in this paper aims to extend the applicability of CrystalPredictor by addressing the above issues. Section 2 describes how this can be achieved via an alternative description of molecular flexibility and the use of Local Approximate Models (LAMs) for intramolecular energy and atomic charges. Section 3 of the paper presents CSP studies for the three molecules shown in Figure 1. The first example revisits a recently reported CSP study 45 on ROY and demonstrates the ability of the new algorithm to perform the global search at comparable computational cost and with improved accuracy in the ranking of the generated candidate structures. In particular, the rankings of the seven polymorphs with known crystal structures at the end of the global search stage are much improved. The second example is β-D-glucose, a relatively small molecule that nonetheless has a very high degree of flexibility arising from the closely coupled torsions of its five hydroxylic hydrogen atoms. CSP has previously been carried out on this molecule 20,54,55 by van Eijck et al., but the method was conducted only by effectively ignoring the hydrogen atom positions and then reintroducing them at a later stage in a manner that is restrictive and that may result in a failure to identify low-energy structures for molecules whose crystalphase behavior is not already understood. In the present study, all five torsions are treated as variables without restriction, and the experimental crystal structure is successfully generated. The final example is the pharmaceutical molecule BMS-488043 (molecular weight 422 g mol −1 ), which also has five flexible torsions that must be varied during the global search. Although the ranges of torsion angle variations that need to be considered are narrower than in the case of glucose, the size of the molecule makes this another benchmark study for CSP.
Finally, Section 4 summarizes the key points of the paper and draws some general conclusions.

CRYSTALPREDICTOR II ALGORITHM
The new CrystalPredictor algorithm seeks to address the issues identified in section 1.3 by adopting a different model of molecular flexibility and also a more efficient approach for the accurate computation of intramolecular energy as a function of the molecular conformation during the global search. It also makes use of conformation-dependent atomic charges.
In the rest of this paper, we shall refer to the original and new algorithms as "CrystalPredictor I" and "CrystalPredictor II" respectively. We will consider crystal structures that, in principle, contain any number of chemical entities in the asymmetric unit and belong to any space group. Although, in the interest of brevity, we shall refer to "molecules", the proposed techniques also handle ionic species in a straightforward manner.
2.1. Description of Molecular Flexibility. CrystalPredictor I essentially categorized the conformational degrees of freedom (CDFs) of the molecule(s) under consideration as either "flexible" or "rigid", the latter being fixed at their values in the in vacuo conformation. Furthermore, it was only possible to treat torsional angles as flexible. In contrast, CrystalPredictor II allows all CDFs, including torsional angles, bond angles, and bond lengths, to vary during the global search. CDFs are further distinguished into independent and dependent CDFs. The former, denoted as θ, are affected directly by intermolecular interactions in the crystalline environment; they are usually CDFs that have a relatively small effect on the magnitude of conformational energy and can therefore undergo substantial changes from their in vacuo values. On the other hand, the dependent CDFs, denoted as θ̅ , have a strong influence on the conformational energy and are therefore not expected to undergo large deviations within the crystal; however, they do assume values that minimize the conformational energy for given values of θ and are, therefore, a well-defined function of θ.
The partitioning between θ and θ̅ provides a useful mechanism for taking account of the trade-offs between physical realism and computational cost. The latter is generally an increasing function of the size of set θ as this affects both the size of the decision space being searched (and therefore the number of candidate structures that need to be generated to ensure sufficient coverage of this space) and the cost of the local minimization of each candidate structure. At the two extremes, an empty set θ corresponds to a rigid molecule search, while an empty set θ̅ represents a fully atomistic computation.
In general, the decision variables that determine a crystal structure comprise the unit cell lattice lengths and angles, collectively denoted by X, the positions of the centers of mass and the orientation of the molecules within the unit cell, collectively denoted by β, and the independent CDFs, θ, of these molecules. If the search is carried out within a given space group (cf. section 2.3), the decision variables other than θ may be constrained by space group symmetry, for example, for structures in the monoclinic space group P2 1 /c, only one lattice angle is an independent decision variable, the other two being fixed at 90°.
In any case, given the values of θ, we can compute the remaining (i.e., dependent) CDFs θ̅ (see section 2.2 below), at which point we have the complete molecular conformation(s). The latter, together with the values of the decision variables β, allow us to compute the positions of all atoms in the unit cell. These, together with the lattice parameters X, now fully determine all interatomic distances in the crystal, which allows us to compute the intermolecular contributions to the lattice energy, provided we have values of the atomic charges that are appropriate for the current molecular conformation(s). Finally, we can also compute the intramolecular energy contribution as a function of θ, thereby determining the lattice energy corresponding to the given values of the decision variables.
2.2. Local Approximate Models for QM Calculations. In principle, the functional dependence of the dependent CDFs θ̅ on the independent CDFs θ can be computed by performing an isolated-molecule constrained-geometry QM optimization. More specifically, the intramolecular energy contribution U intra to the lattice energy is given by where E intra (θ̅ ,θ) is the molecular energy for given values θ̅ ,θ of the CDFs, and E vac is a constant representing the energy of the in vacuo conformation of the molecule, usually the globally minimal value of E intra . The solution of the above optimization problem also provides the values of θ̅ and the electrostatic field from which atomic charges may be computed.
CrystalPredictor achieves the efficiency required in the context of global search while retaining the accuracy afforded by the above computations by making use of QM results computed prior to the global search. However, instead of the multidimensional cubic Hermite interpolants used by Crystal-Predictor I, CrystalPredictor II makes use of Local Approximate Models (LAMs). 56 These are quadratic approximations of U intra (θ) of the form: and linear approximations of θ̅ of the form:  56 In the limit of the dependent CDF set θ̅ being empty (i.e., a completely atomistic description of the molecule), the above reduces to the second-order Taylor series expansions used by van Eijck and co-workers. 23 However, the approach proposed here is generally applicable to any partitioning of the CDFs, including those in which only a relatively small subset θ is treated as independent, a crucial consideration in the context of efficient global search. Expressions 2 and 3 match the results of the energy minimization 1 exactly at the reference values θ ref and provide reasonably accurate approximations to U intra b(θ) and θ̅ (θ) within a vicinity of θ ref , typically defined as a hyper-rectangle of the form: where Δθ is a given vector defining the range of validity of the LAM. The intramolecular energy LAMs given by eq 2 retain their quadratic accuracy irrespective of the number of independent CDFs θ being considered. The restricted Hermite interpolants employed by CrystalPredictor I exhibit higher (cubic) accuracy for the case of a single independent variable, but their accuracy deteriorates as the grid dimensionality increases. Thus, LAMs have a distinct advantage for systems with higher degrees of configurational flexibility, such as those of interest to this paper.
A sufficient number of LAMs is constructed to cover the entire conformational space of practical interest, usually defined as the set of values of θ for which U intra (θ) (as computed by eq 1) is below a given threshold. In our work, the latter is typically taken to be +20−30 kJ mol −1 relative to the global minimum in vacuo energy as it is generally unlikely for the reduction in intermolecular energy caused by the deformation of the molecules in the tightly packed crystalline environment to be able to compensate for intramolecular energy increases above this range. 57 In some cases, larger variations of the intramolecular energy have been reported 58 within the crystalline environment and the algorithm is implemented in such a way that the user may choose to explore a wider range of flexibility. The approximate size and shape of this space can be established by a conformational scan (i.e., varying each independent CDF θ around its value in the in vacuo conformation of the molecule while performing an isolated-molecule energy minimization with respect to all other CDFs at each θ-point considered). One may also be guided by information on the values of torsions in analogous molecules, such as those stored in the Cambridge Structural Database (CSD). 59 Ultimately, this exercise determines one or more domains of the form [θ 1 min , ×··· that need to be considered during the global search. The approximate value of the LAM validity range Δθ may also be established by constructing LAMs at some of the points considered during the scan and comparing their predictions of intramolecular energy at neighboring points against the corresponding values computed by the QM calculations.
Once the relevant θ space(s) and the range of LAM validity are obtained in the manner described above, a sufficient number of LAMs can be constructed to cover the entire region of interest within the required accuracy. In the interest of simplicity, in the current implementation of CrystalPredictor II, we place the LAM reference points θ ref on a regular grid, although this is not strictly required. The results of the QM calculations at each point θ ref are also used for computing a corresponding set of atomic charges. Our current implementation employs a zeroth-order approximate model for atomic charges (i.e., the latter are considered to be constant within the range of validity of each LAM) but differ from one LAM to another. All QM computations are done using Gaussian09, 60,60 with atomic charges being fitted using the CHELPG algorithm. 61 We note that the construction of LAMs takes place as a preprocessing step prior to performing the global search. In contrast, the CrystalOptimizer algorithm 56 for structure refinement constructs LAMs "on-the-fly" as and when required, an efficient strategy which in most cases leads to relatively few LAMs actually being generated. However, by design, the global search in CrystalPredictor thoroughly explores the entire space of decision variables, which makes it necessary to compute the intramolecular energy over the entire space of conformational flexibility; therefore, it is better to generate the LAMs a priori in a systematic fashion.
Finally, an important benefit of using LAMs in the global search stage is that the LAM database generated can be reused in the refinement stage, provided that the same level of theory and basis set are used. It is not necessary to use the same set of independent CDFs in both stages to permit reuse of the database. 4, 56 2.3. Structure Generation Procedure and Local Lattice Energy Minimization. As mentioned in section 1.2, CrystalPredictor II employs a search procedure that systematically explores the space of decision variables within userspecified bounds. The search is carried out within a set of crystallographic space groups selected by the user, taking account of the symmetry constraints of each space group in order to improve efficiency. The algorithm can in principle account for all 230 space groups. In the current implementation, 63 space groups are handled; they are listed in Appendix A and correspond to those that appear with the highest frequency in the CSD. The user typically specifies the number of candidate structures to be generated and the way these should be allocated among the selected space groups (e.g., in equal numbers, or in proportion to the space group's relative frequency in the CSD).
Any structure generated is subjected to a set of preliminary checks to determine whether it is physically realistic. This leads to the elimination of structures in which two or more molecules overlap in space, or where the lattice unit cell is unrealistic in terms of shape and density (as specified by user), or where intermolecular and/or intramolecular energies are excessively large.
Structures that pass the above tests are used as starting points of local minimizations of the lattice energy with respect to the decision variables subject to constraints on unit cell shape and volume. The efficient and accurate computation of the various contributions to the lattice energy and their partial derivatives with respect to the decision variables has been described elsewhere. 24,25 The minimization is carried out using a sequential quadratic programming algorithm, implemented as either E04UFF 62 or NLPQLP 63,64 with default optimality/ termination tolerances of 10 −6 .
Local minimizations starting from different candidate structures are independent of each other and can be run in parallel, a fact that is already exploited by the CrystalPredictor software architecture (cf. section 1.2). The achievable speedup factor is a near-linear function of the number of processors, and therefore, there is no need to make use of the parallelised computation capabilities of solvers such as NLPQLP.
The overall structure of the CrystalPredictor II code is summarized in Figure 2. As indicated, the structures obtained are subjected to a final clustering step in which duplicate structures that may have been generated from different starting points are removed, based on a comparison of energy, density, and root-mean-square deviation of the atomic distances in the 15-molecule coordination sphere.

CASE STUDIES
We now consider the application of CrystalPredictor II to three molecules (cf. Figure 1) with significant degrees of flexibility. The independent CDFs for each molecule are marked in Figure 3, with the corresponding ranges searched being reported in Table 1. For ROY and BMS-488043, the ranges shown are consistent with those used in previous studies. 45,65 In order to demonstrate the contribution of CrystalPredictor to the overall CSP methodology, we also report results for the subsequent structure refinement stage performed using the CrystalOptimizer 56 code for two of the three molecules considered, namely, β-D-glucose and BMS-488043. Structure refinement calculations have already been reported elsewhere 45 for ROY, albeit for a different set of starting structures, and are not repeated here.
3.1. ROY. The recent study 45 on ROY using CrystalPredictor I modeled the flexibility of torsions θ 1 and θ 2 , indicated by the solid arrows in Figure 3a, using an elaborate system of rigid fragments aiming to maintain geometrical relationships between nonconnected parts of the molecule. In particular, the hydrogen atom in the secondary amine group had to be included within a single rigid fragment together with the (2-nitrophenyl)amino group (cf. left part of Figure 1a). This was necessary in order to take account of the intramolecular hydrogen bond between the amine hydrogen and one of the oxygen atoms of the nitro group. Albeit effective for this specific example, the need for such system-dependent judgment is not a desirable attribute of what aims to be a widely applicable CSP procedure. As we shall demonstrate in this section, the use of a more natural representation of molecular conformation in CrystalPredictor II allows us to address this issue.
The original study made use of a uniform interpolation grid of 90 points over the independent torsion angle space of [100°, 260°] × [0°, 170°]. Each grid point required a QM calculation performed using Gaussian09 60,60 at the B3LYP level of theory with a 6-31G(d,p) basis set. In the present study, the same set of flexible torsions was investigated to make it easier to compare the performance of the two CrystalPredictor algorithms. Fifty six LAMs were generated at intervals of 20°( i.e., Δθ = 10°), using the same level of theory and basis set to maintain consistency. Both studies made use of the FIT exp-6 repulsion/dispersion potential, 33,34,66 expanded to include sulfur, 67 and generated 400 000 crystal structures. The numbers of unique structures identified within 15 kJ mol −1 of the global minimum were 1584 for CrystalPredictor I and 3375 for CrystalPredictor II. The 1097 structures identified by the present study within 10 kJ mol −1 of the global minimum are shown in Figure 4.
In both cases, the generated structures include all seven polymorphs of ROY for which crystal structures have been characterized experimentally. However, there were significant differences in the ranking of these experimental structures, as illustrated in Table 2, which lists the seven structures in order of decreasing experimentally determined stability. 68−70 Overall, the rankings of the computed structures providing a good match of the experimental polymorphs determined by CrystalPredictor II are much improved. In particular, the  worst-ranked experimental polymorph is at place 930 (R) with CrystalPredictor I but at place 120 (ON) using CrystalPredictor II. As can be seen in Table 2, the relative rankings obtained with CrystalPredictor II for the experimentally determined polymorphs are in much better agreement with those of CrystalOptimizer than the relative rankings derived from CrystalPredictor I. This indicates that the LAM-based lattice energies are more accurate than those computed with the restricted Hermite interpolants. Moreover, the similarity of the generated structures to the corresponding experimental structures, measured through the root-mean-square difference in atomic positions between a cluster of 15 molecules from each structure (RMSD 15 ) is similar to CrystalPredictor I results or much superior. This improvement is achieved with the same number of grid points in CrystalPredictor II as in CrystalPredictor I. This indicates that a significant improvement in the accuracy of the energy function has been achieved by introducing LAMs and incorporating a description of the effect of conformational flexibility on dependent CDFs and charges.
Irrespective of which version of CrystalPredictor is used for the global search, the low-ranking structures identified by the latter need to be refined further using CrystalOptimizer, 56 which employs a more accurate description of electrostatic interactions in terms of distributed multipole expansions and also takes account of a much wider range of independent CDFs; a total of 22 independent CDFs, including 10 bond angles and 12 torsion angles, were considered in the previous study of this molecule. It is worth noting that, in view of the rankings reported in Table 2, using CrystalPredictor II for the global search would lead to much fewer structures having to be refined in order to ensure identification of all experimentally known polymorphs. Given the high computational cost of the refinement stage, this has a significant impact on the reliability and the overall cost of CSP. The improved reliability of the crystal structure landscape is a major benefit of the revision of CrystalPredictor. A further benefit arises from the fact that the LAMs constructed as part of a CrystalPredictor run can be reused at the refinement stage, making the overall process more efficient.
3.2. β-D-Glucose. Because of their relatively low effect on conformational energy, all five of the hydroxylic hydrogens in β-D-glucose could potentially take on any orientation within the crystalline environment. Therefore, the corresponding torsion angles have to be treated as independent CDFs with search   a Lattice energies and rankings from generated by CrystalOptimizer in a previous study 45 are also shown. Polymorphs are listed in decreasing order of experimentally determined thermodynamic stability. 68−70 Where 15 molecules could not be overlaid between the generated structure and the corresponding experimental structure, the number of molecules that could be overlaid is listed in square brackets instead of an RMSD 15 .
ranges set at [0, 360°]. In contrast, conformational scans indicate that the torsion angle shown by the broken thin arrow in Figure 3b is much less flexible, with relatively small deviations from its in vacuo value of 60°leading to significant increases in conformational energy. It is therefore treated as a dependent CDF for the purposes of the global search. There exists a second low-energy minimum at 285°(see S4.1 in the Supporting Information) that would need to be sampled for a comprehensive search, but in this work we treat only the global minimum region in order to illustrate the method.
Handling the flexibility of the five independent CDFs using CrystalPredictor I would have been problematic due to the need for relatively small intervals between grid points arising from the relatively large number of CDFs (e.g., with 30°i ntervals, nearly 250 000 QM calculations would be required). In CrystalPredictor II, LAMs were generated at intervals of 60°a t the HF-SCF level of theory with a 6-31G (d,p) basis set in Gaussian09. The result was a large but computationally manageable set of 7776 = (360/60) 5 LAMs. The global search generated 500 000 points, yielding only 14 unique structures within 20 kJ mol −1 of the global minimum (see Figure 5a). This is due to the presence of an isolated global minimum, at a comparatively low energy value. Given this unusual landscape, structures within 40 kJ mol −1 of the global minimum were considered in order to include over 1000 structures in the refinement step. The FIT exp-6 repulsion/ dispersion potential was used. 33,34 A structure closely matching the experimentally known crystal structure (CSD refcode: GLUCSE02) was found ranked 489th at 36.4 kJ mol −1 above the global minimum with RMSD 15 = 0.206 Å. Subsequent refinement and reranking was carried out using CrystalOptimizer. The torsion angle shown by the broken thin arrow in Figure 3b was now also treated as an independent CDF. The intramolecular energy was calculated at the HF-SCF level of theory using a 6-31G (d,p) basis set, while the intermolecular energy was calculated using electrostatic multipoles derived using GDMA2 71 from wave functions calculated at the PBE0 level of theory using a 6-31G (d,p) basis set and the FIT exp-6 repulsion/dispersion potential. After this refinement, the experimentally known structure was ranked 137th.
The adjacent alcohol groups in β-D-glucose lead to strong intramolecular OH···O interactions, although the molecular geometry does not permit the formation of full intramolecular hydrogen bonds. On the other hand, intermolecular hydrogen bonding does take place and this, in turn, may result in significant electrostatic induction effects between neighboring molecules. The consequent changes in electron density are not captured by isolated-molecule QM calculations and this may have important consequences on the relative ranking of putative crystal structures. 72 To compensate for this polarizability effect, the lattice energies of the final crystal structures were recalculated using electrostatic multipoles and intramolecular energies calculated in a polarizable continuum model (PCM) 73 with dielectric constant of ε = 3, which has been shown 74 to approximate the polarizing effect of a crystalline environment.
The computational cost of the entire CSP exercise is shown in the corresponding columns of Table 3. The structure refinement cost includes the final reranking calculations mentioned above.
The final, refined set of crystal structures is shown in Figure 5b and contains a match to the experimental crystal structure with RMSD 15 = 0.159 Å in sixth place, 4.1 kJ mol −1 above the global minimum. The details of the predicted structure and a comparison with experimental measurements are shown in Table 4 below.
3.3. BMS-488043. There are two experimentally known polymorphs of BMS-488043. 65 Form I is recorded in the CSD as SUTTOR. 75 Five torsion angles ( Figure 3c) were determined to be sufficiently important in terms of their effect on the value of the intramolecular energy to warrant variation during the global search. The conformational energy scan on angles θ 3 and θ 4 found that each one of them exhibits two disjoint low-energy ranges of potential interest to CSP. In such cases, CrystalPredictor can be used to generate structures across the full set of values of both angles, simply allowing structures in the intermediate, high-energy ranges to be discarded because of violations of the preliminary screening rules described in section 2.3. However, in this case, we chose to perform four separate searches, each corresponding to a different combination of the low energy ranges of θ 3 and θ 4 , as shown in the last four rows of Table 1.
Overall, the relatively small range of rotation in these five torsions allows us to model the flexibility in all four conformational regions using a total of only 180 LAMs. These were calculated at the HF-SCF level of theory with a 6-31G (d,p) basis set in Gaussian09. The global search generated 500 000 points in each conformational region, giving a total of two million initial crystal structures. The FIT exp-6 repulsion/ dispersion potential was used.
Overall, 871 unique structures were identified within 20 kJ mol −1 of the global minimum and are shown in Figure 6a. A structure corresponding to form I was found ranked sixth, and a structure corresponding to form II was found ranked 767th. Subsequent refinement and reranking of all 871 structures was carried out using CrystalOptimizer. The torsion angles indicated by the broken thin arrows in Figure 3c were also treated as independent CDFs at this stage. The intramolecular energy was calculated at the PBE0 level of theory using a 6-31G (d,p) basis set, while the intermolecular energy was calculated using electrostatic multipoles derived using GDMA2 71 from wave functions calculated at the PBE0 level of theory using a 6-31G (d,p) basis set, and the FIT exp-6 repulsion/dispersion potential. 33,34 In the final set of crystal structures (see Figure 6b), the structure corresponding to form I (with an RMSD 15 of 0.249 Å) is ranked fifth, while that corresponding to form II (with an RMSD 15 of 0.278 Å) is ranked 51st. A comparison of predicted and experimental structures for the two forms is shown in Table 5.
By the standards of contemporary crystal structure prediction, BMS-488043 is a relatively large molecule. Successful generation and refinement of structures for a molecule of this size within a fairly standardized workflow and at relatively modest computational cost (see relevant columns in Table 3) represents a notable step forward for crystal structure prediction as a field.

CONCLUDING REMARKS
The CrystalPredictor II algorithm and code described in this paper allow a significant degree of molecular flexibility to be taken into account during the global search stage of CSP methodologies. Like its predecessor, the CrystalPredictor I algorithm, 24,25 it is generally applicable to systems with one or more chemical entities in the asymmetric unit, forming crystal structures belonging to any space group. It derives its effectiveness in identifying low-energy structures from a systematic and extensive search of the lattice energy surface implemented on distributed computing architectures.
By allowing the user to categorize CDFs into "independent" and "dependent" ones, the new algorithm introduces a natural description of molecular flexibility that can easily be tailored to achieve the best trade-off between physical realism and computational tractability without the need for artificial constructs such as "rigid fragments" which has sometimes proven problematic in the past. Moreover, the use of LAMs provides an efficient and accurate description of the variation of intramolecular energy, dependent CDFs and atomic charges during the global search.
As illustrated by the ROY example (cf. section 3.1), the improved accuracy of CrystalPredictor II may lead to an improved ranking of the candidate structures over that achieved by its predecessor. In practical terms, this can significantly reduce the number of structures that need to undergo further refinement in order to ensure identification of all polymorphs     Table 1 for the definition of the conformational regions.
of practical interest. Conversely, it can increase the probability of these polymorphs being identified following the refinement of a given number of structures. The developments reported in this paper render CrystalPredictor II more consistent with its companion CrystalOptimizer 56 algorithm for structure refinement, as both algorithms now employ the independent/dependent CDF categorization and the LAMs based on it. The higher accuracy required at the structure refinement stage in the computation of lattice energy is achieved by CrystalOptimizer in three different ways, namely by a. using distributed multipole expansions instead of atomic charges for the description of the intermolecular electrostatic interactions; b. treating more CDFs (e.g., more torsion angles and some bond angles) as independent; c. potentially employing a more accurate level of theory/ basis set for the QM calculations. The theoretical consistency between the two algorithms has an important practical implication: if the same level of theory/ basis set is used for both CrystalPredictor II and CrystalOptimizer, then any LAMs generated for the former may be stored and reused by the latter, thereby reducing the number of QM calculations that need to be performed during the structure refinement. This is a mathematical property of LAMs that holds 56,65 even if some of the CDFs treated as dependent by CrystalPredictor II are treated as independent ones by CrystalOptimizer.
The introduction of LAMs in the global search stage of Crystal Structure Prediction is an important step forward in the application of CSP techniques to larger and/or highly flexible molecules because it allows the generation of an improved crystal energy landscape within reasonable computational cost. This increases the reliability of the entire process since a more realistic set of low-energy structures is identified at the end of the global search. It can also reduce significantly the overall computational cost of CSP studies as fewer putative structures need to be refined and LAM databases from the global search can be reused in the refinement step. Section S1 provides a discussion of the choice of the conformational search space for BMS-488043. Section S2 provides information on the reuse of LAM databases in refinement. Section S3 provides a breakdown of the space groups sampled for each molecule. This material is available free of charge via the Internet at http://pubs.acs.org/