A Low-Order Permutationally Invariant Polynomial Approach to Learning Potential Energy Surfaces Using the Bond-Order Charge-Density Matrix: Application to Cn Clusters for n = 3–10, 20

A representation for learning potential energy surfaces (PESs) in terms of permutationally invariant polynomials (PIPs) using the Hartree–Fock expression for electronic energy is proposed. Our approach is based on the one-electron core Hamiltonian weighted by the configuration-dependent elements of the bond-order charge density matrix (CDM). While the previously reported model used an s-function Gaussian basis for the CDM, the present formulation is expanded with p-functions, which are crucial for describing chemical bonding. Detailed results are demonstrated on linear and cyclic Cn clusters (n = 3–10) trained on extensive B3LYP/aug-cc-pVTZ data. The described method facilitates PES learning by reducing the root mean squared error (RMSE) by a factor of 5 relative to the s-function formulation and by a factor of 20 relative to the conventional PIP approach. This is equivalent to using CDM and an sp basis with a PIP of order M to achieve the same RMSE as with the conventional method with a PIP of order M + 2. Implications for large-scale problems are discussed using the case of the PES of the C20 fullerene in full permutational symmetry.


S.1 A symmetrization scheme for generating covariant polynomials
Equations 6 take as input symmetrization phase factors  ,, =± 1 that use nucleus indices 1 ... a ... b ... N to choose the phase for each monomial j such that a permutation of a pair of identical nuclei also permutes the two PIPs, which are labeled by i.In other words, PIP i describing nucleus a will transform into the corresponding PIP i describing an identical nucleus b upon the a-b permutation.This guarantees that the PES described by Eq. 1 is invariant under the said permutation and by extension under any permutation.This has been tested numerically on the systems considered in the present work.
The presently implemented scheme for choosing the symmetrizing phases is given below: I. determine the number of non-zero powers in a monomial: M_NZ{monomial} This section provides analytic expressions for the Gaussian integrals seen in Eqs. 12 of the main text.Here the integrals involving s functions will be given explicitly, with the corresponding p function integrals calculated as derivatives of the former.The normalized s function on center a with coordinate   is where  is electron's coordinate.The Hamiltonian matrix element used in Eq. 2 of main text is where the kinetic energy integral is and the overlap integral is with   ≡     (    ).The distance between two centers a and b is   = |  -  |.The one-electron Coulomb integral involving two basis functions centered at a and b nuclear centers coupled over the nuclear center c with charge   is where the ab centroid position vector is  1 = (        ) (    ) and  1 = | 1 -  |.For a = b, the above integrals reduce to and for a = b = c the Coulomb integral reduces to Matrix elements involving Cartesian p functions are derived by differentiation of the above integrals with respect to basis function centers, i.e. nuclear position   = {  ,  ,  }, for the two and three center integrals: and on for all other X, Y, Z combinations.After carrying out the differentiation, the special cases of coinciding nuclei, a = b, a = c , b = c and a = b = c, are readily derived.

S.3 Definition of a body-fixed coordinate system for density matrix parameterization
In this section we demonstrate the use of an inertia matrix in constructing a uniquely defined body fixed coordinate system.To begin, a 3x3 matrix is constructed using the nuclear positions   and nuclear weights   , where the index k runs over the N atoms with , = ,,, and where the origin is defined in the usual way as with  = ∑  =1   .We presently set   = 1 which amounts to treating the nuclei as indistinguishable (Euclidean) points, although using nuclear charges has been seen as an equally valid approach in the calculations.On the other hand, using nuclear masses for   leads to an unphysical dependence of the basis and therefore the density matrix on isotopic substitutions and is therefore undesirable.
Diagonalization of I produces an orthogonal body-fixed (BF) basis  1 ,  2 ,  3 ordered by the increasing eigenvalue.It is spatially unique for a given geometrical configuration but with the directions of the basis vectors determined only up to a sign.This ambiguity is problematic when density elements of the sx, xy, etc sort need to be calculated.A sign flip in   leads to a flip, i.e. a discontinuity, in the corresponding density matrix element resulting in a discontinuous PES and its derivatives.
To address this issue, we adopt a convention by which a unique phase for the BF vectors is determined.By analogy with a dipole moment, a vector  is constructed relative to the above defined origin and using configuration dependent nuclear "charges" Examination of various models in a course of preliminary calculations suggested that   =  |  - 0 | 2 is one of the simplest and most effective choices which guarantees a unique and nonzero vector D for a generic configuration and with permutational symmetry manifestly enforced, the latter property being easily checked by inspection.(It needs to be mentioned that in very special and extremely rare high-symmetry cases where D = 0 we assume   = | 1, | 2 for  = ,,.)We also note that unlike a standard dipole moment vector, D is constructed to be invariant under translation which is a very important property in present derivation.Having constructed D, the phases of the BF axes are determined by the following convention: Thus defined BF coordinate system is guaranteed to preserve invariance of the energy with respect to translations, rigid-body rotations and permutations of identical nuclei.Numerous tests of these cases have been done on a variety of molecules to verify this proposition numerically.
The fitting procedure involves the following steps: (1) the training set coordinates (with corresponding energies) are taken from a trajectory propagation, Monte Carlo simulation or any other structure generator and transformed into the BF frame; (2) the transformed set is used to construct the matrix equation for linear regression analysis and recovery of fitting coefficients  BF  ; (3) to recover the energy of an arbitrary input geometry, the coefficients  BF  are used to express the density matrix P in the BF coordinate system which requires a prior rotation of the input geometry into the BF frame.
(4) The rotational covariance of the energy gradient is also properly described by the above procedure.Namely, the gradient in the input (space fixed (SF)) frame is first calculated analytically (or numerically) in the BF frame as ∇ BF and then transformed back into the SF frame using for atom i, a protocol that one would employ in a molecular dynamics simulation.Similar transformations also work for higher order energy derivatives.

S.4 C 3 training and testing data
Following our previous approach of optimizing the linear parameters separately from the nonlinear ones, here we perform a linear regression calculation for a set of the three exponents chosen on a grid: 1-dimensional for PIP and three-dimensional for CHM_s and CHM_sp models.There is a small D 2h barrier, ~90 cm -1 , to C 2v -C 2v pseudoinversion of the cyclic triplet C 10 , calculated at B3LYP/aug-cc-pVTZ(spd) level, which completely disappears upon including the ZPE.Calculations using the cc-pVQZ basis set yield the triplet C 10 cyclic D 2h barrier even closer in energy to the C 2v one, within ~30 cm -1 , therefore we presume that the vibrationally averaged triplet C 10 configuration is of the D 2h geometry.
We also examine isomerization pathways connecting cyclic and linear configurations for both singlet and triplet electronic states of C 10 .Previous DFT studies at the PBE/Def2TZVPP level of theory (see main text for details) reported only linear-cyclic isomerization for the singlet state.
The transition state for the cyclic-linear isomerization, i.e., the ring opening process, of the C 10 singlet state is much higher than that for the triplet state.The cyclic singlet C 10 is lower in energy than the cyclic triplet C 10 state, while the linear singlet C 10 is higher in energy than the linear triplet C 10 .

Figure S3 .S26Figure S4 .S27Figure S5 .Figure S6 .
Figure S3.Normal mode cuts of the C 20 potential for modes n = 1 -18 (ordered left to right, top to bottom) along the reduced coordinate where  is the corresponding harmonic frequency.The

Table S1 .
C 3 training and testing data for the linear singlet.The training set TR contains 5001 trajectory points 2 fs apart propagated at ZPE (1726 cm -1 ).TEST contains different 5000 trajectory points 2 fs apart from the same trajectory.TEST-G is the gradient data at the points of TEST.

Table S2 .
C 4 training and testing data for the linear triplet.The training set TR contains 5001 trajectory points 2 fs apart propagated at ZPE (2848 cm -1 ).TEST contains different 5000 trajectory points 2 fs apart from the same trajectory.TEST-G is the gradient data at the points of TEST.

Table S3 .
C 5 training and testing data for the linear singlet.The training set TR contains 5001 trajectory points 2 fs apart propagated at ZPE (4176 cm -1 ).TEST contains different 5000 trajectory points 2 fs apart from the same trajectory.TEST-G is the gradient data at the points of TEST.

Table S4A .
C 6 training and testing data for the cyclic singlet.The training set TR contains 5001 trajectory points 2 fs apart propagated at ZPE (6006 cm -1 ).TEST contains different 5000 trajectory points 2 fs apart from the same trajectory.TEST-G is the gradient data at the points of TEST.

Table S4B .
C 6 training and testing data for the linear triplet.The training set TR contains 5001 trajectory points 2 fs apart propagated at ZPE (5121 cm -1 ).TEST contains different 5000 trajectory points 2 fs apart from the same trajectory.TEST-G is the gradient data at the points of TEST.

Table S5 .
C 7 training and testing data for the linear singlet.The training set TR contains 5001 trajectory points 2 fs apart propagated at ZPE (6433 cm -1 ).TEST contains different 5000 trajectory points 2 fs apart from the same trajectory.TEST-G is the gradient data at the points of TEST.

Table S6 .
C 8 training and testing data for the linear triplet.The training set TR contains 5001 trajectory points 2 fs apart propagated at ZPE (7326 cm -1 ).TEST contains different 5000 trajectory points 2 fs apart from the same trajectory.TEST-G is the gradient data at the points of TEST.

Table S7 .
C 9 training and testing data for the linear singlet.The training set TR contains 5001 trajectory points 2 fs apart propagated at ZPE (8559 cm -1 ).TEST contains different 5000 trajectory points 2 fs apart from the same trajectory.TEST-G is the gradient data at the points of TEST.

Table S8A .
C 10 training and testing data for cyclic singlet.The training set TR contains 5001 trajectory points 2 fs apart propagated at the ZPE (10499 cm -1 ).TEST contains different 5000 trajectory points 2 fs apart from the same trajectory.TEST-G is the gradient data at the points of

Table S8B .
C 10 training and testing data for linear triplet.The training set TR contains 5001 trajectory points 2 fs apart propagated at ZPE (9395 cm -1 ).TEST contains different 5000 trajectory points 2 fs apart from the same trajectory.TEST-G is the gradient data at the points of TEST.

12 Electronic structure and configuration data for C 6 and C 10Table S9A :
B3LYP/aug-cc-pVTZ(spd) Mulliken population fractions in % at the linear and cyclic geometries of C 6 for the singlet (M1) and triplet electronic states (M3).

Table S12 .
C 20 training and testing data.The training set TR contains 5001 trajectory points 2 fs apart propagated at ZPE/16 (1579 cm -1 ).TEST contains different 5000 trajectory points 2 fs apart from the same trajectory, and TEST-G contains the gradient data at the points of TEST.