Parameterization of Coarse-grained Molecular Interactions through Potential of Mean Force Calculations and Cluster Expansions Techniques

We present a systematic coarse-graining (CG) strategy for many particle molecular systems based on cluster expansion techniques. We construct a hierarchy of coarse-grained Hamiltonians with interaction potentials consisting of two, three and higher body interactions. The accuracy of the derived cluster expansion based on interatomic potentials is examined over a range of various temperatures and densities and compared to direct computation of pair potential of mean force. The comparison of the coarse-grained simulations is done on the basis of the structural properties, against the detailed all-atom data. We give specific examples for methane and ethane molecules in which the coarse-grained variable is the center of mass of the molecule. We investigate different temperature and density regimes, and we examine differences between the methane and ethane systems. Results show that the cluster expansion formalism can be used in order to provide accurate effective pair and three-body CG potentials at high $T$ and low $\rho$ regimes. In the liquid regime the three-body effective CG potentials give a small improvement, over the typical pair CG ones; however in order to get significantly better results one needs to consider even higher order terms.


I. INTRODUCTION
The theoretical study of complex molecular systems is a very intense research area due to both basic scientific questions and technological applications. 1 A main challenge in this field is to provide a direct quantitative link between chemical structure at the molecular level and measurable macroscopic quantities over a broad range of length and time scales. Such knowledge would be especially important for the tailored design of materials with the desired properties, over an enormous range of possible applications in nano-, bio-technology, food science, drug industry, cosmetics etc.
A common characteristic of all complex fluids is that they exhibit multiple length and time scales. Therefore, simulation methods across scales are required in order to study such systems. On the all-atom level description, classical atomistic models have successfully been used in order to quantitatively predict the properties of molecular systems over a considerable range of length and time scales. [1][2][3][4] However, due to the broad spectrum of characteristic lengths and times involved in complex molecular systems it is desirable to reduce the required computational cost by describing the system through a small number of degrees of freedom. Thus, coarse-grained (CG) models have been used in order to increase the length and time scales accessible by simulations. 1,3,[5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20][21][22] From a mathematical point of view, coarse-graining is a sub-field of dimensionality reduction; 4 there are several statistical methods for the reduction of the degrees of freedom under consideration in a deterministic or stochastic model, such as principal component analysis, polynomial chaos and diffusion maps. 20 Here we focus our discussion on CG methods based on a combination of recent computational methods and old theoretical tools from statistical mechanics. Such CG models, which are developed by lumping groups of atoms into CG particles and deriving the effective CG interaction potentials directly from more detailed (microscopic) simulations, are capable of predicting quantitatively the properties of specific molecular systems (see for example refs. [5][6][7][8][9][11][12][13]15,[17][18][19][23][24][25] and references therein).
The most important part in all systematic CG models, based on detailed atomistic data, is to develop rigorous all-atom to CG methodologies that allow, as accurate as possible, estimation of the CG effective interaction. With such approaches the combination of atomistic and hierarchical CG models could allow the study of a very broad range of length and time scales of specific molecular systems without adjustable parameters, and by that become truly predictive. 11,14,15 There exists a variety of methods that construct a reduced CG model that approximates the properties of molecular systems based on statistical mechanics. For example: (a) In structural, or correlation-based, methods the main goal is to find effective CG potentials that reproduce the pair radial distribution function g(r), and the distribution functions of bonded degrees of freedom (e.g. bonds, angles, dihedrals) for CG systems with intramolecular interaction potential. 6,7,9,10,21,22 The CG effective interactions in such methods are obtained using the direct Boltzmann inversion, or reversible work, method 10,[26][27][28] or iterative techniques, such as the iterative Boltzmann inversion, IBI 7,29 , and the inverse Monte Carlo, IMC, (or inverse Newton) 22,30 approach. (b) Force matching (FM) or multi-scale CG (MSCG) methods 5,14,16,31-33 is a mean least squares problem that considers as observable function the total force acting on a coarse bead. (c) The relative entropy (RE) 8,18,34 method employs the minimization of the relative entropy, or Kullback-Leibler divergence, between the microscopic Gibbs measure µ and µ θ , representing approximations to the exact coarse space Gibbs measure. In this case, the microscopic probability distribution can be thought as the observable. The minimization of the relative entropy is performed through Newton-Raphson approaches and/or stochastic optimization techniques. 19,35 In practice, all above numerical methods are employed to approximate a many body potential of mean force (PMF), U PMF , describing the equilibrium distribution of CG particles observed in simulations of atomically detailed models. Besides the above numerical parametrization schemes, more analytical approaches have also been developed for the approximation of the CG effective interaction, based on traditional liquid state theory and on pair correlation functions. [36][37][38][39][40][41][42] Here we discuss an approach for estimating U PMF , and the corresponding effective CG non-bonded potential, based on cluster expansion methods. Such methods originate from the works of Mayer and collaborators 43 in the 40's. In the 60's numerous approximate expansions have been further developed 44,45 for the study of the liquid state. Later, with the advancement of powerful computational machines, the main focus has been directed on improving the computational methods such as Monte Carlo and molecular dynamics. However, the latter are mostly bulk calculations and they get quite slow for large systems. Reducing the degrees of freedom by coarse-graining has been a key strategy to construct more efficient methods, but with many open questions with respect to error estimation, transferability and adaptivity of the suggested methods. Based on recent developments of the mathematical theory of expansion methods in the canonical ensemble 46 , our purpose is to combine the two approaches and obtain powerful computational methods, whose error compared to the target atomistic calculations can be quantified via rigorous estimates. In principle, the validity of these methods is limited to the gas regime. Here we examine the accuracy of these methods in different state points. This attempt consists of the following: a priori error estimation of the approximate schemes depending on the different regimes, a posteriori error validation of the method from the coarsegrained data and design of related adaptive methods.
In previous years, we have carried out this program for the case of lattice systems, obtaining higher order schemes and a posteriori error estimates 47 , for both short and long range interactions 48 and designing adaptive methods 49 and investigating possible strategies for reconstruction of the atomistic information. 50 This is very much in the spirit of the polymer science literature 10,11,51 and in this paper we get closer by considering off-lattice models. The proposed approach is based on typical schemes that are based on isolated molecules. 26,27,52 Here we extend such approaches using cluster expansion tools for deriving CG effective potentials. We start from typical 2-body (pair) effective interaction, but some results can be extended to many-body interactions as well. We also present a detailed theoretical investigation about the effect of higher order terms in obtaining CG effective interaction potentials for realistic molecular systems. Then, we show some first results from the implementation of three-body terms on the effective CG potential; a more detailed work on the higher order terms will be given in a forthcoming work. 53 The structure of the paper is as follows: In Section II, we introduce the atomistic molecular system and its coarsegraining via the definition of the CG map, the n-body distribution function and the corresponding n-body potential of mean force. The cluster expansion based formulation of the CG effective interaction is presented in Section III. Details about the model systems (methane and ethane) and the simulation considered here are discussed in Section IV. Results are presented in Section V. Finally, we close with Section VII summarizing the results of this work.

II. MOLECULAR MODELS
A. Atomistic and "exact" coarse-grained (CG) description Here we give a short description of the molecular model in the microscopic (all-atom) and mesoscopic (coarsegrained) scale. Assume a system of N (classical) atoms (or molecules) in a box Λ( ) := (− 2 , 2 ] d ⊂ R d (for some > 0), at temperature T . We will also denote the box by Λ when we do not need to explicit the dependence on . We consider a configuration q ≡ {q 1 , . . . , q N } of N atoms, where q i is the position of the i th atom. The particles interact via a pair potential V : R d → R ∪ {∞}, which is stable and tempered. Stability means that there exists a constant B ≥ 0 such that: for all N and all q 1 , ..., q N . Moreover, temperedness requires that where β = 1 k B T and k B is Boltzmann's constant. The canonical partition function of the system is given by where H Λ is the Hamiltonian (total energy) of the system confined in a domain Λ: By U (q) we denote the total potential energy of the system: where for simplicity we assume periodic boundary conditions on Λ. Integrating over the momenta in (3), we get: where λ := ( 2mπ β ) d/2 . In the sequel, for simplicity we will consider λ = 1 and identify Z β,Λ,N ≡ Z U β,Λ,N . Fixing the positions q 1 and q 2 of two particles, we define the two-point correlation function : It is easy to see that in the thermodynamic limit the leading order is ρ 2 , where ρ = N |Λ| and |Λ| is the volume of the box Λ. Thus, it is common to define the following order one quantity g(r) := 1 ρ 2 ρ (2),at N,Λ (q 1 , q 2 ), for r = |q 1 − q 2 |. More generally, for n ≤ N , we define the n-body version and from that the order n potential of mean force (PMF), U PMF (q 1 , . . . , q n ), 54,55 given by We define the coarse-graining map T : (R d ) N → (R d ) M on the microscopic state space, given by T : q → T (q) ≡ (T 1 (q), . . . , T M (q)) ∈ R M , which determines the M (M < N ) CG degrees of freedom as a function of the atomic configuration q. We call "CG particles" the elements of the coarse space with positions r ≡ {r 1 , . . . , r M }. The effective CG potential energy is defined by where the integral is over all atomistic configurations that correspond to a specific CG one using the coarse-graining map. In the example we will deal with later, the configuration r will represent the centers of mass of groups of atomistic particles. This coarse graining gives rise to a series of multi-body effective potentials of one, two, up to M -body interactions, which are unknown functions of the CG configuration. Note also that by the construction of the CG potential in (10) the partition function is the same: The main purpose of this article is to give a systematic way (via the cluster expansion method) of constructing controlled approximations of U eff that can be efficiently computed and at the same time we have a quantification of the corresponding error for both "structural" and "thermodynamic" quantities. By structural we refer to g(r), while by thermodynamic to the pressure and the free energy. Note that both depend on the partition function, but they can also be related 55 to each other as follows: at least for the case of pair-interaction potentials.

B. Coarse-grained approximations
As mentioned above there are several methods in the literature that give approximations to the effective (CG) interaction potential U eff as defined in (10). Below we list some of them without claim of being exhaustive: (a) The 'correlation-based (eg. DBI, IBI and IMC) methods that use the pair radial distribution function g(r), related to the two-body potential of mean force for the intermolecular interaction potential, as well as distribution functions of bonded degrees of freedom (e.g. bonds, angles, dihedrals) for CG systems with intramolecular interaction potential. 6,7,9,10,21,22 These methods will be further discussed below. (b) Force matching (FM) methods 5,16,31 in which the observable function is the average force acting on a CG particle. The CG potential is then determined from atomistic force information through a least-square minimization principle, to variationally project the force corresponding to the potential of mean force onto a force that is defined by the form of the approximate potential.
(c) Relative entropy (RE) 8,18,19 type methods that produce optimal CG potential parameters by minimizing the relative entropy, Kullback-Leibler divergence between the atomistic and the CG Gibbs measures sampled by the atomistic model.
In addition to the above numerical methods, analytical works for the estimation of the effective CG interaction, based on integral equation theory, have also been developed 56 . A brief review and categorization of parametrization methods at equilibrium is given in references 17,57 .
Given the atomistic "target" g(r) from a free (i.e., without constraints) atomistic run, by inverting (13) and neglecting the higher order terms of γ(r) one can obtain a first candidate for a pair coarse-grained potential u(r). Then, one calculates the g(r) that corresponds to the first candidate and by iterating this procedure eventually obtains the desired two-body coarse-grained potential. This iteration should in principle converge since there exists a pair interaction that can be reconstructed from a given correlation function. However, this is only an approximation (accounting for the neglected terms of order ρ and higher in the expansion of γ(r)) since we know that the "true" CG interaction potential should be multi-body, as a result of integrating atomistic degrees of freedom. Hence, having agreement on g(r) does not secure proper thermodynamic behaviour and several methods have been employed towards this direction, see for example refs 7,40,58 and the references within.
In order to maintain the correct thermodynamic properties, our approach in this paper is based on cluster expanding (10) with respect to some small but finite parameter depending on the regime we are interested in. For technical reasons we will focus on low density -high temperature regime. As it will be explained in detail in the next section, the resulting cluster expansion provides us with a hierarchy of terms: together with the corresponding error estimates.
The above terms can in principle be calculated independently via fast atomistic simulations of 2, 3, etc. molecules, in the spirit of the conditional reversible work CRW method. In more detail, the effective non-bonded (two-body) CG potential can be computed as follows: (a) One method is by fixing the distance r 1,2 := r 1 − r 2 between two molecules and perform molecular dynamics with such forces that maintain the fixed distance r 1,2 . In this way we sample over the constrained phase space and obtain the conditioned partition function as in (10). Then, by integration of the constrained force the two-body effective potential can be obtained. (b) Alternatively, by inverting g(r) in (13) for two isolated molecules, the two-body effective potential can be directly obtained, since for such a system γ(r) = 1.
Here we examine both methods, see Figure 3. Note also that the validity of cluster expansion provides rigorous expansions for g(r), the pressure and the other relevant quantities. Hence, with this approach we can have a priori estimation of the errors made in (13). Another benefit of the cluster expansion is that the error terms can be written in terms of the coarse-grained quantities allowing for a posteriori error estimates and the design of adaptive methods 49 ; see also discussion in Section VII.

III. CLUSTER EXPANSION
The cluster expansion method originates from the work of Mayer and collaborators, see ref. 43 for an early review, and consists of expanding the logarithm of the partition function in an absolutely convergent series of an appropriately chosen small but finite parameter. Here we will adapt this method to obtain an expansion of the conditioned partition function (10).
For the purpose of this article we assume that the CG map T is a product T = ⊗ M i=1 T i creating M groups of l 1 , . . . , l M particles each. We index the particles in the i th group of the coarse-grained variable r i by k i 1 , . . . , k i li . We also denote them by q i := (q k i 1 , . . . , q k i l i ), for i = 1, . . . , M . Then (10) can be written as: where, for simplicity, we have introduced the normalized conditional measure: and by λ i we denote the measure 1 To perform a cluster expansion in the second term of (15) we rewrite the interaction potential as follows: Then, we have where for V ⊂ {1, . . . , N }, we denote by C V the set of connected graphs on the set of vertices with labels in V . Furthermore, for g ∈ C V , we denote by E(g) the set of its edges. Since µ in (16) is a normalized measure, from (15) we obtain: where is a function over the atomistic details of the system. Note that the above expression involves a sum over all possible pairs, triplets etc. which is a convergent series for values of the density ρ = N |Λ| and of the inverse temperature β such that ρC(β) < c 0 , where C(β) is defined in (2) and c 0 is a known small positive constant. 46 If we simplify the sum in (19) one can obtain 46 expansion (14) where and Recall also the definition of f i,j in (18).

A. Full calculation of the PMF
Notice that the potentials W (2) and W (3) in (20) and (21), respectively, have been expressed via the Mayer functions f i,j . However, the full effective interaction potential between two CG particles can be directly defined as the (conditional) two-body PMF given by By adding and subtracting 1, we can relate it to (20): Higher order terms in the above equation are expected to be less/more important in high/low temperature. Similarly, for three CG degrees of freedom r 1 , r 2 , r 3 , the full PMF is given by By adding and subtracting 1 we can relate it to (20) and (21) (in the following we simplify notation by not explicitly showing the dependence on the atomistic configuration and neglecting the normalized conditional measure): which implies that In principle, we can rewrite (14) with respect to W (2),full and W (3),full . Note however, that both of these terms contain the coarse-grained two-body interactions, hence in order to avoid double-counting, when we use both, we have to appropriately subtract the two-body contributions. For some related results, see also the discussion about Figure 11.

B. Thermodynamic consistency
As already mentioned, several coarse-graining strategies lack of thermodynamic consistency, see also the discussion by Louis 59 and Guenza 40 . On the other hand, by construction, the cluster expansion approach gives quantified approximations to the correct thermodynamic behaviour. Hence, from (14), by considering only the two-body contribution, for the finite volume free energy we have that where the error is uniform in N and |Λ| and negligible in the limit. Thus, the approximation U (2) of the CG Hamiltonian implies a good approximation of the free energy. Similarly, for the pressure as a function of the activity z, we have: Both quantities have limits given by absolutely convergent series with respect to ρ = N/|Λ| for the first and z or ρ for the second. As a side remark, let us mention that in order to compute them we have two options: the first is to use (27) and calculate the integral dr 1 . . . dr M e −βU (2) using molecular dynamics. Alternatively, we can use the corresponding expansions -e.g. for the free energy we would obtain 60 -and compute the coefficients β Λ . The latter are not bulk computations as they involve 2, 3, etc particles so they are rather efficient, at least up to some order.

C. Pair correlation function
Recalling the coarse-grained map T from the previous section, we fix two centers of mass r 1 and r 2 and integrate over all atomistic configurations so that the first two groups q 1 and q 2 of atomistic configurations have the above fixed centers of mass. Partitioning the N particles into M groups of l 1 , . . . , l M particles and choosing two of them (indexed by 1 and 2) to be the fixed ones, we define the "projected" correlation function at the coarse-grained scale as follows: Hence, using (14) we can construct coarse-grained approximations for the correlation functions as well. Alternatively, as a corollary of the cluster expansion, we can write (7) as a convergent power series with respect to the density. These are old results 45 for which the convergence has also been proved recently in the context of the canonical ensemble. 61 In the limit N → ∞, Λ → R d such that N |Λ| = ρ, we obtain: where and (33) Note that this formula could also be used at the coarse-grained level with the pair coarse-grained potential W (2) , giving an alternative way to compute it.

A. The model
A main goal of this work, as mentioned before, is to examine the parameterization of a coarse-grained model using the cluster expansion formalism described above for simple realistic molecular systems; in this work we study liquid methane and ethane. In more detail, we consider N molecules of CH 4 and we denote asq ≡ {q 1 , . . . ,q N } to be the positions of the N many carbons and q i ≡ {q i,1 , . . . , q i,4 } be the positions of the 4 hydrogens that correspond to the i th carbon. We have two types of interactions, namely the bonded with (many body) interaction potential V b and the non-bonded with pair interaction potential V nb . The latter are of Lennard-Jones type between all possibilities: C − C, C − H and H − H (with different coefficients), i.e., V nb = V CC + V CH + V HH . In the model used here the non-bonded interactions within the same CH 4 molecule are excluded.
The microscopic canonical partition function is given by where U nb is a pair potential of all possible pairs amongq, q 1 , . . . , q N , all of L-J type (eventually with different parameters). Note also that since only the 4 particles of H are indistinguishable, we have introduced the factor 1/4! for each molecule.
We are interested in computing the effective Hamiltonian when only the centers of mass of the N many molecules are prescribed. Hence, let us introduce a map T : Λ 5 → Λ which gives the center of mass of a molecule consisting of an atom of C together with the prescribed 4 atoms of H which are linked to C by the bonded interactions, i.e., by denotingq i ≡ (q i , q i ) we have: We introduce the variables r 1 , . . . , r N for the centers of mass. Our goal is to find the effective potential U eff (r 1 , . . . , r N ). We define the "bonded" (normalized) prior measure by Note that here we could have also included possible non-bonded interactions between atoms of the same molecule. This would be important for the case of coarse-graining a molecule with intra-molecular non-bonded interactions; for the methane molecule studied here such interactions do not exist. Then, from (34) we obtain: The effective free energy is defined by: for which we can construct approximations following formula (19). A similar analysis holds for ethane as well.
The total (atomistic) potential energy V (q), for both methane and ethane, is defined by where V bond (q), V angle (q) are quadratic intramolecular potential functions of the bonds and angles respectively. V LJ (q) is the non-bonded potential as defined in the previous subsection. The parameters values of CH 4 are summarized in Table I. The more simple, non-spherically symmetric ethane molecule consists of one rigid bond connecting two united atom CH 3 beads.

B. Simulations
The simplest system to simulate is the one with only two interacting methane, or ethane, molecules in vacuum. This is a reference system for which the many-body PMF is equal to the two-body one. In addition we have also simulated the corresponding liquid systems. The atomistic and CG model methane systems were studied through molecular dynamics and Langevin dynamics (LD) simulations. All simulations were conducted in the NVT ensemble. For the MD simulations the Nose-Hoover thermostat was used. Langevin dynamics models a Hamiltonian system which is coupled with a thermostat. 64 The thermostat serves as a reservoir of energy. The densities of both liquid methane and ethane systems were chosen as the average values of NPT runs at atmospheric pressure. NVT equilibration and production runs of few ns followed and the size of the systems were 512 CH 4 and 500 CH 3 − CH 3 molecules. We note here that the BBK integrator used for Langevin dynamics exhibits pressure fluctuations of the order of ±40 atm in the liquid phase, whereas temperature fluctuations have small variance and the system is driven to the target temperature a lot faster than with conventional MD.
In order to compute the effective non-bonded coarse-grained potential, different simulation runs have been used which are discussed below.

Constrained runs
The first method which we use in order to estimate the effective CG potential is by constraining the intermolecular distance between two molecules, r = r 1,2 , in order to compute the constrained partition function (10). We call it "constrained run" of two methane, or ethane, molecules and special care had to be taken in order to avoid long sampling of the low probability short distances. This method is very similar to the typical conditional reversible work methods in which CG degrees of freedom are constrained at a fixed values for deriving CG potentials, as well as in free energy calculations. Technically, we pin the centres of mass (COM) of each CG particle in space and, on every step throughout the Langevin dynamics trajectory, we subtract the total force acting on each COM. Hence, we allow the atoms to move, resulting in rotations but not translations of the CG degrees of freedom (CH 4 , COM). During these runs the constraint forces are recorded. The mean value f r12=r is calculated in the same manner and we get W (2),full, f (r), from f = −∇W . Both W (2),full, f (r) and W (2),full, u (r) are based on the same trajectory. Then, the effective potential is calculated by numerical integration of the constraint force f r12=r from r min up to r max . The constrained run technique described above, accelerates the sampling for short distances but there is a caveat; the ensemble average at very short distances (left part of the potential well) is strongly affected by the non-bonded forces on specific atoms between the two molecules. For example, the two CH 4 molecules are oriented according to the highly repulsive forces and rotate around the axis connecting the two COM's. Due to this specific reason, we utilized stochastic (Langevin) dynamics in order to better explore the subspace of the phase space, as a random kick breaks this alignment. We determine the minimum amount of steps needed for the ensemble average to converge, in a semi-empirical manner upon inspection of the error-bars.

Geometric direct computation of PMF
In order to further accelerate the sampling and alleviate the noise problems at high energy regions, that might become catastrophic in the case of the non-symmetric CH 3 − CH 3 model, we have also calculated the two-body PMF (constraint partition function) directly, through "full sampling" of all possible configurations using a geometrical method proper for rigid bodies. In more detail, the geometric averaged constrained two-body effective potential W (2),geom (r), is obtained by rotating the two CH 4 molecules around their COM's, through their Eulerian angles and taking account of all the possible (up to a degree of angle discretization) orientations. The main idea is to cover every possible (discretized) orientation and associate it with a corresponding weight. The Euler angles proved to be the easiest way to implement this; each possible orientation is calculated via a rotation matrix using three (Euler) angles in spherical coordinates.
The above way of sampling is more accurate (less noisy) than constrained MD and considerably faster. In addition, the nature of the computations allows massive parallelization of the procedure. We used a ZYZ rotation with dφ = dψ = dθ = π/20 for CH 4 and simple spherical coordinate sampling with dφ = π/20, dθ = π/45 for CH 3 − CH 3 (as it is diagonally symmetric in the united atom description). Note however, that in this case the molecules are treated as rigid bodies; i.e., bond lengths and bond angles are kept fixed, essentially it is assumed that intra-molecular degrees of freedom do not affect the intermolecular (non-bonded potential) ones. The advantage of this method is that we avoid long (and more expensive) molecular simulations of the canonical ensemble, which might also get trapped in local minima and inadequately sample the phase space. We should also state that this method is very similar to the one used by McCoy and Curro in order to develop a CH 4 united-atom model from all-atom configurations. 52 All atomistic and coarse-grained simulations have been performed using a home-made simulation package, whereas all analysis has been executed through home-made codes in Matlab and Python.

V. RESULTS
A. Calculation of the effective two-body CG potential First, we present data related to the calculation of the two-body potential of mean force for the ideal system of two (isolated) molecules. For such a system the conditional M-body CG PMF is a 2-body one, i.e., the pair approximation  Figure 4. Relation of the PMF through cluster expansions and energy averaging at high temperatures, i.e., W (2) (r1, r2) and W (2),full (r1, r2) through expansion over β for CH4 at T = 300K (left panel) and CH3 − CH3 at T = 650K (right panel). As expected from the analytic form and the relation between the two formulas, W (2) and W (2),full tend to converge to the same effective potential.
in the effective CG interaction is exact. In more detail, in Figures 3a and 3b we provide data for the CG effective interaction between two methane and ethane molecules, through the following methods: (a) A direct calculation of the PMF, W (2),geom , using a geometrical approach as described in Section IV B 2 that involves the direct calculation of the constraint partition function, treating the two molecules as rigid bodies. Note that in this case in the all-atom description bond lengths and bond angles are kept fixed.
(b) A calculation of the PMF using the constraint force approach, W (2),full, f , as described in section IV B 1. In this case the constraint force required to keep two methane molecules fixed at a specific distance is computed. Then through a numerical integration the effective potential between the two molecules (CG particles), U PMF CF , is computed. This is a method that has been extensively used in the literature to estimate effective pair CG interaction between two molecules, as well as differences in the free energy between two states. Alternatively, through the same set of atomistic configurations the two-body PMF, W (2),full, u , can be directly calculated through Eq (22).
(c) DBI method: The CG effective potential, W (2),g(r) , is obtained by inverting the pair (radial) correlation function, g(r), computed through a stochastic LD run with only two methane (or ethane) molecules in the simulation box. The pair correlation function, g(r), of the two methane molecules is also shown in Figure 3a.
The first two of the above methods refer to the direct calculation of the constrained partition function (10) with constrained forces and canonical sampling, while the third uses the "Direct Boltzmann Inversion" approach. All above data correspond to temperatures in which both methane and ethane are liquid at atmospheric pressure (values of −k B T are also shown in Figure 3).
First, for the case of the two methane molecules (Figure 3a) we see very good agreement between the different methods. As expected, slightly more noisy is the W (2),full, u (r 12 ) curve as fluctuations in the e −βu term for a given r 12 distance in equation (22), are difficult to cancel out. The small probability configurations in high potential energy u regimes having a large impact in the average containing the exponent, hence the corresponding plot is not as smooth as the others are. In addition, as previously mentioned, W (2),full, f comes from the same trajectory (run) but the integration of the f r12 from r cutoff up to r 12 washes out any non-smoothness. Note, that for the same system recently CG effective potentials based on IBI, force matching and relative entropy methods have been derived and compared against each other. 57 Second, for the case of the two ethane molecules (Figure 3b) we see a good, but not perfect, agreement between the different sets of data, especially in the regions of high potential energy (short distances). This is not surprising if we consider that high energy data from any simulation technique that samples the canonical ensemble, exhibit large error bars, due to difficulties in sampling. The latter is more important for ethane compared to methane case due to its molecular structure; indeed the atomistic structure of methane approximates much better the spherical structure of CG particles than ethane. The only method that provides a "full", within the numerical discretization, sampling at any distance is the geometric one; however as discussed before (see Section IV) such a method neglects the bond lengths and bond angle fluctuations.
Next, we also examine an alternative method for the computation of the effective CG potential, by calculating the approximate terms from the cluster expansion approach. For the latter we use the data from the constraint runs of two methane molecules integrated over all atomistic degrees of freedom, as given in formula (20). In Figures 4a and b we demonstrate the Potential of Mean Force through cluster expansions and the effect of higher order terms as shown in equation (23), of the two isolated molecules, for CH 4 and CH 3 − CH 3 respectively. As discussed in the Section III cluster expansion is expected to be more accurate at high temperatures and/or lower densities. For this, we examine both systems at higher temperatures, than of the data shown in Figure 3; Values of −k B T are shown with full lines. Both systems show the same behavior. First, it is clear that the agreement between W (2) and the (more accurate) W (2),full is very good only to long distances, whereas there are strong discrepancies in the regions where the potential is minimum as well as in the high energy regions (short distances). Second, it is evident that adding terms up to the second order with respect to β, we obtain a better approximation of W (2),full .

Effect of temperature-density
Next, we further examine the dependence of the PMF, for the two isolated methanes, on the temperature, by studying the molecules at T = 80K, 120K, 300K and 900K. In more detail, in Figures 5a and b we compare the difference between W (2) and W (2),full at different temperatures. As discussed in Section III, the cluster expansion method is valid only in the high temperature regime. This is directly observed in Figure 5a; at high temperatures, W (2) is very close to W (2),full , which is exact for the system consisting of two molecules. Note the small differences at short distances, which, as also discussed in the previous subsection, are even smaller if higher order terms are included in the calculation of W (2) ; see also Figure 4.
On the contrary, at low temperatures there is a strong discrepancy around the potential well as shown in Figure 5b. In fact, for values of r close to the potential well and for rather high values of β the contribution to the integral (2) is large and the latter can exceed one, rendering the expansion in (23) not valid. In Figure 5b we see that the term (20) is not small so the expansion (23) is not valid. The case for ethane is qualitatively similar.
For completeness, we also plot the potential of mean force at different temperatures for the system of two CH 4 molecules, see Figure 6. In principle, equation (20) is a calculation of free energy, hence it incorporates the temperature of the system and thus both approximations to the exact two-body PMF, W (2) and W (2),full , are not transferable. Indeed, we observe slight differences in the CG effective interactions (free energies) for the various temperatures, which become larger for the highest temperature.

B. Bulk CG CH4 runs using a pair potential
In the next stage, we examine quantitatively the accuracy of the effective CG interaction potential (approximation of the two-body PMF), in the liquid state based on structural properties like g(r). Here we use the different CG models (approximated pair CG interaction potentials) derived above, to predict the properties of the bulk CG methane and ethane liquids. In all cases we compare with structural data obtained from the reference all-atom bulk system, projected on the CG description.
In Figures 7a and b we assess the discrepancy between the CG (projected) pair distribution function, g(r), taken from an atomistic run, and the one obtained from the corresponding CG run based on W (2),full as already seen in  Figure 7. RDF from atomistic and CG using pair potential, W (2) , for CH4 system at T = 80K (left panel) and CH3 − CH3 at T = 150(right panel). Spherical CG approximation to the non-symmetric ethane molecule induces discrepancy implies there is more room for improvement. Figure 3 of methane and ethane respectively. Note that g(r) is directly related to the effective CG potentials (N = 2 in Eq (8)).
It is clear that for methane ( Figure 7a) the CG model with the W (2),full potential gives a g(r) very close to the one derived from the analysis of the all-atom data. This is not surprising if we consider that for most molecular systems small differences in the interaction potential lead to even smaller differences in the obtained pair correlation function. Interestingly the CG model with the W (2) is also in good agreement to the reference one, despite the small differences in the CG interaction potential discussed above (see Figures 4 and 8). As expected, the difference comes from the missing higher order terms of eq (14).
The fact that the CG effective potential, which is derived from two isolated methane molecules, give a very good agreement for the methane structure in the liquid state is not surprising if we consider the geometrical structure of methane, which is rather close to the spherical one, and the typical van der Waals type of interactions between methane molecules. On the contrary, for the case for ethane (Figure 7b) predictions of g(r) using pair CG potential are much different compared to the atomistic one, especially for the short distances. Even larger differences would be expected for more complex systems with long-range interactions, such as water. 57 Similar is the case also for the other temperatures (T = 80K) studied here (data not shown here).   Figure 8. RDF from atomistic data, and CG models using pair potential at different temperatures: (a) T=300K, (b) T=900K. In both cases the density is 0.3799 gr cm 3 .

Effect of temperature-density
We further study the structural behavior of the CG systems at different state points; i.e., temperature/density conditions, compared to the atomistic ones. First, we examine the temperature effect by simulating the systems discussed above (see Figure 7) at higher temperatures; however keeping the same density. In Figures 8a,b we present the RDF of methane from atomistic and CG runs using pair potential at T = 300K, and T = 900K respectively.
It is clear that the analysis of the CG runs using the W (2),full potential gives a pair distribution function g(r) close to the atomistic one for both (high) temperatures, similar to the case of the T = 100K shown above. In addition, the CG model with the W (2) potential is in very good agreement with the atomistic data at high temperature ( Figure  8b), whereas there are small discrepancies at lower temperatures (Figure 8a), in particular at the maximum of g(r). This is shown in the inset of Figures 8a,b. Note also that in this high temperature the incorporation of the higher order terms in W (2) leads to very similar potential as the W (2),full (see also Fig. 4), and consequently to very accurate structural g(r) data as well.
Next, we examine the structural behavior of the CG systems at different densities. In Figure 9a we present the g(r) from atomistic and CG runs using pair potential at different densities (ρ 1 = 0.3799 gr cm 3 and ρ 2 = 0.0395 gr cm 3 and T = 300K, and T = 900K). There is apparent discrepancy from the reference (atomistic) system in both densities in agreement to the data discussed above in Figure 8a.
For the case of higher temperature data (T = 900K) and the same densities, as shown in Figure 9b, the pair distribution function, g(r), obtained from the CG model with the W (2) effective interaction is very close to the data derived from the W (2),full one, and in very good agreement to the reference, all-atom, data. This is not surprising since, as discussed before, at high temperatures the cluster expansion is expected to be more accurate, since cluster expansions hold for high T and low ρ. From Figure 9a we deduce that despite the different potentials W (2) , W (2),full (Figure 4), we obtain the same g(r) for the liquid case, as a result of the close packing and frequent collisions.
Overall, the higher the temperature the better the agreement in the g(r) derived from the CG models using any of W (2) and W (2),full . These data are in better agreement with the atomistic data as well.

VI. EFFECTIVE THREE-BODY POTENTIAL
In the last part of this work we briefly discuss the direct computation of the three-body effective CG potential and its implementation in a (stochastic) dynamic simulation. More results about the three-body terms will be presented in a future work. 53 A. Calculation of the effective three-body potential In the following we present data for the 3-body potential of mean force estimated from simulation runs and geometric computations involving three isolated molecules. We have two suggestions for the 3-body PMF: (a) Formula (21) derived from cluster expansion formalism, which is valid for rather high temperatures and (b) another one based on the McCoy-Curro scheme given in formula (24).
Similarly to the two-body potential, the corresponding calculations can be performed by running constrained molecular dynamics (or any other method that performs canonical sampling). For this one needs to calculate the derivative of the three-body potential with respect to some distance. However, as previously stated, deterministic MD simulations of a constrained system might easily get trapped in local energy minima, so we utilized stochastic dynamics for the three-body case. In addition, rare events (high energy, low probability configurations) induce noise to the data, despite long equilibration (burn-in) periods or stronger heat-bath coupling in the simulations. Although smoothing could in principle have been applied, it would wash-out important information needed upon derivation with respect to positions (f = −∇ q W (3) ). Therefore, we choose here to present results from the "direct" geometric averaging approach. The total calculations are one order of magnitude more than the two-body ones (all possible orientations of the two molecules for one of the third one), so special care was given to spatial symmetries.
The new effective three-body potential, W (3),full (r 12 , r 13 , r 23 ), incorporates three intermolecular distances: r 12 , r 13 , r 23 . The discretization of the COM's in space is on top of the angular discretization mentioned in Section IV B 2 and relates to the above three distances. In more detail, in In each case the sum of the corresponding two-body terms is also shown. At smaller distances, the potential of the triplet deviates from the sum of the three pairwise potentials and this is where improvement in accuracy can be obtained. As shown in Figure 10 improvement is needed for close distances around the (3 dimensional) well. We used a 3-dimensional cubic polynomial to fit the potential data (conjugate gradient method) which means that 20 constants should be determined. A lower order polynomial cannot capture the curvature of the forces upon differentiation. The benefit of this fitting methodology (over partial derivatives for instance) is the analytical solution of the forces with respect to any of r 12 , r 13 , r 23 in contrast to tabulated data that induce some small error.
Overall, there are clear differences between the 3-body PMF, W (3),full , and the sum of three two-body interactions, W (2),full , at short r 12 , r 13 and r 23 distances. On the contrary, for larger distances the sum of two-body interactions seems to represent the full three-body PMF very accurately. This is a clear indication of the rather short range of the three-body terms. Based on the above data, the range of the 3-body terms for this system (methane at T = 80K) is: r 12 ∈ [3.8 : 4.1]Å, r 13 ∈ [3.8 : 4.1]Å and r 23 ∈ [3.8 : 5]Å; hence, the maximum distance for which three-body terms were considered, is r cut-off,3 =5Å. In practice we need to identify all possible triplets within r cut-off,3 . Naturally, by including higher-order terms the computational cost has increased as well. More information about the numerical implementation of the three-body CG effective potential and its computational efficiency will be given elsewhere. 53 We should state here that in order to keep constant the temperature (in the BBK algorithm) due to the extra three-body terms in the CG force field a larger coupling constant value for the heat bath was required.
B. CG Runs with the effective three-body potential Next we examine the effect of the 3-body term on the CG model by performing bulk CG stochastic dynamics simulations using the new CG model with the 3-body terms described above. Results about the pair distribution function, g(r), for bulk (liquid) methane at T = 80K are shown in in Figure 11. In this graph data from the atomistic MD runs (projected in the CG description), the CG model involving only pair CG potentials, and the new CG model that also involves 3-body terms are shown. First, it is clear that g(r) data derived from the CG model that involves only pair CG potentials show clear deviations, compared to the reference all-atom data. Note, that these differences are slightly larger than the ones discussed before (see Figure 7), in which data at a higher temperature are presented. Second, the incorporation of the three-body terms in the effective CG potential slightly improves the prediction of the g(r), mainly in the first maximum regime.

VII. DISCUSSION AND CONCLUSIONS
In recent years we have experienced an enormous increase of computational power due to both hardware improvements and clever CPU-architecture. However, atomistic simulations of large complex molecular systems are still out of reach in particular when long computational times are desirable. A generic strategy in order to improve efficiency of the computational methods is to reduce the dimensionality (degrees of freedom) by considering systematic coarsegrained models. There have been many suggestions on how to compute the relevant CG effective interactions in such models; a main issue here is that even if in the microscopic (atomistic) level there are only pair interactions, after coarse-graining a multi-body effective potential (many-body PMF) is derived, which for realistic molecular complex systems cannot be calculated. Therefore, a common trend has been to approximate them by an "effective" pair potential by comparing the pair correlation function g(r). This seems reasonable since given the correlation function one can solve the "inverse problem" 65 and find an interaction to which it corresponds. But, this is an uncontrolled approximation without thermodynamic consistency.
Instead, here we suggest to explicitly compute the constrained configuration integral over all atomistic configurations that correspond to a given coarse-grained state and from that suggest approximations with a quantifiable error. This is similar to the virial expansion where one needs to integrate over all positions of particles that correspond to a fixed   Figure 11. RDF from atomistic and CG using pair, W (2),full , and three-body, W (3),full , potential for CH4 (T = 80K). Three dimensional cubic polynomial was used for the fitting. "Clustered" means that one particular CG atom belongs to one triplet in comparison to all possible triplets. density and it is based on the recent development of establishing the cluster expansion in the canonical ensemble. 46 ; see also Ref. 55,61 for the corresponding (in the canonical ensemble) expansions for the correlation functions and the Ornstein-Zernike equation. The main drawback that limits the applicability of these expansions is that they are rigorously valid only in the gas phase. To extend them to the liquid state is an outstanding problem and even several successful closures like the Percus-Yevick are not rigorously justified. Therefore, there is need of further developing these methods and relate them to computational strategies.
In this paper we extend the above methods by presenting an approach based on cluster expansion techniques and numerical computations of isolated molecules. As a first test we presented a detailed investigation of the proposed methodology to derive CG potentials for methane and ethane molecular systems. Each CG variable corresponds to the center-of-mass for each molecule. Below, we summarize our main findings: (a) The hierarchy of the cluster expansion formalism allowed us to systematically define the CG effective interaction as a sum of pair, triplets, etc. interactions. Then, CG effective potentials can be computed as they arise from the cluster expansion.
(b) The two-body coarse-grained potentials can be efficiently computed via the cluster expansion giving comparable results with the existing methods, such as the conditional reversible work. In addition we present a more efficient direct geometric computation of the constrained partition function.
(c) The obtained pair CG potentials were used to model the corresponding liquid systems and the derived g(r) data were compared against the all-atom ones. Clear differences between methane and ethane systems were observed; For the (almost spherical) methane, pair CG potentials seems to be a very good approximation, whereas much larger differences between CG and atomistic distribution functions were observed for ethane.
(d) We further investigated different temperature and density regimes, and in particular cases where the two-body approximations are not good enough compared to the atomistic simulations. In the latter case, we considered the next term in the cluster expansion, namely the three-body effective potentials and we found that they give a small improvement over the pair ones.
Overall, we conjecture that the cluster expansion formalism can be used in order to provide accurate effective pair and three-body CG potentials at high T and low ρ regimes. In order to get significantly better results in the liquid regime one needs to consider even higher order terms, which are in general more expensive to be computed and more difficult to be treated. A more detailed analysis of the higher-order terms will be a part of a future work. 53 Finally, another future goal is to extend this investigation in larger molecules (e.g. polymeric chains) that involve intra-molecular CG effective interactions as well, and to systems with long range (e.g. Coulombic) interactions.