Denoise Pretraining on Nonequilibrium Molecules for Accurate and Transferable Neural Potentials

Recent advances in equivariant graph neural networks (GNNs) have made deep learning amenable to developing fast surrogate models to expensive ab initio quantum mechanics (QM) approaches for molecular potential predictions. However, building accurate and transferable potential models using GNNs remains challenging, as the data is greatly limited by the expensive computational costs and level of theory of QM methods, especially for large and complex molecular systems. In this work, we propose denoise pretraining on nonequilibrium molecular conformations to achieve more accurate and transferable GNN potential predictions. Specifically, atomic coordinates of sampled nonequilibrium conformations are perturbed by random noises and GNNs are pretrained to denoise the perturbed molecular conformations which recovers the original coordinates. Rigorous experiments on multiple benchmarks reveal that pretraining significantly improves the accuracy of neural potentials. Furthermore, we show that the proposed pretraining approach is model-agnostic, as it improves the performance of different invariant and equivariant GNNs. Notably, our models pretrained on small molecules demonstrate remarkable transferability, improving performance when fine-tuned on diverse molecular systems, including different elements, charged molecules, biomolecules, and larger systems. These results highlight the potential for leveraging denoise pretraining approaches to build more generalizable neural potentials for complex molecular systems.

in these models fail to encode directional information. In neural potentials, we are interested in the equivariance with respect to three-dimensional (3D) rigid-body transformation, such that the neural network commutes with any 3D rotations, translations, and/or reflections. The first category of equivariant GNNs is based on irreducible representations in group theory including Tensor field network (TFN), 40 Cormorant, 41 SE(3)-Transformer, 42  develops linear operations to pass information in between to keep the equivariance. TorchMD-Net extends such an architecture with the multi-head self-attention mechanism. 50 In our work, we focus on equivariant GNNs due to their superior performance in neural potential benchmarks. 49 To improve the performance of GNNs on molecular property predictions, self-supervised learning (SSL) approaches 51,52 have been investigated. GNNs are first pretrained via SSL to learn expressive molecular representations and then fine-tuned on downstream prediction tasks. Predictive SSL methods rely on recovering the original instance from partially observed and 3D geometric structures. It is noted that most SSL works neglect 3D information. Even though a few SSL approaches incorporate 3D information, they are not built upon equivariant GNNs, which greatly limits the applications to accurate neural potential predictions. Recently, a few works propose denoising as an SSL approach to pretrain GNNs. [66][67][68] By predicting the noise added to atomic coordinates at the equilibrium states, GNNs are trained to learn a particular pseudo-force field. Such a method has demonstrated effectiveness on QM property prediction benchmarks like QM9. 69 Nevertheless, it only leverages molecules at equilibrium states which is far from sufficient for accurate neural potentials, since potential predictions require evaluation of nonequilibrium molecular structures in simulations. Overall, SSL has not been well studied for neural potential predictions.
Although neural potentials have been extensively investigated, their accuracy and trans-ferability are still limited due to the reliance on expensive QM calculations to obtain training data. In particular, for large and complex molecular systems, collecting accurate and sufficient QM data can be extremely challenging or even infeasible. This raises two questions: 1) can we use SSL to improve the accuracy of neural potentials with currently available data; and 2) can we use SSL to develop more transferable neural potentials that leverage relatively rich molecular potential data (e.g., small molecules) and apply pretrained models to large and complex molecular systems with limited data? To address these challenges, we propose an SSL pretraining strategy on nonequilibrium molecules to improve the accuracy and transferability of neural potentials (Figure 1(a)). This involves adding random noises to the atomic coordinates of molecular systems at nonequilibrium states, and training GNNs to predict the artificial noises in an SSL manner (Figure 1(b)). We then fine-tune the pretrained models on multiple challenging molecular potential prediction benchmarks (Figure 1(c)

Equivariant Graph Neural Networks
In this work, we investigate GNNs for molecular potential predictions, which require 3D positional information of atoms. To this end, equivariant GNNs are introduced which extend the message-passing functions to keep the physical symmetry of the molecular systems. Let T g : X → X define a set of transformations of the abstract group g ∈ G in input space X and φ : X → Y is a function that maps the input to the output space Y. A function φ is equivariant to g if there exists a transformation S g : Y → Y such that φ(T g (x)) = S g (φ(x)), ∀g ∈ G and x ∈ X . E(n)-equivariance denotes equivariance with Euclidean group E(n) which comprises all translations, rotations, and reflections in n-dimensional Euclidean space, while SE(n)-equivariance only satisfies translation and rotation equivariance. 70 Equivariant GNNs introduce the equivariance as an inductive bias to molecular modeling and demonstrate superior performance in many energy-related property predictions.
One strategy of designing equivariant message-passing operations is based on irreducible representations in group theory. 41,42 TFN 40 introduces the Clebsch-Gordon coefficient, radial neural network, and spherical harmonics as building blocks for SE(3)-equivariant message passing as given in Eq. 3 and 4 (we ignore the superscript of the layer to simplify the notation).
where r ij = x i − x j is the directional vector between the Cartesian coordinates x i and x j of two atoms i and j, h k j is a type-k feature of node j, and W k : is the weight kernel that maps type-k features to type-features. Another method to build the equivariance is to apply only linear operations (i.e., scaling, linear combination, dot product, and vector product) to vectorial features in messagepassing. 45,48 Following the insight, a straightforward way to build E(3)-equivariant operations is to keep track of a vectorial feature v i ∈ R 3×F besides the scalar feature h i for each node.
Thus, the message passing function and update function for vectorial features are shown in Eq. 5 and 6, respectively.
where f v maps the message m ij to a scalar and the vectorial feature of each atom is updated as a linear combination of directional vectors r ij in each layer t. It should be noted that only linear operations can be applied to vectorial features in the message function f m to keep the equivariance. Usually, atoms within a cutoff distance d cut from atom i are included in the neighboring list N i . Such a strategy paves a more flexible way to equivariant GNNs compared with irreducible representations.
After T equivariant message-passing layers, global pooling over all node features can be applied to obtain the representation of the whole molecule. In our work, we apply summation over all the node features as the pooling. 71 The pooled representation is then fed into an MLP to predict the molecular potential asÊ = MLP( i h Besides, invariance indicates that the transformations (e.g., rotations and translations) of the input will not change the output, such that φ(T g (x)) = φ(x), ∀g ∈ G and x ∈ X .
Invariance is a special equivariance where S g is the identity mapping for ∀g ∈ G. Building an invariant GNN is more straightforward than an equivariant one. By simply replacing Eq. 1 with Eq. 5, one can obtain an E(3)-invariant GNN, since the geometric information is only embedded as the distances between atoms in message passing.
In this study, we implement one invariant GNN (i.e., SchNet 35 ) and three equivariant

Denoising on Nonequilibrium Molecules
The pretraining strategy is based on predicting the artificial noise added to sampled conformations of molecules. A conformation of a molecule is denoted as V and X, where V encodes the atomic information and X ∈ R N ×3 containing Cartesian coordinates of all N atoms in the molecule. The interatomic interactions A can be directly derived from X for GNN models. Random noises E ∈ R N ×3 sampled from Gaussian N (0, σI) are added to the position of all atoms to perturb the molecular conformation. The perturbed conformation is denoted asX = X + E. During pretraining, a GNN model is trained to predict the additional noise, and the objective function is shown in Eq. 7.
where φ θ denotes an invariant/equivariant GNN parameterized by θ and p(X; V ) measures the probability distribution of perturbed molecular conformations given the atoms.
Such a denoising strategy in a self-supervised manner is related to learning a pseudoforce field at the perturbed states. 66,72,73 In this work, we extend such a concept from equilibrium molecular conformations to nonequilibrium ones, which is pivotal in molecular simulations. For a given molecule with N atoms encoded as V , the probability of a molecular conformation X is p(X; V ) ∼ exp (−E(X; V )) following the Boltzmann distribution, where E(X; V ) is the potential energy. The force field of each atom in the conformation is . The molecular conformation X can be sampled via molecular dynamic simulation, normal mode sampling, torsional sampling, etc. These methods provide diverse physically and chemically feasible nonequilibrium conformations around the equilibrium states which help understand the energetics of molecular systems. 30 It should be noted that in our case, we refer to nonequilibrium states as molecular conformations that are not at energy minima, which is different from the terminology in statistical mechanics. By adding random noise to each atomic coordinate, unrealistic conformationX can be obtained which has higher energy than the sampled conformation X. Driven by this, we assume p(X; V ) is approximated by a Gaussian distribution q(X; X, V ) = N (X, σI 3N ) centered at X. Following the assumption, the force field is proportional to the perturbation noise for a given variance σ as shown in Eq. 8.
Therefore, training a GNN to match the perturbation noise is equivalent to learning a pseudoforce field when assuming the probability of atomic positions around a sampled conformation follows a Gaussian distribution. In our case, the temperature T in Boltzmann distribution is considered as a fictitious term when approximated with Gaussian, meaning T is not explicitly included in the denoising pretraining. Such a simplification helps make use of data sampled from normal mode or torsional sampling that does not include temperature. Also, it is worth noting that temperature term T in the Boltzmann distribution of a molecular system can be related to the selection of the optimal standard deviation σ of the Gaussian distribution of random noise in pretraining. Investigation of data and pretraining strategies that connect σ with T is out of the scope of this work but could be an interesting direction to explore.
It should be noted that though Gaussian distribution can be a good approximation of the distribution of states around a local minimum, it may fail for states with high energies. Such an assumption is related to the pretraining dataset and the selection of the standard deviation σ of the Gaussian noise added to atomic positions. The pretraining molecular conformations should be physically reasonable and not overly distorted. Consequently, adding noise to these conformations leads to higher energy states, ensuring that the denoise pretraining process enables GNNs to learn a score function that leads to lower energy conformations. Details regarding the pretraining dataset can be found in section 3.1. Moreover, the selection of the standard deviation (σ) for the Gaussian noise affects the pretraining. When excessive noise is added, it distorts the conformation to such an extent that the denoise component struggles to accurately recover the original conformation with lower energy. On the other hand, if the perturbation is too small, the change of molecular energies can be trivial and it is hard for GNNs to learn meaning representations in pretraining. A detailed investigation of how σ affects the performance of neural potential can be found in section 3.6.

Prediction of Noise
To predict the noise given the perturbed molecular conformation, the GNN models are

Datasets
To evaluate the performance of denoise pretraining strategies on molecular potential predictions, five datasets containing various nonequilibrium molecular conformations with DFT-calculated energies are investigated as listed in Table 1 In this study, we combine ANI-1 and ANI-1x as the pretraining dataset since they include various small organic molecules with different conformations. In pretraining, all conformations of each molecule are split into the train and validation sets by a ratio of 95%/5%. All datasets including ANI-1 and ANI-1x are benchmarked in fine-tuning for potential predictions. By this means, we investigate whether invariant or equivariant GNNs pretrained on small molecules generalize to other various molecular systems. During fine-tuning, we split the dataset based on the conformations of each molecule by a ratio of 80%/10%/10% into the train, validation, and test sets for ANI-1, ANI-1x, MD22, and SPICE. Besides, we follow the splitting strategy reported in the original literature 75 for ISO17.

Experimental Settings
During pretraining, each invariant or equivariant GNN is trained for 5 epochs with a maximal learning rate 2 × 10 −4 and zero weight decay. All models are pretrained on the combination of ANI-1 and ANI-1x and fine-tuned on each dataset separately. In pretraining, we employ the AdamW optimizer 78 with the batch size 256, and a linear learning rate warmup with cosine decay 79 is applied. During fine-tuning, the models are trained for 10 epochs on ANI-1 and ANI-1x while trained for 50 epochs on SPICE and ISO17. We apply different experimental settings for each molecule in MD22 since molecules greatly vary in the number of atoms. In comparison to pretrained models, we also train GNNs from scratch on each dataset following the same setting as their denoise pretrained counterparts. Detailed fine-tuning settings can be found in Supplementary Information S2. In fine-tuning, the parameters of pretrained message-passing layers are transferred. Atomic features h K from message-passing layers are first summed up and fed into a randomly initialized MLP to predict the energy of each

Neural Potential Predictions
To investigate the performance of denoise pretraining, we pretrain the invariant and equivariant GNN models on the combination of ANI-1 and ANI-1x and fine-tune the models on each dataset separately.

Transferability
The results presented in the previous section demonstrate that denoise pretraining is highly effective for improving the accuracy of neural potentials when the downstream tasks and pretraining data cover similar chemical space (i.e., small molecules). However, real-world simulations often involve much larger and more complex molecular systems, making it challenging to obtain the energies of such systems using expensive ab initio methods. Therefore, the ability to transfer GNN models pretrained on small molecules (with sufficient training data) to larger and more complex molecular systems (with limited training data) would be highly beneficial. To this end, we fine-tune the pretrained four GNNs, either invariant or equivariant, on three other molecular potential benchmarks (i.e., ISO17, 75 SPICE, 76 and MD22 77 ), which cover different molecular systems from the pretraining datasets.   Overall, our experiments suggest that the proposed pretraining method can improve the accuracy and transferability of GNN models across diverse molecular systems, making it a promising method for predicting the energies of complex molecular systems with limited training data.

Data Efficiency
To further evaluate the benefits of denoise pretraining for molecular potential predictions, we train GNNs with different dataset sizes. As shown in Figure 4, we compare the performance of pretrained and non-pretrained EGNN on ANI-1x as well as AT-AT, Ac-Ala3-NHMe, and    Fig. 5 illustrates the performance of pretrained TorchMD-Net on ANI-1x with different noise scales.

Selection of Noise
TorchMD-Net without pretraining is included as σ = 0.0Å for comparison. As shown, pretrained models achieve the best performance with noise σ = 0.2Å on both RMSE and

Conclusions
To summarize, our proposed denoise pretraining method for invariant and equivariant graph neural networks (GNNs) on nonequilibrium molecular conformations enables more accurate and transferable neural potential predictions. Our rigorous experiments across multiple benchmarks demonstrate that pretraining significantly improves the accuracy of neural potentials. Furthermore, GNNs pretrained on small molecules through denoising exhibit superior transferability and data efficiency for diverse molecular systems, including different elements, polar molecules, biomolecules, and larger systems. This transferability is particularly valuable for building neural potential models on larger and more complex systems where sufficient data is often challenging to obtain. Notably, the model-agnostic nature of our