Learning in Continuous Action Space for Determination of High Dimensional Potential Energy Surfaces

Reinforcement learning (RL) approaches that combine a tree search with deep learning have found remarkable success in searching exorbitantly large, albeit discrete action spaces, as demonstrated recently in board games like chess, Shogi and Go. Many real-world materials discovery and design applications, however, involve multi-dimensional search problems and learning domains that have continuous action spaces. Exploring high-dimensional potential energy surfaces (PES) of materials to represent inter-and intra-molecular interactions, for example, involves a continuous action search to ﬁnd optimal potential parameters or coeﬃcients. Traditionally, these searches are time consuming (often several years for a single system) and have been driven by human intuition and/or expertise and more recently by global/local optimization searches that have issues with convergence and/or do not scale well with the search dimensionality. Here, in a departure from discrete action and other gradient-based approaches, we introduce a RL strategy based on decision trees that incorporates modiﬁed rewards for improved exploration, eﬃcient sampling during playouts and a “window scaling scheme” for enhanced exploitation, to enable eﬃcient and scalable search for continuous action space problems. Using high-dimensional artiﬁcial landscapes and control RL problems, we successfully benchmark our approach against popular global optimization schemes and state of the art policy gradient methods, respectively. We further demonstrate its eﬃcacy to perform high-throughput PES search for 54 diﬀerent elemental systems across the Periodic table, including alkali, alkaline-earth, transition metals, metalloids, as well as non-metals. Using a well-sampled ( ∼ 165,000 conﬁgurations) ﬁrst-principles derived training and test dataset, we demonstrate that the new class of RL trained bond-order potentials capture the size-dependent energetic landscape from few atom clusters to bulk (energy errors << 200 meV/atom over a 3-6 eV sampled range) as well as their dynamics (force errors << 0.5 eV/˚A over a 50-100 eV/˚A range). We analyze the error trends across diﬀerent elements in the latent space and trace their origin to elemental structural diversity and the smoothness of the element energy surface. Finally, we run molecular dynamics using these RL trained potentials and perform a comprehensive test of dynamic stability of more than 40,000 clusters sampled for diﬀerent elements across the Periodic table. Our newly developed high-quality potentials will enable accelerated nanoscale materials design and discovery. Broadly, our RL strategy will be applicable to many other physical science problems involving search over continuous action spaces.

Reinforcement learning (RL) and decision tree (e.g., Monte Carlo tree search) based RL algorithms are emerging as powerful machine learning approaches, allowing a model to directly interact with and learn from the environment [1]. RL has achieved impressive capabilities with tremendous success in solving problems with intractable search space, for example in game playing (such as chess, Shigo, and Go) [2,3], chemical synthesis planning [4,5] or drug discovery [6]. However, these methods have been limited to discrete action space-e.g., "move pawn to e4", "add acetone reagent" or "remove chemical group -COOH" [7]. Many realworld problems including several grand challenges in materials discovery and design involve decision making and search via continuous action space [8]. These include, for instance, the problem of searching optimal model parameters/weights, exploring low energy material phases or inverse design, optimizing experimental parameters or synthesized material properties [5,9]. While it is highly desirable to translate the merits of RL methods to solve search problems in materials design, a challenge lies in its continuous, complex and multi-dimensional nature and is further complicated by an exorbitantly large number of degenerate and/or sub-optimal solutions.
One of the more successful versions of RL involves use of Monte Carlo tree search (MCTS) [10], which utilizes playouts to select the best possible action (with maximum reward) from the current state ( Figure  1). Here, playouts refer to random actions from the model that allow it to learn by interacting with the environment. The larger the number of playouts, the better the model estimate of reward, and the more promising is the model action selection. Notably, the MCTS performs this search in a tree-structure, continuously growing either those leafs of the tree that result in maximum reward (exploitation) or those which have not been adequately sampled (exploration) [11]. It is important to understand that when the action space is discrete, a parent leaf will exhibit limited possible child leaves, all (or some) of which can be assessed for their prominence. When the action space is continuous, the number of possible child leaves are infinite, irrespective of the depth of the parent leaf, making the use of MCTS seemingly impossible in a continuous action space.
In a significant departure from traditional MCTS approaches [12], we introduce three concepts to tackle continuous action space problems: (1) a uniqueness function to avoid degeneracy, (2) correlating tree depth to action space, and (3) implementing an adaptive sampling of playouts. The first ensures that only unique leaves are being explored during MCTS. This avoids the common issue of convergence of two initially separate MCTS branches to the same region of the continuous search space. More importantly, this resolves the problem of multiple representations of the same (degenerate) solution as often encountered in several physical problems (for example, a phase structure can be represented using varying unit cell definitions). The second concept provides a meaningful structure to the algorithm, with the child leaf searching within a narrower region of that of the parent node. Lastly, to improve the quality of the playouts, especially in the case of high dimensional search space, the random simulations were biased to sample those regions that were closer to the parent leaf.
We deploy our approach to a representative high-dimensional and continuous parameter search that involves navigating through high-dimensional potential energy surfaces [13] of elemental nanoclusters and bulk systems. Historically, this has represented a major challenge for molecular modeling and has been accomplished using human intuition and expertise, requiring years of painstaking efforts. Recently, a variety of global/local optimization methods have emerged for this task [14,15], but they either have convergence issues, do not scale well with the search dimensionality, or cannot incorporate important gradient-free knowledge (e.g., dynamic stability). Over several decades, these approaches have been used to develop a large number of multi-parameter physics-based models, mainly for bulk systems and their static/dynamical properties. Their extrapolation to capture nanoscale properties and dynamics, however, shows strong deviation from the ground truth (estimated using high-fidelity first-principles models, such as density functional theory). Here, we demonstrate the efficacy of our continuous action MCTS (c-MCTS) by developing a hybrid bond-order potential (an 18 dimensional parameter space) for 54 elements chosen across the Periodic table, capturing a variety of bonding environments and demonstrating the generality, efficiency and robustness of our approach. For each element, we train by fitting energies of thousands of carefully sampled (see Methods) nanoclusters of varying sizes, which are particularly known for their complex chemistry [16,17] and are difficult to train using traditional optimization strategies. Our ML trained bond-order potentials show significant performance improvement over current physics-based potential models in terms of energies, atomic forces and dynamic stability, and generalize well for dynamical properties that were not included during model training.  Figure 1: Schematic of the continuous action MCTS algorithm applied for exploration of highdimensional potential energy surfaces a) Top: Simplistic representation of an objective landscape for a 2-parameter search problem. In-plane axes corresponds to (two) independent model parameters. Outof-plane axis corresponds to objective values, which is defined as the weighted sum of the error in model predicted energies of clusters with respect to target energies. This objective is minimized by our c-MCTS algorithm. The spheres represent candidates of different model parameters within a MCTS run, where differences in their vertical positions indicate differences in their objective values. Bottom: Slightly tilted view of the above with the surface represented as a contour map below the spheres. The numbering on the spheres corresponds to their node positions in the MCTS tree shown in (b). These numbers roughly correspond to the order that the candidates are explored. b) Schematic showing the root, parent, child nodes and their relationship within a MCTS tree structure. A typical MCTS search involves node selection, expansion, simulation (playout), and back-propagation. Different coloring of the nodes indicates different depth in the MCTS tree. The algorithm balances between exploration (lateral expansion of nodes) and exploitation (depth expansion of nodes). As shown in (a), the objective value of a MCTS run is expected to decrease quickly along the depth of the tree. c) Search space of a traditional MCTS algorithm, e.g., game board, is discrete. In the context of parameter optimization, for two discrete parameters each of 19 possible values, the search space consists of a finite 361 search positions. d) The problem of parameter search, such as the objective surface illustrated in (a), generally involves parameters that are continuous, which corresponds to infinite possible search positions. We handle this challenge by applying a range-funneling technique to the MCTS algorithm where the search neighborhood at each tree depth becomes smaller and smaller such that the algorithm can converges to the optimal solutions.

Learning in Continuous Action Space
MCTS is a powerful algorithm for planning, optimization and learning tasks owing to its generality, simplicity, low computational requirements, and a theoretical bound on exploration-vs-exploitation trade-off [18,10,12]. As illustrated in Figure 1b, it utilizes a tree-structure for the parameter search consisting of four key stages: (1) selection: based on a tree policy select the leaf node that has the highest current score; (2) expansion: add a child node to the selected leaf after taking a possible (unexplored) action; (3) simulation: from the selected node, perform Monte Carlo trials of possible actions using a playout policy to estimate the associated expected reward; (4) back-propagation: pass the rewards generated by the simulated episodes to update the scores of all the parent leafs encountered while moving up the tree. A popular tree policy to use is the upper confidence bound for parameters (UCP) [19]: where θ j represents the node j in the MCTS structure, r denotes the reward for a given playout, c(> 0) is the exploration constant, n i is the number of playout samples taken by node θ j and all of its child nodes, and N i is the same value as n i except for the parent node of θ j . (f (θ j ) is the uniqueness function specifically introduced in this work and is equal to 1 in traditional MCTS settings.) This policy tries to balance the search between those nodes which have either returned the maximum reward (left term) or have not been explored enough (right term). In contrast, the playout policy selects random actions (from a node) until the simulated episode is over. Several issues deter the use of the traditional MCTS formulation for continuous action space. First, from any parent node, the number of possible child nodes is infinite as the action space is continuous. Further, for problems involving parameter search or optimization, the same set of actions are possible irrespective of the current depth of the tree, thus, rendering the tree-structure meaningless. Second, the search space of several physical problems display inherent symmetries, which are not properly accounted for in the traditional MCTS. This leads to poor search efficiency, especially in high-dimensional problems, or sometimes unwanted convergence of different MCTS branches to the same solution. Lastly, performing random actions during the playout policy may be useful to learn expected returns for game playing, but do not translate well in highdimensional continuous action space due to the well-known "curse of dimensionality" [20] (see Supplementary Information Section 1).
We introduce three concepts to enable MCTS operations for continuous actions space, namely, the uniqueness function, a tree-depth based adaptive action space, and a dimensionality-dependent playout policy. First, the uniqueness function f (θ j ) for a node θ j returns a value between 0 and 1, with a higher value indicating the uniqueness of this node with respect to all the other nodes in the tree. This not only helps avoid degenerate actions, but also promotes explorations of those actions that are dissimilar to the previous ones, thereby efficiently sampling a larger search region. This is among the most essential feature of c-MCTS that enables high search efficiency. The definition of f (θ j ) used in this work (see Section 1.1 in Supplementary Information) and its performance for high-dimensional continuous space is demonstrated in Extended Figure  1 and Table 1.
Second, an adaptive action space that depends on the tree-depth was included to provide better childparent correlation within the MCTS tree-structure. As illustrated in Figure 1d, in this new scheme the range of possible actions from a node continually decreases as we go down the tree-depth. This ensures that the child node is based on actions that are within the search scope of its parent node. Further, this allows the MCTS algorithm to incrementally refine its search space; larger scans are made in the initial phase of the search, followed by more focused search in the identified interesting regions. This also restores a meaningful correlation between the parent and the child node, and allows the algorithm to converge to a reasonable optimal solution as one moves down the tree. More details on the appropriate choice of the window scaling parameter with tree depth is provided in Supplementary Information Section 1.
Third, we introduce an efficient MCTS playout policy for high-dimensional continuous action space. Traditional MCTS relies on playout policy with random moves, but when the search space is intractable such a policy can make the learning process extremely difficult and inefficient. [21,22]. For example, even with the tree-depth based scaling of action space, a random playout (assuming uniform probability distribution) will have a high change of action selection which is far from the parent node; see Supplementary Information b.

a.
(meV/Å) Figure 2: HyBOP model performance after high-throughput exploration of the potential energy surface of nanoclusters using continuous action MCTS. A bond order model with Tersoff-type formalism for 54 different elements across the Periodic table is parameterized using an extensive training dataset generated from density functional theory (DFT) computations. The training set includes nanoclusters of varying sizes and shapes (∼1500 configurations and their energies) for each of the different elements. Mean absolute error (MAE) in (a) energy and (b) force predictions of the bond order force-fields relative to the reference DFT model on a test set of ∼70,000 configurations is shown using the color bar in meV/atom and meV/Å units, respectively.
(section 1.2) for a discussion on how uniform or multivariate Gaussian probability distribution based playout policies systematically select actions that have large displacements from the parent node. To avoid this issue, we used a biased playout policy that selects action based on the probability distribution r −(d−1) where r is the distance from the parent node, and d is the dimensionality of the search space. Based on our experiments, we found this playout policy to provide significant improvement in the performance owing to better action selections.
We first demonstrate the efficacy of our approach by comparing its performance with other state-of-theart evolutionary and gradient-based approaches [23] on 25 well-known trial functions or artificial landscapes, containing deceitful local minima and a high-dimensional search space, (see Extended Figure 1 and Table 1 and Supplementary Information (Section 2.1). Our c-MCTS algorithm outperforms other approaches both in terms of the solution quality (maximum reward) and search efficiency (low computational cost). Especially for trial functions with higher dimensionality (e.g., > 50), we note that the performance of c-MCTS is clearly superior to other representative and popular optimizers, illustrating the appeal of c-MCTS for multidimensional materials design and discovery problems. We also applied c-MCTS to classic reinforcement problems with a continuous action space. As shown in Supporting Information Figure S6, c-MCTS performs on par with state-of-the-art policy gradient methods.

High-throughput Navigation of High-dimensional PES
As a representative impactful materials application, we next demonstrate the ability of our c-MCTS approach to successfully navigate through an 18-dimensional potential energy landscape (see Methods section and Supporting Information for a description of the hybrid bond-order formalism) for 54 different elements that include alkali element, alkaline-earth elements, transition element, rare-earth, metalloids and non-metals. We assess the performance of the optimized hybrid bond-order potentials (HyBOP) [24] with respect to the ground truth, i.e., density functional theory (DFT) computed energies and forces of ∼165,000 total configurations (including both low-energy near equilibrium (0.5-1.5 eV range) and high-energy non-equilibrium configurations (1.0-8.0 eV range) in our training (∼95,000) and test (∼70,000) datasets. As shown in Figures  2a and b, ML optimized HyBOP displays remarkable performance for all the elements across the Periodic table. MAE in cluster energies of Group IV-IB transition metals is ∼90 meV/atom and ∼88 meV/atom for the training and test set, respectively. Corresponding values for the post-transition metals from Group IIB (e.g. Zn, Cd) are slightly smaller at ∼60 meV/atom and ∼90 meV/atom, while elements in Group IVA (e.g. Pb) and Group VA (e.g. Bi) display MAE of ∼65 meV/atom for both the training and test. Non-metals that include C, P, S and Se show relatively higher average MAE of ∼138 meV/atom on the training and test datasets, while metalloids such as, B, Si, Ge, As, Sb and Te, have an average MAE of ∼100 meV/atom. Group IA alkali elements, including Li, Na, K, Rb and Cs, have an average MAE of ∼77 meV/atom on both the training and test datasets. Similarly, Group IIA elements from Be to Ba show an average MAE of ∼51 meV/atom. Note that the MAE, as expected, are slightly higher for highly non-equilibrium clusters compared to the near-equilibrium configurations. Overall, our newly trained HyBOP models display remarkable performance in capturing vast regions of the energy landscape, with energies sampled from near-equilibrium to highly non-equilibrium cluster configurations and with sizes from dimers to bulk like cluster configurations (>50 atoms).
We next assess the performance of our c-MCTS trained HyBOP models by comparing the forces experienced by atoms for the various sampled configurations with that computed from DFT. Note that forces were not included in the training. The high-quality of our c-MCTS trained HyBOP models is evident from the strong correlation across a large number of non-equilibrium nanoclusters (total ∼145,000 that have a wide range ∼100 eV/Å for forces) in our test data set (see supporting information). As seen in Figure 2b, the trend in MAE prediction errors on forces is similar to that seen in energies. The Group IA alkali and Group IIA alkali earth metals such as Li, Na, K, Rb, Cs, Mg, Ca, Sr and Ba have force MAE of ∼187 meV/Å. Transition metals as Cu, Ag, Au, Pd have marginally higher MAE ∼300 meV/Å compared to the alkali and alkali earth elements. Magnetic elements such as Fe, Co, Mn, Ni from transition block element have a MAE ∼450 meV/Å in force prediction. Metalloids (B, Si, Ge, As, Sb, Te) from Group IIIA and Group VIA, respectively have a MAE ∼800 meV/Å in force prediction (and MAE ∼99 meV/atom in energy prediction). The block of transition metal elements containing 25 different elements with d electrons have average MAE ∼720 meV/atom in force predictions (MAE ∼90 meV/atom in energies). Non-metals such as C, B, P, S from Group IVA and Group VA display a strong correlation with DFT forces over a wide range (∼200 eV/Å) but have a relatively higher MAE prediction error of 1300 meV/Å compared to the rest.
Notably, we find both the energy and force predictions of c-MCTS optimized HyBOP models to be significantly better than many other commonly available potential models for several of these elements. While not exhaustive, we compare the model performance for 38 different elements covering a range of existing potential types [25], such as EAM, MEAM, Tersoff, AIREBO, ADP and SW, in the Supporting Information. In comparison to the heat map shown in Figure 2, Extended Figure 2 depicts the corresponding MAE generated from some of these best performing force fields available in literature. We note that the errors in energies and forces predicted by the best existing potential are >> 220 meV/atom and >> 2252 meV/Å, respectively for the different elements shown in Extended Figure 2. This is much higher than those for our ML trained models (see Figure 2). This dramatic improvement in the performance can be attributed to two factors, first, the complex chemistry of the nanoclusters that is distinct from bulk-like behavior to which the existing potentials are primarily fit to, and second, the ability of the c-MCTS to find accurate fitting parameters as compared to human intuition. Further details on the MAE obtained for different systems using existing potentials are provided in the Supplementary Information Section 5. In addition to the energies and forces, we also evaluate bulk properties (lattice constants, cohesive energies and elastic constants) predicted by our HyBOP model for ∼ 150 different polymorphs of these 54 elements. We note that the inclusion of larger bulk-like cluster configurations (> 50 atom clusters) in the training set and longer range interactions in HyBOP ensured that bulk properties are reasonably captured by our models (see Extended Table 3-5 Figure 3: Trends in model prediction errors with structural diversity of the element. (a) SOAP fingerprint representation of clusters across different elements as projected along the first two principal components (PCs). The area covered by an element in the PC space is proportional to its structural diversity, and thus, the difficulty to train an accurate potential model. Also, the structural diversity of an element is closely related to its position in the Periodic table. Regions of higher errors mostly occur at the boundary of the PC space which are expected to belong to under-represented configurations. (b) Trend in the prediction errors for different elements when grouped across different columns or rows of the Periodic Table. A similar trend can be seen for the elemental PC area, suggesting that elemental structural diversity correlates well with the model errors. The dotted lines capture the approximate trend within a group of elements and guide the eye. (c) Correlation between the prediction error and the inverse of the smoothness of the element energy surface. (d) Elements with high prediction errors display either high structural diversity or exhibit highly corrugated energy surface. Elements with mean absolute error (MAE) >100 meV/atom have been marked in (c) and (d).

Trends in Elemental Errors and Their Origins
We next aim to understand the relationship between the chemistry and the HyBOP model performance for any underlying element. We project the complete cluster test dataset in a 2D principal component (PC) space as shown in Figure 3a. A popular fingerprinting scheme, SOAP or smooth overlap of atomic positions [26], that transforms structural arrangement of a cluster to a unique vector representation was utilized (see Methods). A larger span in the PC space by an element is characteristic of its higher structural diversity; for example, some of the well-known elements with most diverse chemical bonding, such as B, C and S, can be seen to have a much larger PC span. We find a strong correlation of this PC area with the element position in the Periodic table; moving down the columns of alkali, alkaline-earth, Group IIIA, Group IVA, Group VA and Group VIA elements, we note that the configurational diversity systematically decreases. Similarly, moving across the 3d, 4d and 5d transition rows, the spanned PC area first increases and then decreases in accordance with the number of valence d-electrons. A good match between the expected chemical diversity and the spanned PC area of an element is indicative of a well sampled and comprehensive (training and test) cluster dataset. Another notable aspect from Figure 3a is that many of the clusters with high prediction errors lie at the boundary of the PC space which are expected to belong to under-represented regions of the potential energy surface. An improvement in the model performance is thus expected if more clusters from such regions are included in the training dataset. When grouped according to the position in the Periodic table, the errors in the model energy prediction roughly evolve as per the element chemistry (see Figure  3b). Moving down the columns of alkali, alkaline-earth, Group IIIA, Group IVA, Group VA and Group VIA elements the model prediction errors systematically decrease (within certain approximation), while moving across the 3d, 4d and 5d transition rows, such errors first increase and then decrease. These trends are consistent with the above discussion on configurational diversity, as captured by the spanned PC area. To further quantify this correlation, the area in the 2D PC space was computed using a convex hull construction, and can be seen to match well with the errors in model predictions ( Figure 3b)-with the notable exception of Mo, which had a relatively high model error than expected from the respective configurational diversity value. We argue that developing potentials for elements that show large configurational diversity, or large PC area, should be more difficult. This is because finding a unique set of the HyBOP parameters that capture all such high energy regions of the energy surface is non-trivial.
Another aspect that can make potential learning more difficult is the smoothness of the energy surface. This implies that small perturbations in an element's configuration leads to energy changes with high variance. Developing a potential for such a system would be difficult as the underlying energy surface would be highly corrugated with strong variations in energy depending on the direction of the perturbation, or could contain multiple local minima. Based on the principle of a variogram [27], the variance in energy changes for an element was computed by measuring the corresponding SOAP fingerprint distances between different, but similar, configurations. As shown in Figure 3c, elements with large prediction errors also displayed high variance in the energy changes, and vice versa. Combining both the above aspects in Figure 3d, it can be seen that models with high prediction errors belong to cases with either high structural diversity or highly corrugated energy surface; almost all elements with MAE >100 meV/atom lie in the upper right corner of Figure 3d. Therefore, this analysis not only explains the observed trends in the model prediction errors across the Periodic table, but also provides confidence that the developed c-MCTS search successfully found highly optimal HyBOP parameters in all cases.

Dynamic Stability of Clusters to bulk
Finally, we perform a rigorous test of our c-MCTS-optimized HyBOP potentials by evaluating the dynamic stability of the 54 elemental nanoclusters with different topologies and over a broad size range (5 to 50 atoms) at different temperatures. We perform MD simulations and analyze the dynamical stability for most clusters using the mean square deviation (MSD) of the atoms during a 1 ns trajectory. A cluster is considered to be dynamically unstable if the MSD is greater than 2Å; for a few special configurations, such as rings, visual inspection of energy or configuration trajectory was used to evaluate dynamical stability as system rotations lead to artificially higher values of MSD. We find that the c-MCTS optimized models capture the dynamical stability across > 40,000 clusters tested, with a few representative trajectories shown in Figure 4. Detailed trajectories for each element are included in the Supporting Information. This rigorous validation provides confidence for the future use of the developed potential models as well as the c-MCTS framework for exploring various structural and dynamical properties of nanoscale systems.
In summary, we build on the powerful ideas of reinforcement learning and decision trees to develop an effective search algorithm in high-dimensional continuous action space (c-MCTS). Our algorithm extends Monte Carlo tree search to continuous action spaces with three novel concepts (uniqueness criteria, window scaling and adaptive sampling) to accelerate the search. c-MCTS broadly outperforms state-of-the-art metaheuristic and other optimization methods. We applied this method to develop accurate bond-order potentials (with 18 dimensional search space) for 54 elements across the Periodic table, a rather non-trivial feat which would have taken years of efforts with the traditional approach. On the one hand, the developed potentials will be of use to the materials simulation community owing to their accuracy in capturing energy and atomic forces across large configuration space, making them attractive in the field of catalysis, especially for problems involving single-atom-catalysts [28,29]that form localized active sites. On the other hand, the developed c-MCTS will be beneficial in solving grand challenges in materials discovery that often involve search in a continuous space.

Code Availability
The c-MCTS algorithm is implemented in the BLAST framework developed by the authors. The codes, scripts and the framework are available from the authors upon reasonable request.

Data Availability
The data that support the findings of this study are available from the authors upon reasonable request.

Author Contribution
SM, RB and SKRS conceived and designed the project. TDL, and RB contributed equally. TDL developed the RL framework for continuous action space problems with input from SM, HC and SKRS. TDL evaluated the performance of the high-dimensional trial functions. SM deployed the RL c-MCTS algorithm and trained the bond order potential functions for the various elements across the periodic Table. All the first-principles datasets were generated by SM. Validation of the force-fields were carried out by SM and SB. RB developed an ML framework to carry out the analysis of the error trends in the potential functions across the different elemental systems. SB performed the high-throughput simulations to evaluate the dynamical stability of all the different elemental clusters. HC integrated the c-MCTS algorithm into the BLAST multi-fidelity framework. TDL carried out validation of the c-MCTS on training high-dimensional neural networks. All the authors contributed to the data analysis and to the preparation of the manuscript. SM, RB and SKRS wrote the manuscript with input from all the co-authors. SKRS supervised and directed the overall project.

Generation of Diverse Cluster Configurations
The accuracy and transferability of a potential model or force field depends on the quality of training and test data sets. In this study, our training and test dataset consist of ∼ 95,000 and ∼70,000 clusters configurations and their energies for 54 different elements across the Periodic table, respectively. The sampled configurations contain (a) the ground state atomic structures (b) atomic structures near equilibrium and (c) structures far from equilibrium. A total of ∼ 19,000 ground state cluster configurations were evenly included amongst the training and test dataset. As a result, the DFT computed energies of sampled configuration generates a broad spectrum of energy window (ranging from ∼ 1 eV/atom, for alkali and alkali earth metals to ∼ 14 eV/atom, for non-metals) in the training and test data set. A continuous energy window from low to high for the sampled configurations ensures that the energetics and dynamics of clusters configurations starting from equilibrium to far-from-equilibrium are adequately represented. Ground state clusters configurations are mined from different literature [30] as well as from our own calculations using standard atomic structure prediction method (such as genetic algorithm [31,32]). We used Boltzmann-based Metropolis sampling and a nested ensemble-based [33,34,35] approach to generate the cluster configurations that have energies near equilibrium and far from equilibrium. The combination of these two techniques generates cluster configurations of continuous range of energies from high to low. From the size perspective, the cluster configurations sampled consist of dimers to bulk-like configurations i.e., clusters containing atoms more than 50 atoms. Details on the various sampling approaches to generate cluster configurations are provided in Section 5 of Supplementary Information.

Reference Energy and Force Computations for Cluster Configurations
The energies of all the cluster configurations were computed using density functional theory (DFT) as implemented in the Vienna Ab initio Simulation Package (VASP) [36]. The low energy cluster configurations are further relaxed using conjugate gradient approach [37] whereas in the case of non-equilibrium cluster we perform static DFT calculation to compute their energies and forces. To avoid interactions between two periodic images we assign the box length such that the distance between two periodic images is at least 15 A or larger. The DFT calculations were performed using the generalized gradient approximation (GGA) with the Perdew-Burke-Ernzerhof (PBE) exchange correlation functional, and the projected-augmented wave (PAW) pseudo-potentials [38]. The details of pseudo-potentials used in our calculations are provided in the Table 2 in Supplementary Information Section 4. Spin polarized calculations with the Brillioun zone sampled only at the Γ-point were performed. To handle errors that may arise during the structural relaxation or static DFT calculations, our high throughput workflow used an in-house python wrapper [39] around VASP along with a robust set of error handling tools.

Learning of Trial Functions
We performed a rigorous test using a series of trial functions to demonstrate the suitability of our c-MCTS approach. We use several popular test functions [40] which are designed to represent a wide variety of different continuous action surfaces. These surfaces include many local minima (Ackley, Buckley, Ragstrigin, etc.), misleading or non-existant gradients (Hyper Plane, Dejong Step Function), or other functions that are designed to mimic common optimization traps that one may encounter. An advantage of these functions is that they are computationally quick to evaluate and have known solutions. In addition, several of them can be scaled up to an arbitrary number of dimensions. As such, these trial functions provide a series of comprehensive first order tests to evaluate a search algorithm. These functions have implementations in a wide variety of programming languages such as Matlab, Python, R to name a few. We compare the performance of c-MCTS with several other global/local optimizers including evolutionary algorithms such as particle swarm, bayesian optimization and random sampling. For the particle swarm the PySwarm optimization package [41] was used with 50 particles per trial and the hyper parameters (c1, c2, w) were adjusted on a per trial basis starting from a default of (0.5, 0.3, 0.9). For the Bayesian based optimization the HyperOpt Python package [42] with the 'TPE. Suggest' option was used as the trial selection method.
The performance of the various optimizers on several different trial functions is listed in Extended Table 1 and the convergence to solution for representative cases is shown in Extended Figure 1.

Model Selection -Hybrid Bond Order Potential
We navigate a 18-dimensional potential energy surface represented by a hybrid bond order potential (HyBOP) expressed as which utilizes a Tersoff formalism [43] for describing short range directional interaction (V SR ) and a scaled Lennard-Jones (LJ) function for long range interaction (V LR ). The short range pair potential function V pair is described by where f C (r ij ), f R (r ij ), and f A (r ij ) are the cutoff, repulsive, and attractive pair interactions, respectively, between bead i and j separated by a distance r ij , and b ij is a bond-order parameter which modifies the pair interaction strength between bead i and j depending on their local chemical environment. The long-range interactions V LR uses a scaled Lennard-Jones (LJ) function which is given by, where, ij and σ ij are LJ parameters for a pair of atoms i and j that are a distance r ij apart. f s (M i ) is a scaling function that describes the dependence of LR contribution from a given atomic pair i − j on the number of atoms within a prescribed radial distance R LR c from the atom i. Details of the potential model are provided in the Supplementary Information Section 6. Owing to its flexibility in describing various bonding characteristics as well as to facilitate comparisons, we retain the HyBOP formalism across all the elements in the Periodic Table. Learning of High-dimensional Potential Energy Surface To navigate the high-dimensional continuous parameter space, we developed a workflow that interfaces the MCTS algorithm with a molecular simulator, large-scale atomic/molecular massively parallel simulator (LAMMPS) [44]. While MCTS is used to navigate the high-dimensional HyBOP parameters space, LAMMPS is used to evaluate the performance of a given input set of HyBOP parameters by computing the mean absolute error (MAE) of energies for the clusters in the training data set. Each node of the MCTS search represents a particular set of HyBOP parameters, with the reward r in Eq. 1 evaluated as the weighted sum of MAE error in the cluster energies as discussed in Supporting Information Section 7.1. MCTS is a decision tree-based approach (comprising of selection, expansion, simulation and back-propagation) that builds a shallow tree of nodes where each node represents a point in the search space and downstream pathways are generated by a playout procedure. The algorithm simultaneously explores potentially better pathways to reach the optimal point in a search space and exploits a subset of pathways that have the greatest estimate values of the search function. This combination of exploration vs exploitation and an appropriate trade-off mechanism between them are found to be the most efficient strategy of identifying optimal point for a given function. The details of the c-MCTS approach are provided in the Supporting Information.

Fingerprinting and Principal Component Analysis
To analyze the structural diversity of the cluster dataset across the different elements, we utilized the smooth overlap of atomic positions (SOAP) fingerprinting method [26] as implemented in the DScribe python library [45]. SOAP encodes the atomic neighbourhood around a spatial point (or an atom) using the local expansion of a Gaussian smeared atomic density with orthonormal basis functions composed of spherical harmonics and radial basis functions. The advantage of using the SOAP fingerprint is that it is invariant to translation, rotation and permutations of alike atoms of a configuration, and forms the basis for development of several successful ML-based inter-atomic potentials. The following parameter settings were used for the SOAP fingerprinting: rcut= 6Å, nmax= 6 and lmax= 4, where rcut is the cut-off radius for the atomic neighborhood around the concerned atom, namx is the number of radial basis functions (spherical Gaussian type orbitals) and lmax is the maximum degree of spherical harmonics. This resulted in a SOAP fingerprint vector for each atom, which was averaged using the "inner" averaging scheme (average over atomic sites before summing up the magnetic quantum numbers) to obtain a 105-dimensional configuration fingerprint for each cluster. Principal component analysis (PCA) was performed using the (standard) normalized SOAP fingerprint representation of the entire cluster dataset. The variance in energy changes as a function of SOAP fingerprint distances (cosine) was computed using the open-source Scikit-GStat Python library [27].

Extended Data
Extended Table 1: Performance comparison among c-MCTS, Bayesian, particle swarm, and random sampling methods to find optimal solutions in 25 different trial functions. Two performance metrics were chosen: the number of iterations to reach the target solution and the quality of the best solution obtained from the optimizer. Results up to a maximum of 30000 iterations are provided. D denotes the dimension of the trial function, while the tolerance indicates the accepted difference from the exact optimal solution of the trial function. Bold font highlight the winning solution in each row.

Ackley (50D)
Objective Score Extended Figure 2: Performance of the best available force fields collected from literature on predicting energies and forces of different elemental cluster configurations. For better error comparison with Figure 2, consistent error range and color scheme are used. Elements for which comparison studies were omitted are colored grey. More rigorous performance evaluations (for both energies and forces) covering a wider range of popular force fields can be found in Section 5 in the Supplemental Information.
Extended Table 2: c-MCTS optimized parameters for all 54 elements based on the Hybrid Bond-Order Potential formalism. While all 19 parameters are included here, we note that the parameter m was kept constant during the c-MCTS.