Machine learning applied to simulations of collisions between rotating, differentiated planets

In the late stages of terrestrial planet formation, pairwise collisions between planetary-sized bodies act as the fundamental agent of planet growth. These collisions can lead to either growth or disruption of the bodies involved and are largely responsible for shaping the final characteristics of the planets. Despite their critical role in planet formation, an accurate treatment of collisions has yet to be realized. While semi-analytic methods have been proposed, they remain limited to a narrow set of post-impact properties and have only achieved relatively low accuracies. However, the rise of machine learning and access to increased computing power have enabled novel data-driven approaches. In this work, we show that data-driven emulation techniques are capable of predicting the outcome of collisions with high accuracy and are generalizable to any quantifiable post-impact quantity. In particular, we focus on the dataset requirements, training pipeline, and regression performance for four distinct data-driven techniques from machine learning (ensemble methods and neural networks) and uncertainty quantification (Gaussian processes and polynomial chaos expansion). We compare these methods to existing analytic and semi-analytic methods. Such data-driven emulators are poised to replace the methods currently used in N-body simulations. This work is based on a new set of 10,700 SPH simulations of pairwise collisions between rotating, differentiated bodies at all possible mutual orientations.


Introduction
Pairwise collisions between planetary-size bodies are the primary agent of planet growth during the late stages of planet formation [1]. These collisions-often called "giant impacts"-are violent events that result in either growth or disruption of the colliding bodies [2,3]. Collisions shape nearly every aspect of a planet's final characteristics, including its composition, thermal budget, rotation rate, and obliquity. Collisions can also determine whether a planet will retain an atmosphere, form satellites, or ultimately be hospitable to life. In addition to their role in planet formation, giant impacts have been suggested as explanations for a number of persisting mysteries in our own solar system, including the origin of Earth's Moon [4,5], Mercury's large core [6,7], Uranus' sideways tilt [8], the martian hemispheric dichotomy [9], the ice giant dichotomy [10], Jupiter's fuzzy core [11], and the Pluto-Charon system [12].
Collisions play a central role in N-body studies of planet formation. Since the first N-body simulations were performed in the 1960s [13], the underlying numerical schemes have improved in leaps and bounds. Collisional N-body codes now routinely include 10 3 -10 5 massive particles, [1] as well as general relativistic effects, gas dynamics [15,16], and the effect of external perturbations [17]. However, despite these advances, the methodology for handling collisions between bodies has remained frustratingly primitive. Within Nbody codes, a range of techniques for handling collisions can be employed. In the simplest, physically selfconsistent case, collisions can be treated as perfectly inelastic mergers (PIM), whereby mass and momentum are conserved, but no fragmentation is possible. While efficient and easy to implement, the downside of PIM is that the outcomes are unphysical for all but a narrow subset of low-energy collisions. Despite its shortcomings, this is the technique that has been em- [1] Collisionless N-body codes are capable of simulating orders-of-magnitude larger numbers of particles, having recently reached 2 × 10 12 particles [14] arXiv:2001.09542v1 [astro-ph.EP] 27 Jan 2020 ployed in the vast majority of N-body simulations to date.
At the other end of the spectrum, an ideal approach would be to simulate every collision using an accurate, high-resolution hydrodynamics code [18]. Unfortunately, such a hybrid approach is computationally prohibitive and adds significant complexity to the simulation. Moreover, because collisions must be evaluated sequentially in order to preserve self-consistency, the N-body integrator must remain idle while each collision is evaluated. This substantially increases the time required to complete a single N-body simulation. The problem is compounded by the fact that, during a typical simulation of late-stage planet formation, the number of collisions can easily reach tens of thousands. This is a problem that will only grow more intractable as N-body codes improve and computing power increases, enabling ever larger numbers of bodies-and thus collisions-within N-body simulations.
In between these two extremes, a number of semianalytic models have been developed in an effort to improve how collisions are handled within N-body simulations. These semi-analytic models are derived from collision simulation datasets of varying size and complexity [19,2,20]. The current state-of-the-art approach is the model known as EDACM [2], which is a set of analytic relations derived from simulations of pairwise collisions between gravitational aggregates (i.e., rubble piles) [19]. Whereas PIM is only able to predict limited properties of the largest (and only) remnant, EDACM allows for fragmentation (outcomes with more than one remnant) and is therefore able to predict the properties of a second post-impact remnant and debris. Since its inception, EDACM has been implemented into the N-body codes Mercury [21,22] and pkdgrav [23,24] and used in several notable studies of terrestrial planet formation [25,26]. A simpler, but more recent semi-analytic approach is the impacterosion model (IEM) for gravity-dominated planetesimals [20]. IEM predicts only the debris mass and the mass of a single remnant. These models are a marked improvement, but the downside of such semi-analytic methods is that they are difficult to generalize beyond a narrow set of parameters and have in practice been able to achieve only modest accuracies, in some cases performing worse than PIM (see Table 5).
In recent years, the rise of machine learning and access to increasing computing power have enabled new data-driven approaches. Now, with sufficiently large datasets, surrogate models known as emulators can be trained to predict the outcome of collisions "onthe-fly" (i.e., within N-body simulations) [27]. These emulators are lightweight enough to be integrated directly into existing N-body codes [28]. In this paper, we show that they can far outperform existing analytic and semi-analytic methods. Nascent efforts to emulate collision outcomes have explored artificial neural networks (ANN) [27,29]. These studies have shown that simple ANNs can achieve high accuracies on relatively small datasets (N = 800).
Machine learning techniques generally rely on the availability of large and well-sampled training datasets. Until recently, simulating such large collision datasets was computationally infeasible. However, computational fluid dynamics (CFD) algorithms and computing resources have advanced to the point where these datasets are now realizable. At the same time, recent improvements in CFD have opened the door to new dimensions in the collision parameter space. Collisions can now be simulated between differentiated bodies, rotating bodies, and bodies with arbitrary mutual orientations. In order to effectively sample these additional dimensions, even larger datasets are needed.
In this work we introduce a new dataset of 10,700 simulations of pairwise collisions between differentiated, rotating bodies. This dataset is larger and better sampled than any previous dataset and includes effects not accounted for in similar studies, including the effects of pre-impact rotation and variable core mass fractions. These simulations were evaluated for an unprecedented number of post-impact parameters; in this work we investigate a subset of those parameters that are relevant to N-body studies of terrestrial planet formation.
In order to determine which numerical strategies are best suited to emulating collisions, we developed a flexible and robust machine-learning pipeline to train, optimize, and validate models from different data-driven methodologies, including techniques from the field of uncertainty quantification (UQ) and machine learning (ML). In addition, the techniques were tested on a range of training dataset sizes, in order to provide constraints on dataset requirements for future studies.
The need to improve collision handling in N-body studies has often been dismissed in the literature, motivated by studies which have shown that the final number, masses, and orbital elements are barely affected by the collision method [30]. However, a number of more recent studies have overturned those conclusions. Indeed, studies with accurate collision handling have obtained profoundly different planetary system architectures, with a wider range of planetary masses and enhanced compositional diversity [28]. Moreover, N-body simulations allowing for fragmentation have shown that roughly half of collisions occurring during planet formation are disruptive [30] and, even within the non-disruptive regime, the effect of erosive collisions on planet growth has likely been underestimated or neglected [31,32]. Studies have also shown that the growth timescale of planets depends strongly on the collision model, in some cases increasing the growth timescale of the planets by a factor of two [26]. This has massive implications for the internal and atmospheric evolution of planets [33], their subsequent habitability, the formation of satellites [34], and even the likelihood of detecting giant impacts around other stars [35].
We begin in §2 by describing the collision datasets that we generated and how each collision was set up, simulated, and analyzed. In §3, we give an overview of the emulation strategies used in this work and how they were evaluated. In §4 we report on the performance of the emulators, their dependence on dataset size, and the associated sensitivity metrics. Finally, in §5, we discuss which techniques are best suited to emulating planetary-scale collisions, their relative ease (or complexity) of implementation, and where future work remains to be done.

Methods
In order to train, test, and compare emulation strategies, a large number of collision simulations was required. We therefore simulated 10,000 collisions to serve as a training dataset (12D LHS10K), 500 collisions as an independent test dataset (12D LHS500), and 200 collisions to study the convergence of the postimpact parameters (12D LHS200). Every collision in these datasets is uniquely defined by 12 pre-impact parameters ( §2.1.1). The large number of dimensions in the parameter space necessitated an efficient sampling strategy, for which we employed Latin hypercube sampling (LHS) and the adaptive response surface method (ARSM) ( §2.1.2). 21,400 unique planet models had to be generated to serve as either a target or projectile in the collisions ( §2.1.3). These models were spun-up to their pre-impact rotation rates using a novel approach that we developed for this work ( §2.1.4). Collisions were simulated using smoothed-particle hydrodynamics (SPH) ( §2.1.5) and were subsequently evaluated for more than a hundred post-impact parameters ( §2.1.6). These post-impact parameters were tested for convergence ( §2.1.7) and a subset of these parameters was chosen to be investigated in this work on account of their relevance to N-body studies of terrestrial planet formation (Table 3).

Pre-impact conditions
Each collision is uniquely defined by 12 pre-impact parameters (Table 1). Together, these parameters define the geometry of the impact and the physical and rotational characteristics of the bodies involved in the collision. This set of parameters allows us to investigate the role of collisions in terrestrial planet formation, critically including the role of core mass fraction, rotation, and mutual orientation. The ranges of these parameters were chosen with two constraints in mind. First, the datasets should be focused on terrestrial planet formation. Second, and foremost for this work, the datasets should allow for a fair and robust comparison between distinct emulation strategies. Table 1 Pre-impact parameters. Each collision in the dataset is uniquely defined by a set of 12 parameters. These parameters define the geometry of the collision and the physical characteristics, rotations, and orientations of the bodies involved in the collision. The subscripts ∞, targ, and proj refer to the asymptotic, target, and projectile values, respectively.

Parameter Range
Unit Description In order to satisfy the first constraint, we simulated collisions with total masses (M tot ) between 0.1 − 2 Earth masses. The ratio of projectile mass to target mass (γ) was allowed to range from 0.1 up to equalmass collisions (γ = 1). The resulting models range in mass from roughly a lunar mass up to nearly twice that of Earth.
The bodies involved in the collisions-referred to in this work as the target and projectile-are fully differentiated planets composed of an iron core and granite mantle. The mass fraction of the core relative to the body's total mass is defined by F core body , where the body subscript can refer to the target, projectile, largest post-impact remnant (LR), or second largest post-impact remnant (SLR). The core mass fractions of the target and projectile range from 0.1 − 0.9 (i.e., iron cores ranging from 10 − 90% by mass).
The target and projectile in the collisions are rotating. The rotation rates range from non-rotating to rotation at the theoretical breakup rate (Ω crit ). The theoretical breakup rate is calculated according to Maclaurin's formula for a self-gravitating fluid body of uniform density, where G is the gravitational constant and ρ is the bulk density of the body [36]. Here, we calculate the bulk density of the body by using the mass and radius of the non-rotating model. Because the Maclaurin formula assumes a uniform density, the theoretical breakup rate is more accurate for lower mass bodies. For highmass bodies, where the density profile strongly deviates from uniformity, the theoretical breakup rate will be a lower bound. While the Maclaurin formula is a somewhat blunt approximation, it serves as a good estimate of the permissible rotation rates. The orientations of the target and projectile are uniquely defined by the obliquity (θ) and azimuth (φ) of their angular momentum vectors (i.e., rotation axes). These angles are allowed to vary between 0−180 • and 0−360 • , respectively, where the obliquity is measured relative to the unit vector normal to the collision plane (ẑ) and the azimuth relative to a predefined reference direction (ŷ) in the collision plane. This allows for every possible mutual orientation between the target and projectile prior to impact.
In defining the pre-impact geometry of the collision, we depart from previous work by specifying the asymptotic impact parameter (b ∞ ) and asymptotic relative velocity (v ∞ ). In contrast, previous studies have generally used the associated quantities at the moment of impact (b imp and v imp , respectively). However, this latter parameterization can result in unphysical initial conditions. Indeed, prior to impact, the mutual gravitational interaction between the target and projectile can alter their shapes, rotation rates, and relative orientations. This also alters the pre-impact trajectory and subsequent collision. This is due to the fact that both the target and projectile act as reservoirs of energy, whereby some fraction of the orbital energy in the pre-impact trajectory is transferred into the tidal deformation and rotational energy of the bodies. The simulations in this work therefore begin with the target and projectile separated by 10 critical radii, where the critical radius is given by R crit = R targ + R proj . Note that we use the non-rotating radii of the target and projectile in calculating the critical radius. Rapidly rotating bodies can take on significantly oblate shapes, increasing their radii and making a clear definition of the critical radius problematic when the orientations are taken into account.
The parameter space investigated in this work is larger and better sampled than any extant collision dataset known to the authors at the time of writing. Nonetheless, the parameter space is limited by computational resources and the sampling requirements. It therefore does not include the full range of collisions relevant to planet formation, but does serve as a good training, test, and validation space for the emulators in this work.

Sampling strategy
In order to make a robust comparison between different emulation strategies, the underlying datasets must be well-sampled and well-behaved. However, generating a well-sampled training dataset in a 12-dimensional parameter space is not a trivial task. The large number of dimensions quickly renders many approaches computationally infeasible. Indeed, a uniform grid sample would require n d simulations, where d is the number of dimensions and n is the desired number of samples in each dimension. A low resolution 12-dimensional dataset with 10 samples in each dimension would then require 10 12 simulations, which is roughly eight orders of magnitude beyond current practical computational limits.
In order to overcome this problem while maintaining flexibility in the dataset requirements, we used a Latin hypercube sample (LHS) based version of the adaptive response surface method (LHS-ARSM) in order to sample a series of Latin hypercube samples [37]. Latin hypercube sampling is a statistical method for generating a near-random sample of parameter values from a d-dimensional distribution [38]. LHS works on a function of d parameters by dividing each parameter into n equally probable intervals. The samples generated in this fashion are then distributed such that there is only one sample in each axis-aligned hyperplane. The advantage of this scheme is that it does not require additional samples for additional dimensions. LHS techniques have been used to considerable success in other high-dimensional astrophysical applications [39].
In this study, the training dataset sizes required to reach optimal accuracies were not known a priori. Therefore, a procedure was needed to expand an existing dataset while maintaining certain properties, such as Latin hypercube, space-filling, and stratification properties. LHS-ARSM achieves this by sequentially generating sample points while preserving these distributional properties as the sample size grows. Unlike LHS, LHS-ARSM generates a series of smaller sub-sets that exhibit the following properties: the first subset is a Latin hypercube, the progressive union of subsets remains a LHS (and achieves maximum stratification in any one-dimensional projection), and the entire sample set at any time is a Latin hypercube. Benchmarking tests show that LHS-ARSM leads to improved efficiency of sampling-based analyses over older versions of ARSM [37].
For the training dataset (12D LHS10K), we generated an initial LHS of 1,000 collisions using the standard maximin distance criterion in order to guarantee space-filling properties. We then used LHS-ARSM to progressively enrich the sample by blocks of 1,000 collisions until we reached a total sample size of 10,000. We separately generated a LHS sample of 500 collisions to serve as an independent test dataset, designated 12D LHS500. An LHS of 200 collisions was also generated, designated 12D LHS200, which we used to study the convergence of the post-impact parameters. All datasets were generated using identical parameter ranges ( Table 1). The datasets used in this work are summarized in Table 2. Table 2 Summary of the collision datasets in this work. Each simulation requires two unique models for the target and projectile. In this work, the 12D LHS10K dataset was used for training, 12D LHS500 dataset as an independent validation set, and 12D LHS200 to study the convergence of the post-impact parameters. Note that the sample type of the 12D LHS10K dataset is an LHS-based ARSM.

Generating planet models
The collisions in this work are pairwise collisions between a target and projectile, where the target is the more massive of the two bodies. In order to simulate collisions between these bodies using a particle-based method such as SPH, we had to first create suitable particle representations (i.e., models) of each body. We used ballic [40] to generate non-rotating, low-noise particle representations of each body. The ballic code solves the equilibrium internal structure equations using the Tillotson equation of state (EOS) and can generate models with distinct compositional layers. In this work we investigated fully differentiated two-layer bodies with iron cores and granite mantles.

Pre-impact rotation
In order to facilitate collisions between rotating planets, we developed a method to induce rotation in the non-rotating models generated by ballic. The planets were first generated as non-rotating spherical models, after which a linearly increasing centrifugal force was applied to the particles in the rotating frame. The maximum centrifugal force applied to each particle is that which is required to achieve the desired rotation rate, F c = m p r xy Ω 2 , where m p is the particle mass and r xy is the particle's distance from the rotational axis. Once the maximum centrifugal force has been reached, F c is held constant and the model is allowed to relax to a low-noise state. The particles are then transformed into the non-rotating frame and allowed to relax again. This method can spin-up a body up to its critical rotation rate (and beyond if not careful) and therefore allows us to probe collisions between rotating planets at any mutual orientation. An example of a model before and after the spin-up procedure is shown in Figure 1. This represents a significant improvement over previous work, which has generally only considered collisions between non-rotating bodies.

Simulating collisions
The collisions in the datasets reported here have been simulated with Gasoline [41], a massively-parallel SPH code. The version of Gasoline used in this work has been modified specifically to handle planetary collisions and has been used in previous work to study the origin of the Moon, Mercury's large core [7], and the ice giant dichotomy [10]. These modifications are described in detail in previous papers [40,10]. Gasoline uses the Tillotson EOS [42,43], which allows us to simulate collisions between differentiated planets with iron cores and granite mantles. The simulations used in this work were simulated at the Swiss National Supercomputing Center (CSCS) and are publicly available in the Dryad repository: https://doi.org/10.5061/dryad.j6q573n94.

Post-impact analysis
Every collision was evaluated for more than a hundred post-impact properties. We focus on a subset of these properties that are likely to prove important for N-body studies of terrestrial planet formation. These properties are listed in Table 3. In particular, we focus on the properties of the LR, SLR, and the debris field. Collisions were simulated for 100τ . The collision timescale τ is equivalent to the crossing time of the encounter and is given by, where v imp is the velocity at impact (see Appendix A) and we reiterate that R crit depends on the nonrotating radii of the colliding bodies. In order to identify the post-impact LR, SLR, and debris field we used the SKID group finder [23]. SKID identifies coherent, gravitationally bound clumps of material. It does this by identifying regions which are bounded by a critical surface in the density gradient (akin to identifying watershed regions). Then it removes the most unbound particles one-by-one from the resulting structure until all particles are self-bound. This usually produces a much larger number of clumps than just the first and second largest remnant. For this reason we also need to see if these clumps are further bound to either of the first or second largest remnant, if not, they are identified as part of the debris field of the collision.
A number of the post-impact properties investigated here do not have obvious definitions and require explanation. We define or provide explanations for the postimpact parameters in Appendix A. In addition, we investigate the normalized masses to determine whether or not such normalization leads to improved regression performance for the data-driven techniques.

Convergence of post-impact parameters
We evaluated the convergence of all post-impact properties considered in this work (Table 3) using the 12D LHS200 dataset. [2] Convergence was measured relative to the post-impact quantity's value at 100τ (the value used to train the emulators). In order to quantify the convergence, we calculated the absolute relative error E at uniformly sampled intervals of τ , where y(τ ) is the value of the post-impact parameter at τ and y 100 is the value used in the training dataset. For a single post-impact quantity, this yields 200 measurements of E at each evaluated step of τ . The median of these relative errors is plotted as a function of τ in Figure 3. Most post-impact parameters have converged to within 1% of their training value by 50τ , however the radii (R), rotation rates (Ω), and debris angular momentum (J deb ) are still converging at 100τ . [2] We utilized a smaller dataset for this purpose because generating time series for each post-impact property is computationally expensive and therefore impractical for the larger datasets. However, we still wanted to ensure that we tested the convergence for the full range of collisions. Figure 2 Diversity of collision outcomes. The images above show the outcomes for a subset of the collisions in the 12D LHS200 dataset. The images are ordered by their impact geometry. From left to right, the impact parameter (b ∞ ) increases. From bottom to top, the relative velocity increases (v ∞ ). Thus, collisions near the top left are high-velocity, head-on impacts, whereas the collisions near the lower right are low-velocity, grazing collisions. Emulators must be able to accurately predict post-impact properties for a wide range of collision outcomes. The color scale here indicates log-density.
We note that the non-convergence of the radii and rotation rates is not a numerical issue in this case, but rather the result of ongoing physical processes post-impact (e.g., differentiation, thermal equilibra-tion, etc.). While we emulate these properties in the work that follows, they are currently not suitable for use within N-body simulations. Longer simulations are Table 3 Post-impact parameters. In this work we consider the following subset of post-impact parameters, focusing on the LR, SLR, and debris field. These parameters were chosen for their relevance to N-body studies of terrestrial planet formation. Detailed definitions of the post-impact parameters and how they are evaluated can be found in Appendix A.

Parameter Constraints Unit Description
Stddev azimuth required to determine the convergence timescale of these properties. In order to track the rotation of planets within Nbody simulations, a substitute for the rotation rate is needed. We investigated the convergence of the rotational angular momenta of the remnants and found that they converge quickly following the impact. Indeed, following an impact, the angular momentum is quickly partitioned between the surviving bodies and has largely converged within a few tens of τ . While the debris angular momentum does not show the same convergence, it can instead be calculated implicitly  Table 3 for the 12D LHS200 dataset. The median of the relative errors for each parameter are shown for uniformly spaced intervals of τ . Note that the radii, rotation rates, and angular momentum of the debris have not converged by 100τ and are therefore require further investigation before being used as emulated quantities in N-body simulations.
The sawtooth pattern apparent for many of the parameters arises because the parameters are oscillating around their training value.
from the angular momenta of the remnants and initial total angular momentum. We therefore suggest that N-body studies should utilize the angular momenta of the remnants to track rotation rates, rather than the rotation rates themselves.

Emulation strategies
In order to overcome the limitations of analytic and semi-analytic approaches, techniques from the field of ML have proved promising [27]. Techniques from UQ have also achieved considerable success in other areas of astrophysics investigating high-dimensional emulation [39] (hereafter we refer to ML and UQ as "datadriven" techniques). These techniques can provide accurate and efficient strategies for emulating collisions. Data-driven methods have the major advantage of being generalizable to any quantifiable post-impact property. In order to identify the emulation methods best suited to the problem at hand, we have evaluated and compared the ability of several distinct data-driven techniques to accurately predict the post-impact properties of planetary-scale collisions. These techniques are polynomial chaos expansion (PCE), Gaussian processes (GP), eXtreme Gradient Boosting (XGB), and multi-layer perceptrons (MLP). We compare these data-driven techniques to the classic analytic model, perfectly inelastic merging (PIM), and two state-of-the-art semi-analytic techniques, the impact-erosion model (IEM) [20] and EDACM [2].
In discussing the training and validation of the datadriven emulators, we adopt the terminology used in ML literature to describe the models, their parameters, and their associated input and output. In particular, we refer to the pre-impact parameters as features and the process of selecting these features as feature selection. The meta-parameters that define the architectures and numerical behavior of the models are referred to as hyperparameters, and the process of selecting an optimal set of hyperparameters is known as hyperparameter optimization (HPO). The post-impact quantities that we are attempting to predict would usually be referred to as targets in this terminology. However, in order to avoid confusion with the target body involved in the collision, we simply refer to them as post-impact quantities/properties.
The approach to collision emulation introduced here produces a set of single-target regressors (STR), each optimized for a specific post-impact property. With this strategy, the models are simpler and we can achieve optimal accuracies for each individual postimpact property. However, the drawback of decoupling the post-impact quantities from one another is that the resulting emulators will be agnostic to the underlying physical relationships and constraints between the quantities (e.g., mass conservation). It's therefore not guaranteed that the emulator predictions will be physically self-consistent. In this paper, we focus on comparing regression strategies and in a forthcoming paper, we introduce a method for imposing physical constraints and self-consistency on the emulators.

Perfectly Inelastic Merging (PIM)
PIM is an analytic method in which all collisions are treated as perfectly inelastic mergers. [3] In a perfectly inelastic merger, the masses and momenta of the colliding bodies are conserved in a single post-impact remnant. There is no net conversion of kinetic energy into other forms such as heat, noise, or potential energy during the impact. This is the simplest possible model for emulating the outcome of a pairwise collision while maintaining physical self-consistency (but not accuracy). The outcome of a perfectly inelastic merger is always a single remnant, which we refer to here as the LR for consistency. There are no additional remnants or debris. PIM can predict the mass and core mass fraction of the LR, and can additionally make naïve predictions of certain rotational parameters for the LR. PIM has been employed in the vast majority of N-body simulations to date. Details of our implementation of PIM can be found in Appendix B.

Genda et al. 2017 (IEM)
The impact-erosion model (IEM) is a semi-analytic model for gravity-dominated planetesimals [20]. IEM predicts the normalized mass of the debris (M norm deb ) as a function of the specific impact energy (Q R ) scaled to the catastrophic disruption threshold (Q RD ). The normalized mass of the debris M norm deb is expressed as, where φ = Q R /Q RD . IEM assumes that only a single remnant is produced by the collision (referred to as the LR for consistency) and therefore M norm LR can be determined via a straightforward relation, M norm For consistency, we use the same values of Q R and Q RD in the calculations of IEM and EDACM. Details of the calculation of Q R and Q RD used here and in EDACM can be found in Appendix C.

Leinhardt & Stewart 2012 (EDACM)
EDACM is a set of analytic relations that predict the masses of the LR, SLR, and debris, as well as the core [3] PIM is sometimes referred to as "perfect accretion" or "perfect merging". mass fraction of the LR [2] via a mantle stripping law [44]. In order to evaluate and compare the performance of EDACM to the other emulators developed in this work, we implemented EDACM as prescribed in Leinhardt & Stewart (2012). EDACM is generally considered the state-of-the-art approach to collision emulation and has been used in numerous N-body studies of planet formation [25,45]. Most notably, EDACM allows for collision outcomes with more than one remnant (referred to as fragmentation) and is thus capable of predicting a larger set of post-impact parameters than either PIM or IEM. We give a brief overview of EDACM in Appendix C and explain where our implementation differs from that used in previous studies.

Data-driven methods
The analytic and semi-analytic models presented in the preceding section express an explicit relationship, based on physical principles, between the pre-and post-impact parameters. IEM, for example, expresses a non-linear relationship between the normalized mass of the debris M norm deb and φ. In contrast, using the datadriven techniques that follow, we construct a mapping between the pre-impact parameters and individual post-impact quantities. These non-linear mappings are derived purely from a training dataset.

Polynomial chaos expansion (PCE)
PCE is a popular technique in the field of UQ, where it is typically used to replace a computable-butexpensive computational model with an inexpensiveto-evaluate polynomial function [46]. In this work, we use a PCE based on tensor products of Legendre polynomials [47]. Recent work has demonstrated that data-driven PCE models can yield point-wise predictions with accuracies comparable to that of other machine learning regression models (e.g., neural networks) [48]. In this work, we use UQLab [49] to train and evaluate all PCE models. The documentation for UQLab is freely available at https://www.uqlab.com/ documentation. An overview PCE as used in this work is provided in Appendix D.

Gaussian processes (GP)
GPs are a generic supervised learning method designed to solve regression and probabilistic classification problems [50]. They are a non-parametric method that finds a distribution over the possible functions f (x) that are consistent with the observed data. ML algorithms that involve a GP use a measure of the similarity between points (the kernel function) to predict a value for an unseen point from training data. The Gaussian radial basis function (RBF) kernel is commonly used, however in this work we test multiple kernels, including the constant, Matérn (ν = 3/2), rational quadratic, and RBF kernels (see Table 4).
A potential downside of GPs is that they are not sparse (i.e., they use all of the sample and features information to perform the prediction) and they lose efficiency in high dimensional spaces [50]. While our 12-dimensional space is relatively small for GPs, the number of training examples is much larger than that for which GPs are generally employed. More advanced algorithms have been suggested to improve the scaling of GPs, such as bagging and enforced sparsity, but we have not attempted to implement these here. A brief mathematical introduction to GPs is provided in Appendix E.

eXtreme Gradient Boosting (XGB)
XGBoost (XGB) is an open-source decision-tree-based ensemble ML algorithm that uses a gradient boosting framework [51]. It has become one of the most popular ML techniques in the previous years and is well documented. Gradient boosting is a machine learning technique for regression and classification problems which produces a prediction model in the form of an additive expansion of simple parameterized functions h (typically called weak or base learners) [52]. These base learners are usually simple classification and regression trees (CART). In gradient boosting, the base learners are generated sequentially in such a way that the present base learner is always more effective than the previous one. Thus, the overall model improves sequentially with each iteration. A detailed overview of the XGB models used here is available in Appendix F.

Multi-Layer Perceptron (MLP)
MLPs are a class of feed-forward deep neural network that consist of multiple, fully-connected (i.e., dense) hidden layers. In MLPs, the mapping f between the pre-and post-impact parameters is defined by a composition of functions g 1 , g 2 , ...g n (n being the number of layers in the network), yielding, where each function is parameterized by a weights matrix (w i ), a bias vector (b i ), and an activation function (h i (·)). The weights matrix and bias vector are the parameters of the network that are tuned by minimizing a loss function which measures how well the mapping f performs on a given dataset. In this work, the MLPs are implemented with Python's Tensorflow library and models consist of an input layer with 12 nodes, one to three hidden layers with up to 24 nodes each, and an output layer with a single node (i.e., a scalar output). All activation functions in the resulting network are the Rectified Linear Unit (ReLU). A detailed overview of the MLPs used in this work is provided in Appendix G.

Data pre-processing
Prior to training the regression models, a number of transformations are applied to the pre-impact parameters (Table 1). These transformations ensure that the training data is well-defined (i.e., no undefined values), and generally improve training efficiency and regression performance. We describe these transformations here. The first step in our data pre-processing pipeline is to remove entries wherein the post-impact quantity of interest is undefined. Undefined entries occur when an LR or SLR was not produced by a collision. This is often the case in head-on, high-velocity impacts, after which only debris is present, and in the case of mergers, in which no SLR is created. In the case of the training data, the outcomes of the collisions are known a priori and we simply remove collisions where the targeted post-impact quantity is undefined. However, when emulating collisions within N-body simulations, the outcome is not known a priori and therefore a pre-classification step is required to predict and handle undefined quantities. We remark on this preclassification step in §3.5.
We apply standardization to the resulting welldefined quantities. The procedure for standardizing the input data differs between PCE and the other methods. In the case of PCE, the input parameters are linearly mapped into a hypercube [−1, 1] 12 , within which the distribution of the transformed features is still uniform. For the other methods, the pre-and postimpact parameters are scaled using the standard scaling method. Standardization is a general requirement for many ML algorithms. The only family of algorithms that are scale-invariant are tree-based methods (e.g., XGB). However, since we are comparing several different ML algorithms here, some of which depend strongly on standardization, we standardize the input and output features for all techniques (except as noted above for PCE). The result of standardization (a.k.a. Z-score normalization) is that the features will be rescaled such that they evince the properties of a standard normal distribution, µ = 0 and σ = 1, where µ and σ are the mean and standard deviation of the distribution, respectively. The z-values are then calculated as, The regression performances reported in Table 5 are for emulators trained on the full 12D LHS10K dataset. However, for the purpose of investigating regression performance as a function of dataset size, we have sub-sampled the 12D LHS10K training dataset to create a series of smaller datasets. These subsets were generated by drawing random samples from the full 12D LHS10K dataset.
We created training subsets with set sizes increasing in steps of 100 up to 1000 and from thereon in steps of 1000 up to 10000. Note that there is a difference between the training set size (TSS) and the effective training set size (ETSS) on which the regression models are actually trained. Because we remove undefined values in the pre-processing step, the ETSS is dependent on the post-impact property in question. The ETSS is therefore generally lower than the TSS for LR quantities and even lower for SLR quantities. To reiterate, this is because the number of remnants depends on the initial conditions of the collision. Outcomes with an LR are more common than outcomes with both an LR and SLR. This also affects the 12D LHS500 validation dataset. This is important because the effective validation set size (EVSS) determines the expected variance of the performance measures, This suggests that larger training and validation datasets are likely needed for SLR quantities to achieve the same accuracy as LR quantities. In our datasets, the post-impact properties related to the debris do not suffer from this issue, because all collisions in our datasets generate at least some debris. This is because the collisions in our dataset begin with asymptotic relative velocities above the escape speed of the system. However, in other datasets, this will not necessarily be the case.

Hyperparameter optimization (HPO)
Once the data has been pre-processed, we perform HPO in order to identify the optimal set of hyperparameters for each model. The HPO procedure for PCE-which is implemented with MATLAB/UQLab-is different from that of the methods implemented in Python (i.e., GP, MLP, and XGB). In the case of the latter methods, we used the hyperopt library to identify the optimal hyperparameters for each model and post-impact parameter pair. The hyperopt package is a Python library designed to optimize hyperparameters over awkward search spaces with real-valued, discrete, and conditional dimensions, which makes it ideal for iterating machine learning meta-parameters. We employed hyperopt's Bayesian sequential model-based optimization (SMBO) with a Tree-structured Parzen Estimator (TPE), which we found converged on optimal architectures more quickly than purely random or grid-based strategies. The Python-based HPO procedure identifies an optimal architecture over 100 iterations. Each step in the HPO procedure employs a 5-fold cross-validation on the training dataset, using 80% of the data for training and the remaining 20% for validation. The negative average r 2 -score across all five folds was used as the objective loss function during HPO.
The PCEs used here have two distinct groups of hyperparameters. The HPO procedure for PCE searches over only one of these groups. The first group contains the maximal polynomial order, p, of the PCE and qnorm. A grid of these parameters is searched for the best configuration using a greedy algorithm (in that the optimal values for p and q-norm are only approximated). The second group of parameters consists of the maximum interaction, r, and the feature importance threshold. These parameters were optimized by trial and error. It is common to set r to very low values (∼ 2-3) following the sparsity-of-effects principle [53]. Here, we use a larger value of r = 4, which results in more expensive training of the PCEs. We found that this value leads to the best performance, whereas higher values of r render the training even more expensive and does not substantially increase the performance (and in some cases leads to worse performance). The feature importance threshold was not varied, but rather set to 1% as it has been noticed that this is a conservative cut that still reduces the computation cost of PCE noticeably.
Each of the four data-driven methods requires a unique set of hyperparameters. The hyperparameter spaces searched for each emulation method are summarized in Table 4.
Because we do not enforce sparsity in the GPs used in this work, they require prohibitively long training times as dataset sizes increase. Therefore, for the GP models, we only carry out HPO up to training set sizes of N = 2000. Beyond this training set size, we do not attempt HPO for GP models, but instead recycle the optimal hyperparameters identified for the GP models at N=2000 for each post-impact property.

Performance evaluation
Once an optimal architecture was identified by the HPO procedure, the best performing architecture was re-trained on 100% of the training dataset. The resulting model was then evaluated on the independent valididation dataset (12D LHS500). Evaluating the performance of an emulator requires a carefully chosen metric appropriate to the problem. There are several commonly employed metrics that are not suitable for collision emulation due to the range of the post-impact properties. For example, mean squared error (MSE) is not scale invariant and relative error metrics are illsuited to the many parameters that can take on null values. For this reason, we use the coefficient of determination, known as the r 2 -score, to measure the quality of the regressors, where SS res = i (y i −ŷ i ) 2 is the residual sum of squares and SS tot = i (y i −ȳ) 2 is the total sum of squares. Here, y i is the ith expected value,ȳ is the mean of the expected distribution, andŷ i is the ith predicted value. The r 2 -score has been used as the performance metric in similar work [27] and is therefore a prudent choice in order to make comparisons to other studies.
In order to evaluate the performance of the regression techniques, we test their predictive ability on an independent validation set (12D LHS500). However, in order to make predictions on this test set, we must first identify the cases in the validation set where the postimpact quantity is undefined. The post-impact quantity will be undefined when the emulator is attempting to predict a quantity for a body that does not exist (i.e., was not produced by the collision).
A classification step is therefore required prior to making predictions with a trained regressor for either LR or SLR quantities. This classifier must predict the existence or non-existence of the LR/SLR such that the regressor is not attempting to predict an undefined quantity. In this work, we assume that a hypothetical perfect classifier exists for this purpose. In practice, this will of course not be the case, but it's a necessary assumption to make here so that we can remain focused on comparing the regression performances. If we were to train and use imperfect LR and SLR classifiers, they would propagate errors inherent to the classification step, which would strongly bias regression performance and therefore render their comparison unhelpful. In a forthcoming paper, we report the performances of such classifiers and how they are implemented within the emulation pipeline.

Feature importance
The data-driven techniques that we consider in this work allow us to evaluate and compare feature importances for each post-impact property. Importance metrics are powerful methods for quantifying relationships between pre-and post-impact parameters. In this work, we report Sobol' indices derived from PCE.
Sobol' indices [54,55] measure how sensitive a given post-impact parameter is to each of the individual preimpact parameters, as well as to any of their interactions. The indices quantify the relative contribution of variance explained by one variable-or group of variables-to the total variance, where S i1...is is the Sobol' index of order s. The first order Sobol' indices are the values S i which characterize the variance explained by the variable x i . The higher order Sobol' indices (second order S ij with i = j etc.) quantify how much variance is explained not by single variables but rather by their interactions. The Sobol' indices are a particularly useful sensitivity measurement tool in the context of PCE because a Sobol' decomposition can be computed directly from a PCE by employing a simple reordering of terms. Hence the computation of Sobol' indices from a PCE is analytic and exact. For a more thorough introduction to Sobol' sensitivity analysis we refer to the following references [56,55].

Regression performance
The following sections describe the regression performance of the emulators with respect to the subset of post-impact properties investigated in this work ( Table  3). The performances of the emulators on each postimpact property are quantified by r 2 -scores, which are tabulated in Table 5. A few general results are apparent from the regression performances: • In all cases, the data-driven models far outperform the analytic and semi-analytic methods. • To within the expected variance, the data-driven models achieve equivalent accuracies for a given post-impact parameter. • There is no apparent benefit to predicting the normalized masses for any of the data-driven techniques. • Predictions of the debris mixing ratio and mean altitude of the debris field proved the most difficult to regress, however these are outliers relative to the otherwise good performances.
The emulator predictions for each post-impact parameter are plotted relative to their expected (i.e., simulated) values in Figure 6 for LR properties, Figure 7 for SLR properties, and Figure 8 for debris properties.

Analytic & semi-analytic methods
The analytic and semi-analytic methods investigated in this work achieved relatively poor r 2 -scores relative to the data-driven methods. While limited to a narrow set of parameters, IEM is the most accurate of these methods, while PIM and EDACM generally perform poorly. However, a number of results are surprising.
The semi-analytic methods' regression performances on M LR are significantly below that of the data-driven methods, achieving r 2 -scores of 0.7659 and 0.6897 for IEM and EDACM, respectively. Their relative performance is somewhat surprising, as EDACM uses an explicit relationship to predict M LR , whereas IEM only predicts M deb and provides no explicit relation for M LR . PIM does poorly when predicting M LR . This latter result is perhaps not surprising, as PIM assumes all collisions result in perfect accretion and studies have shown that this is not the case in about half of collisions [26].
Only EDACM is capable of making predictions for M SLR . The resulting r 2 -score, −0.4343, is much worse than the associated score for its prediction of M LR . EDACM's significantly worse performance when predicting the mass of the SLR as opposed to the LR is likely influenced by two important aspects of the EDACM algorithm. First, EDACM delineates collisions into multiple regimes (e.g., perfect merging, hitand-run), in which different analytic relations are used. Second, the calculation of M SLR uses M LR as an input (via M norm LR ; see Eq.24 in Appendix C). Thus, any Table 5 Coefficients of determination (r 2 -scores) for the analytic, semi-analytic, and data-driven methods investigated in this work. The data-driven models were trained on the 12D LHS10K dataset and all models were evaluated on the 12D LHS500 dataset. The r 2 -scores quantify the correlation between the predicted and "true" values of the postimpact parameters, where the true values are obtained from SPH simulations. Entries listed as n/a indicate the method was not designed to make a prediction for the parameter in question. error in the prediction of M LR will propagate to the prediction of M SLR . Moreover, the regime in which an SLR is produced (i.e., the hit-and-run regime) and for which a mass is subsequently predicted is a relatively small subset of the overall collision space. If the prediction of M LR is on average worse in this regime, then this bias will be propagated.
In the case of the debris, only IEM explicitly predicts M deb . IEM predicts M norm deb , from which M norm LR is subsequently derived. IEM's prediction of the debris mass is surprisingly good with an r 2 -score of 0.8692, but still approximately 10% lower than that of the data-driven methods. This reverse approach taken by IEM, first predicting the M deb , allows it to make an accurate prediction of M LR .
We also tested the ability of the analytic and semianalytic methods to predict the normalized mass quantities. In the case of the LR, this resulted in worse performance for these methods. Indeed, in the case of PIM, the resulting r 2 -score for M norm LR was −0.7128, versus −0.1731 for M LR . Similarly for IEM and EDACM, the r 2 -scores are significantly lower for the normalized quantity. The poor performance of the analytic and semi-analytic methods on the normalized quantities is expected as a side-effect of how the r 2 -score is calculated. Because the normalized quantities are scaled by the total mass of the collision (M tot , which is different for each collision), the distribution of M tot skews the predicted distribution of M LR . Thus, the normalized quantities are only of interest to the data-driven methods, which predict the normalized masses directly and therefore don't suffer from this issue.
The core mass fraction of the LR (F core LR ) is predicted by both PIM (implicitly) and EDACM (via a mantle stripping formula [44]). Here, PIM performs unexpectedly well, yielding an r 2 -score of 0.4649. PIM's unexpected performance on F core LR provides physical insight into the processes that determine F core LR , suggesting that the cores of pre-impact bodies often merge. In contrast, EDACM yields an objectively poor r 2 -score of −0.2368 for F core LR , despite utilizing a more complicated formulation.
For both F core LR and M SLR , a large factor in EDACM's poor performance are the collisions that comprise the super-catastrophic disruption (SCD) regime [2] (see Appendix C). In Figure 6, it's clear that M LR is systematically under-predicted for a subset of collisions, which correspond to the SCD regime. The poor predictions in this subset of collisions is propagated to the calculations of both F core LR and M SLR , causing the former to be systematically over-predicted and the latter to be under-predicted.
In addition to the data-driven methods, only PIM makes any prediction of rotational properties. These predictions were not expected to be very accurate, given the assumptions of the model (see Appendix B). Indeed, the resulting regression performances are exceptionally poor. Figure 6 illustrates that PIM greatly overestimates the angular momentum budget of the LR (J LR ), which results in similar overestimates of its rotation rate (Ω LR ). This has the opposite effect on θ LR , which is systematically underpredicted. The obliquities are predicted to be low because the angular momentum delivered by the impact is deposited in the plane of the impact.
The method for handling debris in the N-body implementation of EDACM [22] performs poorly relative to the data-driven methods as well. This is unsurprising given the simplifying assumptions of the debris model (see Appendix C). This would suggest that more accurate models for handling debris within N-body simulations is sorely needed.

Data-driven methods
For all post-impact parameters, the data-driven methods achieve high accuracy. Of the LR, SLR, and debris properties, the r 2 -scores are generally > 0.9, with values as high as 0.9929 for M deb . The mixing ratio of the debris (δ mix deb ) and mean altitude of the debris field (θ deb ) proved most difficult to regress with maximum r 2 -scores of 0.7872 and 0.5438, respectively.
For a given post-impact parameter, the data-driven techniques achieve similar performances. Indeed, the differences in performance are generally small and fall within the expected variance of the 12D LHS500 test dataset (Eq. 7). This demonstrates that, despite fundamentally different underlying methodologies, all of the data-driven methods are capable of achieving roughly the same performance given a sufficiently large dataset.
However, between post-impact parameters, the best achieved accuracy can differ significantly. Given that the different data-driven techniques are able to achieve the same accuracies, this suggests that the difficultly in reaching higher accuracies lies not with the emulation methodology, but rather with the data or the underlying physical processes that determine the post-impact quantity. In the former case, this may be due to insufficient fidelity of the simulations, insufficient resolution of the training dataset, or ill-defined parameterizations of the post-impact properties.
A known source of uncertainty in the post-impact quantities is the post-impact group finding step. In subsequent steps, the group finding algorithm can assign particles to a group to which they were previously not a part of. While the number of these particles is almost always small (on the order of a few), this can have a large effect on the calculation of post-impact quantities, especially for remnants or debris fields composed of a small number of particles.
Parameters whose accuracies are likely affected by the underlying physical process are, for example, the obliquities. In this case, the limitation on performance may be a result of the obliquity (via the angular momentum vector) being highly variable at low rotation rates. Another set of parameters affected in this way are likely those related to the debris field spatial distribution (e.g.,θ deb andφ deb ). It may be that these quantities are inherently noisy as a result of being sensitive to small changes in the impact geometry.
In many cases, the performance of the GP models is below that of the other data-driven models. The lower r 2 -scores for GPs are likely, at least in part, a result of the limitations on HPO for GPs. Recall that HPO is only carried out for GP models on training datasets with sizes of N ≤ 2000. Due to these limitations, the GP models are not fully optimized on the full 12D LHS10K dataset, while the other data-driven methods are.

Dependence on training set size
In the preceding sections we have discussed the performance of the emulators as trained on the full 12D LHS10K dataset. Here, we discuss their performance on smaller training datasets.
The regression performances of the emulators see their most dramatic improvement on training dataset sizes of less than a thousand (Figure 4). On dataset sizes above roughly a thousand, the r 2 -scores continue to improve slowly until a few thousand, after which only marginal gains are achieved. For many post-impact properties, near-optimal performances are achieved quickly. However, some post-impact properties continue to see improvement with increasing training set sizes. This suggests that, while the masses and several other properties only require relatively small training datasets, other properties relevant to terrestrial planet formation will require datasets even larger than those considered here. This is especially true of properties related to the SLR, for which the ETSS is generally about half that of the TSS.
On the smallest subsets (N < 1000), the GP, PCE, and XGB models outperform MLPs. However, MLPs catch up to the other methods once the dataset sizes have reached a few thousand.

Feature importance
The Sobol' indices in Figure 5 suggest that, for most post-impact properties, the geometry and energy of the impact-determined by γ, b ∞ , and v ∞ -are the strongest factors in deciding the outcome of a collision. However, for some post-impact properties, other preimpact parameters are important. This is true for the obliquities and core mass fractions, which are generally dependent on the pre-impact values of the associated body-i.e., the target for the LR and projectile for the SLR. Pointedly, the Sobol' analysis also shows that the azimuthal orientation (φ) of the pre-impact bodies is unimportant to the outcome of the collisions.

Feature importance
The data-driven models investigated here provide insight into the physical relationships between pre-and post-impact quantities. While ML methods are often criticized for being so-called "black boxes", advances in model interpretability have made data-driven methods powerful tools for understanding complex relationships. The Sobol' indices shown in Figure 5 and PCE feature selections reported in Table 6 illustrate clearly the relationships between the pre-and post-impact parameters. In general, the most important pre-impact parameters are those related to the geometry and energy of the impact. These parameters are the mass ratio (γ), asymptotic relative velocity (v ∞ ), and asymptotic impact parameter (b ∞ ). For rotational quantities (J, Ω, and θ), the pre-impact rotational state of the associated body-target for the LR and projectile for the SLR-are also important.
The Sobol' analysis, along with the results of the PCE feature selection (Table 6), also explain why the analytic PIM method does so well at predicting F core LR . In addition to the impact geometry, the PCE feature selection shows that the core mass fractions of the target (F core targ ) and projectile (F core proj ) are crucial in determining the F core LR . This would add further weight to the idea that, with the exception of hit-and-run collisions, the cores of the target and projectile tend to merge.
The Sobol' analysis and associated PCE feature selection pointedly show that the pre-impact azimuthal orientations (φ targ , φ proj ) are unimportant in determining the outcome the post-impact quantities. This would suggest that these parameters can be ignored in future studies. However, it would be prudent to first assess their contributions to post-impact properties not Sobol' Index (S1) Figure 5 Sobol' indices. The Sobol' index is a sensitivity metric that quantifies the contribution of each pre-impact parameter in determining the value of a given post-impact quantity. For all post-impact properties, the Sobol' analysis indicates that the geometry of the impact is important in determining the outcome. Additionally, for parameters related to the post-impact rotation, the pre-impact rotational states of the target and projectile are also important. The fact that the data-driven methods achieve approximately the same accuracies (to within the expected variance) for each of the post-impact parameters strongly suggests that further increases in the accuracies are limited by the underlying data or physical processes and not the regression methodology.
The obliquity of the SLR (θ SLR ) evinces a relatively low performance (r 2 ∼ 82%) among the data-driven methods. The associated Sobol' indices don't reveal a clear determinant feature or set of features, whereas the indices for θ LR clearly show a relationship dominated by the target obliquity (θ targ ). Furthermore, for θ SLR , the PCE feature selection appears unable to confidently eliminate any features with the exception of M tot , φ targ , φ targ , and-surprisingly-Ω proj . This would suggest that the PCE (and presumably the other data-driven methods) are still attempting to leverage as much information as possible from the training data for θ SLR and would therefore likely benefit from even larger training datasets.
The datasets used to train and validate the datadriven models here include at least six additional dimensions to any previous study of its kind, as well as more expansive ranges in each of its dimensions. We have sampled asymptotic relative velocities of up to 10 times the escape velocity. Previous studies have considered much lower asymptotic relative velocitiesindeed, they sampled lower impact velocities-than we have in this work. The high velocities considered in this work might seem excessive, but such velocities are needed to capture the low-probability collisions that can occur during planet formation. Indeed, recent studies have shown that it's possible for planetary-sized objects to be exchanged between stars in a crowded stellar environment, leaving those objects on highly-eccentric orbits that could result in a collision [17]. These velocities would be extremely fast and the ensuing collisions catastrophic.

Ease of implementation
The data-driven models developed and evaluated in this work operate by fundamentally distinct underlying methodologies, both from a mathematical and algorithmic point of view. Therefore, an important consideration of these models going forward is their complexity and relative ease of implementation into existing or future N-body codes. There are a number of considerations that need to be taken into account regarding practical development and use of the models. First, what are the dataset requirements? Second, what are the computational resources required to train and validate the models? And third, what are the limitations when integrating the model into an existing N-body integrator, both in terms of speed and complexity?
Most of the improvement in performance relative to training set size is achieved up to sizes of roughly a thousand, with marginal increases thereafter ( Figure  4). The results would therefore suggest that datasets of approximately a few thousand simulations would be suitable for most post-impact properties, such as masses or core mass fractions. For other, more difficult-to-emulate post-impact properties, such as θ SLR , larger dataset sizes are advisable. The datasets should additionally be large enough to allow for robust training and validation practices, such as the HPO with k-fold cross-validation used in this work.
While the dataset requirements are similar for the data-driven models, the computational resources needed to train, optimize, and validate them are not. We have avoided an explicit comparison between training times and memory requirements, on one hand because the models only have to be trained once and, on the other hand, because not all models were trained on the same hardware, rendering a fair comparison problematic. However, the qualitative differences between methods is worth mentioning. As training set sizes increase, the time required to train, optimize, and validate the models increases. The times required to train and optimize the PCE and XGB models are negligible for the datasets investigated here, whereas the times required for the MLP and GP models grow quickly. The MLP models remain tractable up to N = 10, 000, however we were unable to perform HPO on GP models above N = 2, 000. Therefore, as dataset sizes continue to grow, GPs are not likely to remain competitive with the other data-driven methods.
In terms of accessibility, neural networks (such as MLPs) and XGBoost are both extremely popular ML methods and as a result many implementations from Python into other languages are readily available. Likewise, PCEs have already been used in other astrophysical applications to great success [39]. In order to utilize these models in an N-body integrator, a way to store their architecture, hyperparameters, and coefficients, weights, and/or biases is required. These parameters must be readily accessible by the integrator, and therefore speed and memory requirements must be considered. For example, while the matrices containing the weights and biases of neural networks can grow very large, the MLPs investigated here are relatively small networks, with no more than three hidden layers with up to at most 24 neurons each. Therefore, the associated weights and biases matrices are negligibly small and can be used without issue in existing N-body codes. Given the excellent performance of the MLP models here, it is unlikely that the number of layers or neurons per layer will grow significantly in the future.
We provide all of the models reported in this study as either HDF5 or serialized pickle files at https:// github.com/mtimpe/aegis-emulator.

Future work
The data-driven emulation strategies explored here have proven to be extremely flexible and robust. This suggests that the greatest benefit to collision models and subsequent emulation-based N-body simulations will come from improvements to the datasets used to train the models. The most obvious improvements are needed in the underlying simulation methods (e.g., smoothed-particle hydrodynamics). Higher resolution simulations, improvements to the underlying CFD algorithms, as well as improved and additional equations of state are the obvious improvements in this respect.
An important caveat that bears repeating in all machine learning applications is that data-driven methods will faithfully emulate the data they are given. Therefore, the accuracy of the underlying numerical methods and distributions of the input features are critical considerations. Unfortunately, there is as of yet no comprehensive study for planetary collisions comparing the results of different CFD methods (e.g., AMR, SPH) or implementations of those methods in the literature. Therefore, while data-driven techniques may achieve excellent accuracies, their performance does not give any information as to the accuracy of the underlying simulations. Thus, a comprehensive code comparison for planetary collision codes would be of great benefit to the community.
We have not attempted to impose any physical limitations on our data-driven models in this work. Thus, while the predictions of the models may be accurate, they may not be physically self-consistent. In the context of N-body studies, the conservation of mass and momentum is of particular importance and therefore a robust method is needed to ensure the physical selfconsistency of the models. In a forthcoming paper, we explore multi-target regression (MTR) for physically conserved quantities, such as mass and angular momentum. MTR may prove useful for imposing physical self-consistency on the models, which at present must be achieved entirely ex post.
In addition, ML and UQ are rapidly advancing fields and are used in a wide range of applications. More advanced techniques (e.g., ensemble learning) are therefore likely to prove useful in the future. Such techniques were beyond the scope of this paper, but the models investigated here may benefit from them significantly.

Conclusions
Using a new set of 10,700 SPH simulations of collisions between differentiated, rotating planets, we have demonstrated that data-driven methods from machine learning (eXtreme Gradient Boosting and multi-layer perceptrons) and uncertainty quantification (Gaussian processes and polynomial chaos expansion) can accurately predict the outcome of a wide range of postimpact properties. We additionally showed that extant analytic (perfect merging) and semi-analytic methods (IEM and EDACM) perform poorly compared to data-driven methods when effects such as variable core mass fractions and pre-impact rotation are included. In terms of dataset requirements, the best performances are reached around a few thousand collisions, however some parameters continue to show improvement, suggesting that larger training datasets will be useful in the future. We summarize the most notable conclusions here: • Data-driven methods can achieve high accuracies for a wide range of post-impact properties. • Extant analytic and semi-analytic methods are limited to a narrow range of post-impact properties and cannot compete with data-driven methods in terms of accuracy. • Data-driven methods require training dataset sizes of at least a few thousand collisions for optimal regression performance. • The data-driven methods investigated in this work achieve roughly the same performance, to within the expected variance, for a given postimpact property. • The core mass fractions and pre-impact rotation of the target and projectile play a significant role in determining collision outcomes.

Figures Appendix A: Definitions
Pre-impact trajectory In this work we use the asymptotic relative velocity (v∞) and asymptotic impact parameter (b∞) to specify the initial trajectory of the projectile in the target's frame of reference. Most previous studies have used the associated quantities at the moment of impact-bimp and vimp, respectively. Therefore, we provide formulae for converting quickly between the two. These conversions can be derived from the conservation of energy and angular momentum. We first calculate vimp from v∞, where G is the gravitational constant, Mtarg is the mass of the target, and Rcrit = Rtarg + Rproj (using the non-rotating radii of the bodies). The impact parameter (bimp) can then be obtained via, bimp = b∞ v∞ vimp (11) Note that this conversion assumes that the target and projectile are perfectly rigid bodies, which is not the case in either reality or in CFD simulations. Therefore, the conversion is an approximation, because the shapes, rotation rates, and orientations of the target and projectile, as well as their pre-impact trajectories, will be altered by gravitational interactions prior to impact.
Iron content The core mass fraction (F core body ) is a measure of the iron in either the target, projectile, LR, or SLR, relative to the body's total mass. In the simulations investigated here, the SPH particles that comprise the pre-impact bodies are either iron or granite. Thus, it is straightforward to calculate the iron (i.e., core) mass fraction, where Ngran and Niron are the number of granite and iron particles, respectively. Similarly, while the debris doesn't have a core, it's iron mass fraction (F F e deb ) is calculated in the same manner.
Melt fraction The melt fraction (F melt body ) is the fraction of the post-impact material that is in a non-condensed state, as defined by the Tillotson EOS. This is useful for estimating the depth of the post-impact magma ocean. Note that the Tillotson EOS doesn't allow for mixed states, so this quantity should be be used with caution and only as a rough estimate of the post-impact melt fraction. Our motivation for including it here was to show that data-driven emulation can be extended to parameters which have not been considered before. Improvements to the EOS in future datasets will improve the usefulness of quantities such as this.
Mixing ratio The mixing ratio (δ mix body ) in this study is defined as the fraction of "foreign" material present in the LR, SLR, or debris. While this gives no information about the source of the foreign material (i.e., whether foreign refers to the target or projectile), it is easier to regress because it does not suffer from the non-negligible number of hit-and-run collisions in which the projectile becomes the LR and the target the SLR. These cases create a significant discontinuity in the response surface, which makes it difficult to regress. However, coupled with a classifier that identifies the dominant material source, the mixing ratio is a powerful tool for studying compositional exchange during collisions.
Debris field spatial distribution The mean and standard deviations of the debris altitude (θ) and azimuth (φ) are a way to quantify the direction and spread of the post-impact debris field. The altitude of the debris particles are measured relative to the initial collision plane and the azimuths are measure relative to an arbitrary reference direction within the collision plane. Here, the azimuths are measured relative to the initial velocity vector of the projectile in the reference frame of the target.

Appendix B: Perfectly Inelastic Merging (PIM)
Perfectly inelastic merging (PIM) assumes perfect conservation of mass and momentum, allowing a set of simple analytic formulae to be derived. The formulae predict the mass and core mass fraction of the largest (and only) remnant (referred to as the LR for consistency). During the collision, there is no net conversion of kinetic energy to other forms such as heat, noise, or thermal energy. Mass is conserved in the only remant, such that where Mtarg and Mproj are the masses of the target and projectile, respectively. We can similarly calculate the core mass fraction of the LR by noting that, in a perfect merger, the cores of the target and projectile will be incorporated in their entirety into the LR, where F core targ and F core proj are the core mass fractions of the target and projectile, respectively. PIM can also predict the rotational angular momentum, rotation rate, and obliquity of the LR. The rotation model assumes perfect angular momentum conservation and assumes that the orbital angular momentum of the collision remains with the post-impact remnant. The angular momentum in the system is determined by the rotational angular momenta of the target and projectile and the orbital angular momentum of the pre-impact trajectory, where J orb = Mproj b∞v∞ is the orbital angular momentum delivered by the impact. The obliquity of the remnant (θ LR ) is subsequently measured relative to the unit vector normal to the collision plane (ẑ = [0, 0, 1]). The rotation rate of the remnant can be calculated from the magnitude of the angular momentum vector, where I LR is the moment of inertia of the LR. Because the bodies themselves are not physically resolved in PIM, the moment of inertia of the LR must be analytically approximated (and in turn the radius), where ρ LR = ρgran (1 − F core LR ) + ρironF core LR . The density of iron is ρiron = 7.86 g/cm 3 and ρgran = 2.7 g/cm 3 is the density of granite. Here, bimp and vimp are the impact parameter and velocity at the moment of impact, Q R is the specific impact energy, and Q RD is the catastrophic disruption threshold. We have followed the implementation of EDACM as provided in LS12 for the LR and SLR properties, and its subsequent N-body implementation [22] for the debris properties. LS12 provides a step-by-step procedure for calculating Q RD , the projectile's interacting mass Minteract, and the velocities for the onset of erosion verosion and super-catastrophic disruption (SCD) v scd , which are used below. These calculations are beyond the scope of this appendix, but we direct the reader to Appendix A of LS12 as a reference. Here, we provide a brief overview of EDACM and point out where our implementation differs.
Perfect merging In EDACM, The mutual escape velocity is calculated using the interacting mass in the collision, where M = Mtarg + Minteract and Minteract is the interacting mass of the projectile. ρ1 = 1 g/cm 3 is an assumed bulk density (see Table 7) of the bodies. This bulk density is low for planetary-scale bodies, but we use it here for consistency with previous implementations [2,22]. If the impact velocity is less than the escape velocity (vimp < v esc ), then the outcome is assumed to be a perfect merger and EDACM is therefore equivalent to PIM in this regime, Disruption and accretion regimes For impact velocities exceeding the escape velocity (vimp ≥ v esc ), collisions are further broken up into grazing (bimp > bcrit) and non-grazing (bimp < bcrit), where Rtarg and Rproj are the radii of the target and projectile, respectively. The radii are determined via the bulk densities, Here, we differ from LS12 in that we are using differentiated bodies, and therefore we calculate the bulk density of our bodies as, where the density of iron is ρiron = 7.86 g/cm 3 and ρgran = 2.7 g/cm 3 is the density of granite. For non-grazing impacts, where v esc < vimp < v scd , the impact is in either the disruption or partial accretion regime. In these regimes, a universal law for M norm LR applies, Hit & run regime Grazing collisions (bimp > bcrit) where v esc < vimp < verosion are defined as hit & run collisions. In this regime, M LR is again calculated by the universal law (Eq. 23). If, in the resulting prediction, M LR < Mtarg, then the outcome is a single large remnant (i.e., the LR) and debris. However, if M LR ≥ Mtarg, then the LR is assumed to be the original target (M LR = Mtarg) and the SLR is calculated assuming the "reverse collision" scenario. This scenario is described in detail in LS12, and the resulting relation used to predict M norm SLR is, where η = −1.5.
Debris Following the EDACM implementation for the N-body integrator Mercury [22], the mass not allocated to the LR (in the case of non-hit-and-run collisions) is split into one or more equal-mass fragments, where the masses are as close as possible to, but always more massive than, M f rag = 4.7 × 10 −3 M⊕. This limit was set by the computational limits of the Mercury integrator at the time of the study. With the LR acting as the center of mass, the trajectories of the resulting fragments are arranged at uniform intervals around a circle lying in the collision plane. This results in the a mean altitude of the debris fragmentsθ deb of 0 degrees with a standard deviation θ stdev deb of 0 degrees. The mean azimuth of the fragmentsφ deb is 180 degrees. The standard deviation of the debris fragments φ stdev deb is that of a uniform distribution from 0 − 360, which is 103.9 degrees in this case.
Mantle stripping EDACM predicts the core mass fractions of its remnants by using a mantle-stripping prescription introduced in earlier work [44]. This prescription is based on simulations of collisions in which the colliding bodies have chondritic compositions (i.e., F core targ = F core proj = 0.33).

Appendix D: Polynomial Chaos Expansion (PCE)
PCE is a probabilistic method whereby the model output is projected on a basis of orthogonal stochastic polynomials in the random inputs. The stochastic projection provides a compact and convenient representation of the model output variability with regards to the inputs. In this work, PCEs are used to represent the relationships between the pre-and post-impact parameters of the collisions. The PCE coefficients are obtained from a non-intrusive regression based method. PCE represents the post-impact parameters by a series expansion, whereŷ is the predicted post-impact value, yα are the coefficients to be calculated and Ψα are the multivariate orthonormal basis functions. Orthonormality for PCE basis functions is always defined with respect to a weighting function given by the joint probability distribution f X ( x) of the sampled input features, where D X is the full input space and d is its dimensionality and δnm is the Kronecker delta. In our case, this input distribution is chosen to be uniform in all d = 12 dimensions (classic LHS; see §2.1.2) as we do not want to impose any non-trivial priors on the collisional input parameters. Following [57], in this work all the basis functions hence need to be based on Legendre polynomials, where n is the polynomial order and the norm of the n-th Legendre polynomial is, with which we can define the normalized Legendre polynomials, In order to construct the multivariate basis functions from the univariate Legendre polynomials, we calculate the tensor product, The Legendre polynomials are further defined over the interval [−1, 1]. This is why all input features need to be linearly mapped into a 12D unit hypercube before they can be passed into the individual Legendre polynomials.
Truncation of the polynomial basis The most straightforward way of truncating a PCE is via a maximal polynomial order. Note that this means that the total polynomial order may not exceed this maximum. The subscript α is a multi-index specifying uniquely how a basis function of order n is composed by individual Legendre polynomials: The first entry in the multi-index is given by the order of the first factor in 33, the second index refers to the order of the second factor and so on. The sum of all entries in the multi-index may thus never be larger than the maximum polynomial order.
Expansion coefficients The goal of PCE regression is to determine the coefficients yα of the expansion, truncated at some polynomial order, given a training data. In PCE the underlying model is assumed to take a random variable as input and, as a consequence, the output of the model has to be treated as a random variable as well. In fact, PCE maps probability distributions of input features to probability distributions of output. Because PCE belongs to the class of spectral decomposition methods, its expansion coefficients decrease polynomially, leading to favorable convergence properties. As it turns out, sometimes the prediction performance can be improved if only carefully chosen terms remain in the expansion while others are left out. There are two more hyperparameters in this approach that further reduce the number of terms kept in the expansion. The expansion coefficients, moreover, contain information about the global output uncertainty given the uncertain input features. This latter property of PCE allows us to quantify feature importance via the Sobol' indices. The OLS algorithm is used to compute the coefficients in the polynomial chaos expansion.
In this work The PCE regression models in this work are constructed as follows: first, for any given target, a computationally cheap version of PCE based on an ordinary least squares (OLS) loss function is computed. This allows us to quantify which features are relevant for the current target via Sobol' analysis (see §3.6). We only retain those features with a total Sobol' index larger than 1% (as otherwise the next step would be computationally too demanding). Based on this reduced set of features the PCE is then computed a second time. This time the PCE is obtained by minimization of a least squares loss function which is augmented by a penalty term through which a sparse representation of the final emulator is enforced. The loss function is minimized with the least-angle regression (LAR) algorithm [58].
For an in-depth introduction to PCE, we refer the reader to [39] and references therein.

Appendix E: Gaussian Processes (GP)
GPs are a non-parametric method that finds a distribution over the possible functions f (x) that are consistent with the observed data [50]. They are stochastic processes, such that every finite collection of its random variables has a multivariate normal distribution. The distribution of a GP is the joint distribution of all of its random variables. The function to be modeled is therefore represented as a stochastic process f (i.e., a collection of random variables indexed by some variable x ∈ X ), where we approximate f with a GP. GPs define a distribution over the function's values at a finite, but arbitrary, set of points (x1, ..., x N ), assuming that p(f (x1), ..., f (x N )) is jointly Gaussian, with a mean µ(x) and covariance σ(x) given by σij = k(xi, xj ), where k is a positive definite kernel function. The key idea is that if xi and xj are deemed by the kernel to be similar, then it expects the output of the function at those points to be similar too.
In regression problems, we are interested in predicting the value yi of f (x) at a specific points xi. In the general case, observations are noisy, which means that we observe, where ε is assumed to be independent and identically distributed Gaussian noise with variance σ 2 n . The prior on the noisy observation becomes cov(yi, yj ) = k(xi, xj ) + σ 2 n δij , where k(xi, xj ) is the kernel and δij is the Kronecker delta function. Typically, the value of the prediction for some input xi is given by the mean of f at xi.
Kernel function Machine learning algorithms that involve a GP use kernel functions to measure similarity between points and predict the value of an unseen point from training data. The prediction is an estimate for the unseen point based on the kernel function. The Gaussian radial basis function (RBF) kernel is commonly used, however in this work we test multiple kernels, including the constant, Matérn (ν = 3/2), rational quadratic, and RBF kernels (see Table 4).
In this work, we use scikit-learn's open-source implementation of GPs. The hyperparameters of the kernel are optimized during fitting of the GP by maximizing the log-marginal-likelihood (LML) based on the chosen optimizer (we use scikit-learn's default optimizer). As the LML may have multiple local optima, the optimizer is started repeatedly by specifying the number of restarts. The noise level in the targets is specified by α and can be helpful for dealing with numerical issues during fitting. We test models without noise and with α = 10 −2 .
Appendix F: eXtreme Gradient Boosting (XGB) XGBoost (XGB) is a scalable, open source machine learning algorithm for tree boosting [51]. For a given dataset with n examples and d features, a tree ensemble model uses K additive functions to predict the output, whereŷi is the predicted output value for a given set of input features xi, F is the function space of all possible classification and regression trees (CART). Each f k corresponds to an independent tree structure q with leaf weights w. To learn the set of functions used in the model, XGB minimizes the following regularized objective function, where l is a differentiable convex loss function that measures the difference between the predictionŷi and the target yi. XGB's default loss function for regression, which we use in this work, is the squared error, l = (ŷi − yi) 2 . The second term Ω is a regularization term that penalizes the complexity of the model, which helps to avoid over-fitting. XGB is a decision-tree-based ensemble machine learning algorithm that uses a gradient boosting framework [51]. Gradient tree boosting considers a function h(x; am), which is a small regression tree, where the parameters am are the splitting variables (i.e., on which input feature does the node make the split), split locations (i.e., in what location or value of the input variable to make the split) and number of terminal nodes, which we fix to be L. In this work, the splitting variables are the pre-impact parameters in Table 1.
During training, at each iteration m, a regression tree partitions the x−(input) space into L−disjoint regions {R l,m } L l=1 and predicts a separate constant value in each one. For some input x, the output of the weak learner can be written as whereȳ lm is the value predicted in region R lm . The model f ( x) is updated, at each iteration m, as where the coefficients βm and the parameters am are jointly obtained by minimizing (βm, am) = arg min where the residuals are given bỹ yi = − ∂ ∂fm−1(xi) Φ(yi, fm−1( xi)) , i = 1, N and an arbitrary, differentiable loss function Φ(y, f ( x)). This loss function could be, for example, mean squared error loss, or Huber loss. A more efficient algorithm is presented in [59], in which the search for best split is not achieved through an exact greedy algorithm (which requires to search for all possible splits on all features), but rather by an approximate algorithm, which proposes candidate splitting points according to percentiles of feature distribution.
In the XGB models used in this work, we use squared error as the loss function, a learning rate of ν = 0.1, and a L1 regularization term on the weights of α = 10.

Appendix G: Multi-Layer Perceptrons (MLP)
Multi-layer perceptrons (MLP) are a type of deep, feed-forward, artificial neural network that consist of three or more layers [60]. These layers include an input layer, output layer, and one or more hidden layers. Each of these layers is composed of a variable number of nodes (also called neurons). The layers in a MLP are fully connected, such that each node in one layer connects-with a certain weight, wij -to every node in the following layer. With the exception of the input layer, the nodes are wrapped in non-linear functions known as activation functions to regularize their output. The resulting network is a supervised learning algorithm that learns a function f (·) : R d → R o by training on a dataset, where d is the number of input dimensions and o is the number of output dimensions. Given a set of features x = x1, x2, . . . , x d and a corresponding target y (in the case of single-target models), it can learn a non-linear function approximator for either classification or regression. In this work, we train MLPs to learn a mapping from a 12-dimensional input space (the pre-impact parameters in Table 1) to a scalar output space (i.e., one of the post-impact parameters in Table 3) The resulting regression models are then non-linear functions that map f ( x) : R 12 → R 1 . While the input nodes provide the inputs, the hidden layers are the computational workhorse of the network. The output of a node in a hidden layer can be represented as, where ψ is the activation function and wi and bi are the weights and biases of the ith layer, respectively. MLPs learn by changing these weights and biases with each new piece of data they see. The magnitude and direction of the changes are based on the difference between the output value and expected result. In order to quantify the degree of error in the output node, a loss function L is defined, where y is the expected (i.e., training) value andŷ is the value predicted by the network. This particular loss function is the mean squared error (MSE). Note that the MSE is the loss function used to determine the weights and biases of the network, but the validation metric used to evaluate the performance of the trained model is the r 2 -score (see §3.5). Finding the minimum of the loss function, which is itself a composition of many non-linear functions, is generally impossible analytically. Thus, in order to find the minimum of the loss function, we use a stochastic gradient descent algorithm [61]. The MLPs used in this work consist of an input layer with 12 nodes, one to three hidden layers with up to 24 nodes each, and an output layer with a single node (i.e., a scalar output). All activation functions in the resulting network are the Rectified Linear Unit (ReLU). The ReLU activation function is linear for all positive values, and zero for all negative values, such that y = max(0, x). For an in-depth introduction to MLPs and the algorithms used here, we direct the reader to the following general comprehensive introduction of neural networks [62].