Building high accuracy emulators for scientific simulations with deep neural architecture search

Computer simulations are invaluable tools for scientific discovery. However, accurate simulations are often slow to execute, which limits their applicability to extensive parameter exploration, large-scale data analysis, and uncertainty quantification. A promising route to accelerate simulations by building fast emulators with machine learning requires large training datasets, which can be prohibitively expensive to obtain with slow simulations. Here we present a method based on neural architecture search to build accurate emulators even with a limited number of training data. The method successfully accelerates simulations by up to 2 billion times in 10 scientific cases including astrophysics, climate science, biogeochemistry, high energy density physics, fusion energy, and seismology, using the same super-architecture, algorithm, and hyperparameters. Our approach also inherently provides emulator uncertainty estimation, adding further confidence in their use. We anticipate this work will accelerate research involving expensive simulations, allow more extensive parameters exploration, and enable new, previously unfeasible computational discovery.


I. INTRODUCTION
Finding a general approach to speed up a large class of simulations would enable tasks that are otherwise prohibitively expensive and accelerate scientific research. For example, fast and accurate simulations promise to speed up new materials and drug discovery 1 by allowing rapid screening and ideas testing. Accelerated simulations also open up novel possibilities for online diagnostics for cases like x-ray scattering in plasma physics experiments 2 and to monitor edge-localized modes in magnetic confinement fusion, 3 enabling real-time predictionbased experimental control and optimization. However, for such applications to be successful the simulations need not only be fast but also accurate; achieving both to the level required for advanced applications remains an active objective of current research.
One popular approach to speeding up simulations is to train machine learning models to emulate slow simulations 4-7 and use the emulators instead. The main challenge in constructing emulators with machine learning models is in their need for large amounts of training data to achieve the required accuracy in replicating the outputs of the simulations. This training data could be prohibitively expensive to generate with slow simulations.
To construct high fidelity emulators with limited training data, the machine learning models need to have a good prior on the simulation models. Most work to date in building emulators, using random forests, 4 Gaussian Processes, 5 or other machine learning models, 6,7 do not fully capture the correlation among the output points, limiting their accuracy in emulating simulations with one, two, or three-dimensional output signals. On the other hand, convolutional neural networks (CNN) have shown to have a good prior on natural signals, 8 making them suitable for processing natural n-dimensional signals. However, as the CNN priors inherently rely on their architectures, 8 one has to find an architecture that gives the suitable prior for a given problem. Manually searching for the right architecture can take a significant amount of time and domain-specific expertise, and often produces sub-optimal results.
Here we propose to address this problem by employing efficient neural architecture search 9,10 to simultaneously find the neural network architecture that is well-suited for a given case and train it. With the efficient neural architecture search and a novel super-architecture presented in this work, the algorithm can find and train fast emulators for a wide range of applications while offering major improvements in terms of accuracy compared with other techniques, even when the training data is limited. We call the presented method Deep Emulator Network SEarch (DENSE).
In DENSE, we start by defining the search space of neural network architectures in a form of superarchitecture. A super-architecture consists of multiple nodes where the first node represents the simulation inputs and the last node the predicted simulation outputs. Each pair of nodes is connected by multiple groups of operations. Each group consists of a set of operations, such as 1×1 convolution, 3×3 convolution, or similar. Most of the operations, such as convolution, contain sets of train- able values that are commonly known as weights. In one forward calculation of the neural network (i.e. predicting a set of outputs given some input), only one operation per group is chosen according to its assigned probability. The probability of an operation being chosen is determined by a trainable value associated with the operation, which we call the network variable.
The super-architecture used in this work is shown in Figure 1. In every group in the super-architecture, there are convolutional layers with different kernel sizes and a zero layer that multiplies the input with zero. The option of having a zero layer and multiple convolutional layers enables the algorithm to choose an appropriate architecture complexity for a given problem. The superarchitecture also contains skip connections 11 (i.e. identity layers) to make it easier to train.
Training the neural network involves two update steps. In the first step, an operation for each group is chosen according to its probability, forming one possible architecture. The weights, w, of the selected operations are then updated to minimize the expected value of a defined loss function, L, between the predicted simulation output and the actual simulation output, where α 1 is the update step size, X t and y t are the input and output from the training dataset, a is an architecture sampled from the super-architecture A(b) according to the network variables, b. The loss function in this paper is defined as the Huber loss function 12 to minimize the effect of outlier data and increase robustness. The second update step involves evaluating the performance of various sampled architectures on the validation dataset, which is different from the training dataset employed in the first step. The performance of an architecture can be evaluated based on the loss function, inference time, power consumption, or some other combination of relevant criteria. The architectures are then ranked based on their performance and they are given rewards according to their rank. The network variables, b, are updated to increase the probability of the highranked architectures and decrease the probability of the low-ranked architectures. Formally, the update can be written as, 13 where α 2 is the update step size, R a is the reward value given to the architecture a based on its rank, and the function π(a|b) is the likelihood of the architecture a being chosen given the network variables b.
In this case, we ranked the architectures based on the Huber loss 12 on the validation dataset and gave the rewards to follow the zero-mean ranking function in CMA-ES. 14 The use of a zero-mean ranking function reduces the update variance and makes the update step scaleinvariant, increasing the robustness of the algorithm.

II. RESULTS
The combined update steps from equations (1) and (2), and the use of a ranking function in assigning rewards, make DENSE a robust algorithm to simultaneously learn the weights and find the right architecture for a given problem. To illustrate this, we apply the method to ten distinct scientific simulation cases: inelastic x-ray Thomson scattering (XRTS) in high-energydensity physics, 2,15 optical Thomson scattering (OTS) in laboratory astrophysics, 16 tokamak edge-localised modes diagnostics (ELMs) in fusion energy science, 3 x-ray emission spectroscopy (XES) in plasmas, 17,18 galaxy halo occupation distribution modelling (Halo) in astrophysics, 19 seismic tomography of the Shatsky Rise oceanic plateau (SeisTomo), 20 global aerosol-climate modelling using a general circulation model (GCM) in climate science, 21 oceanic pelagic stoichiometry modelling (MOPS) in biogeochemistry, 22 and neutron imaging (ICF JAG) and scalar measurements (ICF JAG Scalars) in inertial confinement fusion experiments. 23 The tested simulations have ranging numbers of scalar input parameters from 3 to 14, and span outputs from 0D (scalars) to multiple 2D signals (images). Datasets for simulations that run in less than 1 CPU-hour were   generated by running them 14,000 times with random sets of inputs. For more expensive simulations, the number of generated dataset is limited by the time and resources available (see table I for more detailed information). Each dataset is divided into three parts: 50% is used as the training dataset, 21% for validation, and 29% as the test dataset. The test dataset was used only to present the results in this paper, never to build the models. The hyperparameters were obtained by optimizing the result for OTS with CMA-ES, 14,24 then used for other cases without further tuning.

A. Emulator results
The example outputs of the trained emulators with DENSE are shown in Figure 2. We see that the output of the emulators generally matches closely the output of the actual simulations, even in MOPS and GCM where only 410 and 39 data points are available. When only a limited number of data is available, the choice of model architectures that give the right priors become important. Complex model architectures with bad priors could still fit the sampled training data, but are likely to overfit, giving bad accuracy on the out-of-samples data.
With DENSE, the model architecture that gives a good prior on the problem is automatically searched for by preferring models that can fit well the out-of-samples data (i.e. the validation dataset). Moreover, randomly choosing an operation in every layer acts as a regularizer in updating the weights during the training to minimize overfitting. These two advantages make DENSE suitable for learning to emulate a wide range of simulations including expensive ones where only a limited number of datasets can be generated.
While the simulations presented typically run in min-utes to days, the DENSE emulators can process multiple sets of input parameters in milliseconds to a few seconds with one CPU core, or even faster when using a GPU card. For the GCM simulation which takes about 1150 CPU-hours to run, the emulator speedup is a factor of 110 million on a like-for-like basis, and over 2 billion with a GPU card. The speed up achieved by DENSE emulators for each test case is shown in Figure 3(b). Compared with other non-deep learning techniques usually employed in building emulators, 25 the models found and trained by DENSE achieved the best results in all tested cases, and in most cases by a significant margin. The DENSE model also performs better than an emulator designed specifically for ICF JAG simulation 26 (i.e. CycleGAN in Figure 1 of the supplementary material). As seen in Figure 3(a), the emulators built by DENSE achieved a loss function up to 14 times lower than the best performing non-deep learning model.
We also compared DENSE with a manually-designed deep neural network model by an architecture from the super-architecture in Figure 1, where all the convolutional layers have size 3. The use of kernel size 3 and skip connections follows the idea of ResNet. 11 As with DENSE, the hyperparameters for manuallydesigned deep neural network were tuned to optimize its result for OTS. Shown in Figure 3(a) is the comparison between DENSE emulators and manually-designed deep neural network. Although in most cases their performances are similar, in some cases DENSE emulators can give a considerable improvement in terms of loss function, as can be seen in Halo and MOPS. This illustrates the robustness of DENSE emulators in a wide range of cases.

B. Emulators for inverse problems
The high fidelity emulators built by DENSE are sufficiently accurate to allow us to substitute simulations even for more advanced tasks such as solving the inverse problem. 27 To illustrate this, we took a simulated output signal randomly from the test dataset where the actual parameters are known. A small noise (∼ 1%) was added to the chosen signal to closely mimic a real observed signal from an experiment. Using this signal, we use an optimization algorithm 28 to retrieve the input parameters by minimizing the error between the sample signal and the output of the emulators, which we call it the retrieval error.
The results of the parameter retrieval using the emulators are compared with the retrieval using the simulations in Figures 4, where we plot the relative retrieval error histograms for two cases. We observe that the relative retrieval errors from the emulators are very similar to those from the simulations.
Without much loss in accuracy, the parameter retrieval with the emulators only takes about 800 ms with a single GPU card. This is to be compared with using the actual simulations which could take up to 2 days (XES) even when using 32 CPU cores. As the parameters can be retrieved in less than one second rather than in hours or days, one can envisage employing this technique for online diagnostics, real-time data interpretation, or even on-the-fly intelligent automation with an accuracy comparable to high-fidelity simulations that are by far too computationally expensive to be used directly. The use of DENSE emulators also enables parameter retrieval with resource-intensive simulations, such as MOPS and GCM, that were too expensive before.
In addition to interpreting signals and parameter retrieval, the emulators can also be used to significantly speed up Bayesian uncertainty quantification. 27 Bayesian uncertainty quantification is usually done by constructing Bayesian posterior distribution using Markov Chain Monte Carlo (MCMC) algorithms. However, the cost of running MCMC to collect sufficient samples from the Bayesian posterior distribution is typically much larger than the cost for parameter retrieval, and is often intractable in practice. Here we perform the Bayesian posterior sampling using an ensemble MCMC algorithm 29 with the same conditions as in ref. 27 In short, we collect all parameters sets that produce spectra that lie in a certain band around a central spectrum. Figure 5 compares the results of sampling the posterior distribution using simulations and emulators in two cases to interpret scattering and spectroscopy data. The posterior distribution sampled by the emulators are very similar to those by actual simulations, and we see that the emulators are well-suited to capture the correlations between parameters. However, note that while collecting 200,000 XES samples via simulations takes over 22 days, the sampling process with the emulators was completed in just a few seconds. Interestingly, building the emulator for XES from scratch only needs some 14,000 samples plus 8 hours for training, so the whole pipeline to build the emulator and use it for MCMC is still considerably faster than directly collecting 200,000 samples using the original simulation.

C. Prediction uncertainty
A final important advantage of building emulators with DENSE is the availability of an intrinsic estimator of the emulator's prediction uncertainty. The randomization of network architectures from the super-architecture can be seen as a special case of dropout. 30 Thus, by adapting the theory of prediction uncertainty with Monte Carlo (MC) dropout by ref., 31 we can show that DENSE emulators can produce the uncertainty of their outputs. The expected value and variance of a DENSE emulator prediction can be obtained by where a is the architecture sampled from the superarchitecture A based on the final values of the network parameters, b. Figure 6 shows the prediction uncertainty of the DENSE emulators, illustrating regions where they are either uncertain or confident in their predictions. As also observed in MC dropout, 32 the prediction uncertainty in this case can be smaller than the difference between the predicted and simulated output, indicating an overconfident prediction. This problem of overconfidence can be resolved by stopping the training of network variables early. The investigation of prediction uncertainty tuning will be the subject of future work.

A. Limitations
Although DENSE has the capability of emulating a wide range of simulations, it is still limited to simulations with a few scalar inputs. The DENSE algorithm has not been tested on building emulators with one, two, or three-dimensional direct inputs. One way to fit a simulation with multi-dimensional inputs to DENSE is by parameterizing the inputs using several scalar parameters (as done in ELMs) or employing a dimensionality reduction techniques. 33 Another limitation observed in DENSE is that it does not learn very well in regions where output variability is high, i.e. where changing the input parameters slightly changes the outputs significantly. This limitation is also observed in other cases of deep learning. 34 Due to the difficulty in learning, regions with high variability tend to require more samples than regions with low variability. This problem can thus be overcome by sampling the parameter space intelligently. 35

B. Applications
Our DENSE approach opens up numerous applications that require fast calculations. One of the main applications of DENSE emulators is real-time diagnostics of complex systems. For research in some fields, such as in plasma physics, diagnostics are usually done by solving an inverse problem using simulations, which involves running the simulations hundreds of times or more. By using the fast emulators instead of simulations, solving the inverse problem could be significantly accelerated without sacrificing the quality of the solutions, as shown in section II B. Real-time diagnostics are the key in automating operations of some machines with complex systems, such as tokamaks 36 and particle accelerators. 37 Another application is the optimization of a very expensive simulations where the simulations can only be executed a few times. The emulators can be used to make high-quality guesses about the optimum parameters which can then be tested using the expensive simu-lations. This idea of utilizing a cheap model for expensive simulation optimization has been used in surrogatemodel optimization 38 and Bayesian optimization. 39 By using a more accurate cheap model, such as DENSE emulators, the number of simulations to be executed can be much lower than in previous approaches. This is a potential avenue for future works.

IV. CONCLUSIONS
We have shown that Deep Emulator Network SEarch (DENSE), a method based on neural architecture search, can be used to robustly build fast and accurate emulators for various types of scientific simulations even with limited training data. The capability of DENSE to accurately emulate simulations with limited data makes the acceleration of very expensive simulations possible. With the achieved acceleration of up to 2 billion times, DENSE emulators enable tasks that were impossible before, such as real-time simulation-based diagnostics, uncertainty quantification, and extensive parameters exploration. This large acceleration in solving inverse problems removes the barriers of using high fidelity simulations in real-time measurements, opening up new types of online diagnostics in the future. The wide range of successful test cases presented here shows the generality of the method in speeding up simulations, enabling rapid idea testing and accelerating new discovery across the sciences and engineering.

A. Test cases
Here we provide a description of the test cases employed in the paper. A summary of the test case parameters is given in Table I. All 1D simulation outputs are sampled to 250 points for convenience.
X-ray Thomson scattering (XRTS): XRTS is a technique widely used in high-energy-density physics to extract plasma temperatures and densities by measuring the spectrum of an inelastically scattered x-ray pulse. 2, 40 The spectrum of the scattered light can be calculated from a set of plasma conditions and the scattering geometry; 15 this forms the simulation on which our emulator is based.
In this paper we consider the specific experimental case presented in ref. 2 where three parameters (temperature, ionization, and density) are to be retrieved from a spectrum of x-rays scattered at a 90-degree angle from a shock-compressed Beryllium plasma. The high-speed emulator for XRTS enables fast solutions to the inverse problem and access to statistical information on the intrinsic uncertainty of the experiment, allowing better control of the experimental optimization and information extraction.  Optical Thomson scattering (OTS): OTS is conceptually similar to XRTS except that it uses optical light instead of x-rays. Optical Thomson scattering is used in measuring electrons and ions temperatures and densities, as well as the flow speed of the plasma using the Doppler shift. 16 Here we considered retrieving five physical parameters (electron and ion temperatures, electron density, ionization, and flow speed) from a normalized scattered spectrum. The impact of building an emulator for OTS is similar to XRTS as it enables access to real-time data interpretation and to uncertainty quantification.
X-ray emission spectroscopy (XES): X-ray emission spectroscopy is a general technique to probe a system by measuring the emitted spectrum and matching it with simulations or theoretical models. In this paper, we consider the diagnostic case of a laser-driven implosion experiment at the National Ignition Facility, 17 using the spectroscopic model based on the CRETIN atomic kinetics code described in detail in ref. 18 Edge-Localized Modes (ELMs) diagnostics: Edge-localized modes are magnetohydrodynamic instabilities that occur in magnetically confined fusion plasmas with high confinement. 41 ELMs are explosive events and cause detrimental heat and particle loads on the plasma facing components of a tokamak. Various diagnostics are implemented to track ELMs. 42 Here, we compare the emulator to the predictive model 43 for the temporal evolution of the electron density profile using the transport code ASTRA. 44 The 14 input parameters in this case describe the diffusion, convective velocity, and particle source profiles 45 as a function of toroidal radial position and time. The output observable is the time-dependent electron density as a function of toroidal radial position.
Galaxy halo occupation distribution modelling (Halo): Here we considered simulations of the angularscale correlation of a galaxy population. The simulation software Halomod 46 was used to calculate the correlation function, angular scales, redshifts and the cosmological model as described in ref. 19 The input parameters used in this simulation follows the parameters described in ref. 47 Figure 6. The orange lines in 1D simulations resemble the distribution of the emulators outputs while the green lines are the outputs from the simulations.

FIG. 8. Examples of emulator uncertainties for test cases not shown in
Fast parameter retrieval is of particular interest here as often researchers are interested in extracting parameters for multiple different galaxy populations.
JAG model for Inertial Confinement Fusion (ICF JAG): JAG simulates the observables from an inertial confinement fusion experiment. 23 There are 5 input parameters in the case considered here. One simulation with 5 input parameters produces four two-dimensional images and 15 scalar values. Constructing fast and accurate emulators of the model allows for a more efficient exploration of the parameters space, and to obtain optimum sets of parameters more efficiently.
Shatsky Rise seismic tomography (SeisTomo): The case considered here is the seismic tomographic inversion problem of the Shatsky Rise oceanic plateau. 20 Given the input parameters that describe the initial velocity profile and regularization in the optimization, the software solves for the velocity structure and the crustal thicknesses as a function of position in the Shatsky Rise that matches the seismic reflection data. Performing uncertainty quantification of the tomographic inversion would require the execution of the software hundreds of thousand times which is very expensive without a fast emulator model. it's absorptivity) between 0.2 and 0.8. All of these factors contribute to the different absorption aerosol optical depths we emulate.
The cost of running the model for one year (including three months of spin-up) is about 1150 CPU-hours which is prohibitively expensive when generating thousands of training data points. However, we have shown that an accurate emulator over three parameters can be built with as few as 39 data points.
The Model of Oceanic Pelagic Stoichiometry (MOPS): The Model of Oceanic Pelagic Stoichiometry (MOPS) is a global ocean biogeochemical model 48 that simulates the cycling of nutrients (i.e., phosphorus, nitrogen), phytoplankton, zooplankton, dissolved oxygen and dissolved inorganic carbon. MOPS is coupled to the Transport Matrix Method (TMM), a computational framework for efficient advective-diffusive transport of ocean biogeochemical tracers. 22,49 In this study we use monthly mean transport matrices derived from a configuration of MITgcm 50 with a horizontal resolution of 2.8 • and 15 vertical levels. There are 6 MOPS input parameters considered in this case, whose definitions and ranges are described in an optimization study by ref. 51 Each simulation involves integrating the model for 3000 years to a steady state starting from a uniform spatial distribution of tracers. Annual mean 3D fields of oxygen, phosphorus, and nitrate at the end of the simulation are used for training. All code and relevant data used for the simulations are freely available. 49

B. Super-architecture
The super-architecture employed for most cases is shown in Figure 1. It consists of two fully connected layers at the beginning, followed by combinations of different types of convolutional layers. Rectified Linear Units are used as the nonlinearity. Most of the nodes contain 64 channels with a growing size of the signal from 4 to 250 at the end. For ICF JAG that has output signal of size 64 × 64, the size written in Figure 1 is capped at 64.
After the first two fully connected layers, each pair of adjacent nodes are connected by an identity layer and a selection of multiple convolutional layers. The identity layers serve as the skip connection for each layer which is always present in every sampled architecture. The member j in group i of layers is assigned a network parameters, b ij , and the probability of member j being selected among the group is determined by the softmax function, Each group of layers consist of one zero layer which multiplies all the inputs to zero. This provides an option for DENSE to remove the layer and make the neural network shallower.
For two nodes with different sizes, we added modified transposed convolutional layers in the group. In the modified transposed convolutional layers, the signal is expanded just as in the normal transposed convolutional layer, but the "holes" are filled with a trainable constant instead of zeros. The convolutional layers and identity layers between two nodes of different sizes operate by upsampling the previous node to match the size of the target node using the nearest neighbor.

C. Hyperparameters
The list of hyperparameters used in the algorithm are given in Table II. The hyperparameters were chosen to minimize the validation loss function for OTS using CMA-ES. 14,24 The same hyperparameters were used for all cases.

D. Other emulator builder methods
In training the emulators using non-deep learning methods, we employed the scikit-learn library 25 in Python. We use the default parameters suggested in the library to build the emulators for all cases.
For models that can only predict a single output, an ensemble of models are trained to predict different outputs in one simulation. For CycleGAN in the ICF JAG and ICF JAG Scalars cases, the model was trained and obtained according to ref. 23,26 E. Solving inverse problems with emulators The parameter retrieval processes for XRTS and Halo to produce Figure 4 were done using the CMA-ES 14 algorithm with population size 32 and 1200 maximum function evaluations. Default parameters suggested in ref. 14 were used. To give a fair results comparison, we used the same algorithm parameters and conditions in parameter retrievals via simulations and emulators.
For the Bayesian posterior sampling process, we employed the affine-invariant ensemble MCMC algorithm 29 with 256 walkers to collect 200,000 samples for XRTS and XES cases. The likelihood is uniform when the generated spectrum lies in a given band and it is zero when it lies outside the band. The band is 0.035 J/keV/ster in XES and 3.5% in XRTS as used in ref. 27 The justification of this form of likelihood is also provided in the supplementary materials of ref. 27

F. Derivation of prediction uncertainty
The randomization of the network architecture can be seen as a special case of dropout. Therefore, the derivation of the prediction uncertainty follows the derivation in Monte Carlo (MC) dropout very closely. 31 Going through the validation dataset this many times in one epoch Denote the input and output from the training dataset as X and Y respectively and write ω as the active weights in the neural network. Given the training data, X and Y, the posterior distribution of the weights in the neural network can be written as P(ω|X, Y) = P(Y|X, ω)P(ω) Z where Z is the normalization constant, P(Y|X, ω) is the likelihood of observing Y with weights ω, and P(ω) is the prior distribution of the weights. The posterior distribution of the weights are intractable, so we need to use variational inference in making the approximation to the posterior distribution. Let the variational distribution takes the form of Q(ω) where ω ij = w ij z ij , and (6) where w ij is the weights of layers j in group i, p ij is the probability of being selected as a function of the network variable, b ij , as written in equation 4. In order to get the best approximation of the posterior distribution P(ω|X, Y) with Q(ω), the Kullback-Leibler (KL) divergence should be minimized. The KL divergence to be minimized can be expressed as KL = − Q(ω) log [P(Y|X, ω)] dω + KL [Q(ω)||P(ω)] .

(8)
The integral on the first term on the right hand side can be approximated by drawing samples from ω n ∼ Q(ω) and performing the Monte Carlo integration on the negative log-likelihood, − log [P(Y|X, ω n )]. The second term on the right hand side is approximated to be ij pij l 2 ||w ij || 2 where l is the prior assumption of the length scale of the distribution. We can take the prior assumption of small length scale to be able to capture high variability region better and therefore making the second term small. With various approximation above, the KL divergence in equation 8 to be minimized can be expressed as By defining the negative log likelihood as the Huber loss function, we obtain that minimizing the KL divergence in the equation 9 is equivalent to minimizing the loss function in equations 1 and 2. Therefore, the optimized parameters after the training can be used to approximate the posterior distribution of the weights in the form of Q(ω).