Accurate Machine Learning Atmospheric Retrieval via a Neural Network Surrogate Model for Radiative Transfer

Atmospheric retrieval determines the properties of an atmosphere based on its measured spectrum. The low signal-to-noise ratio of exoplanet observations require a Bayesian approach to determine posterior probability distributions of each model parameter, given observed spectra. This inference is computationally expensive, as it requires many executions of a costly radiative transfer (RT) simulation for each set of sampled model parameters. Machine learning (ML) has recently been shown to provide a significant reduction in runtime for retrievals, mainly by training inverse ML models that predict parameter distributions, given observed spectra, albeit with reduced posterior accuracy. Here we present a novel approach to retrieval by training a forward ML surrogate model that predicts spectra given model parameters, providing a fast approximate RT simulation that can be used in a conventional Bayesian retrieval framework without significant loss of accuracy. We demonstrate our method on the emission spectrum of HD 189733 b and find good agreement with a traditional retrieval from the Bayesian Atmospheric Radiative Transfer (BART) code (Bhattacharyya coefficients of 0.9843--0.9972, with a mean of 0.9925, between 1D marginalized posteriors). This accuracy comes while still offering significant speed enhancements over traditional RT, albeit not as much as ML methods with lower posterior accuracy. Our method is ~9x faster per parallel chain than BART when run on an AMD EPYC 7402P central processing unit (CPU). Neural-network computation using an NVIDIA Titan Xp graphics processing unit is 90--180x faster per chain than BART on that CPU.


INTRODUCTION
Over the past decades, exoplanet studies have expanded from their detection to include characterization of their atmospheres via retrieval (see reviews by high signal-to-noise ratios (e.g., Koskinen et al. 2016), retrieval on noisy exoplanet spectra require Bayesian methods to provide a distribution of models that can explain the observed data. The posterior distribution resulting from a Bayesian retrieval places limits on each model parameter (within some range, an upper or lower limit, or equally probable for all values considered), informing the statistical significance of the result.
Bayesian retrieval methods involve evaluating thousands to millions of spectra, integrating over the observational bandpasses, and comparing to observations. Depending on model complexity, this requires hundreds to thousands of parallelizable compute hours, resulting in hours to days of runtime. Calculating the model spectra by solving the radiative transfer (RT) equation takes the vast majority of compute time.
Machine learning (ML) encompasses algorithms that learn representations of and uncover relationships within a collection of data samples. Deep learning (Goodfellow et al. 2016) is a subfield of ML that is based on neural networks, which are highly flexible differentiable functions that can be fit to data. Neural networks can classify images (e.g., Krizhevsky et al. 2012;Simonyan & Zisserman 2015;Szegedy et al. 2015;He et al. 2016;Huang et al. 2017), recognize speech (e.g., Chorowski et al. 2014;Amodei et al. 2016;Chan et al. 2016;Xiong et al. 2016), and translate between languages (e.g., Bahdanau et al. 2015;Ranzato et al. 2016;Sennrich et al. 2016;Wu et al. 2016). Neural networks consist of a hierarchy of layers that contain nodes performing weighted (non)linear transformations of their inputs, through a series of hidden layers, to the desired output. For example, for a retrieval, one might have the input layer receive the observed spectrum, hidden layers extract features, and the output layer predict the underlying atmospheric parameters. Neural-network training conventionally uses gradient-based optimization, iteratively adjusting the weights of the connections between nodes to minimize the error between the neural network's prediction and the desired output (Rumelhart et al. 1986).
Recent applications of ML to atmospheric retrieval reduced compute time from hundreds of hours to minutes or less. Márquez-Neila et al. (2018) presented a random forest of regression trees to build predictive distributions comparable to the posterior distributions of traditional Bayesian retrievals. Zingales & Waldmann (2018) utilized a generative adversarial network (GAN; Goodfellow et al. 2014) to retrieve distributions for model parameters. Waldmann & Griffith (2019) used a convolutional neural network (CNN) to map spatial and spectral features across Saturn. In Cobb et al. (2019), we introduced plan-net, an ensemble of Bayesian neural networks that uses parameter correlations to inform the uncertainty on retrieved parameters. Hayes et al. (2019) demonstrated a new approach to ML retrieval by applying k -means clustering to a principal component analysis of the observed spectrum to inform a standard Bayesian retrieval. Johnsen & Marley (2019) showed that a dense neural network can provide quick estimations of atmospheric properties.
While these approaches are promising, all except Hayes et al. (2019) suffer from a common deficiency: the reduction in computational time is accompanied by a reduction in posterior accuracy because they make significant approximations when performing Bayesian inference. For ML to become an integral part of atmospheric retrieval, the accuracy of the posterior approximation must be preserved.
The solution lies in simulation-based inference methods (Cranmer et al. 2019). While directly using a simulator (e.g., RT code) requires a consistent amount of compute time for each new inference (e.g., retrieval), surrogate models that emulate the simulator (e.g, neural networks) allow new data to be quickly evaluated after an upfront computational cost to train the surrogate (Kasim et al. 2021;Munk et al. 2019). ML-and simulation-based inference approaches have been successfully applied to a variety of tasks ranging from quantum chemistry (Gilmer et al. 2017) to particle physics (Brehmer et al. 2018;Baydin et al. 2019), resulting in significant reductions in compute cost with minimal loss in accuracy. Similar approaches have been used by the Earth science community to reduce the computational burden of forward modeling of spectra, retrieval of surface conditions, and atmospheric correction (e.g., Atzberger 2004;Garcia-Cuesta et al. 2009;Rivera et al. 2015;Verrelst et al. 2015;Gómez-Dans et al. 2016;Verrelst et al. 2016;Verrelst et al. 2017;Chernetskiy et al. 2018;Yin et al. 2018;Vicent et al. 2018;Bue et al. 2019).
Here we present a novel application of this approach to retrieval, which uses a neural-network model of RT within a Bayesian framework, apply it to the emission spectrum of HD 189733 b, and compare the results to a classical retrieval using the RT code that trained the surrogate model. Our general method is to (1) generate a data set over some parameter space, (2) train a surrogate forward model on the generated data, and (3) infer the inverse process via a Bayesian sampler ( Figure  1). Our approach circumvents the existing limitations of ML retrieval methods, which seek to directly learn the inverse process, by learning the forward, deterministic process (RT) and using the simulator surrogate in a standard inference pipeline. This approach preserves  Figure 1. Schematic diagram of our inverse modeling method, color-coded based on the scope of our software packages. MARGE (Section 2.3.1) generates a data set based on a deterministic, forward process (e.g., RT) and trains a surrogate model to approximate that process. Using the trained surrogate, HOMER (Section 2.3.2) infers the inverse process (e.g., atmospheric retrieval) by simulating many forward models and comparing them to the target data (e.g., an observed spectrum) in a Bayesian framework.
the accuracy of the Bayesian inference and, while slower than direct ML retrieval, is still much faster than computing RT.
In Section 2, we describe our approach in detail as well as introduce the software packages that implement the method. Section 3 discusses the results. Finally, Section 4 presents conclusions.

Model Training
To train a neural network for our approach (Figure 1), we generate a data set of spectra using the Bayesian Atmospheric Radiative Transfer (BART) code Cubillos et al. 2022;Blecic et al. 2022).
The atmospheric models consist of 100 log-uniform layers spanning pressures from 10 −8 to 100 bar, and we assume that the planet radius corresponds to a pressure of 0.1 bar. We use the five-parameter temperaturepressure profile, T (p), parameterization of Line et al. (2013): κ, the Planck mean infrared opacity; γ 1 and γ 2 , the ratios of the Planck mean visible and infrared opacities for each of two streams; α, which controls the contribution of the two streams; and β, which represents albedo, emissivity, and energy recirculation. We allow the radius (R p ), mass (M p ), and semimajor axis (a, adjusts the temperature at the top of the atmosphere due to stellar irradiation) of the planet to vary to encompass a range of hot Jupiters. We also include a free param- We allow a wide range of values without regard for physical plausibility, except by enforcing that (1) the H 2 /He ratio remains constant, (2) the total relative abundances of molecules in the atmosphere equals 1, and (3) the T (p) profile does not exceed the temperature range of the line lists. For example, this could lead to models with H 2 O at conditions where it dissociates (Arcangeli et al. 2018), though in the case of HD 189733 b, such models would be rejected with a high probability due to a poor fit. We note that these are not fundamental constraints of our approach; other constraints (e.g., enforcing thermochemical equilibrium, keeping elemental ratios within some range) may be used when generating the data set to train the surrogate model.
For opacities, we use HITEMP for H 2 O, CO, and CO 2 (Goorvitch 1994;Tashkun et al. 2003;Barber et al. 2006;Rothman et al. 2010), HITRAN for CH 4 (Niederer et al. 2008;Boudon et al. 2010;Nikitin et al. 2010Nikitin et al. , 2011Brown et al. 2013;Campargue et al. 2013;Daumont et al. 2013;Niederer et al. 2013;Nikitin et al. 2013;Rothman et al. 2013), and collision-induced absorptions of H 2 -H 2 and H 2 -He (Borysow et al. 2001;Borysow 2002;Abel et al. 2012;Richard et al. 2012). While there are newer line lists available with a greater number of lines (e.g., Hargreaves et al. 2020), these tests are meant to demonstrate consistency between neural-network-based and non-ML retrievals; we therefore use the setup described in Harrington et al. (2022), which uses this set of line lists to compare with previous studies. As our approach learns RT from a data set of spectra, it is not tied to any specific line lists.
To train our neural-network surrogate model, we generate 3,458,432 spectra, which are subdivided into 2,446,784 spectra (∼70%) for training, 689,536 spectra (∼20%) for validation, and 322,112 specra (∼10%) for testing (for considerations about data set size, see Appendix B). Model parameters come from the uniform distribution bound by the limits listed in Table 1. Each spectrum spans 280-7100 cm −1 at a resolution of 1.0 cm −1 and corresponds to the planet's emitted flux in erg s −1 cm −1 .
When processing the BART inputs/outputs for our neural network, we simplify the neural-network inputs by transforming the planet mass into the surface gravity, because this is a factor in the integration to calculate the spectrum. We assume a host star of radius 0.756 R with a temperature of 5000 K to calculate the T (p) profiles; because β acts as a scaling factor on the related term (Eq. 15 of Line et al. 2013), it can compensate for different stellar fluxes.
We normalize the input and output data by (1) taking the logarithm of the output spectra, (2) standardizing the inputs and (log) outputs by subtracting the training mean and dividing by the training standard deviation, and (3) scaling the standardized inputs and outputs to be in the range [-1, 1]. The neural network's input layer corresponds to the 12 inputs described above, with surface gravity replacing planetary mass. The hidden layers consist of Conv1d(256)L(0.05) -Dense(4096)L(0.05) -Dense(4096)L(0.05) -Dense(4096)L(0.05) -Dense(4096)L(0.05). Conv1d(n) indicates a 1D convolutional layer with n feature maps and a kernel size of 3. L(m) indicates a leaky rectified linear unit (ReLU) activation function with slope m for x < 0. The dense output layer has 6821 nodes, corresponding to the emitted spectrum over the defined wavenumber grid, with a ReLU activation function. For details on our model selection process, see Appendix A.
We train with a batch size of 64 using a mean-squarederror loss function, the Adam optimizer, and early stopping with a patience of 30 epochs based on the validation loss. We employ a cyclical learning rate that increases from 8 × 10 −6 to 5 × 10 −3 over 4 epochs, then decreases over the same window. After each complete cycle (8 epochs), the maximum learning rate decays by half the difference between the maximum and minimum learning rates (triangular2 policy, Smith 2015). The boundaries were chosen according to the method described in Smith (2015), except that we consider the loss instead of accuracy (see Appendix A for details). To evaluate the model's performance, we compute the rootmean-squared error (RMSE; comparable to the standard deviation of the differences between the predicted and true values) and the coefficient of determination (R 2 ; measures the linear correlation between the predicted and true values) between the data and the predictions, both for the full high-resolution output and the bandintegrated spectra corresponding to the observations of HD 189733 b.

Retrieval
Following the setup of Harrington et al. (2022), we perform a retrieval of the dayside atmosphere of HD 189733 b based on the measurements by the Hubble Space Telescope Near Infrared Camera MultiObject Spectrograph (Swain et al. 2009); Spitzer Space Telescope Infrared Spectrograph (IRS Grillmair et al. 2008); Spitzer InfraRed Array Camera (IRAC) channels 1 and 2 values of 0.1533 ± 0.0029% and 0.1886 ± 0.0071% (M. Line, priv. comm.); IRAC channel 3, IRS 16 µm photometry, and Multiband Imaging Photometer for Spitzer ; and IRAC channel 4 (Agol et al. 2010). We use a K2 solar-abundance Kurucz stellar model for the host star's emission (Castelli & Kurucz 2003). Using the differential evolution Markov chain with snooker updating algorithm of ter Braak & Vrugt (2008), 2,500,000 iterations are spread across 10 parallel chains, with a burn-in of 50,000 iterations per chain. When retrieving, we fix the semimajor axis to 0.031 au and the planetary radius and gravity at 0.1 bar to 1.138 R J and 2187.762 cm s −2 , respectively. The remaining neural-network input parameters are allowed to freely vary over the entire training space.
We compute the Bhattacharyya coefficient (Bhattacharyya 1943; Aherne et al. 1998) to compare the similarity of 1D marginalized posteriors, where a value of 0 indicates no overlap and a value of 1 indicates identical distributions. We choose this metric over others, such as the Kullback--Leibler divergence, because it is both intuitive to understand and defined for all distributions, even those that do not overlap.
For this investigation, we focus on a neural network as a faster replacement for an RT code for retrieval; we therefore only compare the results of BART and the neural-network approximation. For a discussion of these results in the context of previous retrievals of HD 189733 b's dayside atmosphere, see Harrington et al. (2022).

Software
We have developed two Python packages for this investigation. Both are open-source software, with full documentation, under the Reproducible Research Software License 1 . We encourage users to contribute to the code via pull requests on Github.

MARGE
The Machine learning Algorithm for Radiative transfer of Generated Exoplanets 2 (MARGE, Figure 1) (1) generates a data set based on a user-supplied function, (2) processes the generated data using a user-supplied function, and (3) trains, validates, and tests a userspecified neural-network architecture on a data set. The software package allows independent execution of any of the three modes, enabling a wide range of applications beyond exoplanet retrieval.
MARGE's design allows it to be applied to any deterministic model. For 1D data (such as spectra), MARGE's desired format is NumPy binary (.npy) files of 2D arrays, where each row corresponds to a single case. Each row is a data vector of the input parameters followed by the output data (e.g., spectrum). MARGE currently includes data-generation and -processing functions for BART as well as a data-processing function for the pypsg 3 Python interface (Soboczenski et al. 2018) for the NASA Planetary Spectrum Generator (Villanueva et al. 2018). We encourage users to contribute code via pull request to handle the processing of the inputs/outputs of other software packages.
We implement neural-network model training in Keras (version 2.2.4, Chollet et al. 2015), using a Tensorflow (version 1.13.1, Abadi et al. 2016) backend. MARGE enables early stopping by default to prevent overfitting, and the user can halt or resume training. MARGE allows for cyclical learning rates for more efficient training (Smith 2015, see also Appendix A). Users specify the model architecture details and the data location, and the software handles the data normalization, training, validation, and testing. MARGE preprocesses the data into Tensorflow's TFRecords format for efficient handling. Users have multiple options when preprocessing the data, which include taking the logarithm of the inputs and/or outputs, standardizing the data according to its mean and standard deviation, and/or scaling the data to be within a specified range. The mean and standard deviation of the data set are computed using Welford's method (Welford 1962) to avoid the need to load the entire data set into memory at once. MARGE computes the RMSE and R 2 for predictions on the validation and test sets to evaluate model performance; these metrics can optionally be calculated over integrated filter bandpasses. Finally, users may specify cases from the test set to plot the predicted and true spectra, with residuals (e.g., Figure 2). 2 MARGE is available at https://github.com/exosports/marge 3 https://gitlab.com/frontierdevelopmentlab/astrobiology/pypsg

HOMER
The Helper Of My Eternal Retrievals 4 (HOMER) utilizes a MARGE-trained model to infer the underlying inputs corresponding to some observed outputs (Figure 1). For its Bayesian framework, HOMER uses a Python wrapper for Markov chain and nested-sampling algorithms. The user specifies data, uncertainties, observational filters, a parameter space, and a few related inputs, which are passed to the Bayesian sampler to perform the inference. If available, a graphics processing unit (GPU) calculates neural-network predictions, though the central processing unit (CPU) can do this at the cost of increased runtime. For each iteration of the Bayesian inference, the trained neural network predicts on the proposed input parameters, which are modified as necessary (descale, denormalize, divide by the stellar spectrum, unit conversions, and/or integrated over bandpasses).
HOMER produces plots of the best-fit spectrum, 1D marginalized posteriors, 2D pairwise posteriors, and parameter history traces. The best-fit spectrum plot contains the data (with observational bandpasses indicated by uncertainties in x) and, if the DataSketches 5 library is installed, the 1σ, 2σ, and 3σ spectra. We use the streaming quantiles method of Karnin et al. (2016) as implemented in DataSketches to compute the 1-2-3σ spectra. This approach avoids needing to load all of the evaluated models at once, which could exceed system memory.
HOMER calculates the steps per effective independent sample (SPEIS) and effective sample size (ESS) as described in Harrington et al. (2022). Markov chains make small, correlated steps; while a chain may perform 100,000 iterations, if it takes 5000 steps to materialize a completely independent sample (steps per effective independent sample, SPEIS), then there have only been 20 effective samples. SPEIS is calculated from the autocorrelation function of each parameter for each chain; as a conservative estimate, we use the highest SPEIS value when calculating the ESS of the Bayesian inference to ensure we do not underestimate credible region uncertainties. By rearranging Equation 1 of Harrington et al. (2022), an uncertainty sĈ can be calculated on a given credible regionĈ based on the ESS: For example, if the ESS is 20, then the determined 68.27% credible region is actually the 68.27 ± 10% cred- ible region; running the inference for more iterations would increase the ESS and accordingly decrease the uncertainty on that credible region.
For easy comparison with other retrieval results, HOMER can overplot the 1D and 2D posteriors for multiple retrievals (e.g., Figure 3) and compute the Bhattacharyya coefficients between the 1D posteriors.

RESULTS & DISCUSSION
The normalized RMSE, normalized R 2 , and denormalized R 2 metrics for the MARGE-trained model on the test set for the high-resolution and band-integrated spectra are detailed in Tables 2 and 3, respectively. The normalized RMSE 1 and R 2 ∼1 indicate an accurate model for RT over the parameter space. Rather than waiting for early stopping to engage, we manually stopped training at 130 epochs because there was an insignificant improvement in the loss for dozens of epochs. For considerations on how this affects model performance, see Appendix B. Figure 2 shows example comparisons between the spectra predicted by MARGE and true spectra calculated by BART. While residuals tend to be around a few percent, they generally fluctuate around 0; when band-integrated over the observational filters, these errors usually cancel, as shown by the lower normalized RMSE and higher denormalized R 2 metrics (Tables 2  and 3). We observe that in some cases, there are regions where the spectrum is consistently over-or underestimated by a few percent (e.g., the top-left panel of Figure 2 around 4250 cm −1 ), thereby introducing error in the band-integrated value. However, the small deviations appear to have only a minor effect on this retrieval's result; see Section 3.1 for considerations when retrieving at high spectral resolutions or in cases where a traditional retrieval result is not available for comparison.
When applying HOMER to the emission spectrum of HD 189733 b, the results are consistent with BART. The retrieved T (p) profiles (bottom-left panel Figure 3) agree in the regions probed by the observations (<1 bar, bottom-right panel Figure 3) and only begin to deviate deeper in the atmosphere, where little to no signal is measured according to the contribution functions. By nature, HOMER cannot calculate contribution functions, as the MARGE model does not solve RT. While they could be included for each case in the training set, this would require significantly more compute resources. Computing the contribution functions for the single best-fit case using the RT code that trained MARGE more efficiently uses compute resources. Table 4 compares HOMER's retrieved 68.27% ("1σ"), 95.45% ("2σ"), and 99.73% ("3σ") credible regions with BART's retrieved credible regions. All regions closely agree, with differences attributable to a combination of uncertainty from a finite ESS and the neural network's imperfect nature (Figure 3, top-right panel). For CO, both BART and HOMER favor large abundances, though BART finds a greater probability for log abundances ≥ −2 (Figure 3, top-right panel). Despite this, the credible regions agree (Table 4). Similarly, HOMER favors lower values for γ 1 and α, though the resulting thermal profiles agree (Figure 3, bottom-left panel). Table 5 compares the SPEIS, ESS values, and associated uncertainties in the 1σ, 2σ, and 3σ credible regions for HOMER and BART. HOMER yields an SPEIS that is less than BART's, attributable to the conservative estimate of SPEIS as being the greatest among all chains and parameters. The highest SPEIS values fluctuate between runs, though the median SPEIS remains relatively constant. HOMER's median SPEIS of 627 and BART'S 615 better reflect the close agreement between the two retrievals. The Bhattacharyya coefficients between the 1D marginalized posteriors of HOMER and BART indicate agreement in the range 0.9843-0.9972, with a mean of 0.9925 (Table 6).

Limitations
HOMER's accuracy is, by nature, bound by the accuracy of the neural-network model. Model inaccuracies may slightly bias the results, as seen in the minor differ- Figure 2. Four comparisons of planetary emission spectra predicted by MARGE and calculated by BART. The smoothed curves use a Savitzky-Golay filter with a third-order polynomial across a window of 101 elements (100 cm −1 ). The purple color arises due to a detailed match between the red and blue spectra at high resolution. For the residuals, a black line is plotted at 0 to show regions where the neural network consistently over-or underpredicts the spectrum. A histogram of the high-resolution residuals appears to the right of the residual scatter plot, where the x-axis shows the probability density function (PDF) for the range of residual percentages. Top left: Case with T (p) profile that increases in temperature with altitude, with H2O and CO2 emission lines. Top right: Case with T (p) profile that decreases in temperature with altitude, with absorption primarily due to CH4 and H2O. Bottom left: Cases with T (p) profile that has an inversion around 0.1 bar, with CH4, CO, and CO2 absorption and emission features. Bottom right: Case with T (p) profile that is nearly isothermal at the pressures with sensitivity. ences between the posteriors of HOMER and BART. In our application, this discrepancy does not significantly affect the scientific conclusions at the spectral resolution of these observations for the current neural-network accuracy. However, this does not necessarily hold for all cases. It is possible that at higher resolutions this neural network's minor inaccuracies can drive the Bayesian sampler to radically different results. While in theory MARGE works for any spectral resolution, users will need to carefully select the model architecture to ensure that it can accurately model the spectra over the desired phase space. In situations lacking a physics-based retrieval to compare with, we advise testing to ensure that forward models are reasonably accurate over the retrieval's phase space, as some regions may not be sufficiently sampled for accurate predictions.

Compute Cost
The performance differences between HOMER and BART highlight HOMER's computational benefits. For a single Markov chain iteration, BART requires around 1.8 seconds per parallel chain on an AMD EPYC 7402P CPU, and multiple chains parallelize linearly across CPUs. By comparison, a single iteration with HOMER on the same CPU -which includes preprocessing (e.g., input normalization), prediction, post-processing (e.g., output denormalization, scaling according to the stellar spectrum), and band-integration -requires just ∼ 0.2 s for any number of chains fewer than 32. For a single chain, this is thus a ∼ 9× speedup. In our setup, we considered 10 parallel chains, translating to a ∼ 90× speedup for the function evaluated at each step of the Markov chain. Using an NVIDIA Titan Xp for predic- tions, the model evaluations at each Markov chain step require 0.01-0.02 seconds, a 10-20× speedup over predictions with the aforementioned CPU and, when using 10 parallel chains, a 900-1800× speedup over the same function evaluation in BART. We note that if BART were capable of utilizing a GPU, this speedup factor would be much less. Further investigation is necessary to determine whether HOMER offers speed improvements over a GPU-accelerated RT code. Nevertheless, the CPU results emphasize the speed improvements of our approach; the significant reduction in compute time enables retrievals to be executed on an average laptop. We note that the memory footprints for both approaches were comparable, though the parameters of each approach can strongly affect the required memory.
Here, the upfront compute cost to generate a data set and train a MARGE model is greater than the time to execute a single BART retrieval. In our example, we generated around 1.5-2× the number of spectra typically computed during a BART retrieval with small credible region uncertainties, plus a few dozen hours to train the neural network. However, additional retrievals within the trained parameter space execute in around 30 minutes on our GPU (less when neglecting to compute spectral quantiles). Thus, when carrying out even two retrievals within some shared phase space, the compute cost of MARGE+HOMER is less than two classical retrievals. In certain circumstances, such as where the radius and mass of the planet do not need to be var- ied (e.g., retrievals on different data sets of the same exoplanet), the number of spectra required to approximate the phase space accurately would be less than in our example, which may lead to MARGE+HOMER requiring less compute time than a single BART retrieval. Beyond the scope of retrieval, this approach could also provide a benefit to situations where it is advantageous to trade one set of computing resources for another. For example, spaceflight missions may be limited by thermal, power, and/or on-board computational resources; it may be advantageous to increase the total compute time, if it can decrease the power, thermal, and/or onboard computing required for the calculation. Another benefit of our approach is that the computecost scaling is less than linear: increasing from 10 to 256 chains results in just a ∼ 12.5×increase in compute cost per iteration when using a GPU, compared to 25.6× as much for BART. Additional chains enable faster exploration of the parameter space, and, if executed for the same number of iterations per chain, increases the ESS, which reduces the uncertainty in the bounds of credible regions . Thus, the combina-tion of MARGE and HOMER saves valuable compute resources when performing retrievals and reduces total runtime when performing multiple retrievals.

CONCLUSIONS
This paper presents a novel technique for ML retrieval that uses a neural-network model of RT within a Bayesian framework to reduce the runtime of a retrieval. Our open-source codes, MARGE and HOMER, provide the community with an easy-to-use implementation of this approach, and they are readily applicable to any forward model and its inversion -not strictly BART or even RT. They are available on Github with full user documentation.
Our method enables fast retrievals that are consistent with algorithms that solve the RT equation. The approach circumvents limitations of current ML retrieval models by using an RT surrogate in place of the RT code found in classical retrieval algorithms, thereby preserving the accuracy of the Bayesian inference. Like BART, MARGE and HOMER work at both the low resolutions of Spitzer and the high spectral resolutions of advanced ground-based spectrographs.
On our hardware, HOMER reduces the runtime of each MCMC iteration by ∼ 9× per parallel chain using a CPU and 90-180× per chain using a GPU, compared to BART. For the case of HD 189733 b, the Bhattacharyya coefficients of the 1D marginalized posteriors of BART and HOMER are >0.984, indicating a close match. This reduction in compute time enables using more realistic (and computationally expensive) RT models, such as those including scattering and condensates. Additionally, 3D retrievals with ∼200 cells could be completed in a matter of days.
Our approach is particularly well suited to planning studies for future observations, telescopes, and instruments, like the James Webb Space Telescope and the Large UltraViolet Optical InfraRed Surveyor (e.g., Rocchetto et al. 2016;Feng et al. 2018). Using a single MARGE model trained over the desired parameter space, HOMER can perform dozens to hundreds of retrievals in the time it takes to run a single retrieval with an RT solver.
More generally, our technique and tools can be applied to problems beyond the scope of this investigation.
MARGE provides a generalized method to train a neural network to model any deterministic process, while HOMER uses a MARGE-trained model to infer the inverse process. MARGE models could be trained for cloud/haze formation or photochemistry within general circulation models, for example. MARGE and HOMER could also be used to map gravitationally lensed galaxies (e.g., Perreault Levasseur et al. 2017).
With the plethora of ML retrieval algorithms that have emerged in recent years, standard data sets should be created and used for benchmarking. Ideally, such a data set would cover a wide range of wavelengths at high resolution and include all available opacity sources, scattering, clouds/hazes, and, in the case of terrestrial planets, surface properties. This would allow easy comparisons among current and future ML retrieval codes.
The Reproducible Research Compendium for this work is available for download 6 . It includes all of the code, configuration files, data, and plots used in support of this work.

ACKNOWLEDGMENTS
We gratefully acknowledge Jon Malkin, Lee Rhodes, and Edo Liberty for valuable contributions to the streaming quantiles method used in this work. We thank James Mang and Nicholas Susemiehl for useful feedback on the software developed for this work. We thank Michael Lund and the NASA Exoplanet Archive for preparing and hosting the online RRC. We thank Jennifer Adams for helpful discussions on radiative transfer emulation in Earth science. We also thank contributors to the Datasketches library, NumPy, SciPy, Matplotlib, Tensorflow, Keras, the Python Programming Language, the free and open-source community, and the NASA Astrophysics Data System for software and services. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research. This research was supported by the NASA Fellowship Activity under NASA Grant 80NSSC20K0682 and NASA Exoplanets Research Program grant NNX17AB62G. We thank FDL (http://www.frontierdevelopmentlab.org/) and SETI (https://www.seti.org) for making this collaboration possible.

A. DETERMINING MODEL ARCHITECTURE
To select an ideal neural-network architecture, a grid search must be carried out. This includes varying the types of hidden layers, number of hidden layers, number of nodes per layer, activation functions for each hidden layer, the parameter(s) for each activation function, and the learning rate.
We carried out a grid search by training each model on a subset of the total data set (171,456 training, 66,432 validation) for 20 epochs using a batch size of 64. We considered 3-5 dense and convolutional+pooling hidden layers; 64-4096 nodes; rectified linear unit (ReLU), leaky ReLU, exponential linear unit (ELU), hyperbolic tangent (tanh), and sigmoid activation functions. The convolutional layers use a kernel size of 3, and pooling layers use a size of 2. We consider four learning rate policies: (1) a cyclical rate ranging from 8 ×10 −6 -5 ×10 −3 where the maximum is reduced by half of the difference with the minimum every 8 epochs, (2) as before but ranging from 10 −5 -10 −3 , (3) a constant learning rate of 10 −5 , and (4) a constant 10 −3 . Policies 3 and 4 are only considered for models that do not include tanh or sigmoid activations. Table A.1 presents the minimum validation loss for each architecture considered. There is some randomness to the minimum validation loss due to the shuffling of the training data, so models with comparable minimum validation losses can be considered equivalent in performance. We chose to perform a more exhaustive grid search than is typical to emphasize certain points that can guide future investigations.
In general, we find that models with 4+ hidden layers with ReLU, leaky ReLU, and ELU activations achieve the lowest validation loss for this problem. The best-performing models all have a 1D convolutional layer as the first hidden layer, while the worst-performing models use tanh or sigmoid activations. Additional layers generally lead to reductions in the loss. Cases where this does not occur can be attributed to the learning rate policy (e.g., models 25-27, LR1 vs. LR2), highlighting the importance of properly selecting the policy (described below). Minor variations (e.g., models 28-30 LR1) can be attributed to training randomness. ReLU and leaky ReLU activations tend to have similar performance; leaky ReLU with a small parameter tends to perform equivalently or better than ReLU (e.g., models 32 and 33). While these results point to deep architectures as optimal configurations for this application of ML RT, tests varying the spectral resolution, wavelength range, etc. are necessary to definitively confirm if such variations change the optimal architecture(s). A future investigation should consider this in more detail.
Based on this grid search, we selected model 40. While a similar architecture with ELU activations performed equivalently (model 37), it took longer to train per epoch. Additionally, we found that the retrieval accuracy did not significantly change below some threshold validation loss (see Appendix B), so training time is a more important consideration than minor differences in minimum validation loss.
Our results show that, when the learning rate range is properly chosen, cyclical learning rates outperform constant learning rates, confirming the findings of Smith (2015) for this particular problem. Select models do not follow this trend (e.g., models 31, 35, 38), which is likely attributable to the small number of epochs considered in this grid search.
In Section 2, we note that we make the learning rate policy selection as described in Smith (2015), except based on the loss instead of the accuracy. Our selection process is to perform a 'range test' by training the model over a few epochs using a learning rate policy that constantly increases from a very small rate (e.g., 10 −7 ) to a large rate (e.g., 10 −1 ). Looking at a plot of loss vs. learning rate (e.g., Figure A.1), the learning rate range can be deduced based on when the loss begins decreasing (minimum learning rate) and when the loss begins increasing (maximum learning rate). In practice, we find more efficient training using a range that is slightly interior to the extrema determined via the plot. This is analogous to the method described in Smith (2015), except that it is more straightforward to determine the learning rate boundaries.    Models trained for 20 epochs in batches of 64. * Triangular2 learning rate policy ranging from 8 × 10 −6 -5 × 10 −3 with a complete cycle spanning 8 epochs. † Like LR1, but ranging from 10 −5 -10 −3 . ‡ Constant learning rate of 10 −5 . § Constant learning rate of 10 −3 . a Dense layer with n nodes, D(n) b ReLU activation c Sigmoid activation d tanh activation e ELU activation E(α), with scaling parameter α Min. Val. Loss (×10 5 ) LR1 * LR2 † LR3 ‡ LR4 § f Leaky ReLU activation L(m), with a slope of m for x < 0 g Convolution1D layer with a kernel size of 3 and n nodes, C(n) h MaxPooling1D layer M(s), with a pooling size s Figure A.1. Example of a range test. The learning rate begins at a value too small to make noticeable changes to the weights of the model. At a learning rate of ∼ 4 × 10 −4 , the loss begins to decrease, indicating that the model has begun learning. However, at a learning rate ∼ 5 × 10 −3 , the loss begins to behave erratically, and it becomes very large at a learning rate of 10 −2 . From this, a learning rate policy varying between 6 × 10 −4 and 4 × 10 −3 would likely perform well for this architecture.

B. DATA SET SIZE CONSIDERATIONS
To briefly investigate the effect of data set size for our problem, we consider three models in addition to that presented in Section 2 ("Main"). The additional models are trained on around 25% of the total data set (614,208 training, 171,584 validation, 78,848 testing). Two models were trained according to the same learning rate policy as described in Section 2, for 187 and 500 epochs ("Sub1a" and "Sub1b", respectively). The third model was trained according to the LR2 learning rate policy described in Appendix A for 500 epochs ("Sub2"). All other setup parameters (e.g., data normalization) match those described in Section 2. Table B.1 compares the normalized RMSE and denormalized R 2 test-set metrics over the high-resolution spectra, as well as the Bhattacharyya coefficients for the retrieved 1D marginalized posteriors. Based on the differences between models Sub1a and Sub1b (which only differ in the number of epochs trained), it can be concluded that manually stopping training once the loss begins to negligibly change does not have a major effect on the model performance. Models Sub1b and Sub2 (which only differ in learning rate policies) illustrate the importance of selecting the learning rate policy. However, both of these effects are negligible compared to those of the data set size: the differences among models Sub1a, Sub1b, and Sub2 are smaller than the differences between model Main and Sub2 (the bestperforming Sub model). While Main and Sub2 underwent similar numbers of total training steps, Main outperforms Sub2. These results motivate the generation of large, comprehensive data sets of spectra to train surrogate RT models, though further research into how data set size, number of inputs/outputs, and architecture complexity influence model performance is needed to inform optimal the data set sizes for future investigations.