DIGS: Deep Inference of Galaxy Spectra with Neural Posterior Estimation

With the advent of billion-galaxy surveys with complex data, the need of the hour is to efficiently model galaxy spectral energy distributions (SEDs) with robust uncertainty quantification. The combination of Simulation-Based inference (SBI) and amortized Neural Posterior Estimation (NPE) has been successfully used to analyse simulated and real galaxy photometry both precisely and efficiently. In this work, we utilise this combination and build on existing literature to analyse simulated noisy galaxy spectra. Here, we demonstrate a proof-of-concept study of spectra that is a) an efficient analysis of galaxy SEDs and inference of galaxy parameters with physically interpretable uncertainties; and b) amortized calculations of posterior distributions of said galaxy parameters at the modest cost of a few galaxy fits with MCMC methods. We utilise the SED generator and inference framework Prospector to generate simulated spectra, and train a dataset of 2$\times$10$^6$ spectra (corresponding to a 5-parameter SED model) with NPE. We show that SBI -- with its combination of fast and amortized posterior estimations -- is capable of inferring accurate galaxy stellar masses and metallicities. Our uncertainty constraints are comparable to or moderately weaker than traditional inverse-modeling with Bayesian MCMC methods (e.g., 0.17 and 0.26 dex in stellar mass and metallicity for a given galaxy, respectively). We also find that our inference framework conducts rapid SED inference (0.9-1.2$\times$10$^5$ galaxy spectra via SBI/SNPE at the cost of 1 MCMC-based fit). With this work, we set the stage for further work that focuses of SED fitting of galaxy spectra with SBI, in the era of JWST galaxy survey programs and the wide-field Roman Space Telescope spectroscopic surveys.


Introduction
Understanding the mass assembly of galaxies across cosmic time is a major goal of modern extragalactic astrophysics; solving this question sheds light onto a galaxy's underlying formation and evolution mechanism. Galaxies are well-characterized by features like stellar mass, chemical composition, dust attenuation, current star formation rate, and the star formation history. These parameters can be accurately inferred from a galaxy's spectral energy distribution (SED).
Within the last two decades, photometry-based SED fitting has become a pivotal method to measure the above properties. Ground-based telescopes have been used extensively for large multi-wavelength galaxy surveys -e.g., Sloan Digital Sky Survey (SDSS, Ahumada et al. 2020), Dark Energy Survey (DES, Abbott et al. 2018), and DESI Legacy Imaging Surveys (Dey et al., 2019) -producing large high-quality complex datasets. However, SED studies relying on photometry alone are subject to many challenges, such as the age-metallicity-dust degeneracy (Worthey, 1994;Ferreras et al., 1999).
SED fitting using spectra mitigate this challenge significantly, especially to constrain galaxy star formation rates, formation timescales, and metallicities, with measurement of absorption line indices and emission line strengths (e.g., Worthey 1994;Leja et al. 2019b and references therein.) There are several cutting-edge SED-fitting pipelines with Bayesian frameworks that use Markov Chain Monte Carlo (MCMC) methods to infer galaxy properties -e.g., CIGALE, MAGPHYS, and Prospector (Leja et al., 2017a;Carnall et al., 2019;Leja et al., 2019a;Johnson et al., 2021). However, the computational time needed by the fitting algorithms in these frameworks -e.g., MCMC or nested sampling has been recently is a major bottleneck. With the next generation of telescopes, like the Vera Rubin Observatory/ Legacy Survey of Space and Time (VRO-LSST; Ivezić et al. 2019) and the Dark Energy Spectroscopic Instrument (DESI; DESI Collaboration et al. 2016), tens of millions of optical and infrared galaxy photometry and spectra will be measured. Highly-resolved spectra have higher information content and require more complex and flexible models for fitting. Finally, datasets with spaxels (or spatial and spectral pixels) are increasing in number, e.g., data units in integral-fieldunit (IFU) spectrograph observations, with JWST IFU spectroscopy of a gravitationally lensed galaxy (Khullar et al., 2021), or the SDSS-MaNGA survey observations of star forming galaxies.
Fitting a large number of free parameters is computationally expensive. A 5-parameter spectral model within a typical SED fitting code -stellar mass, dust attenuation, metallicity, age -converges to a 10 4 4 × 10 3 6 × 10 3 Wavelength (Å) 10 0 10 1 Relative Flux (maggies) Figure 1: Five Galaxy SEDs randomly sampled from the training set used in this work, with 5 different values of the parameter vector θ. The SEDs are normalized to their median: this respects the fact that the shape of the SED is primarily dictated by the stellar metallicity, age and dust properties, while stellar mass is usually correlated with the amplitude of the SED flux. The SEDs marked by solid lines have nebular emission features (such as the Hβ, [OIII]5007Å and Hα), while dotted lines represent SEDs of galaxies with absorption features and continuum breaks, such as 4000Å . best-fit model solution in 2-10 CPU hours. Moreover, each galaxy spectra requires its own separate inference chains. With the advent of spectroscopic surveys that may potentially observe tens of millions of galaxy spectra, the cost of modeling spectra quickly becomes prohibitive. Thus, the need of the hour is to quickly and reliably deduce the physical parameters of galaxies in large surveys, as well as to rapidly analyse thousands of spectra from within one galaxy.
Deep learning applied to galaxy SED fitting allows, in principle, a mapping between an observed SED and the target galaxy's star formation history, with several studies in the last few years alone demonstrating the efficacy of new methodologies (Lovell et al., 2019;Leung and Bovy, 2019;Hahn and Melchior, 2022). These methods allow regression of galaxy parameters, albeit without any uncertainty quantification.
Recent developments in deep learning methods have focused on uncertainty quantification. For example, Deep Ensembles involve retraining a network many times with different initializations to enable uncertainty quantification of the model outputs. (Ganaie et al., 2021). Furthermore, with Bayesian Neural Networks (BNNs), deterministic weights of the model are replaced by probability distributions, which allows the model to provide uncertainties of  (Kacprzak et al., 2018;Alsing et al., 2019;Zhang et al., 2021;Zhao et al., 2022;Huppenkothen and Bachetti, 2022).
SBI of galaxy spectra is an exciting opportunity because these models a) do not require the expression of an explicit likelihood, and b) can calculate approximate posterior distributions of galaxy parameters efficiently, allowing for robust uncertainty quantification. Recent work has shown that photometric SED data can be used with SBI for fast inference, e.g., Hahn andMelchior 2022, and(Robeyns et al. (2022)).
In this work, we demonstrate a proof-of-concept SBI framework to analyse galaxy spectra and recover posteriors efficiently for a 5-parameter SED model. We expect to scale this work to upcoming galaxy spectroscopic surveys -like the Dark Energy Spectroscopy Instrument DESI Collaboration et al. (2016) and the Roman Space Telescope High-Latitude Survey Wang et al. (2022a) -and for SED models with more complex and flexible descriptions of galaxy properties.
In Section 2, we describe the simulated data used in this study.
In Section 3, we describe the SBI network architecture and analysis, and in Section 4, we share our results and next steps. The fiducial cosmology model used for all distance measurements as well as other cosmological values assumes a standard flat cold dark matter universe with a cosmological constant (ΛCDM), corresponding to WMAP9 observations (Hinshaw et al., 2013).

Data
We use Prospector (Johnson et al., 2021) to generate simulated SEDs of galaxies. Prospector relies on Markov Chain Monte Carlo (MCMC) sampling for stellar population synthesis (SPS) and parameter inference. It is based on the Python-FSPS framework, with the MILES stellar spectral library and the MIST set of isochrones (Conroy and Gunn, 2010;Leja et al., 2017b;Foreman-Mackey et al., 2013;Falcón-Barroso et al., 2011;Choi et al., 2016).
We generate a training set of 10000 restframe SEDs using a 5-parameter model, with a delayed, exponentially declining (i.e., delayed-tau) star formation history. The SFH -star formation rate as a function of time -is: where SFR is the star formation rate, t is the epoch at which the star formation history is being evaluated, and τ corresponds to the e-folding time in the delayed-τ SFH model. This model incorporates physical priors used in survey studies of galaxy mass assembly (e.g., see Belli et al. 2019). We sample the total stellar mass (M * ), the stellar metallicity log(Z/Z sol ), a delayed and exponentially declining SFH with age t age , the e-folding time τ age (τ from here on) , and dust attenuation (τ λ,2 , corresponding to the optical depth of diffuse dust at 5500Å; from here on referred to as dust.) Each parameter vector θ comprises these five parameters. Our SED model assumes a Kroupa Initial Mass Function (IMF) (Kroupa, 2001). Nebular continuum and line emission are also present.
See Table 1 for a description of the prior range for each model parameter. See Figure 1 for 5 example SEDs/spectra from our training set, highlighting the emission and absorption features depending on the type of galaxy (young vs old stellar populations, respectively).
We smooth and resample the simulated SEDs to resemble a medium-resolution spectroscopic survey using Prospector's internal resampling utility. We use a velocity smoothing parameter (σ v (km s −1 ), to account for the contribution of Doppler broadening by stellar velocities, and resolution of the model libraries), and fix it at 350 kms −1 ; this smoothing corresponds to R∼100, similar to a deep galaxy survey conducted with the R∼100 JWST/NIRSpec prism (Zackrisson et al., 2017). This results in a training set with each galaxy SED sampling rest-frame 3750-9500Å, with 138 flux  Figure 2: The architecture used in this work to infer galaxy SED properties with spectroscopic data. We use a 5-parameter model and a training set with realistic spectra, that is trained by an NPE to generate approximate posteriors. elements for each SED, which is the data vector x corresponding to each θ.
To map our training set to observations, we add stochasticity to the training set in the form of Gaussian noise, to the level of 5% of the flux at a given wavelength, representative of real data at signal-tonoise ratio SNR∼20. We conduct data augmentation to scale 10000 noise-less spectra in our base training set to 2×10 6 spectra with Gaussian noise used in our SBI framework (see Section 3).
We also create an additional test of pectra conduct posterior diagnostics. We generate a set of 1000 spectra with noise, from the same parameter prior range.

Simulation-Based Inference
Our objective is to calculate posterior distributions p(θ|x) of the galaxy parameters derived from a typical SED analysis, where θ is the set of galaxy properties, and x represents the galaxy spectra. We do this by training our SBI model on the large stochastically-sampled training set of SEDs described in Section 2. We utilise Neural Posterior Estimation (NPE) (Papamakarios and Murray, 2016;Greenberg et al., 2019)) which relies on neural networks to train on simulated SEDs with realistic noise, and allow us to estimate "amortized" posterior distributions. SBI/NPE requires computational time in advance of the actual inference, and evaluates the posterior for different observations without having to re-run inference (this is known as amortization). This "amortized" calculation of posteriors then allows us to infer the posteriors of a "real" galaxy with computational time < 1s. For more details and examples of amortized neural network-based posterior estimation, see Greenberg et al. (2019) or Section 2 of Hahn and Melchior (2022). We provide a short summary below. NPE uses "normalizing flows" (Tabak and Turner, 2013) as a density estimator, which employs an invertible bijective transformation to map a complex distribution (i.e., the true posterior distribution in SED model parameter space) to a simpler and faster-to-calculate distribution (often Gaussian, or a combination of Gaussians). This results in the calculation of approximate posterior distributions, that are assumed to be a good approximation of the underlying posterior distributions of parameters. In particular, we use Masked Autoregressive Flow analyses. We find that SBI and MCMC recover SED parameters θ accurately, while SBI constraints for each parameter (for marginalized posteriors) is similar or moderately weaker than MCMC constraints. (Bottom) Best-fit SEDs from SBI/NPE (black) and MCMC (blue) analyses, and percentage residuals. Note that the difference in observed spectrum (orange) and NPE best-fit spectrum (black) is < 3%, smaller than the Gaussian noise applied to each simulated spectrum.

This work
See Figure 2 for a depiction of the SBI architecture used in this work. We use a supervised learning pipeline within an SBI framework via the Macke Lab sbi toolbox (Tejero-Cantero et al., 2020). To demonstrate a proof-of-concept, we train on 2×10 6 simulations (where each simulation is a noise-added version of an SED in our training set) in an NPE framework. We use 25 hidden units and 10 transform layers without an embedding network in this framework. Our model trains on features in the raw simulated data; this model converges after 87 epochs and takes ∼ 14 hours to train. This analysis generates the set of approximate posterior distributions for our 5 parameter SED model. We also test on other combinations of hidden units and transform layers, and choose the above as the fiducial choice with more robust results.

Posterior Diagnostics
We evaluate the results using a variety of statistical and diagnostic mechanisms. First, to test the precision of our framework, we compare the recovered SED parameter values with the true values θ of the parameters from our test set. Secondly, to test the health of the calculated posteriors, we also perform posterior predictive checks (PPCs) and simulationbased calibration (SBC) checks (Talts et al., 2018).
PPCs validate that the model SEDs corresponding to the distribution of θ values in a given posterior fall within the allowed range. We do this by crosschecking whether our best-fit model-based spectrum looks similar to observed data x (see Figure 3). SBCs provide a quantitative insight into whether the posterior uncertainties are balanced. In this test, we sample θ i values from the priors, and simulate observations (using our simulator) from these parameters. Following this, we perform inference given each of these observations, which generates SBC posteriors of their own. For a healthy posterior, the SBC ranks of ground truth parameters under the inferred posteriors should follow a uniform distribution (rank plots aid in visually confirming this; see Figure  5).
Finally, we compare our results to an MCMC analysis of representative SEDs from the test set.

MCMC Analysis
We fit the same 5-parameter SED model to representative galaxy SEDs in the test set using the inference framework Prospector, which calculate Markov-Chain Monte Carlo (MCMC)-based posterior distributions.
We assume the following likelihood function: where (x, σ i ) are n independent spectral flux elements assumed to be drawn from a Gaussian distribution, and m(θ) corresponds to the model spectrum for a given parameter set θ. We use the same θ prior range and shape as the SBI/NPE analysis, in order to calculate the posterior p(θ|x). We use emcee Foreman- Mackey et al. (2013) to conduct the MCMC posterior sampling with 128 walkers, 128 iterations and a burn-in with the step set [4096,4096,2048,512]. Note that non-Gaussian or correlated uncertainties are seen in spectral datasets (e.g., magnitude upper limits in the case of non-detections), which are not accurately captured by the above likelihood, making a "likelihood-free inference" like SBI the ideal choice for this analysis. The results from the SBI analysis, posterior diagnostics, and MCMC comparison are shown in Section 4.

Computing Resources
For our SBI analysis, we use the Python 3 Google Compute Engine backend (with the CPU processor AMD EPYC 7B12), which for our network architecture takes ∼14 CPU hours to train 2×10 6 simulated spectra with noise. For every subsequent posterior estimation, this setup takes ∼0.3s.
For our serialized MCMC calculations, we utilise Prospector runtime on a 2.7 GHz Quad-Core Intel Core i7 processor, which takes ∼14 CPU hours to converge.

Results
In this proof-of-concept analysis, we test out to set whether an SBI framework can train on realistic noisy galaxy spectra to estimate amortized posteriors robustly. To test the efficacy of our SBI framework, we use a test set of 1000 randomly sampled SEDs from our prior range (see Table 1 and Section 2).
One such result is shown in the top panel of Figure 3, for a galaxy with logM tot = 10.51 (M ), log(Z/Z ) = -0.41, age = 1.63 Gyr, τ = 0.11 Gyr, and dust = 0.65 (a metal-poor dusty galaxy). We plot pairwise posterior distributions estimated from both the MCMC (in blue) and SBI (or neural posterior estimation; NPE, in orange) in order to compare constraints across the 5-parameter SED model. The truth values are plotted with black dotted lines. In the bottom panel of Figure 3, we show the maximum a posteriori (MAP) SED models and model residuals from the SBI/NPE (black) and MCMC (blue) analyses. Also overplotted are 1000 randomly sampled SEDs from the posteriors (in grey).
We find excellent agreement between the median values of parameters across MCMC and SBI/NPE posteriors (when marginalized over other parameters); these values are also accurate relative to the true parameter values θ. We also observe that the age and metallicity constraints are similar in both analyses for this test galaxy, while MCMC mass and dust estimation is more precise relative to SBI/NPE. This is the first demonstration that the proof-of-concept analysis presented here is effective at recovering galaxy SED parameters with spectroscopic observations.
On running inference on a sample of 1000 test galaxies sampled from the 16th-84th percentile range of priors in this study, we find accurate recovery of SED parameters. See Fig. 4 for a comparison between true and recovered values of each parameter, as well as χ plots to show goodnessof-fit across the simulated spectroscopic dataset.
The recovered values here are the 50th percentile parameter values, and the uncertainties correspond to confidence intervals between 16th-84th percentile in the posteriors. Here, we demonstrate accurate and precise parameter recovery across the majority of the prior range, specifically for stellar mass (logM), while the constraints on metallicity and age are seen to be wider and less precise. We also note that in our analysis, the recovered parameters are the most biased at the edges of the prior ranges, which indicates that the underlying posterior distribution is not being captured in these parameter ranges. For example, the 16th, 50th and 84th percentile of parameter values for a given galaxy are not accurate descriptors of the underlying posteriors near the edges of the prior range. This can be potentially solved by training on a spectroscopic dataset sampled from a prior range marginally wider than the target spectroscopic survey. We also find no substantial difference in the quality of parameter recovery for star forming galaxies (emission line galaxies; median values of the age parameter < 0.75 Gyr), or quiescent systems (galaxies with spectra containing strong absorption line indices; median values of the age parameter > 2.5 Gyr).
We also run extensive PPC and SBC checks to test the accuracy and precision of our SED parameter values, where we find that the posterior distributions in this analysis are well converged. See Figure 5 for rank distributions for each parametera healthy posterior follows uniform distribution (nonuniform distributions indicates a poorly calibrated posterior; Talts et al. 2018). This demonstration of well-calibrated uncertainties in our SBI analysis is confirmed in Figure 6, where we plot the rank cumulative distribution function (CDF) on the left panel. In both figures, the grey region corresponds to the 95% confidence interval of a uniform distribution, which our parameter rank distributions follow.
The right panel of Figure 6 shows the probability coverage curve of each SED parameter θ.
The principle behind a probability coverage plot is as follows: a well-calibrated posterior estimator will produce -for an ensemble of SEDs in the test setparameter uncertainties that accurately reflect the true underlying uncertainty in the ensemble, e.g. a 68% posterior volume will contain 68% of the true SED parameter values of the test ensemble. Generalizing from this, by plotting the fraction of SED parameters in the test set which fall within the posterior volume as a function of the posterior volume, we obtain curves such as those seen in the right panel of Figure 6. A wellcalibrated estimator will produce probability coverage curves that closely trace the diagonal dashed line, while posterior estimates that are under-confident (i.e. overestimate uncertainties) will produce curves that fall on the upper-left part of the plot.
For our 5-parameter model, we see that the NPE has fairly well-calibrated uncertainty predictions for the age and τ parameters. On the other hand, it tends to over-predict the posterior uncertainties for three parameters -stellar mass, dust and metallicity. This is consistent with the results in Fig. 4 -in the goodness-of-fits plots, the scatter in the differences between the predicted and true parameters values across the test set is smaller than what their error bars would suggest. However, we are encouraged by the fact that the NPE analysis a) accurately predicts the median best-fit values across the test set for those 3 parameters, and b) in most scientific applications, overpredicting the uncertainties in an analysis is preferable than the alternative. We aim to continue to further improve the posterior uncertainty calibrations in future work.
We also demonstrate here a significant improvement in inference speeds compared with MCMC inference. As mentioned above, the SBI/NPE model uses 25 hidden units and 10 transform layers: this computation takes ∼14 hours to train on a CPU, and ∼0.3s per posterior estimation thereafter (MCMC calculation for a single galaxy takes ∼24 hours in our setup). Accounting for the cost of training, we can effectively infer accurate posteriors of 0.9-1.2×10 5 galaxy spectra via SBI/SNPE at the cost of 1 MCMC-based fit. This demonstrates that the amortization of posterior estimation in SBI with accurate recovery of SED parameters is the biggest advantage of our proof-of-concept analysis.

Caveats
This work presents SNPE analyses on spectra generated using a) empirical templates within FSPS (MILES spectral libraries, and MIST isochrones; Falcón- Barroso et al. 2011;Choi et al. 2016), and b) with an assumption of uniform (or nearly uniform) SNR and Gaussian noise across wavelength and for each galaxy spectrum.
These assumptions have impact on both SNPE and MCMC inference of galaxy parameters.
For example, the age-metallicity-dust degeneracy is a systematic that is limited by the information content of the templates, and the wavelength range sampled by the spectra that may or may not contain discriminating information. This affects both MCMC and SNPE analyses, and is expected to be mitigated with upcoming spectroscopic surveys of statistical samples of galaxies (DESI, or the Prime Focus Spectrograph; Greene et al. 2022) as well as simulation studies such as UniverseMachine Behroozi et al. (2013) and FIRE Ma et al. (2018).
In addition, SED inference using spectra is only weakly impacted by small (< 10 − 20%) variations in SNRs across wavelengths (especially in the stellar continuum portions of a given spectrum. Several studies analysing quiescent galaxy spectra Carnall et al. (2019); Khullar et al. (2022); Tacchella et al. (2022) as well as photometric SED fitting and parameter recovery C. et al. (2015); Leja et al. (2017a) seem to point towards this trend. Star-forming galaxies (with strong emission lines) have wavelength regions with peaked SNRs, that improves constraints on parameters such as instantaneous star formation rates (SFRs) and ages. This biases our inferences in favor of young stellar populations with emission line spectra, albeit weakly. In Figure 4, we observe that both older and younger stellar populations are recovered with similar precision. Finally, we expect spectroscopic surveys like DESI, PFS and the Roman High-Latitude Survey to improve this systematic uncertainty, as they attempt to reach nearly uniform SNRs across wavelength.

Conclusion and Next Steps
In this work, we demonstrate a proof-of-concept for amortized neural posterior estimation with an SBI framework, that utilizes simulated low/mediumresolution galaxy spectra. This is the first-of-itskind demonstration of this technique on spectra, that will enable precise and rapid estimation of galaxy parameter posteriors for billion-galaxy surveys. We also show here a significant improvement in inference speeds, while maintaining accuracy in the recovery of parameters, with precision comparable or moderately weaker than MCMC constraints.
While this work focuses on using an SBI framework to train on galaxy spectra directly (without any summary statistics or embedded nets), we wish to scale this analysis with GPU-processing on suitable summary statistics (Khullar et al. in prep, 2022). Moreover, the combination of highly complex SED models (Leja et al., 2019a;Suess et al., 2022b) and high-resolution spectroscopy will enable precise constraints on star formation histories of galaxies. This is especially true in the era of JWST (Nanayakkara et al., 2022;Leethochawalit et al., 2022;Labbe et al., 2022;Suess et al., 2022a) and Roman Space Telescope (High Latitude Survey; Wang et al. 2022b), where SED analysis of systematic spectroscopic surveys will be bolstered with an SBI framework.
Finally, when using simulation-trained SBI models on future survey data, it is important to consider possible performance issues that will arise from small differences between simulated and real data (due to approximations, unknown physics or computational constraints, imperfect simulation of noise and other observational effects).
The drop in performance of simulation-trained models that are applied to real data is a known issue that affects all deep learning models. Mitigation of these problems is an active area of research, which already led to the development of a broad group of methods called Domain Adaptation (Csurka, 2017;Wang and Deng, 2018). These methods allow deep learning model to learn the features shared between simulated and real data and use only these features for inference. This leads to better alignment of the two data distributions in the latent space of the deep learning model, which leads to improved performance (Ćiprijanović et al., 2021, 2022). In future work, we will include domain adaptation methods in our SBI frameworks.