pop-cosmos: A comprehensive picture of the galaxy population from COSMOS data

We present pop-cosmos: a comprehensive model characterizing the galaxy population, calibrated to $140,938$ ($r<25$ selected) galaxies from the Cosmic Evolution Survey (COSMOS) with photometry in $26$ bands from the ultra-violet to the infra-red. We construct a detailed forward model for the COSMOS data, comprising: a population model describing the joint distribution of galaxy characteristics and its evolution (parameterized by a flexible score-based diffusion model); a state-of-the-art stellar population synthesis (SPS) model connecting galaxies' instrinsic properties to their photometry; and a data-model for the observation, calibration and selection processes. By minimizing the optimal transport distance between synthetic and real data we are able to jointly fit the population- and data-models, leading to robustly calibrated population-level inferences that account for parameter degeneracies, photometric noise and calibration, and selection. We present a number of key predictions from our model of interest for cosmology and galaxy evolution, including the mass function and redshift distribution; the mass-metallicity-redshift and fundamental metallicity relations; the star-forming sequence; the relation between dust attenuation and stellar mass, star formation rate and attenuation-law index; and the relation between gas-ionization and star formation. Our model encodes a comprehensive picture of galaxy evolution that faithfully predicts galaxy colors across a broad redshift ($z<4$) and wavelength range.


INTRODUCTION
As galaxies evolve, their macroscopic (astro)physical characteristics -stellar mass, metallicity, dust, gas and active galactic nuclei (AGN) content -will evolve accordingly.While the detailed physics of galaxyevolution and merger histories is not directly observable for individual galaxies, these processes determine the joint distribution of physical characteristics in the galaxy population, and how that distribution evolves over cosmic time.Constraining the joint distribution of galaxy properties in the Universe is therefore one of the main ways we can learn about galaxy evolution (see e.g.Madau & Dickinson 2014 for a broad review).
As well as enabling galaxy evolution science, detailed characterization of the galaxy demographics over cosmic history is critical for cosmological probes that rely on observations of galaxies.Large-scale galaxy imaging surveys, which probe cosmological structure formation via galaxy clustering and weak gravitational lensing, require accurate determination of galaxy redshifts from their broad-band photometry.The physical characteristics of galaxies uniquely determine their spectral en-ergy distributions (SEDs; e.g.Conroy 2013), providing the link between prediction and observation in inferring redshifts from photometric data.The joint distribution of galaxy properties implicitly provides the prior over galaxy SEDs, which is critical for accurate photometric redshift estimation (especially from broad-band data: Arnouts et al. 1999;Benitez 2000;Ilbert et al. 2006;Brammer et al. 2008;Tanaka 2015).In Alsing et al. (2023) we recently showed that the redshift distributions of ensembles of galaxies in photometric surveys can be accurately derived via forward modeling, i.e., explicit modeling of the galaxy population, observational processes, and selection, provided those elements can be modeled with sufficiently high fidelity.Accurate estimation of redshift distributions is essential for obtaining robust and accurate cosmological constraints, and currently represents one of the main systematic challenges for both current (Stage III) and imminent (Stage IV) surveys, such as the Dark Energy Survey (DES; Flaugher 2005), the Kilo Degree Survey (KiDS; De Jong et al. 2015) and Hyper Suprime-Cam (HSC; Aihara et al. 2018), the Vera C. Rubin Observatory's Legacy Survey of Space and Time (LSST; Abell et al. 2009), and Euclid (Laureijs et al. 2011).
In addition, in order to leverage the cosmological information from small-scale galaxy-clustering, we require an understanding of how galaxies with different properties trace the underlying dark matter field (ie., galaxy bias, Sheth & Tormen 1999;Tinker et al. 2010).Detailed characterization of the galaxy population is hence a key component in the exploration and exploitation of the galaxy-halo connection (see e.g.Wechsler & Tinker 2018 for a recent review).
In order to be useful in a cosmological inference context -and to provide a more complete picture of the demographics of the galaxy population in general -it is desirable to obtain comprehensive constraints on the joint density P (φ) of galaxy characteristics φ, from a large and deep sample of galaxies, with as simple selection criteria as possible1 , and with any selection cuts properly accounted and corrected for.In this work we fit a flexible, non-parametric model for the joint density of galaxy characteristics to a large, deep (r < 25), flux-limited sample of galaxies from the Cosmic Evolution Survey (COSMOS; Scoville et al. 2007;Weaver et al. 2022), assuming a state-of-the-art stellar population synthesis (SPS) model connecting galaxy properties with their SEDs.We construct a detailed forward model for the COSMOS data, accounting for photometric noise and calibration, selection cuts, and modeling biases at the level of the SPS predictions.We then use simulation-based inference (SBI) to jointly fit the population model along with photometric noise and calibration parameters (and modeling errors), while properly accounting and correcting for selection.The result is a robustly calibrated galaxy population model that characterizes the complex web of dependencies between galaxy characteristics, and how they evolve over cosmic time.
Our calibrated population model, which we denote pop-cosmos, is useful for both cosmological applications and galaxy evolution studies, and is the first joint inference of the full set of dependencies between galaxy characteristics (rather than individual scaling-relations), while properly accounting for degeneracies between parameters, calibration and selection in a principled fashion.
This represents a milestone in our ongoing effort to achieve accurate galaxy-population modeling under SPS models.In Alsing et al. (2020) we developed neural emulation of SPS, achieving the (10000×) speed-up required to deploy them at scale.In Alsing et al. (2023) we developed the forward modeling framework, demonstrating that existing state-of-the-art population modeling can already deliver (for example) redshift distributions accurate enough for Stage III surveys.In Leistedt et al. (2023) we developed the necessary photometricand model-calibration elements, demonstrating state-of-the-art photo-z performance.Now in the present work, we combine these advances with a flexible populationmodel parameterization and simulation-based inference to deliver comprehensive constraints on the galaxy population from a large, deep galaxy sample with broad wavelength coverage.
This paper is structured as follows.In §2 we describe the COSMOS galaxy sample.In §3 we describe the forward model, comprising the population model ( §3.2), SPS model ( §3.3), calibration and noise model ( §3.4), and selection.The optimal-transport based simulationbased inference technique is described in §4.Results are presented in §5, with discussion and conclusions in §6 and §7.

DATA
The Cosmic Evolution Survey (COSMOS), comprises deep imaging and photometry (in 44 bands) of 1.7 million objects across 2 deg 2 in the COSMOS field (Weaver et al. 2022).We use profile-fitting based photometry from the Farmer (COSMOS2020) catalog (Weaver et al. 2022(Weaver et al. , 2023a) ) in 26 of the available bands 2 , chosen to ensure well-calibrated photometry and relatively homogeneous depth across wavelengths (following Brammer et al. 2008;Leistedt et al. 2023; see our Figure 1).This selection excludes the Subaru Suprime-Cam broad bands (which are shallower than other filters at similar wavelengths), and the GALEX bands (which are shallow and also have broad PSFs; Weaver et al. 2022).
We apply the conservative combined mask, which retains the deepest regions with the greatest number of available bands while removing areas corrupted by bright stars and other artifacts (Weaver et al. 2022).The catalog is prepared using the code released with the COSMOS2020 data3 , which applies the relevant flux corrections (including Galactic extinction) and unit conversions.We use the same star-galaxy separation criterion as Weaver et al. (2022), which is based on the χ 2 estimated for star and galaxy templates in LePhare (Arnouts et al. 1999;Ilbert et al. 2006) as well as morphology information from the COSMOS HST/ACS mosaics4 .
To construct a clean and complete analysis sample, we impose a hard magnitude cut in the r-band of r < 25, two magnitudes shallower than the 3σ magnitude limit5 .This results in a flux-limited sample of 140, 938 galaxies, without significant additional selection effects.

MODEL
Our generative model for photometric galaxy survey data comprises a sequence of steps for simulating mock galaxy catalogs, which can then be compared against the observed data in a simulation-based inference setting for estimating the population-level parameters of interest (as described in §4).Notation is summarized in Table 1, the overall model structure and key components are outlined in §3.1, while the detailed assumptions about each model component are given in §3.2-3.4.The logical flow of our forward model is also summarized in Figure 2 (left panel).

Generative model structure
Our generative model proceeds in the following sequence of steps: where d L (z) the luminosity distance for redshift z, τ (z, λ) is the optical depth of the inter-galactic medium, and W b (λ) are the band-passes for each band b.We assume a state-of-the-art 16parameter SPS model, detailed in §3.3; 3. Calibrate photometry: Measured photometry is subject to calibration biases.We apply zeropoint corrections α ZP per band.The SPS model, too, will be subject to small biases due to approximations and missing model components.For example, emission-line predictions are often only accurate at the O(10%) level or less (Leistedt et al. 2023), with variation arising from both the SPS treatment used (e.g.Byler et al. 2017), and the scheme used to compute line intensities (quantum mechanical vs. semi-classical, etc.; see Guzmán et al. 2017;Ferland et al. 2017).In this step, we apply the zero-point α ZP and emission-line β EM corrections to the SPS model photometry from step 2: where f EM b (φ, z) is the vector of emission-line contributions to the photometry for band b.We include emission-line corrections to the 44 strongest emission-lines, following Leistedt et al. (2023)   Construction of the uncertainty model is detailed in §3.4.
We model an additional source of photometric uncertainty arising from variability in the emission-line contributions to each galaxy (relative to CLOUDY predictions); these depend on the detailed micro-structure of the galaxy and are not captured in the SPS parameterization.To this end, we construct emission-line contributions to the photometric uncertainties in each band, parameterized as where γ EM represent the (fractional) variance in the emission-line contributions for each of the 44 lines included (Table 3).The total photometric uncertainty for each galaxy is then given by the quadrature sum of measurement and emission-line contributions σ 2 = σ 2 P + σ 2 EM ; 5. Add noise: We add noise to the calibrated model photometry from step 3, given the photometric uncertainties from step 4, assuming independent Student's-t errors on each band (with two degreesof-freedom), where d is the vector of noisy (calibrated) fluxes, ⊙ denotes element-wise multiplication, and f denotes the vector of model fluxes (i.e.all the f b (φ, z) computed in step 3); 6. Apply selection: Galaxies are selected into the sample following the same photometric cuts that were applied to the data ( §2).
This generative process represents a complete description of our model assumptions, or equivalently, our simulation pipeline for generating mock galaxy catalog data.Simulated catalogs generated in this way can be compared to the data in a simulation-based inference setting in order to estimate the population-level parameters (see §4).
In the following sections, we give more details of the population model ( §3.2), SPS model ( §3.3), and uncertainty model ( §3.4) assumptions.The simulation-based fitting procedure is then described in §4.

Population model
The population model P (φ, z|ψ) is the main target of our analyses.We require a flexible parameterization for this high-dimensional density, which is capable of capturing the complex web of inter-dependencies between galaxies' properties that arise from galaxy formation and evolution physics.Advances in generative machinelearning models, such as normalizing flows (Rippel & Adams 2013;Germain et al. 2015;Dinh et al. 2016;Papamakarios et al. 2017;Grathwohl et al. 2018;Chen et al. 2018;Kingma & Dhariwal 2018;Durkan et al. 2019;Papamakarios et al. 2021) and diffusion models (Sohl-Dickstein et al. 2015;Ho et al. 2020;Song & Ermon 2019;Song et al. 2020b,a;Song & Ermon 2020;Kingma et al. 2021;Luo 2022) have provided a step change in our ability to parameterize and learn complex and high-dimensional probability distributions from data.
We parameterize P (φ, z|ψ) using a score-based diffusion model (Song & Ermon 2019;Song et al. 2020a,b), where the population-model parameters ψ are the weights and biases of the score-network (outlined below).Diffusion models have been shown to be effective flexible approximators for unknown probability distributions, are relatively inexpensive to train, and scale well to high-dimensional problems, making them ideally suited to this use-case (see Luo 2022 for a review).
In diffusion models, as with normalizing flows, we aim to find an invertible transform that maps from some simple base density (eg., a unit normal) to our target p(x), such that we can generate samples from the target by simply transforming draws from the base-density, ie., where J = ∂f −1 (x)/∂x is the Jacobian, and the transform f must be invertible.In both normalizing flows and diffusion models, the goal is to parameterize the invertible transform f by a neural network.In a score-based diffusion model, we begin by constructing a diffusion process {x(t)} t=T t=1 (indexed by a continuous time-variable t) such that x(t = 0) is distributed according to our target distribution, and x(t = T ) has a normal distribution.This diffusion process can be described by a stochastic differential equation (SDE), which maps samples from our target distribution at t = 0 to random noise at t = T : where w is standard Brownian motion (Wiener process).In order to generate samples from our target then, we can take samples from the base normal distribution x(t = T ) and reverse the process back to t = 0.The reverse of a diffusion process defined by Equation ( 6) is simply another diffusion process, defined by the reversetime SDE (Anderson 1982;Song et al. 2020b): where w is reverse-time Brownian motion, p t (x) are the marginal distributions of the diffusion process defined by Equation ( 6), and dt is an infitesimal step backwards in time.Hence, once the score ∇ x p t (x) of the marginals of the forward diffusion process is known as a function of time, then the reverse-process in Equation ( 7) can be evaluated to transform samples from the base density x(t = T ) ∼ N (0, 1) to the target x(t = 0).The transform from the base-density to the target is hence completely characterized by the score of the marginals: in a score-based diffusion model, the score s(x, t) = ∇ x p t (x) is parameterized as a (dense) neural network, and fit by denoising score-matching (Hyvärinen 2005;Song et al. 2020aSong et al. ,b, 2021)), or otherwise.So far, this reverse-time diffusion process provides a means to stochastically transform from the base density to the target, via Equation (7).However, in order to be able to evaluate the Jacobian and hence log probability, we require a deterministic (invertible) transform between the base and the target.Fortunately, for any reverse-time SDE of the form given in Equation ( 7), there exists a deterministic ordinary differential equation (ODE) that has the same marginal distributions as the SDE (Maoutsa et al. 2020;Song et al. 2020b): Integrating this ODE from t = T to t = 0 hence provides a deterministic, invertible transform from the base density x(t = T ) to the target p(x), which is completely characterized by the score-function s(x, t) = ∇ x p t (x), and whose Jacobian can be computed.Interpreting the diffusion model as an ODE transform in this way elicits an equivalence between continuous-time normalizing flows (Grathwohl et al. 2018;Chen et al. 2018) and scorebased diffusion models (Song et al. 2020b).
We parameterize the score s(x, t) as a dense network with two layers of 128 hidden units and tanh activation functions.We take the so-called variance-exploding SDE (Song et al. 2020b) to define the forward diffusion process, where we choose σ 0 = 0.01, σ T = 10 and T = 1, and implicitly in the variance-exploding SDE the drift-term f (x, t) is set to zero (c.f.Equation 6).

Stellar population synthesis (SPS) model
Stellar population synthesis provides the theoretical framework linking the stellar, gas and dust content of galaxies, and their SEDs (see Conroy 2013 for a review).We assume a state-of-the-art 16-parameter SPS model, based on the milestone Prospector-α model (Leja et al. 2017(Leja et al. , 2018(Leja et al. , 2019a,b),b), but including the gas-phase ionization parameter as an additional free parameter; we found that this additional parameter was required to give reasonable inferences about the gas-phase physics.For completeness, the physical assumptions and parameters are described below.
The star formation history (SFH) is modeled as piecewise constant, with seven bins in time (see Leja et al. 2019a).The first two bins are fixed at [0, 30] Myr and [30,100] Myr respectively, to capture recent star formation.The oldest bin is fixed at [0.85, 1]t age (z), where t age (z) is the age of the universe at the lookback time of the galaxy.The remaining four bins are equallyspaced (logarithmically) in time between 100 Myr and 0.85t age (z).The ratios of the log star formation rate (SFR) between adjacent SFH bins are then the free model parameters describing the SFH.This flexible sixparameter model is able to capture a rich diversity of SFH phenomenology, including both smooth and bursty star-formation histories.
Dust is modeled as separate diffuse and birth cloud dust screens, where the latter only impacts stars younger than 10 Myr (Charlot & Fall 2000).The optical depths τ 1 (birth cloud) and τ 2 (diffuse), as well as the powerlaw index of the Calzetti et al. (2000) attenuation curves, constitute the free parameters describing the dust model.Dust emission is powered by energybalance.
The stellar metallicity for all stars in the galaxy is assumed to take a single value, ie., the model does not explicitly account for time-varying metallicity in the stellar population.This is generally a good approximation, although some studies suggest that metallicity evolution can improve SED modeling at the level of (typically) a few percent (Robotham et al. 2020;Bellstedt et al. 2020).
Gas is characterized by the gas-phase metallicity and ionization state (treated as separate independent variables).Nebular line and continuum emission is generated using CLOUDY (Ferland et al. 2013(Ferland et al. , 2017) ) model grids from Byler et al. (2017).We assume that the gasphase metallicity must be greater than or equal to the stellar metallicity (since the latter captures the lightweighted average over the stellar population, which includes older stars).
Active galactic nucleus (AGN) activity is modeled as described in Leja et al. (2018), where the fraction of the bolometric luminosity from the AGN f AGN and optical depth of the AGN torus τ AGN are both included as free parameters.
Together with stellar mass and redshift, this amounts to a total of 16 parameters characterizing each galaxy.The list of parameters and their prior limits are given in Table 1.
The SPS model is implemented in the public code Flexible Stellar Population Synthesis (FSPS; Conroy et al. 2009Conroy et al. , 2010;;Conroy & Gunn 2010a,b), accessed through the python-fsps binding (Foreman-Mackey et al. 2014).We then use speculator (Alsing et al. 2020) to accelerate the SPS computation, achieving a factor of 10 4 speed-up over FSPS per band, while maintaining sub-percent accuracy on the predicted fluxes6 .

Uncertainty model
The uncertainty model describes the distribution of photometric measurement uncertainties in the survey, conditional on the true source flux, P (σ P |f ; χ).Following Alsing et al. (2023), we model P (σ P |f ; χ) as a mixture density network (MDN; Bishop 2006).Here we use an MDN with one Gaussian component, i.e., a neural network parameterizing the mean µ(f ) and standard deviation Σ(f ) of the distribution of photometric uncertainties, conditioned on flux.From this, photometric uncertainties can be drawn for a simulated galaxy with flux f by drawing: We parameterize the MDN with a single dense network with two hidden layers, with 128 units each and tanh activation functions.
By keeping the uncertainty model parameters χ free in the fitting process, we are able to fully self-calibrate the uncertainty model from the data, eliminating any reliance on the (approximate) estimated flux uncertainties in the Farmer catalog.

INFERENCE
Inferring the population-level parameters ψ, η from the hierarchical model defined in §3 is difficult for a number of reasons.Firstly, flexible (neural network) parameterizations of the population and photometricuncertainty models mean that the number of hyperparameters of interest is large7 .Secondly, there is a vast number of individual-galaxy level parameters {φ, z} 1:N that would need to be inferred and then marginalized over in a typical Bayesian analysis (using e.g.Markov chain Monte Carlo methods).This provides a technical challenge due to the complexity and diversity of individual galaxy SPS-parameter likelihoods (degeneracies and multimodality are commonplace), and a computational bottleneck due to the large number of SPS model calls required.Thirdly, the selection cuts introduce a high-dimensional integral into the likelihood, making it effectively intractable (see Alsing et al. 2023

for details).
Even though the likelihood is intractable, the model described in §3 defines a straightforward recipe for simulating mock catalogs, given some assumptions about the population-level parameters.This means that we may instead compare simulated catalogs to the data in a simulation-based inference setting, for example, by min-imizing a suitable distance metric between model generated and real data.
Minimizing the divergence between the predictive (model) and true data distributions is well-motivated: maximum-likelihood estimation is asymptotically equivalent to minimizing the Kulback-Leibler (KL; Kullback & Leibler 1951) divergence between model and data distributions.However, the KL divergence requires evaluating the predictive distribution for the data from our model, in this case the predicted distribution of galaxy photometry for galaxies in the survey (given hyper-parameters ψ, η).As discussed above, this distribution is not tractable so we seek an alternative distance metric with properties suitable for robust and efficient parameter estimation.
We estimate the population-level parameters ψ, η by minimizing the optimal transport (OT) distance between model generated data (catalog) D = d 1:N , and the COSMOS data D = d1:N .The OT distance (also known as the Wasserstein distance or Kantorovich-Rubinstein metric, after Kantorovich & Rubinstein 1958;Vaserstein 1969) measures the divergence between two distributions from which we have samples, by computing the minimum distance required to transport one set of points onto the other, given some local metric to define distances in data space8 (for a review on OT and its implementation, see Peyré & Cuturi 2019).Optimal transport has been widely used for parameter estimation in settings where the KL divergence is intractable (Peyré & Cuturi 2019), providing efficient and consistent estimators, which are asymptotically equivalent to maximum-likelihood estimation in the large-dataset limit in most situations9 .
While exact calculation of the optimal transport distance is computationally complex (ON 3 ln N ; Pele & Werman 2009) and difficult to scale, the Sinkhorn divergence (Cuturi 2013)   distribution, calibrated model photometry is calculated and then uncertainties and noise are drawn and added, followed by application of selection cuts.In order to be able to use gradient-based optimization to minimize the OT distance between simulated and real data, we need to be able to take gradients through our simulator.To achieve this, we use a variant of the reparameterization trick (Kingma & Welling 2013), where we re-write our forward model as a sequence of deterministic steps applied to some fixed draws from a base density (which are kept fixed for the purpose of estimating gradients).
In this sense, our forward model can be written as the following sequence of steps: 1. Draw base random variates for the populationmodel, uncertainty-model, and noise-model: s 1:M ∼ N (0, 1), [uncertainty-model base draws] where T 2 is the standard-t distribution with two degrees-of-freedom (Equation 4), and M is the number of mock galaxies to generate (which should be larger than the target (selected) catalog size N ); The objective function for minimization is then given by: where W 2 denotes the OT distance (assuming a local Euclidean metric), D(u, s, n|ψ, η) is the simulated catalog and D the COSMOS catalog.By keeping the base random drawn from step 1 fixed, the simulated catalog (and hence OT distance) are deterministic functions of the parameters ψ, η, so that we can take gradients and perform gradient-based optimization.This scheme is summarized in Figure 2.

Initialization and training
The calibration model parameters (characterizing the zero-points and emission-line corrections) are initialized following the Bayesian hierarchical calibration approach presented in Leistedt et al. (2023): cross-matching with data from zCosmos-bright (Lilly et al. 2007), DEIMOS (Hasinger et al. 2018)], and C3R2 (Masters et al. 2017(Masters et al. , 2019;;Stanford et al. 2021) yields 12,473 objects with spectroscopic redshifts available, in the range 0 < z < 2. This lifts degeneracies between SPS parameters and makes the calibration model parameters very well constrained by the data.Simultaneous optimization of all parameters converges easily, with the SPS parameter uncertainties having negligible effect on the result (see Leistedt et al. 2023 for more details).
To initialize the population model, we perform an initial maximum aposteriori (MAP) estimation of the SPS parameters for each galaxy in the COSMOS sample, and pre-train the diffusion model on that ensemble of MAP estimates via denoising score-matching.This provides a reasonable initialization for the population model to avoid a long burn-in phase based on the more expensive optimal transport objective.
The uncertainty model network is initialized as follows.The initial MAP estimates for the SPS parameters (and initialized calibration-model parameters) provide estimates of the true (denoised) photometry for each galaxy in the COSMOS sample.This provides a catalog of uncertainties and associated (denoised) photometry {σ P , f }, on which we can train our conditional estimator for P (σ P |f ; χ) by minimizing the negative log-likelihood loss: This provides a reasonable initialization for the uncertainty model, after which χ is kept free in the final fitting procedure.
OT optimization is then performed with Adam (Kingma & Ba 2014) with a learning rate of 10 −4 , until the distance ceases to improve for twenty epochs.All of the population-level hyperparameters are kept free in the fitting process, including the zero-points and emission line corrections.
We compute the OT objective between both the synthetic and real magnitudes, and separately between the synthetic and real colors10 , and sum them.We find that this improves the ability of the fitted model to reproduce both the colors and magnitudes faithfully.

Discussion
The model fitting scheme described above has a number of advantages over existing methods.
Firstly, we target the hyper-parameters (describing population-and data-model) directly, completely bypassing the need to infer the properties of each individual galaxy in the sample (in contrast to eg., MCMCbased approaches).This provides a significant advantage in computational cost and scalability when population-level inference is the main goal.
Secondly, by jointly inferring the population-and data-model parameters together in a self-consistent fashion, we are able to use the data to "self-calibrate" any unknown nuisance parameters (e.g., calibration and noise-model parameters, etc).This will result in more robust inferences compared to the traditional approach of estimating and fixing nuisance parameter values prior to the main analysis.
While our fit to COSMOS data necessarily includes the r < 25 selection cut, our inference pipeline explicitly includes (and corrects for) that selection: the target population model is therefore a description of the underlying galaxy population that is not subject to selection effects.The resulting population model can therefore be straightforwardly utilized for prediction (and like-for-like comparison) for different surveys, simply by applying the noise characteristics and selection appropriate for that survey in a forward modeling context.This point is critical for application to cosmological inference from broad-band galaxy surveys, where we require a well-calibrated population model that is able to make faithful predictions for those deep, broadband data.Note that while our method properly corrects for selection, it is not designed to extrapolate more than a few noise standard deviations below the fluxlimit (where there is no data to constrain the population model).Therefore, application of the calibrated population model should be limited to surveys with similar or shallower depths.
Our forward modeling-based approach is also well suited to principled validation on the basis of predictive performance.This is in contrast to typical galaxy evolution and cosmology analyses, where population-level inferences are drawn, but little assessment of prediction quality (in data-space) is done.In a companion paper (Thorp et al. 2024b), we present a model validation approach for the simulation-based inference setting, based on quantile-quantile (QQ) and probability-probability (PP) plots (Wilk & Gnanadesikan 1968; see Eadie et al. 2023 for an astronomy example).A further complication is the question of comparing models.Again, the most popular approaches in astrophysics and cosmology are typically applied at the level of the parameter posterior (i.e. via the Bayesian evidence; although use of posterior predictive scores is growing, see e.g.Feeney et al. 2019;Abbott et al. 2019;Rogers & Peiris 2021;Setzer et al. 2023;McGill et al. 2023;Welbanks et al. 2023;Nixon et al. 2024).In our simulation-based approach, we can readily interrogate competing models based on their ability to reproduce observed data.
The simulation-based inference scheme described above currently provides a point estimate for the population-level parameters.Statistical uncertainties on the estimated parameters could be obtained by bootstrapping, or training ensembles of models with different initializations (e.g., Li et al. 2024).However, we expect uncertainties to be dominated by systematic rather than statistical errors (due to e.g., photometric calibration; see § 5.1).

RESULTS
In this section we present the key results from our fitted forward model.Our model predictions in dataspace are validated against the COSMOS sample in §5.1, and the fitted values of the calibration-model parameters (zero-points and emission-line corrections) are given in Tables 2 and 3.While most of the emission-line corrections are at the percent level or less, ten of the included bands get ≳ 10% or more (and up to 50% in some cases).We report that emission-line calibration was essential to obtain physically reasonable population-model constraints on the fundamental-metallicity relation (gasmetallicity vs. SFR; Figure 11), and AGN (Figure 3).
The 1-and 2-d marginals for the fitted populationmodel are summarized in Figure 3. Since we assumed a flexible parameterization for the population model, it is designed to capture the complete web of complex dependencies between galaxy characteristics and how those evolve over cosmic time.While some of this structure is already visible in Figure 3, we present our model predictions in light of commonly studied relationships and quantities in §5.2-5.8: the redshift distribution §5.2; mass-function §5.3; mass-metallicity and fundamental relations §5.5-5.6;dust versus mass and SFR §5.7; and gas ionization versus SFR §5.8.Constraints on AGN are briefly discussed in §5.9.Note that while our model corrects for selection, our flexible population-model parameterization is not designed to extrapolate far below the flux-limit of the sample (where the data has no constraining power): this leads to the apparent turnover at low masses (high redshifts) in Figure 3.
Direct quantitative comparison with previous work for the relations presented in §5.2-5.8 is not always straightforward.This work represents the first time it has been possible to jointly infer the full set of galaxy parameter dependencies, while accounting for SPS-parameter degeneracies, self-calibrating the data-and calibrationmodel, and correcting for selection.It is therefore nontrivial to present like-for-like comparisons with previous studies with different SED or data-modeling assumptions, different assumed priors or specific parametric forms for scaling relations, and differing selection effects.For these reasons, in §5.2-5.8 we focus primarily on presenting a broad physical interpretation of our results, with qualitative comparison to the (most-comparable) literature where appropriate, and to template-based parameter estimates in some cases as a sanity check.We leave detailed comparison with the literature and implications for galaxy evolution to future work.

Data-space comparisons
Comparisons of our fitted model predictions to the COSMOS data in magnitude-and color-space are shown in Figures 4-6.To ensure a like-for-like comparison, Figures 4-6 compare our model predicted distributions for noisy, calibrated (r < 25 selected) photometry against the equivalent COSMOS data.We focus on a subset of key bands and colors, spanning the full wavelength range and key color-space features, following Weaver et al. (2022).
Our model achieves excellent agreement in the magnitude marginals (Figure 4); this is not unexpected, since the magnitude marginals are mostly dominated by the shape of the mass function and volume effects, which should be easily-captured by the model.
The predictive distribution of galaxy colors on the other hand is a rich probe of galaxy evolution physics.The ability of our model to faithfully reproduce the color-color distribution underpins our confidence in the model predictions, and accurate characterization of galaxy colors as a function of redshift is crucial for predicting redshift distributions for cosmological analyses (e.g.Alsing et al. 2023).In Figures 5 and 6 we see that our model reliably reproduces the color-color distributions of COSMOS galaxies, including fine structure (e.g.related to star-forming and quiescent concentrations).
The largest discrepancies (at the level of 0.05 − 0.1 magnitude color offsets) are seen in (K s − irac1) and (g − r).These small biases are likely explained by residual (un-modeled) systematics in the COSMOS data.Weaver et al. (2022) performed a detailed comparison of the Farmer and Classic versions of the COSMOS catalogs, with different approaches to the photometric extraction.They reported the largest unexplained systematic differences between the two catalogs' photometry in the irac1, g and u bands (figure 8 of Weaver et al. 2022), with (K s − irac1), (g − r) and (z − J) being the most affected colors (figure 9 of Weaver et al. 2022).Discrepancies between our model predictions and the Farmer data are less than the systematic differences between Farmer and Classic in all bands and colors.It is therefore likely that any modest differences between model and data seen in Figures 5 and 6 are dominated by residual systematics in the COSMOS photometry.This makes a strong case for pursuing further improvements to the photometric data-modeling and extraction for COSMOS data in future 11 .
11 Conversely, the fact that our flexible population-and SPSmodels are able to avoid simply "overfitting" to residual unmodeled systematics in the photometry is encouraging.This is In a companion paper (Thorp et al. 2024b), we will present further validation of our calibrated model in data-space, based on quantile-quantile (QQ) and probability-probability (PP) plotting.

Redshift distribution
Accurate prediction of the redshift distributions for ensembles of (photometrically-selected) galaxies is of critical importance in constraining cosmology from weak lensing surveys (see e.g., Newman & Gruen 2022 for a review).Redshift distributions are a key ingredient in predicting lensing and clustering statistics from data, and have significant degeneracies with key cosmological parameters of interest.This leads to very stringent requirements on their accuracy; for imminent Stage IV surveys such as LSST (Abell et al. 2009), biases on inferred redshift distributions must not exceed 0.001(1+z) (in the mean redshift; Mandelbaum et al. 2018).
Forward modeling has emerged as a promising avenue for obtaining accurate redshift distributions for deep broad-band imaging surveys, where sufficient spectroscopic calibration data are not available (Alsing et al. 2023).These approaches rely on accurate modeling of the galaxy population, with calibration to deep fluxlimited samples such as COSMOS (as in this work) expected to provide key baseline constraints.
In Figure 7 we show our predicted galaxy redshift distribution n(z) (given the photometric cuts described in §2), and compare to photometrically-derived redshift estimates from LePhare (Weaver et al. 2022).Cosmic variance is estimated following the recipe in Moster et al. (2011) 12 .The predicted n(z) is broadly in good agreement with the LePhare redshift estimates, with two notable discrepancies.Firstly, the LePhare redshifts exhibit an unphysical build-up of low or zero redshift galaxies.This is a commonly observed feature in template-based photo-z estimation, where some fits get driven to the prior boundary at z = 0, while the assumed redshift prior does not go to zero at the boundary to penalize them appropriately (Hildebrandt et al. 2012) 13 .Second, the LePhare redshift histogram exhibits clustering above z > 1 over-and-above the expected clustering due because the model is physics-guided, and helps build confidence in the model predictions.
12 The cosmic variance estimation is performed using redshift bins of ∆z = 0.05 13 The common practice of using redshift priors that do not go to zero at z = 0 was introduced in Hildebrandt et al. (2012) as an ad hoc modification that was observed to reduce the bias in template-based redshift estimates at low redshift.However, it comes at a cost of (un-physically) allowing some template fits to be driven up against the prior boundary at z = 0. Comparison is shown in the observed data-space, i.e., for r < 25 selected galaxies and with photometric noise and calibration included (to ensure like-for-like comparison with the data).We show a subset of the 26 bands, spanning the full wavelength range.Comparison is shown in the observed data-space, i.e., for r < 25 selected galaxies and with photometric noise and calibration included (to ensure like-for-like comparison with the data).We show a subset of key colors following Weaver et al. (2022).
to cosmic variance.This behavior is commonplace in template-based photo-z methods, where redshift point estimates have a tendency to cluster around specific values (owing to the limited fidelity with which finite or interpolated template-sets can describe real galaxy SEDs).In contrast, our predicted n(z) has the physically correct behavior of going to zero at z = 0, and does not exhibit spurious structure above z > 1. Conversely, since our generative model does not include galaxy clustering (present in the COSMOS sample at z ≲ 1), and is only calibrated to galaxy colors (which are expected to be very weakly sensitive to clustering), our model implicitly learns the underlying (mean) n(z), as desired.
We expect the calibrated pop-cosmos model to provide an improved population-model for predicting redshift distributions for cosmological surveys (Alsing et al. 2023).We will present pop-cosmos enabled redshift distribution estimation for KiDS data in a companion paper (Loureiro et. al., in prep.).
While we do not present individual galaxy redshift estimates here, our calibrated population model can also provide an improved prior for SPS-based photo-z estimation.We are investigating the utility of pop-cosmos   for redshift estimation for individual galaxies in a companion paper (Thorp et al. 2024a).

Galaxy stellar mass-function
Galaxies build up stellar mass through a combination of in-situ star formation and mergers.Modeling how galaxies grow is a major ongoing challenge, involving processes that span a wide range of scales (from stellar to cosmological; see e.g., Somerville & Davé 2015 for a review).Observations of the stellar mass function and its redshift evolution hence provide an important constraint on models of galaxy formation and evo-lution (Marchesini et al. 2009;Ilbert et al. 2013;Muzzin et al. 2013;Moustakas et al. 2013;Tomczak et al. 2014;Grazian et al. 2015;Song et al. 2016;Davidzon et al. 2017;Wright et al. 2018;Leja et al. 2020;Weaver et al. 2023b).In the context of photometric redshift estimation, accurate characterization of the mass function is also essential for obtaining accurate redshifts.
The stellar mass function derived from our model is shown in Figure 8 14 .The closest study for comparison is Weaver et al. (2023b), who estimate the stellar mass function from COSMOS2020 data based on the LePhare mass estimates.To simplify the comparison (eliminating any differences in data and modeling assumptions) in Figure 8 we compare directly to the LePhare masses on which the Weaver et al. (2023b) measurement is based.
We achieve good agreement with the LePhare masses over the entire redshift range, and predict a number of key features in the mass function.We find a steepening of the low-mass slope with redshift, a buildup of galaxies around 10 11 M ⊙ below z < 1.2 (leading to the observed "bump" in the mass function at low and intermediate redshifts), and little or no redshift dependence of the location of the knee of the mass-function.We also note relatively little evolution in the shape of the mass function at z ≲ 1.5.These features are in good agreement with previous observations (including previous COSMOS analyses: Ilbert et al. 2013;Davidzon et al. 2017;Weaver et al. 2023b).
Comparison to other recent measurements such as Leja et al. (2020) are non-trivial due to differing modeling assumptions; we leave broader comparisons to future work.
Figure 9 shows the inferred relationship between SFR, stellar mass, and redshift, from our population model.We compare to the measured SFS from Leja et al. (2022), which is based on COSMOS-2015 and 3D-HST photometry.This comparison is chosen because it is the most similar to our analysis; they model the SFS for star-forming and quiescent galaxies together (as in our work, rather than selecting star-forming galaxies only), the datasets used have some commonality, and the SPS modeling assumptions are similar.
Our recovered SFS is in good agreement with the measurement from Leja et al. (2022).We find a similar slope of the SFS at both low and high masses, flattening of the SFS at higher masses, steepening of the high-mass slope as a function of redshift, and a negative skewness at the high-mass end owing to the increasing presence of massive quiescent galaxies at higher masses.The small offset in normalization at low masses is mostly due to the broken power-law from Leja et al. (2022) modeling the log of the mean SFR, whereas for our model we show the median log SFR.We would also expect some modest quantitative differences due to the differing galaxy samples used, treatment of selection effects, and modeling assumptions.We note that our inferred SFS extrapolates sensibly into the regime where the COSMOS data are incomplete or lacking (Figure 9, grey bands).
The majority of observational studies have focused on characterizing the SFS for star-forming galaxies only, since those are the galaxies which are actively forming mass.However, more recently it has been shown that the method of identifying star-forming galaxies leads to systematic differences in the inferred SFS (of up to 0.5 and 0.2 dex in normalization and width respectively; Leja et al. 2022), owing largely to the fact that the galaxy population cannot be cleanly split into "starforming" and "quiescent" samples based on SFR (ie., the distribution of SFR is not strongly bimodal at most masses and redshifts: see e.g., Leja et al. 2022).We emphasize that, in the spirit of Leja et al. (2022), the SFS prediction from our model presented in Figure 9 includes all galaxies in the flux-limited COSMOS sample (not only star-forming galaxies).

Mass-metallicity-redshift
The chemical enrichment of galaxies is driven by two main processes: successive generations of massive stars produce metals via nucleosynthesis and return them to the interstellar medium at the end of their lives; at the same time, outflows driven by starburst winds or AGN feedback result in ejection of metal-enriched gas into the intergalactic medium, while inflows can bring metal-poor gas in.The interplay of these processes results in a relationship between stellar mass, stellar-and gas-phase metallicities, and star-formation rate.The observed mass-metallicity and fundamental metallicity (mass-gas metallicity-SFR) relations hence provide key observational probes of galaxy evolution (Tremonti et al. 2004;Maiolino et al. 2008;Mannucci et al. 2009;Lara-López et al. 2010;Yates et al. 2012;Lara-López et al. 2013;Andrews & Martini 2013;Nakajima & Ouchi 2014;Yabe et al. 2015;Salim et al. 2014Salim et al. , 2015;;Kashino et al. 2016;Cresci et al. 2019;Cullen et al. 2021;Curti et al. 2020;Bellstedt et al. 2021;Sanders et al. 2021;Thorne et al. 2022).
In Figure 10 (upper panels) we show the predicted mass-stellar metallicity relation from our population model, averaged over redshift (left panel), and as a function of redshift (right panel).The shape of the massmetallicity relation is in excellent agreement with local measurements from SDSS at low redshift (e.g., Gallazzi et al. 2005) 15 .We find that the slope of the massmetallicity relation at lower masses steepens by a factor of ≃ 2 between z = 0 and z = 3.5, with the trend decreasing as the mass-metallicity relation flattens off at higher masses.
In the lower panels of Figure 10 we show our population model predictions for the mass-gas metallicity relation averaged with respect to redshift (left), and as a function of redshift (right).The shape and redshift evolution is in good qualitative agreement with recent measurements (e.g., Bellstedt et al. 2021;Thorne et al. 2022).We find the slope of the mass-gas metallicity relation at lower masses steepens by a factor of ≃ 2 between z = 0 and z = 3.5, with the median gas metallicity at 10 10 M ⊙ decreasing by around 0.4 dex over the same redshift range.This is roughly consistent with recent measurements from GAMA data (Bellstedt et al. 2020), which shows around 0.6 dex decrease in the median gasmetallicity over a similar redshift range.The normalization of the recovered mass-gas metallicity relation (bottom row of Figure 10) is higher than for the mass-stellar metallicity relation (top row of Figure 10).This is expected, since under our assumed SPS model the gas metallicity represents the present-day metallicity of the ISM, while the stellar-metallicity parameter is a proxy for the light-weighted average metallicity among the stellar population (which includes older stars).
Note that for both stellar and gas metallicity, in the regime where the COSMOS data is lacking (grey bands in Figure 10) the extrapolation of our model predictions show a flattening of the mass-metallicity relations, while from observations it is expected to continue downwards (e.g., Kirby et al. 2013).Our model is not designed to extrapolate very far into the regime where the data is lacking; additional constraints may be needed to improve the extrapolation of the mass-metallicity relations at low masses, if desired.

Fundamental metallicity relation
The interplay between star formation and the chemical enrichment of the ISM is expected to result in a relationship between mass, gas-phase metallicity, and star-formation rate -the so-called fundamental metallicity relation (FMR; Mannucci et al. 2010;Dayal et al. 2013).
In Figure 11 we show the dependence of the mass-gas metallicity relation with SFR; the second component of the fundamental metallicity relation.We find a clear and smooth negative trend between gas metallicity and SFR for masses up to around 10 11.5 M ⊙ , with a 0.2 − 0.3 dex evolution in the median gas metallicity across the full dynamic range of SFR, across most stellar masses.This is qualitatively consistent with other measurements in the literature (Mannucci et al. 2010;Dayal et al. 2013;Salim et al. 2014;Zahid et al. 2014;Curti et al. 2020 3.2 < z < 3.6 Figure 9.The star-forming sequence predicted by our forward model.The red line shows the median of our predictions of log 10 (SFR), with the red shaded regions showing the 68 and 95% intervals of the conditional distribution at a given mass (estimated in a rolling mass window).The gray shaded region shows the estimated mass completeness limit in each redshift bin (see footnote 14).All subsequent figures follow the same plotting scheme unless noted.The blue dashed lines show the inferred SFS from Leja et al. (2022).Note that the Leja et al. (2022) predictions are for the logarithm of the mean SFR.Thorne et al. 2022), and sits roughly in the middle in terms of the magnitude of the trend compared to recent measurements (e.g., Curti et al. 2020 find a trend of up to 0.5 dex, while Thorne et al. 2022 report an overall variation of only 0.13 dex with SFR).Whether or not there exists a dependence of the massgas metallicity relation with SFR at all (and hence the existence of the FMR as a fundamental plane) is still under debate, with some studies finding a negative trend between gas-metallicity and SFR (Mannucci et al. 2010;Dayal et al. 2013;Salim et al. 2014;Zahid et al. 2014;Curti et al. 2020;Thorne et al. 2022), while others report no significant correlation (Sánchez et al. 2013(Sánchez et al. , 2017(Sánchez et al. , 2019) ) or even a positive trend (Lara-López et al. 2013).Nevertheless, our measurement of the FMR is qualitatively consistent with the most recent measurements (Curti et al. 2020;Thorne et al. 2022), and with the physical expectation of a negative trend between gasmetallicity and SFR (Mannucci et al. 2010;Dayal et al. 2013).
We report that inclusion of the gas ionization parameter in our SPS model was essential to recover reasonable inferences about gas-metallicity: without log U gas , our population-model was unable to recover physically sound predictions for mass-gas metallicity relation and FMR.

Dust attenuation
The microscopic properties of dust grains (e.g.size, material, etc.) govern their interaction with light, and the direct impact this has on a galaxy's SED (see e.g., Calzetti 2001 or Draine 2003 for a review).Dust grains also impact SEDs through their key role in galaxy star formation, as their surfaces act as favorable media for the formation of molecular hydrogen (Gould & Salpeter 1963;Hollenbach & Salpeter 1971).Dust also serves as a key component in regulating heating and cooling, further affecting the star formation cycle (Yamasawa et al. 2011).Observations of how dust properties relate to other galaxy characteristics are important in constraining models of galaxy evolution, with key observational targets including the degeneracy between attenuation slope and optical depth, star-dust geometry, and correlations between dust properties with mass and star formation rate (e.g.Burgarella et al. 2005;Noll et al. 2009;Garn & Best 2010;Buat et al. 2012;Zahid et al. 2013;     Kriek & Conroy 2013;Chevallard et al. 2013;Reddy et al. 2015;Salim et al. 2016;Salmon et al. 2016;Leja et al. 2017;Salim et al. 2018;Salim & Boquien 2019;Nagaraj et al. 2022;Lower et al. 2022Lower et al. , 2024)).
Figure 12 (left panel) shows our inferred relationship between (diffuse) dust attenuation and SFR.We see that quiescent galaxies have a tendency toward little or no dust attenuation, although a tail out to non-negligible dust contributions for quiescent galaxies is present.For log 10 (SFR) ≳ 0 the typical level of dust attenuation increases and the spread broadens.Studies of SDSS star-forming galaxies (using H-α emission or the Balmer decrement to measure attenuation, e.g.Garn & Best 2010;Zahid et al. 2013) show very similar behavior, with dust attenuation picking up around log 10 (SFR) ≳ 0.More recently, a photometric analysis by Nagaraj et al. (2022) using 3D-HST data and the Prospector SPS model also found a strong increase in optical depth at log 10 (SFR) ≳ 0, very similar to our results.
The relationship between dust attenuation and stellar mass (Figure 12, middle panel) shows a broadening and increase in dust attenuation for galaxies ≳ 10 10 M ⊙ .Nagaraj et al. (2022) also find a similar relationship between dust attenuation and stellar mass, where galaxies ≳ 10 10 M ⊙ have higher dust attenuation on average (by around a factor of two) compared to those ≲ 10 10 M ⊙ .Similar results are found by Salim et al. (2018), who identify a tendency for higher attenuation values (and a larger scatter) to be seen for more massive galaxies.Our result is also consistent with previous studies of SN Ia host galaxies, where the distribution of extinction values is typically observed to be broader (longer tailed) in galaxies ≳ 10 10 M ⊙ (e.g.Sullivan et al. 2010;Childress et al. 2013;Thorp et al. 2021;Meldorf et al. 2023;Grayling et al. 2024).
In Figure 12 (right panel), we show our inferred relation between dust law index and dust attenuation (for the diffuse dust component).We see a trend towards higher n (shallower attenuation law) for galaxies with higher levels of attenuation, with substantial dispersion (∼ 0.3) of the dust index n for any given attenuation A V .This is qualitatively consistent with recent literature (Buat et al. 2012;Kriek & Conroy 2013;Reddy et al. 2015;Salim et al. 2018;Álvarez-Márquez et al. 2019;Battisti et al. 2020;Nagaraj et al. 2022), and with expectations from radiative transfer calculations (e.g.Witt & Gordon 2000;Chevallard et al. 2013).We leave an extended quantitative comparison with previous literature to future work.

Gas physics
While the detailed connection between gas dynamics and star formation is non-trivial, one clear expectation is that gas ionization will increase with increased star formation activity, with massive young stars contributing heavily to the ionizing photon budget.Our population model predicts a clear increasing trend in gas ionization with specific star formation rate (Figure 13), qualitatively consistent with previous studies (eg., Kaasinen et al. 2018) and in line with physical expectations.The slope and normalization from Kaasinen et al. (2018) differ somewhat from our model.We expect relatively weak constraints expected on gas ionization from photometry alone, with emission-lines typically contributing a few percent at most to broad-band fluxes; the level of agreement with Kaasinen et al. (2018) is very reasonable given the limitations of photometric observations.It is also possible that some differences are due to selection effects in the Kaasinen et al. (2018) sample.

Active galaxies
Figure 3 shows some structure in our calibrated population model between AGN and other SPS parameters; most notably, a tendency for the brightest AGN (higher f AGN ) to be redder (higher τ AGN ), in line with physical expectations (see also Figure 14).We note that the sharp peak at low values of f AGN (and the corresponding spike at intermediate values of τ AGN ) is an artefact of how we perform the population-model fits, and the information content of the data.For the portion of the galaxy population with little or no AGN contribution, there are no AGN constraints from the data and hence nothing to prefer a sharp peak at some negligible value of f AGN over any other distribution over very low values of f AGN : neither have any discernible impact on the model predictions.Similarly there are no meaningful constraints on τ AGN for galaxies with no AGN contribution; the fact that our model gives all the galaxies with no AGN intermediate values of τ AGN has no impact on our model predictions.While inclusion of AGN is important for population-modeling, drawing detailed inferences about AGN physics likely requires a more sophisticated parameterization of the AGN contribution to the galaxy SEDs.

Impact of emission-line calibration on population-level inference
A key feature of our model is the ability to selfcalibrate emission-line corrections together with the population-model, photometric calibration and uncertainty model.It is informative to explore which population-level inferences are most affected by the emission-line calibrations, and whether those correc-tions are physically reasonable.Of the all the relations studied in this paper, we find that only the FMR, gas ionization-sSFR relation, and AGN parameters receive any appreciable corrections due to emission-line calibration.This is expected, since the gas metallicity, gas ionization and AGN parameters are expected to be most sensitive to the details of emission-line modeling.Key emission-lines and line-ratios relevant for metallicity and gas-physics (e.g., Hα, Hβ, [OIII] / Hβ and [NII] / Hα) receive considerable (∼30-60%) corrections in our model fit (see Table 3).
Figure 14 (Appendix A) shows model fits with and without allowing the emission-line calibration parameters to vary, for the FMR (top row), gas ionization-sSFR relation (middle row), and AGN luminosity versus optical depth (bottom row).Emission-line calibration induces a ∼ 0.1dex shift in the normalization of the FMR, with the shape remaining largely unchanged.For the gas ionization-sSFR relation, the model without emissionline calibration barely recovers any correlation between gas ionization and SFR, while the model with emissionline corrections elicits a clearer positive trend between gas ionization and SFR (in line with physical expectations), with around 40% reduced scatter.For the AGN sector, without emission-line calibration no appreciable constraints on the AGN parameters are recovered, while the model with emission-line corrections recovers a clear (positive) correlation between AGN luminosity and optical depth, in line with physical expectations.
The fact that inclusion of emission-line calibration leads to physically reasonable corrections to the gasphysics and AGN sectors supports the importance of emission-line calibration for obtaining accurate population-level inferences under SPS models.We leave a detailed study of the impact of specific line-and lineratio corrections and their relation to metallicity, gasand AGN-physics results to future work.

DISCUSSION
The pop-cosmos population model presented in §5 is calibrated down to an r-band magnitude of r < 25.Since selection is corrected for, we expect the model predictions to be valid somewhat deeper than r < 25, becoming less reliable into the fainter regime where the data is lacking.One of the primary use-cases of our population model in a cosmological inference context is for predicting galaxy redshift distributions from deep, broad-band data from Stage IV surveys such as LSST (Alsing et al. 2023).The gold sample for LSST is expected to have a limiting magnitude r < 25.3, only 0.3 magnitudes deeper than the COSMOS sample used in this work.Care will need to be taken in examin- ing the extent to which pop-cosmos can extrapolate to r < 25.3; we leave this to future work.As a steppingstone to Stage IV cosmological survey applications, we will present pop-cosmos enabled redshift distribution estimation for Stage III (KiDS) data in a companion paper (Loureiro et. al., in prep.), as well as the utility of pop-cosmos as an improved model and prior for individual galaxy photo-z estimation (Thorp et al. 2024a).
In the context of calibrating an accurate galaxy population model for improving cosmological analyses (e.g., Alsing et al. 2023;Moser et al. 2024), the galaxyevolution results presented in §5 are a essentially sideeffect of pursuing accurate redshifts.Nonetheless, the advancement in methodology means that many of the inferred scaling relations may be better measured by our new framework.
Measurement of the FMR, mass-metallicity-redshift and sSFR-gas ionization relations has generally been considered challenging or impossible from photometric data alone.Nonetheless, in §5.5-5.8 we present measurements of these relations that are qualitatively consistent with previous results (from spectroscopic data).This opens up an exciting new approach to measuring these quantities, which merits further investigation16 .Even where our predictions about gas-phase physics do not agree in detail with spectroscopic measurements (likely due to the limitations of photometric data), we do not expect this to have a significant effect on the photometric redshift program: corrections to broad-band photometry should be tiny17 .
In §5.1 we saw that our model predictions for galaxy colors are accurate to within observed calibration biases between different photometric extraction methods (e.g., the Farmer vs. Classic versions of the COSMOS photometry; Weaver et al. 2023a).There is hence no evidence that additional complexity in the SPS model is justified at this stage to improve the populationmodel predictions.Nevertheless, the scalability of our simulation-based inference approach opens up the possibility of including further extensions (and parameters) within the SPS model, while remaining computationally feasible.
We established that self-calibration of emissionline corrections was important for drawing reasonable population-level predictions of the gas-and AGN-sector parameters, and was shown to improve photometric redshift inferences in the COSMOS data (see, e.g., Alarcon et al. 2021;Leistedt et al. 2023).While parameterizing the mean bias in the most important emission-lines captures the leading order correction, in practice emission line strengths will be a strong function of the SPS parameters.It would be straightforward to incorporate parameter-dependent emission-line corrections in our calibration model; we leave this to future work.
Recently, another forward modeling-based approach to inferring the population distribution of galaxy parameters has been presented(popsed; Li et al. 2024), also making use of the OT distance as an objective function in their inference procedure.They use normalizing flows as their population distribution over SPS parameters, with a 12-parameter SPS model following Hahn et al. (2023), and an SPS emulation scheme similar to Alsing et al. (2020).They demonstrate the method on broadband SDSS ugriz photometry for a sample of galaxies from the Galaxies and Mass Assembly survey (GAMA; Driver et al. 2011;Baldry et al. 2018), with a depth of r < 19.8 and at relatively low redshift z ≲ 0.45, showing in particular that they can recover the star-forming main sequence (c.f.our §5.4 and Figure 9).cluding intermediate and narrow bands) also allows us to constrain a comprehensive range of galaxy evolution physics, for a diverse galaxy sample.As a result our calibrated population-model can faithfully predict galaxy colors over a wide range of wavelengths and redshifts, with direct utility in a cosmological inference context for Stage III and IV surveys.
Another forward modeling-based approach has been developed by Moser et al. (2024) and applied to imagelevel data from HSC (Aihara et al. 2022).Their population model is parametric, with galaxy spectra built up from Kcorrect templates (Blanton et al. 2017), and source detection within their simulated images being handled by SExtractor (Bertin & Arnouts 1996).Their inference is carried out using an approximate Bayesian computation (ABC) scheme (also employed by Tortorelli et al. 2020Tortorelli et al. , 2021)).Our work differs from theirs in utilizing a continuous SPS model (rather than templates) for galaxy SEDs, and a flexible (diffusionmodel) parameterization of the population-model, while jointly calibrating the population-and data-model simultaneously.OT optimization is also expected to scale favourably to high-dimensional problems on large datasets, where ABC quickly becomes computationally infeasible due to the high number of simulations required (see e.g., Alsing et al. 2019).Nevertheless, forward modeling at the level of images represents an important advance, and may be necessary for including and correcting for image-based selection cuts in future analyses.

CONCLUSIONS
We have presented pop-cosmos: a comprehensive population model fit to a large, deep, flux-limited sample of galaxies from COSMOS.We constructed a detailed forward model for the COSMOS data, including a flexible diffusion-model parameterzation of the populationdistribution of galaxy characteristics, a state-of-the-art (16-parameter) SPS model, and a detailed data-model describing the observation, calibration and selection processes.By comparing synthetic and real data in a simulation-based inference setting, we were able to jointly fit the population-model while self-calibrating the data-and calibration-model parameters in a selfconsistent fashion.As a result, we obtained a robustly calibrated population model describing galaxies down to r < 25 and out to redshift z ≃ 3.5.
Our population model is able to faithfully reproduce galaxy colors (to within the limitations of the photometric calibration of the COSMOS data), and encodes a comprehensive and compelling picture of galaxy evolution processes.This represents the first time that it has been possible to jointly infer the full, complex web of dependencies between galaxy characteristics, together with the photometric noise, data-and modelcalibration, and principled correction of selection: a key milestone in the analysis of large galaxy surveys.
Accurate galaxy population models calibrated to large, deep, narrow band (or spectroscopic) data are of key importance in drawing robust cosmological measurements from galaxy surveys.We expect the pop-cosmos model and its successors to open up new capabilities in accurate redshift estimation from photometric data, eliminating systematics in transient cosmology due to correlations between host galaxy properties and supernovae, and in modeling and inferring the galaxy-halo connection.

Figure 1 .
Figure 1.The subset of 26 COSMOS bands use in this work.Broadband transmission curves are rescaled to have peak transmission of 1.0, intermediate bands ("IA/B. . .") are scaled to peak at 1/3, and narrow bands ("NB. . .") to peak at 0.2.

2.
Pass base samples u 1:M to the population-model (score-based diffusion model) to generate draws of galaxy parameters φ 1:M , by solving the ODE in Equation (8) (given the current values of the population-model parameters ψ); 3. Compute calibrated photometry f 1:M for each galaxy assuming the SPS and calibration mod-els (Equation 2, given the current values of the data-model parameters η); 4. Pass base samples s 1:M and the model fluxes f 1:M through the uncertainty model (Equation 10) to generate photometric noise variances for each galaxy σ P,1:M (given the current values of the uncertainty-model parameters χ); 5. Compute additional uncertainty contributions σ EM,1:M due to emission-lines (Equation 3), and add in quadrature to get total uncertainties σ base samples n 1:M and the model photometry to the noise model (Equation 4) to generate noisy mock photometry, d 1:M = f 1:M + σ 1:M ⊙ n 1:M ; 7. Apply selection cuts (and trim the number of retained objects to N if necessary) to give a mock catalog D(u, s, n|ψ, η) = {d 1:N , S 1:N = 1} of the desired size, N .

Figure 4 .
Figure4.Comparison of the marginal distributions for the magnitudes predicted by our model (red), versus the COSMOS data (blue).Comparison is shown in the observed data-space, i.e., for r < 25 selected galaxies and with photometric noise and calibration included (to ensure like-for-like comparison with the data).We show a subset of the 26 bands, spanning the full wavelength range.

Figure 5 .
Figure 5.Comparison of the marginal color distributions predicted by our model (red), versus the COSMOS data (blue).Comparison is shown in the observed data-space, i.e., for r < 25 selected galaxies and with photometric noise and calibration included (to ensure like-for-like comparison with the data).We show a subset of key colors followingWeaver et al. (2022).

Figure 6 .
Figure6.Color-color diagrams comparing the COSMOS data (blue) to predictions from our trained forward model (red), in the observed data-space (i.e., with photometric noise and calibration included in the model predictions to ensure like-for-like comparison with the data).Contours show 68 and 95 per cent levels.We show a subset of key colors followingWeaver et al. (2022).

Figure 7 .
Figure 7. Photometric redshift distribution predicted by our forward model (red curve), with estimated uncertainties due to cosmic variance (red envelope).The blue histogram shows redshift estimates for COSMOS using LePhare (from Weaver et al. 2022).

Figure 8 .
Figure 8. Predicted galaxy stellar mass functions in redshift bins of widths ∆z = 0.4.Our forward model predictions are shown in red (dotted in the incomplete region), with the LePhare mass distributions shown as blue histograms.To ensure a like-for-like comparison with LePhare, we show the mass function for galaxies with an r-band magnitude cut r < 25.Gray shaded regions indicate the region where the selected sample plotted begins to become incomplete (see footnote 14).

Figure 10 .
Figure 10.Upper left panel: predicted stellar metallicity vs. mass relation.Upper right panel: median predicted massmetallicity relation in redshift bins of width ∆z = 0.4.Lower left and right panels are the same, but for gas metallicity.Grey bands indicate where the COSMOS sample becomes incomplete (see footnote 14).

Figure 11 .
Figure 11.Fundamental metallicity relation showing median predicted gas metallicity conditional on stellar mass in bins of log 10 (sSFR).

Figure 12 .
Figure 12.Left panel: predicted diffuse dust attenuation as a function of SFR (left) and stellar mass (middle), and the index of the dust attenuation law as a function of dust attenuation (right).

Figure 13 .
Figure 13.Predicted dependence of gas ionization on sSFR.The blue line shows the relation from Kaasinen et al. (2018).
Our work goes further than Li et al. (2024) by constructing and fitting a comprehensive forward model for the data, including: a flexible population-model; stateof-the-art (16-parameter) SED model; self-calibration of the data-modeling (noise and calibration); and explicit treatment of selection.By utilizing a larger, deeper galaxy sample, we cover the depth (r < 25) and redshift range (z ≲ 3.5) required for modeling Stage III and IV wide-deep galaxy surveys.The broad wavelength range covered by the 26-bands used here (inlevel distributions and correlations with other parameters being even less significant.

Figure 14 .
Figure 14.Comparison of pop-cosmos fits with (left column) and without (right column) emission-line calibration, for three population-level quantities most impacted by emission-line calibration: the fundamental-metallicity relation (FMR; top row), gas ionization-sSFR relation (middle row), and the relationship between the AGN luminosity and optical depth (bottom row).
Compute photometry: Rest-frame spectral energy distributions (SEDs) l(λ; φ) are calculated for each galaxy, given its SPS parameters φ and the assumed SPS model.The photometry f b in each band b, for each galaxy, is then obtained by: 1. Draw galaxy parameters: SPS parameters φ and redshifts z are drawn for each galaxy from the population-model P (φ, z|ψ).Inference of the population model parameters ψ is the main target of our analysis.We parameterize P (φ, z|ψ) as a score-based diffusion model (Song & Ermon 2019; Song et al. 2020a,b; see §3.2 for details); 2.

Table 1 .
Summary of key notation and SPS model parameters.