EFTofLSS meets simulation-based inference: σ 8 from biased tracers

Cosmological inferences typically rely on explicit expressions for the likelihood and covariance of the data vector, which normally consists of a set of summary statistics. However, in the case of nonlinear large-scale structure, exact expressions for either likelihood or covariance are unknown, and even approximate expressions can become very cumbersome, depending on the scales and summary statistics considered. Simulation-based inference (SBI), in contrast, does not require an explicit form for the likelihood but only a prior and a simulator, thereby naturally circumventing these issues. In this paper, we explore how this technique can be used to infer σ 8 from a Lagrangian effective field theory (EFT) based forward model for biased tracers. The power spectrum and bispectrum are used as summary statistics to obtain the posterior of the cosmological, bias and noise parameters via neural density estimation. We compare full simulation-based inference with cases where the data vector is drawn from a Gaussian likelihood with sample and analytical covariances. We conclude that, for k max = 0.1hMpc-1 and 0.2hMpc-1, the form of the covariance is more important than the non-Gaussianity of the likelihood, although this conclusion is expected to depend on the cosmological parameter inferred, the summary statistics considered and range of scales probed.


Introduction
The current standard cosmological inference procedure consists in providing an analytical likelihood for the data vector considered, which together with parameter priors and sampling methods such as Monte-Carlo Markov Chain (MCMC) yields the posteriors of the parameters of interest given the observed data [1,2].Focusing on the case of a galaxy catalog from a certain cosmological survey, it is standard to compress the information contained in this catalog and infer the cosmological parameters using summary statistics as the data vector, such as the galaxy power spectrum [3,4].The first natural question then is how to perform the galaxy field compression, in order to maximize the amount of information extracted.Many alternatives have been proposed in the literature to go beyond two-point statistics, such as higher n-point functions (bispectrum, trispectrum, etc), Minkowski functionals, wavelets, k-nearest neighbor cumulative distribution functions, peak counts, void statistics, among many others (see, e.g., [5][6][7]).Of course, a given cosmological parameter, such as the amplitude of fluctuations A s or dark energy equation of state w, affects different observables differently, so that the optimal compression choice is highly dependent on the parameter to be inferred.Recently, exciting developments are being made towards field-level inference, where the information from the entire galaxy density field is used to sample both the cosmological parameters and the initial phases with Hamiltonian Monte Carlo (HMC) due to the high-dimensionality of the parameter space [8][9][10][11][12][13][14][15][16].
In this paper, we defer the difficult question of optimal compression, and instead focus on the lowest order n-point functions, namely the galaxy power spectrum and bispectrum.These summaries are well motivated on quasilinear scales by perturbative approaches to large-scale structure (LSS).Having fixed the definition of the data vector, the next problem consists in predicting the distribution of each of the summary statistics.For essentially all relevant LSS observables, only approximate expressions for these distributions are known.Even if an exact likelihood expression could be written down and used for inference, there would remain additional issues with this standard explicit-likelihood approach which we discuss in detail below; in particular, covariance estimation and binning effects.
Simulation-based inference (SBI) [17,18], also known as likelihood-free inference (LFI) or implicit-likelihood inference, arises as an alternative to this approach.SBI makes use of a simulator model that specifies a process to generate "mock" data; that is, for a given set of parameters θ, the model generates (or forward-models) independent samples of p(x|θ), where x is the data vector (of summary statistics, in our case).The inference makes use of the simulated dataset to obtain the posterior and, since it does not require a direct evaluation of the conditional density, is therefore in contrast to density models, or explicit-likelihood models, where p(x|θ) has to be specified to obtain the posterior via Bayes' rule.This latter class of models is currently the standard in cosmology inference.
Simulator models, by virtue of their construction, operate as dynamic representations of the actual mechanisms at work based on our well-established scientific knowledge, providing a direct and intuitive link between theoretical models and observable phenomena.They are therefore more natural and interpretable than density models, where the density function specification depends on a more abstract understanding of real-world phenomena and often requires analytical approximations [19].
Although we work with n-point functions in this paper, it is important to stress that the SBI approach generalizes straightforwardly to any general data compression scheme, as long as the simulator still provides accurate predictions and the data vector is not extremely large.One can therefore study different summary statistics and their combinations with no concerns on finding an analytical expression for their distribution.This facilitates the use of more informative summary statistics, in particular those learned by neural networks [20,21], which can be flexibly optimized according to each particular inference problem.In the context of galaxy clustering, recent progress has been made by using graphs networks to capture the map information from the galaxy distribution [22,23].
Nonetheless, no work has been done in the specific context of the Effective Field Theory of the Large Scale Structure (EFTofLSS) and the bias expansion [49], which together provide a rigorous framework for modelling galaxy clustering as described in Section 2.1.Recently, a significant effort has been made towards constraining cosmological parameters from galaxy clustering data using the usual likelihood-based analysis [3,4,[50][51][52][53][54][55][56].Although this has been a successful program for cosmological inference, it is faced with significant practical and conceptual challenges, and we discuss below some of the main advantages of using SBI for galaxy clustering analysis instead.
Why simulation-based inference for galaxy clustering?Likelihood approximation.Focusing on the cosmological inference procedure for galaxy clustering, the use of an analytical approximation for the likelihood becomes already inadequate for the galaxy power spectrum, the "simplest" summary statistic.It is well known that the Gaussian approximation for cosmological two-point functions fails at low wavenumbers k, since they are estimated as a sum of field amplitudes squared, and cosmic variance breaks the central limit theorem due to the small number of modes [57,58].This issue is investigated in [58] for weak lensing power spectra, in [59] for the 3D galaxy power spectrum, and in [60] for a log-normal density field.The Gaussian likelihood approximation particularly affects the posterior of parameters that are sensitive to low-k modes, such as in the case of f NL [61], which expresses deviations from Gaussianity in the perturbations generated during the inflation epoch.In addition, by coupling different Fourier modes, nonlinear evolution also breaks the Gaussian assumption for the likelihood on small scales, not only for the power spectrum but also for many other summary statistics.
Covariance estimation.Even if the Gaussian likelihood assumption is accurate, we need to determine the covariance of the data vector.The usual procedure is to either assume an analytical approximation, or to estimate the covariance from simulated mock catalogs with fixed cosmology.Analytical approximations are not always sufficient, and estimation from mocks might be cumbersome and unpractical depending on the summary statistics chosen.It is worth noting in this context that existing EFT-based analyses of galaxy clustering make use of covariances estimated from mocks, while the mean of the data vector is based on the EFT, which is strictly speaking inconsistent.In addition, there is the issue of at which (fixed or varying) point in parameter space the covariance of the data vector should be computed.SBI circumvents these issues and provides a means to use a consistent prediction of the entire distribution of the data vector and its dependence on the parameters θ, including correlations between different elements of the data vector and higher-order moments.All of this is of great importance for scientific reasoning and the assurance that the errors are not underestimated via a poor covariance estimation; however, specific testing methods have to be applied to the SBI posterior for an evaluation of whether it itself is underestimating the errors [62].
Binning effects.In standard explicit-likelihood approaches, it is necessary to bin-average the theory prediction for the data vector in accordance with the binning used for the data.This is know to be important in the bispectrum case, see for example [63].For the bispectrum, there is also the distinction between "open" and "closed" triangles, which arises from the fact that "open" triangle bins (i.e., those which do not satisfy the closed triangle condition) contain individual "closed" triangle modes [64].These concerns are not present in the SBI approach, as the data vector can be freely chosen, as long as the same procedure is applied to observed and simulated data.
Forward modeling.As already mentioned, SBI makes use of simulator models to obtain the simulated dataset and therefore is an application of forward modeling, where a prescription of the generative process for the data vector as a function of the model parameters is given.Although not particular to SBI, forward modeling presents several advantages over other methods, such as the straightforward inclusion of observational effects, including window functions, redshift-space distortions and systematic effects [46,57,[65][66][67][68].In the context of the EFT forward model employed here, forward modeling at the field-level allows us to reach essentially arbitrary perturbative (loop) order [69].
To sum up, galaxy clustering for cosmological precision analysis is inevitably moving towards the need of higher n-point functions, combining different summary statistics and expanding the range of scales probed.It is evident how the covariance computation complexity scales with higher n-point functions, and how the need of an analytical approximation for the likelihood limits the summary statistics which can be used for analysis.Given a trustable simulator model, SBI therefore provides a flexible and rigorous framework for statistical inference in the context of galaxy clustering, provided that the following points are addressed: model misspecification, or how accurate is the simulator, convergence, or how many simulations are needed for a reliable posterior estimation, and diagnostics, or how reliable is the estimated posterior.In particular, we are interested in determining the calibration of the posterior [62,70] to check whether the uncertainties of the posterior are at least not underestimated.
In this paper, we take advantage of our understanding of galaxy clustering on large scales to test the SBI performance based on LEFTfield, a rigorous simulator based on the EFTofLSS.Using this tool as a first step is of crucial importance, since we tackle model misspecification by keeping all modes under control, cutting small scales where we should not trust our simulator and using the bias expansion on the scales where it is known to be robust.The simulator uses Lagrangian Perturbation Theory (LPT), which is fast to evaluate and therefore allows a detailed study of convergence with respect to the number of simulations and also careful posterior diagnostics.
Using the bias expansion constructed in LEFTfield, we employ the combination of the galaxy power spectrum and bispectrum to break the degeneracy between the bias parameters and the amplitude of fluctuations σ 8 .Our main goal is to assess the potential impact on cosmological inference when we consider the entire data vector distribution through SBI rather than relying on the assumption of a Gaussian likelihood.We also provide a comprehensive description of how simulation-based inference works in practice, as this is one of the first applications of this technique to galaxy clustering.In the context of SBI, this is the first time that the complete second-order bias expansion is used.Our work is thus complementary to recent work [45,46,48], which uses halo occupation distribution (HOD) as a bias model, and includes the power spectrum on smaller scales.Note that our forward model is numerically much less costly than the one used in [45,46,48], allowing us to generate many more independent simulated data realizations.
The paper is structured as follows.In Section 2.1, we review perturbation theory for galaxy clustering, while the details on the forward model can be found in Section 2.2.The power-spectrum and bispectrum estimators used are described in Section 2.3, and we explain in detail how simulation-based inference works in Sections 3.1 and 3.2.The results are shown in Section 4.1 (linear forward model), and in Section 4.2 (LPT-based forward model).We conclude in Section 5.

Perturbation Theory
In order to study the dynamics of matter, we can write the geodesic equation for the Eulerian comoving position x(τ ) of a Newtonian pressureless fluid element and relate it to the gravitational potential Φ(x, τ ).If we denote the Lagrangian position of this fluid at τ = 0 as q, we obtain x(q, τ ) = q + s(q, τ ), where s(q, τ ) is the Lagrangian displacement along the fluid trajectory, which is our dynamic variable in the Lagrangian picture.We can then write the geodesic equation for the displacement and relate it to the gravitational potential Φ(x, τ ), which in turn is connected to the matter density contrast δ(x, τ ) via the Poisson equation.The continuity equation for matter combined with Eq. (2.1) further implies that where we defined the Lagrangian deformation tensor M ij ≡ ∂s j /∂q i , allowing us to write the equation of motion of the displacement in terms of q only [71].Lagrangian Perturbation Theory (LPT) then proceeds by expanding the displacement as [72] and equivalently for the deformation tensor to obtain the LPT recursive relations [71,73,74].
It is also convenient to separate the displacement into a longitudinal σ ≡ ∇•s and a transverse component t ≡ ∇ × s, where the resulting evolution equations show that the transverse contribution t only becomes relevant at third order [69,75].The time component can be factorized out, which allows for writing evolution equations for σ and t order by order as a function of their initial conditions, and also for the construction of a closed bias expansion (see below).Although this factorization is mostly used in the context of an Einstein deSitter (EdS) Universe, it is also possible for a general expansion history [76].
The equivalence principle guarantees that the leading observable for non-relativistic observers is the tidal field ∂ i ∂ j Φ and its time derivatives along the fluid trajectory.In the EFTofLSS context [77][78][79][80][81], we aim to describe the number density of galaxies by including all possible dependencies on this quantity at a given fixed order in perturbation theory.By doing so, we refrain ourselves from modeling all unknown small-scale physics, as they can be absorbed into EFT parameters to be determined from data.The ultraviolet (or small-scale) cutoff Λ should be seen as a computational tool that guarantees loop integrals remain finite, as any observable has to be independent of its choice.In the following, we will not explicitly indicate the cutoff Λ for clarity, but re-instate it in the next section.
Since at leading order in spatial derivatives all the allowed terms by EFTofLSS are contained in M (q, τ ) and its time derivatives [82], the deterministic part of the galaxy overdensity field1 in the Lagrangian picture can be written as where F g is a nonlocal in time functional.By expanding F g in M , we obtain all the rotational invariants of M , and the time integral can be performed thanks to the aforementioned time factorization.We can construct the Lagrangian bias expansion as where b O L denote the Lagrangian bias parameters that encode all complex and non-linear physics of galaxy formation from F g .The Lagrangian operators O L are determined by taking all scalar invariants of the symmetric part of M (n) and their independent products.It is not necessary to include tr[M (n) ] for n > 1 nor the antisymmetric contributions, since they can be expressed in terms of lower-order symmetric ones via the equations of motion [71,74].The local basis of the deterministic Lagrangian operators up to second order is then given by [82] (2.7) So far, we have assumed that the galaxy formation process is perfectly local in space.Going beyond this approximation requires the introduction of higher-(spatial-)derivative contributions, the leading of which is ∇ 2 δ.By dimensional analysis, these contributions are associated with a spatial length scale R * .Supposing that R * is of the same order as the nonlinearity scale, then within the perturbative regime it is sufficient to keep the leading order high-derivative term b ∇ 2 δ ∇ 2 δ(x, τ ), as the higher-order ones are suppressed by more powers of (R * k) 2 .The higher-derivative bias are naturally introduced when imposing that the observables should not depend on the EFT cutoff [49], where it is important to stress that the nonlocality scale R * differs from the cutoff scale Λ −1 .In particular, all the counterterms needed to renormalize the operators of Eq. (2.7) are the ones already listed there, together with their associated higher-order derivatives terms [82].
The observed galaxy density field is in Eulerian space, while our expansion so far has been treated in Lagrangian space.We use Eq.(2.1) to advect the Lagrangian operators to Eulerian space.The continuity equation for galaxies gives the relation between the galaxy overdensity in Eulerian and Lagrangian coordinates, i.e., By expanding the displacement as in Eq. (2.3) and the exponential in Eq. (2.8), it is possible to obtain the relation between the Lagrangian and Eulerian basis order by order [83].It is worth emphasizing that Lagrangian and Eulerian descriptions correspond merely to different coordinate choices, that is, one does not assume a conserved galaxy number.In the Eulerian case, we perform a perturbative expansion from Eq. (2.2) in terms of the matter overdensity instead of the Lagrangian displacement.
So far we have discussed the deterministic operators, as in the form of Eq. (2.6), but the complete basis in the bias expansion should also account for the stochastic effect of small scales on galaxy formation.Since the stochastic fields are uncorrelated with large-scale perturbations, they are completely described by their n-point functions.For the two-and three point functions, we have that, at leading order and at a fixed time slice, where the prime denotes that the momentum-conserving Dirac delta function is dropped, b 1 is the Eulerian bias associated to δ(x, τ ) and P m is the non-linear matter power spectrum.The noise parameters P ε , B ε and P εε δ are to be determined together with the bias parameters, where the posterior of the cosmological parameters can be obtained by marginalizing over them.
Just like for large-scale perturbations, galaxy formation also depends nonlocally on smallscale perturbations inside the region characterized by R * , such that we also have to take into account higher-derivative contributions for the stochastic fields.In Fourier space, these contributions are suppressed by k 2 R 2 * on large scales [49], and we neglect them here.In our EFT forward model we construct all operators starting from the linear, sharp-kfiltered density field contrast δ Λ .This is related to the expansion of M by expanding Eq. (2.2) to first order, yielding σ Λ ≡ tr[M Λ . (2.10) The higher-order LPT basis terms are then constructed out of σ Λ via the LPT recursion relations.We stress the difference between our finite, explicit cutoff Λ and the formal cutoff used in semi-analytical loop calculations, which is usually sent to ∞ [78,84,85].

LEFTfield
The forward model we use to get from the initial conditions, cosmological, noise and bias parameters to the final data is LEFTfield, which evolves a given initial (linear) density field up to n-th order in LPT [69].This represents a powerful tool for cosmological inference using SBI.As already discussed, statistical inference in SBI is dependent on the quality of the simulator and the simulation budget.Using the EFT-based approach, we keep all the modes under control up to a certain scale Λ, such that SBI only learns the features we trust from our simulator.We therefore combine two big advantages of EFTofLSS for SBI: (i) theoretically robust treatment of bias; and (ii) fast evaluation thanks to the fact that we only need to follow the evolution of modes up to Λ, making it a perfectly suitable tool for analysing convergence and coverage of SBI algorithms.For example, using a Subhalo Abundance Matching (SHAM) [86] approach would be less theoretically robust, while being much more costly in comparison.The price to pay of course is that our forward model is limited to scales where perturbation theory is valid.
Our model differs from those commonly used in the EFTofLSS literature where the basis for the n-point functions are generated by methods such as CLASS-PT [87] and PyBird [88], since we directly measure the n-point functions on the simulated galaxy density field.By doing so, we are also simulating the full (non-Gaussian) distribution of our data vector, as generated by the LPT-based simulator, and SBI can provide us with the correct distribution for the summary statistics considered.Each simulation realization is run with a different seed for the density and stochastic fields, so that our model correctly accounts for cosmic variance.Note that SBI models the probability density of the data vector, and is therefore also in contrast to emulators [89], where the expectation value of the summary statistics is learned.Also, our simulator naturally incorporates IR resummation by way of the Lagrangian forward model.
LEFTfield, which is based on [69], works as follows.We start from an initial power spectrum P L which is scaled by α 2 , where we define α ≡ σ 8 /σ fid 8 and set σ fid 8 = 0.85, while all other cosmological parameters are fixed to Ω m = 0.3, Ω Λ = 0.7, h = 0.7 and n s = 0.967.A Gaussian random field δ g is then generated from this scaled power spectrum as where ŝ(k) is a Gaussian random field of zero mean and unit variance, z is the redshift and the grid is smoothed at scale Λ with a sharp-k filter W Λ (k), guaranteeing that we are only treating perturbative modes in the forward model [12].We emphasize that the inferred bias and stochastic parameters will depend on ("run with") Λ [85], while inferred cosmological parameters such as α should be independent of Λ.We then use the relation from Eq. (2.10) to set our initial conditions for σ Λ , while M and t are constructed in the same grid using the evolution equations.The displacement s(k) is then evaluated from Eq. (2.4) up to the desired order in LPT.
We then use the displacement to advect the Lagrangian operators to Eulerian space.As previously discussed, in perturbation theory this is done by both expanding the exponential and the displacement in Eq. (2.8).With the aim of preventing noise generation on largescales, instead of expanding the exponential we use a cloud-in-cell (CIC) scheme, where the "mass of the particles" corresponds to the value of the Lagrangian operators at each grid cell, i.e., M O CIC (q) = O L (q), while the extra 1 + δ term from the Jacobian is absorbed into the bias parameters.This approach is similar to what has been previously done in the literature; however, while [90] uses the Zel'dovich approximation (1LPT), we expand the displacement up to n-th order and [91,92] estimate the full displacement from N-body simulations.In this paper, we use second-order LPT (2LPT) throughout.
After the displacement s(k) is inversely Fourier transformed to configuration space and copied to a larger grid of dimension N 3 CIC , we use Eq.(2.1) to advect the deterministic Lagrangian operators O L listed in Eq. (2.7), with the exception of tr[M (1) ].Since δ and M are nonlinearly related through Eq. (2.2), the usual bias term b 1 δ(q) contributes to several terms of Eq. (2.7), so we instead displace a field with mass M δ CIC = 1 to obtain the Eulerian density directly.Although displacing tr[M (1) ] would equivalently lead to a complete basis of operators, this approach allows us to work with the linear bias b 1 , facilitating the interpretation of our results.The deterministic bias expansion in Eulerian space then reads where σ Λ ]) 2 , and we have omitted the subtraction of the mean of the secondorder operators for clarity.The stochastic contributions are added directly in Eulerian space, such that the final galaxy overdensity field reads where we also omitted the mean subtraction for ε 2 .This expression exhibits some differences from the usual stochastic contributions in the bias expansion [49].Since the simulated stochastic field ε is sampled from a Gaussian distribution with zero mean and variance P ε , non-Gaussianity of the noise, specifically the noise bispectrum B ε is generated via the ε 2 term in the model.Precisely, we obtain which thus captures both the B ε and P εε δ terms in Eq. (2.9) via c 2 ε and c εδ , as derived in Appendix A. In the Poisson limit, i.e., when the stochasticity of galaxies perfectly follows Poisson statistics, determined by their mean comoving density ng , we expect that c ε 2 = 1/6 and c εδ = b 1 /2 (Appendix A).For a flowchart representation of how our forward model is constructed, we refer the reader to [15,69].

Summary statistics
The usual procedure to extract information from the galaxy density field is through its n-point functions.Since its one-point function (the mean density) is trivial, most of the information can be captured by its two-and three-point functions or, equivalently, their Fourier transforms, namely the power spectrum and bispectrum.Under the homogeneity and isotropy conditions, the bispectrum is described by three parameters which characterize the shape and the scale of the triangles, thus encoding important cosmological information beyond the two-point function which only depends on scale [93].Especially in the context of primordial non-Gaussianities, it is convenient to separate the triangle shapes into e.g."squeezed" and "equilateral" configurations, but we do not make such distinctions as we work with all triangle configurations.
The tree-level galaxy power spectrum displays a degeneracy between the amplitude of density fluctuations, parametrized by α in our forward model, and the linear bias b 1 , while the bispectrum has the power of breaking this degeneracy [93][94][95][96].The intuition is that, at leading order in perturbations, the galaxy bispectrum will depend on a sum of different powers of the bias parameters and the linear power spectrum, allowing for a determination of the linear and second-order bias parameters which is independent of the power spectrum normalization.We refer the reader to [97] and [49] (Sec.4.1.1)for an illustration of the different shape contributions of the matter and galaxy bispectrum, respectively.
To measure such n-point functions on the grids generated by LEFTfield, we use the bispectrum estimator introduced by [98].We start by defining the quantities [99] where all p modes within a k-shell of width ∆k centered on k are integrated.The expressions for I k (x) and J k (x) therefore represent the inverse Fourier transform over a k-shell of the galaxy overdensity field, which is calculated following the procedure of Sec.2.2, and a unit field, respectively.The power spectrum estimator is defined as  where the superscript D refers to the discretized version of the fields, k 1 is the Fourier bin, N is the grid size and L is the box size, while the bispectrum estimator in turn is . (2.17) It is now evident that the power spectrum comes essentially for free when estimating the bispectrum.As already discussed in the introduction, we do not suffer from binning effects, since bin averaging is automatically taken into account in the SBI framework.This allows us to use both "open" and "closed" triangle configurations, as opposed to likelihood-based approaches where the binning effect is important for the open triangles [64,100]."Open" triangles are those for which the k-bins do not respect the momentum conservation constraint although there are individual triples of modes associated to the bin which do satisfy the relation.
Considering the most costly model in this work, i.e., the Euclid configuration described in Section 4.2.2, using LEFTfield to generate N sim = 10 5 density grids with 2LPT, construct a second-order bias expansion and compute their corresponding power spectra and bispectra with data vector size of D = 49 (6 bins for the power spectrum and 43 triangle bins for the bispectrum) takes roughly 19500 core hours when using 2GHz Intel Xeon Gold 6138 CPUs.
We show in Figure 1 the comparison of our estimators to their respective theoretical predictions.We use here a linear forward model instead of 2LPT, which is given directly from Eq. (2.11) for α = 1, where we construct the final density field in a box with size choosing the redshift z = 0.The second-order bias term generates a nonzero bispectrum.The prediction for the power spectrum estimator is then (2. 19) and P L has support up to Λ.For the bispectrum, its leading-order theory prediction in this case reads We have also tested the LPT predictions for the power spectrum and bispectrum estimator (not shown), where in the latter case we compare the closed triangle bins only.

Simulation-based inference
We describe in detail how simulation-based algorithms work in this section.Its fundamental requirement is to have a simulator model, such as LEFTfield.Let θ be the vector of parameters of interest of dimension N θ (e.g., cosmological, bias and noise parameters) and x be the data vector of dimension D (e.g., power-spectrum bins).Assuming that we have a proposal distribution over the parameters p(θ) (not necessarily the prior) and a simulator that generates x given θ, it is possible to generate a simulated dataset of N sim samples {(θ n , x n )} N sim n=1 , where (θ n , x n ) is a joint sample from p(θ, x) = p(x|θ)p(θ).We describe below how the methods which will be used throughout this paper estimate the parameters posterior from this dataset given the observed data x o .

Approximate Bayesian Computation
The first approach to perform SBI goes back to the idea of Approximate Bayesian Computation (ABC).In its simplest form, rejection ABC [101], one defines a distance metric ρ(x, x o ) between the simulated samples x and the observed data x o , and it is possible to obtain samples from an approximate posterior p(θ|ρ(x, x o ) < ϵ) by accepting the parameters θ n for which the corresponding metric ρ(x n , x o ) lies below a given threshold ϵ.In other words, the posterior is obtained by the histogram of parameters for which the simulated samples most closely resembles the observed data.In the limit ϵ → 0 and under suitable regularity conditions, the approximated posterior recovers the true posterior [102].
In practice, however, the value of ϵ controls the sample acceptance rate and the required simulation budget N sim , so due to computational efficiency reasons it can't be chosen to be arbitrarily small.This is in tension with the aim for the smallest possible threshold to obtain a better approximation of the posterior.Extensions of this method were proposed to improve its efficiency such as Population Monte Carlo (PMC) [103], where a sequential approach is used and the approximated posterior is used as proposal distribution p(θ) for the next round until it has converged.These methods however still discard all the simulations which were rejected.Below, we discuss more efficient algorithms in which the entire dataset is used for the posterior estimation.

Neural density estimators
One of the biggest problems related to rejection-based algortihms such as ABC and PMC concerns the fact that these inherently have to throw away all the simulations which lie below a certain threshold.Neural Density Estimators (NDEs) come as a rescue, since they use all available simulations to estimate the target distribution.The dataset of parameters and simulated summaries {(θ n , x n )} N sim n=1 is used to find a certain probability distribution.Neural Posterior Estimation (NPE) [104] and Neural Likelihood Estimation (NLE) [105] can use normalizing flows [106] to fit the posterior or the likelihood, respectively.There are also methods to learn the likelihood-to-evidence ratio which use classifiers instead, such as Neural Ratio Estimation (NRE) [107][108][109][110][111], but we do not discuss them here as they are not used for our analysis.We give a detailed explanation of normalizing flows in Appendix C.
Neural Likelihood Estimator.Suppose that we aim to approximate the unknown target distribution p(x|θ) as the conditional of the data x on the parameters θ.The idea is to fit a flow-based model q ϕ (x|θ) parametrized by ϕ to the dataset {(θ n , x n )} N sim n=1 by imposing that the model approximates the target distribution, where p(θ, x) = p(x|θ)p(θ).Evidently we do not know the target, but we can proceed by using the maximum likelihood estimation method which minimizes the forward KL divergence in the support of the proposals, where on the third line we identified the part which is independent of ϕ as a constant and on the last line we approximated the expectation over p(θ, x) with Monte Carlo.The resulting Monte Carlo estimate of the KL divergence we wish to minimize is therefore independent of the explicit form of the target distribution and equivalent to the sum over the negative log-likelihood of the simulated datasets batches under the flow-based model.The estimation of conditionals is a natural extension of Masked Autorregressive Flows (MAF) [112], where only the conditionals corresponding to x are modelled by augmenting the set of input variables with θ, i.e., z t = f t (z t−1 , θ) (see Appendix C).The only requirement is that x appears before θ for any order used, and no connections have to be masked out from θ to the rest of the network.Therefore, the Neural Likelihood Estimator (NLE) builds the flow-based model with MAF as [105] where the network parameters are trained with the loss given by Eq. (3.1).
Note that here the model is trained for any given data x, where for inference we are interested in a specific observed data x o .The procedure is then to use the likelihood as the estimated conditional evaluated at the observed data, and then use MCMC (which is very cheap since only the learned model has to be evaluated) together with a prior p(θ) to obtain the estimated posterior p(θ|x o ) ∝ q ϕ (x o |θ)p(θ).
When simulating the data is expensive, it is often desirable to use sequential approaches such as in Sequential Neural Likelihood Estimator (SNLE) [105], where the density estimation is done over multiple rounds focusing on a given observation x.The first proposal p(θ) is often chosen to be the prior, while the subsequent proposals are set as the estimated posterior p(θ|x) from the previous round.This of course limits the density estimation to a single observation and is therefore non-amortized, but has the advantage of requiring less simulations to obtain an accurate posterior estimation focused on x o , as the regions where the proposal density is high tend to be better approximated by the model.However, the cost of calibrating nonamortized methods is huge, as discussed below.

Neural Posterior
Estimator.An advantage of the SNLE method is that the prior can be changed and tested, but it has the MCMC as an extra computational step.This can be avoided by estimating the posterior directly, as in the case of Sequential Neural Posterior Estimator (SNPE) [104,113,114].The estimated posterior has the form where p(x) = θ p(θ)p(x|θ).Note that the proposal posterior p(θ|x) is equal to the target posterior if the proposal p(θ) equals the prior p(θ).
In its first implementation [113], the undesired dependence on the proposal p(θ) was adjusted by multiplying the estimated posterior p(θ|x) by p(θ)/p(θ).This restricts q ϕ (θ|x) to be a Mixture Gaussian Density [115] and p(θ) to be a Gaussian, so that the division can be done analytically; additionally, p(θ) can only be Gaussian or uniform.Besides these strong requirements, a drawback of this model is the fact that the division can lead to non-positive covariance matrices if p(θ) has smaller variance than any of the components of q ϕ (θ|x).
A further extension to this method was presented in [114], where the aforementioned requirements are no longer needed and the method no longer yields negative covariances.This is achieved by weighting the samples with w n = p(θ n )/p(θ n ), such that the loss is modified to − n w n log q ϕ (x n |θ n ).However, the weights can have high variance, leading to instability during the training process and inaccurate predictions in some cases [104].The algorithm developed in [104] circumvents all these issues with the idea of "atomic" proposals, that essentially replaces integrals by sums and allows for the usage of arbitrary density estimators, especially those based on normalizing flows such as MAFs.
We have tested some of the NDE algorithms implemented in the publicly available SBI package [116], and we chose to work with NPE from SBI as our baseline.There are several reasons for our choice; first, this method has been empirically observed to perform better for the inference problems considered in this work, especially for a limited simulation budget and large dimensions of the data parameter vectors.Second, it does not require an extra MCMC step as in NLE and NRE, or a retraining of the model for each observation such as for non-amortized algorithms, allowing for faster posterior diagnostics.Lastly, amortized posteriors such as the ones obtained by NPE tend to be more conservative [62].
We use the SNPE method of [104] with 10 atoms for atomic proposals and MAF with 5 autoregressive layers (i.e., stacked MADEs), each constructed using two fully-connected tanh layers with 50 hidden units.We train the models by stochastically minimizing the loss using the Adam optimizer [117] with learning rate of 5 × 10 −4 and batch size of 50, where 10% of the samples are used for validation and the training is stopped if the validation set loss did not improve after 20 consecutive epochs.
Training the most complex case considered here (Section 4.2.2, with data vector size of D = 49, N θ = 8 parameters and simulation budget of N sim = 10 5 ) takes roughly one hour on a single 2GHz Intel Xeon Gold 6138 CPU, and no significant improvement was observed 2), which constructs a model for the galaxy overdensity field δ g through nLPT (2LPT in this work).We then measure the summary statistics x, which in our case is the galaxy power-spectrum concatenated with the bispectrum (Sec.2.3).We repeat this process N sim times, and use the resulting pairs {(θ n , x n )} Nsim n=1 as input to the SBI neural density estimation (Sec.3.2), which then gives us the posterior p(θ|x) that can in turn be evaluated at a given observed data x o .
when using a GPU.Generating 10 5 samples from the estimated posterior takes less than one second, allowing for efficient calibration analysis since the NPE posterior is amortized.

Results
In this section, we present our results regarding two cases: first, a simple, linear forward model, described in Sec.4.1, where we compare ABC and NPE.In Sec.4.2, we then use a 2LPT forward model and estimate the posterior with NPE, as summarized in Figure 2.

Linear forward model
We start reporting our results with a very simple model, where we aim to infer the linear bias b 1 and the noise amplitude P ε from the galaxy power spectrum.For that, we consider a linear forward model, where instead of using LPT we use Eq.(2.11) with α = 1 to construct the galaxy overdensity field as where ε(k) is a Gaussian random field of zero mean and variance P ε .We use a box of size L = 2000 h −1 Mpc, and choose redshift z = 0, without any loss of generality in this case.We define the observed data x o as a fiducial power spectrum P (k) measured from a particular realization of the model evaluated at the ground-truth parameters b1 = 1.5 and Pε = 10 3 h −3 Mpc 3 .We then generate N sim simulated datasets, where θ = {b 2  1 , P ε } are sampled from a Gaussian prior as θ ∼ N ( θ, 10 −1 × θ) and x = {P (k)} = {P (k 1 ), P (k 2 ), . . ., P (k D )}, where D is the number of power-spectrum bins N bin , are the power spectra of the simulated datasets.It is important to stress that, for each simulation, the realization of both δ (1) (k) and ε(k) are also varied.We infer b 2 1 instead of b 1 since its posterior can be evaluated analytically assuming a Gaussian likelihood and prior.Note that the SBI posterior for b 2  1 does not necessarily converge to the analytical one, since we relax the Gaussian likelihood assumption in this case.
The ABC metric we consider is where the sum runs over the summary statistics bins (k-bins for the power spectrum).This metric is inspired by the reduced χ 2 , and is expected to be close to optimal in the present case, in addition to being straightforward to interpret.Here, D = N bin = 4, N θ = 2 and Cov[ P (k)] = 2⟨ P (k)⟩ 2 /N k , where ⟨ P (k)⟩ is the mean of the fiducial power spectrum over initial conditions evaluated at fiducial parameters θ and N k is the total number of modes which lie in the respective k-bin.As explained in Sec.3.1, the proposed parameters θ are then accepted or rejected according to this metric evaluated at the simulated data vector x, and the ABC posterior should converge to the true one in the ϵ → 0 limit.However, due to computational efforts, one can never truly reach this limit.
We take the approach of [18], where instead of defining a specific threshold ϵ we accept the parameters corresponding to the lowest quantile f of metric values.We show in Figure 3 the SBI posterior standard deviation for b 2  1 normalized by the analytical prediction, where we compare ABC to NPE.We use a minimum of 10 2 samples for estimating the posterior variance, leading to a different number of points for different choices of f .The corresponding threshold value ϵ for each percentage fraction f = {10 −1 , 10 −2 , 10 −3 , 10 −4 } is ϵ = {16.2,3.6, 1.1, 0.4}.
First, we can see that the NPE posterior converges much faster in terms of N sim than the ABC one, as expected since the effective number of samples used in ABC is f × N sim , although the total simulation cost is N sim , while NPE uses all simulations at hand for posterior estimation.Further, we find that one has to be careful when choosing the value for the treshold ϵ; values that are too large will lead to significantly overestimated error bars, regardless of the number of simulations used.If the summary statistics used is approximately Gaussiandistributed with approximately known covariance, such as in our case, defining the ABC metric via the reduced χ 2 makes the former interpretable, with convergence expected for values of ϵ of order unity, which is indeed confirmed by Figure 3 given the values of ϵ listed above.In the simple case we considered here, we can see that the posterior standard deviation converges to the Gaussian prediction, and we can conclude that for this particular case the non-Gaussianity of the power spectrum at low-k does not significantly impact the constraints.
Figure 3 serves as a useful cross-check of the different SBI approaches.In the following, due to the difficulty in achieving converged posteriors via ABC, we will use NPE throughout.

2LPT forward model
We now move to a forward model consisting of second-order LPT and second-order bias expansion as described in Section 2.2, where the galaxy overdensity field is given by Eq. (2.13).Here, x is the concatenated vector of power spectrum and bispectrum bins of dimension D = N bin + N tr , where N tr denotes the number of triangle bins for the bispectrum.We use the configuration of the Euclid satellite [118], which will cover a volume of 63 h −3 Gpc 3 with mean number density of galaxies ng = 5.2 × 10 −4 h −3 Mpc 3 at mean redshift z = 1.4.
We consider two cutoffs for generating the datasets.For Λ = 0.1 hMpc −1 , the grid size is N c = 128 3 and the dimension of the data vector is D = 33, while for Λ = 0.
where the fiducial noise parameters are set to their corresponding Poisson expectation (see Appendix A), inspired by galaxy number densities of current and upcoming spectroscopic surveys, and we choose to sample 10 −3 P ε instead of P ε in order to have all parameters to be of order unity, which is desirable for the neural density estimation.The fiducial Eulerian linear bias is set to b1 = 1.5, while btr[M (1) M (1) ] and bσσ are determined from b1 through the bias relations (see Appendix B).
Regarding b∇ 2 δ , we expect its value to be close to zero, but with variance between two and three times the Lagrangian radius of an average Euclid galaxy.To set its fiducial value, we first find the minimum mass M min which satisfies the integral for ⟨b 1 (z, M min )⟩ = b1 = 1.5 using the Tinker mass function for n(z, M ) [119] and the Tinker linear bias predictions for b 1 (z, M ) [120].We then use Eq.(4.4) for the higher-derivative bias evaluated at the obtained M min , where b ∇ 2 δ (z, M ) is determined by −R 2 L (M )/2.5 [121], to set its fiducial value as the calculated mean, i.e., b∇ 2 δ = ⟨b ∇ 2 δ (z, M min )⟩.
We compare our results to a Fisher forecast, which approximates the posterior with a Gaussian via the ensemble average of its curvature of around the maximum.The Fisher information matrix for the likelihood is defined as which for a Gaussian likelihood reduces to where the mean of the derivatives as well as the covariance of the data vector x over initial conditions realizations are evaluated at the fiducial parameters values θ.We also include the prior Fisher matrix , where δ αβ is the Kronecker delta function and σ α is the standard deviation considered for the parameter θ α in a Gaussian prior.The total Fisher information matrix for the posterior is then F = F (L) + F (p) .The marginal posterior in any subspace of the parameter space is then controlled by the corresponding restriction of F −1 .Specifically, the 1-sigma uncertainty on θ α is approximated by (F −1 ) αα .As previously discussed, the power spectrum and bispectrum are not exactly Gaussian distributed, and one of the main goals of this paper is to explore the final parameter posterior when this assumption is relaxed.However, the Fisher prediction provides a guideline for interpreting our results.
Since we aim to use NPE instead of sequential methods for amortabilizity and since SBI requires many simulations around high density posterior regions, a careful choice of priors (i.e.neither too wide nor too narrow) is crucial.We hence choose the mean and variance of our priors in the following way.First, we run SNPE with batches of 10 3 simulations starting from the prior where we choose the centers of α and b 1 in such a way that their multiplication equals 1.5 = ᾱ × b1 .After SNPE convergence, we sample from the final posterior to choose the prior for NPE to be a Gaussian centered on the sample mean with variance given by at least two times the SNPE sample variance.For all inference cases, we overplot the prior with the posterior to guarantee that the latter is not bounded by the prior ranges.We refer the reader to Appendix E for examples comparing SNPE and NPE.

Fixed cosmology: impact of the likelihood form
We start with the case where α is fixed to unity, while all the first and second order bias and noise parameters are sampled.We name as full SBI results from the forward model where the measured data vector from LEFTfield are used directly for posterior estimation.We compare these results to a Gaussian-likelihood model, where we generate samples from the data vector as x ∼ N (⟨x⟩, Cov[x]), where ⟨x⟩ denotes the mean of the data vector over initial conditions realizations calculated at the proposed parameter θ and the covariance is calculated for the data vector at the fiducial parameters x.While the name Gaussian likelihood refers to the fact that the data is Gaussian distributed by construction, we emphasize that the posterior is also obtained with NPE as in the full-SBI case.We of course expect to obtain the same result when performing a Monte Carlo sampling based on the same Gaussian likelihood for the data vector.
Regarding the covariance Cov[x], we consider two distinct cases.First, we consider an analytical, diagonal covariance for the data vector, namely Cov[ P (k)] = 2⟨ P (k)⟩ 2 /N k , where ⟨ P (k)⟩ is the mean of the fiducial power spectrum over initial conditions evaluated at fiducial parameters θ and N k is the total number of modes which lie in the respective kbin, and Cov[ B(k 1 , k 2 , k 3 )] = s B ⟨ P (k 1 )⟩⟨ P (k 2 )⟩⟨ P (k 3 )⟩/k 3 f N t , where s B is the triangle shape symmetry factor (6, 2 and 1 for equilateral, isosceles and scalene triangles, respectively), k f = 2πL −1 is the fundamental frequency and N t is the number of triangle configurations inside each triangle bin.Second, we measure the sample covariance from N cov simulations with parameters fixed at their fiducial values.To summarize, in both Gaussian-likelihood cases, the likelihood of the data vector is characterized completely by its first and second moments (namely the mean and covariance), where the covariance is fixed to the fiducial point in parameter space.Compared to the analytical covariance, the sample covariance case adds more realism, since cross-correlations between different elements of the data vector (i.e.off-diagonal elements of the covariance) are included.In contrast, the full SBI forward model captures the full distribution of the data vector, including its higher-order moments, and as function of the position in parameter space.
We show in Figure 4 the estimated posteriors for the Gaussian likelihood model at k max = Λ = 0.1hMpc −1 , which closely resembles the Fisher prediction.For all parameter posteriors considered in this work, we always sample 10 5 posterior samples for plotting.The Fisher derivatives are taken numerically for the full case, with steps given by the standard deviation of the NPE posterior, while for the Gaussian-likelihood case it is evaluated analytically at the fiducial values from the mean cross-spectra of the bias expansion basis.The small deviations come from the fact that, although the likelihood and the prior are Gaussian, the posterior is not necessarily Gaussian, as the parameters enter nonlinearly in the prediction for the expectation value of the data vector.Note that, although the Fisher prediction is evaluated at the fiducial parameters points, we shift the mean of the Fisher contours to the maximum a posteriori posterior values for better comparison.This results shows that NPE can successfully recover the expected posterior in this idealized case, showing that this method is therefore robust for the data vector size and number of simulations used.
Figure 5 compares the Gaussian-likelihood posteriors with the full-SBI case.First of all, we can notice that the Gaussian likelihood cases are very similar, which indicates that offdiagonal terms do not play an important role on the scales k ≤ Λ = 0.1 hMpc −1 considered here.The difference between these and the full-SBI posteriors is slightly larger, but still small, which also indicates that the non-Gaussianity of the data vector also does not affect the constraints.Figure 6 shows the corresponding result when going to smaller scales, k ≤ Λ = 0.2 hMpc −1 .The posterior differences are somewhat larger than on smaller scales, as expected, though still not dramatic.
We show the simulation-based calibration (SBC) tests of the posterior in Figure 7 (see Appendix D).As previously discussed, a healthy posterior should lead to uniformly distributed rank statistics for all parameters, and we show two different visualizations which can help us in identifying possible problems with the estimated posterior.First, if the SBI underestimates the true posterior variance for some parameter, one expects a "U-shaped" rank distribution, or equivalently a CDF lying below the grey shaded area of the 95% confidence interval of a uniform distribution.Conversely, if the posterior overestimates the errors, one would get a centrally peaked distribution, or a CDF above the grey area.Since our ranks are uniformly distributed and all CDFs lie inside the grey region, we conclude that our posterior passed the calibration test, although we re-emphasize that this is only a necessary and not a sufficient condition.
We show convergence of the obtained posterior with respect to the number of simulations for the full case in Figure 8.The results indicate that adding more simulations than N sim = 10 5 , the simulation budget used in the previous figures, does not change the full-SBI posteriors.The conclusions hold equivalently for the Gaussian likelihood cases.Note that the sudden upturns for a low simulation budget only indicates that the budget was not sufficient for convergence, and that running SBI again on the same dataset could lead to different posteriors.Parameter posterior for the case where the cosmological parameter α is sampled together with the bias and noise parameters.The contour colors, simulation budget and data vector size are the same as in Fig. 5.For all cases, the method NPE was used from a simulation budget of N sim = 10 5 , scale cut of k max = Λ = 0.1hMpc −1 and data vector dimension D = 33.

Inferring cosmology
We now turn to constraining the bias and noise parameters together with the cosmological parameter α, where we recall that α ≡ σ 8 /σ fid 8 and set its fiducial value to ᾱ = 1.We show in Figure 9 how we are able to infer these parameters independently with the help of the bispectrum at k max = 0.1hMpc −1 , and that SBI is successful in this context even considering the dimensionality of the problem and the nontrivial parameter degeneracies.
We can notice that the analytical covariance underestimates the errors in this case, a result that has already been discussed in the galaxy clustering literature [122,123].We however find no significant difference between the Gaussian likelihood case with sample covariance and the full-SBI one, a result which is in agreement with the current intuition that adding offdiagonal terms to the covariance would be more important than considering a non-Gaussian likelihood.Our conclusion is that, for this particular case, the non-Gaussianity of the data vector (especially at low-k modes) does not lead to any considerable effect on the posterior densities.The conclusion still holds for k max = 0.2hMpc −1 , as displayed in Figure 10, although differences between full-SBI and Gaussian-likelihood results become more noticeable.The 1-σ errobars for the full-SBI, Gaussian-likelihood with sample covariance, Gaussianlikelihood with analytical covariance and the Fisher forecast are 0.173, 0.173, 0.169 and 0.21 k max = 0.1hMpc −1 , respectively, while 0.131, 0.111, 0.111 and 0.100 for k max = 0.2hMpc −1 .
We emphasize here that this conclusion is highly problem-dependent, and it does not necessarily extend to other cosmological parameters, scale ranges or summary statistics.For example, since most of the constraining power on f NL comes from low-k modes, one could expect it to be more sensitive to the Gaussian likelihood assumption.At the other extreme, on highly nonlinear small scales, one can also expect deviations from Gaussianity, while some summary statistics might not even be well described by an analytical form in any scale range.
In general, relaxing the Gaussian assumption in the likelihood is a non-trivial analytical task, but SBI methods provide us a straightforward route.The investigation of whether relaxing the Gaussian likelihood assumption leads to an increase of the error bars has been previously discussed in the literature; for example, [59] reported no significant difference when using the power spectrum only.In contrast, Figure 3 of [38] show that when using the twopoint correlation function, the posterior obtained by SBI displays larger error bars than when considering a Gaussian likelihood in the context of lognormal fields.
The SBC results are shown in Figure 11, which indicate that our posterior estimates pass the calibration test also for this case.The convergence is analyzed in Figure 12.We can notice a slower convergence with respect to the number of simulations when compared to the case where α is fixed, which might be related to the fact that the two-dimensional contours of α and the other parameters are more non-Gaussian in this case, due to the increased degeneracies relative to the α-fixed case, and therefore the normalizing flow needs more simulations to fit the posterior.
We show in Appendix E an analysis of how our results can be affected by data vector normalization and the number of sample covariance estimates for the Gaussian-likelihood case, besides providing a comparison of bispectrum with power-spectrum-only constraints.We also compare the results with SBI density estimators other than NPE (namely SNPE and NLE), as well as the impact of a more complex network architeture or using an ensemble of networks instead of a single realization for posterior estimation.

Conclusions
We have explored for the first time how SBI performs within the context of a forward model based on the EFTofLSS and the bias expansion with the goal of constraining cosmological, noise and bias parameters from the galaxy power spectrum and bispectrum.First, we have demonstrated that SBI can successfully recover the expected posterior in controlled cases where the data vector is sampled from a Gaussian likelihood.Considering an Euclid-like mock tracer sample, we conclude that the non-Gaussianity of the lowest order n-point functions on large scales does not impact the constraints on σ 8 for the scale ranges considered in this work, namely k max = 0.1hMpc −1 and 0.2hMpc −1 .In these cases, adding off-diagonal terms to the covariance has a larger impact than relaxing the Gaussian likelihood assumption.Our estimated posteriors passed the SBC test and, within this specific context, we estimated the number of simulations needed for convergence to be of the order of 10 5 simulations.
It is important to emphasize that, while SBI serves as a versatile and powerful tool for cosmological inference, our utilization of machine-learning is strictly limited to the statistical aspect of our inference process, specifically density estimation.This entails essentially a fitting of the posterior distribution or likelihood function derived from a set of simulations, allowing us to overcome the limitations of standard inference techniques that rely on analytical approximations for the selected summary statistics.Most importantly, this work is performed within the framework of a forward model rigorously constructed to ensure accuracy on large scales.In this context, our understanding of the underlying physics benefits from years of dedicated study within the EFTofLSS community, providing us with a high level of control and intuitive comprehension, especially when employing the power spectrum and bispectrum as summary statistics.
It is paramount to validate and establish the reliability of SBI in well-understood regimes, such as those under consideration in this study.This validation process represents a crucial first step before delving into more complicated scenarios, where intuitive comprehension may be lacking.In addition, knowledge of an explicit likelihood that is at least valid for some elements of the data vector (for example on very large scales) allows for the possibility for explicit combinations of analytical likelihoods and SBI [42].
As the next direct step of this work, we will test our inference pipeline on dark-matter halos from N-body simulations and test different compression schemes other than n-point functions.It would also be interesting to investigate the impact of the details of the forward model that differ in our approach and current EFT-based analysis techniques based from codes such as CLASS-PT or PyBird (see the discussion in [85]).We also aim to use SBI to investigate whether the low-k non-Gaussianities of the lowest order n-point functions impact the inference of cosmological parameters related to primordial non-Gaussianities.
In future work, we plan to extend our analysis to redshift space, using the LEFTfield forward model extension presented in [68], and to sample more cosmological parameters and to improve the realism of the forward model by incorporating systematic effects and masks in LEFTfield.We are particularly interested in comparing the results from field-level cosmological inference, which explicitly marginalizes over the initial conditions using HMC, to SBI inference, both using n-point functions as summary statistics and other "field-level" compression mechanisms which learn the summary statistics directly from the galaxy density field.
Regarding our SBI pipeline, we are planning to explore prior truncation schemes [30,109], using techniques such as the Sobol' sequence [124] and active learning [24].We would also like to improve our density estimation methods by hyperparameter tuning, besides trying other SBI algorithms such as Neural Ratio Estimation [107][108][109][110][111]. In particular, it would be interesting to test Truncated Marginal Ratio Estimation (TMNRE) [35,109] for marginalizing over the bias parameters.Lastly, it would be interesting to also explore other SBI diagnostics besides SBC [125][126][127].

B Bias relations
The bias expansion is constructed upon a sum of operators at given order in perturbation theory at a certain given time, where one can always relate a given basis of operators one to another.From the fiducial value of the linear bias b 1 , which is related to the Eulerian basis we wish to determine a physically-motivated fiducial value for the second-order bias parameters associated to the Lagrangian operators O L = {tr[M (1) ], tr[M (1) ] 2 , tr[M (1) M (1) ]}.At leading order, these operators can be related as and assuming a local Lagrangian bias model with vanishing Lagrangian tidal bias, we can use the co-evolution relations [49] to obtain that the conversion between both sets of bias parameters reads

C Normalizing flows
Statistical inference has gained power with the advent of normalizing flows, a class of generative models that allows one to model complex probabilistic distributions.The key idea behind this method is to learn a map that transforms a simple, well-understood density such as a Gaussian to the target probability distribution through successive transformations.Let x be a D-dimensional real vector and p(x) its joint distribution we wish to estimate.We proceed by defining a transformation f from a base distribution π(u) such that We further require that f is a diffeomorphism, i.e., that f invertible and that f and f −1 are differentiable, which restricts u to be D-dimensional as well [130].The distribution over x can therefore be obtained by the change of variables [131,132] p The composable property of diffeomorphisms, i.e., the fact that for any two diffeomorphisms f 1 and f 2 the composition f 1 • f 2 is also diffeomorphic, guarantees that Eq. (C.2) x, which follow a more complex distribution p(x).In the lower panel, we illustrate the conditioner that "masks out" the connections between z i and h <i , as well as the affine functions applied to the vector components.
is still valid if we build a complex transformation from a series of simple transformations where f t transforms z t−1 into z t , and setting z 0 = u and z T = x.Flow therefore refers to these successive transformations applied on the samples from π(u) to progressively deform them into the ones of p(x), while normalizing refers to the inverse flow 1 that maps the samples from the distribution p(x) into the ones of π(u), which is often chosen to be a multivariate normal [106].
The fact that normalizing flows provide explicit densities (via Eq.C.2) in addition to sampling (via Eq.C.1) is in contrast to other generative models like Variational Autoencoders (VAEs) [133] and Generative Adversarial Networks (GANs) [134], where usually only sampling is possible.In practice, f (or f −1 ) is implemented as a neural network parametrized by ϕ (e.g., its weights and biases), which is optimized to learn the mapping between the distribution parameters and a given simulated dataset.The model should have tractable inverse and Jacobian determinant; in other words, it is often required that the inverse is efficient to calculate and that the Jacobian determinant time cost should be at most O(D).

Masked Autoregressive Flows.
A simple way to achieve the desirable properties of the model is to use autoregressive flows, which restrict f t to be of the form [135] z Here, the output z ′ i is the i-th component of a vector z t ∈ R D for which z t = f t (z t−1 ), while z i is the i-th component of z t−1 ∈ R D .The transformer τ is required to be a strictly monotonic function of z i , so that the model is invertible.Since τ is parametrized by h i , the output z ′ i does not depend on z ≥i due to the autoregressive structure of the i-th conditioner c i .As a consequence, the Jacobian is triangular and its determinant can be calculated as the product of its diagonal elements, leading to an O(D) evaluation of the logarithm of the determinant [106]: The implementation of an autoregressive flow then follows by choosing an appropriate transformer τ and conditioner c i .A common choice for the transformer is to use affine functions, i.e., τ (z i ; where we choose to parametrize the scale parameter of this location-scale transformation as exp α i to guarantee its invertibility.The determinant of the Jacobian is then simply We focus on Masked Autoregressive Flows (MAFs) [112] (see the flowchart Fig. 13), a particular case of autoregressive flows which will be used throughout this paper for density estimation, but other alternatives exist such as Inverse Autoregressive Flow (IAF) [136] and Real NVP [137].MAFs are a special case of autoregressive models [138], which use the product rule of probability to decompose densities into 1-dimensional ones as In MAFs, the conditionals are parametrized by Gaussians, which can be equivalently written as an affine function (Eq.C.5) where the base distribution is normal.Given that the model is easily invertible, with u i = (x i − µ i ) exp(−α i ), and it has a triangular Jacobian of the form of Eq. (C.6), autoregressive models can also be interpreted as a normalizing flow [136], as Eqs.(C.1) and (C.2) can be evaluated.
The MAF conditioner c i in turn is implemented with masked conditioners as in Masked Autoencoder for Distribution Estimation (MADE) [139], where a single feedforward network outputs h in one pass given the input z.To preserve the autoregressive property of the conditioner, each weight matrix element is multiplied by a binary mask, where the connections between z i and h <i are "masked out" by essentially multiplying their corresponding weights by zero.
The interpretation of autoregressive models as normalizing flows enables MAF to stack multiple layers of MADEs with Gaussian conditionals into a deeper flow, where the output of each layer, α i and µ i for all i, are used as input for the next one.In the language of the previous section, we will have that a MAF transforms the basis function π(u) = N (0, I) to the target density p(x) through the composition of transformations f = f T • ... • f 1 , where each f t is implemented by a MADE for z 0 = u, z t = f t (z t−1 ) and z T = x.
As a result of the stacking, the model is still tractable, but more expressive and flexible.For example, it has been shown that a MAF with 5 autoregressive unimodal conditional layers is able to approximate multimodal posteriors [112].Another subtlety is that the estimated densities can depend on the order of the inputs, so MAF avoids this issue by using different input orders for each layer.We discuss below how such models are trained for density estimation.

D Simulation-based calibration
When using approximated methods for Bayesian inference, a lot of effort has to be made to analyse whether the obtained posterior is correct.For example, for MCMC one has to check whether the chains have converged, and to this date there exists no method to assure convergence with complete confidence [62].Regarding SBI algorithms, an important diagnostic is to check whether the posterior is calibrated, i.e., not over-or under-estimating the parameter uncertainties.For scientific reasoning, we are mostly concerned with assuring that the posterior is not superconfident.Simulation-based calibration (SBC) [70] arises as a useful tool for this analysis, where it is important to stress that passing this test is only a sufficient condition; that is, if it fails, it is an indication that the training was not successful (for example, if not enough simulations were used), while if it passes there is no guarantee that the posterior is correct.Providing reliable tests for SBI algorithms is an active area of research [62,[125][126][127].
Here, we use SBC to analyse the estimated posteriors, and its results can tell us if the posterior is not well-calibrated and indicate possible systematic biases on the inference.The idea behind SBC is that, if we draw samples θ ∼ p(θ) from the prior and then generate a data vector x ∼ p(x|θ) with the simulator, by drawing samples θ ∼ p( θ|x) from the estimated posterior approximated by the model q ϕ ( θ|x), we obtain joint distribution µ(θ, x, θ) = p(θ)p(x|θ)p( θ|x).The marginal distribution of θ, Analysing the distribution of θ can therefore be used as a posterior calibration test.In practice, we generate a set of "observed" data x i o , where x i o ∼ p(x|θ i ) is simulated given a particular θ i ∼ p(θ) drawn from the prior, and then estimate the posterior p(θ|x i o ) for each of these observations.We then draw a set of posterior samples { θ} i for each of the estimated posteriors and compute the rank of the observed data under this set by counting how many of the posterior samples { θ} i fall below the corresponding observed data x i o .If the rank statistics is not uniformly distributed, then the estimated posterior is not well calibrated.

E Tests of the inference
In this section, we analyse some details of our pipeline focusing on a particular case, namely the one where the cosmological parameter α is fixed and k max = Λ = 0.1hMpc −1 .We will be referring to the full case using NPE and N sim = 10 5 unless stated otherwise; note that the following conclusions hold qualitatively the same for the other cases (using a simulation budget N sim that guarantees convergence).Data vector normalization.It is well known that usually machine learning techniques perform better with normalized values.Since our parameters θ are already of order one, instead of using the data vector x as the raw power spectrum and bispectrum for density estimation, we also test normalizing them as P (k)/P L (k) and B(k 1 , k 2 , k 3 )/[P L (k 1 )P L (k 2 ) + 2 perm.].However, as we can see in the left panel of Figure 14, the normalization does not lead to faster convergence in this case.

SNPE.
The right hand side of Figure 14 shows the results of SNPE, the sequential version of NPE.As we can see, the standard deviations converge to the ones corresponding to the final NPE posterior, what further confirms our posterior convergence.As aforementioned, although this case uses less simulations for convergence (e.g., after 15 rounds, where convergence seems to be safely reached, one would have used a total of only N sim ∼ 10 4 simulations), performing SBC tests on such non-amortized posteriors is a heavy computational task.then sample 10 4 posterior samples for each of the 10 estimated posteriors.This can give us an estimate of the error associated with the posterior estimation itself, and as we can see the errors are indeed larger; however, the trends are very similar and further confirm our previous findings.
Figure 17.Same as Fig. 9, but using an ensemble of networks instead of a single posterior estimation.

Figure 1 .
Figure 1.Comparison of power spectrum and bispectrum estimators to theory for the linear forward model in Eq. (2.18), where we use Λ = 0.3hMpc −1 , k min = 2πL −1 hMpc −1 and k max = 0.2hMpc −1 for both cases.The blue lines show the n-point function estimator from LEFTfield as described in this section, which agree with the theory predictions indicated by the red curves.Left: we set b 1 = b 2 = 1 and use 50 linear bins for the power spectrum.The linear power spectrum P L is displayed in black for reference.Right: we set b 1 = 1 and b 2 = 0.1, in order to suppress next-to-leading-order corrections, and use 150 triangle bins for a better visualization of the bispectrum, which are constructed from 10 linear bins in k.The lowest triangle bin indexes contain the smaller k-modes, hence their large variance, where k 1 ≤ k 2 ≤ k 3 , where k 1 corresponds to the outermost loop index, while k 3 is the innermost index.

Figure 2 .
Figure2.Flowchart of our pipeline for SBI, assuming an nLPT forward model.The cosmological, bias and noise parameters θ are sampled from the prior and used as input for LEFTfield (Sec.2.2), which constructs a model for the galaxy overdensity field δ g through nLPT (2LPT in this work).We then measure the summary statistics x, which in our case is the galaxy power-spectrum concatenated with the bispectrum (Sec.2.3).We repeat this process N sim times, and use the resulting pairs {(θ n , x n )} Nsim n=1 as input to the SBI neural density estimation (Sec.3.2), which then gives us the posterior p(θ|x) that can in turn be evaluated at a given observed data x o .

Figure 3 .
Figure 3. Posterior standard deviation of the linear bias parameter squared b 2 1 obtained by SBI divided by its analytical prediction as a function of the total simulation budget N sim .Continuous lines show the results obtained from ABC, while the dashed line corresponds to NPE.The value of f denotes the fractional percentage of samples which were accepted in ABC, where we use f × N sim for posterior estimation.The error bars correspond to 10 different realizations of the observational mock data x o .

Figure 4 .
Figure 4.Parameter posterior for the case where the cosmological parameter α is fixed.The left and right contour plots correspond to the models where the data vector is sampled from a Gaussian with analytical and sample covariance, respectively.For all cases, the neural density estimator method NPE was used with a simulation budget of N sim = 10 5 .

Figure 5 .
Figure 5. Parameter posterior for the case where the cosmological parameter α is fixed.Blue and purple contours correspond to the models where the data vector is sampled from a Gaussian likelihood with analytical and sample covariance, respectively, where N cov = 10 4 .The red contours show the full SBI results, with no Gaussian assumption.Dotted lines indicate the Fisher prediction with sample covariance for reference.For all cases, the method NPE was used from a simulation budget of N sim = 10 5 , scale cut of k max = Λ = 0.1hMpc −1 and data vector dimension D = 33.

Figure 7 .
Figure 7. SBC (simulation-based calibration) tests for the full-SBI case where the cosmological parameter α is fixed with N sim = 10 5 .The upper panels show the CDF of the ranks distributions for each parameter, where the grey area show the 95% confidence interval of a uniform distribution, while the lower panels show the rank distribution, where the grey areas denote the 99% confidence interval of a uniform distribution.

Figure 8 .
Figure 8. Convergence of the full-SBI parameter posterior with the number of simulations N sim for the case where the cosmological parameter α is fixed.The standard deviation of the posterior samples are normalized by their respective Fisher prediction for better comparison.Note that it should not necessarily converge to one.The errors indicate posterior evaluation at 10 different data observations, with same fiducial parameters but different initial conditions realizations.

Figure 9 .
Figure 9. Parameter posterior for the case where the cosmological parameter α is sampled together with the bias and noise parameters.The contour colors, simulation budget and data vector size are the same as in Fig.5.For all cases, the method NPE was used from a simulation budget of N sim = 10 5 , scale cut of k max = Λ = 0.1hMpc −1 and data vector dimension D = 33.

Figure 11 .Figure 12 .
Figure 11.Same as Fig. 7 but for the case where the cosmological parameter α is sampled as well, with N sim = 10 5 .

Figure 13 .
Figure 13.Diagram of how normalizing flows work, with the specific example of Masked Autoregressive Flows.The samples from the vector z 0 = u, sampled from the simple distribution π(u), are deformed through the sequence of transformations f= f T • • • • • f 1 into those of z T =x, which follow a more complex distribution p(x).In the lower panel, we illustrate the conditioner that "masks out" the connections between z i and h <i , as well as the affine functions applied to the vector components.

2 Figure 14 .
Figure 14.Left: Convergence of the standard deviation of the posterior of each parameter with respect to the simulation budget N sim normalized by their Fisher prediction.Solid lines denote the case where the data vector corresponds to the raw spectra, while dashed lines display the results using the normalized spectra.Right: Standard deviation of the parameters from SNPE starting from the prior of Eq. (4.7) as a function of its rounds, where 10 3 simulations are sampled at each round, divided by the NPE standard deviation of each parameter obtained from N sim = 10 5 .

3 Figure 16 .
Figure 16.Left: impact of the number of simulations for covariance estimation on the final posterior for the Gaussian-likelihood case with sample covariance.Right: impact of doubling the number of hidden units and number of transforms.