Mapping Lyman-alpha forest three-dimensional large scale structure in real and redshift space

This work presents a new physically-motivated supervised machine learning method, Hydro-BAM, to reproduce the three-dimensional Lyman-$\alpha$ forest field in real and in redshift space learning from a reference hydrodynamic simulation, thereby saving about 7 orders of magnitude in computing time. We show that our method is accurate up to $k\sim1\,h\,\rm{Mpc}^{-1}$ in the one- (PDF), two- (power-spectra) and three-point (bi-spectra) statistics of the reconstructed fields. When compared to the reference simulation including redshift space distortions, our method achieves deviations of $\lesssim2\%$ up to $k=0.6\,h\,\rm{Mpc}^{-1}$ in the monopole, $\lesssim5\%$ up to $k=0.9\,h\,\rm{Mpc}^{-1}$ in the quadrupole. The bi-spectrum is well reproduced for triangle configurations with sides up to $k=0.8\,h\,\rm{Mpc}^{-1}$. In contrast, the commonly-adopted Fluctuating Gunn-Peterson approximation shows significant deviations already neglecting peculiar motions at configurations with sides of $k=0.2-0.4\,h\,\rm{Mpc}^{-1}$ in the bi-spectrum, being also significantly less accurate in the power-spectrum (within 5$\%$ up to $k=0.7\,h\,\rm{Mpc}^{-1}$). We conclude that an accurate analysis of the Lyman-$\alpha$ forest requires considering the complex baryonic thermodynamical large-scale structure relations. Our hierarchical domain specific machine learning method can efficiently exploit this and is ready to generate accurate Lyman-$\alpha$ forest mock catalogues covering large volumes required by surveys such as DESI and WEAVE.


INTRODUCTION
One of the most outstanding discoveries in modern Cosmology has been the acceleration of the expansion of the Universe (Riess et al. 1998;Perlmutter et al. 1999). This finding supported the so-called Λ cold dark matter (ΛCDM) model and revived the existence of an accelerating force as opposed to gravity, called dark energy (see e.g. Peebles & In particular, the Lyman-α (Lyα hereafter) absorption features (dubbed Lyα forest), originated in quasar spectra by interactions of rest-frame UV photons with HI atoms in gaseous clouds along the sightline, have been increasingly employed to investigate the high-redshift LSS. Differently from galaxies, which identify discrete points in redshift space, the Lyα forest is a valuable proxy for the continuous matter density field along the line-of-sight at 2 z 4, allowing to use multiple quasar sightlines to reconstruct the three-dimensional matter density field (Nusser & Haehnelt 1999;Pichon et al. 2001;Gallerani et al. 2011;Kitaura et al. 2012b;Horowitz et al. 2019;Huang et al. 2020;Porqueres et al. 2020). Detailed cosmological simulations (e.g. Cen et al. 1994;Hernquist et al. 1996;Miralda-Escudé et al. 1996;Zhang et al. 1998;Gnedin & Hui 1998;Davé et al. 1999;Machacek et al. 1999Machacek et al. , 2000Regan et al. 2007;Norman et al. 2009;Lukić et al. 2015;Nagamine et al. 2021) indicate that the forest traces preferentially the underdense and mildly overdense (δ 1 − 10) regions, whereas the absorption lines rapidly saturate in corresponding high-density gaseous clouds, probing therefore a complementary density regime to that traced by galaxies.
The Lyα forest has become a valuable cosmological probe, exploiting the Baryon Oscillation Spectroscopic Survey (BOSS, Eisenstein et al. 2011;Dawson et al. 2013) ∼ 160, 000 quasars dataset to measure the baryon acoustic oscillations (BAO) (Slosar et al. 2011;Busca et al. 2013;Slosar et al. 2013;Kirkby et al. 2013;Font-Ribera et al. 2013Delubac et al. 2015;Bautista et al. 2017;du Mas des Bourboux et al. 2017 and recently even proposed to constrain the growth rate (Cuceu et al. 2021). As surveys, such as the Dark Energy Spectroscopic Instrument (DESI) survey (Levi et al. 2019), get ready to improve on cosmological measurements, the modelling of the Lyα forest needs to improve as well.
While the Lyα forest is a linear tracer of the underlying density field on large cosmological scales (k < 0.1 h Mpc −1 ), on smaller scales non-linear corrections become unavoidably necessary (e.g. Chen et al. 2021), such that modelling the absorption field at high-resolution turns out to be a highly non-trivial task. Additional effects such as HI self-shielding (Font-Ribera & Miralda-Escudé 2012), redshift space distorsions (RSD hereafter, see Kaiser (1987); Hamilton (1998) and in particular for the Lyα forest Seljak (2012)), and thermal broadening (Cieplak & Slosar 2016), among others, complicate even further the overall analytical framework. The large degree of complexity accounted for by these features traditionally requires resorting to full hydrodynamic simulations, at the expense of huge time and computational resources, making it practically unfeasible to produce the large amount of independent cosmic volumes necessary to perform cosmological analysis and exploit in full power the unprecedent incoming datasets from future surveys.
To overcome this problem, a customary approach has been to generate low-resolution Lyα forest cosmological volumes using scaling relations (Fluctuating Gunn-Peterson Approximation, FGPA hereafter, Rauch 1998;Croft et al. 1998;Weinberg et al. 1999) applied to independent dark matter (DM) fields, obtained from DM-only N-body simulations (Meiksin & White 2001;Viel et al. 2002;Slosar et al. 2009;White et al. 2010;Rorai et al. 2013;Lee et al. 2014), approximated gravity solvers (Horowitz et al. 2019) (e.g. FastPM (Feng et al. 2016), COLA (Tassev et al. 2013)), Gaussian random fields (Le Goff et al. 2011;Bautista et al. 2015;Farr et al. 2020) and rank-ordering transformations from linear theory to non-linear distributions obtained from a hydrodynamic simulation (Viel et al. 2002;Greig et al. 2011). Although the FGPA succeeds in roughly accounting for the spatial correlations on large scales, in the case of N -body simulations it is found to lose accuracy at k > 0.1 h Mpc −1 (as will be reported in more detail in §2 and 3).
Approaches relying on Gaussian random fields need to be complemented with an input power-spectrum to impose a priori the two-point target summary statistics. However, they do not guarantee to accurately reproduce the realistic threedimensional cosmic-web matter distribution, as would be obtained from a simulation, or a detailed analytical structure formation model. In particular,  assumes a lognormal model for the Lyα forest optical depth τ distribution, obtained applying a non-linear transform to the Gaussian random field and fitting the free parameters to match the mean transmitted flux to observations. This prescription holds in first approximation, although it does not ensure that the flux distribution accurately matches the observed one. A similar approach adopted by Le Goff et al. (2011), combining the lognormal model with the FGPA, features substantial deviations in the 1D power spectrum at k 0.4 h Mpc −1 (Le Goff et al. 2011).
The accuracy of the Lyα forest modelling methods is expected to improve when using (fast) N -body codes rather than approximated gravity solvers, which fail to model the evolution of cosmic structures beyond the mildly non-linear regime. However, given the high redshift of the data and when the studies are restricted to large scales of say, above ∼ 5 h −1 Mpc, fast methods correcting for shell-crossing may be accurate enough (Kitaura & Hess 2013;Tosone et al. 2021).
A number of efforts have been dedicated to develop alternative techniques based on matching just the Lyα forest probability distribution function (PDF, hereafter) (e.g. the LyMAS code, Peirani et al. 2014) and/or the power-spectrum (as e.g. in the Iteratively Matched Statistics, IMS, Sorini et al. 2016).
However, in such cases the achieved accuracy in the one-and two-point summary statistics (IMS: 4% mean deviations in the PDF and 5% − 7% in the power-spectrum) goes in detriment of the bias relation and bi-spectrum (for a similar situation modelling the halo distribution, see Kitaura et al. 2015). An accurate modelling of the two-and three-point statistics is mandatory to obtain proper covariance matrices that leads to unbiased likelihood analysis of the data (see e.g., Baumgarten & Chuang 2018).
Two recent companion papers have proposed to use convolutional neural networks to synthetize hydrodynamic fields, including the Lyman-α forest, starting from the dark matter field, using a deterministic  or a probabilistic (HyPhy, Horowitz et al. 2021, and references therein) deep learning approach. Such methods outperform previous efforts, although they still feature significant biases already at the level of reproducing the Lyα forest PDF.
In this work we present the latest application of the Bias Assignment Method (Balaguera-Antolínez et al. 2018, BAM hereafter) to map the Lyα forest field in three dimensions from a reference cosmological hydrodynamic simulation in order to establish a pipeline and yield an accurate (and fast) alternative to the aforementioned methods for the generations of hydro-simulations.
In particular, we show that accurate reconstructions of the Lyα forest cannot be obtained directly from the dark matter field in our framework, but require using a hierarchical domain specific machine learning approach which involves the reconstruction of ionized (HII) and neutral (HI) gas density fields first, as proposed in Sinigaglia et al. (2021) (HydroBAM-I hereafter, equivalent to BAM-V). Subsequently, we use such fields to model for the first time the non-trivial mapping of Lyα fluxes, thereby including the dependence on baryon effects, such as cooling and feedback. We introduce here also a novel framework for the treatment of redshift-space distortion, not included in HydroBAM-I. In summary, while HydroBAM-I introduces the general hierarchical bias showing reconstructions of HII, HI and T fields in real space, in this work we extend our framework to the complex treatment of Lyα forest, both in real and in redshift space. In particular, we start by considering the problem of predicting the Lyα forest field from the dark matter field and argue that a direct mapping between these two quantities is not sufficient to achieve the desired accuracy in the power spectrum and higher-order summary statistics. Then, we show how the bias hierarchy introduced in HydroBAM-I helps in accomplishing our goal and provide physical insights to explain this performance. Furthermore, we extend the whole framework to redshift space, devising a suitable model for redshift space distortions.
The performance of our method is assessed by means of the deviation of relevant summary statistics (power-and bi-spectrum in real and redshift space) with respect to the reference simulation. The results suggest that the implementation of our method with hydrodynamic simulations on cosmological volumes will be in position to produce a set of accurate Lyα forest mock realizations in a fast accurate way, with direct applications to upcoming surveys such as DESI and WEAVE (Dalton et al. 2012).
The paper is organized as follows. In §2 we briefly introduce the BAM method, describe the reference cosmological hydrodynamic simulation and summarize the FGPA. In §3 we present our pipeline, which is applied to Lyα forest reconstructions in real and redshift space in §4 and §5, respectively. Possible limitations of our method are discussed in §6. We end with conclusions in §7.

HYDRO-BAM
In this section, we recap first the machine learning method we use in our framework. Secondly, we describe the training data set used to learn the modelling of the Lyα forest, in particular a large hydrodynamic simulation.

The machine learning method
We build on the BAM algorithm (Balaguera-Antolínez et al. 2018, 2019, BAM-I andBAM-II hereafter). As stated in HydroBAM-I, this is a special domain specific machine learning approach which learns the values of a binned isotropic kernel in Fourier-space, minimizing the cost function defined by the Mahalanobis distance of the reconstructed power-spectrum and the ground truth given by the reference simulation. The minimization is achieved through an iterative rejection Metropolis-Hastings algorithm. The kernel preserves the dimensionality of the problem, meaning that it is equivalent to a quadratic matrix in configuration space. Hence, non-local dependencies need to be explicitly provided within the BAM algorithm, otherwise they will only be accounted for isotropically through the kernel and the anisotropic contributions will be lost 1 . In this sense, BAM represents a physically motivated supervised learning algorithm. The advantage is that knowing the non-local bias contributions and hierarchical bias relations, we can speed up the learning process using only one (sufficiently large) or a few reference simulations with same primordial power spectrum, setup, and cosmological parameters, e.g. fixed-amplitude paired simulations (Angulo & Pontzen 2016, which have shown that just two fixed-amplitude paired simulations can suppress cosmic variance and achieve the same accuracy in the power spectrum as tens of traditional simulations), thereby increasing the accuracy of the reconstruction, as we will show below. In the latter case, using different simu-lations is mainly meant at increasing the effective volume covered by the training set, hence making the bias relations less prone to be affected by cosmic variance. In this sense, it is crucial that such different simulations share the same numerical setup, parameters and parametrizations, as mentioned above, not to introduce incoherencies in the training set.
The resulting kernels and bias relations (the latter understood as the probability P(η|Θ) ∂V 2 of having a given number of dark matter tracers η in a cell of volume ∂V conditional to a set Θ of properties of the dark matter density field, DMDF hereafter in this section, in the same cell can be used to generate independent realizations of the tracer spatial distribution, when mapped over independent DMDF (evolved from a set of initial conditions generated with the cosmological model and parameters used for the reference simulation). BAM has been shown to be able to generate ensembles of mock catalogs with an accuracy of 1 − 10% in the two-and three-point statistics of dark matter halos (see e.g. BAM-II), with higher accuracy that previous bias-mapping methods (e.g. Pellejero-Ibañez et al. 2020, BAM-III) and currently aims at generating the realistic set of galaxy mock catalogs by including intrinsic properties of tracers such as halo masses, velocity concentrations, spins (Balaguera-Antolínez & Kitaura, in preparation).
Our classification of BAM as a machine learning algorithm relies on the fact that it attempts to solve the problem of assessing the optimal link (bias relation) between the dark matter density field and some tracer field. By optimal we mean the relation yielding the most accurate predictions of oneand two-point statistics of the target tracers field when mapping it from the dark matter field. In particular, BAM learns how to modify, through a kernel, the bias initially measured from a reference simulation acting as training set, applying it, in a second stage, to an independent and approximated DMDF, in order to predict mock catalogs of the desired target tracers field. As anticipated, in BAM the cost function is defined as the Mahalanobis distance between the reconstructed power-spectrum and the ground truth given by the reference simulation. In this picture, the Metropolis-Hastings algorithm employed in the iterative determination of the kernel, is the optimization step usually embedded in the training phase of a machine learning algorithm, as e.g. gradient descent in a regression problem. However, since we are in the presence of a stochastic mapping between the dark matter field and its tracers field, we cannot rely on a deterministic minimization algorithm based e.g. on gradient computation. Instead, we need to employ an alternative algorithm (the Metropolis-2 We will omit the symbol ∂V hereafter. Hastings in our case) which performs optimization by means of Monte-Carlo methods.
In HydroBAM-I, the approach of BAM was extended to the context of hydrodynamic simulations 3 , exploring the link between gas properties and the underlying DMDF, as well as the scaling relations between different gas properties. According to our findings, the spatial distribution of HII represents an intermediate stage between the dark matter density field and other baryon quantities, tracing both the large-scale matter distribution, as well as small-scale baryonic processes and the underlying thermodynamic relations.
The BAM procedure can be briefly summarized as follows: • Start with a DMDF δ dm ( r) interpolated on a mesh with resolution ∂V .
• For a given dark matter tracer η, obtain the interpolation on the same mesh as the DMDF.
• Measure the bias P(η|{Θ}), where {Θ} denotes a set of properties of the underlying dark matter density field δ dm ( r) retrieved from the reference simulation. This bias is obtained from the assessment of the joint distribution P(η, {Θ}), which is represented by a multi-dimensional histogram consisting on N η bins of the target tracer property and N Θ properties of the DMDF.
• Sample a new variableη at each spatial cell (with DM properties {Θ}) as • Measure the power-spectrum of this quantity and compare with the same statistics from the reference.
In general, the first sampling of the tracer field performed according to the bias relation learnt from the simulation results in biased reconstructions of the power-spectrum Pη(k) of the variableη with respect to the reference simulation statistics P η (k). Such deviation is accounted for and corrected iteratively by means of a convolution of the dark matter density field with an isotropic kernel K(k = | k|), selected with a Metropolis-Hastings algorithm and updated throughout the iterations. Briefly, the process to obtain the kernel starts with the definition of the ratio T 0 (k) ≡ P i=0 η (k)/P η (k), where P i=0 η (k) is the power-spectrum obtained from the sampling described by Eq. (1). At each iteration i (producing a new estimate of P ĩ η (k)), a value of the ratio T i (k) is computed. At each wavenumber k, it is subject to a Metropolis-Hasting (MH) rejection algorithm 4 . The outcome of the MH sampling process defines a sets of weights w i (k) = T i (k) if the current value of T i (k) is accepted, or w i (k) = 1 otherwise. The BAM kernel is then defined as K(k) ≡ j=i j=0 w j (k) such that, under successive convolutions with the DMDF, the limit lim n→∞ P (n) The aim of this operation is to transform the underlying DMDF such that, once the sampling of the tracer-bias over the DMDF is performed, we obtain a tracer field with the reference power-spectrum. This is achieved in BAM with typically 1% deviations 5 after ∼ 200 iterations. The convergence is nevertheless sensitive to i) the cross-correlation between the DMDF and the tracer field, ii) the properties of the dark matter density field included in the bias model, and iii) the way these properties are explored (i.e., binning strategy). A weak fulfillment of any of these conditions may potentially undermine the success of the whole procedure (see also HydroBAM-I).
Furthermore, it is desirable that K(k) < 1 ∀k, since otherwise it is equivalent to a deconvolution in a similar way to correcting for aliasing introduced by a mass assignment scheme (Jing 2005). It is known that deconvolving a signal in the presence of noise can enhance the noise contribution 6 (see e.g. Kitaura & Enßlin 2008, and references therein). In our case, the bias relations are subject to the particular realisation and include stochastic components, which are also affected by the mathematical representation such as the binning, hence, introducing a noise component. In the absence of a prior optimal non-linear transformation which regularises the kernel to values K < 1, we allow the kernel to go beyond 1, which is theoretically sub-optimal as might in principle enhance the noise contribution. As we show in §4, however, from a practical point of view it yields already very good results. A further improvement on this is left for future work when smaller scales will be investigated in more detail.
Since BAM is based on a direct measurement of the bias relation (described as a multi-dimensional histogram 7 ), de-4 This is done by computing a transition probability min(1, exp( the (Gaussian) variance associated to the reference power-spectrum. 5 Throughout this work, deviations (sometimes referred to as residuals) between two quantities A(x) and B(x) are computed as where the sum is done over all probed (bins of) values of the variable x and Nx is the number of x values. 6 Let us suppose that we aim at recovering a signal s from some simple data model expressed as: d = Rs+ , where R is a response function and some noise contribution. If R is modelled as a convolution, then reconstructing the signal with srec = R −1 d = s + R −1 will yield reasonable results as long as the additive noise term remains small. 7 See BAM-II for the details.
vising an efficient way to represent the different properties is a key ingredient in our analysis. This is done by means of a suitable scaling and binning, aiming at efficiently extracting the information encoded in the joint probability distribution describing the correlations between physical quantities. The total number of bins used to model the properties in {Θ} is N bins,tot = n k=1 θ k , where θ k is the number of bins used to model property k ∈ {Θ}, and therefore rapidly grows with n. For this reason, we need a sufficiently large volume or several simulations to avoid over-fitting.
We refer the reader to §4.4 of HydroBAM-I for a preliminary assessment of overfitting in the context of Hydro-BAM when learning from small data sets. We further discuss this subject in forthcoming sections.
In comparison to other machine learning approaches, adopting a deep learning neutral networks perspective (e.g., Harrington et al. 2021;Horowitz et al. 2021), Hydro-BAM relies on a simpler architecture with no hidden layers. As mentioned above in this Section, the crucial physical dependencies to map the desired dark matter tracer field need to be explicitly specified in the bias model implemented in the code. The kernel, indeed, modifies iteratively the dark matter field in order to try to compensate the lack of fundamental physical information in the bias formulation. In particular, previous applications (BAM-I to BAM-V) have clearly highlighted that including non-local anisotropic contributions in the bias accounts for significant improvement of the tracer mapping.

The training data set
The training data set is determined by a reference cosmological hydrodynamic simulation. This simulation has been obtained using the cosmological smoothed-particle hydrodynamics (SPH) code GADGET3-OSAKA (Aoyama et al. 2018;Shimizu et al. 2019;Nagamine et al. 2021) 8 , with comoving volume V = (100 h −1 Mpc) 3 , N = 2 × 512 3 particles of mass m DM = 5.38 × 10 8 h −1 M for DM particles and m gas = 1.0 × 10 8 h −1 M for gas particles, gravitational softening length g = 7.8 h −1 kpc (comoving) and allowing the baryonic smoothing length to become as small as 0.1 g . As a result, the minimum baryonic smoothing at z = 2 is about physical 260 h −1 pc, which is sufficient to resolve the HI distribution in the circumgalactic medium. Some basic statistics of Lyα forest have already been discussed in Nagamine et al. (2021). The simulation includes detailed models for supernova and stellar feedback (Shimizu et al. 2019), photo-heating and photo-ionization under a uniform UV background (Haardt & Madau 2012), radiative cooling, chemical evolution of atomic and molecular species by For the purpose of this work, we consider the output at z = 2 of the simulation, interpolating dark matter density, HII density, HI number density and HI optical depth onto a 128 3 cubic mesh using a CIC mass assignment scheme (MAS hereafter). In this setup, the physical cell-volume corresponds to ∂V ∼ (0.78 h −1 Mpc) 3 and the Nyquist frequency is k N = 4.02 h Mpc −1 .
The measurements of power-and bi-spectra are performed within spherical averages in Fourier space of width equal to the fundamental mode, with proper corrections for the MAS implemented (through its Fourier transform, see e.g. Hockney & Eastwood 1981) and for their (Poisson) shot-noise.
We refer the reader to §2 of HydroBAM-I (and references therein, e.g., Nagamine et al. 2021) for a detailed description of the reference simulation and the properties of the dark matter and baryon fields taken into consideration.
A major novel aspect of this work with respect to HydroBAM-I is the treatment of HI optical depth τ and the corresponding Lyα forest field 9 F = exp(−τ ). In this context, we do not adopt the FGPA, but rather obtain the HI opacity by means of a line-of-sight integration. The optical depth is computed as (Nagamine et al. 2021): where e, m e , c, n HI , f , x j , ∆l denote respectively the electron charge, electron mass, speed of light in vacuum, HI number density, the absorption oscillator strength, the lineof-sight coordinate of the j-th cell and the physical cell size. The Voigt-line profile φ(x) in Eq. (2) is provided by the fitting formula of Tasitsiomi (2006). Where necessary, relevant quantities (e.g. HI number density) are previously interpolated on the mesh according to the SPH kernel of the simulation. Coordinates x j of cells along the line-of-sight refer to the outcome of interpolation of particles based either on their positions in real space r j or redshift-space s j = r j + v los j /aH, where v los j , a and H are the j-th particle velocity component along the line-of-sight, the scale factor and the Hubble parameter at z = 2, respectively.

Fluctuating Gunn-Peterson Approximation
Hydrodynamic simulations (see e.g., Lukić et al. 2015) show that a considerable fraction (> 90% in volume and > 50% in mass) of the gas probed by the Lyα forest, found in regions with mildly-linear (δ dm < 10) regime, is in a diffuse state not subject to shock processes. In these conditions, the gas density ρ gas and temperature T gas are found to display a tight scaling relation characterized by a power-law relation T gas = T 0 (ρ gas /ρ gas ) α (withρ gas being the average gas density), whose parameters (amplitude and slope) depend on the reionization history and on the spectral slope of the UV background. These vary commonly within the ranges 4000 K T 0 10 3 K and 0.3 α 0.6 (see e.g., Hui & Gnedin 1997). Assuming photo-ionization equilibrium, τ ∝ n HI and n HI ∝ ρ 2 T −0.7 gas /Γ HI , where n HI and Γ HI are the HI number density photo-ionization rate and τ denotes the optical depth, respectively, it is possible to express the latter as which represents the FGPA. In this expression, β = 2 − 0.7, α ∼ 1.6, and A is a constant which depends on redshift and on the details of the hydrodynamics (see e.g. Weinberg et al. 1999). Provided that dark matter and gas densities are highly correlated in the cool low-density regions (where the Lyα absorption mostly takes place) δ gas can be replaced with δ dm in Eq. (3). Furthermore, in some cases an additional Gaussian smoothing is applied to transform the dark matter field into a pseudo-baryon density field, accounting for the fact that the gas density field tends to be smoother than dark matter density in correspondence of low-density dark matter peaks (Gnedin & Hui 1998;Bryan et al. 1999).
In this work we compute the FGPA based on our reference simulation and compare it against the results obtained with BAM. To this end, we use the dark matter field at our fiducial resolution (0.78 h −1 Mpc with a mesh of 128 3 cells), comparable to that used for cosmological studies, but coarser than typical smoothing lengths employed in detailed studies based on hydrodynamic simulations (e.g., 228 kpc in Sorini et al. 2016). We do not apply a Gaussian smoothing. Following relevant works (e.g., Weinberg et al. 1999;Viel et al. 2002;Seljak 2012;Rorai et al. 2013;Lukić et al. 2015;Cieplak & Slosar 2016;Horowitz et al. 2019), we use the value β = 1.6, and A ∼ 0.44, the latter obtained from a linear fit in the log 10 (τ ) − log 10 (δ dm ) plane. This ensures that the resulting scaling relation is consistent with that observed in the reference simulation. 2021). Instead, we establish a hierarchy of bias relations starting from a dark matter field, continuing over the ionised and the neutral gas, to the flux directly. We circumvent hereby the problem of translating optical depths into fluxes with coarse resolutions. These operations are known to introduce biases, especially taking redshift space distortions into account. In particular, the commutation of non-linear operations (as are down-sampling from the native resolution of the simulation to the resolution adopted in this work, and the mapping from optical depth to Lyα forest fluxes F = exp(−τ )) are known to introduce velocity bias contributions in redshift-space Lyα forest representations on a mesh (Seljak 2012). We have explicitly checked with numerical experiments that at the volume and resolution we are working at, this bias is negligible and hence commuting the two aforementioned operations does not imply a problem. This fact offers us the opportunity of applying the F = exp(−τ ) transform at 128 3 resolution, thereby yielding directly a reconstruction F of the final target quantity F and ensuring that our procedure does not introduce further biases. This may have not been the case if we had had a reconstruction τ of the optical depth τ as final step of our hierarchical method, and then computed fluxes as F = exp(−τ ).
In fact, in that case our procedure would not ensure that the summary statistics of F reproduces the reference one within the required few percent accuracy. We construct first a proxy of the gas distribution which is fast to compute. To this end we take the dark matter field from the reference simulation down-sampled to the cho-sen resolution. A large number of fast gravity solvers are now available achieving that accuracy at those high redshifts (z = 2) on Mpc scales (Kitaura & Hess 2013;Tassev et al. 2013;Feng et al. 2016;Tosone et al. 2021). Then we make a suitable non-linear transform of that field which improves the bias mapping relation to the targeted tracer distribution. In particular, we use log 10 (1 + X) (see Neyrinck et al. 2009;Falck et al. 2012;Kitaura & Angulo 2012;Kitaura et al. 2012a, for a motivation), with X being δ dm , δ HII and δ HI (although other kind of transforms can be considered, Kitaura et al. 2020;Sinigaglia et al. 2021).
The key aspect of starting from the dark matter field is that gravitational evolution has shifted the matter perturbations forming a cosmic web which correlates with the gas distribution. In fact, as we showed in HydroBAM-I the correlation between the ionised gas (HII) and the dark matter field is very high. Despite of this, the bias relation is highly non-linear and non-local. Hence, in a second step, we apply our machine learning algorithm to extract the corresponding kernel and multidimensional bias relation.
We must stress that the BAM algorithm learns the statistical relations between two fields which can be applied to any spatial configuration. One could, in principle, learn the different relations between the dark matter and the baryonic quantities, such as HII or HI, independently. Although, the statistical relations would be individually correct, the thermodynamical equilibrium relations among the various baryonic quantities would not be satisfied, as they would have been obtained ignoring each other. For this reason, the step in which the first baryonic quantity is sampled fixes the choice of the overall gas distribution. The hierarchy of bias relations continues with sampling HI from HII. We could continue with the temperature of the gas, but our studies have shown that it is not necessary to obtain the Lyα forest fluxes. In particular, introducing the dependence on the temperature does not yield substantial improvements in the flux power spectrum calibration, with respect to our fiducial model based just on HII and HI. Therefore, we believe the temperature does not account for crucial missing physical dependencies at our mesh resolution. We will consider potential extensions to the temperature of our model in future works, investigating the accuracy of Lyα forest reconstruction towards smaller scales than the ones probed in this paper.
The first part of our procedure is entirely performed in real space, whereas the Lyα forest mocks are obtained in redshift space. To accomplish this goal we apply BAM in real space to obtain HII and HI densities, and then map these recon-structed quantities in redshift space. This last step requires an adequate modelling of RSD, which will be discussed in §5.1. The full calibration strategy can be summarized in the following five steps: whereX denotes the reconstruction of the property X and S denotes the mapping from real to redshift space. That is, in steps 1, 2 and 5 the fieldX on the left-hand side is randomly-sampled following the stochastic bias relation B on the right-hand side, expressing the probability of having a certain value of X in a cell conditional to the set of properties Θ, after convolving with the kernel K the field upon which Θ is built. As anticipated, steps 1-2, presented in details in HydroBAM-I, are carried out in real space, whereas step 5 is performed in redshift space. r and s stand for Eulerian coordinates in real and redshift space, respectively. In steps 3-4, S denotes the mapping from real to redshift space, which will be presented in details in §5.1. In summary, HII and HI densities are obtained in real space starting from the dark matter field in steps 1-2, are moved to redshift space in steps 3-4, and the resulting redshift-space HII and HI fields are used to sample fluxes in step 5. The output of the calibration consists therefore in the sets of multi-dimensional bias relations {B i ( x = r, s)} and kernels {K i }, i = 1, 2, 3. These are then applied to independent dark matter fields to obtain the corresponding Lyα forest mock realization, following all the steps depicted above.
In the following sections, we explore the Lyα forest reconstruction in real and redshift space based on baryon density fields and present the results of dedicated BAM runs.

RECONSTRUCTION OF THE LYMAN-α FOREST: REAL SPACE
In this section we show the performance of the Hydro-BAM approach in real space.

The bias model
We start analysing the reconstruction procedure in real space, i.e., excluding steps 3 and 4 in the hierarchical sampling process introduced in §3. This enables us to assess the accuracy of the method and compare it to the FGPA before additional physical effects (i.e. RSD) are taken into account.
The top panels of Fig. 1 show the phase diagrams (or bias relations) linking the transmitted flux F to other gas properties as well as to the dark matter density in the reference simulation. From this figure it is evident how the relation between the flux and dark matter density shows to be less concentrated (large scatter around the mean relation), in contrast to the scaling relations between the transmitted flux and the HII and HI density fields. This is an indication that the bias model P(F | δ dm ⊗ K) can be sub-optimal for our reconstruction procedure. Indeed, when calibrating the bias relation, the resulting power-spectrum of the reconstructed flux field displays > 10% deviations throughout all the probed Fourier modes. This result suggests that an arbitrarily complex (i.e, non-linear) δ dm − F scaling relation is not yet sufficient to capture the full physical correlation between these quantities. Non-local contributions have to be accounted for, which however cannot rely solely on the dark matter field.
The reference hydrodynamic simulation includes a series of baryonic processes (e.g. cooling and feedback by supernovae and active galactic nuclei) relevant to determine the spatial distribution of the gas, especially of its cold neutral phase.
In the case in which HII and HI densities are extracted from the reference simulation, the aforementioned bias model generates reconstructions with percent accuracy in the two-point statistics of the Lyα forest, as will be shown in forthcoming sections. The situation in which this bias model uses the HII and HI fields reconstructed with BAM is more subtle. In the latter scenario, we obtain 5% − 10% deviations in the flux power-spectrum on small and intermediate scales (k ∼ 1.0−2.0 h Mpc −1 ). To reduce these deviations, we add the dependence on the dark matter density with the model {Θ } = {δ HII ⊗ K, δ HI , δ dm }. This does not further improve the BAM calibration with respect to {Θ} = {δ HII ⊗ K, δ HI } when using baryons from the simulation. However, in the case of reconstructed HII and HI the explicit dependency with the dark matter density helps to ensure the large-scale cluster- ing signal, allowing for an overall improvement of the reconstruction. We therefore use {Θ} as our fiducial model when HII and HI are read from the reference simulation, while we use {Θ } when these fields are reconstructed within BAM.
Adopting two different fiducial models in the two situations poses no conflict if one considers that {Θ } is just an extension of {Θ}, alleviating the slight loss of accuracy discussed above and/or the effect of random noise in the spatial assignment of tracers for reconstructed fields. In the ideal case where HII and HI are extracted from the simulation, we regard {Θ} and {Θ } to carry an equivalent amount of information of Lyα forest clustering, as demonstrated by the numerical tests we mentioned.

Results
A first visual assessment of the performance of our method in real space is achieved by comparing slices through the simulation box from the reference, the reconstruction by BAM and the FGPA Lyα forest fields. This is shown in the top row of Fig. 2, from where we observe that BAM replicates the flux structures, both on large and small scales, in good agreement with the reference. The FGPA, on the other hand, reproduces the large-scale spatial distribution, whereas it tends to suppress small-scale structures and to overestimate the high density regions. This can be also observed in Fig. 3, showing transmitted flux skewers along the line-of-sight extracted at two random positions from the reference simulation, the BAM reconstructions using baryon fields and the FGPA. The origin of the sizable differences between these three cases can be due to the fact that the FGPA accounts only for the F − δ dm mean relation and for local dependencies on the dark matter field, whereas BAM captures the full non-linear non-local stochastic nature of the tracer-bias.
The quality of the HI and HII reconstructions performed by BAM can be firstly assessed through the distribution function of each property in these two scenarios. This is shown in panels (a) and (b) of Fig. 4, where we see that BAM reproduces the bi-modality presented in the HI distribution. On the other hand, the qualitative differences observed in Fig. 2 between FGPA and BAM are represented in Fig. 4 panel (c) by an excess of abundance in the low-density (F ∼ 1, i.e., full transmission) regime of the flux distribution function, with a correspondent drop of abundance in high-density regions (F ∼ 0, full absorption). Conversely, BAM reproduces with very high accuracy the flux PDF (panel (d)), as it is designed to match the tracer PDF by construction learning from the reference simulation.
The panel (a) of Fig. 5 and the top set of plots in Fig. 6 show the BAM-reconstructed (green dashed based on reference HII and HI, yellow dash-dotted based on reconstructed HII and HI) and the FGPA (purple dash-dotted) powerspectrum and bi-spectrum 10 , compared to the corresponding reference flux field summary statistics (red solid). As can be seen from panel (c) of Fig. 5, reconstructed power-spectrum based on reconstructed baryon densities (reference baryon densities) average and maximum deviations are respectively < 2% and ∼ 3% up to k = 2.0 h Mpc −1 (up to the Nyquist frequency) with respect to the reference power-spectrum. The FGPA power-spectrum oscillates around the reference Our results can be understood in more detail by comparing panels (g)-(h) and (l)-(m) in Fig. 1, which show respectively the relations P(δ HII | δ dm ) and P(δ HI | δ HII ) as seen from the simulation and as reconstructed by BAM. Although BAM reproduces the reference simulation remarkably well and manages to replicate the two-phases (hot and cool) thermodynamic structure of the gas over nearly 12 orders of magnitude, the resulting scaling relations turn out to have slightly more scatter than the reference ones, implying that they pose a weaker constraint. This translates into broader bias relations also shown in panels (d) and (f), which need therefore to be complemented with additional information supplied by the dark matter spatial distribution.
We should bare in mind that the reference hydrodynamic simulation has been run at a considerably higher resolution (512 3 vs 128 3 ). Hence, we cannot expect these relations to be equal, but to be the optimal ones reproducing the summary statistics of the various quantities of interest at a given (lower) resolution (than the reference one). The computing time gain at this resolution is already of 7 orders of magnitude as compared to the hydrodynamic simulation. This difference will, however, dramatically increase when going to larger volumes and/or lower mesh resolutions than the ones studied here. In particular, assuming to run Hydro-BAM with fixed number of cores and bias model, the code speed scales approximately as O(N c ), with N c the number of cells used in the grid. Therefore, passing from 128 3 to 64 3 mesh resolution would imply a gain in computing time of a factor ∼ 8 in the case of the reference simulation used in this paper. Similarly, adopting a simulation with comoving volume V = (500 h −1 Mpc) 3 , N p = 2048 3 particles, interpolated on a N c = 128 3 cells regular mesh (corresponding to approximately l ∼ 3.9 h −1 Mpc comoving cell size) and run with a TreePM N-body code scaling as O(N p log(N p )), Hydro-BAM would keep its performance unchanged whereas the simulation would take approximately ∼ 78 times the CPU time required by our reference simulation to run up to z = 2. Lastly, BAM performance does not depend on redshift, whereas it takes a much longer CPU time   Figure 6: Top plots set: real space Lyα forest reference bi-spectrum (red solid), bi-spectrum of BAM reconstruction obtained with density fields from the reference simulation (green dashed), FGPA bi-spectrum (purple dash-dotted). Bottom plots set: redshift space Lyα forest reference bi-spectrum (red solid), bi-spectrum of BAM reconstruction obtained with density fields from the reference simulation mapped in redshift space (blue dashed) and bi-spectrum of BAM reconstruction obtained with BAM-reconstructed density fields (brown dash-dotted). BAM well-reproduces the reference bi-spectrum up to k ∼ 0.8 h Mpc −1 both in real and in redshift space, whereas the FGPA considerably underestimate the reference bi-spectrum at k > 0.1 h Mpc −1 .
to evolve the initial conditions to lower redshift, e.g. up to z ∼ 1. In general, there are hence no drawbacks in using larger volume simulations, as long as a reasonable trade-off between number of particles and mesh resolution is guaranteed. In some unfavourable cases, there may not be any further gain in computing time with respect to the reference case investigated in this paper, being anyway left with the 7orders-of-magnitude computing time gain quoted above, i.e. HydroBAM is not penalized. When the HII and HI fields are available from the simulation, the bias is modelled with just two conditional properties (HII and HI), so that N bins,tot N cell (or, equivalently, the average number of cells per Θ-bin is N Θ = N cell /N bins,tot 1) and the results are therefore not likely to be affected by overfitting. On the other hand, the case in which BAM learns from reconstructed HII and HI fields, and includes also information on dark matter in the bias model, is more delicate and deserves a more careful look. Because the majority of the bins used to model {Θ} in the bias histogram are actually devoid of cells, following §4.4 of HydroBAM-I we account only for the bins which are filled with cells instead of considering all the bins used to describe Θ, and define the effective average number of cells per Θ-bin as N eff Θ = N cell /N bins,filled . The purpose of this is to shift the focus only onto the Θ-bins which are filled with cells and for which is therefore meaningful to define a non-trivial one-dimensional tracer probability distribution, according to which the subsequent random sampling is performed. In other words, considering also the Θ-bins which are devoid of cells would be misleading, as no random assignment of tracers is performed in such cases. 11 This computation yields N eff Θ ∼ 40 − 50 depending on the considered iteration within the calibration procedure, which is reasonably high number to regard that we do not incur into overfitting.
From a perturbation theory perspective (see e.g. McDonald et al. 2000;McDonald 2003;Seljak 2012), this would correspond to a Taylor expansion of the flux field at resolution ∆l, which to linear order is expressed as more detailed discussion on the connection between BAM and perturbation theory). The novel aspects to the standard perturbative framework are that our bias treatment (i) is explicitly based on baryon rather than on dark matter densities, (ii) that the dependencies are formulated in a multi-variate fashion, and (iii) that the relations are iteratively learnt from a reference simulation. Separating the HII and HI components, which could be in principle unified into one single baryon density field term, allows to treat them independently, give them different weights, and leave the freedom to build distinct higher order local and non-local terms, to model different baryonic phenomena down to highly non-linear scales.

RECONSTRUCTION OF THE LYMAN−α FOREST: REDSHIFT SPACE
In this section we show the performance of the Hydro-BAM approach including redshift space distortions.

RSD modelling
As anticipated in §3, the second part of the calibration pipeline requires the transition from real to redshift space. Given that the HII and HI density fields are reconstructed in real space (as in HydroBAM-I), an explicit modelling of RSD for such quantities is required at this stage. In order to do so, we devise a procedure to apply RSD to density fields already interpolated on a mesh. The procedure we adopt can be summarized in the following steps: 1. Assign to each cell i characterized by a local density ρ i , a number N = 16 pseudo-particles 12 with mass M = ρ i V i /N , and positions r j coincident to the center of the cell; 2. Consider the dark matter velocity field v sim dm,i from the simulation, interpolated on a mesh of equal size and resolution as the density fields. Apply RSD and transform real-space r j to redshift space s j coordinates by means of the mapping (see e.g. Kaiser 1987;Hamilton 1998) where is the coherent component of the velocity field, while v disp dm,j accounts for the small-scale quasi-virialized motions. This last term is originally smoothed-out from the distribution of gas by the interpolation of particle velocities on the mesh at physical cell-resolution. Therefore, we restore it assuming that its components follow a Gaussian distribution of zero mean and variance A(1 + δ i ) α with A and α as free parameters (Heß et al. 2013;Kitaura et al. 2014). The term b v in Eq. (5) is a velocity bias factor, while a is the scale factor, H is the Hubble parameter andr = r/| r|. We have adopted the distant observer approximation and translated to redshift space along the z-component.
3. Interpolate again the redshift-space pseudo-particles on a grid using a CIC MAS, with M j as weight.
This description of RSD implies a set of three free parameters {b v , A, α}, which are determined by comparing the signal of the multipole decomposition of the 3D flux power-spectrum P (k), in particular the quadrupole ( = 2) and hexadecapole ( = 4) 13 in Fourier space, leading to {b v , A, α} ∼ {0.93, 1, 1.03}. In general, the parameters depend on the grid resolution, and hence these figures might not be the optimal choice in other cases. We will further address this aspect in future works, in which we will apply our method to higherresolution mesh representations of the same reference simulation.

Reconstruction
The final phase of our pipeline consists of reconstructing the Lyα forest field in redshift space, with the bias characterized by the set of underlying properties {Θ} = {δ HII ( s) ⊗ K, δ HI ( s)}.
We first apply BAM to the reference HII and HI density fields (mapped in redshift space through S), to address the performance of our RSD model in the bias. Subsequently, we repeat the same using BAM-reconstructed HII and HI fields, to mimic a realistic mock production context.
Similarly to the case of real space, BAM is found to yield accurate results. Dedicated experiments reveal that in the latter the outcome suffers larger deviations than expected. This is due to the progressive loss of accuracy at k > 1.0 h Mpc −1 affecting the reconstructed HII and HI fields. Therefore, we re-introduce the dependence in the bias on the dark matter field ({Θ ( s)} = {δ z HII ( s) ⊗ K, δ HI ( s), δ dm ( s)}, as previously discussed in §3), conveniently mapped in redshift space as well.
The preliminary qualitative assessment based on the visual inspection of slices through the simulation box (Fig. 2, bottom row) highlights that, the same as in real space, BAM (central panel) succeeds in reproducing the small-and large-scale structure of the reference slice (left panel) with a high level of agreement.
A more quantitative assessment can be performed through the analysis of the reconstructed summary statistics. Panel (d) in Fig. 4 shows that both the Lyα forest field reconstructions built on the simulation-based (blue dashed) and BAMbased (brown dash-dotted) HII and HI fields, are found to have extremely accurate PDF if compared to the reference (red solid).
Panel (a) in Fig. 5 shows the power-spectrum reconstruction of the flux field performed by BAM, based on HII and HI density fields from the simulation (blue dashed) and from previous BAM runs (brown dash-dotted) respectively, compared to the reference flux field power-spectrum (red solid). The panel (b) of the same figure shows the ratio P mock (k)/P ref (k), indicating that in this setup BAM is able to learn the bias relation from the simulation reproducing the reference power-spectrum with average deviations 1.7% (< 1%) and maximum deviation ∼ 3% (∼ 2%) up to the Nyquist frequency (up to k = 1.0 h Mpc −1 ). Learning the bias relation from baryon fields previously reconstructed by BAM yields a moderate loss of accuracy at k > 1.0 h Mpc −1 , albeit preserving the accuracy on large-scales, accounting for average and maximum deviations 2% and ∼ 3.5% respectively at k 1.0 h Mpc −1 . As mentioned, the average mild lack of power trend at k 1.0 h Mpc −1 , dominated by superimposed random noise oscillations, is due to the progressive loss of accuracy in HII and HI reconstructions, which unavoidably propagates to the Lyα forest as well. In fact, modelling the power spectrum at k > 1.0 h Mpc −1 (where AGN and supernova feedback have a significant effect) in redshift space is a highly non-trivial task. Hence, to some extent it is not surprising that a loss of accuracy in BAM reconstructions of HII and HI has a negative impact also on the subsequent reconstruction of the Lyα forest, based on the previous two (see also §6).
To further assess the accuracy of our reconstruction and the performance of RSD modelling, we compute the multipole expansions of the power-spectrum. In particular, the quadrupole and the hexadecapole of Lyα forest reconstructions are shown in panels (a) and (b) of Fig. 7, respectively. While the power-spectrum is constrained in BAM through the iterative procedure, the quadrupole and hexadecapole are not explicitly constrained. Therefore, the accuracy in the reproduction of such multipoles is primarily a consequence of the approach we have adopted for RSD. The reference quadrupole is shown to be recovered with < 5% maximum deviation at k 1.0 h Mpc −1 , for reconstructions based both on the reference simulation (green dashed) and on BAM baryon density fields (purple dash-dotted). Average deviations of the hexadecapole with respect to the reference (red solid) are ∼ 10% and ∼ 15% at k 0.5 h Mpc −1 when learning from the simulation gas (blue dashed) and from the BAM gas (brown dash-dotted), respectively. hexadecapole P4(k) measured from the reference simulation, BAM reconstruction using density fields from the reference simulation and the BAM reconstruction using BAM-reconstructed density fields. The bottom panels (c) and (d) show the ratios between the multipoles form the two BAM reconstructions and the reference simulation. In these panels, gray shaded area in indicate the 1%, 2% and 5% deviations from the reference simulation.
The bottom plots set in Fig. 6 shows the reconstructed flux bi-spectrum at different triangular configurations in Fourier space. BAM credibly reproduces the reference bi-spectrum (red solid) at k 0.8 h Mpc −1 when building on the HII and HI fields both from the simulation (blue dashed) and from reconstructions (brown dash-dotted), starting to lose accuracy in θ 12 ≈ [0, 1.3] only at k > 0.8 h Mpc −1 .
As in the case of real space, an assessment of overfitting is needed only in the case of reconstruction based on HII and HI fields obtained by BAM, in which the dark matter field is included as well in the bias. Again, N eff Θ ∼ 40 − 50 and we can rule out the possibility that overfitting is affecting our results.
We consider the RSD modelling acceptable at this stage. A number of modifications can be introduced to improve the accuracy in the multipoles at the expense of either introducing more parameters, or another machine learning step.
For the first strategy, we could consider separate b v velocity bias for HII, HI, and the DM. Also the corresponding dispersion terms could adopt different A and α values. These improvements may allow to account for intrinsic velocity bias differences among these fields. We do not expect these refinements to introduce dramatic changes in the RSD free parameters we have estimated in this work. However, leaving the algorithm the freedom of choosing different val-ues for the parameters related to different fields might improve even further the RSD modelling, with a consequent potential projected improvement in the accuracy of our reconstructions.
In a second strategy, we could introduce BAM-like additional machine learning steps to extract the degree of anisotropy as a function of scale and density required to minimize a cost function determined by, e.g., the quadrupole or hexadecapole. In particular, the parameters of our RSD model would be automatically determined in a machine learning fashion by explicitly requiring that deviations of higher order multipoles from the reference statistics are minimized in the cost function, currently not implemented in our algorithm. Concretely, in this case the new machine learning step would consist in adding a further optimization stage in the BAM flowchart sketched in §3, repeating iteratively steps 3, 4, and 5, to determine the values of b v , A and α which optimize the quadrupole and hexadecapole. Extracting such parameters as a function of scale and density might allow to unveil non-obvious dependencies and, thus, improve the model.
We leave these series of studies for later work.

LIMITATIONS OF THE METHOD
In what follows, we summarize the main potential drawbacks and limitations of our method.
• Availability of a suitable reference simulation: since BAM is in principle able to self-consistently learn the kernel from just one, or few, reference simulations, the main limitation to its application is the availability of such a suitable simulation. In machine learning language, such limitation translates into the availability of a suitable, sufficiently large, training dataset. While this condition is unavoidable by construction and represents a cause of likely failure if not fulfilled, as already mentioned Hydro-BAM alleviates the need for a rather large training set (i.e. tens of simulations), commonly shared by other machine learning algorithms through an explicit modelling of the physical dependencies underlying the bias relation. In this sense, as already mentioned in §2.1, one or few simulations spanning a sufficiently large total volume, are enough to guarantee that the learning processes is successful and not prone to cosmic variance effects. A volume in our context is sufficiently large when: 2. the number of cells exceeds the number of parameters of the bias model (i.e., the total number of bins) to ensure good enough statistics for each bias dependency.
• Need for an effective and feasible prescription for the bias: while the architecture of Hydro-BAM -relying on an explicit modelling of the key physical dependencies in the bias relation -has been proven to be a major strength of our method, it can become a drawback if one has little clue of which dependencies are the leading ones. This is particularly true for non-local bias dependencies. While a common ML method is able to explore any kind of non-local dependence without understanding the underlying physics, the Hydro-BAM approach needs some explicit information such as the tidal field tensor. An extension of this problem is that Hydro-BAM needs in principle all relevant physical relations given to reach a certain accuracy. This drawback could become especially problematic when going to smaller scales, when we need to specify not only the tidal field tensor, but also the density fluctuation curvature, or even the velocity shear. A ML algorithm can potentially automatically explore all de-pendencies and even blindly handle them with multiple layers. For the considered scales in this work, we consider nonetheless that Hydro-BAM is very competitive and could inspire ML algorithms. Even though all the applications shown thus far in BAM papers are supported by a long-standing theoretical discussion in the literature, offering valuable insights into the topic, this might not be the case for other applications not yet explored. Moreover, a similar issue arises if the bias relation requires the modelling of quantities beyond the dark matter density field, which cannot be reliably obtained within BAM. In this case, one can still extract the needed fields from the reference simulations to carry out a purely academic study, but will not be able to exploit such information to generate mock catalogs. E.g., in this work, we make use of the BAM-reconstructed HII and HI fields to map the Lyα forest. If we had found other gas properties to play a crucial role in the Lyα forest mapping, we would have had to go beyond the hierarchical approach presented in HydroBAM-I and devise new strategies.
• Loss of accuracy in the various BAM hierarchical reconstruction steps: we have shown in this paper that the application of the whole Hydro-BAM pipeline results in a very accurate reconstruction in redshift space of the power spectrum up to k ∼ 1.0 h −1 Mpc, and of the bi-spectrum up to k ∼ 0.8 h −1 Mpc. The loss of accuracy at k 1.0 h −1 Mpc is a natural consequence of the fact that baryon processes have an important impact in determining the Lyα forest power spectrum at such frequencies. Therefore, achieving high accuracy in such a highly-nonlinear regime is very nontrivial, and a loss of accuracy in the various reconstruction steps progressively forward-propagate until the last step.
• Dependence on cosmological and astrophysical models of the reference simulation: as BAM extracts the relations linking different quantities from the reference simulations, where such quantities have been obtained through the full N -body/SPH calculation, the resulting BAM mock catalogs will forcefully represent cosmic realizations bound to the input parameters and models adopted to run the reference training simulation. This implies that, to perform predictions with different cosmological models (e.g. modified gravity), or different baryon feedback implementations, it will be necessary to dispose of a suite of different training simulations, one for each different sets of models and parameters to be investigated.

CONCLUSIONS
In this paper, we have presented a hierarchical domain specific machine learning approach to efficiently map the threedimensional Lyα forest field starting from the dark matter density field. We have shown that a direct relation cannot be accurate if we ignore the HII and HI distributions. Therefore, the machine learning approach has to be first applied to obtain the HII field, which shows the highest correlation with the dark matter field, and then in turn that baryonic field has to be used to get the HI field, as described in Sinigaglia et al. (2021). These fields are combined in a non-linear, non-local multivariate fashion to obtain the Lyα forest absorption flux field. In this way, a few percent accuracy is achieved in the summary statistics.
Our results, first obtained in real space, show maximum and average deviations in the power-spectrum up to the Nyquist frequency of ∼ 3% and < 1.7%, respectively; and good agreement in the bi-spectra probing scales down to k ∼ 1.0 h Mpc −1 . In contrast, the commonly-adopted FGPA reproduces the Lyα forest power-spectrum with oscillations of ∼ 5% around the reference up to k ∼ 1.0 h Mpc −1 , failing to match the one-point PDF and is found to progressively lose accuracy in the bi-spectrum at configurations already with k > 0.1 h Mpc −1 .
In order to represent the Lyα forest in redshift space (as probed by galaxy surveys), we map the HII and HI fields from real to redshift space through an appropriate modelling of RSD, not included in HydroBAM-I. In particular, we explicitly account for both large-scale coherent flows and smallscale quasi-virialized motions contributions.
Having these consistent anisotropic fields we apply the same machine learning scheme as in real space to reconstruct the Lyα forest field in redshift space.
The reconstructed Lyα forest power-spectrum presents average deviations of 1.7% ( 2%) and maximum deviations of ∼ 2% (∼ 3%) up to k ∼ 1.0 h Mpc −1 (up to the Nyquist frequency) when the reconstruction of fluxes is based on HII and HI density fields previously reconstructed with BAM following HydroBAM-I (from the simulation). The reconstructed quadrupole deviates < 5% with respect to that from the reference at k < 0.9 h Mpc −1 when using BAMreconstructed baryon densities and up to k ∼ 2.0 h Mpc −1 when relying on baryon densities from the reference simulation. The hexadecapole reproduces the reference with reasonable accuracy considering that it is highly affected by cosmic variance. Finally, the bi-spectrum is well-reproduced for Fourier triangular configurations with side lenghts up to k ∼ 0.8 h Mpc −1 .
We stress that our hierarchical bias mapping is applied for the first time in this paper also using in the calibration HII and HI density fields self-consistently reconstructed within BAM, whereas in HydroBAM-I we showcased only the ideal situation in which such fields can be read from the simulation.
In forthcoming publications we will extend the technique presented in this work to a larger volume simulation, aiming at producing mock Lyα forest realizations to cover the cosmic volumes probed by on-going and future observational campaigns. In parallel, we will address the bias modelling towards smaller scales, to improve the accuracy of reconstructed quantities in Hydro-BAM on scales where relevant astrophysical phenomena, such as feedback and cooling, plays a major role in determining the spatial distribution of baryons. As mentioned, in this context we also plan to extend our study to higher-resolution mesh representation of the same reference simulations, to get Hydro-BAM ready to explore new avenues and study phenomena which can impact the Lyα forest summary statistics on very small scales (k ∼ 5 − 10 h Mpc −1 ), such as the mass of warm dark matter particles (Viel et al. 2005(Viel et al. , 2013, the mass of neutrinos (Palanque-Delabrouille et al. 2015) and the thermal state of the intergalactic medium (Garzilli et al. 2017, and references therein), among others.
Furthermore, our method can be applied in a similar fashion as in this paper to obtain independent HI density fields in redshift space which, if properly complemented with instrumental effects, can pave the way to the generation of accurate and detailed 21cm-line mock catalogs for HI intensity mapping.
We stress that BAM as a particularly efficient machine learning technique has still plenty of potential applications to be explored in the fields of large-scale structure and galaxy formation and evolution, limited a priori just by the availability of a suitable reference simulation.
tronomical Observatory of Japan (NAOJ), the XC40 and the Yukawa-21 at YITP in Kyoto University, OCTOPUS at the Cybermedia Center, Osaka University, and Oakforest-PACS at the University of Tokyo as part of the HPCI system Research Project (hp200041, hp210090). This work is supported in part by the JSPS KAKENHI Grant Number JP17H01111, 19H05810, 20H00180. K.N. acknowledges the travel support from the Kavli IPMU, World Premier Research Center Initiative (WPI), where part of this work was conducted. MA thanks for the support of the Kavli IPMU fellowship and was supported by JSPS KAKENHI Grant Number JP21K13911. We acknowledge Cheng Zhao for making available his code to measure the bi-spectrum.