Constraining the Reionization History using Bayesian Normalizing Flows

The next generation 21 cm surveys open a new window onto the early stages of cosmic structure formation and provide new insights about the Epoch of Reionization (EoR). However, the non-Gaussian nature of the 21 cm signal along with the huge amount of data generated from these surveys will require more advanced techniques capable to efficiently extract the necessary information to constrain the Reionization History of the Universe. In this paper we present the use of Bayesian Neural Networks (BNNs) to predict the posterior distribution for four astrophysical and cosmological parameters. Besides achieving state-of-the-art prediction performances, the proposed methods provide accurate estimation of parameters uncertainties and infer correlations among them. Additionally, we demonstrate the advantages of Normalizing Flows (NF) combined with BNNs, being able to model more complex output distributions and thus capture key information as non-Gaussianities in the parameter conditional density distribution for astrophysical and cosmological dataset. Finally, we propose novel calibration methods employing Normalizing Flows after training, to produce reliable predictions, and we demonstrate the advantages of this approach both in terms of computational cost and prediction performances.


Introduction
In the past decade, Cosmology has entered into a new precision era due to the considerable number of experiments performed to obtain information both from early stages of the Universe through the Cosmic Microwave Background (CMB) and late times via deep redshift surveys of large-scale structures. These measurements have yielded precise estimates for the parameters in the standard cosmological model, establishing the current understanding of the Universe. However, the intermediate time known as Epoch of Reionization (EoR), when the first stars and galaxies ionized the InterGalactic Medium (IGM), remains vastly unexplored. This period is relevant to understand the properties of the first structures of our Universe and provide complementary information related to fundamental Cosmology, inflationary models, and neutrino constraints, among others, e.g., [1]. It is known that the EoR can be studied indirectly through its imprint in the IGM, using the redshifted 21cm line [2]. This line results from the hyperfine splitting of the ground state of the hydrogen atom due to the coupled magnetic moments between the proton and the electron, emitting radiation with a 21cm wavelength, then redshifted by the expansion of the Universe [1]. Future experiments such as Hydrogen Epoch of Reionization Array (HERA) 1 and the Square Kilometre Array (SKA) 2 are intended to measure this 21cm signal over a wide range of redshifts providing 3D maps of the first hundreds millions years of the Universe. These instruments are expected to generate a huge amount of spectra, encouraging the development of automated methods capable of reliably estimating physical parameters with great accuracy. Recently, Deep Neural Networks (DNNs) have been applied in several fields of Astronomy because of their ability to extract complex information from data, which makes it benefit for analysing non-Gaussian signatures. In particular, the application of DNNs to the 21 cm signal has received considerable attention due to the success of classifying reionization models [3] or estimating physical parameters [4]. For example, in [4] 2D images corresponding to slices along the line-of-sight axis of the light-cones were used for training Convolutional Neural Networks (CNN) in order to estimate some astrophysical parameters. More recently [5] and [6] generalized the previous findings by incorporating contamination from simulated SKA-like noise. However, DNNs are prone to overfitting due to the high number of parameters to be adjusted, and do not provide a measure of the uncertainty for the estimated parameters, see for instance [7][8][9]. These limitations can be addressed by following a Bayesian approach, both intrinsically providing an effective regularization during training and allowing to quantify the uncertainty in the predicted parameters at inference time [10].
In this paper, we generalize these preliminary works related to the application of DNNs on the 21 cm data by implementing Bayesian Neural Network (BNNs), in order to obtain the posterior probability estimates of the physical parameters and their correlations. We discuss methods for calibrating uncertainties in Bayesian Networks and we propose approaches toward the efficient extraction of information in non-Gaussian signals via diffeomorphic transformations of the output distribution.
This work is organized as follows. In Sec. 2 we introduce the variational inference formalism which allows to compute the aleatoric and epistemic uncertainty of BNNs, and we also comment on the use of bijectors for improving inference tasks. In Sec. 3 we describe the Reionization model used in this work and the generation of the synthetic dataset. In Sec. 4 we describe the network architecture, and in Sec. 5 we show the results related to the potential application of BNNs to obtain approximate posteriors over the parameter space. Moreover, we also present the most relevant findings about implementing Normalizing Flows in inference task and their advantages in estimating the parameters relevant for the EoR. Finally, conclusions and future works are shown in Sec. 6.

Variational Inference
BNNs provide the adequate groundwork to output reliable estimations for many machine learning tasks. Let us consider a training dataset D = {(x i , y i )} D i=1 formed by D couples of images x i ∈ R M and their respective targets y i ∈ R N . By setting a prior distribution p(w) on the model parameters w, the posterior distribution can be obtained from Bayes' law as p(w|D) ∼ p(D|w)p(w). Unfortunately, the posterior usually cannot be obtained analytically and thus approximate methods are commonly used to perform the inference task. The Variational Inference approach approximates the exact posterior p(w|D) by a parametric distribution q(w|θ) depending on a set of variational parameters θ [7]. These parameters are adjusted to minimize a certain loss function, usually given by the KullBack-Leibler divergence KL(q(w|θ)||p(w|D)). It has been shown that minimizing the KL divergence is equivalent to minimizing the following objective function [7] F V I (D, θ) = KL(q(w|θ)||p(w)) − (x,y)∈D Ω q(w|θ) ln p(y|x, w)dw . (2.1) To infer the correlations between the parameters uncertainties [8,11,12], we need to predict the full covariance matrix. This requires to produce in output of the last layer of the network a mean vector µ ∈ R N and a covariance matrix Σ ∈ R N ×N representing the aleatoric uncertainty, for instance parameterized through its Cholesky decomposition Σ = LL . These outputs determine the Negative Log-Likelihood (NLL) when p(y|x, w) is a Multivariate Gaussian distribution [9,11,13] ln p(y|x, w) ∼ Letθ be the value of θ after training, corresponding to a minimum of F V I (D, θ). The approximate predictive distribution qθ of y * for a new input x * can be rewritten as [12] qθ(y * |x * ) = Ω p(y * |x * , w)q(w|θ)dw . (2.3) Moreover [14] proposed an unbiased Monte-Carlo estimator for Eq. 2.3 where K is the number of samples. In Bayesian deep learning [15] two main uncertainties are of interest: the aleatoric, capturing the inherent noise in the input data, and the epistemic, capturing the uncertainty in the model, typically due to the lack of data points during training which are similar to the current observation. To obtain both uncertainties [8], we can invoke the total covariance law for a fixed x * the images are forward passed through the network T times, obtaining a set of mean vectors µ t and a covariance matrices Σ t . Then, an estimator for the total covariance of the trained model can be written as In this setting, BNNs can be used to learn the correlations between the targets and to produce estimates of their uncertainties.

Normalizing Flows in inference
Transforming probability distributions has become a powerful tool in deep learning. The main idea of a Normalizing Flow is to use a diffeomorphism (a differentiable and bijective mapping) to transform the sample space of a distribution. It can be demonstrated that this is equivalent to transforming the probability distribution and thus allowing for a more expressive output of the model [16]. For a more comprehensive introduction about flows we remind the reader to [17,18].

Basics concepts of Normalizing Flows
Let us consider u ∈ U ⊂ R D and x ∈ X ⊂ R D two D dimensional vectors in some subsets of R D . We can define probability distributions for u and x, in the associated sample spaces U and X, respectively. Let f be a diffeomorphism between the two sample spaces f : U → X. Knowing the probability distributions q u , we can define q x as where x = f (u). We can construct more complex densities by applying successively the bijector (2.7), thus transforming an initial random variable u with distribution q u (= q 0 ) through a series of transformations f n as where we defined x 0 = u for convenience of notation. Any expectation E qn [h(x n )] can be written as an expectation under q u as Through a suitable choice of the diffeomorphisms f n , we can start from simple factorized distributions such as a mean-field Gaussian and apply normalizing flows to obtain complex and multi-modal distributions [17][18][19].
Ideally a flow is both expressive and requires a reduced additional cost. Several transformations have been proposed in the literature [17,18] to compute efficiently both (2.8) and (2.9). An example are Masked Autoregressive Flows [20] (MAF) and its inverse, the Inverse Autoregressive Flows [21] (IAF) which allow to compute efficiently Eq. (2.9) and Eq. (2.8) respectively. Real valued non-volume preserving transfromations [22] (NVP) are a special case of both MAF and IAF, in which d pass-through units are selected and the transformations on the other units are function of these pass-through units. A more recent improvement is represented by neural ODE [23,24] in which the flow is represented by a continuous transformation specified through an ordinary differential equation. In this paper we will use: NVP, MAF and IAF as a proof of concept to show how to provide a more flexible and scalable distribution in the output of the network with the purpose of extracting complex features in the data such as a non-Gaussianities and provide well calibrated BNN model.

Dataset generation
We generated 21cm simulations through the semi-numerical code 21cmFast [25], producing realizations of halo distributions and ionization maps at high redshifts. Through approximate methods, the code generates full 3D realizations of the density, ionization, velocity, spin temperature, and 21-cm brightness temperature fields. The latter is computed as [26] where T S and T γ (z) are the gas spin and the CMB temperatures at redshift z respectively, δ m is the density contrast of baryons, and x HI denotes the neutral fraction of hydrogen. At z ≈ 6−20, the scattering of photons from the first galaxies and black holes warms up the IGM enhancing the ionization fraction [27]. Once the gas is ionized at some percent level (∼ 25%, see [28]), the spin temperature becomes greater than CMB temperature, T S >> T γ , and thus, the dependence on T S may be neglected in Eq. 3.1. The contrast density which depends on cosmological parameters is found by using the same techniques employed in numerical cosmological simulations [26,28], while the HI ionized field parametrized by x HI and dependent on different astrophysical parameters, is obtained from the excursion-set approach [25]. At the end, the combination of the numerical solutions for the contrast density and x HI , allows to compute the 21 cm brightness temperature field via Eq. 3.1. In order to generate the synthetic dataset for training the network, we follow the ideas started in [10] where we have varied four parameters. Two parameters corresponding to the cosmological context: the matter density parameter Ω m ∈ [0.2, 0.4], and the amplitude of mass fluctuations on 8h −1 Mpc, The other two parameters corresponding to the astrophysical context: the ionizing efficiency of high-z galaxies ζ ∈ [10, 100] and the minimum virial temperature of star-forming haloes T F vir ∈ [3.98, 39.80] × 10 4 K (hereafter represented in log10 units). For each set of parameters we produce 20 images at different redshifts in the range z ∈ [6,12], and we stack these redshift-images into a single multi-channel tensor. This scheme brings two main advantages, first the network can extract effectively the information encoded over images as it was reported in [11], and secondly, it represents adequately the signals for the next-generation interferometers and provides advantages when we need to include effects of foreground contamination [5]. As a final result we have obtained 6, 000 images each with size of (128, 128, 20) and resolution 1.5 Mpc. We used a 70-10-20 split for training, validation and test, respectively.

Architecture and network calibration
All the networks are implemented using TensorFlow 3 and TensorFlow-Probability 4 . We used a modified version of the VGG architecture with 5 VGG blocks (each made by two Conv2D layers and one max pooling) and channels size [32,32,32,32,64]. Kernel size is fixed to 3×3 and activation function used is LeakyReLU (α = −0.3). Each convolutional layer in the network is followed by a batch renormalization layer. The last layer is dense with output corresponding to the mean of predictions µ and a lower triangular matrix L, cf. [9,11,13], yielding a multivariate Gaussian distribution with mean µ and covariance Σ = LL to guarantee positive definiteness. To model the distribution over the weights we used two methods, Dropout [14] and Flipout [29]. For the Dropout method, after each Conv2D we included a dropout layer which is always turned on, during both training and test. In [14] the authors found that this method is equivalent to work with BNNs i.e., the weights are sampled from a (kind of) Bernoulli distribution and during inference allows to obtain uncertainties of the output's network. On the other hand, we used the Flipout method, assuming in Eq. 2.1 a Gaussian distribution over the weights for both prior and posterior, and providing an efficient way to draw pseudo-independent weights for different elements in a single batch [29]. We trained the networks for 180 epochs with batches of 32 samples, using 10 samples from the approximate posterior for the estimation of Eq. 2.4 (experiments with 1 sample are also reported in Appendix for comparison). After training, to obtain the prediction distributions and the related uncertainties, we feed each input image from the test set 2,500 times to each network.

Calibration in BNNs
Predicting reliable uncertainties is crucial for classification and regression models in many applications. However, it is known that DNNs trained with NLL may produce poor uncertainty estimates [30]. Weight decay, Batch Normalization and the choice of specific divergences in VI have been shown to be important factors influencing the calibration [30,31]. One way to observe this miscalibration is computing the coverage probability in the test set. For doing this, we have binned the samples drawn from inference and computed their mode [32]. With this value, and assuming an unimodal posterior, we estimated the intervals that include the 68, 95, and 99% of the samples.
To solve the miscalibration of the network, we can proceed in two different ways. The first option is to calibrate the network during training by tuning hyper-parameters such as dropout rate in Dropout [11,32], the regularization for the scale of the variational distribution in Flipout [11], or the α-value in the alpha-divergence energy [31]. The second option involves post-processing techniques applied after training. For the latter option, we retrain the last layer of the network as it was proposed in [11], minimizing the NLL Eq. 2.2 distorted by Normalizing Flows. Other methods have been reported in literature for calibration [11,30,33]. However, for practical reasons only a selection of calibration methods which from preliminary experiments we found working best will be reported.

Results
We quantify the performance of the networks by the coefficient of determination and the accuracy of the uncertainties (well calibrated networks). The coefficient of determination is defined as whereμ(x i ) (see Eq. 2.6) is the prediction of the trained Bayesian network ,ȳ is the average of the true parameters and the summations are performed over the entire test set. R 2 ranges from 0 to 1, where 1 represents perfect inference. Regarding uncertainties accuracy, for calibrated networks y i should fall in a β% confidence interval of the conditional density estimation (2.4) approximately β% of the time, where β = {68.3, 95.5, 99.7} corresponding to 1, 2, and 3σ confidence levels of a normal distribution.

Comparison among BNNs methods
For Dropout we tested several dropout rates in the range [0.01, 0.1] keeping L2 regularization fixed to 1e −5 , while for Flipout we tested several L2 regularizations in the range [1e −5 , 1e −7 ]. A detailed summary of the experiments is reported in Fig. 5 and Tables 7-8 in Appendix. Table 1 reports the best configuration of the network: Flipout with L2 regularizer 1e −7 and Dropout with dropout rate 1e −2 . We report in the same table the coefficient of determination and average confidence intervals for Flipout and Dropout after calibration [11]. Flipout obtains the best estimations even though tends to overestimate its uncertainties.  present paper is considerably smaller, furthermore as discussed in this section our approach has the advantage to be able to provide uncertainty estimation for the predicted parameters. In Fig. 1 we report the confidence intervals 5 for a single example in the test set and in Table 2 we present the parameters predictions at the 95% confidence level. Notice that Flipout yields more accurate inferences and provides tighter constraints contours, see for example T F vir -ζ. Moreover, the correlations extracted from EoR such as σ 8 -Ω m (see Fig. 1) provide significant information for breaking parameter degeneracies and thus, be able to  Table 2: Limits at the 95% confidence level of the credible interval of predicted parameters.
improve the existing measurements on cosmological parameters [1,35,36]. In the following sections we will focus on the methods used for producing reliable uncertainties. Since Flipout achieves better performances than Dropout, from now on we will use Flipout to determine the performance of the subsequent calibration experiments.

Normalizing Flows during Training
A good predictive distribution depends on how well the parametric distribution matches the exact posterior. For the case presented above, the variational distribution provides a simple Gaussian approximation for the conditional density p(y|x, w).  Normalizing Flows map an initial probability distribution through a series of transformations to produce a richer, and even a multi-modal distribution [16].
We consider different kinds of Normalizing Flows acting on the output distribution of a BNN: the inverse autoregressive flow (IAF), Masked Autoregressive Flow (MAF) and nonvolume preserving flows (NVP). The results of these experiments are reported in Table 3. We observe that the R 2 are comparable for all methods, but the NLL is higher for the IAF, which means that this method tends to recover better accuracy in the uncertainties. This is consistent with the findings showed in the last three rows in Table 3, where the coverage probability in the test set is closer to the confidence intervals for IAF.  Table 4: Limits at the 95% confidence level of the credible interval of predicted parameters using Normalizing Flows.
Normalizing Flows lead to more expressive output distributions that focus on obtaining calibrated probabilities rather than enhance the precision in the target value (quantified by R 2 ) as we compare with the ones reported in Table 1. In Fig. 2 we compare the best models i.e., IAF and MAF. We can notice how smaller are the confidence regions predicted by MAF with respect to IAF, although both methods predict roughly the same orientations as expected. Finally, the values of the parameters with confidence levels at 95% are reported in Table 4. Comparing these results with Table 2 we can notice how the predicted parameters uncertainties are tighter and also more asymmetric, which is a consequence of the increased flexibility of the output posterior.

Normalizing Flows in the post-process calibration
In this part we will focus on post-process methods for network calibration using Flows. We apply Normalizing Flows on the output distribution of the Flipout experiment reported in Sec. 5.1, trained with a vanilla Multivariate Gaussian in output. We retrain the last layer as it was suggested in [11], plus we train the parameters of the flow. This method has the advantage that does not require to retrain the entire network, thus demonstrating to be very cost efficient. Results after 120 epochs of the proposed recalibration are reported in Table 5. Notably, the post-processing calibration outperforms all other experiments done so far, in terms of both the expected deviation from the target value, R 2 , and the NLL. Additionally, the improvement   of the NLL leads to better calibrated networks, validated thought the coverage probabilities in Table 5. What is perhaps even more interesting is that using any Normalizing Flow method in the post-process period, outperforms the performances obtained by that same flow during training, which leads to a powerful method for obtaining models with correct interpretation of its uncertainty estimates. Calibrating with a NF results to be an easier optimization, converging to better results. In order to compare the methods used so far, we chose an example from the test set,  Table 6: Limits at the 95% confidence level of the credible interval of predicted parameters using Normalizing Flows with and without calibration.  and produced the posterior and marginalized distributions of the parameters, obtaining the results displayed in Fig. 3. We can observe that after calibration, the contours produced by MAF becomes wider solving the underestimation found during training. Moreover, the contours of MAF and IAF applied in calibration overlap, and they are smaller compared with the base experiment, while Flows applied during training produce better results only for IAF. The credible intervals at 95% are shown in Table 6. There we can see the effect of the flows on the performance of the network, allowing for a reduction of the uncertainty intervals. Finally, in Fig. 4 we plot the predicted and true values of the cosmological and astrophysical parameters using the model calibrated with IAF and MAF Flows. The error bars displayed in the plots correspond to both aleatoric and epistemic uncertainties. Here we observe that σ 8 parameter contains larger errors which means this parameter is the most difficult to predict accurately, this could be a consequence of the 21cm signal being less sensitive to the effects of the density field than the IGM properties [4,6]. The ionizing efficiency, ζ, presents instead accurate predictions at low values, which are getting progressively less accurate and less precise at larger values. This fact could be explained due to the limited information at high redshifts (which can be reduced by assuming not spin temperature saturation) and also, the small variability of the brightness temperature maps at lower redshift with respect to large values of ζ.

Conclusion and discussion
We presented the first study using Bayesian Neural Network and Normalizing Flows to obtain credible estimates for astrophysical and cosmological parameters from 21cm signals. These methods offer alternative ways different from MCMC to make inference and recover the information in the 21 cm observations. Firstly, we show that Flipout outperforms Dropout and is able both to better estimate parameters correlations and to obtain a better coefficient of determination. Comparing with existing literature, we obtain comparable performances, while using a relatively smaller network [4,6], furthermore by using a BNN, based on Variational Inference, we can estimate the confidence intervals for the predictions and the parameters uncertainties correlations. The 21 cm signal is highly non-Gaussian due to the complex physics involved during the EoR. Normalising Flows provide a flexible likelihood model capable to better capture complex information encoded in the dataset. This improves the performance of the network, training with flows achieves better NLL values than experiments without Flows (Tables 1 and 3). Additionally we propose novel calibration methods employing flows after training, showing how this method provide accurate uncertainty estimates and high prediction of the parameters regardless of the flow used (Table 6). Fine tuning the last layer, in combination with NFs leads to a simple, fast and effective technique for calibrating BNNs.
As future perspective, we plan on evaluating the performances of different network architectures (also in particular residual networks) and estimate the cosmological and astrophysical parameters in the presence of realistic noise from instruments of the future 21 cm surveys and also in other astrophysical dataset.   . In each experiment, we sample once and ten times during training. explained in Sec. 4.1. The contour regions for the best results are also reported in Fig. 5. The R 2 for Flipout is reported in Table 8.