Artificial Neural Networks for Galaxy Clustering. Learning from the two-point correlation function of BOSS galaxies

The increasingly large amount of cosmological data coming from ground-based and space-borne telescopes requires highly efficient and fast enough data analysis techniques to maximise the scientific exploitation. In this work, we explore the capabilities of supervised machine learning algorithms to learn the properties of the large-scale structure of the Universe, aiming at constraining the matter density parameter, Omega m. We implement a new Artificial Neural Network for a regression data analysis, and train it on a large set of galaxy two-point correlation functions in standard cosmologies with different values of Omega m. The training set is constructed from log-normal mock catalogues which reproduce the clustering of the Baryon Oscillation Spectroscopic Survey (BOSS) galaxies. The presented statistical method requires no specific analytical model to construct the likelihood function, and runs with negligible computational cost, after training. We test this new Artificial Neural Network on real BOSS data, finding Omega m=0.309p/m0.008, which is remarkably consistent with standard analysis results.


Introduction
One of the biggest challenges of modern cosmology is to accurately estimate the standard cosmological model parameters and, possibly, to discriminate among alternative cosmological frameworks. Fast and accurate statistical methods are required to maximise the scientific exploitation of the cosmological probes of the large-scale structure of the Universe. During the last decades, increasingly large surveys have been conducted both with ground-based and space-borne telescopes, and a huge amount of data is expected from next-generation projects, like e.g. Euclid (Laureijs et al., 2011;Blanchard et al., 2020) and the Vera C. Rubin Observatory (LSST Dark Energy Science Collaboration, 2012). This scenario, in which the amount of available data is expected to keep growing with exponential rate, suggests that machine learning techniques shall play a key role in cosmological data analysis, due to the fact that their reliability and precision strongly depend on the quantity and variety of inputs they are given.
According to the standard cosmological scenario, the evolution of density perturbations started from an almost Gaussian distribution, primarily described by its variance. In configuration space, the variance of the field is the two-point correlation function (2PCF), which is a function of the magnitude of the comoving distance between objects, r. At large enough scales, the Email address: veronesi@strw.leidenuniv.nl (Niccoló Veronesi) matter distribution can still be approximated as Gaussian, and thus the 2PCF contains most of the information of the density field.
The 2PCF and its analogous in Fourier space, the power spectrum, have been the focus of several cosmological analyses of observed catalogues of extra-galactic sources (see e.g. Totsuji and Kihara, 1969;Peebles, 1974;Hawkins et al., 2003;Parkinson et al., 2012;Bel et al., 2014;Alam et al., 2016;Pezzotta et al., 2017;Mohammad et al., 2018;Gil-Marín et al., 2020;Marulli et al., 2021, and references therein). The standard way to infer constraints on cosmological parameters from the measured 2PCF is by comparison with a physically motivated model through an analytical likelihood function. The latter should account for all possible observational effects, including both statistical and systematic uncertainties. In this work, we investigate an alternative data analysis method, based on a supervised machine learning technique, which does not require a customized likelihood function to model the 2PCF. Specifically, the 2PCF shape will be modelled with an Artificial Neural Network (NN), trained on a sufficiently large set of measurements from mock data sets generated at different cosmologies.
Supervised machine learning methods can be considered as a complementary approach to standard data analysis techniques for the investigation of the large-scale structure of the Universe. There are two main issues in applying machine learning algorithms in this context. Firstly, the simulated galaxy maps used to train the NNs have to be sufficiently accurate at all scales of interest, including all the relevant observational effects characterizing the surveys to be analysed. Indeed, the NN outputs rely only on the examples provided for the training phase, and not according to any specific instructions (Samuel, 1959;Goodfellow et al., 2016). The accuracy of the output depends on the level of reliability of the training set, while the precision increases when the amount of training data set increases. Moreover, significant computational resources are required to construct the mock data sets in a high enough number of test cosmologies.
On the other hand, both the construction of the mock data sets and the NN training and validation have to be done only once. After that, the NN can produce outputs with negligible computational cost, and can exploit all physical scales on which the mock data sets are reliable, that is possibly beyond the domain in which an analytic likelihood is available. These represent the key advantages of this novel, complementary data analysis technique.
Machine learning based methods should provide an effective tool in cosmological investigations based on increasingly large amounts of cosmological data from extra-galactic surveys (e.g. Cole et al., 2005;Parkinson et al., 2012;Anderson et al., 2014;Kern et al., 2017;Ntampaka et al., 2019;Ishida, 2019;Hassan et al., 2020;Villaescusa-Navarro, 2021). In fact, these techniques have already been exploited for the analysis of the largescale structure of the Universe (see e.g. Aragon-Calvo, 2019; Tsizh et al., 2020). In some cases, machine learning models have been trained and tested on simulated mock catalogues to obtain as output the cosmological parameters those simulations had been constructed with (e.g. Ravanbakhsh et al., 2016;Pan et al., 2020). These works demonstrated that the machine learning approach is powerful enough to infer tight constraints on cosmological model parameters, when the training set is the distribution of matter in a three-dimensional grid.
The method we are presenting in this work is different in this respect, as our input training set consists of 2PCF measurements estimated from mock galaxy catalogues. The rationale of this choice is to help the network to efficiently learn the mapping between the cosmological model and the corresponding galaxy catalogue by exploiting the information compression provided by the second-order statistics of the density field. The goal of this work is to implement, train, validate and test a new NN of this kind, and to investigate its capability in providing constraints on the matter density parameter, Ω m , from real galaxy clustering data sets.
The construction of proper training samples would require running N-body or hydrodynamic simulations in a sufficiently large set of cosmological scenarios. As already noted, this task demands substantial computational resources and a dedicated work, which is beyond the scope of the current analysis. A forthcoming paper will be dedicated to this fundamental task. Here instead we rely on log-normal mock catalogues, which can reproduce the monopole of the redshift-space 2PCF of real galaxy catalogues with reasonable accuracy and in a minimal amount of time. This will allow us to test our NN on training data sets with the desired format, in order to be ready for forthcoming analyses with more reliable data sets. Moreover, future developments of the presented method will involve estimating a higher number of cosmological parameters at the same time. The implemented NN is provided through a public Google Colab notebook 1 .
The paper is organised as follows. In Section 2 we give a general overview of the data analysis method we present in this work. In Section 3 we describe in detail the characteristics of the catalogues used to train, validate and test the NN. The specifics on the NN itself, together with its results on the test set of mock catalogues are described in Section 4. The application to the BOSS galaxy catalogue and the results this leads to are presented in Section 5. In Section 6 we draw our conclusions and discuss about possible future improvements. Finally, Appendix A provides details on the algorithms used to construct the log-normal mock catalogues.

The data analysis method
The data analysis method considered in this work exploits a supervised machine learning approach. Our general goal is to extract cosmological constraints from extra-galactic redshift surveys with properly trained NNs. As a first application of this method, we focus the current analysis on the 2PCF of BOSS galaxies, which is exploited to extract constraints on Ω m .
The method consists of a few steps. Firstly, a fast enough process to construct the data sets with which train, validate and test the NN is needed. In this work we train the NN with a set of 2PCF mock measurements obtained from lognormal catalogues. The construction of these input data sets is described in detail in Section 3. Specifically, we create several sets of mock BOSS-like catalogues assuming as values for the cosmological parameters the ones inferred from the Planck Cosmic Microwave Background observations, except for Ω m , that assumes different values in different mock catalogues, and Ω Λ that has been changed every time in order to have Ω tot = Ω m + Ω Λ + Ω rad = 1, where Ω Λ and Ω rad are the dark energy and the radiation density parameters, respectively. Specifically, we fix the main parameters of the Λ-cold dark matter (ΛCDM) model to the following values: Ω b h 2 = 0.02237, Ω c h 2 = 0.1200, 100 Θ MC = 1.04092, τ = 0.0544, ln(10 10 A s ) = 3.044, and n s = 0.9649 (Aghanim et al., 2018, Table 2, TT,TE+lowE+lensing). Here h indicates one hundredth of the Hubble constant, H 0 .
The implemented NN performs a regression analysis, that is it can assume every value within the range the algorithm has been trained for, that in our case is 0.24 < Ω m < 0.38. In particular, the algorithm takes as input a 2PCF monopole measurement, and provides as output a Gaussian probability distribution on Ω m , from which we can extract the mean and standard deviation (see Section 3.4 and Appendix A). Specifically, the data sets used for training, validating and testing the NN consist of different sets of 2PCF monopole measures, labelled by the value of Ω m assumed to construct the mock catalogues these measures have been obtained from, and assumed also during the measure itself.
After the training and validation phases, we test the NN with a set of input data sets constructed with random values of Ω m , that is, different values from the ones considered in the training and validation sets. The structure of the NN, the characteristics of the different sets of data it has been fed with, and how the training process of the NN has been led are described in Section 4.
Finally, once the NN has proven to perform correctly on a test set of 2PCF measures from the BOSS log-normal mock catalogues, we exploit it on the real 2PCF of the BOSS catalogue.
To measure the latter sample statistic, a cosmological model has to be assumed, which leads to geometric distortions when the assumed cosmology is different to the true one. To test the impact of these distortions on our final outcomes, we repeat the analysis measuring the BOSS 2PCF assuming different values of Ω m . We find that our NN provides in output statistically consistent results independently of the assumed cosmology. The results of this analysis are described in detail in Section 5.
We provide below a summary of the main steps of the data analysis method investigated in this work: 1. Assume a cosmological model. In the current analysis, we assume a value of Ω m , and fix all the other main parameters of the ΛCDM cosmological model to the Planck values. A more extended analysis is planned for a forthcoming work.
2. Measure the 2PCF of the real catalogue to be analysed and use it to estimate the large-scale galaxy bias. This is required to construct mock galaxy catalogues with the same clustering amplitude of the real data. Here we estimate the bias from the redshift-space monopole of the 2PCF at large scales (see Section 3.3).
3. Construct a sufficiently large set of mock catalogues. In this work we consider log-normal mock catalogues, that can be obtained with fast enough algorithms. The measured 2PCFs of these catalogues will be used for the training, the validation and the test of the NN.
4. Repeat all the above steps assuming different cosmological models. Different cosmological models, characterised by different values of Ω m , are assumed to create different classes of mock catalogues and measure the 2PCF of BOSS galaxies.

5.
Train and validate the NN.
6. Test the NN. This has to be done with data sets constructed considering cosmological models not used for the training and validation, to check whether the model can make reliable predictions also on previously unseen examples.
7. Exploit the trained NN on real 2PCF measurements. This is done feeding the trained machine learning model with the several measures of the 2PCF of the real catalogue, obtained assuming different cosmological models. The reason for this is to check whether the output of the NN is affected by geometric distortions.

The BOSS data set
The mock catalogues used for the training, validation and test of our NN are constructed to reproduce the clustering of the BOSS galaxies. BOSS is part of the Sloan Digital Sky Survey (SDSS), which is an imaging and spectroscopic redshift survey that used a 2.5m modified Ritchey-Chrétien altitudeazimuth optical telescope located at the Apache Point Observatory in New Mexico (Gunn et al., 2006). The data we have worked on are from the Data Release 12 (DR12), that is the final data release of the third phase of the survey (SDSS-III) (Alam et al., 2015). The observations were performed from fall 2009 to spring 2014, with a 1000-fiber spectrograph at a resolution R ≈ 2000. The wavelength range goes from 360 nm to 1000 nm, and the coverage of the survey is 9329 square degrees.
The catalogue from BOSS DR12, used in this work, contains the positions in observed coordinates (RA, Dec and redshift) of 1 198 004 galaxies up to redshift z = 0.8.
Both the data and the random catalogues are created considering the survey footprint, veto masks and systematics of the survey, as e.g. fiber collisions and redshift failures. The method used to construct the random catalogue from the BOSS spectroscopic observations is detailed in Reid et al. (2016).

Two-point correlation function estimation
We estimate the 2PCF monopole of the BOSS and mock galaxy catalogues with the Landy and Szalay (1993) estimator: where DD(s), RR(s) and DR(s) are the number of galaxygalaxy, random-random and galaxy-random pairs, within given comoving separation bins (i.e. in s ± ∆s), respectively, while are the corresponding total number of galaxy-galaxy, randomrandom and galaxy-random pairs, being N D and N R the number of objects in the real and random catalogues. The random catalogue is constructed with the same angular and redshift selection functions of the BOSS catalogue, but with a number of objects that is ten times bigger than the observed one, to minimise the impact of Poisson errors in random pair counts.

Galaxy bias
As introduced in the previous Sections, we train our NN to learn the mapping between the 2PCF monopole shape, ξ, and the matter density parameter, Ω m . Thus, we construct lognormal mock catalogues assuming different values of Ω m . A detailed description of the algorithm used to construct these log-normal mocks is provided in the next Section 3.4.
Firstly, we need to estimate the bias of the objects in the sample, b. We consider a linear bias model, whose bias value is estimated from the data. Specifically, when a new set of mock catalogues (characterised by Ω m = Ω m,i ) is constructed, a new linear galaxy bias has to be estimated. The galaxy bias is assessed by modelling the 2PCF of BOSS galaxies in the scale range 30 < s [h −1 Mpc] < 50, where s is used here, instead of r, to indicate separations in redshift space. We consider a Gaussian likelihood function, assuming a Poissonian covariance matrix which is sufficiently accurate for the purposes of this work. We model the shape of the redshift-space 2PCF at large scale as follows (Kaiser, 1987): where the matter 2PCF, ξ m (r), is obtained by Fourier transforming the matter power spectrum modelled with the Code for Anisotropies in the Microwave Background (CAMB) (Lewis et al., 2000). The product between the linear growth rate and the matter power spectrum normalisation parameter, f σ 8 , is set by the cosmological model assumed during the measuring of the 2PCF and the construction of the theoretical model. The values of all redshift-dependent parameters have been calculated using the mean redshift of the data catalogue, z = 0.481. We assume a uniform prior for bσ 8 between 0 and 2.7. The posterior has been sampled with a Markov Chain Monte Carlo algorithm with 10 000 steps and a burn-in period of 100 steps. The linear bias, b, is then derived by dividing the posterior median by σ 8 . The latter is estimated from the matter power spectrum computed assuming Planck cosmology. We note that this method implies that only the shape of the 2PCF is actually used to constrain Ω m .

Mock galaxy catalogues
As described in Section 2, the data sets used for the training, validation and test of the NN implemented in this work consist of 2PCF mock measurements estimated from log-normal galaxy catalogues. Log-normal mock catalogues are generally used to create the data set necessary for the covariance matrix estimate, in particular in anisotropic clustering analyses (see e.g. Lippich et al., 2019). In fact, this technique allows us to construct density fields and, thus, galaxy catalogues with the required characteristics in an extremely fast way, especially if compared to N-body simulations, though the latter are more reliable at higher-order statistics and in the fully nonlinear regime.
To construct the log-normal mock catalogues, we adopted the same strategy followed by Beutler et al. (2011). We provide here a brief explanation of the method, while a more detailed description is given in Appendix A. As already commented in Section 1, log-normal catalogues do not provide the optimal training sets for this machine learning method, but are used here only to test the algorithms with input data with the desired format.
The algorithm used in this work to generate these mock catalogues for the machine learning training process takes as input one data catalogue and one corresponding random catalogue. These are used to define a grid with the same geometric structure of the data catalogue and a visibility mask function which is constructed from the pixel density of the random catalogue. A cosmological model has to be assumed to compute the object comoving distances from the observed redshifts and to model the matter power spectrum. The latter is used to estimate the galaxy power spectrum, which is the logarithm of the variance of the log-normal random field.
The density field is then sampled from this random field. Specifically, once the algorithm has associated to all the grid cells their density value, we can extract from each of them a certain number of points that depend on the density of the catalogue and on the visibility function. These points represent the galaxies of the output mock catalogue.
Once the mocks are created, the same cosmological model is assumed also to measure the 2PCF. We repeated the same process for all the test values of Ω m . That is, for each Ω m we created a set of log-normal mock catalogues and measured the 2PCF. This is different with respect to standard analyses, where a cosmology is assumed only once for the clustering measurements, and geometric distortions are included in the likelihood.
As an illustrative example, Figure 1 shows different measures of the 2PCF obtained in the scale range 8 < s [h −1 Mpc] < 50, in 30 logarithmic bins of s, and assuming the lowest and the highest values of Ω m that have been considered in this work, that are Ω m = 0.24 and Ω m = 0.38. The black and red dots show the measures obtained from the corresponding two sets of 50 log-normal mock galaxy catalogues, while the dashed lines are the theoretical 2PCF models assumed to construct them. As expected, the average 2PCFs of the two sets of log-normal catalogues are fully consistent with the corresponding theoretical predictions. Indeed, a mock log-normal catalogue characterised by Ω m = Ω m,i provides a training example for the NN describing how galaxies would be distributed if Ω m,i was the true value. Finally, the blue and green squares show the 2PCFs of the real BOSS catalogue obtained by assuming Ω m = 0.24 and Ω m = 0.38, respectively, when converting galaxy redshifts into comoving coordinates. The differences in the latter two measures are caused by geometric distortions. As can be seen, neither of these two data sets are consistent with the corresponding 2PCF theoretical models, that is both Ω m = 0.24 and Ω m = 0.38 appear to be bad guesses for the real value of Ω m . As we will show in Section 5, the NN presented in this work is not significantly affected by geometric distortions.

Architecture
Regression machine learning models take as input a set of variables that can assume every value and provide as output a continuous value. In our case, the input data set is a 2PCF measure, while the output is the predicted Gaussian probability distribution of Ω m . Specifically, the regression model we are about to describe has been trained with 2PCF measures in the scale range 8 − 50h −1 Mpc, in 30 logarithmic scale bins.
The architecture of the implemented NN is schematically represented in Figure 2. It consists of the following layers: • Input layer that feeds the 30 values of the measured 2PCF to the model; • Dense layer 2 with 2 units. Every unit of this layer performs the following linear operation on the input elements: where y is the output of the unit, x i is the i-th element of the input, w i is its corresponding weight, and b is the intercept of the linear operation. No nonlinear activation function has been used in this layer. This choice has been proven to be convenient a posteriori, providing excellent performance of the NN during the test phase (see Section 4.3). We thus decided not to introduce any activation function to keep the architecture of the regression model as simple as possible; • Distribution layer that uses the two outputs of the dense layer as mean and standard deviation to parameterise a Gaussian distribution, which is given as output.
This architecture has been chosen because it is the simplest one we tested which is able to provide accurate cosmological constraints. Deeper models have been tried out, but no significant differences were spotted in the output predictions.

Training and validation
The training and validation sets are constructed separately, and consist of 2 000 and 800 examples, respectively. Regression machine learning models work better if the range of possible outputs is well represented in the training and validation 2 A layer is called dense if all its units are connected to all the units of the previous layer. sets. We construct mock catalogues with 40 different values of Ω m : 29 from Ω m = 0.24 to Ω m = 0.38 with ∆Ω m = 0.005 and 11 from Ω m = 0.2825 to Ω m = 0.3325, also separated by ∆Ω m = 0.005. The latter are added to improve the density of inputs in the region that had proven to be the one where the predictions were more likely, during the first attempts of the NN training. The training and validation sets consist of 50 and 20 2PCF measures for each value of Ω m , respectively. All the mock catalogues used for the training and validation have the same dimension of the BOSS 2PCF measures.
The loss function used for the training process is the following: where N is the number of examples, l i is the label correspondent to the i-th 2PCF used for the training or the validation (i.e. the true value of Ω m that has been used to create the i-th mock catalogue and to measure its 2PCF), while σ i and µ i are the standard deviation and the mean of the Normal probability distribution function that is given as output for the same i-th 2PCF example.
During the training process, we apply the Adam optimisation (see Kingma and Ba, 2014, for a detailed description of its functioning and parameters) with the following three different steps: • 750 epochs with η = 0.002 , • 150 epochs with η = 0.001 , • 100 epochs with η = 0.0005 , where η indicates the learning rate. During the three steps, the values of the other two parameters of this optimisation algorithm are kept fixed to β 1 = 0.9 and β 2 = 0.999, and the training set was not divided into batches. Variations in the parameters of the optimization have been performed and did not lead to significantly different outputs.
Gradually reducing the learning rate during the training helps the model to find the correct global minimum of the loss function (Ntampaka et al., 2019). The first epochs, having a higher learning rate, lead the model towards the area surrounding the global minimum, while the last ones, having a lower learning rate, and therefore being able to take smaller and more precise steps in the parameter space, have the task to lead the model towards the bottom of that minimum. In this model, minimizing the loss function means both to drive the mean values of the output Gaussian distributions towards the values of the labels (i.e. the values of Ω m corresponding to the inputs), and to reduce as much as possible the standard deviation of those distributions.
During the training process, the interval of the labels, that is 0.24 ≤ Ω m ≤ 0.38, has been mapped into [0, 1] through the following linear operation: where l is the original label and L is the one that belongs to the [0, 1] interval. Once the mean, µ L , and the standard deviation, σ L , of the predictions are obtained in output, we convert them back to match the original interval of the labels. The standard deviation is then computed as follows: where σ is the standard deviation of µ, so that the uncertainty on µ is given by 0.14 σ L .

Test
The test set consists of 5 different 2PCF measures obtained from log-normal mock catalogues with 16 different randomly generated values of Ω m . Therefore, we have a total of 80 measures. The Ω m predictions and uncertainties are estimated as the mean and standard deviation of the Gaussian distributions obtained in output (Matthies, 2007;Der Kiureghian and Ditlevsen, 2009;Kendall and Gal, 2017;Russell and Reale, 2019). Our model is thus able to associate to every point of the input space an uncertainty on the output that depends on the intrinsic scatter of the 2PCF measures. Figure 3 shows the predictions on the test set, compared to the true values of Ω m the mocks in the test were constructed with. As can be seen, the implemented NN is able to provide reliable predictions when fed with measures from mock catalogues characterised by Ω m values that were not used during the training and validation phases. In fact, fitting this data set with a linear model:

Application to BOSS data
After training, validating and testing the NN, we can finally apply it to the real 2PCF of the BOSS galaxy catalogue. To test the impact of geometric distortions caused by the assumption of a particular cosmology during the measure, we apply the NN on 40 measures of the 2PCF of BOSS galaxies obtained with different assumptions on the value of Ω m when converting from observed to comoving coordinates. Figure 4 shows the results of this analysis, compared with the Ω m constraints provided by Alam et al. (2017). The predictions of the NN have been fitted with a linear model: where Ω pred m is the prediction of the regression model, while Ω ass m is the value assumed to measure the 2PCF. The best fitvalues of the parameters we get are α = −0.001 ± 0.025 and β = 0.309 ± 0.008. In particular, the slope is consistent with zero, that is all the Ω m predictions are consistent, within the uncertainties, independently of the value assumed in the measurement. This demonstrates that the NN is indeed able to make robust predictions over observed data, without being biased by geometric distortions.
Our final Ω m constraint is thus estimated from the best-fit normalization, β, that is This result is consistent with the one obtained by Alam et al. (2017), which is Ω m = 0.311 ± 0.006.

Conclusions
In this work we investigated a supervised machine learning data analysis method aimed at constraining Ω m from observed galaxy catalogues. Specifically, we implemented a regression NN that has been trained and validated with mock measurements of the 2PCFs of BOSS galaxies. Such measures are used as a convenient summary statistics of the large-scale structure of the Universe. The goal of this work was to infer cosmological constraints without relying on any analytic 2PCF model to construct the likelihood function.
To train and validate our NN, we use 2 800 2PCF examples, constructed with 40 different values of Ω m , in 0.24 ≤ Ω m ≤ 0.38. The trained NN has been finally applied to the real 2PCF monopole of the BOSS galaxy catalogue. We get Ω m = 0.309 ± 0.008, which is in good agreement with the value found by Alam et al. (2017).
This work confirms that NNs can be powerful tools also for cosmological inference analyses, complementing the more standard analyses that make use of analytical likelihoods. One obvious improvement of the presented work would be to consider more accurate mock catalogues than the log-normal ones, for the training, validation and test phases. In particular, Nbody or hydrodynamic simulations would be required to exploit higher-order statistics, as in particular the three-point correlation function, as features to feed the model with.
The higher the number of reliable features is, the more accurate the predictions of the NN will be. Furthermore, multilabelled regression models can be used to make predictions on multiple cosmological parameters at the same time. To do that, however, bigger data sets are required, in order to have a proper mapping of the input space, characterised by the different values each label can have. The analysis presented in this work should also be extended to larger scales, though in this case more reliable mock catalogues are required not to introduce biases in the training, in particular at the Baryon Acoustic Oscillations scales. Finally, a similar analysis as the one performed in this work could be done using the density map of the catalogue, or directly the observed coordinates of the galaxies. This approach would be less affected by the adopted data compression methods considered. On the other hand, it would have a significantly larger computational cost for the training, validation and test phases, which should be estimated with a dedicated feasibility study. All the possible improvements described above will be investigated in forthcoming papers.
The PDF of the primordial matter density contrast at a specific position, δ(x), can be approximated as a Gaussian distribution, with null mean and the correlation function as variance. The same definitions can be used in Fourier space as well. In this case the variance is the Fourier transform of the 2PCF, that is the power spectrum, P(k). In more general cases, the Gaussian random field is only an approximation of the real random field that may present features, such as significant skewness and heavy tails (Xavier et al., 2016). Coles and Barrow (1987) showed how to construct non-Gaussian fields through nonlinear transformations of a Gaussian field. One example is the log-normal random field (Coles and Jones, 1991), which can be obtained through the following transformation: 3) The log-normal transformation results in the following onepoint PDF: where µ and σ 2 are the mean and the variance of the underlying Gaussian field N, respectively. The multivariate version for the log-normal random field is defined as follows: where M is the covariance matrix of the N-values.
To construct the log-normal mock catalogues used to train, validate and test our NN, we start from a power spectrum template, assuming a value for the bias. After creating a grid according to this power spectrum, a Fourier transform is performed to obtain the 2PCF, which is used to calculate the function log[1 + ξ(r)]. We then revert to Fourier-space obtaining a modified power spectrum, P ln (k), and assign to each point of the grid a value of the Fourier amplitude, δ(k), sampled from a Normal distribution that has P ln (k) as standard deviation. A new Fourier transform is performed to obtain a density field δ(x) from which we sample the galaxy distribution. In our calculations we used a grid size of 10 Mpc h −1 , while the minimum and the maximum values of k are 0.0001 h Mpc −1 and 100 h Mpc −1 , respectively.