Inpainting hydrodynamical maps with deep learning

From 1,000 hydrodynamic simulations of the CAMELS project, each with a different value of the cosmological and astrophysical parameters, we generate 15,000 gas temperature maps. We use a state-of-the-art deep convolutional neural network to recover missing data from those maps. We mimic the missing data by applying regular and irregular binary masks that cover either $15\%$ or $30\%$ of the area of each map. We quantify the reliability of our results using two summary statistics: 1) the distance between the probability density functions (pdf), estimated using the Kolmogorov-Smirnov (KS) test, and 2) the 2D power spectrum. We find an excellent agreement between the model prediction and the unmasked maps when using the power spectrum: better than $1\%$ for $k<20 h/$Mpc for any irregular mask. For regular masks, we observe a systematic offset of $\sim5\%$ when covering $15\%$ of the maps while the results become unreliable when $30\%$ of the data is missing. The observed KS-test p-values favor the null hypothesis that the reconstructed and the ground-truth maps are drawn from the same underlying distribution when irregular masks are used. For regular-shaped masks on the other hand, we find a strong evidence that the two distributions do not match each other. Finally, we use the model, trained on gas temperature maps, to perform inpainting on maps from completely different fields such as gas mass, gas pressure, and electron density and also for gas temperature maps from simulations run with other codes. We find that visually, our model is able to reconstruct the missing pixels from the maps of those fields with great accuracy, although its performance using summary statistics depends strongly on the considered field.


INTRODUCTION
Cosmology is in a transformative stage. Nowadays, we know the value of the main cosmological parameters with a relatively high precision. This has allowed us to claim, with high confidence, the existence of a substance that is responsible for the accelerated expansion of the Universe: dark energy. The nature and properties of dark energy remain the biggest mysteries in modern physics. In order to shed light on these and other The traditional method used to transform the data from cosmological surveys into constraints is this: 1) the data is compressed into a lower dimension summary statistic, 2) theoretical predictions for that summary statistic are provided as a function of the value of the cosmological parameters, and 3) a likelihood function is evaluated to find the parameter constraints. Currently, there is a large debate on what summary statistics should be employed to extract the maximum information from these surveys (e.g. Villaescusa-Navarro et al. 2020;Samushia et al. 2021;Gualdi et al. 2021a;Kuruvilla & Aghanim 2021;Bayer et al. 2021;Banerjee et al. 2020;Hahn et al. 2020;Uhlemann et al. 2020;Friedrich et al. 2020;Massara et al. 2021;Dai et al. 2020;Allys et al. 2020;Banerjee & Abel 2021a,b;Gualdi et al. 2021b;Giri & Smith 2020;de la Bella et al. 2020;Hahn & Villaescusa-Navarro 2021;Valogiannis & Dvorkin 2021).
Unfortunately, the data from the cosmic surveys is affected by numerous issues, such as instrument noise. Among these problems, there are some effects that can induce spatial discontinuities in the data. For instance, in the case of galaxy redshift surveys, the presence of stars, fibre collisions, and bad observations, will create masks in the survey geometry (Ross et al. 2012;de la Torre et al. 2013;Bianchi & Verde 2020;Mohammad et al. 2020). Another example is when such masks are created to avoid contamination by systematic effects; e.g. Cosmic Microwave Background (CMB) and 21cm observations may be masked near the galactic plane to avoid the bright foregrounds.
In general, the complicated geometry induced by these masked regions represents a challenge for both the theoretical predictions and the computation of the (optimal) summary statistic. This problem may also get worse when working at the field level with machine learning methods, as one needs to make sure that no information from the mask itself is used by the network.
One potential solution to this problem will be to reconstruct the missing data within the masked region. In most of the cases, this is however a very difficult task, as the clustering properties of the considered field (e.g. galaxy redshift surveys or 21cm surveys) are not well understood theoretically (see discussion about summary statistics above). On the other hand, the statistical properties of the considered field can be learned by neural networks and used to reconstruct the masked region. This idea has been developed in the machine learning community to inpaint the missing pixels of images (Pathak et al. 2016;Yang et al. 2016;Demir & Unal 2018;Yan et al. 2018;Yu et al. 2018;Liu et al. 2018;Nazeri et al. 2019;Yu et al. 2019;Zhu et al. 2021).
The use of image inpainting techniques based on deep learning has recently gained increasing interest in the cosmological community. Several works have successfully used deep convolutional neural networks to reconstruct missing data in 2D maps of the cosmic microwave background (Raghunathan et al. 2019;Yi et al. 2020;Vafaei Sadr & Farsian 2021;Montefalcone et al. 2021) and in the galactic foreground intensity and polarization maps (Puglisi & Bai 2020).
In this work we use these techniques to investigate whether we can reconstruct masked regions from 2D images generated from state-of-theart magneto-hydrodynamic simulations. For this, we make use of data from the Cosmology and Astrophysics with MachinE Learning Simulations (CAMELS, Villaescusa-Navarro et al. 2021) Multifield Dataset (CMD, Villaescusa-Navarro & et al. 2021), a collection of hundreds of thousands of 2D maps and 3D grids containing 13 different fields from thousands of different cosmological and astrophysical models. To our knowledge, this is the first time that such a study is carried out with data from state-of-the-art hydrodynamic simulations over a vast range of cosmological and astrophysical models.
This paper is organised as follows. In Section 2 we describe the data we use in this work. We outline the architecture and training procedure in Sec. 3. The main results of this work are shown in Sec. 4. Finally, we summarise and discuss the findings of this paper in Sec. 5.

DATA
In this work we make use of 2D maps from the CAMELS Multifield Dataset 1 , CMD, a collection of hundreds of thousands of 2D maps showing different properties of the gas, dark matter, and stars at z = 0 from 2,000 state-of-the-art (magneto-)hydrodynamic simulations of the CAMELS project . All simulations follow the evolution of 256 3 dark matter particles and 256 3 fluid elements from z = 127 down to z = 0 in a periodic comoving volume of (25 h −1 Mpc) 3 . Half of the hydrodynamic simulations have been run with the AREPO code (Weinberger et al. 2019) and employ the same subgrid model as the IllustrisTNG simulations (Pillepich et al. 2018;Weinberger et al. 2017), while the other half has been run with the GIZMO code (Hopkins 2015) and utilize the subgrid model of the SIMBA simulation (Davé et al. 2019).
All simulations share the value of these cosmological parameters: baryon density Ω b = 0.049, Hubble parameter h = 0.67, spectral index n s = 0.96, sum of the neutrino mass m ν = 0 eV, and the equation-of-state parameter for dark energy w = −1. On the other hand, each simulation has a different value of the total matter density parameter Ω m and σ 8 , the amplitude of the linear power spectrum on scales of 8 h −1 Mpc and also differ in the value of four astrophysical parameters that characterize the efficiency of supernova and Active Galactic Nuclei (AGN) feedback. CMD contains maps for 13 different fields: 1) gas density, 2) gas velocity, 3) gas temperature, 4) gas pressure, 5) gas metallicity, 6) neutral hydrogen density, 7) electron number density, 8) magnetic fields, 9) magnesium-to-iron ratio, 10) dark matter density, 11) dark matter velocity, 12) stellar mass density, 13) total matter density. Each 2D map covers an area of 25×25 (h −1 Mpc) 2 , contains 256×256 pixels, and has a specific value of the cosmological and astrophysical parameters. For each field CMD provides 15,000 maps. We refer the reader to ) for further details on CMD.
In this work we focus our attention on the gas temperature maps, which represent the mass-weighted temperature field of the gas particles in the different simulations.

TECHNIQUE
In this section we describe the method used to evaluate the performance of the inpainting model. We start describing the construction of the binary masks in Sec. 3.1 that we later apply to the CMD maps to mimic the missing data. In Sec. 3.2 we present the architecture of the deep convolutional neural network used to inpaint the masked regions in the data. In Sec. 3.3 we discuss the loss function used to train the neural network while the training process is described in Sec. 3.4.

Binary Masks
We generate two types of masks: 1) regular masks that have either a rectangular or circular shape and cover a continuous portion of the field of view; and 2) irregular masks that consist of a set of segments of different width and length randomly placed over the field. For each of these two types, we build masks that cover different fractions of the total area. In particular, we use masks, both regular and irregular, that cover 15% and 30% of the total area. These are realistic numbers that one may encounter in galaxy redshift surveys. In particular, in the Dark Energy Survey (DES) photometric sample for cosmology (Sevilla-Noarbe et al. 2021) the masked regions amount to roughly 10% of the total survey area. In spectroscopic surveys such as the Baryon Oscillation Spectroscopic Survey (BOSS) LOWZ and CMASS (Dawson et al. 2013) samples ∼ 7% of the total area is lost due to the veto masks. In the extended Baryon Oscillation Spectroscopic Survey (eBOSS) ∼ 17% of the area covered by the Luminous Red Galaxy (LRG) and quasar (QSO) (Ross et al. 2020) catalogues was obscured by different types of veto masks. The choice for the sizes of regular masks is straightforward given the desired fraction of the area to be masked. Each irregular mask, on the other hand, is built by successively adding segments of randomly chosen width and length until the number of pixels they cover is the target fraction of the total area. In this paper we will refer to the pixels covered by the mask as the 'hole pixels' and to the un-masked pixels as 'valid pixels'.

Architecture
We use the network architecture presented in Zhu et al. (2021) based on the so called 'Mask-Aware Dynamic Filtering' (MADF) module. This is a deep convolutional neural network consisting of three main stages: the encoder, the recovery decoder and the refinement decoder. The architecture is similar in nature to a Ushaped encoder-decoder network that encodes the semantic information from the valid pixels of the masked image into multiple level feature maps which are later decoded into the low-level pixel values.
The encoder provides the high-level feature maps using the information from the input damaged image and the corresponding binary mask. In particular, rather than using fixed kernels it uses the Mask-Aware Dynamic Filtering (MADF) module to dynamically generate kernels for each convolutional window based on the features of the corresponding position on the mask. The decoder step is further divided into two stages. The recovery decoder performs a rough filling of the holes in the feature maps and produces the first output. A set of refinement decoders are run in parallel to the recovery decoder to refine the decoded feature maps. Another distinct feature of this novel network architecture is the use of the so called 'Point-wise Normalisation' (PN) in place of the typical 'Batch Normalisation' (BN) in the refinement decoding steps to avoid the 'covariant shift' problem arising from the difference between the statistical properties of the features of the hole and valid pixels.
We refer the reader to Zhu et al. (2021) for a detailed discussion of the advantages of this approach.
Although the architecture proposed in Zhu et al. (2021) is flexible in terms of the model complexity, tuning its hyperparameters would require many tests that are computationally expensive and time demanding. We thus use the same setup proposed in Zhu et al. (2021) that resulted in excellent results on the benchmark datasets typically used to assess the performance of the image inpainting models. In particular, each of the encoder, recovery decoder and refinement decoders consists of 7 levels with the kernel size and strides of each convolutional operation set empirically. Also the number of refinement decoders is set to be 2 as a compromise between model performance and efficiency.

Loss Function
We use the 'inpainting loss' adopted in Liu et al. (2018) and Zhu et al. (2021) as the optimisation objective. The total loss function consists of multiple terms that depend on the output of each decoder and are incrementally added. Different loss terms compare different properties of the predicted and the true maps (ground truth).
The first-order comparison is performed using the so called 'per-pixel reconstruction loss' that is split into two terms, one evaluated over the valid pixels (L valid ) and one over the hole pixels (L hole ), In Eqs. (1) and (2), N Igt indicates the number of elements in the ground truth map, M is the binary mask, I out is the model output, I gt is the ground truth image and denotes the elementwise product. The perceptual loss L perc , introduced by Gatys et al. (2015), forces the network to output semantically meaningful predictions as encoded by the feature maps Ψ p extracted using the pool1, pool2 and pool3 layers of the pretrained VGG16 ImageNet (Simonyan & Zisserman 2014), where N Ψ I gt p denotes the number of elements in the feature map extracted from the VGG16 layer p and I comp results from the model output with the valid pixels set to their ground truth values.
The style loss L style uses the same feature maps extracted from the VGG16 network as those used for L perc but computes the L1 loss over their auto-correlation given by the Gram matrix, is the normalisation factor with (C p H p W p ) being the size of the feature vector extracted from layer p. The style loss L style helps constraining the texture of the predicted maps to match that of the ground truth. Finally, the total variation loss L tv is used to allow for the spatial smoothness in the output map, where R represents the 1-pixel dilation of the hole region. Different loss terms described above are weighted by the corresponding weights and combined to provide the total loss function L tot , (6) The weights associated with each term in eq. (6) are identical to those set by Liu et al. (2018) found by empirical calibration.

Training
In order to train the model we first split the 15,000 IllustrisTNG-based CMD gas temperature maps into the train, validation and test sets. We assign 10,000 maps to the train set, 2,000 to the validation set and 3,000 to the test set.
We train the network using 4 NVIDIA P100 GPUs for 130 epochs. Each epoch consists of multiple iterations with a single iteration using a batch of 16 maps. At a given iteration each map is coupled with a randomly selected binary mask from a pool of 12,000 masks for data augmentation purpose. We apply the log 10 transformation to the input maps to reduce the dynamic range of the temperature values and then normalise the training set to zero mean and unit variance using the mean µ train and standard deviation σ train of the train set. The same parameters (µ train , σ train ) are then used to normalise the log 10 -transformed validation and test sets. We use the Adam optimizer and set the initial learning rate to 0.0002. We use PyTorch ReduceLROnPlateau function to implement the update policy that decays the learning rate by a factor of 10 if no decrease in the training loss L tot is observed for 5 consecutive epochs. The training process is completed in approximately 24 hours. After each epoch the model is evaluated on the validation set to monitor any over-fitting to the training set.

RESULTS
We evaluate the model predictions using the holdout test set of 3,000 gas temperature maps and binary masks. None of these maps and masks is exposed to the model during training in order to check how well the results generalise to new data. We first show a visual comparison of the ground truth and predicted maps in Section 4.1. We then quantitatively assess the reliability of the inpainted maps using the probability density function in Sec. 4.2 and the 2D power spectrum in Section 4.3. In Sec. 4.4 we also evaluate the performance of the model in recovering missing data in physical fields different to the one exposed during training.

Visual Comparison
In Fig. 1 we show four temperature maps from the CMD test set. Rather than showing the raw maps we plot the log 10 of the temperature values to facilitate a visual inspection. From top to bottom, the first row shows the ground truth maps, the second row displays the output of the reconstruction while the last row shows a pixel-by-pixel comparison between the ground truth and the predicted maps. Different columns show the results for different types (regular or irregular) and extent (fraction of the total area covered) of the binary masks. The left two columns contain results using irregular-shaped masks and a visual comparison between the ground truth and network prediction can barely spot any difference. In the case of regular masks in the right two columns of Fig. 1, there are some clear differences between the target and the predicted map even for the masks with the lower coverage (15%). This naturally arises from the fact that whole structures are wiped-out in the masking process and the inpainting model aims at recovering the correct style (or statistical properties) in the reconstructed map rather than matching pixelby-pixel the output and the ground truth maps. This effect is much more pronounced for the regular masks that cover 30% of the total pixels. Indeed, large struc-tures in the reference map are replaced by an ensemble of smaller structures. This result is not surprising since the lost information cannot be retrieved from the valid pixels given the size of the mask relative to that of the cosmological structures it covers and the size of the whole map.
We also highlight the near perfect match between the model output and the ground truth maps for the unmasked pixels. This can be attributed to the 'skip connection' between the input map and the final stage of the recovery decoder (see Figure 4 in Zhu et al. (2021)).
Finally, the use of L tv in the total loss L tot allows a continuity and smooth transition between the hole and valid pixels. Indeed, in none of the cases tested in this work we find any artefact at the edges of the binary masks.

Probability Density Function (PDF)
In order to quantify how closely the predicted maps match the ground truth we compare their probability density functions of the temperature values. We use the p-values of the Kolmogorov-Smirnov test (KS-test hereafter) that quantifies the likelihood that the pixels temperature values in the reconstructed and the ground truth maps are drawn from the same underlying distribution. In particular, for each map we estimate the pvalue of the KS test by comparing the reconstructed and ground truth map in the masked region and repeat the exercise for all 3,000 maps in the test set. Figure 2 shows the histograms of the corresponding 3,000 p-values for different choices of binary masks.
Under the null hypothesis, i.e. the temperature values in the reconstructed and the ground-truth maps are drawn from the same underlying distribution, we expect a uniform distribution of the KS-test p-values. However, we notice that for irregular masks the distribution peaks at p-value = 1 with a near exponential drop at lower pvalues indicating an even stronger agreement than that expected between two samples randomly drawn from the same distribution. In order to observe a distribution such as those seen in the top panel of Fig. 2 the temperature values in a non-negligible fraction of the reconstructed pixels must match very closely their counterparts in the ground-truth map. On the other hand, the observed p-values in the case of regular masks in the lower panel of Fig. 2 provide a strong evidence against the null hypothesis especially when 30% of the pixels are masked. Although, for regular masks covering 15% of the data, some of the maps exhibit p-values larger than the threshold of 0.05, typically used to reject (lower pvalues) or accept (larger p-values) the null hypothesis, these form only a relatively small fraction of the 3,000 test maps. In order to understand the trend observed in Fig. 2 we notice from Fig. 1 (last row) that there is a near perfect match in the temperature values between the reconstructed and the ground-truth map near the edges of the mask. However, this is not surprising given the use of the 'per-pixel' loss and the 'total-variation' loss to train the network that together ensure continuity in the reconstructed map between the hole and valid pixels. On the other hand, the neural network struggles to provide accurate reconstruction in the innermost part of larger masked patches. For irregular masks, a smaller fraction of the area being masked results in a lower probability of different segments that form the mask being joined together to create a single large patch. This increases the fraction of the hole pixels that are close to the mask boundaries where the reconstructed temperature field closely matches its ground-truth values. This explains the blue histogram (irregular masks covering 15% of the data) in the top panel of Fig. 2 being more skewed towards 1 than the red one (irregular masks erasing 30% of the data). For regular masks, along with the aforementioned cause, another effect that contributes to the bad performance seen in Fig. 2 (and later on in Fig. 3 and Fig. 4) is the unique nature of the structures being removed. In particular, the structure that are erased by the regular masks are unique and the network is unable to retrieve the semantic features of the missing data from the valid pixels of the map. The latter effect is field dependent, and we expect a much better performance in terms of recovering both accurate probability density function and the power spectrum for a field that is more homogeneous on the scales of the (25 Mpc/h) 2 maps, such as the temperature fluctuations seen in the CMB.
We also investigate whether the p-values of the KS test correlate with any of the 6 cosmological or astrophysical parameters used to run the simulations. The Pearson correlation coefficients reported in Table 1 show that there is no significant correlation between the KS-test p-values and any of the simulation parameters. This indicates that the model performance mainly depends on the properties and extent of the mask and not that much on the particular cosmological and astrophysical model employed.
While the KS test quantifies the statistical differences between the probability density functions of the ground-truth and the predicted map, it does not indicate where these differences originate from. To investi-   . Mean difference between the probability density functions of the min-max scaled and log 10 -transformed temperature maps, estimated from the model output and the corresponding ground truth map averaged over 3,000 maps in the test set, in units of the standard error on the mean. Continuous lines show results when data are masked using irregular masks while dashed lines correspond to the cases when regular-shaped masks are employed. Blue and red lines correspond to masks covering 15% and 30% of the total area, respectively. Horizontal shaded bands delimit 1-σ and 2-σ intervals. gate if the model's bad performance occurs in specific regimes of the temperature values we compare the corresponding probability density functions estimated from the ground-truth and the predicted maps. In particular, for each map in the test set, we apply the log 10 transformation and the min-max scaling to both the groundtruth and the model output before estimating the probability density functions. We show the results in Fig. 3 where the y-axis shows the difference between the two distributions averaged over 3,000 maps from the test set in units of the standard error on the mean as a function of the min-max scaled logarithmic temperature. We note that the disagreement between the model prediction and the ground truth is i) stronger in the low pixel-values regime and improves in pixels with higher intensity of the field; ii) as expected, worse for regular masks compared to the irregular ones and iii) higher for regular masks covering larger extent of the total area.

Power Spectrum
Besides the probability density function, another widely used statistic in cosmology is the power spectrum, defined in this case as, where F ( k) is the Fourier transform of the considered field F ( x), and δ D is the Dirac delta. Note that the fields we consider are statistically homogeneous and isotropic, so the power spectrum only depends on the magnitude of the wavenumber, k. We use the publicly available Pylians3 2 library to compute the power spectra of the maps. In this section we use the power spectrum as a summary statistic to quantify the agreement between the reconstructed maps and their unmasked versions.
The results are shown in Fig. 4 for the data from the test set, masked using irregular and regular masks. The top panels show the power spectra measured from the masked data (red thick and blue dashed lines), from the reconstructed maps (blue and red dots with corresponding statistical errors) as well as from the ground truth maps (black thick lines) averaged over the 3,000 maps from the test set. The error bars (on red and blue dots) and shaded bands (around the black thick lines) show the errors on the mean of the 3,000 estimates (i.e. the standard deviation scaled by √ 3000). The differences between the power spectra from the reconstructed and the ground truth maps are barely visible in the top panels. We thus show the ratio between these two quantities in the bottom panels of Fig. 4. As in the top panels, shaded bands in the bottom panels of Fig. 4 correspond to the error on the mean of 3,000 estimates.
For irregular-shaped masks we find that the power spectra of the reconstructed maps agree very well with the reference ones. In particular, for masks covering 15% of the total area the power spectra from the reconstructed maps show a systematic bias with respect to the reference ones of less than ∼ 1% up to a wavenumber of k ∼ 20 h/Mpc (blue dots with errorbars in the top left panel, blue line with shaded band in the bottom left panel of Fig. 4). The accuracy degrades only marginally for wavenumbers below k ∼ 20 h/Mpc, when extending the analysis to irregular masks that cover 30% of the input maps (red points in top left panel and red line in the bottom left panel). For larger wavenumbers (up to the Nyquist wavenumber of k Nyq. ∼ 30 h/Mpc) the power spectra estimated from the reconstructed maps stay accurate within ∼ 5% (∼ 10%) for irregular masks covering 15% (30%) of the area.
For regular masks that cover a continuous area of the maps, the reconstruction is less accurate than that for the irregular-shaped masks. As already discussed in Sec. 4.2, this is due to the fact that regular-shaped masks erase entire structures in a single large patch. Furthermore, the learning process is also complicated by the fact that we have only very limited number (10) of maps for each set of simulation parameters to train the model, far below the standard size of datasets used to train deep convolutional neural networks. Nevertheless, the network does an excellent job in reconstructing maps where the mask covers 15% of the total area with the recovered power spectra matching the reference ones within ∼ 5% up to Nyquist wavenumber of k Nyq. ∼ 30 h/Mpc. For regular masks that erase 30% of the data the agreement degrades drastically and becomes strongly scale-dependent.
This analysis, combined with the results in Sec. 4.2, shows that the neural network breaks down when large portions of the data are erased in a single patch while results are reliable for the measured power spectrum in other cases explored in this work. One natural way to improve the performance is to train the neural network either on a larger number of simulations for each set of cosmological and astrophysical parameters or on data over an area much larger than the homogeneity scale of the field.

Performance on auxiliary data
So far we have used the CMD IllustrisTNG-based gas temperature maps, split into the train, validation and test sets, to both train the model and test its performance on unseen data. In this section we use this model and try to reconstruct the missing data in a number of different fields that are not used during the model training. In particular, we test the performance of the model to recover missing data in maps from other fields such as: 1) SIMBA-based gas temperature maps (T SIMBA ), 2) gas density (M gas ), 3) total matter density (M tot ), 4) gas pressure (P ), 5) electron density (n e ), and 6) the magnesium-to-iron ratio (Mg/Fe). Here we limit the analysis to irregular masks that cover 15% of the total area.
The scales of the pixel intensities in these auxiliary fields are significantly different than the gas temperature field used to train the model. In order to feed the  . Top panels: Power spectra measured from the ground truth gas temperature maps averaged over 3,000 maps in the test set (black thick line with shaded band), from the input maps masked using irregular-shaped masks (left panels) or regular-shaped masks (right panels) that cover 15% (blue shaded line) and 30% (red thick line) of the total area. Results, after the reconstruction is applied, are shown as blue dots with error bars for masks covering 15% and with red dots with errorbars for those covering 30% of the area. Bottom panels: ratio between the power spectra measured from the reconstructed Pout and ground-truth maps P ref are shown when masks cover 15% (blue line with shaded band) and 30% (red line with shaded band) of the area. All errors shown as shaded bands or error bars refer to the error on the mean of 3,000 estimates.
neural network with pixel values that cover a range similar to that of the training set we first rescale each single map to the min-max range of the gas temperature maps in the training set and then normalise it using the (µ train , σ train ) values used in Sec. 3.4.
The visual comparison between the ground truth and the model output is shown in Fig. 5 where each row contains results from a different field. Except for the gas pressure (P ) maps the differences between the model output and the ground truth are very subtle and can be noticed only through a direct comparison as shown in the rightmost column in Fig. 5. For the gas pressure (P ) map the model completely fails to recover reliable estimates of the field in specific regions where the model predicts negative pressure. We note that this mainly occurs in the areas with low pixel values in the ground truth maps, in agreement with results shown in Fig. 3, i.e. the model struggles to provide accurate estimates of the field in the low intensity areas. While this effect is unnoticeable in other fields, it is exacerbated for the gas pressure map. It is very interesting to see that even for fields that have a very different morphology, e.g. Mg/Fe, our model is still able to inpaint features with great success.
We also perform the analysis using the power spectrum of the auxiliary fields and show the results in Fig. 6. Although results for all 6 fields are worse than those seen in Fig. 4 we notice that for the magnesium-iron density field the model is able to match the reference power spectrum within ∼ 5% up to k ∼ 10 h/Mpc. This can be explained by the fact that, as seen in Fig. 5, the structures in the Mg/Fe maps are less complex compared to the other fields. These structures also extend well beyond the typical width of the masks. Interestingly, even for the SIMBA-based gas temperature maps the power spectra of the reconstructed maps are accurate, within , gas density (Mgas), total matter density (Mtot), gas pressure (P ), electron number density (ne), and the magnesium-to-iron ratio (Mg/Fe). Column-wise from left to right: ground truth (log 10 (Igt)), model output (log 10 (Iout)) and the difference between the model output and the ground truth maps (log 10 (Iout/Igt)). For a fixed row, the left two columns share the same color coding shown in the color-bars on the left while the the range of the color map in the right-most column is adapted to highlight the differences between the model output and the ground truth map. The color-bars on the right show the color coding for the rightmost column. 1%-level, only for k < 3 h/Mpc indicating a difference in the small-scale morphological features with respect to the maps based on the IllustrisTNG simulations. The predicted power spectra show a systematic error within ∼ 1% up to k ∼ 1 h/Mpc for the gas density (M gas ) and the electron density (n e ) fields. This scale extends to k ∼ 2 h/Mpc for the total matter density (M tot ) field. On the other hand, the neural network completely fails to return reliable predictions for the gas pressure (P ) maps resulting in a significantly biased estimates of the power spectra. We do not attempt to provide a physical explanation for these results and leave this for future work.
Our analysis in this section shows that the model does not generalise particularly well to fields it is not exposed to during training. While the model fails to return reliable predictions for some fields, in other cases the validity of the predictions is limited to the largest scales (smallest wavenumbers k). These results highlight the need to train a model specifically for the field under investigation. In other words, our model has learned characteristic features of the gas temperature field, that although very generic due to the large variety of cosmological and astrophysical models present in CMD, are still very distinct to those present in other fields.

SUMMARY AND CONCLUSIONS
In this paper we test the ability of a state-of-theart deep convolutional neural network architecture, based on the Mask-Aware Dynamic Filtering (MADF) module, to inpaint masked pixels in 2D maps of the CAMELS Multifield Dataset (CMD). We focus our attention on the gas temperature maps based on the IllustrisTNG simulations; CMD provides 15,000 maps obtained from 1,000 state-of-the-art magnetohydrodynamic simulations with different values of the cosmological and astrophysical parameters.
The dataset is split into a train set of 10,000 maps, a validation set of 2,000 maps and a test set of 3,000 maps. We mimic the missing/masked data in the maps by applying two different kinds of binary masks: 1) regularshaped that cover a continuous area of each map in a circular or rectangular patch randomly placed within the map and 2) irregular-shaped masks that are composed of a number of segments of various width and length randomly placed across the map area. For each type of mask we test the model performance using two different extents, covering 15% and 30% of the total area. We train the model for 130 epochs using a batch size of 16 for a total of 81,250 training iterations.
We check the model performance using the hold-out test set of 3,000 gas temperature maps and different binary masks. Through a qualitative visual comparison between the model output and the target ground truth, we first show that the model outputs are visually indistinguishable from the ground truth for irregular masks covering either 15% or 30% of the map. The difference becomes more evident for regular-shaped masks. In particular, for regular masks covering 30% of the data in each map, reticular-like artefacts start to appear in correspondence of the masked pixels indicating a breakdown of the model for such a large masks. We also quantify the statistical agreement between the output of the model and the unmasked maps using two different summary statistics: i) the probability density functions and ii) the 2D power spectrum.
We compare the temperature probability density functions of the model output with that of the ground truth using the Kolmogorov-Smirnov test in the masked regions. We find that, for irregular masks the observed distribution of the KS-test p-values supports the null hypothesis that the reconstructed maps follow the same distribution of the corresponding ground-truth maps. For regular masks, on the other hand, the results of the KS-test indicate that the model fails to match the probability density function of the ground-truth temperature maps. In particular, for regular masks covering 15% of the pixels a vast majority of the 3,000 test maps exhibit a p-value< 0.05 that indicates a rejection of the null hypothesis. For the largest regular masks that occult 30% of the pixels we find that the KS-test p-values are systematically 0.05 indicating a strong evidence against the hypothesis that the reconstructed field matches the ground truth in distribution. We do not find any correlation between the KS-test p-values and any of the 6 simulation parameters. We also show that the main source of such a disagreement are the low-intensity pixels.
Estimates of the 2D power spectra highlight an excellent agreement with a systematic error below 1-2% up to k ∼ 20 h/Mpc between the model output and the ground truth when data are masked using irregular masks covering up to 30% of the pixels. The accuracy deteriorates significantly when regular masks are employed, although the systematic offset remains within 5% up to the Nyquist wavenumber k ∼ k Nyq. when only 15% of the pixels are masked. The model breaks down when regular masks covering 30% of the total area are used.
The main cause of the model breakdown when data are erased in large patches is the unique nature of the structures being removed combined with a smaller number of maps (for each set of cosmological and astrophysical parameters) used to train the network. On one hand the neural network is unable to retrieve the statistical properties of the missing data from the un-masked pixels, on the other it fails to learn the semantic features of the field from the ensemble of the training maps for a fixed set of cosmological and astrophysical parameters. We thus expect an improvement in the model performance by increasing either the size of each map or the number of maps in the training set.
Finally, we use the model that was trained on the CMD gas temperature maps to perform inpainting on CMD maps of different fields like the SIMBA-based gas temperature maps (T SIMBA ), the total matter density (M tot ), the gas density (M gas ), the gas pressure (P ), the magnesium-to-iron ratio (Mg/Fe), and the electron density (n e ). We find that, even when using irregular masks that extend over 15% of the pixels the model performance degrades significantly compared to when it is applied to the same field it is trained on. An even more important result is that the model performance becomes strongly field-dependent, indicating the need to train the model specifically on the field under investigation.
We conclude that the model used in this work is able to recover reliable pixel-values distributions when data are missing in irregular-shaped patches. These results hold for gas temperature maps that span 1,000 different cosmological and astrophysical models and that exhibit very different morphological aspects such as halos, filaments, and voids. The power spectrum of the inpainted maps exhibit an impressive agreement with their unmasked versions: within 1% for k max = 20 h/Mpc and within 5% all the way to the Nyquist wavenumber at k ∼ 30 h/Mpc. For regular-shaped masks our model breaks down in recovering reliable probability density function of the field in the masked patches regardless of the extent while it yields power spectrum estimates accurate at 5% only when 15% of the pixels are masked. This can be a consequence of the very large variety of models seen by the networks; we would expect a higher accuracy also for regular masks if the model was trained on a very large number of images with a fixed cosmological and astrophysical model.
The results presented in this paper have important consequences for cosmological surveys, where missing, masked, and damaged data is a very common issue. This paper paves the way to tackle that problem in different way as it is commonly done in the field. However, more work is needed in order to apply this to real data. We plan to pursue this direction in future work.