Deep learning insights into non-universality in the halo mass function

The abundance of dark matter haloes is a key cosmological probe in forthcoming galaxy surveys. The theoretical understanding of the halo mass function (HMF) is limited by our incomplete knowledge of the origin of non-universality and its cosmological parameter dependence. We present a deep learning model which compresses the linear matter power spectrum into three independent factors which are necessary and sufficient to describe the $z=0$ HMF from the state-of-the-art AEMULUS emulator to sub-per cent accuracy in a $w$CDM$+N_\mathrm{eff}$ parameter space. Additional information about growth history does not improve the accuracy of HMF predictions if the matter power spectrum is already provided as input, because required aspects of the former can be inferred from the latter. The three factors carry information about the universal and non-universal aspects of the HMF, which we interrogate via the information-theoretic measure of mutual information. We find that non-universality is captured by recent growth history after matter-dark-energy equality and $N_\mathrm{eff}$ for $M\sim 10^{13} \, \mathrm{M_\odot}\, h^{-1}$ haloes, and by $\Omega_{\rm m}$ for $M\sim 10^{15} \, \mathrm{M_\odot}\, h^{-1}$. The compact representation learnt by our model can inform the design of emulator training sets to achieve high emulator accuracy with fewer simulations.


INTRODUCTION
The abundance of the dark matter haloes, and by extension the abundance of groups and clusters of galaxies in the observed universe, is strongly sensitive to cosmological parameters.Accordingly, cosmological parameters can be inferred from cluster number counts (Dodelson et al. 2016;Abbott et al. 2020;Abdullah et al. 2020;Costanzi et al. 2021).To do so requires combining observational mass estimates with theoretical abundance predictions.The work of Press & Schechter (1974) and Bond et al. (1991) established a basic paradigm for the latter, known as the 'extended Press-Schechter' (EPS) formalism.
The underlying assumption of EPS is that dark matter haloes form at peaks in the linear density field exceeding a pre-determined threshold.Predicting the mass of a halo is then equivalent to determining the radial extent of the corresponding density peak.In this picture, the effect of cosmology on halo abundance is determined by three ingredients: (i) the linear variance and therefore the abundance of peaks as a function of radius , given by (); (ii) the mass  enclosed within a sphere of the same radius; (iii) the height   of a peak that collapses into a halo.The first factor is fully determined by the linear matter power spectrum (), while the second depends only on the mean matter density.Halo mass function predictions based on the first two factors without any further dependence on cosmological parameters were defined as 'universal' by Jenkins et al. (2001), ★ E-mail: ningyuan.guo.20@ucl.ac.uk and we will follow this naming convention.EPS is the prototypical example of a universal halo mass function, provided the threshold  c is held fixed 1 .
The original EPS formalism offers strong physical insight but very limited accuracy, with a variety of studies showing up to 50 per cent deviations between numerical simulations and predictions; see e.g.Sheth & Tormen (1999), and a review in Kravtsov & Borgani (2012).However, by calibrating functional freedoms on numerical simulations, far greater predictive accuracy can be achieved while retaining the essential idea of universality (Jenkins et al. 2001;Reed et al. 2003;Tinker et al. 2008;Courtin et al. 2011;Bhattacharya et al. 2011;Watson et al. 2013;Despali et al. 2016;Diemer 2020).Universal halo mass functions generalize to new cosmological parameters, but only with an accuracy of a few per cent (Courtin et al. 2011;Bhattacharya et al. 2011;McClintock et al. 2019;Bocquet et al. 2020;Diemer 2020;Ondaro-Mallea et al. 2022).By contrast, extracting precise cosmological parameters from present and near-future survey data requires ∼ 1 per cent accuracy on theoretical abundances (Sartoris et al. 2016;McClintock et al. 2019;Euclid Collaboration: Castro et al. 2023).Thus, universal halo mass functions calibrated on nu-1 Holding  c fixed is justified by its weak cosmological dependence in analytic spherical collapse calculations.However, including its cosmological dependence can account for a small fraction of known non-universal behaviour (Courtin et al. 2011).merical simulations provide a good 'baseline' prediction, but do not in themselves reach sufficient accuracy for cosmology requirements.
Emulators (e.g.McClintock et al. 2019;Nishimichi et al. 2019;Bocquet et al. 2020) offer a promising route towards the required accuracy (although other theoretical models going beyond universality, e.g.Euclid Collaboration: Castro et al. 2023, also demonstrate 1 per cent accuracy).For example, Aemulus (which we use in this work) reaches per cent level accuracy over much of its 7D parameter space, which extends the baseline ΛCDM parameters with the dark energy equation of state  and the effective number of neutrinos  eff (McClintock et al. 2019).However, by necessity, any emulator must place a finite number of training simulations into a cosmological parameter space which ideally would span a larger number of dimensions (to fully encapsulate the physics of dark energy, neutrinos and primordial non-Gaussianity).If the origin of non-universality were to be clearly established, the architecture and/or training of such approaches could be optimized to cover these much larger spaces with greater accuracy.
Accordingly, a number of works have started to investigate physical contributors to non-universality (e.g.Courtin et al. 2011;Ondaro-Mallea et al. 2022;Euclid Collaboration: Castro et al. 2023;Gavas et al. 2023), especially focusing on the potential role of the linear growth history  () (Courtin et al. 2011;Ondaro-Mallea et al. 2022).In the present work, we provide a complementary perspective by using an interpretable machine learning approach.Specifically, we use an interpretable variational encoder (IVE; Lucie-Smith et al. 2022;Lucie-Smith et al. 2024), which is a deep learning network designed for knowledge extraction, building on SciNet (Iten et al. 2020).IVEs learn to compress input information into a small number of latent variables which can in turn be used to make predictions about physical quantities.For our purposes, the input is the linear matter power spectrum of a cosmological model and optionally the growth history too, and the output is the halo abundance as a function of mass.We train the IVE to reproduce HMFs at  = 0, generated by Aemulus (McClintock et al. 2019), to within 0.25 per cent residuals for  = 10 13.2−15 ℎ −1 M ⊙ .
The purpose of training an IVE in this way is not to supplant emulators directly, but has two linked goals.The first is to determine the inputs and dimensionality of the latent space required to reproduce per-cent-level-accurate halo mass functions.The second is to then interpret the physical meaning of the discovered latent parameters.Specifically with respect to inputs, we are able to test whether  () is independently informative compared to providing the IVE with () alone.With respect to latent interpretation, the key tool is a robust estimator of mutual information (MI; Piras et al. 2023), enabling us to probe the information content of the compressed space and to distinguish non-universal from universal effects.MI is an information theoretic measure of linear and non-linear correlation between variables (see e.g. a review in Vergara & Estévez 2013).As such, it is agnostic to any a priori expectations on what drives beyond-universal variations in the HMF within Aemulus, giving us a new perspective on this crucial question.
This paper is organized as follows.We provide an overview of our model in Section 2, in particular describing the training data and procedure in Sections 2.1 and 2.2 respectively.We present the results on determining the underlying dimensionality of the latent space and the required inputs in Sections 2.3 and 2.4 respectively.We proceed with interpreting the latent representation in Section 3. We first describe the MI metric in Section 3.1, and then present an overview of the information content of the latents in Section 3.2.We focus on the latents' universal and non-universal information content in Sections 3.3 and 3.4 respectively, and discuss the connection between non-universality and halo mass definition in Section 3.5.Finally, we discuss the results and conclude in Section 4.

OVERVIEW OF THE MODEL
Our goal is to investigate the origin of the universal and non-universal information in the HMF, through the use of the IVE interpretable deep learning technique (Iten et al. 2020;Lucie-Smith et al. 2022;Lucie-Smith et al. 2024); a schematic illustration is shown in Fig. 1.The IVE learns to predict the  = 0 HMF for a range of different cosmologies, when given as input e.g. the linear matter power spectrum.In doing so, it generates a low-dimensional latent representation, which captures all the information in the inputs required to predict the HMF.By investigating the cosmological information in the latents, we can gain insight into how cosmological parameters map on to the universal and non-universal aspects of the HMF.
Since predicting the non-universal HMF requires information beyond the linear mass variance (), we provide as inputs to the IVE the linear matter power spectrum (), and optionally the linear growth function  () too.This choice of inputs allows us to investigate whether there is additional information about the  = 0 HMF in  () over information already present in the ().
The IVE consists of an encoder-decoder framework, similar to that in Iten et al. (2020) and Lucie-Smith et al. (2022).The encoder is a neural network made of fully-connected layers which takes in the 1D linear matter power spectrum (or a concatenation of () and  ()) and returns the mean and standard deviation of each Gaussian latent distribution; the decoder is another neural network made of fullyconnected layers which takes in a random sample from the latent distributions and a query halo mass log , and returns the HMF d/d log .Further technical details regarding the architecture are presented in Appendix B2.
The IVE is similar to a variational autoencoder (VAE; Kingma & Welling 2013), but because its task is not to reconstruct the input, the IVE's latent space does not retain all information from the input, but only the information that is relevant to predicting the HMF.This design allows us to interpret the latent representation to extract the relevant physical factors.Relative to our previous work using IVEs (Lucie-Smith et al. 2022;Lucie-Smith et al. 2024), the model used here is simpler (using one-dimensional inputs and outputs), and is more similar to the architecture presented in Iten et al. (2020).In order to interpret the latent space, we require it to be disentangled, i.e. each latent should encode an independent factor of variation in the HMF.We achieve this by using a regularization term in the loss function (see Section 2.2.1), following Iten et al. (2020) and Lucie-Smith et al. (2022).Once the IVE model learns a disentangled latent representation, we interpret it using MI as described in the Introduction.We estimate the MI of each latent variable with the HMF, as well as with the input linear power spectrum and growth functions, to determine what information each latent variable encodes.The subper-cent-level accuracy requirements on the learnt mapping places stringent requirements on the accuracy of the disentangled model, which is technically challenging to achieve in comparison with these prior works; our annealing-based approach to this problem is outlined in Section 2.2 and fully described in Appendix B4.

Training and test data
We now describe the construction of the ground-truth HMF outputs and the inputs used to train the IVE.We train two IVE models which differ by their inputs: the first uses only () as input, while the  2022) find information on growth history relates to non-universality, the comparison between our two models allows us to determine whether growth history is required in addition to () to achieve a more accurate description of the HMF.We compare the prediction accuracy of the two IVEs; if growth history provides additional relevant information, the IVE also given  () should predict the HMF to higher accuracy.

Ground truth HMFs
We produce the HMFs to train our IVE models using the state-ofthe-art HMF emulator Aemulus (McClintock et al. 2019).Aemulus can achieve 1 per cent precision over its seven-dimensional cosmological parameter space varying Ω b ℎ 2 , Ω c ℎ 2 , ,  s ,  8 ,  0 , and  eff (McClintock et al. 2019).Using Aemulus is particularly convenient for an exploration of HMFs with our machine learning approach, because it can generate unlimited training samples.By contrast, the largest available simulation suites only have data for a few thousand cosmological parameter samples (Villaescusa-Navarro et al. 2020).
While training on Aemulus limits our exploration to its cosmological parameter space, this is sufficient for our study.
To ensure we train on reliable HMFs, we choose cosmological parameter samples for our dataset to lie within the domain of validity of Aemulus.Following the approach that DeRose et al. ( 2019) used to construct the Aemulus training set, we project a 7D Latin hypercube into the cosmological parameter space such that the resulting samples cover approximately the ±3 region allowed by CMB+BAO+SNIa, and follow degeneracies among cosmological parameters.We provide details on the cosmological parameter sampling process in Appendix B1.During training, we encountered a bug in Aemulus which does not affect its overall accuracy but which generates a correlation between residuals and the value of ; this is also described in Appendix B1.We generate a dataset with 10 5 cosmological parameters samples, where 48000 samples are used for the training set, 12000 samples for the validation set, and 40000 samples are used for the test set.We checked that the size of the training set we used saturated the accuracy of the learnt mapping.
The halo mass function is generally written as 2 (Press & Schechter 2 Note that throughout this paper we use log to denote logarithms to base 10, 1974; Bond et al. 1991) where  is the number density of haloes with mass  per logarithmic mass interval,  b is the background matter density of the universe, and  () is the multiplicity function.The latter gives the shape of the halo mass function in terms of (, ), the rms mass variance of the linear density field given by Here, (, ) is the linear matter power spectrum at the specified redshift, and W the Fourier transform of the top hat filter with  = 4 3  b  3 .Given cosmological parameters, Aemulus outputs the HMF for haloes with masses  200b , defined as the mass enclosed within a sphere of overdensity 200 with respect to the cosmic mean matter density.To do so, it first emulates the fitting parameters , ,  ,  of the multiplicity function (Tinker et al. 2008(Tinker et al. , 2010)), where  is a normalization constant such that all dark matter resides in haloes.Then, it converts equation (3) to the HMF via equation (1), calculating () from the linear ().Note that Aemulus learns a non-universal HMF, as the fitting parameters are explicit functions of cosmology.Fig. 2 shows the distribution of  () as a function of () for the set of cosmologies in our training sample.The effect of non-universality is at maximum of ± ∼ 2 per cent level.Learning this small but non-trivial effect places stringent demands on the IVE prediction accuracy; we will return to this in Section 2.2.2.

Inputs to the IVE
For each set of cosmological parameters in the training data, we generate the IVE inputs: the linear matter power spectrum () at  = 0 and the linear growth function  (), using CAMB (Lewis et al. 2000).For modelling dark energy with  ≠ −1, we use the Parameterized Post-Friedmann (PPF) approximation as implemented in CAMB (Hu & Sawicki 2007).The linear matter power spectrum () in units of ℎ −3 Mpc 3 at  = 0 is evaluated for 10 −4 ≤ /(ℎMpc −1 ) ≤ 10 at 200 points logarithmically spaced in .The broad range of -values allows the IVE access to information at all scales, not just those that are directly encoded within ().The growth function is derived from CAMB's output using the redshift-dependent transfer function values as  () =  (,  = 1ℎMpc −1 )/ ( =  n ,  = 1ℎMpc −1 ), where  ∈ [0, 5] and  n is the redshift where the growth is normalized.We choose  n = 50 to give a broad distribution of  () at low redshifts, when growths differ the most due to dark energy.The growth function is evaluated at 100 values of  ∈ [0, 5] linearly spaced in scale factor, which samples the lower redshifts more finely.
We perform a standard data pre-processing step to normalize our inputs and outputs to be of order unity (Goodfellow et al. 2016).We first calculate  mean () and  mean () of the mean cosmology, for which each of the seven cosmological parameters takes the mean value of its training set distribution.Specifically, the parameters of the mean cosmology are Ω b ℎ 2 = 0.022, Ω c ℎ 2 = 0.118,  = −1,  s = 0.962, ln 10 10  s = 3.087,  0 = 68.3km s −1 Mpc −1 ,  eff = 3.44.We then divide out the set of power spectra we obtain by  mean (), and similarly divide out the set of growth functions by  mean ().We also take the logarithm of the inputs to reduce their dynamic range.We rescale the distribution of log  normalized () for all  in the training set to the range [−1, 1], and rescale the distribution of log  normalized () for all  similarly.

Outputs of the IVE
For each cosmological parameters sample in our dataset, we initially evaluate the HMF on a grid of 500 mass points linearly spaced between log(/ℎ −1 M ⊙ ) = [13.2,15.0] using the  200b mass definition.The mass range is chosen where Aemulus is most reliable: below log(/ℎ −1 M ⊙ ) = 13.2, the simulations that Aemulus was trained on did not converge to within 1 per cent with respect to simulations of higher mass resolution (DeRose et al. 2019).Above log(/ℎ −1 M ⊙ ) = 15.0, the Poisson noise in the binned halo counts data used to train Aemulus is ≳ 10 per cent (McClintock et al. 2019). 3We use the same mass range for all cosmologies since this mass range is covered by the binned halo counts from all Aemulus training simulations at  = 0. To normalize our HMFs for machine learning, we divide out the set of HMFs by that of the mean cosmology.We then take the logarithm to reduce their dynamic range.
To produce query-ground truth pairs, for the training set HMFs, we randomly sample 50 masses per HMF from the initial grid of 500 halo masses.We tested that increasing the number of queries per HMF above 50 did not significantly improve the model's prediction accuracy.For the validation set, we randomly sample 10 masses per HMF (the model training is not sensitive to this specific choice).For the test set, to evaluate the IVE predictions across the mass range, we subsample the initial grid of 500 masses by taking every 50-th mass.We additionally include the maximum mass of  = 10 15 ℎ −1 M ⊙ to test the IVE performance at the upper mass bound.The distribution of log(d/d log ) normalized for all query masses in the training set is rescaled to [−1, 1], and the distribution of all query masses log(/ℎ −1 M ⊙ ) in the training set is also rescaled to [−1, 1].

Loss function
Training the IVE amounts to optimizing parameters of the encoder and decoder such that the model achieves high prediction accuracy and the latent representation is disentangled.This is achieved via minimizing a -VAE loss function (Higgins et al. 2017;Burgess et al. 2018), which is written Here,  is the number of training samples per batch and the first term  pred measures the accuracy of the model's predictions, for which we adopt the mean squared error: The second term of equation ( 4), D   , regularizes the latent space to encourage disentanglement.Further details are given in Appendix B3.The balance between accuracy and regularization is controlled by the hyperparameter , which we tune while training to achieve both high prediction accuracy and disentanglement.
To verify that the latent variables are disentangled, we measure the MI between latent variables (see Section 3.1), checking that it is negligible compared to the information each latent variable contains about the ground truth HMF.

Optimization
Since our goal is to discover which factors govern the HMF at  = 0, our IVE models must be accurate enough to learn non-universality in our training data.Fig. 2 shows this is at a maximum of ∼ 2 per cent, with a 95 per cent confidence interval of ± ∼ 0.9 per cent at low  −1 (low mass) and ±1.3 per cent at high  −1 (high mass).A full investigation places stringent demands on the accuracy in the HMF predictions, such that the IVE can reproduce Aemulus even in regimes where its accuracy is better than per cent level.Achieving this level of accuracy together with disentangled latents requires a two-stage approach involving carefully tuned annealing strategies.The first stage aims to achieve disentanglement while improving prediction accuracy.We find models trained with a constant  in the loss function in equation (4) either do not have disentangled latent variables when the value of  is small, or for high values of  the latent variables are disentangled at the cost of losing prediction accuracy.We therefore use -annealing, where we start with a high value of  that disentangles the latents, and then slowly decrease the value of  with each training epoch  to improve prediction accuracy (Shao et al. 2020).The slow decrease in  allows the latents to remain disentangled while prediction accuracy improves.Further details of the annealing procedure are given in Appendix B4.Additionally, we halve the learning rate and double the batch size when the validation loss does not improve over the previous 40 epochs.We minimize the loss function in equation ( 4) using the AMSGrad optimizer (Reddi et al. 2019).

Determining the dimensionality of the latent space
Fig. 3 shows the results from training the IVE using different numbers of latent variables, and for different input choices.We show the predictive accuracy of the IVE models in terms of the ratio between predicted and ground truth HMFs.The top panel shows the mean and 95 per cent confidence interval of the residuals of the predictions when training the IVE with different numbers of latent variables.We find that an IVE with a three-dimensional latent space can achieve similar accuracy to an IVE with seven latent variables; the latter is the maximum number of latent variables we expect to need, since this is the number of cosmological parameters varied to generate the training set.As we further decrease the latent space dimensionality to two, we find that the accuracy of the IVE significantly worsens.We therefore conclude that three latent variables are required and sufficient to capture all the information needed to predict Aemulus HMFs at  = 0.The residuals of the three-latent-variable models are ≤ 0.25 per cent, i.e. reaching the level of the emulator errors inherent in Aemulus itself.As non-universality varies HMFs in the Aemulus parameter range by up to ∼ 2 per cent, to achieve this accuracy our models must have learnt information on non-universality.

Determining the required inputs
The bottom panel of Fig. 3 compares the residuals of the two IVEs, one trained on () alone and the other on both () and  () respectively.In both cases, three latents are necessary and sufficient to achieve the best accuracy.The IVE achieves the same accuracy regardless of whether  () information is included in its inputs, implying that  () does not carry additional information over that contained in () for describing the HMF at  = 0 within the CDM+ eff parameter space considered.
This finding that  () does not carry additional information on the HMF seems at first to contradict previous work by Courtin et al. (2011);Ondaro-Mallea et al. (2022), who find differences in the HMFs of simulations performed with the same power spectrum but different growth histories.However, in order to decouple the effects of growth history from the shape of the power spectrum, these studies initialized their simulations using power spectra calculated from a different set of cosmological parameters to those governing the growth history.For example, Ondaro-Mallea et al. ( 2022) initialized their simulations with () calculated using a fixed Ω m = 0.307, while the growth history evolved according to Ω m = [0.148,1].By definition, any physical effect associated with growth history is therefore unrecoverable from the power spectrum in such cases.Conversely, our approach reflects the fact that, within the Aemulus parameter space and where all cosmological parameters are selfconsistent, it is possible to infer any relevant information directly from the power spectrum at  = 0.
Ondaro-Mallea et al. ( 2022) further showed on a separate set of CDM simulations (where () and growth follow consistent cosmological parameters) that including growth history in addition to () [given in equation ( 2)] improves the modelling of nonuniversality.However, while () can be obtained from (), the transformation loses information when evaluated over a finite mass range.In other words, () for  ≥ 10 13.2 ℎ −1 M ⊙ does not contain all the information in () for 10 −4 ≤ /(ℎMpc −1 ) ≤ 10 (the relevant ranges given in Section 2.1).Hence there is no tension between previous results and our finding that  () adds no information once () is known.

INTERPRETATION OF LEARNED LATENTS
In the previous section we presented the baseline IVE model, which takes () as input and encodes it within a three-dimensional latent space which is necessary and sufficient to predict Aemulus HMFs.We now examine the information content captured within the latent space, in order to interpret the cosmological information related to universal and non-universal aspects of the HMF.
To gain intuition on the meaning of the latent variables, we start by visualizing the impact of each latent parameter on the final HMF.To do so, we systematically vary the value of one latent, while keeping the others fixed, and show the resulting HMFs in the top panels of Fig. 4. In each case, the chosen latent varies between −3 and +3, where  is the standard deviation in the latent distribution (which is close to Gaussian by construction).The plots therefore give the  Variations in the predicted HMF when systematically varying the value of one latent variable, while keeping the others fixed.Each panel from left to right varies the universal, mapping and non-universal latents, respectively.Lower panels: Fractional differences in the predicted HMF with respect to that of the reference mean cosmology (defined in Section 2.1.2),again when varying each latent variable, keeping the others fixed.This gives an indication of a lower limit to the magnitude of each latent's effect across the parameter space as a whole.These independent aspects of the halo mass function were discovered automatically by the IVE during training.sought-after visual indication of the independent variation induced by each latent.However, one should bear in mind that the quantitative effect of each latent is somewhat larger than these plots illustrate, because they condition on the remaining two latents.
The panels from left to right show the impact of each latent variable in turn; the three latents are denoted 'universal', 'mapping' and 'non-universal' respectively, for reasons that we will shortly explain.The universal latent primarily affects the amplitude of the HMF at the high-mass end; the mapping latent controls the normalization of the HMF at the low-mass end, pivoting around  ≃ 10 14.6 ℎ −1 M ⊙ ; the non-universal latent controls the curvature of the HMF at intermediate mass scales.The latents exhibit a hierarchy in that the universal latent induces changes in the HMF of order 100 per cent, the mapping latent carries 10 per cent effects, while the non-universal latent carries ∼ 1 per cent level effects.
In order to quantify the information carried by these latents in detail, we use tools based on the measurement of marginal and conditional MI, as we discuss below.

Mutual information analysis
We use MI both to determine whether latent variables are disentangled, and to interpret what the latent variables encode.MI is a measure of the amount of information shared between two variables  and , given by where (, , ) is the joint probability density function of ,  and .The generalization to more than one conditioned variable follows by promoting  to a vector of quantities.Generally, an increase in the MI after conditioning on  means that  and  provide synergistic information on , whereas a decrease in MI after conditioning on  means that  provides redundant information to  about .The estimation of conditional MI is also carried out using GMM-MI.Previously in astronomy, conditional MI has been used to investigate information content in multi-band galaxy surveys (Chartab et al. 2023), and to study the relation between galaxy morphology and environment (Bhattacharjee et al. 2020).

Overview of the information content in latent space
As previously discussed in Section 2.2.1, we assess the disentanglement of the latent space using MI.The models we present in Section 2.4 are disentangled with MI ≲ 0.002 nats.This disentangled 3D latent representation contains all cosmology-dependent information required to predict the (non-universal) HMF at  = 0.The left panel of Fig. 5 shows the MI between latent variables and the ground truth d/d log ; the middle panel shows the MI between latent variables and the ground truth d/d log  conditioned on the other two latents; the right panel shows the MI between the latents and  () as a function of .We now discuss the key features seen in each panel in turn.
The left panel shows that there are two dominant latents -the universal and mapping latents -which carry most of the information about the HMF.The nomenclature we use to describe the latents will become apparent as we analyse the latents in more detail.The universal latent predominantly carries information about the high-mass end of the HMF, in particular showing a peak at  = 10 14.6 ℎ −1 M ⊙ where the MI reaches ∼ 4.6 nats.This behaviour only emerges in disentangled latent spaces, i.e. when we tried training with insufficient  to obtain disentanglement, none of our latents exhibited such a clear peak.We conclude that 10 14.6 ℎ −1 M ⊙ is the mass-scale which best disentangles the three different factors of variation in the HMF.
The mapping latent has an opposite trend to the universal one: its MI peaks at the low-mass end  = 10 13.2 ℎ −1 M ⊙ and decreases towards higher masses.The fact that the universal and mapping latents capture information about the high-mass and low-mass ends of the HMFs is consistent with the intuitive picture provided by Fig. 4. The peak in the universal latent MI also corresponds to the pivot point in the mapping latent, so that the universal latent is almost sufficient on its own to predict the HMF at that mass.
The information content in the third latent is highly subdominant compared to the other two.However, as we noted in the discussion of the top panel of Fig. 3, this latent is necessary; the two dominant latents allow the IVE model to predict the HMF to an accuracy of ≲ 0.5 per cent, but the third latent variable is required to further improve the prediction accuracy to ≤ 0.25 per cent.We adopt the three-dimensional latent space since our goal is to reproduce results from Aemulus without any significant noise.The third latent is at a level somewhat below the intrinsic uncertainties in Aemulus itself, but we will show that it none the less encodes physically interpretable effects.
The middle panel of Fig. 5 shows a complementary view of the information content in the latents.By conditioning the MI on the other two latents, we quantify the intuitive picture presented by the latent traversals shown in Fig. 4, where the effect of a single latent is illustrated by fixing the other two.Conditional MI thus reveals the effect of a single latent, which is particularly valuable in the case of the non-universal latent whose unconditioned MI is driven to zero by the dominance of the other two latents.As such, the middle panel of Fig. 5 confirms that the non-universal latent does carry additional information about the HMF.The two troughs in the non-universal latent line correspond to the dual pivot points seen in the traversal plot (right panel of Fig. 4).Similarly, in the conditional MI view, the mapping latent contains information not only at low masses but also at high masses, with a trough corresponding to the previously identified pivot scale where the universal latent MI peaks ( = 10 14.6 ℎ −1 M ⊙ ).The universal latent MI retains its peak at this scale, even once conditioned on the other two latents, once again reflecting the disentanglement constraint that ensures the latents capture independent degrees of freedom.
The right panel of Fig. 5 examines the information in the latents about non-universal behaviour, by showing the MI between the latents and  ().As explained in the Introduction, and following Jenkins et al. (2001), non-universality is defined by the effect of cosmological parameters on  ().Thus, any non-zero MI in the right panel of Fig. 5 indicates non-universal behaviour.Since the Tinker  = 0 HMF is universal with no variation in  () as a function of (), the MI between the latents and the Tinker  () is consistent with zero. 4trikingly, the maximal information on  () is carried by the non-universal latent, despite its subdominant effects on the overall mass function, justifying our choice of nomenclature.This is particularly true at the low-mass end (low  −1 ).Additional non-universal information at the high-mass end (high  −1 ) is carried by the the mapping latent.On the other hand, the universal latent contains a peak of 4.6 nats of information about the HMF, but less than 0.2 nats of non-universal information; this is why this latent is referred to as the universal latent.Similarly, the mapping latent carries about 2 nats of information about the HMF peaking at the low-mass end, in comparison to the 0.4 nats of non-universal information it carries about the high-mass end.We can therefore conclude that most of the information carried by the mapping latent is universal.We will return to a more detailed investigation of non-universality in Section 3.4.In Section 2.4 we showed that, despite being provided only () inputs, the latent space fully encapsulates the information about  () required to predict the HMF.It is instructive to therefore quantify that information explicitly.Fig. 6 shows the conditional MI of each latent with  (), normalized such that  ( = 0) = 1, conditioned on the other latents.We find that disentanglement of the latents leads to a trade-off in information before and after matter-dark-energy equality,  m=DE .This was not imposed during training, but rather discovered by the IVE when provided with only () information as input.This trade-off is particularly prominent for the universal and mapping latents.The non-universal latent, on the other hand, picks up information about the growth across all time.We observe that the three latents contain similar amounts of conditional information roughly at  m=DE .Furthermore, we find that the non-universal latent's conditional MI peaks at the redshift where the universal latent's conditional MI reaches its minimum.We denote this redshift  late ≃ 0.11, and we will discuss it further in Section 3.4.

Universal information
We now discuss the information content related to the universal part of the HMF that has been learnt by the latent representation, i.e. the information which could still be captured by an approximation which adopts a fixed  ().As just discussed, this information is primarily encoded in the universal and the mapping latents.
We start by discussing the universal latent.The abundance of high mass galaxy clusters is known to be a sensitive probe of cosmology (e.g.White et al. 1993;Eke et al. 1996;Rozo et al. 2010;Norton et al. 2023).In particular, it is common to quote sensitivity to the parameter combination  8 ≡ √︁ Ω m /0.3 8 , which encodes the am-plitude of matter fluctuations within a certain redshift range (Jain & Seljak 1997).For reasons that we will explain below, in our case the natural combination is Ω 0.46 m  8 , which is almost equivalent to  8 : the exponent is nearly the same, and the √ 0.3 normalization in  8 has no impact on information content.The top panel of Fig. 7 shows a scatter plot to illustrate the close correlation between the universal latent and the parameter combination Ω 0.46 m  8 .The scatter plot is additionally coloured by Ω m which captures the (small) remaining scatter in the tight relation with Ω 0.46 m  8 .To understand this tight relationship, recall that the MI with the HMF peaks at the mass scale  = 10 14.6 ℎ −1 M ⊙ .Such haloes form half their mass by  form ≃ 0.46.In turn, the cosmic density at this redshift for our mean cosmology (Ω m,0 = 0.30,  = −1.0) is well approximated5 by Ω m ( form ) ≈ Ω 0.46 m ; this is illustrated in the lower panel of Fig. 7. Using this argument, it is possible to anticipate that the parameter combination Ω 0.46 m  8 is an excellent proxy for information in the universal latent.
We next consider the mapping latent.Our designation for this latent derives from the hypothesis that, once the normalization is determined by the universal latent, the next most important information will concern the mapping between  () and the HMF.This hypothesis is largely borne out by our analysis, as we now explain.First, we find that the information in this latent is well approximated by the parameter combination Ω 0.35 m d log  d log  evaluated at  = 10 13.2 ℎ −1 M ⊙ .The mass scale  = 10 13.2 ℎ −1 M ⊙ is where the MI between the latent and the HMF peaks, as shown in the left panel of Fig. 5.We find that the residual scatter in the relationship between the mapping latent and this proxy is well described by  eff .This is illustrated in the top panel of where the right hand side comprises of two components: the multiplicity function  () and a factor Ω m d log  d log  which maps  () to the HMF.Other than the exponent on Ω m (discussed below), the latter is the parameter combination found to best describe the mapping latent.Therefore, the role of the mapping latent can be thought of as a mapping between the multiplicity function  () and the halo mass function d/d log .
Additionally, since the abundance of lower mass haloes depends more strongly on Ω m d log  d log  than on  (), this result is also consistent with the finding that the mapping latent carries mostly information on the abundance of lower mass haloes (see left panel of Fig. 5).Similar to the universal latent case, we find that the exponent of 0.35 in Ω m can be predicted from the formation history of lowmass haloes as shown in Fig. 8: at the redshift  form where haloes of  = 10 13.2 ℎ −1 M ⊙ form half their mass, Ω m ( form ) ≈ Ω 0.35 m for the mean Ω m = 0.3 in the training set.
In principle, one could extend the Ω  near-perfect linear correlation with the mapping latent (as found for the universal latent).We choose not to pursue this route in order to restrict ourselves to describing the latents in terms of physically interpretable parameter combinations only, rather than empirically motivated parameter combinations.In Fig. 9, we additionally use MI to show the close resemblance between the parameter dependencies identified and the main information content in the latents.
In the top row of Fig. 9, we compare MI between the universal latent and (),  () and the HMF with that of its main parameter dependence, Ω 0.46 m  8 .We find that the MI curves of the latent and its proxy cosmological parameter dependence overlap closely.This agrees with the tight correlation seen between Ω 0.46 m  8 and the universal latent in the top panel of Fig. 7, even without accounting for the extra small dependence on Ω m .
The bottom row of Fig. 9 compares the MI between the mapping latent and (),  () and the HMF, with that of its proxy, Ω 0.35 m d log  d log  evaluated at  = 10 13.2 ℎ −1 M ⊙ .We account for the mapping latent's additional dependence on  eff by conditioning on it.We find that the two curves are very similar, confirming that this parameter combination is a good proxy of the cosmological information in the mapping latent.We note that the mapping latent, which has the highest MI with the HMF at the low-mass end, encodes high information on () at small  (large scales).In contrast, the universal latent, which encodes the most information on the HMF at the high-mass end, has high information with () at larger  (smaller scales).This may seem counter-intuitive at first, but it can be explained by the main parameter dependence of each latent.The universal latent has high MI with  8 ; from equation (2), we expect both very large scales and small scales to contribute minimally to the integrand; this is indeed the case in the top left panel of Fig. 9. On the other hand, the mapping latent has high MI with Ω m .The power spectrum is affected on all scales by Ω m , but on small scales it is additionally affected by e.g. the baryon density and  eff .Hence, () is most informative of Ω m on large scales, leading to a high MI between the mapping latent and () on those scales.This also leads to the high MI between the mapping latent and  () at low redshifts, when the growth function normalized at  = 0 is most strongly dependent on Ω m .normalized to unity at  = 0 (middle), and the ground truth HMF (right).In pink dashed line, we show the same MI quantities but using Ω 0.46 m  8 instead of the universal latent.Bottom row: MI between the mapping latent and the same three functions, conditioned on  eff .In pink dash-dotted line, we show the same conditional MI quantities but using Ω 0.35 m d log  d log  evaluated at  = 10 13.2 ℎ −1 M ⊙ instead of the mapping latent.

Non-universal information
In the previous section, we discussed the 'universal' information in the first two latents.However, as discussed in Section 3.2, the latent variables also capture information on non-universality, i.e. information that is not present in ().This was already demonstrated in the right panel of Fig. 5 through the MI between each latent variable and  ().The plot further demonstrates that there is an inversion in the hierarchy of information: the universal latent which dominates the overall information content of the halo mass function learns the least out of all three latents about non-universality.
We now consider the interpretation of the non-universal information in the latents.Fig. 10 shows the MI between each latent and  (), but now conditioned on the other two latents to isolate the effects.The panels from top to bottom illustrate the universal, mapping and non-universal latents respectively.The top panel reconfirms that the universal latent carries little information about non-universality (solid line), since the MI between the latent and  () is approximately 0.25 nats for all  −1 values.We previously showed that the universal latent carries almost the same information as the parameter combination Ω 0.46 m  8 .When examining its contribution to non-universality, we find that the information content closely mimics that in  8 (dashed line).This is further demonstrated by calculating the MI between the latent and  () but conditioning on  8 (in addition to the other latents).The grey line indicates that this conditional MI is close to zero at all scales, meaning that there is negligible further information about non-universality in the universal latent once  8 is known.However, since the overall contribution to non-universality is anyway concentrated in the other two latents, the overall effect of  8 on non-universality will be strongly subdominant and we do not consider it further.
We next consider the drivers of non-universal behaviour in the case of the mapping latent shown in the middle panel of Fig. 10.The mapping latent carries more significant information about non-universality at high  −1 (i.e.high mass).As discussed in Section 3.3, this latent carries information about Ω m , d log  d log  and  eff .From the perspective of non-universality, we find that the information carried by Ω m is near-identical to that carried by the latent.To verify this explicitly, the dotted line shows that once Ω m is conditioned on (in addition to the other latents), the non-universality information in the mapping latent is close to zero.
The bottom panel shows the information about non-universality in the non-universal latent.As previously noted, this dominates the non-universal information, especially at low  −1 (i.e.low mass).Previously in Fig. 6, we showed that the information in the nonuniversal latent is strongly related to the recent growth history after matter-dark-energy equality; the conditional MI between the latent and the growth function peaks at  late ≃ 0.11.We accordingly find that the growth factor at this particular redshift largely encodes the non-universal information content; the dashed line in Fig. 10 traces a similar dependence on  −1 .However, unlike for the other two latents, here we find there is a substantial difference between the information in this leading order dependence and the information in the latent itself.
To understand more fully what is driving the non-universality, Fig. 11 shows the MI between the non-universal latent and the most relevant cosmological parameter combinations (conditioned on the other two latents).We show this separately for two IVEs trained independently on Tinker HMFs, which are universal by definition, and Aemulus HMFs.This allows us to assess the amount of encoded non-universal information.We find that the  ( late ) encoding is a better description of the non-universal aspects compared to the  eff parametrization used by Ondaro-Mallea et al. (2022).The non-universal latent is also highly correlated with matter-darkenergy equality, but this does not maximize the information content as  ( late ) does.By contrast, the Tinker latents contain only a small fraction of the information in the Aemulus latents about these quan- Figure 10.Conditional MIs between  ( ) and various quantities, for interpreting the non-universal information content of the latents.The MIs are shown as functions of  −1 , which increases with mass.From top to bottom, the panels show the universal, mapping and non-universal latents respectively, in each case conditioned on the remaining two latents.The leading order dependencies for the latents are encoded respectively by  8 , Ω m and  ( late ), as illustrated by the dashed lines.By further conditioning the MI between the latent and  ( ) on these leading-order dependencies, we obtain the grey dotted lines, demonstrating that there is then little information remaining.In the case of the non-universal latent (bottom panel), we need to condition on  eff as well as  ( late ) to reach this point of information saturation.
tities.The cosmological parameter dependence of  ( late ) can be approximated using the results of Linder (2005) as where Ω m ( z) is the matter density parameter at redshift z ≃ 0.05, and  ≃ 0.55 + 0.02(1 + ) for  < −1 and  ≃ 0.55 + 0.05(1 + ) otherwise.The matter density parameter in turn depends on its value at  = 0, and .Equation ( 9) is derived assuming that the growth rate is well-approximated by Ω m ()  in CDM.The derivation of equation ( 9) and a demonstration of its accuracy are given in Appendix A.
In addition to these growth-related indicators, we also find that the non-universal latent adds to the other latents significant information about the effective number of neutrino species,  eff .However, the added information on  eff is subdominant to that about the growth Conditional MI between the non-universal latent and X given the other two latent variables, where X is each of the cosmological quantities on the -axis.We compare the MI of latents of two IVE models, separately trained on Tinker (grey) and Aemulus HMFs (orange); the former does not contain any non-universal information by construction.The non-universal latent has high information on parameters associated with the recent growth history, as well as  eff .The parameter  eff is the effective growth rate parametrization used by Ondaro-Mallea et al. (2022).
history.Returning to the bottom panel of Fig. 10, the dotted line shows that once  eff and  ( late ) are conditioned upon, there is basically no information left about non-universality.

Connection between mass definition and non-universality
Our results apply to dark matter halo abundances defined using the  200b mass definition; this choice is motivated by the spherical collapse model and is commonly used (together with the 500  c overdensity definition) in cluster cosmology analyses.Previous work has shown that non-universality varies depending on the mass definition or equivalently, the mass distribution out to different physical scales (Despali et al. 2016;Ondaro-Mallea et al. 2022); e.g. the 500 c overdensity definition resolves only the inner parts of haloes, 200 b the virial region, and the splashback mass the outer region (Diemer & Kravtsov 2014;Adhikari et al. 2014;More et al. 2015;Diemer 2020).
It has been shown that mass definitions probing the inner regions of haloes (500  c ) tend to exhibit more non-universality than those that probe out to the outskirts of haloes (splashback) for cluster-sized haloes (Diemer 2020).Our result that non-universality is tied to latetime growth can explain this finding from a physical perspective, since the inner regions of haloes are not as sensitive to late-time growth as the outer-regions.In this picture, we postulate that nonuniversality can originate because a given halo mass definition (in our case  200b ) artificially truncates cosmological information contained in the halo accretion history (in our case the late time growth).Conversely, in the case of a mass definition sensitive to the accretion history relating to the outskirts (and hence late time growth), that information would already be incorporated into the mass definition, meaning that the HMF would not need to be reparametrized to account for the extra accretion physics.We therefore expect a mass definition enclosing more of the density profile (and hence more information about accretion history) to lead to a more universal HMF.
This picture ties to our previous results in Lucie-Smith et al. ( 2024), which showed that the universality in the density profiles of haloes may originate from a universality in the halo accretion histories themselves.Further, our interpretation implies that the larger amounts of non-universality seen in the HMF at higher redshifts (Diemer 2020) arise due to the fact that haloes at higher redshifts incorporate less of their accretion history within any given mass definition compared to  = 0.This in turn may explain the close connection between the mass function across redshift and across cosmologies utilised by previous work including the cosmology-rescaling technique (Angulo & White 2010).
In addition to the effect of actual increase in mass through accretion, the mass of haloes can also appear to undergo a spurious 'pseudo-evolution' simply due to redshift evolution of the background density used in the mass definition (Diemer et al. 2013).We may therefore expect that this non-physical effect related to the overdensity definition links to Ω m , and indeed we see that the nonuniversality encoded in the mapping latent is described by this parameter.

DISCUSSION
We have used deep learning to generate new insights into accurately modelling the HMF and understanding universal and non-universal information encoded within it.We trained a deep learning framework (IVE) to compress linear matter power spectrum (and, optionally, growth function) information into a low-dimensional latent space, that is then decoded into a halo mass function.The training set was generated from the existing Aemulus emulator (McClintock et al. 2019).We then interpreted the latent representation using MI to extract knowledge on the physical quantities required to model the HMF in Aemulus.We summarize our main findings below.
(i) The linear growth function  () does not carry information about the present-day HMF that is independent of (,  = 0) in the CDM+ eff parameter space.While  () can play a physical role in halo growth, our architecture can extract relevant aspects of the growth history from (,  = 0).This is not possible when using (,  = 0), since over a finite radius range the latter is a lossy compression of the former.
(ii) Three independent latent variables are necessary and sufficient to predict the HMF at  = 0 to ≤ 0.25 per cent residuals for  = 10 13.2−15 ℎ −1 M ⊙ in the CDM+ eff parameter space.In other words, the IVE with three latents reproduces Aemulus essentially perfectly, since Aemulus itself produces halo mass functions with slightly better than 1 per cent accuracy.
(iii) One latent variable primarily encodes universal information related to the mass variance ().It has a cosmological parameter dependence Ω 0.46 m  8 very similar to  8 .This can be predicted from the formation history and mass fluctuation variance of high-mass haloes.
(iv) The second latent variable encodes the factor Ω 0.35 m d log  d log  .This corresponds closely to the 'mapping' factor within equation ( 8) responsible for mapping from the multiplicity function  () to the HMF.The power of Ω 0.35 m corresponds to evaluating the mapping at the formation time of low-mass haloes in our sample.The latent also carries some information about the effective neutrino number  eff .This latent carries information about non-universality (through its Ω m dependence) at the high-mass end of the HMF, i.e.  ∼ 10 15 M ⊙ ℎ −1 .
(v) The third latent variable primarily encodes non-universal information about the low-mass end of the HMF, i.e.  ∼ 10 13 M ⊙ ℎ −1 .This non-universal information can be parametrized by the linear growth function after matter-dark-energy equality, specifically at  late = 0.11, with a contribution also from  eff .
Recently, Ondaro-Mallea et al. (2022) studied simulations designed to isolate the role of the linear growth rate and that of the local slope of the power spectrum.To do so they considered cosmologies with values of Ω m and   , respectively, that are well outside current data constraints.They demonstrated that a measure of growth rate in the recent past, which corresponds to the growth rate at  ≃ 0.4, is related to non-universality.They additionally suggested a smaller dependence of non-universality on the slope of the power spectrum; this dependence was later confirmed by Euclid Collaboration: Castro et al. (2023) using a similar methodology with extreme spectral tilts.
Conversely, our conclusions on the origin of non-universality are drawn across the relevant cosmological parameter space for forthcoming surveys, without the need to construct extreme test cosmologies.We find a strong role for recent growth in establishing non-universality, but we find that growth since matter-dark-energy equality (and more specifically, at  ≃ 0.1) characterizes its role significantly better than the proxy suggested by Ondaro-Mallea et al. (2022).We do not find the local slope of the power spectrum or   to play any significant role in non-universality at  = 0 within the currently viable cosmological parameter space we consider.
Overall, our deep learning technique has allowed us to segregate and quantify universal and non-universal aspects of information in the HMF, and to explain these aspects in terms of physical proxies related to cosmological halo formation.This approach builds on the finding in the existing literature that non-universality is related to growth history (Courtin et al. 2011;Ondaro-Mallea et al. 2022); however, the existing literature is tied to the specific EPS-motivated functional form, whereas our more general approach allows us to gain physical insights that are challenging to obtain through empirical fitting formulae.
In this paper, we focused on training IVE models to predict the HMF at a single redshift  = 0. Since the HMF exhibits strong nonuniversality with redshift evolution (Reed et al. 2007;Tinker et al. 2008;Bhattacharya et al. 2011;Diemer 2020) and forthcoming surveys will observe clusters out to  = 2 (Euclid Collaboration: Adam et al. 2019;Cerardi et al. 2024), future work will need to model both the cosmology and the redshift dependence of the HMF (e.g.Mc-Clintock et al. 2019;Bocquet et al. 2020;Artis et al. 2021).The IVE framework we have presented in this paper can be straightforwardly extended to address this problem by adding redshift as an additional query.Our framework can also be applied to a range of other training data, including HMFs generated by other emulators such as Dark Emulator (Nishimichi et al. 2019) and Mira-Titan (Bocquet et al. 2020), or directly to numerical simulations.Such generalizations would test how our findings extend beyond the domain of validity of Aemulus.Further, comparing the latent spaces of IVEs trained on HMFs using different halo definitions will allow us to explicitly test the connections between information in halo accretion histories and HMF non-universality discussed in Section 3.5.
The latent representation that IVE models learn can be applied to assist with the design of emulator training sets.The training sets of existing emulators are generally designed to uniformly cover the cosmological parameter space (e.g.Heitmann et al. 2016;DeRose et al. 2019).However, this approach does not take into account which directions within the space are more or less informative.By contrast, the three-dimensional latent space of the IVE corresponds to independent factors governing changes in the HMF.Therefore, by sampling this 3D latent space more uniformly, the accuracy of existing emulators could be improved using a small number of additional training simulations.We propose that instead of adding simulations to uniformly sample the cosmological parameter space with higher density, simulations can be added such that the 3D latent space is more uniformly covered.Taken together with our proposed extensions to redshift dependence, alternative cosmological parameter spaces and halo definitions, the IVE-based approach thus promises new insights that will help tackle the crucial problem of predicting accurate theoretical HMFs for forthcoming data analyses.(ii) Obtain the eigenvectors and eigenvalues  2  of the concatenated chain by centring the data and performing principle component analysis; the eigenvalues give the variance in the direction of each eigenvector.
(iv) Project the Latin hypercube into the cosmological parameter space.Each dimension of the Latin hypercube is projected onto a single eigenvector, and the range for each dimension is scaled to [−3, 3] ×   .The MCMC chains vary only 6 cosmological parameters.The missing parameter,  eff , is obtained from the 7th dimension of the hypercube, scaled appropriately to the minimum and maximum  eff values in Aemulus training cosmologies.The aemulus training points were also obtained in this way.
(v) Discard all samples lying outside the convex hull of Aemulus training cosmologies.This eliminates the possibility for our network to have trouble learning the HMF due to incorrect ground truths produced by extrapolating the emulator outside its domain of validity.
The procedure gives us a list of 628849 cosmological parameter samples, from which we randomly sample 10 5 to produce our dataset.We verified that 10 5 samples saturated the accuracy of the IVE.
As described in Section 2.1, Aemulus predicts the HMF by emulating fitting parameters to the multiplicity function  (), and converts this into the HMF by calculating () using CLASS (Lesgourgues 2011; Blas et al. 2011).During our investigation, we have found that Aemulus misconfigured CLASS such that the power spectrum used to evaluate () is always calculated using  = −1, even for  ≠ −1 cosmologies.Because this misconfiguration was already present when training Aemulus, this does not affect the overall accuracy of Aemulus, but it does result in a correlation between the emulator's residuals and .We find for  > −1, the emulator underpredicts the HMF especially at the high mass tail; the opposite is true for when  < −1.We expect this to have only a small effect on our results relating to non-universality, as this issue affects high mass haloes, whereas we find that non-universality affecting the high mass haloes is mostly related to Ω m rather than .The recent growth history, on which  has a stronger effect, affects mostly the nonuniversality for lower mass haloes where the residuals of Aemulus are small across the whole parameter space.

B2 IVE architecture
The IVE architecture consists of an encoder that compresses the input () (and, optionally,  ()) into a low-dimensional latent representation, and a decoder which maps the latent representation and a given halo mass  to the corresponding differential halo number density d/d log .
The IVE takes as input a 1D array of either () at  = 0, or a concatenation of () with  ().This input x is compressed into a low-dimensional latent representation by the encoder, which consists of a neural network with 6 fully-connected layers.Each fully-connected layer has 512 neurons, and each neuron follows  = ℎ(w • x + b), where w and b are the weights and biases that are optimized when training the network, and ℎ is the activation function.We use leaky ReLU activation (Maas et al. 2013) for all hidden layers except the last layer, for which we use a tanh activation to ensure smoothness of the encoder output.Weights are initialized with the He et al. (2015) uniform initialization scheme for layers with leaky ReLU activation, and with Glorot & Bengio (2010) uniform initialization for the layer with tanh activation.
The encoder maps from the input x to a multivariate distribution in the latent space (|x), where  denotes the latent representation.(We use  instead of the conventional z to avoid possible confusion with redshift.)The encoder approximates the latent distribution by a factorized Gaussian, (|x) =   N (  (x),   (x)), where  is the dimensionality of the latent space and   ,   are the means and standard deviations of the Gaussian distribution for each component of the latent representation.Using this assumption, the encoder maps the inputs to   and   .
The decoder first samples a latent vector  from the encoded Gaussian distribution.This, together with a query log , is passed into another neural network with 6 fully-connected layers, with the same number of neurons per layer, the same activation functions, and the same initialization schemes as the encoder.The output from the decoder is an estimate of log d/d log  at the given halo mass.

B3 The regularization term in the loss function
The IVE loss function was discussed in Section 2.2.1.Here we consider further the second term in equation ( 4), which is the Kullback-Leibler (KL) divergence (Kullback & Leibler 1951), measuring how close the latent distribution returned by the encoder (|x) is to a prior distribution ().We choose as the prior distribution a diagonal standard Gaussian () =   N (  = 0,   = 1).This leads to a closed form for the regularization term: The diagonal standard Gaussian prior encourages an interpretable latent space: it ensures continuity and completeness in the latent space since all samples are mapped to a localized region in the latent space (see Burgess et al. 2018), and it promotes independence between latent variables since the marginal latent distribution () = ∫ (|x) (x)dx is encouraged to be diagonal.This causes each latent variable to learn an independent factor of variation governing the HMF, which is crucial for the latent representation to be interpretable.

B4 Training procedure
In Section 2.2.2 we outlined our use of a -annealing scheme to improve the prediction accuracy while maintaining a disentangled latent space.Here, we give more details on this procedure.We used a training data batch size of 200 and a learning rate of 5 × 10 −4 , and varied  as a function of epoch  according to a generalized logistic function of the form where   and   are the initial and final values that  asymptotes to,  controls the rate of decrease, and  c is the epoch where the rate of decrease in  is maximum.
We performed experiments with different   ,   , , and  c , and chose as our final models those with (i) a prediction accuracy that is comparable to when the model is trained with  = 0, and (ii) the lowest possible MI between latent variables.In particular, we ensured that the MI between latents remained below the MI between the non-universal latent and the ground truth HMF.
To achieve this, we performed a three-stage -annealing procedure.In the first stage, we set   = 3.6 × 10 −5 ,   = 10 −8 ,  = 0.001 and  c = 2000.With this set of parameters, we were able to achieve an accuracy comparable to that of a three-latent IVE model trained with  = 0, but at the expense of insufficient disentanglement.In particular, the MI between the non-universal latent and the other latents was higher than that between the former and the ground truth HMF.We performed a second and third stage of annealing to improve both the model accuracy and the latents' disentanglement.In the second stage, we (i) resumed the training starting from epoch 5000, where the accuracy matched that of a two-latent IVE model and the MI between all latents was < O (10 −4 ) nats, and (ii) used   = 1.4 × 10 −6 ,   = 10 −7 ,  = 0.0015 and  c = 6000.This set of parameters ensured that () decreased at a slower rate than in the first stage, which turned out to be crucial to maintain the disentanglement.In the third stage, we (i) resumed training at epoch 6000, where the prediction accuracy was again comparable to a twolatent IVE but the MI between the latents was ∼ O (10 −4 ) nats, and (ii) used   = 8.485 × 10 −7 ,   = 10 −7 ,  = 0.002 and  c = 9000.We trained the model for a further 6250 epochs until  plateaued to its   value.This three-stage process allowed us to finally achieve a sufficient level of disentanglement while reaching high prediction accuracy.
Once the -annealing procedure finished, we further refined the prediction accuracy of the model to ensure it is comparable to the same model trained with  = 0. We increased the batch size from 200 to 500 and halved the learning rate, and continued to double the batch size and halve the learning rate every time the loss on the validation set ceased to improve over 40 training epochs, until a maximum batch size of 8000 (limited by GPU memory).This last training phase improved the accuracy without affecting the level of disentanglement;  was kept constant at 10 −7 during this stage.
Decreasing the learning rate reduces the possibility that the optimizer will overstep the minimum (Alsing et al. 2020), while increasing the batch size increases the accuracy of gradient direction estimates.
Our models reached convergence in 311 additional epochs when the validation loss saturated.In total, training the IVE model (including all -annealing and prediction accuracy refinement) lasted 12561 epochs.
This paper has been typeset from a T E X/L A T E X file prepared by the author.

Figure 1 .
Figure1.The interpretable variational encoder (IVE) consists of an encoder compressing the inputs (which can be the linear matter power spectrum  ( ) at  = 0, or  ( ) concatenated with the linear growth function  ()) into a low-dimensional latent representation, followed by a decoder that maps the latent representation and a given halo mass log  to the output halo number density d/d log .The latent representation encodes all the information required to predict the HMF at  = 0 into a set of independent latent components, which can then be interpreted in terms of their cosmological content.

Figure 3 .
Figure 3. Mean and 95 per cent confidence interval of the HMF prediction residuals for IVE models trained using different latent dimensionalities (upper panel), and different inputs (lower panel).Grey shaded bands indicate ±0.2 per cent.Upper panel: Three latent variables can predict the HMF with similar accuracy as seven latent variables (the maximum number expected).Decreasing to two latents significantly worsens the prediction accuracy.Lower panel: Adding  () to the inputs does not improve the prediction accuracy compared to training on  ( ) alone.

Figure 4 .
Figure 4. Upper panels: Variations in the predicted HMF when systematically varying the value of one latent variable, while keeping the others fixed.Each panel from left to right varies the universal, mapping and non-universal latents, respectively.Lower panels: Fractional differences in the predicted HMF with respect to that of the reference mean cosmology (defined in Section 2.1.2),again when varying each latent variable, keeping the others fixed.This gives an indication of a lower limit to the magnitude of each latent's effect across the parameter space as a whole.These independent aspects of the halo mass function were discovered automatically by the IVE during training.

Figure 5 .
Figure5.Left panel: MI between each latent variable and the ground-truth HMF in nats.Solid lines and shaded bands indicate the mean and standard deviation in the MI as estimated by GMM-MI(Piras et al. 2023).Middle panel: MI between each latent variable and the ground-truth HMF, conditioned on the other two latent variables.Right panel: MI between each latent and  ( ) as function of  −1 , showing the amount of non-universal information (i.e.information about variation in  ( ) as a function of  (  )) captured in each latent variable.The grey line shows as reference the MI between latent variables and the universal Tinker mass function's  ( ) which, as expected, is consistent with MI = 0 and thus has no non-universal information.

Figure 6 .
Figure 6.Conditional MI between each latent variable and the growth function  () (normalized such that  (0) = 1) given other latent variables.The black dashed line indicates  late = 0.11, and the red shaded region shows the range of the redshifts of matter-dark-energy equality  m=DE in the training set.

Fig
d log  carries physical meaning from which the 'mapping' latent nomenclature originates.The halo mass function in equation (1) can be written as d d log  ∝ Ω m d log  d log   () ,

Figure 7 .
Figure 7. Cosmological parameter dependence of the 'universal latent'.Upper panel: the universal latent shows a near-perfect correlation with Ω 0.46 m  8 ; this is close to the standard  8 ∝ Ω 0.5 m  8 which is also known to affect the abundance of high mass haloes.The small residual scatter between the two is correlated with Ω m .Lower panel: Ω m () as a function of redshift  for the range of cosmologies used in the training set (blue band).The red band indicates the half-mass formation redshifts  form of high-mass haloes estimated from Correa et al. (2015), corrected for the  200b mass definition.The black dashed line indicates Ω 0.46 m .The fact that Ω m ( form ) ≈ Ω 0.46 m indicates that the latent captures the amplitude of fluctuations at the formation time of the high-mass haloes.

Figure 8 .
Figure 8. Cosmological parameter dependence of the 'mapping latent'.Upper panel: the mapping latent is strongly correlated with Ω 0.35 m

Figure 9 .
Figure9.Top row: MI between the universal latent and different functions: the linear matter power spectrum  ( ) (left), the growth function  () normalized to unity at  = 0 (middle), and the ground truth HMF (right).In pink dashed line, we show the same MI quantities but using Ω 0.46 m  8 instead of the universal latent.Bottom row: MI between the mapping latent and the same three functions, conditioned on  eff .In pink dash-dotted line, we show the same conditional MI quantities but using Ω0.35 , Non-universal latent| N eff , D(z late ), other latents) Non-universal latent D(z late ) MI( f (σ ), X | other latents) [nats] Figure11.Conditional MI between the non-universal latent and X given the other two latent variables, where X is each of the cosmological quantities on the -axis.We compare the MI of latents of two IVE models, separately trained on Tinker (grey) and Aemulus HMFs (orange); the former does not contain any non-universal information by construction.The non-universal latent has high information on parameters associated with the recent growth history, as well as  eff .The parameter  eff is the effective growth rate parametrization used byOndaro-Mallea et al. (2022).
Figure A1.A demonstration that the approximate expression for  ( late ) given in equation (A3) closely approximates the numerically computed values of  ( late ) across our parameter space.

Figure B1 .
Figure B1.Cosmological parameter space covered by the dataset we use to train and test our IVE models (blue).Black points indicate the cosmologies used to train Aemulus (McClintock et al. 2019).
Smith et al. 2024)we demonstrated the utility of MI in the physical interpretation of IVE latent spaces.In this work, we introduce an additional tool for interpretation in the form of conditional MI.This measures the MI between two latent variables conditioned on knowledge about additional variables; in other words, conditional MI quantifies how much information is shared amongst two variables once the conditioned variables are already known.It is defined for three variables as (Piras et al. 2023)2013)f two variables are statistically independent; see e.g. the review byVergara & Estévez (2013).Estimating MI requires approximating the joint and marginal distributions in equation (6).To do this, we use the GMM-MI (pronounced 'Jimmie') package(Piras et al. 2023), which fits a Gaussian mixture model to the distribution of  and  samples.This simultaneously provides the joint and marginal densities.The package then evaluates the integral in equation (6) using Monte Carlo integration.It also estimates the error in the MI estimate due to finite sample size by bootstrapping.The final estimate returned is the mean and standard deviation obtained from the bootstrap.In previous work (Lucie-Smith et al. 2022; Lucie-MI(,  | ) = ∭ (, , ) log () (, , ) (, ) (, ) d d d ,