CONNECT: A neural network based framework for emulating cosmological observables and cosmological parameter inference

Bayesian parameter inference is an essential tool in modern cosmology, and typically requires the calculation of $10^5$--$10^6$ theoretical models for each inference of model parameters for a given dataset combination. Computing these models by solving the linearised Einstein-Boltzmann system usually takes tens of CPU core-seconds per model, making the entire process very computationally expensive. In this paper we present \textsc{connect}, a neural network framework emulating \textsc{class} computations as an easy-to-use plug-in for the popular sampler \textsc{MontePython}. \textsc{connect} uses an iteratively trained neural network which emulates the observables usually computed by \textsc{class}. The training data is generated using \textsc{class}, but using a novel algorithm for generating favourable points in parameter space for training data, the required number of \textsc{class}-evaluations can be reduced by two orders of magnitude compared to a traditional inference run. Once \textsc{connect} has been trained for a given model, no additional training is required for different dataset combinations, making \textsc{connect} many orders of magnitude faster than \textsc{class} (and making the inference process entirely dominated by the speed of the likelihood calculation). For the models investigated in this paper we find that cosmological parameter inference run with \textsc{connect} produces posteriors which differ from the posteriors derived using \textsc{class} by typically less than $0.01$--$0.1$ standard deviations for all parameters. We also stress that the training data can be produced in parallel, making efficient use of all available compute resources. The \textsc{connect} code is publicly available for download at \url{https://github.com/AarhusCosmology}.


Introduction
For the past two decades the method of choice for cosmological parameter estimation has been based on stochastic optimisation techniques, typically Markov-chain Monte Carlo (MCMC) methods. These methods have the advantage that they are very robust and do not require derivatives of the cost (likelihood) function. They also easily scale to large numbers of parameters which allows for a simple treatment of nuisance parameters. However, a major disadvantage is that a single calculation of the cost function in cosmology can be very expensive because it requires a full solution of the Einstein-Boltzmann equations of linear perturbation theory (and perhaps even a calculation of non-linear corrections). Such a computation typically takes tens of seconds on a single CPU core, and does not parallelise well beyond 10 cores. A fully converged MCMC run, typically requires between 10 5 and 10 6 solutions of the Einstein-Boltzmann solver, so the total computation time can easily reach days or weeks, in particular for more complex cosmological models. Furthermore, a new MCMC run must be performed either when a new cosmological model is required or when new data is added. In the latter case, which is common for modern application, the complete analysis with several datasets can be prohibitively expensive numerically.
The purpose of the present paper is to remedy this through a new framework for emulating cosmological observables based on machine learning via neural networks (NN) which we call connect (Cosmological Neural Network Emulator of Class using TensorFlow). We demonstrate, via a new plug-in written for the publicly available MontePython MCMC code [1,2] that we can reduce the time required for a full MCMC run to hours rather than days or weeks. A similar plug-in for the code Cobaya [3] has also been implemented. connect assumes a cosmological model but is independent of the targeted dataset, and separates itself from other Einstein-Boltzmann emulators by allowing for user-friendly plug-and-play generation of a neural network emulator for any cosmology that the user may want to investigate, the only requirement being a working class implementation. With a simple Boolean input argument supplied, we have modified MontePython to automatically generate training data with class, train a neural network emulator to sufficient precision and conduct the MCMC analysis using this emulator, with a very significant decrease in total computation time relative to simply running an MCMC analysis directly with class.
The idea of using machine learning and specifically neural networks to speed up computations in cosmology has existed for several years. Much focus has been on emulating N -body codes (e.g. [4]) due to them being massively time consuming. Since the training data is expensive to generate, the field of emulating N -body codes is markedly data starved. Accordingly, the machine learning tools usually employed in that context include Gaussian processes [5,6] and polynomial chaos expansion coupled with principal component analysis [4]. However, when large data samples are available, and especially when the dimensionality is large, neural networks are often superior to other supervised learning strategies (as evident, for example, in the recent dominance of neural networks in the ImageNet Large Scale Visual Recognition Challenge [7]), and are therefore the obvious choice of strategy for emulating Einstein-Boltzmann codes which are many orders of magnitude faster at generating data than N -body codes.
Examples of use of neural networks in emulation of Einstein-Boltzmann codes date back to the early CosmoNet [8,9], and the approach has since been revisited on numerous occasions with various target variables. CLASSNET [10,11] is embedded in class and learns the source functions, reducing the time required to solve linear perturbation equations. Ref. [12] targets LSS angular power spectra, whereas ref. [13] learns the linear matter power spectrum, both using neural networks. More recently, CosmoPower [14] emulates class computations of CMB spectra with temperature, polarisation and lensing anisotropies, as well as the matter power spectrum. connect, contrary to these works, emulates a wide range of customisable outputs: The user simply defines the desired class output variables in an input file, and the connect framework automatically generates a network emulating these.
Although Einstein-Boltzmann solvers generate data much faster than their N -body siblings, the total time required for the combined process of gathering data, training a network from it and performing parameter inference with the network, is still dominated by the data generation. To optimise the emulation scheme, it is therefore most vital to improve on the data gathering method, e.g. by optimising the amount of information extracted from each Einstein-Boltzmann computation or by generating training data in the most important points of the cosmological parameter space. These topics fall under the machine learning field of active learning [15][16][17]. Most early active learning algorithms focused on selecting new data at regions of large uncertainty of the emulator in so-called uncertainty sampling. Individual active learning algorithms in uncertainty sampling typically differ on how they approximate the uncertainty of the emulator. Query-by-committee [16] algorithms estimate the uncertainty as the spread in predictions from a set of learners trained on the currently available dataset, whereas expected model change approaches [16], such as the expected gradient length algorithm [18], select new data that optimise an approximate expected improvement of the emulator; e.g. where the new training gradient has the largest magnitude in the case of expected gradient length sampling. In the case of a fully-connected neural network emulator, however, a measure of network uncertainty is not readily available. Neural network uncer-tainties can be naturally estimated using architectures such as Monte Carlo dropout [19] or Bayesian neural networks [20], but given the scope of the paper, we leave such endeavours for future investigation.
Furthermore, an important distinction between the classical active learning applications and the one at hand is that in addition to minimising the global emulator uncertainty, we are especially interested in minimising the error in the regions of parameter space that correspond to cosmologies of large likelihood. This duality, i.e. selecting data where (i) the likelihood value is large and (ii) the current emulator uncertainty is large, has been explored previously in the context of cosmological inference. Particular examples of such active learning strategies in cosmology include ref. [21], in which a Gaussian process is used to emulate the likelihood function, from which new data can be selected based on their weighting according to some balance of the Gaussian process uncertainty and its current estimate of the likelihood at the proposed points. A similar approach was adopted in ref. [22] in an iterative fashion, as well as in ref. [23], where it was found that the exploratory behaviour is increasingly important in many-dimensional problems. However, with Einstein-Boltzmann emulators, the overhead introduced by the Gaussian processes in the Bayesian optimisation may cancel this gain in efficiency since Gaussian processes are known to scale disadvantageously with the size of the training data [24]. Additionally, batch acquisition can be non-trivial, and the optimization of the acquisition function itself contributes considerable overhead, rendering such more advanced methods of active learning useful when the generation of data is slow, e.g. for N-body simulations. Indeed, it was shown in [23] that the overhead involved in these acquisition methods often becomes the computational bottleneck when emulating the relatively fast Einstein-Boltzmann emulators.
In this work, we present an iterative data generating procedure that with little overhead combines the focus of data generation around regions of large likelihood while still being spread to reduce uncertainty far from the maximum likelihood. Our algorithm produces parameter space samples with an MCMC chain run by a neural network iteratively trained on the same points, including a method of protecting against spuriously oversampled regions. With this, we find vastly increased emulator accuracy relative to a standard Latin hypercube sampling of training data. We developed the framework with the notoriously difficult posterior of decaying cold dark matter [25] as a reference, and tested it blindly on a ΛCDM model with variable neutrino mass and degeneracy parameter, on both of which it performs excellently.
This paper is structured as follows. In section 2 we specify the design of the neural network architecture employed in connect, and in section 3 we describe the novel iterative algorithm for placing training data at advantageous points in the parameter space. In section 4 we describe the use of connect through MontePython and present resulting MCMC analyses, using connect, for the decaying cold dark matter and massive neutrino cosmological models. Finally, we discuss and conclude on our findings in section 5.

Neural network design
The method used for the emulation is a fully connected deep neural network consisting of an input layer, multiple hidden layers, and an output layer (see e.g. [26] for a recent overview). The input layer consists of the cosmological parameters from which we would like to extract an output, i.e. any numeric parameter that the Einstein-Boltzmann solver code class takes as input [26]. The hidden layers have a much larger dimension of several hundreds of nodes, in order to create enough trainable weights for the network to find the correct behaviour of the class computations. The output layer consists of all the specified spectra and output parameters one wishes to emulate -this being any output that class can compute (CMB spectra, matter power spectra, background functions, thermodynamical parameters, and derived parameters).
The first step is to gather training data for the network which requires a method for sampling in the space of cosmological input parameters. The construction of this method will be further discussed in section 3. When the sample of input values has been constructed, we can use class to calculate the specified output values for each point in the sampled data. This is then combined to a single output array for each point while the cosmological input parameters are put in a single input array for each point. Together, the set of input arrays and output arrays constitute our training data.
Using the TensorFlow framework [27] we can now train the network on the training data for a specified number of epochs, where an epoch refers to an update of the network weights. Each network used for our results has been trained for 300 epochs 1 and with batch-sizes of 512. The loss function is then minimised using a specified optimisation algorithm (We have used the ADAM optimiser [28] for this work and as the default in connect) which slightly tweaks the weights of the network while propagating backwards. The Network is then ready for the next epoch where the whole procedure is repeated in order for the network to perform better with each epoch.

Network architecture
An Einstein-Boltzmann solver can be seen as a function mapping the cosmological parameters into a set of observables such as the CMB anisotropy coefficients or the linear matter power spectrum. Since this mapping can be very general, the most conservative neural network structure to employ is a fully connected, feed-forward, deep neural network [26] 2 . This involves several hidden layers where each node in a layer is connected to each node in the next layer. For all results in this paper we use 6 hidden layers with 512 nodes in each, inspired by the architecture in ref. [14]. Too few nodes and layers restrict the ability of the network to emulate the desired computation, and too many both make it prone to overfitting [26] and require larger datasets. We find that our chosen values evade both of these concerns, and by varying these network parameters slightly, we find only modest changes in the network performance. We therefore leave a more thorough investigation of the optimal network architecture for future work. However, one soft requirement is that the evaluation time of the connect architecture must not exceed the evaluation time of typical likelihood codes such as Planck [29] or Planck lite [30]. We have conducted rough benchmarking of the likelihood codes and the connect evaluation time, and find that at around 12 layers (∼ 3 × 10 6 trainable parameters), the evaluation time of connect becomes greater than the evaluation time of Planck lite, giving an approximate upper bound on the architecture complexity. Furthermore, since connect allows the user to choose the outputs to emulate, one should keep in mind that the ideal architecture will vary with the size of the output, with larger outputs naturally requiring a larger network complexity. For example, Einstein-Boltzmann solvers such as class do not evaluate C ℓ coefficients for each ℓ, but rather at a reduced set of approximately 10 2 ℓ-values from which the full sets of C ℓ coefficients are constructed by interpolation. This significantly reduces the output dimension and we have consequently chosen the set of ℓ-values directly computed by class for the output layer.

Choosing a loss function
When training a neural network, one always has to make choices regarding the optimisation of the network. First of all, we need a way of quantifying how well the output from the network fits the desired output from the training data -the loss function [26]. A simple choice for a loss function would be the widely-used mean squared error (MSE) function, where x is the output from the network and y is the output from the training data. This loss function ensures that the network performs equally well on every output node and is thus the apparent choice if we are to remain agnostic about our network. There are, however, various situations where this approach is not the most optimal, and the CMB spectra are examples hereof. Measurement errors on the CMB temperature and polarisation power spectra are a combination of cosmic (sample) variance, noise, and finite beam width (see e.g. [31]). Modern CMB probes, such as Planck, in general provide spectra which are cosmic variance limited (except for B-mode polarisation) effectively out to the maximum ℓ-value measurable with the given beam width, and can therefore be reasonably approximated by assuming cosmic variance out to some maximum ℓ beyond which the error goes to infinity. Using this observation as a guide we therefore modify equation (2.1) with ℓ-dependent coefficients, 2) Figure 1 shows the CMB spectra of a ΛCDM model as calculated by class and two neural network models with different loss functions, L MSE and L CV . The figure also includes the errors between the spectra from the neural network models and class. The errors are calculated as the absolute difference scaled by the rms-values of the spectra. This is due to the fact that a normal relative error is misleading when the values of the spectra are close to zero, since the error would be very large even though the discrepancy is rather small. Evaluating the loss functions on a single cosmological model may misrepresent the performances on a larger region of cosmological parameter space. We therefore use the neural network models on a set of data from a high-temperature MCMC sampling containing 20,000 points and calculate the error in the same way. Figure 2 shows the 1σ and 2σ percentiles of this set of errors for both loss functions. It is clear from figure 2 that the cosmic variance loss function has the desired effect of improving the accuracy at high ℓs at the cost of accuracy at low ℓs. This in turn improves results of parameter inference compared to using the mean squared error loss function. When including additional output, such as derived parameters, we need to use a combination of both loss functions, but we also need to assign an importance to all non-CMB output similar to that of the highest ℓ-mode, as to not get a low accuracy on these.

Choosing an activation function
Another choice we need to make is the choice of an activation function [26]. The traditional choice is the Rectified Linear Unit (ReLU) function [32]. However, the main drawback of ReLU is that the training might become more difficult due to the derivative being exactly zero for negative input. For our application, we found that the following parameterised ReLU with a smoothing between the positive and negative parts, as suggested in ref. [33], works well. There are two free parameters of this activation function, one for the slope of the negative part and one for the smoothing, and we allow these to be trained alongside the weights of the network. We can furthermore assign different parameters for each node in a layer which will then be optimised during training. This leads to the form of the activation function as presented in ref. [33], where the parameters β and γ control the smoothing and slope of the negative part, respectively, and the ⊙ represents elementwise multiplication. From figure 3, it is evident that this activation function performs better than the simple ReLU activation function.

Normalisation of inputs and outputs
When using an artificial neural network it is beneficial, and often necessary, to consider scaling of the training data [34]. This is especially true in our case, since the input nodes and output nodes vary with several orders of magnitude. If we were to not consider this at all, the loss function would only have significant contributions from the larger values, while small numbers, such as the C ℓ s, would have a vanishing impact on the total loss. We are therefore required to address this problem in some manner. There are several ways to deal with this and they include the following: 1. Min-Max scaling, where data belonging to each node in the input (output) layer is transformed to the same interval, e.g. [0, 1], using the minimal and maximal values of the data within the node: 2. Logarithmic scaling, where the data is transformed to logarithmic space in order for values differing by several orders of magnitude to lie within the same order of magnitude (or few apart). Since , this also ensures that optimisation of absolute loss in logarithmic space is equivalent to an actual optimisation of relative loss, meaning that larger orders of magnitude will not be favoured above smaller orders of magnitude.
3. Standardisation, where data belonging to each node in the input(output) layer is transformed to a normal distribution with zero mean (µ = 0) and a variance of unity (Var = 1): The input arrays in the training data are automatically normalised with standardisation using TensorFlow's own preprocessing.Normalization routine based on a usual batch normalisation scheme [34]. The means and variances are stored as weights in the input layer of the model and we thus do not need to do anything explicit to the inputs. It is not quite as easy with the output arrays, since no similar routine is available for the output layer. We instead have to normalise the output arrays manually, and we have therefore implemented all of the above three methods in connect.
All three methods of normalisation yield good results when compared to simply multiplying all spectra with a constant factor of 10 10 , but the accuracy is much better when using min-max scaling or standardisation. This is because all nodes in the output layer have the same span and the network treats them similarly. When using logarithmic scaling, the order of magnitude is still very similar, but there is a clear difference between the C ℓ s and other kinds of output, such as derived parameters, since the C ℓ s will have values around or lower than −20 while other parameters will have values closer to or above zero. This leads to a difference in how the output nodes are viewed and treated by the network, and it is thus harder to achieve convergence. In our implementation of logarithmic scaling, we found that the performance can be further increased by taking the logarithm twice (after a shift of all the data to positive values) since the difference in orders of magnitude for C ℓ s and derived parameters is quite large. Standardisation yields a slightly better result than min-max scaling, as apparent from figure 4, and so it has been chosen as the default normalisation method. All results in this paper have been produced with standardisation as the normalisation method except for the comparisons between different methods of normalisation in this section.

Sampling of training data
The training data can be sampled in various ways with different methods having different strengths and weaknesses. The most agnostic way of sampling the parameter space would be using a grid-based method. To get a good resolution these can, however, be very costly and we end up with many points that yield almost identical output since many of them only differ in a single parameter. To circumvent this, we can use Latin hypercube sampling, where no two models share any parameter values. This is much more efficient and proves sufficient for the training of the neural network. This way of sampling still yields a uniform distribution of points in the parameter space, so the trained network will be able to emulate the output for all models in the parameter space (within the boundaries of the Latin hypercube) with similar accuracy.
For a large Latin hypercube containing all reasonable models, it is, however, rare that we would ever need to use the network on the outer parts of the hypercube. This is due to the fact that most models near the edges (and especially the corners) have very low likelihoods since they are very far from the best-fit points of most datasets. If we disregard all such unlikely models, a better way of sampling would be by mimicking the shape of the actual posterior distribution. With high dimensionality in the parameter space, this proves much more efficient than using Latin hypercube sampling, since only a small fraction of the models are of actual use in the latter. A way of illustrating this effect is by imagining a simple hyperspherical posterior with radius R centred around the best-fit point. The ratio of the volume of the hypersphere to that of a hypercube with side length 2R is given by and in high-dimensional space this decreases rapidly. With just 3 parameters, the corners of the hypercube makes up almost half of the volume, and with 9 parameters, less than a percent of the volume is within the hypersphere. By only focussing on models within such a hypersphere, we could utilise our resources much better and increase the performance of the network on all the relevant models of interest. Actual posteriors typically have a much more complicated shape than a hypersphere, however, the argument still holds due to many of the cosmological parameters having a vanishing likelihood only a few standard deviations away from the best-fit point. We cannot simply expect that a hyperspherical sampling will be representative of the posterior distribution. We thus need a way of sampling training data from the actual posterior distribution instead. We therefore propose to sample the training data using an MCMC method with a high sampling temperature. It seems a little strange to use an MCMC method to create the training data for a neural network that is to be used in an MCMC analysis, but the idea is that we do not need anywhere near as many data points for the training data as we do for the actual MCMC analysis. A high-temperature MCMC sampling running for a few hours is sufficient to obtain the same (or better) accuracy on the relevant models as one would get from Latin hypercube sampling with 10 5 -10 6 points. As noted in ref. [35], we thus obtain a set of training data from the exact region of the parameter space where emulation is relevant instead of having the majority of the training data unrealistically far away from the best-fit point.
In this paper, we present results obtained with the default connect temperature of T = 5. Since the temperature alters the likelihood L as L → L 1/T , a temperature of T = 5 corresponds to increasing the standard deviation of a Gaussian likelihood by a factor 5, to the effect that the generated training data mainly lies inside the 5σ contour of the posterior. However, this is a free input parameter and may be adjusted if the user desires higher accuracy further away from the posterior mode.

Iterative sampling
We can even improve on this and make the sampling even more efficient. We can exploit the fact that we only need to sample from something roughly similar to the posterior distribution and not the actual one, due to the high sampling temperature and the neural networks ability to interpolate in a well-sampled area. We therefore do not need to search the parameter space

MONTEPYTHON
Has data converged? with the precision of class resulting in many slow calculations of the likelihood. With a typical acceptance rate of 0.3 we are wasting a lot of computation time when calculating the likelihood of rejected steps. A way around this is to use a neural network trained with only a small number of Latin hypercube points (∼ 10 4 ) -enough to give a decent, but not great, accuracy. We then use this neural network model to calculate likelihoods during the MCMC sampling much faster and sample new points for the training data. We then only need to use class to calculate the output-part of this new training data, which means that we effectively skip the class computations of all the rejected steps. Since the neural network model had a relatively low accuracy, we have only gained a rough sample around the posterior distribution, but new models trained with this new training data shows a major improvement. We can even repeat the process starting from this new model, and the resulting models improve for each iteration. Using this method we can thus get rid of a huge part of the expensive class computations only resulting in rejected steps. We could include the initial Latin hypercube data in the total dataset, but by doing so we lower the accuracy of the neural network model in the relevant parts of the parameter space. This is due to the fact that the network contains a limited set of weights, and by forcing it to learn the behaviour in the outermost regions of the parameter space, we inhibit the training in the more relevant regions. It is thus advantageous to exclude the initial training-data even though the class computations have already been done. An illustrative flowchart of this sampling algorithm is shown in figure 5.
During the completion of this paper a few similar approaches have been published [36][37][38], which further increases the confidence in this type of sampling. When using this way of sampling, we are interested in the least amount of points possible with the best representation of the posterior for a good accuracy. We are therefore not interested in using all of the points from the high-temperature MCMC samplings, since the burn-in period yields unfavourable points to use as training data. In order to get a representative set of training data, we therefore sample longer MCMC chains and keep only the last N points for the training data. The question now remains how to determine when each high-temperature MCMC sampling should end as well as when the accuracy of an iteration is acceptable. We propose similar answers to the two questions, namely to stop when the variance falls below a certain threshold. For the individual high-temperature MCMC samplings it is the variance between the chains, and for the iterations it is the variance between the kept points from two consecutive iterations.

Reduction of over-densities in sample points
We could use all of the kept points from each MCMC run, but we would then get a lot of similar points in our total dataset, since each iteration roughly samples from the same distribution. It would be beneficial to have a way of determining which points we can safely discard, as to not waste computational power increasing our dataset where it is already well sampled. A simple, yet effective, way of doing this is the following: p is accepted Add all accepted points to the data set The conditions of the if-statements might seem arbitrary at first, but we found that a tolerance of 2 standard deviations gave the most consistent results. There are several reasons why oversampling should be avoided, and computational waste is only one of them. Another reason is that an oversampling of certain regions leads to a bias in the training, since these regions are given more weight in the calculation of the total loss. When repeating the sampling for several iterations, the oversampled regions may differ between two iterations and thus result in trained neural networks with different biases. These will in turn lead to different samples, and the convergence in the data between two consecutive iterations will be immensely difficult and require a large number of iterations.
Next, in figure 6 we provide an actual demonstration of how the iterative procedure works for the case of a decaying cold dark matter (DCDM) model. This model is described by a number of cosmological parameters: ω ini dcdm and Γ dcdm (see e.g. ref. [25] for details on the model parameters). The figure shows the training data acquired in each iteration in the 2-dimensional (ω ini dcdm , log 10 (Γ dcdm )) parameter space, and only the accepted points using the aforementioned algorithm are shown in color.
From the different iterations of figure 6 it is clear that we might need to further discard some of the points in our dataset. The points from the first iteration, as sampled by the neural network model trained on the initial Latin hypercube data, often have little to no overlap with those from the other iterations, and including them in the dataset thus leads to a worse accuracy in the relevant part of the parameter space, since the network has to focus some of the training on an irrelevant region. It would therefore be beneficial to remove the data from the first iteration all together like we removed the initial Latin hypercube data. This way of discarding data is more wasteful than the filtration of points from the MCMC samplings, since we again are throwing away already computed class models. In order to combat this waste of resources, we decrease the number of points sampled by the initial neural network model. For some cosmological models the first sample is not far from the consecutive samples, and in those cases we could keep the data from the first iteration without lowering the accuracy. We have therefore included the option of keeping the data from the first iteration if one wishes to do so. When looking at figure 6, one could make an argument for keeping the first iteration (or, though wasteful, throwing away the second as well), but we discard it to be on the safe side. The first iteration contains 5,000 class computations and the maximal amount of new points from each iteration is 20,000. The final dataset contains 19,999 points from i = 2 (one class computation returned an error and was excluded), 7,475 points from i = 3, 4,781 points from i = 4, and 2,415 points from i = 5. We thus see that the amount of points taken from each iteration decreases due to convergence, so less and less class computations need to be performed.

Integration with MontePython
In order to gain any real benefits of a neural network emulating cosmological observables, we need to be able to use the network instead of an Einstein-Boltzmann solver code like class in an MCMC analysis. We have therefore made a module for connect as a plug-in for the popular MCMC code MontePython. Using this plug-in along with the Planck lite likelihood [30], one can reach speedups of 2-3 orders of magnitude. This means that a reasonable inference can be done in mere minutes.

Considerations and usage
Now that the computation speed of CMB spectra is increased significantly, it no longer dominates the computation time of each step in an MCMC chain. This means that the computation time of the likelihood dominates the computation time when using the connect plug-in, and this means that we are limited to only certain likelihoods if we want the greatest speedups. When using the full Planck clik likelihood, we only see speedups of less than one order of magnitude using the connect plug-in, and this is because of the vast number of nuisance parameters making the likelihood computation slow. To really see the benefits of the plug-in, we need to use the much faster Planck lite likelihood which is marginalised over the nuisance parameters leaving only a single one for the likelihood computation. This leads to speedups of several orders of magnitude.
A few things in the source code of MontePython are specific to class and for the sake of usability we did not want to alter anything in the source code. The solution was therefore to make our new plug-in inherit from the cython wrapper of class, classy, and trick MontePython into believing that our connect module is class. This way, a user will not have to alter any code, and any version of MontePython supporting class can be used. One simply has to set the path of the cosmological module to that of the connect plug-in instead of a class installation and add the name of a trained connect model to the data.cosmo_arguments dictionary in the parameter file.
Since the plug-in inherits from the classy wrapper, it will automatically use class to compute any derived parameter that was not emulated by connect. Since the background calculation of class is at the same order of magnitude of computation time as a connect emulation, we can just let class take care of any derived parameter that is only dependent of the background module of class. It is, however, a good idea to include any other derived parameter, needed for the MCMC analysis, in the emulation output, since all other modules of class are slower and this would impact the computation time significantly.

Inference with Planck lite
Using the plug-in for MontePython, we have first sampled the parameter space of the DCDM model which we used to test and validate the training algorithm. In figure 7 we show the posteriors resulting from both a standard class-based MCMC run with approximately 5 × 10 5 accepted chain elements (∼ 5, 000 CPU core-hours) and a connect-based run with approximately the same number of chain elements (∼ 10 CPU core-hours). As can be clearly seen the agreement is excellent! The difference on both means and standard deviations is around 0.01-0.1 standard deviations for all parameters. Figure 8 shows the connect-based runs using data from iterations 1, 3 and 5 for a subset of the cosmological parameters (ω ini dcdm , Γ dcdm and H 0 ), and we can clearly see the progress through the iterations. The first iteration does not lead to a particularly good model, but it manages to find a rough area in which to sample during the following iterations. This improves until the iterations are halted by convergence of the data. When comparing the results of the iterative sampling method to the performance of Latin hypercube sampled NN models in figure 9, we see that not even 10 6 points (individual class computations) are enough for the Latin hypercube sampling to match the results of the iterative sampling, and certainly not a Latin hypercube with as few points as in the dataset from the iterative sampling (34,670 class computations). The Latin    hypercubes are of course sampled logarithmically in the Γ dcdm parameter as in the iterative case, but this does not help much, the reason being the finite size of the network having to accommodate a huge amount of points in the parameter space that are very far from the region of interest. A network with many more nodes and hidden layers could perhaps be trained to behave nicely in the entire span of the Latin hypercube. Next, we have used the exact same setup for a completely different massive neutrino model, described by parameters m ncdm and deg ncdm . In figure 10 we show the iterative acquisition of training data in the mass-degeneracy plane. As can clearly be seen, this model is significantly easier for the algorithm and converges in just 3 iterations because the likelihood function is significantly more Gaussian than that of the DCDM model. The first iteration again contains 5,000 class computations and the maximal amount of new points from each iteration is now 30,000 (see section 5 for a discussion hereof). The final dataset contains 30,000 points from i = 2 and 8,959 points from i = 3. We then perform the same comparison between MCMC runs with class and connect as in the DCDM case, and the result is shown in figure 11. Clearly, in this case the agreement is even better than in the DCDM case, with means and confidence regions differing by around or less than 10 −2 standard deviations for all parameters except H 0 , which differs by around 10 −1 . Figure 12 shows the results of MCMC runs based on the data from the three iterations for a subset of the cosmological parameters (m ncdm , deg ncdm and H 0 ), and we clearly see that convergence is very quickly achieved since the second iteration is very close to the third. This is quite remarkable due to the fact that the first iteration samples no training data in the immediate vicinity of the best-fit point, so the second iteration being that good really demonstrates how well the sampling works when the likelihood function is simpler and more Gaussian. Figure 13 shows how the results of MCMC runs using models trained on Latin hypercubes of different sizes stack up against the results from the last iteration. It is apparent that a Latin hypercube of the same size as the dataset from the iterative process (38,959 points) in no way comes even remotely close to the performance of the last iteration. We also see that a number of points larger than 10 6 is needed to even come close to the same result as the iterative process, but a much larger dataset would require a Lower panel: Combined data after filtering including 1σ and 2σ contours. The first iteration is again removed from the dataset, since a better accuracy is achieved without it.
larger network architecture as well to perform reasonable, and this would raise the evaluation time for each model and is thus not a viable solution -not to mention the huge number of class computations that would make the whole idea of emulation obsolete. The larger Latin hypercube seems reasonable for the parameters with a Gaussian posterior, but in the case of a parameter whose likelihood function increases towards a boundary of the prior, we again conclude that the Latin hypercube does not have the ability to represent the cosmological models close to the boundary. Lastly, we note that models trained by connect are also fairly accurate beyond 2σ. Figure 14 shows the contours of posteriors in the massive neutrino model as estimated by connect and class, respectively, out to 5σ. Evidently, the agreement is reasonable, even out to 5 standard deviations for the massive neutrino model. This accuracy is a consequence of the increased temperature with which the training data sampling chains are run, and the training temperature can be further tuned manually to accommodate even stricter accuracy requirements on a broader region.

Discussion and Conclusions
We have presented connect, a novel, neural network based framework for cosmological parameter inference. The method relies on an iterative MCMC-based generation of training data which proves highly efficient in training the network to perform extremely well for MCMC based cosmological parameter inference. We have tested the robustness and versatility of the method by first building the network structure and training algorithm on a decaying cold dark matter (DCDM) cosmological model and demonstrating that we can achieve results almost exactly identical to well-converged class based MCMC runs, and subsequently     used connect in an off-the-shelf manner to run parameter inference on a different massive neutrino cosmology. Even though the method was not previously tested or optimised on this model we find results which are just as impressive as for the DCDM model.

Hyperparameters of the iterative sampling.
A few things can be customised when using the iterative sampling method, including convergence criteria, the size of the initial Latin hypercube, prior bounds for the individual high-temperature MCMC samplings, the maximum number of points to include from each iteration, and the filtration of new data points. The convergence criterion for the individual MCMC runs is chosen to be when R − 1 < 0.01 between the chains is valid for all sampling parameters, and the criterion for halting the iterations is chosen to be when the variance between the data acquired from two consecutive iterations is also at R − 1 < 0.01. Depending on the complexity of the likelihood, these criteria could be loosened. When filtrating points from each iteration to avoid oversampling in certain regions, the amount of points actually included, of course, decreases over time, and so another halting criterion for the iterations could be when (almost) no new points are added to the total dataset. These two criteria could even be combined, thus giving larger robustness and a better guarantee of the whole best-fit region being sampled. The maximum amount of points included by each iteration is the number of points extracted from the high-temperature MCMC samplings before doing any filtration. This number can be varied depending on the difficulty in sampling the likelihood function. A simpler likelihood function allows for very quick convergence when having a larger maximum amount, which is apparent from our results for massive neutrino model, where this parameter was set to 3 × 10 4 while it was set to 2 × 10 4 for the DCDM results. There can, however, also be a reason to raise this number for more complicated models, since we would then have a better representation from the MCMC chains. This is not ideal for the first few iterations though, since we would waste many more class computations given the low accuracy of the first iterations. The size of the initial Latin hypercube was chosen to 10 4 points for both of our cosmological models, but this could perhaps be brought down due to the fact that the bounds are chosen to be quite large in order for it to encapsulate all of the regions of significant likelihood. If one knows roughly where the best-fit region is, a much smaller Latin hypercube could be sufficient. The priors on the high-temperature MCMC samplings were chosen to be the same as the bounds of the initial Latin hypercube, but if the hypercube is shrunk, the priors should be set differently than the bounds so as to not exclude significant parts of the parameter space. Due to the nature of the iterative method, convergence should be reached even if the initial Latin hypercube has little to no overlap with the best-fit region (it might take more iterations though), but more tests are needed regarding this.

CPU time comparisons.
Comparing the CPU intensity of inference run with connect and with standard class based methods is somewhat involved. There is an initial overhead in training the connect emulator for a given model of order 5 × 10 4 class evaluations depending on the number of iterations and how many points each iteration should contribute. However, once the network has been trained the CPU consumption from the connect based MontePython inference comes almost exclusively from the likelihood evaluation when using the full Planck likelihood, not from the emulation. A CPU consumption of similar size as evaluations of the neural network arise when using the Planck lite likelihood instead, thus making the speedup much more profound during the MCMC runs. A single evaluation of the C ℓ spectra for a cosmological model takes of order 5 s (depending on the cosmological model -about twice as much for DCDM) to evaluate using class on a modern intel CPU core while connect only uses around 3 ms (including interpolation of the C ℓ s)! This is an immense speedup of three orders of magnitude, and this is also apparent from the time consumption of an MCMC. The MCMC runs using class as cosmological module each took 200 hours with 6 chains and 6 CPU cores for each chain. This enables the class computations to be parallelised which brings down the total time at the cost of using 6 times the CPU cores. When using connect as cosmological module, we again use 6 chains, but only one CPU core per chain is necessary since the evaluation of the network is not parallelisable. These runs, however, take less than two hours to reach the same level of convergence and a similar amount of accepted steps, so the MCMC runs using connect are sped up by a factor of more than 600. This is in great agreement with the difference in the evaluation times when factoring in the time consumption of the Plank lite likelihood evaluations.
We also need to consider the overhead from sampling of training data, which consists of a number of class computations and, in the case of iterative sampling, high-temperature MCMC sampling and training of neural network models. There is not much to do about the class computations except for limiting the number of them and using many CPU cores on a cluster. With the iterative sampling method the class computations are embarrassingly parallelisable, and the only way to limit the number of computations is to optimise the choice of points in the parameter space to compute and not throw any away in the end. Unfortunately we have to throw away the initial Latin hypercube since the inclusion of this worsens the performance of the network, and for complicated likelihood shapes, we often need to discard the first iteration as well, as argued previously. This means that there are around 10 4 class computation that we have performed, but cannot use in the final dataset. The high-temperature MCMC samplings are much less expensive in CPU time, but using a normal Metropolis-Hastings algorithm, the sampling is not very parallelisable, and so the time consumption is significant compared to the class computations utilising hundreds of CPU cores at once. The training of the network is quite fast on a GPU, and it normally takes around 5-8 minutes to train our networks using datasets of ∼ 5 × 10 4 points on two GPUs with distributed training, so we can probably not bring this down any further. The computation time of a single iteration in the sampling takes about on hour at this point with 500 CPU cores and two GPU cores allocated. The class computation utilise all CPUs for around 10-15 minutes, whereafter the training uses both GPUs for 5-8 minutes, and lastly the MCMC sampling uses only a handful of CPU cores for around 20-40 minutes depending on the difficulty of achieving convergence. This means that we only use a fraction of the resources for most of the time, but limiting the amount of CPUs to match the MCMC sampling would increase the time of class computations many times. The biggest speedup in the iterations would thus come from the high-temperature MCMC sampling being modified with a more parallelisable algorithm on either a GPU or the many CPU cores available any-way for the class computations. This could bring the sampling part down to a few minutes or even seconds, leaving only class computations and training as the time consuming parts and thus decreasing the time for each iteration to 20 minutes when using the same resources.
MCMC methods and other sampling strategies. For the results presented in this paper we have used the Metropolis-Hastings algorithm for the samplings of parameter space through integration with the popular code MontePython. A benefit of doing this is that we can utilise the entire library of likelihoods contained within MontePython. This means that nothing has to be rewritten and our connect plug-in really provides a plug-and-play solution. The standard Metropolis-Hastings algorithm used for the sampling is, however, not parallelisable to more than a handful of chains running simultaneously, and therefore the MCMC is quite time consuming when compared to everything else during an iteration (assuming enough CPU cores for the class computations to be parallelised). In order to speed up the process, we need to consider alternative sampling methods as well as how to utilise different likelihoods in the analysis.
There are many ways of sampling the parameter space of a model and MCMC with Metropolis-Hastings is widely chosen mainly due to the cost of evaluating Einstein-Boltzmann solver codes. With the use of neural networks for emulation, a whole new world of parameter inference beyond MCMC sampling opens. Due to the neural network being a smooth function, gradients are not only easy to access but also very numerically stable. This means that gradient-based methods like Hamiltonian Monte Carlo [39] are now a possibility. It is even possible now to move away from MCMC methods and compute profile likelihoods, given that optimisation is so much faster and more robust than when using class.
In order for us to really utilise the speedup of the emulation, we need some way around the time consumption of likelihood evaluations, which is especially cumbersome for the full Planck likelihood. An idea used by ref. [14] is to translate the likelihoods into TensorFlow syntax in order for it to run rapidly on a GPU. This way one can perform the sampling of the parameter space on the GPU as well, thus having parameter inference in mere seconds (using a parallelisable sampling method such as the affine invariant sampling algorithm [40]). It requires a lot of work to translate the more heavy likelihoods, so a full library of GPUcompatible likelihoods is probably not realisable in the next few years. A simple solution could also be optimisation of the existing likelihood codes (if possible), since many probably have been written knowing that the class computations will always be much slower and therefore had no reason to be written in the most optimal way. This too requires a significant amount of work, and so this is most likely also not happening in the immediate future. A more easy-to-code solution could in fact be emulation of the likelihood functions themselves. This would allow for both a faster sampling with normal MCMC methods, since the evaluation of the likelihood functions would all be on the level of the connect evaluations, and compatibility with GPUs allowing for more sophisticated gradient-based or parallelisable sampling methods leading to reliable parameter inference in seconds.

Reproducibility.
We provide the complete connect framework on GitHub for public use available at https://github.com/AarhusCosmology/connect_public. All the parameter files used for connect are included as well as a brief description of how to use connect on its own and with MontePython. Any version of MontePython supporting class as cosmological module can be used with connect without any alterations to the source code. We used our own version of class written in C ++ , and thus named class ++ , but any version  supporting the wrapper functions lensed_cl() and get_current_derived_parameters() can be used. class ++ along with a forked version of MontePython are both available from our GitHub organisation page https://github.com/AarhusCosmology. called trained_models. We have also included a space for all the training data, which is automatically sorted in folders with a specified job-name and subfolders with the amount of data points (iteration number) as the name upon creation of the data when using Latin hypercube sampling (iterative sampling). These data folders are collected in the folder data. Within the source folder, the custom_functions.py file contains classes for adding new activation functions and loss functions, which gives the user an easy way of implementing new custom functions for training networks. Within the source/architecture folder, users can in addition easily define entirely new architectures for the network and tailor everything to their needs. In the folder mcmc_plugin we have made a wrapper for trained models to mimic class. This structure is depicted in figure 15.