A machine learning approach to galaxy properties: Joint redshift - stellar mass probability distributions with Random Forest

We demonstrate that highly accurate joint redshift - stellar mass PDFs can be obtained using the Random Forest (RF) machine learning (ML) algorithm, even with few photometric bands available. As an example, we use the Dark Energy Survey (DES), combined with the COSMOS2015 catalogue for redshifts and stellar masses. We build two ML models: one containing deep photometry in the $griz$ bands, and the second reflecting the photometric scatter present in the main DES survey, with carefully constructed representative training data in each case. We validate our joint PDFs for $10,699$ test galaxies by utilising the copula probability integral transform (copPIT) and the Kendall distribution function, and their univariate counterparts to validate the marginals. Benchmarked against a basic set-up of the template-fitting code BAGPIPES, our ML-based method outperforms template fitting on all of our pre-defined performance metrics. In addition to accuracy, the RF is extremely fast, able to compute joint PDFs for a million galaxies in just over $2$ hours with consumer computer hardware. Such speed enables PDFs to be derived in real-time within analysis codes, solving potential storage issues. As part of this work we have developed GALPRO, a highly intuitive and efficient Python package to rapidly generate multivariate PDFs on-the-fly. GALPRO is documented and available for researchers to use in their cosmology and galaxy evolution studies at https://galpro.readthedocs.io/.


INTRODUCTION
The next generation of photometric surveys such as the Rubin Observatory Legacy Survey of Space and Time (LSST; LSST Science Collaboration 2009) and Euclid (Laureĳs et al. 2011) will observe ★ E-mail:sunil.mucesh.18@ucl.ac.uk billions of galaxies. The sheer amount of data generated will enable studies ranging from the cosmic large-scale structure (LSS), to the formation and evolution of galaxies, to be conducted in unprecedented detail; ultimately leading to a transformation in our understanding of the Universe. However, one of the key challenges will be developing algorithms which can quickly and reliably extract physical properties and redshifts of galaxies.
The success of many scientific analyses critically hinges on redshift measurements. For example, redshifts are required in weak lensing tomography (Hu 1999); one of the primary probes to unveil the nature of dark energy. As a result, a large number of methods now exist to estimate redshifts from photometric data (photo-zs) (see Salvato et al. 2019, for a review). In general, they are either physically motivated or data-driven.
Template fitting methods fall into the former category as they require prior knowledge in the form of template spectral energy distributions (SEDs). These templates are fit to the observed fluxes, and photo-zs are usually determined using chi-square minimisation (e.g. Bolzonella et al. 2000). Baum (1962) originally applied template fitting to estimate photo-zs of elliptical galaxies. Since then, a plethora of codes have been developed for the task such as HYPERZ (Bolzonella et al. 2000), BPZ (Benítez 2000), LEPHARE (Arnouts et al. 1999), ZEBRA (Feldmann et al. 2006), EAZY (Brammer et al. 2008) and BCNZ2 (Eriksen et al. 2019).
Galaxies are described by a wide range of physical properties, with stellar mass, star formation rate, age and metallicity being among the most important. Template fitting codes such as FAST (Kriek et al. 2009), CIGALE (Burgarella et al. 2005;Noll et al. 2009;Boquien et al. 2019), MAGPHYS (da Cunha et al. 2011) and BMASTELLARMASSES (Palmese et al. 2020a) have been specifically designed to output these quantities. Meanwhile, the application of ML in this field has been fairly limited, but literature has now begun to emerge (Acquaviva 2016;Stensbo-Smidt et al. 2016;Bonjean et al. 2019;Delli Veneri et al. 2019).
While single-value ('point') estimates are useful, probability distribution functions (PDFs) have become increasingly important in recent years as a full characterisation of the uncertainties, beyond a point estimate and an error bar, is required for accurate analyses. This has been particularly true in the role of redshifts for weak lensing cosmology (e.g. Bonnett et al. 2016), where it has been shown that using distributions instead of point estimates can improve the accuracy of cosmological measurements (Mandelbaum et al. 2008;Myers et al. 2009). It is possible to extract redshift PDFs using both template fitting and ML methods. However, ML methods have recently grown in use due to their efficiency. For example, packages such as ArborZ (Gerdes et al. 2010), TPZ (Carrasco Kind & Brunner 2013), SOMz (Carrasco Kind & Brunner 2014), SkyNet (Bonnett 2015) and ANNz2 (Sadeh et al. 2016) all have foundations in ML. To reach a consensus on the best algorithm in terms of PDF accuracy, Schmidt et al. (2020) and Deprez et al. (submitted) have compared a dozen or more popular algorithms from both approaches.
The redshift and physical properties of a galaxy, measured via modelling its photometry, are correlated, and thus should be described with a multivariate distribution. The commonly used marginal distributions in redshift, stellar mass, etc., constitute a loss of information and could potentially introduce biases into a scientific analysis as a result. Consequently, a new class of template fitting codes have come to the fore such as BAYESed (Han & Han 2012, BEAGLE (Chevallard & Charlot 2016) and BAGPIPES (Carnall et al. 2018). They utilise Bayesian statistical techniques such as Markov chain Monte Carlo (MCMC; Goodman & Weare 2010;Foreman-Mackey et al. 2013) and nested sampling algorithms (Skilling 2006;Feroz & Hobson 2008, Feroz et al. 2009 to generate multivariate posterior distributions of the most important properties. By estimating redshift and physical properties simultaneously, they allow for any uncertainties on redshift to propagate to the statistical constraints on physical properties, whilst accounting for any potential correlations (Chevallard & Charlot 2016). The only drawback is that it is not feasible to obtain these distributions for a large number of galaxies. For example, BAGPIPES takes on average a few minutes to fit each galaxy, making it prohibitively expensive to fit modern datasets where sample numbers can exceed hundreds of millions, let alone upcoming surveys where the numbers will exceed a billion. Moreover, the results of the fit to each galaxy must somehow be stored in a way that is accessible to scientific analysis routines.
Based on the speed and the competitive performance of ML algorithms when used to estimate photo-zs, it is possible that an ML approach to the problem could be promising. With this in mind, we take a significant step towards realising the ultimate goal of extracting full posterior distributions of galaxy properties using ML by first focusing on 2D posterior distributions of redshift and stellar mass. We choose these properties as they are two of the most important and accurate to predict (Walcher et al. 2011;Conroy 2013). Furthermore, joint PDFs are straightforward to visualise and thus ideal for uncovering any hidden correlations or degeneracies which exist between the properties. Joint redshift -stellar mass PDFs have many potential science applications such as determining the evolution of the stellar mass function (SMF) (e.g. Papovich et al. 2003;Mortlock et al. 2015;Capozzi et al. 2017), the cross-correlation function between galaxies and galaxy groups (Yang et al. 2005), understanding the connection between stellar mass and dark matter in galaxy clusters (Palmese et al. 2016(Palmese et al. , 2020a, and the stellar-to-halo mass relation (SHMR) (see Wechsler & Tinker 2018, for an overview). However, their storage remains a potential issue. Unless there is a revolution in data storage, it will not be feasible to store a large number of multivariate PDFs. To solve this dilemma, we have developed GALPRO, a highly intuitive and efficient Python package for rapid, on-the-fly generation of n-dimensional PDFs. GALPRO is documented and available for fellow researchers to use in their analyses at https://galpro.readthedocs.io/.
An interesting application of GALPRO could be to generate joint redshift -luminosity PDFs for measurements of the Hubble constant from gravitational wave events that lack an electromagnetic counterpart (Schutz 1986;Palmese et al. 2019;Soares-Santos & Palmese et al., 2019). The use of full redshift PDFs rather than point estimates is very important for standard siren measurements (Palmese et al. 2020b), and the inclusion of joint redshift -luminosity PDFs allows one to correctly define the selection function of the galaxy sample at the same time.
The outline of this paper is as follows. In Sec. 2 we give a brief introduction to machine learning, describe the Random Forest algorithm and outline the method we use to extract point estimates, marginal and joint posterior probability distributions of redshift and stellar mass. In Sec. 3 we describe the preprocessing steps we perform to construct the necessary datasets. In Sec. 4 we describe the different random forest models we train and explain the motivation behind them. We compare, discuss and validate our results in Sec. 5, and place them into a familiar context via a comparison to those achieved by BAGPIPES in Sec. 6. Finally, we summarise this work in Sec. 7.

MACHINE LEARNING
Machine learning (ML) is a subset of artificial intelligence (AI) which focuses on the development of computer algorithms that can learn to make predictions or decisions without being explicitly programmed to do so. In general, there are three types of ML algorithms: supervised, unsupervised and reinforcement learning. With supervised learning, the algorithm is given labelled data (i.e. the correct answers), and it learns a mapping between the input and target features. On the other hand, unsupervised learning algorithms are not given any labelled data and are left to their own accord to find structure and discover hidden patterns within data. Lastly, reinforcement learning algorithms give computers the ability to interact with a dynamic environment to achieve a pre-defined goal.
The application of ML in astrophysics began as early as the 1990s with the use of artificial neural networks (ANNs) for stargalaxy separation (e.g Odewahn et al. 1992) and morphological classification of galaxies (e.g Storrie-Lombardi et al. 1992). With the advent of large scale surveys such as the Sloan Digital Sky Survey (SDSS; Gunn et al. (2006)) and more recently, the Dark Energy Survey (DES; The Dark Energy Survey Collaboration 2005; Dark Energy Survey Collaboration 2016; Lahav et al. 2020), ML algorithms have been widely adopted to cope with the enormous influx of data and to do novel science (see Baron 2019, for a recent review). This trend is likely to continue with the next generation of surveys such as the Rubin Observatory Legacy Survey of Space and Time (LSST; LSST Science Collaboration 2009) and Euclid (Laureĳs et al. 2011) as they will produce an order of magnitude more data than the previous. In the next section we describe the Random Forest algorithm and outline our method for estimating joint redshift -stellar mass posterior probability distributions.

Random Forest
Random Forest (RF; Breiman 2001) is a supervised learning algorithm based on ensemble learning, as it utilises many decision trees to make predictions. These trees are a type of data structure, which allow one to make a decision using a series of yes/no questions, and they can be used for regression and classification. They are built using a recursive algorithm which splits the data usually into two groups at each step until some pre-defined threshold is achieved. The job of the algorithm is to identify groups that have similar input and target features and is therefore related to the kNN algorithm (Altman 1992). The main components of the decision tree are the root, decision and leaf nodes. The root node defines the first and the most optimum split. The decision nodes describe the subsequent splits, and the leaf nodes contain the final groups.
The exact process of building a decision tree is as follows. At each step, all possible splits are evaluated in all dimensions of the input feature space. For classification, the data is split to best separate different classes, and this is achieved by maximising the information gain (IG), or in other words, minimising the impurity using metrics such as the information entropy, Gini entropy and classification error (Carrasco Kind & Brunner 2013). For regression, the data is split such that the average values of the target variable are representative of the groups. Usually, the variance is minimised to accomplish this using the loss function: where is the number of data points in a group, ,˜are the values of the target variable in the group, and¯is the group mean of the target variable.
Once the decision tree is built or trained, it can be used to make predictions. If the training data used to build the tree are complete and representative, then a new datum will end up in a leaf node that is representative of itself. The content of the leaf node can then be used to make a prediction. For classification, the prediction is the mode, and for regression, it is the mean of the leaf node.
The simplicity of the decision tree algorithm makes it one of the most popular learning mechanisms. However, decision trees only perform well on training data as they are prone to overfitting. The RF algorithm solves this issue by constructing multiple decision trees and by making a few adjustments. For example, when building the decision trees, only a subset of the training data and features are used. This technique is called feature bagging and injects randomness, making RFs more flexible and better suited to make predictions on data not encountered before. By using multiple decision trees in combination with feature bagging, RF aims to preserve the low bias of a single decision tree whilst simultaneously reducing variance to successfully navigate the bias-variance tradeoff. In summary, a RF can be built using the following process: (i) Create a bootstrapped dataset by sampling randomly from the training data with replacement.
(ii) Choose a random subset of input features when building a decision tree using the bootstrapped data.
(iii) Repeat the process to build multiple decision trees, thus creating a 'forest'.
The process of predicting with a RF is similar to predicting with a single decision tree. The only difference is that predictions from all the decision trees are collated. For classification, the prediction is the most predicted class, and for regression, it is the mean of all the values predicted by the decision trees. As is the case with many ML algorithms, RF has hyperparameters, which need to be specified beforehand. These hyperparameters can be tuned to give the best performance; and some of the most important are: • n_estimators -The number of decision trees used to build the RF determines its effectiveness. Each decision tree is built using a subset of training data. As a result, if the number used is too small, then the likelihood of complete coverage of the training data decreases, resulting in poor performance. The performance improves as the number of trees increases, but at a cost, which is the time taken for training. The key is to find the right balance between performance and training time because the gains become negligible after a certain point.
• max_features -The maximum number of features considered at each step when building the decision trees controls the correlation between them and hence, the flexibility of the RF. Usually, √ features are sufficient to build each decision tree, where is the total number of input features.
• max_depth -The maximum depth defines the number of levels in the decision tree, and it determines how finely or coarsely the training data is grouped. A low depth leads to underfitting, and if the depth is too high, it may lead to overfitting. In essence, the maximum depth provides a stopping criterion. The minimum number of training samples in a leaf node (min_samples_leaf), and the minimum number of training samples in a leaf node before the data is split (min_samples_split) also serve the same purpose.

Joint PDF Estimation Method
The RF algorithm has previously been utilised to extract point estimates (Carliles et al. 2008;Carliles et al. 2010) and PDFs (Carrasco Kind & Brunner 2013) of redshift. Recently, Bonjean et al. (2019) used the algorithm to predict stellar masses and star formation rates of galaxies. They built a single model to predict both target variables simultaneously. The process of building decision trees to achieve this is conceptually similar to building them to predict one target variable. The only difference is that at each step, to decide the best split, the average loss function for two or more variables is minimised. In eq. 1,˜and¯are now a vector of target variables and the means respectively. As this loss function is scale dependant, the target variables must be transformed to place them on scales with similar ranges otherwise the variance of one will dominate, resulting in the algorithm expending more effort in getting one target variable correct at the expense of others (Breskvar et al. 2018). Once trained, the leaf nodes in the decision trees contain values of the target variables.
We apply this methodology to predict redshift and stellar mass simultaneously, thus preserving any correlation between the properties. As both variables are continuous, we use regression trees to build the forest. However, it is entirely possible to use classification trees as shown by Gerdes et al. (2010) and Carrasco Kind & Brunner (2013). Another motivation for using regression trees is that they are generally faster to train and better suited to non-uniform data. To summarise the process: • Galaxies cluster together in n-dimensional space if they have comparable values of input features.
• The algorithm identifies these clusters by minimising the loss function (eq. 1), with redshift and stellar mass being the target variables.
• These clusters end up in the leaf nodes of the decision trees. In the end, the leaf nodes contain redshifts and stellar masses of similar galaxies.
We extract point estimates of redshift and stellar mass by running a 'new' galaxy down all the decision trees and using the mean of all the predicted values. To build marginal posterior distributions, we aggregate the values of redshift and stellar mass in the leaf nodes across all the decision trees respectively. Finally, we combine the aggregated values to build joint posterior distributions. We would like to point out that our method is flexible and can be adapted to generate joint PDFs of any other combination of properties. However, we chose redshift and stellar mass as they are two of the most important and accurate properties to predict. Furthermore, the method is flexible and can be applied to generate n-dimensional PDFs. We describe the implementation of the RF in this work, and the input features in Sec. 4.

DATA
We use data from two different surveys to train and test our RF models. These are the Dark Energy Survey (DES; The Dark Energy Survey Collaboration 2005; Dark Energy Survey Collaboration 2016; Lahav et al. 2020) and the Cosmological Evolution Survey (COSMOS; Scoville et al. 2007).

Cosmological Evolution Survey (COSMOS)
The Cosmological Evolution Survey (COSMOS) observed a 2 deg 2 equatorial field in the entire spectral range from radio to X-ray with both ground and space-based telescopes, collecting photometric and spectroscopic data. In this field, ∼ 2 million galaxies were detected, spanning 75% of the age of the Universe (Scoville et al. 2007).
We use the COSMOS2015 (Laigle et al. 2016) catalogue from the survey for its photo-zs and stellar masses. Usually, to train a ML algorithm to predict photo-zs, spectroscopic redshifts (spec-zs) are used. However, the photo-zs in this catalogue have been shown to be precise and accurate. Compared to photo-zs from surveys such as the DES and the Sloan Digital Sky Survey (SDSS; Gunn et al. 2006), the COSMOS photo-zs have been computed using more than 30 bands spanning a huge portion of the electromagnetic spectrum, as opposed to 4 or 5 optical bands. The most precise photo-zs have been estimated for very bright, low redshift, star-forming galaxies, with a normalised median absolute deviation (NMAD; Hoaglin et al. 2000) of 0.007, of which 0.5% are catastrophic outliers. Furthermore, in the deepest regions of the survey, 90% of galaxies with stellar mass greater than 10 10 at = 4 have been detected. The high photo-z precision and the overall completeness of the survey in stellar mass makes this an exemplary dataset to use in this work.

Dark Energy Survey (DES)
The Dark Energy Survey (DES) is a visible and near-infrared survey that has imaged ∼ 5100 deg 2 of the South Galactic Cap ten times in photometric bands using the Dark Energy Camera (DECam; Flaugher et al. 2015) over a span of six years, starting in 2013. It is expected to have generated ∼ 310 million galaxies with photo-zs, once all the data has been processed. In addition, the survey targeted a set of four fields with a total of ten DECam pointings over 27 deg 2 for supernova (SN) science. This SN survey had an approximately weekly cadence and thus many more epochs per pointing than the main survey (Neilsen et al. 2019). We use two datasets from the DES survey, which are discussed in the following sections.

DES Deep Fields
As part of the DES Year 3 (Y3) cosmology analysis, observations from the SN survey were combined with community data, additional DES exposures (particularly in -band) and coincident near-infrared data to form the DES Deep Fields catalogue (DF, Hartley & Choi et al., in prep.). The principal aims of the DF project are to improve calibration of redshift distributions in the main survey and to act as a prior on the population of full multi-colour images for Balrog (discussed in the next section), to better understand the systematics and selection function of the wide-field (WF) survey. These goals rely on the fact that the DF represents a statistically complete, yet effectively noiseless, population of the galaxies that are found in the WF survey. Other motives include conducting galaxy evolution studies, science with the faintest possible sources and the properties of the host galaxies of transient events.
The Y3 DF catalogue consists of data from three SN fields plus the COSMOS field, with a total coverage of 5.88 deg 2 and photometry of over 1.7 million objects (after masking for image defects) in DECam and VIRCam bands. We combine the deep (∼ 1.25 magnitudes fainter than the WF data) and precise photometry in this catalogue with the accurate redshifts and stellar masses from the COSMOS2015 catalogue to produce a 'baseline' DF dataset. Specifically, we utilise the bulge+disk model-fit magnitudes computed using the Multi-Object Fitting (MOF; Drlica-Wagner et al. 2018) algorithm.
Our goal is to produce valid posterior PDFs of galaxies in the main DES survey and to achieve this we require a suitable dataset with which to train a RF model. The photometric errors in the DF dataset would not reflect those in the WF and so would lead to biased results if used directly as training data. Furthermore, the COSMOS field does not overlap the main survey area and the redshifts and stellar masses that could be derived from model fitting to the 4-band WF data are grossly imprecise compared to those in the COSMOS2015 catalogue. In essence, we require a catalogue of DF galaxies which emulate galaxies in the WF to overcome these issue, and for this, we take advantage of the Balrog algorithm.

Balrog
Balrog is a Python package designed for the purpose of measuring the selection function of imaging surveys (Suchyta et al. 2016;Spencer et al., in prep.). The process by which it achieves this is as follows. A realistic ensemble of fake stars and galaxies are generated using GALSIM (Rowe et al. 2015), including survey characteristics appropriate to their intended sky location, e.g. seeing FWHM. The fake objects are then embedded into real survey images, thus inheriting many of their properties. Finally, the objects are detected and measured using SExtractor (Bertin & Arnouts 1996) in the same way as the original survey images. The output catalogue comprises a Monte Carlo sampling of the selection function and measurement biases and naturally accounts for systematic effects arising from the photometric pipeline, detector defects, seeing and other sources of observational systematic errors.
The Balrog process requires a prior population of galaxies from which to draw objects. The DES Y3 Balrog catalogue (Spencer et al., in prep.) was produced by injecting model fits of galaxies drawn randomly from the Y3 DF catalogue into DES Y3 single-epoch images and then measuring their properties. This catalogue contains true and measured photometry of nearly 4 million objects, and it provides us with ready-made emulated galaxies that reflect our target WF dataset, the DES Y3 GOLD (Sevilla-Noarbe et al. 2020). By combining the Y3 Balrog catalogue with COSMOS2015, we obtain a dataset that closely matches and is representative of the WF data, capturing many of the details of the objects' noise properties, but with the addition of accurate redshifts and stellar masses. From the catalogue, we use composite model magnitudes in this work. In the next section we outline the preprocessing steps we perform to create the DF and WF datasets.

Preprocessing
To construct the DF dataset, we first cross-match galaxies in the Y3 DF and the COSMOS2015 catalogues using TOPCAT (Taylor 2005) with a matching radius of 1 , and this serves the dual purpose of enabling the use of accurate photo-zs (PHOTOZ) and stellar masses (MASS_BEST) in our analysis and removing galaxies in all the other fields besides the COSMOS field. Next, we discard stars and any galaxies with erroneous redshifts and stellar masses by ensuring 0 < < 9.99, and we produce a magnitude-limited sample by selecting galaxies with < 23.5. These cuts automatically remove saturated objects and bad areas (FLAG_PETER = 0). We discover that there are some faint galaxies with close to zero or even negative fluxes in the bands, resulting in their magnitudes being undefined. To solve this issue, we convert all galaxy fluxes into 'asinh' magnitudes or 'luptitudes' (Lupton et al. 1999), defined as: where 0 = 0 − 2.5 log , = 2.5 log , is the flux, is an arbitrary softening parameter and 0 is the magnitude zero point. The authors state that the optimal value of = √ where is the standard deviation of the flux. We set the value of to be the median of the standard deviations. Additionally, we transform flux errors into luptitude errors using: Luptitudes behave like magnitudes for bright photometry and like fluxes for faint photometry, with the turning point in the behaviour determined by the softening parameter. By converting to luptitudes, we avoid introducing an additional selection effect by not discarding galaxies with negative fluxes.
To produce the WF dataset, we start anew and match 'WF' galaxies in the Y3 BG catalogue to their counterparts in the Y3 DF using the ID column. Next, we cross-match the galaxies in the intermediate catalogue to the COSMOS2015 catalogue. There are multiple scattered WF copies of each DF galaxy in the BG catalogue to efficiently sample the DES selection function, and to preserve this we keep all of the copies. This is an important aspect of our set-up, as it captures the selection function through the galaxy detection probability as a function of true photometry and light profile, and the asymmetric scatter between photometry and galaxy properties (redshift and stellar mass) that it induces. We remove any galaxies with erroneous flux measurements by selecting all galaxies with MEAS_CM_FLAG = 0 (Spencer et al., in prep.). Finally, we repeat all the aforementioned cuts and steps used in constructing the DF dataset, the only difference being that on this occasion, we apply the -band cut to the magnitudes of WF galaxies. Thus, we have 'augmented' a completely realistic target dataset which effectively replicates the systematics in the WF survey without compromising on the accuracy of redshifts and stellar masses.
After all the preprocessing steps, there are 53, 491 galaxies in the DF dataset and 392, 376 galaxies in the WF dataset. Each dataset contains the following information: luptitudes and luptitude errors, photo-zs and stellar masses. Additionally, we compute all the relevant lupticolours; and the associated errors using the standard error propagation formula.
where 1 and 2 are the errors on the luptitudes, and is the error on the computed lupticolour. Fig. 1 shows the marginal and the joint distribution of redshifts and stellar masses of galaxies in the DF dataset, and the distributions of luptitudes. The average redshift and stellar mass is approximately 0.7 and 5 × 10 9 respectively. For the sake of brevity, we do not show a similar figure for the WF dataset as the distributions are broadly similar.
We perform an 80 : 20 split on the DF and WF datasets to create The colours in the joint distribution indicate the density of points. The DF dataset is created by cross-matching galaxies in the DES Y3 Deep Fields (DF) and the COSMOS2015 catalogues. All galaxies with erroneous redshift and stellar masses are discarded from the dataset, and a magnitude-limited sample is produced by selecting galaxies with < 23.5. The luptitudes in the dataset are computed from fluxes in the Y3 DF catalogue, while the redshifts and stellar masses are from the COSMOS2015 catalogue. their training and testing datasets respectively. As there are multiple copies of each galaxy in the WF dataset, we ensure that there is no admixture of unique galaxies in its training and testing datasets. In other words, unique galaxies which exist in the training dataset do not appear in the testing dataset, and vice-versa. As a consequence, there are 314, 196 and 79, 080 galaxies in the WF training and testing datasets respectively. Lastly, we randomly sample 10, 699 galaxies without replacement from the WF testing dataset to construct its final version. We do this to ensure that the number of galaxies in both the DF and WF testing datasets matches, thus enabling us to make a fair comparison when testing our RF models.
The training datasets represent prior information which the RF models utilise in order to make predictions on the test datasets. As a result, one must construct a suitable and representative training data set (as we have done) when using outputs from a ML model in their scientific analysis. In the next section we describe the different RF models, explain the motivation behind them and the implementation of the RF algorithm we use in this work.

MODELS & IMPLEMENTATION
We train and test two different RF models, with redshift and stellar mass as the target variables and the following as input features: • luptitudes • luptitude errors • − , − and − lupticolours, and their associated errors We build the first model using the DF dataset and refer to it as DES-DF from here onwards. The high precision photometry of DF galaxies combined with the accurate redshifts and stellar masses allows us to establish the baseline performance. We build the second model to produce valid posterior PDFs of galaxies in our target dataset (the DES Y3 GOLD) by training on the WF dataset. We refer to this model as DES-WF.
To train and test our RF models, we use the implementation of the algorithm in the Python ML package scikit-learn. In particular, we use the RandomForestRegressor module from the package, which allows us to do regression. Before training, we do not perform feature scaling as the RF algorithm is invariant under monotonic transformations. Furthermore, we do not scale the target features because redshift and stellar mass (in the logarithmic form) have similar ranges. Besides, scikit-learn automatically normalises the variances of individual target variables so that they contribute equally to the loss function.
As previously discussed in Sec. 2.1, RF has hyperparameters which can be tuned to increase the performance of a model. Therefore, we tune our RF models before training using a combination of random search and grid search. We first set up a wide grid of hyperparameters and run the models using 100 different combinations.
Next, we use a grid search around the best hyperparameters found in the previous searches. After tuning, we find that the performance of the models, in terms of the root-mean-square error (RMSE), only improves by 1 − 2%. In principle, one could use metrics associated with the validity of PDFs (described in Secs. 5. 2.1 and 5.3.1). However, we opted for the simple RMSE as we do not believe that there exists a single metric that can fully characterise the performance of a model. Given the insignificant improvements in the performance of our models, we ultimately resorted to using the following default scikit-learn hyperparameters for training both models: , where is the total number of input features, to ensure that our models have sufficient prior information as we are using a limited number of photometric bands to begin with. We train and test both models on a 13" Macbook Pro (2.4 GHz Intel Core i5, 16GB LPDDR3) using GALPRO, and it takes less than 2 and 6 minutes respectively to generate PDFs for 10, 699 galaxies. In the next section we compare, discuss and validate the point estimates, marginal and joint posterior PDFs of redshift and stellar mass of test galaxies estimated from the trained models.

Point Estimates
We extract point estimates by averaging predictions from all the decision trees in a given RF model. In order to quantify how the models are performing, we use the median absolute deviation (MAD) metric for stellar mass and its normalised form for redshift. The MAD is defined as: whereˆand˜are the predicted and 'true' values of redshift and stellar mass of galaxies respectively. The normalised MAD (NMAD) is calculated in a similar fashion with the only difference being that the biasˆ−˜is divided by 1 +˜. Fig. 2 shows the redshifts and stellar masses of test galaxies versus the predictions made by DES-DF and DES-WF. Most of the data points lie close to the diagonal, which indicates that the predicted redshifts and stellar masses are accurate. However, there are outliers at low and high redshifts and low stellar masses. There is a lack of training data available in these regions, as can be observed in Fig. 1. Given the strong correlation between the accuracy of a RF model and the abundance of training data, these outliers are to be expected.
Moreover, the degradation in performance could be due to degeneracies which exist in the colour-redshift space. For example, at < 0.2, there is a lack of strong spectral features that can be detected in the bands. Using the band can break the degeneracies. However, we do not use it as an input feature as the band is not available in the DES data. Furthermore, in the redshift range, 1.2 < < 2.2, there is a lack of strong spectral features in the visible bands (Bolzonella et al. 2000). These degeneracies can lead to incorrect clustering of training galaxies and thus inaccurate point predictions.
Comparing the two models, the point-estimate performance of DES-DF is better than DES-WF, with a NMAD and MAD of 0.04 and 0.15 dex for redshift and stellar mass respectively. There is a visible increase in the scatter in the DES-WF plots, and this is reflected in the values of the performance metric doubling for redshift to 0.08 and increasing by ∼ 73% to 0.26 dex for stellar mass. This drop in performance is primarily due to the degraded photometric precision, which makes it difficult for the RF to cluster galaxies, resulting in inaccurate predictions. Nevertheless, DES-WF still performs well for a significant portion of test galaxies as can be observed.

Marginal Probability Distributions
The point estimates we extracted are not perfect. In general, inaccuracies can arise from: • Incomplete and incorrect information -The information provided to a ML algorithm may not be sufficient to learn the perfect mapping between the input features and target variables. For example, to estimate redshifts to a high degree of accuracy, spectroscopic data are required. However, we use photometric data which only provides a rough sampling of the underlying SED. Furthermore, the data used for training and testing have to be accurate. In our case, the redshifts and stellar masses we use to train our RF models may contain some errors. They have been estimated using the template fitting code LEPHARE, which utilises template SEDs and they may not be a perfect representation of the true SED. Therefore, the mappings learnt by the RFs may not be entirely accurate, and this could lead to the observed errors in the estimates. Furthermore, we predict redshifts and stellar masses using 4 band photometry while those in the COSMOS2015 catalogue are computed using more than 30 bands. Consequently, there will be subtle differences between our predictions and the 'truth'.
• Unrepresentative and incomplete training data -The lack of representative and complete training data can also lead to errors. In our case, the training data is highly likely to be representative. However, in some regions, the data are sparse, and therefore, do not provide a complete sampling of the target population. For example, at low and high redshifts, the number of galaxies available for training reduces dramatically as can be observed in Fig. 1, and this causes the performance of the algorithm to suffer. Furthermore, the effect of sample variance from the small COSMOS area can lead to some incompleteness.
• ML algorithms and hyperparameters -Different ML algorithms learn using different methods. As a result, predictions on the same datum can be slightly different. Furthermore, the hyperparameters can also have an effect, as discussed in Sec. 2.1. However, the performance of ML algorithms suitable for a specific problem generally converges given sufficient and good quality training data.
In order to characterise uncertainties associated with our point estimates, we extract marginal posterior distributions of redshift and stellar mass. We do this by aggregating the redshift and stellar mass values in the leaf nodes of the decision trees in a RF that are representative of the test galaxy in question. We extract the distributions from the trained models and validate them using several techniques and metrics described in the next section.

Marginal PDFs Validation
Unlike point estimates, it is not possible to validate individual redshift and stellar mass PDFs as the true distributions are not available. Consequently, we aim to determine the validity of the marginal PDFs as a whole. We use the framework developed by Gneiting et al. (2007), which is founded on the paradigm of maximising the sharpness of the predictive distributions subject to calibration. Sharpness refers to the concentration of predictive distributions and is a property of the distributions only. The authors describe calibration as the statistical consistency between the distributions and the truth. We refer to this as validation as it better captures the essence of use in our context. However, for consistency, we will use the former when describing the authors' work. In this paper, we focus on calibration to validate the marginal PDFs produced by our models, rather than sharpness, as the latter is useful when ranking competing calibrated methods. Furthermore, as demonstrated by Bordoloi et al. (2010), one could use the framework to empirically recalibrate marginal PDFs. However, this can be challenging and could potentially result in unforeseen issues. Gneiting et al. (2007) introduce three modes of calibration: probabilistic, marginal and exceedance. The first two modes are the most important, and they can be empirically assessed. As a result, we focus on these to determine if the marginal PDFs produced by our models are valid and exclude exceedance calibration in our anal-ysis. Probabilistic calibration can be assessed using the probability integral transform (PIT; Rosenblatt (1952)). It is the cumulative distribution function (CDF) evaluated at its true redshift or stellar mass.
where˜is the 'true' redshift or stellar mass and ( ) is the marginal PDF. If the marginal PDFs are probabilistically calibrated, then the true redshifts and stellar masses should be random draws from their respective distributions. This statement is equivalent to requiring that the CDF evaluated at the true redshift should not have a preferred value. In this case, for an ensemble of galaxies, the distribution of PIT values should follow the standard uniform distribution ( (0, 1)) (Dawid 1984), i.e. one percent of galaxies should have their spectroscopic redshifts found within the first percentile of their CDFs, and so on. Deviations from uniformity can be interpreted as follows. If the marginal PDFs are overly broad, then fewer objects will have true redshifts in the tails of their PDF, instead being closer to 0.5, and the PIT distribution will be convex shaped. Conversely, if they are overly narrow, then the PIT distribution will be concave shaped. Finally, if the PIT distribution has a gradient, then this means that the marginal PDFs are biased. In the past, the PIT distribution has been utilised to determine the validity of red- shift PDFs (e.g. Bordoloi et al. 2010;Polsterer et al. 2016;Tanaka et al. 2018;Schmidt et al. 2020;Deprez et al. (submitted)). The uniformity of the PIT distribution is a necessary condition for marginal PDFs to be valid. However, Hamill (2000) has shown that uniformity can also arise from biased distributions. Therefore, probabilistic calibration may not be sufficient in some cases, and marginal calibration may be required to reach a concrete conclusion. Marginal calibration is associated with the equality of the predicted and true distributions of redshift and stellar mass. Specifically, the average predictive CDF (ˆ) is compared to the true empirical CDF (˜).
where is the number of test galaxies, is the predicted CDF,ĩ s the true redshift or stellar mass of a galaxy and 1 is the indicator function, defined as: If the PDFs are marginally calibrated, then the average predictive CDF should equal the true empirical CDF. To assess probabilistic calibration, we check the uniformity of the PIT distributions visually and use quantile-quantile (Q-Q) plots to highlight deviations. In a Q-Q plot, the quantiles of one distribution are plotted against the quantiles of another distribution. In our case, these are the PIT and (0, 1). If the two distributions are identical, then the quantiles match and lie along the diagonal. Furthermore, we use several metrics to quantitatively determine the uniformity of the PIT distributions (Schmidt et al. 2020) such as the Kullback-Leibler (KL;Kullback & Leibler 1951) divergence, Kolmogorov-Smirnov (KS;Shiryayev 1992) test and Cramér-von Mises (CvM;Cramér 1928) test. All of these metrics measure the similarity between two distributions in different ways. The KL divergence is defined by the following integral: where ( ) and ( ) are the reference ( (0, 1)) and target (PIT) PDFs, respectively. The KS test is a non-parametric test and is the maximum distance between the empirical distribution function ( ( )) and the cumulative distribution function ( ( )) of the reference distribution.
where is the supremum of the set of distances. The CvM is an alternative to KS test and is more sensitive to the edges of a distribution.
A value of zero for the different metrics indicates that there is a perfect match between the two distributions. Fig. 3 shows the redshift and stellar mass PIT distributions and Q-Q plots for the models. The dashed black line represents the quantiles of (0, 1), and the quantiles of the PIT distributions are shown using the solid blue curves. The values of the metrics, along with the percentage of catastrophic outliers, are also indicated. We define a catastrophic outlier to be any galaxy for which the true value of redshift or stellar mass is completely outside the support of its marginal PDF.
Visually, the PIT distributions of DES-DF and DES-WF appear to be uniform, and this is reinforced by the quantiles of the PIT distributions lying close to the diagonal in the Q-Q plots, if not on it. Consequently, at first glance, both models seem to be performing equally well. However, on closer inspection, subtle differences can be observed in the PIT distributions. The PIT distributions of DES-DF are more uniform compared to those of DES-WF, and the main difference arises at the edges. Specifically, the PIT distributions of DES-WF are slightly concave-shaped as indicated by the minor deviations in the Q-Q plots at the extremes and quantitatively confirmed by the significantly larger CvM criterion values. Hence, the marginal PDFs produced by DES-WF are somewhat overly narrow or underdispersed. Taking into account the degraded photometry, DES-WF is still performing admirably with only small increases in the number of catastrophic outliers compared to DES-DF. Overall, both models are producing probabilistically calibrated marginal PDFs and performing at an unprecedented level.
To assess marginal calibration, we plot the difference between the average predictive and true empirical CDFs of redshift and stellar mass at regular intervals in their respective ranges. If the PDFs are marginally calibrated, then only minor fluctuations about the zero line are expected. Fig. 4 shows the redshift and stellar mass marginal calibration for the models. There are negligible fluctuations about the zero line, with maximum deviations of ∼ 0.005. Therefore, both models are producing marginally calibrated redshift and stellar mass PDFs, with DES-DF performing marginally better with a smaller average deviation compared to DES-WF. To summarise, the marginal PDFs are both probabilistically and marginally calibrated, thus giving us confidence that they are valid. Finally, in the next section we analyse and perform validation checks of the joint redshift -stellar mass posterior distributions.

Joint Probability Distributions
In general, a joint PDF encompasses more information than its marginals. Therefore, we extract joint redshift -stellar mass PDFs of test galaxies from DES-DF and DES-WF. We build the distributions by combining the aggregated values of redshift and stellar mass in the leaf nodes across all the decision trees. Fig. 5 shows some examples of the joint PDFs of the same test galaxies produced by the models. The gold and white stars alongside the dashed lines indicate the 'true' and predicted redshifts and stellar masses respectively. We remind the reader that the predicted redshifts and stellar masses are computed by averaging the predictions from all the decision trees in a RF. Visually, the joint PDFs of the same test galaxy look alike and occupy similar regions of the redshift-stellar mass space. However, the joint PDFs produced by DES-WF are more spread out compared to the ones produced by DES-DF, or in other words, the probability is more dispersed. This is a reflection of the degraded photometry in the WF dataset. Overall, we do not expect the joint PDFs of the same galaxy to resemble each other perfectly as both models have been trained using different datasets.

Joint PDFs Validation
It is more challenging to validate joint PDFs compared to marginal PDFs as the relatively straightforward methods adopted to validate the latter are no longer applicable. As a result, we use the multivariate extensions of probabilistic and marginal calibration developed by Ziegel & Gneiting (2014) to validate joint PDFs in our case. These are probabilistic copula calibration and Kendall calibration respectively. Probabilistic copula calibration can be empirically assessed by using the copula probability integral transform (copPIT): Figure 5. Examples of joint redshift -stellar mass PDFs produced by the DES-DF and DES-WF models of the same test galaxies (in rows). The gold and white stars alongside the dashed lines represent the 'true' and predicted redshifts and stellar masses of the galaxies respectively. The predicted redshifts and stellar masses are computed by averaging the predictions from all the decision trees in the individual RFs. The green circles indicate the values of redshift and stellar mass in the leaf nodes that are representative of the test galaxies. Figure 6. copPIT distributions for the DES-DF and DES-WF models. They are overlaid with Q-Q plots to aid in visually assessing the probabilistic copula calibration of joint redshift -stellar mass PDFs of test galaxies. The dashed black and solid blue lines represent the quantiles of (0, 1) and copPIT distributions respectively. The percentage of catastrophic outliers along with the values of the Kullback-Leibler (KL) divergence, Kolmogorov-Smirnov (KS) test and Cramér-von Mises (CvM) metrics are also stated to quantify uniformity of the copPIT distributions. We define a catastrophic outlier to be any galaxy that is completely outside the support of its marginal PDFs. Probabilistic copula calibration is the multivariate analogue of probabilistic calibration.

Figure 7.
The difference between the 'average Kendall distribution function' (K ) and the empirical CDF of the predicted joint CDFs evaluated at the 'true' redshifts and stellar masses (˜), plotted at regular intervals in the probability space ∈ [0, 1]. This diagnostic plot is used to assess the Kendall calibration of the joint PDFs produced by the DES-DF and DES-WF models. Kendall calibration is the multivariate analogue of marginal calibration.
where (˜) is the joint CDF evaluated at the true redshift and stellar mass, and K is the Kendall distribution function, defined as: where ( ) is the predicted joint CDF and ∈ [0, 1]. Simply put, the Kendall distribution function is the CDF of ( ). For marginal PDFs, it corresponds to the standard uniform distribution and the copPIT coincides with the PIT. To assess Kendall calibration, we compare what we refer to as the 'average Kendall distribution function' (K ) to the empirical CDF of the predicted joint CDFs evaluated at the 'true' redshifts and stellar masses (˜).
Probabilistic copula calibration and Kendall calibration can be interpreted in the same manner as their univariate counterparts. As such, probabilistic copula calibration ascertains if the 'true' redshifts and stellar masses of galaxies are random draws from their corresponding joint PDFs, as they should be. If this is the case, then for an ensemble, the copPIT distribution is uniform, and the joint PDFs are probabilistically copula calibrated. On the other hand, Kendall calibration probes how well the dependence structure between redshift and stellar mass is predicted on average, and can be understood as marginal calibration of the Kendall distribution. IfK is comparable to˜, then the joint PDFs are Kendall calibrated. Once again, if both modes of calibration are satisfied, then we can claim with some conviction that the joint PDFs are valid overall. Furthermore, we would like to point out that while we use probabilistic copula calibration and Kendall calibration to validate our joint redshift -stellar mass PDFs, they can be applied to validate higher dimensional PDFs also. Fig. 6 shows the copPIT distributions for the DES-DF and DES-WF models. The distributions are uniform with minor deviations which are more prominent for DES-WF. Overall, both models are performing well with no substantial differentiation and producing joint PDFs that are probabilistically copula calibrated. Furthermore, in comparison to the PIT distributions in Fig. 3, the copPIT distributions of DES-WF are somewhat less uniform as primarily reflected by the large CvM value. Hence, the marginal PDFs produced by the model are better probabilistically calibrated than the joint PDFs. Fig. 7 shows the difference betweenK and˜at regular intervals in the probability space . For DES-WF, the fluctuations about the zero line are smaller compared to those for DES-DF, thus indicating that the joint PDFs produced by the former are better Kendall calibrated. We believe that DES-WF is better capturing the redshift-stellar mass dependence structure as it is trained using the WF dataset which contains multiple scattered copies of the same DF galaxies, resulting in better incorporation of photometric errors present in the data into the model. Collectively, the joint PDFs are less marginal/Kendall calibrated compared to the marginal PDFs as the deviations are larger in magnitude. However, we hypothesize that the deviations in the Kendall calibration are not significant given the complex nature of joint PDFs, and to prove this, we compare our results to those achieved by the template fitting code BAGPIPES in the next section.

TEMPLATE FITTING COMPARISON
The different diagnostic plots and the metrics we utilise to validate the marginal and joint PDFs produced by our RF models are difficult to fully appreciate without familiar context. Consequently, we utilise Bayesian Analysis of Galaxies for Physical Inference and Parameter EStimation, or BAGPIPES (Carnall et al. 2018) to benchmark our results. BAGPIPES is a Python package which uses MultiNest (Feroz & Hobson 2008, Feroz et al. 2009 nested sampling algorithm to model the emission from galaxies and to fit these models to any combinations of spectroscopic and photometric data in order to output multivariate posteriors distributions of parameters such as redshift and stellar mass, hence making it ideal for comparison.
The photometry in the COSMOS2015 and DES Y3 DF catalogues have been calibrated independently of one another. So although we can expect them to be broadly consistent, it is possible that small differences in absolute calibration between the two remain. Even minor offsets in the calibration baseline may have a significant impact on the stellar mass posterior PDFs produced using BAGPIPES with respect to COSMOS2015, and perhaps also some subtle effects in redshift. Accordingly, validation of the PDFs using the point predictions in the catalogue would not be appropriate. To solve this dilemma, we run BAGPIPES on Subaru , , + and + + bands' photometry from the catalogue in place of the DES DF bands. We specifically choose these bands in order to imitate the DES bands as far as possible and therefore allow for an adequate comparison between the template fitting method and our ML-based method. Although this does not match exactly the degradation in the information provided to the RF, it is nevertheless very similar as we measure PDFs using four optical bands instead of the 30-plus bands available in the catalogue. Importantly, however, we avoid introducing any possible systematic effects that could arise from inter-dataset calibration differences.
The model templates used by Laigle et al. (2016) cannot be exactly reproduced in BAGPIPES. It is important for the validity of our comparison that the 4-band PDFs and the truth values are constructed under the same set of model assumptions. Therefore, we produce a new set of truth values using the 22 COSMOS bands (including the 4 aforementioned) listed in Tab. 1. In both the 4-band and 22-band runs, we employ the same physical information about the model as outlined in Tab. 2. These choices were made to closely mimic the set-up adopted by Laigle et al. (2016) to compute the redshifts and stellar masses in the COSMOS2015 catalogue so that we can make a fair comparison. There are, however, slight differences which we cannot negate, and as such, a direct comparison is not possible. Nevertheless, they are mostly similar, and the aggregate metric results should be comparable. We compute total COSMOS flux and flux errors from those measured in a 3" diameter aperture, and correct for photometric and systematic offsets, and foreground galaxy extinction, before initiating the runs. We define the true values of redshift and stellar mass from the 22-band run to be the mean predictions for each galaxy. Finally, we extract marginal and joint PDFs of redshift and stellar mass from the 4-band run and validate them using these new 'truth' values. We utilise a total of 14 nodes for both runs, with each node consisting of 12 Xeon X5660 cores and 16GB of random-access memory (RAM). The runs take approximately 900 and 1, 400 hours to generate PDFs for 10, 699 galaxies respectively. Naturally, we only run BAGPIPES on test galaxies in the DF dataset.
Template fitting with four bands is known to be difficult due to degeneracies in the parameter space (see Renzini 2006, for a review). To compensate, authors sometimes restrict the parameter space, for example, by neglecting dust extinction to improve results (e.g. Capozzi et al. 2017), and this amounts to a hard prior in the galaxy population. By design, RF includes an implicit prior built from the training data. We approximate the effect of this prior by applying a 2d population prior formed from the redshifts and stellar masses in the 'truth' catalogue to the PDFs estimated by BAGPIPES using the 4-band photometry. To apply the prior, we fit a kernel density estimate (KDE) to the 'true' redshifts and stellar masses. We use 1% of the total number of point predictions to fit this prior, and this equates to ∼ 200, 000 data points. Next, we compute the prior probability density at each redshift -stellar mass sample point output by the BAGPIPES nested sampling (with 4-band photometry). We produce a smoothed posterior of these points, weighted by the prior probability, via another KDE. Finally, we draw 1000 importance samples from this smoothed posterior. We repeat this process for all the galaxies.
We explored the possibility of applying a full 6d prior because, in principle, it should further improve the results. However, doing so caused a large number of galaxies to become catastrophic outliers. It is beyond the scope of this work to go through the painstaking process of carefully optimising a high-dimensional prior, as we simply want a comparison that assists the reader's intuition in interpreting the result from our RF models. Nevertheless, we still had a considerable percentage (6 − 7%) of catastrophic outliers even with our 2-dimensional prior. These outliers can skew the performance in terms of the metrics we have chosen and can often be treated separately in scientific analyses. Hence, we remove these outliers and then perform the different calibration checks to better gauge the performance of the population at large. Fig. 8 shows the PIT and the copPIT distributions alongside the marginal and Kendall calibration plots from the analysis, and for comparison, they are overlaid with results from the DES-DF model, labelled as GALPRO. The PIT distributions are not uniform and indicate biased marginal PDFs for the galaxy population, and this correlates well with the marginal calibration plots which have large fluctuations about the zero line. Nevertheless, the marginal redshift PIT distribution is competitive with template fitting approaches used in code comparison works, e.g. Schmidt et al. (2020, Fig . 2), Deprez et al. (submitted, Fig. 7). However, these studies use deeper data than in this work. Unsurprisingly, a small number of joint PDFs are also biased as reflected by the non-uniform copPIT distribution. Despite the biased PDFs, BAGPIPES does manage to capture the dependence structure between redshift and stellar mass on a similar level to that achieved by the RF. On the whole, RF  Table 2. Fixed and fitted parameters with their associated priors for the delayed exponentially declining ( −2 − / ) star formation history (SFH) model used in the BAGPIPES runs. The model is not readily available in BAGPIPES, so we lightly modify the code to meet our requirements. We adopt the Calzetti et al. (2000) attenuation curve, stellar population synthesis (SPS) models of Bruzual A. & Charlot (1993) and a Kroupa & Boily (2002) initial mass function (IMF). is the attenuation in the V-band, is the star formation timescale, is the metallicity, is the ionisation parameter, is the lifetime of H II regions and is a constant which controls the extra attenuation towards them. Figure 8. Comparison diagnostic plots for benchmarking the performance of GALPRO on test galaxies in the DF dataset against that of BAGPIPES on a comparable dataset, which is composed of the same galaxies but with Subaru photometry in 4 bands ( , , + and + +) from the COSMOS2015 catalogue. The marginal and joint PDFs of redshift and stellar mass produced by BAGPIPES are validated using a 'truth' catalogue constructed by running BAGPIPES on photometry in 22 COSMOS bands listed in Tab. 1. outperforms BAGPIPES on the metrics we have considered in our analysis. Having said that, it should be possible for BAGPIPES to match the performance of the RF through judicious use of priors and great care in photometric calibration. A great advantage of the RF is that the large effort that would be required to do so is not necessary. An implicit prior is automatically applied, transferring information from the rich training data set to our target data.
To summarise, we have benchmarked the performance of GAL-PRO against BAGPIPES, and by doing so, we have been able to place our results into context. We have found that our ML-based method performs better in every aspect compared to a template fitting method that employs a fairly standard set-up. Thus, we have confidence that our models are producing valid marginal and joint posterior probability distributions, based on the different calibration modes and metrics we have employed in our analysis.

CONCLUSIONS
The emergence of template fitting methods with the capability of generating multivariate probability distribution functions (PDFs) of redshift and physical properties of galaxies represents a paradigm shift. These PDFs account for potential correlations between different galaxy properties and fully characterise uncertainties associated with point estimates of the quantities. However, with their potential benefits, comes the task of generating them quickly, which is difficult given their complexity. For example, the template fitting code BAGPIPES takes a few minutes to fit each galaxy. While this may not seem significant, the amount of time required to generate them for hundreds of thousands of galaxies, let alone the billions that will be observed with the upcoming photometric surveys such as the LSST and Euclid, quickly becomes impractical. Coupled with the difficulty of storing such PDFs, a solution that enables on-the-fly production at speed is greatly desirable.
In this work we tackle the problem by using a machine learning (ML) based approach. We introduce a novel method based on the Random Forest (RF) algorithm to generate joint PDFs. As an example, we generate PDFs for the probability space in redshift and stellar mass, as they are two of the most important to accurately predict. Our method can be generalised to extract n-dimensional PDFs. However, we focus on this specific two-dimensional space as it is easy to visualise and exhibits well-known correlations between the properties.
To demonstrate the method, we train two RF models to produce joint PDFs of galaxies in the DES deep field (DF) and the main wide-field (WF) DES survey respectively. We separately combine the COSMOS2015 catalogue, with the DES Y3 DF and the Y3 Balrog to construct the necessary datasets, which contain 53, 941 and 392, 376 galaxies respectively. From the trained models, we extract point estimates, marginal and joint PDFs of 10, 699 test galaxies. We then proceed to determine the validity of both sets of PDFs, and for this, we utilise the notions of probabilistic copula calibration and Kendall calibration to validate the joint PDFs and their univariate counterparts to validate the marginals. We highlight in particular the advantage of incorporating realistic photometric errors into the RF has on Kendall calibration. We benchmark our results against those achieved by BAGPIPES, adopting a basic set up and simple population-derived prior in redshift and stellar mass, to provide some context to the metric values and guide our intuition. We find that our ML-based method is producing valid PDFs with only small calibration errors, and performs at a superior level on every metric we consider in our analysis compared to BAGPIPES.
Despite the success of our method, template-fitting approaches such as BAGPIPES undoubtedly still have a vital role to play in building the training samples for ML-based codes.
To conclude, joint redshift -stellar mass PDFs have many potential science applications from determining the evolution of the stellar mass function (SMF), to constraining the stellar-to-halo mass relation (SHMR). Consequently, we have developed GALPRO, a highly intuitive and efficient Python package for rapidly generating n-dimensional PDFs on-the-fly, thus solving the potential issue of storage. We have trained and tested our RF models using GALPRO on a 13" Macbook Pro (2.4 GHz Intel Core i5, 16GB LPDDR3) and found that, at best, it takes on average a few milliseconds to generate a PDF. Thus, GALPRO can potentially offer a 10, 000 reduction in run time compared to packages based on template fitting methods, making it ideal for the impending era of 'big data'. Of course, one must ensure that the training data set is representative and suitable for their scientific analysis to fully reap the benefits of GALPRO. This paper has been typeset from a T E X/L A T E X file prepared by the author.