Inferring halo masses with Graph Neural Networks

Understanding the halo-galaxy connection is fundamental in order to improve our knowledge on the nature and properties of dark matter. In this work we build a model that infers the mass of a halo given the positions, velocities, stellar masses, and radii of the galaxies it hosts. In order to capture information from correlations among galaxy properties and their phase-space, we use Graph Neural Networks (GNNs), that are designed to work with irregular and sparse data. We train our models on galaxies from more than 2,000 state-of-the-art simulations from the Cosmology and Astrophysics with MachinE Learning Simulations (CAMELS) project. Our model, that accounts for cosmological and astrophysical uncertainties, is able to constrain the masses of the halos with a $\sim$0.2 dex accuracy. Furthermore, a GNN trained on a suite of simulations is able to preserve part of its accuracy when tested on simulations run with a different code that utilizes a distinct subgrid physics model, showing the robustness of our method. The PyTorch Geometric implementation of the GNN is publicly available on Github at https://github.com/PabloVD/HaloGraphNet


INTRODUCTION
In 1933 Fritz Zwicky found out that the mass of the Coma cluster should be much larger than the one from its luminous component (Zwicky 1933). That finding pointed out to the existence of an unknown type of nonluminous matter: dark matter. The requirement of unseen matter was later supported by the observation of rotation curves of galaxies (Rubin et al. 1978;Bosma 1978). Nowadays, there is overwhelming evidence for pablo.villanueva.domingo@gmail.com fvillaescusa@flatironinstitute.edu the existence of dark matter, although we are still ignorant of its fundamental properties (Young 2017).
We now believe that dark matter is the backbone of the distribution of matter in the Universe: it concentrates in high-density regions called halos, that are connected by thin filaments with intermediate density and surrounded by gigantic regions with low density (voids). It is within halos that gas can cluster, cool, and form stars and galaxies (Somerville & Davé 2015). Dark matter halos are therefore the environment where galaxies reside.
Understanding the halo-galaxy connection is fundamental to improve our knowledge on the nature and properties of dark matter. There are two possible directions to take in the halo-galaxy connection. On one hand, given a dark matter halo and its properties and environment, predict the number and distribution of galaxies it hosts. This task is fundamental in order to create galaxy mocks with the correct clustering on all scales needed for forward modeling approaches (Wechsler & Tinker 2018). On the other hand, given a set of galaxies, it may be useful to determine some properties of their host halo such as its mass, spin, and concentration. This task is fundamental to derive cosmological constraints from the abundance of dark matter halos.
All the above techniques do not make use of all available information. For instance, the abundance matching technique only considers the total stellar mass in the system, disregarding information about its clustering state. In this work we attempt to build a model that can use all available information from observations and/or simulations, e.g. phase-space information, stellar masses, galaxy sizes, to infer the mass of the halo hosting the galaxies. For this, we made use of neural networks and their capacity as universal function approximators.
Different machine learning (ML) algorithms have been already used to perform this task. For instance, Convolutional Neural Networks (CNNs) have been already applied in order to predict dynamical galaxy cluster masses Ho et al. 2019;Kodi Ramanah et al. 2020, 2021Yan et al. 2020;Gupta & Reichardt 2020;de Andres et al. 2021), as well as other ML techniques (Ntampaka et al. 2015(Ntampaka et al. , 2016Green et al. 2019;Armitage et al. 2019;Haider Abbas 2019). Galaxy and subhalo masses can be inferred from different subhalo properties via multilayer perceptrons (Shao et al. 2021) or other ML algorithms (von Marttens et al. 2021), as well as from images with CNNs (de los Rios et al. 2021). Other works have been employed different ML algorithms to predict the mass of a halo from the properties of the halo and galactic group (Man et al. 2019;Calderon & Berlind 2019;Lucie-Smith et al. 2020). These ML-based approaches have been shown to outperform other traditional techniques to infer halo masses. However, some of these works make use of N-body computations plus semianalytical galaxy formation models, rather than more accurate hydrodynamical simulations. Moreover, several of the features employed may not be easily observable, which could complicate their applicability to real data.
Although these approaches make use of global properties of a halo as well as individual features of the galaxies, they do not incorporate explicitly the relationship between galaxies or subhalos, either in the form of clustering in configuration space and/or distribution in phase-space. In this article, we aim at predicting halo masses exploiting the halo-galaxy connection, using a novel method based on Graph Neural Networks (GNNs). This type of neural network shares the typical training procedure of other deep learning techniques, but it is applied to data structured in the form of mathematical graphs. To understand their significance, it is useful to compare them to other deep learning frameworks. CNNs are mostly employed with regular data (grids), such as images and 3D grids; CNNs automatically account for traslational invariance. Recurrent neural networks, on their side, are designed to treat sequential data, such as chains of characters in natural language or time series. However, GNNs can be applied when dealing with irregular data, where data points may have arbitrary relations. 1 That is the case of point clouds, as a galaxy catalogue can be regarded. GNNs have been successfully employed in different fields such as chemistry, computer vision, natural language processing, social networks or particle physics (Wu et al. 2019;Bronstein et al. 2021). There are already some applications of GNNs in cosmology, for instance in order to perform symbolic regression (Cranmer et al. 2019;Cranmer et al. 2020), to predict the redshift of galaxies (Beck & Sadowski 2019), or to allocate resources in an unsupervised way in order to select galaxies (Cranmer et al. 2021). GNNs have also been applied in other physics fields, such as particle physics (Shlomi et al. 2020), but still represent a novel, promising and mostly unexplored way to extract information from irregular data.
We train our GNNs using galaxies from simulations of the Cosmology and Astrophysics with MachinE Learning Simulations (CAMELS) project (Villaescusa-Navarro et al. 2021a) to extract information from galaxy properties and their phase-space distribution. Since CAMELS contains thousands of state-of-the-art (magneto-)hydrodynamic simulations with different values of the cosmological and astrophysical parameters, our method accounts for uncertainties in cosmology and astrophysics. Furthermore, since CAMELS contains two different suites of hydrodynamic simulations run with two different codes that employ different subgrid physics, we can quantify the robustness of our results to astrophysical uncertainties. Ultimately, we would like to build a tool that can infer the mass of galaxy systems, like our own Milky Way and Andromeda, just from the observed properties of those galaxies and their satellites. Knowing the mass of the host dark matter halos of those system will allow us to make consistency checks within the ΛCDM model.
The article is structured as follows. We start by reviewing the basics on graphs and GNNs in Sec. 2. In Sec. 3 we describe the data we use to train our model together with an outline of the training procedure. The main results of this work are presented in Sec. 4. Some aspects of the interpretability of the GNNs are examined in Sec. 5, followed by a discussion of the main conclusions in Sec. 6.

GRAPH NEURAL NETWORKS
In this section we review the basics of graph neural networks, firstly summarizing the fundamentals of graphs, subsequently detailing how to build graphs from the galaxies of halos, and finally introducing how to build a neural network on a graph based on the message passing scheme. We refer the reader to Bronstein et al. (2021); Battaglia et al. (2018); Hamilton (2020) for comprehensive references on graph neural networks and geometric deep learning.

General concepts on graphs
We start by discussing some generic concepts of graphs and standard definitions, since some astrophysicists, cosmologists and, ML practitioners may not be familiar with the terminology. A graph can be defined as a tuple G = (V, E), where V denotes the set of nodes, and E ⊆ V × V the set of edges. For two nodes of the graph i, j ∈ V, there is an edge connecting them if (i, j) ∈ E. Nodes i, j are thus coined as neighbors. The connectivity can be described by the adjacency matrix A ij , which takes value of 1 if the pair (i, j) is connected by an edge, being 0 otherwise. A graph is undirected when, if (i, j) ∈ E, then (j, i) ∈ E too. A loop is an edge which connects a node to itself. We restrict our discussion to simple graphs, i.e., undirected graphs without loops, since these are enough for our purposes (although self-loops can be implicitly employed in some GNN architectures).
Given a node i, we denote its feature vector as x i ∈ R nin , encoding the relevant physical information about the node, with n in the number features. The feature where |V| is the number of nodes in the graph, comprises the feature vectors of all nodes. The neighborhood of the node i ∈ V, denoted by N i , includes every node which shares and edge with i, and can thus be written as Note that N i does not include the node i, since we dismiss loops. Analogously, we can define the set of feature vectors of the neighbors of the node i as A graph is complete if every pair of nodes is connected by an edge, having thus |V| 2 = |V|(|V| − 1)/2 edges in total. 2 The edges set can hence be written as E = V × V \ {(i, i) ∈ V} (excluding loops). As will be shown later, most of the halos employed here lead to complete graphs. 3

Halos as graphs
The general terminology discussed above allows us to set up our problem. Consider a halo h with mass M h which hosts several galaxies. We build a graph, denoted by G h , by considering all the (central and satellite) galaxies of the halo as the nodes of the graph. The feature vector of a node i, x i , will include information about the corresponding galaxy, namely the 3D comoving position, p, the stellar mass, M * , the modulus of the velocity, v, and the stellar half-mass radius (i.e., the comoving radius containing half of the stellar mass in the galaxy), R * . These galactic properties are listed on Table 1. Positions are expressed with respect to the center of the halo (chosen as the point with minimum

Message passing
Global pooling Sketch of a GNN with one message passing layer acting on a graph G h composed of 5 nodes. During the message passing step, node b receives a message from each of its neighbors j ∈ N b , aggregating and updating its features to the hidden vector h b (red). Once this step is done over all nodes, a global pooling step is performed, aggregating the hidden node features over all vertices of the graph G h , which leads to the global target y (green). Figure based on https://github.com/PetarV-/TikZ. gravitational potential energy), while velocities are written relative to the center of mass frame of all particles. 4 The mass of the halo is taken as the total mass of all components (dark matter, gas, stars, and black holes) within a virial radius enclosing a density 200 times the critical density, usually known as M 200c . Unlike other graph applications, such as, e.g., social networks, we are dealing with distributions of particles not physically connected among them (sometimes known as point clouds in graph contexts), and thus the edges are not predetermined. There is not an unique way to build the edges between the nodes. Since gravity is a long-range force, every galaxy is affected by the rest, and all the nodes should be connected in some way. However, gravitational effects, such as tidal forces, are stronger for shorter distances. Hence, one can expect that the dynamics and properties of a given galaxy will be more affected by nearby companions rather than by distant ones. Different approaches can be followed to connect the graphs and build the neighborhood of each node. In our case, we create edges from a node taking into account the neighbors within a fixed certain radius, which is taken as a hyperparameter. Its specific value is given from the optimization outlined in Sec. 3.2. A similar procedure could be considering the k nearest neighbors for each node. However, this approach does not properly reflect the effect of clustering. It may miss some neighbors in a clustered region which should be connected, and include others much farther away, which in principle should have less influence in the node. We have checked that performance slightly worsens using this approach instead of considering nodes within a certain distance. It is worth it to remark that we only consider galaxies that are part of the same halo, excluding thus interlopers and explicit relations between halos.
Among the available halos in the CAMELS catalogues, we work only with those containing more than one galaxy, excluding therefore halos without satellites. The halos excluded, typically very low mass halos, form trivial single-node graphs, where the advantages of graph structured data with edge connections between nodes is absent and therefore are not very interesting for our purposes here. We have explicitly checked that by including those halos without satellites our results worsen. Note also that this fact limits the application of our method to real observations to halos with more than one known satellite galaxy.

Message passing
GNNs employ the so-called neighborhood aggregation or message passing scheme. 5 Each node i receives a message from each of its neighbors j, appending its information to its own features. This makes it possible to create a hidden feature vector h i , which updates the node and contains aggregated data from its neighborhood. From the hidden vectors, our GNN is aimed at inferring the mass of the halo, which is a global property of the graph. The architecture is thus composed of two main steps: 1) update nodes via message passing, and 2) extract global information of the graph. We designate as Graph Layer, GL, to a generic message passing layer, which can be written as where h i is the output feature vector, denote a differentiable, permutation invariant aggregation function, such as the maximum, the mean or the sum, while ψ is the message function, a differentiable function, like a Multi Layer Perceptron (MLP), which depends upon the feature vector of the node, x i , as well as on those from its neighbors, x j , for j ∈ N i . The distinct ways to build ψ lead to different graph layers. 6 The function ψ maps ψ : R nin → R nout , with n in and n out the number of initial and output features, respectively. As shown in Sec. 2.4, n in may not correspond to the number of features considered of each node. A common advantage of GNNs is that they exploit locality of data, since closer nodes may present stronger relations than those farther away. Here, locality is enforced by assuming that the functions ψ are the same for all nodes and graphs. 7 The final GNN incorporates a last step to infer the global graph target, and thus can be of the form where φ is a differentiable function, chosen as a MLP with three hidden layers with 300 channels separated by ReLu activation functions, and y is the output of the network. In our case, since we perform likelihood-free inference to estimate the mean and standard deviation of the posterior of the mass of the halo, the target is given by a vector of two components y = [y, σ], containing the mean prediction y and its expected standard deviation, σ. The targeted quantity is the logarithm of the mass of the halo, y = log 10 [M h /(M /h)]. The likelihood-free approach to estimate the posterior mean and standard deviation is detailed in Sec. 3.2. The equation above can be easily modified to include some global graph quantities u, which may help to train the network, such as the number of nodes or the total stellar mass, writing . We include these global features in our graphs, although we have checked that removing them does not leave a significant impact in the results. Figure 1 shows an sketch of this basic architecture, illustrating the message passing scheme. Permutation equivariance and invariance are key concepts in GNNs. We say that a function acting on the node feature matrix f (X) is permutation invariant if for every permutation matrix P, one has f (PX) = f (X), leaving thus unchanged the output. On the other hand, f is permutation equivariant if it transforms as f (PX) = Pf (X). 8 A message passing layer GL(x i , X i ) should be permutation equivariant, since a reordering of the input nodes produces the same permutation in the outputs, although the output feature space can be different. However, any global quantity of the graph such as the final output of the network, the halo mass, must be permutation invariant, since its value should not depend on the ordering of the nodes. The final aggregation step, i∈G h GL(x i , X i ), can be regarded as a function f (X) which fulfills by construction permutation invariance, f : R |V|×nout → R nout . In this case, the aggregation may be termed as a global pooling layer, since it reduces dimensionality. Note that the aggregation method here does not have to be the same as in the message passing layer. Furthermore, it is possible to employ different aggregation operators, such as maximum, sum, and mean, and concatenate them, which is the approach used in this work. The last MLP φ makes use of n out global features of the graph in order to extract a global property vector y, the halo mass and its expected standard deviation in our case, φ : R nout → R 2 . Finally, we can apply multiple successive GNN layers, the updated feature vector of the neighbors. In such a case, the node after k updates will encode information from the nodes of its k-hop neighborhood. We take the number of message passing layers as an optimizable hyperparameter, although a single GNN layer is sufficient to provide the best results, as shown in Sec. 3.2.

Graph layer architecture
So far we have discussed the general shape of the GNN, but the explicit design of the message passing layer remains undetermined. There are many possible ways to construct the specific architecture of the GNN layer, which is defined by the message function ψ and the aggregation scheme . For instance, one could not take into account neighbors to update each node, and employ only the information of the node x i to extract the hidden information h i . These types of architectures, a subset of GNNs, are known as DeepSets (Zaheer et al. 2017). No aggregation function is thus needed to update the node features, and the hidden layer can just be written as h i = ψ(x i ).
However, to fully exploit the relations among the nodes, it is useful to incorporate neighborhood and edge information actually performing message passing. Here we assume an architecture based on the edge convolutional layer, coined as EdgeNet (Wang et al. 2019), where for each neighbor j of the node i, the relative vectors x i − x j are concatenated to the feature vector x j . The aggregation function is chosen as the maximum, writing thus the hidden layer as where the square brackets indicate array concatenation. In this case, the initial input number is given by n in = 2n feat , where n feat is the number of features. The differentiable function ψ is taken as a MLP with three hidden layers, with 300, 300 and 100 hidden channels, separated by ReLu activation functions. We have checked that other choices of the number of channels do not improve the preditions of the net. There are other popular architectures, such as Point-Net (Qi et al. 2017a,b), which employs only the relative spatial positions for the message passing, p i − p j , rather than the full feature vector, or the Graph Convolutional Network (Kipf & Welling 2017), where the neighbor information is simply incorporated by taking a linear combination of the features and summing over all the neighbors, imitating a convolution operation. We have checked that the EdgeNet (Eq. 5) outperforms the other architectures mentioned above via a hyperparameter optimization procedure, as commented in Sec

METHODS
In this section, we specify the details regarding the data employed and the training of the network.

The CAMELS simulations
The CAMELS project (Villaescusa-Navarro et al. 2021a, 2022) comprises a set of state-of-the-art hydrodynamic and N-body simulations, specially suited and designed to train and test ML algorithms. They include thousands of realizations varying two cosmological parameters, namely the matter density parameter Ω m , and the variance of the linear field on 8 Mpc/h at z = 0, σ 8 , plus four astrophysical parameters controlling the efficiency of supernovae and active galactic nuclei (AGN) feedback. The rest of cosmological parameters are kept fixed to standard values in flat cosmologies with cosmological constant: the baryonic fraction Ω b = 0.049, the reduced Hubble constant h = 0.6711, the tilt of the primordial power spectrum n s = 0.9624, the equation of state of the dark energy (corresponding to a cosmological constant) w = −1, and the total neutrino mass M ν = 0 eV. Each simulation follows the evolution from z = 127 to z = 0 of 256 3 DM particles and 256 3 gas resolution elements within a box of size periodic volume of size 25 Mpc/h. The DM particle mass resolution is ∼ 10 8 M /h. A collection of different cosmological fields in form of 2D maps and 3D grids from the CAMELS simulations is available as the CAMELS Multifield Dataset (CMD) 10 , intended to be a standard cosmological dataset for ML applications (Villaescusa-Navarro et al. 2021b), while the full dataset has been recently made publicly available (Villaescusa-Navarro et al. 2022).
The CAMELS project includes two suites of simulations, with different astrophysics modeling and subgrid physics.
On the one hand, a suite of magneto-hydrodynamic simulations with the subgrid physics models of IllustrisTNG (Weinberger et al. 2017;Pillepich et al. 2018;Nelson et al. 2019), performed with the code Arepo 11 (Weinberger et al. 2020, see also Springel et al. 2018;Marinacci et al. 2018;Naiman et al. 2018 for more details). Its galaxy formation model is based on the predecessor Illustris (Vogelsberger et al. 2013(Vogelsberger et al. , 2014. On the other hand, another suite employs the SIMBA subgrid physics model (Davé et al. 2019), making use of the code GIZMO 12 (Hopkins 2015). SIMBA is built on its precursor MUFASA (Davé et al. 2016) with the addition of supermassive black hole growth and feedback (Anglés-Alcázar et al. 2017). The hydrodynamics simulations are accompanied by N-body counterparts, run with the GADGET-III code (Springel 2005). 13 Within each simulation suite, there are different sets, according to the configuration of astrophysical and cos-mological parameters. The CV simulations (standing for Cosmic Variance) include 27 simulations sharing their astrophysical and cosmological parameters, fixed to standard values (Ω m = 0.3 and σ 8 = 0.8), and varying only the random seed to generate the initial conditions. The LH set (standing for Latin-Hypercube) consists of 1,000 simulations, varying all the astrophysical and cosmological parameters (together with the random seed) along a latin-hypercube. 14 We make use of both sets for training our networks, in order to check whether the architectures considered are robust enough for different cosmologies and baryonic feedback parameters. The halos and subhalos are identified using the SUBFIND algorithm (Springel et al. 2001). We define galaxies as subhalos that contain more than 20 star particles. Although the CAMELS suite contains data for several redshifts, only simulations at z = 0 are considered in this work. We refer the reader to the CAMELS webpage 15 and Villaescusa-Navarro et al. (2021a) for further details.
As an example of the differences between IllustrisTNG and SIMBA, Fig. 2 shows the number of halos as a function of the number of galaxies the halos contain for the simulations of the LH set. While most of the halos in the simulations only host a few galaxies, there are a number of them that contain hundreds of satellites. Note that for a fixed halo mass, the SIMBA simulations contain more galaxies than the IllustrisTNG simulations. In general, SIMBA simulations usually contain many more galaxies than their IllustrisTNG counterpart, being an average of ∼ 1200 and ∼ 700 per simulation respectively, both at z = 0. Furthermore, SIMBA simulations tend to be more stochastic, and usually can produce more extreme scenarios (with larger effects of the baryonic feedback effects) than IllustrisTNG. These facts reflect the inherent differences between the two codes/subgrids models. We refer the reader to Villaescusa-Navarro et al. (2021a) for a more detailed comparison between different observables of both suites, such as clustering or halo population, and to Villaescusa-Navarro et al. (2022); Villanueva-Domingo & Villaescusa-Navarro (2022) for more information on the distribution and correlations between the different galactic features of the CAMELS catalogues.

Training procedure
We construct our dataset, the collection of graphs, based on the procedure outlined in Sec. 2.2. We restrict Figure 2. Number of halos sorted by the number of galaxies that they host in the IllustrisTNG (blue) and SIMBA (purple) suites, for the LH sets. On average, the halos of the SIMBA simulations contain more galaxies than the halos of the IllustrisTNG simulations, reflecting the large differences between the two simulation suites given their distinct subgrid physics models.
our analysis to halos with more than one galaxy, as mentioned above. The models are trained on simulations of the CV and LH sets separately, where in each case we split the dataset into training (70%), validation (15%) and testing (15%) sets. We employ an Adam optimizer (Kingma & Ba 2014) and L2 regularization. The batch size is set to 128 and the number of epochs limited to 150. The GNN architecture is implemented following the prescription outlined in Sec. 2.3, concatenating one or several message passing layers and appending at the end a global pooling and MLP to infer the halo mass.
We perform a hyperparameter optimization in order to get the best values for the hyperparameters, following a bayesian model-based optimization procedure with the Tree Parzen Estimator (TPE, Bergstra et al. 2011), making use of the Python package Optuna 16 (Akiba et al. 2019). The hyperparameters considered for this optimization are the learning rate, the weight decay, the number of message passing layers, the distance to define neighborhoods, and the specific architecture (among those defined in Sec. 2.4). We perform at least 75 trials for each suite and set, where each trial is a specific choice of the values of the hyperparameters. The optimization procedure leads to different values for each simulation suite and set, with learning rates ranging between 10 −5 and 6 × 10 −4 , weight decay values between 10 −8 and 10 −7 , and the neighborhood distance between 2 and 20 Mpc/h. One unique message passing layer (i.e., one update of the hidden features of the nodes from Eq. 3) and the EdgeNet architecture (among those discussed in Sec. 2.4) are the optimal choices for all cases. The large values of the neighborhood radii obtained imply that most of the graphs created are complete, around 98% for both the CV and LH sets. That means that most of the galaxies are connected with each other within the halo, and most nodes can access to information about each other companion of the graph at one hop.
We follow a likelihood-free bayesian inference approach to sample the expected standard deviation of the outputs. This procedure allows us to reproduce some properties of the posterior (its second centered moment in this case) without the need of a likelihood (Jeffrey et al. 2021). We follow Jeffrey & Wandelt (2020) and design our model to output two quantities: the mean and standard deviation of the halo mass posterior. The loss function needed to achieve that is given by and Note that we have included log functions to make sure that the two contributions are similar. For instance, if the errors are much smaller than the mean, the loss will be dominated by the mean and the errors may not be accurately computed (see Villaescusa-Navarro et al. 2021b, for further details). It is worth emphasizing the symmetries fulfilled by the GNNs. By construction, GNNs are permutation invariant, guaranteed by neighborhood aggregation as discussed in Sec. 2.3. Furthermore, given that graphs are written in the center of mass rest frame, our framework is also translational invariant, since any displacement of the galaxy group would not alter the relative coordinates with respect to the center, and hence the graph would be the same. Moreover, for our special case of point clouds, our GNNs should be rotationally invariant, since an arbitrary rotation of all galaxies around the center of the halo should not change the global halo properties. We try to enforce this symmetry by performing random rotations on each graph at every training epoch, as a data augmentation procedure. This practice also helps in alleviating overfitting, given that the network becomes robust when tested in systems with arbitrary rotations. Overfitting also becomes absent in our training procedure with the proper selection of the hyperparameters.
We write and train the models making use of PyTorch Geometric 17 (Fey & Lenssen 2019). Our implementation of the GNNs, HaloGraphNet, is publicly available on GitHub 18 (Villanueva-Domingo 2021).

RESULTS
In this section, the main results of the work are discussed. We first introduce a benchmark model to estimate halo masses from stellar masses, and next we detail the results of training and testing the GNNs in the different CAMELS simulation sets and suites considered, examining also their robustness over different astrophysical modeling.

Predictions from stellar masses
Before starting to discuss the accuracy of the GNN predictions, it may be useful to set a benchmark model to predict halo masses, to test the worthiness and degree of improvement of using GNNs. A traditional approach exploits the relation between stellar and halo mass, based on abundance matching (Behroozi et al. 2010;Wechsler & Tinker 2018). Figure 3 shows the mass of a halo M h as a function of its total stellar mass M * ,tot = i M * ,i (summing over all stellar particles in the central and satellite galaxies within the halo) in the IllustrisTNG and SIMBA suites, for the CV (left) and the LH (right) sets. 19 One can notice a clear correlation between stellar and halo masses, although with a large scatter. This is specially noteworthy in the LH set, where multiple values of the cosmological and astrophysical parameters are considered, leading to completely different outcomes. Shaded areas denote the standard deviation of the points, which is around ∼ 0.2 dex in the CV case, but grows up to ∼ 0.3 − 0.4 dex in the LH set.
Taking advantage of the expressed correlation, we can build a naive estimator of the halo mass based only on the total stellar mass of galaxies. A simple 4th degree polynomial fit is shown in dashed lines in Fig. 3. This can be regarded as a benchmark model for comparing with our forthcoming models based on GNNs, in order to further evaluate their strength. To check its accuracy for predicting y = log 10 [M h /(M /h)], we employ the mean relative error , with N the number of test halos, as well as the correlation coefficient (or coefficient of determination) R 2 , defined in the usual way as with y truth the mean of true values. This naive estimator gives fairly accurate results in the CV set, with relative errors around 1% and a linear correlation coefficient of R 2 0.94. However, when an analogous fit is attempted in the LH case, the relative error worsens down to ∼ 1.7% (∼ 2.4%) for SIMBA (IllustrisTNG), with R 2 0.84 (R 2 0.67). Note that these errors are for the logarithm of the mass, y, rather than for the halo mass itself, which correspond to relative errors in the mass between ∼ 50 − 120%. This fact illustrates the non-trivial dependence of the halo and stellar masses on the different astrophysical and cosmological scenarios. Notice that the LH set contains some extreme scenarios which may be unlikely to describe the real universe. In any case, the most appropriate astrophysical parameters to describe our cosmos are still unclear, and thus, models able to robustly marginalize over baryonic feedback effects are required. Thus, a prediction of the halo mass from only the stellar mass when a broad range of astrophysical models are considered seems to be quite inaccurate. In the following, we shall show how a GNN is able to overcome this difficulty by considering further features and taking advantage of the graph structure of halos.

Inferring halo masses with GNNs
Here we first discuss the results of training a GNN to predict halo masses using galaxies from simulations of the CV set. As stated in Sec. 3.1, this set contains 27 simulations with fixed fiducial values for the cosmological and astrophysical parameters, only varying the random seed. The left panels of Fig. 4 show the accuracy of the network in the CAMELS CV sets, where the top panel stands for the IllustrisTNG subgrid model and the bottom one for the SIMBA suite. The vertical axis is the difference between the predicted and true logarithms of the halo mass, log 10 (M h /(M /h)), with respect to the true value in the horizontal axis. Error bars have been estimated via likelihood-free inference, sampling the standard deviation of the posterior, as outlined in Sec. 3.2. One can see that the performance is fairly good in both cases, as the linear correlation coefficient of R 2 = 0.96 − 0.97 confirms, with a relative error lower than ∼ 1% in y (∼ 25 − 40% in the mass itself). The dashed lines show the mean of test points, while the shaded region depicts their actual standard deviation, which extends up to ∼ 0.14 dex. Thus, most of the test predictions lie within this region, although there are some outliers. The neural network is thus able test dataset is shown in each case. Shaded regions and dashed lines correspond to real standard deviation and mean of test points, respectively. While in the CV set, astrophysical and cosmological parameters are fixed to fiducial values, the LH set comprises a broad range of astrophysical and cosmological scenarios. Even so, the GNN is still able to learn the halo/galaxy relation and predict masses in the LH case, only slightly worsening the prediction with respect to the CV case.
to accurately infer the halo mass given some features of its galaxies.
Nevertheless, the previous results only show that the GNN is capable of predicting the mass of a halo when cosmology and astrophysics is known, i.e., when the parameters are fixed. However, the specific values of the relevant parameters for the real Universe are not well known yet, specially the astrophysical ones. In order to marginalize over uncertainties in cosmology and astrophysics, we have trained our network in the LH simulation set, which includes one thousand simulations varying cosmological and astrophysical parameters, together with the random seed (to also incorporate effects of sample variance), as noted in Sec. 3.1. The right panels of Fig. 4 show the GNN predictions for training the GNN in IllustrisTNG (top) and SIMBA (bottom). One can see that the results only slightly worsen from the equivalent case in the CV simulation set, with R 2 = 0.90−0.92 and relative errors of ∼ 1%. Uncertainties are also larger than in the CV set, roughly by a factor of 2, meaning that the model trained in the LH set offers lower precision. This worsening is expected since the network needs to marginalize over the astrophysical and cosmological parameters, differently from the network trained on the CV set. It should be noted that a small bias appears at low masses, below 10 11 M /h, which deviates up to 0.4 dex (also present in the SIMBA CV case). This could be explained due to the smaller number of low-mass halos in the datasets, preventing the network to properly learn in that range. Moreover, the expected fewer satellites in low-mass halos could also affect the predictions, since the GNN counts on less satellites to extract information. In any case, these results indicate that our model is able to learn the mapping between astrophysics/cosmology and the halo mass, and thus marginalize over the value of the parameters present in the simulations to make an accurate prediction.
To further evaluate the predictive power of GNNs, it is worth comparing these results to those obtained from the naive fit based only on stellar mass outlined in Sec. 4.1. For both the CV and LH sets, the GNN predictions outperform those from the polynomial fit, presenting larger R 2 coefficients and lower relative errors. The improvement is especially clear in the LH case, where the R 2 is significantly better than the benchmark method. Moreover, the scatter around the true values spans ∼ 0.14 and 0.2 dex, compared to the larger standard deviations from the naive fit, which are 0.2 and 0.3 − 0.4 dex respectively. Note that 0.2 dex is a factor up to ∼ 1.6 in the mass (rather than in the logarithm), while 0.3 dex is a factor ∼ 2, meaning a ∼ 100 % error in the mass. These results demonstrate how taking advantage of the graph structure and further galaxy features, it is possible to attain richer correlations and better results.
Nevertheless, it has to be emphasized that the linear correlation coefficient R 2 and the relative error cannot constitute a complete statistical summary for testing the accuracy of the GNN. It is because our network is also predicting the standard deviation of the target, for which an additional component to the loss has been included, as discussed in Sec. 3.2. Thus, neither R 2 nor the relative error quantify the sampling accuracy of the standard deviation. To test whether the uncertainties are reasonably predicted, it is useful to compute the χ 2 , defined as Note that minimizing the loss function contribution from Eq. 7 tends to drive the χ 2 towards unity. This is actually the case in the LH set, where χ 2 = 1.00 for IllustrisTNG and χ 2 = 1.03 for SIMBA, indicating that uncertainties are accurately predicted. In the CV cases, however, while the mean predictions are more accurate, some errors are underestimated, leading to larger χ 2 values, around ∼ 4 in SIMBA. Moreover, since the test dataset is smaller, a few outliers with too small standard deviations greatly impact the value of the χ 2 .
One can also compute how many points in the test dataset present an accurate uncertainty by counting the fraction which fulfill the conditions |y truth,i −y infer,i | ≤ σ i and |y truth,i − y infer,i | ≤ 2σ i , i.e., how many points lie within one and two times the standard deviation of the posterior. We find that for the LH cases, the fraction of points fulfilling the above conditions is 69% and 95% respectively for both suites. For the CV sets, these fractions only slightly deviate, 62 and 90% for IllustrisTNG, and 72 and 96% for SIMBA, respectively. For Gaussian distributed errors, these fractions should be around 68 and 95% respectively. Note however that our calculation of the posterior mean and standard deviation does not make any assumption about the form of the posterior. Therefore, the numbers quoted above should be taken with caution and comparison with the Gaussian case should be done in a careful manner.
There is another way to figure out whether uncertainties are correctly sampled. The shaded regions in Fig.  4 represent the actual standard deviation of the test points computed within several mass bins. Therefore, if the predicted uncertainties are accurately sampled, their mean value σ should correspond to those shaded regions. For the CV case, despite the GNN providing more accurate models, mean uncertainties σ are underpredicted with respect to actual scatter by ∼ 40 % for IllustrisTNG and ∼ 20 % for SIMBA. However, in the LH case, uncertainties are better sampled, only deviating ∼ 3 % and ∼ 7 % for IllustrisTNG and SIMBA respectively. This implies that models trained in LH, despite predicting slightly less accurate results than those from the CV set, provide a better sample of the posterior uncertainties.

Robustness over different subgrid physics models
Subgrid physics, i.e. the models used to simulate unresolved astrophysical processes such as the feedback from supernovae and black holes, can only be implemented in a phenomenological way and therefore there is not a unique subgrid model that best represents reality. Thus, having a ML model robust over different subgrid scenarios would be needed in order to obtain predictions that do not depend on the particular type of simulation used to train the networks.
To check whether our GNN fulfills this requirement, we have taken the model previously trained on simulations from the IllustrisTNG suite, and we have tested it on galaxies from the SIMBA simulations, which make use of a completely different subgrid physics model for AGN and SN feedback. The top left panel of Fig. 5 shows the predictions for the halo mass in the CV set, i.e., trained in IllustrisTNG CV and tested in SIMBA Figure 5. Same as Fig. 4 but either using a model trained with the IllustrisTNG suite and tested with the SIMBA simulations (top) or trained in SIMBA and tested in the IllustrisTNG suite (bottom), for CV (left) and LH (right) sets. A model trained in a given suite worsens its behavior when tested in the other one, appearing biased in the CV case. However, in the LH set, it is possible to find a mapping between the parameter space of both subgrid physics models, alleviating such biases.
CV. We see that the performance becomes worse, and actually a bias appears. This offset may arise from the fact that the IllustrisTNG and SIMBA models make different predictions for the halo-galaxy connection, as shown in Fig. 3. This can be related with the fact that astrophysical parameters are not completely correlated and calibrated between both cases. While the CV set assumes fiducial values for the astrophysical and cosmological parameters, the default values in the Illus-trisTNG suite do not correspond to the ones in SIMBA, since they refer to different quantities and physics. The absence of a one-to-one relation between both suites in the CV set may explain why the network fails to make a robust prediction.
The right panel of Fig. 5 depicts the results of testing the network trained on galaxies of the IllustrisTNG LH set on galaxies of the SIMBA LH set. In this case, the bias present in the CV case disappears, and only a broad scatter holds. The absence of the offset can be attributed to the intrinsic marginalization of astrophysical effects carried out by the network (although a slight tilt is present in the trend). Moreover, the linear correlation coefficient only slightly decreases with respect to testing in IllustrisTNG (top right panel of Fig. 4), and is actually better than in the CV counterpart (top left panel of Fig. 5), in spite of dealing with a much broader astrophysical parameter space. The χ 2 also remains closer to unity than in the CV case, meaning that uncertainties are better sampled in the LH set. The fraction of points fulfilling the conditions |y truth,i −y infer,i | ≤ σ i and |y truth,i − y infer,i | ≤ 2σ i is 65 % and 93 % respectively, while much lower for the CV case (20 and 46% respec-tively). These facts imply that models trained in the LH set generalize better than those trained in the CV one.
An analogous experiment has been carried out, employing the model trained in the SIMBA suite and testing it in IllustrisTNG. The results are shown in the bottom panels of Fig. 5 for the CV set (left) and LH set (right). Note the offset in the CV case, which is similar but opposite (underpredicting the mass) to the one appearing in the top left panel of Fig. 5. This is reasonable, since given that the IllustrisTNG model overpredicts the mass in SIMBA, the contrary should be expected when a SIMBA model is tested in Illus-trisTNG. In the LH scenario, predictions are slightly worse, decreasing the truth-prediction correlation down to R 2 = 0.8, and also poorer than in the CV case. Both sets present larger χ 2 values and a deviation from the Gaussian counterpart, given that only a fraction of ∼ 50% and ∼ 80 % points lie within one and two times the posterior standard deviation, respectively.
It is noteworthy that, for the LH set, a model trained in IllustrisTNG generalizes better in SIMBA than the opposite case. An interpretation of this outcome is problematic. SIMBA usually covers more extreme astrophysical scenarios than IllustrisTNG, presenting also more galaxies per halo (see Fig. 2), facts which could have an impact in this different cross testing. SIMBA simulations usually show a more stochastic behavior, presenting more scatter in some properties such as galaxy scaling relations. This can be due to the small box size compared to the large scale effect of the SIMBA AGN jets (Davé et al. 2019;Villaescusa-Navarro et al. 2021a). Nevertheless, although we observe a higher variance and some outliers, most of the predictions only deviate up to 0.4 dex from the ground truth. In all cases, the mean relative error in the logarithm of the mass lies below 2%. The GNNs are able to recover the true mass in the majority of the cases within the standard deviation uncertainty. This shows that the models are relatively accurate even when they are applied to simulations with a different subgrid physics modeling, manifesting their robustness.

INTERPRETABILITY OF THE GNN
Neural networks represent superb tools to deal with multiple problems, but the factors that determine their behavior and performance are difficult to understand. It is always desirable to gain some interpretability of our ML models, in order to determine which features and inputs are the most relevant for predicting the output. To do so, we make use of the Python library Captum 20 (Kokhlikyan et al. 2020), a package designed for model interpretability and attribute selection. Specifically, we employ the saliency method for computing the gradients of the outputs with respect to the inputs and features employed (Simonyan et al. 2013). In that way, larger gradients, and hence larger saliency values, imply that the prediction is more sensitive to variations of that variable.
We have computed the gradients with respect to the features employed in the training of the GNN, averaged over all galaxies and taken the absolute value. This approach gives us a saliency value for each property, which is shown in Fig. 6 for the IllustrisTNG suite in the CV (blue) and LH (red) sets. For both sets, we can notice that, as could be expected, each coordinate of the position presents a very similar importance, due to implicit rotational invariance present in the problem. Regarding the CV case, the most relevant features to determine the output appear to be the stellar mass, M * (as expected from the tight correlation shown in left panel of Fig. 3) and subsequently, the stellar half-mass radius, R * . However, in the LH case, the network focuses more on R * and v rather than the stellar mass. One can interpret this change of learning behavior from the larger scatter in stellar mass shown in Fig. 3 in LH with respect to the CV set. In other words, the large variety of astrophysical and cosmological scenarios in the LH set may require the network to base its predictions on features  Figure 7. Saliency graphs for six different halos from the IllustrisTNG CV (top) and LH (bottom) datasets. Colors denote the saliency attributes, meaning that galaxies with larger values are more relevant for the prediction. Sizes of the nodes are proportional to the stellar mass of the galaxies. In the CV set, central galaxies usually have the greatest impact on the prediction, although if there are massive satellites, they can also contribute significantly (e.g., right panel). In the LH case, however, more attention is often focused on smaller satellites rather than on central galaxies.
that do not exhibit such large scatter. While the predictor mostly rely on the M * -M h correlation when the scatter is small, velocities and galaxy size become better tracers otherwise. Nevertheless, one has to be cautious with these interpretations, since the saliency values may not give a complete picture. For instance, spatial position importance may be underestimated since its weight could be split into the three coordinates. It is common in computer vision tasks to calculate the saliency map, which indicates those pixels in an image that are most relevant for the final output (see, e.g., Villanueva-Domingo & Villaescusa-Navarro 2021 for an application in cosmology). In our case, we are dealing with graphs rather than images. Therefore, for a given halo it is possible to compute the saliency graph, which shows the nodes whose features are more relevant for predicting the halo mass. Saliency values can be computed using the same procedure as above, but taking the absolute value and averaging over all features at each node. Examples of such saliency graphs for different halos are depicted in Fig. 7, where the color indicates the saliency value, and the size is proportional to the stellar mass. Neighbors are connected by lines. Chosen samples present relative errors lower than 1.5 % to ensure that their saliency graphs are meaningful. Top row stands for the CV set and bottom for LH, both in Illus-trisTNG. In the CV set, as one would naively expect, the central galaxies provide the most relevant nodes for the output. These galaxies are also those with larger stellar masses. However, given that the stellar mass is an informative property, as seen in Fig. 6, halos with relatively massive satellites can also show other relevant nodes besides the central one. On the other hand, since in the LH set, stellar mass is less informative, as seen in Fig. 6, the network may focus more on some low mass satellites rather than in central galaxies, which become less important. This illustrates the importance of the satellite population in our method. Hence, one has to be cautious when applying these models to galactic systems, since excluding some relevant satellites may have an impact in the predictions and induce a bias in the results.
It is thus pertinent to ask ourselves which satellites leave a greater impact on the halo mass. Fig. 8 depicts the saliency value of each satellite galaxy as a function of distance to the center, excluding central galaxies. Point size is proportional to the stellar mass of the node. One can notice a trend where closer galaxies become more relevant than those farther away. Moreover, this tendency seems to be relatively independent on their stellar mass. Furthermore, in real applications, the membership of galaxies lying at the boundaries of the system may be less secure, being not clear whether a far galaxy is a satellite or not. Fig. 8 shows that removing or including such candidates would not have a significant impact in the results of the network, given the decreasing saliency value with the distance. This plot employs the IllustrisTNG CV dataset, although similar qualitative conclusions can be extracted from the other cases.
These tests provide us with enlightening information about how GNNs learn and predict their outputs, as well as which are the most relevant components to understand the halo/galaxy connection. It has to be noted that the previous interpretations have to be taken with caution, since the saliency maps can be sometimes dominated by noise in the gradients and lose meaningfulness. Further development is required in order to obtain a better understanding of the training procedure and the emergent properties of halos from galaxies. 6. DISCUSSION

Main conclusions
Constraining the total mass of dark matter halos from galaxies still presents a challenge from both theoretical and observational perspectives, given the large contribution of the dark matter component. In this work we have presented a new method based on artificial intelligence designed to infer the total mass of a halo from the properties of the galaxies it hosts. The point cloud arrangement of halo-galaxy catalogues has been exploited in order to structure halos as mathematical graphs, where galaxies constitute the nodes, connected by proximity. This organization of data makes it possible to employ GNNs, naturally suited to operate with graphs, to extract global permutation invariant quantities, as is the case of the halo mass. The models are fed with different observable galactic features, such as the position, velocity, stellar mass, and stellar half-mass radius.
We have made use of the large collection of state-ofthe-art hydrodynamic simulations from the CAMELS project to train our networks, which include thousands of simulations covering different astrophysical scenarios. Training GNNs over this dataset allows us to achieve precise models capable of predicting the mass of a halo with remarkable accuracy. Here we outline some of the key conclusions of our method: • Our models are able to accurately infer the mass of a halo when trained and evaluated in simulations with fixed astrophysical and cosmological parameters, with a precision better than ∼ 1% (in terms of the logarithm of the halo mass). There is a general tendency where closer galaxies to the center present larger saliency values, and therefore, are more relevant for predicting the output.
• The networks have also been trained in simulations with different astrophysical and cosmological scenarios, successfully marginalizing over a broad astrophysical parameter space, and learning the connection between the halo mass and the properties of the galactic components. This provides accurate predictions of the total halo mass, with relative errors (in the logarithm of the halo mass) around ∼ 1%.
• We have proven that the trained networks in a simulation suite still provide relatively precise predictions when tested in simulations with a different subgrid physics model, only increasing the mean relative error up to ∼ 2%. This illustrates the robustness of this method with respect to the astrophysical modeling.
• Our results strongly rely on the velocity and size of satellite galaxies, demonstrating the importance of other galactic features beyond the stellar mass.
• We have performed likelihood-free bayesian inference, providing additionally an estimate for the standard deviation without knowing the actual likelihood.
• Hyperparameter optimization has been carried out to maximize the performance of the networks.

Comparison to previous works
It is not straightforward to compare our results to previous works, given that the datasets employed, problem setups and features considered can be notably different. In the following, we compare our approach to other ML methods estimating the halo mass in previous literature, while emphasizing that a proper comparison between two techniques would require the use of common data and problem statement.
• Calderon & Berlind (2019) apply several ML techniques, such as standard MLPs, to predict the halo mass from galaxy groups data. They train their models in semianalytical galaxy catalogues calibrated to SDSS data, outperforming traditional methods to infer the dynamical mass. Their 1σ scatter region spans ∼ 0.4 − 0.6 dex (a fractional difference of ∼ 3 − 5%), while for the same masses, our GNN reduces down to ∼ 0.14 dex for the CV set and ∼ 0.2 dex for the LH one. In terms of the relative error in the mass (rather than in the logarithm), their scatter gets up to ∼ 250 − 400%, while our models reduce it down to 40 % and 60 % for CV and LH respectively. Furthermore, our models make use of fewer galaxy features (6 rather than their 9) and the LH simulations cover a large volume of the astrophysical and cosmological parameter space.
• Man et al. (2019) face a similar problem of predicting halo masses from group galaxy properties, splitting their datasets in red and blue groups, according to the color of the central galaxies. They train a random forest estimator, obtaining scatter comparable to ours (note the √ 2 factor of difference in their definitions), but employing 9 features with a semianalytical galaxy formation model with fixed parameters.
• von Marttens et al. (2021) predict galactic properties, such as the total mass, training different ML models in the TNG100 simulation. As in our case, the galaxies considered present stellar masses above 10 8 M /h, although they predict the subhalo mass rather than the halo one. They employ up to 15 features, including photometric, kinematic, and structural properties. When all features are included, they are able to get accurate results, with correlation of R 2 = 0.92. This case can be compared to our IllustrisTNG CV set, since it presents fixed astrophysical and cosmological pa-rameters, 21 where we obtain a better correlation of R 2 = 0.97 even using only 6 features. Moreover, we are also capable of obtaining a similar accuracy in the LH set than their models in TNG100.
• Lucie-Smith et al. (2020) train CNNs on the density field of dark matter halos to infer their masses, with an accuracy comparable to the one of our network train on the CV set. However, note that this is a very different task, since that method relies on the 3D dark matter density field rather than on observable features, as is the case of our GNNs. Moreover, the authors employ N-body simulations with fixed cosmology rather than full hydrodynamic simulations. In any case, one has to be cautious at comparing these different approaches, due to the distinct datasets, assumptions and features considered, being appropriate only for illustrating the potential of GNN models.

Caveats and future prospects
While the ML method presented here shows reasonable accuracy, it however has some caveats that have to be emphasized. For instance, it may not be obvious whether one galaxy belongs to a halo or not, in cases where two galactic groups are close together. This can be exacerbated in real observations, which take place in redshift-space. This may cause the appearance of interlopers in a halo which could be counted as its own satellites, distorting the results. The effect of the presence of other halos in the environment when building the graph is also disregarded, which could have an impact for some close halos gravitationally bound. A way to deal with the influence of surrounding groups may be to include some global feature regarding the amount of galaxies or halos within a certain distance from the halo which is evaluated.
Note that, since only halos with more than one galaxy have been considered, this method should only be applied to halos that contain multiple satellites (above the assumed mass threshold). In those cases where the existence of satellites is not clear, the method may not be reliable. Machine learning approaches could also be employed to identify the satellites around a given central galaxy, i.e., training a network to predict whether a subhalo is part of a given halo or not.
It is important to highlight that, given the small size of the simulation boxes, the amount of very massive galaxies may be undersampled. The fraction of galaxies of galaxies with stellar mass above 10 11 M /h is 1% for both suites, implying a relatively low amount of those large objects. Training in larger simulation boxes with a higher amount of large halos would be advisable to ensure the reliability of our method when applied to very massive galaxies.
Moreover, we have only trained with halos at z = 0, but it would be also convenient to derive models capable of inferring halo masses at earlier times. New difficulties may arise in that case, such as redshift space distortions, or the fact that we could expect fainter galaxies at higher redshift (since for a given luminosity threshold, the number density of galaxies decreases with redshift), reducing thus the size of the training dataset.
The cross-tests discussed in Sec. 4.3 are aimed to illustrate the robustness of the network under different conditions regarding the baryonic feedback modeling. While our results are still relatively accurate even when they are tested in a different subgrid physics model than the one used for training, it would be desirable to further maximize their robustness. While training on Il-lustrisTNG and SIMBA simultaneously could enhance the performance, there is no guarantee that the network would be more robust under other conditions. It is possible that the model learns some kind of bimodal distribution of simulations, identifying first whether the halo corresponds to a IllustrisTNG or a SIMBA simulation, and then predicting the halo mass. A further improvement on this would be to train on IllustrisTNG plus SIMBA and cross-test in a third different suite. Such additional suite with equivalent features to those of IllustrisTNG and SIMBA is still missing, although the CAMELS collaboration is already working to include other suites using different hydrodynamic codes and baryonic feedback models.
The results presented here are obtained by restricting ourselves to a reduced set of several observable quantities. However, it is possible in principle that using further variables, the results could improve. Additional correlations with the halo mass could arise if supplementary features are considered to train the GNNs. For instance, previous works like von Marttens et al. (2021) have shown the importance of photometric variables, such as luminosities at different wavelenghts, for inferring the total halo mass. Additionally, considering, e.g., radii enclosing 20% and 80% of the mass (or light) may provide further information about the concentration and morphology of the halo. Among other appealing galactic properties that could be contemplated are the gas mass, the HI mass, metallicities, velocity dispersion, etc. Furthermore, it would be desirable to explore more complex GNN architectures to find whether more accurate results can be achieved, reducing the scatter and the predicted uncertainties. The training of GNNs on galactic properties combined with other observables such as lensing could also enhance its predictive power.
The framework developed in this article has been proposed to infer the total mass of a given halo. However, since it is based on extracting global quantities from galaxy features, it could be generally applied to predict any other global quantity of the halo, such as its concentration, spin, characteristic age, etc. Depending on the specific global quantity considered, different galaxy or subhalo features may be required in order to maximize accuracy, which in principle could differ from those employed here. Moreover, one could apply GNN architectures to other problems, such as edge prediction. For instance, given a set of galaxies, a GNN could be trained to identify those which conform separate halos, as a ML alternative to friend-of-friends algorithms. Furthermore, note that our GNNs are trained to marginalize over baryonic feedback and cosmological parameters, but they cannot identify the specific parameters underlying the simulation of a given halo. However, a GNN can be explicitly trained to infer the values of the cosmological parameters from a simulation employing its galactic population, a task which has been tackled with CAMELS catalogues in Villanueva-Domingo & Villaescusa-Navarro (2022).
It would be desirable to synthetize the predictive power of the GNNs into an analytical formula, via symbolic regression, as has been done already in other cosmological contexts (Cranmer et al. 2019;Cranmer et al. 2020;Shao et al. 2021;Wadekar et al. 2020). The analytical formulae derived in this way would depend upon the messages exchanged between the nodes. Nonetheless, obtaining a model susceptible to be replicated by a concise formula presents further difficulties. Performing symbolic regression requires interpretable models. One way to achieve this is to enforce sparse weights, obtained, e.g., by applying L1 regularization, which can affect the accuracy of the network. Sometimes, increasing the accuracy of a network reduces its extrapolation properties, and the other way around (Shao et al. 2021). Hence, deriving simple analytical formulae via symbolic regression from these GNNs remains as a desirable but challenging step for future work.
Our mass inference method based on GNNs is developed for galactic systems composed by central and satellite galaxies, but it could also be applied to galaxy clusters, given that the data description as graphs would be equivalent. While we believe our method will be perfect to perform mass estimates of galaxy clusters, the CAMELS simulations may not be the best training dataset given the small volume they cover, which translates into a lack of galaxy clusters, as already pointed out by previous works Wadekar et al. (2022).
Besides galaxy clusters, the description as graphstructured data makes this kind of network suitable for other types of astrophysical systems, which are characterized by point clouds. Examples of these may include globular clusters, stellar populations within a galaxy or even planetary systems. Any point distribution could take advantage of the graph structure presented here for halos, in order to derive global quantities of such systems. An unexplored window remains open to apply all the power of GNNs to astrophysics.
Given that halo masses can be accurately predicted from numerical simulations by using the method shown here, the natural step forward would be applying this kind of model to real data. Once a GNN model is trained in simulations, one can predict the mass of a real halo using the observed kinematic and internal features of some real galactic systems as input data. Note that this procedure presents further challenges. Spectroscopic measurements would be required to infer the velocity, redshift information and 3D positions of every satellite and central galaxy. The satellite population of distant halos could not be completely observable given their faintness, which could induce some biases in the results. Moreover, additional problems may arise when considering several redshifts at once, such as dealing with redshift space distortions. However, our local galactic neighborhood provides us with a simple scenario to apply our method, where the previous difficulties can be handled. This task is carried out in a companion paper , where GNNs trained with CAMELS simulations are employed to predict the masses of the halos containing the Milky Way and Andromeda. In that article, independent predictions for both halos are presented, which are consistent with other standard methods for estimating the dynamical masses of our galaxy and its companion. This represents a success of applying ML models trained with numerical simulations on real data, illustrating the power of artificial intelligence to enhance our knowledge of the Universe.

DATA AVAILABILITY
The models and implementation of GNNs in Py-Torch Geometric, HaloGraphNet, are available on GitHub 22 (Villanueva-Domingo 2021). Details on the CAMELS simulations can be found in https://www. camel-simulations.org.