Characterizing structure formation through instance segmentation

Dark matter haloes form from small perturbations to the almost homogeneous density ﬁeld of the early universe. Although it is known how large these initial perturbations must be to form haloes, it is rather poorly understood how to predict which particles will end up belonging to which halo. However, it is this process that determines the Lagrangian shape of proto-haloes and it is therefore essential to understand their mass, spin, and formation history. We present a machine learning framework to learn how the proto-halo regions of di ﬀ erent haloes emerge from the initial density ﬁeld. We developed one neural network to distinguish semantically which particles become part of any halo and a second neural network that groups these particles by halo membership into di ﬀ erent instances. This instance segmentation is done through the Weinberger method, in which the network maps particles into a pseudo-space representation where di ﬀ erent instances can easily be distinguished through a simple clustering algorithm. Our model reliably predicts the masses and Lagrangian shapes of haloes object by object, as well as other properties such as the halo-mass function. We ﬁnd that our model extracts information close to optimally by comparing it to the degree of agreement between two N -body simulations with slight di ﬀ erences in their initial conditions. We publish our model open source and suggest that it can be used to inform analytical methods of structure formation by studying the e ﬀ ect of systematic manipulations of the initial conditions.


Introduction
Dark matter (DM) haloes are the primary structures in the universe within which galaxies form and evolve.Acting as gravitational anchors, they play a pivotal role in connecting theoretical cosmology with empirical observations from galaxy surveys.Given their significance in cosmology, a comprehensive understanding of DM haloes and their behaviour is paramount.Currently, our most detailed insights into their formation and properties come from N-body simulations (see Frenk & White 2012, for a review).These computationally intensive simulations model the interactions of vast numbers of particles, pinpointing the regions of the density field where gravitational collapse leads to the formation of DM haloes (e.g.Angulo & Hahn 2022).Therefore, understanding the formation and behaviour of DM haloes is essential to bridge the gap between theoretical models and observational data.
However, providing quick and accurate predictions (based on the initial conditions of a simulation) remains a challenging task for physically-motivated models.An accurate model for halo formation must be able to capture the nonlinear growth of density fluctuations.Previous analytical or semi-analytical models for halo formation, such as the top-hat spherical collapse (Gunn & Gott 1972;Gunn 1977;Peebles 1980), the Press-Schechter / Excursion Set Theory (Press & Schechter 1974;Bond et al. 1991;Lacey & Cole 1993), or ellipsoidal collapse approaches (e.g.Sheth et al. 2001;Sheth & Tormen 2002), qual-⋆ daniellopezcano13@gmail.com itatively reproduce the behaviour of the halo-mass function and the merging rate of haloes, however, they fail on predicting these quantities accurately (e.g.Jiang & van den Bosch 2014).Further, N-body simulations show the formation of "peak-less" haloes, that cannot be accounted for by any of these methods (Ludlow & Porciani 2011).
Traditional analytical methods have provided foundational insights into the process of halo formation, but they struggle to capture the full complexity of it.Machine Learning (ML) techniques have emerged as a promising alternative, capable of capturing intricate non-linear dynamics inherent to the gravitational collapse of structures.ML algorithms can be trained on N-body simulations to emulate the results of much more expensive calculations.Previous studies have trained ML models to map initial positions and velocities of particles to their final states (He et al. 2019;Giusarma et al. 2019;Alves de Oliveira et al. 2020;Wu et al. 2021;Jamieson et al. 2022) and to predict the distribution of non-linear density fields (Rodríguez et al. 2018;Perraudin et al. 2019;Schaurecker et al. 2021;Zhang et al. 2023;Schanz et al. 2023).
Further, ML has been used to predict and gain insights into the formation of haloes.Some studies utilized classification methods to anticipate if a particle will become part of a halo (Lucie-Smith et al. 2018;Chacón et al. 2022;Betts et al. 2023), or to predict its final mass category (Lucie-Smith et al. 2019).In Lucie-Smith et al. (2020) a regressor network is trained to predict the final halo mass for the central particle in a given simulation crop.The work by Bernardini et al. (2020) demon-strates how ML-segmentation techniques can be applied to predict halo Lagrangian regions.In Berger & Stein (2019) a semantic segmentation network is trained to predict Peak-Patch-haloes.In Lucie-Smith et al. (2023) a network is trained to predict the mass of haloes when provided with a Lagrangian region centred on the centre-of-mass of proto-halo patches and is then used to study assembly bias when exposed to systematic modifications of the initial conditions.
While interesting qualitative insights have been obtained in these studies, it would be desirable to develop a model that accurately predicts halo membership at a particle level, surpassing some of the limitations from previous works.An effective model should predict particles forming realistic N-body halos, improving upon previous models restricted to simpler halo definitions (e.g.Berger & Stein 2019, where Peak-patch haloes are targeted).Additionally, an ideal model should be able to predict disconnected Lagrangian halo patches, overcoming the limitations of methods like the watershed technique used in Bernardini et al. (2020), which can only handle simply connected regions.Furthermore, particles within the same halo should share consistent mass predictions, avoiding having different halo mass estimates for particles belonging to the same halo.
We present a general ML framework to predict the formation of haloes from the initial linear fields.We create a ML model designed to forecast the assignment of individual particles from the initial conditions of an N-body simulation to their respective haloes.To do so we train two distinct networks, one for conducting semantic segmentation and another for instance segmentation.These two networks together conform what is known as a panoptic-segmentation model.Our model effectively captures the dynamics of halo formation and offers accurate predictions.We provide the models used in this study for public access through our GitHub repository: https://github.com/daniellopezcano/instance_halos.
The rest of this paper is organized as follows: In §2, we define the problem of identifying different Lagrangian halo regions from the initial density field ( § §2.1), introduce the panoptic segmentation method ( § §2.2), present the loss function employed to perform instance segmentation ( § §2.3), describe the simulations used for model training ( § §2.4), asses the level of indetermination for the formation of proto-haloes ( § §2.5), outline the CNN architecture ( § §2.6), and explain our training process ( § §2.7).In §3, we present the outputs of our semantic model ( § §3.1) and our instance segmentation approach ( § §3.2).We investigate how our model reacts to changes in the initial conditions in § §4.1 & § §4.2, and study how the predictions of our model are affected when varying the cosmology § §4.3.We conclude with a summary and final thoughts in §5.

Methodology
We aim to predict the formation of DM haloes provided an initial density field.To comprehensively address this problem, we divide this section into distinct parts.In § §2.1, we explain the problem of predicting halo-collapse and discuss the most general way to phrase it.In § §2.2, we introduce the panoptic segmentation techniques and explain how they can be employed to predict halo formation.We divide § §2.2 into two separate parts: semantic segmentation and instance segmentation.In § §2.3 we describe the loss function employed to perform instance segmentation.In § §2.4,we present the suite of simulations generated to train and test our models.In § §2.5 we assess the level of indetermination of proto-halo formation.In § §2.6 we explain how to build a high-performance model employing convolutional neural networks.Finally, in § §2.7 we present the technical procedure followed to train our models.

Predicting structure formation
The goal of this work is to develop a machine-learning framework to predict the formation of haloes from the initial conditions of a given universe.Different approaches are possible to define this question in a concrete input/output setting.We want to define the problem in a way that is as general as possible so that our model can be used in many different contexts.
The input of the model will be the linear density field discretized to a three-dimensional grid δ i jk .A slice through such a linear density field is shown in the top panel of Figure 1 and represents how our universe looked in early times, e.g., z ≳ 100.Beyond the density field, we also provide the linear potential field ϕ i jk as an input.The information included in the potential is in principle degenerate with the density field if the full universe is specified.However, if only a small region is provided, then the potential contains additional information of e.g. the tidal field sourced by perturbations outside of the region considered.
The model shall predict which patches of the initial density field become part of which haloes at later times.Concretely, we want it to group the N 3 initial grid cells (corresponding, e.g., to particles in a simulation) into different sets so that each set contains exactly all particles that end up in the same halo at a later time.Additionally, there has to be one special extra set that contains all remaining particles that do not become part of any halo: Output: This task is called in the ML literature an instance segmentation problem.Note that it is different from typical classification problems since (A) the number of sets depends on the considered input and (B) the sets have no specific order.In practice, it is useful to define the different sets by assigning different number-labels to them.For example, one possible set of particles belonging to the same halo can be given the label "1", another set the label "2", and so forth.These number-labels do not have a quantitative meaning and are permutation invariant, for example, interchanging the label "1" with "2" yields the same sets.
We show such labelling of the initial space in the bottom panel of Fig. 1.In this case, the labels were inferred by the membership to haloes in an N-body simulation that employs the initial conditions depicted in the top panel of Fig. 1 (see Sec. 2.4).Our goal is to train a model to learn this instance segmentation into halo sets by training it on the output from N-body simulations.
We note that other studies have characterised the haloformation processes through a slightly different prediction problem.For example, Lucie-Smith et al. (2020) trains a neural network to predict the final halo masses directly at the voxel level.While their approach offers insights into halo formation, our method provides a broader perspective: halo masses can be inferred easily through the size of the corresponding sets, but other properties can be inferred as well -for example the Lagrangian shapes of haloes which are important to determine their spin (White 1984).Furthermore, our approach ensures the physical constraint that particles that become part of the same halo are assigned the same halo mass. .In this work, we present a machine-learning approach to predict the formation of haloes (as in the bottom panel) from the initial condition field (top panel).

Panoptic Segmentation
The proposed problem requires first to segment the particles semantically into two different classes (halo or non-halo) and then to classify the particles inside the halo class into several different instances.The combination of such semantic plus instance segmentation is sometimes referred to as panoptic segmentation.Several strategies have been proposed to solve such panoptic segmentation problems (Kirillov et al. 2016;Bai & Urtasun 2016;Arnab & Torr 2017;De Brabandere et al. 2017;Kirillov et al. 2018Kirillov et al. , 2023) ) and they usually operate in two-steps: 1. Semantic segmentation: The objective of this task is to predict, for each voxel in our initial conditions (representing a tracer particle in the N-body code), whether it will be part of a DM halo at z = 0.This task is a classification problem, and we will employ the balanced cross-entropy (BaCE) loss (Xie & Tu 2015) to tackle it: Here, Y represents the ground truth data vector, each entry corresponds to a voxel and is equal to 1 if the associated particle ends up being part of a halo; otherwise, its value is 0. Ŷ contains the model predictions, with each entry representing the probability that this particle ends up in a halo.The parameter β handles the class imbalance and is calculated as the number of negative samples divided by the total number of samples.We measure β using our training simulations (see § §2.4) and obtain a value of β = 0.58151 .After training our network, we need to choose a semantic threshold to generate the final semantic predictions.This threshold is calibrated to ensure that the fraction of predicted particles belonging to haloes is equal to 1 − β, resulting in a value of 0.589 (refer to Appendix C for an in-depth explanation).2. Instance segmentation: The objective of this task is to recognize individual haloes (instances) by identifying which particles (from those that are predicted to be part of a DM halo) belong to the same object and separating them from others.
Instance segmentation tasks are not conventional classification problems and tackle the problems of having a varying number of instances and a permutational-invariant labelling.
To our knowledge, there is no straightforward way to phrase the problem of classifying each voxel into a flexible number of permutable sets through a differentiable loss function.
Typical approaches train a model to predict a related differentiable loss and then apply a postprocessing step on top of it.Unfortunately, this leads to the loss function not directly reflecting the true objective.
Various approaches have been proposed to tackle this problem (Kirillov et al. 2016;Bai & Urtasun 2016;Arnab & Torr 2017;De Brabandere et al. 2017;Kirillov et al. 2018Kirillov et al. , 2023)).A popular method is the watershed technique (Kirillov et al. 2016;Bai & Urtasun 2016).This method uses a network to predict semantic segmentation and the borders of different instances (Deng et al. 2018) and then applies a watershed algorithm to separate different instances in a post-processing step.However, the watershed approach comes with several limitations: -It cannot handle the identification of disconnected regions belonging to the same instance, a problem known as occlusion.-It is necessary to select appropriate threshold values for the watershed post-processing step to generate the final instance map.These parameters are typically manually chosen to match some particular metric of interest, but might negatively impact the prediction of other properties.For instance, in Bernardini et al. (2020), they apply the watershed technique to predict Lagrangian halo regions identified with the HOP algorithm (Eisenstein & Hut 1998).However, they choose the watershed threshold to reproduce the halo-mass-function, which does not ensure that the Lagrangian halo regions are correctly predicted.
-The watershed approach would struggle to identify the borders of Lagrangian halo regions since they are difficult to define.In Fig. 1 it can be appreciated that the borders of halo regions are very irregular.There also exist points in the "interior" of these regions which are "missing" and make it particularly complex to define the border of a halo.Despite all the challenges presented by the watershed approach, in §A, we apply this method to predict the formation of FoF-haloes and discuss how the border-prediction problem can be addressed.An approach that offers greater flexibility for grouping arbitrarily arranged particles was presented by De Brabandere et al. (2017).We will follow this approach through the remainder of this work.The main idea behind this method, which we will refer to as the "Weinberger approach"2 , is to train a model to produce a "pseudo-space representation" for all the elements of our input space (i.e., voxels/particles in the initial conditions).An ideal model would map voxels belonging to the same instance close together in the pseudospace while separating them from voxels belonging to different instances.Consequently, the pseudo-space distribution would consist of distinct clouds of points, each representing a different instance (see Fig. 2).The postprocessing step required to generate the final instance segmentation in the Weinberger approach is a clustering algorithm which operates on the pseudo-space distributions.

Weinberger loss
The Weinberger approach possesses some advantages over other instance segmentation techniques: First of all, the loss function more closely reflects the instance segmentation objective; that is, to classify different instances into a variable number of permutationally invariant sets.Secondly, the approach is more flexible and makes fewer assumptions, for example, it can handle occlusion cases and does not need to assume the existence of welldefined instance borders.
In Fig. 2, we schematically illustrate the effects of the individual components of the Weinberger loss.Each point in this figure represents a pseudo-space embedding of an input voxel.The colours indicate the assigned labels based on the ground truth.Points sharing the same colour belong to the same instance (according to the ground truth), whereas different colours depict separate instances.The "centre of mass" for each cluster is computed and indicated with coloured crosses as "cluster centres".The Weinberger loss is constituted by three separate terms: -Pull force, equation (4): Given a certain instance c (where C is the total number of instances), a point i belonging to that set, whose pseudo-space position is x i , will feel an attraction force proportional to the distance to the instance centre µ c = N c i=1 x i /N c , where N c is the number of members associated with the instance c.Points closer than δ Pull (which is a hyperparameter of the Weinberger loss) from the instance centre will not experience any pull force.The pull force is represented in Fig. 2 as coloured arrows pointing towards the instance centres outside the solid-line circles, which symbolize the distance δ Pull to the instance centres.
-Push force, equation ( 5): Two instances A and B will repel each other if the distance between their instance centres in the pseudo-space, µ c A and µ c B , is smaller than 2δ Push (a hyperparameter of the Weinberger loss).The force they feel is proportional to the distance between them.In Fig. 2 the push force is represented as grey arrows.The dashed circles represent the distance δ Push to the instance centres.-Regularization force, equation (6): To avoid having an arbitrarily big pseudo-space distribution all instance centers will feel an attraction towards the pseudo-space origin.
The overall effect of these forces on the total Weinberger loss is written as: Where c Pull , c Push , and c Reg are hyperparameters that regulate the strength of the different components.
Minimizing equation ( 7) ensures that the pseudo-space mapping produces instance clusters separated from each other.A model trained effectively will predict pseudo-space distributions with points corresponding to the same instances being grouped together and distinctly separated from other instances.In an ideal scenario in which the Weinberger loss is zero, all points are closer than δ Pull to their corresponding cluster centres, and clusters are at least 2δ Push apart.However, realistically, the Weinberger loss won't be exactly zero, necessitating a robust clustering algorithm for accurate instance map predictions.
In Appendix B we describe the clustering algorithm that we have developed to robustly identify the different instance maps.
In our clustering algorithm we first compute the local density for each point in our pseudo-space based on a nearest neighbors calculation.We then identify groups as descending manifolds of density maxima surpassing a specified persistence ratio threshold.Particles are assigned to groups according to proximity and density characteristics.We merge groups selectively, ensuring that the persistence threshold is met.The algorithm relies on three key hyper-parameters for optimal performance: N dens , N ngb and p thresh .This approach effectively segments the pseudo-space distribution of points, even when perfect separation is not achieved, thus enhancing the reliability of predicted instance maps.

Dataset of Simulations
We generate twenty N-body simulations with different initial conditions to use as training and validation sets for our panoptic segmentation model.Our simulations are carried out using a lean version of L-Gadget3 (see Springel et al. 2008;Angulo et al. 2012Angulo et al. , 2021)).For each of these simulations, we evolve the DM density field employing N DM = 256 3 DM particles in a volume of V box = (50 h −1 Mpc) 3 , resulting in a DM particle-mass of m DM = 6.35 • 10 8 h −1 M ⊙ .All our simulations employ the same softening length: ϵ = 5 h −1 kpc, and share the cosmological parameters derived by Planck Collaboration et al. (2020), that is, σ 8 = 0.8288, n s = 0.9611, h = 0.6777, Ω b = 0.048252, Ω m = 0.307112, and Ω Λ = 0.692888.Our suite of simulations is similar to the one employed in Lucie-Smith et al. (2020).
We use a version of the NgenIC code (Springel 2015) that uses second-order Lagrangian Perturbation Theory (2LPT) to generate the initial conditions at z = 49.We employ a different random seed for each simulation to sample the Gaussian random field that determines the initial density field.We identify haloes at redshift z = 0 in our simulations using a Friends-of-Friends algorithm (Davis et al. 1985), with linking length b = 0.2.In this work, we will only consider haloes formed by 155 particles or more, corresponding to M FoF ⪆ 10 11 h −1 M ⊙ .We use 18 of these simulations to train our model and keep 2 of them to validate our results.

Assessing the level of indetermination
In addition to the training and test sets, we run a set of simulations to establish a target accuracy for our model.These simulations test to what degree small sub-resolution changes of the initial density field can affect the final Lagrangian halo regions.
Structure formation simulations resolve the initial conditions of a considered universe only to a limited degree and exhibit therefore an inherent degree of uncertainty.(1) The numerical precision of simulations is limited (e.g. to 32bit floating point numbers) and therefore any results that depend on the initial conditions beyond machine precision are inherently uncertain.For example, Genel et al. (2019) show that changes in the initial displacement of N-body particles at the machine-precision level can lead to differences in the final locations of particles as large as individual haloes.
(2) The initial discretization can only resolve the random perturbations of the Gaussian random field down to a minimum length scale of the mean-particle separation.If the resolution of a simulation is increased, then additional modes enter the resolved regime and act as additional random perturbations.Such additional perturbations may induce some random changes in the halo assignment of much larger-scale structures.
A good model should learn all aspects of structure formation that are certain and well resolved at the considered discretization level.However, there is little use in predicting aspects that are under-specified and may change with resolution levels.Therefore, we conduct an experiment to establish a baseline of how accurate our model shall be.
We run two additional N = 256 3 simulations with initial conditions generated by MUSIC code (Hahn & Abel 2011).For these simulations we keep all resolved modes fixed (up to the Nyquist frequency of the 256 3 grid), but we add to the particles different realisations of perturbations that would be induced by the next higher resolution level.We do this by selecting every 2 3 th particle from two initial condition files with 512 3 particles and with different seeds at the highest level ("level 9" in MUSIC).Therefore, the two simulations differ only in the random choice of perturbations that are unresolved at the 256 3 level.We refer to these two simulations as the "baseline" simulations.
In Fig. 3 we show a slice of the Lagrangian halo patches at z = 0 through these simulations (left and right panels respectively).The colour map in this Figure represents the masses of the halo that each particle becomes part of, which correspond to the size of the corresponding halo-set.We colour each pixel (which corresponds to a certain particle) according to the mass of the halo that it belongs to.We can appreciate that the outermost regions of the Lagrangian regions are particularly affected while the innermost parts remain unchanged.Notably, in certain instances, significant changes appear due to the merging of haloes in one of the simulations where separate haloes are formed in the other (black-circled regions).
Throughout this article, we will use the degree of correspondence between the baseline simulations as a reference accuracy level.We consider a model close to optimal if the difference be- tween its predictions and the ground truth is similar to the differences observed between the two baseline simulations.A lower accuracy than this would mean that a model has not optimally exploited all the information that is encoded in the initial conditions.A higher accuracy than this level is not desirable, since it is not useful to predict features that depend on unresolved aspects of the simulation and may be changed by increasing the resolution level.

V-Net Architecture
V-nets are state-of-the-art models, product of many advances in the field of ML over the last decades (Fukushima 1980;Lecun et al. 1998;Krizhevsky et al. 2012;Szegedy et al. 2014;Long et al. 2014;Ronneberger et al. 2015;He et al. 2015).They are a particular kind of convolutional neural network (CNN) developed and optimized to efficiently map between volumetric inputs and volumetric outputs.V-nets are formed by two separate modules: the encoder (or contracting path) which learns how to extract large-scale abstract features from the input data; and the decoder (or up-sampling path) that translates the information captured by the encoder to voxel-level predictions (also making use of the information retained in the "skipped connections").We train V-nets to minimize the loss functions presented in § §2.2 and § §2.3.We now explain the technical characteristics of how we have implemented a V-net architecture in Tensor-Flow (Abadi et al. 2015) (see Fig. 4 for a schematic representation of our network architecture): -Input: Our network is designed to accept as input 3D crops consisting of 1443 voxels. 3For the results presented in §3, we employ two input channels for the semantic segmentation model, corresponding to the initial density field and the displacement potential, which is defined through Poisson's equation as: For the instance segmentation model, we include three additional input channels corresponding to the Lagrangian positions of particles.This is necessary since the network has to be able to map different haloes with the same density (and potential) structure at different locations in the initial field to different locations in the pseudo space.-Encoder / contractive / down-sampling / down-scaling path: This module consists of consecutive down-scaling blocks that reduce the number of voxels per dimension by half at each level of the network.The purpose of the down-scaling path is to enlarge the network's field of view, enabling pervoxel predictions that take into account distant regions of the field.Achieving this would be impractical using large convolution kernels, as they would consume excessive memory.
Within each down-sampling block, we apply three consecutive convolution operations followed by a Leaky-ReLu activation function.The number of convolution filters in a contractive block doubles with each level of compression to improve the performance of the model.For each level, the latent maps computed before the final convolution (the one used to reduce the data size) are temporarily stored to serve as a skip connection for the up-scaling path.In Fig. 4 we show the dimensions of the latent maps computed at each level of the contractive path; the deepest level of our network has a size of 9 3 × 128.-Decoder / up-sampling / up-scaling path: This path operates opposite to the contractive path; each up-scaling block doubles the number of voxels per dimension, ultimately recovering an image with the same dimensions as the original input (see Fig. 4).The up-sampling path facilitates the extraction of smaller-scale features that influence the final per-voxel predictions.Within an up-sampling block, the final convolution is substituted with a transposed convolution operation, that allows doubling the output size per dimension.-Output: The final module of our network takes as input the latent maps with dimensions 144 3 × 16.The functionality of this module varies depending on the task at hand.For semantic segmentation, a single convolution operation is performed, resulting in a latent map of 144 3 × 1.This map is subsequently cropped to 128 3 × 1, and finally, a sigmoid activation function is applied.In the case of instance segmentation, we have decided to work in a three-dimensional pseudo-space, hence, we employ a convolution with three filters to obtain 144 3 × 3 maps, which are afterwards cropped to 128 3 × 3.In both cases, the final cropping operation is implemented to enhance model performance by focusing on the central region of the image.
The V-Net architecture we have implemented is a state-ofthe-art model that encompasses over 3 • 10 6 trainable parameters.

Training
We train our segmentation networks using a single Nvidia Quadro RTX 8000 GPU card.As mentioned in § §2.4,we empowerful (specifically, an NVIDIA QUADRO RTX 8000 with 48 GB of memory), are insufficient to accommodate such an input size while maintaining a reasonably complex network architecture.
Article number, page 6 of 21 D. López-Cano et al.: Characterizing structure formation through instance segmentation ploy 18 simulations for training, dividing the training process into separate stages for the semantic and instance models.
To ensure robust training and enhance the diversity of training examples without needing to run more computationally expensive simulations, we apply the following data augmentation operations each time we extract a training sample from our simulation suite: 1. Select one of the training simulation boxes at random.2. Select a random voxel as the center of the input/output regions.3. Extract the input (144 3 ) and target (128 3 ) fields of interest by cropping the regions around the central point, considering the periodic boundary conditions of the simulations.4. Randomly transpose the order of the three input grid dimensions q x , q y , q z . 5. Randomly chose to flip the axes of the input fields.
To train our semantic and instance segmentation networks we minimize the respective loss functions -equation (3) and equation ( 7) -employing the Adam optimizer implemented in TensorFlow (Abadi et al. 2015).We train our models for over 80 epochs, each epoch performs mini-batch gradient descent using 100 batches, and each batch is formed by 2 draws from the training simulations.We deliberately choose a small batch size to avoid memory issues and ensure the network's capability to handle large input and output images (144 3 and 128 3 respectively).Selecting a small batch size induces more instability during training; we mitigate this issue by using the clip normalization operation defined in TensorFlow during the backpropagation step.
The hyper-parameter β in the Balanced Cross-Entropy equation ( 3 Regarding the hyper-parameters in the Weinberger loss equation (7), we adopt the values presented in De Brabandere et al. ( 2017), as we have observed that varying these parameters does not significantly affect our final results.The specific hyperparameter values are the following: c Pull = 1, δ Pull = 0.5, c Push = 1, δ Push = 1.5, and c Reg = 0.001.We have conducted a hyper-parameter optimization for the clustering algorithm described in Appendix B and found the following values: N dens = 20, N ngb = 15 and p thresh = 4.2 (see Table 2).
Our semantic and instance models are designed to predict regions comprising 128 3 particles due to technical limitations regarding GPU memory.To overcome this limitation and enable the prediction of larger simulation volumes, we have developed an algorithm that seamlessly integrates sub-volume crops.For our semantic model, we serially concatenate sub-volume predictions to cover the full simulation box.For our instance network, we propose the method described in Appendix D. In summary, this method works as follows: we generate two overlapping lattices.Both lattices cover the entire simulation box, but the second one is shifted with respect to the first one (its sub-volume centres lay in the nodes of the first one).The overlapping regions between the lattices are employed to determine whether instances from different crops should merge or not.We have verified that this procedure is robust by checking that the final predictions are not sensitive to the particular lattice choice.We train our semantic and instance networks separately.The semantic predictions are not employed at any stage during the training process of the instance model.To compute the instance loss, equation ( 7) is evaluated using the true instance maps and the pseudo-space positions.The semantic predictions are only employed once both models have been trained.We use the semantic predictions to mask out pseudo-space particles not belonging to haloes.Then, the clustering algorithm described in Appendix B is applied to identify clusters of particles in the pseudo-space (which yields the final proto-halo regions).

Model Evaluation
In this section, we test the performance of our models for semantic segmentation ( § §3.1) and instance segmentation ( § §3.2).We use the two simulations reserved for validation to generate the results presented in this section.

Semantic Results
In Fig. 5, we compare the predictions of the semantic segmentation network with the halo segmentation found in the validation simulation.The leftmost panel illustrates a slice of the ground truth.Voxels/particles of the initial conditions belonging to a DM halo at z = 0 are shown in red; blue voxels represent particles not belonging to a DM halo at z = 0.
The central panel of Fig. 5 displays the probabilistic predictions from our semantic model for the same slice.The colour map indicates the probability assigned to each pixel for belonging or not to a DM halo.Voxels with a white colour have a 50% predicted probability of belonging to a halo.The neural network tends to smooth out features, assigning uncertain probabilities to regions near halo borders, while consistently assigning high probabilities to inner regions and low probabilities to external regions.In the ground truth it is possible to observe that some interior particles within proto-haloes are predicted to belong to the background.We refer to these as "missing voxels".One of the consequences of the smoothing effect of our network is to ignore these missing voxels, predicting a homogeneous probability of collapse in the interior regions of proto-haloes.The missing voxels in the Lagrangian structure seem to be a feature very sensitive to the initial conditions impossible to capture accurately at a voxel level.This is supported by the fact that the missing voxels also change significantly in the baseline simulations (see Fig. 3).
The rightmost panel of Fig. 5 shows the pixel-level error map for the same slice.We select a semantic threshold value equal to 0.589 to generate these results.We choose this value for the semantic threshold so that the total predicted number of particles that belong to a halo matches the number of collapsed voxels in the validation simulations.In Appendix C we further analyze the sensitivity of our semantic results to the value chosen for the semantic threshold.We use different colours to represent the corresponding classes of the confusion matrix: Green corresponds to true positive (TP) cases, blue to true negatives (TN), black to false negatives (FN), and red to false positives (FP).Some regions are particularly challenging to predict for the network, likely due to their sensitivity to changes in the initial conditions.For example, in the rightmost panel of Fig. 5, it is easy to appreciate many FN regions that appear as black string-like structures surrounding TP collapsed regions.These FN cases likely correspond to particles infalling into the halo at z = 0, identified as part of the FoF group despite not having completed the first pericentric passage.Capturing this behaviour might be particularly challenging for the network since the exact shape of these "first-infall" regions is more sensitive to small changes in the initial condition and can also be influenced by distant regions of the proto-haloes that do not completely fit within the field-of-view of our network (which can occur for very massive proto-halos).Also, we can appreciate FP regions that appear between the FN string-like regions and the TPs corresponding to the central Lagrangian regions of haloes.Additionally, the boundaries of the largest haloes may be especially difficult to predict for the network, since they only fit partially into the field of view.
The results presented in Fig. 5 suggest, upon visual inspection, that our model accurately captures many of the complex dynamics that determine halo collapse.To rigorously assess the performance of our model we need to quantify the results obtained from our semantic network and compare them with the differences between the baseline simulations, as discussed in §2.
In Table 2 we present the values of some relevant metrics that we can employ to evaluate the performance of our semantic network (we have considered the semantic threshold of 0.589).In particular, we study the behaviour of five different metrics: True Positive Rate TPR = TP/(TP + FN), True Negative Rate TNR = TN/(TN + FP), Positive Predictive Value PPV = TP/(TP + FP), Accuracy ACC and the F 1 -score (which is a more representative score than the accuracy when considering unbalanced datasets), see equation ( 9): Table 2 also contains the scores measured using the baseline simulations.Our model returns values for all the metrics very close to the optimal target from the baseline simulations.This demonstrates the reliability of our model in predicting the wellspecified aspects of halo collapse.See Appendix C for a more detailed discussion about the performance of our semantic model and the relation between the selected semantic threshold with the results contained in Table 2.
In addition to the optimal case, we compare our semantic model with the explicit implementation of the excursion set theory from ExSHalos (Voivodic et al. 2019).The ExSHalos code grows spheres around the density peaks in the Lagrangian density field until the average density inside crosses a specified barrier for the first time.The barrier shape is motivated by the ellipsoidal collapse (Sheth et al. 2001;de Simone et al. 2011) with three free parameters that were fitted to reproduce the mean mass function of our simulations.In Table 2 we include the semantic metrics measured with the ExSHalos results.While ExSHalos can describe halo formation to some degree, there exist some aspects that go beyond the spherical excursion set paradigm which are better captured by our semantic model.A more detailed analysis of the results obtained with ExSHalos is presented in Appendix E. In Fig. 6 we compare the values of the predicted TPR as a function of ground truth halo mass (TPR Pred , solid green line), with the TPR values measured from the baseline simulations (TPR base , solid black line).It is possible to perform this comparison for the TPR because, in the ground truth data, we retain information about the mass of the FoF-haloes associated with each DM particle.Therefore, we can compute the fraction of TP cases in different ground-truth-mass-bins by selecting the voxels according to the mass associated with them in the ground truth.
In Fig. 6, the values for TPR base increase with halo mass, indicating that particles that end up in lower-mass haloes are more sensitive to small-scale changes in the initial conditions, consequently, harder to predict accurately.Our network's predictions follow a similar trend, albeit with some discrepancies.The model seems to under-predict the number of particles that end up in haloes with masses lower than M True ⪅ 10 12 h −1 M ⊙ (dotted vertical black line in Fig. 6).This indicates that our model tends to under-predict the number of pixels that are identified as TPs in the lower mass end.For haloes whose mass is greater than 10 12 h −1 M ⊙ , our model returns accurate predictions to a good degree over a broad range, extending more than two orders of magnitude in halo mass.
In this subsection, we have demonstrated that our semantic model extracts most of the predictable aspects of halo formation by comparing our results with the baseline simulations (which only differ in unresolved aspects of the initial conditions).We now employ the predictions of our semantic network to generate the final results using our instance segmentation model.

Instance Results
We provide some examples of our instance predictions in Fig. 7.The left column displays the ground truth masses of halo Lagrangian regions extracted from the simulation results (analogous to Fig. 3); the right column shows the predictions obtained from our segmentation pipeline.The way in which we compute halo masses from the instance predictions is by counting the number of particles/voxels that have been assigned to the same label and multiplying that by the particle mass of our simulations, m DM = 6.35 The shapes of the halo contours are well-captured thanks in part to the semantic predictions.The instance segmentation pipeline successfully distinguishes the different haloes that have formed, and in most cases, correctly separates neighbouring haloes.This is not a trivial task since the size of halo Lagrangian regions varies across several orders of magnitude.Therefore, the instance segmentation pipeline must correctly separate wildly different particle groupings in the pseudo-space.Fig. 7 shows that our instance segmentation pipeline correctly identifies different Lagrangian halo regions for the majority of cases.However, we note that differences arise on the one hand for very small haloes that are close to the resolution limit and on the other hand for very large haloes that are larger than the field of view of the network.
In Fig. 8, we present a comparison between the ground truth halo masses and the predicted masses associated with the particles/voxels in our validation set.To generate these results we apply the following procedure: We select all the ground truth voxels/particles that end up in FoF-haloes and study the predictions associated with them.We can associate a predicted mass for all the voxels that belong to a DM halo.In these cases, we can compare the predicted mass values (M Pred ) with the ground truth masses (M True ) at a voxel level.This comparison is shown in the main panel of Fig. 8 as black violin plots ("violins" henceforth).The mass range covered by the black violins goes from M True = 10 11 h −1 M ⊙ , corresponding to the minimum mass of haloes (155 particles), to M True ≈ 10 14.7 h −1 M ⊙ , which is the mass of the most massive halo identified in the validation simulations.The number of high-mass haloes is smaller than smallmass ones and therefore the higher-mass end of the violin plot exhibits more noise.We can appreciate that the median predictions (black dots) correctly reproduce the expected behaviour (ground truth) for several orders of magnitude.
The voxels identified as part of a halo in the ground truth, but not in the predicted map, are false negative (FN) cases.For these occurrences, we can study the dependence of the False Negative Rate (FNR) as a function of the ground truth halo mass (solid black line on the top panel of Fig. 8; analogous to 6).We can also study the reciprocal case in which a voxel is predicted to be part of a halo (hence, it has an associated M Pred ) but the ground truth voxel is not collapsed.These cases correspond to False positives (FP) but to make a comparison as a function of mass we can only express it in terms of the predicted mass.Therefore, we show as a dashed black line in the top panel of Fig. 8 the false discovery rate, We compare our results with those obtained from the baseline simulations.In the main panel of Fig. 8 we present the corresponding violin plots from the baseline simulations with green lines.The range that the green violins span is smaller than the black violins since the most massive halo identified in the baseline simulations has a mass of M True ≈ 10 14.4 h −1 M ⊙ .In the top panel, the solid and dashed green lines represent the FPR and FDR respectively.As expected, the FPR and FDR coincide in the case of the baseline simulations.The top panel results demonstrate that our predictions are comparable to those of the baseline simulations (as pointed out in Fig. 6) over most of the considered mass range.However, they get progressively worse for masses below M True ⪅ 10 12 h −1 M ⊙ (vertical dotted black line), deviating from the baseline trend.This indicates that our model struggles to capture the correct behaviour of lower-mass haloes but it produces accurate predictions for higher-mass ones.When comparing the violin plot distributions of our model with the baseline simulations we appreciate that we obtain similar (but slightly broader) contours.Being able to achieve a similar scatter as in the baseline simulations indicates that our model can capture the well-resolved aspects of halo formation.We want to emphasize that precise predictions for halo masses are not directly enforced through the training loss, but are a side product, consequence 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 of precisely reproducing halo Lagrangian patches.The scatter broadens for smaller halo mass and the network loses accuracy in these cases, sometimes associating smaller haloes close to a big Lagrangian patch to its closest more massive neighbour.
In the main panel of Fig. 8, we include the violin plot lines presented in Lucie-Smith et al. (2020) (blue violin lines).In this study, a neural network was trained to minimize the difference between predicted and true halo-masses at the particle level using as inputs the initial density field or the potential.The focus of Lucie-Smith et al. ( 2020) is to examine how different features of the initial conditions influence mass predictions within a framework that mirrors analytical models.
The comparison between our methodology and Lucie-Smith et al. ( 2020) in Fig. 8 highlights the differing outcomes that arise from the unique objectives and constraints each model employs.While both models ultimately predict halo masses, we suggest that our approach benefits from the rigid operator that groups particles together and assigns them the same halo mass.Therefore, analytical approaches towards predicting the formation of structures may benefit from knowing about the fate of neighbouring particles.Since in excursion set formalisms, this is only possible to a limited degree, this increases the motivation for considering alternative approaches, like the one proposed by Musso & Sheth (2023a).
In Appendix E, we include a comparison of our instance model with the predictions of ExSHalos (Voivodic et al. 2019).In Fig. E.1, we show a map-level comparison between the Lagrangian shapes of friends-of-friends proto-haloes and ExSHalos predictions.The shapes of proto-haloes predicted by the ExSHalos implementation are limited to sphere-like volumes, which affects its flexibility and, consequently, its accuracy (see Table 2).While ExSHalos correctly replicates the halo mass In Fig. 9 we present the halo-mass-function (HMF) computed using the validation simulations (solid black line).The dashed black line shows the predicted HMF computed using the results of our instance segmentation pipeline.We can appreciate that our predictions reproduce the N-body results over a range that spans more than two orders of magnitude.Our results improve upon the prediction mass range for the HMF of previous similar approaches (Berger & Stein 2019;Bernardini et al. 2020).This is despite the fact that Bernardini et al. (2020) select their hyper-parameters to reproduce the HMF; while in Berger & Stein (2019) they reproduce the HMF corresponding to Peak Patch haloes (Stein et al. 2019), instead of the HMF associated with FoF haloes.In Fig. 9 we also include a solid blue line representing the theoretical HMF predictions using the model by Ondaro-Mallea et al. (2022).We compare this result with the HMF associated with the haloes predicted by our model using the density and potential fields of a realization with 1024 3 particles and a volume of V box = (200 h −1 Mpc) 3 .Both lines show a good agreement in the 10 12 − 10 15 h −1 M ⊙ range.
We conclude that our semantic plus instance segmentation pipeline correctly reproduces the Lagrangian halo shapes of FoF-haloes spanning a mass range between 10 12 h −1 M ⊙ and 10 14.7 h −1 M ⊙ .We have tested the accuracy of our results employing different metrics (presented in several tables and figures).Inferred quantities from our predicted Lagrangian halo regions, such as the predicted halo masses, correctly reproduce the trends computed using N-body simulations and improve upon the results presented in previous studies.

Experiments
In this section, we test how our network reacts to systematic modifications to the input density field and potential and how well it generalizes to scenarios that lie beyond the trained domain.Therefore, we analyze the response to large-scale density perturbations, to large-scale tidal fields and to changes in the variance of the density field.

Response to large scale densities
We study the response of the haloes to a large-scale over-density such as typically considered in separate universe simulations (Wagner et al. 2015a;Lazeyras et al. 2016;Li et al. 2014).We add a constant δ ϵ to the input density field δ(q) so that the new density field δ * (q) is given by and to maintain consistency with Poisson's equation, see equation (8), we add a quadratic term to the potential: where q 0 is an arbitrary (and irrelevant) reference point (Stücker et al. 2021a), which we choose to be in the centre of our considered domain.Note that we break the periodic boundary conditions here, so it is difficult to do this operation for the whole box, but instead we consider it only for a smaller region to avoid boundary effects.
We show how haloes respond to this modification in Fig. 10.The middle panel shows the predicted masses associated with the particles/voxels (in a similar way to Fig. 7) for the reference field, δ ϵ = 0.The upper and lower panels show the results of including a constant term to the initial over-density field of δ ϵ = −0.5 and δ ϵ = 0.5, respectively.
Increases in the background density lead to more mass collapsing onto haloes, thus generally increasing the Lagrangian volume of haloes.Furthermore, it leads in many cases to previously individual haloes merging into one bigger structure.This is qualitatively consistent with what is observed in separate universe simulations (e.g.Dai et al. 2015;Wagner et al. 2015b;Barreira et al. 2019;Jamieson & Loverde 2019;Terasawa et al. 2022;Artigas et al. 2022).
To evaluate quantitatively whether the model has learned the correct response to large-scale density perturbations, we test whether it recovers the same halo bias that has been measured in previous studies (Desjacques et al. 2018, for a review).In separate universe experiments, the linear bias parameter can be inferred as the derivative of the halo mass function with respect to the large-scale density: Therefore, (Lazeyras et al. 2016) used the halo mass function measured in separate universe simulations with different largescale densities δ ϵ to measure the bias parameters through a finite differences approach.While our qualitative experiment from Figure 11 follows this in spirit, it is difficult to do the same measurement here, since the addition of the quadratic potential term in equation ( 12) breaks the periodic boundary conditions and makes it difficult to measure the mass function reliably over a large domain.Therefore, we instead adopt an approach to infer the bias from the unperturbed δ ϵ = 0 case.(Paranjape et al. 2013) shows that the Lagrangian bias parameter can be measured by considering the (smoothed) linear over-density at the Lagrangian location of biased tracers δ i : where the sum goes over N different tracers (e.g.all haloes in a given mass bin) and where σ 2 = ⟨δ 2 ⟩ is the variance of the (smoothed) linear density field.Since this measurement should give meaningful results only on reasonably large scales, we smooth the Lagrangian density field with a Gaussian kernel with width σ r = 6h −1 Mpc.We measure the smoothed linear density δ i at the Lagrangian centre of mass of each halo patch and then we measure the bias by evaluating equation ( 14 sus predicted halo patches respectively.These two seem consistent, showing that the model has correctly learned the bias relation that is captured inside of the training set.However, this (L = 50h −1 Mpc) relation is not consistent with the wellmeasured relation from larger scale simulations, indicated as a black solid line adopted from (Lazeyras et al. 2016).This is because very massive haloes M ≫ 10 14 h −1 M ⊙ do not form in simulations of such a small volume, but they are important to get the correct bias of smaller mass haloes, since wherever a large halo forms, no smaller halo can form.Our network has never seen such large scales, so it is questionable whether it has any chance of capturing the large-scale bias correctly.However, it might be that what it has learned in the small-scale simulation transfers to larger scales.To test this, we evaluate the network on two larger boxes, L = 100h −1 Mpc and L = 200h −1 Mpc, shown as orange and green lines in Figure 11.These cases match the true bias relation better, but still show some significant deviation e.g. at M ∼ 10 14 h −1 M ⊙ .Therefore, we conclude that the network generalizes only moderately well to larger scales and halo masses.Improved performance could possibly be achieved by extending the training set to larger simulations and by increasing the field of view of the network.

Response to large scale tidal fields
In a second experiment, we want to study the response of haloes to purely anisotropic changes of the initial conditions, by adding a large-scale tidal field.We, therefore, aim to emulate a modification similar to the ones considered in anisotropic separate universe simulations (Schmidt et al. 2018;Stücker et al. 2021b;Masaki et al. 2020;Akitsu et al. 2021).We modify the input potential through the term ϕ * (q) = ϕ(q) + 1 2 (q − q 0 ) T T (q − q 0 ) (15) The different panels show the cases with λ z = −0.5, 0 and 0.5 -corresponding to a stretching tidal field, no tidal field and a compressing tidal field in the vertical direction respectively.A negative (stretching) tidal field delays infall and shrinks the proto-halo patches in the corresponding direction, whereas a positive (compressing) tidal field facilitates infall and extends the proto-halo patches.
Since we are considering a trace-free tidal tensor, we do not need to include any modifications to the initial density field.The results of introducing the tidal field are presented Fig. 12.In the upper panel in which we have imposed a value of λ z = −0.5, the regions of typical proto-haloes are slightly reduced in the zdirection and extended in the y-direction.Further, in some cases haloes merge additionally in the y-direction while separating in the z-direction.In the bottom panel with λ z = 0.5 we observe the opposite behaviour, with proto-halo shapes elongated in the z-direction and reduced in the y-direction.These observations are consistent with the naive expectation: A positive λ z means a contracting tidal field in the z-direction, which facilitates infall in this direction, whereas a negative λ z delays the infall.Therefore, proto-haloes appear extended in the direction where the tidal field has a contracting effect.This should not be confused with the response of the halo shapes in Eulerian space which has the opposite behaviour -reducing the halo's extent in the direction where the tidal field is contracting (Stücker et al. 2021b).Therefore, a large-scale tidal field effects that the direction from which more material falls in, is the direction where the final halo is less extended.
However, by comparing Figures 10 and 12, we note that the effect of modifying the eigenvalues of the tidal tensor (while keeping the trace fixed) is much less significant than modifying its trace δ by a similar amount.Modifying δ leads to strong differences in the abundance and the masses of haloes whereas the modifications to the tidal field strongly affect the shapes, but has a much smaller effect on typical masses -if at all.
Our investigation into the role of anisotropic features in the initial conditions complements the findings of Lucie-Smith et al. (2020).They find that anisotropic features of the initial conditions do not significantly enhance halo mass predictions when compared to predictions based on spherical averages.Therefore, they conclude that including anisotropic features would not significantly improve the mass predictions that can be obtained within excursion set frameworks.This observation is consistent with masses not changing significantly when applying a largescale tidal field.However, we find that anisotropic features are in general important for the formation of structures since they affect which particles become part of which halo.
Finally, we note that the response of the Lagrangian shape of haloes is particularly interesting in the context of tidal torque theory (White 1984).To predict the angular momentum of haloes, tidal torque theory requires knowledge of both the tidal tensor and the Lagrangian inertia tensor of haloes.Further, it has been argued that the misalignment of tidal field and Lagrangian inertia tensor is a key factor for predicting galaxy properties (Moon & Lee 2023).Our experiments show that modifications of the tidal tensor itself also trigger modifications of the Lagrangian shape.Precisely understanding this relation would be relevant to correctly predict halo spins from the initial conditions.Note that such responses are inherently absent in most density-based structure formation models (e.g.Press & Schechter 1974;Bond et al. 1991;Sheth & Tormen 2002), but could possibly be accounted for by recently proposed approaches based on the Lagrangian potential (Musso & Sheth 2021a, 2023b).sults of 10 different boxes, each one spanning L = 50h −1 Mpc, with σ 8 values set to 0.5802 (blue lines), 0.8288 (black lines), and 1.077 (red lines).The model's predictions reveal a discrepancy with the anticipated HMF behaviour beneath the threshold of ∼ 10 12.7 h −1 M ⊙ for both σ 8 ≈ 0.5802, and σ 8 ≈ 1.077.This discrepancy is attributed to the model's training on datasets characterized by the specific σ 8 from Planck Collaboration et al. (2020).The model's ability to extrapolate to different variances remains limited.At higher masses, however, the network's predictions correspond more closely with the expected HMF.This partial alignment suggests that the network possesses some degree of generalization capability.Nonetheless, for reliable application across varying cosmologies, incorporating these scenarios into the training set is essential.

Discussion & Conclusions
We present a novel approach to understand and predict halo formation from the initial conditions employed in N-body simulations.Benchmark tests indicate that our model can predict Lagrangian FoF-halo regions for simulations efficiently, taking around 7 minutes in a GPU for a simulation with 256 3 particles in a volume of 50h −1 Mpc.For those interested in leveraging or further enhancing our work, we have made our codes publicly available: https://github.com/daniellopezcano/instance_halos.
Our model consists of a semantic network that reliably recognizes regions in Lagrangian space where haloes form, and an instance segmentation network, that identifies individual haloes from the semantic output.Our predictions accurately reproduce simulation results and outperform traditional analytical, semianalytical techniques, and prior ML methods.
The foundation for our instance segmentation model is the Weinberger approach, first introduced by De Brabandere et al. (2017).This technique lets us develop a more general framework for identifying Lagrangian halo patches than previous attempts.Employing the Weinberger loss approach, we bypass some limitations of other instance segmentation methods, like the wa-tershed technique employed by Bernardini et al. (2020).With our approach, we manage to predict the complicated Lagrangian shapes of haloes that are formed in N-body simulations.This is notably more difficult than the predictions of spherical Peak-Patch-haloes that were considered by Berger & Stein (2019).
Additionally, we quantify in how far halo formation is indetermined by the resolved scales of the initial conditions, to establish an optimal performance limit of machine learning methods.We infer this limit by comparing two simulations which only differ in their initial conditions realization on scales beyond the resolution level.We find an agreement between our model predictions and reference simulations similar to the agreement between the two 'baseline' simulations.This shows that our model extracts information encoded in the initial conditions close to optimal.We suggest that such reference experiments may also be used as a baseline in other ML studies to establish whether information is extracted optimally.
Upon evaluating our semantic model, we measure an accuracy of 0.864 and an F 1 -score of 0.838.Compared to the baseline simulations, which have an accuracy of 0.903 and an F 1 -score of 0.884, our model results stand remarkably close, demonstrating its capability to predict halo regions nearly matching N-body simulations' natural variability.
We also assess our instance segmentation network using various metrics.As depicted in Fig. 8, our model closely aligns with the baseline across a broad mass range, outperforming previous methods like Lucie-Smith et al. (2020).We speculate that our approach benefits from the physical constraint that different particles that belong to the same halo are assigned the same halo mass.Moreover, the halo mass function (HMF) predictions in Fig. 9 closely match the true ground truth values across three orders of magnitude.The visual representations in Fig. 7 reinforce our model's precision, faithfully replicating Lagrangian halo patch positions and shapes.
We have tested through experiments how the network reacts to systematic modifications of the initial conditions.We find that the network correctly captures the response to density perturbations at the finite boxsize provided in the training set.However, it struggles to generalize to larger boxsizes and to cosmologies with different amplitudes of the density field σ 8 .This can easily be improved by increasing the diversity of the training set.
Further, we have found that our network utilizes information from the potential field that is not encoded in the density field of any finite region.Modifications to a large-scale tidal field are consistent with the same linear density field, but do affect the potential landscape.Our network predicts that such tidal fields affect the Lagrangian shape of haloes in an anisotropic manner which is consistent with the intuitive expectation of how a tidal field accelerates and decelerates the infall anisotropically.
We have demonstrated the robustness of our model in its current applications and we believe it could find potential utility in several other scenarios like crafting emulated merger trees, aiding separate-universe style experiments (e.g.Lazeyras et al. 2016;Stücker et al. 2021b) and informing the development of analytical methods for halo formation (e.g.Musso & Sheth 2021b, 2023a).Other works such as MUSCLE-UPS (Tosone et al. 2021) can also benefit from our semantic predictions alone by informing their algorithm about which particles will collapse into haloes.
Additionally, our model can be used to help understand the development of spin and intrinsic alignments in haloes and galaxies by establishing how tidal fields modify the Lagrangian shapes of haloes.This is a vital ingredient to predict the spin of haloes through tidal torque theory (White 1984).Also, we can employ our model to predict changes in the Lagrangian regions of halos in combination with the "splice" technique presented by Cadiou et al. (2021).We believe this approach can provide new insights regarding how modifications in the environment of haloes at initial conditions can affect their final properties.We encourage experts in these fields to use our open-source code as a basis for tackling and exploring these and other related problems.
The models we have presented in this paper can be easily extended to characterize other properties of halos.One possible extension of the model would be to include an additional spatial dimension to our instance network's output to predict final halo concentrations.In this extension of our model, each particle would have associated a concentration prediction whose average (over all particle members of the same halo) would be trained to minimize the mean square error with respect to the true halo concentration.
The findings presented in this work are promising but there exist some aspects of our models that would benefit from further investigation.For instance, extending our methodology to understand other halo properties beyond mass would be a logical next step.It would also be interesting to test our model's performance under a wider variety of simulation conditions, including variations in cosmology and redshift.An additional avenue of exploration might involve delving into capturing intricate structural details, specifically the gap features in the predicted Lagrangian halo regions.Generative Adversarial Networks (GANs) are tools that have demonstrated potential in reproducing data patterns in the context of cosmological simulations (e.g.Rodríguez et al. 2018;Villaescusa-Navarro et al. 2021;Schaurecker et al. 2021;Robles et al. 2022;Nguyen et al. 2023;Zhang et al. 2023).Hence, employing a GAN-like approach might help recreate these gap features, further improving our model's ability to mimic the structures of haloes found in N-body simulations.
In conclusion, this study showcases the potential of machine learning for facilitating the study of halo formation processes in the context of cosmological N-body simulations.We provide a fast model that exploits the available information close to optimally.We hope our approach serves as a useful tool for researchers working with N-body simulations, opening avenues for future advancements.where k = N dens is a hyper-parameter of the clustering algorithm.We accelerate this step with the ckd-tree from the scipy package in python (Virtanen et al. 2020).
Then we determine groups as the descending manifold of the maxima that exceed a persistence ratio threshold ρ max /ρ sad ≥ p thresh between maximum and saddle-point.The descending manifold corresponds to the set of particles from whose location following the local density gradient would end up in the same maximum (e.g.Sousbie 2011;Tierny et al. 2017).For this, we use a slightly modified version of the density segmentation algorithm used in subfind (Springel et al. 2001): We consider the particles from highest to lowest density.For each particle we consider from the N ngb nearest particles the sub-set of particles that have a higher density than ρ i (this set may be empty).Among these we select the set B i of the (up to) two closest particles.This set can have zero, one or two particles.
-If the set B i is empty, then there is a density maximum ρ max = ρ i and we start growing a new subgroup around it.-If the set B i contains a single particle or two particles that are of the same group, the particle i is attached to the corresponding group.-If B i contains two particles of different groups, then i is potentially a saddle-point.We check whether the group with the lower density maximum ρ max has a sufficient persistence ρ max /ρ i ≤ p thresh .If not, then we merge the two groups (and keep the denser maximum).Otherwise, we keep both groups and we assign the particle to the group of the denser particle in B i .(This step corresponds to following the local discrete density gradient.) Note that unlike the subfind algorithm, we merge groups not at every saddle-point, but only if they are below a persistence threshold.Therefore, sufficiently persistent groups are grown beyond their saddle point and ultimately correspond to the descending manifold of their maximum.
The clustering algorithm has three hyper-parameters N dens , N ngb and p thresh .We have done a hyper-parameter optimization over these and found that N dens = 20, N ngb = 15 (quite close to the default parameters in the subfind algorithm, 20 and 10 respectively) and p thresh = 4.2 give the best results, though our results are not very sensitive to moderate deviations from this.We can understand the quantitative value of the persistence ratio threshold by considering that the relative variance of our density estimate is outlier.Therefore, the persistence ratio threshold p thresh ensures that it is very unlikely that our algorithm mistakes a spurious overdensity in the pseudo space for a group.
where we set IoU thresh = 0.5. 5. We summarize each connected component in the graph into a new label.After this operation for most voxels the new label in grid 1 and in grid 2 agree and we can choose that label as our final label.However, for a small fraction of voxels the labels still disagree, because the corresponding instances had too little overlap to be identified with each other.In this case, we assign to the corresponding voxel the label that contains the larger number of voxels in total.
We illustrate the different steps of this procedure in Regarding the semantic segmentation network, we can merge the predictions corresponding to different crops independently since, in this case, we are truly working with a translationinvariant network.We employ the central 64 3 voxels (analogous to 'Lattice1') of separate predictions and merge them together to generate the final full-box predictions of the semantic segmentation network.

Appendix E: Comparison with ExSHalos
In this appendix, we explore how the results obtained with the ExSHalos code (Voivodic et al. 2019) compare against our semantic and instance predictions.
As mentioned in § §3.1, ExSHalos is an explicit implementation of the excursion set theory that identifies haloes in Lagrangian space by growing spheres around density peaks until the average density inside crosses a specified barrier for the first time.The barrier shape is motivated by the ellipsoidal collapse (Sheth et al. 2001;de Simone et al. 2011) and we have fitted the three free parameters in the model to reproduce the mean halo mass function of our simulations.
In ).The physical approach of the ExSHalos algorithm enables to identify, with a reasonable degree of accuracy, the location of proto-haloes in Lagrangian space, and their mass.However, the built-in assumption that proto-haloes are spherical gives only a crude approximation to the actual proto-halo shapes.
In Table 2 we quantify the differences between ExSHalos and friends-of-friends employing several semantic metrics.
In Fig. E.2 we present a violin plot analogous to Fig. 8.This plot shows a comparison between the ground truth halo masses (friends-off-friends) and the predicted masses from our model associated with the particles/voxels in our validation set (black violin lines in the main panel).We also include the comparison between the masses of ExSHalos and of friends-off-friends haloes (purple violin lines).We have generated the violin lines  of ExSHalos employing all our simulations (both training and validation) to achieve better statistics.Our model predictions are capable of achieving greater mass accuracy than ExSHalos throughout all mass bins considered here.
In the upper panel of Fig. E.2, we show the False Negative Rate (FNR) as solid lines against the ground truth halo mass, and the False Discovery Rate (FDR) as dashed lines against the predicted mass.This plot is analogous to the top plot in Fig. 8 (See § §3.2 for details).We additionally include solid and dashed purple lines corresponding to the ExSHalos case.It's clear that ExSHalos predicts higher FNR and FDR values compared to the baseline case and our model predictions, indicating more semantically-misclassified particles. ."Violin plot", visualizing the distribution of predicted halo masses (at a voxel level) for different ground-truth mass bins.The black violin plots show the results obtained with our instance segmentation model.Green violin plots show the agreement between the two baseline simulations -representing an optimal target accuracy.The purple violin plots in the main panel correspond to the comparison with the ExSHalos predictions.The solid black line in the top panel shows the false negative rate, FNR, as a function of the ground truth halo mass.The dashed black line represents the fraction of predicted collapsed pixels that are not collapsed as a function of predicted halo mass (false discovery rate, FDR).The green and purple lines on the top panel correspond to the analogous results obtained from the baseline simulations and ExSHalos respectively.Chacón, J., Vázquez, J. A., & Almaraz, E. 2022, Astronomy and Computing, 38, 100527 Dai, L., Pajer, E., & Schmidt, F. 2015, J. Cosmology Astropart. Phys., 2015, 059

Fig. 1 .
Fig. 1.Example of the prediction problem considered in this article.Top panel: Slice of the three-dimensional initial density field of an Nbody simulation.Each voxel (represented here as a pixel) corresponds to a particle that can become part of a halo at later times.Bottom panel: Regions in the initial condition space (same slice as the top panel) that are part of different DM haloes at redshift z = 0. Pixels coloured in white do not belong to any halo.Pixels with different colours belong to different haloes

Fig. 2 .
Fig. 2. Example of a two-dimensional pseudo-space employed to separate different instances according to the Weinberger loss.Coloured points represent individual points mapped into the pseudo-space.The centres of the clusters are presented as coloured crosses.Coloured arrows depict the influence of the pull force term, only affecting points outside the δ Pull range of their corresponding cluster centre.Grey arrows show the influence of the push force that manifests if two cluster centres are closer than the distance 2 • δ Push

Fig. 3 .
Fig.3.Slice of the Lagrangian halo regions of the two "baseline" simulations (left and right panels respectively).These simulations only differ in sub-resolution perturbations to the initial conditions and their level of agreement sets a baseline for the desired accuracy of our models.The colours employed for both panels represent the mass of the halo associated with each particle for the different Lagrangian halo patches.Circled regions highlight Lagrangian patches whose associated mass significantly changes between the two simulations.

Fig. 4 .
Fig. 4. Flowchart of the particular V-Net architecture we have implemented.The network can take as input multiple channels with dimensions of 144 3 (top left green cube) and generates predictions for the central voxels with dimensions 128 3 (top right red cube).The flowchart illustrates the encoder and decoder paths, along with other distinctive features of the network.Notably, the hidden layers and skip connections are represented by purple and yellow cubes, with their respective dimensions annotated at their centres.The down-sampling and up-sampling blocks are shown as brown and purple trapezoids, in their centres we indicate the number of filters employed for the convolution (or transposed convolution) operations.
) is determined by computing the ratio of negative samples to the total number of samples in the training data.The value of β measured in different training simulations lies in the interval [0.575, 0.5892].There exists a slight predominance of voxels/particles that do not collapse into DM haloes with mass M FoF ⪆ 10 11 h −1 M ⊙ at z = 0 considering the Planck Collaboration et al. (2020) cosmology.We fix the hyper-parameter β in equation (3) to the mean value β = 0.5815.

Fig. 5 .
Fig. 5. Slice through the predictions of our semantic segmentation network applied to a validation simulation.Left panel: Ground truth representation showing in red the voxels/particles belonging to a DM halo at z = 0 and in blue those particles that do not belong to a DM halo.Central panel: Probabilistic predictions of the semantic network with colour-coded probabilities for halo membership.Right panel: Pixel-level error map indicating true positive (green), true negative (blue), false negative (black), and false positive (red) regions resulting after applying a semantic threshold of 0.589 to our predicted map.The network effectively captures complex halo boundaries and exhibits high validation accuracy (acc = 0.86) and F 1 -score (F 1 = 0.83).

Fig. 6 .
Fig.6.True Positive Rate expressed as a function of the halo mass associated with the ground truth voxels.We present the results measured from the model predictions (solid bright green line) in comparison to the optimal target accuracy from the baseline simulations (solid black line).The vertical dotted line at 10 12 h −1 M ⊙ marks the point where model predictions start to differ from the baseline results.

Fig. 7 .
Fig. 7. Examples of the instance segmentation results obtained with our model.Left column: ground truth masses obtained using N-body simulations.Right column: predicted masses obtained using our instance segmentation pipeline.The model can predict the Lagrangian patches of haloes, although some small differences -e.g.regarding the connectivity of haloes -exist.

log 10 MFig. 8 .
Fig.8."Violin plot", visualizing the distribution of predicted halo masses (at a voxel level) for different ground-truth mass bins.The black violin plots show the results obtained with our instance segmentation model.Green violin plots show the agreement between the two baseline simulations -representing an optimal target accuracy.The blue violin plots in the main panel show the results presented in(Lucie-Smith et al. 2020).The solid black line in the top panel shows the false negative rate, FNR, as a function of the ground truth halo mass.The dashed black line represents the fraction of predicted collapsed pixels that are not actually collapsed as a function of predicted halo mass (false discovery rate, FDR).The green lines on the top panel correspond to the analogous results obtained from the baseline simulations.The model predicts haloes accurately object-by-object for masses M ≳ 10 12 M ⊙ /h.

Fig. 9 .
Fig. 9. Halo-mass-function (HMF) computed using our N-body simulations reserved for validation (solid black line).The dashed black line represents the predicted HMF using the Lagrangian halo regions obtained with our instance segmentation pipeline.The solid blue line shows the HMF prediction from (Ondaro-Mallea et al. 2022).The dashed blue line corresponds to the HMF obtained after evaluating our model in a simulation with 1024 3 particles and V box = (200 h −1 Mpc) 3 .

Fig. 10 .
Fig.10.Response of proto-haloes to large-scale over-densities.The three panels show over-densities of δ ϵ = −0.5, 0 and 0.5 respectively.A larger large-scale density tends to increase the Lagrangian volume of haloes and leads to additional mergers in some cases.

Fig. 11 .
Fig. 11.Linear Lagrangian bias parameter b 1L for the haloes, measured for different boxsizes L and comparing simulation and model.The model agrees well with the simulation at the L = 50h −1 Mpc scale, but both are inconsistent with the true large-scale bias relation from (Lazeyras et al. 2016) due to effects from the limited size of the simulation volume.Evaluation on larger boxes moves the prediction closer to the known relation, but some deviation is maintained.

Fig. 12 .
Fig. 12. Response of proto-halo regions towards a large-scale tidal field.The different panels show the cases with λ z = −0.5, 0 and 0.5 -corresponding to a stretching tidal field, no tidal field and a compressing tidal field in the vertical direction respectively.A negative (stretching) tidal field delays infall and shrinks the proto-halo patches in the corresponding direction, whereas a positive (compressing) tidal field facilitates infall and extends the proto-halo patches.

4. 3 .
Response to changes in the variance of the density fieldWe now study whether our model can generalize to scenarios different from the training set by investigating how it responds to variations in σ 8 , deviating 30% from the original Planck Collaboration et al. (2020) cosmology.We aim to discern if the network, trained on a singular variance setting, has gained enough insight into halo formation to anticipate outcomes considering different values for the variance of the initial density field.These modifications only affect the initial conditions which are fully visible to the network, so it could be possible that the network correctly extrapolates to these scenarios.In Fig.13we show how the HMF reacts to changes in σ 8 in comparison to the measured mass functions from Ondaro-Mallea et al. (2022) (solid lines) as a benchmark.Our predictions for the HMF (dashed lines) are generated by taking the average re-

10 M
Fig. A.3. Results of our watershed segmentation approach presented in an analogous way to results from Fig. 7.
2) so that at a fixed background density having a density contrast of p thresh = 4.2 due to Poisson noise corresponds to a ∆ log ρ = log(p thresh ) ≈ 1.43 ≈ 6.5σ log ρ (B.3) Fig. D.1.The top panel, labelled 'Lattice1', shows the individual instances predicted in the first lattice arrangement.Each colour represents a distinct label assigned to a group of voxels within the 64 3 central region of the sub-volumes.The middle panel, 'Lattice2', displays the second set of predictions using a shifted lattice by half the offset in each dimension.Here again, different colours represent unique instance labels.The bottom panel, 'Combined', presents the final merged full-box prediction.It is generated by synthesizing the labels from 'Lattice1' and 'Lattice2' using the graph-based method to connect overlapping instances.The resulting image shows larger, coherent structures, indicative of the correct performance of combining both lattices.
Fig. E.1 we show a map-level comparison between the Lagrangian proto-haloes identified in one of our validation simulations with the friends-of-friends algorithm (left panel), and the ExSHalos detected employing the code presented in Voivodic et al. (2019) (central panel).The ExSHalos regions in Lagrangian space are spherical by construction (see the middle panel of Fig. E.1 Fig. D.1.Process of merging predictions from two overlapping lattice structures to produce a full-box instance segmentation map.'Lattice1' (top) and 'Lattice2' (middle) represent predictions from initial and shifted lattice grids, respectively, with unique color-coded labels for instances.Black dashed lines indicate the lattice employed in each case, while thin dashed grey lines correspond to the lattice employed in the reciprocal scenario.'Combined' (bottom) depicts the final synthesized full-box map, where instances have been merged based on their overlap, demonstrating the effectiveness of the methodology in generating contiguous and comprehensive halo segmentations from smaller, predicted sub-volumes.
Fig. E.1.Slices through the Lagrangian field of friends-of-friends proto-haloes, and the corresponding predictions using the ExSHalos algorithm.Left panel: ground truth masses obtained using N-body simulations (friends-of-friends proto-haloes).Central panel: predicted masses obtained using the ExSHalos algorithm.Right panel (analogous to left panel of Fig.5): Semantic pixel-level error map between ExSHalos and friends-offriends haloes indicating true positive (green), true negative (blue), false negative (black), and false positive (red) regions.

Fig
Fig. E.2. "Violin plot", visualizing the distribution of predicted halo masses (at a voxel level) for different ground-truth mass bins.The black violin plots show the results obtained with our instance segmentation model.Green violin plots show the agreement between the two baseline simulations -representing an optimal target accuracy.The purple violin plots in the main panel correspond to the comparison with the ExSHalos predictions.The solid black line in the top panel shows the false negative rate, FNR, as a function of the ground truth halo mass.The dashed black line represents the fraction of predicted collapsed pixels that are not collapsed as a function of predicted halo mass (false discovery rate, FDR).The green and purple lines on the top panel correspond to the analogous results obtained from the baseline simulations and ExSHalos respectively.

Table 1 .
Hyper-parameters employed in our instance segmentation pipeline.Pull δ Push c Pull c Push δ

Table 2 .
Performance metrics of our semantic segmentation model, along with the ExSHalos results, compared against the optimal target accuracy estimated from the baseline simulations.The table presents True Positive Rate (TPR), True Negative Rate (TNR), Positive Predictive Value (PPV), and Negative Predictive Value (NPV).