Relating the structure of dark matter halos to their assembly and environment

We use the ELUCID N-Body simulation to study the relation of structural properties of dark matter halos to their assembly history and environment. We find that the complexity of halo assembly histories can be well described by a small number of principal components. Using decision trees built with the random ensemble method, we find that the correlation between halo concentration and assembly history is tight: more than 60% of the variance in halo concentration can be explained by assembly history alone. Halo shape and spin parameters are found to depend not only on assembly history, but also on environment and initial condition, and these dependencies are not orthogonal. The structures of more massive halos depend more strongly on higher-order components of the assembly history. The relations between halo intrinsic properties and environment are weak compared to their variances, with the anisotropy of local tidal fields having the strongest correlation with halo properties. Our method of dimension reduction and regression can help simplify the characterization of the halo population and clarify the degeneracy among halo properties.


INTRODUCTION
In the concordant Λ Cold Dark Matter (ΛCDM) cosmology, dark matter halos, the dense clumps formed through gravitational collapses of the initial density perturbations, are the basic building blocks of the cosmic web. The formation history of a halo depends not only on the properties of the local density field, but may also be affected by the environment within which it forms. Since galaxies are believed to form in the gravitational potential wells of dark matter halos, the halo population provides the link between the dark and luminous sectors of the Universe. Consequently, the understanding of the formation, structure and environment of dark matter halos and their relations to each other have long been The mass assembly histories of halos in general are complex. In the literature a common practice is to focus on the main-branch assembly histories, ignoring other branches (e.g., van den Bosch 2002;Wechsler et al. 2002;Zhao et al. 2003a,b). Attempts have been made to describe the histories of individual halos with simple parametric forms (Wechsler et al. 2002;van den Bosch 2002;Zhao et al. 2003a,b;Tasitsiomi et al. 2004;McBride et al. 2009;Correa et al. 2015;Zhao et al. 2009). These simple models are useful to provide some crude description of halo assembly histories, but are not meant to give a complete characterization. Because of this, a variety of formation times have also been defined to characterize different aspects of the assembly histories of dark matter halos (see e.g., Li et al. 2008, for a review). These formation times, usually degenerate among themselves, again only provide an incomplete set of information about the full assembly history.
The environment of a halo is also complex. To the lowest order, the average mass density around individual halos in a population can be used to characterize the distribution of the population relative to the underlying mass density field. In general, the spatial distribution of halos (halo environment) can depend on the intrinsic properties of halos. The mass dependence, often called the halo bias (Mo & White 1996), is a natural outcome of the formation of halos in a Gaussian density field (Sheth et al. 2001;Zentner 2007). Furthermore, correlations have also been found between halo bias and halo assembly history. Halos, particularly lowmass ones, that formed earlier tend to be more strongly clustered (Gao et al. 2005;Gao & White 2007;Li et al. 2008). This phenomenon is now referred to as the halo assembly bias, and various studies have been carried out to understand its origin (Wang et al. 2007;Sandvik et al. 2007;Zentner 2007;Dalal et al. 2008;Desjacques 2008;Wang et al. 2009;Lazeyras et al. 2017). In addition to assembly history, halo bias has also been analyzed in its dependencies on other halo properties, such as halo concentration (Wechsler et al. 2006;Jing et al. 2007), sub-structure occupation (Wechsler et al. 2006;Gao & White 2007), halo spin (Bett et al. 2007;Gao & White 2007;Hahn et al. 2007;Wang et al. 2011), and halo shape (Hahn et al. 2007;Faltenbacher & White 2010;Wang et al. 2011). These dependencies are collectively referred to as the 'secondary bias', and sometimes also as the 'assembly bias', presumably because these intrinsic properties may be related to halo formation.
Since halo properties are intrinsically correlated, it is necessary to investigate the joint distribution of different properties. Along this line, Lazeyras et al. (2017) investigated halo bias as a function of two halo properties.
They found that in all combinations of halo properties considered, halo bias can change with the second parameter when the first is fixed, and that the maximum of the halo bias occurs for halos with special combinations of the halo properties. For the environment, new parameters have also been introduced in addition to the local mass density. For example, Wang et al. (2011) used local tidal fields to represent halo environment, and Salcedo et al. (2018) used the closest distance to a neighbor halo more massive than the halo in question. However, it is still unclear which quantity is the driving force of halo bias, and which combination of bias sources can provide a more complete representation. Answering these questions requires more advanced statistical tools to measure the capability of models to fit the data and to identify hidden degeneracy between model parameters.
Some statistical tools are available for such investigations. For unsupervised learning tasks, the Principal Component Analysis (PCA) is a powerful tool to determine the main sources that contribute to the sample scatter, and decompose the sample scatter along principal axes. For example, Wong & Taylor (2012) used this technique to reduce the dimension of halo mass assembly history (MAH), while Cohn & van de Voort (2015) and Cohn (2018) applied the PCA to model star formation history of galaxies. For supervised learning tasks, the Ensemble of Decision Trees (EDT) or Random Forest (RF) is capable of both classification and regression. This method can also capture the nonlinear patterns, effectively reducing model complexity, and discovering the dominant factors in target variables. RF has been widely used recently, e.g. in identifying galaxy merger systems (de los Rios et al. 2016), in galaxy morphology classification (Sreejith et al. 2017;Dobrycheva et al. 2017), in predicting neutral hydrogen contents of galaxies (Rafieferantsoa et al. 2018), in determining structure formation in N-body simulations (Lucie-Smith et al. 2018), in measuring galaxy redshifts (Stivaktakis et al. 2018), in classifying star-forming versus quenched populations (Bluck et al. 2019), in identifying the best halo mass proxy in observation (Man et al. 2019), and in estimating star formation rate and stellar mass of galaxies (Bonjean et al. 2019).
In this paper, we use both the PCA and the RF regressor to investigate the dependence of halo structure properties on halo assembly history and environment. The paper is organized as follows. In §2 we describe the simulation and the halo quantities to be analyzed. In §3, we describe our methods of dimension reduction and regression. In §4 we demonstrate how to extract information from halo assembly history, using both the main branch and the full merger tree. In §5 we analyze the relation between halo structure properties and assembly history. We focus on the upper limit of the prediction capability provided by the assembly history and identify the most important factors affecting halo structure properties. We also investigate the dependence of halo structure and assembly history on halo environment and halo mass. We summarize our main results in §6.

The Simulation
The N-body simulation used here is the ELUCID simulation carried out by Wang et al. (2016) using L-GADGET code, a memory-optimized version of GADGET-2 (Springel 2005). The simulation uses 3072 3 dark matter particles, each with a mass 3.08 × 10 8 h −1 M , in a periodic cubic box of 500 co-moving h −1 Mpc on a side. The cosmology parameters used are those based on WMAP5 (Dunkley et al. 2009): a ΛCDM Universe with density parameters Ω K,0 = 0, Ω M,0 = 0.258, Ω B,0 = 0.044 and Ω Λ,0 = 0.742, a Hubble constant H 0 = 100 h km s −1 Mpc −1 with h = 0.72, and a Gaussian initial density field with power spectrum P (k) ∝ k n with n = 0.96 and an amplitude specified by σ 8 = 0.80. A total of 100 snapshots, uniformly spaced in log(1 + z) between z = 18.4 to z = 0, are taken and dumped.
Halos and sub-halos with more that 20 particles are identified by the friends-of-friends (FoF, see e.g. Davis et al. 1985) and SUBFIND (Springel 2005) algorithms. Halos and sub-halos among different snapshots are linked to build halo merger trees. 1 Halo virial radius R vir is related to halo mass, M halo , through where ρ is the mean density of the universe, and ∆ vir is an over-density obtained from spherical collapse model (Bryan & Norman 1998). The center of a halo is assumed to be the position of the most bound particle of the main sub-halo. Halo mass M halo is computed by summing over all the particles enclosed within R vir . The virial velocity, V vir , is defined as V vir = GM halo /R vir , where G is the gravitational constant. We use halos with masses ≥ 10 10 h −1 M directly from the simulation, and use Monte-Carlo based merger trees to extend the mass resolution of trees down to 10 9 h −1 M (see Chen et al. 2019). From the halo merger tree catalog constructed above, we form four samples according to halo mass: S 1 contains 2,000 halos with 10 11 ≤ M halo / h −1 M ≤ 10 11.2 ; S 2 contains 1,000 halos with 10 12 ≤ M halo / h −1 M ≤ 10 12.2 ; S 3 contains 500 halos with 10 13 ≤ M halo / h −1 M ≤ 10 13.2 ; and S 4 contains 500 halos with 10 14 ≤ M halo / h −1 M ≤ 10 14.5 . We also construct a complete sample, S c , which contains 10,000 halos with M halo / h −1 M ≥ 10 11 to represent the total halo population. All halos in samples S 1 , S 2 , S 3 , S 4 and S c are randomly selected from simulated halos at z = 0. When halos are needed to be divided into sub-samples according to some properties, the sample size of S c may be insufficient. In this case, we construct another larger sample by the following steps: starting from all simulated halos at z = 0 with M halo / h −1 M = 10 11 -10 14.6 , we divide them into mass bins of width of 0.3 dex. If the number of halos in a bin exceeds 10,000, we randomly choose 10,000 from them; otherwise all halos in this mass bin are kept. This gives a sample of 94,524 halos, which is referred to as sample S L . Due to the mass resolution of the ELUCID simulation, some properties of small halos cannot be derived reliably. Whenever these properties are used, we use another halo sample, S c , which contains all halos with M halo ≥ 5 × 10 11 h −1 M in sample S c .
Note that we use only a fraction of all halos in a given mass range available in the simulation to save computational time. We have made tests using larger samples to confirm that the samples are sufficiently large to obtain robust results.

Halo assembly history
Following the literature (e.g., van den Bosch 2002;Wechsler et al. 2002;Zhao et al. 2003a,b), we define the halo mass assembly history (MAH) of a halo as the main-branch mass M z as a function of redshift z in the halo merger tree rooted in that halo. Based on theoretical consideration of halo formation (e.g. Zhao et al. 2009), we use the following quantity as the mass variable: where σ(M 0 ) and σ(M z ) are the R.M.S. of linear density field at the mass scale M 0 at z = 0 and the one at M z at redshift z. Similarly, we use as the time variable, where δ c,0 = 1.686 is the critical over-density for spherical collapse, and D(z) is the linear growth factor at z. We use the transfer function given by Eisenstein & Hu (1998) and the linear growth factor D(z) from Carroll et al. (1992). These definitions for the mass and time variables are well motivated by the self-similar behavior of halo formation expected in the Press-Schechter formalism (e.g. Press & Schechter 1974;Mo et al. 2010). Thus, in our definition, the MAH of a halo is a vector s = (s(z 1 ), s(z 2 ), ..., s(z M )) T , with each of its elements being the main-branch mass at a snapshot in the merger tree. Such a high-dimension vector is obviously too complex to be useful in characterizing the formation of a halo. To overcome this problem, a common practice is to characterize the full MAH by a set of formation times (e.g., Li et al. 2008). In our analysis, we will use both formation times (see below §2.3) and Principal Components (PCs, see §4) to reduce the dimension of MAH.
The MAH introduced above only uses the main branch of a halo merger tree, and thus does not provide the full information about the formation of a halo. It is thus interesting to increase the information content by including the side branches (see e.g., Parkinson et al. 2008, for describing the Monte-Carlo trees). To this end, we define, for each halo at z = 0, a quantity that describes its progenitor mass distribution at redshift z: where M p is the halo mass of a progenitor halo, and F (M p , z) is the cumulative fractional mass of progenitors at redshift z, defined as the sum of the masses of all progenitors with masses below M p at z. So defined, s p is just the fractional halo mass that is contained in progenitors per dex of halo mass at given redshift, and it is a function of both M p and z. From a simulated halo merger tree, we calculate s p in 16 mass bins equally-spaced between log(M p /M halo ) = −3.0 and 0.2, and in 16 redshift bins from z = 7 to 0. We use s p,i,j to denote the values of s p for the ith mass bin at the jth snapshot. Thus, the Full Mass Assembly History (FMAH) of a halo is given by all s p,i,j . We use a vector, s p = (s p,1,1 , s p,2,1 , ..., s p,1,2 , s p,2,2 , ...), to denote the FMAH of a halo.

Halo formation times
Halo formation times are widely used in the literature to characterize the MAH of dark matter halos. Because of the diversity in MAH, different formation times can be defined to describe different aspects of the assembly. Here we summarize the halo formation times we will use in our analysis. Most of the definitions can be found in Li et al. (2008), but we also add some new definitions that have been used in the literature. Since the halo mass and redshift in different time steps are discrete, the mass-redshift relation is linearly interpolated within adjacent time steps.
• z mb,1/2 : the highest redshift at which the main branch has assembled half of its final mass M halo (e.g., van den Bosch 2002; Shi et al. 2018).
• z mb,core : the highest redshift at which the main branch has reached a fixed mass M h, core = 10 11.5 h −1 M . Halo at such a mass typically has the highest star formation efficiency Van Den Bosch et al. 2003), and therefore is capable to form a bright galaxy in it.
• z mb,0.04 : this is the highest redshift at which the main branch of a halo assembled 4% of its final mass. This formation time is found to be related to the concentration of the halo in a wide range of cosmological models (see Zhao et al. 2009).
• z mp,1/2 : simiar to z mb, 1/2 , but using the most massive progenitors (MMPs) in the entire tree rooted from a halo, instead of the main branch. This definition is used in Wang et al. (2011).
• z mp,core : similar to z mb, core , but using MMPs.
• z 0.02,1/2 : the highest redshift at which half of the halo mass has been assembled into its progenitors with masses ≥ 2% of the halo mass. This definition is used in Navarro et al. (1997) to study the correlation between halo concentration and formation history.
• z core,1/2 : the highest redshift at which half of the halo mass has been assembled into its progenitors with masses M h, core ≥ 10 11.5 h −1 M . This represents the time when the massive progenitors are capable to form large amounts of stars.
• z core,f : the highest redshift at which a fraction f = 1 2 (M halo /M h, core ) −γ of the halo mass has been assembled into its progenitors with masses M h, core ≥ 10 11.5 h −1 M , where γ = 0.32. This definition takes into account the dependence of star formation efficiency on halo mass Yang et al. (2003).
• z mb,vmax : the redshift at which the main branch has achieved its maximum virial velocity. This definition, therefore, reflects the formation of the gravitational potential well.
• z mp,vmax : the same as z mb, vmax , but using the maximum virial velocity of MMPs.
• z lmm : the last major merger time of a halo. Here the major merger is defined as a merger event in which the mass ratio r = m/M between the two merger parts is larger than 1/3. Major merger is violent event and may change the halo structure significantly.
• s tree,β : the tree entropy with entropy update efficiency β (see Obreschkow et al. 2019). Different from the formation parameters defined above, this parameter is not associated with any specific event of the halo assembly history, but describe the complexity of the whole tree. By construction, s tree,β is bound to the range [0, 1]. A close-to-zero s tree,β represents a continuous accretion history, while a close-to-one s tree,β describes a history that is given by the merger of two progenitors of equal mass. The parameter β ∈ [0, 1] controls the balance between the entropy inherited from the progenitors and that generated in recent merger events. We choose β = 0.1 so that the tree entropy can reflect its assembly history at high z.

Halo concentration
Dark matter halo profiles are usually modeled by a universal two-parameter form -the NFW profile (Navarro et al. 1997), where ρ crit = 3H 2 /(8πG) is the critical density of the Universe. This profile is specified by a dimensionless amplitude, δ, and a scale radius, r s . The scale radius is usually expressed in terms of the virial radius, R vir , through a concentration parameter, c ≡ R vir /r s . Since both ρ and ∆ vir in equation (1) are known for a given cosmology, the profile of a halo can be specified by the parameter pair (M halo , c), where the halo mass M halo is defined in 2.1. We obtain the concentration parameter of a halo through the following steps (see Bhattacharya et al. 2013): • We divide the volume centered on a halo into N r = 20 radial bins equally spaced between 0 and R vir (see 2.1 for definitions of halo center and radius), and calculate the mass within each bin i, M i , using the number of particles in the bin.
• We compute the mass expected from the NFW profile in this bin M i,NFW (c), assuming a concentration parameter c.
• We define an objective function, χ 2 (c), to be minimized as • We minimize the objective function to find the best-fit concentration parameter c fit = argmin c χ 2 (c).
In what follows, we will drop the subscript 'fit' and use c to denote the concentration parameter obtained this way. As tested by Bhattacharya et al. (2013), changing the radius range and binning scheme in the fitting only introduces a negligible difference in c. According to our tests, our results presented in the following sections are also insensitive to such changes.

Halo axis ratio
We model the mass distribution in a dark matter halo with an ellipsoid and use its axis ratio to characterize its shape. Our modeling consists of the following steps (see MacCiò et al. 2007): • We start from a FOF halo and calculate the spatial position dr i relative to the center of mass for each particle linked to the halo. The inertia tensor M of the halo is given by the dyadic of dr i by summing over all the N p particles in that halo: where m i is the particle mass.
• We calculate the eigenvalues λ M,i and eigenvectors v M,i (i = 1, 2, 3) of M, and rank the eigenvalues in descending order: λ M,1 ≥ λ M,2 ≥ λ M,3 . So defined, v M,i gives the axis direction of the inertia ellipsoid, and a M,i = λ M,i gives the length of the corresponding axis.
• We define the axis ratio q axis of the halo as q axis = a M,2 + a M,3 2a M,1 .
So defined, q axis = 1 for a spherical halo, and close to 0 if the halo is very elongated along the major axis.

Halo spin
Following Bullock et al. (2001), we define the spin parameter of a halo as where M halo , R vir and V vir are, respectively, the halo mass, virial radius and virial velocity defined in §2.1. The total angular momentum j is defined as where m i is the particle mass, dr i and dv i are particle position and velocity vectors relative to the center of mass, respectively. The summation is over all the N p particles linked in the FOF halo.

Environmental quantities
We define a number of quantities to describe environment in which a halo resides, including the density contrast, as specified by the bias factor and the tidal tensor. These quantities are computed directly from the N-body simulation.
Firstly we follow Mo & White (1996) to define halo bias b as the ratio between halo-matter cross correlation function and the matter-matter auto-correlation function. For each simulated halo i, we calculate its bias b i by where ξ hm,i (R) is the overdensity centered at halo i at a radius R, and ξ mm (R) is the matter-matter autocorrelation function at the same radius in the simulation at the redshift in question. On linear scales, the bias factor is expected to depend only on halo mass, independent of R. We have checked the values of b on different scales and found that the bias factor is almost constant at R > 5 h −1 Mpc. In the following we compute both ξ hm,i and ξ mm for R between 5 and 15 h −1 Mpc at z = 0, and obtain the corresponding local linear bias b i for each halo using the above equation.
For the tidal field, we follow Ramakrishnan et al. (2019) and define the halo environment in the following steps: we first divide the simulation box into a sufficiently fine grid (N cell at each side), and compute the density field ρ(x) on the grid points by assigning the dark matter particles with the clouds-in-cell method (Hockney & Eastwood 1988). To describe the environment of a given halo, the density field is smoothed on some scale R sm with a Gaussian kernel. We then compute the potential field Φ(x) by solving the Poisson equation where δ(x) = ρ(x)/ρ − 1 is the over-density and ρ is the mean density of the Universe. Next we obtain the tidal field T (x) through Finally, we solve the eigenvalue problem of the tidal tensor at each grid point to find the eigenvalues λ 1 (x), λ 2 (x), λ 3 (x) (ranked in a descending order). With all these, we obtain four environmental quantities for each halo: the local bias factor, and the three eigenvalues of the tidal tensor: λ T ,R,i (i = 1, 2, 3). We follow the definition in Ramakrishnan et al. (2019) to define the local tidal anisotropy at each halo's position as where the halo-centric tidal shear q 2 is q 2 = 1 2 [(λ 1 − λ 2 ) 2 + (λ 2 − λ 3 ) 2 + (λ 1 − λ 3 ) 2 ], and δ = λ 1 + λ 2 + λ 3 is the local over-density. For our application, we follow Ramakrishnan et al. (2019) to choose R sm = 4R vir , although we have checked that our conclusion does not change significantly by using other values of R sm . We use N cell = 2560, which is sufficiently fine for computing α for the smallest halo in our sample.

Effect of un-relaxed halos
As demonstrated by MacCiò et al. (2007), halos that have undergone recent major mergers may be un-relaxed and have structural properties significantly different from virialized halos. Following MacCiò et al. (2007) we define two parameters to quantify the dynamic states of halos. The first is the χ 2 c parameter, defined as the minimized χ 2 in fitting the NFW profile, normalized by halo mass. The second is the offset parameter, x off , defined as the distance from the center of mass of the halo to the most bound particle, normalized by the virial radius. Figure 1 shows the distribution of the χ 2 c and x off for sub-samples with different last major merger time.
As one can see, if a halo has experienced a recent major merger (log δ c (z lmm ) < 0.3), it is likely that its profile deviates from the NFW profile, and that its most bound particle is far away from the center of mass of the halo.
Those un-relaxed halos have structural properties that are very different from virialized halos, and including them in our sample will significantly increase the variance of halo properties, thereby affecting the statistics derived from the sample. To reduce their effects, we exclude all halos that have undergone a major merger at z < 0.3. This will remove 9.6%, 9.6%, 14.2%, 21.8%, 10.5%, and 13.8% halos in samples S 1 , S 2 , S 3 , S 4 , S c , and S L , respectively. To make our conclusion even less dependent on relaxation processes, we compute λ s and q axis using only simulated particles that are within a radius 2.5 r s from the most bound particle. According to our test, our results are not sensitive to the radius chosen.
In Table 1, we summarize the halo assembly, structure, and environment properties used in our analysis. Because many parameters can be defined to characterize various aspects of the MAH and FMAH, it is not suitable to use all of them directly to study the relationship between halo assembly and other halo properties. In the following two sections, we describe a method that can be used to effectively reduce the dimension of the MAH and FMAH, which has some advantages over the parameterization using formation times.
where γ = 0.32. IV. the major merger is defined as a merger event in which the mass ratio between two merger parts is larger than 1/3.
Throughout this work, we use two statistical methods to analyze halo properties. The Principal Component Analysis (PCA) is used to reduce the complexity of quantities in high-dimension space, and the Ensemble of Decision Trees (EDT), also called the Random Forest (RF), is used to study correlations among different quantities. A brief description of the two methods is given below. 2 For a more detailed description, see PRML (Bishop 2006).
The programming interfaces and implementation can be found in scikit-learn.

Principal Component Analysis
PCA is an unsupervised, reduced linear-Gaussian dimension reduction method (Pearson 1901;Hotelling 1933). Consider a set of N vectors have maximal variance, where P is the projection operator. It can be shown that the problem to be solved is equivalent to solving the eigenvalue problem for the sample covariance matrix of X, defined as x i is the sample mean. If we rank the eigenvalues in a descending order, the first eigenvector of S, v 1 , is the direction along which the sample has the maximum projected variance, λ 1 = v T 1 Sv. This variance is exactly the first eigenvalue of S. Similarly, the ith eigenvector and the ith eigenvalue are, respectively, the direction and value of the ith largest variance. Consider the space V M spanned by the first M eigenvectors. The linearity of the transformation can be used to prove that the projected variance in M is to find a lower-dimension representation for it. The ith component of x is called the ith Principal Component (PC) of this data point in the sample.
Any dimension-reduction algorithm will inevitably lose information contained in the original data. In PCA, the proportional variance explained (PVE) by the ith PC, defined as is used to quantify the importance of the ith PC. The cumulative PVE, defined as can be used to quantify the performance of using the first M PCs in the dimension reduction. Typically, if the data in question are generated from an intrinsic process of lower dimension, the CPVE should quickly converge to 1 as M increases. We will see later that halo assembly histories have this property.
The inverse operation of the projection,x = Px , allows one to reconstruct the original vector, but with information loss. We use the following quantity, to quantify the reconstruction error of the data point x.

The Random Forest
The Random Forest regressor/classifier is a supervised, decision-tree based, highly-nonlinear, non-parametric model ensemble method in statistical learning (Breiman 2001). Here we first introduce the decision tree algorithm, and then discuss how the trees are assembled to a forest to make classification.
Given a set of observations each consisting of a vector of predictor features x i ∈ R M and a target random variable y i (continuous in the regression problem, discrete in the classification problem), a decision tree T can be trained to fit the data by minimizing some error functional E(T|D) (residual sum of square, RSE, in the regression; Gini index in the classification). The tree can then be used to predict the target value for a future observation, x.
A decision tree is built by sequentially bi-partitioning the feature space along some axes. In the first step, the direction to be bi-partitioned and the position of the partition plane are both chosen to minimize the error functional E(T|D). After a partition, the feature space is split into two sub-regions, each of which can be bipartitioned further to minimize the error function. This recursive process partitions the feature space into a treelike structure, and is continued until some stop criterion (e.g. the maximum tree-height, or the maximum number of tree nodes, or the minimum number of data points in individual leaf nodes, defined as the nodes at the top of the tree) is achieved.
Once a tree is built, the number of data points split according to each predictor feature can be computed, and is normalized to 1. This quantity, denoted as I T|D (x), where x is a predictor in the predictor vector x, gives the relative importance of each predictor x in explaining the target y. Such a tree model suffers from the over-fitting problem: the more complicated the tree is, the less training error it will have. However, the tree will eventually be dominated by noise as its height increases. Many methods have been proposed to control such over-fitting, e.g. the cross-validation, bootstrap and jackknife ensembles. Here we adopt the RF, an extension of the bootstrap ensemble, designed specifically to deal with over-fitting in tree-like algorithms. The building of a RF involves two levels of randomness. First, one uses n re bootstrap re-samplings, each is used to train a tree T i (i = 1, ..., n re ). Second, when training each of the n re trees, only a random subset (size n var < M) of all M predictor variables is used at each partition step. The n re trees are then combined, and the final prediction for a given feature x is then averaged among all trees (arithmetic average in regression, and majority voting in classification). Also, the importance of the predictor, I T|D (x), is averaged among trees, which we denote it I RF|D (x). This procedure has an advantage that the test-performance (variance explained, or R 2 , in regression problems; mis-classification rate, in classification problems) can be directly estimated with the out-of-bag (OOB) sample in the bootstrap process, and therefore a test sample is not necessary. In addition, RF does not suffer from the issue of scaling/arbitrary transformation of predictors, which exists in many non-linear approaches, such as K-nearest-neighbors (KNNs), support vector machines (SVMs), etc.
The RF method has some free parameters to be specified, and we choose them based on the following considerations: (1) The number of trees in the forest, n re , should be as large as possible to suppress over-fitting, but a larger n re is computationally more difficult. In our analysis, we choose n re = 100, which is sufficiently large for most applications of RF. (2) The number of predictors randomly chosen in partition, n var , also controls the suppression of over-fitting. We optimize the value of n var by maximizing the OOB score through grid searching.
(3) The tree termination criterion affects the complexity of each tree. We choose to control the number of data points in the leaf nodes, s leaf . This choice makes the tree self-adaptive when more data points are available, and also reduces issues associated with transformations of target variables and the choice of the error functional. The value of s leaf is also optimized by maximizing the OOB score, again through grid searching.

Principal Components of Halo Assembly Histories
Here we apply the PCA described in §3.1 to reduce the dimension of MAH. We will show the advantages of using such a method in two aspects. First, the principal components (PCs) capture the most variance among all linear, low-dimension representations, and are the best in reducing the reconstruction error. We will demonstrate its power in representing halo MAH. Second, PCs are tightly correlated with some other halo properties, indicating that PCs are not only mathematicallyoptimized approximations to the MAH, but also have clear physical meanings.
For a given sample with N halos, the MAHs are represented by S = (s 1 , s 2 , ..., s N ) T , where s i are the MAH of the ith halo, as defined in §2.2. In simulation, the MAH of a halo is not traced below the mass resolution of the simulation. Halos of different mass therefore may be traced down to different snapshots, resulting in different lengths of the vector s i . For a given sample, we choose a snapshot where 90% of the MAHs can be traced back to this time. MAHs extend beyond this snapshot are truncated, and MAHs that are terminated before this snapshot are padded with 0. In this way all the s i will have the same length M, suitable for PCA. After applying PCA to S, the s for each halo is transformed into a new coordinate system, producing a new vector that consists of a series of PCs, pc MAH = (PC MAH,1 , PC MAH,2 , ..., PC MAH,M ). Figure 2 shows the PVE and CPVE curves (see §3.1) for the PCs of 5 samples, S 1 , S 2 , S 3 , S 4 and S c , with different halos masses, defined in §2.1. It is clear that the first several PCs, among all cases, capture most of variance of the halo MAH (> 80% by using the first 3 PCs). This demonstrates that strong degeneracy exists in the MAH of individual halos, suggesting that the assembly history can be described by using only a few eigen-modes. As shown by the CPVE curve, using a single parameter is insufficient to describe the MAH. It can at most explain as much variance as PC 1 , which is 55% for the most massive halos (≈ 10 14 h −1 M ), and 70% for the smallest halos (≈ 10 11 h −1 M ). In principle, we can add more PCs, which typically leads to better capture of the subtle structures in the MAHs. Figure 2 also shows the distribution of the error e in each sample when MAHs are reconstructed with the first 3 PCs (see §3.1 for the reconstruction algorithm). The reconstruction error is almost all below 10%, demonstrating that the MAHs of halos can be represented well by a small number of PCs. Figure 3 shows some examples of the reconstructed MAHs using the first 3 PCs. Here ∆s(z) = s(z)−s(z) is shown for several halos in each halo-mass sample, where s(z) is the mean MAH in the sample. The overall shape of MAH is well captured by PCs, although the MAHs of individual halos are quite diverse. Some fine structures in the MAH, caused by violate changes in the formation history due to merger events, are missed in the reconstruction. They can, in principle, be captured by including more PCs.
For the FMAH, the strategy is the same as for the MAH. For a sample of N halos, their FMAHs are described by S p = (s p,1 , s p,2 , ..., s p,N ), where s p,i is the FMAH of the ith halo in the sample (see §2.2). The PCA can be applied to S p by transforming s p of indi- vidual halos into a series of PCs denoted by pc FMAH = (PC FMAH,1 , PC FMAH,2 , ... PC FMAH,M ). Since FMAH of a halo is a high-dimension vector, the number of FMAH PCs needed to achieve a given amount of variance reduction will be larger than that of the MAH PCs. Figure 4 shows the PVE and CPVE curves for five samples defined in §2.1. While the CPVE curves also converge when a certain number of FMAH PCs are used, the situation here is worse than the MAH, since about 20 FMAH PCs are needed to achieve a CPVE of ≥ 80%. This reflects the complexity of the full assembly history of individual halos.
Again, we use the first 20 FMAH PCs to recover the FMAH of each halo, and the distribution of the reconstruction error e is also shown in Figure 4. The error is larger than that in the case of MAH, reflecting again the complexity of the full mass assembly history. Over the entire halo mass range, the error is controlled to be below 0.5, and more FMAH PCs can be added to the reconstruction if a higher reconstruction precision is needed. The reconstructed FMAH of a few halos are also shown in Figure 5 in comparison with the simulated FMAH. The small difference between the two indicates that the PCs used can represent the true FMAH well.
The complexity of the full assembly history clearly makes it more difficult to describe without the use of PCA. There have been attempts to describe the FMAH with a limited number of parameters, such as the tree entropy s tree (see §2.3 and Obreschkow et al. 2019). We emphasize, however, that any low dimensional parametrization will lose significant amounts of information contained in the halo merger tree, and reduce the predicting power of using the correlation between FMAH and other halo properties.

PCs versus Halo Formation Times
The diversity of the assembly history showed in the Figure 3 indicates that no single parameter can provide a complete description of the MAH. To show this more clearly, Figure 6 plots the correlation between z mb,1/2 (see § 2.3 and Table 1) and the first two MAH PCs. A strong and nearly linear correlation between PC MAH,1 and z mb,1/2 is seen, which indicates that the variance in z mb,1/2 contributes significantly to the variance in the MAH of halos. However, large scatter still exists in the relation. This suggests that the direction of the largest scatter in s space is not fully aligned with the scatter caused by half-mass formation time, and that other parameters must also contribute to the distribution of halos in the s space. Moreover, when compared to PC MAH,1 , the relation of PC MAH,2 with z mb,1/2 is much weaker, and tends to be more nonlinear, as seen from the triangular shape of the contours in the right panel of Figure 6. Thus, the information contained in the second PC of MAH, which contributes 10 to 20% of the total variance as seen from Figure 2, is largely missed when half-mass formation time alone is used to predict the MAH.
One may use more than one formation time to represent a MAH. The problem here is that there is usually strong degeneracy between different formation times, and it is unclear which combination is the most suitable to use. To demonstrate this, Figure 7 shows how different formation times contribute to the diversity of the halo MAH. Here we use Sample S c , as described in the §2.1, to regress the first 3 MAH PCs on all the formation times defined above. As the output of the RF, the performance, R 2 , and the contribution of each formation time to each PC, are plotted. For all PCs, the contribution from z mb,0.04 is the most dominant, while z mb,1/2 is also significant. This demonstrates that there is large linear correlation between these two formation times. Interestingly, the last major merger redshift, z lmm , which contributes little to PC MAH,1 , is increasingly important as the PC order increases. For PC MAH,3 , the contribution from z lmm is comparable to those from the other two formation times. Since higher order PCs are capable of capturing more subtle patterns in the feature space, this behavior of z lmm means that major merger is an important factor in producing the fine structure in MAH.
The completeness and degeneracy problem can, in principle, be overcome if we use PCs to describe the assembly history, since PCs are linearly independent of each other. Typically, the use of a small number of loworder PCs can solve most of the problems associated with the regression of halo MAH. If more subtle infor-mation is needed, one can always add more PCs without introducing too much degeneracy into the problem.

RELATING HALO STRUCTURE TO ASSEMBLY HISTORY AND ENVIRONMENT
In this section we investigate how halo structural properties are related to halo assembly history and environment.

Halo Concentration versus Assembly
It is well known that halo concentration is correlated with halo mass assembly history (see e.g., Navarro et al. 1997;Jing 2000;MacCiò et al. 2008;Wechsler et al. 2002;Zhao et al. 2003aZhao et al. ,b, 2009). However, it is still unclear which single parameter best predicts the concentration, and whether combinations of multiple parameters can improve the prediction precision. The difficulties involved are twofold. First, the number of possible combinations is large, and the degeneracy between different predictors is usually hard to quantify. Second, traditional fitting algorithms are either too simple to capture the non-linear structure in the relation or too complicated to be free of over-fitting. This is the 'bias-variance' trade-off, a core problem in statistics that is hard to handle without advanced statistical tools. To tackle the problem, we use Random Forest to study the relation of halo concentration, c, with halo assembly and environmental properties. To this end, we use RF regressor to build the relation, c = RF(x), where x can either be halo assembly properties, or environmental properties, or a combination of them. Figure 8 shows the contours of halo distribution in the (c, assembly) plane, obtained from the complete sample, S c . Here PC MAH,1 and z mb,1/2 are used to represent halo assembly. It is clear that c has strong correlation with both PC MAH,1 and z mb,1/2 , but it is difficult to judge which correlation is stronger from the contour plots.
To quantify the results, we perform the RF regression, c = RF(x), either on the first 10 MAH PCs (x = {PC MAH,i } 10 i=1 ), or on formation times (x = formation times ), using sample S c (see §2.1). RF regression provides the overall performance, R 2 , and the contribution I(x) from each predictor variable x ∈ x, and the results are shown in Figure 9. As one can see, if a large number of predictors are used, the upper limit in prediction of c using assembly history is larger than 60%. This indicates that the concentration parameter c of a halo is largely determined by its assembly history. About 35% of the variance is still missing if one uses only the mean relation to predict c from assembly history. We have checked the result by including M halo and the PCs of the FMAH into the predictors and found that the regression performance changes little. This demonstrates that the missing information about c is not contained in halo mass or the side-branches of the merger tree. We have also tried to combine halo formation times and PCs of the MAH as predictors, and found that the overall performance again remains unchanged, indicating that the MAH PCs dominate the information content about halo concentration. Furthermore, as seen from the upper panel of Figure 9, the first MAH PC is by far the most important, accounting for about 70% of the total information provided by the MAH. As a comparison, the combination of the three formation times, z mb,1/2 , z mp,1/2 and z mb,0.04 , contains about the same amount of information as PC MAH,1 . Thus, if a single parameter is to be adopted as the predictor of c, PC MAH,1 is the preferred choice.

Axis Ratio and Spin versus Assembly
The upper panel of Figure 10 shows the relation between q axis and the following three assembly variables: the first PC of MAH, PC MAH,1 , the main-branch halfmass formation time, z mb,1/2 , and the last major merger time, z lmm . Compared to the concentration parameter, the shape parameter shows much weaker correlations with all the three assembly indicators, although there are noticeable trends. A late-formed halo (larger PC MAH,1 or smaller z mb,1/2 ) on average has a smaller axis ratio, i.e. more elongated. A halo with more recent major merger also tends to be more elongated.
We exploit the power of RF to regress q axis on a large set of variables x: q axis = RF(x). We investigate three cases: (1) x is the first 10 PCs of the MAH, (2) x is the first 20 PCs of the FMAH, and (3) x is all the formation time indicators (include the tree entropy) defined in §2.3. In addition, we also include halo mass as a predictor. The reason why we use such a large number of predictors is that we want to obtain an upper limit of the prediction power of halo assembly for q axis , and at the same time to select the most important predictors among them. The results of the regression performance, and the contributions from each predictor in explaining q axis , are shown in Figure 11 using red symbols. From this, we can draw the following conclusions: (1) If we only use halo mass and assembly history to predict the axis ratio, the upper limit of the explained variance is less than 20%. This is in contrast to the concentration parameter, for which the percentage is larger than 60%. We have checked that using the combination of MAH PCs (or FMAH PCs) and formation times makes a small improvement (less than 2%). (2) There is no single dominating predictor that outperforms the other ones. When PC MAH 's are used, the first two PCs are the most important predictors. However, a long tail in I(x) indicates that higher-order PCs are also not negligible. If PC FMAH 's are used, the long tail becomes lower. When formation times are used as predictors, almost all of them contribute significantly, and roughly equally, to the axis ratio. All these suggest that the axis ratio can be affected by many factors related to halo mass assembly, but the effects are all small.
We carry out a similar analysis for the spin parameter, λ s . The lower panel of Figure 10 shows the dependence of λ s on PC MAH,1 , z mb,1/2 , and z lmm . It is clear that none of these assembly parameters has a strong correlation with halo spin. There is a weak trend that halos that assembled late (large PC MAH,1 and small z mb,1/2 ) tend to spin faster, and that a recent major merger tends to increase the halo spin.
The blue symbols of Figure 11 shows the result of using RF to regress λ s on three sets of variables: the first 10 MAH PCs, the first 20 FMAH PCs, and all the formation times. The situation here is very similar to that for the axis ratio. First, the upper limit to the prediction performance is about 20%, and this upper limit can be achieved by all the three sets of predictor variables. Second, no single variable dominates the contribution, as indicated by the long and low tail in the I(x) plot. Third, in all cases the halo mass is not the most important factor, and it is less important than either PC MAH,1 , or PC FMAH,1 or z mb,1/2 .
The similarity in behavior between λ s and q axis suggests that these two quantities may have similar origins. As we will show below, λ s and q axis are strongly correlated between themselves, and both show strong correlation with the anisotropy of the local tidal field.

Environmental Effects on Halo Structure
In §5.1 and §5.2 we have seen that the first PC of the halo assembly history is a prominent indicator for c, less prominent but still the best for q axis and λ s . Therefore, we start by building a Random Forest regressor y = RF 1 (PC MAH,1 , M halo ), where y is one of the three structural quantities. The inclusion of M halo is motivated by the fact that halo mass is traditionally considered as one of the most important quantities distinguishing halos. We also calculate the performance R 2 1 as well as the contributions I 1 (x) of the regressor. Next, we add the two environmental quantities, the bias factor b and tidal anisotropy α T , to build a second regressor, y = RF 2 (PC MAH,1 , M halo , b, α T ), and obtain the corresponding R 2 2 and I 2 (x). With these two regressors, we can judge the importance of environment to halo structure y as follows: • If R 2 1 ≈ R 2 2 , then either the environmental affect on y is weak, or it is degenerate with halo mass and as- sembly history. We can also look at the contribution parameter I(x env ) for the environmental parameter x env . If I(x env ) ≈ 0 for all x env , then environment is indeed unimportant in affecting internal structure properties; otherwise environmental effects on y must be degenerate with the assembly history.
• If R 2 1 < R 2 2 , then environment must have effects on structure, although they may be partly degenerate with the assembly history. We can identify such degeneracy by comparing the contribution of environment, I(x env ), with that of other predictors. If I(x env ) is large but the increase in prediction power, R 2 2 − R 2 1 , is small, then the effect of environment on halo structure is likely a manifestation of the dependence on assembly history through the degeneracy between environment and assembly (assembly bias). In other words, the contribution of the environment factors is taken from that of halo assembly. Figure 12 shows the contribution, I(x), of these two regressors, for each of the three structure properties described above, with the values of R 2 indicated. To estimate the uncertainty in the results, we generate 100 random subsamples, each consisting of half of the halos randomly selected from the original sample S c without replacement. The errors of R 2 and I(x) are then estimated as the standard deviations among these subsamples. The reason why we do not use the standard bootstrap technique is that the sampling with replacement will lead to artificially large R 2 in the Random Forest regressor, because the repeated data points may both appear in the training set that shape the decision trees  Figure 12. Contributions I from predictors to halo structural quantities: concentration c (red), shape parameter qaxis (blue) or the spin parameter log λs (purple). The Solid lines with circular makers show the results when using the first MAH PC PCMAH,1, halo mass M halo , tidal anisotropy αT and bias factor b as predictors. The Dashed lines with triangular makers show the results when using only PCMAH,1 and M halo without environment properties. The overall performances R 2 2 (with environment) and R 2 1 (without environment) are also indicated in the panel. In each case, the error bars are calculated with 100 half-size resamplings without replacement. Halos are taken from the complete sample S c with M halo ≥ 5 × 10 11 h −1 M . and the out-of-bag set that are used to test the performance, making the test performance over-estimated.
In the case of c, the inclusion of environment only increase R 2 from 0.56 to 0.57. The importance factor I(x env ) is below 0.1. These two results indicate that environment has little impact on halo concentration c, that the concentration c is well determined by the first PC of MAH, and that the weak dependence of c on environment is mainly through the dependence of halo assembly on environment (assembly bias). This is consistent with the result obtained with the Gaussian Process Regression by Han et al. (2019) who found that the dependence of halo bias on halo concentration is mainly through the dependence of the bias on formation time.
For q axis , the R 2 is doubled, from 0.05 to 0.11, when environment factors are included. From the importance curve, I(x), one can also see that these environment factors take away about half of the contribution from the halo mass and PC MAH,1 . These results together suggest that environment can affect halo shape significantly, at least as important as halo mass and the main PCs of the MAH.
For the spin parameter, the value of R 2 increases from 0.16 to 0.20 when environment factors are included. The y R 2 q axis 0.28 ±0.02 log s 0.30 ±0.02 Figure 13. Contributions I from predictors to halo structural quantities yα: shape parameter qaxis (blue) or the spin parameter log λs (purple). This is the same as Figure 12, except that we add the initial condition of each structure property at redshift z = 3 and a β-structure parameter y β into the predictor variables. For shape parameter, the y β is the spin, for spin parameter, the y β is the shape. The overall performances R 2 is also indicated in the panel. In each case, the error bars are calculated with 100 half-size resamplings without replacement. Halos are taken from the complete sample S c with M halo ≥ 5 × 10 11 h −1 M . increase, 0.04, is about 20%, suggesting that these environment factors do have a sizable effect on halo spin. The contribution, I(b) + I(α T ) = 35%, is larger than the 20% they contribute to R 2 , suggesting that some of the contribution is actually taken from halo assembly history and halo mass. Thus, when interpreting the dependence of halo spin on environment, one should remember that part of it may actually come from the degeneracy with halo assembly history.
Another difference between c and the other two structural parameters is in their values of R 2 . Both q axis and log λ s have fairly small R 2 , much smaller than 50%. This indicates that the major contributors to these two structural parameters are not found. In general, factors that can affect halo structural properties can be classified into three categories: the initial condition of the halo, the intrinsic properties of halos (e.g., M halo , P C MAH,1 ), and halo environment (e.g., α T and b). The interaction of halos with their environment depends not only on the environment, but also on halo properties. For example, since a halo is coupled to the local tidal torque only through its quadrupole, we expect that the spin and shape of a halo should be correlated. Motivated by this, we add the spin parameter into the predictors of shape, and vise versa. We denote the target structure parameter as y α and the added structure parameter y β . In addition, we also consider the 'initial condition' of a halo by tracing all the halo particles back to redshift z = 3 and using these particles to calculate the corresponding shape and parameters. This quantity for y α , denoted by y α,z=3 , is also added into its set of predictors. Figure 13 shows the result when this set of variables are used to predict halo structures. Among all the predictor variables, y β is the most important for y α , indicating that q axis and log λ s are strongly correlated. For λ s , its initial value also matters, ranked as the second important predictor. The performance, R 2 , for both λ s and q axis is now boosted significantly, to ∼ 0.3. However, even in this case R 2 is till less than 50%, meaning that the causes of the main parts of the variances in both λ s and q axis are still to be identified. This also indicate that, unlike the concentration parameter c, which is determined largely by halo assembly, λ s and q axis may depend on the details of both formation and environment.

Dependence on Halo Mass
The regressors for halo structures presented in §5.1, §5.2 and §5.3 are all built using the complete halo sample. The model training processes and performance measurements are thus dominated by low-mass halos which are more abundant. However, halos with different masses may have different properties. For example, the halo assembly bias, as reflected by the correlation between halo formation time and the bias factor, is found to be significant only for low-mas halos (e.g., Gao et al. 2005;Gao & White 2007;Li et al. 2008). It is, therefore, interesting to see how the structural properties of massive halos depend on assembly and environment, and how environmental and assembly effects are related to each other.
Here we quantify such halo mass dependence by building RF regressors for sub-samples of halo mass, using the large sample S L . For halos with given mass, M halo , and for each of the three structural properties, y = c, q axis and log λ s , we build three forest regressors y = RF(x|M halo ) with different sets of predictor variables x: 1. x = (PC MAH,1 , PC MAH,2 , PC MAH,3 ), the first three PCs of halo MAH; 2. x = (PC MAH,1 , α T , b), the first PC of halo MAH and environmental parameters; 3. x = (PC MAH,1 , PC MAH,2 , PC MAH,3 , α T , b), the first three PCs of MAH plus the environmental parameters.
The reason for including PC MAH,2 and PC MAH,3 is that the MAHs of massive halos may be more complicated than low-mass ones, and high order PCs may be needed to capture the more subtle components in their MAHs. Figure 14 shows the contribution curves and performances of regressors y = RF(x|M halo ) for different halo structural properties, y, using different predictor variables, x, and for halos of different masses. In the case where x is the first three MAH PCs (panels in the left column), the PC MAH,1 is always the most important for the structures of low-mass halos. However, as halo mass increases, the importance of PC MAH,1 decreases and eventually is taken over by higher order PCs (the second or the third). Massive halos may have more diverse accretion histories (see, e.g., Obreschkow et al. 2019, who found the tree entropy increases with halo mass), so their structural properties may also be more complex. By using higher order PCs, the complex formation history can be captured so that a better prediction to halo structures can be achieved.
As discussed in §5.3, for the total halo population, environmental effects are different for the three halo structural properties. Similar conclusions can be reached for halos of given mass. The middle and right columns in Figure 14 show results for regressors that combine the MAH and environment as predictors. For the halo concentration, the PCs of MAH always outperform environmental quantities, although there is some increase in I(x) for environment quantities at the high mass end. Compared with the regressor with only MAH PCs (upper left panel), the performance R 2 including the two environmental properties (upper right panel) only increases slightly. This indicates again that the environmental effect on halo concentration is mainly through the dependence of halo MAH on environment.
The environmental effect on the shape parameter, q axis , is totally different. As seen from the contribution curves, the environment is as important as MAH, and including environment variables increases R 2 significantly. This implies that environmental effect on the halo shape parameter is important, and that the effect is not degenerate with the MAH. The environmental contribution to spin parameter, λ s , is intermediate, stronger than that to the concentration parameter but less than that to the shape parameter. The value of R 2 after including environment variables increases, but less significantly than in the case of the shape parameter. This suggests that the environmental effect does contribute to halo spin, but part of the contribution is taken from the assembly.

Halo Assembly Bias
As a final demonstration of the application of the Random Forest regressor, we show how assembly parameters correlate with environment for halos of given mass, which is usually referred to as halo assembly bias. The purpose here is to identify the best correlated pair of variables (X, Y ) at given halo mass, where X is an assembly property and Y is an environmental property. The method to measure the correlation strength is straightforward. First, we bin halos in a sample S L into sub-samples according to halo mass. Within each sub-sample, we build a Random Forest regressor for each pair of variables (X, Y ). The value of R 2 (X, Y |M halo ) then provides a measurement of the correlation strength between the two quantities. Figure 15 shows the results for different cases, where the environmental quantity Y is either the bias factor b or the tidal anisotropy pa-rameter α T , and X is either PC MAH,1 or log δ c (z mb,1/2 ). The dependence of log δ c (z mb,1/2 ) on b is only present for halos with M halo < 10 13 h −1 M , and totally absent for more massive ones. The values of R 2 between b and the two assembly properties are both smaller than 2%, indicating the correlation between b and assembly history is week. These results are consistent with those obtained previously (e.g., Gao et al. 2005;Gao & White 2007;Li et al. 2008): the assembly bias is significant only for low-mass halos, and one has to average over a large number of halos to detect the weak trend. For comparison, we also build regressors between structure properties (c, λ s and q axis ) and b for halos  Figure 15. Random Forest regression performance R 2 (X, Y |M halo ) of pairs of variables (X, Y ) for halos with given mass, where X is an environmental quantity (left panel: halo bias factor b, right panel: tidal anisotropy parameter αT ), and Y is either a structural quantity or an assembly history quantity, as represented with different colors. The error bars of R 2 are obtained by 10 half-size resamplings without replacement. Halos are taken from the large sample SL. Linear regression slope k(X, Y |M halo ) of pairs of variables (X, Y ) for halos with given mass, where X is an environmental quantity and Y is either a structural quantity or an assembly history quantity. Each panel is for a pair of (X, Y ). The error bars are standard errors estimated from linear regression model. The error bars of R 2 are obtained by 10 half-size resamplings without replacement. Halos are taken from the large sample SL. of given mass. The results are also shown in the Figure 15. Clearly, the dependence of these properties on b is also weak. The results are consistent with those obtained previously by Mao et al. (2018), who found that b is better correlated with c and λ s than with assembly properties for massive halos.
In a recent paper, Ramakrishnan et al. (2019) showed that the tidal anisotropy parameter, α T , is a good variable that correlates well with many structure properties. We present the correlation between α T and other halo quantities in the right panel of Figure 15. It is clear that α T shows better correlation with halo intrinsic properties than the bias factor, as indicated by the larger values of R 2 . In particular, the correlation between the assembly properties (δ c (z mb,1/2 ) and P C MAH,1 ) and α T are significant, except at the very massive end. In §5.3, we demonstrate that a part of the contribution from environment to structural properties is produced by the degeneracy between environment and halo assembly history. The strong relation between α T and assembly history is a proof of this degeneracy. Note that both q axis shows a strong correlation with α T for massive halos, indicating that local tidal field plays an important role in determining the shape of the halo.
As mentioned above, the small values of R 2 between assembly and environment imply that assembly bias is a weak relation compared to the variance. To detect such a bias thus needs the average over a large num-ber of halos. To obtain the mean trend in the bias relation, we can build linear regression models between pairs of variables, and use the slopes of the regression, k, to represent the mean trends between the two variables in question. Figure 16 shows the slopes of halo properties with b and α T for halos with different masses. Given any two variables, a larger absolute value of k means more significant linear correlation of the two variables for halos of a given mass. The relations between halo environment and different halo properties show different trends with halo mass. For example, the slope k(log δ c (z mb,1/2 ), b|M halo ) is significant only at the low-mass end, while k(PC MAH,1 , b|M halo ) is significant at the low-mass end and becomes less significant at the high-mass end, suggesting that halo assembly bias is important for low-mass halos and less important for massive ones. In contrast, the slopes k(c, b|M halo ), k(q axis , b|M halo ) and k(log λ s , b|M halo ) are both more significant at the high-mass end. As discussed in Wang et al. (2011), this halo mass dependence is a result of the competition between the environmental effect and self-gravity of the halo. For example, a higher-density environment not only provides more material for halos to accrete, but also is the place of stronger tidal field that tend to suppress halo accretion. Depending on the halo mass, one of the two effects will dominate. For example, for lower-mass halos where self-gravity is weaker, the local tidal field may be effective to make them more elongated and spin faster, and in preventing them from accreting new mass so as to make their mass assembly earlier and concentration higher. This is indeed what can be seen from the lower panels of Figure 16. We note, however, that the physical units of different curves in these panels are not the same, so that these curves cannot be compared directly with one another. Note also that a significant value of k does not necessarily imply a significant R 2 in Figure 15, and vice versa, because a linear model cannot represent faithfully data of non-linear relations. In addition, the value of R 2 depends not only on the value of k, but also on the absolute value of the variance in the regressed sample.

SUMMARY AND DISCUSSION
In this paper, we have used the ELUCID N-body simulation to relate the structure properties of dark matter halos to their assembly histories and environment. Our analysis is based mainly on the PCA and the RF regressor. Our main results and their implications can be summarized as follows.
(1) The spaces occupied by halo MAH and FMAH are high-dimensional, and PCA provides a simple and yet effective tool to reduce their dimensions. By using the first three PCs of the MAH, more than 80% of the variance can be explained. Even just the first PC can capture ∼ 55% of the variance for massive halos with M halo ∼ 10 14 M and ∼ 70% for low-mass halos with M halo ∼ 10 11 M . The reconstructed MAH with the first 3 PCs from the reduced space is a good approximation to the original one, with relative error < 10%. The main structure of MAH of a single halo is always recovered, while more subtle structures can also be included by using higher order PCs. The FMAH is more difficult to represent, requiring a much larger number of PCs (> 20) to achieve 80% CPVE.
(2) The PCs of the MAH have clear physical meanings. The lower order PCs, such as the first PC, are tightly related to halo formation times, such as z mb,1/2 , z mb,0.04 , etc., while higher-order PCs reflect more about events, such as the last major merger. PCs are preferred to halo formation times, as they are linearly independent, avoiding the degeneracy that exits among different formation times.
(3) Halo concentration is tightly related with the MAH. Using RF, we are able to show that the halo assembly history can explain more than 60% of the variance in the concentration parameter, among which 70% is contributed by the first PC of the MAH. This suggests that the density profile of a dark matter halo is largely determined by its MAH, consistent with the finding of Lu et al. (2006) that the density profiles of individual halos can be modeled accurately from their MAH. For shape parameter, q axis , and the spin parameter, λ s , halo assembly can only explain about 15% and 20% of the variance, respectively. The corresponding importance curves, I(x), are also quite flat, indicating that there is no single property of assembly that dominates these two structure properties.
(4) Environmental effects on c, q axis , and λ s , as described by their correlations with b and α T , behave differently. Environment has almost no effect on c, but has a significant effect on both q axis and λ s . For the axis ratio, q axis , environment is as important as the first PC of the MAH, and there is little degeneracy between environment and MAH. For λ s , on the other hand, including environment increases prediction performance by about 20%, but the effect is degenerated with that of MAH.
(5) The correlation between q axis and λ s is strong, indicating that these two quantities have some common origin. The initial condition of λ s is also important. However, putting all factors we have considered together, the values of R 2 are still smaller than 0.5 for both q axis and λ s , indicating that these two halo quantities may be affected by many subtle factors and thus difficult to model.
(6) The structural properties depend mainly on the first PC of the MAH for low-mass halos, but have significant dependence on higher order PCs for high-mass halos. The conclusions in (4) for the overall population still holds for halos of a given mass: environment has almost no effect on c once MAH is included; environment is more important for q axis and λ s , although its effect on λ s is partly degenerated with that of MAH.
(7) The tidal anisotropy, α T , has stronger correlation with halo assembly and structure than b does. α T is strongly correlated with halo assembly history for all halos except the most massive ones, and also shows strong correlation with the halo shape parameter, indicating that the local tidal field plays an important role in shaping halos.
For dimension reduction tasks such as those for halo MAH, one may also use nonlinear algorithms, such as the Locally Linear Embedding (LLE, Roweis 2000), and the Spectral Embedding (Belkin & Niyogi 2003)). The degrees of freedom of these manifold-learning techniques are higher than those of the linear algorithms, so that they are not stable for noise data. Indeed we have tried implementing LLE to halo MAH, but found that it does not outperform the PCA in terms of both the reconstruction error and the correlation with halo structural properties.
The correlation of structure properties with halo assembly and environment revealed by the RF analysis provides insights about the halo population in the cos-mic density field. Since galaxies form and evolve in dark matter halos, understanding the formation, structure and environment of dark matter halos is a crucial step in establishing the link between galaxies and halos. Empirical approaches, such as halo abundance matching (Mo et al. 1999;Vale & Ostriker 2004;Guo et al. 2010), clustering matching (Guo et al. 2016), age matching (Hearin & Watson 2013), conditional colormagnitude diagram (Xu et al. 2018), halo occupation distribution (Jing et al. 1998;Berlind & Weinberg 2002), conditional luminosity function , and those based on SFR (Lu et al. 2014;Moster et al. 2018;Behroozi et al. 2019), all use halo properties to make predictions for the galaxy population. A key question in all these models is which halos quantities should be used as the predictors of galaxies. Using too little information about the halo population will make the model too simple to capture the real effects of halos on galaxy formation; using too many halo properties may be unnecessary because of the degeneracy between them. Our results, therefore, provide a foundation to build galaxy formation models such as those listed above.