StarGO: A New Method to Identify the Galactic Origins of Halo Stars

Zhen Yuan; Jiang Chang; Projjwal Banerjee; Jiaxin Han; Xi Kang; M. C. Smith

doi:10.3847/1538-4357/aacd0d

1. Introduction

According to the hierarchical structure formation theory, the Milky Way (MW) grows to its current size through frequent accretion and merger events. During these violent processes, satellite galaxies are tidally disrupted and the disk gets heated. The stellar halo is built up at the same time as a repository of stars from various origins (Bullock & Johnston 2005; Font et al. 2006; De Lucia & Helmi 2008; Cooper et al. 2010; Deason et al. 2016). Due to the approximately dissipationless nature of stars, substructures in the stellar halo, such as the stellar debris from a satellite or groups of stars that originated from the Galactic disk, may retain the memory of their origins. The identification of these substructures is the first step toward unraveling the evolution history of the MW. A number of such substructures have been found in the last decade, adding strong support to the scenario of hierarchical structure formation. One famous example is the discovery of the Sagittarius dwarf galaxy (Ibata et al. 1994, 1995; Yanny et al. 2000) and its tidal streams (Mateo et al. 1996; Ibata et al. 2001; Majewski et al. 2003), both of which are located in the stellar halo.

The current hierarchical structure formation paradigm implies that the inner stellar halo contains a wealth of information about the early assembly history of the MW, as the stars there tend to be accreted a long time ago. However, identifying substructures in configuration space is not easy due to the fact that the accreted substructures in the inner stellar halo have undergone mixing for a long time. Furthermore, this region is also populated by star groups likely originated from the disk, e.g., Monoceros (Bergemann et al. 2018; Laporte et al. 2018), which makes substructure identification from satellites difficult.

On the other hand, identifying substructures in phase space can be relatively easier, given the additional information from the velocities. In particular, the separations of stars in the integral-of-motion space are much better conserved, and thus provide a natural coordinate system for identifying the original grouping of stars (Helmi et al. 1999, 2006; Klement et al. 2009; Smith et al. 2009; Smith 2016). Previous searches of substructures in the inner stellar halo were hindered by the limited astrometric data. With the advent of Gaia, we now have 5D astrometric data for an unprecedented number of stars (1332 million) from Data Release 2 (Lindegren et al. 2018). Cross matching the Tycho-Gaia Astrometric Solution (TGAS; Gaia Collaboration et al. 2016) with other surveys such as RAVE (Casey et al. 2017), LAMOST (Luo et al. 2015), 2MASS (Skrutskie et al. 2006), and APOGEE (Anders et al. 2014) has produced a stellar library within ∼20 kpc and has already led to several discoveries. For example, Koposov et al. (2017) discovered faint MW satellites by searching for overdensities in configuration space, Helmi et al. (2017) found a substructure of halo stars in integral-of-motion space, and Myeong et al. (2017) identified the existence of a comoving star cluster with additional information of metallicity distribution.

Despite the increasing discovery of identified substructures, their number is far below the predictions from ΛCDM cosmology. According to Aquarius simulations (Springel et al. 2008), hundreds of streams are expected in the Gaia sky (Gómez et al. 2013; Maffione et al. 2015). To systematically identify these substructures using the vast amount of astrometric data, several methods have been developed, including distance based methods such as friends-of-friends (FoF; Helmi & de Zeeuw 2000), and density-based algorithms such as Mean Shift (Gómez & Helmi 2010) and Watershed (Helmi et al. 2017).

In this paper, we propose a new method of substructure identification that requires essentially no free parameters. Our method utilizes a machine-learning technique called self-organizing map (SOM) (Kohonen 2001), that maps out the topology of a high-dimensional data set onto a 2D map. Using the fact that stars with the same origin have similar orbital energy and angular momentum, we first apply SOM to the n-Dimensional (n-D) input space constructed from these quantities and visualize the results in a 2D neural map. Then, we develop a new adaptive group identification scheme based on the resulting 2D map. As SOM retains the topological structure of the data set, it can manifest the fine structures in the data. This makes our method particularly well suited in identifying groups that are weakly clustered. We test the performance of our method by applying it to a mock catalog generated from our simulation of a MW-like system with realistic observational uncertainties.

The paper is organized as follows. Section 2 describes the details of model setup and simulations for generating the mock catalog; Section 3 discusses the details of SOM and group identification of StarGO; Section 4 presents the results of StarGO applied to the mock catalog and comparisons with FoF; and Section 5 provides our conclusion.

2. Simulations

2.1. Overview

A popular approach of building a stellar halo is using zoom-in ΛCDM cosmological simulation of a MW-like system accompanied by post-process of star tagging using semi-analytic models such as galform (De Lucia & Helmi 2008; Cooper et al. 2010; Tumlinson 2010). Although such models can retain the realistic accretion history of a MW-like system, they do not include any stellar components, such as the disk and the bulge, which are crucial for modeling the kinematics of stars in the inner stellar halo.

Another approach involves using an analytic potential for the dark matter and stellar components of the MW, while using N-body models for the dark matter component of satellites (Helmi & de Zeeuw 2000). The main advantage of this approach is that it can achieve a higher resolution than pure N-body simulations. Instead of using the actual merger history, an artificial one is used to build up a synthetic stellar halo. Gómez et al. (2010) used a similar approach but with a time dependent MW potential. In such studies, stars in satellites are assigned to particles in post-process and in situ stars are added as background contamination. In a different approach of resimulation of a MW-like system, Jean-Baptiste et al. (2017) modeled the MW and satellites as collections of both dark matter and star particles to get a live N-body simulation.

In this study, we use a static analytic potential to model the dark matter halo of the MW, while particles are used to model stars in the disk and the bulge. For satellites, particles are used for both dark matter and stars. Although our method cannot account for dynamical friction since we use an analytic potential, it is expected to have a minor effect for the mass range of satellites chosen in our study (Amorisco 2017; Frings et al. 2017). This makes our method computationally much less expensive compared to studies which use live models (e.g., Jean-Baptiste et al. 2017).

Similar to the second approach mentioned above, we use an artificial merger history. Zoom-in cosmological simulations of MW-like systems suggest that the main contributors to the inner stellar halo are a few satellites that infall at early times (De Lucia & Helmi 2008; Cooper et al. 2010). In this study, we build up an synthetic stellar halo through several minor mergers following the same ideas as Boylan-Kolchin et al. (2008) and Amorisco (2017), where the dynamical friction from dark matter halo can be neglected. On the other hand, energy and angular momentum exchange between the MW center (the disk and the bulge) and satellites is important since all the satellites have pericenter distances within 20 kpc (see Table 2). This however, is automatically taken into account as they are modeled using particles. Here, we model the accretion of nine satellites which infall in the first 0–4 Gyr (see Table 2), where the total simulation time is 12 Gyr.

We use progenitors with infall mass ratios relative to the virial mass of MW of ≲1:25. In this mass range, even under dynamical friction from the halo, the initial orbital imprints of satellites can be retained in some of the stars (Amorisco 2017). The infall radial velocity, V_r, and tangential velocity, V_θ, are set to be some fraction of the virial velocity of the MW, ${V}_{\mathrm{vir},\mathrm{MW}}$ . The adopted value of the fractions are taken from the preferred range from cosmological simulations by Jiang et al. (2014). We calculate the circularity j, defined as the ratio of the total angular momentum to the angular momentum for a circular orbit of the same energy. The satellites in our model have j = 0.2–0.8, which are consistent with values found in studies of accreted satellites from cosmological simulations (Jiang et al. 2014), as well as from other models of the stellar halo (Amorisco 2017).

2.2. Our Model

The dark matter halo of the MW is described as a Navarro-Frenk-White (NFW) potential (Navarro et al. 1997) with virial mass M_vir = 10¹²M_⊙ and concentration parameter c = 7 (Macciò et al. 2008). The stellar part of the MW consists of a Hernquist bulge (Hernquist 1990) and an exponential disk with the total stellar mass of M_* = 0.03M_vir and particle mass of 3 × 10⁴M_⊙. The bulge component contributes ∼20% of M_*, with a density profile ρ_b given by

$\begin{eqnarray}&&{\rho }_{{\rm{b}}}(r)=\displaystyle \frac{{M}_{{\rm{b}}}}{2\pi }\displaystyle \frac{a}{r{(r+a)}^{3}},\end{eqnarray} \tag{ 1 }$

where M_b, a, and r are the mass, scale length, and radius, respectively. The rest ∼80% of the total stellar mass is in the stellar disk of mass M_d with the density profile parameterized by scale length R_s and scale height z₀ = 0.2R_s, given by

$\begin{eqnarray}&&{\rho }_{{\rm{d}}}(r,z)=\displaystyle \frac{{M}_{{\rm{d}}}}{4\pi {z}_{0}{R}_{{\rm{s}}}^{2}}{{\rm{sech}} }^{2}(\displaystyle \frac{z}{2{z}_{0}})\exp (-\displaystyle \frac{R}{{R}_{{\rm{s}}}}),\end{eqnarray} \tag{ 2 }$

where z and R are the height and radius, respectively.

For pre-cooked progenitors of satellites, we use a similar recipe to Chang et al. (2013). Each satellite has a dark matter halo with a NFW profile, and an exponential disk with a total stellar mass of 1% of the virial mass of the satellite. We use particle mass of 5 × 10⁴M_⊙ and 5 × 10³M_⊙ for the dark matter halo and stellar disk, respectively. We design three types of progenitors: H-m, M-m, and L-m, with total masses of 4 × 10¹⁰M_⊙, 10¹⁰M_⊙ and 2.5 × 10⁹M_⊙, respectively (see Table 1).

Table 1. Properties of the MW and Satellite Galaxies

Type	M_vir(M_⊙)	V_vir(km s⁻¹)	c	N_h	M_*(M_⊙)	R_s(kpc/h)	N_*	m_p,*(M_⊙)
MW	10¹²	162.6	7.0	⋯	3 × 10¹⁰	3.01	6 × 10⁵	5 × 10⁴
H-m	4 × 10¹⁰	55.6	9.3	7.84 × 10⁵	8 × 10⁸	0.98	1.6 × 10⁵	5 × 10³
M-m	10¹⁰	35.0	10.6	1.96 × 10⁵	2 × 10⁸	0.57	4 × 10⁴	5 × 10³
L-m	2.5 × 10⁹	22.1	12.2	4.9 × 10⁴	5 × 10⁷	0.34	10⁴	5 × 10³

Note. For each type of galaxy, M_vir and V_vir are the viral mass and viral velocity (for satellites these values refer to their initial values), respectively. Note that R_vir has the same value as V_vir, with the unit of kpc/h. For the dark matter halo part, c is the concentration parameter of the NFW model, N_h is the number of particles in the dark matter halo, and m_p,h = 5 × 10⁴M_⊙ is the mass for each dark matter particle. For the stellar part, M_* is the stellar mass, R_s is the disk scale length, N_* is the number of star particles, and m_p,* is the mass for each star particle.

Download table as: ASCII Typeset image

Using H-m, M-m, and L-m, we create nine satellites (sat1–9) with distinct infall scenarios by changing their initial velocities and positions (see Table 2). All of the satellites are released at distance R_ini, with an initial radial velocity V_r, and tangential velocity V_θ (Benson 2005; Jiang et al. 2014). The inclination of the satellite orbit with respect to the disk is characterized by i, which is the angle between the initial orbital direction of the satellite and the initial Galactic z direction. The values of i are chosen from a range of 0°–60°. We choose R_ini to be less than the virial radius for all the satellites to emulate the early infall scenarios when the MW is smaller.

Table 2. Details of the Mock Catalog

	Type	T_inf(Gyr)	${R}_{\mathrm{ini}}/{R}_{\mathrm{vir},\mathrm{MW}}$	$({V}_{{\rm{r}}},{V}_{\theta })/{V}_{\mathrm{vir},\mathrm{MW}}$	i(°)	r_peri(kpc)	j = J/J_circ	N_sat
sat1	H-m	0	0.4	(0.64, −0.64)	60	22.8	0.71	3,609
sat2	M-m	0	0.4	(0.96, 0.32)	45	9.1	0.32	1,970
sat3	L-m	0	0.4	(1.0, 0.2)	30	5.4	0.20	59
sat4	H-m	2	0.6	(0.96, 0.32)	45	13.2	0.32	5,598
sat5	M-m	2	0.6	(0.96, −0.32)	0	13.2	0.32	1,961
sat6	L-m	2	0.6	(0.72, 0.72)	45	38.7	0.71	9
sat7	H-m	4	0.8	(0.6, −0.2)	0	10.6	0.32	4,985
sat8	M-m	4	0.8	(0.36, 0.36)	30	22.2	0.71	38
sat9	L-m	4	0.8	(0.48, 0.16)	15	8.4	0.32	341

Note. The initial condition of each satellite before it falls into the MW at T_inf are represented by the distance R_ini, velocity (V_r, V_θ), and inclination angle i. The pericenter distance and circularity of the orbit for each satellite are denoted as r_peri and j, respectively. The number of stars from each satellite N_sat is shown in the last column.

Download table as: ASCII Typeset image

2.3. Catalog

To generate data sample from our simulation, we select all the stars within 10 kpc relative to the Sun, which is taken to be at the galactocentric distance R_⊙ = 8 kpc. This is done to get data with the coverage similar to the Gaia sky. We address the sampling bias introduced by the solar position by generating eight samples, where the solar position is rotated by 45° each time in the x–y plane. We find the samples to be quantitatively similar, i.e., number of stars in each population is similar. In this paper, we use only one of the samples for demonstration, the details of which are given in Table 2.

The disk and the bulge population accounts for more than 95% in our sample, most of which are distributed close to the Galactic disk plane. As we are mainly interested in the halo stars, we exclude all the stars from these two components to generate the mock catalouge. In real observational samples, this can also be done relatively easily by using a cut for metallicity or distance from the mid-plane. The total number of stars in each satellite is denoted by N_sat (see Table 2). Most of the stars in the extracted heliocentric volume come from six of the nine satellites (sat1, sat2, sat4, sat5, sat7, and sat9).

K-giants are ideal tracers of the stellar halo because they are bright, and distances can be reliably estimated from photometry. Therefore, we construct a mock catalog of K-giant stars, adopting errors similar to what will be obtained for a sample of such stars from LAMOST and Gaia DR2. Specifically, we assign the distance error to be 20% according to the distance estimation method using photometry (Xue et al. 2014). We adopt the error of radial velocity to be 7 km s⁻¹, which is consistent with LAMOST (Schönrich & Aumer 2017). The proper motion error at G ∼ 16–17 mag is 0.1–0.2 mas yr⁻¹ from Gaia DR2 (Lindegren et al. 2018). We take 0.15 mas yr⁻¹ for all the stars in the catalog. This is a reasonable number for K-giants, as even at 10 kpc, they have G ∼ 16–17 mag. The corresponding error in tangential velocity for a star at 10 kpc is about 7 km s⁻¹, i.e., comparable to our assumed error in radial velocity from low-resolution spectroscopy.

Before we discuss the details of our group identification method and the results, it is useful to first visualize the input data. We plot stars in the Aitoff projection of angular momentum space shown in Figure 1. The left panel shows stars from different satellites, whereas the right panel shows the corresponding density map. We can clearly see that stars of the same origin tend to cluster, although stars of different origins often overlap each other. This can also be seen from Figures 2(a)–(c), where stars are plotted in (L_x, L_y), (L_x, L_z), and (L_z, E). The clustering can be clearly seen in the corresponding density maps (see the right panel of Figures 1 and 2(d)–(f)). This clustering information is the basis of all substructure identification methods. Below, we discuss the details of our method applied to different input spaces discussed above.

**Figure 1.** Aitoff projection of angular momentum of stars from mock catalog in the galactic reference frame. Left panel: stars from different satellites are assigned unique colors as follows: sat1 (salmon), sat2 (olive), sat3 (magenta), sat4 (purple), sat5 (pink), sat6 (blue), sat7 (cyan), sat8(gold), and sat9(light blue). Right panel: the corresponding density map of all the satellite stars in the same projected space as in the left panel.
Download figure:
Standard image High-resolution image

**Figure 2.** Top row: projection of all the stars from the mock catalog in (a) L_x–L_y, (b) L_x–L_z, and (c) L_z–E plane, using the same color codes of Figure 1. Bottom row (b)–(d): the corresponding density map of all the stars in the same plane as the top row.
Download figure:
Standard image High-resolution image

3. Method

Here, we develop a substructure identification method based on SOM, which belongs to unsupervised learning domain. We apply SOM to the mock data catalog, followed by a novel group identification procedure that utilizes the visualization of SOM output in the 2D neural map. Below, we give a brief introduction to SOM followed by details of our novel group identification procedure.

3.1. Self-organizing Map

The aim of SOM is to map a n-D input data to a 2D neural map, while retaining the topological structures within the data at the same time. The starting point is the construction of a 2D map of m × m neurons, each of which are located at a different grid point (a, b). Neurons have initially randomized weight vectors ${\boldsymbol{w}}$ with the same dimension and range as the n-D input vectors ${\boldsymbol{v}}$ . Given an input vector ${{\boldsymbol{v}}}^{i}$ , for the ith star from the data catalog, we first find the neuron that has the closest weight vector to ${{\boldsymbol{v}}}^{i}$ by finding the neuron with the minimum value of $| {\boldsymbol{w}}-{{\boldsymbol{v}}}^{i}|$ . Such a neuron is defined to be the best matching unit (BMU). The learning process involves improving weight vectors ${\boldsymbol{w}}$ of all the neurons toward ${{\boldsymbol{v}}}^{i}$ according to their distances ${d}_{a,b}^{i}$ to the BMU located at (a_i, b_i) on the 2D neural map, where ${d}_{a,b}^{i}$ is defined as

$\begin{eqnarray}&&{d}_{a,b}^{i}=\sqrt{{(a-{a}_{i})}^{2}+{(b-{b}_{i})}^{2}}.\end{eqnarray} \tag{ 3 }$

The change in the weight vector ${{\boldsymbol{dw}}}_{a,b}^{i}$ due to the ith star is given by

$\begin{eqnarray}&&{{\boldsymbol{dw}}}_{a,b}^{i}={\alpha }_{q}\exp \left(-\displaystyle \frac{{{d}_{a,b}^{i}}^{2}}{{\sigma }_{q}^{2}}\right)({{\boldsymbol{v}}}^{i}-{{\boldsymbol{w}}}_{a,b}^{i}),\end{eqnarray} \tag{ 4 }$

where α_q characterizes the learning rate, and σ_q controls the neighboring influence of neurons around the BMU for the qth iteration. We can see from the above equation that the change in the weight of a neuron is sensitive to its distance from the BMU. The learning process is performed using ${{\boldsymbol{v}}}^{i}$ for each star in the data set. The learning process is then repeated for a total number of N_iter iterations. For the qth iteration, the corresponding α_q and σ_q is

$\begin{eqnarray}{\alpha }_{q} & = & {\alpha }_{0}(1-q/{N}_{\mathrm{iter}}),\\ {\sigma }_{q} & = & {\sigma }_{0}(1-q/{N}_{\mathrm{iter}}),\end{eqnarray} \tag{ 5 }$

where we use typical fiducial values of α₀ = 0.3 and σ₀ = max(m, n)/2 similar to Geach (2012). We find that the results are independent of α₀ and σ₀ for reasonable variation around this fiducial value. As the number of iteration q increases, $| {\boldsymbol{dw}}|$ is reduced due to the decrease in α_q and σ_q, leading to refinement of the learning process. The learning process is considered to be complete when $| {\boldsymbol{dw}}| /| {\boldsymbol{w}}| \to$ 0.

3.2. Group Identification

We feed the mock catalog in a given input space to a 80 × 80 neural network. After the application of SOM, the clustering structures can be visualized by using the differences between the weight vectors of neighboring neurons, which are the elements of the u-matrix defined as

$\begin{eqnarray}&&{u}_{a,b}={\mathrm{log}}_{10}(| {{\boldsymbol{w}}}_{a\pm 1,b}-{{\boldsymbol{w}}}_{a,b}| +| {{\boldsymbol{w}}}_{a,b\pm 1}-{{\boldsymbol{w}}}_{a,b}| ).\end{eqnarray} \tag{ 6 }$

Neurons mapped to stars in highly clustered regions tend to have similar angular momentum and orbital energy, leading to lower values of u and vice versa. Figure 3(a) shows the distribution of u while Figure 3(b) shows the resulting 2D 80 × 80 neural map, where u is represented by the gray color scale. Each star can be mapped to its BMU in the 2D map. We note that every neuron can be associated with more than one star or no stars at all.

**Figure 3.** Results from the first iteration of workflow of StarGO to the mock catalog in the (E, L_x, L_y, L_z) space. (a) The blue shaded histogram shows the distribution of u, where the cyan and salmon dashed lines denote u_m and u_thr values, respectively. (b) 2D neural map resulting from SOM, where the u value between adjacent neurons is represented by the gray color scale. (c) The same map as (b) which shows the selected neurons in different colors according to step 2–3 of the workflow. The neurons with u < u_m are marked by colored pixels; each pixel associated to a group is colored cyan, while groups with more than 30 stars have their pixels enclosed with a blue box. (d) The neurons with u < u_thr are marked by salmon pixels according to step 4 of the workflow.
Download figure:
Standard image High-resolution image

Based on the 2D map generated by SOM, we develop a novel algorithm for identification of substructures. Below, we list the steps adopted for group identification starting with the application of SOM followed by our algorithm.

1.
We first normalize each component of the input vector. For each dimension of the input vector, we calculate the 95% confidence interval for the whole sample. We then divide each component of the input vector for each star by this normalizing factor. We then apply SOM to all the stars in the normalized input space and calculate u-matrix from the resulting map. We associate each star to its BMU in the 2D map.
2.
We group neighboring neurons with u < u_m to form candidate seed groups (marked by cyan pixels in Figure 3(c)), where u_m is the median value of u for all neurons in the map.
3.
A candidate seed group resulting from Step 2 is considered as a bona fide seed group (enclosed with blue boxes in Figure 3(c)) if more than 30 stars are associated to neurons in the group.
4.
If there are more than one bona fide seed group, we maximize the size of each group by increasing the value of u_thr (shown as salmon dashed line in Figure 3(a)). This results in the merging of multiple groups and we increase u_thr until we have two groups remaining from our original set of seed groups (marked by salmon pixels in Figure 3(d)). We stop increasing u_thr when these two groups are as large as possible, i.e., if u_thr was increased any further then these would merge. Additional groups can arise as the increased u_thr results in the formation of some new groups that have more than 30 stars associated to them.
5.
Star associated with the identified neuron groups form the corresponding identified star groups. For each identified group, we apply SOM followed by the above group identification procedure by repeating steps 1 to 4 (see workflow in Figure 4). Stars not belonging to any group are designated as "unidentified."
6.
We stop the group identification procedure when no more than one seed group can be found after step 3.

**Figure 4.** Workflow of StarGO.
Download figure:
Standard image High-resolution image

The group identification algorithm above allows us to find substructures adaptively for a given data set. At each iteration, neurons with u < u_m correspond to stars that have clustering above the median value. We set the minimum number of stars for seed groups to be 30 to discard small groups with low significance. The threshold of identified group is also set to be 30, which is slightly below the number of stars from the smallest population in the catalog. Increasing the value of u_thr in step 4 is designed to maximize the completeness of grouped stars. Because we apply our algorithm iteratively to each group, the final set of groups represent the smallest indivisible group that has at least 30 stars.

3.3. Size and Convergence Check

We check the dependence of the results on the size of neural network by using networks of size 50 × 50, 80 × 80, and 100 × 100 neurons. We find that the convergence is achieved for network size of 80 × 80, with the larger network yielding almost identical results. The smaller network fails to identify fine structures due to coarse griding. Thus, we use the network of size 80 × 80 throughout the study. We perform additional checks on dependence of the results on the iteration number N_iter. We use N_iter = 200, 300, and 400, finding the results are already converged for N_iter = 200. We adopt N_iter = 200 as the default value.

4. Results and Discussions

In this section, we apply StarGO to the mock catalog in the (E, L_x, L_y, L_z) space. We compare the results with the corresponding results using FoF. We know that in axisymmetrical potential of the MW, the orbital parameters E, L, and L_z are known to be approximately conserved, whereas L_x and L_y evolve coherently (Helmi & de Zeeuw 2000; Knebe et al. 2005; Klement 2010; Gómez & Helmi 2010; Maffione et al. 2015). In a more realistic scenario, such as in our model which includes a live N-body disk and bulge, the conservation of these quantities are more strongly violated. However, after stars from a satellite interact with the stars in the MW and other satellites, some of them can still have similar E and L, due to very similar disruption history. This can be seen from the Aitoff projection of L shown in (θ, ϕ) in the left panel of Figure 1, where stars of the same origin tend to cluster. Similar clustering can also be seen in (L_x, L_y), (L_x, L_z), and (L_z, E) (see Figures 2(a)–(c)). Substructure identification methods can exploit these clusterings to identify star groups.

4.1. StarGO

Following steps 1–6 of the workflow (see Figure 4), we apply SOM to the mock catalog in the (E, L_x, L_y, L_z) space. Figure 5(Ia) shows the training results after the application of SOM (step 1 of the algorithm) on the 2D neural map. Each BMU is represented with the color and symbol according to the satellite of the associated stars, where the darker symbols represent multiple stars mapped to the same neuron. Some neurons are BMUs of stars belonging to different satellites, which can be seen with overlaid symbols. After we perform the group identification algorithm (steps 2–4), the neurons in seed groups are marked by blue pixels and the neurons with u < u_thr are marked by salmon pixels in Figure 5(Ib), same as Figures 3(c)–(d). Figure 5(Ic) shows the identified star groups (Group A–G) using the same color coding as Figure 5(Ia). The group identification applied to individual groups requires six more iterations for Group A and one more iteration for Group E before we reach the end of the workflow (see Figure 4). For illustration, we show the detailed group identification procedure for two more iterations for Group A in Figures 5(II)–(III). Figure. 6 shows the full schematic diagram for the hierarchical group identification from StarGO with the detailed results listed in Table 3.

**Figure 6.** Schematic diagram for the hierarchical group identification from StarGO. The results for star groups after each iteration is represented by tree maps in each row. Every tree map contains rectangles of different colors and areas, each of which represent a distinct star group. Indivisible groups with purity ≥60% are colored according to the major contributor with the same color scheme as in Figure 1. If an indivisible group cannot be associated to any dominant single satellite, the rectangle is colored in white. The divisible groups are denoted by light pink rectangles, and the unidentified stars (i.e., stars not associated to any group) are denoted by gray rectangles. The relative area of each rectangle in a tree map approximately represents the number fraction of stars from the corresponding group.
Download figure:
Standard image High-resolution image

Table 3. Results from StarGO Applied to the Mock Catalog in the (E, L_x, L_y, L_z) Space

i = 0
GrpID	A	B	C	D	E	F	G
Sat1				63		50	61
Sat2			35
Sat9		26

N_grp	Next	48	39	63	Next	50	66
purity(%)	iteration	54	90	100	iteration	100	92

i = 1	A							E

GrpID	A0	A1	A2	A3	A4	A5	A6	E0	E1	E2

Sat1								180	35	52
Sat4			45		36
Sat5							49
Sat7		27
Sat9						49

N_grp	Next	33	86	Next	44	74	64	182	35	52
purity(%)	iteration	82	52	iteration	82	66	77	99	100	100

i = 2	A0		A3

GrpID	A0a	A0b	A3a	A3b

Sat5				265
Sat7		143

N_grp	Next	208	1887	284
purity(%)	iteration	69	Spurious	93

i = 3	A0a

GrpID	A0a0	A0a1	A0a2

Sat4			75
Sat7	122

N_grp	131	Next	80
purity(%)	93	iteration	94

i = 4	A0a1

GrpID	A0a1a	A0a1b	A0a1c

Sat7		92	30

N_grp	next	101	31
purity(%)	iteration	91	97

i = 5	A0a1a

GrpID	A0a1a0	A0a1a1	A0a1a2

Sat7	113		87

N_grp	138	Next	117
purity(%)	82	iteration	74

i = 6	A0a1a1

GrpID	A0a1a1a	A0a1a1b

Sat4	185	29

N_grp	252	38
purity(%)	73	76

Note. For each iteration, we list the identified satellites and the number of stars in the corresponding identified groups. N_grp denotes the total number of stars in each identified group. Groups that require further iteration are highlighted in bold at each step.

Download table as: ASCII Typeset image

We find that for most identified groups, the major contribution is from a single satellite. If the purity for a group is ≥60%, we identify the group with the corresponding satellite, which we refer to as the dominant contributor. Using this criteria, StarGO is able to find a total of 24 star groups, out of which 21 can be identified with satellites. One group is considered as a spurious group (Group A3a), which has roughly equal fraction of stars from sat4 and sat7 with purity ∼40%. The remaining two groups (Group B and A2) have purity ranging from 50%–54% and thus cannot be strictly identified with a satellite using our criteria. We find that all of the six major satellites (sat1, sat2, sat4, sat5, sat7, and sat9) in the mock catalog can be identified with at least one group. The number fraction f of stars of a satellite in the extracted volume that are identified with its corresponding groups is ≥5% for all major satellites except sat2 (f = 1.7%). This is likely due to the fact that sat2 is more heavily disrupted compared to the other five major satellites. On the other hand, for the two largest contributors to the mock catalog, sat4 and sat7, StarGO is able to identify 6.6% (four groups) and 12% (seven groups) of the stars, respectively. Overall, StarGO is able to identify a total of 1850 stars from satellites within the analyzed volume that are the major contributors to the identified groups. This constitutes a fraction f_tot = 10% of the total number of stars in the mock catalog.

4.2. Friends-of-friends

A widely used method of substructure identification is FoF, which is the standard procedure of finding halo groups used in cosmological simulations. In this case, the linking length l_p is a key parameter, which is chosen empirically. The typical value of l_p is set to be 0.2 times the interparticle distance, which is a characteristic length scale used in the definition of dark matter halo. When FoF is applied to substructure identification in the integral-of-motion space, all "distances" have units of angular momentum or energy, such that interparticle distances lose their physical meaning. Following (Helmi & de Zeeuw 2000), we set the characteristic length scale to be the dispersion of the total angular momentum of stars ΔL_cluster. We note that, similar to StarGO, we use a normalized input space which is dimensionless. Thus, the dispersion of the angular momentum can be used as a scale for every dimension. The linking length l_p is set to be ηΔL_cluster, where η is empirically chosen from a range of ∼0.1–0.2. Finding optimal values of η and determining ΔL_cluster are problematic for substructure identification.

Following the procedures from Helmi & de Zeeuw (2000), we test the performance of FoF applied to the mock catalog. We estimate ΔL_cluster from the distribution of L_z. Specifically, we set it to be half of 68% confidence interval, which roughly corresponds to the ±1σ range for a normal distribution. We apply FoF using different values of η. In order to visualize the results from FoF for easy comparisons with StarGO, we again use the 2D neural map. To do this, we map each star to its BMU resulting from the initial application of SOM. We use distinct colors to plot the stars of each group identified by FoF. As in the case of StarGO, we only consider groups with more than 30 stars. Figures 7(a)–(e) shows the results for three different values of η. As we can see, the group identification is sensitive to η. The optimal results are found for η = 0.15–0.25, which gives the maximum number of identified satellites within the extracted heliocentric volume and highest f_tot. The results for η = 0.15, 0.20, and 0.25 are listed in Table 4.

**Figure 7.** Results from FoF applied to the mock catalog in the (E, L_x, L_y, L_z) space, visualized on the 2D neural map resulting from the initial application of SOM shown in Figure 5(Ia). Star groups identified by FoF are shown in distinct colors using different values of η: a) η = 0.1 b) η = 0.15 c) η = 0.2 d) η = 0.25 e) η = 0.3. The spurious group which cannot be identified with any satellite is colored in gray.
Download figure:
Standard image High-resolution image

Table 4. Results from FoF Applied to the Mock Catalog in the (E, L_x, L_y, L_z) Space

η = 0.15
GrpID	A	B	C	D	E	F	G	H	I	J	K

Sat1						41
Sat4					29			35	22	33
Sat5		102									32
Sat7			47	42

N_grp	1919	102	50	44	42	41	40	35	34	33	32
purity(%)	spurious	100	94	95	69	100	spurious	100	65	100	100

η = 0.20

GrpID	A	B	C	D	E	F	G

Sat1			99			36
Sat4					40
Sat5		282
Sat7							30
Sat9				52

N_grp	5276	302	100	57	40	36	30
purity(%)	Spurious	93	99	91	100	100	100

η = 0.25

GrpID	A	B	C	D	E	F	G	H	I	J	K	L	M

Sat1		272	95		72	67	57	43	40	36	31
Sat2												28
Sat4													31
Sat9				67

N_grp	9277	274	95	74	72	67	57	43	40	36	31	31
purity(%)	Spurious	99	100	90	100	100	100	100	100	100	100	90	100

Download table as: ASCII Typeset image

For η = 0.20, FoF is able to identify sat1, sat4, sat5, sat7, and sat9 with f_tot = 3.9%, where six out of seven groups can be identified with satellites. When η is reduced to 0.15, FoF is still able to identify sat1, sat4, sat5, sat7 with f_tot = 2.1%, but is unable to identify sat9. In this case, nine out of the 11 groups can be identified with satellites. On the other hand, for η = 0.25, FoF can identify sat1, sat2, sat4, sat9 with f_tot = 4.5%, where 12 out of 13 groups can be identified with satellites. In contrast, StarGO is able to identify all the satellites with f_tot = 10%. Interestingly, for all the three values of η, the group that cannot be identified with any single satellite (marked by gray pixels in Figure 7) is the largest group, with roughly equal contributions from sat4 and sat7. This is likely due to the fact that sat4 overlaps heavily with sat7 (see Figures 1 and 2). It results in weak clustering features such that FoF is barely able to distinguish them. This can clearly seen from Table 5, where FoF gives very low values of f for sat4 and sat7 for η = 0.20 and is unable to identify sat7 for η = 0.25. η = 0.15 gives highest values of f = 2.1%,1.8% for sat4 and sat7 respectively, which are still below f = 6.6%, 12% obtained from StarGO. Similarly, FoF also fails to identify sat2 for η = 0.15 and η = 0.2, and gives low value of f = 1.4% for η = 0.25, whereas StarGO gives slightly better result of f = 1.7%. As mentioned before, this is likely due to the fact that sat2 has gone through severe disruption which results in weak clustering signal in the input space. Even for the optimal range of values of η, the variation of FoF results can be seen from the fact that sat5 can be easily identified for η = 0.15 and η = 0.20 but cannot be identified at all for η = 0.25. On the other hand, StarGO gives higher value of f = 16% compared to f = 14% from the best case of FoF with η = 0.2. Similarly, the identified fraction of stars from sat1 increases sharply from 3.7% to 20% as η in increased from 0.20 to 0.25. Compared to such variations, StarGO is able to identify sat1 with a moderate value of f = 12%.

Table 5. Comparison of f between StarGO and FoF

Satellite		sat1	sat2	sat3	sat4	sat5	sat6	sat7	sat8	sat9	Total
N_sat		3609	1970	59	5598	1961	9	4985	38	341	18570

f(%)	StarGO	12	1.7		6.6	16		12		14	10
	FoF (η = 0.15)	1.1			2.1	6.8		1.8			2.1
	FoF (η = 0.2)	3.7			0.1	14		0.6		15	3.9
	FoF (η = 0.25)	20	1.4		0.5					22	4.5

Download table as: ASCII Typeset image

For values of η outside of the optimal range, FoF identifies fewer satellites or has even lower values of f and f_tot within the analyzed volume (shown in Figure 7). For η = 0.1, FoF can find only one group of 36 stars (f_tot = 0.19%), which is associated with sat7. For η = 0.3, there are three groups identified from sat1, sat4 and sat9 with f_tot = 1.5%.

5. Conclusion

In this paper, we present a new substructure identification method StarGO that identifies and visualizes star groups hierarchically on top of a 2D neuron map. Our algorithm first maps the multidimensional phase space coordinates of stars into a 2D map using SOM while conserving the topological structure of the data set. It then identifies a hierarchy of star groups adaptively according to the significance of clustering at each step.

We test our algorithm using a mock catalog of stars within a heliocentric radius of 10 kpc generated from a simulated MW-like system, and compare the results against that from an FoF algorithm. In the tests we take into account observational errors that are expected for K-giants in the Gaia DR2 and LAMOST DR5 catalogs. In comparison to FoF, StarGO is able to identify star groups dominated by each of the six major satellites, whereas FoF is able to identify at most five, even after optimizing the linking length. In addition, StarGO can identify a higher fraction of stars from almost all the satellites compared to FoF (see Table. 5). If we consider the number of stars from the dominant satellite in each group, we find that StarGO is able to identify a total of 10% of the total satellite population in the extracted heliocentric volume, whereas for FoF this fraction is below 4.5%.

In conclusion, StarGO is able to identify star groups efficiently by combining the sensitivity and visualization ability of SOM with an adaptive clustering algorithm. The adaptive group identification procedure allows us to systematically search for substructures while avoiding uncertainties from nuisance parameters. Overall, the results from StarGO are better than the results from FoF, even when using an optimized linking length. StarGO is an ideal tool to explore high-dimensional data set from the recently released Gaia DR2. Our method will be particularly useful for studies of the inner stellar halo, for example when applied to the cross-match of Gaia DR2 and spectroscopic surveys.

Z.Y. gratefully acknowledges Jingying Lin for inspiring discussions about SOM. Z.Y. thanks Xiangxiang Xue for sharing her expertise in applying FoF to the real data, and Chao Liu for insightful discussions about the algorithm of StarGO. Z.Y. is also indebted to Yi Peng Jing, Yu Luo, and Hong Guo for commenting on the early draft of this paper. All the authors thank the anonymous referee for valuable and constructive comments which greatly improved this work. This work is supported by the National Key Basic Research Program of China (No. 2015CB857003). Z.Y. and P.B. acknowledge the support of NSFC-11533006. J.C. and X.K. acknowledge the support of NSFC-11333008. J.X.H. acknowledge the support of JSPS Grant-in-Aid for Scientific Research JP17K14271. J.C. also acknowledges the support of the Opening Project of Key Laboratory of Computational Astrophysics, National Astronomical Observatories, Chinese Academy of Sciences. M.C.S. acknowledges financial support from the CAS One Hundred Talent Fund and from NSFC grants 11673083 and 11333003. This work was also supported by the National Key Basic Research Program of China 2014CB845700.

We gratefully acknowledge the use of the High Performance Computing Resource in the Core Facility for Advanced Research Computing at Shanghai Astronomical Observatory.

StarGO: A New Method to Identify the Galactic Origins of Halo Stars

Article metrics

Permissions

Author e-mails

Author affiliations

ORCID iDs

Dates

Abstract

1. Introduction