Blind Search of The Solar Neighborhood Galactic Disk within 5kpc: 1,179 new Star clusters found in Gaia DR3

Studying open clusters (OCs) is essential for a comprehensive understanding of the structure and evolution of the Milky Way. Many previous studies have systematically searched for OCs near the solar system within 1.2 kpc or 20 degrees of galactic latitude. However, few studies searched for OCs at higher galactic latitudes and deeper distances. In this study, based on a hybrid unsupervised clustering algorithm (Friends-of-Friends and pyUPMASK) and a binary classification algorithm (Random Forest), we extended the search region (i.e., galactic latitude |b|>=20 degrees) and performed a fine-grained blind search of Galactic clusters in Gaia DR3. After cross-matching, the newly discovered cluster candidates are fitted using isochrone fitting to estimate the main physical parameters (age and metallicity) of these clusters. These cluster candidates were then checked using manual visual inspection. Their statistical properties were compared with previously exposed cluster catalogs as well. In the end, we found 1,179 new clusters with considerable confidence within 5kpc.


INTRODUCTION
Galaxies or open clusters (OCs) are chemically homogeneous stellar populations that have the same age, the same kinematics (proper motion and radial velocity), and maintain approximately the same separation from us. OCs are ideal laboratories and powerful tools for studying star formation and evolution (Krumholz et al. 2019;Bossini et al. 2019). For example, the vast majority of OCs are located near the Galactic plane and thus serve as excellent tracers of the recent formation history of the Galactic disk.
Accurate determination of cluster membership is critical for the study of open clusters, as it directly impacts the estimation of their fundamental astrophysical parameters. Most studies Cantat-Gaudin et al. 2019;Castro-Ginard et al. 2019a;Cantat-Gaudin et al. 2020;He et al. 2022d;Castro-Ginard et al. 2022a) use an unsupervised machine learning algorithm, such as the Density-Based Spatial Clustering of Applications with Noise algorithm (DBSCAN). DBSCAN could search for arbitrarily shaped clusters by adjusting two parameters, i.e., the neighborhood radius (Eps) and the density threshold (MinPts). However, a set of Eps and MinPts can only locate clusters with a specific distribution density. Therefore, searching for clusters with different member star densities with only a set of global Eps and MinPts results in a significant identification bias, leading to clusters being over-segmented or merging multiple clusters into one (Hunt & Reffert 2021;He 2020).
In addition to single clustering algorithms, hybrid algorithms have been effectively studied. Chi et al. (2023b) (hereafter Paper I) proposed a hybrid method of pyUPMASK and Random Forest (RF) to identify potential OCs, and 46 reliable clusters were successfully re-authenticated out of 807 clusters removed after Li et al. (2022a) identification, which proves that the hybrid presented method is effective. The hybrid method consists of 3 steps: the friends-offriends (FoF) algorithm for rough clustering, members census, and OC identification by RF model. The advantage of the FoF algorithm to group stars is that clustering considers a five-dimensional weighted parameter space of parallax, position, and velocity. We investigated related algorithms that have been widely used in searching for star clusters and presented in Table 1.
However, the search for open clusters is a long and challenging task (Deb et al. 2022). One of the challenges is focused on identifying more OCs. Piskunov et al. (2006) had estimated that there are about 100,000 OCs in the Milky Way. The number of OCs that have been identified in previous literature is less than one-tenth of the theoretical estimate. A series of studies have searched for OCs using the GAIA 1 and subsequent catalogs. More than 6,000 Galactic star clusters (SCs) have been detected in published Gaia data catalogs (He et al. 2022a). About 1,200 pre-Gaia open clusters (OCs) have been reidentified ) based on Gaia Data Release 2 (Gaia Collaboration et al. 2018). A total of 4,000 OCs have been released Liu & Pang 2019b;Castro-Ginard et al. 2019b;Castro-Ginard et al. 2020;Li et al. 2022b) based on Gaia Data Release 2 (Gaia Collaboration et al. 2018) and EDR3 (Lindegren et al. 2021). Recently, He et al. (2022b) reported 1,656 new star clusters found in the Galactic disk (|b| < 20 • ) beyond 1.2 kpc, using Gaia EDR3 data. Chi et al. (2023a) (hereafter Paper II) proposed e-HDBSCAN and reported 83 OCs.
Another challenge is to search for OCs at higher Galactic latitudes. It is rare to find OCs in high-latitude regions of the galaxy, and the majority are found in the thin disk of the galaxy. Only a few efforts focused on hunting for OCs at higher galactic latitudes and deeper distances. For example, He et al. (2022c) searched all-sky regions nearby ( > 0.8 mas) based on the astrometry of Gaia EDR3 and reported 270 candidates had not been cataloged before, of which 46 clusters are newly found with |b| > 20 degrees. Li et al. (2022c) performed a search of high Galactic latitude (|b| > 20 degrees) with Gaia EDR3 and reported 35 OCs in the high Galactic latitude region with |b| ≥ 25 • . In addition, Sim et al. (2019) manually searched the higher-Galactic-latitude regions and identified five clusters at high Galactic latitude with |b| > 20 • .
It is helpful to search for OCs at higher galactic latitudes since this can provide a better understanding of both OCs and Galactic details outside the Galactic plane. There are only a few hundred known OCs at high Galactic latitudes, which is far from the need for statistical studies on properties (i.e., less dust, more distant, and main sequence branch in CMDs) of high Galactic latitude OCs and structural studies of the Milky Way (Li et al. 2022c).
The release of Gaia data version 3 (Gaia Collaboration et al. 2022) (Gaia DR3) brings us new opportunities to perform star cluster identification. Gaia DR3 contains information about object radial velocities (RVs) in the solar system barycentric reference frame (Recio-Blanco et al. 2022). The sample includes 34 million stars with magnitudes in the RV band G RV ≤ 14. RVs are useful to assess the reliability of the classification of the OC candidate (Castro-Ginard et al. 2022a). More abundant stellar radial velocity and physical information provide an opportunity to study cluster membership and kinematics (He et al. 2022a).
In this study, we use a hybrid method (same as Paper I) for blind search that has been proven effective, which combines unsupervised clustering algorithms (FoF and pyUPMASK) and two classification algorithms (RF). Since the Galactic altitude of OCs can reach |z| = 200 − 400 pc He et al. 2022c). Additionally, the pursuit of high Galactic latitude OCs and an examination of their properties contributes to our comprehension not only of OCs, but also of the Galactic regions beyond the Galactic plane. The origin of such OCs remains unclear. We extended the search region from the Galactic plane to high Galactic latitudes (|b| ≥ 20) degrees and attempted a fine-grained blind search based on Gaia DR3 within 5kpc of the solar neighborhood.
The rest of the paper is structured as follows. Section 2 describes the data preparation based on Gaia DR3 database. The methodology developed for identifying open clusters is presented in Section 3. Section 4 presented all results including 1,179 new star cluster found in the study. The discussions are presented in Section 5. Finally, a conclusion is presented in Section 6. We use Gaia DR3 to perform OC blind search in this study. Gaia DR3 includes full five-parameter astrometric solutions: positions, parallaxes, and two proper motions for more than 1.468 billion sources (Lindegren et al. (2021)). In addition, the photometry of the data is available in three photometric bands: G, G BP and G RP , which contain sources up to a limiting magnitude of G ≈ 21 mags and a bright limit of G ≈ 3 mags.

DATA PREPARATION
After excluding faint stars (G < 18mags) and limiting to parallax ( ) (from 0.14 kpc to 5 kpc), we obtained more than 20 million stars. In accordance with many previous studies, we selected five parameters (i.e., celestial positions (l, b), parallaxes ( ), and proper motions (µ α cos δ, µ δ )) for the identification of open clusters. These five parameters of each target source were first normalized. Then the computed results were constructed as a quintet for each source, respectively. To better facilitate the clustering calculation and improve the clustering effect, referring to Liu & Pang (2019b) and Paper II, we constructed a weight for each star, respectively. w = {cosb, 1, 0.5, 1, 1} (2) Taking the cosine of b is due to the contraction of l at a given b in spherical geometry, and denominator (0.2 cos b + 0.7) is the normalization factor in the denominator guarantees that 5 i=1 w i = 5. The reason for setting the parallax weight to 0.5 is that the uncertainty of the parallax is greater than the other parameters. Therefore a low weight of parallax can reduce its impact on cluster identification.
After data preprocessing, we finally obtained a star source dataset X d with a size of 218,152,787 rows * 5 dimensions for the cluster identifications using the FoF algorithm.

Rough Clustering Based on FoF
Similar to Paper I, we first roughly divided the data into many data regions according to galactic longitude (l), galactic latitude (b), and parallax ( ). The number of divisions for , b, and l are 8, 16, and 64, respectively. To avoid splitting the clusters into different regions as much as possible, each of the data regions must not be smaller than two times the typical cluster size (20 pc) (Portegies Zwart et al. 2010). To deal with potential clusters located at the boundaries of the region, we set an overlapping region for the two adjacent regions with size ( size 0.2 mas, l, and b size 10 pc).
After applying the above scheme, the whole search volume was divided into 8,596 data regions. Referring to Paper I, we first performed rough clustering using the FoF algorithm with the linking length (l F oF ) as where N star is the number of stars in each region, b F oF is the linking length factor. According to previous studies (Liu & Pang 2019b;Li et al. 2022b;Li & Mao 2023;Chi et al. 2023b), we set b F oF to 0.2.
Next, we merged the clustering results obtained from each data region. We adopted a recursive merge strategy to account for clusters at the edge of the data regions. We merge star clusters in two adjacent regions if more than fifty percent of their minimum members are the same. The overlapping regions in the model were limited to no less than 20 pc (the size of a typical cluster). Therefore, there is typically no intersection point where two clusters share less than 50% unless the cluster is larger than 60 pc and symmetrically straddles the overlap region established. This is extremely rare.
In merging regions, we set the value of minimum member stars (MMS) to 10, which is inspired by Sim et al. (2019), Cantat-Gaudin et al. (2020), Hunt & Reffert (2021) and Castro-Ginard et al. (2022a). MMS directly affects the size of the clusters we eventually identify. Therefore, the value determination of MMS is particularly significant. We determined the appropriate MMS from two aspects. 1) According to Sim et al. (2019), the number of members in most cluster are less than 50. Hunt & Reffert (2021) also suggested that the minimum possible size of a star cluster is set to 10 for HDBSCAN, which could detect the majority of OCs in the Gaia data sample. 2) Previous studies revealed that the smallest MMS is 10. In the most widely used open cluster catalog (Cantat-Gaudin et al. 2020), there are almost all (98.5%(1986/2017)) objects with more than 10 members. Of these, the smallest cluster has only nine members. In Castro-Ginard et al. (2022a), there are almost all 98.88% (621/628)) objects with more than ten members.

Member Star Determination
where P star , KDE m , and KDE nm are the membership probability of star, KDE of the members,and field star, respectively. Methods to select reliable cluster members by membership probability have been widely used, such as Jaehnig et al. (2021), He et al. (2022a,c,d), Niu et al. (2020) etc. Cut stars with membership probability p < 0.5, which is an optimal threshold (Soubiran et al. 2018) and (Carrera et al. 2019), to reduce the contamination of field stars. Analogously, after feeding X d to the FoF clustering model for get many rough cluster candidates, we have a membership census in each candidate with pyUPMASK and keep members whose membership probability is greater than 0.5 which is same as Jaehnig et al. (2021) but smaller than that in Gao (2018).

Open Cluster Identification
We performed a series of processing to analyze and identify OCs. We first selected OC candidates by RF model; we then filtered the results using proper motion dispersion; after excluding the published OCs by cross-matching, we made an isochrone fitting and classified the OCs; we finally obtained credible OCs using manual inspection.

Random forest modelling
We used random forest (RF) model to isolate the most likely cluster members based on membership probabilities without normalizing high-dimensional data. Same to Paper I, in the third step, we trained a Random Forest model with samples collected in the work of Cantat-Gaudin & Anders (2020b) with Gaia DR2 and Gaia EDR3 Castro-Ginard et al. 2019b;Castro-Ginard et al. 2020;Castro-Ginard et al. 2022b) to detect OCs among the potential candidates.

Filtering using proper-motion dispertion
To select potentially real OCs from spatial over-density structures, we used the following proper-motion criterion (Hunt & Reffert 2021;Hao et al. 2022;Hao et al. 2022a) and Paper I.
where σ 2 µ a * and σ 2 µ δ are the dispersion in positional space µ a * and µ δ , respectively. It should be noted that this criterion is necessary for the identification of OCs. It does not imply that clusters that have dispersion higher than the criterion in these formulas are true clusters. However, by doing so, we can filter out most of the candidates that do not meet this criterion, obtain high quality cluster candidates, and reduce the final identification effort.

Cross-match
To exclude as many reported clusters as possible, we cross-matched them with pre-Gaia cluster catalogs, OC catalogs based on Gaia, and globular cluster catalogs. The pre-Gaia cluster catalogs (MWSC) contained 3006-star clusters gathered by Dias et al. (2012) and Kharchenko et al. (2013), aggregated from various data sources. Since there is no relevant proper motion parameter in the data, we have to only compare the mean parameters within 5σ (whereσ is the uncertainty listed in both catalogs for each quantity) using sky coordinates. To eliminate as many OCs as possible that have already been found and obtain OC candidates that have not been unnoticed before, we consider an OC to be positionally matched to a cataloged one if their astrometric mean parameters (l, b, , µ α , µ δ ) are compatible within 5 σ (where σ is the uncertainty quoted in both catalogs for each quantity) which is consistent with He et al. (2022a), He et al. (2022c) and Hao et al. (2022a). The list of previously published sources including LP, Ferreira Series, CWNU, Hao Series, UBC, and so on is presented in Table 1.
We also carried out the same method to cross-match with Globular clusters (GCs) to exclude globular clusters.

Isochrone fitting and classification
We performed isochrone fitting for each new result, following the methods described in Paper I. The PARSEC theoretical isochrone models (Bressan et al. 2012) have been updated by the Gaia EDR3 passbands using the photometric calibrations from ESA/Gaia. The extinction curve of R V = 3.1 has been reddened to derive their physical parameters (age and metallicity).
A log-normal initial mass function (Chabrier 2003) is used to generate an isochrone library from log( t yr ) = 6.0 to 11.13 at steps of ∆(logt) = 0.03 while metal fractions from 0.002 to 0.042 with a step of 0.002. An objective fitting functiond was applied to all new OC candidates, where n is the number of selected members in a cluster candidate, and x k and x k,nn are the positions of the member stars and the points on the isochrone that is closest to the member stars, respectively. A series of isochrones are produced with the parameters in Table 2 referred to Chi et al. (2023b). After the isochrone fitting, we classified the fit results to facilitate the identification of well-fitting OCs. We performed this by calculating the dispersion σ d 2 of d 2 by and σ d 2 reflects how close the core sample is along the isochrone. We subsquently classified them in to 3 categories (class A, class B and class C) based on isochrone-fitting according to stringent criteria listed in Equation 9.
r n < 0.1 andd 2 < 0.02 and σ d 2 < 0.04 B r n < 0.1 andd 2 0.02) or (r n < 0.1 and σ d 2 0.04 C others Ones with clear CMDs (r n < 0.1, σ d 2 < 0.04 andd 2 < 0.02) and enough bright star members, which have more than 20 members of magnitude less than 17, are class A. Same as class A but with unclear isochrone ( r n < 0.1 ) are class B. The rest of the candidates with a loose CDM distribution are class C. One can refer Liu & Pang (2019b) and Paper I for more details about the calculation of r n .

Comprehensive Analysis Based on Visual Insection
Since a real OC should have clear main sequence features on the CMDs, reference Cantat-Gaudin et al. (2020) and He et al. (2022a), to screen out the most reliable candidates, we performed manual visual inspections on spatial distributions (SDs), proper motion distributions (PMDs), parallax distributions (PDs) and vs µ α , and their isochrone fits results to further check the quality of candidate clusters.

RESULTS
In data sources of 218,152,787 stars generated by pre-processing, we obtained 14,701 stellar aggregates. 23 aggregates were rejected by the RF model, and 1,063 aggregates were eliminated by proper-motion dispersion filtering. Among them, 1,244 OCs can be cross-matched with published catalogs, and there are 12,371 likely new OCs that need further identification (i.e., isochrone-fitting and visual inspection), of which 12,316 are located at latitudes of |b| < 20 degrees, and 55 OCs are located at latitudes of |b| > 20 degrees.
We further carried out an isochrone-fitting and divided all possible clusters (12,371 OCs) into three classes based on the approach mentioned in subsection 3.3.4, i.e., 1,194 (class A, see Figure 5), 5,252 (class B), and 5,925 (Class C). Class A means the OCs have clear main sequences in the CMD and more star members, which is also what we focus on.
After manual visual inspection, 1,179 OCs are supposed to be real galactic star clusters. Derived astrophysical parameters for the final OCs are given in Table 3. All member sources are presented in Table 5.
In addition, as far as we know, only a few star clusters have been discovered at high galactic latitudes of |b| ≥ 20 degrees. Based on manual visual inspection, ID3252 and ID14525 is identified as a true high galactic latitude OC. Those 5-panel plots are presented in Figure 1. The remaining 53 clusters which are located at high galactic latitudes are not listed in this work because they do not have clear main sequence features.

Plausibility analysis
The results of the 1,179 OCs we found are reasonably plausible. The identified new reliable 1,179 cluster candidates in Class A account for 8.02% of all possible clusters in this work. This ratio is similar to 3.11% (76/2443) of Liu & Pang (2019b), which used FoF to identify star clusters in Gaia DR2. On the other hand, compared with Liu & Pang (2019b), the fitting criteria we adopted were more stringent, for example, thed 2 value was more strictly restricted ( Liu & Pang (2019b) is 0.04, ours was 0.02). As a result, thanks to the huge volume of high-quality data of Gaia DR3, the number of clusters that are eventually identified is more, but the quality and reliability are higher. As shown in Figure 2 (c), most of the OCs member stars we report are in the intermediate distribution of less than 100, with 740 clusters with less than 50 members, (63 percent of the total). The number of clusters with fewer than 30 members is 228, which is 20 percent of the total. This indicates that our fine-grained blind search method can effectively detect small OCs.
To validate our results found in this study, we compared them to the OCs in CG20 with parallax distribution (see Figure 2 (d)). Compared with CG20, our clusters are mostly above 0.25 and similar in their number, size, and distribution. The peak of the median distribution of our results (990) (see Figure 8 (a)) is consistent with H22, which is local at 1 km s −1 . Most parts of RV dispersion are smaller than 13 km s −1 . Right subgraph of Figure 8 Figure 2 (b) shows the age and Z distributions of the SC candidates. It is obvious that the new SC candidates are younger than 8.5 log(age/yr). And the left panel (Figure 2 (a)) shows that many of the OCs are metal-poor (smaller than 0.4 log(Z/Z )). From the comparison of the above four aspects, combined with the CMD properties of the original candidates, it is clear that the current candidates are clusters with the characteristics of genuine SCs.

Member Size Analysis and Comparison
For most of the matched clusters, we found that the clusters have more member stars than those identified in Gaia DR2. Figure 3 (a) and (d) indicate that there is more concentrated membership around the cluster halo. We can re-detect more members in NGC2682 (M67), which are located in more concentrated areas and less contaminated by field stars (see Figure 3 (c)). We also discovered more members further away from the center in NGC1662 while maintaining good CMD main sequence characteristics (see Figure 3 (b)). That means previously reported cluster scales may be underestimated, which is consistent with the work of Zhong et al. (2022).

Deficiencies
The OCs distribution of Class A in the celestial sphere is shown in Figure 4. Some examples of comparison of our results with previous work based on Gaia data are given in Figure 1. Left to right: spatial distribution, parallax histogram, proper-motion distribution, color-magnitude diagrams (CMDs) with the best-fitting isochrone line. The cross-matched candidates at High Galactic Latitude are listed in Figure 1. The candidates are divided into class A (see Figure 5), class B (see Figure 6), and class C (see Figure 7), respectively. From those figures, some member stars deviate from the main sequence, possibly due to inhomogeneous heavy reddening and/or field star contamination.
In future work, the stars are further kicked out using a rational algorithm, such as the Bayesian algorithm, to authenticate class B and class C. To determine if the clusters are real or not, just two fundamental parameters, age, and metallicity are inferred using the isochrone fitting method. However, more information about those OCs is expected to be detected by subtler methods and models, like the advanced stellar population synthesis model (ASPS) (Li et al. 2016;Li et al. 2017). Nmem is the number of cluster members, N 17 is the numbers of magnitude less than 17 and Nrv is the number of members that have radial velocity. RVmean , RV std and RV mad are the mean value, standard deviation and median value of RV dispersion, respectively. The fields in the table are described in Table 4. This table and its description are available in their entirety in machine-readable form.

Linking length factor b F oF
In the study, we set the value of the linking length factor (b F oF ) to 0.2, in accordance with Liu & Pang (2019b). Liu & Pang (2019b) pointed out that they chose 0.2 because the value is commonly used in the dark matter halo identification of cosmological simulations (Springel et al. 2001).
We are very concerned about whether different values of b F oF will affect OC identifications because we noticed that b F oF would significantly impact the clustering results. To verify the reasonableness of taking 0.2 for b F oF , we first   selected the real star data of 5 regions from Gaia DR3, i.e., ID 1325 (106905 stars), ID 76 (107417 stars), ID 14 (64011 stars), ID 67 (82124 stars), and ID 18 (58714 stars). We then identified OC using different b F oF values. The experimental results (see Figure 9) show that the larger the b F oF , the larger the number of detected clusters and the smaller the average cluster size. When b F oF is less than 0.2, the identified groups are basically the same. However, the identified groups increase significantly when b F oF is greater than 0.2.
Then, we selected the well-studied M67 cluster for testing. After a square query with a side of 5.5 degrees around an RA/Dec coordinate (132.85, 11.83) in Gaia DR3, we created a test dataset including 172557 stars of M67 (NGC2682).  Table 4. Description of the catalog of star cluster properties (Table 3).

Column
Format Unit Description  ID: 13523 Class:B Figure 6. As same as Figure 5, but for Examples of class B.   After the same data preprocessing, we obtained 42568 stars. We identified OCs using b F oF with 0.1, 0.2, and 0.3 and obtained 2, 10, and 60 groups, respectively. All different b F oF can identify M67 correctly. According to the results of the above two experiments, the value of b F oF greatly influences the identification of OCs. However, considering that each candidate needs to be verified manually at a later stage, taking the value of b F oF as 0.2 may be a relatively reasonable compromise to balance the workload of manual verification and the correct rate of the identification model.

CONCLUSIONS
To our latest knowledge, over 7,000 OC candidates have been found in our galaxy using different methods and algorithms. Identifying and confirming whether newly documented clusters in different published catalogs are genuine SCs requires a census of homogeneous member stars, which will be a challenging but necessary effort.
We carried out a broad blind search for galactic star clusters. According to the probabilities, all candidates were divided into three classes, i.e., 1,194 (class A), 5,252 (class B), and 5,925 (Class C). After a series of stringent examinations, 1,179 true likely OCs in class A are present in this study.
To sum up, this work enriches the OC sample of tracer galaxies within 5kpc nearby the Solar System, especially for the study of the local arm. This catalog will serve the community as a useful resource for tracing the chemical and dynamic evolution of the MW.
To determine if the clusters are real or not, we currently use the isochrone fitting method with two fundamental parameters, i.e., age and metallicity. Obviously, we need to develop new algorithms to discover more OCs accurately in the future. One possible approach is to combine spectroscopic data from member stars to estimate more information about cluster parameters. In addition, more than 10,000 objects in classes B and C still need to be identified using more advanced models. In addition, many candidate open clusters with complex main sequences require more advanced models for fitting and identification. It is worth mentioning that binary open clusters are likely to exist in the true OCs we reported this time, which is worthy of further study. There are some clusters that have tidal tails, such as ID00236, which could be disintegrating OCs. Some are likely binary OC candidates that need to be identified further. This is worthy of further investigation in the future.