The following article is Open access

Robust Data-driven Metallicities for 175 Million Stars from Gaia XP Spectra

, , and

Published 2023 July 5 © 2023. The Author(s). Published by the American Astronomical Society.
, , Citation René Andrae et al 2023 ApJS 267 8 DOI 10.3847/1538-4365/acd53e

Download Article PDF
DownloadArticle ePub

You need an eReader or compatible software to experience the benefits of the ePub3 file format.

0067-0049/267/1/8

Abstract

We derive and publish data-driven estimates of stellar metallicity [M/H] for ∼175 million stars with low-resolution XP spectra published in Gaia DR3. The [M/H] values, along with Teff and $\mathrm{log}g$, are derived using the XGBoost algorithm, trained on stellar parameters from APOGEE, augmented by a set of very-metal-poor stars. XGBoost draws on a number of data features: the full set of XP spectral coefficients, narrowband fluxes derived from XP spectra, and broadband magnitudes. In particular, we include CatWISE magnitudes, as they reduce the degeneracy of Teff and dust reddening. We also include the parallax as a data feature, which helps constrain $\mathrm{log}g$ and [M/H]. The resulting mean stellar parameter precision is 0.1 dex in [M/H], 50 K in Teff, and 0.08 dex in $\mathrm{log}g$. This all-sky [M/H] sample is substantially larger than published samples of comparable fidelity across −3 ≲ [M/H] ≲ +0.5. Additionally, we provide a catalog of over 17 million bright (G < 16) red giants whose [M/H] values are vetted to be precise and pure. We present all-sky maps of the Milky Way in different [M/H] regimes that illustrate the purity of the data set, and demonstrate the power of this unprecedented sample to reveal the Milky Way's structure from its heart to its disk.

Export citation and abstract BibTeX RIS

Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

1. Introduction

The chemical composition of stars, reflected in their photospheric abundances, is a fundamental stellar observable. To zeroth order, it can be summarized by the mean metallicity [M/H], which varies by orders of magnitude among the stellar populations in the Galaxy, while the individual abundance ratios among heavy elements tend to vary far less.

In the context of the Milky Way, or other resolved nearby galaxies such as the Magellanic Clouds, having vast samples of stars with [M/H] estimates across all stellar populations matters greatly for both galaxy and stellar evolution: [M/H] traces the "chemical evolution" of the galaxy, which reflects the combination of the star formation history, stellar yields, gas inflow, and feedback (e.g., Dekel & Silk 1986; Matteucci 1994; Tremonti et al. 2004). Large and systematically selected samples of low-[M/H] stars are needed to quantify and test stellar yields and the importance of different nucleosynthetic channels, in particular of the earliest stars in a galaxy (e.g., Tinsley 1979; McWilliam 1997). Furthermore, [M/H] is indispensable to study the chemodynamics of, say, the Milky Way: the determination and evolutionary interpretation of the stars' distribution in the space of orbits, ages, and element abundances (e.g., Hayden et al. 2015; Weinberg et al. 2019, 2022). Studies of the chemical and dynamical evolution of the Galaxy are linked closely, as the abundances, in particular [M/H], also serves as an—albeit complex—proxy for stellar age (e.g., Tinsley 1980; Twarog 1980; Nordström et al. 2004; Gallazzi et al. 2005; Rix et al. 2022).

While [M/H] is the fundamental measurement for abundances, α-element enhancement has long been established as the arguably next most important abundance observation, as it reflects the relative roles of core-collapse and thermonuclear supernovae in the enrichment of a star's birth material (Tinsley 1979; Hayden et al. 2015; Weinberg et al. 2019). The dimensionality of the abundance space for elements through the iron peak is not yet fully settled (Ness et al. 2019; Ting & Weinberg 2022). The abundances of elements beyond the iron peak, which arise primarily via the s- and r-process, is of great interest (e.g., Sneden et al. 2008). However, the observational determination of these elements requires spectra of relatively high resolution and signal-to-noise ratio (S/N; e.g., Ji et al. 2019a, 2019b).

These arguments have motivated over the last decade(s) a suite of large-scale spectroscopic surveys: the Sloan Digital Sky Survey (SDSS) I–IV (York 2000), LAMOST (Cui et al. 2012), GALAH (De Silva et al. 2015), Gaia-ESO (Gilmore et al. 2012) in the past; SDSS-V (Kollmeier et al. 2017) and WEAVE (Dalton et al. 2012) now; and 4MOST (de Jong et al. 2019) in the future. Although these ground-based surveys have now reached sample sizes of nearly 107 stars, all have had highly incomplete sky coverage and complex selection functions. Only SDSS-V will provide ground-based spectroscopic all-sky coverage over the next few years (Kollmeier et al. 2017; Almeida et al. 2023).

As Gaia's most recent data release DR3 (Gaia Collaboration et al. 2022c) has again made abundantly clear, Gaia is not only a photometric and astrometric mission, but also a spectroscopic one. Gaia obtained spectra both with the Radial Velocity Spectrometer (RVS) instrument, at a resolution of ∼8000 around the near-IR Ca triplet (G. Seabroke et al. 2023, in preparation; Sartoretti et al. 2022), and very-low-resolution spectra (R ∼ 40–150) taken with the two prisms BP and RP (Carrasco et al. 2021; De Angeli et al. 2022) that together cover the wavelength range from ∼350 nm to ∼1000 nm (Montegriffo et al. 2022). In the following, we denote these BP and RP data as XP spectra. These spectra—and the astrophysical parameters derived from these spectra—were released in Gaia DR3 for both XP and RVS: 220 million and 1 million spectra, as well as 470 million and 6 million sets of astrophysical parameters, for XP and RVS, respectively.

To maximize the Gaia data set suitable for chemodynamical studies of our Galaxy, one would need to have abundances, at least [M/H], for most stars that have RVS velocities. 3 In the context of DR3, this can be done with [M/H] based on XP spectra, but not on RVS spectra or derived abundances, as these are published only for a subset of 6 million stars (Recio-Blanco et al. 2022) compared to 33.8 million stars that have RVS velocities (Katz et al. 2022).

An extensive set of metallicities for Gaia sources with XP spectra was published as part of DR3 (Andrae et al. 2022). By design, these [M/H] values were derived using synthetic model spectra in comparison with the XP spectra, with the goal of a consistent approach to stellar parameter estimates across much of the color–magnitude diagram (CMD). Unfortunately, external validation has shown that these [M/H] values have important shortcomings (e.g., systematics and a high rate of "catastrophic" outliers) due to two aspects known and stated at the time of publication: first, knowledge of the Gaia XP system is detailed but imperfect, so that significant discrepancies between the predictions of the synthetic model and the XP data exist and lead to erroneous [M/H] estimates. Second, the spectra of different Teff and $\mathrm{log}g$ have very different information content about [M/H] at low resolution and for some temperatures (e.g., OB stars), the XP spectra are simply not informative on [M/H].

On the other hand, it has been established that for cool stars, even very-low-resolution spectra are informative about [M/H] (Ting et al. 2017). By using a data-driven approach to estimate [M/H] and by focusing on stellar types whose low-resolution spectra are informative about [M/H], one can overcome these limitations. This has recently been shown by Rix et al. (2022), who produced a large set of [M/H] estimates toward the Galactic center. Similar methods have been employed to study the halo of the Galaxy, for example to map out its last major merger (Belokurov et al. 2023; Chandra et al. 2023).

Here we set out to build on this work and produce a comprehensive catalog of high-fidelity stellar [M/H] estimates (with Teff and $\mathrm{log}g$, as corollary) that:

  • 1.  
    Includes essentially the entire sample of published XP spectra, acknowledging that the [M/H] estimates at low S/N and high Teff are potentially unreliable. This requires the identification of subsamples where the [M/H] estimates are precise, accurate, and robust (i.e., with negligible outliers), as verified by external comparison.
  • 2.  
    Is "all sky," accepting that the reach of such a catalog varies across the sky due to (a) the Gaia experimental setup, (b) the details of the DR3 data release, and (c) the changing source density and dust extinction.
  • 3.  
    Is data driven, drawing on high-quality training sets that cover essentially the whole metallicity range present in the Local Group: −3 < [M/H] < 0.5.
  • 4.  
    Draw mostly on XP spectral information, but also utilize relevant information that is available for most of the target sample: broadband photometry across a wide range of wavelengths (extending to Wide-field Infrared Survey Explorer (WISE) in the infrared) to constrain the overall spectral energy distribution, and reduce Teff–reddening degeneracy; and parallaxes, ϖ, which are—even at low or negative ϖ/δ ϖ, highly informative about the luminosity or absolute magnitude M of the star, as ${10}^{{M}_{\lambda }/5}\propto \varpi \,{10}^{{m}_{\lambda }/5}$. Indirectly, ϖ therefore informs $\mathrm{log}g$ and [M/H].

Since the first submission of this article, two similar works have been published: Yao et al. (2023) classify 188,000 candidates for very-metal-poor stars ([Fe/H]< −2) using XP spectra and XGBoost while Zhang et al. (2023) build an empirical forward model from LAMOST training examples in order to estimate stellar parameters with realistic uncertainties for all 220 million published XP spectra.

The rest of the paper is organized as follows: in Section 2, we explain how we compile the training sample, what input features we choose for XGBoost, and we show first internal validation results. In Section 3, we define our application sample and validate our results on external data that have not been used for testing. In the closing Section 4, we illustrate the power of this sample by showing a set of all-sky maps in different metallicity bins, which illustrate that even the (rare) low-metallicity subsamples have little if any contamination. In the summary and outlook, we touch on obvious future astrophysical uses of this sample. The catalogs produced in this work are published online at doi:10.5281/zenodo.7599788 (Andrae et al. 2023).

2. Training XGBoost

We seek to train XGBoost models (Chen & Guestrin 2016) to estimate stellar metallicity, effective temperature, and surface gravity from XP spectra, drawing on the subset of objects for which both XP spectra and externally derived stellar parameters of high fidelity exist.

2.1. Training Sample Selection

Encouraged from the results in Rix et al. (2022), we train XGBoost models using as data features both XP coefficients (Carrasco et al. 2021; De Angeli et al. 2022) and synthesized photometry that was computed with GaiaXPy (e.g., Gaia Collaboration et al. 2022b). For the most part, we do this for stars with literature labels from APOGEE DR17 (Abdurro'uf & Aerts 2022). However, APOGEE DR17 has no metallicity estimates below [M/H] ∼ −2.5, and XGBoost cannot extrapolate beyond the metallicity range of its training sample. 4 Therefore, we augment the APOGEE DR17 training sample by the set of very/ultra-metal-poor stars from Li et al. (2022), which provide a consistent and fairly extensive set of [M/H] determinations for a set of (apparently) bright stars. We also replace the AllWISE photometry (Cutri et al. 2021) used in Rix et al. (2022) by CatWISE photometry (Marocco et al. 2021), which is deeper and thus achieves higher completeness.

APOGEE DR17 contains a total of 733,901 stars. Of these, 647,025 actually have the stellar parameters Teff, $\mathrm{log}g$, and [M/H], and 64,401 have a crossmatch to Gaia. Among these, only 599,662 achieve S/Ns above 50 in the APOGEE spectra. For the purpose of this paper, we also require XP spectra to be available, which is the case in the Gaia DR3 data for 537,412 of these stars. We also require CatWISE photometry in the W1 and W2 bands as these bands greatly aid reducing the temperature—extinction degeneracy. This reduces the set of APOGEE DR17 training to 510,413, which only contains 485,850 unique Gaia source IDs, i.e., there are "duplicates" which represent repeated APOGEE observations of the same star. In such cases, we adopt the mean APOGEE parameters averaged over all repeat observations as training labels. The results of Li et al. (2022) encompass 385 stars, of which 291 stars have published XP spectra, as well as W1 and W2 photometry in CatWISE.

The resulting [M/H] distribution of this training sample is shown in Figure 1. Already from APOGEE, we have good coverage for [M/H] < −1. But this figure also shows how critical the inclusion of stars from Li et al. (2022) is to cover metallicities below −2.5, eventually down to a minimum value of −4.37. This expanded training sample removes an important limitation at low metallicity of the work by Rix et al. (2022).

Figure 1.

Figure 1. Distribution of [M/H] in the training sample of stars, which is drawn from SDSS-APOGEE DR17 (red) and the very-metal-poor stars from Li et al. (2022; blue).

Standard image High-resolution image

It is worth taking a closer look at the distribution of all stellar parameters in our training sample in Figure 2, to understand over which range we can expect XGBoost to return robust estimates. While Figure 2(c) suggests that we have a good coverage of main-sequence dwarfs and red-giant stars, the temperature range is limited to 3107 to 6867 K. In particular, we have no OBA stars, no white dwarfs, or ultracool dwarfs in our training sample. Furthermore, Figure 2(a) shows that we have essentially no training examples for [M/H] < −2 and Teff < 4000 K. Also, Figure 2(b) shows that we have only very few training examples for metal-poor dwarfs with $\mathrm{log}g\gt 3.5$ and [M/H] < −1. This will likely preclude robust and precise parameter estimates in this regime.

Figure 2.

Figure 2. Distributions of the [M/H] training sample terms of effective temperature, surface gravity, and metallicity. The dominant SDSS-APOGEE DR17 part of the sample is shown as the logarithmic density map, and the metal-poor training stars from Li et al. (2022) as black dots.

Standard image High-resolution image

2.2. Completeness of Photometry

Rix et al. (2022) noticed that some bands synthesized from the XP coefficients with GaiaXPy had negative fluxes and thus invalid magnitudes, particularly narrow bands in the blue where fluxes are often low. This leads to a rapidly decreasing completeness of XGBoost predictions in Rix et al. (2022) at the faint end. Here, we address this issue more systematically to keep the completeness toward the faint end as high as possible. First, for all stars in our application sample, we synthesize their photometry with GaiaXPy in the following photometric systems that we a priori believe to be useful for estimating metallicities (for details see Gaia Collaboration et al. 2022b):

  • 1.  
    Pristine,
  • 2.  
    Stromgren_Std,
  • 3.  
    JPLUS,
  • 4.  
    PanSTARRS1_Std,
  • 5.  
    Sky_Mapper,
  • 6.  
    Gaia_2.

Second, for all photometric bands, derived from the XP spectra, CatWISE, and AllWISE, we investigate the completeness of the magnitudes as a function of GBP in Figure 3. Evidently, the completeness of the synthesized photometry diminishes much earlier in some bands than in others. Further investigation reveals that bands with pivot wavelengths below ≈420 nm are the first to be affected by incompleteness. This is consistent with our interpretation of the incompleteness arising through noise in the XP coefficients, given that the BP spectra have low transmission and thus low S/Ns for wavelengths below 420 nm. Furthermore, Figure 3 shows that AllWISE would limit the completeness at all magnitudes, and that CatWISE can reach much higher completeness especially at the faint end where there are numerous stars. Still, even CatWISE does not reach full completeness even at the bright end.

Figure 3.

Figure 3. Completeness of the sample at a given GBP magnitude in several bandpasses that limit the application of XGBoost, which requires the full set of features for both training and testing. For GBP ≲ 17.7, CatWISE is the most severe limitation, while for GBP ≳ 17.7 the two narrow bandpasses synthesized with GaiaXPy in the far blue (e.g., Pristine_mag_CaHK and Jplus_mag_J0395) become severely incomplete.

Standard image High-resolution image

2.3. Input Features for XGBoost

For the final set of input features for XGBoost, we adopt all bands that achieve a completeness of 95% or higher at GBP = 18. These are 31 bands and for each band, the XGBoost input feature is the color obtained from the apparent G magnitude minus the magnitude in this band. We choose the G magnitude for all colors for two reasons: first, the G magnitude is measured independently from the XP spectra from which all synthetic photometry are derived; and second, the G magnitudes have very high S/Ns. Additionally, we use the three Gaia colors GGBP, GGRP, and GBPGRP as input features, as well as several colors including CatWISE photometry (namely W1W2, GW1, GW2, and GBPW2). Therefore, our final set of data features comprises 38 colors and all 110 XP coefficients normalized to G = 15. This may appear confusing at first, because the synthesized photometry is fully redundant with the XP coefficients and adding redundant features could even be detrimental to the scientific performance (curse of dimensionality). Ultimately though, the choice to include both XP coefficients and photometry synthesized from XP as input features for XGBoost is a matter of feature selection that we test during cross-validation (see Table 1 in Rix et al. 2022). As it turns out, both are required in order to achieve optimal [M/H] results and the omission of either XP coefficients or synthetic photometry would lead to a noteworthy increase in the [M/H] errors during cross-validation and later application. This implies that XGBoost is unable to extract all information fully from the XP coefficients alone. Instead, our manual help to "rephrase" the information in terms of synthesized photometry is required in order to make the information more easily accessible for XGBoost.

For deriving stellar parameters, in particular $\mathrm{log}g$, the absolute magnitude is highly informative: e.g., it straightforwardly differentiates between giants and dwarfs. While a substantive subset of the stars with XP spectra have good parallax S/Ns, from which we can estimate absolute magnitudes, many sample members have parallaxes that are consistent with zero or even negative. Therefore, we added a data feature that reflects or places limits on the absolute magnitude, but is linear in the parallax in order to remain well behaved in cases of noisy or even negative parallaxes. Specifically, we opted for input features of the form

Equation (1)

where ϖ denotes the parallax, mX is the apparent magnitude in some band X, while MX and AX are, respectively, the absolute magnitude and dust attenuation in the same band X. 5 We added five such input features, for the photometric bands X = G, GBP, GRP, W1, and W2. The parallax in Equation (1) has been corrected for the parallax zero-point according to Lindegren et al. (2021). We find that these additional features do not only help to estimate $\mathrm{log}g$, but they also improve our [M/H] estimates by ∼10% where metal-poor giants benefit in particular. The complete list of all input features and the details of the XGBoost configuration are provided in Appendix A.

We use the exact same set of input features for training the XGBoost models for [M/H], Teff, and $\mathrm{log}g$, all based on training labels (see Section 2.1). Our objective is to maximize the number of stars for which all required input features are available. In that case, the completeness of our results would be dominated by the completeness of the AllWISE photometry (see Figure 3).

2.4. Internal 20-fold Cross-validation

For internal validation, we assess the quality of the XGBoost results on the training sample, using 20-fold cross-validation: 20 times we set aside disjoint sets comprising 5% of the data for subsequent testing of a model trained on the other 95% of the data. In the end, the data features for each object in the training sample have been compared to a statistically independent XGBoost model prediction for them. These cross-validation results are summarized in Figure 4.

Figure 4.

Figure 4. Cross-validation of the XGBoost parameters on the training sample (SDSS-APOGEE DR17 and Li et al. 2022). The plots show the results of the 20-fold cross-validation of the 5% portions of the training sample, withheld in the training. Rows from top to bottom show [M/H] residuals, Teff residuals, and $\mathrm{log}g$ residuals. Columns from left to right show residuals versus the training sample's [M/H], Teff, $\mathrm{log}g$, and Gaia's apparent GBP magnitude. The numbers in the top right corners quote the median absolute difference (MedAD) and the root mean square difference (RMSD). The density map is logarithmic. These plots illustrate the remarkable precision of the approach: 0.10 dex in [M/H], 50 K in Teff, and 0.08 dex in $\mathrm{log}g$. Note that these variances still include all the uncertainties in the APOGEE estimates.

Standard image High-resolution image

The first row of Figure 4 is most important because it shows the cross-validation of the [M/H] estimates. Panel (a) shows that, for the most part, our results are accurate, i.e., unbiased with respect to the APOGEE reference [M/H]. There are only a few outliers where XGBoost assigns a higher [M/H] value than APOGEE. However, for training labels [M/H] < −3 XGBoost tends to overestimate [M/H]. Most likely, this is a consequence of mixing different definitions of "metallicity" in our training sample: while APOGEE provides [M/H] estimates, Li et al. (2022) provide estimates of [Fe/H]. Astrophysically, very old stars will have low iron content but may already have been enhanced in other elements, such that the stars from Li et al. (2022) may be genuinely low in [Fe/H] but have higher [M/H], which is recognized by XGBoost learning [M/H] from the majority of training examples provided by APOGEE. Panel (b) shows how our [M/H] estimates depend on Teff values: the agreement is overall very good, and for stars hotter than ∼5500 K our [M/H] estimates are closer to the training labels than for cooler stars. The main reason for this is that for Teff > 5500 K the APOGEE sample contains virtually no stars with [M/H] below −0.75, limiting the comparison to the "easy" metal-rich regime. Panel (c) shows that our [M/H] residuals do not exhibit any noteworthy trends with the training sample's $\mathrm{log}g$, i.e., our [M/H] estimates work just as good for main-sequence dwarfs as they work for red-giant stars. Panel (d) shows that our [M/H] estimates remain robust (though obviously less precise) to the very faint end of GBP ∼ 20. The overall RMSD to the training sample's [M/H] estimates is 0.106 and half of the stars differ by less than 0.044 from their reference values. This performance is about 10% better than that found for the sample of only red giants in Rix et al. (2022; see Table 1 therein). This improvement, despite the expanded coverage of the CMD, is mainly due to the inclusion of luminosity estimates (see Equation (1)) as features. Like in Rix et al. (2022), the current [M/H] estimates remain unbiased as the AK extinction increases as is evident from Figure 5(a). Including CatWISE photometry is the key here. Furthermore, Figure 5(b) establishes that there are also no systematics with parallax (i.e., inverse distance).

Figure 5.

Figure 5. Cross-validation of XGBoost on the training sample (SDSS-APOGEE DR17 and Li et al. 2022). The dependence of the test error for [M/H] on WISE AK extinction (panel (a)) and parallax (panel (b)). [M/H] shows no systematics with either.

Standard image High-resolution image

Table 1. Abridged Table of the 174,922,161 XGBoost Estimates Presented in this Work

source_idcatwise_w1catwise_w2in_training_samplemh_xgboostteff_xgboostlogg_xgboost
429580672015.79615.942False−0.2565991.64.551
3865554496011.83711.879False−0.2124791.64.604
127560612595214.36614.438False−0.4385177.34.489
165356324774414.73414.802False−1.2866102.04.017
285185828864010.90410.932True−0.4545899.54.295
333289477952010.31610.385True0.1784915.73.594
337155016588812.28012.350False−0.3884912.74.520
350898911923213.38613.439False−0.5185339.24.558
471157993574412.72512.767False−0.3235896.74.338
481465915033614.20714.241False−0.2894163.34.667

Note. The full table is available online (Andrae et al. 2023). The Gaia DR3 source_id is sorted in ascending order. The Boolean flag in_training_sample indicates whether or not a source was part of the XGBoost training sample. We provide minimal information in order to save data volume.

Download table as:  ASCIITypeset image

In particular, Rix et al. (2022) restricted their analysis to bright (GBP < 16 mag) red giant branch (RGB) stars (teff_xgboost < 5500 K and logg_xgboost < 3.5). Figures 4(b) and (c) suggest that the [M/H] estimates from the current work are also robust outside the RGB and panel (d) suggests that this also holds down to the faintest stars which have their XP spectra published in Gaia DR3. Note that we cannot test with this validation sample whether our [M/H] estimates remain so precise and robust also for very-metal-poor stars.

The other two rows of Figure 4 show the XGBoost residuals for Teff (middle) and $\mathrm{log}g$ (bottom). The RMSDs are remarkably small: 54 K for Teff and 0.089 for $\mathrm{log}g$. Furthermore, the residuals do not show obvious systematics and appear to remain robust down to GBP ∼ 20.

3. Stellar Parameters from XP Spectra via XGBoost

We now turn to applying the XGBoost estimator, trained as just described, to an all-sky sample of stellar sources with XP spectra and CatWISE photometry.

3.1. Sample Selection

We define the sample to which we apply the XGBoost estimator as all sources in Gaia DR3 that have XP spectra, valid parallaxes, and proper motions, and valid XP-derived and CatWISE photometry; the parallaxes do not have to differ significantly from zero. The following AQDL query

  • SELECT
  • source_id
  • FROM gaiadr3.gaia_source_lite
  • WHERE has_xp_continuous='true'
  • AND parallax IS NOT NULL

results in 218,132,063 stars. Since we require complete photometry in CatWISE and synthesized passbands (see Section 2.2), not all of them have the complete set of input features to XGBoost. The final number of stars that satisfy these additional conditions is 174,922,161 (∼80.2%). Their apparent magnitude distributions are shown in Figure 6. It is important to note that in Gaia DR3 XP spectra were only published for sources brighter than G = 17.65, yet Figure 6 shows sources fainter than that. The reason is that the XP spectra of presumed QSOs, galaxies, and ultracool dwarfs were exempt from the Gaia DR3 publication limit of G = 17.65. Consequently, these objects may be contaminants in our stellar parameter catalog: they manifest as a small bump at G ∼ 19 in the distributions of G and GRP.

Figure 6.

Figure 6. Apparent magnitude distributions—G (black), GBP (blue), GRP (red), and W1 (gray)—for the full application sample of 174,922,161 stars without any quality cuts. The bump in the G-band distribution at G ∼ 19 reflects contamination by galaxies, QSOs, and ultracool dwarfs.

Standard image High-resolution image

The overall result of this analysis is given in Table 1: the three stellar parameters, [M/H], Teff, and $\mathrm{log}g$ for 175 million sources, specified by their Gaia DR3 source ID and with a label whether these sources were included in the XGBoost training.

3.2. External Validation with Other Surveys

We can validate these XGBoost results by a comparison to other surveys not used in the training. Specifically, we compare to results from GSP-Spec after calibration 6 in Gaia DR3 (Recio-Blanco et al. 2022), GALAH DR3 (Buder et al. 2021), and SkyMapper DR2 (Chiti et al. 2021).

To start, we compare the [M/H] estimates from XGBoost with other metallicity estimates. For GSP-Spec (Figure 7(a)), the agreement is excellent: there are no discernible systematics and very few outliers. Importantly, the GSP-Spec comparison is mostly limited to [M/H] > −1, where we expect the [M/H] estimates to be robust. For GALAH (Figure 7(b)), we still see good overall agreement across the full metallicity range. However, there are some outliers, where GALAH estimates [Fe/H] below −1, while XGBoost estimates [M/H] above −0.5. We also note a small systematic offset below an [Fe/H] of −1 where XGBoost's [M/H] is ∼0.2 lower than GALAH's [Fe/H]. These outliers and the slight offset have also been observed in the results of Rix et al. (2022). Yet, unlike in Rix et al. (2022), we no longer see a saturation of XGBoost metallicities below −2, where we now see a continuation of the one-to-one relation with GALAH. This is the result of including the very-metal-poor stars of Li et al. (2022) in our training sample, thus extending the APOGEE metallicity range. For the SkyMapper photometric metallicities (Figure 7(c)), substantial scatter is evident both visually and quantitatively. Successful comparisons to the external spectroscopic surveys suggest that this scatter is probably inherent to SkyMapper.

Figure 7.

Figure 7. Comparison of the XGBoost [M/H] estimates with GSP-Spec's calibrated metallicity estimates in Gaia DR3 (Recio-Blanco et al. 2022), with GALAH DR3 (Buder et al. 2021), and with SkyMapper DR2 (Chiti et al. 2021). For SkyMapper DR2, we impose the quality flag equal to 0. Numbers quote MedAD and RMSD. No quality cuts were applied, and the density maps are logarithmic. The comparison with GSP-Spec DR3 is very good, but the comparison is limited (mostly by GSP-Spec) to [M/H] ≳ −1. The comparison with GALAH DR3 is also very good, with a small systematic offset at [M/H] ≤ −1, already noted in Rix et al. (2022). Comparison with the photometric [M/H] estimates from SkyMapper DR2 shows substantially increased scatter. The good comparison of the XGBoost results with other surveys makes it likely that this is attributable to SkyMapper issues.

Standard image High-resolution image

We further investigate the origin of the outliers and systematics of our [M/H] estimates in Figure 8 where we directly compare the metallicity estimates from APOGEE (i.e., the training sample underlying XGBoost) to those from GALAH. First, we observe the same small systematic offset below an [Fe/H] of −1 between GALAH and APOGEE, i.e., the XGBoost model has correctly learned from APOGEE and simply reflects this difference. Second, we can also see the outliers where GALAH votes for [Fe/H] < −1 whereas APOGEE votes for [M/H] > −0.5, so again XGBoost has faithfully learned from its APOGEE training sample. Consequently, both effects are traced back to genuine differences between GALAH and APOGEE and are thus not introduced by XGBoost. In fact, these outliers in Figure 7(b) mostly have high temperatures in GALAH and potentially correspond to outliers in GALAH DR3 itself.

Figure 8.

Figure 8. Comparison of the [M/H] estimates from APOGEE DR17 (Abdurro'uf & Aerts 2022) to the [Fe/H] estimates from GALAH DR3 (Buder et al. 2021; top panel) and to the [Fe/H] estimates from SkyMapper DR2 (Chiti et al. 2021; bottom panel). There are 22,662 stars in common between APOGEE and GALAH and 696 in common between APOGEE and SkyMapper. Numbers quote MedAD and RMSD.

Standard image High-resolution image

Quantitatively, Figure 7 shows that our XGBoost [M/H] estimates compare very well with those from GSP-Spec and GALAH, with half of the stars differing by no more than 0.092 and 0.068, respectively. This external validation error is somewhat larger than the cross-validation error of 0.042 found for APOGEE in Figure 4. This most likely reflects subtle differences between APOGEE's and GSPSpec's [M/H] estimates (our XGBoost estimates are tied to the APOGEE scale), as similar scatter is found in the direct comparison of these surveys.

For the scatter of temperatures and surface gravities when comparing with GSP-Spec we find RMSDs of 112 K for Teff and 0.223 for $\mathrm{log}g$; and when comparing with GALAH, we find 166 K for Teff and 0.119 for $\mathrm{log}g$. These are again slightly higher than the RMSDs from the 20-fold cross-validation on APOGEE (54 K and 0.089, respectively) quoted in Figure 4. For $\mathrm{log}g$, the difference to GSP-Spec is larger than for APOGEE or GALAH, but we also did not apply the empirical corrections for GSP-Spec's $\mathrm{log}g$ recommended in Recio-Blanco et al. (2022).

3.3. [M/H] Estimates at Faint Apparent Magnitudes

Of particular interest is the publication limit of XP spectra in Gaia DR3, which was set at G = 17.65. Figure 4(d) suggests that our XGBoost results may remain robust as we approach the publication limit, but we would like to confirm this with an independent validation sample. Unfortunately, both GSP-Spec and GALAH DR3 are of no use in exploring this regime, as both samples are limited to bright stars. Therefore, we make use of the LAMOST DR6 7 data (Wu et al. 2011, 2014). As is evident from Figure 9, the [M/H] differences between XGBoost and LAMOST degrade "gracefully" toward the faint end, which means that the random scatter increases smoothly and no systematics appear. At G = 16 the central 68% interval ranges from −0.2 to +0.2 and even at G = 17.65 it ranges from −0.3 to +0.4. In fact, these variances include the [Fe/H] uncertainties from LAMOST, which typically are of the order of 0.25 around G = 17.65. Assuming that these uncertainties add in quadrature, the −0.3 to +0.4 interval at G = 17.65 implies an uncertainty of 0.17—0.32 attributable to XGBoost.

Figure 9.

Figure 9. Differences of [M/H] between XGBoost for LAMOST DR6 as function of apparent G magnitude. The Gaia DR3 publication limit for XP spectra is G = 17.65. Black lines indicate the 16th, 50th, and 84th percentiles as function of G. Color maps indicate the logarithmic number density.

Standard image High-resolution image

While a random error of ∼0.33 in [M/H] at G ∼ 17 is acceptable, it still represents a substantial increase from the error of 0.1 at the bright end. What are the possible origins of this increased noise? First, an earlier version of our catalog was based on AllWISE photometry instead of CatWISE, but apart from having significantly lower completeness (see Figure 3) it produced the same results at the faint end. This rules out the CatWISE photometry as origin for the increased noise. Second, Gaia DR3 parallaxes can become very noisy toward the faint end, such that the features defined in Equation (1) could begin to confuse XGBoost at the faint end. However, if we remove these features from the XGBoost input and thus become entirely independent from the parallax, we find no improvement either. This only leaves the XP spectra as the source of the increased noise toward the faint end. More precisely, we suspect that it is not the XP coefficients themselves but rather the synthesized narrowband photometry which is becoming increasingly susceptible to noise toward the faint end. This interpretation is also supported by Figure 3, which reminds us that the synthetic photometry becomes incomplete due to noise, leading to negative flux values even when XP spectra are available.

3.4. External Validation with Solar Analogs

Gaia Collaboration et al. (2022a) compiled a list of 5863 solar-analog candidates, whereof 5759 are in our sample. According to XGBoost, their mean [M/H] is 0.012 ± 0.105 and the central 90% interval ranges from −0.167 to 0.178. Figure 10 shows their distribution, which is consistent with a Gaussian of standard deviation 0.1. This is in excellent agreement with the solar value and demonstrates that our [M/H] estimates are also reliable at least for solar-like main-sequence dwarfs, whereas the estimates from Rix et al. (2022) were applicable only to giant stars.

Figure 10.

Figure 10. Distribution of [M/H] estimates from XGBoost for 5759 solar-analog candidates from Gaia Collaboration et al. (2022a). A Gaussian with zero mean and standard deviation of 0.1 is given by the dashed line, illustrating that the [M/H] estimates are precise and accurate on the main sequence at high metallicities and for Teff ∼ 5772 K.

Standard image High-resolution image

3.5. External Validation with Clusters

Among our XGBoost results, we find 22,477 member stars in 36 open clusters from Gaia Collaboration et al. (2018). As a first instructive example, Figure 11(b) shows how XGBoost's metallicity estimates vary with GBPGRP color in the Praesepe cluster. The expected literature value is recovered only within a certain color range but otherwise XGBoost systematically underestimates the metallicity. This underestimate is related to the limited temperature range of the training sample (3107–6867 K, see Section 2.1). In the absence of interstellar extinction (such as for Praesepe), this temperature range roughly corresponds to a GBPGRP color range from 0.5 to 2.3. Indeed, Figure 11(b) shows that the underestimated [M/H] values occur mainly for colors bluer than 0.5 or redder than 2.5, with a less pronounced underestimation of about 0.15 also in the range from 1.5 to 2.5. As is also evident from Figure 11(a), the Praesepe member stars are dominated by main-sequence dwarfs. We also note that the solar analogs showing excellent agreement in Figure 10 have intrinsic colors of GBPGRP = 0.818 ± 0.029 (Gaia Collaboration et al. 2022a), and would thus fall well within the regime of good agreement between 0.5 and 1.5 in Figure 11(b).

Figure 11.

Figure 11. Validation of the [M/H] estimates in the main sequence, using the Praesepe cluster. The cluster's CMD of all 653 members with [M/H] is illustrated in panel (a). Their [M/H] estimates are shown as a function of color in panel (b). The horizontal dashed line indicates the metallicity of 0.16 (Z = 0.02) adopted by Gaia Collaboration et al. (2018). For 0.5 < GBPGRP < 1.5 the metallicity agreement is excellent, whereas for 1.5 < GBPGRP < 2.5 they are systematically too low by 0.15 dex. Outside of these color ranges the agreement is poor. We attribute the offsets and the poor estimates to possible systematics and poor sampling of the CMD space in the training sample. The [M/H] estimates for main-sequence stars with colors outside 0.5 < GBPGRP < 2.5 are manifestly unreliable.

Standard image High-resolution image

Obviously, many Praesepe member stars fall into a temperature range that is not covered sufficiently by our training sample. Unfortunately, we cannot select based on our catalog's XGBoost temperature estimates because these values are also strictly confined to the range 3107–6867 K of the training examples. Instead, we select on input features as shown in Figure 12: for every individual cluster member star, we ask where its features fall into the two diagrams and we reject it from further consideration if and only if it has a sufficiently high number of training examples nearby in these diagrams. 8 After this filtering procedure, Figure 13 shows that the XGBoost [M/H] estimates agree reasonably well with the adopted mean metallicities of the 36 open clusters from Gaia Collaboration et al. (2018). We do see a slight positive offset below an [Fe/H] of −1 and a slight negative offset around solar [Fe/H]. The latter is probably similar to the offset seen in Figure 11 for Praesepe and color GBPGRP > 1.4. The slight positive offset below an [Fe/H] of −1 may be due to XGBoost occasionally overestimating [M/H] in that regime (see Figure 4(a)).

Figure 12.

Figure 12. XGBoost input feature distributions of the application sample (color maps), overlaid with black contours of the training sample. The lowest contour is at one star per bin, i.e., it encloses the full training sample, and the other contours successively increase by factors of 10. This shows that our full sample extends across important portions of color–color space that are not covered by the training sample. The resulting parameter estimates in these regimes will be inevitably unreliable.

Standard image High-resolution image
Figure 13.

Figure 13. Comparison of the [M/H] estimates from XGBoost after filtering the input features by the mean metallicities of 36 open clusters from Gaia Collaboration et al. (2018). Black dots show the median [M/H] and the gray error bars show 16th and 84th percentiles in each cluster.

Standard image High-resolution image

3.6. External Validation with Wide Binaries

Given the catalog of El-Badry et al. (2021), we find 55,033 pairs of wide binaries where each component star is observed as an individual source by Gaia. Since both stars from each binary pair have formed from the same gas cloud, XGBoost should estimate the same metallicity. In fact, Figure 14 shows that their [M/H] estimates are consistent with each other. The RMSD divided by $\sqrt{2}$ is 0.105.

Figure 14.

Figure 14. Comparison of the [M/H] estimates for the components of the 30,748 wide binaries from El-Badry et al. (2021). We quote RMSD divided by $\sqrt{2}$ because we want to quantify the difference between two noisy [M/H] estimates.

Standard image High-resolution image

An inspection of the XGBoost surface gravities reveals that essentially all wide binary members are main-sequence dwarf stars, which most likely is a selection effect from El-Badry et al. (2021) focusing on high-quality astrometry in Gaia DR2. Being main-sequence stars, the limitations from Figure 11(b) apply: if we restrict the comparison to wide binaries in which both stars fall within the color range 0.5 < GBPGRP < 1.5, the RMSD divided by $\sqrt{2}$ drops from 0.105 to 0.079 and the mean difference is 0.004. In contrast, if we only consider pairs where one component is in the good color range, whereas the second component is within 1.5 < GBPGRP < 2.5, the redder stars have on average a 0.05 lower [M/H] than the (unbiased) bluer stars. This offset is slightly less than the systematic underestimation of –0.15 seen in Praesepe in Figure 11(b).

We also note that while most of the wide binaries in Figure 14 are at [M/H] > −0.5, there are a handful of systems with [M/H] < −1 and that in those cases the XGBoost estimates still hold.

3.7. OBA Stars as a Failure Mode

In this section, we investigate the results for OBA stars. Gaia Collaboration et al. (2022a) compiled a list of 3,023,388 OBA stars, whereof 2,371 ,118 are in our sample. These are beyond the temperature range of the training sample, i.e., we intentionally break the model assumptions of our XGBoost model. Given that absorption lines are often washed out in very hot stars, we expect XGBoost to misinterpret such stars as metal poor. Since OBA stars are mostly young, they should have solar-like metallicities or higher. However, XGBoost assigns a median [M/H] of −0.443 to the stars in this sample and about 11% of the OBA stars are assigned an [M/H] lower than −1 by XGBoost. Consequently, the XGBoost estimates are clearly not viable for OBA stars. We note, however, that Gaia Collaboration et al. (2022a) report contamination from metal-poor stars, i.e., not all of those may actually be hot OBA stars.

4. Illustration of the Sample

The results of our XGBoost analysis are listed in Table 1. They are unprecedented in sample size at such precision and accuracy (σ([M/H]) ∼ 0.1 dex, σ(Teff) ∼ 50 K, and $\sigma (\mathrm{log}g)\sim 0.08$ dex) and can be used for a vast array of science applications, which is beyond the scope of this paper. This combination of sample size and data quality warrants a rigorous modeling of the selection function (see, e.g., Rix et al. 2021), which is also beyond the scope of this paper. What we will do here is to provide two, only qualitative, illustrations of the sample's science potential: its total [M/H] distribution and all-sky maps in different bins of [M/H]. More generally, we emphasize that each science application warrants specific vetting of the subsample used.

4.1. [M/H] Distribution of the Sample

Perhaps the most compact way to present the sample is to show its [M/H] distribution. We show this distribution in Figure 15 for all 174,922,161 stars, for 43,520,755 likely RGB stars, and for 18,858,968 RGB stars with high-quality parallaxes (see below). The [M/H] distribution for this last subset in Figure 15, restricted to the Milky Way within about 10 kpc, is the one that can be taken most at face value as an observational approximation of the "total" metallicity distribution of the Galaxy. Of course this is a flux-limited sample, whose extent is limited by distance, dust extinction, and (in part of the sky) crowding. Proper volume corrections of this [M/H] distribution would be a complex exercise (see, e.g., Rix et al. 2021) that is beyond the scope of this work. Such an analysis must include the G ≤ 17.65 publication limit of XP spectra in Gaia DR3 (e.g., introducing foreground dust extinction), the completeness of photometry synthesized from XP spectra (i.e., loss of stars due to negative synthetic fluxes caused by noisy XP spectra, see Section 2.2), and the crowding-afflicted completeness of AllWISE photometry.

Figure 15.

Figure 15. [M/H] distributions resulting from XGBoost for all stars in our catalog (black), RGB stars (red, teff_xgboost < 5300 K, and logg_xgboost < 3.5), and RGB stars with high-quality parallaxes (orange). The differences in the distributions between the RGBs with and without high-quality parallaxes is physical, as the parallax cut eliminates mostly distance stars, often in the halo or the Magellanic Clouds, which are metal poor. The parallax cut on the RGB sample matters, as most of this subsample also has RVS radial velocities: for these stars orbits can be calculated (e.g., Rix et al. 2022).

Standard image High-resolution image

Nonetheless, this distribution shows a number of remarkable features, extending over a factor of 10,000 in metallicity within a single galaxy. It starts at [M/H] ≈ −3.5, rises steeply to [M/H] ≈ −2.5, and then follows $d\mathrm{log}N/d[{\rm{M}}/{\rm{H}}]\sim 1$ to [M/H] ≈ −1.0. At this point, the onset of the old disk in metallicity, the [M/H] distribution rises quickly to a maximum near [M/H] ≈ −0.4, stays flat to [M/H] ≈ +0.2, dropping steeply beyond. The implications of this distribution in terms of chemical enrichment warrant to be studied in a framework of chemical evolution models, such as that of Weinberg et al. (2017). The interpretation in terms of halo, old disk, thin disk, etc., will be most powerful when combining this information with orbital information. We do not pursue these avenues in this paper, but stress only two points: first, our training set, extended in [M/H] compared to Rix et al. (2022), shows that there are likely hundreds of mostly bright (G < 16) extremely metal-poor giants ([M/H] ≥ −3) and over 10,000 very-metal-poor giants ([M/H] ≥ −2) observable in the Milky Way. Second, the steep slope of the [M/H] distribution argues that spurious [M/H] determinations remain rare also at the lowest accessible metallicities.

Figure 16.

Figure 16. Illustration of the quality cuts in effective temperature and ${M}_{W1}={W}_{1}+5\cdot {\mathrm{log}}_{10}(\varpi /100)$ for the vetted RGB sample (colored density map, 17,558,141 stars) compared to the full sample (gray density map, 174,922,161 stars). The cuts were designed with two goals in mind: first, isolate a subsample of bright giants for which the [M/H] estimates should be most precise and robust, to be used, e.g., in Galactic chemodynamics. Second, they are limited in temperature to 5200 K, which was empirically found to be highly effective to eliminate contamination of the metal-poor subsample by unrecognized hotter and reddened stars.

Standard image High-resolution image

4.2. Monoabundance All-sky Maps

Since we have an all-sky sample with precise and robust metallicities, it behooves us to make all-sky maps as a function of metallicity to illustrate it. Figures 17 and 18 provide all-sky maps of the sample's number density in various metallicity ranges for two samples: first, the complete (unfiltered) sample of all 174,922,161 stars, and second a vetted sample of 17,558,141 RGB stars. The vetted RGB sample was designed to eliminate spurious [M/H] estimates at the expense of sample size, in particular to eliminate sample contamination among the metal-poor stars that result from unrecognized instances of hotter but reddened stars. After some experimentation, we adopted the following selection criteria illustrated in Figure 16:

  • 1.  
    phot_g_mean_mag < 16;
  • 2.  
    ϖ/σϖ > 4;
  • 3.  
    logg_xgboost < 3.5;
  • 4.  
    teff_xgboost < 5200 K;
  • 5.  
    MW1 > −0.3−0.006 × (5500 − teff_xgboost);
  • 6.  
    MW1 > −0.01 × (5300 − teff_xgboost);
  • 7.  
    (GW2) < 0.2 + 0.77 × (GBPW1);

where ${M}_{W1}={W}_{1}+5\cdot {\mathrm{log}}_{10}(\varpi /100)$.

Figure 17.

Figure 17. Sky maps showing the logarithmic number density of all unfiltered stars with −3 < [M/H] < −1.2 (top panel), −0.9 < [M/H] < −0.6 (middle panel), and −0.3 < [M/H] < 0.1 (bottom panel). These illustrate two important issues: the incompleteness of XP spectra in the two sickle-shaped regions at high latitude. And a significant contamination of the metal-poor bin in the unfiltered sample, which manifests itself as a thin disk near the Galactic plane; these are presumably hotter and highly reddened stars with weaker metal lines that are not recognized as such by our algorithm, as it lacks good training sets in this CMD regime.

Standard image High-resolution image
Figure 18.

Figure 18. Sky maps showing the logarithmic number density of vetted RGB stars for −3 < [M/H] < −1.2 (top panel), −0.9 < [M/H] < −0.6 (middle panel), and −0.3 < [M/H] < 0.1 (bottom panel). This vetted sample (see Figure 16) is restricted to red giants with good S/Ns (G < 16), Teff cuts that eliminate contaminants in the low-metallicity subsample, and significant parallaxes (which explain the "disappearance" of the Magellanic Clouds compared to Figure 17). The top panel qualitatively illustrates how clean the metal-poor subsample is: it prominently shows the "Poor Old Heart of the Galaxy" (Rix et al. 2022), without any traces of spurious sample members in the disk that are so dominant in the high-metallicity subsample (bottom panel).

Standard image High-resolution image

Only 11,853 of the 2,371,118 of the OBA stars identified by Gaia Collaboration et al. (2022a) that are in our sample pass these quality cuts, i.e., these cuts succeed to eliminate 99.61% of OBA stars from this sample. Notably, Gaia Collaboration et al. (2022a) find that their OBA star sample has some contamination from "halo" stars (i.e., metal-poor stars), which they eliminate kinematically, but cannot eliminate for metal-poor stars with disk-like orbits. Hence, our metal-poor sample may be even purer than the above comparison implies.

For the convenience of the user, this vetted RGB subset is provided as a separate table, as described in Table 2. In this table, we also provide auxiliary information from Gaia about each source's astrometry, photometry, and RVS radial velocity that are available for a substantive fraction of the sample. This provides all the information necessary for the user to compute stellar orbits.

Table 2. Table Description of the 17,558,141 Vetted RGB Results Provided Online (Andrae et al. 2023)

Column nameDescription
source_idGDR3 identifier in ascending order
lGalactic longitude [deg]
bGalactic latitude [deg]
raR.A. [deg]
decdecl. [deg]
parallax_correctedparallax with zero-point correction [mas]
parallax_errorparallax error [mas]
pmraproper motion R.A. [mas/yr]
pmra_errorerror of proper motion R.A. [mas/yr]
pmdecproper motion decl. [mas/yr]
pmdec_errorerror of proper motion decl. [mas/yr]
ruweastrometric quality flag
radial_velocityradial velocity [km/s]
radial_velocity_errorradial velocity error [km/s]
phot_g_mean_magapparent G magnitude [mag]
phot_bp_mean_magapparent GBP magnitude [mag]
phot_rp_mean_magapparent GRP magnitude [mag]
catwise_w1apparent W1 magnitude [mag]
catwise_w2apparent W2 magnitude [mag]
mh_xgboostXGBoost estimate of [M/H]
teff_xgboostXGBoost estimate of Teff [K]
logg_xgboostXGBoost estimate of $\mathrm{log}g$
in_training_samplemembership in training sample

Note. We adopt the column names from the Gaia DR3 archive where appropriate. We emphasize that the zero-point correction of Lindegren et al. (2021) has been applied to the parallaxes in this table.

Download table as:  ASCIITypeset image

The three panels of the all-sky maps for the unfiltered sample in Figure 17 clearly show the imprint of the Gaia scanning law: in particular two "crescents" of lower sample density at high latitudes are attributable to too few transits that prevented the publication of XP spectra in Gaia DR3 (see De Angeli et al. 2022, Figure 29 therein). Moreover, dust extinction in the Galactic plane causes many stars to be dimmed below G < 17.65; they may be too faint to have XP spectra published or even too faint to be in the Gaia catalog at all. Note that the extinction in the Galactic plane appears most dramatic among the two metal-poor bins (top and middle panels), as these have far fewer foreground stars. The top map in Figure 17 also shows an implausible set of seemingly metal-poor stars near the disk, preferentially in star-forming regions. Most likely, these objects have spuriously low [M/H] estimates, and actually are reddened OBA stars of presumably higher metallicity that are misinterpreted as metal poor (see Section 3.7).

It is these stars that most immediately show that the full unfiltered sample must contain some spurious [M/H] estimates, which motivated our vetted sample of bright giant stars. Comparison of the two top panels of Figures 17 and 18 shows that these spurious sources are absent in the vetted giant sample.

Apart from these issues, the top maps in Figures 17 and 18 both clearly show the central concentration of metal-poor stars toward the Galactic center, extensively discussed as the "Poor Old Heart of the Milky Way" in Rix et al. (2022). Figure 17 also shows the two Magellanic Clouds, which are missing from Figure 18 due to the cut on parallax quality of $\tfrac{\varpi }{{\sigma }_{\varpi }}\gt 4$.

4.3. Potential Filtering

While we have illustrated only two examples of how to vet or filter the overall table of [M/H] estimates, it is clear that there are further limitations of our stellar parameter estimates that may compromise the use of our catalog. Here, we provide some guidance on potential filtering by the user:

  • 1.  
    In Rix et al. (2022), we used a bright RGB sample defined by teff_xgboost < 5300 K, logg_xgboost < 3.5, and GBP < 16. While this is still possible, we point out that our results in this work also hold for main-sequence dwarfs (see Figure 4(c)). Nevertheless, if the focus is on RGB stars, we recommend to drop or relax the selection on GBP, given that our results are robust toward the faint end (see Figure 9).
  • 2.  
    Given that OBA stars are problematic (see Section 3.7), one could take the golden sample of OBA stars from Gaia Collaboration et al. (2022a) and remove all known OBA stars from the sample.
  • 3.  
    The Gaia DR3 publication limit of G = 17.65 for XP spectra was not strict. Rather, XP spectra for 162,686 QSOs and 26,500 galaxies were also published down to the survey detection limit. Furthermore, XP spectra of ultracool dwarfs were published beyond G = 17.65. Given our training sample's temperature range of 3107 K—6867 K, ultracool dwarfs are not covered. Therefore, the user may consider removing all 129,997 results for G > 17.65.
  • 4.  
    The user may want to give special consideration to globular clusters, as those represent regions of high source density where the color–color diagram windows assigned to the XP spectra may begin to overlap, thus compromising the XP spectra and our derived [M/H] estimates. This effect was illustrated for Omega Centauri in Figure 27 of Creevey et al. (2022).
  • 5.  
    When working with [M/H] for stars of the main sequence, we recommend limiting the color range to 0.5 < GBPGRP < 1.5 for unbiased results (see Figures 10 and 11). The Teff estimates of stars on the main sequence may be precise over a wider range.
  • 6.  
    In order to prevent invalid extrapolations beyond the training sample, the user can check if a source's colors fall within the training sample, as we did for clusters in Section 3.5 and Figure 12. To this end, our catalog contains a column named in_training_sample, which is a Boolean flag that indicates if a source was part of the training sample.

5. Summary

We have derived and presented a catalog of data-driven, precise, accurate, and robust metallicity estimates [M/H] (as well as Teff and $\mathrm{log}g$) for 175 million stars from Gaia DR3. These estimates were derived using an externally trained XGBoost algorithm that draws on an extensive set of data features: parallaxes, low-resolution XP spectra, robust synthetic photometry based on those XP spectra, and CatWISE photometry. By construction, the resulting parameters are tied to the stellar parameter scale of the main training set, SDSS DR17 (APOGEE). The entire catalog is published and available online (Andrae et al. 2023).

This catalog greatly improves on our earlier catalog in Rix et al. (2022) in several respects: (1) it is all sky, not restricted to stars toward the Galactic center. (2) It covers much of the stellar color–magnitude plane, not just red giants. (3) It encompasses all stars with XP spectra, not just the bright ones (G < 16). (4) The [M/H] estimates overcome the [M/H] ≳ −2.5 limitation in Rix et al. (2022) by augmenting the main APOGEE DR17 sample by the very and extremely metal-poor stars from Li et al. (2022) in the training of the XGBoost algorithm. (5) It replaces AllWISE (Cutri et al. 2021) by CatWISE (Marocco et al. 2021), thus improving completeness substantially.

For stars within our training sample's temperature range (from 3107 K to 6867 K), our empirical approach recovers [M/H] to within an rms test error of 0.1 (from cross-validation in Figure 4) and an rms validation error of 0.146 on GALAH DR3 (see Figure 7). In particular, our empirical results exhibit the same systematics that our APOGEE-dominated training sample exhibits in comparison to GALAH DR3, i.e., our results are perfectly consistent with the known discrepancies between the spectroscopic surveys. An independent validation on solar-analog candidates from Gaia Collaboration et al. (2022a), on members of the Praesepe cluster, and on wide binaries from El-Badry et al. (2021) confirms a typical [M/H] uncertainty of 0.1 with negligible bias for stars on the main sequence in this intermediate temperature regime. Toward the faint end, the [M/H] errors increase moderately, reaching 0.15 at G ∼ 14, 0.2 at G ∼ 16, and finally ∼0.4 at G ∼ 17.65 (see Figure 9), but we do not see systematic errors emerge. We suspect that this "graceful" degradation is probably caused by increasing noise in the narrowband photometry synthesized from the XP spectra.

We provide the full, unfiltered catalog of all ∼175 million [M/H] estimates without applying any quality cuts (see Table 1), as we had already published a smaller catalog with highly conservative cuts as part of Rix et al. (2022). The purpose of the current work is to push toward what is maximally possible in [M/H] estimates from XP spectra, to allow further and broader scientific exploitation of the Gaia DR3 data. Consequently, the user is advised to vet each data subset carefully to understand its limitations for each astrophysical application. In particular, all applications that draw on stars with parameters not well represented in the training set require caution. We provide some guidance in Section 4.3. For user convenience, we also define a vetted sample of 17.5 million RGB stars (see Table 2) with conservative cuts to ensure high data quality. This sample is still much larger than the sample in Rix et al. (2022), which contained only 2 million RGB stars toward the Galactic center.

However, we emphasize that main-sequence stars from the overall sample presented here can also be used reliably for [M/H] analysis, as long as they are in the color range 0.5 < GBPGRP < 1.5, where they achieve typical [M/H] uncertainties between 0.079 for wide binaries (see Section 3.6) and 0.1 for solar analogs (see Figure 10). Additionally, in Appendix B we provide instructions how to retrieve all Gaia DR3 sources with radial velocity measurements and astrometry and match those with our metallicity catalog in order to facility chemodynamical studies.

This catalog is already being used for various upcoming research projects, e.g., on the metallicity gradient in the Large Magellanic Cloud (R. Andrae et al. 2023, in preparation), the chemodynamics of the Milky Way disk (V. Chandra et al. 2023, in preparation), and on stellar rotation in open clusters (E. Pancino et al. 2023, in preparation).

Although Gaia DR3 is only a few months old, our work also allows us to be very optimistic for Gaia DR4; the fact that we can obtain robust and reliable [M/H] estimates even for the faintest stars at the publication limit of XP spectra in Gaia DR3 (see Figure 4(d) and Figure 9) suggests that useful [M/H] estimates might also be achievable for fainter XP spectra that will be published in Gaia DR4. In addition, it is likely that the XP spectra themselves will improve substantially from Gaia DR3 to Gaia DR4, due to improved processing and about twice as many observing epochs. All this bodes extremely well for the science potential of the XP spectra that will be published in Gaia DR4.

Acknowledgments

We thank our colleague Morgan Fouesneau for valuable discussions on this work and on the manuscript. In particular, R.A. thanks Yang Huang for useful background information about LAMOST and SMSS errors and Francois-Xavier Pineau for his valuable help in crossmatching Gaia and CatWISE data.

This work has made use of data from the European Space Agency (ESA) mission Gaia (https://www.cosmos.esa.int/gaia), processed by the Gaia Data Processing and Analysis Consortium (DPAC, https://www.cosmos.esa.int/web/gaia/dpac/consortium). Funding for the DPAC has been provided by national institutions, in particular the institutions participating in the Gaia Multilateral Agreement.

Guoshoujing Telescope (the Large Sky Area Multi-Object Fiber Spectroscopic Telescope (LAMOST)) is a National Major Scientific Project built by the Chinese Academy of Sciences. Funding for the project has been provided by the National Development and Reform Commission. LAMOST is operated and managed by the National Astronomical Observatories, Chinese Academy of Sciences.

Facilities: Gaia - , WISE - Wide-field Infrared Survey Explorer, Sloan - Sloan Digital Sky Survey Telescope, LAMOST - , AAT - Anglo-Australian Telescope, Skymapper - ANU Siding Spring Observatory 1.3m Skymapper Telescope.

Software: numpy (Harris et al. 2020), scipy (Virtanen et al. 2020), matplotlib (Hunter 2007), astropy (Astropy Collaboration et al. 2013, 2018, 2022), GaiaXPy (https://gaia-dpci.github.io/GaiaXPy-website/), gaiadr3_zero-point (https://gitlab.com/icc-ub/public/gaiadr3_zeropoint).

Appendix A: Details of XGBoost

In order to ensure reproducibility, this appendix provides further details about our XGBoost model. First and foremost, we list the input features used for all XGBoost models ([M/H], Teff, $\mathrm{log}g$):

  • 1.  
    110 XP coefficients, each divided by 10(15−G)/2.5 for normalization;
  • 2.  
    Five features of the form of Equation (1) for G, GBP, GRP, W1, and W2;
  • 3.  
    Seven observed colors: GGBP GGRP, GBPGRP, GBPW2, GW1, GW2, and W1W2.
  • 4.  
    31 synthesized colors, each of the form of GX, for (using GaiaXPy nomenclature) StromgrenStd_mag_b, StromgrenStd_mag_y, Jplus_mag_gJPLUS, Jplus_mag_iJPLUS, Jplus_mag_J0515, Jplus_mag_J0861, Jplus_mag_J0660, Panstarrs1Std_mag_gp, Panstarrs1Std_mag_rp, Panstarrs1Std_mag_ip, Panstarrs1Std_mag_zp, Panstarrs1Std_mag_yp, SkyMapper_mag_g, SkyMapper_mag_r, SkyMapper_mag_i, SkyMapper_mag_z, Gaia2_mag_C1B431, Gaia2_mag_C1B556, Gaia2_mag_C1B655, Gaia2_mag_C1B768, Gaia2_mag_C1B916, Gaia2_mag_C1M467, Gaia2_mag_C1M506, Gaia2_mag_C1M515, Gaia2_mag_C1M549, Gaia2_mag_C1M656, Gaia2_mag_C1M716, Gaia2_mag_C1M747, Gaia2_mag_C1M825, Gaia2_mag_C1M861, and Gaia2_mag_C1M965.

Concerning the details of the XGBoost models themselves, the following Python code shows how the training was configured:

  • from sklearn.experimental
  • import enable_hist_gradient_boosting
  • from sklearn.ensemble
  • import HistGradientBoostingRegressor
  • xgboost=HistGradientBoostingRegressor(
  • loss = 'least_squares',
  • min_samples_leaf = 20,
  • max_depth = 50,
  • max_leaf_nodes = 500,
  • max_iter = 10000,
  • max_bins = 255,
  • l2_regularization = 1.0e-9
  • )

Specifically for training the [M/H] model, we weighted the training examples with e−[M/H]/5 in order to put more emphasis on low-metallicity examples:

  • Weights=numpy.exp(-MH/5)
  • modelMH=xgboost.fit(Features, MH,
  • sample_weight=Weights)

The XGBoost models for Teff and $\mathrm{log}g$ were configured in exactly the same way but did not use any weighting during training.

Appendix B: Extracting Gaia DR3 Sources with Radial Velocities and Astrometry

For chemodynamical studies, the reader may want to match our [M/H] estimates to stars that have radial velocity measurements and astrometry in Gaia DR3. Table 2 contains all the information necessary to calculate orbits (for a given potential), but "only" for 13.3 million vetted RGB stars. Our full Table 1 contains only source IDs and stellar parameter estimates (in particular [M/H]). Due to user quota limitations, it is impossible to upload either Table 2 or Table 1 to the Gaia archive for direct crossmatching. Instead, the user needs to download the data from the Gaia archive 9 and then match via source_id locally. The following ADQL query retrieves astrometry and radial velocity for 33,653,049 stars that have both available in Gaia DR3:

  • SELECT
  • source_id,
  • ra,dec,
  • parallax,parallax_error,
  • pmra,pmra_error,
  • pmdec,pmdec_error,
  • radial_velocity,radial_velocity_error
  • FROM gaiadr3.gaia_source_lite
  • WHERE radial_velocity IS NOT NULL
  • AND parallax IS NOT NULL
  • ORDER BY source_id

The last line ensures that the results are sorted by Gaia source_id in ascending order. This greatly facilitates the local crossmatch for the reader, e.g., allowing for binary search during crossmatching.

Footnotes

  • 3  

    Essentially all stars with RVS velocities also have parallaxes and proper motions, needed to calculate orbits along the the RVS velocity.

  • 4  

    XGBoost is a tree-based method. As such, it segments the feature space and assigns as output to each segment the average training labels within that segment.

  • 5  

    The absolute magnitude and extinction in Equation (1) remain unknown. We merely highlight the astrophysical meaning of these features.

  • 6  
  • 7  
  • 8  

    The exact thresholds for this selection depend on the binning used in Figure 12.

  • 9  
Please wait… references are loading.
10.3847/1538-4365/acd53e