Spectrophotometric Parallaxes with Linear Models: Accurate Distances for Luminous Red-giant Stars

David W. Hogg; Anna-Christina Eilers; Hans-Walter Rix

doi:10.3847/1538-3881/ab398c

1. Introduction

The Milky Way disk is a beautiful object of study for cosmology and galaxy evolution. Its evolution involves gas accretion and cooling, star formation, and orbital dynamics. To understand the details of this story, we need precise measurements of the stars and their properties over a large part of the disk, not just very near the Sun. Since we live in the middle of this disk, and the disk is large and dusty, these measurements require infrared spectroscopy of luminous giants. This motivates the APO Galactic Evolution Experiment (APOGEE; Majewski et al. 2017), APOGEE-2, and Sloan Digital Sky Survey V (SDSS-V; Kollmeier et al. 2017) projects, which take spectra in the infrared, and deliver detailed abundances along the entire red-giant branch up to the tip. The maps made with these data will reveal the dynamics of the Milky Way and its disk, and show us how and where stars form, and how they migrate around the disk with time.

As we operate these spectroscopic surveys, the Gaia Mission (Gaia Collaboration et al. 2016) has also been revolutionizing our view of the disk. It delivers good proper motions over most of the Galaxy, and extremely valuable parallax information locally. However, Gaia does not deliver precise geometric parallaxes over a large fraction of the disk, and certainly not in the dusty and crowded regions. For these reasons, the value of Gaia in these infrared spectroscopy projects is not to deliver distance information directly, but rather to calibrate stellar models, which then deliver distance information through relationships between stellar luminosities and spectral characteristics.

Luminous red-giant stars are valuable for mapping the Milky Way disk for a number of reasons, only one of which is their great luminosities (and hence brightnesses even at large distance). They are very common stars, much more frequent than distance-indicating stars like Cepheids. They are produced by stellar populations at all metallicities and almost all ages, so they can be found in all parts of the Galaxy. They have temperatures and surface gravities such that it is straightforward to spectroscopically measure metallicities, detailed chemical abundances, and stellar parameters, even including ages (Martig et al. 2016; Ness et al. 2016). The red-giant branch is photometrically near-orthogonal to reddening vectors by dust, so they are easy to de-redden or dust-correct. And finally, in physical models of red giants, their predicted luminosities are simple functions of composition, surface gravity, temperature, and age. Thus, if we can measure these properties of red giants, we can (in principle) predict their luminosities and use them as standard candles. That is, red giants are standardizable candles.

In this project, we develop purely data-driven techniques to predict red-giant parallaxes or distances with spectroscopy and photometry. Our method is predicated on the expectation that red-giant stars are dust-correctable, standardizable candles with such data, but it makes no use whatsoever of any stellar interior, stellar exterior, or dust-reddening models. We only use these physical expectations about red giants to recommend data features—inputs—for a completely data-driven model. The data-driven method uses patterns in the observed data themselves to discover the relationships between spectral features in the spectra of the stars, photometry (including colors), and parallax (or distance).

Data-driven models are very often more accurate than physical models of stars for this kind of work, because they have the flexibility to discover regularities or offsets in the data that are not adequately captured by the physical models. They do better than physical models when there are lots of training data with properties that are good for performing regressions.

Simple data-driven models (like the ones we will employ here) also suffer from many limitations for this kind of work: because they are constrained only by the data, and not physical law, they can learn relationships that we know to be physically unlikely, and (by the same token) they cannot benefit from our physical knowledge. They are required, in some sense, to spend some of the information in the data on learning physical laws or relationships. Not all of the information in the data is being used optimally when we ignore our physical models. Relatedly, they are not (usually) interpretable or generalizable or useful for extrapolations significantly outside the training set: a data-driven model trained with one kind of data cannot usually be used in another context with very different input data. And the internal properties of the model are not useful, in general, for improving our understanding or intuition about the astrophysics in play.

All that said, our goal here is to produce a precise mapping tool for surveys like APOGEE and SDSS-V. So we accept the limitations of the data-driven models in exchange for their performance. This work builds on the success we have seen in building other kinds of data-driven tools for stellar spectroscopy, for example to measure stellar parameters (Ness et al. 2015), abundances (Casey et al. 2016; Ho et al. 2017; Ness et al. 2018), and masses and ages (Ness et al. 2016).

There is a huge range of complexity or capacity in data-driven models, from linear regressions (what we use here) to the most advanced machine-learning methods (like deep neural networks). In this work we stay on the low-complexity side of this, as we do not want to face the (literally combinatoric) choices in model structure that advanced methods bring. And we want to control or understand the smoothness or flexibility of the model. Deep networks can model extremely complex function spaces, which is good in some contexts, but bad when you know that the functional dependencies you hope to model are smooth. In that case, much of the information in the data can get spent on learning the smoothness of the function.

What follows is a regression: we use Gaia measurements of parallax to learn relationships between the spectroscopic and photometric properties of stars and their parallaxes. We then use that learned model to estimate improved parallaxes for stars with poor (or no) Gaia measurements. This regression is very sensitive to certain aspects of the data or experimental design. One is that we will do far better with the regression if we have an accurate model for noise process in the Gaia data. For this reason, we make use of a justifiable likelihood function for the Gaia astrometry (Hogg 2018). Another is that regressions can be strongly biased if we have a bad selection function or data censoring. This means that we cannot apply cuts to the Gaia astrometry or signal-to-noise unless those cuts are correctly accounted for in our generative model for the data sample. Below we will make no such cuts. That means that our training step has the odd property that much of the training data is low in signal-to-noise; we include training parallax measurements even if they are negative! The fact that we are using a justified likelihood function ensures that these data will only affect the model in sensible and justifiable ways. The alternative—cutting to high signal-to-noise data—is the more traditional approach, but it leads to biased inferences (unless the cuts are included correctly in the likelihood function). The method in which we make no cuts on astrometric measurements and we keep an unmodified, justified likelihood function is unbiased, and simple.

2. Assumptions of the Method

Our position is that a methodological technique is correct inasmuch as it delivers correct or best results under its particular assumptions. In that spirit, we present here the assumptions of the method explicitly.

Features.⁶ Perhaps our most fundamental assumption is that the parallax or distance of a star can be predicted from the features we provide, which are the full set of (pseudo-continuum-normalized) pixels from the APOGEE (Majewski et al. 2017) spectrum, plus the G, ${G}_{\mathrm{BP}}$ , and ${G}_{\mathrm{RP}}$ photometry from Gaia (Gaia Collaboration et al. 2016), plus the J, H, and K_s photometry from the Two Micron All Sky Survey (2MASS; Skrutskie et al. 2006), plus the W₁ and W₂ photometry from the Wide-field Infrared Survey Explorer (WISE; Wright et al. 2010). That is, we assume that these spectrophotometric features are sufficient to predict the parallax in the face of variations in stellar age, evolutionary phase, composition, and other parameters, and also interstellar extinction.
Good features. We assume that the spectrophotometric features are known for each star with such high fidelity (both precision and accuracy) that we do not need to account for errors or uncertainties or biases in the features. That is, we assume that the features are substantially higher in signal-to-noise than the quantities we are trying to predict; in particular we are assuming that the photometry and the spectroscopy is better measured than the astrometry. This is true for most features for most stars, but it does not hold universally.
Representativeness. We assume that the training set constructed from the overlap of Gaia and APOGEE data sets constitutes a representative sample of stars, sufficiently representative to train the model for all other stars. Although this assumption is not terrible, it has a weakness: the stars at greatest distances and greatest local (angular) stellar crowding have the least precise Gaia parallax measurements, and therefore will effectively get less weight in the fits performed to train the model.
Sparsity. We expect that only a small subset of the full complement of APOGEE spectral pixels will be relevant to the prediction of parallax (the spectrophotometric parallax). That is, we expect that many of the pixels will or ought to get no weight in the final model that we use to predict luminosities and distances.
Linearity. Perhaps the most restrictive assumption of this work is that the logarithmic parallax (or, equivalently, the logarithmic distance, or, equivalently, the distance modulus) can be predicted as a completely linear function of the chosen features. We are only assuming this linearity in a small range of stellar surface gravity, but this assumption is strong, and limits strongly the flexibility or capacity of the model. We make it to ensure that our method is easy to optimize, and the results are easy to interpret. It is also the case that linear models extrapolate better than higher-order or more flexible models; this is not extremely important to the present work, but it does protect our predictions at the edges (in terms of temperature, surface gravity, and chemical abundances) of our sample. That said, the choice of this model (log parallax or log distance is a linear combination of input features) will certainly introduce particular and hard-to-learn biases into our results. Indeed, any choice of model that is not strictly physically correct will introduce particular biases. We are effectively assuming that we can produce useful distance estimates nonetheless.
Likelihood. We assume that the Gaia parallaxes are generated by a particular stochastic process, in which the difference between the Gaia catalog parallax (plus a small offset, which we learn below) and the true parallax is effectively drawn from a Gaussian with a width set by the Gaia catalog uncertainty on the parallax. This is the standard assumption in all properly probabilistic Gaia mission inferences to date (as discussed in Hogg 2018), but it subsumes a number of related assumptions, like that the Gaia noise model is correct, that the stars are only covariant with one another at negligible levels, and that there are no significant outliers or large-scale systematics.

In addition to making the above assumptions, we also avoid various practices with the data that are tempting but lead to important biases. The common practices of cutting to parallaxes that are good to 20% (see, for example, Helmi et al. 2018; Trick et al. 2019), or catalog parallaxes that are positive, or parallaxes that are smaller or larger than something, are all practices that will bias results on stellar collections. That is, if you cut on parallax or parallax signal-to-noise and you subsequently take an average of parallaxes for some population, or perform a regression (as we do here), the results of that average or regression will be strongly biased. We never cut on parallax or parallax signal-to-noise. We will (below) cut on Gaia uncertainty estimates on parallax. These uncertainty estimates are a function only of spacecraft orientation history (scanning law) and telemetry; they are not a function of the magnitudes of the parallaxes or their measurements. For these reasons, the uncertainty estimates on the parallaxes can be used to make sample cuts (and we do use them). These cuts will not bias our regressions. By using a justifiable likelihood function (the likelihood assumption above), we can use all of the Gaia parallaxes without fearing that the low signal-to-noise and negative parallaxes will cause trouble for our regressions, and without any of the biases that enter when cuts are made.

Along similar lines, we never assume that the distance is the inverse of the measured parallax. In what follows, a star's distance is a latent property of the star, which generates the Gaia catalog parallax through a noisy process (again, this is the likelihood assumption above). We never take the inverse or the logarithm of the measured parallax at any time. This is related to the no cuts point above: if you take the inverse or logarithm of the parallax, you cannot operate safely on the negative and low signal-to-noise parallaxes, which in turn will require making cuts on the sample, which will in turn bias the results.

Finally, we never use Lutz–Kelker corrections (Lutz & Kelker 1973) or distance posteriors (e.g., Bailer-Jones et al. 2018). These both involve (implicit or explicit) priors on distance. When multiple stars are combined as we combine them here (below), use of distance posteriors instead of parallax likelihoods is not just unjustified, but it also leads to an effective raising of the distance prior to an enormous power. That is, the effective prior on a N-star inference performed naively with distance posteriors (made with a weak prior) can end up bringing into the inference an exceedingly strong prior. Therefore we do not use any prior-contaminated parallax or distance inputs to the inference.

3. Method

There are not many choices for building a model of the stars that is both justifiable probabilistically and consistent with the assumptions stated in the previous section. Here we lay out the model and methodology. We apply the method to real data in the following sections.

The model for the parallax and the log-likelihood function can be expressed heuristically as

$\begin{eqnarray}&&{{\varpi }^{({\rm{a}})}}_{n}=\exp (\theta \cdot {x}_{n})+{\rm{noise}}\end{eqnarray} \tag{ 1 }$

$\begin{eqnarray}&&{\chi }^{2}(\theta )\equiv \displaystyle \sum _{n=1}^{N}\displaystyle \frac{{\left[{{\varpi }^{({\rm{a}})}}_{n}-\exp (\theta \cdot {x}_{n})\right]}^{2}}{{{\sigma }^{({\rm{a}})}}_{n}^{2}},\end{eqnarray} \tag{ 2 }$

where ${{\varpi }^{({\rm{a}})}}_{n}$ is the Gaia measurement (adjusted; see below) of the astrometric parallax of star n, the model is that the logarithm of the true parallax can be expressed as a linear combination of the components of some D-dimensional feature vector x_n, θ is a D-vector of linear coefficients, ${{\sigma }^{({\rm{a}})}}_{n}$ is the Gaia estimate of the uncertainty on the parallax measurement, and χ²(θ) is (twice) the negative-log-likelihood for the parameters θ under the assumption of known Gaussian noise and that there are N independently measured stars n. The feature vector x_n contains photometry in a few bands, and all 7400-ish pixels in the pseudo-continuum-normalized APOGEE spectrum, so D is on the order of 7400.

In addition, we assume that many entries in the parameter D-vector θ will be zero (the sparsity assumption). In order to represent this expectation, we optimize not χ² but a regularized objective function

$\begin{eqnarray}&&\hat{\theta }\leftarrow \mathop{\mathrm{argmin}}\limits_{\theta }\left[\displaystyle \frac{1}{2}\,{\chi }^{2}(\theta )+\lambda \,| | P\cdot \theta | {| }_{1}^{1}\right],\end{eqnarray} \tag{ 3 }$

where λ is a regularization parameter, P is a projection operator that selects out from the θ vector only those components that pertain to the APOGEE spectral pixels, and $| | y| {| }_{1}^{1}$ is the L1-norm or sum of absolute values of the components of y. This kind of regularization (L1) adds a convex term to the optimization and leads to sparse optima. The P operator makes it such that we only regularize the components of θ that multiply the spectral pixels; we ask for the spectral model to be sparse but we do not ask for the photometric model to be sparse. We set the value of λ by cross-validation across the A/B split (see below). In the end, the L1 regularization zeros out about 75% of the model coefficients.

This optimization is not convex—there are multiple optima in general. In particular, there is a large, degenerate, pathological optimum where the exponential function underflows, all parallaxes are predicted to vanish, and there is no gradient of anything with respect to the parameter D-vector θ. This large, bad optimum (it is a local, not global, optimum) must be avoided in optimization. In practice we avoid it by optimizing first for the very highest signal-to-noise (better than 5% measurements of astrometric parallax) training set stars, and then use that first optimum as a good first guess or initialization for the full optimization. This method works because in the limit of high signal-to-noise, the problem asymptotically approaches the convex problem of L1-regularized linear least-square fitting. Optimization is performed with scipy.optimize using the L-BFGS-B algorithm (Byrd et al. 1995).

Once the model is optimized—and therefore $\hat{\theta }$ is determined—the output of the model is a prediction of the parallax, or really what we will call the spectrophotometric parallax. This spectrophotometric parallax ${{\varpi }^{(\mathrm{sp})}}_{m}$ for any star m in the validation or test set is assigned according to

$\begin{eqnarray}&&{{\varpi }^{(\mathrm{sp})}}_{m}\leftarrow \exp (\hat{\theta }\cdot {x}_{m}),\end{eqnarray} \tag{ 4 }$

where $\hat{\theta }$ is the optimal parameter vector according to Equation (3), and x_m is the feature D-vector for star m. The model can be trained on a training set of stars and applied to a validation set or a test set to make predictions. The only requirements are that every test set object m must have a full feature vector x_m just as every training set object n must have a full feature vector x_n.

The linear model permits straightforward propagation of uncertainty from the feature inputs to the spectrophotometric parallaxes. Although the model fundamentally assumes that the prediction features are noise free (see the good features assumption in Section 2), there are in fact small uncertainties on the features that we can propagate to make uncertainty estimates on the spectrophotometric parallaxes

$\begin{eqnarray}&&{{\sigma }_{m}^{(\mathrm{sp})}}^{2}\leftarrow {{\varpi }_{m}^{(\mathrm{sp})}}^{2}\,{\hat{\theta }}^{T}\cdot {C}_{m}\cdot \hat{\theta },\end{eqnarray} \tag{ 5 }$

where ${\sigma }_{m}^{(\mathrm{sp})}$ is the uncertainty on the inferred (or predicted) spectrophotometric parallax ${{\varpi }^{(\mathrm{sp})}}_{m}$ , the right-hand side is a scalar product, and C_m is the covariance matrix of the input features. In what follows we do not actually instantiate the full covariance matrix; we presume that input features are independently measured; that is, we assume that C_m is a diagonal matrix with the uncertainty variances of the elements of x_m along the diagonal.

We perform all of the fitting in a two-fold train-and-test framework, in which the data are split into two disjoint subsets, A and B. The model trained on the A data is used to predict or produce the spectrophotometric parallaxes ${{\varpi }^{(\mathrm{sp})}}_{m}$ values (and hence the distances and parallaxes) of the B data, and the model trained on the B data is used to produce the ${{\varpi }^{(\mathrm{sp})}}_{m}$ values of the A data. This ensures that, in the estimate of any individual star's spectrophotometric parallax, none of the Gaia data pertaining to that star were used. This makes the parallax estimates from the spectrophotometric feature vectors x_m statistically independent of the Gaia parallax measurements. They are not independent globally—Gaia data were used to train the model—but each star's spectrophotometric parallax estimate is independent of the astrometric estimate from Gaia on a star-by-star basis.

4. Data

We take the APOGEE (Allende Prieto et al. 2008; Wilson et al. 2010; Majewski et al. 2017) spectral data from SDSS-IV (Blanton et al. 2017) DR14 (Abolfathi et al. 2018). Because we want to make a purely linear model, which has very little capacity, we restrict our consideration to a small region in stellar parameter space. We cut the APOGEE data down to the range of $0\lt \mathrm{log}g\lt 2.2$ , which isolates stars that are more luminous than the red clump. The APOGEE pipeline (García Pérez et al. 2016) values of surface gravity $\mathrm{log}g$ (which we use) have uncertainties but this cut leads to a clean sample of luminous red giants, and it is a cut that is only on spectral properties of the stars (and not photometry nor astrometry).

From the APOGEE data on these low-gravity stars, we take the spectral pixels, of which there are 7405 (after cutting out pixels that rarely or never get data), on a common rest-frame wavelength grid. That is, every APOGEE star is extracted on (or interpolated to) the same wavelength scale. In detail, we obtain the APOGEE spectral pixels from the pipeline-generated aspcapStar files. We then pseudo-continuum-normalize the spectra according to the procedure developed in The Cannon (Ness et al. 2015): that is, the pseudo-continuum is a spline fit to a set of pixels chosen to be insensitive to stellar parameters. We use as our spectral data the normalized spectral pixel values.

Because the APOGEE target selection is based on 2MASS (Skrutskie et al. 2006), every APOGEE star also comes with 2MASS photometry. That is, for each star n, we have 2MASS near-infrared photometry J_n, H_n, and K_n.

We used the Gaia (Gaia Collaboration et al. 2016) DR2 (Gaia Collaboration et al. 2018) data archive (Mora et al. 2018) official matches (P. M. Marrase et al. 2018, in preparation) to match the APOGEE+2MASS stars to the WISE catalog (Wright et al. 2010), according to the Gaia archive internal match criteria. This gives, for each matching star n, mid-infrared photometry W_1n and W_2n at 3.6 and 4.5 μm. In detail we use the w1mpro and w2mpro catalog entries.

We match this full-match catalog to the Gaia DR2 using the Gaia archive official match (P. M. Marrase et al. 2018, in preparation) to the 2MASS IDs. We take from the Gaia DR2 catalog the photometric data G_n (phot_g_mean_mag), ${{G}_{\mathrm{BP}}}_{n}$ (phot_bp_mean_mag), and ${{G}_{\mathrm{RP}}}_{n}$ (phot_rp_mean_mag), which will become part of the feature vector x_n.

We need complete feature-vector information for all stars. For this reason, we define the parent sample to be the set of all stars that meet the APOGEE and Gaia cuts and also have the complete set of photometry: G_n, ${{G}_{\mathrm{BP}}}_{n}$ , ${{G}_{\mathrm{RP}}}_{n}$ , J_n, H_n, K_n, W_1n, and W_2n and also an APOGEE spectrum. In addition, we applied two light color cuts to remove stars with obviously contaminated or outlying photometry:

$\begin{eqnarray}&&({J}_{n}-{K}_{n})\lt (+0.4\,\mathrm{mag})+0.45\ast ({{G}_{\mathrm{BP}}}_{n}-{{G}_{\mathrm{RP}}}_{n})\end{eqnarray} \tag{ 6 }$

$\begin{eqnarray}&&({H}_{n}-{W}_{2n})\gt (-0.05\,\mathrm{mag}).\end{eqnarray} \tag{ 7 }$

These cuts removed roughly 2% of the APOGEE stars. This parent sample contains 44,784 stars and is shown in Figure 1.

Every parent sample star gets, in addition, a randomly assigned binary label (A or B). This is used for two-fold validation and jackknife. In short, we will use the model trained on the training set A data to assign spectrophotometric parallaxes to the parent sample B data and use the model trained on the training set B data to assign spectrophotometric parallaxes to the parent sample A data.

For training, we need A and B training sets. We define the training set stars to be parent sample stars that also have a measured Gaia parallax. The training set stars, in addition to meeting all parent sample cuts, also must meet an uncertainty criterion ${{\sigma }^{({\rm{a}})}}_{n}\lt 0.1\,\mathrm{mas}$ , i.e.,

$\begin{eqnarray}&&{\mathtt{parallax}}\_{\mathtt{error}}\lt {\mathtt{0}}.{\mathtt{1}},\end{eqnarray} \tag{ 8 }$

it must be observed by more than eight (widely separated) times,

$\begin{eqnarray}&&{\mathtt{visibility}}\_{\mathtt{periods}}\_{\mathtt{used}}\gt ={\mathtt{8}},\end{eqnarray} \tag{ 9 }$

as well as meet the additional quality criterion

$\begin{eqnarray}&&\begin{array}{l}{\mathtt{astrometric}}\_{\mathtt{chi}}{\mathtt{2}}\_{\mathtt{al}}/\\ {\mathtt{sqrt}}({\mathtt{astrometric}}\_{\mathtt{n}}\_{\mathtt{good}}\_{\mathtt{obs}}\_{\mathtt{al}}-{\mathtt{5}})\lt ={\mathtt{35}},\end{array}\end{eqnarray} \tag{ 10 }$

which eliminates detections that are not consistent with a single point-spread function (see Bailer-Jones et al. 2018), to ensure that the parallax measurements are good. These cuts leave 28,226 stars total in the training sets. But—as we have emphasized above—we do not cut ever on parallax or the ratio to its uncertainty. For this reason, most of the training set stars do not have significantly measured parallaxes and more than 2% even have negative parallaxes! But we will perform our training such that it will be unbiased in these circumstances. Indeed it would be necessarily biased if ever we did cut on parallax or parallax signal-to-noise.

In detail, for each star n in the training sets, we take from Gaia DR2 (Gaia Collaboration et al. 2018) the astrometric parallax ${{\varpi }^{({\rm{a}})}}_{n}$ and its uncertainty ${{\sigma }^{({\rm{a}})}}_{n}$ . Importantly, to each Gaia parallax measurement we add a positive offset ϖ₀ to adjust for the underestimation of parallaxes reported by the Gaia Collaboration (Lindegren et al. 2018). The Gaia recommended all-sky average value is ϖ₀ = 0.029 mas (Lindegren et al. 2018), but we adopt ${\varpi }_{0}=0.048$ mas because that value optimizes our cross-validation objective. It also happens to be in the range of other values found in the literature (Arenou et al. 2018; Zinn et al. 2019). Of course we really expect the offset to be a function of sky position and color (at least), so the offset we find is not recommended for further use as any kind of universal value. The spectrophotometric parallaxes we generate do not depend strongly on this choice. It does not have a large effect because it moves the training data by only a tiny fraction of the root variance of the uncertainties. It is also the case (for the same reason) that we do not strongly prefer (in a statistical sense) the ϖ₀ = 0.048 mas value over the Gaia value of 0.029 mas.

The training sets are shown in the bottom panels of Figure 1. Comparison of the top and bottom panels of Figure 1 shows that there is more color range in the parent sample than in the training set. This challenges the representativeness assumption in Section 2. The principal reason for the difference is that the Gaia quality cuts exclude stars preferentially from crowded regions, which also tend to be the most dust-obscured. The hope of the model assumptions is that the training set will contain sufficient dust variation that the model will naturally learn the dust corrections and extrapolate acceptably. In terms of the more mundane properties of Gaia parallax ${\varpi }^{({\rm{a}})}$ , APOGEE spectroscopic signal-to-noise ratio, or surface gravity $\mathrm{log}g$ , there are small differences between the distributions in the training sets and the parent sample. However, in each of these dimensions, the training set does span the full range of the parent sample, so the training set does have full coverage in the space.

For every star n in the full parent sample we construct the feature vector x_n as

$\begin{eqnarray}&&\begin{array}{l}{x}_{n}^{{\mathsf{T}}}\equiv [1,{G}_{n},{{G}_{\mathrm{BP}}}_{n},{{G}_{\mathrm{RP}}}_{n},{J}_{n},{H}_{n},{K}_{n},{W}_{1n},{W}_{2n},\\ \qquad \,\mathrm{ln}{f}_{1n},\mathrm{ln}\ {f}_{2n},\ \cdots ,\ \mathrm{ln}{f}_{{Ln}}],\end{array}\end{eqnarray} \tag{ 11 }$

where the 1 permits a linear offset, the photometry is from Gaia, 2MASS, and WISE, respectively, the fluxes in the L = 7405 APOGEE spectral pixels (for which there are reliably and consistently data) for star n are denoted f_1n, f_2n, and so on, and we have taken logarithms of those fluxes. These feature vectors live in a D-dimensional space where D = 7414. For every star n in either of the training sets we additionally require—in addition to these feature vectors—a Gaia-measured astrometric parallax ${{\varpi }^{({\rm{a}})}}_{n}$ and uncertainty ${{\sigma }^{({\rm{a}})}}_{n}$ .

5. Results and Validation

We optimize two models, one for training set A and one for training set B. The two models are applied to the two splits of the parent sample, parent sample A and parent sample B, using the A-trained model on parent sample B and vice versa. For the purposes of assessing the accuracy and precision of the model, this train-and-test framework constitutes a two-fold cross-validation. Results of this cross-validation are shown in Figure 2. For the stars with highest Gaia-measured parallax signal-to-noise, the spectrophotometric model is predicting the Gaia parallaxes with little bias and a scatter of less than 9%.

**Figure 2.** Predictive accuracy of the model in the two-fold cross-validation. Each panel shows *Gaia* astrometric parallaxes ${{\varpi }^{({\rm{a}})}}_{n}$ on the horizontal axis and our spectrophotometric parallaxes ${\varpi }^{(\mathrm{sp})}$ on the vertical axis. The points are colored by the spectroscopic (*APOGEE*) signal-to-noise ratio. The left panel shows the full parent sample, the middle panel shows the training set (both A and B combined), and the right panel shows a subset of stars with very high astrometric parallax signal-to-noise. This latter set plays no particular role in the method, but it can be used to demonstrate or assess the prediction precision. The fractional precision of the prediction in the right panel is better than 9%. Note that by construction, there can be no negative spectrophotometric parallaxes; that is, the quantity plotted on the vertical axis is positive-definite.
Download figure:
Standard image High-resolution image

**Figure 2.** Predictive accuracy of the model in the two-fold cross-validation. Each panel shows *Gaia* astrometric parallaxes ${{\varpi }^{({\rm{a}})}}_{n}$ on the horizontal axis and our spectrophotometric parallaxes ${\varpi }^{(\mathrm{sp})}$ on the vertical axis. The points are colored by the spectroscopic (*APOGEE*) signal-to-noise ratio. The left panel shows the full parent sample, the middle panel shows the training set (both A and B combined), and the right panel shows a subset of stars with very high astrometric parallax signal-to-noise. This latter set plays no particular role in the method, but it can be used to demonstrate or assess the prediction precision. The fractional precision of the prediction in the right panel is better than 9%. Note that by construction, there can be no negative spectrophotometric parallaxes; that is, the quantity plotted on the vertical axis is positive-definite.
Download figure:
Standard image High-resolution image

We used this cross-validation framework to adjust the regularization parameter λ. We performed a coarse grid search in λ, using the A/B split as the two-fold cross-validation, with the objective of the cross-validation being prediction accuracy. We get a prediction accuracy of better than 9% at the (coarsely) optimal setting of the regularization strength.

This 9% precision estimate is conservative in the sense that it does not deconvolve or correct for the contributions from the Gaia noise. However, this test is performed with Gaia's best stars, which are bright, nearby, in low extinction regions, and (because of these things) near solar metallicity. There could in principle be additional bias or scatter for stars that are in dustier regions or at lower metallicities. Figure 3 shows that there is no suggestion of any such trends when we color the residuals by metallicity or H − W₂ color (which is a reddening proxy). However, that figure does show that stars with greater reddening do appear to have larger scatter against the Gaia parallaxes. Additional tests confirm that our spectrophotometric parallaxes have largest scatter in the dustiest and most crowded regions, as expected given the relatively low angular resolution photometry.

**Figure 3.** Prediction validation colored by relevant features. Each panel shows the predicted parallax vs. the spectrophotometric parallax colored by a different data feature that might be relevant to the residual. Again the quantity plotted on vertical axis is positive-definite.
Download figure:
Standard image High-resolution image

Another validation of the results can be obtained by looking at known clusters or spatially compact objects in the data. In Figure 4, we show the parallax distribution from Gaia and the spectrophotometric parallax distribution from our model for stars that are astrometrically confirmed members of three of the stellar clusters in which we validated our results. The cluster members were found by hand, looking at sky and proper-motion distributions centered on literature values. In each case we are plotting only very securely identified cluster members; that is, membership is conservative. We chose the three displayed clusters to span some range in metallicity, age, and extinction, and therefore test the accuracy of the method for different kinds of populations. With the exception of M71, the clusters we tested indicate that both Gaia and the spectrophotometric parallaxes appear to be unbiased for these clusters and the range in abundances and ages they represent. The case of cluster M71 is troubling, but after searching for trends with housekeeping data (also see Figure 3) we do not find any correlations with parallax offsets, although M71 is a higher-extinction cluster.

It is interesting and valuable to compare the spectrophotometric parallax precision to the astrometric parallax precision. Figure 5 compares uncertainties between the spectrophotometric outputs and the astrometric inputs (and recall that information is proportional to the inverse square of the uncertainty). The brief summary is that the spectrophotometric parallaxes are more precise (and even far more precise) than the astrometric parallaxes for stars further than a few kiloparsecs from the Sun. This is not unexpected; Gaia parallax uncertainties are in the 0.1 mas range, whereas our uncertainties are in the 10% range.

**Figure 5.** Comparison of precisions between the spectrophotometric parallaxes and the photometric parallaxes, as a function of magnitude and heliocentric distance. For distances here are shown simply the inverses of the spectrophotometric parallaxes ${{\varpi }^{(\mathrm{sp})}}_{n}$ . Note that the spectrophotometric estimates are more precise for most stars. The information content of a parallax estimate (or any measurement) is proportional to the inverse square of the uncertainty.
Download figure:
Standard image High-resolution image

It is also interesting to compare these estimates to other, previous spectroscopic distance efforts. We performed comparisons with two other efforts (Schultheis et al. 2014; Queiroz et al. 2018). In both cases, the agreement is better at smaller distances (larger parallaxes) and worse at larger distances, with the distances (parallaxes) presented here being smaller (larger) on average than the other investigation results for the very most distant (and hence most difficult) stars. In detail, the comparison with the more recent sample (Queiroz et al. 2018) is better: the scatter is at the 15% level for stars at distances <8 kpc, and there is no visible bias between the results at these distances. The distances presented here also compare well with those from a recent machine-learning approach (Leung & Bovy 2019), which appeared while this paper was under review.

In the end, this method produces a spectrophotometric parallax estimate ${{\varpi }^{(\mathrm{sp})}}_{m}$ and uncertainty ${\sigma }_{m}^{(\mathrm{sp})}$ for every star m. Table 1 shows a snippet of this catalog of output, with the full catalog available as an electronic data table associated with this paper.⁷ In order to use this output with Gaia astrometry, we recommend joining our catalog to the Gaia catalog on the 2MASS ID (name), as per the instructions at the Gaia archive (Mora et al. 2018). In order to use this output with APOGEE radial velocities and element abundances, we recommend joining our catalog to the SDSS-IV catalog on the 2MASS ID (name). In addition to this catalog, all the code used for this project is available online in an associated code repository.⁸

Table 1. The Generated Spectrophotometric Parallaxes and Their Uncertainties

`2MASS`_`ID`	`Gaia`_`parallax`	`Gaia`_`parallax`_`err`	`spec`_`parallax`	`spec`_`parallax`_`err`	`training`_`set`	`sample`
2M00000002+7417074	0.3191	0.0329	0.3185	0.0164	0	B
2M00000317+5821383	0.3643	0.0510	0.4428	0.0183	0	A
2M00000546+6152107	0.3154	0.0351	0.3292	0.0164	0	B
2M00000797+6436119	0.0771	0.0841	0.1943	0.0194	1	B
2M00001719+6221324	0.1395	0.0439	0.1564	0.0128	0	B
2M00002005+5703467	0.3003	0.0247	0.2372	0.0203	0	B
2M00002036+6153416	0.0657	0.0637	0.1347	0.0131	1	A
2M00002118+6136420	0.1936	0.0323	0.2173	0.0133	1	B
2M00002227+6223341	0.1965	0.0450	0.1461	0.0089	0	B
2M00002472+5518473	0.0947	0.0207	0.1473	0.0096	1	B
2M00002908+6140446	0.1675	0.0460	0.1914	0.0176	1	A
2M00003061+5820590	0.1669	0.0339	0.3502	0.0290	1	A
⋯	⋯	⋯	⋯	⋯	⋯	⋯

Note. This table is just a snippet; the full catalog of all 44,784 objects in the combined parent samples is available online. In order to reproduce the figures in this paper, this catalog should be joined to the Gaia, 2MASS, WISE, and APOGEE catalogs by 2MASS ID. All floating point numbers are angles in mas.

Download table as: ASCII Typeset image

6. Discussion

We have demonstrated that photometry and spectroscopy can be used to predict or estimate distances to stars. This is not new! However, what's new is that it is possible to do this very well for luminous giants with unknown ages, and with a purely linear model acting on magnitudes and spectral (logarithmic) fluxes, and entirely data-driven. That is, a linear model trained on noisy parallax data from Gaia.

The model and method are based on a raft of assumptions, listed above in Section 2. We make some comments there, but we emphasize some particulars here.

One strong assumption of this work is that our set of features is sufficient to build a predictive model for distance. Our argument for our included features is that they ought to be sufficient to estimate a dust-independent apparent magnitude and the stellar parameters (especially age and surface gravity) that are sensitive to luminosity. In particular, since our model is linear in the logarithm of the parallax, it is also linear in any kind of apparent magnitude at fixed distance and also linear in distance modulus at fixed luminosity. Our feature decisions were highly motivated by these kinds of considerations.

However, our choice of features is very rigid. In principle the data themselves should tell us what features to include. In that direction, it would make sense to choose features using a more complex technique like deep learning or an auto-encoder. These methods generate good features automatically. They also involve an enormous set of hyper-parameter-like choices, and they involve sacrificing certain kinds of interpretability. All that said, we expect that a better set of features would do better; in this sense the project here is just a first step toward precise spectrophotometric distance estimates.

As we said above, we required that the log-parallax (or, equivalently, log-distance) prediction be constructed from a linear combination of the input data or feature vector components. This may seem like an absurd simplification; it begs the questions: why did it work? And could we have done better? The answer to the first question is that we have taken such a small part of stellar parameter space with our surface gravity cut, that the linearized model of stellar spectrophotometry is not a bad approximation. Also, because our linear model included photometry in magnitudes (that is, logarithmic) and was exponentiated to make the parallax prediction, we knew in advance that much of the needed flexibility (for dust attenuation, for example) would be close to linear. That is, it was a combination of limited model scope and some cleverness in the feature engineering and model structure.

The answer to the second question is that almost certainly we could have done better! Since models like Gaussian processes and deep networks subsume the linear model (at least approximately), they could have delivered better—or at least similar—results. And indeed, after the the appearance of this manuscript, such an investigation has appeared (Leung & Bovy 2019) and it does produce very good results. The issues with going to a much more complex model are manifold, however. We would have had many more decisions to make and many more hyper-parameters to set. We would have used some (or maybe much) of the information in the data to learn the simple fact that the problem is close to linear. A more complex model might also have edge issues, because flexible models do not extrapolate outside the convex hull of their training data. In the linear model it is also—unlike in more flexible models—trivial to propagate uncertainties from the feature space to the prediction. We used that, above, to propagate uncertainty in Equation (5). And finally—in principle—linear models are better for visualization and interpretation; the linear model contains essentially only first derivatives in the data space.

Another advantage of a linear model over more general methods (like deep learning, say), is that it is possible to look inside the linear model and check whether the dependencies represented there make sense. For example, the feature vector x_n for star n contains a set of photometry in magnitudes. Linear combinations of these photometric measurements are like complex synthetic magnitudes or colors. Similarly, linear combinations of spectral pixels are like a projection of the spectral space.

For an example of how the model might be interpreted, we can look at the photometric part of the linear model and see if the combination of photometry does indeed look like a dust-corrected absolute magnitude, as we expected (see Section 4). It does! Indeed the model automatically finds a photometric combination that includes a term close to H − W₂, which is known as a good extinction estimator. And we find that on the two-fold A/B split, the photometric model is extremely stable. As for the spectroscopic-pixel part of the model, we find that interpretation is less simple, for two reasons: one is that the number of pixels is enormous, so the model has huge flexibility. Another is that although the two splits, A and B, both give similar-performing models, they have (in detail) different linear coefficients in the spectral space. That is, the model is not as stable in the spectroscopic domain as it is in the photometric domain. This is a generic feature of high-capacity machine-learning methods. And it means that we did not obtain as much insight from the spectroscopic model as we had hoped up front.

Although the model, training, and prediction make no use whatsoever of stellar models, we did use stellar models indirectly to select the parent sample of stars. We used the APOGEE pipeline (García Pérez et al. 2016) surface gravity $\mathrm{log}g$ measurements. This tiny use of stellar models could have been obviated by selecting stars not on the basis of the derived physical parameter, but instead selecting stars to be similar, observationally, in the spectral space. That would have lead to a clean sample and made the project independent completely of stellar models.

Our uncertainty estimation is very naive; it presumes that the features are measured with no intrinsic correlations (probably not true for neighboring pixels in the APOGEE spectra, for example) and it does not allow for any biases in the feature inputs. What's shown in Figure 3 is that the biggest deviations are for stars with large (red) H − W₂ colors. These are in dusty and crowded regions; we expect that is the crowding that's causing the greatest problem, because both Gaia and WISE (and probably also 2MASS) photometry is affected by overlapping and blended sources.

We validated the uncertainty estimates by looking at the chi-squared statistic and robust variations of it. We find that chi-squared is larger than expected if the errors are treated naively. However, robust estimates show that this is driven by outliers; median absolute differences are consistent with our expectations under this very naive noise model. We also find that the outliers that drive up the chi-squared statistic tend to be red in H − W₂.

Figure 4 suggests that there may be some hidden biases in the results; the cluster M71 looks inconsistent between the Gaia and the spectrophotometric parallaxes. We have looked at colors, sky position, and cluster properties, and we do not have any simple explanation for the discrepancy. This is left as a caveat for the reader and our users.

Because of our train-and-test framework, each star is given a spectrophotometric parallax based on coefficients derived from training on a complementary training set sample, of which that star is not a member. As we emphasize above, this ensures that the Gaia measurement ${{\varpi }^{({\rm{a}})}}_{m}$ of star m is never used, even indirectly, in generating the spectrophotometric estimate ${{\varpi }^{(\mathrm{sp})}}_{m}$ for that star m. This leads to true statistical independence of the parallax estimates generated here from the Gaia parallaxes. The two could even be combined naïvely in an inference.

This paper and project delivers greatly improved distance estimates for luminous red-giant stars, more luminous than the red clump. The purpose of all this is to make global maps of the Milky Way. Just as a demonstration, we show kinematics and element abundances for stars as a function of position in the Milky Way disk in Figure 6. In this visualization, it is possible to see the rotation of the disk and the kinematic center of the Galaxy; that might even provide a distance estimate to the Galactic center. It also shows extremely strong chemical gradients. In a companion paper (Eilers et al. 2019), we are using this information to estimate the (asymmetric-drift-corrected) circular-velocity curve for the Milky Way.

**Figure 6.** Map of the kinematics and element abundances in the Milky Way disk. Each arrow represents the three-space Galactocentric velocity of each star in the sample, projected onto the Galactic disk plane. To make these velocities, raw *Gaia* proper motions and *APOGEE* radial velocities and (for distance) inverses of spectrophotometric parallaxes were used (in a standard Galactocentric transformation), so the map shown here is not in any way noise-deconvolved. Each arrow has a color corresponding to the *APOGEE* reported metallicity. This map shows the rotation of the Galaxy out to large radius, and also indicates the kinematic center of the disk rotation. The stars shown in this map have distance uncertainties on the order of 10%, so at significant heliocentric distances, the interpretation of the quantitative information in this plot requires care.
Download figure:
Standard image High-resolution image

Finally, one comment on the use of these spectrophotometric parallaxes. They are noisy. And they linearly combine a set of features into a logarithmic parallax or logarithmic distance. Because the input features have close-to-Gaussian noise (mainly because they are observed at high signal-to-noise), the linear combination will have close-to-Gaussian noise. That is, the noise in the estimated distances or parallaxes ought to be close to Gaussian in the logarithm. This is a consideration in the use of these outputs: precise use of these involves building a likelihood function and performing inference, just as the precise use of the Gaia data presented here required the same. Justified likelihood functions are critical to making good inferences.

It is a pleasure to thank Coryn Bailer-Jones (MPIA), Sven Buder (MPIA), Andy L. Jones, Daniel Michalik (ESTEC), Melissa Ness (Columbia), Joel Zinn (OSU), and the participants in the MPIA Milky Way Group Meeting and the APOGEE Science Telecon, for help with this project. The final version of this paper benefited substantially from the comments from the anonymous referee. This project was developed in part at the 2018 NYC Gaia Sprint, hosted by the Center for Computational Astrophysics of the Flatiron Institute in New York City in 2018 June.

This work has made use of data from the European Space Agency (ESA) mission Gaia (https://www.cosmos.esa.int/gaia), processed by the Gaia Data Processing and Analysis Consortium (DPAC; https://www.cosmos.esa.int/web/gaia/dpac/consortium). Funding for the DPAC has been provided by national institutions, in particular the institutions participating in the Gaia Multilateral Agreement.

This publication makes use of data products from the Two Micron All Sky Survey, which is a joint project of the University of Massachusetts and the Infrared Processing and Analysis Center/California Institute of Technology, funded by the National Aeronautics and Space Administration and the National Science Foundation.

This publication makes use of data products from the Wide-field Infrared Survey Explorer, which is a joint project of the University of California, Los Angeles, and the Jet Propulsion Laboratory/California Institute of Technology, funded by the National Aeronautics and Space Administration.

Funding for the Sloan Digital Sky Survey IV has been provided by the Alfred P. Sloan Foundation, the U.S Department of Energy Office of Science, and the Participating Institutions. SDSS-IV acknowledges support and resources from the Center for High-Performance Computing at the University of Utah. The SDSS website is www.sdss.org.

SDSS-IV is managed by the Astrophysical Research Consortium for the Participating Institutions of the SDSS Collaboration including the Brazilian Participation Group, the Carnegie Institution for Science, Carnegie Mellon University, the Chilean Participation Group, the French Participation Group, Harvard-Smithsonian Center for Astrophysics, Instituto de Astrofísica de Canarias, The Johns Hopkins University, Kavli Institute for the Physics and Mathematics of the Universe/University of Tokyo, the Korean Participation Group, Lawrence Berkeley National Laboratory, Leibniz Institut für Astrophysik Potsdam, Max-Planck-Institut für Astronomie (Heidelberg), Max-Planck-Institut für Astrophysik (Garching), Max-Planck-Institut für Extraterrestrische Physik, National Astronomical Observatories of China, New Mexico State University, New York University, University of Notre Dame, Observatário Nacional/MCTI, The Ohio State University, Pennsylvania State University, Shanghai Astronomical Observatory, United Kingdom Participation Group, Universidad Nacional Autónoma de México, University of Arizona, University of Colorado Boulder, University of Oxford, University of Portsmouth, University of Utah, University of Virginia, University of Washington, University of Wisconsin, Vanderbilt University, and Yale University.

Facilities: SDSS-IV (APOGEE-2) - , Gaia - , 2MASS - , WISE. -

Software: Astropy (Astropy Collaboration et al. 2013; The Astropy Collaboration et al. 2018), IPython (Pérez & Granger 2007), matplotlib (Hunter 2007), numpy (Oliphant 2006), scipy (Jones et al. 2001).

Spectrophotometric Parallaxes with Linear Models: Accurate Distances for Luminous Red-giant Stars

Article metrics

Permissions

Author affiliations

ORCID iDs

Dates

Abstract

1. Introduction

2. Assumptions of the Method

3. Method

4. Data

5. Results and Validation

6. Discussion

Footnotes

Spectrophotometric Parallaxes with Linear Models: Accurate Distances for Luminous Red-giant Stars

Article metrics

Permissions

Share this article

Author affiliations

ORCID iDs

Dates

Abstract

1. Introduction

2. Assumptions of the Method

3. Method

4. Data

5. Results and Validation

6. Discussion

Footnotes