Temperatures and Metallicities of M Dwarfs in the APOGEE Survey

M dwarfs have enormous potential for our understanding of structure and formation on both Galactic and exoplanetary scales through their properties and compositions. However, current atmosphere models have limited ability to reproduce spectral features in stars at the coolest temperatures (Teff < 4200 K) and to fully exploit the information content of current and upcoming large-scale spectroscopic surveys. Here we present a catalog of spectroscopic temperatures, metallicities, and spectral types for 5875 M dwarfs in the Apache Point Observatory Galactic Evolution Experiment (APOGEE) and Gaia-DR2 surveys using The Cannon: a flexible, data-driven spectral-modeling and parameter-inference framework demonstrated to estimate stellar-parameter labels ( , , , and detailed abundances) to high precision. Using a training sample of 87 M dwarfs with optically derived labels spanning calibrated with bolometric temperatures, and dex calibrated with FGK binary metallicities, we train a two-parameter model with predictive accuracy (in cross-validation) to 77 K and 0.09 dex respectively. We also train a one-dimensional spectral classification model using 51 M dwarfs with Sloan Digital Sky Survey optical spectral types ranging from M0 to M6, to predictive accuracy of 0.7 types. We find Cannon temperatures to be in agreement to within 60 K compared to a subsample of 1702 sources with color-derived temperatures, and Cannon metallicities to be in agreement to within 0.08 dex metallicity compared to a subsample of 15 FGK+M or M+M binaries. Finally, our comparison between Cannon and APOGEE pipeline (ASPCAP DR14) labels finds that ASPCAP is systematically biased toward reporting higher temperatures and lower metallicities for M dwarfs.

Fe H 0.5 dex calibrated with FGK binary metallicities, we train a two-parameter model with predictive accuracy (in crossvalidation) to 77 K and 0.09 dex respectively. We also train a one-dimensional spectral classification model using 51 M dwarfs with Sloan Digital Sky Survey optical spectral types ranging from M0 to M6, to predictive accuracy of 0.7 types. We find Cannon temperatures to be in agreement to within 60 K compared to a subsample of 1702 sources with color-derived temperatures, and Cannon metallicities to be in agreement to within 0.08 dex metallicity compared to a subsample of 15 FGK+M or M+M binaries. Finally, our comparison between Cannon and APOGEE pipeline (ASPCAP DR14) labels finds that ASPCAP is systematically biased toward reporting higher temperatures and lower metallicities for M dwarfs.

Introduction
Low-mass stars, with masses  < M M 0.7 * and effective temperatures T eff <4000 K, are by far the most ubiquitous type of star, comprising ∼70% of the Galaxy's population by number (Bochanski et al. 2010). With nuclear fusion timescales τ>10 11 yr (Laughlin et al. 1997), the chemical compositions of the M-dwarf population trace the nucleosynthetic processes and interstellar mixing of heavy elements from many generations of shorter-lived, high-mass stars, and are a unique probe for piecing together Galactic structure and evolution (Bochanski et al. 2010;Woolf & West 2012).
Additionally, the low masses of M dwarfs make for easier detection of planets by variability in radial velocity (Trifonov et al. 2018), high ratios of planet-to-star radii make for easier detection of exoplanet transits in observations of light curves (Nutzman & Charbonneau 2008), and shorter orbital periods (for a fixed stellar insolation flux) allow for discovery of new planets in less observation time than for more massive stars. For these reasons, M dwarfs are primary candidates for exoplanet searches, including by the NASA Kepler (e.g., Dressing & Charbonneau 2015) and Transiting Exoplanet Survey Satellite (e.g., Muirhead et al. 2018) missions. As a result, detailed and precise knowledge of M-dwarf chemical compositions has become key to constraining the properties, formation scenarios, and atmospheric conditions of potentially habitable exoplanets observable with the James Webb Space Telescope (Clampin 2008).
Advances in instrumentation and the implementation of several spectroscopic surveys in the past decade, such as the Sloan Digital Sky Survey (SDSS; Eisenstein et al. 2011;Blanton et al. 2017) and the Large Sky Area Multi-Object Fiber Spectroscopic Telescope (LAMOST; Zhao et al. 2012), have dramatically increased the sample of known M dwarfs (West et al. 2011;Guo et al. 2015) with spectroscopic catalogs of over 70,000 sources, enabling studies of M-dwarf abundances on a Galactic scale. The Apache Point Observatory Galactic Evolution Experiment (APOGEE; Majewski et al. 2015) survey, as part of the SDSS III/IV mission, has introduced the largest sample of M dwarfs observed with high-resolution spectroscopy (Deshpande et al. 2013 Elemental abundance measurements from high-resolution spectra of F, G, and K stars have achieved extremely high precision (down to 0.01-0.03 dex; Nissen & Gustafsson 2018) enabled by improvements in atmosphere models including realistic assumptions of 3D local thermodynamic equilibrium (Asplund 2005), and differential abundance techniques using equivalent widths (Bedell et al. 2014). However, the determination of precise metallicities for M dwarfs has remained a long-standing challenge due to the formation of diatomic and triatomic molecules at M-dwarf temperatures, with absorption from TiO and VO in the optical, H 2 O and CO in the infrared, and hydrides (FeH, CaH, CrH, MgH, etc.) present in the spectra of the latest spectral types (Allard et al. 1997). Atmospheric models often fail to reproduce these spectral features (e.g., Mann et al. 2013b) because of incomplete line lists and opacities. The presence of millions of weak, blended transitions, and the absence of a clear continuum, contribute to making it difficult to deconvolve individual features and extract line strengths from equivalent widths. The combination of these effects limits our ability to explore the information content of high-resolution spectra using traditional methods.
A number of studies focused on improving precisions of M-dwarf metallicity have used systems of M dwarfs in common proper motion with an FGK star and strong, isolated lines in the spectra of the M dwarf (e.g., Rojas-Ayala et al. 2010;Terrien et al. 2012;Neves et al. 2014;Newton et al. 2014;Lindgren et al. 2016) to develop precise empirical relations (as good as ∼0.07 dex). However these metallicity calibrations do not take advantage of the full wavelength coverage available, nor information about the overall spectral shape often used to determine T eff and spectral type. Furthermore, earlier calibrations are generally based on moderate-resolution data (with some exceptions: Neves et al. 2014;Lindgren et al. 2016) that fail to utilize the greater spectral information provided by APOGEE's resolution.
In this work we build a data-driven model for M-dwarf APOGEE spectra with The Cannon (Ness et al. 2015;Casey et al. 2016;Ho et al. 2017a;Behmard et al. 2019)-a fully empirical model that employs no line lists or radiative transfer models. The Cannon is a generative model that parameterizes the flux at each pixel of a spectrum in terms of a set of stellar labels (a flexible number of parameters chosen by the user; described in more detail in Section 3). The model in this sense is used to transfer labels from spectra for which we know parameters to those for which we do not. This data-driven approach effectively circumvents the challenges of physically modeling the atmosphere of a star (and common issues associated such as incomplete line lists or opacities), provided that we have a subset of spectra in the data set with known (and very accurately measured) reference labels possibly measured from other data.
The data-driven approach of The Cannon is ideal in certain cases: if stellar labels are known for a small number of stars but there are spectra taken for many more; if it is computationally expensive to obtain labels for a star, and there are many stars that need labels; or if there are spectral models or techniques that work in one wavelength range or resolution but not in another. Existing methods to model M-dwarf spectra in the near infrared at high resolution are computationally expensive, and often calibrated over a narrow range of T eff and/or metallicity. The Cannon thus fills this niche: it does not require the use of specific lines or opacity information that may be missing from the models; instead it allows us to determine labels from a lot of low-level metallicity information present in thousands of lines, and as we demonstrate, it does so with very good precision.
Here we take M-dwarf labels from samples of wellcharacterized stars that are present in the SDSS-IV APOGEE sample, and use those labels to train a model and label all of the M dwarfs observed by SDSS-IV APOGEE. One set of labels are physical parameters (effective temperatures and metallicities), the other set are spectral types. This paper is organized as follows: in Section 2 we describe the technical specifications of the data from the APOGEE and Gaia surveys, as well as previous studies of M dwarfs in APOGEE. Section 3 describes our model implementation using The Cannon framework, and Section 4 describes our sample selection and derivation of training parameters. In Section 5 we present our experimental results, evaluate the predictive accuracy of our models, apply our model to a selected test sample of nearly 6000 sources, and examine the validity of our parameters against color-temperature relations and metallicities of binary pairs. Finally, in Section 6 we discuss model performance, future improvements, and implications of our results.

Data
The APOGEE survey is a high-resolution (R∼22,500), Hband (1.5-1.7 μm), multi-epoch survey that has observed over 250,000 stellar spectra up to its fourteenth data release (DR14; Abolfathi et al. 2017). Fundamental parameters for each of these stars are estimated by the APOGEE Stellar Parameter and Chemical Abundances Pipeline (ASPCAP; García Pérez et al. 2016), which employs a χ 2 fitting procedure using the FERRE code to fit radiative transfer models and determine atmospheric parameters, 15 chemical abundances, and microturbulence parameters (Mészáros et al. 2012). The pipeline uses MARCS plane-parallel/spherical models (Gustafsson et al. 2008) for low temperatures (2800 K<T eff < 3500 K), and ATLAS9 plane-parallel models (Castelli & Kurucz 2003) for higher temperatures (  T 3500 K eff ). APOGEE is primarily designed to target bright stellar populations, particularly red giants, with dereddened photometry and color cutoffs of 7H13.8 and [J − K] 0 0.5 (Zasowski et al. 2013), with the objective of studying Galactic composition and evolution. However, numerous cool, mainsequence sources have also been observed either as targets proposed by the APOGEE M-dwarf ancillary survey (∼1200 sources; Deshpande et al. 2013) or serendipitously.
A number of studies out of the M-dwarf ancillary survey have already been conducted to measure reliable fundamental atmospheric parameters and make kinematic measurements using spectral synthesis of atmospheric model grids. These studies include Deshpande et al. (2013) and Gilhool et al. (2018), which have studied the radial and rotational kinematics for 700+ sources; Souto et al. (2017Souto et al. ( , 2018, which have modeled three exoplanet-hosting M dwarfs (Kepler-138, Kepler-168, and Ross-128), determining T eff / g log /metallicity and 13 elemental abundances; Rajpurohit et al. (2018), which tested BT-Settl (Allard et al. 2012) and MARCS (Gustafsson et al. 2008) model grids on 45 M dwarfs to estimate T g log metallicity eff ; and Skinner et al. (2018), which identified and measured mass ratios and radial velocities for 44 M-dwarf spectroscopic binaries. This work complements existing studies by producing a model-independent catalog of spectroscopic temperatures and metallicities to test against model predictions for the entire APOGEE M-dwarf sample, which we quantify to contain at least 10,000 sources to date (DR14).
The ASPCAP pipeline releases several types of data files, with various levels of processing: ap1D (the raw onedimensional spectra for individual visits), apVisit (the individual visit spectra with telluric subtraction), apStar (the co-added apVisit spectra), and aspcapStar, which contains the pseudo-continuum-normalized, rest-frame-shifted, co-added spectrum of all observed epochs (see García Pérez et al. 2016 for a complete description of the pipeline). We use the last data set for our study. In previous work it has been recommended to use an alternative pseudo-continuum normalization (Ness et al. 2015), but we did not find obvious issues with the normalization in our analysis, so we retain the survey pipeline outputs.

Method
The Cannon is a regression model that relies on two assumptions: first, that sources with identical labels have nearidentical flux at each wavelength pixel; and second, that the expected flux at each pixel varies continuously with change in label.
Inferring the label of a star with such a model requires two steps: first, the training step, in which a generative model describing the probability density function of the flux is constructed at each pixel from the set of spectra with known reference labels; and second, the test step, in which the model is applied to determine the labels of a spectrum.
Following the procedure of Ness et al. (2015) and Ho et al. (2017a), we adopt a simple linear model that assumes that the flux at each pixel of the spectrum can be parameterized as a function of a label vector ℓ and coefficient vector θ. For each star n, at wavelength pixel λ, we assume that the measured flux for a star at a given pixel is the sum of the coefficient and label product, and observational noise: , where the bracketed term is the root mean sum of the intrinsic scatter of the model at each pixel l s and the uncertainty due to instrumental effects s l n , which is then multiplied by a Gaussian random number which gives the probability density function of the measured flux, given the labels, coefficients, and scatters. We apply a quadratic parameterization of the model such that the label vectors for the two models are all combinations of reference labels up to second order: Equation (3) is the label vector for the spectral type model, and Equation (4) is the label vector for the physical parameter model; the first element "1" is included to allow flexibility for a linear offset to the model. We find that a second-order parameterization is sufficient for reproducing the flux of each spectrum to 1% accuracy, as discussed further in Section 5.1. The training step consists of optimizing the likelihood function (Equation (2)) for the coefficient vector and scatter (θ λ and s λ ) given the fixed label vector (ℓ n ) constructed from the reference labels. The test step consists of optimizing the likelihood function for the labels at fixed θ λ and s λ obtained in the training step (see Ness et al. 2015 for further description). In the training step, the regression is designed to predict spectral pixels given labels, by learning zeroth, first, and second derivatives of the data with respect to the labels. In the test step, the regression is designed to predict labels given the spectral derivatives.

Sample Selection
The Cannon model can in principle be trained on any physical or empirical labels available beyond those that typically parameterize theoretical atmospheric models (T eff , log g, [M/H], etc.), such as additional physical parameters (e.g., mass/age Ho et al. 2017b) or empirical proxies for physical parameters (e.g., spectral types, colors, magnitudes), giving a wide range of flexibility to the model. However, choosing a training sample with high-quality labels is critical to its performance. Limitations of The Cannon include that test (output) labels are only accurate if the training labels are accurate, and only precise if the training labels are measured consistently across the training sample. It is also critical to have a training sample with the dynamic range to span the entire parameter space of interest, because The Cannon does not extrapolate well outside the parameter space of the training sample. Finally, The Cannon assumes that the dependence of the spectrum on labels is continuous and smooth-and in this implementation is well approximated by quadratic functions. If that is not true, there will be features that The Cannon cannot reproduce.
For the purpose of this study, we have constructed two different training samples: first a one-dimensional spectral type model, and second, a two-dimensional physical parameter model, which describes the temperature and metallicity. The choice of training labels, the dimensionality of our data set, and requirements for a good training set are discussed further in Section 6.

Spectral Type Training Sample
The spectral type training sample consists of 51 sources, spanning M0-M9 cross-matched from the catalog of West et al. (2011, hereafter W11) that contains 78,841 M dwarfs from SDSS. For each source in the catalog, spectral types were determined both through an automated routine for comparing spectral type templates to data using The Hammer (Covey et al. 2007) and by visual inspection to a reported accuracy of ±1 type. A spectral sequence of spectra from the training sample spanning M0-M9 is shown in Figure 1.

Physical Parameter Training Sample
The physical parameter training sample consists of 87 sources with reference labels distributed over  are part of a previously unpublished extension sample to M15, analyzed using similar data and identical techniques to M15. The major difference in the extension sample is that its sources had lower-quality or no parallaxes (prior to Gaia data) and hence were omitted from the M15 study and were less vetted for binarity than the M15 sample (however, all sources in the training sample were visually inspected by color-magnitude position for binarity before addition).
The M15 catalog in total contains 183 sources and the extension sample another 500 stars. Both samples were primarily selected from the proper-motion-selected CONCH-SHELL (Gaidos et al. 2013) M-dwarf catalog. All targets have low-resolution optical spectra from the SNIFS spectrograph (Lantz et al. 2004) and infrared spectra taken with the SpeX Spectrograph (Rayner et al. 2003), which have been combined to estimate largely empirical bolometric fluxes. Effective temperatures have been estimated by comparing the SNIFS spectra to BT-Settl atmospheric models (Allard et al. 2011). A subsample of 29 sources with measured angular diameters from long-baseline optical interferometry (Boyajian et al. 2012) are used to calibrate the model comparison, including masking of spectral regions poorly reproduced by the model spectra (Mann et al. 2013b). Based on the difference between assigned T eff values and those from angular diameters, absolute uncertainty on T eff is estimated to be 60 K, although the relative uncertainty is likely a factor of ;2 better.
Iron abundances ([ ] Fe H ) are assigned to the physical parameter sample based on the strength of metal-sensitive lines in the near-infrared SpeX spectra (Rojas-Ayala et al. 2010) using the calibration from Mann et al. (2013a). The relation between these lines and an absolute [ ] Fe H scale is calibrated using wide binaries containing an F-, G-, or K-type primary and an M-dwarf companion, under the assumption that binaries formed from the same molecular cloud and therefore have the same metallicity (Bonfils et al. 2005). Uncertainties are estimated to be ;0.08 dex based on irreducible scatter in the empirical relation between selected lines and the assigned [ ] Fe H from the primary star. As with T eff , relative errors on [ ] Fe H are smaller, estimated to be 0.04-0.06 dex over most of the temperature and metallicity range considered here.
We note that surface gravity is not included as a training label. The reason for this is that for main-sequence M dwarfs, the parameter is almost entirely redundant with metallicity. The properties of M dwarfs, unlike those of their more massive counterparts, do not change measurably over the age of the universe after arriving at the zero-age main sequence. Hence perfect knowledge of abundances and T eff for an M dwarf should uniquely determine its surface gravity, position on a color-magnitude diagram, and overall luminosity. While we only had [ ] Fe H for the training sample, for the uncertainties considered here, lack of information about [α/Fe] or specific abundances will only be important compared to other uncertainties in extreme cases (e.g., carbon stars).

Temperature/Metallicity Model
For the physical parameter model, we trained The Cannon on 87 M dwarfs with two-dimensional temperature/metallicity labels, to a precision of 77 K/0.09 dex as estimated by the cross-validation scatter, similar to the uncertainties on the original training sample of 60 K/0.08 dex. We note for this model that five out of 87 sources show possible rotational line broadening identified by visual inspection (as indicated by the red circles in Figure 2), while the remaining sources show no obvious broadening. We note that these broadened sources have high c 2 values (those sources with χ 2 > 80,000 in Figure 4), and that the labels for these five sources are biased by an average of +65 K and −0.08 dex. However, removing them from the training sample does not significantly change the overall scatter and bias of the model. For the model overall, the cross-validation bias is +4 K/+0.008 dex with the rapid rotators included in the training set, and +5 K/+0.01 dex when they are excluded. Hence we do not remove them from the training sample.
To assess the validity of our model's labels we used a leaveone-out cross-validation (LOOCV) test, in which we train a model on all sources but n, then apply the N -1 source-trained model to obtain the labels for star n. Precision (scatter) and bias of the model for each test are calculated as the standard deviation and mean of the difference in training and test (or LOOCV) labels respectively (Figure 2). Since the LOOCV test both evaluates how well the model reproduces the training values and penalizes the model for overfitting, we adopt the LOOCV scatter as the estimate of the model's precision. The set of training, test, and cross-validated labels for each training source is reported in Tables 1 and 2.
Another mode of analysis we can utilize with The Cannon is how the derivative of the model changes with respect to given training parameters, which makes our model interpretable for discovering or verifying atomic or molecular lines with strong dependence on different physical parameters. The top two panels of Figures 12 and 13 show two example spectra and  (This table is available in its entirety in machine-readable form.) model fits for two different temperatures ( Figure 12) and two different metallicities (Figure 13), with atomic and molecular features identified by the abundance analysis of Souto et al. (2017). The bottom panels of Figures 12 and 13 show the derivative of flux with respect to temperature and metallicity at each pixel, taken at the median training values. In order to evaluate which spectral features show statistically significant change with respect to input label, we compute the error of the derivative at each pixel using a jackknife statistic (with a 1σ level overplotted in red): where s q m , is the error at pixel m, N is the total number of stars in the sample indexed by n, θ is the coefficient vector trained on all N sources, and θ /n is the coefficient vector trained on N -1 sources excluding star n. A summary of identified lines with derivative values greater than 2σ jackknife is given in Tables 3  and 4.
The spectra contain roughly 8000 pixels, so we might expect the χ 2 values to be close to 8000 in magnitude, but they are much higher. This discrepancy follows from the fact that, while the spectral model is good at the level of a few per cent, the signal-to-noise ratio of a typical spectrum is more than 100. That is, the χ 2 values do show that the model is not good in the frequentist sense; it is only good at the level of a few per cent.

Spectral Type Model
We trained The Cannon on 51 M dwarfs in the range M0 −M9 with a one-dimensional spectral type label, and obtained a precision of ±0.9 spectral types, similar to the uncertainty of the original training label of ±1 spectral type. We note, however, that the training sample is distributed heavily toward sources of earlier type, with a median spectral type of 3 and only one M8 and one M9 source. As seen in Figure 3, the model performs poorly at reproducing spectral types >M8,  which confirms that The Cannon does not extrapolate well to labels outside the training sample space. Because of this skew for late-type sources, we report our spectral type model to be precise to ±0.7 spectral types for the range M0-M6. Repeating the analysis of Section 5.1, Figure 3 shows LOOCV test for the labels reported in Table 5, and Figure 14 shows the derivative of model flux with varying spectral type.

Test Sample
Out of the total APOGEE DR14 catalog of 258,475 sources, we selected 254,478 sources that were in the cross-match of Gaia-DR2 (Gaia Collaboration et al. 2018) and applied Gaia colormagnitude cuts of 1<BP−RP<6 and 7.5< M G <20 for sources with only positive parallaxes (v > 0), yielding a sample of 14,828 sources. From there we applied additional selection criteria, described below, to identify a sample of single, main-sequence M stars, with minimal contamination from reddened K dwarfs, pre-main-sequence stars, and binaries: 1. Quality of fit cut:We apply a Cannon model χ 2 cut of less than 100,000, chosen to remove badly fit sources (such as fast rotators) but include χ 2 values close to the distribution of the training sample (Figure 4). 2. Color-magnitude cuts:Using Gaia and photometry from the Two Micron All Sky Survey (2MASS) we apply the additional color-magnitude selections shown in Figure 5 to remove sources above the main sequence (which are likely pre-main-sequence, reddened K dwarfs and/or multiples), and subdwarfs below the main sequence.  (This table is available in its entirety in machine-readable form.) Figure 3. Leave-one-out cross-validation test for the West-trained spectral type model. Predictive accuracy, as computed from the scatter in cross-validation, is 0.9 subtypes.

Model extrapolation cuts
4. Astrometric cut:Using the Gaia renormalized unit weight error (RUWE)-a metric for evaluating the fit of the astrometric solution described in the additional release notes (Lindegren 2018)-we apply a cut of RUWE<1.2 to remove sources with high astrometric error or noise, such as binaries (see Figure 6). 5. Binary cut:To remove further contamination from binary sources, we applied an additional color-magnitude cut on sources above the main sequence, which we visibly selected for in Figure 6.
The top, middle, and bottom panels of Figure 7 show before and after selection of the sources in Gaia color-magnitude space, colored by temperatures, metallicities, and spectral types determined by The Cannon, with their respective training samples overplotted in orange. Each plot shows the expected gradient: temperature increases with decreasing color, spectral subtype increases with increasing color, and the metallicity gradient is largely perpendicular to the main-sequence branch. We also note that applying our model requires very little computational demand: the time to train and test a model on all 14,828 sources was two minutes on a 2.7 GHz Intel core i7 laptop. Table 6 outlines the parameters included in the test sample catalog, which can be downloaded from the online journal. Included are two versions of the catalog: the first containing all 14,828 sources before selection, and the second containing the 5875 sources kept after making selections 1−5 described in this section.

Temperature Validation
As a validation test of our derived temperatures, we perform a comparison between several color-temperature relations from the literature, which use combinations of 2MASS and visualband photometries to predict temperatures (similarly to the  Skinner et al. 2018). Right:the distribution of χ 2 fits for all 14,828 sources in the APOGEE-Gaia cross-match, with color cuts 1<G BP −G RP <6 and 7.5<M G <20, and v > 0. We apply a quality cut of χ 2 <100,000 to the test sample for those sources we report as "safe." Figure 5. Gaia and 2MASS color-magnitude cuts for the 12,037 sources with χ 2 <100,000. Overplotted with orange triangles are the 67 out of 87 sources in the training sample that have parallaxes measured by Gaia. The coordinates for the selected quadrangles are {( ) ( ) ( ) ( ) } 1. 4, 7.5 , 2.2, 7.5 , 4.2, 14 , 3.3, 14 corresponding to (BP − RP, M G ) for the Gaia color-magnitudes shown in the left panel, and {( ) ( ) ( ) ( )} 0. 7, 4.3 , 1.1, 4.3 , 0.7, 9 , 1.1, 9 corresponding to (J − K, M K ) for the 2MASS color-magnitudes shown in the right panel. evaluation of ASPCAP temperatures by Schmidt et al. 2016). To obtain visual-band magnitudes for a set of sources, we cross-matched the 5875 sources in our "safe" test sample to the AAVSO Photometric All-Sky Survey DR9 (APASS; Henden et al. 2016), to obtain a subsample of 1702 sources with both BV photometries measured by APASS and 2MASS JHK photometries from APOGEE. Figure 8 shows Cannon versus photometric temperatures on the right, and ASPCAP versus photometric temperatures on the left for each of the 1702 sources, colored by their respective spectroscopic metallicities.
Compared to the color-metallicity-derived temperatures of Mann et al. (2015) and Boyajian et al. (2012), both ASPCAP and Cannon temperatures show similar scatters of ∼60 K, but are offset by a constant. We find Cannon to be in better agreement with Mann et al. (2015) and Boyajian et al. (2012), with ASPCAP overestimating T eff on average by ∼110-140 K, and The Cannon underestimating T eff on average by ∼10-20 K, with the largest deviation in the latter at the lowest and highest T eff .

Metallicity Validation
As a check of the reliability of our test sample metallicity, we cross-matched our M-dwarf final sample with the catalog of >50,000 high-confidence, widely separated binaries identified by Gaia-DR2 presented in El-Badry & Rix (2018). In total we found 216 of the APOGEE M dwarfs to have binary pairs (46 FGK+M,155 M+M,and 15 WD+M). Out of the 155 M+M pairs, eight contained both pairs in APOGEE. Cross-matching the list of FGK+M dwarf companions with several catalogs/ surveys with measured stellar metallicities, we found an additional seven sources with FGK metallicities from LAMOST (Zhao et al. 2012) and APOGEE (ASPCAP). The metallicity measurements for the 15 M-dwarf binaries and their companions are given in Table 1 and shown in Figure 9, and the overall scatter is 0.08 dex-an improvement over the scatter of ASPCAP metallicities, which is 0.15 dex for these 15 sources. The internal consistency of the two models (the scatter of the eight M+M pairs both in APOGEE) is 0.06 dex for Cannon and 0.12 dex for ASPCAP.
As expected, the Toomre diagram in Figure 10 shows that higher-metallicity sources in the sample are concentrated in lowvelocity space corresponding roughly to the thin-disk population; while the thick-disk population contains a slightly higher concentration of lower-metallicity sources. Separating the two populations into separate histograms (also shown in Figure 10), we find that thick-disk sources are marginally more metal-poor than thin-disk sources, with the mean±standard deviation  Figure 11 shows that ASPCAP metallicities are systematically lower than Cannon metallicities. We further find that the bias is temperature-dependent: at the highest temperatures ( > T 3600 K eff ) ASPCAP and Cannon metallicities are consistent to a scatter of 0.05-0.06 dex and offset by an average of −0.12-0.15 dex, while at the lowest temperatures ( < T 3200 K eff ) ASPCAP and Cannon are consistent to a scatter of ∼−0.13 dex and offset by an average of −0.3 dex.

Discussion
We trained a data-driven model (The Cannon; Ness et al. 2015) to deliver high-quality atmospheric parameters (T eff and [ ] Fe H ) for M-type dwarf stars from high-resolution infrared spectra from APOGEE. This work was motivated by the problem that M dwarfs stars are difficult to model physically; the data are better than the models in important senses. Indeed we find that our data-driven model is both accurate in the data domain (as a spectral synthesis model) and precise in the latent domain (as a tool for deriving physical parameters). This accuracy and precision is consistent with previous work with The Cannon (Ness et al. 2015(Ness et al. , 2018Casey et al. 2016;Ho et al. 2017a), but here extends to a new regime in spectral type (T eff ). The primary result of this work is that we have compiled a catalog of 5875 M dwarfs with Cannon temperatures, Figure 6. The top panel shows the test sample of 8335 sources after applying Selections 1-3 described in Section 5.3. The bottom panel shows the same test sample reduced to 6221 after applying an astrometric quality cut of RUWE<1.2 (Selection 4). To further remove sources that were likely binary contamination (Selection 5), we cut out sources above the red line that sparsely lay above the majority of the main sequence with temperatures and metallicities that deviate from the expected gradient, reducing the final test sample to 5875 sources. The red line shown is constrained by the (BP − RP, M G ) coordinates {( ) ( ) } 3.5, 12 , 1.87, 7.5 . metallicities, spectral types, and six-dimensional kinematics. These data are provided in Table 6.
While The Cannon achieves excellent precision at predicting labels and reproducing spectral features, the accuracy of labels it produces is limited by the accuracy, relative precision, size, dynamic range, and representation of the training sample. That is, being a supervised method, The Cannon is never any better in a mean (bias) sense than the input training data, although it can be better in a precision or variance sense. The catalog we have produced is a label transfer from parameters provided in our input data (M15) and it implicitly adopts all the biases and issues from those input data. It is also limited to the stellar-parameter domain of that input catalog. That said, this work provides an external validation of the M15 stellar parameters.
The model we have developed does have limitations, however. For example, it delivers chi-squared goodness-of-fit measures that are large; the model is not technically an accurate Figure 7. Full sample of 14,828 M dwarfs colored by Cannon labels before selection (left) and final sample selection of 5875 M dwarfs after applying selection criteria described in Section 5.3 (right), to reduce contamination from sources that are not similar to the training sample (not single, main-sequence M stars, such as pre-main-sequence, spectroscopic binaries, and K dwarfs). Overplotted with orange triangles are the M15 and W11 training samples, for their respective Cannon test labels. Temperature gradient increases with decreasing color, spectral subtype increases with increasing color, and metallicity gradient increases perpendicularly up from the main-sequence branch as expected. Deviations from these gradients seen at the upper boundary of the main sequence are likely remaining contamination from the binary sequence. description of the spectra, especially when the spectra are observed at signal-to-noise levels above 100. The model does not include some known physical and instrumental effects. such as line broadening from rotation or convection (for example, Behmard et al. 2019), or binarity and the superposition of multiple stellar spectra (as in, say, El-Badry et al. 2018). The model also does not include any adjustments for instrumental variations, such as the small but significant variations of APOGEE resolution with spectrograph fiber number (as included in Ness et al. 2018).
The APOGEE instrument was designed to be sensitive to more than a dozen individual elemental abundances in stellar spectra. So the M-dwarf spectra analyzed here contain individual elemental abundance information that we have ignored. Exploitation of that information requires a better training set of M dwarfs than we have at present, but is an important goal for the future with these data.
While a detailed analysis of atmospheric model limitations is beyond the scope of this paper, our results provide an avenue to compare the metallicity scale for FGK stars to the less wellunderstood metallicity scale for M dwarfs. These results find that atmospheric metallicities are systematically metal-poor biased compared to Cannon-based metallicities trained on sources with metallicities calibrated to those of FGK companions. At the high-temperature end ( > T 3600 K eff ), the ASPCAP metallicity bias is −0.12-0.15 dex with a scatter of 0.05-0.06 dex relative to Cannon metallicities, and it increases to a bias of −0.3 dex and scatter of 0.13 dex at the low-temperature end ( < T 3200 K eff ) ( Figure 11). We suspect that this metal-poor bias, while not explored to a great extent in this work, is due to the line lists of the models-an effect in which the optimizer of the pipeline may be lowering the continuum level and metallicity of the fit to compensate for the missing lines or opacities. We also note that this analysis was  (Souto et al. 2017). Further analysis would need to be done to quantify the metallicity improvement for M dwarfs in future data releases of APOGEE, and determine whether the metallicity bias is found in other model grids (besides the ATLAS/MARCS models used by the ASPCAP pipeline), and whether the effect is present at other wavelengths.   Table 1. The overall scatter between the 15 metallicity pairs is 0.08 dex.
Given that physics-based spectral models of M dwarfs have issues, one of the possible future values of the data-driven model shown here is that it is highly interpretable: it contains within it first and second derivatives of the spectral expectation with respect to the atmospheric parameters. We show some of these derivatives in Figures 12, 13, and 14 and deliver relevant The histogram (right) shows the metallicity distribution of thin-and thick-disk stars, with blue corresponding to v tot <70 km s −1 and green to 70 km s −1 <v tot <180 km s −1 respectively. The average/standard deviation metallicity of the thin-disk distribution is [ ] =  Fe H 0.00 0.17 dex, and that of the thick-disk distribution is [ ] = - Fe H 0.14 0.19 dex. Figure 11. A comparison of ASPCAP DR14 and Cannon metallicities for 8335 test sample sources separated into temperature bins of 100 K. We see that the scatter and metal-poor bias of ASPCAP DR14 metallicities clearly increases with decreasing temperature. Figure 12. Top two panels of each plot:APOGEE spectra (black), overlaid by the Mann-trained Cannon model for two sources of varying temperatures, and similar metallicities. Third panel of each plot:Derivative of The Cannon model with respect to temperature, taken at the median training temperature, T eff =3463 K; an error estimate computed using a jackknife statistic at each pixel is marked in red, making it possible to distinguish which features vary significantly with change in spectral type, and which are likely due to noise.  Tables 3 and 4. These tables summarize spectral features in the APOGEE bandpass that are found to be strong temperature and metallicity indicators. In the long run, this is the primary value of data-driven models for astronomy: to provide physical insights that drive physical understandings. It is our hope that The Cannon, and models like it, will lead to new and improved physical models which will, in turn, put The Cannon out of business.