Machine Learning Applied to Star–Galaxy–QSO Classification and Stellar Effective Temperature Regression

Yu Bai; JiFeng Liu; Song Wang; Fan Yang

doi:10.3847/1538-3881/aaf009

1. Introduction

Nowadays, astronomy and cosmology are concerned with the study and characterization of millions of objects, which can be quickly identified with their optical spectra. However, billions of sources in wide-field photometric surveys cannot be followed-up spectroscopically, and appropriate identification of various source types is complicated (Krakowski et al. 2016). In a traditional way, a separator between stars and galaxies is a morphological measurement (Vasconcellos et al. 2011), but we quickly reach the limit due to low image resolution. Another separation involves magnitudes and colors criteria, but the criteria become too complex to be described with functions in a multi-dimensional parameter space.

However, this parameter space can be effectively explored with machine-learning algorithms, e.g., the support vector machines (SVM; Cortes & Vapnik 1995; Christianini et al. 2000; Kovács & Szapudi 2015; Krakowski et al. 2016), random forests (RF; Breiman 2001; Yi et al. 2014; Reis et al. 2018) and k-nearest neighbors (Fix & Hodges 1951; Garcia-Dias et al. 2018). Machine learning teaches computers to learn from "experience" without relying on a predetermined equation or an explicit program. It finds natural patterns in the data that generate insight and help us to make better decisions and predictions.³ Machine-learning algorithms have helped us to deal with complex problems in astrophysics, e.g., automatic galaxy classification (Huertas-Company et al. 2008, 2009), the Morgan-Keenan spectral classification (MK; Manteiga et al. 2009; Navarro et al. 2012; Yi et al. 2014), variable star classification (Pashchenko et al. 2018), and spectral feature recognition for QSOs (Parks et al. 2018).

The "experience" used for the machine learning is also known as training data, which is the key to make effective predictions. The classification from spectroscopic surveys is an ideal training data due to its high reliability. Several works have been done that explore the performance of the star/galaxy/QSO classification (e.g., Suchkov et al. 2005; Ball et al. 2006; Vasconcellos et al. 2011; Kovács & Szapudi 2015; Krakowski et al. 2016). In these studies, the machine-learning classifiers were built with photometric colors and spectroscopic classes, and they show more accurate predictions than other traditional methods, such as color cuts (Weir et al. 1995). However, there are still some locations in multi-color space that were not explored by the classifiers, owing to the small size of the spectroscopic sample. Therefore, a machine-learning classifier built from a large spectroscopic sample is required to cover a more complete multi-color space and to further yield accurate classification of billions of sources.

After separating stars from galaxies and QSOs, we want to understand their nature. The stellar spectral classification, the MK spectral types, is the fundamental reference frame of stars. However, the method for MK classification is based on features extracted from the spectra (Manteiga et al. 2009; Daniel et al. 2011; Navarro et al. 2012; Garcia-Dias et al. 2018), which limits the application to the stars with high signal-to-noise ratios (S/Ns). On the other hand, the spectral features of different types could be very similar, and thus it is difficult to make clear cuts for different spectral types (Liu et al. 2015). An alternative method is estimating the effective temperature with multiple colors, which only requires photometric data and has the ability to cover a greater area of the sky. Some theoretical studies have indicated that combining broad-band photometry allows atmospheric parameters and interstellar extinction to be determined with fair accuracy (Bailer-Jones et al. 2013; Allende Prieto 2016). However, there is still no research to test its validation with real observational data.

In this paper, we take advantage of the archival data from the Sloan Digital Sky Survey (SDSS) and the Large Sky Area Multi-Object Fiber Spectroscopic Telescope (LAMOST) surveys (Section 2) to build a star/galaxy/QSO classifier (Section 3) and stellar effective temperature regression (Section 4) based on one of the largest machine-learning samples. The validation and blind tests are applied to explore the performance of the prediction in Sections 3 and 4. In Section 5, we present comparisons with other machine-learning methods and an application of the spectral energy distribution (SED) fit to the real observational data. A summary and suggestions for future work are given in Section 6.

2. Data

2.1. SDSS and LAMOST Spectroscopic Surveys

SDSS is an international project that has made the most detailed three-dimensional maps of our universe. The fourth stage of the project (SDSS-IV) started in 2014 July, with plans to continue until mid-2020 (Blanton et al. 2017). The automated spectral classification of the SDSS-IV is determined with the chi-square (χ²) minimization method, in which the templates are constructed by performing a rest-frame principal-component analysis (Bolton et al. 2012; Blanton et al. 2017). The first data release in the SDSS-IV, DR13, includes over 4.4 million sources, in which galaxies comprise 59%, QSOs 23%, and stars 18% (Albareti et al. 2017).

Another ongoing spectroscopic survey is undertaken by LAMOST (Cui et al. 2012; Zhao et al. 2012), which mainly aims at understanding the structure of the Milky Way (Deng et al. 2012). In 2016.06.02, LAMOST finished its forth year of survey (the forth data release; DR4) and has obtained spectra of more than 7.6 million sources included 91.6% stars, 1.1% galaxies, and 0.2% QSOs. The LAMOST one-dimensional (1D) pipeline recognizes spectral classes by applying a cross-correlation with templates (Luo et al. 2015). An additional independent pipeline and visual inspection are carried out in order to double check the galaxies' and QSOs' identification. Here, we adopt SDSS DR13 plus LAMOST DR4, since they are matched on the time of the data releasing.

Figure 1 shows the comparison between the two spectroscopic surveys. The objects of LAMOST are dominated by stars, while over half of the objects in SDSS are galaxies. The combination of the two surveys can provide a more balanced and larger training sample for the classification. In order to add more QSO samples, we adopt the 13th edition of quasars and active nuclei catalogs (Veron13, Véron-Cetty & Véron 2010), which includes 23108 samples. The priority of the catalogs is Veron13, SDSS, and LAMOST, if some objects are included in more than one catalog.

2.2. SDSS and Wide-field Infrared Survey Explorer (WISE) Photometric Surveys

The combination of optical and infrared (IR) data on huge numbers has been proved to be valid in the star/galaxy classification (Baldry et al. 2010; Henrion et al. 2011) and stellar parameter determination (Allende Prieto 2016). SDSS has imaged over 31000 square degrees in five broad bands (ugriz). The DR13 includes photometry for over one billion objects. WISE (Wright et al. 2010) performed an all-sky survey with images in 3.4, 4.6, 12, and 22 μm and yielded more than half a billion objects.

In order to obtain the training sample, we extract objects with the available model magnitudes in g, r, i bands for the SDSS and LAMOST spectroscopic surveys, and cross identify them with the WISE All-Sky Data Release catalog with the help of the Catalog Archive Server Jobs (CasJobs).⁴ Similar to Krakowski et al. (2016), we use w1(2)mpro magnitudes. The J, H, and K magnitudes are also extracted in order to cover the near-IR bands. Our selection required the object with zWarning = 0 for the SDSS objects, and S/Ns higher than 2 in the W1 and W2 bands. We adopt w?mag13 as the indicator for the extended objects (Bilicki et al. 2014; Krakowski et al. 2016; Kurcz et al. 2016; Solarz et al. 2017), which is defined as

$\begin{eqnarray}&&{\rm{w}}?\mathrm{mag}13={\rm{w}}?\mathrm{mag}\_1-{\rm{w}}?\mathrm{mag}\_3,\end{eqnarray} \tag{ 1 }$

where w?mag_1 and w?mag_3 are magnitudes measured with circular apertures of radii of 5 farcs 5 and 11''. The question mark is the channel number in the catalog.

3. Star/Galaxy/QSO Classification

Classification is foundational in astronomy, and it is the beginning of understanding the relationships between disparate groups of objects and identifying the truly peculiar ones (Gray & Corbally 2009). In this section, we present the machine-learning method and performance tests of our classifier.

3.1. Method

We use the CasJobs to cross identify the photometric data with the spectral catalogs of LAMOST, SDSS, and Veron13. The result has 2,973,855 objects, including 2,123,831 stars, 806,139 galaxies, and 43885 QSOs. We present the color–color diagram in Figure 2 that is often used for the star–galaxy separation (e.g., Jarrett et al. 2011; Goto et al. 2012; Ferraro et al. 2015). The contours of the three classes overlap in the color–color diagram. Neither the cut W1 − W2 = 0.8 (Stern et al. 2012; Yan et al. 2013), nor W1 − J = −1.7 (Goto et al. 2012) could provide a clear cut to classify the stars, galaxies, and QSOs.

**Figure 2.** The color–color diagram for stars (red), galaxies (blue), and QSOs (black) in our training sample.
Download figure:
Standard image High-resolution image

We build a nine-dimensional color space, g − r, r − i, i − J, J − H, H − K, K − W1, W1 − W2, w1mag13, and w2mag13. Each object is weighted with the quadratic sum of their photometric uncertainty. The holdout validation is applied to test the total accuracies of different machine-learning algorithms, in which a random partition of 20% is held out for the prediction and the rest is used to train the classifier.

Table 1 lists the accuracies and time costs of different algorithms for the 20% held out samples (80% of the samples are used for training and 20% of the samples are held out for testing). Since the validation is applied, the time costs are approximate. The RF algorithm (Breiman 2001) shows the best performance on the time cost (57 minutes⁵ ) and the total accuracy (99.2%). Other methods, for example, the k-nearest neighbor and the SVM, either take more time to build the classifiers or show lower total accuracies.

Table 1. The Accuracies and Time Costs of Different Algorithms

Algorithm	Accuracy [%]	Time Cost^a
Simple Tree^b	97.6	minutes
Medium Tree^c	98.6	minutes
Complex Tree^d	98.8	minutes
Linear Discriminant^e	98.3	a minute
Quadratic Discriminant^f	98.2	a minute
Fine KNN^g	98.7	a hour
Medium KNN^h	99.1	a hour
Coarse KNNⁱ	99.0	hours
Cosine KNN^j	99.0	hours
Cubic KNN^k	99.1	hours
Weighted KNN^l	99.0	hours
RF	99.2	hours
Linear SVM^m	98.9	a week
Quadratic SVMⁿ	90.6	a week
Fine Gaussian SVMⁿ	98.9	a week
Cubic SVMⁿ	72.9	a week
Medium Gaussian SVMⁿ	99.1	a week
Coarse Gaussian SVMⁿ	99.2	a week

Notes.

^aOne worker for the parallel computing. ^bFew leaves to make coarse distinctions between classes. ^cMedium number of leaves for finer distinctions between classes. ^dMany leaves to make many fine distinctions between classes. ^eCreates linear boundaries between classes. ^fCreates nonlinear boundaries between classes. ^gFinely detailed distinctions between classes. ^hMedium distinctions between classes. ⁱCoarse distinctions between classes. ^jMedium distinctions between classes, using a Cosine distance metric. ^kMedium distinctions between classes, using a cubic distance metric. ^lMedium distinctions between classes, using a distance weight. ^mMakes a simple linear separation between classes. ⁿMakes a nonlinear separation between classes.

Download table as: ASCII Typeset image

The working theory of the RF is that it builds an ensemble of unpruned decision trees and merges them together to obtain a more accurate and stable prediction. The algorithm consists of many decision trees, and it outputs the class that is the mode of the class output by individual trees (Breiman 2001; Gao et al. 2009). The RF is often used when we have very large training data sets and a very large number of input variables. One big advantage of RF is fast learning from a very large number of data. Gao et al. (2009) listed many other advantages of RF.

After selecting the best algorithm, we apply the holdout validation and test the RF classifier 10 times. The average accuracy is 99% and the result is shown in Table 2 and Figure 4. In the classifier, the contributions of the nine colors are different, which could be described by the predictor importance estimates (Figure 3). We find that the IR colors play an important role in our classifier, which is similar to the result of Krakowski et al. (2016).

**Figure 3.** The predictor importance estimates for the classification and the regression.
Download figure:
Standard image High-resolution image

Table 2. The Comparison of the Average Performance

	All Samples		Uniform Samples
	c [%]	p [%]	c [%]	p [%]
Stars	99.6	99.7	99.6	96.5
Galaxies	98.9	97.8	97.6	92.5
QSOs	71.9	88.5	88.9	97.4

Accuracy	99		95

Note. All samples: the classifier using all samples. Uniform samples: the classifier using samples with the same numbers of different classes. c = completeness, and p = purity.

Download table as: ASCII Typeset image

We adopt the measures defined by Soumagnac et al. (2015) to show the performance of the classifier. These measures have been used in other machine-learning studies (Kovács & Szapudi 2015; Krakowski et al. 2016): completeness (c), and purity (p) for star, galaxy and QSO samples. We use the following equations (here for galaxies):

$\begin{eqnarray}&&{c}_{g}=\displaystyle \frac{\mathrm{TG}}{\mathrm{TG}+\mathrm{FGS}+\mathrm{FGQ}},\end{eqnarray} \tag{ 2 }$

$\begin{eqnarray}&&{p}_{g}=\displaystyle \frac{\mathrm{TG}}{\mathrm{TG}+\mathrm{FSG}+\mathrm{FQG}}.\end{eqnarray} \tag{ 3 }$

The FGS and FGQ are the numbers of galaxies misclassified as stars and QSOs, and FSG and FQG are the numbers of stars and QSOs misclassified as galaxies, respectively. The QSO sample shows the lowest completeness and purity, probably due to its smallest sample size.

In order to test this effect, we normalize the sample sizes of the three classes and apply the 20% holdout validation again (about 8800 objects in each type for the testing). The average result is shown in Table 2 and the right panel of Figure 4. The three classes have similar percentages of the completeness and purity. This result implies that we could not judge the performance of the classifier only by these measures.

**Figure 4.** The comparison of the confusion matrices for two classifiers: one using all samples and the other using samples with the same numbers of the different classes.
Download figure:
Standard image High-resolution image

We also apply the magnitude binnings suggested by Krakowski et al. (2016) to test the completeness, since different magnitudes stand for different stellar and galactic types and distances. The binnings are 12 < W1 < 13, 13 < W1 < 14, 14 < W1 < 15, and 15 < W1 < 16. The completenesses for stars, galaxies, and QSOs are similar to those calculated without binning. On the even samples, the low performance of the galaxy sample is probably due to the relatively high contamination from the QSO sample rather than the lost information of the galaxy sample.

3.2. Blind Test

This section describes various tests using the classifier made from the LAMOST, SDSS, and Veron13. These tests allow us to quantify the performance of the classification.

3.2.1. 6dF Galaxy Survey

The 6dF Galaxy Survey has (6dFGS) mapped the nearby universe over nearly half the sky (Jones et al. 2004, 2009). The final redshift release of the 6dFGS contains 124,647 spectrally identified galaxies. We match the galaxies with the SDSS and WISE archive data, which yields 12300 galaxies. We then remove the galaxies, which are used to build the classifier, and there are 8382 galaxies left. We feed the classifier with nine colors of these galaxies and obtain the predicted types. The classifier can output three scores for each entry, corresponding to the possibilities for star, the galaxy, and the QSO. The type with the largest score is adopted as the predicted type. The classifier also output the standard deviation (σ) for each score.

About 99.5% of the galaxies are classified correctly, and 40 galaxies are incorrectly classified as stars. Here all the predicted QSOs are treated as the galaxies, since there is no QSO subtype in the 6dFGS. The classification result is shown in Figure 5. The scores of the correctly classified galaxies are larger than those of the incorrectly classified. About 73% of the correctly classified galaxies have σ < 0.2, while only 55% for the wrongly classified galaxies have σ < 0.2. This indicates that the classifier is very uncertain about the types of incorrectly classified galaxies.

**Figure 5.** The classification result of the 6dFGS. Left panel: the distribution of the scores for correctly classified galaxies (green) and incorrectly classified galaxies (blue). The one score means 100%. Right panel: the distribution of the scores' standard deviation.
Download figure:
Standard image High-resolution image

3.2.2. RAdial Velocity Extension (RAVE)

RAVE is designed to provide stellar parameters to complement missions that focus on obtaining radial velocities to study the motions of stars in the Milky Way's thin and thick disk and stellar halo (Steinmetz et al. 2006). Its fifth data release (DR5) contains 457,588 stars in the south sky (Kunder et al. 2017). There are 935 stars also observed by SDSS and WISE. We remove the stars used to build the classifier, and end up with 737 remaining stars. We feed the classifier with nine colors and it yields 736 stars and one QSO with an accuracy of 99.9%. The incorrectly classified star, SDSS J154142.28-194513.1, is located in the bright halo of the kap Lib. Its colors are probably polluted by the bright star.

We also take advantage the g, r, and i magnitudes from APASS (Munari et al. 2014) that have been matched with RAVE stars (Kunder et al. 2017). Not all of the stars are detected in both WISE and APASS, and there are 435,012 stars with seven valid colors. The prediction contains 434,735 stars, 264 galaxies, and 13 QSOs with an accuracy of 99.9%. The classification result is shown in Figure 6. The incorrectly classified stars have smaller scores and larger σ, implying a high uncertainty of the types.

**Figure 6.** The classification results of APASS-RAVE. Left panel: the distribution of scores for correctly classified stars (red) and incorrectly classified stars (blue). Right panel: the distribution of the scores' standard deviation.
Download figure:
Standard image High-resolution image

3.2.3. UV-bright Quasar Survey (UVQS)

The data release one of all-sky UVQS contains 1055 QSOs selected from GALEX and WISE photometry and identified with optical spectra (Monroe et al. 2016). We cross identified the QSOs with SDSS and WISE, which yields 262 QSOs. We remove the QSOs used to build the classifier, which leaves 237 QSOs. The classifier yields 224 QSOs, 12 galaxies, and one star with an accuracy of 94.5%. Again, the incorrectly classified QSOs show smaller scores and larger σ (Figure 7). The accuracies of the blind tests are summarized in Table 3.

**Figure 7.** The classification result of the UVQS. Left panel: the distribution of scores for correctly classified QSO (green) and incorrectly classified QSOs (blue). Right panel: the distribution of the scores' standard deviation.
Download figure:
Standard image High-resolution image

Table 3. The Accuracies of the Blind Test

Survey	Accuracy
6dFGS	99.5%
RAVE	99.9%
UVQS	94.5%

Download table as: ASCII Typeset image

4. Effective Temperature Regression

We need additional information on stars, after separating them from galaxies and QSOs. The stellar spectral classification organizes vast quantities of diverse stellar spectra into a manageable system and has served as the fundamental reference frame for studies of stars for over 70 years (Gray & Corbally 2009). In this section, we present the method and tests of our regression.

4.1. Method

The LAMOSTs 1D pipeline only provides rough classification results, and the accuracy of the subclasses is still not robust (Jiang et al. 2013). Therefore, we instead adopt the effective temperatures (T_eff) from the A-, F-, G-, and K-type star catalog, which was produced by the LAMOST stellar parameter pipeline (LASP; Wu et al. 2014). We also extract the T_eff computed with the SEGUE Stellar Parameter Pipeline in the SDSS (SSPP; Allende Prieto et al. 2008; Lee et al. 2008a, 2008b). Both samples are dominated by G stars and have similar distributions of T_eff (Figure 8).

**Figure 8.** The normalized distribution of the T_eff. The green bars are the training sample of the LAMOST and the blue bars are that of the SDSS.
Download figure:
Standard image High-resolution image

In the classification (Section 3), RF exhibits the advantage in accuracy and training time cost. Here we also adopt the algorithm of RF to build the regression of stellar effective temperature.

These temperatures and seven colors, g − r, r − i, i − J, J − H, H − K, $K-W1$ and W1 − W2, of 1,327,071 stars are used to train the RF for regression. We apply the 10-fold cross validation in order to test the performance of the regression. The cross validation partitions the sample into ten randomly chosen folds of roughly equal size. One fold is used to validate the regression that is trained using the remaining folds. This process is repeated ten times such that each fold is used exactly once for validation.

We present the results of the cross validation in Figure 9. The one-to-one correlation is shown in the left panel. In order to estimate the uncertainty of the prediction, we bin the predicted T_eff with a step size of 100 K and fit the distribution of the corresponding test T_eff with a Gaussian function. We calculate the root-sum square of the standard deviation and the offset of the fit, which is adopted as the uncertainty of the prediction (the blue error bars in Figure 9). The Gaussian fit to the total residuals is shown in the right panel of Figure 9, and the fitted offset (μ) and σ are listed in Table 4. The red bars in Figure 3 are the importance estimates for the regression. The optical and 2MASS colors show much more importance than the WISE colors, which are different from those of the classification. This may be due to the fact that the majority of our sample is G- and K-type stars.

Table 4. The Gaussian Fits to the Total Residuals

	μ	σ
	(K)	(K)
Cross Validation	−27 ± 2	136 ± 2
RAVE	−93 ± 3	175 ± 3
APOGEE	−36 ± 2	182 ± 2

Download table as: ASCII Typeset image

4.2. Blind Test

In this subsection, we use the T_eff extracted from the spectrum-based methods to test the actual performance of the regression.

4.2.1. RAVE

The RAVE pipeline processes the RAVE spectra and derives estimates of T_eff, log g, and [Fe/H] (Kunder et al. 2017). The pipeline is based on the combination of the MATrix Inversion for Spectral SynthEsis (Recio-Blanco et al. 2006) algorithm and the DEcision tree alGorithm for AStrophysics (Bijaoui et al. 2012). This pipeline is valid for stars with temperatures between 4000 and 8000 K. The estimated errors in T_eff are approximately 250 K, and ∼100 K for spectra with S/N⁶ ∼ 50 (Kunder et al. 2017).

We adopt the photometry from APASS and WISE in the RAVE database to construct the input colors. The sample is restricted to have S/N ≳ 50 and the quality flag of ${Algo}\_{Conv}\ne 3$ or 4.⁷ There are 165,011 stars left. We present the prediction results in Figure 10 and list the parameters of the Gaussian fit to the total residuals in Table 4.

**Figure 10.** The one-to-one correlation of the stars in the APASS-RAVE (left panel). The Gaussian fit (red) of the total residual (black) is shown in the right panel.
Download figure:
Standard image High-resolution image

4.2.2. APOGEE

The Apache Point Observatory Galactic Evolution Experiment (APOGEE), one of the programs in SDSS-III, has collected high-resolution (R ∼ 22500) high signal-to-noise (>100) near-infrared (1.51–1.71 μm) spectra of 146,000 stars across the Milky Way (Majewski et al. 2017). These stars are dominated by red giants selected from the 2MASS. Their stellar parameters and chemical abundances are estimated by the APOGEE Stellar Parameters and Chemical Abundances Pipeline (García Pérez et al. 2016). The typical error in T_eff is ∼100 K (Mészáros et al. 2013).

We extract the photometric data of SDSS and WISE with the help of the Casjob. We feed the RF regression with seven colors of 13685 stars. The prediction is shown in Figure 11, and the parameters of fit to the total residuals are listed in Table 4.

**Figure 11.** The one-to-one correlation of the stars in the APOGEE (left panel). The Gaussian fit (red) of the total residual (black) is shown in the right panel.
Download figure:
Standard image High-resolution image

We find that the offsets of the validation and the predictions are less than 100 K, and the standard deviations are less than 200 K (Figures 10 and 11). Lee et al. (2015) has applied the SSPP to LAMOST stars and compared the results to those from RAVE and APOGEE catalogs. The offsets of T_eff between different pipelines range from 36 to 73 K, and the standard deviations range from 79 to 172 K. This indicates that our RF regression can determine stellar temperatures with fair accuracy.

5. Discussion

Machine learning has been adopted as a successful alternative approach for defining reliable objects classes, stellar types, and types of variable stars (e.g., Kovács & Szapudi 2015; Liu et al. 2015; Krakowski et al. 2016; Kuntzer et al. 2016; Pashchenko et al. 2018; Sarro et al. 2018). It is not the first time this technology has been taken advantage of in order to classify objects or to regress stellar parameters. In this section, we would like to compare our classification and regression to the results of other studies.

5.1. Comparisons with Other Machine-learning Methods

Ball et al. (2006) applied the supervised decision tree algorithm to classify the stars and galaxies in SDSS-DR3. They used the colors u − g, g − r, r − i and i − z of 477,068 objects with spectroscopic attributes to train the machine-learning classifier, and they performed cross validation to test the performance. The accuracy and completeness were over 90%. Except for the optical colors, the IR colors are included in our multi-color data set, since they have shown considerable importance in machine-learning methods (Henrion et al. 2011). Our larger training sample and the IR aided color set result in a better performance of our classifier, over 99% for the stars and galaxies classification. We also test the performance of some decision tree algorithms, for which the accuracies are ∼98%. Compared to the decision tree, random forest avoids overfitting to the training set and limits the error from the bias (Hastie et al. 2008).

Krakowski et al. (2016) used the SVM learning algorithm to classify WISE × SuperCOSMOS objects based on the SDSS spectroscopic sources. The training sample included over one million objects, 95% of which were galaxies, 2% were stars, and 3% were QSOs. They used six parameters, W1, W3, W1 − W2, R − W1, B − R, and w1mag13. The 10-fold cross validation was performed to test the classifier, and the total accuracy was 97.3%. Instead of magnitudes, we adopt colors that are independent of distance. Our training sample shows better compositional balance, and its size is three times larger than theirs. We also try some SVM algorithms, and the accuracies are from 70% to 99%. The time cost to build the SVM classifier is much longer than that for the RF classifier.⁸ For a classification problem, RF gives the probability of belonging to classes (Breiman 2001), while SVM relies on the concept of "distance" between points and needs more time to calculate. The RF algorithm also shows better performance than SVM in other fields, such as Liu et al. (2013).

Liu et al. (2015) employed an SVM-based classification algorithm to predict MK classes with 27 line indices measured from a small sample, 3134 LAMOST stellar spectra. The holdout validation of 50% was performed to test the accuracy of the classifier. The completeness of A and G stars reached 90%, while that of other stars was below 80%. Since the spectral features of different types can be very similar, clear cuts of these features probably lead to misclassification. Therefore, we adopt the regression of T_eff rather than the MK classification in order to avoid such an effect. Liu et al. (2015)'s research also implies that a large sample could cover a larger area of the parameter space and could further yield more reliable predictions.

Sarro et al. (2018) constructed regression models to predict T_eff of M stars with eight machine-learning algorithms. The training sample is built with the features extracted from the BT-Settl of synthetic spectra. Then, the models were applied to two sets of real spectra from the NASA Infrared Telescope Facility and Dwarf Archives collections. Sarro et al. (2018) used the root mean/median square errors (RMSE/RMDSE) to describe the prediction errors. The RMSEs were from 160 to 390 K, and the RMDSEs from 90 to 220 K, various with the different algorithms and S/Ns. Our prediction for A, F, G, K stars gives similar results: RMSE/RMDSE (RAVE) = 246/140 K, and RMSE/RMDSE (APOGEE) = 247/130 K, implying that our regression built with photometric data could achieve similar accuracy to the spectrum-based model.

5.2. SED Fit

Another way to determine the T_eff is the fit of stellar SEDs with synthetic templates. The theoretical study has concluded that the broad-band photometry from the UV to the mid-IR allows atmospheric parameters and interstellar extinction to be determined with good accuracy (Allende Prieto 2016). The study used the SEDs extracted from the ATLAS9 model atmospheres (Mészáros et al. 2012). They added interstellar extinction to these SEDs in order to construct the theoretical templates. The test SEDs were also extracted from the ATLAS9 model, but added some random noise. Then the test SEDs were fitted with the templates using the χ²-optimization method. The standard deviations of the total residuals were from 130 to 380 K depending on the different bands used for the fittings. We follow this procedure to fit 10⁵ simulated SEDs, extracted from the BT-Cond theoretical model (Baraffe et al. 2003; Barber et al. 2006; Allard & Freytag 2010). Since the simulated SEDs have random stellar parameters, about 70000 SEDs are located inside of the reasonable ranges. The result is shown in the upper panels in Figure 12. We only plot ∼64000 samples with χ² ≲ 5.88, which is one standard deviation for a χ² distribution with the five degrees of freedom (T_eff, log g, [Fe/H], $E(B-V)$ and the scaling factor). The residuals are fitted with a Gaussian-like function:

$\begin{eqnarray}&&f={{Ae}}^{\tfrac{-{(x-\mu )}^{\delta }}{\delta {\sigma }^{\delta }}}.\end{eqnarray} \tag{ 4 }$

The standard deviation of T_eff is 207 ± 15 K for 12 bands fit, F-NUV, ugriz, ${JHK},W1W2$ . The standard deviations of other parameters are also similar to the result in Allende Prieto (2016), indicating that the multi-band SED fit can well constrain the atmospheric parameters and interstellar extinction theoretically.

However, the result is worse than expected (the lower panels in Figure 12) when we apply the SED templates to fit the SEDs in the real observation, the stars in the LAMOST Spectroscopic Survey of the Galactic Anticentre (LSS-GAC; Liu et al. 2014; Yuan et al. 2015). The standard deviation of T_eff is 454 ± 5 K and the offset is −365 ± 5 K, larger than those of the machine-learning regression by a factor of three. We also try this technology in other ways, e.g., fitting the stars in RAVE, or using 10 bands fit.⁹ The standard deviations of T_eff are about 400 K, extremely worse than both the theoretical simulation and the machine-learning regression. This implies that the atmospheric parameters of the stars in the real observation can't be well estimated by the SED fit using the χ² minimization. Based on photometric data, machine learning shows better performance on the T_eff estimate.

5.3. A Scientific Application

The ESA space mission Gaia is performing an all-sky survey at optical wavelengths, and its primary objective is to survey more than one billion stars (Gaia Collaboration et al. 2016). Its second data release (Gaia DR2; Gaia Collaboration et al. 2018) includes ∼1.3 billion objects with valid parallaxes. These parallaxes are obtained with a complex iterative procedure, involving various assumptions (Lindegren et al. 2012). Such a procedure may produce parallaxes for galaxies and QSOs, which should present no significant parallaxes (Liao et al. 2018).

We have applied the classifier to 85,613,922 objects in the Gaia DR2 based on the multi-wavelength data from Pan-STARRS and WISE (Bai et al. 2018a). The result shows that the sample is dominated by stars (∼98%), and galaxies and QSOs make up 2%. For the objects with negative parallaxes, about 2.5% are galaxies and QSOs. About 99.9% of the sample comprises stars if the relative parallax uncertainties are smaller than 0.2, implying that using the threshold of 0 < σ_π/π < 0.2 could yield a very clean stellar sample (Bai et al. 2018a).

6. Summary and Future Work

In this work, we have attempted to classify objects into stars, galaxies, and QSOs, and further regress the effective temperatures for stars using machine learning, the algorithm of choice being RF. The classifier is trained with about three million objects in SDSS, LAMOST, and Veron13, and the regression is trained with one million stars in SDSS and LAMOST. In order to examine the performance of the classifier, we perform three blind tests by using objects spectroscopically identified in RAVE, 6dFGS, and UVQS. The total accuracies are over 99% for the RAVE and 6dFGS, and higher than 94% for the UVQS. We also perform two blind tests for the regression by using the stellar T_eff estimated with spectroscopical pipelines in the RAVE and APOGEE. The offsets and the standard deviations of the total residual are below 100 K and 200 K, respectively.

Our classifier shows the high accuracy compared to other machine-learning algorithms in former studies, indicating that combining broad-band photometry from the optical to the mid-infrared allows classification to be determined with very high accuracy. Machine learning provides us with an efficient approach to determine the classes for huge amounts of objects with photometric data, e.g., over 400 million objects in the SDSS-WISE matched catalog.

Since there is no clear cut for colors or spectral features of the different spectral types, we adopt the T_eff regression rather than the MK classification to further provide basic information on stars. Our regression results shows similar or even better performance than the SED χ² minimization and some spectrum-based methods. The RF regression enable us to estimate the T_eff without spectral data for the stars that are too many or too faint for the spectral observation, or the stars in the large area time-dominated survey (e.g., Pan-STARRS1 survey; Chambers et al. 2016).

We will test regressions for other stellar parameters with machine-learning algorithms. We will also try to decouple the effective temperature and the interstellar extinction based on a large sample, such as LAMOST-SDSS-Gaia. Future well controlled samples, e.g., LAMOST-II and SDSS-V (Kollmeier et al. 2017), will also provide us with an opportunity to explore the multi-dimensional parameter space with this technology for classification and regression.

The machine-learning results in this work are developed with MATLAB¹⁰ available upon request to the first author as MAT files.

We are grateful to Stephen Justham for valuable discussions. This work was supported by the National Program on Key Research and Development Project (grant No. 2016YFA0400804) and the National Natural Science Foundation of China (NSFC) through grants NSFC-11603038/11333004/11425313/11403056. Some of the data presented in this paper were obtained from the Mikulski Archive for Space Telescopes (MAST). STScI is operated by the Association of Universities for Research in Astronomy, Inc., under NASA contract NAS5-26555. Support for MAST for non-HST data is provided by the NASA Office of Space Science via grant NNX09AF08G and by other grants and contracts.

The Guoshoujing Telescope (the Large Sky Area Multi-Object Fiber Spectroscopic Telescope, LAMOST) is a National Major Scientific Project, which was built by the Chinese Academy of Sciences, funded by the National Development and Reform Commission, and operated and managed by the National Astronomical Observatories, Chinese Academy of Sciences.

Funding for the Sloan Digital Sky Survey IV has been provided by the Alfred P. Sloan Foundation, the U.S. Department of Energy Office of Science, and the Participating Institutions. SDSS-IV acknowledges support and resources from the Center for High-Performance Computing at the University of Utah. The SDSS website is http://www.sdss.org/.

SDSS-IV is managed by the Astrophysical Research Consortium for the Participating Institutions of the SDSS Collaboration including the Brazilian Participation Group, the Carnegie Institution for Science, Carnegie Mellon University, the Chilean Participation Group, the French Participation Group, Harvard-Smithsonian Center for Astrophysics, Instituto de Astrofísica de Canarias, The Johns Hopkins University, Kavli Institute for the Physics and Mathematics of the Universe (IPMU)/University of Tokyo, Lawrence Berkeley National Laboratory, Leibniz Institut für Astrophysik Potsdam (AIP), Max-Planck-Institut für Astronomie (MPIA Heidelberg), Max-Planck-Institut für Astrophysik (MPA Garching), Max-Planck-Institut für Extraterrestrische Physik (MPE), National Astronomical Observatories of China, New Mexico State University, New York University, University of Notre Dame, Observatário Nacional/MCTI, The Ohio State University, Pennsylvania State University, Shanghai Astronomical Observatory, United Kingdom Participation Group, Universidad Nacional Autónoma de México, University of Arizona, University of Colorado Boulder, University of Oxford, University of Portsmouth, University of Utah, University of Virginia, University of Washington, University of Wisconsin, Vanderbilt University, and Yale University.

Machine Learning Applied to Star–Galaxy–QSO Classification and Stellar Effective Temperature Regression

Article metrics

Permissions

Author e-mails

Author affiliations

ORCID iDs

Dates

Abstract

1. Introduction

2. Data

2.1. SDSS and LAMOST Spectroscopic Surveys

2.2. SDSS and Wide-field Infrared Survey Explorer (WISE) Photometric Surveys