Estimation of Stellar Atmospheric Parameters with Light Gradient Boosting Machine Algorithm and Principal Component Analysis

In this paper, we propose a new method to estimate stellar atmospheric parameters with photometric data, which is based on principal component analysis (PCA) and light gradient boosting machine (LightGBM) algorithms. We first use PCA to transform all band photometric data (u, v, g, r, i, and z) and then utilize LightGBM to estimate stellar atmospheric parameters. The experimental results show that the root mean square errors of the method for estimating the effective temperature, surface gravity, and metallicity are 90 K, 0.40 dex, and 0.20 dex, respectively. We then compare PCA + LightGBM with the original photometry data (OPD) + LightGBM and the color index data (CID) + LightGBM. The experimental results show that the performance of PCA + LightGBM is better than that of CID + LightGBM and OPD + LightGBM, and PCA + LightGBM can solve the problems of model instability and inaccurate estimation results caused by direct use of OPD or CID as input for LightGBM. We believe the new features obtained by PCA can be used on photometric data collected by the Chinese Space Station Telescope.


Introduction
The main task of modern astrophysics is to find out when and how galaxies were formed and evolved. Our own galaxy, the Milky Way, offers a unique opportunity to study galaxies in considerable detail by measuring and analyzing the properties of stars (Jurić et al. 2008). The principal properties of stellar analysis are determined by the accurate measurement of stellar atmospheric parameters, such as effective temperature (T eff ), surface gravity (log g), and metallicity ([Fe/H]). The photometric method is widely used to estimate the atmospheric parameters of stars.
Effective temperature is relatively easy to estimate compared with other stellar atmospheric parameters, and has been widely considered in previous studies. Strömgren (1956) and Crawford (1958) proposed a uvbyβ photometric system, which yielded four commonly used color indexes-namely b − y, m 1 , c 1 , and β-and accurately calculated the atmospheric parameters (T eff , log g, and [Fe/H]) for stars of different spectral types. Magain (1987) used the InfraRed Flux Method (IRFM) to establish the relation among B − V, b − y, V − K, and T eff , which can be used to estimate the T eff of dwarf and subgiant stars. Alonso et al. (1996) studied 475 dwarf and subdwarf stars sampled by using the IFRM and obtained their T eff with an average accuracy of approximately 1.5%. Alonso et al. (1999) applied the IRFM to a sample of approximately 500 giant stars to derive their T eff with an internal mean accuracy of approximately 1.5% and a maximum uncertainty in the zero point of the order of 0.9%. They also established a relation among T eff , color, and [Fe/H]. Holmberg et al. (2007) consolidated the calibrations of uvbyβ photometry into T eff , [Fe/H], distance, and age for F and G stars. For different b − y ranges, the relation between b − y and T eff was established, and the accuracy was approximately 60 K. Casagrande et al. (2010) presented an investigation based on the IRFM using a large set of solar twins to set the absolute zero point of the T eff scale within a few degrees. They also established a color-T eff relation with [Fe/H] ranging from −5 to 0.4. For luminosity classes IV and V dwarfs and for classes II and III giants, Huang et al. (2015) devised an empirical calibration of T eff and color based on metallicity. The method was based on direct T eff measurements of approximately 200 nearby stars, which were better than 2.5%. The T eff range of calibration was 3100-10,000 K for dwarf stars with M5 to A0 spectral types and 3100-5700 K for giant stars with K5 to G5 spectral types.
Estimating [Fe/H] with photometric information has also been widely studied. Olsen (1984) discussed four-color uvby photometry for several hundred late-type dwarf stars of types G, K, and M. The relation between uvby photometry and [Fe/ H] was established, and the accuracy of [Fe/H] was approximately 0.17 dex. Nissen (1981) performed photoelectric measurements on F0 − G2 giant stars and determined the [Fe/ H] of 179 stars. Holmberg et al. (2007) revised the relation between [Fe/H] and uvby photometry established in previous studies and obtained a more complicated relation among [Fe/ H] and b − y, m 1 , and c 1 , with an accuracy of 0.12 dex. Jurić et al. (2008) used more than 6000 FG stars from the Sloan Digital Sky Survey (SDSS) to perform polynomial fitting of [Fe/H] with u − g and g − r, and the experimental accuracy was almost similar to the spectral prediction results.
Most of the above studies use the color index data to determine the stellar parameters. However, there are two obstacles for extending the use of color index data (CID) to large-scale prediction of stellar atmospheric parameters. First, these features have strong correlation with each other, which will cause the instability of the model. Therefore, the application of previous models to different types of stars is difficult. Second, not all bands have been fully considered. For example, it is believed that only u − g and g − r are sensitive features of [Fe/H]. However, the experiments showed that the weighted combination of all bands could impact the [Fe/H] of stars.
Considering the aforementioned problems, we first use principal component analysis (PCA) to construct features and eliminate the correlation among variables, and then employ the light gradient boosting machine (LightGBM) algorithm to estimate stellar atmospheric parameters. The experimental results show that PCA+LightGBM method can estimate the stellar parameters accurately, and is superior to LR, support vector regression (SVR), artificial neural networks (ANNs), and other tree model algorithms (such as XGBoost and random forest) in terms of the performance.
The paper is organized as follows. We introduce the LightGBM algorithm and PCA in Sections 2 and 3, respectively. Then, we introduce the data in Section 4. We compare various experiments in Section 5 and detail the extraction of the feature importance of stellar atmospheric parameters. Finally, we summarize the study in Section 6.

Method: LightGBM
LightGBM is a gradient boosting decision tree algorithm (GBDT) with gradient-based one-side sampling (GOSS) and exclusive feature bundling (EFB). It uses a histogram-based algorithm to find the best split points. The GOSS and EFB techniques are introduced to process a large number of data instances and features, respectively. The histogram-based algorithm can reduce the number of split points and accelerate training. Hence, LightGBM is mainly used to solve the problem that traditional GBDT is slow during massive data training. For our experiment, as the feature dimension is very low and the sample size is relatively large, we only use GOSS and the histogram-based algorithm. We briefly introduce GBDT in Section 2.1 and then explain GOSS, histogrambased algorithm in Sections 2.2 and 2.3, respectively. The detailed introduction of EFB can be found in Ke et al. (2017).

GBDT
In the field of machine learning, a learner whose generalization performance is slightly better than random guessing is called a weak learner. GBDT is an ensemble algorithm that takes regression decision trees as weak learners. Each regression decision tree of GBDT is established in order, and the resulting structure is shown in Figure 1. In this section, we briefly introduce the calculation process for GBDT. For more detail, we refer the reader to Friedman (2001).
Suppose that (x i , y i ), i = 1, 2,K,N represents the data set, where is the magnitude data vector corresponding to n bands, and y i contains the stellar atmospheric parameters corresponding to x i . We refer to y i as the target value of x i . Given an input x i , the output of GBDT is where h m is the mth regression decision tree. h m (x i ) denotes the predicted value of the mth regression decision tree for x i . Constant M is the number of regression decision trees. GBDT takes the following steps to obtain Equation (1).
In the four aforementioned steps, m ranges from 1 to M in turn. Finally, a total of M regression decision trees are created, and the prediction value is obtained through the following formula: Many ensemble algorithms based on decision tree are different in the third step. GBDT adopts a simpler method to build h m in the third step. That is, each region (parent node) is recursively divided into two regions (left and right child node) in sample set O, and the output value of each region is determined by performing the following three steps to construct a binary decision tree.
1. Starting from the root node, select the optimal splitting variable j, and the optimal splitting point d (the value of variable j), and minimize the square error between the predicted value and the real value of the two regions (left and right child nodes): and c 1 and c 2 are the predicted values of the left and right child nodes, respectively. 2. After the first step, the left and right child nodes are used as new parent nodes of the next split. According to the first step, split the new parent node. 3. Repeat the first and second steps and establish h m .
After the three aforementioned steps are performed, h m is established, as shown in Figure 1. The residuals in set O are different. The samples with small residuals converge to the real value, and fitting them again leads to overfitting of the GBDT and increases the calculation cost of Equation (2). To improve performance, GOSS and the histogram-based algorithm are proposed and described in the following subsection.

GOSS
GOSS is a statistical method to treat different residual values (g m (x i )) differently. It achieves a balance between reducing the amount of data and improving the accuracy of GBDT by changing the second step of GBDT. The principle of GOSS is as follows: 1. Sort absolute values of residuals in descending order as follows: The data corresponding to the first a% residuals are recorded as subset A. The remaining data are marked as subset B, including the 1 − a% samples with the smallest residuals. . 3. Randomly sample b% of data from subset B, and record it as subset C. The sample weight is assigned as (1 − a)/b for subset C, then The above three steps correspond to the calculation of the residual error of the second step of GBDT. Ke et al. (2017) proved that GOSS can increase the generalization ability of the model compared to GBDT by paying attention to samples with large residuals without losing accuracy.

Histogram-based Algorithm
The main cost in GBDT is to find the best split points. In order to alleviate the problem, Ke et al. (2017) used histogrambased algorithm to bucket continuous variable values into discrete bins and used these bins to construct feature histograms during training. That is, the jth variable is divided into s bins in ascending order of x i j (i = 1, 2...,N), and each bin (called feature histogram) includes N s samples. In this way, the splitting point d, is changed from Finally, in the same way as Equation (2), we find the optimal split variable j and the optimal splitting point d in to obtain the optimal h m by minimizing the following formula , : : , : , and c 1 and c 2 are the predicted values for each node. It changes the third step of GBDT by reducing the number of split points (from N to s) to achieve the effect of speedup. Because s is much less than N, this model's fitting speed is faster than that of GBDT.

PCA
In this section, we briefly introduce PCA. We refer to Abdi & Williams (2010) and Jolliffe & Cadima (2016) for more information about PCA. PCA is widely used in spectral classification and dimensionality reduction (Storrie-Lombardi et al. 1994;Singh et al. 1998;Bu et al. 2014), which is also used in the data processing of regression analysis to reduce data collinearity and improve the stability of regression model.
For consistency and simplicity, we suppose that the original data contains n variables, and , in which x i (i = 1, 2,K,n) is called the ith band variable. PCA combines the original variables into new variables through the following linear combination where F i (i = 1, 2,K,n) is the ith new variable (called the ith principal component). a i T (i = 1, 2,K,n) is the coefficient to be calculated, which satisfies = a a 1 i T i . To make each principal component represent the potential structure of the original data, PCA requires that the variance of each principal component be reduced in turn and the covariance of different principal components be 0, which can be expressed mathematically as follows.
Based on Equation (3), we obtain the following.
We calculate each principal component in turn according to the order of principal component variance in Equation (4). Therefore, F 1 is first calculated by using a 1 , and a 1 can be obtained according to Equation (5) in the following way: . According to Equation (6), we can calculate a 1 through the Lagrange multiplier method by optimizing where λ 1 is the Lagrange factor. As |var(X) − λ 1 | = 0, and λ 1 is the eigenvalue of var(X). According to Equation (7), a 1 is the eigenvector of var(X) belonging to eigenvalue λ 1 . Thus, According to Equations (6)-(8), var(F 1 ) is the largest eigenvalue of var(X), and a 1 is the eigenvector corresponding to the largest eigenvalue of var(X). Similarly, a i (i = 2,K,n) is the eigenvector corresponding to the ith largest eigenvalue of var(X). Substituting a i into Equation (3), we obtain each principal component F i . These new variables, (F 1 , F 2 ,K,F n ), represent the potential structure of the original data, this is convenient to explore the data and improve the generalization ability of the model.

Data Introduction
In this section, we introduce data sources and the data preprocessing process. To simulate the future Chinese Space Station Telescope (CSST) and obtain more similar data, we select common stars of the Apache Point Observatory Galactic Evolution Experiment Data Release 16 (APOGEE DR16), Stellar Abundance and Galactic Evolution (SAGE), and Panoramic Survey Telescope and Rapid Response System Data Release 1 (Pan-STARRS DR1) to obtain u, v, g, r, i, zband photometric data and stellar atmospheric parameters (T eff , log g, and [Fe/H]) of 37,979 stars.
APOGEE is one of the programs in the Sloan Digital Sky Survey III (Majewski et al. 2017) and Sloan Digital Sky Survey IV (Blanton et al. 2017), and it is a large-scale, stellar spectroscopic survey conducted in the near-infrared (H-band) portion of the electromagnetic spectrum. APOGEE is a large spectroscopic survey that is expected to exceed 700,000 stars, mainly consisting of galactic red giants from all stellar populations, but also including the Magellanic clouds and other nearby dwarf galaxy red giants, as well as numerous cool (FGKM) dwarf stars (Zasowski et al. 2013(Zasowski et al. , 2017. The stellar atmospheric parameters in our experiment are taken from APOGEE DR16, which contains high-resolution (R ∼ 22,500), near-infrared (15140-16940 Å) spectra for approximately 430,000 stars covering both the northern and southern sky, from which radial velocities, stellar parameters, and chemical abundances of up to 26 species are determined (Jönsson et al. 2020).
The photometric data of the u and v bands are taken from the SAGE survey (Fan et al. 2018), which used the 2.3 m Bok telescope of the Steward Observatory of the University of Arizona to complete the observation of the u and v bands, obtaining high-precision photometric data for approximately 500 million stars. The actual 100σ depths of the single-epoch images are u ∼ 17.3 and v ∼ 16.8 (Zheng et al. 2019). The SAGE system is designed to be more sensitive to stellar atmospheric parameters than uvbyβ, so the accuracy of the SAGE system's stellar parameter measurement could actually be better than that of the broadband photometry.
The photometry data for the g, r, i, z bands are taken from Pan-STARRS DR1. Located on Haleakala, and dedicated to sky survey observations, Pan-STARRS system is a 1.8 m aperture, f/4.4 telescope illuminating a 1.4 Gpixel detector spanning a 3°.3 field of view (Tonry et al. 2012). Its mean wavelengths of g, r, i, z are 486.6 nm, 621.5 nm, 754.5 nm, and 867.9 nm, respectively. Although the Pan-STARRS filter system has a lot in common with the filter system used in previous investigations such as the SDSS (York et al. 2000), the g filter extends 20 nm in the direction of g of SDSS, which costs the price of 5577 Å sky emission for obtaining higher sensitivity and a lower photometric redshift system. The z filter is cut off at 920 nm, making its response different from that of the detector defined by the z filter of SDSS. Therefore, these bands are more sensitive to stellar atmospheric parameters. Chambers & Pan-STARRS Team (2018) introduced Pan-STARRS data in detail.
We first correct the measured photometric magnitudes of each band for interstellar reddening with E(B − V ) (Schlegel et al. 1998) and adopted reddening coefficients deduced from Schlafly & Finkbeiner (2011). After reddening corrections, we use following criteria to select data: 1. Delete the samples with missing values of each band. 2. According to the distribution of magnitude error, samples with magnitude error of u, v bands less than 0.1 mag and magnitude error of bands (g, r, i, z) less than 0.05 mag are selected.
We obtain a data set containing 37,979 stars in total. The distribution of stellar atmospheric parameters of these stars is shown in Figure 2.
It is clear that our data sets have a large number of stars with high temperatures and high [Fe/H], and include a large number of dwarf stars and giant stars. Therefore, our data sets cover a wide range of atmospheric stellar parameters, and the performance based on these data sets can fully demonstrate the generalization ability of our method.

Data Experiment
In this section, we first use original photometric data (OPD) + LightGBM and CID + LightGBM experiments to explain in detail the two problems faced by previous studies, namely, the strong correlation between variables and the lack of consideration of the weighted combination of the whole band in the color index method. Finally, we use a PCA + LightGBM experiment to analyze the performance improvement after solving the above problems.
We randomly divide the data into a training set (80%) and test set (20%), and use root mean square error (RMSE; see Equation (9)) as the metric of performance of the model We use maximal information coefficient (MIC) measure dependence between variables, such as linear and nonlinear relations, even for nonfunctional dependence (Reshef et al. 2011). The MIC values range from 0 to 1. The closer to 1, the stronger the relation between variables, and thus can we understand the relation among the data before the experiment. We use variance gain ratio as jth variable feature importance (see Equation (10)), which is conducive to distinguishing the contribution of different variables to estimate stellar atmospheric parameters The experiments are carried out in a PyCharm 2019.3.1 environment running on a personal computer with an Intel core i7 2.60 GHz CPU and 16 GB of memory.

OPD + LightGBM
We first use the original magnitude data of six bands as input of LightGBM to derive stellar atmospheric parameters. The results on the test set are RMSE(T eff ) = 106 K, RMSE (log g) = 0.51 dex and RMSE([Fe/H]) = 0.25 dex, which are shown in Figure 3. The experimental results show that LightGBM cannot estimate stellar atmospheric parameters well. The main reasons are as follows: 1. The MIC values between variables and stellar atmospheric parameters are low. As shown in subgraph A of Figure 4, the MIC value of u band and T eff is only 0.  Figure 4, the MIC between variables is between 0.37 and 0.95, which is very strong. The feature importance of T eff and log g are more related to the u band and z band than other bands in Figure 5. However, the MIC between T eff and the v band is 0.21, which is greater than that with z band (0.1). This makes it difficult to estimate T eff accurately. And the same is true for log g.
Therefore, we believe that the photometric data cannot be directly input to estimate the stellar atmospheric parameters. It is necessary to transform the variables appropriately to reduce the MIC of each band, and increase the MIC between each band and the stellar atmospheric parameters to make the model more stable.

CID + LightGBM
The CID is the difference of magnitude of any two wave bands of the same star. There is a strong relationship between the CID and stellar parameters, such as T eff and V − K. Therefore, we also input CID to estimate the stellar atmospheric parameters.
The CID we used in the experiment contains 15 variables (C 6 2 ): The result are shown in Figures 6 and 7. We find the MIC between the new variables and the parameters (T eff , log g) in subgraph B of Figure 4 is improved, which in turn reduces the RMSE of T eff and log g in the test set by 9% and 12% (compared with the OPD method in Section 5.1), respectively. However, this method do not reduce the MIC among the new variables in subgraph B of Figure 4, with the result that LightGBM is unable to fully recognize important features. For example, as shown in subgraph B of Figure 4, u − g and T eff have high MIC (0.65), which means that it plays an important  role in T eff prediction. However, as shown in Figure 7, the model considers u − g is an unimportant feature of T eff , and its importance to T eff accounts for only 0.2%. This is the consequence of the model instability caused by the high MIC of variable u − g and other variables; that is, the model does not fully establish the functional mapping from variables to T eff . This may affect the prediction ability of the model. The same is true for log g and [Fe/H].
Although the CID method enhances the MIC between new variables and parameters, it does not reduce the strong MIC among new variables. Therefore, the model cannot fully recognize the important features of the data itself, so it cannot be directly applied to other data sets.

PCA + LightGBM
To investigate whether PCA + LightGBM can improve the performance of estimating stellar atmospheric parameters, we first apply PCA to process the data such that the correlation between the new variables (PCs) is 0 (see Equation (4)). Then, we applied LightGBM to the PCs, and the experimental results are shown in Figure 8. Compared with the CID method in Section 5.2, the RMSE decreased by 6%, 11%, and 13%, respectively. Furthermore, the model can accurately identify the important features of the data.
As shown in subgraph C of Figure 4, the main reasons are that (1) the MIC between the new variables PCs is less than 0.26 (the correlation between the new variables PCs is 0), and  (2) the MIC between new variable F 2 and parameters (T eff , log g) is enhanced. Therefore, LightGBM can identify the important variables of the parameters (T eff , log g) and improve the model's generalization performance. For example, the maximal MIC (as shown in subgraph C of Figure 4) of T eff (or log g) is F 2 , and LightGBM shows the important features (as shown in Figure 9) of T eff (or log g) are also F 2 . The MIC values of T eff and other PCs (F 1 , F 3 , F 5 , F 4 and F 6 ) are all lower than 0.14. Consistent with the MIC value, LightGBM model shows that the total importance of these PCs (F 1 , F 3 , F 5 , F 4 , and -F 6 ) to T eff is only 7.6%. In addition, there is no significant difference in the proportion of importance between PCs, which are 3.1%, 2.3%, 0.9%, 0.8%, and 0.5%, respectively, indicating that LightGBM reasonably establish the functional mapping from PCs to T eff . The same is true for log g. PCA makes the new variables more independent and enhances the MIC between the new variables and stellar parameters, and thus improves the final model prediction accuracy. PCA itself can be considered as a generalization of the CID method, and the CID method only considers the difference of two variables, so does not comprehensively consider the weighted sum of each variable. Therefore, the new variable obtained by PCA takes more data information into account, and the final model is more stable. The specific combination formula of each new variable (F i , i = 1, 2, 3, 4, 5,   We take Equation (11) as a generalization of the CID. Here we consider not only the difference of magnitude of any two wave bands, but also the weighted sum of the data for each band. In this way, problems caused by the CID method are solved, and the performance of the model is improved.

Comparison of Experimental Results of Different Algorithms
To further assess the performance of LightGBM, we perform similar experiments with LR, SVR, Random Forest, GBDT, XGBoost, and ANN on PCs. These algorithms have their own advantages. For example, LR has strong interpretability, SVR has powerful kernel functions, GBDT, Random Forest, and XGBoost are the classic ensemble learning algorithm based on a decision tree, and ANN is a widely used machine learning algorithm.
The algorithms used in our test were created by the Python library. LR, SVR, GBDT, and random forest were built with the sklearn library, XGBoost and LightGBM were respectively created XGBoost library and LightGBM library. Bayesian optimization is used to select hyperparameters to make the model achieve the best result, and it was built with Bayesianoptimization library (Nogueira 2014). Bayesian optimization uses Gaussian process to optimize objective functions which are very costly or slow to evaluate. And it selects the best parameters in continuous space and can be effective in practice even if the underlying optimized function is stochastic, nonconvex, or even noncontinuous. Therefore, using Bayesian optimization to select model hyperparameters is more efficient than grid search and random search (Dewancker et al. 2015).
The hyperparameters configurations of each model are shown in Tables 1-3. The experimental results are listed in Table 4. We find that the training speed of LightGBM is more than three times faster than that of XGBoost and more than ten times faster than that of GBDT, and the RMSE of LightGBM is almost lower than that of other algorithms. The high efficiency of the LightGBM algorithm provides a new reason for using photometric data to estimate stellar atmospheric parameters. For future spectral data, we will use the EFB algorithm with LightGBM for feature binding, which will further improve the efficiency of the LightGBM algorithm.

Optimize the LightGBM
In order to further study the LightGBM algorithm, we quantitatively analyze the impact of hyperparameters on estimating T eff from the following two aspects, and the impact on other parameters is similar.
1. We explore the impact of a and b on the results, and limit the ranges of a and b as 0.001 to 0.5 (other hyperparameters take the best value in Table 2). The experimental results are shown in Figure 10. We find that when a = 0.05 and b = 0.45, the RMSE of T eff reaches the minimum. This shows that in order to improve the generalization ability of the model, we need to pay  attention to the samples with large residuals in the first 5%, rather than treat all samples equally. The setting of a and b is an important reason why the generalization ability of LightGBM is better than that of other algorithms. 2. We explore the impact of the number of bins (max_bin) on the results, and limit the number of bins varying from 250 to 10,000 in steps of 513 (20 steps); that is, the range of max_bin is 250 to 10,000 (other hyperparameters take the best value in Table 2). The experimental results are shown in Figure 11. We find that the training time of the model increases linearly with the increase of the number of bins. The good setting of the number of bins is the main reason why LightGBM algorithm is faster than other algorithms.
The advantages of LightGBM are mainly reflected in the above two aspects. The use of GOSS makes the model focus on the samples with high residuals, which improve its generalization ability. The setting of the number of bins reduces the