Objective Separation between CP1 and CP2 Based on Feature Extraction with Machine Learning

In the eighth data release (DR8) of the Large Sky Area Multi-Object Fiber Spectroscopic Telescope, more than 318,740 low-resolution stellar spectra with types from B to early F and signal-to-noise ratios >50 were released. With this large volume of the early-type stars, we tried machine-learning algorithms to search for class-one and class-two chemical peculiars (CP1 and CP2), and to detect spectral features to distinguish the two classes in low-resolution spectra. We selected the XGBoost algorithm after comparing the classification efficiency of three machine-learning ensemble algorithms. Using XGBoost followed by the visual investigation, we presented a catalog of 20,694 sources, including 17,986 CP1 and 2708 CP2, in which 6917 CP1 and 1652 CP2 are newly discovered. We also list the spectral features to separate CP1 from CP2 discovered through XGBoost. The stellar parameters (including effective temperature (T eff), surface gravity (log g), metallicity [Fe/H]), the spatial distribution in Galactic coordinates, and the color magnitude were provided for all of the entries of the catalog. The T eff for CP1 distributes from ∼6000 to ∼8500 K, while for CP2 it distributes from ∼7000 to ∼13,700 K. The log g of CP1 ranges from 2.8 to 4.8 dex, peaking at 4.5 dex, and of CP2 it ranges from 2.0 to 5.0 dex, peaking at 3.6 dex, respectively. The [Fe/H] of CP1 and CP2 are from −1.4 to 0.4 dex, and the [Fe/H] of CP1 are on average higher than that of CP2. Almost all of the targets in our sample locate around the Galactic plane.


Introduction
The chemically peculiar (CP) stars are important because they can help us to understand the evolution and interaction of atomic diffusion processes, magnetic fields, and stellar rotation. The CP stars are characterized by the presence of certain absorption lines of abnormal strength or weakness, which indicate peculiar surface chemical abundances, and their spectral types range from B to F (Preston 1974;Ghazaryan et al. 2018). There are four main classes of CP stars, as follows: the metallic-line or Am stars (hereafter, CP1), the magnetic Bp/Ap/CP2 stars (hereafter, CP2), the mercury-manganese stars (hereafter, CP3), and the He-weak stars (hereafter, CP4). In addition, other classes of CP stars have been studied that do not have designations on the scheme of Preston (1974), such as the He-rich stars, the λ Bootis stars and so on (Gray & Corbally 2009).
The CP1 stars show weaker Ca II K lines, and enhanced iron and heavier elements in their spectra than normal A-type stars. As a result, the spectral types derived from the Ca II K line and from the metallic lines are different. For the typical CP1 stars (Roman et al. 1948), the K-line type is earlier by five or more spectral subclasses than that derived from the metallic-line spectrum. Cowley et al. (1969) even classified CP1 stars into more subclasses according to the K line and other metallic lines. The CP2 stars exhibit excesses of elements such as Si, Sr, Cr, or rare-earth elements. Most of the CP2 stars possess stable and globally organized magnetic fields with strengths of up to several tens of kG (Babcock 1947;Aurière et al. 2007). The CP3 stars are characterized by enhanced lines of Hg and Mn, and other heavy elements, whereas the main characteristic of the CP4 stars is that they have anomalously weak He lines. Many observational studies have proven that most of CP1 and CP3 stars are in binary systems (Wolff & Preston 1978;Abt & Levy 1985), whereas about 30% CP2 stars are binary stars (Southworth et al. 2011).
Some theories have been proposed to explain the formation of CP stars, such as the atomic diffusion model (Browne 1968;Michaud 1970;Richer et al. 2000), the supernova model (Stothers 1963;Guthrie 1967), magnetic field accumulation model (Havnes & Conti 1971) and the collision model (Cowley 1977). The magnetic field accumulation model suggests that the abnormity of the element abundance is due to the magnetic field in the atmosphere capturing numerous atoms from the interstellar medium, and these captured atoms move along the magnetic field to the stellar atmosphere surface by the diffusion process to result in the abnormal element abundance. The collision theory holds that the abnormity of element abundance may be induced by the collision between CP stars and planets or tiny planets. This collision process will fundamentally change the chemical composition of the surface atmosphere of CP stars to result in the anomaly of observed element abundance.
With the continuous progress of theoretical research and observation technology, some models are gradually developed, improved, replaced, or even discarded. The collision model has been largely replaced by a merger scenario (Tutukov & Fedorova 2010) and the atomic diffusion model has been welldeveloped to explain the formation of CP stars (Michaud et al. 2015). In the merger scenario model, one of the main channels for the formation of CP2 stars is probably the merger of close binary systems. The high surface magnetic fields of CP2 stars are probably generated in the convective envelopes of the precursor stars. In the atomic diffusion model, the observed chemical peculiarities are ascribed to the interplay between radiative levitation and gravitational settling, which leads to element separation. Most elements sink under the force of gravity, but those elements with obvious absorption lines are accelerated toward the stellar surface by the diffusion process.
The formation mechanism of CP stars could be extremely complex. The acquisition of a perfect model requires more observation data to obtain detailed physical parameters and to constrain/test the models repeatedly. To date, ∼17,000 CP1 stars (or probable) and ∼5600 CP2 stars (or probable) have been found. The first CP catalog was provided by Renson et al. (1991), which contains about ∼4000 CP1 (or probable) and ∼3500 CP2 stars (probable) collected from a larger number of literature and some CP star catalogs. More than 20 yr later, a powerful spectral survey, the Large Sky Area Multi-Object Fiber Spectroscopic Telescope (LAMOST), appeared that enabled Hou et al. (2015) to find 3537 CP1 candidates with an empirical separation curve derived from the line index of the Ca II K line and a group of nine Fe lines. Subsequently, Qin et al. (2019) found 9372 CP1 candidates and 1132 CP2 candidates from the LAMOST DR5 by using the Random Forest (RF) algorithm. Hümmerich et al. (2020) presented an identified sample of 1002 mCP stars (the sample was mostly made up of Ap/CP2 stars and several He-weak/CP4 stars) by searching for the presence of the characteristic 5200 Å blend flux depression (Maitzen 1976;Paunzen et al. 2005) in the lowresolution spectra of LAMOST DR4.
It is fairly straightforward to separate CP from non-CP stars with low-resolution spectra by identifying the characteristic lines and blends. The classical textbooks by Jaschek & Jaschek (1990) -The Classification of Stars-and Gray & Corbally (2009)-Stellar Spectral Classification-have described the classification of the CP stars, and the line lists and blends in detail. With the features and the classification criteria presented in Gray & Corbally (2009), the CP1 and CP2 stars are quite readily used to visually figure out at LAMOST resolution for spectra with a sufficient S/N. To obtain a relatively pure CP1 star sample, Qin et al. (2019) removed 1132 suspected CP2 candidates from their CP1 star sample by employing the 4077 Å blend as a reference line to identify CP2 star candidates. We checked the excluded 1132 CP2 candidates and found that most of the spectra are actually of CP1 stars. The main reason for this is that the 4077 Å blend used in the work of Qin et al. (2019) is not a sufficient criterion. According to Gray & Corbally (2009), CP1 stars may also show significantly enhanced 4077 Å blends. Although the separation of CP1 and CP2 stars is fairly straightforward by using the correct set of criteria, it is hard to do the work manually for the large amount of data provided by spectral surveys such as LAMOST. Consequently, feature-learning based automated algorithms have to be considered.
In this paper, CP1 and CP2 stars were searched from the LAMOST DR8 with machine-learning (ML) methods, and a reliable catalog of the found CP1 and CP2 stars was compiled. In the catalog, CP1 and CP2 candidates were objectively classified with the differences of spectral features between them discovered by the ML algorithm, and the stellar parameters of these objects were calculated as well. In addition, some statistical investigations were conducted for these CP stars. This paper is organized as follows. In Section 2, we introduce the ML methods and the data used for model training, and CP1 and CP2 searching. In Section 3, we describe the application of the trained model to search for CP1 and CP2 in LAMOST DR8, analyze the important features where CP1 and CP2 differ from normal early-type stars, and to identify the distinct features between CP1 and CP2. In Section 4, we give some statistical analysis for the two samples of CP1 and CP2, including stellar parameter distribution, color-magnitude distribution, and spatial distribution. Finally, we summarize the work and present our conclusions in Section 5.

Machine-learning Methods
In the past decades, ML methods have been successfully applied to the classification of stellar spectra (Schierscher & Paunzen 2011;Kheirdastan & Bazarghan 2016;Li et al. 2019;Qin et al. 2019;Flores et al. 2021). We select three ensemble algorithms of ML-that is, Random Forest (RF), Extra-trees, and Extreme Gradient Boosting (XGBoost)-to train and test the input data set. The code of these algorithms comes from scikit-learn. 10 Among the three, XGBoost demonstrates the best performance. We also use SHAP to derive spectral features in XGBoost model. These ML algorithms and SHAP interpretation are briefly introduced as follows.
1. Random Forests: The RF algorithm provides a nonlinear supervised classification model. It contains multiple decision trees, and each tree in the ensemble is built from a sample drawn with a replacement from the training set. Each tree in the forest is an independent classifier, which classifies the input data set independently. For the classification results obtained from all trees, the most frequent one is taken as the final classification of the RF model (Breiman 2001). 2. Extra-trees: The Extra-trees is similar to the RF, which builds multiple trees and splits nodes using random subsets of features. But there are two key differences between Extra-trees and RF: Extra-trees does no bootstrapping (meaning it samples without replacement), and nodes are split on random splits and not by the most discriminative splits (Geurts et al. 2006). 3. XGBoost: The XGBoost is a supervised learning algorithm that is based on the gradient boosting framework, which can be used to solve classification and regression problems. The XGBoost is also composed of multiple decision trees, which has a faster parallel processing speed and higher accuracy than the traditional decision tree. Each new decision tree of XGBoost learns the residual between the target value and the predicted value of all of the previous trees. Multiple trees make decisions together, and add up the results of all trees as the final prediction result. Each tree is generated by the idea of binary recursive splitting (Chen & Guestrin 2016 (Lundberg & Lee 2017). For each input sample X i , the model will produce a prediction value Y i , which can be expressed as follows: , 1 , 2 , where Φ 0 is usually the mean of the model, S(X i,j ) represents the SHAP value of the jth feature of the ith sample of data set. S(X i,j ) > 0 indicates that the feature improves the predicted value and plays a positive role. On the contrary, it indicates that the feature reduces the predicted value and plays a reverse role.

Data for the Search
LAMOST, also called the GuoShouJing Telescope, is a Chinese national scientific research facility that is operated by the National Astronomical Observatories of China (NAOC, CAS). It is a special reflecting Schmidt telescope with an effective aperture of 3.6-4.9 m and field-of-view of 5° (Cui et al. 2012;Zhao et al. 2012). Four thousand fibers are installed on the focal plane, which enables it to observe 4000 objects simultaneously. The telescope is located at Xinglong Observatory (longitude 117.58°E and latitude 40.39°N) of NAOC and is dedicated to the spectral survey over the entire available northern sky (Luo et al. 2012. By the end of 2021 March, LAMOST DR8 released 11,214,076 spectra. 11 The wavelength coverage of the spectra is 3690 Å-9100 Å with a resolution of 1800 at 5500 Å. In this work, CP1 and CP2 stars have been searched from 318,740 spectra of DR8 with signalto-noise larger than 50 in the g band, in which 4825 B-type, 157,405 A-type, and 156,510 F0-type are included. Figure 1 shows their distributions of the signal-to-noise ratio (S/N) in the g band along with the G magnitude of Gaia DR2, from which we can see that the sample mainly ranges from 12 to 16 mag.

Training and Testing Data Sets
We collected known samples of CP1 and CP2 stars from the works of Hou et al.   Figure 3. 3. Search for CP1 and CP2 stars from LAMOST DR8

Preprocessing
The LAMOST spectra cover the wavelength range from 3690 to 9100 Å with a resolution of 1800 Å at the wavelength of 5500 Å. First, each spectrum was shifted to the rest frame according the the released radial velocity as shown in Figure 5, and rebinned to a uniform spacing of 1 Å with cubic spline interpolation. Then, the spectrum was truncated to the violetblue region from 3800 to 5600 Å because the obvious spectral features of CP1 and CP2 stars mainly appear in the violet-blue region. With the rebinned and truncated spectra, a seven-order polynomial was used to fit the pseudocontinuum masking the strong spectral lines, cosmic rays, and sky emission residual from data reduction. As a result, each spectrum was normalized with the pseudocontinuum shown in Figure 6.

Classification between Normal Early-type Stars and CP1 and CP2
As described in the Section 2.2.2, we have selected spectra of 1771 CP1 and 1780 CP2 as the positive sample, and the S/N distributions of the spectra are similar, as shown in Figure 4. We have removed known CP1 and CP2 spectra from the picked out 318,740 spectra of LAMOST DR8, and randomly selected 8298 spectra from the remainings as the negative sample. These labeled spectra were divided into the training and test set according to the ratio of 7-3.
Compared with the RF and Extra-trees algorithms, we found that the XGBoost algorithm shows 98.85% accuracy rate and 97.57% recall rate in the test set, respectively, which has the best performance in searching for CP1 and CP2 stars from early-type stars.
With the trained XGBoost model, we obtained 10,776 CP1 and CP2 mixed candidates from the remaining spectra. According to the trained model, the SHAP values of the extracted separation features between the mixed CP1 and CP2 and normal stars were calculated.
We found that the most important separation features between the CP stars and the spectra of non-CP stars are the 4130 Å blend (Si II 4128 Å, Eu II 4130 Å, Si II 4131 Å) and Ca II K line 3934 Å, as shown in Figure 7. Because some features are not separated at the resolution of the LAMOST low-resolution spectra ∼2.5, we use "blend" to represent these blend features. From the figure, compared with normal stars, the abundances of Si and Eu in CP1 and CP2 stars are higher than that in normal stars, while CP1 and CP2 stars are notably deficient in Ca II. Figure 8 shows the average spectra of CP1 and CP2 at LAMOST resolution. Although the two classes are similar, it is .4 is a CP1 star. The area filled in red is the flux depression region at center 5220 Å blend with a bandwidth of 230 Å. It is found that the flux depression at the 5220 Å blend is more obvious than that for a CP1 star. fairly straightforward to distinguish them with the detailed list of the characteristic lines and blends provided by Gray & Corbally (2009). However, dealing with thousands of spectra manually is inefficient. ML algorithms might be a solution to this problem.

Classification between CP1 and CP2
We labeled the training and test set using the spectra of known 1771 CP1 and 1780 CP2 stars that were described earlier. Three algorithms were compared and the best choice is still the XGBoost algorithm. The accuracy rate and recall rate of the test set are 99.29% and 97.86%, respectively. With the trained model, the CP1 and CP2 stars in the mixed sample of 10,776 candidates were classified as 7880 CP1 and 2896 CP2 candidates.
The most important separation features of spectra between the CP1 and CP2 stars are shown in Figure 9, including 4935/ 4936 Å (hereafter, 4935 Å blend), 4416 Å, 5081/5082 Å (hereafter, 5081 Å blend), and 4402 Å. These features may be contributed by the elements listed in Table 2. The feature 4935 Å blend could be contributed by the elements Ni I or Cr I. This is a new feature that has not been presented in previous works. The 4416 Å and 5081 Å blends may be contributed by the elements Ni I and Fe II, respectively. Feature 4402 Å may be contributed by the elements Ni I or Fe I.
For an easier understanding of these features, we plotted the separation features with their SHAP values on the spectra of CP1 and CP2 in Figure 10. In the figure, the blending line of Sr II, Cr II, and Si II around the wavelength at 4077 Å is also an important feature for CP1 and CP2 separation. These are consistent with the definition of CP1 and CP2; that is, in general, the Sr, Cr, Eu, or Si elements are extremely abundant for the CP2 stars, the abundances of Sr, Cr, Eu, and Si in CP1 stars are also slightly higher than that of other ordinary stars. This means that the accuracy and purity of the initial sample are relatively high.

Line Indices Defined for CP1 and CP2
According to the top-ranking separation features between CP1 and CP2 stars obtained through the XGBoost classifier, we Figure 3. Showcase of manually identified characteristic lines of nine new CP2 star candidates. They were classified as spectral subclasses, B6, A1, A2, A3, A5, A6, A7, A9, and F0 from the top row to the bottom row. The spectral subclasses presented by the LAMOST catalogs are rough estimates. The reliable spectral subclass is presented by the MKCLASS code in column "SpT_mkclass" of Table 1. defined the line indices for further study. The definition of each line index is similar to that of Hou et al. (2015). First, we drew a straight line as the continuum through the two points of the flux, which are the peak values within 5 Å from either side of the line center. The line index of each feature was then calculated with the following equation: where, λ 0 and λ 1 are the corresponding wavelengths of the lefthand and right-hand peaks of the feature, respectively. f (λ) and g(λ) are the observed flux and values of a straight line at wavelength λ, respectively. The line indexes of the first 10 important features are shown in Figure 9. We found that the line index distributions of four features are highly distinguishable, which means that there may be significant abundance differences of these elements between CP1 and CP2 stars. For a sample with high purity, we expect obvious difference between the CP1 and CP2 in the calculated line indices. Depending on the values of the four line indices, we removed those outliers from our sample stars with the quartile method, and then obtained a purer sample including 6917 new CP1 and 2708 CP2 candidates. In CP2 candidates, 1056 CP2 candidates are included in the known sample that was published by Renson & Manfroid (2009) The purification process is as follows. For each distribution of a line index, we calculated the corresponding quartile and removed those data points located in the range > Q3 + 1.5 × (Q3 − Q1) and < Q1 − 1.5 × (Q3 − Q1); Q3 and Q1 are upper and bottom quartiles, respectively. Finally, we obtained a pure and reliable sample of CP1 and CP2 stars of LAMOST DR8. After the purification, the line index distributions of these features are highly distinguishable between the spectra of CP1 and CP2 stars, as shown in Figures 11 and 12 at 3σ confidence. This means that the purity and reliability of our sample are very high.
We also studied the correlation among these line indices for the important features of CP1 and CP2 stars with the Pearson correlation coefficient ρ, which is defined as follows, where Cov(X, Y) represents the covariance between any two line indices X and Y. Var[X] and Var [Y] are the variance of X and Y, respectively. From the line index diagrams shown in Figure 12, we found that any two features are weakly correlated for the CP1 or CP2 stars. The maximum correlation coefficient between these features is 0.21 (between the feature of 4416 Å and 5081 Å blend) for CP1 stars, while it is 0.16 (between the features of 4402 Å and 4935 Å blend) for CP2 stars. Finally, a sample of CP1 and CP2 is listed in Table 1.

The Distribution of Stellar Atmospheric Parameters
The atmospheric parameters in our sample stars were determined by comparing the observed spectra to the KURUCZ library of theoretical spectra (Castelli & Kurucz 2003), using the spectral region of 3800-5500 Å. The observed spectra were shifted into their rest frames by adopting the radial velocities from the LAMOST 1D pipeline . To absorb the continuum differences between the observed and synthetic spectra, a fifth-order multiplicative polynomial was applied to each synthetic spectrum so that the synthetic spectrum held the same pseudocontinuum as the target spectrum. We chose a fifth-order multiplicative polynomial in this procedure following the work of Du et al. (2021). We adopted a χ 2 algorithm to compare the target spectrum with each of the adjusted synthetic spectra and found five best-matching reference spectra by sorting χ 2 . The parameters were interpolated by a linear Figure 7. Computing the feature importances of separation between CP stars and normal stars from the XGBoost algorithm with SHAP package. Each row represents a feature, a dot represents a sample. The values of the feature increase with the color change from blue to red. The horizontal axis is the snap value. Figure 8. The normalized averaged spectra of CP1 and CP2 stars. The red spectra in the upper panel are the averaged spectra of CP2 stars. The blue spectra in the bottom panel are the averaged spectra of CP1 stars. Black spectra in both panels are all normalized spectra of CP2 and CP1 stars spectra. For these spectra, their pseudocontinua are fitted with a seventh-order polynomial. All of the spectra are normalized by dividing the spectra by the pseudocontinua.
combination of the five best-matching spectra. We chose to use the five best-matching spectra in the linear combination procedure following the work of Yee et al. (2017). To avoid affection by metal lines, we masked the features of CP1 and CP2 that were both obtained in this work and in the previous work of Hou et al. (2015) and Qin et al. (2019). Finally, the Figure 10. The red and blue spectra are the averaged spectra of CP2 and CP1 stars, respectively. The value on the vertical axis is the SHAP value which represents the impact on the separation between CP2 stars and CP1 stars. The upper and bottom panels show the comparison in the wavelength ranges from 3800 Å to 4700 Å and from 4700 Å to 5600 Å, respectively. newly stellar atmospheric parameters of our sample stars are presented in Table 1. Figures 13 and 14 show the distributions and statistical results of stellar atmospheric parameters of the CP1 and CP2 stars. We compared the distribution of stellar atmospheric parameters of the newly found CP1 stars with that of the published CP1 star samples shown in Figure 14. The red histogram in the left-hand panels of Figure 14 is basically the same as that of the published samples (blue histogram in the left-hand panel of Figure 14), ranging from ∼6000 to ∼8500 K, peaking at ∼7600 K, which is consistent with the previous defined temperature range (from ∼6000 to ∼10,000 K) of CP1 stars.
For the CP2 stars, the distribution of effective temperature is from ∼7000 to ∼13,700 K with two peaks at T eff ; 10,000 K and T eff ; 7700 K, respectively, which means that our CP2 sample stars could include both high-and low-temperature populations. The high-temperature population is mainly contributed by the early A-type and B-type stars, while the low-temperature population is mainly contributed by later A-type and F-type stars.
For the distribution of log g as shown in the middle panels of Figure 14, the distributions of newly found sample and published sample stars are basically the same, with CP1 ranging from ∼2.8 to ∼4.8 dex peaking at ∼4.5 dex, and CP2 ranging from ∼2.0 to ∼5.0 dex peaking at ∼3.6 dex, respectively.
As shown in the right-hand panels of Figure 14, the [Fe/H] of CP1 stars is slightly higher than that of CP2 stars. The distributions of [Fe/H] for CP1 and CP2 stars are from −1.4 to 0.4 dex. The [Fe/H] of most CP1 stars is higher than that of CP2, and the range of log g of most CP1 stars is from −0.5 to 0.25 dex.

The Spectral Subtypes
The spectral subtypes of sample stars are rederived with the MKCLASS code 12 (Gray & Corbally 2014;Gray et al. 2016). The spectral subtypes from the LAMOST catalog are presented in Table 1 (see column "SpT_lamost"). To investigate the consistency of the derived temperatures with the spectral subtypes, we found that the derived temperatures do not go together with the spectral subtypes presented by the LAMOST catalog. There is a significant amount of A1-and A2-type stars with an overestimated effective temperature. In addition, the hotter B-type CP2 stars, which form a significant fraction of the CP2 star population, seem to be curiously underrepresented in the sample of CP2 stars. The main factor causing this result may be the inaccurate spectral subtype of the LAMOST catalog. The spectral subtypes presented by column "SpT_lamost" in Table 1 were directly taken from the LAMOST catalog. These automatically derived subtypes presented in the LAMOST catalogs are in most cases rough estimates. The uncertainty of spectral subtypes reaches ∼5 subtypes.
To solve this problem, the spectra of our sample stars were reclassified with the MKCLASS code. The rederived spectral subtypes are presented by the column "SpT_mkclass" in Table 1. The column "Quality" represents a quality evaluation of spectral subtypes given by the MKCLASS code. The outputs of quality evaluation include "excellent", "vgood", "good", "fair", and "poor" (the corresponding meanings are given in Gray et al. 2016). Checking for these subtypes, we found that the spectral subtypes given by the MKCLASS code are more reliable than those given by the LAMOST catalogs, and the temperature matched well with the newly derived spectral subtypes.
The null flux in a large wavelength range will reduce the reliability of classification. For example, some candidates with null flux at around 5200 Å are not in fact CP2 stars. These spectra have been selected in an automatic search of CP2 stars because of the null flux in the spectra that might have been misidentified by the code as a strong 5200 Å depression. In addition, the null flux in a large wavelength range also leads to the inaccurate classification of spectral subtypes. By checking the spectra of sample stars, some spectra with the null flux at around 5200 Å are wrongly classified as M-type with a quality evaluation of "poor". These candidates should be carefully considered or reclassified for future uses. For the null flux in a small wavelength range (several angstroms), the effect on the classification of spectral subtypes can be neglected.

The Color-Magnitude Distribution
In Figures 15 and 16 Figure 16, there are 5745 known and 3441 newly discovered CP1 stars were compared, while 602 known and 1006 new CP2 stars were compared. It is found that the colormagnitude distributions of newly discovered CP1 and CP2 stars are basically similar to those of known CP1 and CP2 stars.

Spatial Distribution
The spatial distribution of CP2 and CP1 stars in the LAMOST DR8 are shown in Figure 17. It is found that the   . "T eff " effective temperature. "log g(dex)" surface gravitation with error bar. "[Fe/H](dex)" metal abundance with error bar. "SpT_lamost" denotes spectral type from the LAMOST. "SpT_mkclass" spectral type derived by the MKCLASS code. "Quality" denotes quality evaluation given by the MKCLASS code. "Star type" denotes the subclass of CP stars, "CP1" and "CP2" indicate that the stars are CP1 and CP2 stars candidates, respectively. The symbols " * " and "#" in the "Notes" column represents the candidates that are obtained from the published literature and this paper, respectively. density of CP1 and CP2 stars at the Galactic anticenter (GAC) is obviously higher than in other areas. The density distribution of our new sample stars in Galactic coordinates is similar to that of published sample stars presented by Renson & Manfroid (2009), Hou et al. (2015, Qin et al. (2019), andHümmerich et al. (2020). The density can be explained by the observational strategy and real spatial distribution. On the one hand, the GAC survey is an important component of the LAMOST survey, which results in more observations being carried out in this region. On the other hand, stars are mainly born in the Galactic disk, where more young objects concentrate. With our work, we significantly increase the sample size of known Galactic CP1 and CP2 stars, which is helpful for future in-depth statistical studies.

Summary and Conclusion
In this paper, we present a reliable and pure sample of 17,986 CP1 and 2708 CP2 stars from the LAMOST DR8 spectra with ML methods. The sample includes 11,069 known CP1 and 1056 known CP2 collected from the published literature of Renson & Manfroid (2009)   Based on feature extraction of XGBoost, we present the important separation features between the CP (the mixing of CP2 and CP1 stars) and normal B/A/F-type stars. It is found that the important separation features are Ca II K line and 4130 Å blend, the corresponding elements are Ca, Si, and Eu elements (shown in Figure 7). By using the CP1 and CP2 samples collected from the publications of Renson & Manfroid (2009), Hou et al. (2015, Qin et al. (2019), andHümmerich et al. (2020) as the training and testing data set of the XGBoost classifier, we extracted the important separation features between the CP1 and CP2 stars from the trained model of the XGBoost classifier. It is found that the important separation features between CP1 and CP2 stars at the low-resolution of LAMOST are 4935 Å blend, 5081 Å blend, 4416 Å, and 4402 Å. The corresponding elements are Ni I, Cr I, Fe I, and Fe II (shown in Table 2), respectively. In addition, the line indices of these features were calculated with Hou et al.'s (2015) method. The outliers of the line index of each feature for the CP2 and CP1 sample stars are then removed with the quartile method, and a high reliable and pure sample including newly found 6917 CP1 and 1652 CP2 candidates was obtained. For each candidate, the effective temperature, log g and [Fe/H] were determined, and the spectral subtype was derived with the MKCLASS code. A statistical analysis of our sample and known sample stars is presented. We compare the distribution of log g versus T eff of our newly searched CP1 sample with that of the samples from the literature (shown in Figures 13 and 14). It is found that the density distributions of log g and T eff of our sample are similar to those of samples from the literature. The T eff is from ∼6000 to ∼8500 K, log g for CP1 and CP2 range from ∼2.8 to ∼4.8 dex with peaking at ∼4.5 dex, and from ∼2.0 to ∼5.0 dex with peaking at ∼3.6 dex, respectively.
As shown in Figure 16, the color-magnitude density distribution of the newly found CP2 stars from LAMOST is slightly different from that of the samples from the literature. There are two possibilities, as follows: the real distribution or selection effects. Meanwhile, the color-magnitude density distributions of our sample CP1 stars and the samples from literature are basically the same. For all CP2 and CP1 stars, the density at the Galactic disk is obviously higher than in other areas.   Figure 15, but for the sample of CP2 stars. Figure 17. The spatial distribution of CP2 and CP1 stars from LAMOST DR8 in Galactic coordinates. "HW+QL" represents the stars were obtained from the published papers of Hou et al. (2015) and Qin et al. (2019). "RM+HPB+QL" represents the stars are obtained from the published papers of Renson & Manfroid (2009), Hou et al. (2015, and Qin et al. (2019), respectively. The published CP1 and this paper are represented by gray crosses and blue dots, respectively. The CP2 stars obtained from the published papers and this paper are represented by green plus signs and red dots. The thick gray-solid line is the equatorial plane. Figure 8 shows the normalized average spectra of the CP2 and CP1 stars. A comparison shows that the mean depth of Ca II K line of CP1 star is deeper than that of CP2 star. However, it should be noted that this is only an average result rather than a criterion for manually distinguishing CP1 and CP2. The flux depression at wavelength 5220 Å blend of CP2 stars is more obvious than that of CP1 stars (see Figure 2). This can be used to visually recognize the CP2 stars from the stellar spectra. However, the flux depression at wavelength 5220 Å blend of the CP2 stellar spectrum is temperature dependent (Maitzen 1976;Hümmerich et al. 2020). This makes it difficult to distinguish them, especially in massive spectral databases. The most recommended criteria for interactively figuring out CP stars from normal spectra of B-, Aand F-type stars, and to separate the CP2 from CP1, are provided in the classical textbooks by Jaschek & Jaschek (1990) and Gray & Corbally (2009). However, when coping with the amount of data provided by large-scale surveys such as LAMOST, a one-byone interactive feature check will become extremely inefficient. Therefore, the feature extraction based automated algorithm XGBoost has been used in this paper. Here, we should note that due to the limitation of low-resolution spectra, more sophisticated work and further identification require follow-up high-resolution spectroscopic observations. Note. The corresponding elements are obtained from Moore et al. (1966).