A New Classification Model for the ZTF Catalog of Periodic Variable Stars

Using the second data release from the Zwicky Transient Facility (ZTF, Bellm et al. 2019), Chen et al. (2020) created a ZTF Catalog of Periodic Variable Stars (ZTF CPVS) of 781, 602 periodic variables stars (PVSs) with 11 class labels. Here, we provide a new classification model of PVSs in the ZTF CPVS using a convolutional variational autoencoder and hierarchical random forest. We cross-match the sky-coordinate of PVSs in the ZTF CPVS with those presented in the SIMBAD catalog. We identify non-stellar objects that are not previously classified, including extragalactic objects such as Quasi-Stellar Objects, Active Galactic Nuclei, supernovae and planetary nebulae. We then create a new labelled training set with 13 classes in two levels. We obtain a reasonable level of completeness (>90 %) for certain classes of PVSs, although we have poorer completeness in other classes (~ 40 % in some cases). Our new labels for the ZTF CPVS are available via Zenodo Cheung et al. (2021).


THE ZTF CPVS
The ZTF is a wide-field, optical survey conducted using a 48-inch Schmidt telescope with a 47 deg 2 field of view (Bellm et al. 2019). Chen et al. (2020) made use of the second data release of the ZTF to create the ZTF CPVS. They search for and classify new PVSs down to an r-band magnitude of ∼ 20.6. By measuring the g-and r-band periods, phase difference, amplitudes, absolute Wesenheit magnitude and adjusted R 2 (which represents how well data are fitted by the Fourier function), they are able to group PVSs into 11 distinct types using linear cuts of these observational features. Moreover, 79.5% of the PVSs presented in the ZTF CPVS are newly classified objects. They report a misclassification rate of 2% when compared to other photometrically classified samples, such as ATLAS (Heinze et al. 2018), WISE (Chen et al. 2018), ASAS-AN (Jayasinghe et al. 2018), and the CATALINA (Drake et al. 2014;Drake et al. 2017) catalogs.
Here we present a new set of classifications based on a deep generative feature space and an independent set of class labels obtained from the SIMBAD catalog (Wenger et al. 2000).

METHODOLOGY
Our classifier is based on learned latent features generated in Chan et al. 2021 (in prep, hereafter C21), as well as hand-engineered features from the ZTF CPVS, including the periods, amplitudes and mean magnitudes of both the gand r-band light curves. The C21 latent features are generated by a convolutional variational autoencoder, with ten features describing the light curves of each object. We extract object labels from the SIMBAD catalog (Wenger et al. 2000) by cross-matching the sky coordinates of PVSs with those in the ZTF CPVS (using Astroquery, Ginsburg et al. 2019). We find 31, 541 successfully cross-matched objects which we use as the training set. We then construct 13 class arXiv:2112.04010v1 [astro-ph.IM] 7 Dec 2021 labels in two levels. The first level contains Active Galactic Nuclei-like objects (AGNL, including blazars and quasars), cepheids (CEP), eclipsing binaries (EB), long-period variables (LPV), Mira variables (Mira), RR Lyraes (RR), and the catch-all categories of other pulsating variables (Pul oth ), and peculiar types (Pec). The second level is a further classification of the Pec class. They include carbon stars (C-Type), horizontal branch stars (HB), red giant branch stars (RGB), S-Type stars (S-Type), young stellar object-like (YSOL), and other variables (V oth ). We split the data set into a training-to-test set ratio of 7 : 3 by using python package scikit-learn (Pedregosa et al. 2011). We note that our training set is highly imbalanced, with the largest class containing 10, 745 objects and the smallest containing just 41 objects. We balance the training set using the python package imbalanced-learn (Lemaître et al. 2017) with default learning parameters, which performs synthetic minority resampling (Chawla et al. 2002;Lemaître et al. 2017). Finally, the hand-engineered features and the latent vectors µ of the cross-matched objects are fed into the hierarchical random forest provided by imbalanced-learn for training, with no hyper-parameter optimization conducted.

RESULTS
Here we discuss the classification results and their implications. We show the confusion matrix of our classification results for the test set in Figure 1 (a) and (b). Our new classification model has excellent classification completeness for certain classes of objects, such as AGNL (93 %), RR (92 %), EB (92 %). However, the completeness for some objects, such as CEP (48 %), Mira (46 %), and YSOL (36 %) are poorer. This is likely due to insufficient samples of these classes in our training set. In addition, we compute the class-averaged precision and accuracy. In addition, we obtained a class-averaged accuracy of 0.97 (0.84), precision of 0.71 (0.45), error rate of 0.03 (0.16), f1-score of 0.69 (0.50), and purity of 0.87 (0.54) in the first (second) level of our new classification. The second level classification performs worse than the first level counterpart, which may also be due to insufficient samples presented in the data set or intrinsic overlap in labels (e.g., C-Type and S-Type variables may be intrinsically very similar).
We highlight specific classifications which are distinct from the ZTF CPVS. In particular, we find that the SR variable category in the ZTF CPVS very likely consists of multiple classes. As shown in Figure 1 (c) and (d), this class includes LPV, AGNL, C-Type, RGB, YSOL, and V oth variables based on our classifier. Furthermore, we note that the ZTF CPVS provides class labels only for galactic periodic variable stars. However, our cross-matching and classification results reveal that the ZTF CPVS may contain non-stellar variables. For instance, we find from our cross-matched results planetary nebulae, supernovae and active galactic nuclei; Furthermore, 11, 837 of the SR variables in the ZTF CPVS are classified as AGNL with our classifier. We hope that a closer comparison of these labels can lead to improved purity in the ZTF CPVS.

CONCLUSION
We present a new classifier and photometric labels for PVSs in the ZTF CPVS. Our new classifier is a 2-layered hierarchical random forest that uses latent features generated by a convolutional variational autoencoder and class labels given from the SIMBAD catalog. We obtain a reasonable level of completeness ( 90 %) for certain classes of PVSs, although we have poorer completeness in other classes (∼ 40 % in some cases). Furthermore, we find non-stellar and extra-galactic objects within the ZTF CPVS which were not previously identified. Finally, our classifications are available on Zenodo. In (g), we plot the distribution of AGNL, LPV, and Pec variables that are previously classified as SR in the ZTF CPVS. (h) is the same as (g), but for the distribution of all AGNL, LPV, and Pec variables. (i) and (j) are the same as (g) and (h), respectively. In (i), we plot the distribution of EB, Pec, and RR variables that are previously classified as EA in the ZTF CPVS. (j) is the same as (i), but for the distribution of all EB, Pec, and RR variables.