Supernova host galaxy association and photometric classification of over 10,000 light curves from the Zwicky Transient Facility

Here we present a catalog of 12,993 photometrically-classified supernova-like light curves from the Zwicky Transient Facility, along with candidate host galaxy associations. By training a random forest classifier on spectroscopically classified supernovae from the Bright Transient Survey, we achieve an accuracy of 80% across four supernova classes resulting in a final data set of 8208 Type Ia, 2080 Type II, 1985 Type Ib/c, and 720 SLSN. Our work represents a pathfinder effort to supply massive data sets of supernova light curves with value-added information that can be used to enable population-scale modeling of explosion parameters and investigate host galaxy environments.


INTRODUCTION
The upcoming Legacy Survey of Space and Time (LSST) by the Vera Rubin Observatory will provide exciting opportunities to investigate hundreds of thousands of supernovae and their host galaxy environments (Ivezić et al. 2019). However, this large influx of data also comes with numerous technical challenges, including the classification of supernovae, where techniques often suffer from limited and biased training data (Boone 2019;Villar et al. 2020;Sravan et al. 2020;Sánchez-Sáez et al. 2021). Within this context, we were motivated to develop a supernova classifier for the alert stream of the Zwicky Transient Facility (ZTF; Bellm et al. 2019) and provide a catalog of supernova-like light curves for as many events as possible.

Training Data
Our classifier training data was built using supernovae observed by ZTF that have been spectroscopically classified by the Bright Transient Survey (BTS; Perley et al. 2020) with the condition that the light curve contained alerts before and after the peak of the event. We split the data into 4 main types: Type Ia (2,247), Type II (534), Type Ib/c (179), and SLSN (39). Treatment of this class imbalance, which is amplified by our data being skewed towards bright, low redshift objects, is discussed in §2.5.

Preprocessing with Gaussian Process Regression
To mitigate challenges associated with sparse and incomplete data, we employed Gaussian processed regression (GPR) which allows us to use the available data and uncertainties in both bands to interpolate the missing data points. We fit a GPR using the python package George (Ambikasaran et al. 2015) following the method introduced by Boone (2019). We then correct each light curve for Milky Way extinction using Schlegel et al. (1998).

Wavelet Transformation and Other Features
To model our supernova light curves we use a stationary wavelet transform (Lochner et al. 2016). We used a sampling of 100 on the GPR curve and a twolevel wavelet transform, resulting in 800 coefficients per object. To deal with this high dimensionality, we use Principal Component Analysis (PCA), which reduced our dimensionality from 800 dimensions to 10. Along with these 10 wavelet features we calculate the area beneath the light curve, multi-band variance, duration, and range of the light curve.

Host Galaxy Association
We associated supernovae with candidate host galaxies following GHOST (Gagliano et al. 2021 classifies as stars, it determines whether or not an event is a star by applying the same host galaxy association but with tighter constraints on the criteria for a star to be the host. This allows us to easily filter out impurities in our final data set in §3.3. We estimate a host misassociation rate of 7.25%; see Gagliano et al. (2021) for more details.

Classifier
We used the balanced random forest classifier from the imbalanced learn software (Lemaitre et al. 2016). We used the random search cv function from scikit-learn (Pedregosa et al. 2012) to tune the model and decided on the following parameters: 2,000 estimators, no bootstrap, maximum features of 4, maximum depth of 45, sample with replacement, and a class weight of balanced subsample.

Performance
To measure the performance of our classifier we used stratified k-fold cross validation in scikit-learn (Pedregosa et al. 2012). We used the weighted F1-score, which takes into consideration both the precision and recall while also accounting for class imbalance, to evaluate our classifier.
Our classifier was able to achieve an overall accuracy of 80% and a weighted F1 score of 0.82 according to 3 fold cross validation. Because our training data are biased towards bright, low redshift objects, and many of the ZTF light curves being evaluated are only partially complete, our results are expected to generally be only representative of the best case scenario for our classifier.

Comparison to Previous Works
The most direct comparison to our classifier is the transient classifier within ALeRCE (Sánchez-Sáez et al. 2021). According to their confusion matrix, they achieved the following values of median accuracy with upper and lower limits representing the 95th and 5th percentile, respectively: 76%± 7 6 of Type Ia, 53%± 7 9 of Type II, 50%± 17 6 of Type Ib/c, and 100%± 0 26 of SLSN. Our classifier achieved the following accuracy: 85% of Type Ia, 60% of Type II, 69% of Type Ib/c, and 77% of SLSN. Our classifier achieved better performance overall, while only using 14 features versus their 152 features. However, it should be noted that our classifier used a larger training set, which could be a source of difference in capability. Table 1 shows results from our catalog of 12,993 objects that have been photometrically classified using our classifier (including our training set). 1 Light curves were first obtained by querying the ALeRCE database (Sánchez-Sáez et al. 2021) for events that had a minimum of 10 alerts and a minimum of 0.7 probability of being a transient according to their classifier. These parameters were chosen based on our own experimentation. We found that 10 alerts provided, on average, sufficient coverage for our classifier to be useful, while still including a large portion of events. The minimum of 0.7 probability was chosen because we found that it provided an optimal balance between supernova purity and the overall number of light curves gathered. Additionally, we used GHOST for star removal. When comparing the distribution of classes outside our training set we found it to consist of: 57.6% Type Ia, 17.9% Type II, 18.6% Type Ib/c, and 6% SLSN. To determine if this distribution is representative of the actual discovery rate of supernovae we compare it to the distribution reported by BTS (Perley et al. 2020) which yields: 72.7% Type Ia, 19.9% Type II, 5.9% Type Ib/c, and 1.4% SLSN. This indicates that our classifier overestimates the number of Type Ib/c while underestimating the number of Type Ia.

CONCLUSION
We have presented a simple, yet effective, classification model that can achieve 80% accuracy distinguishing SN Ia, II, Ib/c, and SLSN events in ZTF light curves. We have also provided a large set of 12,993, ZTF light curves that have been classified using only photometry and associated with candidate host galaxies. These products can be used to build upon current classification techniques, as well as enable population-scale studies of supernova explosions.