DeepZipper: A Novel Deep Learning Architecture for Lensed Supernovae Identification

Large-scale astronomical surveys have the potential to capture data on large numbers of strongly gravitationally lensed supernovae (LSNe). To facilitate timely analysis and spectroscopic follow-up before the supernova fades, an LSN needs to be identified soon after it begins. To quickly identify LSNe in optical survey datasets, we designed ZipperNet, a multi-branch deep neural network that combines convolutional layers (traditionally used for images) with long short-term memory (LSTM) layers (traditionally used for time series). We tested ZipperNet on the task of classifying objects from four categories -- no lens, galaxy-galaxy lens, lensed type Ia supernova, lensed core-collapse supernova -- within high-fidelity simulations of three cosmic survey data sets -- the Dark Energy Survey (DES), Rubin Observatory's Legacy Survey of Space and Time (LSST), and a Dark Energy Spectroscopic Instrument (DESI) imaging survey. Among our results, we find that for the LSST-like dataset, ZipperNet classifies LSNe with a receiver operating characteristic area under the curve of 0.97, predicts the spectroscopic type of the lensed supernovae with 79\% accuracy, and demonstrates similarly high performance for LSNe 1-2 epochs after first detection. We anticipate that a model like ZipperNet, which simultaneously incorporates spatial and temporal information, can play a significant role in the rapid identification of lensed transient systems in cosmic survey experiments.


INTRODUCTION
Strong gravitational lensing is an expansive probe of both astrophysics and cosmology. In systems with strong lensing, light from a background object (the source) is deflected by the gravitational potential of an interposed foreground object (the lens), producing characteristic features such as arcs, Einstein rings, and/or multiple images (Treu 2010). A key subclass of lenses, strongly lensed transients, have time-variable brightness in the source galaxy. Due to the cosmological distances involved in strong lensing, the most common transient objects to observe lensed are quasars, which vary in brightness on the time scales of several years (Hook et al. 1994;Helfand et al. 2001), and supernovae, which can reach peak brightness within days and then dim over the course of weeks to months (Mihalas 1963).
One of the principle features of lensed transients is the time delay between arrival of photons that take different paths around the lenses. Photons from multiple images of source objects travel different distances to Earth and experience different magnitudes of the gravitational potential due to the geometry of the lensing system. Given the constant speed of light, the differences in the path lengths and gravitational potentials traversed by the photons produce an offset in the arrival times for photons that are emitted at the same point in the source's lightcurve. Therefore, sources with time-varying brightness in these systems exhibit an observable time delay between the individual images of the lensed source (Refsdal 1964).
Strongly lensed transients are particularly useful for a wide range of investigations. For example, the magni-fication of these background sources and their environments can reveal new information on the eruption and prevalence of these objects at earlier times in the Universe (Treu & Marshall 2016). Time-delay cosmography (TDC) is a technique that uses the time-varying brightness of compact systems that have undergone strong gravitational lensing (SL) to perform a geometric measurement of H 0 (Refsdal 1964). TDC entails a measurement of this time delay and modeling of the full strong lensing system. H 0 is then inversely proportional to the time delay between photons. This technique has been utilized with quasars (persistent variable objects) to measure H 0 to better than five percent precision in time-varying SL systems (Shajib et al. 2020;Wong et al. 2019).
The cosmic expansion rate today H 0 is a critical parameter for understanding the evolution of the Universe. There are multiple probes of H 0 , including the cosmic distance ladder, extrapolation from the cosmic microwave background (CMB), and strongly lensed variable sources, like supernovae and quasars. There is currently a significant tension amongst the probes (Freedman 2021), particularly between early-Universe probes like the CMB (Planck Collaboration et al. 2020) and late-Universe probes like the type-Ia SNe distance ladder anchored with Cepheid variable stars (Riess et al. 2021). Time-delay cosmography, as probe of H 0 , does not rely on anchoring measurements to late-Universe objects or extrapolating from early-Universe physics, so it offers a new perspective on the expansion rate of the Universe today.
Aside from quasars, supernovae (SNe) are the other major class of common time-varying sources that are . Summary of the instruments and observational procedures emulated in the simulated data sets: camera properties, observing conditions, and survey cadence. We show only the i band for each property, because it illustrates the key discerning features between the datasets. The cadence displays the mean and standard deviation of the intra-band time separation. The full cadence information and data quality properties are available in the deeplenstronomy input files, which accompany this work (Morgan et al. 2021a).
bright enough to be detected at cosmological distances. SNe present an experimental challenge in TDC because they become bright and fade on the scale of months, therefore requiring rapid identification and analysis to obtain a measurement of the time delay. However, the larger variability on shorter timescales compared to quasars offers a competitive advantage in measuring the time delay. Another advantage of LSNe is that a common subclass of SNe (SNe-Ia) can facilitate highly accurate modeling of the lensing gravitational potentialand therefore a more precise H 0 measurement (Foxley-Marrable et al. 2018a;Birrer et al. 2021a;Kolatt & Bartelmann 1998;Oguri & Kawano 2003) -as a result of their standardizable brightness (Tripp 1998). To date, only a handful of LSNe have been detected (Kelly et al. 2015;Rodney et al. 2021;Amanullah et al. 2011;Quimby et al. 2014), and only two LSNe-Ia have been discovered (Goobar et al. 2017;Rodney et al. 2015). Large optical surveys are ideal datasets to search for LSNe, since high area coverage and returning to the same field multiple times increase the chances of observing a LSN, though the rarity of LSNe still makes their detection a challenging problem. For example, based on the area covered, imaging depth, and length of observations, only 0.5-2 LSNe are expected to be in DES data (Oguri 2019). Looking forward to the next era of optical survey astronomy, the Vera C. Rubin Observatory's Legacy Survey of Space and Time (LSST; Ivezić et al. 2019) plans to cover the entire southern sky to greater depth and with higher cadence than any predecessor survey -e.g. the Dark Energy Survey (DES; Diehl et al. 2018). Preliminary forecasts indicate that the Rubin Observatory will detect hundreds to thousands of these systems (Goldstein & Nugent 2016;Oguri 2019;Wojtak et al. 2019). Each detected time-varying SL system has the potential to produce an independent measurement of H 0 , meaning that the measured statistical precision on H 0 using this technique will scale with 1/ √ N , where N is the number of detected systems. Therefore, one of the main goals in the LSST-era is the identification and characterization of as many LSNe-Ia as possible to precisely measure H 0 (The LSST Dark Energy Science Collaboration et al. 2018).
Fast and robust algorithms for detecting LSNe-Ia are essential to keep pace with the data stream of LSST and surveys with comparable data size. Furthermore, because supernovae fade after their explosion, they must be identified rapidly to facilitate follow-up observations and more detailed characterization of the system for lensing analyses. One approach to detecting LSNe is to observe known SL systems and wait for SN. This approach leverages existing SNe detection strategies and infrastructure and is expected to detect all SNe in known SL systems in the southern hemisphere. However, the Rubin Observatory's LSST will probe deeper than any previous optical survey. It is expected that a large population (∼ 10%) of all SL systems will have a source galaxy that is too faint to have been detected previously, and they will be missing from the list of target systems (Ryczanowski et al. 2020). Another approach leverages the standardizable brightness of SNe-Ia and proposes the search for brighter than expected (due to lensing magnification) SNe near elliptical galaxies (Goldstein & Nugent 2016). This magnitude threshold technique is expected to detect ∼ 500 LSNe-Ia, but is specific to elliptical lens galaxies and will miss LSNe-Ia whose date of peak brightness do not align with the LSST cadence.
In this work, we introduce a deep neural network architecture designed to identify LSNe-Ia without the requirements of targeting known SL systems, elliptical galaxies, or observing the peak brightness of the SN. Deep learning algorithms have been highly successful, both in terms of accuracy and speed, in the fields of image-based SL system detection (Jacobs et al. 2019, among several others) and light-curve-based SNe classification (Möller & de Boissière 2019, among several others). A convolutional neural network (CNN; Le-Cun et al. 1989) is a kind of deep algorithm that slides learnable matrix operators along images to emphasize or de-emphasize characteristic shapes, such as lensing arcs. Recurrent neural networks (RNNs ;Elman 1990) can model sequences of data, such as lightcurves, to make classifications based on how the data vary in time. Within the RNN architecture, long short-term memory (LSTM) networks (Hochreiter & Schmidhuber 1997), which introduce cells with learnable pathways for information to be passed along, have demonstrated improved performance in sequence characterization. In our approach, we combined a CNN and an LSTM as two branches and optimally utilize the information from both branches -an architecture that leverages the data structures of each input and is tailored to the challenge of immediate LSNe-Ia identification. Other studies, such as Ramanah et al. (2021), have combined convolutional and recurrent layers in different deep learning architectures to target LSNe identification, though our multibranch approach is unique in that it places the spatial and temporal information on the same footing from the start.
We present this work as follows: In Section 2, we describe the simulations used for training and testing Zip-perNet, the data processing procedure for utilizing both spatial and time series information, and the architecture of ZipperNet. In Section 3, we describe the results from applying ZipperNet to four simulated datasets that emulate modern survey data products. In Section 4, we discuss the performance of ZipperNet with respect to the dataset properties and the network architecture. We conclude in Section 5.

Data Simulation
We simulated images of astronomical strong lensing systems with the open-source software package, deeplenstronomy (Morgan et al. 2021b), which is built around lenstronomy (Birrer & Amara 2018;Birrer et al. 2021b), a widely used package that performs gravitational lensing calculations, modeling, and simulations in a variety of contexts. deeplenstronomy provides additional features that are important for the era of largescale surveys and deep learning studies of strong lenses -e.g., image and SN injection, probability distribution sampling, and realistic observing conditions.

Survey Emulation
We simulated four datasets, distinguished by their camera specifications, observing conditions, and cadence, each emulating a distinct modern or nextgeneration cosmic survey: one wide-and one deepfield dataset for DES -DES-wide and DES-deep, re-spectively; a wide-field LSST dataset (LSST-wide), and a three-day cadence Dark Energy Camera (DECam; Flaugher et al. 2015) dataset similar to the Dark Energy Spectroscopic Instrument DECam Observation of Transients (DESI-DOT) program (Palmese & Wang; DE-Cam Proposal 2021A-0148). All datasets utilize the g, r, i, and z optical filters and include 45-by-45-pixel images. The DES-wide, DES-deep, and DESI-DOT datasets simulate images from the DECam, which connects the pixel size, gain, and read noise across those datasets. The LSST-wide dataset simulates LSSTCam (Stalder et al. 2020) images, which have similar but slightly adjusted values for the camera properties.
The DES-wide and DESI-DOT datasets both use the real observing conditions (seeing and sky brightness) from the DES wide-field survey (Abbott et al. 2018). For DES-wide and DESI-DOT, the exposure times are 90 and 60 seconds, respectively. The DES-deep dataset has different seeing, sky brightness, zeropoint, and exposure times chosen for the DES SN program (Abbott et al. 2019). In general, these exposure times are on the order of 200 seconds, but the acceptable seeing criteria can be worse than the DES wide-field survey. The LSST-wide observing conditions are estimated from simulations of the first year of the survey and utilize 30-second exposures (Marshall et al. 2017).
Each dataset has a specific and distinct cadence, and the density of observations significantly affects the analysis in this work. The LSST-wide, DESI-DOT, and DES-deep datasets contain a baseline of 14 epochs per band. While the LSST main survey cadence is still being designed at the time of this writing, our fiducial 14epoch data sequences are sufficiently short that they are obtainable from both the "baseline" and "rolling" cadences that are under consideration for the survey. The DES-wide dataset contains seven exposures in each band spread over 5.5 years to match the real survey. We sampled the observation times of several fields from the DES footprint to generate the deeplenstronomy simulations. The LSST-wide cadence is estimated using several realizations of an intra-band spacing of 12 ± 5 days over a three-month period (Marshall et al. 2017). The DESdeep cadence is estimated using several realizations of an intra-band spacing of 6 ± 1 days over a one-month period (Abbott et al. 2019). Lastly, the DESI-DOT cadence is an exposure in each band every three nights over a one-month period.
We seek to avoid jargon confusion between astronomy and machine learning contexts with respect to the term, "epoch." In this work, "epoch" refers to one astronomical exposure or data-collection period. When discussing neural network training steps, we use the term "training iteration" instead of the traditional "epoch" that is used in machine learning. The data sets are summarized in Table 1. All deeplenstronomy input files for this analysis are accessible for reproduction of the datasets in Morgan et al. (2021a).

Object, System, and Population Simulation
In total, we simulate 17 different types of astronomical systems to reflect the diversity of systems that classifiers are likely to encounter when applied to observed optical survey data: (1) one galaxy behind one foreground galaxy; (2) two galaxies at the same redshift and with small angular separation (1 to 4 arcseconds); (3) one galaxy (with one SN-Ia) behind one foreground galaxy; (4) one galaxy (with one SN-CC) behind one foreground galaxy; (5) one galaxy; (6) two galaxies (with small angular separation and at the same redshift) in front of one background galaxy that is at a higher redshift; (7) one galaxy behind one foreground galaxy that has one star from the Milky Way in the image cutout; (8) two galaxies with small angular separation and at similar redshifts that has one star from the Milky Way in the image cutout; (9) one galaxy (with one SN-Ia) behind one foreground galaxy that has one star from the Milky Way in the image cutout; (10) one galaxy (with one SN-CC) behind a foreground galaxy that has one star from the Milky Way in the image cutout; (11) one galaxy that has one star from the Milky Way in the image cutout; (12) two galaxies with small angular separation and at similar redshifts in front of one background galaxy at a higher redshift and one star from the Milky Way in the image cutout; (13) empty sky; (14) one Milky Way star; (15) two Milky Way stars; (16) one galaxy with one SN-Ia; and (17) one galaxy with one SN-CC. Figure 1 shows sample images and time series of the 17 systems.
To further enhance realism, the properties of all simulated objects are drawn from real data: all inherent physical correlations of these parameters are included in our dataset. First, a galaxy that enters the simulations as the lens has properties drawn from a population of ∼ 2, 000 observed galaxies. The velocity dispersion and spectroscopic redshift were measured by the Sloan Digital Sky Survey (SDSS; York et al. 2000). We obtain a color-independent ellipticity, as well as a band-wise halflight radius, Sersic profile index, magnitude from DES Year 1 data (Tarsitano et al. 2018;Abbott et al. 2018). Next, a putative source galaxy, whether used in a lensing or non-lensing system, draws its properties from a population of ∼ 500, 000 galaxies measured by DES (Abbott et al. 2021). The band-wise magnitudes of foreground Milky Way stars were also drawn from DES data (Abbott et al. 2021). Finally, the SNe were injected using public SN spectral energy distributions (Kessler et al. 2010) available in deeplenstronomy, which redshifts the distribution and calculates the observed magnitude in each band. The injected SN reaches peak brightness anytime between 20 days before the first observation and 20 days after the final observation, so the dataset contains falling lightcurves, rising lightcurves, and complete lightcurves. We do not include the effects of microlensing in our simulated dataset, because it is expected to be small compared to the change in brightness observed from a SN (Foxley-Marrable et al. 2018b).
For all four survey emulation datasets, we simulate the same strong lensing systems: all strong lensing systems are emulated in all four of the cosmic survey contexts. While our simulated datasets are subject to selection biases from the detection limits of DES and SDSS for source and galaxies, respectively, the data nonetheless contain realistic collections of object properties, which enables the validation of our deep learning detection method on realistic survey data.

Data Processing
deeplenstronomy emulates observational surveys by producing a time series of images. With the exception of small effects from observing conditions, images in a series will be approximately identical, because astronomical objects are approximately stationary on month-long timescales. Even in the case of a SN, the primary difference is the presence of one or more point sources in some of the images in the time series. We condensed the image information to single-image input for ZipperNet by averaging all images in the time series on a pixel-bypixel basis within each band. This processing reduces noise fluctuations from the observing conditions while preserving the presence of SNe, thus increasing the overall signal-to-noise ratio of the image and making faint objects more visible. After averaging the images, the pixel values of the mean images are scaled to range from 0 to 1 on a per-example basis to preserve color relationships.
To concisely characterize the temporal behavior of a time series of images and to avoid relying on source identification or deblending algorithms, we follow a process that reflects a standardized background-subtracted aperture flux measurement in astronomy. We measure the signal (S) and background (B) to extract a background-subtracted brightness (S − B) of individual images within a predefined circular aperture at the center of each image. In equation form, this process can be expressed as where i and j index the row and column of the image pixels, N gives the number of pixels along one dimension of the images, X is an image, and W is the aperture. W i,j is zero outside the aperture and one inside the aperture. In this work, the circular aperture has a 20-pixel radius, which corresponds to 5.26 arcseconds for DECam and 4.0 arcseconds for LSST-Cam -both much larger than typical galaxy-scale lens Einstein radii, which are approximately in the range [0.5, 1.2] arcsec. For processing within the neural networks, we again scale the extracted brightness to between 0 and 1 on a per-example basis. A byproduct of this process is the significant increase in noise in the photometry measurements; we find, however, that this effect does not hinder the deep learning methods. After the averaging images and extracting lightcurves, each example input to ZipperNet is a 45pixel-by-45 -pixel image in each band and a lightcurve of the extracted brightness at each time step in each band.
The operations required to extract the photometry of the systems have little computational cost and can be easily broadcasted, which is a great benefit considering the scale of modern astronomical datasets.
Lastly, we define a classification scheme for our 17 simulated systems. We construct a four-class problem, where the classes are "No Lens," "Lens," "LSNe-Ia," and "LSNe-CC," as shown in Figure 1. The No Lens class collects the cases where there is no gravitational lensing present -labeled as 2, 5, 8, 11, 13, 14, 15, 16, and 17. The Lens class collects cases where there is gravitational lensing but no SNe in the background galaxy -labeled as 1, 6, 7, and 12. The LSNe-Ia and LSNe-CC classes collect cases with gravitational lensing and a SN in the background galaxy -labeled as 3 and 9 and as 4 and 10, respectively. Each of the four classes contain 1,250 examples, with equal representation of the individual constituent cases. To augment the datasets and increase the size of each class eightfold to 10,000, we rotated and mirrored the images; the lightcurve was unaffected due to the circular aperture extraction method. When we split the datasets into smaller training and testing datasets, none of the examples in the testing dataset are rotated or mirrored versions of objects in the training dataset: the two datasets are rotated and mirrored independently.

ZipperNet
Our ZipperNet architecture is designed to treat the image-based information and the lightcurve-based information on equal footing. On one branch, the images are passed through convolutional layers, flattened into one dimensional arrays, and condensed in size. On another branch, the lightcurves are passed through recurrent layers composed of LSTM cells and flattened into one dimensional arrays. The flattened images and lightcurves output by each branch are condensed to equal sizes, concatenated, and then mapped to four output featuresone for each class of our problem. We then obtain a single classification by determining which of the four output features has the largest value. By zipping convolutional layers and recurrent layers into one coherent deep learning architecture (in joining the branches), the training of the network will optimize weights in both types of layers simultaneously. The architecture is illustrated in Figure  2, and the specifications of each layer are presented in Table 2. All deep learning code in this analysis utilizes the PyTorch (Paszke et al. 2019) library.
For each of the four datasets in our analysis, we trained an individual ZipperNet: we trained on 90% (9,000 samples) of the simulated data and used the remaining 10% (1,000 samples) for testing. We did not uti-lize any of our simulated data as a validation dataset for hyperparameter optimization. Rather, we chose the Zip-perNet hyperparameter settings based on an independent toy dataset composed of images of different shapes (squares versus circles) with different time-varying properties (parabolic versus linear change in total brightness) and fixed the settings for each of the four ZipperNet instances. This choice is motivated by a desire to keep the model constant and prevent the hyperparameter settings from favoring one of the simulated datasets over another. We therefore control confounding variables in our experiment such that we can connect differences in model performance between the four simulated datasets to dataset properties.
We chose a batch size of five for the training because LSTM cells generally perform better when processing smaller amounts of information at the same time. We also utilized categorical cross entropy as the loss function of the network and a learning rate of 0.001 with the Adam (Kingma & Ba 2017) optimizer. As the network trained, we monitored the accuracy (the number of correct classifications divided by the total number of samples) for the training and testing datasets. In each case, the training and testing accuracy plateaued after ∼ 10 training iterations, but we allowed the training to continue for 40 training iterations. The fully trained network is chosen as the point during training with the highest testing accuracy. We use the term "training iteration" in place of the traditional "epoch" to avoid confusion with the astronomy term "epoch" utilized in other parts of this analysis. The accuracy for the training and testing sets for each of the four datasets is presented in Table 3.

RESULTS
In general, we find that the ZipperNet model is capable of identifying SL systems, identifying the presence of an SN within SL systems, and distinguishing between LSN-Ia LSN-CC. Furthermore, we demonstrate that ZipperNet can perform these classifications on simulated datasets across wide ranges of depth, observing conditions, and cadence. We use a four-class confusion matrix to compare the predicted and true labels resulting from ZipperNet's execution (see Figure 3). In all the matrices, the strong representation along the main diagonal indicates correct classifications. ZipperNet had the lowest performance on the DES-wide data, and we discuss this result in Section 4.
There are multiple primary sources of confusion. Confusion between the No Lens and Lens classes is likely due to pixel-based features being difficult to distinguish from the images. This confusion is the strongest within  the DES-wide dataset, so we attribute this behavior to the optical depth of the images, since the DES-wide dataset is the shallowest dataset simulated and faint source galaxies would become more difficult to identify. In the deeper datasets, this confusion is caused by exposures with high seeing or systems with small Einstein radii, where in both cases objects blur together. The confusion between the LSNe-Ia and LSNe-CC classes is likely due to the difference in cadence and seeing in the surveys. DES-deep, LSST-wide, and DESI-DOT are much higher cadence datasets than DES-wide (See Table  1), indicating that more densely sampled SN lightcurves are easier for the LSTM cells to classify. A comparison of the DES-deep and DES-wide confusion matrices indicates that there are situations where cadence can be more important than seeing (specifically noting the LSNe-Ia versus LSNe-CC confusion), since the DESwide dataset had better seeing than DES-deep, but that these situations require dramatic differences in cadence density. Furthermore, the LSST-wide and DESI-DOT datasets have much better seeing than the DES-deep dataset, which shows the importance of being able to resolve spatial features when making classifications. We  discuss the importance of the cadence and seeing in more detail in Section 4. In practice, a general LSN identifier is itself a useful tool: LSNe-CC can be utilized for time-delay cosmogra-phy measurements, though they offer less precision on the final H 0 measurement. If we reframe this classification scheme from a four-class problem to a two-class problem (No Lens and Lens in one class and LSNe-Ia and LSNe-CC in second class), the performance is boosted. We can obtain two-class problem classifications from our four-class network outputs by selecting the class predicted by the network and sorting it into LSN or not LSN. Figure 4 shows a Receiver Operating Characteristic (ROC) curve for the DES-wide, LSSTwide, DES-deep, and DESI-DOT datasets. ROC curves are standard tools for assessing the predictive power of a classifier by calculating the false positive rate and true positive rate at all possible probabilities output by the classifier. A perfect classifier will have an Area Under Curve (AUC) of 1.0 while a classifier that guesses randomly will have an AUC of 0.5. ZipperNet shows high performance when classifying LSNe versus everything else, and this high performance extends across the LSST-wide, DES-deep, and DESI-DOT datasets. This result can also be interpreted from the confusion matrix (Figure 3), where there is little confusion between the LSNe classes and non-LSNe classes. We comment on this high performance in the context of ZipperNet's architecture in Section 4.
We also estimate the baseline true-positive rate and false-positive rate for LSNe using this technique in the different datasets. When applying the ZipperNet technique to real data, we would expect real data to be used in the training and validation, which would produce more accurate estimates. That being said, we can initially report a LSN true (false) positive rate of 90.2 % (29.6 %) for DES-wide, 87.0 % (21.9 %) for DES-deep, 91.5 % (9.8 %) for LSST-wide, and 89.0 % (12.6 %) for DESI-DOT. The true-positive rate of approximately 91.5 % for LSST-wide is the higher than the estimated recovery rates of LSNe for the non-deep-learning approaches mentioned in Section 1. Furthermore, with relatively low false-positive rates, we do not expect the data stream of the LSST to be overwhelmed by other astronomical systems being incorrectly labeled as LSNe.
In the two-class problem, we performed additional analysis to interpret the features ZipperNet identified for making classifications. Figure 5 displays examples from the LSST-wide dataset arranged into groups of correctly classified LSNe (true positives), other astronomical systems correctly classified as not LSNe (true negatives), other astronomical systems erroneously labeled as LSNe (false positives), and LSNe that were missed (false negatives). These examples were found to be representative of the general relationship between the properties of objects and the predictions made by ZipperNet. ZipperNet is able to correctly classify LSNe when the Einstein radius is small (0.5 − 1.0 arcsec) and with foreground stars in the image, both of which would trouble a standalone CNN. Upon close inspection, the true-positive images display galaxies with non-uniform light profiles, hinting at the presence of lensing, but the dominating feature is clearly the large fluctuation in brightness detected by the LSTM. The true negatives show systems with evidence of lensing as well, but this time there is no temporal behavior to indicate the presence of a SN. The false positives contain images of lensing or crowded fields with small, but non-negligible coherent time-varying behavior; they in general contain both spatial and temporal features similar to the LSNe class. Lastly, the false negatives are the most important group to understand due to the rareness of LSNe. In some cases, the presence of a star in the aperture used to extract the scaled bright-  nesses can be bright enough to obscure the change in brightness of the LSNe. Similarly, if the source galaxy is distant and the alignment of the lensing system does not produce sufficient magnification, the imaging may not be deep enough to see a lensed source or a background SNe. Both these cases of false negatives demonstrate difficult to detect systems with LSNe and point toward a physically-motivated selection as opposed to inaccurate feature representations learned by ZipperNet. In general, we find that ZipperNet learns features we would a priori expect.

DISCUSSION
In general, ZipperNet can identify SL systems, identify LSNe within those systems, and classify the LSNe as LSNe-Ia or LSNe-CC. The variance in the performance across different datasets indicates a correspondence between data quality and LSN identification power. The DES-deep and DESI-DOT datasets have slightly higher cadences (more samples within a time series) than the LSST-wide dataset, but the ZipperNet accuracy was considerably higher for the LSST-wide dataset, which has higher seeing (lower image quality). The DES-deep dataset emulates that of the DES SNIa observing program, in which exposures were collected on nights with slightly poorer (higher) seeing to optimize the seeing of the exposures used for weak lensing measurements. In DES data processing and SNIa analysis, the differenceimaging (Kessler et al. 2015) and scene-modeling (Brout et al. 2019)  sure SNe when seeing is up to 2 . We bypassed those time-consuming techniques with our circular aperture extraction method for the lightcurve brightnesses. Without those techniques, the performance of ZipperNet is degraded, because the higher seeing in the DES-deep dataset obscures image patterns that would otherwise be detectable in LSST-wide and DESI-DOT datasets.
A second data quality factor in ZipperNet's predictive power is the cadence of the observations. The DES-wide dataset has excellent depth and seeing, but a low-density time sampling: ZipperNet fails to perform at similar levels to the other datasets. With DESI-DOT, which has high cadence and similar depth and seeing as DES-wide, ZipperNet was able to learn the underlying features of the four classes extremely well. Overall, we find that together seeing of 1.2 (corresponding to typical upper limits on Einstein radii of galaxy-scale lenses) and a cadence with intra-band spacing of 15 nights (corresponding to roughly the timescale for SNe evolution) can also improve performance.
Nevertheless, even when ZipperNet had confusion between LSN-Ia and LSN-CC (primarily caused by deficient cadence of the dataset) or confusion between the No Lens and Lens classes (primarily caused by seeing greater than typical Einstein radii), ZipperNet performs extremely well as a LSNe finder. Reducing the classifi-cation to two classes -LSN and everything else -shows high performance for the LSST-wide, DES-deep, and DESI-DOT datasets (Figure 4). The ZipperNet architecture as a LSNe finder in this setting does not suffer from the dependence on seeing observed in the four-class problem and is slightly less dependent on the cadence. By balancing the image and temporal inputs, Zipper-Net finds weightings and combinations of the two data products optimal for LSNe detection. We interpret this result as a demonstration of a key strength of the Zip-perNet architecture.
Finally, we compare the ZipperNet architecture to a standalone "RNN," a standalone "CNN," and a combination of those two standalone networks ("COMBO") in the context of the two-class problem and the LSST-wide dataset. The standalone networks are identical in structure to the corresponding constituents of ZipperNet and trained under the same conditions. The combination classifier does not connect the feature representations of the standalone networks internally: it merely uses the outputs from each of the standalone classifiers, requiring them to individually identify an object as a LSN. Therefore, the COMBO classifier reflects a simplistic approach to LSN identification in astronomical surveys -first a CNN is used to identify a SL system followed by a transient detection algorithm used to search for LSNe. We expect that the standalone CNN will identify systems with lensing, while the standalone RNN will identify systems with SNe: by requiring both a lensing classification and a SN classification, we establish a baseline for the performance of deep learning architectures that are not connected internally in ways similar to Zipper-Net's design. ROC curves for the different classifiers are shown in Figure 6. The CNN performs well and mostly identifies all systems with lensing or point sources in galaxies, but this performance can still result in high numbers of false positives (due to the rarity of LSNe) in practical applications. The RNN is effectively a SN identifier in this context: the vast majority of the false positives come from classifying unlensed SNe (labels 16 and 17) as LSNe. While the ROC AUC is high for the RNN, it alone could not be used as a LSN finder due to the much larger volumetric rate of SNe compared to LSNe. As shown by this test, ZipperNet outperforms both constituent networks (CNN and RNN) and more importantly the simplistic combination of its constituent networks' outputs (COMBO), indicating that connecting the feature representations of the RNN and CNN internally yields better overall performance and is a worthwhile deep learning strategy for the problem of LSNe detection. The ZipperNet architecture shows the highest performance.
The motivating challenge for ZipperNet is to facilitate the identification of LSNe during the Rubin Observatory's LSST, which is expected to have a very high data stream rate compared to previous large-scale surveys. The high data stream rate is not a challenge for ZipperNet because classifications can be parallelized and preprocessing is economical (because we directly extract band-wise lightcurves from images without the need for deblending or expensive photometric analysis). The role of a tool like ZipperNet in this setting would be to process the data stream and report a list of candidates ordered by a probability of being a LSN. At present, the binary classification produced by ZipperNet with confidence quantified by the measured true and false positive rates is our focus. However, a small amount of additional calibration could straightforwardly map the ZipperNet outputs to physical probabilities to facilitate the generation of candidate lists.
The remaining test of ZipperNet for Rubin Observatory main survey operations is detection at various epochs into the SN light curves for community alerts: How early into the light curve can ZipperNet find a LSN? We present in Figure 7 the LSN true-positive rates as functions of the number of lightcurve epochs after the first detection. In this context, an r-band magnitude brighter than 24.4 mag serves as the first detection, which is motivated by the LSST science requirements (The LSST Dark Energy Science Collaboration et al. 2018). We find that even when only one or two epochs are present in the lightcurve after the first detection, the true-positive rate is > 75%, and it improves as more epochs are added. After five post-detection observing epochs in the light curve, the LSN identification has a true-positive rate > 95%. Furthermore, even LSNe fainter than the 5σ limiting magnitude are identifiable with a true-positive rate of ∼ 80%, indicating high performance where other detection methods would lose sensitivity: both proposed methods discussed in Section 1 rely on SNe being brighter than the detection threshold and realized as individual objects. For these faint LSNe, ZipperNet likely finds success by identifying systems with image-based strong lensing features and then noticing small changes in brightness. This result supports the expectation that ZipperNet will perform well as a real-time LSN identifier in Rubin Observatory main survey data.
In summary, ZipperNet introduces a new deep learning architecture for lensed transient detection where spatial and temporal features are treated on the same footing in a single framework. This balanced framework produces a ROC curve AUC of 0.97 when identifying LSNe and 79% accuracy at outright identification of LSNe-Ia in LSST wide-field data even in the early phases of the lightcurve. Therefore, we expect ZipperNet to play a large role in the rapid identification of LSNe for spectroscopic characterization and time-delay cosmography during the main survey operations of the Rubin Observatory.

CONCLUSION
Detecting LSNe soon after explosion will be an important goal for the Vera C. Rubin Observatory and other high-cadence optical surveys. In this work, we introduced ZipperNet, a deep learning tool for LSNe identification. ZipperNet combines a convolutional neural network with a recurrent neural network to simultaneously process spatial and temporal data. We utilized deeplenstronomy to simulate four distinct optical survey datasets for testing. ZipperNet performed well when the cadence and seeing were LSST-like or better. Specifically, ZipperNet was able to identify LSNe in LSST-like data with a ROC AUC of 0.97 and distinguish LSNe-Ia from LSNe-CC in LSST-like data with 79% accuracy. With ZipperNet, high-cadence optical surveys can accurately identify both LSNe-CC and LSNe-Ia early into their light curves, and furthermore distinguish between the two transient classes. Thus, we expect ZipperNet to be a powerful tool in identifying lensing systems for time-delay cosmography measurements during the Rubin Observatory main survey operations.