Convolutional Neural Networks for Searching Superflares from Pixel-level Data of the Transiting Exoplanet Survey Satellite

In this work, six convolutional neural networks (CNNs) have been trained based on %different feature images and arrays from the database including 15,638 superflare candidates on solar-type stars, which are collected from the three-years observations of Transiting Exoplanet Survey Satellite ({\em TESS}). These networks are used to replace the artificially visual inspection, which was a direct way to search for superflares, and exclude false positive events in recent years. Unlike other methods, which only used stellar light curves to search superflare signals, we try to identify superflares through {\em TESS} pixel-level data with lower risks of mixing false positive events, and give more reliable identification results for statistical analysis. The evaluated accuracy of each network is around 95.57\%. After applying ensemble learning to these networks, stacking method promotes accuracy to 97.62\% with 100\% classification rate, and voting method promotes accuracy to 99.42\% with relatively lower classification rate at 92.19\%. We find that superflare candidates with short duration and low peak amplitude have lower identification precision, as their superflare-features are hard to be identified. The database including 71,732 solar-type stars and 15,638 superflare candidates from {\em TESS} with corresponding feature images and arrays, and trained CNNs in this work are public available.


INTRODUCTION
As time-domain astronomy develops, a large amount of observational data need to be scientifically processed. Astronomers have applied data-mining methods including machine learning for the processing of observational data. For example, a convolutional neural network (CNN) was used to search strong-lensing systems (He et al. 2020). Machine learning and recurrent neural networks (RNN) were used for detecting exoplanet transits (e.g. Rao et al. 2021;Cui et al. 2022;Ofman et al. 2022). Zhang et al. (2021) used random forest for the classification of four XMM data release (DR) 9 sources. Georgoulis et al. (2021) used deep-learning methods for solar flare forecasting. The unsupervised clustering method was used for classifying fast radio bursts (FRBs; Chen et al. 2022). CNNs were used for searching FRBs from radio data (e.g. Agarwal et al. 2020;Yang et al. 2021).Machine-learning methods have already been applied to enhance the efficiency of selecting stellar flares from Kepler data (e.g., see, Vida & Roettenbacher 2018).
Superflares are stellar bursting phenomena with duration and energy larger than typical solar flares (Hudson 2021). The total energy of superflares on solar-type stars can be larger than 10 33 erg . Superflares have severe impacts on the space weather around the star and exoplanets (e.g. Segura et al. 2010;Airapetian et al. 2016;Lingam & Loeb 2017;Atri 2017). One solar superflare will definitely damage habitability of the Earth. The rapid increases of 14 C in tree rings may be attributed to solar superflares (Miyake et al. 2012(Miyake et al. , 2013Park et al. 2017;Wang et al. 2017). Fortunately, it has been predicted that the Sun generates superflares with lower possibility (Shibata et al. 2013). With the limitation of observational techniques, detailed observations of distant stars seem impossible. In recent years, astronomers try to use space telescopes to study stellar superflares. Many works have used Kepler data to build the connection between solar flares and stellar superflares (Maehara et al. 2012;Shibayama et al. 2013;Maehara et al. 2015;Notsu et al. 2019;Okamoto et al. 2021).
TESS began to observe on 2018 August 7, and covered 85% of the sky (Ricker et al. 2015), while Kepler covered only 0.25% of the sky (Van Cleve et al. 2016). Tu et al. (2020Tu et al. ( , 2021 used TESS data and Large Sky Area Multi-Object Fiber Spectroscopic Telescope (LAMOST) spectroscopic data to strengthen the view that superflares can be generated by isolated stars (e.g., Shibata et al. 2013), instead of through star-planet interactions (e.g., Rubenstein & Schaefer 2000;Ip et al. 2004). Comparing the statistical results from TESS with those from Kepler, it has been found that TESS is less sensitive for detecting relatively weaker stellar flares than Kepler (Doyle et al. 2019;Tu et al. 2020). The reason is that the bandpass filter for TESS is 600-1000 nm (Ricker et al. 2015), which is longer than 420-900 nm for Kepler (Van Cleve et al. 2016).
Superflares on other stars may give us a chance to statistically study stellar activities. The most crucial step is collecting enough true superflare events. From Kepler era to the TESS era, superflare candidates are selected through automatic algorithms and then visual inspection (e.g., Maehara et al. 2012;Tu et al. 2020). Manual visual inspection requires a long time to verify true superflare events; therefore, there is an urgent demand for an automatic and efficient method to visually inspect true superflare events. One of the executable choices is machine learning. Vida & Roettenbacher (2018) used a random sample consensus (a traditional data-mining method) to find flares in Kepler data. Feinstein et al. (2020) used CNN to search stellar flares in TESS, and their network was used in subsequent research (Feinstein et al. 2021). Vida et al. (2021) used an RNN to search stellar flares, which can be used for both Kepler and TESS data.
The above works including CNN and RNN methods enhance the efficiency and automatization of searching for stellar flares. These methods surely improve the accuracy of those collected superflares compared to other traditional light-curve outlier-searching algorithm, such as fitting the quiescent variability (e.g., Walkowicz et al. 2011;Wu et al. 2015;Davenport 2016;He et al. 2018), the iterative σ-clipping approach (e.g., , analysis of light curve changes between all pairs of consecutive points (e.g. Maehara et al. 2012Maehara et al. , 2015Shibayama et al. 2013;Okamoto et al. 2021). However, these works (Vida & Roettenbacher 2018;Feinstein et al. 2020;Vida et al. 2021) just used stellar light curves to search for stellar flare events. According to our experience on the visual inspection of superflare candidates from pixel-level data, there are a number of candidates showing superflare features on light curves, but without any superflare signals from pixel-level data. Obviously, this kind of candidates should not be convincingly treated as superflare events. Their superflare-like shapes in stellar light curves may be caused by CCD noises, pipeline errors or other intrusive disturbing factors. The visual inspection of pixel-level data is the most direct way of excluding contamination from false-positive events. Naturally, we pursue a machine-learning method based on pixel-level data, which excludes as much false events as possible, instead of being based on stellar light curves only. Then, a database including true superflares can be constructed by machine-learning methods. Based on this database, a statistical study of stellar activities is more reliable and scientifically valuable. So, in this work, we attempt to use the deep-learning methods such as CNN and ensemble-learning methods, including stacking and voting, to search for superflare events from the pixel-level data of TESS. We expect that CNNs can more accurately search for true superflare events based on pixel-level data.
This paper is organized as follows. In Section 2, we briefly articulate the process of selecting solar-type stars and superflare candidates. All candidates through our improved data pipeline have been visually inspected and classified into three classes. In Section 3, the construction of data sets and six structured CNNs according to six feature data are given. Ensemble learning including stacking CNN and voting are also performed in Section 4. The results of sic CNNs, comparison between CNNs, and some discussions are given in Section 5. A summary is presented in Section 6.

SOLAR-TYPE STARS AND SUPERFLARE CANDIDATES
In this section, we will briefly illustrate how to select solar-type stars and search for superflare candidates for training CNNs. For more details, one could refer to our previous works (Tu et al. 2021(Tu et al. , 2020. Differences and improvements will be specifically discussed in the following.

Selection of solar-type stars
We select solar-type stars according to the following criteria: (1) stellar effective temperature (T eff ), 5100K T eff < 6000K, and (2) surface gravity (log g), log g > 4.0 (e.g. Shibayama et al. 2013;Maehara et al. 2015;Tu et al. 2020Tu et al. , 2021. These criteria are used in Maehara et al. (2012). As the Sun is a G-type dwarf star, the effective temperature (5100K T eff < 6000K) ensures that those selected targets are G-type stars. Surface gravity (log g > 4.0) ensures they are dwarf stars instead of giant or sub-giant stars. Maehara et al. (2012) also defined Sun-like stars, which are a subset of solar-type stars with more strict criteria: the effective temperature within 5600K T eff < 6000K and stellar period longer than 10 days. Those stars whose properties are more similar to that of the Sun can also be defined as solar analogs, which belong to the subset of solar-type stars. So, the solar-type stars selected in this work are broadly similar to the Sun. log g and T eff are both gathered from TESS input catalogue (TIC v8;Stassun et al. 2019). As TESS has a pixel resolution of 21 , it may not be able to distinguish contamination apart from the main target within one pixel. The Hipparocos-2 catalog (van Leeuwen 2007) and Gaia early data release (EDR) 3 Gaia Collaboration et al. (2020) are used to mark those solar-type stars, which are possible binary systems or contaminated by brighter stars within 21 distance. Gaia EDR3 gives T eff estimations, with some from Gaia-DR2 (Gaia Collaboration et al. 2018).
Specifically, from all 71,732 solar-type stars, 277 stars are possible binary systems according to Hipparocos-2 catalog. After crossmatching with Gaia EDR3 catalog, 753 stars are with brighter stars in less than 21 distance, which can not be distinguished by TESS. 739 stars are with brighter stars in a distance greater than 21 and less than 42 . Unlike what we did in our previous work (Tu et al. 2021) where we excluded stars that are binary systems or contaminated by brighter stars, here we flag these stars in our database. This is because, instead of judging whether superflares are purely from solar-type stars, this work focuses on selecting superflare candidates according to pixel-level light curves with superflare shapes or features. This will also enhance the purpose of collecting as much data as possible in order to obtain better performance of CNNs. Besides, 5435 stars may contain M-type star candidates within 42 range. Jackman et al. (2021) found that the flares from M-type stars neighboring the TESS targets occupy a 5.8% ± 1.0% rate of flares in the TESS data. These stars are recommended to be carefully inspected in the catalog. As Gaia EDR3 does not include log g, and Section 3.1 of Tu et al. (2021) demonstrated that observations of solar-type stars may not be influenced by M-type stars, these stars are not excluded but marked with flags. Totally, 71,732 solar-type stars from three years of TESS observations are collected.
We use pre-search data conditioned (PDC) light curve of every single star for period estimation and photometric variability calculation. The periods of these stars are estimated through the Lomb-Scargle method (Lomb 1976;Scargle 1982). We also set the false-alarm probability to 10 −4 to estimate the periodicity of a star (e.g., Cui et al. 2019). The photometric variability (R var ) is derived by where F 95% and F 5% stand for the upper 95% and lower 5% of the ranked normalized flux of stellar light curves (Basri et al. 2010). The first-and third-year observations of TESS both cover the southern hemisphere of the sky. Stars in this area are observed multiple times, so their period and R var are calculated based on all observations from TESS. Note that, combining all observations from TESS is accomplished by stitching light curves from different sectors of TESS data. We used a PDC light curve of every single star from each sector and normalized the flux within the sector (about 27 days), then stitched the normalized flux from different sectors. Due to the unique observational mode of TESS, some stars are only observed for 27 days. Even if we stitched light curves from multiple sectors, the estimation of longer stellar periods would be hard due to the limitations of TESS (Tu et al. 2021). Other methods should be considered for determining stellar periodicities of TESS stars, for instance, measuring the rotational velocity from the stellar spectrum (e.g., Reiners & Basri 2008;Reiners et al. 2012), or using the autocorrelation function (e.g., McQuillan et al. 2014). Specific information of 71,732 solar-type stars can be found in Table 1.

Superflare candidates
We select superflare candidates through the same method as in Tu et al. (2020Tu et al. ( , 2021. Each solar-type star' PDC light curve from TESS has been checked by the automatic algorithm. The updated algorithm has been used to process TESS second-year observation (Tu et al. 2021). Briefly, those outliers in the light curve are selected by the following equation, where F is the normalized flux of one sector. s = 1, when (F i − F i−n−1 ) > 0 and (F i+1 − F i−n ) > 0 are both satisfied; otherwise s = −1. As n describes the rising time of flares (Tu et al. 2020), we set n = 2 and n = 5 to search outliers on the short and long rising phase of light curves, respectively. If n = 2, we cut the light curve 0.01 to 0.05 days before, and 0.05 to 0.10 days after the peak flux for further analysis. Correspondingly, if n = 5, we use the light curve with intervals of 0.03 to 0.15 days before, and 0.15 to 0.25 days after the peak flux. The quadratic function F q (t) is used to fit these light curves, in order to deduct the long term variability of the star. We can obtain the light curve, which keeps flare signals, through the following equation, The first and last points, whose photomeric errors F error (t) satisfy the following equation, are set to be the start and end time of a flare, 3 × F error (t) < F flare (t). (4) F flare (t peak ) at peak point stands for the peak amplitude of a flare. The stellar bolometric luminosity is calculated through the Stefan-Boltzmann law where the stellar radius R * and surface temperature T eff are taken from TIC v8 catalog. We estimate the superflare energy by the equation where the stellar luminosity and flare flux are calculated in Equations (5) and (3), and the integral is from the start to the end time of a flare. Totally, 15,638 flare candidates are selected from all solar-type stars in the three years of TESS observations. They are all visualized through the images as shown in Figure 1. It should be noted that we standardize the time intervals, which start from 0.05 days before and 0.1 days after the flare peak time stamp. The peak time stamp is just at the 1/3 position of the light curve. This process ensures that the slowly decaying phase of superflares can be presented in the images. Panels (c) and (d) of Figure 1 show the star original pixel-level images before the flare and at the flare peak, respectively. The blue frames encircle the aperture masks of the TESS pipeline, which should be the same in the panels. Panels 1(e) and 1(f) show pixel-level light curves with the normalized and unnormalized flux, respectively. We classify all candidates into three classes, through visual inspection of their pixel-level information according to the images shown in Figure 1. These three classes are the gold, silver class and none class.  Figure 2 are the same as those in Figure 1, from which this candidate is convincingly classified as the gold class.
• The silver class gathers superflare candidates that do not perfectly show flare features unlike those in the gold class. The silver class may also show flare signals from pixel-level data of the normalized flux. For instance, pixel [5,5] and [6,5] (where [horizon index, vertical index]) in row 1 of Figure 2 show pinnacle-shaped light curves both in panels with normalized and unnormalized fluxes. But the pixels around them show higher noise than those in the gold class. An example is shown in the row 1 of Figure 2.
• The none class collects other candidates, which are false events including planets transit and cosmic ray signals. Row 3 of Figure 2 gives an example of the none class. We hardly find pinnacle-shaped light curves in pixel-level data. Even if a few light curves with unnormalized fluxes show a pinch of flare-shape features, unrelated noise are strongly dominant.
To be clear, the candidates in the gold class and silver class are true superflares. The reason we split these superflares into two categories is that they have different features. In particular, for data in the silver class, bright pixels (e.g., pixel [5,5] and [6,5] ) in the panels with pixel-level light curves in Figure 2 show pinnacle-shaped light curves, (see comparison in row 1 and row 2 of this figure). These pixels should be more cautiously inspected, because they show superflare signals but are not perfect as those in the gold class. All 15,638 candidates are classified into these three categories by manual visual inspection. Detailed information can be found in Table 2. For the convenience of users who are only interested in data of superflares, the duration, peak luminosity and energy of superflares are collected in this table. This information will be concealed for those candidates if at least one of following criteria is satisfied. (1) The hosting stars are with flags HB or GB21 in Table 1 after cross matching with the Hipparocos-2 and Gaia EDR3 catalogs. (2) The candidates' rising epoch is longer than the decaying epoch. (3) Candidates are visually inspected and classified in the none class. A column in Table 2 lists all the file names that correspond to feature images and arrays used for CNN training and will be articulated in Section 3.1.

DEEP LEARNING
This section will articulate the process of obtaining three-dimensional arrays and feature pictures for training CNNs, and how we set training, validation and test data sets. Specific structures of six networks are also shown. Two methods of ensemble learning are used for enhance performances of CNNs, including stacking and voting.

Three-dimensional arrays and feature images
This work focuses on automatically identifying superflares through the pixel-level data of TESS. Figure 1 and Figure  2 reveal pixel-level light curves, according to which we visually classified those superflare candidates into three classes.
In particular, we transform pixel-level data into arrays and feature images for further CNN training. The first data format is a three-dimensional array as visualized in Figure 3(a). One dimension of this data is time, and the other two are pixel indexes. This array contains original pixel-level information during the whole flare light curve interval. One may comprehend these arrays as short film clips that record changes in the pixel-level image in just 0.15 days. The image from the TESS data contains 11×11 pixels and with 2 minutes cadence. So, these three dimensional arrays are of size 11×11×108. These arrays contain original observations of flare candidates from TESS.
Then, we transform the array into pixel-level light curves as shown in Figure 3(b), which is exactly the same as that shown in panels (e) and (f) of Figure 1. In this feature picture, we list 11×11 squares, which correspond to 11×11 pixels of TESS pixel-image data. By this procedure, we reduce the three-dimensional data into two-dimension images. In each square, the light curve of the corresponding pixel is revealed in Figure 3(c). In order to reduce complexities of the following CNNs, we set the peak time of each pixel-level light curve at the 1/3 position of the time axis. Besides, the blue frames stand for those aperture masked by TESS pipeline. In order to reduce the color information, which is totally unrelated with superflare searching, the feature image is transformed to a gray scale image with just one channel (traditional colorful images are with RGB three channels), as shown in Figure 3(d). In order to protrude the pinnacle shape of superflare light curves, the areas under the light curves are filled. We use white color for those light curves, which are from those aperture pixels masked by TESS, so they are distinguished from other ordinary light curves, which are in gray.
Further, we line up the sequences of all 11×11 pixel light curves and reveal the flux in gray scale color as shown in Figure 3(e). Each line of this feature image stands for the corresponding pixel, and from left side to right side it stands for the time axis of the flare light curve. For example, in the image, the line of pixel No. 60 suddenly brightens and gradually darkens, which means the rapid rise and slow decay of the light curve.
As the unnormalized and normalized fluxes are both derived, we have two versions for three-dimensional arrays; images of pixel-level and lineup sequences of light curves. These six kinds of data are visualized in Figure 4. Then we construct six CNNs for all of them. All of these six feature images and arrays are from a superflare candidate. In other words, we intend to use the six CNNs to classify just one superflare candidate, because these feature images and arrays describe a candidate from different angles even according to the same pixel-level information from TESS.
• Three-dimensional arrays (TD arrays) are basically original records of TESS observations but without any details on the TESS aperture masks.
• Pixel-level images (PL images) are with information of the TESS aperture masks and show the pixel-level PSF shape of a true superflare, which shows the relationships between stellar center pixel and nearby pixels.
• The images of the lineup sequence light curves (LS images) more obviously reveal the contrast of dark and bright, which more delicately portray the rapid rising and slow decaying phases of superflare light curves. However, the LS images deficient in any information of TESS aperture masks.
Data augmentation is a method that can be used to augment the size of the data set and make the data be rich in variation. We do not apply this method to the feature images for the following reasons.
• The pixel-level light curves, as shown in panels (e) of and (f) in Figure 1, should reveal rapid rising ahead of slow decaying as shown by the typical superflare shape. This unique shape is what we obtain from superflare signals and also a vital factor when we inspect superflare candidates. So, any augmentation methods like flip or rotate will totally change this basic factor. Further, we may get a wrong CNN to recognize false superflare signals with slowly rising inversely ahead of rapid decaying. This situation should be avoided.
• Superflare signals from light curves can be treated as outliers. Selecting those outliers is successfully implemented, as in Section 2.2. Similar methods can also be executed efficiently according to other outlier-selection algorithms (e.g., Walkowicz et al. 2011;Wu et al. 2015;. In this work, CNNs are used to recognize feature images exported by the algorithm. As this algorithm can automatically select outliers on light curves and process pixel-level data of TESS with outputting unified feature images and arrays, CNN structures can be as lightweight as possible ( Figure 6). Lightweight CNNs also ensure that we can train them more easily with limited data. So, data from any augmentation methods like image zooming, padding seems overwhelming for the lightweight CNNs in this work.

Training, validation, and test data sets
As we introduced in Section 2.2, 15,638 flare candidates are selected and classified into three classes including 1268 data in the gold class, 3792 data in the silver class, and 10,578 data in the none class. For convenience of the further k-fold cross validation of CNNs, we randomly split 15,638 flare candidates into seven stacks. Figure 5 shows the workflow of this process. Each of the six stacks contains 2,348 data, and the last stack contains 1550 data. Basically, the test set takes 15% of the whole data set. From these stacks, we choose each of them as the test data set, and others are integrated and then evenly split into 10 ministacks. One of the ministacks is set as the validation data set; the others are combined as the training data set. The validation set is about 10% of the whole database, and the training set is 75%. According to above flows, one test data set corresponds to 10 validation and 10 training data sets. As there are no intersection data among the test, validation and training data sets, we obtain 70 clusters of data (7 test sets × 10 training and validation sets). Each one includes a training set, a validation set, and a test set. As these three classes are data imbalanced, we try to ensure the above stacks and ministacks contain data of three classes with almost the same proportions to avoid issues due to the different proportions in k-folding cross validation.

Convolutional neural networks
In this work, selecting superflares from light curves can be considered as a time-series problem or time-series forecasting problem, which is traditionally solved using RNNs instead of CNNs (e.g., Hewamalage et al. 2021;Lim & Zohren 2021). There are two reasons for choosing CNNs in this work. First, RNNs can be used to process light curves of stars (e.g., Vida & Roettenbacher 2018;Rao et al. 2021;Cui et al. 2022;Ofman et al. 2022), but selecting superflare signals can be treated as finding outliers, which can be easily executed by enormous simple but strong algorithms as we introduced in the Section 1. So, it is unnecessary for using RNNs to process stellar light curves. Second, according to our experience of manual visual inspection for superflare events, through only one univariate stellar light curve may lead to false-positive results, which can be avoided by using the additional information provided by pixel-level data. For instance, the left panel of row 3 in Figure 2 shows a rapid rise and slow decay in the univariate light curve. However this event is a nonsuperflare event, as there are no superflare signals from pixel-level light curves shown in the middle and right panels of row 3 in Figure 2. As shown in panels (e) and (f) in Figure 1 and row 2 of Figure 2, from the PSF shape it can be found that the brightest pixel shows a higher superflare amplitude than other surrounding pixels, and outer pixels show stillness over time. Besides, the aperture masks of TESS pipeline are shown as blue frames in these figures, which also provide pixel-level information, and ensure that superflare signals are revealed inside of these masks, as we introduced in Section 3.1. This information can be easily identified through CNNs rather than RNNs as the information is also variable in the pixel-level data as TESS target changes. So, CNNs are practicable because we transform a time-series problem into an image-classification problem. Those images contain much more pixel-level information, which is helpful and necessary for identifying true superflare events.
We have set the training, validation and test data sets; then we construct six CNNs according to the candidates' TD arrays, PL images and LS images. The validation data set is used to modulate the hyperparameters of the networks. Before we select the six CNNs for each of the TD-PL-LS-normal. and unnormal. data, we should construct a whole bunch of networks and select the most stable and outstanding one according to the validation data set.
Flow charts of these six CNNs are shown in Figure 6. We use AdaGrad algorithm (Duchi et al. 2011) as an optimizer when training the networks. The learning rate, learning rate decay, and weight decay are all listed in the figure for each of these networks. None of these networks are the same, as we consider different identities of feature images and arrays. The stability shown by the validation data after applying for each network is also regarded. Note that, in order to standardize the inputs of each network, we resize PL images and LS images with resolution of 704×704 and 121×121, respectively. The TD arrays are all resized to 11×11×110 through interpolation along time-axis. We use the outputs, to which we apply the Softmax function (normalized exponential function Bishop (2006)) as follows As we only have three classes, K = 2. o i represents the original output for the i th class. So, after being exponentially normalized, c i represents the possibility of classifying one input candidate to i th class. We choose the maximum number of c i to classify the input into i th class. During the training of the networks, the cross-entropy loss function is used, which can be derived as where K = 2, and n stands for the number of class and counts of data in the training set, respectively. j means the j th data of the training set. t i = 1 if the j th data are manually classified into the i th class; otherwise t i = 0. The purpose of training the networks is to reduce the value of loss function and raise the accuracy of the classification. Besides, as the three classes are data imbalanced, data rebalance is also realized through the cross-entropy loss function by multiplying the proportions of these three classes in the training set.
The results of 70 training and validation data sets are shown in Figure 7. These can be also treated as results of the k-folding cross validation. Note that we classify candidates into three classes; however both the gold class and silver class contain superflare events. So, results from these two classes are combined to calculate the accuracy of a network. The reason why we set three classes has been illustrated in Section 2.2. From the figure, we can see that the loss and accuracy of the six networks are more stabilized as the training epochs increase. Some of networks show some overfitting as the accuracy of the validation is relatively lower than that of the training set. This is caused by two reasons. First, we only have about 15,000 data, and the networks may be relatively complicated with many neural units, which should be better constrained by even more data. Second, we tend to choose those networks that perform well on the validation data with results of higher accuracy and stable training progress. We have also tried to simplify those networks or put a penalty on loss function with a higher weight decay. However, none of these enhance the performance of networks, but reduce it. So, we control overfitting by adding dropout layers in-between convolutional layers and setting the learning rate decay to avoid relatively higher learning rate as the learning epochs increase.

CNN results
In Equation 9, we list an example of the confusion matrix, which is helpful when we introduce recall, precision, the true-positive rate (TPR) and false-positive rate (FPR). In this matrix, TP, FN, FP, and TN stand for numbers of true-positive, false-negative, false-positive, and true-negative events. In this work, true events are superflare events. The real truth is artificial classification. The CNN results are classifications based on CNN outputs. So, the accuracy, recall (R), precision(P ), TPR and FPR can be derived as The break-Even Point (BEP) and area under the receiver operating characteristic (ROC) curve (AUC) are also derived from those network results. BEPs are points where precision equals recall in the PR curve, which are derived by setting limits at each of the ranked possibilities of positive events to calculate the specific recall and precision. The ROC curve also sets these limits and calculate the specific TPR and FPR. The AUC is the area under this curve. The best network is with Accu.= 100%, R=TPR=1, P= 1 and FPR= 0, from BEP= 1 and AUC= 1, which means that the network can ideally recall all positive events with 100% precision and perfectly distinguish positive events from negative events.
Each of the seven test data sets is inputted into the corresponding networks. For example, the test set in Figure 5 is stack 0, which is inputted into networks trained by using data from a combination of ministacks without ministack 0. Then, the same test set is inputted into networks trained by data without ministack 1, and so on. This test set is inputted into the corresponding networks 10 times, as do other test sets. Finally, we get 70 results for each of the six networks. Their accuracy, recall, and precision are all listed in Table 3. As these results are from the test data set, these properties can indicate the generalization ability of the CNN. These results are based on the classification of superflares or nonsuperflares instead of three classes, as we illustrated above.

ENSEMBLE LEARNING
The six CNNs for the TD arrays, PL images, and LS images do not show perfect performance with even higher accuracy. So, we try to find a way to improve accuracy for classifying superflare candidates. After training the six separate networks, ensemble learning is a handy way, including the stacking and voting methods. Here, we illustrate these two methods, through which we improve the superflare classifying accuracy.

Stacking
The stacking method can also be understood as training another CNN network (hereafter stacking CNN), whose inputs are a combination of those outputs from many individual networks (Wolpert 1992;Breiman 1996). As we introduced in Section 3.2, we have seven test sets ×10 training and validation sets. Each test set, validation set, and training set are independent. For one test set, there are 10 training and validation sets. In order to avoid overfitting of training the stacking CNN, outputs of inputting training data set into trained networks (six CNNs) are not used for training the stacking CNN (Zhou 2021). These six networks are illustrated in Section 3.3. For instance, stack 0 in Figure 5 is treated as a test set. First, we input the validation set ministack 0 and the test set into six networks, which are trained by using the remaining ministacks, and we get the first 1/10 training data (train-part-0) and first test set (test stack 0) for the stacking CNN. Then, we use the ministack 1 and the same test set into the networks, which are trained by the remaining ministacks, and get the second 1/10 training data (train-part-1) and second test set (test stack 1), and so on. Finally, for the test set stack 0, we have 10 test stacks and one training data combined by the 10 train parts. There are seven test sets (aka stacks in Figure 5). So we get seven corresponding training sets. Each one of them corresponds to 10 test stacks for stacking CNN, respectively. The outputs of the six networks are combined as an array with size of 6×3 (each of the six networks outputs an array with size 1×3) and inputted into the stacking CNN. The new 7×10 test stacks are used for the k-folding cross validation of the stacking CNN. The workflow of the stacking CNN can be found in Figure 8(a). The results of putting 70 test stacks into the trained stacking CNN are listed in Table 3. These 70 test stacks are also used in the following voting method.

Voting
Voting is another method to integrate many separate CNNs, whose outputs are treated as votes to classify the candidates. Here, we use two voting actions including soft voting and hard voting, which can be formulized as where and Here, S(x) and H(x) stand for soft voting and hard voting, and x is the candidate data needed to be classified. c j i is the output applied by theSoftmax function in Equation 7. i is the i th class, as we classify candidates into three classes (gold class, silver class and none class), i ∈ {0, 1, 2}. j is the j th network from 1 to 6. The corresponding orders of networks are listed as the first six rows of Table 3. The specific workflows of voting are shown in panel (b) and (c) of Figure 8. In order to strictly classify candidates through the voting method, we set a limit at 75%. So, max H(x) and max S(x) should be larger than this limit; then the candidate can be labeled. Otherwise, the voting method refuses to label the candidate, and the data remain for artificial labeling after visual inspection. For hard-voting, 75% means more than four out of six individual networks have the same voting result. As some candidates are denied classification, we define the classification rate as where N is the data count of all data of a test stack, and N TBD is the number of data that are cannot be classified and are labeled as 'TBD' in the voting flows. As the outputs derived from the Softmax function can be treated as possibilities for classifying a candidate, we regard that each of the six CNNs has 100 votes. For soft voting, 75% means that more than 75% of 600 votes support the same voting result. We do not set this kind of limit for those six separated CNNs. The reasons are as follows. (1) There is no practical meaning for the limit, if we set it for each of the six CNNs.
(2) It is hard to get a specific percentile limit that is reasonable and explicable.
(3) Even if we set 75% as limits for each of the six CNNs, accuracy and other properties do not achieve better results than voting methods. (4) Voting from the six CNNs can deny to classify those candidates that may be hard to classify or have not been used previously to train the CNNs. So, we insist to use the maximum outputs of each of the six CNNs to classify candidates instead of set a limit. The results of the voting methods are listed in Table 3.

Public open data set
In Section 2.2, we have introduced the method through which we select superflare candidates from three years of TESS observations. These 15,638 flare candidates are all from solar-type stars (with stellar effective temperatures of 5100K T eff < 6000K and surface gravities of log g > 4.0), and their information are listed in Table 2. It is worth noting that only 5060 among all 15,638 candidates are superflare events, and others are nonsuperflare data. The corresponding stellar properties of their target stars are listed in Table 1. Note that, although the stellar periods may be a combination of three years observations, we should be careful when using periods larger than 10 days because a sector of TESS has a limited observing span of about 27 days, and the observation of each sector is normalized by itself (Tu et al. 2020). We classify these data into three classes: gold class, silver class, and none class, among which the gold and silver classes collect superflare events. As these three classes have different specialties, and in order to improve the performance of CNNs, we still use three classes for training the CNNs, but use superflare events or nonsuperflare events as a binary classification problem to calculate the properties of the CNN. The three classes with their own characteristics are articulated in Section 3.1 and shown in Figure 2.
We emphasize that these candidates are classified into three classes. However, we do not statistically use these data to do any research like Tu et al. (2020Tu et al. ( , 2021, which is beyond the topic of this article. Using the CNNs in this work to process the four years of TESS observations will accomplish collecting all superflare events detected by TESS. Following that, statistical research can be performed, for instance, comparing stellar flare frequencies on different types of star and comparing stellar flares on one type of stars but with different ages. For readers who are interested into these data, they can find corresponding events as specific information are released. One could also use these data to get creative feature images or data for their own CNN training. Besides, the six kinds of data sets including TD-normal., TD-unnormal., PL-normal., PL-unnormal., LS-normal., and LS-unnormal. are all public released 1 . Their corresponding file names are also listed in Table 2. So, these data sets are really useful and valuable for the community of searching for stellar superflares. They are useful for searching for superflares on other types of stars, as deciding the type of a star is totally irrelevant with detecting superflare candidates, which is only according to light-curve and pixel-level data.

Comparing CNNs
We compare results of these networks and methods listed in Section 3.2. These result are listed in Table 3. The classification rate, accuracy, recall, precision, BEP and AUC are used to indicate the generalization capabilities of different CNNs and methods. Note that: (1) In order to compare results from those six different CNNs, we use the same seven test sets (stacks in Figure 5). Each one of these six CNNs is trained once by each of 10 training data sets, so we get a total of 70 outputs (7×10). The only difference is the inputs of these CNNs, which are different corresponding feature images or arrays. For the voting and stacking methods, 70 test stacks are inputted into the corresponding stacking CNNs, which are trained by each of the seven training sets. The processes of setting the data sets for deep learning and ensemble learning are introduced in Section 3 and Section 4. So, the original candidates and counts of data in each of three classes are all the same when comparing these CNNs and voting methods. (2) The ultimate purpose of this work is to efficiently distinguish superflare events and nonsuperflare events. Although we set three classes to improve the performance of the six CNNs, the results according to the two-class classification (superflare or nonsuperflare) are obtained. According to the results of the two-class classifications, we discuss using these CNNs to search for superflares in the future.
For comparing these CNNs and voting methods, the properties in Table 3 are visualized in Figure 9. From this figure, the accuracy of these six CNNs is around 95.5%. It is obvious that TD-normal. is the best network of these six CNNs, and we use the properties of it to compare with stacking and soft voting in Figure 10. As the shaded area (std intervals) of the six CNNs is relatively larger than that of the stacking and voting methods, the six CNNs may be not very stable compared to the stacking CNN and voting methods according to all test sets. We also compare PR curves and ROC curves of the TD-normal. stacking CNN and soft voting in Figure 10. Both of these curves indicate the generalization and performance of a CNN. Comparing with the CNN of TD-normal., the accuracy and other properties of stacking CNN are significantly enhanced. The six CNNs and stacking CNN are both classifying all candidates with a 100% classification rate, unlike the voting methods. This notable enhancement strengthens the idea that each of six CNNs and inputs of them have their own individuality. A combination of these six CNNs could improve redundancy or error-tolerant rate of searching for superflares. From Figure 10, it is noticeable that the performances of the ensemble-learning methods are more enhanced than those of the six CNNs. Only through one CNN obtaining a higher generalization performance is not that practical as we only use less than 15,368 data to train a network, and only about 1/3 of them are labeled as superflare events. The well-known deep-learning benchmark database MINIST (Lecun et al. 1998;Kussul & Baidyk 2004) and Fashion-MNIST (Xiao et al. 2017) have 60,000 training instances, even if they are classified into ten classes, and each class is definitive with clear boundaries. In this work, only superflare events are distinguished from other nonsuperflare events with clear features and boundaries. However, it is also possible to classify many subclasses in the none class, like exoplanet transients, cosmic ray events, CCD errors, and pipeline noise. For convenience, we blend all these nonsuperflare events in the none class to distinguish from superflare events, which may confuse CNNs when nonsuperflare events are inputted. So, limited instances of superflares and blending all nonsuperflare events together may be the reason why we do not get a single CNN, which is better, more stable and comparable to ensemble-learning methods yet. Using ensemble-learning methods is more complicated as we have to train and get six kinds of CNNs and input data, but ensemble-learning is now a practical and relatively powerful method when we try to classify superflare candidates. Besides, the stacking CNN, hard voting and soft voting are more stable and better than six CNNs according to Figure 9.
If we describe the enhancement of stacking CNN as taking advantage of the differences between the six CNNs and their inputs, the voting methods are more emphasized to use the result consistency of these six CNNs. For the voting methods, the accuracy, recall, precision, BEP and AUC are calculated based on those classified candidates apart from others labeled as 'TBD' after voting, as shown in Figure 8(b) and 8(c). Comparing the stacking CNN with the voting methods in Figure 9, it is evident that the voting methods are more stable and exceptional, and soft voting performs better than hard voting. In Figure 10, soft voting is really close to the ideal vision of a classification machine-learning method with R ∼ 1, P ∼ 1, FPR∼ 0 and TPR∼ 1. However, the classification rate decreases from 100% to about 92%, as shown in Figure 9. This is a trade-off between performance and the classification rate of a CNN. The voting methods evenly consider each of the six CNNs, through uniting outputs of these six CNNs. As we set 75% as a limit over which a candidate is classified, this can be understood to ensure voting results are broad, representative and accurate among these six CNNs. We also checked those candidates whose maximum voting results were not up to a 75% limitation. These candidates are intractable, even through visual inspection of their signal-to-noise ratios (S/N) are relative low, and hard to be classified. This may result from the fact that each of the six CNNs is also hard to be trained to detect superflare features at such low S/N. Another reason might be that some kinds of data have not been paid adequate attention to as their data are not enough in the training sets. These kinds of data may not be well learned by CNNs and rejected for classification. The voting method maximally exposes these unclassified data with an about 8% exposure rate (100% − 92%) for soft voting. Besides the decrease in the classification rate (derived in Equation 18), other indicators of network generalization and performance are all enhanced distinctly, as shown in Figure 9, and accuracy is improved to values even higher than 99%. So, we could improve the accuracy of soft voting to more than 99.42% assuming that the manual visual inspection is 100% accurate and trustworthy. As some kinds of data are not enough to be paid adequate attention to while training CNNs, it is likely that some data may be new transient phenomena. Through one-on-one specifically visual inspection, we will find these new phenomena instead of submerging them into endless data. Back to our initial purpose of using CNNs for searching superflare events, we pursue an ideal vision that those automatically searched events are all true-positive superflare events in a database. At the present moment, the benefits of artificial intelligence are more reflected in efficiency while processing some repeatable jobs instead of handling extraordinary unseen situations. We still recommend readers who are willing to use our CNNs for searching superflare events to at least visually inspect the remaining not classified 8% data. These data can also be used for enlarging the superflare database for training the six CNNs or other networks in the future.

Duration and amplitude
As we have discussed, superflare candidates with relatively low S/N are hard to be visually inspected. The same also applies to those six CNNs trained before. It is possible that the duration and amplitude of a superflare can be related to the S/N of a superflare. These two properties can be intuitively recognized. Those superflares with long durations and large amplitudes are more obvious and have abundant superflare features. In Figure 11, it is obvious that the performances of the stacking CNN is not better than that of soft voting, as shown in panels (a1), (a2), (b1) and (b2). The candidates with short durations and low peak amplitudes are not better classified than these with long durations and high amplitudes. In another words, the six trained CNNs or stacking CNN do not perform well on candidates with relatively lower S/N. For soft voting, greater performance is achieved by rejecting the classification of candidates with shorter durations and lower peak amplitudes, as shown in panels (a3) and (b3) of Figure 11. So, when using the classification results from the CNNs in this work, we should pay more attention to those candidates that are classified as superflare events but with short durations and low peak amplitudes. In the future, the updates of the six CNNs and stacking CNN presented in this work or CNNs that may be constructed should lay emphasis on superflare instances with lower S/N. Here, we propose two ways: (1) collect more data with relatively lower S/N, and (2) make the time interval of a superflare at the same scale of width for TD arrays, PL images and LS images, through stretching or compressing the time axis.

SUMMARY
This work focuses on developing CNNs for superflare automatic searching on pixel-level data. We use 15,638 superflare candidates, which are classified into three categories, namely, the gold class, silver class and none class. Examples of these classes are shown in Figure 2. These candidates are from 71,732 solar-type stars (with 5100K T eff < 6000K; log g > 4.0), which are observed by three years of TESS observations. These candidates are automatically searched through our previous methods (Tu et al. 2021(Tu et al. , 2020 and visually inspected in this work. The information of solar-type stars and superflare properties are listed in Table 1 and 2.
Six CNNs have been trained and validated by k-folding cross validation. These six CNNs are trained by feature data including TD arrays, PL images, and LS images. All superflare candidates and feature data are public available. Examples of these feature data are visualized in Figure 3. As they are uneven counts of instances in the gold class, silver class, and none class, data rebalancing is considered when training the CNNs. These six CNNs are different from each other and are developed considering the characteristics of different features.
Ensemble learning is used when combining all outputs of the six CNNs. The six trained CNNs, stacking CNN, and code are publicly available 2 . After comparing indicators of the networks' performance, like the accuracy, recall, precision, BEP, and AUC of the CNN, we found that the stacking and voting methods evidently enhance the performance and generalization ability of the six CNNs. The reason is that the combination of the six CNNs depicts superflare features according to abundant points of view much more than a single CNN. The stacking CNN improves the redundancy or error-tolerant rate of searching superflares. Voting methods including soft voting and hard voting show better performance than the stacking CNN, but the trade-off classification rate decreases from 100% to about 92%. Meanwhile, the accuracy is improved from 97.62% to over 99%. The voting methods achieve better result consistency than the six CNNs. We set a limit of 75%, which ensures that votes united from the six CNNs are broadly representative, more accurate, and reliable while searching for superflares. Soft voting is the best method with R ∼ 1, P ∼ 1, FPR∼ 0 and TPR∼ 1. 8% of unclassified data visually inspected one by one. These data are basically intractable with relatively low S/N and hard to be classified. Visual inspection of these data will not only improve the accuracy to higher than 99.42%, but also enlarge the size of the training database for updates of the six CNNs updates in the future.
The performance of these six CNNs is better on superflare candidates with high S/N (also with longer durations and higher peak amplitudes). The voting methods are more capable of rejecting those candidates with shorter durations and lower peak amplitudes. This is the reason why soft voting is better than stacking CNN. The voting methods also give chances for finding new transient phenomena, which are not recognized by trained CNNs at all and will not be successively classified but left for visual inspection.
In this work, six CNNs and ensemble-learning methods are developed for the purpose of automatically identifying superflare events from pixel-level data, which ensures that the selected superflares are true-positive events. A contamination of false-positive events will influence the statistical analysis of superflare data. For instance, the number of nonsuperflares events is 2 times more than that of superflare events in the database of this work, whose statistical weight cannot be neglected. Although superflare events in this work are selected from solar-type stars, the training CNNs are only based on observational data. Stellar parameters are not considered. So, these trained CNNs and methods can also be applied to the detection of stellar flares of other types of stars. CNN training is time and data consuming. Performance and generalization of a CNN is also based on the scale of database. In the future, updating CNNs of this work should be still based on larger scale of superflare database, especially superflares with low S/N. CNNs will efficiently process archived and upcoming data from TESS. Namekata et al. (2021) probably found a stellar filament eruption associated with a superflare event through optical spectroscopic observation. If CNN forecasting of stellar superflare is available in the future, it will buy time for preparing follow up optical spectroscopic observation. If so, the work by Wang et al. (2021) could also capture the whole spectroscopic observations from before until the end of the stellar flare, and Namekata et al. (2021) would not spend too much precious ground observation time on a single star. They would not probably find stellar filament eruption only by chance. Much more stellar coronal mass ejection events could be more efficiently detected. We can foresee that the scale of stellar superflares can be efficiently enlarged by CNNs in this work and more interesting statistical research could be done in the future, including but not limited to comparison of superflares on different types of stars. gives the whole light curve of the sector, from which this superflare candidate is searched. The peak of this superflare is pointed by the blue arrow mark. Panels (c) and (d) show the pixel-level images at the pre-flare and peak time stamps of the superflare, respectively. In these two panels, blue frames encircles the aperture masks of TESS pipeline. We use these two panels to help us check if the area of apertures are stable and unchanged, and if the brightest area is identical before and at peak of the flare. If it changes, the quality of the data will be specially considered. For each pixel in the images, its light curves with normalized and unnormalized flux are correspondingly shown in the square frames of the panel (e) and (f). Aperture masks by TESS pipeline are shown by blue frames in these lower panels. Corresponding pixel-level light curves are also shown besides, with both normalized and unnormalized fluxes. The red solid lines stand for Fq(t). The colored three dots stand for the start, peak, and end time stamps of the flare. The patterns contain pixel-level light curves, which are exactly same as those shown in Figure 1. Here, peak 1 and peak 2 stand for candidates that belong to the silver class and gold class. Peak 3 belongs to the none class as a nonsuperflare (negative) event. Line-up sequences of light curves with un-normalized flux (LS-unnormal.) Pixel-level light curves with un-normalized flux (PL-unnormal.) Three-dimension data with normalized flux (TD-normal.)

Pixels
Three-dimension data with unnormalized flux (TD-unnormal.) Figure 4. Examples of TD arrays, PL images, and LS images, which are all derived from the candidate displayed in Figure  1. Here, normal and unnormal represent whether flux data are normalized and unnormalized. Figure 3 illustrates how we get these feature images. Examples of TD-normal. and TD-unnormal. may not be accurately visualized. However, these two kinds of data are three-dimensional arrays with one dimension describing time. These arrays can be understood as film clips, which reveal changes in the pixel-level image during the whole flare times. In PL-normal. and PL-unnormal. images, flare light curves which are colored in white are those data from apertures masked by TESS. Other pixel light curves are colored in gray. Evenly split into 10 mini-stacks mini-stack-0 mini-stack-1 mini-stack-2 ······ mini-stack-9 Validation set Training set Test set Figure 5. Flow chart for setting the training, validation, and test data sets. As shown by this workflow, we have seven test data sets, each of them corresponds to 10 validation and 10 training sets. The data balance between three classes is considered when we split the data into stacks and ministacks. The training set, validation set, and test set are 75%, 10%, and 15% of the whole database. Separation into stacks and ministacks is for k-folding cross validation.    Figure 6. Flow charts of six CNNs. For each network, its learning rate (lr), learning rate decay (lr decay), weight decay, and epochs are all listed under the title of the network. All networks are constructed by convolutional layer and fully connected layer. The three brackets in some layers denote the neural kernel size, kernel moving stride, and padding size. The numbers of the input channels and output channels are also listed in the layer. The outputs of each network are applied to the Softmax function (Equation 7), so outputs can be regarded as probabilities for classifying a candidate into three classes.    (b) represents the point where FPR = 0, TPR = 1, This is the result from the best ideal network, as all positive candidates are selected without any negative ones. As TD-normal. is the best network of the six CNNs and soft voting is better than hard-voting method, we use the results from these two to compare with the stacking CNN. The results from soft voting, stacking CNN, and TD-normal. are shown in blue, red, and yellow. The solid curves represent the median values of the results. The shaded area denotes the ranges of each property from 5% to 95%. Recall (%) (b1) Figure 11. Specific recall, precision and classification rate in each bin of superflare duration and amplitude. The bin size of the duration and amplitude is in logarithmic scale. We compare the stacking CNN with soft voting, which are represented by red and blue curves, respectively. Recall and precision are defined in Equations (11) and (12). Rate stands for classification rate defined in Equation (18). These three properties are all in unit of percentile in this figure. The error bars are standard deviations of each property. The data from which we plot this figure are the same as those of Figure 9, Figure 10 and Table 3, but with specific results from the bins of duration and peak amplitude.  . c The serial number of a flare candidate captured from the star in a specific sector. For example, the first row stand for the candidate, which is the eleventh one detected from the light curve of TIC382575967 in sector seven, and n = 5 is set in Equation 2. d Labels of the gold class, silver class, and none class, which are represented by 0, 1, and 2, respectively. All labels are set by visual inspection of the candidates. Here, we have 1268 candidates in the gold class, 3792 candidates in the silver class, and 10,578 candidates are in the none class. e Superflare peak time stamps of TESS. f Superflare peak luminosity in unit of erg s −1 . g Superflare energy in unit of erg, which is calculated by Equation 6. h Superflare duration in unit of s, which is calculated by t end − tstart. The start and end times of superflare are derived through Equation 4. i File name of feature images and arrays of the candidate. These files are used for training CNNs. Their names are structured by types of data, the number of sector, source, and the serial number of the candidate. Note that, data of TD-normal. and TD-unnormal. are not saved in any picture formats but in array format of Numpy (a Python package) with file name extension .npy. (This table is available in its entirety in machine-readable form.)  (10), (11), and (12), respectively. The BEP and AUC are introduced in Section 3.3. The mean and standard deviations of each property are listed in the table. For each parameter in the six CNNs and ensemble-learning methods, the best results are shown in bold. TD-normal. is the best CNN among the six CNNs. Soft voting has better results except for the Class. Rate, which we discussed in Section 5.2.