Re-Evaluating Small Long-Period Confirmed Planets From Kepler

We re-examine the statistical confirmation of small long-period Kepler planet candidates in light of recent improvements in our understanding of the occurrence of systematic false alarms in this regime. Using the final Data Release 25 (DR25) Kepler planet candidate catalog statistics, we find that the previously confirmed single planet system Kepler-452b no longer achieves a 99% confidence in the planetary hypothesis and is not considered statistically validated in agreement with the finding of Mullally et al. (2018). For multiple planet systems, we find that the planet prior enhancement for belonging to a multiple planet system is suppressed relative to previous Kepler catalogs, and we identify the multi-planet system member, Kepler-186f, no longer achieves a 99% confidence in the planetary hypothesis. Because of the numerous confounding factors in the data analysis process that leads to the detection and characterization of a signal, it is difficult to determine whether any one planetary candidate achieves a strict criterion for confirmation relative to systematic false alarms. For instance, when taking into account a simplified model of processing variations, the additional single planet systems Kepler-443b, Kepler-441b, Kepler-1633b, Kepler-1178b, and Kepler-1653b have a non-negligible probability of falling below a 99% confidence in the planetary hypothesis. The systematic false alarm hypothesis must be taken into account when employing statistical validation techniques in order to confirm planet candidates that approach the detection threshold of a survey. We encourage those performing transit searches of K2, TESS, and other similar data sets to quantify their systematic false alarms rates. Alternatively, independent photometric detection of the transit signal or radial velocity measurements can eliminate the false alarm hypothesis.


INTRODUCTION
The final Kepler DR 25 pipeline run (Twicken et al. 2016) and Kepler science office Robovetter classification (Thompson et al. 2018) provides an unprecedented uniform catalog of planet candidates for understanding the dynamics and occurrence outcomes of the planet formation process. The planet candidate catalog was accompanied by an equally impressive set of supplemental data to help measure the biases and corrections necessary to make full use of the catalog. The following is a non-exhaustive list of supplemental data products associated with the DR 25 Kepler data • Stellar parameter catalog (Mathur et al. 2017) • Database of transit signal injection followed by the full pipeline and classification recovery (Christiansen 2017;Coughlin 2017a) • Database of transit signal injection followed by the transiting planet search module alone for greater detail into signal recovery (Burke & Catanzarite 2017a,b) • Kepler pipeline open software release 1 and documentation (Jenkins 2017) for understanding implementation details of the pipeline algorithms • Astrophysical false positive classification (Morton et al. 2016) as well as false positive analysis taking into account ground-based follow-up observations (Bryson et al. 2017) • False positive classification taking into account the stellar image position  • Characterization of the systematic false alarms through data permutation techniques (Coughlin 2017a; Thompson et al. 2018) The last item offers new insight into quantifying the Type I error rate of a transit survey. Jenkins et al. (2002) used theoretical arguments to propose that a 7.1 σ threshold would produce < 1 false alarm (or Type I error) in the entire Kepler data set under certain assumptions regarding the expected noise. However, experimental data, even after detrending, can contain residual systematics in excess of the expected noise causing an enhanced false alarm rate. The simulated false alarms described in Coughlin (2017a) represent the first time a transit survey has quantified its Type I false alarm rate based upon the experimental data for the express purpose of constraining the detections induced by instrumental or data processing systematics that can mimic a transit signal. It was recognized in analyzing the Q1-Q16 Kepler planet candidate catalog ) that there was evidence for excess detections at long periods, likely due to low-level systematic false alarms that mimic the low signal-to-noise ratio, few transit events highly sought after for detection. Burke et al. (2015) concluded that quantifying the rate of the false alarm contamination was a leading systematic uncertainty in constraining the GK dwarf habitable zone planet occurrence rate. Astrophysical false positives, such as background eclipsing binary signals, were also identified in the nascent stage of transit discoveries as a contaminating source of false positive transit signals (Brown 2003). Thus, transit discoveries relied heavily on radial velocity confirmation and follow-up observations to vet against astrophysical sources of contamination (Torres et al. 2004;Mandushev et al. 2005). Systematic false alarms in the discovery data (typically based upon small telescopes) could be rejected by additional photometric follow-up at the predicted transit times (typically using larger telescopes) in order to provide an independent detection of the transit signal and confirm its ephemeris. While the high signal-to-noise ratio (SNR) of most candidates eliminates the concern for systematic false alarms, low SNR detections from Kepler are extremely difficult to follow-up photometrically from the ground due to their long orbital periods, long transit durations, and faint host magnitudes.
Statistical validation methods were developed to vet against astrophysical false positives for Kepler candidates Fressin et al. 2011;Morton & Johnson 2011;Lissauer et al. 2012;Morton 2012;Díaz et al. 2014;Lissauer et al. 2014;Rowe et al. 2014). Although statistical validation achieved success in constraining the possibilities for astrophysical false positives, systematic false alarms became a problem for Kepler candidates near the detection threshold. However, up to and including the Q1-Q12 Kepler planet candidate catalog (Borucki et al. 2011a,b;Batalha et al. 2013;Burke et al. 2014;Rowe et al. 2015), it was possible to mitigate systematic false alarm contamination by examining the extra 6-12 months of Kepler data that was available post-discovery to ensure that the transit signal detected in less data continued to increase in significance. Starting with the Q1-Q16 Kepler catalog ) and the end of the prime mission field for observations, looking ahead at more data was no longer an option for dealing with systematic false alarms.
Although the Kepler false alarm data set from Coughlin (2017a) were primarily focused on addressing the occurrence rate contamination problem, Mullally et al. (2018) recently have demonstrated that the false alarm rate as a contaminating source impacts the statistical validation method employed to confirm planet candidates. They conclude that the planet Kepler-452b (closest approximation to the planet radius, R p , and orbital period, P orb , of Earth orbiting a Sun-like star; Jenkins et al. 2015) is not a confirmed planet anywhere close to the 99.7% probability claimed by Jenkins et al. (2015).
In this paper, we follow on and expand the work by Mullally et al. (2018) to address the full sample of Kepler confirmed planets in the small-planet, long-period regime. In Section 2 we show how to frame the systematic false alarm contamination problem in terms of the Bayesian probability framework adopted by the statistical validation literature. In Section 3 we apply this framework to the full sample of Kepler confirmed planets orbiting targets hosting a single planet candidate, and Section 4 extends the calculation to targets hosting multiple planet candidates. Sections 3 and 4 update the planet probability in order to identify several previously confirmed planets that can no longer be confirmed at the 99% reliability level. Section 6 provides an alternative analysis that focuses on targets on the electronically 'quiet' Kepler detectors. Section 7 discusses options for reconfirming these newly unconfirmed plan-ets and provides guidance for the K2 and TESS missions (Howell et al. 2014;Ricker et al. 2016) for avoiding systematic false alarms in their statistical validations. Finally, Section 8 summarizes the findings of this study.

CONFIRMING PLANETS RELATIVE TO SYSTEMATIC FALSE ALARMS
In this study, we follow the procedure for statistical validation of transiting planet signals using a Bayesian framework Fressin et al. 2011;Morton & Johnson 2011;Morton 2012). In summary, the validation calculation proceeds by specifying the likelihood that a signal of interest matches what is expected of bona-fide planet (L PL ), astrophysical false positive (L AFP ), and a systematic false alarm (L SFA ), as well as a prior probability for each scenario (π(PL), π(AFP), π(SFA)). Overall, the Bayesian probability that a Kepler signal originates from a planet given the data is, (1) For this study, we concentrate on the probability of the signal originating from a planet relative to a systematic false alarm given the data (i.e., P (PL|data)/P (SFA|data)) and ignore the astrophysical false positive probability. Ignoring the astrophysical false positive probability is justified as we restrict ourselves to the subset of Kepler discoveries that are considered confirmed planets, where previous literature has validated the planet relative to astrophysical false positives alone (i.e., P (AFP|data) < 0.01P (PL|data)). Thus, we need to quantify π(PL)L PL /π(SFA)L SFA .

Prior Probabilities and Likelihoods
For the planet prior probability, π(PL), we adopt the observed Kepler objects of interest (KOIs) classified as planet candidates. Using the Kepler planet candidate detections for measuring the planet prior probability was adopted by Torres et al. (2015) to confirm numerous Kepler planets. The Torres et al. (2015) measurement of the planet prior consists of counting the number of Kepler planet candidates having P orb within a factor of two of the candidate and R p within 3 σ of the measured planetary radius, and then scaling the resulting count by the number of targets surveyed. This quantification of π(PL) assumes that astrophysical false positives (and systematic false alarms) are rare and that a majority of the planet candidates discovered by Kepler are bonafide planets. Thus, this estimate of π(PL) is an overestimate, but from the expected rates of astrophysical blend scenarios, it is accurate enough within a factor of a few for making statistical validations Morton 2012). Rather than estimating the planet prior distribution through summing detections in a predefined region of parameter space, we employ kernel density estimation (KDE) techniques to estimate the planet prior as described below.
In order to quantify π(SFA), it would be preferable to have a quantitative model for the shapes, amplitudes, frequency, and environmental drivers of the systematics present in the Kepler data. Such a model is currently not available for Kepler, especially for the level of SNR approaching the threshold for detection of transit signals in the Kepler pipeline analysis MES=7.1 (Jenkins et al. 2002;Christiansen et al. 2012;Jenkins 2017), where MES (Multiple Event Statistic) represents the SNR for detection in the Kepler pipeline. MES is the test statistic of the Kepler pipeline search algorithm which quantifies the presence of a transit signal relative to the null hypothesis. As opposed to the hypothesis statistic for a single transit event (Single Event Statistic, SES), the MES is measured over multiple transit events that contribute to the signal. Instead, in order to measure π(SFA), we rely upon the Kepler data itself in order to provide a model for false alarm signals. The key insight to quantifying π(SFA) is to modify the Kepler relative flux time series in such a manner that the original periodic dimming events of a transiting planet no longer coherently combine to make detections. We use data permutation techniques of flux inversion and data scrambling (described below) in order to suppress transit signals in the original data and generate modified time series. The modified time series can be re-searched for transit signals, and the detected signals that are classified as 'planet candidates' are almost entirely due to systematic false alarms. Re-analyzing the full, modified Kepler data set required automating the classification process (i.e., the 'Robovetter') as described in Mullally et al. (2015), Coughlin et al. (2016), and Thompson et al. (2018). Our definition of π(SFA) is the same as the definition of π(PL), but we take our 'planet candidates' from the search and classification performed on the flux inversion and data scrambling analysis and divide by the number of targets contributing to detections. Coughlin (2017a) and Thompson et al. (2018) describe our two methods for generation of the simulated false alarms, but we summarize them here. The first data modification, which we denote as inversion (INV), takes advantage of the fact that the transit signals we seek have a flux deficit. The amplitude of the relative flux time series is modified with a negative sign, such that dimming (brightening) signals become bright-ening (dimming) signals, respectively. As the planet search algorithm returns dimming signals for consideration, the data inversion converts dimming periodic transit signals into periodic brightening signals, and the systematic false alarms (that previously were periodic brightening events in the unmodified flux time series) become detectable planet candidate signals by the search algorithm. Data inversion relies on a key assumption that the systematic false alarms contaminating the planet candidate signals are symmetric upon data inversion. While some false alarms are symmetric under inversion, such as those due to 'rolling-band' pattern noise Caldwell et al. 2010), other false alarms are not. For example, the cosmic ray induced sudden pixel sensitivity dropout (SPSD; Jenkins et al. 2010;Stumpe et al. 2012) produces a drop in flux followed by a slow recovery. Another weakness of data inversion is that there is only one synthetic data set that can be generated by this process.
The second data modification employs statistical resampling of the data employing a process similar to block bootstrap methods which we denote as data scrambling (SCR). In order to preserve the time correlated structure (on time scales 1< τ < 24 hr) of the false alarms that mimic transit signals while removing the repetitive transit signal on P orb time scales, we reorder data using yearly or Kepler quarter blocks. For instance, the normal time ordering of Kepler observing quarters, [Q1, Q2, Q3, Q4, Q5, Q6, Q7, Q8, Q9, Q10, Q11, Q12, Q13, Q14, Q15, Q16, Q17], were reordered into [Q13, Q14, Q15, Q16, Q9, Q10, Q11, Q12, Q5, Q6, Q7, Q8, Q9, Q1, Q2, Q3, Q4, Q17] for one alternative permutation of the data. The block data scrambling/permutation tests performed using the DR 25 pipeline search are described in more detail in Coughlin (2017a) and Thompson et al. (2018). The data scrambling test allows us to test the hypothesis that systematic signals, randomly distributed in time, align by chance and mimic a repeating transit signal, especially at the low SNR, few number of transit (N tran =3) limit of the Kepler pipeline search. By performing independent data reordering, we can generate many synthetic data sets. However, the time consuming nature of performing the full pipeline and classification steps in the exact same fashion as the observed/unpermuted data limited us to three scrambled iterations. One limitation of the block permutation is astrophysical and transit signatures with periodicities shorter than the block time scale remain coherent and can lead to detections matching their presence in the original data. Fortunately, this is not an issue in the domain of interest in this paper.
Note that astrophysical phenomena can still generate detections in the modified data sets. Strong periodic signals (deep eclipsing binaries, 'heartbeat' tidal induced variables, and microlensing white dwarfs to name a few Sahu & Gilliland 2003;Thompson et al. 2012;Kruse & Agol 2014) can contaminate the false alarm detections detected in the modified data sets. Thus, the modified planet candidate list is filtered (see Section 2.3.3 of Thompson et al. 2018) to avoid contamination by transiting planets, eclipsing binaries, heartbeat stars, and self-lensing binary systems. In addition to the contamination list from Thompson et al. (2018), and since we are interested in the false alarm contamination among the sample of previously confirmed planets (which generally are of 'higher quality' and undergo more inspection than the typical Kepler planet candidate), we visually inspect the planet candidates from the modified data sets to remove ones with transit signals that do not fit the qualitative expectation for a repeatable transit signal. This by-eye filtering simulates the process by which planet candidates are inspected to identify the most promising candidates suitable for confirmation based upon how well they match the expectation for a transit signal. Although not ideal from a statistical and/or repeatability aspect, visual inspection of the flux time series phased to the orbital period of the detection is one of the most common techniques employed in transit surveys for making decisions on what to prioritize for observational and analysis follow-up. Qualitatively, one prioritizes transit signals that possess a clear box/u-shaped flux decrement offset from the out-of-transit data. In addition, 'good' transit signals have out-of-transit data that is well behaved and does not show evidence for coherent signals of similar depth and duration to the transit signal of interest. From the inversion test, we remove the planet candidates found for Kepler input catalog (KIC) identifiers 5795353, 10782875, and 7870718. From scrambling test two, we remove planet candidates found for KIC 7531327, 10336674, and 10743850. From scrambling test three, we remove planet candidates found for KIC 5778913, 7694693, and 8176169. No planet candidates were removed from scrambling test one. We do not perform the same visual inspection of the planet candidates from the unmodified data, thus our final sample likely underestimates π(SFA).
We simplify the problem further by assuming that L PL = L SFA (i.e., a systematic false alarm signal and a transiting planet signal have equal probabilities of explaining the data). This assumption is valid as the classified planet candidates from the modified and unmodified data sets must pass a battery of 52 quantitative metrics that ensure that the signal is consistent with the transiting planet hypothesis. In particular, the Marshall test ) employed by the Kepler Robovetter, performs a formal Bayesian model comparison test to ensure that a transit model has comparable (or higher) probability for fitting the individual transit events versus the other models considered such as a step discontinuity or SPSD. However, metrics like this are ineffective in the low SNR limit where the detailed shape of the events can not be measured. Thus, our detections are equally well modeled as a transit or a systematic. Visual examination of the data (such as from the Kepler data validation reports and Kepler science office vetting documents; Coughlin 2017b) for the planet candidates found in the modified data sets shows that they are qualitatively indistinguishable from the planet candidates found in the unmodified data series (see Mullally et al. 2018, for an example). With equivalent likelihoods, the odds ratio probability P (PL|data)/P (SFA|data) simplifies to the prior probability ratio, π r = π PL /π SFA .

RESULTS FOR SINGLE PLANET SYSTEMS
In order to calculate π(PL) and π(SFA), we use a subset of the Kepler targets that are optimized for studying planet occurrence rates for GK dwarfs. See Burke et al. (2015), Christiansen (2017), and Burke & Catanzarite (2017b) for a discussion, but in summary, in addition to selecting targets based upon their spectral type, targets are selected for their suitability as relatively quiet, well-behaved targets with well modeled recoverability of transit signals with the Kepler pipeline. The target selection criteria were based upon studying a database of 1.2×10 8 transit injection and recovery trials over the Kepler targets in order to reject targets with noise properties making them unsuitable for an accurate quantification of their recovery for transit signals (Burke & Catanzarite 2017a). Using this sample ensures that our estimate of π(SFA) is not contaminated by detections around targets with ill-behaved or overly-noisy flux time series. The selection criteria results in 75,522 Kepler targets having 3900<T eff <6000 K, log g>3.8, similar data quantity selection as Burke et al. (2015), and excludes targets with noise properties resulting in transit recovery statistics that do not follow the adopted recovery model (Burke & Catanzarite 2017b). Stellar parameters are taken from the original DR 25 stellar catalog (Mathur et al. 2017).
We base our calculation of π(PL) on the planet candidates identified by Thompson et al. (2018) in the original/unmodified Kepler DR 25 pipeline analysis requiring that they are associated with the above stellar sample. Figure 1 shows the distribution of the planet can-didates in P orb versus R p focusing on the small, longperiod candidates (circles). Values for R p are taken from the DR 25 planet candidate catalog (Thompson et al. 2018;Hoffman & Rowe 2017) and are not necessarily the best individual values available from the literature. We base our calculation of π(SFA) on the planet candidates identified in the four modified data sets (one from inversion and three from scrambling) (Coughlin 2017a; Thompson et al. 2018). Figure 1 shows the distribution of the planet candidates from the modified data search and classification after our filtering from inversion (triangle), scrambling one (diamond), scrambling two (square), and scrambling three (pentagon). The planet candidates from the modified data sets concentrate toward the small, long-period region. However, they do extend up to R p ∼3.5 R ⊕ . R p is a derived quantity, and the distribution in target stellar radius, photometric noise, and impact parameter results in a wide distribution of SNR values for a given P orb and R p . Given the degeneracy in SNR values for a given P orb and R p , for the remainder of this study we will analyze the planet candidates in the more directly measured quantities of the number of transit events detected, N tran and the SNR of the event as quantified by the MES as calculated by the Kepler pipeline (Jenkins et al. 2002). Working in the N tran versus MES parameters improves the validity of our assumption that L PL = L SFA and removes the uncertainty in the stellar parameter estimates. Figure 2 shows the same data as in Figure 1, but with the planet candidate distribution in N tran as a function of MES. For display purposes, a small uniform random value is added to N tran in order to avoid point overlap as N tran is an integer value. The planet candidates from the modified data sets, representing systematic false alarms, are heavily concentrated near the selection threshold, MES=7.1. We find that there are 16, 6, 11, and 18 planet candidates passing all the filters from the modified data sets of inversion, scrambling one, scrambling two, and scrambling three, respectively. Specifically, in addition to requiring the planet candidate to be hosted by a member of the stellar sample, we require P orb >12 day, 0.6<R p <6.0 R ⊕ , N tran < 100, and MES<15. There are 346 planet candidates from the unmodified catalog that meet the same selection criteria. Table 1 provides the properties for the planet candidates shown in Figures 1 and 2 from the unmodified and modified data sets.
There are also modified data set planet candidates extending toward short P orb ∼25 day and high N tran ∼47. We find these targets do not have threshold crossing event (TCE) detections from the DR 25 Kepler pipeline run (Twicken et al. 2016) and there are not KOIs associ- ated with these targets from the cumulative KOI table at the NASA Exoplanet Archive. Having false alarm contamination at lower P orb , higher N tran was a-priori unexpected given the extremely low probability for chance periodic alignment of systematics, emphasizing the need to measure the contamination rate rather than rely solely on a-priori expectations alone. Definitively elucidating the mechanism responsible for the short period detections is difficult since MES∼8 and N tran ∼50 detections imply the individual transit SNR∼1 cannot be distinguished from noise. From visual inspection of the pre-search data conditioning Kepler light curves for these targets, the signals appear to be due to stellar noise that was coherent enough to trigger a detection when either inverted or scrambled. Human evaluation of the Robovetter diagnostic output plots demonstrates that the human and algorithmic Robovetter struggle to differentiate between instrumental false alarms, stellar variability, and planet transit signals in this regime. We examined the possibility that the pattern of data scrambling happened to match a bona-fide planet with transit timing variations (TTVs) that resulted in a more uniform ephemeris with the imposed data scrambling. The transit events in the scrambling runs were associated with their original timings from the unmodified data series. We searched for plausible orbital periods in the . Same data as Figure 1, but in the set of parameters number of transits versus MES that are more directly related to the observational 'quality' of the transit events. The classified planet candidates from the DR 25 Kepler pipeline run (circle) and planet candidates representing systematic false alarms detected from the modified data sets using inversion (triangle), scrambling one (diamond), scrambling two (square), and scrambling three (pentagon) are shown.
original unmodified data series that result in minimum scatter in orbital phase for the transit events. At these preferred periods we find that the most plausible detections of bona-fide planets with TTVs were for the SCR1 candidates with KIC 3241702 and 12453916. The pattern of TTV requires discontinuous jumps in the transit timings from 0.4 to 0.9 times the transit duration at the yearly block size boundaries. We find the SCR2 candidates with short P orb require larger 2-3 times the transit duration discrete jumps at the scramble block boundaries to be explained by TTVs. Given the correlation in TTV jumps with the scrambling block boundaries, we prefer the explanation of potential low-level stellar variability resulting in these short period candidates from the modified data series rather than instrumental false alarms or planet transit signals with TTVs.
Using the samples of planet candidates in Figure 2, we calculate the number density of false alarms and planet candidates using KDE techniques. We apply a multivariate KDE analysis using a Gaussian kernel with the bandwidth determine by Silverman's rule-of-thumb that is appropriate if the true distribution is Gaussian. The Silverman's rule-of-thumb bases the bandwidth on the sample standard deviation, separately in each dimen-  Color bar indicates best estimates of the odds ratio level. Based upon the location of Kepler-452b (label 452b), there is a 10 −0.07 = 0.86 odds ratio of its transit signal being due to a systematic false alarm relative to being a bona fide planet. Multiple planet systems will have an additional reduction to the odds prior ratio (not shown in the color bar) by a factor of πγ m =10 and πγ m =14 for planets in systems with two and systems with three or more planets, respectively.
sion. The KDE analysis is performed in logarithmic in number of transits and linear in MES. The KDE probability densities are converted to number densities by multiplying by the number of planet candidates in each sample; π SFA is given by the sum of the (simulated, false alarm) planet candidates over the four modified data sets divided by four to provide the expected number density for a single set of data. Figure 3 shows the log of the relative prior probability. Figure 3 is calculated by evaluating π r at 300x300 uniform grid spaced locations in the MES versus N tran parameters. We indicate the specified levels of π r using color shading.
Also shown in Figure 3 are the Kepler confirmed planets (circles) using values for the DR 25 detection MES and derived planet radius provided by the Kepler cumulative table hosted by the NASA Exoplanet Archive 2 . A select set of confirmed planets that are discussed throughout the text are specifically labeled. As shown by Mullally et al. (2018), we also find that Kepler-452b 2 data retrieved 2018, January 16 is in a region highly contaminated by systematic false alarms; Kepler-186f also has detection parameters in the systematic contaminated region. Based on the DR 25 data set, Kepler-452b has the highest π SFA /π PL = 0.86 among the previously confirmed Kepler planets. Using the BLENDER analysis techniques, Jenkins et al. (2015) find a π AFP /π PL = 0.0023, when considering astrophysical sources of false positives only, and with the inclusion of the false alarm hypothesis, we find that the odds ratio in favor of the planet hypothesis ∼360 times lower. Thus, Kepler-452b is not a statistically validated planet in agreement with (Mullally et al. 2018). Besides Kepler-452b, Kepler-186f and Kepler-445d are the only other confirmed planets where the DR25 estimate of π r violates the 1% validation level. We defer the discussion of Kepler-186f and Kepler-445d to Section 4, where we take into account them belonging to a multiple planet system. Other statistical validations in the literature (in particular BLENDER) adopt a more strict π AFP /π PL =1/370 or 3 σ confidence in the planetary hypothesis. Adopting this more strict threshold, the confirmed single planet Kepler-443b exceeds the 3 σ confidence in the planetary hypothesis for DR 25. The DR 25 π r estimates for these confirmed planets are provided in Table 2 4. RESULTS FOR MULTIPLE PLANET SYSTEMS The above discussion does not take into account the impact of multi-planet systems. Lissauer et al. (2012), Lissauer et al. (2014), and Rowe et al. (2014) demonstrate the statistical arguments in favor of validating multiple candidate signal systems as planets even in the face of large ( 0.3) fraction of false alarm rate for single detections (such is the case for the false alarm contamination rates for Kepler at the small planet, long period parameter space). Approximately half of the previously confirmed Kepler planets in Table 2 that do not meet the statistical validation threshold are in multiple planet systems. In this section, we extend the arguments outlined in Lissauer et al. (2014) to a lower SNR parameter space containing the candidates contaminated by false alarms. Lissauer et al. (2014) calculate the multiplanet 'boost' on the planet prior, π γm , that results from the planet counts and their distribution among the targets. The multiple planet statistics from Lissauer et al. (2012Lissauer et al. ( , 2014 were based on half as much Kepler data (Q1-Q8 data) and adopted an SNR>10 threshold (equivalent to SNR 14 for Q1-Q17 data), where SNR is the transit fit signal-to-noise ratio. The SNR>10 threshold was specifically adopted to avoid false alarm contamination as they did not have the tools to address this contam-ination at the time (Rowe et al. 2014). We update the single and multiple planet system counts based upon the planet candidates identified around our control sample of N t =75,522 well behaved G and K dwarfs observed by Kepler (see Section 3). The candidates under investigation in this study are well below the thresholds of Lissauer et al. (2014). We find that using the DR 25 planet candidate sample down to the MES=7.1 threshold (required to include the candidates under investigation) that π γm is significantly reduced. Sinukoff et al. (2016) also discuss variations in π γm for different planet statistics as applied to their K2 survey.
From the target control sample, the DR 25 Kepler planet search and classification finds N c =2125 planet candidates with P orb >1.6 day (same P orb threshold as Lissauer et al. 2014). The breakdown by planet multiplicity are 1233, 248, 85, 26, 5, and 2 targets having one, two, three, four, five, and six planet candidates, respectively. This yields N m =366 targets hosting two or more planet candidates. We associate P = 1/(1 + π SFA /π PL ), where P is the reliability of the single candidate host sample relative to systematic false alarms.
As described in Section 2.1 of Lissauer et al. (2012) and further refined in Lissauer et al. (2014) they derive a multiplicity boost on the planet prior based upon two methods. The first uses the expected false alarms relative to the observed candidates in multiplanet systems. The second method uses the multiple planet target counts to establish a multiplicity boost. Depending on the method and differences between Lissauer et al. (2012) and Lissauer et al. (2014), the multiplanet stats from Lissauer et al. (2014) yield a range of 16<π γm <35 for targets with two planets and 23<π γm <100 for targets with three or more planets. Using the same method, but using the multiplanet statistics from this study, the multiplicity boost on the planet prior ranges from 10<π γm <20 for targets with two planets and 14<π γm <50 for targets with three or more planets. π γm is roughly half with the full DR 25 Kepler planet candidates than what is found using the earlier sample of planet candidates at higher SNR.
The suppressed multiplicity boost is driven by the higher number of planet candidates found with more data as the number of observed targets is fixed. This can be demonstrated by considering π γm for targets with three or more planets. The dominant false positive scenario is the 'two or more planets plus one false positive' case (equation 6 of Lissauer et al. 2014). From this contribution alone, one can show that, where N t is the number of targets in the sample, N k is the number targets with one or more planet candidates, N 3+c is the number of planet candidates associated with targets hosting three or more planets, and N m is the number of targets with two or more planet candidates. Using multiplanet statistics in this study, values for N t /N k and N 3+c /N m are 0.58 and 0.89 times the values based upon the sample of planet candidates used in Lissauer et al. (2014), respectively. Overall, π γm suppression is dominated by having more planet candidates with more data (N t /N k ), with a minor contribution for having two planet systems being preferred relative to three or more planet systems with more data (N 3+c /N m ).
In order to account for π γm in estimating the systematic false alarm contamination, we scale π r by 1/π γm . From the DR 25 estimate of π r = 0.54 for Kepler-186f, equation 8 of Lissauer et al. (2012), and the conservative π γm =14 for systems with three or more candidates, Kepler-186f no longer meets the 1% criterion for statistical validation. When taking into account false alarm contamination and π γm , we revise the planet hypothesis probability of Kepler-186f to 96%. Quintana et al. (2014) show that Kepler-186f met a 3 σ confidence (>99.7% probability) for the planetary hypothesis when considering astrophysical false positives alone and a conservative π γm based on the sample from Lissauer et al. (2014) and Rowe et al. (2014). We note that Kepler-445d was not detected in the final DR 25 Kepler pipeline run. Thus, we evaluated its single star π r based upon the DR 24 Kepler pipeline run parameters. However, given its DR 24 detection MES=7.13, and the systematic MES difference between pipeline runs (see Section 5 and Figure 4) we consider the test of this object inconclusive and we are unable to properly determine its confirmation status relative to false alarm contamination measured in DR 25.

MES VARIATION
In this section, we discuss the sensitivity on determining the level of systematic false alarm contamination due to the uncertainty and systematics in determining the MES detection value for a given signal. MES is approximately the SNR of the transit signal, where ∆ is the transit depth and σ is the noise averaged over the transit duration. However, in detail MES is a specific quantity based upon the algorithm employed in the Kepler pipeline to estimate the non-stationary noise and its covariance. From a single Kepler dataset, and considering statistical noise alone, it is not possible to know the intrinsic depth of a transit signal. Analogously, from a single Kepler dataset, it is not possible to know the expected MES value since it is proportional to the intrinsic transit signal depth. Jenkins et al. (2002) show that for a given signal strength, the MES detection statistic will have a Gaussian with unit variance distribution around the intrinsic signal strength. Thus, from the measured MES in DR 25, m DR25,⋆ , due to statistical noise, a hypothetical repeat experiment of Kepler (without combining data) would result in p(m|m DR25,⋆ ) = , where m is the intrinsic MES expected based on the intrinsic depth of the signal scaled by the noise.
In addition to the intrinsic uncertainty in measuring MES, there is a systematic uncertainty in measuring MES. MES is not a directly observed quantity. MES is influenced by data analysis choices (e.g., instrument calibration, aperture selection, data conditioning, noise estimation, ephemeris estimate, and planet model fitting) that can lead to systematic variations in MES estimates. Thus, independent analysis of the same underlying data can lead to different MES estimates for a signal influenced by the numerous analysis choices. We demonstrate the systematic potential for MES variations by comparing the detection MES between the DR 25 and the previous Kepler DR 24 release. We show in Figure 4 the difference in detection MES between KOIs classified as planet candidates with N tran <50 that were detected in both DR 25 and DR 24. For the population of planet candidates in common with MES<50, the median measured MES difference between DR 25 and DR 24 is∆ = 0.6 with a robust mean absolute deviation of σ = 0.8. The measured MES difference between DR 24 and DR 25 shows that different choices of processing algorithms can perturb the noise components and depth estimates. The average difference in MES between DR 25 and DR 24,∆ = 0.6, is in the sense that DR 25 has on average lower detection MES than DR 24 for signals in common. The lower MES in DR 25 primarily results from a change to the noise estimate algorithm. It was found that for DR 24 and previous releases that photometric noise estimates were systematically overestimating the noise at the Kepler observing quarter boundaries and underestimating the noise away from quarter boundaries. The average difference in MES between DR 24 and DR 25 is a warning that comparing SNR estimates between different processing algorithms or different pipelines and SNR calculations is fraught with systematic biases. Only an analysis of systematic false alarms that consistently analyzes both detections in the observing time series and modified time series using the same algorithms and processes can be used to accurately assess systematic false alarm contamination. Kepler data release and the previous DR 24 data release. Even though the underlying data set is the same, the scatter in the detection MES difference between data releases illustrates the uncertainty in MES that needs to be taken into account when evaluating the false alarm contamination.
The variation in MES can have important consequences for estimating systematic false alarm contamination between different processing algorithms. As an illustration, we examine a hypothetical scenario by showing the confirmed planet locations using their DR 24 MES values in relation to the prior ratio estimate of DR 25 in Figure 5. In order to compare to the DR 25 prior ratio estimate, we remove the systematic∆ = 0.6 offset in MES. Unfortunately, the data permutation data sets do not exist for the DR 24 Kepler pipeline code base, so we cannot verify that the prior ratio estimate from DR 25 is appropriate. However, even though an individual signal can have MES variations, after correcting for the systematic MES difference, the variation is symmetrical and the prior ratio estimate based on the full population of signals is likely to be similar between DR 25 and DR 24. Thus, we assume that π r (m DR24,⋆ ) = π r (m DR25,⋆ ) after removing the system-atic∆ = 0.6 difference between m DR24,⋆ and m DR25,⋆ for signals in common. With these assumptions, we show that using DR 24 MES values (less∆ = 0.6 MES), Kepler-452b, MES=9.09, and Kepler-186f, MES=9.10, had higher MES estimates relative to DR 25. Assuming a similar prior ratio for DR 24 as for DR 25, Kepler-452b, would still be unvalidated at the >1% level, but including π γm , Kepler-186f would remain validated. Conversely, Kepler-441b has a lower MES in DR 24 relative  Figure 3, but using the DR 24 measured MES after removing the average difference in MES between DR24 and DR25 shown in Figure 4. Given the level of MES variation between DR25 and DR24 and assuming a similar amplitude and pattern of the false alarm contamination for DR24 as measured in DR25, Kepler-441b would be considered an unvalidated planet. Whereas, Kepler-186f (with the multiple planet prior boost) remain a validated planet.
to DR 25 and one would consider Kepler-441b unvalidated at the >1% level, where it remained validated with the DR 25 analysis. Despite the contradicting validation, the MES variations between DR 24 and DR 25 are within the expected MES variations shown for the signals in common DR 24 and DR 25. Due to systematic data analysis choices, independent analyses of the same underlying data would come to different conclusions regarding the validation of planet signals.
Out of concern for independent analyses to arrive at contradictory validations for an individual system, we are motivated to identify an extended set of planet confirmations where the potential for contradictory validations due to data analysis choices can occur. To model this possibility, we calculate the predictive posterior distribution of the prior ratio in order to model a hypothetical ensemble of data reprocessings. The predictive posterior distribution, provides the expected distribution for the prior ratio in light of a hypothetical ensemble of data reprocessings that result in an unobserved distribution of measured MES values for a given signal, m ⋆ , based upon the DR 25 MES measurement, m DR 25,⋆ . The predictive posterior distribution for the prior ratio of an individual signal in an ensemble of reprocessings is given by (3) where m ⋆ is the measured MES of a hypothetical reprocessing and Ω m is the integration domain over MES. In equation 3 and what follows, we have suppressed the dependence on N tran as it's integral value is known without uncertainty. We model the expected distribution of measured MES values due to reprocessing, p(m ⋆ |m DR25,⋆ ), as a zero mean Gaussian with standard deviation, σ = 0.8, based upon our empirical distribution of MES variations observed between DR 25 and DR 24.
In order to evaluate the conditional probability, p(π r |m ⋆ ), we note that the prior ratio represents a direct relation between m ⋆ and π r . Thus, the prior ratio can be viewed as a parameter change-of-variable/probability transformation function φ(m ⋆ ) = π r (m ⋆ ). Since a direct relationship between m ⋆ and π r exists, the conditional probability can be written using an expression involving the dirac generalized function, p(π r |m ⋆ ) = δ(π r − φ(m ⋆ )). Furthermore, we assume that a reprocessing of the data will result in a prior ratio dependence on m ⋆ that is equivalent to what was estimated in the DR 25 analysis (i.e., π r (m ⋆ ) ∼ π r (m DR25,⋆ )). This follows from the arguments given above regarding MES differences between the DR 25 and DR 24 processing. The reprocessing results in a symmetric zero-mean Gaussian distributed MES variations (after removing a systematic difference in MES).
Due to the numerical nature of the KDE estimate for π r , we calculate p(π r |m DR25,⋆ ) through Monte-Carlo sampling of m ⋆ followed by substitution into the prior ratio result. We perform the predictive posterior distribution estimate using 25000 Monte-Carlo iterations sufficient to reach convergence of results. In order to evaluate the KDE false alarm distribution for MES values near and below the MES=7.1 threshold, we additively reflect the false alarm distribution at MES pk =7.5 (apparent peak in the false alarm distribution) in order to extrapolate results towards lower MES. The apparent peak in the false alarm distribution is an artifact of the MES=7.1 survey threshold and the well-known bias for KDE implementations to underestimate the distribution near sharp boundaries. In reality, the false alarm distribution continues to increase with MES<MES pk . In particular, evaluation of the false alarm prior for MES=6.0, or δMES=1.5 below MES pk , involves evaluation of the KDE estimate at MES pk and MES pk +δMES=9.0. The logarithmic difference in the false alarm prior between MES pk and MES pk +δMES is added to the logarithm of the false alarm prior at MES pk in order to determine the extrapolated false alarm prior for MES pk -δMES. The reflective extrapolation method results in a reduced skewness in the distribution for π r .
The resulting distribution of π r for confirmed planets in and near the false alarm contamination region spans 20 dex and is highly skew with the mode of the distribution typically occurring near its 95 th percentile. In Table 2, we report the median (π r,med ), mode (π r,mod ), 68.2 th (π r,1 ), and 95 th (π r,2 ) percentiles of the distribution of π r resulting from the Monte-Carlo evaluation for a selection of previously confirmed Kepler planets that potentially have π r distributions overlapping with a 1% validation threshold. We require that the mode of the distribution, π r,mod <0.01. In order to ensure that the mode peak is significant relative to the broader underlying distribution, we require that the 95 th percentile, π r,2 <0.01. Based upon the variations in MES observed between DR 25 and DR 24, it is possible that alternative analyses can influence the outcome of a strict cut on considering whether a planet remains validated. Our model in this section considers a statistical ensemble of outcomes for influencing the measured MES values for transit signals. If one wants to make a validation decision robust against the influence of the data analysis decisions, then this model indicates that Kepler-452b, Kepler-186f, Kepler-441b, Kepler-443b, Kepler-1633b, Kepler-1178b, and Kepler-1653b have a non-negligible probability of overlapping with the 1% validation threshold. Given the subtle data analysis decisions that lead to a detection and a MES estimate, it is difficult to exactly assign a planet probability relative to systematic false alarms. Furthermore, the black and white decision as to whether to consider a planet candidate as a confirmed planet or not depends upon the risk tolerance of the scientific question at hand. In this paper, we outline a specific set of criteria that lead us to conclude that the above extended list of planets are no longer confirmed. Other reasonable data analysis choices and criteria could lead to different black and white confirmation choices, but it should be acknowledged that including planets from the above list risks a non-negligible probability of contaminating a bona-fide planet sample at a >1% level or higher as one includes detections towards lower MES and fewer N tran .
6. QUIETEST Kepler DETECTORS Electronic noise properties varies between the detectors that make up the Kepler camera (Van Cleve & Caldwell 2016). Some detectors have elevated levels of rolling band pattern noise, elevated read noise, or out-ofspec gain values. The detectors impacted by these less ideal properties are highlighted in Table 13 of Van Cleve & Caldwell (2016). We select all the Kepler camera channels that have a 'yellow' or 'red' indicator of Table 13 in Van Cleve & Caldwell (2016) and exclude Kepler targets that pass through these noisier channels during any Kepler observing season. From the 206150 Kepler targets that have at least one Kepler quarter of observations, 93079 (55% reduction) Kepler targets remain after removing targets that pass through the designated noisier channels. After the spectral type, data quality, and analysis region cuts, there are 346 observed planet candidates. Requiring the target host to be on the quietest channels results in 150 planet candidates (57% fewer) for analysis. The systematic false alarm candidates are reduced from 51 down to 15 (71% fewer). The larger fractional reduction in false alarm candidates suggest the false alarms are slightly over-represented on the noisy channels. Candidates and false alarms that remain for the quietest channels are indicated by a flag in Table 1. Figure 6 repeats the prior ratio analysis for the subset of candidates detected on the quietest channels. The elevated false alarm to planet prior ratio region does not extend to as high a MES at the low N tran region for the quietest channel data relative to including all Kepler channels. We also note that the unexpected population of false alarm candidates with N tran >10 remains represented in the sample of detections on the quietest Kepler detectors. For the main conclusions of this study we adopt results using targets from all channels in Section 3. We present the quiet channel analysis in order to demonstrate that the systematic false alarm contamination is present even for when limiting the analysis to the most well-behaved Kepler channels and is qualitatively consistent with the all channel analysis. For DR25 and the 'well-behaved' Kepler detectors, Kepler-452b and Kepler-186f remain unvalidated due to false alarm contamination.

DISCUSSION
Unconfirming a planet not does imply that it is definitively a false positive, only that the KOI should be properly considered a planet candidate. Even in the case of Kepler-452b, (the previously confirmed planet suffering the highest false alarm contamination), P (SF A)/P (P L)=0.86, indicating that the previously confirmed Kepler planets are more likely bona-fide planets than systematic false alarms. The accepted convention in the literature is to consider a planet statistically confirmed if the planet hypothesis is favored at the 99% level. This threshold is arbitrary, but has the qualitative goal of producing a sample of transiting planets confirmed with a reliability approaching that of radial velocity surveys. However, depending on the scientific goal and the risk posture of the investigation, including lower fidelity planet candidates into a confirmed planet sample may be worthwhile.

The Case for Independent Detection
There are several avenues to eliminate the false alarm hypothesis and reconfirm these planets. First, high precision radial velocity can be obtained to detect the planet by measuring its mass using an independent method. A planet mass measurement is strongly aided by knowing the orbital period of the signal a-priori rather than having to search for the orbital period from the radial velocity data alone. However, despite this advantage, the current radial velocity precision and observing time available prevent radial velocity confirmation of these faint (13<Kpmag<15) Kepler targets with small expected radial velocity semiamplitude (0.3<K p <0.9 m s −1 for the planets unconfirmed in this study). Alternatively, the false alarm hypothesis could be ruled out with a better model of the low-level systematics in Kepler data, or an understanding of their environmental drivers, or new vetting metrics that can more cleanly differentiate the false alarms from planet candidates. Machine learning techniques applied to Kepler data have been helpful in this direction (McCauliff et al. 2015;Thompson et al. 2015;Armstrong et al. 2017;Shallue & Vanderburg 2018;Pearson et al. 2018), but have yet to outperform the expert guided decision tree method employed by the Robovetter. However, the Robovetter does employ machine learning techniques  as a subset of the decision metrics. The additional analysis must include data other than the flux time series alone, as the SNR of these candidates in the flux time series is insufficient to distinguish systematic contamination from planet transit signals.
The final possibility for eliminating the false alarm hypothesis is to re-observe the transit events with an independent instrument. The long P orb > 100 day and long transit duration for these Kepler candidates precludes the use of large ground-based observatoriesthe chance of a transit event optimally timed for midtransit near meridian crossing at large telescopes is negligible. In addition, the ephemeris uncertainty pressure (for Kepler-452b, the current 6 hr uncertainty in ingress time relative to mid-transit grows ∼30 min every four transit events) prevents taking advantage of possible observing chances from the ground. HST is currently the only available resource to make timely follow-up on these important small, long-period Kepler candidates. Assuming HST can achieve orbit-to-orbit photometric stability approaching the Poisson expectation, we expect HST to achieve approximately two times higher SNR than Kepler for a single transit event employing the long pass F350LP filter on UVIS WFC3. Spitzer with its smaller aperture and relatively narrower bandpass, has a ∼6 times lower SNR than HST for a single transit event and G dwarf hosts. Toward this end, we executed a pilot HST cycle 25 program (Program ID: 15129) to recover a transit of the the habitable zone super-Earth Kepler-62f (P orb =267 day). The purpose of the HST program is to demonstrate the feasibility of HST to reconfirm important individual Kepler discoveries. In addition, HST observations on a statistically large sample of Kepler candidates is critical for an accurate planet occurrence measurement in the regime of terrestrial, habitable zone planets orbiting GK dwarfs in light of the significant false alarm contamination (Burke et al. 2015). The modified data series currently provides our best method of measuring the false alarm contamination impacting statistical validation and planet occurrence rate. However, there is no guarantee that our procedure faithfully represents the true underlying false alarms present in the original unmodified data, and the HST observations can eliminate this concern.  Kepler-62f was chosen because when the contamination is cast in the P orb and R p parameters, it suffers enhanced false alarm contamination. We show in Figure 7 the same analysis as shown in Figure 3, but in the alternative parameterization of P orb and R p . Similar to Figure 1, Figure 7 shows that due to the spread in SNR for a given location in P orb and R p , that an elevated systematic false alarm rate can occur over a large region of parameter space. From an occurrence rate perspective, it is necessary to understand false alarms when cast in the P orb and R p basis. However, for this study the mix of SNR levels implies a varying likelihood which violates our assumption of L PL = L SFA . Our expectation is that Kepler-62f avoids the false alarm contamination as its MES=14.3 separates it from the contaminating false alarms. Kepler-62f provides a higher SNR test case for the capabilities of HST before attempting to reconfirm lower SNR Kepler candidates.

Min SNR for statistical confirmation
It has generally been accepted in the statistical validation of Kepler candidates from the prime mission (Rowe et al. 2014;Morton et al. 2016;Torres et al. 2017) that adopting SNR>10 for Kepler planet candidates is sufficient to avoid contamination by false alarms. As we have seen, after quantifying an acceptable threshold to avoid false alarms, limiting SNR>10 was helpful guidance. Based on the DR 25 analysis, from Figure 3, we find that enabling a 1% validation relative to false alarms for single planet systems necessitates MES 9.0 for systems with N tran < 10 events and MES 8.0 for 10<N tran <60. In addition, we find that the MES detection statistic from the Kepler pipeline is systematically lower than the SNR derived from the transit fit. SNR of the transit fit, rather than MES, is what has been employed when defining a sample for validation. For DR 25, we find that the median ratio between MES for detection and SNR from the transit fit is 0.81 with a sample standard deviation of 0.5 in the ratio. The transit fit SNR is typically expected to be higher and can be systematically different between various algorithms due to different choices in pre-detection/pre-fit filter, knowing ahead of time where event is located in the case of fitting, different choices for duration, and depth of transit event. Accounting for the systematic difference between the detection MES and the transit fit SNR, the threshold to avoid false alarms during statistical validation for single planet systems for DR 25 is SNR 11.2 for N tran <10 events and SNR 10 for 10<N tran <60 events. Taking into account π γm , multiple planet systems avoid false alarms for SNR 10 for N tran <10 events and avoid them down to the detection threshold for N tran >10. These thresholds are appropriate for the DR 25 pipeline analysis. Detections from Kepler data using other pipelines or alternative data processing without a corresponding quantification of the false alarm population require higher thresholds. We show in Section 5 that modest changes in data analysis can result in MES variations. To be robust against potential MES variation, MES 10.5 for systems with N tran <10 events and MES 9.5 for 10<N tran <60 (SNR 13 and SNR 11.9, respectively) is needed.
This study provides important lessons for the continuing K2 phase of the Kepler mission (Howell et al. 2014;) and the upcoming TESS (Ricker et al. 2016) mission. Systematic false alarms are a dominant contamination source preventing statistical validation when the detection is made within several dex of the significance threshold adopted for the survey. To complicate the situation, the variety and expansive set of K2 search pipelines and dissimilar data systematics, general statements such as the above guidance based upon Kepler may not apply (Vanderburg & Johnson 2014;Armstrong et al. 2015;Foreman-Mackey et al. 2015;Montet et al. 2015;Sanchis-Ojeda et al. 2015;Aigrain et al. 2016;Crossfield et al. 2016;Kovács et al. 2016;Luger et al. 2016). For instance, Crossfield et al. (2016) adopt a threshold of SNR>12 for detection. Without tests, such as inversion and block bootstrap resampling, is SNR>13 (like for Kepler single plan-ets) sufficient to avoid false alarm contamination, or is a ∆3.5 offset in the SNR relative to the detection threshold (implying SNR>15.5), or something else entirely required? In addition, the SNR values for the same signal may be systematically different from one practitioner to another (as we have demonstrated by comparing the Kepler MES and SNR from transit fitting). At least for Kepler we have shown that the false alarm hypothesis exceeds the astrophysical false positive hypothesis for the low MES, few transit event regime for the planet detections that were previously confirmed.
The contribution of false alarms should not be ignored when using the statistical validation method for planet confirmation. Furthermore, if the primary goal of a transit study is to identify candidates for statistical validation, then analyzing inverted and block bootstrap permuted data in order to characterize the false alarm rate of the survey is more important than characterizing the survey completeness through transit injection and recovery. However, this warning is tempered by the fact that K2 and TESS missions have access to brighter targets, have a short observing baseline ∼30 day, and in the case of TESS a smaller aperture than for Kepler. Thus, more traditional means of radial velocity confirmation and ground-based transit recovery are viable options for eliminating false alarm contamination.
The same warning applies to employing statistical validation on samples of detections from an independent and/or 'deeper' search of Kepler prime mission data. Recently, Shallue & Vanderburg (2018) announced confirmation of Kepler-90i and Kepler-80g detected during an independent search and machine learning classification of Kepler prime mission data. Candidates to Kepler-90i and Kepler-80g where detected near a SNR∼9 by Shallue & Vanderburg (2018). Shallue & Vanderburg (2018) did not quantify the systematic false alarm hypothesis during the validation of their planets. Based upon the Kepler pipeline and classification performance, we do not find significant false alarm contamination down to the MES=7.1 threshold in the P orb ∼14 day (N tran 100) regime of these two detections. However, Shallue & Vanderburg (2018) did run their machine learning classification on the inversion threshold crossing events (TCEs) identified by the Kepler pipeline (MES>7.1 threshold). Their machine learning classifier finds four times more planet candidates than the Robovetter. They do not report the distribution in P orb and R p for their false alarm planet candidates, so we are unable to determine whether they overlap with their planet confirmations. We conjecture that if they encounter a four times higher false alarm rate when classifying TCEs identified by the Ke-pler pipeline employing a MES> 7.1 threshold, then their independent search employing a deeper SNR> 5 threshold will result in an insurmountable false alarm contamination to maintain a 1% threshold for statistical validation. The multiple planet prior boost would also be further suppressed by adopting a lower SNR detection threshold.

CONCLUSION
The statistical validation technique has provided a bountiful population of planets with which to constrain the planetary system outcomes of planet formation. However, we have shown that for the Kepler prime mission discoveries, statistical validations currently cannot be extended down to the detection threshold of the survey. Most, but not all, statistical validation studies from Kepler discoveries limited the planet candidate sample requiring SNR>10 in the transit model fit in order to render the false alarm contamination hypothesis negligible. We find that this qualitative judgment was a very helpful guide, but not conservative enough to avoid false alarm contamination preventing >99% planet probability confirmation when faced with a quantitative measurement of the false alarm population. In particular, Kepler-452b and Kepler-186f suffer the highest level of false alarm contamination among the previously confirmed Kepler planets in DR 25. They have false alarm probabilities that exceed the 1% validation level in DR 25. Furthermore, the extended set of planets, Kepler-441b, Kepler-443b, Kepler-1633b, Kepler-1653b, and Kepler-1178b have a non-negligible probability of exceeding the 1% validation level when taking into account a model that considers a statistical ensemble of processing variations. Given the sensitivity of determining the MES for a given signal to subtle data analysis choices, it is difficult to exactly measure the planet probability relative to systematic false alarms. Qualitatively, the confirmation confidence level intends to produce a sample of transiting planets confirmed with a reliability approaching that of confirming the transit signal by additionally measuring the planet mass from radial velocity observations. However, depending on the scientific goal and the risk posture of the investigation, including lower fidelity planet candidates, such as ones unconfirmed in this study, into a confirmed planet sample may be worthwhile.
Our findings do not preclude the probability that the planets unconfirmed in this study are in fact bonafide planets. The periodic dimming signals representing these candidates cleanly pass a battery of 52 vetting metrics employed by the Robovetter as consistent with a planet transit model. The betting odds are in favor of them being planets. In order to reconfirm these planet candidates, we find that independent detection of the transit event with HST represents the most viable option for these 'relatively' faint targets with P orb >100 day and shallow transits. We estimate that HST can achieve twice the SNR per transit as Kepler. A pilot HST cycle 25 program (ID: 15129) was developed in order to demonstrate that HST can achieve the orbit-to-orbit photometric stability at the Poisson limit in order to re-observe a transit of Kepler-62f.
Another viable option for re-confirmation is to further understand the drivers responsible for the systematic false alarms in Kepler, enabling their removal. Table 1 provides a listing of the small, long-period candidates identified in the unmodified and modified data sets. A targeted examination on this short list of candidates, in conjunction with the pixel-level transit injection database (Christiansen 2017), may provide additional vetting metrics that can significantly separate the false alarm detections from the ground-truth injected signals.
Avoiding false alarm contamination for long-period Kepler candidates requires selecting candidates from DR 25 with a detection MES> 9. This represents a ∆MES = 1.9 above the detection threshold MES=7.1. In addition, the systematic difference between the transit fit SNR and detection MES results in SNR>11.2 equivalent threshold. This SNR threshold applies for single candidate systems, a SNR>10 threshold is ap-propriate for multiple candidate systems taking into account the multiple planet prior boost. As guidance for other transit surveys, such as K2 and TESS it is not clear whether to treat these thresholds in an absolute sense (SNR 11) or relative sense (∆ 2 dex above the survey's adopted thresholds). In lieu of measuring the systematic false alarm rate, we recommend adopting the more conservative threshold. However, due to the relatively brighter hosts and shorter observing baseline, many of the K2 and TESS candidates can eliminate the systematic false alarm contamination down to the survey's detection threshold by independently confirming the transit signal with relative photometry observations or radial velocity confirmation. a DR25 estimate of the prior odds ratio between the systematic false alarm scenario and planet scenario, πr.
b Median estimate of the πr distribution using a model accounting for processing variations.
c Mode estimate of the πr distribution using a model accounting for processing variations.
d 68.2 th percentile (1-σ) estimate of the πr distribution using a model accounting for processing variations.
e 95 th percentile (2-σ) estimate of the πr distribution using a model accounting for processing variations.