Reliability Correction is Key for Robust Kepler Occurrence Rates

The Kepler DR25 planet candidate catalog was produced using an automated method of planet candidate identification based on various tests. These tests were tuned to obtain a reasonable but arbitrary balance between catalog completeness and reliability. We produce new catalogs with differing balances of completeness and reliability by varying these tests, and study the impact of these alternative catalogs on occurrence rates. We find that if there is no correction for reliability, different catalogs give statistically inconsistent occurrence rates, while if we correct for both completeness and reliability, we get statistically consistent occurrence rates. This is a strong indication that correction for completeness and reliability is critical for the accurate computation of occurrence rates. Additionally, we find that this result is the same whether using Bayesian Poisson likelihood MCMC or Approximate Bayesian Computation methods. We also examine the use of a Robovetter disposition score cut as an alternative to reliability correction, and find that while a score cut does increase the reliability of the catalog, it is not as accurate as performing a full reliability correction. We get the same result when performing a reliability correction with and without a score cut. Therefore removing low-score planets removes data without providing any advantage, and should be avoided when possible. We make our alternative catalogs publicly available, and propose that these should be used as a test of occurrence rate methods, with the requirement that a method should provide statistically consistent occurrence rates for all these catalogs.


INTRODUCTION
The Kepler space telescope Koch et al. 2010) has delivered unique data that enables the characterization of exoplanet population statistics, from hot Jupiters in short-period orbits to terrestrial-size rocky planets in orbits with periods up to ∼one year 1 . By observing >150,000 stars nearly continuously for four years looking for transiting exoplanets, Kepler detected over 4,000 planet candidates (PCs) (Thompson et al. 2018), leading to the confirmation or statistical valida-tion of over 2,300 exoplanets. This rich trove of exoplanet data has delivered many insights into exoplanet structure and formation, and promises deeper insights with further analysis. One of the most exciting insights to be gained from Kepler data is η ⊕ , the occurrence rate of temperate, terrestrial-size planets orbiting Sunlike stars. η ⊕ is also a critical input to the design of future space telescopes for the characterization of habitable exoplanets.
Fully utilizing Kepler data to calculate accurate occurrence rates requires a thorough understanding of how well it reflects the underlying exoplanet population. There are several ways in which the Kepler planet candidate catalog does not directly measure the real planet population. In this paper we focus on the following: • The catalog is incomplete, missing real planets arXiv:2006.15719v1 [astro-ph.EP] 28 Jun 2020 • The catalog is unreliable, being polluted with false positives (FPs) Low completeness and low reliability are particularly acute at the Kepler detection limit, which happens to coincide with the period and radius of Earth-Sun analog exoplanets. We therefore focus our attention on a period and radius range 50 ≤ period ≤ 400 days and 0.75 ≤ radius ≤ 2.5 R ⊕ , which spans the Kepler detection limit. We refer to this range as our domain of analysis. Bryson et al. (2020) developed a method for using data provided in the final Kepler data release (DR25) to characterize the completeness and reliability of the DR25 planet candidate catalog, and use this characterization in occurrence rate calculations. The occurrence rates presented in Bryson et al. (2020) were not considered definitive, and several issues with the occurrence rate calculations were discussed in detail. In this paper we address two issues: reliance on the specific balance between completeness and reliability of the DR25 planet candidate catalog, and the assumption that planet occurrence is described by a Poisson likelihood.
We address possible dependence on the balance of the DR25 PC catalog between completeness and reliability by creating new PC catalogs that give greater emphasis to either completeness or reliability. As described in §2.3, these new catalogs provide lists of planet candidates in Kepler data that are just as valid as the DR25 PC catalog, so we would expect a good occurrence rate measurement to give the same result for all of these catalogs. We will find that, when correcting for completeness and reliability, the method of Bryson et al. (2020) computes the same occurrence rates for all the catalogs.
We address the possible dependence on the assumption of a Poisson likelihood by computing the occurrence rates using Approximate Bayesian Computation (ABC) as described in Kunimoto & Matthews (2020) and Kunimoto & Bryson (2020), following methods developed in Mulders et al. (2018) and He et al. (2019). The ABC method treats completeness and reliability, as well as the statistics of the planet population, very differently from the Poison likelihood method used by Bryson et al. (2020). We find that both methods result in essentially the same occurrence rate results.
We also examine the use of DR25 Robovetter Disposition Score (Thompson et al. 2018) to provide a highreliability planet candidate catalog, potentially making a correction for reliability unnecessary. We use the consistency of results for the PC catalogs presented in this paper as the diagnostic criterion. We find that, while using the disposition score significantly improves consistency, correction for reliability is still indicated. Not using the disposition score and correcting for reliability gives the most consistent results. This paper is structured as follows: In §2.1 and 2.2 we review the DR25 catalog and stellar properties used in Bryson et al. (2020). We describe our alternative planet candidate catalogs in §2.3, and the Bayesian computation of the planet population and occurrence rates in §2.4. We present our results in §3, and interpret these results in §4.
All results reported in this paper were produced with Python code, mostly in the form of Python Jupyter notebooks, found at the paper GitHub site 2 . This site also contains the alternative PC catalogs described in §2.3.
2. METHODOLOGY 2.1. DR25 Completeness, Reliability, and Score Each planet candidate catalog starts with the DR25 catalog of 34,032 threshold crossing events (TCEs) (Twicken et al. 2016), which are periodic transit-like events, identified by a matched filter (Jenkins 2002), that have a combined signal strength above a threshold (set to 7.1σ for DR25). Identification of the PCs from the TCEs was performed by a fully automated Robovetter (Coughlin 2017). The Robovetter applies a variety of tests to each TCE, many of which were tuned on the synthetic test datasets described below. When a TCE passes tests that indicate a resemblance to a planetary transit or eclipsing binary, it is elevated to a Kepler Objects of Interest (KOI). If the KOI passes further tests, it is elevated to planet candidate (PC) status. Such automated vetting (and transit detection) is critical for the production of a statistically uniform catalog that is amenable to statistical correction for completeness and reliability. The DR25 planet candidate catalog (Thompson et al. 2018) contains 4034 identified PCs out of 8054 KOIs.
The DR25 Robovetter uses a number of metrics to identify instrumental false alarms, and the inverted and scrambled data sets were used to tune their pass/fail thresholds. For an extensive discussion, see Thompson et al. (2018).

Detection and Vetting Completeness
The DR25 completeness products are based on injected data -a ground-truth of transiting planets obtained by injecting artificial transit signals with known characteristics on all observed stars at the pixel level (Christiansen 2017). A large number of transits were also injected on a small number of target stars to measure the dependence of completeness on transit parameters and stellar properties. This data is then analyzed by the Kepler detection pipeline to produce a catalog of detections at the injected ephemerides called injected and recovered TCEs, which are then sent through the same Robovetter used to identify planet candidates.
Detection completeness is defined as the fraction of injected transits that are recovered as TCEs by the Kepler detection pipeline, regardless of whether or not those TCEs are subsequently identified as planet candidates. We use the detection completeness of Burke & Catanzarite (2017), which was computed for each target star as a function of period, the simulated Multiple Event Statistic (MES; a measure of the signal-to-noise ratio (S/N) that is specific to the Kepler pipeline) based on stellar noise properties measured in that star's Kepler light curve. The result is referred to as completeness detection contours.
Vetting completeness is defined as the fraction of detected injected transits that were identified as planet candidates by the Robovetter. Vetting completeness is computed for a population of stars based on the simulated MES and orbital period of injected transits. We use the method of Bryson et al. (2020), which models vetting completeness as a binomial problem with a rate given by a product of rotated logistic functions of MES and orbital period. The specific shape of the logistic functions depend on the Robovetter thresholds. The detection completeness contours are multiplied by this vetting completeness function for each star.
The product of vetting and detection completeness as a function of period and MES is converted to a function of period and planet radius for each star. This product is further multiplied by the geometrical transit probability for each, which is a function of planet period and radius, given that star's radius. The final result is a completeness contour for each star that includes detection and vetting completeness, and geometrical transit probability.

Vetting Reliability
Vetting reliability is the fraction of planet candidates that are true planets (Thompson et al. 2018). Reliability is estimated by determining the rate of two types of misidentified planet candidates: astrophysical false positives and non-transit-like false alarms. Astrophysical false positives are periodic transit-like signals that are not caused by planets, most often eclipsing binaries. We evaluate the probability that a PC is an astrophysical false positive using the False Positive Probability (FPP) of Morton et al. (2016). False alarms can be caused by stellar variability, but, especially at longer periods, are dominated by instrumental artifacts (Thompson et al. 2018). False alarm reliability R FA is defined as the fraction of PCs that are not false alarms. The final reliability for each planet candidate is the product (1 − FPP)R FA False alarm reliability R FA is decomposed into two rates, the false alarm effectiveness E FA , the fraction of true false alarms correctly identified as false alarms by the Robovetter, and the observed false-alarm rate F FA , the fraction of TCEs classified as false alarms. E FA is measured using observed data manipulated to eliminate true periodic transit-like signals, called inverted and scrambled data, described in Thompson et al. (2018). Any TCE found in the inverted and scrambled data is a false alarm. Then the reliability against false alarms is given by (Thompson et al. 2018) R FA is computed as a two-dimensional function of MES and period, based on what PCs were found in the scrambled and inverted data, and the observed false alarms. The reliability of a PC is computed by evaluating R FA at that PC's MES and period. In Bryson et al. (2020) E FA and F FA are characterized separately. However, we have found that this separate characterization resulted in negative values for R FA when applied to the high-completeness catalog described in §2.3, which is not meaningful. Therefore in this paper we characterize E FA and F FA in a joint MCMC inference using a likelihood formed from the product of the individual likelihoods for E FA and F FA . The requirement of 0 ≤ R FA ≤ 1 is enforced by making this requirement part of the prior for the joint inference. We performed this joint inference of E FA and F FA on all catalogs discussed in this paper, including DR25.

Robovetter Disposition Score
As described in Thompson et al. (2018), the Robovetter disposition score is a measure of the confidence of the Robovetter's classification of a TCE into a PC or FP. This score is measured by varying the Robovetter metrics according to their uncertainties, and the score of a TCE is the fraction of those variations for which the TCE is classified as a PC. The score can be thought of as the propagation of the uncertainty of the Robovetter metrics for each TCE. A high-score PC (near 1.0) is almost always classified as a PC, while a low-score PC (near 0.0) is almost always classified as a FP.
Robovetter disposition score and false alarm reliability are often conflated, but are conceptually very different. The score of a PC is determined by the Robovetter Orbital Period (Days) Figure 1. Left: Robovetter disposition score plotted against false alarm reliability for the DR25 GK PC population with 50 ≤ period ≤ 400 days and 0.75 ≤ radius ≤ 2.5 R⊕, with the color of each planet showing its orbital period and marker size showing its radius. Most PCs, with short orbital periods, have high reliability but strongly varying score, while long-period PCs have a strong correlation between score and reliability. Right: the same PC population showing the combined false alarm and astrophysical false positive reliability. In this case a high score cut does not remove PCs which have low reliability due to a high false positive probability.
metrics for that PC based on that PC's observed data. False alarm reliability is determined by the rate of the Robovetter's identification of PCs in the inverted and scrambled data, and the observed rate of false alarms, at the PC's MES and period. The relationship between score and reliability is shown in Figure 1 for PCs in our domain of analysis 50 ≤ period ≤ 400 days and 0.75 ≤ radius ≤ 2.5 R ⊕ . The left panel shows false alarm reliability. PCs with shorter periods have high false alarm reliability but strongly varying score, indicating no relationship. Long-period PCs, on the other hand, show a strong correlation between score and false alarm reliability. This correlation provides some justification for the use of score as a proxy for false-alarm reliability, producing a "high-reliability" PC population by removing those PCs below a score threshold (Mulders et al. (2018), for example). But such a score cut will remove many high-reliability planets. The right panel of Figure 1 shows the PC reliability including astrophysical false positives. We see that even a high score cut such as 0.9 does not remove all lowreliability PCs due to their having a high false positive probability. This indicates that reliability correction is still useful when using a score cut. Figure 2 shows the impact of various score cuts on the PC population in our domain of analysis. In this paper we will study the impact score cuts and address the possibility that using high-score PCs may provide more accurate occurrence rates.

Input Stellar and Planet Catalogs
As in Bryson et al. (2020), our stellar catalog uses the Gaia-based stellar properties from Berger et al. (2020b) combined with the DR25 stellar catalog at the exoplanet archive, with the cuts described in the baseline case of Bryson et al. (2020). This gives us 57,015 GK stars whose noise properties and observational coverage make them appropriate for a statistical exoplanet survey.
We use planet properties from the Kepler Threshold Crossing Events (TCE) catalog, with planet radii corrected using the Gaia-based stellar radii from Berger et al. (2020b) as in Bryson et al. (2020). These radii differ from those in Berger et al. (2020a) by a constant factor = 1.00226, due to a small difference in the value of R ⊕ /R .

Varying the Robovetter Vetting Thresholds
We produce and compare different planet candidate catalogs, with differing balances between completeness and reliability, by varying the thresholds used by the Robovetter to identify planet candidates (PCs) from the transit signals identified as Transit Crossing Events (TCEs) by the Kepler data analysis pipeline. We produce four PC catalogs for the stellar population described in §2.2: • DR25, the original Kepler planet candidate catalog, which was analyzed in detail in Bryson et al. (2020).
• High Reliability, which uses more restrictive Robovetter thresholds and rejects more borderline transit detections compared to the original DR25   Figure 2. The DR25 planet candidate population after imposing various score cuts. Top Left: score cut = 0. Top right: score cut = 0.6. Bottom Left: score cut = 0.7. Bottom Right: score cut = 0.9. The planet candidates are colored and sized by reliability with planet radius error bars. The background color map and contours indicate the summed completeness function η(p, r). The box on the left indicates the region integrated to obtain the occurrence rate F1, while the box on the right indicates the integration region for the occurrence rate ζ⊕. The ζ⊕ box extends out to 438 days. The occurrence rate F0 is the integral over the range of the figure.
catalog, resulting in higher reliability and lower completeness.
• High Completeness, which uses less restrictive Robovetter thresholds and accepts more borderline transit detections compared to the original DR25 catalog, resulting in lower reliability and higher completeness.
• FPWG PC, which attempts to tune the Robovetter thresholds to pass DR25 false positive KOIs that are identified as possible planets by the Kepler False Positive Working Group (FPWG) (Bryson et al. 2015).
The high-reliability and high-completeness catalogs use the alternative Robovetter thresholds described in §5.2 of Thompson et al. (2018). Details of the alternative vetting thresholds are given in Appendix A, as well as the new planet candidates that appear in the high-reliability and FPWG PC catalogs. We believe that these four catalogs are equally valid, imperfect, catalogs of planet candidates in the Kepler data, each with differing reliability and completeness. These catalogs are shown for our domain of analysis 50 ≤ period ≤ 400 days and 0.75 ≤ radius ≤ 2.5 R ⊕ in Figure 3. We see that there is a strong variation between the cases in the number of planets with orbital periods > 200 days.
In the high completeness case, 14 planet candidates appear within our domain of analysis that were vetted as false positives in DR25. We manually inspected these 14 new PCs using the TCERT Vetting Reports 3 , and found that only 5 presented plausible planetary transit signals.
The other 9 are likely false alarms due to instrumental artifacts or stellar variability.
For each catalog and its corresponding Robovetter thresholds, we run the Robovetter on the injected, inverted and scrambled data, producing the data required to compute the vetting completeness and reliability of each catalog. Figure 4 shows example distributions of vetting completeness for our planet candidate catalogs for the longperiod, low MES case of period 365 days and MES = 10. In this case there is a large variation in the vetting completeness between the catalogs. As expected, the high-reliability case has lower vetting completeness. Figure 5 shows the resulting false alarm reliability distributions in two cases near the detection limit. Again we see a large amount of variation in the reliability between the catalogs we consider. As expected, the highcompleteness catalog has lower false alarm reliability at a given period and MES.

Occurrence Rate Methods
We compute occurrence rates as the average of the number of planets per star f (p, r) over a specified stellar population and range of planet period p and radius r. We do this in two major steps: 1) the determination of d 2 f dp dr , the differential rate of planets per star for the specified stellar population and catalog of planet candidates on those stars, and 2) the integration of that rate over the specified planet period and radius range.
Step 1, described in this section, is performed for a planet and period range 50 ≤ period ≤ 400 days and 0.75 ≤ radius ≤ 2.5 R ⊕ . Using the results of step 1, step 2 computes several occurrence rates by integrating the differential rate over various ranges described in §3.
To explore the dependence of our occurrence rates on the specific Bayesian inference method, we compute all occurrence rates for the four catalogs from §2.3 using the Poisson-Likelihood MCMC method of Burke et al. (2015) and the Approximate Bayesian Computation method of Kunimoto & Matthews (2020). Both methods are modified to account for vetting completeness and reliability, with the Poisson-likelihood method described in Bryson et al. (2020) and the ABC method described in Kunimoto & Bryson (2020). Both methods use a standard power-law model for the planet population differential rate λ: for θ = (F 0 , α, β), λ(p, r, θ) ≡ d 2 f dp dr where f is the number of planets per star. Given θ, we find f for a particular period and radius range by integrating λ over that range. The normalization in Equation (2) is chosen so that the integral of λ from r min to r max and from p min to p max = F 0 , so F 0 is the number of planets per star in our domain of analysis. The occurrence rates we present in this paper are the integral of λ over various period and radius ranges.
Both inference methods use the same stellar and planet population, and the same characterization of completeness and reliability computed using the approach Bryson et al. (2020). These steps are as follows: • Select a subset of the target star population, which will be our parent population of stars that are searched for planets. We apply various cuts intended to select well-behaved and well-observed stars, and we restrict our analysis to GK dwarfs.
• Use the injected data to characterize vetting completeness.
• Compute the detection completeness, incorporating vetting completeness and geometrical probability for each star and sum over the stars, as described in §2.1.1.
• Use observed, inverted, and scrambled data to characterize false alarm reliability, as described in §2.1.2.
• Assemble the collection of planet candidates, including computing the reliability of each candidate from the false alarm reliability and false positive probability.
Because choice of score cut and catalog change both the vetting completeness and reliability of the PC population, all of steps except stellar parent sample selection are computed for each catalog and score cut. For the Poisson likelihood inference of the parameters in Equation (2), reliability is implemented by running the MCMC computation many times, with the planets removed with a probability given by their reliability. For details see Bryson et al. (2020). For the Poisson likelihood case, we compute infer the coefficients of Equation (2) for four choices of score cut for all four populations. The ABC-based inference of the parameters in Equation (2) is computed using the approach of Kunimoto & Bryson (2020). In summary, the underlying Kepler planet population is forward modeled by drawing N p = F 0 N s planets according to Equation (2), where N s = 57, 015 is the total number of stars in the sample. Following the procedure of Mulders et al. (2018)   involves assigning each planet a period between 50 and 400 days from the cumulative distribution function of p β , and a radius between 0.75 and 2.5 R ⊕ from the cumulative distribution function of r α . The detectable planet sample is then simulated from the underlying popula-tion by drawing from a Bernoulli distribution with the star-averaged detection probability. We compare the detected planets to the observed PC population using a distance function, which quantifies agreement between the period distributions, radius distributions, and sample sizes of the catalogs. For the former two distances, we chose the two-sample Anderson-Darling (AD) statistic, which has been shown to be more powerful than the commonly used Kolmogorov-Smirnoff test (Engmann & Cousineau 2011). With each iteration of the ABC algorithm, model parameters are accepted when the resulting population's distance from the observed population is less than 75th quantile of the previous iteration's accepted distances. Following the guidance of Prangle (2017), we confirmed that our algorithm converged by observing that the distances between simulated and observed catalogues approached zero with each iteration. These simulations are repeated within a Population Monte Carlo ABC algorithm to infer the pa- rameters that give the closest match between simulated and observed catalogs. We perform the ABC inference on the four catalogs, without a score cut. This forward model is appropriate for estimating the average number of planets per star in a given period and radius range, as is achieved by the Poisson likelihood function method. However, forward modeling has the advantage of being more versatile, especially in the face of increasingly complicated population models. Notably, Mulders et al. (2018) first used forward modeling to explore planetary architectures by taking into account correlations in planet properties, while He et al. (2019) used ABC to describe exoplanet periods and sizes using a clustered point process model. Another advantage is the ease with which ABC can implement reliability. Rather than requiring many inferences on different catalogues, we modify the distance function to accept weighted data.

RESULTS
We test our occurrence rate methods described in §2.4 on the four catalogs described in §2.3, using both the Poisson likelihood and ABC inference methods. For the Poisson likelihood method, we study the impact of score cut. As explained in §2.3, we treat these catalogs as equally valid, alternative measurements of the exoplanet population by Kepler, and will take the consistency of results using these catalogs as a diagnostic of the quality of the occurrence rate method.
For all cases, we present both the population model parameters θ for the differential occurrence rate λ(period, radius, θ) given by Equation (2), as well as several occurrence rates obtained by integrating λ(period, radius, θ) over various planet period and radius ranges. The occurrence rates we compute are • The log-differential rate of planets per star evaluated at Earth's period and radius Γ ⊕ ≡ d 2 f /d log p d log r = p ⊕ r ⊕ λ (p ⊕ , r ⊕ , θ).
• F 0 , the number of planets per star in our domain of analysis, given by the integrating λ over 50 ≤ period ≤ 400 days and 0.75 ≤ radius ≤ 2.5 R ⊕ .
All these occurrence rates probe the Kepler detection limit, with F 1 being furthest from the detection limit. F 0 , F 1 , and part of ζ ⊕ are shown in Figure 3. Figure 6 shows the distribution of the five occurrence rates resulting from the posteriors of θ, all near the Kepler detection limit, for the four catalogs described in §2.3. The left panels show the occurrence rates without correcting for reliability, while the right panels are corrected for reliability. For all occurrence rates, if we do not correct for reliability the different choices for Robovetter thresholds yield different occurrence rates, in some cases varying by more than a factor of 3. When we correct for reliability, we find that the different choices for Robovetter thresholds yield very similar occurrence rates, with closely-overlapping distributions. We take this to indicate the methods we use for completeness and reliability characterization and correction are working as intended, and correction for both  completeness and reliability are required. This is the case even when the PC population is polluted by a significant number of false alarms, as is the case for the high-completeness catalog as described in §2.3. Figure 7 shows the same occurrence rate posteriors as Figure 6, but with a score cut applied to the planet catalog that removes planets with Robovetter disposition score below 0.9. This score cut is expected to yield a higher-reliability population in all the catalog. This population for the DR25 catalog is shown in the lower right panel of Figure 2. We see that, without correcting for reliability (left panels), the distributions for the different catalogs are more consistent than in the left panels of Figure 6. As discussed in §2.1.3, even a high score cut of 0.9 passes planets with low reliability due to astrophysical false positive probability, so correction for reliability is appropriate with a score cut. But there are relatively few such low-reliability planets with this high score cut, so the impact of reliability correction on the distributions in Figure 7 (right panels) is minor. Comparing Figures 7 and 6 shows, however, that whether corrected for reliability or not, the spread of occurrence rates from the four catalogs using a score cut is notably larger than the spread when correcting for reliability not using a score cut.

Poisson Likelihood Method
We compute the planet population model parameters θ and resulting occurrence rates for the score cuts 0, 0.6, 0.7, and 0.9 for all four catalogs. We provide results both not corrected for reliability and corrected for reliability. The resulting θ are given in Tables 1, and the occurrence rates are give in Table 2. These tables give the central values and 14th and 86th percentile confidence intervals. These tables introduce the maximum separation metric, which quantifies the separation between the distributions from the catalogs. For each pair of catalogs {i, j}, we compute d i,j = (m i − m j )/σ di,j , the difference in medians m i and m j divided by the uncertainty in that distance propagated from σ i and σ j , where σ i is the 68% confidence interval of distribution i. Then the maximum separation in a row is the largest d i,j over all pairs {i, j} in that row.
Based on the maximum separation, we see in Tables 1 and 2 that the spread between the different catalogs can be over 2σ with no score cut and not correcting for reliability. Correcting for reliability reduces the separation to well under 1σ. Applying a score cut of 0.6 or 0.7 also reduces the separation to under 1σ with and without reliability correction. Applying a score cut of 0.9 results in a separation in F 0 greater than 1σ, though the separation in α and β is less than 1σ. The larger separation in F 0 for a score cut of 0.9 drives a separation greater than 1σ for the occurrence rates in Table 2.
The results in Table 2 are shown graphically in Figure 8. In this figure we see that, though the medians are separated by less than the error bars, there is a consistent bias towards higher occurrence rates when not correcting for reliability compared to correcting for reliability except for a score cut of 0.9. This is also seen in Figure 9, which shows the difference of each occurrence rate from the value for the DR25 catalog corrected for reliability without a score cut. While this difference is less than 1σ, without reliability correction there is a consistent bias towards higher occurrence rates when using a score cut of 0.6 or 0.7. When using a score cut of 0.9 without reliability correction, this bias disappears, but there is a larger spread of values across the catalogs, consistent with the maximum separation metric. These biases disappear with reliability correction, implying that these biases are due to planets that have low reliability due to astrophysical false positive probability.
It is reasonable to conjecture that applying a score cut eliminates low-quality planet candidates, which would result in more accurate occurrence rates. But as we have seen above, even with a high score cut there are lowreliability planet candidates due to astrophysical false positive probability, so an accurate occurrence rate requires reliability correction in any case. For all score cuts considered above, score cut combined with reliability correction give results highly consistent with no score cut and reliability correction. So there is no evidence that applying score cuts results in more accurate occurrence rates.

ABC Method
We compute the rate function coefficients θ using the ABC method described in §2.4 for the four catalogs and no score cut. The results, including occurrence rates, are given in Table 3, and the occurrence rates are compared with those from the Poisson likelihood method in Figure 10. The results are very consistent with the Poisson likelihood method with no score cut, exhibiting the large variation of results from the different catalogs when not correcting for reliability and strong consistency when correcting for reliability. This is consistent with the results in Kunimoto & Bryson (2020). It is notable that the error bars using ABC are somewhat smaller than those from the Poisson likelihood.

DISCUSSION
In this paper we presented four planet candidate catalogs created from Kepler DR25 detections and vetting metrics via the Kepler Robovetter following Thompson et al. (2018). Each catalog uses different choices of vetting thresholds, chosen for differing balances between completeness and reliability. The specific vetting

0.89
Note-Median values and 16th and 84th percentile error bars of the posteriors of θ in Equation (2) for the high reliability, DR25, FPWG PC and high catalogs F0 is the occurrence rate of planets with 50 ≤ period ≤ 400 days and 0.75 ≤ radius ≤ 2.5 R⊕. These values were computed using the Poisson Likelihood MCMC method. The maximum separation in each row is the maximum over each row of the difference in medians divided by the propagated uncertainty of that distance.
thresholds for each catalog are equally reasonable and defensible, so each catalog can be considered as a legitimate, though imperfect, planet candidate catalog. Therefore, occurrence rate measurements using these catalogs should provide consistent results. By applying the Robovetter using these thresholds to observed, injected, inverted, and scrambled data we characterized the completeness and reliability of each catalog using the techniques of Bryson et al. (2020). We find that if we do not correct for reliability, occurrence rate estimates using these catalogs vary widely. For example, comparing the range of distributions in the left panel of Γ ⊕ in Figure 6 with Figure 14 in Kunimoto & Matthews (2020)  completeness and reliability is critical for robust occurrence rates and strongly suggests that the simulated false alarms in the inverted/scrambled data are a statistically accurate representation of the true false alarm population.
Using the criterion of consistent results with the four catalogs, we investigated the use of Robovetter disposition score cuts as a method of correcting for reliability. We found that score cuts, which remove those planet candidates with score less than a threshold value, significantly improves the consistency of occurrence rates from the four catalogs without reliability correction, compared to not correcting for reliability with no score cut. Using a score cut without reliability correction can produce results from the four catalogs consistent with correcting for reliability without a score cut. However, using a score cut without reliability correction results in occurrence rates that are somewhat biased towards high occurrence rates relative to those with reliability correction. This bias is removed when using a score cut and reliability correction, and implementing a score cut with reliability correction yields the same result as reliability correction without score cut. Therefore we recommend always correcting for reliability when possible without a score cut, because score cuts remove data without providing any advantage. If correcting for reliability is not possible, then a score cut is a reasonable alternative, but will give less accurate results.
We found that the above behavior of the occurrence rate calculation is the same for both the Poisson likelihood and Approximate Bayesian Computation methods, in spite of the dramatic difference in these method's treatment of completeness, reliability, and the statistics of the planet population model.
The occurrence rates presented in this paper are, like those in Bryson et al. (2020), illustrative. In particular, the occurrence rates ζ ⊕ and the SAG13 η ⊕ involve significant extrapolation beyond where there is a significant amount of data. We therefore treat these occurrence rates with some skepticism, though it is remarkable how robust these occurrence rates are against variations in Robovetter vetting thresholds, Robovetter disposition score cuts, and the Bayesian inference method.
There are at least three aspects of the occurrence rate calculations presented in this paper that may compromise accuracy: • Incorrect population model. The product of independent power laws in period and radius in Equation (2) may not correctly describe the planet population. There is ample evidence that exoplanet populations are significantly more complex (Fulton et al. 2017;Mulders et al. 2019;Pascucci et al. 2019), and may not be well-described by simple broken power laws. Incorrect population models can lead to large inaccuracies when extrapolated as we do for ζ ⊕ and the SAG13 η ⊕ .
• Not accounting for planet multiplicity. Zink et al. (2019) showed that the existence of shortperiod transiting planets can inhibit the detection of longer-period transiting planets on the same star. They estimate that longer-period occurrence rates may be as much as 16% higher on individual stars that have short-period transits after correction for the impact of planet multiplicity on detection completeness. However, it is difficult to correct for this effect in our methods, which rely on a uniform completeness characterization across the parent stellar population.
• Incomplete vetting metrics. We use the vetting metrics of Thompson et al. (2018). While these metrics are remarkably thorough, they do not fully exploit the Kepler data. For example, further vetting metrics based on pixel data can help distinguish astrophysical signals from instrumental artifacts. Such metrics could potentially yield catalogs that are both more complete and more reliable, which may result in different, theoretically more accurate occurrence rates.
We expect that the robustness demonstrated in this paper would still apply with improved population models and vetting metrics. Strictly speaking, the DR25 catalog is a catalog of objects that pass the Robovetter with a specific set of metrics. We strongly believe that this catalog, when used with the associated measures of the completeness and reliability, provides a high-quality measurement of the true transiting planet population. Using the same metrics but changing the metric thresholds as described in §2.3 provide slightly different measurements of the same population, so we expect the resulting slightly different catalogs to be statistically consistent with each other. Adding different metrics, on the other hand, may potentially measure different populations of transiting planets and false alarms/positives, yielding a catalog that more closely matches the true underlying population. Therefore new vetting metrics may yield statistically different occurrence rates which may be closer to the true value.

CONCLUSIONS
In this paper we explored the impact of using several alternative planet candidate catalogs derived from Kepler data on exoplanet occurrence rates. We find statistically consistent occurrence rates using these cata-

1.13
Note-Occurrence rate results for various score cuts when not correcting for reliability resulting from high reliability, DR25, FPWG PC and high completeness vetting. F1 is the integrated planet rate over 50 ≤ period ≤ 200 days and 1 ≤ radius ≤ 2 R⊕, ζ⊕ is the integrated rate within 20% of Earth's orbital period and size, and SAG13 η⊕ is the integrated rate over 237 ≤ period ≤ 860 days and 0.5 ≤ radius ≤ 1.5 R⊕. logs, so long as we correct for catalog completeness and reliability. Ignoring reliability, however, results in statistically inconsistent occurrence rates between the catalogs. This implies that a) completeness and reliability correction is necessary for accurate occurrence rates and b) the completeness and reliability of these catalogs are correctly statistically measured using injected, inverted, and scrambled data. In particular, the false alarms in the injected and inverted data statistically represent the false alarms in the observed Kepler data. This result is independent of the computational method. We make the four planet catalogs we use, as well as the data required for their completeness and reliability characterization publicly available. We recommend that other occurrence rate methods be tested using these catalogs to demonstrate that they yield statistically consistent results.
This paper illustrates the importance of correcting for both completeness and reliability when performing demographic studies. This lesson surely applies to any survey whose catalogs are incomplete and not fully reliable. Our ability to characterize catalog completeness and reliability depends on being able to create our catalogs in a uniform and repeatable way, so the same catalog inclusion criteria can be applied to both observed and ground-truth data. The ground-truth data must statistically represent both true and false detections.
Large-scale transit surveys such as K2, TESS, and PLATO, as well as RV and microlensing surveys, provide wonderful opportunities for a deeper understanding of exoplanet demographics. These surveys will require a similar ability to characterize their completeness and reliability in order to provide high-quality results.

ACKNOWLEDGMENTS
We thank NASA, Kepler management, and the Exoplanet Exploration Office for continued support of and encouragement for the analysis of Kepler data. Gijs Mulders provided helpful comments. We thank Bill Borucki and the Kepler team for the excellent data that makes these studies possible.  Figure 9. The difference in the medians of the catalogs from DR25 with no score cut, divided by the 1-sigma error in those differences for various occurrence rates and score cuts, computed with the Poisson method. Left: without correcting for reliability. Right: corrected for reliability.

A. ROBOVETTER VETTING THRESHOLDS
We created the catalogs described in §2.3 by changing a subset of the DR25 Robovetter thresholds described in Thompson et al. (2018). Table 4 shows those thresholds that were changed from the DR25 PC catalog -all other thresholds were unchanged. Figures 11 and 12 show histograms of these metrics for the injected and inverted/scrambled data described in §2.1. Injected data provides true transits, while the inverted/scrambled data contains no true transits so any detected transit in the inverted/scrambled data is a false alarm. Ideally, the thresholds would separate the injected (true transit) from the inverted/scrambled (no transit) populations. We see, however, that for these metrics there is not a clean separation between data with true transits and data with no true transits, which makes a choice of threshold difficult. Thompson et al. (2018) describes how these thresholds were chosen for a particular balance of completeness and reliability. Our alternative thresholds provide different balances of completeness and reliability, and we see from Figures 11 and 12 that there is considerable freedom in the choice of those thresholds. Table 5 gives the total number of planets in each catalog, and how many remain after applying various score cuts. Table 6 lists the TCEs that were given planet candidate status in the high-completeness or FPWG PC catalogs that do not appear in the DR25 PC catalog (there were no new TCEs given PC status in the high-reliability catalog). For each TCE in Table 6 the false alarm reliability, computed as described in §2.1.2 foe each catalog, is given for the catalog in which it appears. A missing reliability value indicates that the TCE was not a PC in that catalog. These new PCs are shown in Figure 13.  . Robovetter metrics and thresholds. Shaded histogram: metric distribution for false alarms from the inverted and scrambled data. Line histogram: metric distribution for true transits from the injected data. The thresholds given in Table 4 Figure 13. Planet candidates resulting from the the high completeness and FPWG Robovetter thresholds that are not in the DR25 PC population. The DR25 PC population is shown for comparison. The dashed box is the period-radius range used when computing our population model parameters.