A Probabilistic Approach to Kepler Completeness and Reliability for Exoplanet Occurrence Rates

Exoplanet catalogs produced by surveys suffer from a lack of completeness (not every planet is detected) and less than perfect reliability (not every planet in the catalog is a true planet), particularly near the survey's detection limit. Therefore, occurrence rate studies of the exoplanet population structure using such a catalog must be corrected for completeness and reliability. The final Kepler data release, DR25, features a uniformly vetted planet candidate catalog and data products that facilitate such corrections. We present a new probabilistic approach to the characterization of Kepler completeness and reliability, making full use of the Kepler DR25 completeness and reliability products. We illustrate the impact of completeness and reliability with a standard occurrence rate method, using a stellar properties catalog that incorporates Gaia stellar radii and uniform treatment of the stellar population. This is the first exoplanet occurrence rate calculation that characterizes and corrects for reliability, using all of the DR25 completeness and reliability products. Correcting for reliability has a significant impact near the Kepler detection limit: when computing the occurrence rate of exoplanets with orbital period and radius within 20% of Earth's around GK dwarf stars, correcting for reliability yields 0.013+0.009-0.006; not correcting results in 0.033+0.017-0.012. We also show that using Gaia-based stellar properties, rather than those in the DR25 stellar catalog, reduces the occurrence rate of the same parameter space by a factor of two. We critically examine the the DR25 catalog and the assumptions behind our occurrence rate method, including the possibility that the DR25 catalog under-represents small, long-period planets. We propose several ways in which confidence in both the Kepler catalog and occurrence rate calculations using that catalog can be improved.


INTRODUCTION
Corresponding author: S. Bryson steve.bryson@nasa.gov The Kepler space telescope Koch et al. 2010) has delivered unique data that enables the characterization of exoplanet population statistics, from hot Jupiters in short-period orbits to terrestrial-size rocky planets in orbits with periods up to one year 1 . By observing >150,000 stars nearly continuously for four years looking for transiting exoplanets, Kepler detected several thousand planet candidates (PCs) (Thompson et al. 2018), leading to the confirmation or statistical validation of over 2,300 exoplanets. This rich trove of exoplanet data has delivered many insights into exoplanet structure and formation, and promises deeper insights with further analysis. One of the most exciting insights to be gained from Kepler data is η ⊕ , the occurrence rate of temperate, terrestrial-size planets orbiting Sun-like stars. η ⊕ is also a critical input to the design of future space telescopes designed to discover and characterize habitable exoplanets, such as HabEx and LUVOIR.
Fully exploiting Kepler data requires a thorough understanding of how well it reflects the underlying exoplanet population. There are several ways in which the Kepler planet candidate catalog does not directly measure the real planet population: • the catalog is incomplete, missing real planets • it may be unreliable, with the planet candidate catalog being polluted with false positives • it may be inaccurate due to observational errors leading to incorrect planet properties.
Lack of completeness and reliability are particularly acute at the Kepler detection limit, which happens to coincide with the period and radius of Earth-Sun analog exoplanets. We therefore focus our attention on a period and radius range spanning the Kepler detection limit.
In this paper we address vetting incompleteness, a significant component of incompleteness caused by incorrectly classifying detected true planets as false positives, and vetting reliability, caused by incorrectly classifying detections as planets candidate when they are in fact not true planets. We address accuracy by using new, uniformly determined stellar properties based in part on Gaia observations, described in §3.1.
We focus our analysis on the final Kepler data release DR25 (Thompson et al. 2018) and its associated planet candidate catalog 2 . DR25 contains several products designed to support the characterization of the completeness and reliability of the DR25 planet candidate catalog. The primary contribution of this paper is a new probabilistic approach to using the DR25 complete-ness and reliability products to characterize vetting completeness and reliability. We illustrate the impact of completeness and reliability with standard occurrence rate computations, and examine the impact on occurrence rates due to changes in various assumptions. This is the first occurrence rate computation that fully uses the DR25 completeness and reliability products to characterize vetting reliability.

Previous Work
Kepler's survey of the Cygnus field involved four years of data collection and another four years of pipeline development, data processing, and survey characterization, culminating in the final deliveries referred to as Data Release 25 (DR25). Incremental data deliveries enabled preliminary science investigations, and several occurrence rate studies were executed as the survey progressed. The simplest, first-look estimates used Gaussian cumulative distribution functions (CDF) as proxies for pipeline completeness (Borucki et al. 2011), restricted samples where completeness was assumed to be near unity (Howard et al. 2012), and linear approximations to a Gaussian CDF (Fressin et al. 2013). Lacking a full characterization of the Kepler pipeline, others employed independent detection pipelines, including injection and recovery experiments operating on flux light curves to quantify the detection completeness (Petigura et al. 2013;Foreman-Mackey et al. 2014;Dressing & Charbonneau 2015). Hsu et al. (2018) performed an occurrence rate calculation using approximate Bayesian computation (ABC) and  computed a habitable zone occurrence rate taking into account the effect of planet multiplicity on completeness.
The performance of the Kepler pipeline was characterized incrementally as more data were collected using transit injection and recovery operating on raw pixel fluxes as described in Section 2.1 (Christiansen et al. 2013(Christiansen et al. , 2016Christiansen 2017). These early studies provided positive feedback to the Kepler pipeline whereby deficiencies were identified and improved upon (Twicken et al. 2016). Occurrence rate calculations using 16 out of 17 quarters of data and the associated pipeline completeness offered a benchmark computation for testing methodologies and comparing independent pipelines . Systematic errors were explored and the tallest tent poles were identified. Among these tall tent poles were two standouts: stellar property uncertainties and catalog reliability.
The first studies to include a treatment of catalog reliability focused on identifying astrophysical false positives either deterministically through follow-up observations (Santerne et al. 2012) or probabilistically via population synthesis (Morton & Johnson 2011;Morton & Swift 2014;Morton et al. 2016). And while most treatments culled or weighted the planet population, others sought to model both the planet and astrophysical sources as part of the planet occurrence estimation (Fressin et al. 2013), with Farr et al. (2014) applying a mixture model approach. These efforts used idealized models of the false positive and false alarm populations. An extremely useful astrophysical false positive probability statistic was developed by Morton et al. (2016).
The most significant effort to characterize the reliability of the Kepler planet candidate catalog to date is the final Kepler DR25 catalog paper Thompson et al. (2018). The DR25 catalog includes a Robovetter score which estimates the confidence with which the Robovetter vetted a TCE. Thompson et al. (2018) suggests that restricting occurrence rate studies to high-Robovetter-score planet candidates avoids the problem of low-reliability candidates (we test this approach in §6.4.2). Hsu et al. (2018) and Mulders et al. (2018) apply this high-score approach in their occurrence rate studies. Burke et al. (2019) analyzes DR25 reliability using the inverted and scrambled data, approaching reliability characterization via kernel density estimation. The focus of Burke et al. (2019) is how reliability impacts statistical exoplanet validation.

This Paper
The primary purpose of this paper is to present a probabilistic characterization of Kepler vetting completeness ( §4.2) and reliability against false alarms ( §5.1) and show the impact on standard occurrence rate computations. This probabilistic analysis is robust against sparse data and resolves detailed structure of the dependence of vetting completeness and reliability on orbital period and transit signal strength. We explore the impact of using this characterization to correct occurrence rates based on Kepler data by performing a standard Poissonlikelihood-based occurrence rate, following Burke et al. (2015). Our characterization depends on the exoplanet population, which in turn depends on the parent stellar sample that is searched for planets. We restrict our analysis to GK dwarf stars.
Our method of accounting for completeness and reliability proceeds by executing the following steps: • Select a subset of the target star population, which will be our parent population of stars that are searched for planets. We apply various cuts intended to select well-behaved and well-observed stars, and we restrict our analysis to GK dwarfs, as described in §3.1.
• Use the injected data to characterize vetting completeness, described in §4.2.
• Use observed, inverted and scrambled data to characterize false alarm reliability, described in §5.1.
• Assemble the collection of planet candidates, including computing the reliability of each candidate from the false alarm reliability and False Positive Probability.
• Compute the desired occurrence rates, presented in §6.
We perform our completeness and reliability analysis on the planet radius range 0.5 ≤ radius ≤ 15 R ⊕ . We perform our completeness analysis on the period range 50 ≤ period ≤ 500 days and reliability analysis on the period range 50 ≤ period ≤ 600 days. Our occurrence rates are focused on illustrating the impact of vetting completeness and reliability where both are low, so our occurrence rates are analyzed for 50 ≤ period ≤ 400 days and 0.75 ≤ radius ≤ 2.5 R ⊕ .
Following an introduction to the details of completeness, reliability and the data products that support their computation in §2, this paper has four major parts: §3 assembles the stellar and planet catalogs we use in our analysis. We choose a stellar catalog that incorporates Gaia stellar radii and features a uniform treatment of the parent stellar sample. §4 describes catalog completeness and describes our characterization of vetting completeness. §5 describes our characterization of catalog reliability. In §6 we perform our baseline occurrence rate computations corrected for vetting completeness, emphasizing the difference between correcting and not correcting for reliability. We then explore alternative occurrence rate calculations, including using an alternative stellar properties catalog, demonstrating the impact of not correcting for vetting completeness and restricting our analysis to planet candidates with Robovetter score > 0.9.
Throughout this paper we present results with confidence intervals that are the 14 th and 86 th percentiles of posterior distributions resulting from MCMC analysis using fixed inputs. These confidence intervals do not account for uncertainties in the inputs. We address the issue of uncertainties in the inputs in §6.3.
All results reported in this paper were produced with Python code, mostly in the form of Python Jupyter notebooks, found at the paper GitHub site 3 .
2. COMPLETENESS, RELIABILITY AND OCCURRENCE RATES, As described above, completeness has two components. Detection completeness is the fraction of true planets that are detected by the Kepler pipeline. Vetting completeness is the fraction of detected true planets that are correctly vetted as planet candidates. Vetting reliability is the fraction of vetted planet candidates that are true planets.
During catalog creation, the reliability of the planet candidate catalog is increased by detecting and removing false positives using a variety of tests. When the transit signal is weak it can be difficult for these tests to distinguish false positives from true planets, so maximizing reliability via stringent tests can cause true planets to be classified as false alarms, reducing vetting completeness. The DR25 catalog addressed this problem with uniform automated vetting via the Robovetter, using tests that were tuned to strike a balance between vetting completeness and reliability. This uniform automated vetting made it possible to vet synthetic and modified data sets designed to statistically mimic true planets and false positives, described in §2.1, in exactly the same way that the observed data was vetted. In this way the completeness and reliability of the planet candidate catalog can be measured and corrected in occurrence rates.
We distinguish two broad classes of phenomena that pollute the planet candidate catalog: • Astrophysical False Positives, such as grazing or eclipsing binaries, which produce a planettransit-like signal with a regular ephemeris in observed light curves that are not due to planetary transits. There has been extensive effort to identify and remove such false positives from Kepler catalogs (e.g., Morton & Johnson 2011;Bryson et al. 2013;Fressin et al. 2013;Coughlin et al. 2014;Morton et al. 2016; Thompson et al. 2018), and for high SNR transits the resulting removal of astrophysical false positives is very effective. For low SNR, however, it is more difficult to distinguish astrophysical false positives from true planetary transits. In this paper we address astrophysical false positives via the probabilistic evaluation of Morton et al. (2016).
• False Alarms, which trigger a transit detection with a regular ephemeris, but are not due to regularly repeating astrophysical phenomena. The dominant source of false alarms in Kepler data is instrumental artifacts. There are two important classes of instrumental artifacts that have been identified as responsible for the overwhelming majority of false alarms at long periods: rolling bands and statistical and pixel fluctuations.
-Rolling bands ( Van Cleve & Caldwell 2009;Caldwell et al. 2010) are thermally dependent quasi-sinusoidal electrical signals in the output of the Kepler CCDs. As the Kepler telescope slowly changes its attitude relative to the Sun, different parts of the photometer are illuminated. While the thermal insulation of the telescope and electronics is very good, it is not perfect and the resulting thermal variations cause the rolling bands to slowly move across the CCDs, often introducing signals that look very much like transits. Because Kepler's attitude is determined by its 372day orbit, rolling bands often induce transitlike signals that repeat with a nearly regular ephemeris with an approximately 372-day period. This is the cause of the sharp peak of TCEs in the left panel of Figure 1. Rolling bands are highly focal plane position dependent: some Kepler CCD channels have much more severe rolling bands than others.
-Statistical and pixel fluctuations are independent, unrelated dips in light curves due to cosmic ray hits, single transits, and statistical fluctuations that trigger transit detections when they accidentally fall into a regular ephemeris. This class of false alarms becomes much more common for long-period ephemerides because they only require three or four events to fall on a regular ephemeris, and explains the broader "shoulders" of the tall peak in the left panel of Figure 1.
In addition, stellar variability triggers false alarm transit detections on a regular ephemeris, typically at short periods. False positives and false alarms are treated differently in our analysis, as described in §5.
In an ideal world, measuring planet occurrence rates from Kepler data would be simple -divide the number of detected planets by the number of observed stars, correcting for the geometrical probability of a planet transit. To get an accurate occurrence rate, however, this  Thompson et al. (2018), showing a dramatic excess near the Kepler orbital period of 372 days and SNR between 7 and 15. Right: the distribution of DR25 planet candidates (PCs) over the same space. Note the scale change on the vertical axis. While the vast majority of detections have been identified as false alarms and removed from the planet candidate population, there remains a possible small excess of PCs near the Kepler orbital period and with SNR between 7 and 15. approach must be corrected for completeness and reliability. This correction can be large for Earth-Sun analog systems, which are at the Kepler detection limit where both completeness and reliability are very low. Specifically, instrumental false alarms are a significant source of false transit detections in the long-period, low signalto-noise (SNR) region of most interest for η ⊕ studies for G and K dwarf stars. In this regime, as shown in Figure 1, the number of instrumental false alarms is very large compared to the expected population of true exoplanet detections.

DR25 Vetting and Reliability Products
The DR25 planet candidate catalog (Thompson et al. 2018) contains 4034 identified planet candidates (PCs) out of 8054 Kepler Objects of Interest (KOIs). The KOIs were extracted from a catalog of 34,032 transit detections called threshold crossing events (TCEs), which are periodic transit-like events (as identified by a matched filter; Jenkins 2002) that have a combined signal strength above a threshold (typically 7.1σ). Identification of the PCs from the KOIs was performed by a fully automated Robovetter. The Robovetter applies a variety of tests to each TCE, many of which are based on the synthetic test datasets described below, and planet candidates are TCEs that pass all tests while following a logic tree. Such automated vetting (and transit de-tection) is critical for the production of a statistically uniform catalog that is amenable to statistical correction for completeness and reliability.
The DR25 completeness products are based on injected data -a ground-truth of transiting planets is obtained by injecting transit signals with specific characteristics on all observed stars at the pixel level (Christiansen 2017). This data is then analyzed by the Kepler detection pipeline to produce a catalog of detections at the injected ephemerides called injected and recovered TCEs, which are then sent through the same Robovetter used to identify planet candidates. The fraction of injected transits that are recovered as TCEs measures detection completeness, while the fraction of recovered TCEs that are vetted as planet candidates measures vetting completeness. A large number of transits were also injected on a small number of target stars to measure the dependence of completeness on transit parameters and stellar properties. This data is used to create highresolution, per-target-star detection contours described in §4.1, providing completeness for each target star as a function of planet orbital period and radius (Burke & Catanzarite 2017).
The rate of false alarms, measured for the first time in DR25, are characterized by manipulating observed data so that it contains no true astrophysical transiting exoplanet signals, creating a ground-truth in which any TCE or vetted planet candidate is an instrumental false alarm. There are two basic manipulations that create the data used to characterize the rate of false alarms: • Data inversion flips the light curves "upside down" so that true transiting signals increase in brightness and are therefore not identified as transits. This is believed to preserve the quasisinusoidal rolling bands described above. The distribution of TCEs detected in the inverted data reproduces well the sharp peak at a period of 372 days that is seen in the distribution of observed TCEs in the left panel of Figure 1.
• Data scrambling shuffles the Kepler observational quarters in a way that destroys the regular ephemeris of astrophysical transit signals, preventing their detection by the Kepler pipeline. While this also prevents the detection of the same false alarms that are detected in the original observed data, it is believed to preserve the statistics of detections due to statistical and pixel fluctuations. The distribution of TCEs detected in the scrambled data is very similar to the broad shoulder near periods of one year in the distribution of observed TCEs seen in the left panel of Figure 1. Three different shuffles of the Kepler data are available.
The DR25 Robovetter uses a number of metrics to identify instrumental false alarms, and the inverted and scrambled data sets were used to tune their pass/fail thresholds. For an extensive discussion, see Thompson et al. (2018). For many Robovetter metrics, the distribution of values from the inverted/scrambled data overlaps the distribution of values in the observed data, making it difficult to distinguish false alarms from true planets, particularly at long period and low SNR. This results in low reliability in some parameter spaces of the Kepler planet candidate catalog, especially near the detection limit. An effort to characterize this catalog reliability is described in Section 4 of Thompson et al. (2018). The work presented in this paper is an attempt to improve on this characterization.

The Stellar Catalog
Our occurrence rate starts with a parent stellar population of GK dwarf stars that is searched for planetary transits. The properties of each star determine the likelihood that a transiting planet of a given size will be observed. The radius of the planet is derived from the fitted radius ratio and stellar radius. While the most accurate stellar properties for each star is desirable for understanding the properties of the transiting planet, a statistical occurrence rate requires stellar properties of uniform quality. This is an issue for Kepler data because target stars with actual transit detections are much better characterized than most targets stars without transit detections. This can potentially lead to unknown biases in the estimated detection completeness of §4.2.
The stellar catalog associated with the DR25 catalog, Q1-Q17 DR25 (with supplement) 4 is based on heterogeneous observations, with some stars having properties derived from asteroseismic data, others from spectral data, and most from photometric data, all fitted to Dartmouth isochrones (Mathur & Huber 2016;Mathur et al. 2017). Berger et al. (2018) combined the DR25 stellar catalog with Gaia parallaxes  to improve stellar radii, yielding an average radius precision of less than 10% for most Kepler stars. However, the adopted effective temperatures were still heterogeneous, and no revisions of other stellar properties such as mass and surface gravities were performed.
To obtain uniform treatment of the parent stellar population, for our baseline case we use the stellar catalog of Berger et al. (in prep), which extends Berger et al. (2018) by deriving a full set of stellar properties from isochrone fitting using broadband photometry, Gaia parallaxes and spectroscopic metallicities where available. We use these consistently fitted stellar properties to assure uniform treatment over the entire parent population. We recompute the 4-parameter stellar limb darkening model coefficients using these stellar properties via the table tableeq5.dat in Claret & Bloemen (2011), assuming a microturbulent velocity of 2 km s −1 . In §6.4.1 we compare the resulting baseline occurrence rates with those using the DR25 stellar properties. We address the issue of possible bias against small planets in this catalog in §6.4.1 and Appendix C.
Because we require information such as observational completeness from the DR25 catalog and the binary flag from Berger et al. (2018) for each target star, we merge Berger et al. (in prep), the DR25 stellar catalog (with supplement), and Berger et al. (2018), keeping only the 177,798 stars that are in all three catalogs. We remove possibly poorly characterized, binary and evolved stars using the following cuts: • Remove stars with qualityFlag = highRUWE, which indicates that the Gaia DR2 goodness-of-fit falls below a threshold, see Berger et al. (in prep) and Lindegren (2018). This leaves 169,166 stars.
• Remove stars that, according to Berger et al. (2018), are likely binaries (Bin flag = 1 or 3; we allow Bin = 2 because that indicates a nearby companion star found via high-resolution imaging, which was only performed on a subset of the target stars). This leaves 166,694 stars.
• Remove evolved stars, using the evolstate package 5 to determine the evolution state of each star using the isochrone-fitted T eff , radius and logg as inputs. This amounts to recomputing the Evol flag described in Berger et al. (2018) using the Berger et al. (in prep) stellar properties. We remove those stars with Evol > 1, indicating that they are likely not dwarfs. After removing these stars 109,093 stars remain.
We then remove stars whose observations were not well suited for long-period transit searches Burke & Catanzarite 2017): • Remove noisy targets identified in the KeplerPorts package 6 , leaving 107,494 stars.
• Remove stars with NaN limb darkening coefficients, leaving 107,143 stars.
• Remove stars with NaN observation duty cycle, leaving 106,653 stars.
• Remove stars with a decrease in observation duty cycle > 30% due to data removal from other transits detected on this star, leaving 102,086 stars.
• Remove stars with data span < 1000 days, leaving 90,829 stars.
• Remove stars with the DR25 stellar properties table timeoutsumry flag = 1, leaving 85,158 stars. This flag = 1 indicates that the Kepler pipeline completed its transit search on this star before timing out.
Finally we select our GK population using the isochrone-fitted effective temperature as 3900K ≤ T eff < 6000K, using the temperature limits of Pecaut & Mamajek (2013), leaving 58,974 stars in our parent GK dwarf population. The distribution of luminosities of these stars, computed as R 2 * T 4 eff in Solar units, is shown in Figure 2.
The largest remaining GK star has a stellar radius of 1.517R . Burke & Catanzarite (2017) states that the per-star detection completeness described in §4.1 is invalid for stars of radius > 1.25R because completeness is characterized only for transit duration below the maximum 15 hours searched by the Kepler pipeline. We extend this to R * > 1.35R because we are restricting our orbital period to 400 days, which keeps transit duration under 15-hours for our GK stellar population. We do not impose the stellar radius criterion in our baseline, instead opting for the physically motivated selection based on the Evol flag (though we use the R * > 1.35R radius cut when using the DR25 stellar properties catalog in §6.4.1). Our baseline stellar population has 1,051 stars, or 1.78%, with R * > 1.35R . The maximum transit duration across our baseline population for a 400 day period and eccentricity = 0 is 14.73 hours. The distribution of transit durations for our baseline stellar population assuming a 400-day period and eccentricity = 0 is shown in Figure 3, which shows that our population gets close to, but does not exceed, the 15-hour duration limit.

The Planetary Catalog
Our planetary catalog is the Q1-Q17 DR25 Kepler Object of Interest (KOI) table at the exoplanet archive 2 (Thompson et al. 2018), restricted to planet candidates (KOIs with koi pdisposition = CANDIDATE) on stars in the catalog from Berger et al. (in prep). We accept the CANDIDATE and FALSE POSITIVE dispositions resulting from the uniform robovetter run on the TCEs.
For our baseline case, we recompute the planet radii R p (in Earth radii) from the stellar radii R * (in Solar radii) in Berger et al. (in prep) and the ratio of the planet radius to the stellar radius, A = R p /R * from the koi ror column of the KOI table, as R p = AR * R /R ⊕ where R (R ⊕ ) is the Solar (Earth) radius. We compute the planet radius uncertainties σ Rp from the stellar radius uncertainties σ R * and planet radius to the stellar radius ratio uncertainties σ A via standard propagation of uncertainties: where the upper and lower uncertainties are computed independently.

THE COMPLETENESS MODEL
As described in §1, the set of planet candidates in the DR25 KOI catalog is not expected to be complete: particularly near the Kepler detection limit we expect that some transiting planets will be detected while others will be missed, and some of those detected will be mis-classified as false positives. Detection completeness is a measure of the fraction of true transiting planets that are detected. Vetting completeness is a measure of the fraction of detected true transiting planets that are correctly classified as planet candidates. We expect detection and vetting completeness to be functions of the orbital period and the signal to noise ratio (SNR), which in the Kepler data processing pipeline is measured as the Multiple Event Statistic (MES) (Jenkins 2002). MES measures the combined significance of all observed transits in the de-trended, whitened light curve.
Detection and vetting completeness are both measured using the DR25 transit injection data products 7 (Christiansen 2017). Christiansen et al. (2013Christiansen et al. ( , 2015Christiansen et al. ( , 2016 used these injection products to produce average 7 https: //exoplanetarchive.ipac.caltech.edu/docs/KeplerSimulated.html detection curves as a function of MES for various stellar populations. We will use the star-by-star detection completeness model of Burke & Catanzarite (2017), derived from a comprehensive database of 1.2 × 10 8 transit injection and recovery trials, which we summarize in §4.1. In §4.2 we introduce a new probabilistic approach to modeling vetting completeness.

Combined Detection and Vetting Completeness
We use the characterization of Kepler detection completeness computed by a modified version of the Ke-plerPorts code base 3 . This software computes a completeness function η s (p, r) as a function of period p and planet radius r for each star s. The completeness function is described in detail in Burke & Catanzarite (2017). We briefly summarize the main steps for calculating the completeness function and describe the augmentations that incorporate vetting completeness.
The detection completeness calculation begins with estimating the MES expected for a given planet period and radius based on the stellar properties of the host. For each period and radius, a central crossing transit depth is estimated based on the stellar properties and limb darkening provided by the stellar catalog. The central crossing transit depth is converted into an expected MES by interpolating in the tabulated values of the one-sigma depth functions for each target (Burke & Catanzarite 2017). The one-sigma depth function corresponds to the signal depth that results in a 1-sigma value for MES, and is a function of the planet period and the expected transit duration. The resulting expected MES is mapped to detection completeness based on analysis of the injected data. We treat this pipeline detection completeness estimate and the vetting completeness as independent. Thus the detection completeness is multiplied by the vetting completeness function ρ(p, expectedMES, θ) described in §4.2.
This produces the combined detection and vetting completeness for a central transit. The impact of noncentral transits are accounted for through MES smearing, which convolves completeness with a distribution derived from analysis of the injected data. Completeness is then multiplied by the tabulated window function, which accounts for observational gaps for this star and the requirement of the Kepler pipeline of having at least three transit events for a detection, and the geometric transit probability assuming a uniform distribution of the cosine of orbital inclination angles.
The output is a collection of completeness functions η s (p, r), one for each star s which includes detection completeness, vetting completeness and geometric transit probability. We sum these functions to create η (p, r) = N * s=1 η s (p, r) where N * is the number of searched stars. The summed completeness η (p, r) is used in the occurrence rate calculations in §6.

Vetting Completeness
Vetting completeness is the fraction of detected TCEs that were correctly vetted as PCs. This vetting is uniformly performed on both the observed TCEs and on the injected data TCEs, both resulting from the Kepler data analysis pipeline, with the DR25 Robovetter (Coughlin 2017;Thompson et al. 2018) using the same thresholds in both cases. Because in the injected data every TCE is by definition a PC, vetting completeness is simply the fraction of injected on-target TCEs that were identified as PC by the Robovetter. We study the dependence of the injected vetting completeness on TCE period and expected MES by binning the detected injected TCEs on a regular grid. Our approach treats vetting completeness as a statistical property of a stellar population, analyzed separately for each choice of stellar population or stellar properties or other choices that may change the stellar or planet population. We present our vetting completeness analysis of the baseline GK star population in detail. Other cases described in §6.4 require independent analysis, which can be found in the htmlArchive folders on the paper GitHub site 3 . Previous treatments of vetting completeness, e.g. Thompson et al. (2018), partitions the expected MESorbital period plane into cells and divides the number of injected TCEs vetted as a PC by the total number of injected TCEs in each cell, which is an estimate of the vetting completeness in that cell. Mulders et al. (2018) does the same on a radius-orbital period plane, and Coughlin (2017) does the same with multiple parameter combinations (MES, period, planet radius, stellar radius, stellar temperature, and insolation flux). Using this data, one can estimate the dependence of the vetting completeness as a function of expected MES or planet radius and orbital period based on the measured fraction in each cell via, for example, χ 2 fitting to a parametric model assuming a Gaussian likelihood, as done in Mulders et al. (2018). When there are many TCEs and many detections, this method can be expected to work well. Near the Kepler detection limit, however, there will be few TCEs and fewer correct PC dispositions, leading to strong gridding dependence and the possibility of adjacent cells having very different values. For example, if adjacent cells have only one TCE each and one is vetted as PC while the other is vetted as false positive, then these adjacent cells will have completeness 0 and 1. Cells with no TCEs require special treatment. Addressing these problems by requiring cells large enough to contain many TCEs can result in large cells smoothing out details of the vetting completeness' functional dependence.
Rather than fitting a parametric model of a particular functional form using χ 2 methodology, we take a probabilistic approach using a binomial likelihood that readily handles sparsely populated regions of parameter space. Specifically, we treat the injected TCEs as a collection, with a rate ρ of (correctly) vetted PCs and a rate (1 − ρ) of (incorrectly) vetted FPs, and vetting by the Robovetter as draws from this collection. This is a classic binomial problem, in which the probability distribution of correctly vetting a PC depends on ρ and the number of TCEs in the underlying collection. For example, if there is only one TCE in a cell with a ρ = 50% probability of being vetted as a PC, the probability distribution of (correctly) vetting that TCE as a PC is extremely broad, with equal probability of PC or FP. Thus it is expected that such adjacent cells with single TCEs will have vetting completeness 0 and 1, with equal likelihood. Cells with no TCEs are gracefully handled because they have zero probability of a vetted PC. This allows the use of fine grids that can detect details of the functional dependence of vetting completeness.
By partitioning the expected MES -period space with a grid and computing ρ in each grid cell, we can measure the dependence of ρ on expected MES and period, inferring the function ρ(p, m). This is what we do in the next section. Figure 4 shows the number of TCEs in each grid cell detected by the Kepler pipeline in the injected data at the correct ephemeris in the GK baseline stellar population. The injections were with expected MES between about 8 and 15, and period less than 500 days (see Christiansen 2017, for details). Figure 5 shows the percentage of TCEs in each cell that were vetted as PC by the robovetter. Perfect completeness is 100%. We see that for high expected MES and period < 200 days the completeness in each cell is typically near 100%, while for period > 300 days and expected MES < 15 the completeness drops off. We will characterize this behavior using a function ρ(p, m, θ) of planet period p and expected MES m, where the exact form of ρ is specified below and θ is the vector of function parameters. Given a specific form for ρ, we infer θ from the number of PCs that are correctly vetted as PC in each cell.
We will infer our rate function ρ(p, m, θ) via an MCMC Bayesian inference. We treat each grid cell as independent identically distributed binomial realizations, which leads to the likelihood (2) where n = {n i,j } is the set of the number of injected TCEs in each cell, and c = {c i,j } is the set of the number of injected TCEs vetted as PC in cell (i, j).
We perform the MCMC inference using the emcee package 8 , which requires the log likelihood We considered several functional forms for ρ(p, m, θ), described in Appendix A. Figure 5 suggests a product of functions that are approximately, but not exactly, coordinate aligned. Qualitatively, Figure 5 also suggests that a generalized logistic function may be a good fit. We construct many, though not all, of the functional forms considered in this paper from this generalized Logistic function.
Appendix A describes our use of the Akiake Information Criterion (AIC) and other considerations to select the form of ρ that best fits the data. In all cases, before applying the function we transform from (period, expected MES) coordinates to homogeneous coordinates on the unit square [0, 1] × [0, 1], which allows rotation. Of the functions we considered, we find that a product of a non-rotated simplified logistic function in period p times a rotated logistic in p and expected MES m to best fit the data: We used the uniform priors −1 ≤ x 0 , y 0 ≤ 2, 10 −4 < k x , k y < 10 4 , −180 < φ < 180, 0 < A < 1, and initialized θ by minimizing − log(L) using the Python optimize package. Our MCMC computation used 100 walkers, and ran for 5000 steps after 5000 steps of burn-in. Figure 6 shows the resulting posteriors, giving  Figure 7 shows the resulting rate function ρ(p i , m j ,θ) evaluated at the median of the posteriorsθ, along with the underlying rates for each grid cell. Figure 8 shows two example positions on the expected MES-orbital period plane, illustrating both the dependence of vetting completeness on these parameters as well as the spread of vetting completeness due to the posterior θ distribution. We find that the approach described in this section is robust against changes in the grid. Changing the grid resolution does not significantly change the results, so long as the resolution is sufficient to resolve features in the underlying data. Figure 9 shows the mean and standard deviation of 1000 realizations, drawn from the posterior θ distribution, of the fraction of correctly vetted PCs for each cell. This figure should be compared with Figures 4 and 5. As expected, where the number of TCEs per cell is low in Figure 4, the standard deviation is high. Figure 10 shows the residual of observed data in Figure 5 from the mean rate in Figure 9 in units of the standard deviation shown in Figure 9. We see that while there are isolated large outliers, as well as larger residual values where the standard deviation is high, there is no indication of a bias in ρ(p i , m j , θ).

THE RELIABILITY MODEL
In this section we characterize the reliability of planet candidates in the Q1-Q17 DR25 KOI catalog. We apply the probabilistic approach of §4.2 to the problem of characterizing the probability that a DR25 planet candidate is in fact a false alarm due to instrumental systematics, or some types of stellar variability. We rely on the Q1-Q17 DR25 False Positive Probabilities table at the NASA Exoplanet Archive 2 (Morton et al. 2016) to provide the probability that the planet candidate is a false positive due to astrophysical signals that imitate transits. The final reliability for each planet candidate is the product of the false alarm reliability and the false positive probability. Thompson et al. (2018) defined reliability as the ratio of the number of PCs which are true exoplanets, T PC , to the number of observed planet candidates N PC :

Vetting Reliability
where N FP is the number of observed false positives and E ≡ NFP TFP is the false positive effectiveness, defined as the number of identified FPs, N FP , divided by the number of true FPs, T FP . The second equality in equation (6) is exact when all quantities are from the same population, such as the observed data analyzed by the DR25 catalog. Unfortunately, the true PCs and false positives, T PC and T FP , are unknown for the observed data. As explained in Thompson et al. (2018), however, we can use the inverted and scrambled data sets 7 described in §1, which are designed so that every detection is, by definition, a false alarm.
Astrophysical transit-like events such as KOIs or eclipsing binaries can trigger detections in the inverted and scrambled data, compromising the use of these data to measure false alarms. This happens in two ways: 1) the transits and eclipses add signals unlike the false alarms we are trying to measure, and 2) the Robovetter is not tuned to detect and remove these kinds of signals. Thompson et al. (2018) describes how the lists of inverted and scrambled detections were cleaned of signals from known transiting systems in §2.3.3. Essentially, targets that are known binaries (Kirk et al. 2016) and known KOIs (Thompson et al. 2018) are removed from the list of detections, so they do not count as either a FP or a PC. For the inverted set, because self-lensing and heartbeat star binaries type events can produce signals that look like inverted transits, 54 targets with significant periodic signals were also removed from the list. The detections dropped from the inverted and scrambled data used in this study, as well as in Thompson et al. (2018), are collected in the files kplr droplist inv.txt and kplr droplist scr*.txt (one for each scrambled data set) on the paper GitHub site 3 . These stars are removed from the inverted/scrambled data before the analysis described in this section.

Characterization of False Alarm Effectiveness EFA
To determine the false alarm effectiveness, E FA , we combine the inverted data with each of the three scrambled data sets (see Coughlin 2017, for details on the properties of each set) to create three data sets, called "inverted/scrambled", where every TCE should be considered a false alarm. We proceed as in §4.2, covering the period-(observed) MES plane with a regular grid, and measure the ratio of the number of false alarms to the number of TCEs in each cell. This problem is more challenging than the analysis of vetting completeness in §4.2 because the Robovetter has been tuned to do a very good job correctly identifying false alarms, resulting in relatively few cells with TCEs incorrectly vetted as PCs.
We therefore combine the three inverted/scrambled data sets through concatenation in order to produce a somewhat stronger signal. This amounts to averaging the three data sets at the input level, avoiding small-number statistics issues that would arise if we fit the three data sets separately and averaging the resulting posteriors. We refer to this concatenated data set as the combined inverted/scrambled data. Figure 11 shows the number of TCEs detected in the combined inverted/scrambled data. We see that most detected TCEs in this data is for MES < 15 and period ≥ 250 days. Figure 12 shows the fraction of correctly vetted false alarms, a measure of E FA , in each cell. The signal we're measuring is small, and is dominated by smaller fractions at low MES and period ≤ 200 days.
We use the same likelihood as in §4.2, equation (6), where in this case E FA plays the role of ρ, n i,j is the number of TCEs detected in the combined inverted/scrambled data in cell (i, j), and c i,j is the number of false positives identified in cell (i, j). We perform the MCMC inference as described in §4.2. We considered several functions, described in Appendix A.2, and determined that a simple rotated logistic function best describes this data set. For θ = [x 0 , k x , φ, A], E FA (p, m, θ) is given by where p is the orbital period, m is the observed MES, and Y is the logistic function from equation (4). We  used the uniform priors −1 ≤ x 0 ≤ 2, 10 −4 < k x < 100, −180 < φ < 180, 0 < A < 1. The MCMC run used a hand-tuned initial condition because Python's optimize maximum likelihood solution was physically unreasonable (A >> 1, for example) and violated the prior. Our MCMC computation used 100 walkers, and ran for 5000 steps after 5000 steps of burn-in. Figure 13 shows  Figure 12 from the mean of these realizations in units of standard deviation is shown in Figure 15, demonstrating an overall good fit to the data.

FFA
For F FA we count the number of false positives found in the observed data, but we must be careful to consider only non-transit-like false alarms in order to be consistent with our characterization of effectiveness. We iden- tify such false alarms by selecting on the not-transit-like (NTL) flag = 0, indicating that the Robovetter identified this false positive as transit like, which identifies 21 candidate astrophysical false positives in our GK population inside 50 ≤ period ≤ 600 days and 0.5 ≤ radius ≤ 15 R ⊕ . We manually examined these FPs and identified those that show a consistent astrophysical signal in all transits as astrophysical false positives. Two TCEs with NTL=0 did not show such a consistent astrophysical signal and are deemed likely false alarms: 004371172-01 and 009394762-01. The other 19 FPs with NTL=0 were identified as astrophysical and removed from the set of FPs used in the analysis of F FA . Figure 16 shows the number of TCEs detected in the combined inverted/scrambled data. We see that most detected TCEs in this data is for MES < 15 and period ≥ 250 days. The close correspondence with Figure 11 shows that most TCEs are false alarms. Figure 17 shows the fraction of identified false alarms (identified via the NTL flag as described above), a measure of F FA , in each cell.
We proceed in a very similar manner to inferring E FA in §5.1.1. We use equation (6) as the likelihood, with F FA playing the role of ρ, n i,j is the number of TCEs detected in the observed data in cell (i, j), and c i,j is the number of false alarms identified in cell (i, j). We perform the MCMC inference as described in §4.2. We considered several functions, described in Appendix A.3, and determined that the same simple rotated logis-tic function as that used in §5.1.1 best describes this data set. In equation (9), F FA replaces E FA , providing We used the uniform priors −1 ≤ x 0 ≤ 2, 10 −4 < k x < 100, −180 < φ < 180, 0 < A < 1, and initialized θ by minimizing − log(L) using the Python optimize package. Our MCMC computation used 100 walkers, and ran for 5000 steps after 5000 steps of burn-in. Figure 18 shows the resulting posteriors, giving x 0 = 0.680 + 0.027 − 0.029, k x = 14.053 + 1.447 − 1.304, φ = −157.409 + 3.588 − 3.505, and A = 0.983 + 0.004 − 0.004. The rate function F FA (p i , m j ,θ) for the posterior median is shown in Figure 19. As in §4.2, 1000 realizations of the FP rate function were created, drawing from the posterior θ distribution. The residuals of the observed false alarm fraction in Figure 17 from the mean of these realizations in units of standard deviation is shown in Figure 20, demonstrating an overall good fit to the data.

Computing the False Alarm Reliability RFA
Once we have the rate functions F FA and E FA , we can compute the false alarm reliability R FA (p, m) from equation (8). In practice we evaluate F FA and E FA at a desired period and observed MES, either on a regular grid or for specific planet candidates. Figure 21 shows the resulting reliability function in the period-MES plane. We see that for low MES there is decreased reliability around period 250 to 450 days, corresponding to the high number of TCEs in that range found in the inverted/scrambled data (see Figure 11), consistent with the excess of detections in Figure 1. Figure 22 shows the reliability function evaluated over the full posteriors of F FA and E FA for three example periods and observed MES. We see that for low MES near 1-year orbital periods the reliability drops to about 0.6 and has a large spread.

Astrophysical Reliability
The reliability function determined in §5.1 only provides the probability that a planet candidate is not a false alarm. To determine the probability that a candidate is not an astrophysical false alarm such as a grazing or eclipsing eclipsing binary, we use the Q1-Q17 DR25 False Positive Probabilities 2 created using the technique developed in (Morton et al. 2016). These probabilities were computed for all KOIs based largely on photometric data including transit light curves and measured magnitudes. We therefore assume that they are still valid even though we are using different stellar properties. We define the astrophysical reliability of a planet candidate as 1 -the false positive probability of that candidate.

Computing the Reliability for Each Planet Candidate
We compute the reliability for each planet candidate by first evaluating R FA (p, m) as described in §5.1.3, where p is the observed orbital period and m is the observed MES of the planet candidate from the KOI catalog. Then we define the reliability R = R FA (p, m) · (1 − FPP) where FPP is the false positive probability for that planet candidate from the Q1-Q17 False Positive Probabilities table.

ILLUSTRATIVE OCCURRENCE RATES
We present several illustrative occurrence rates, focusing on long-period, small planets where vetting completeness and reliability have the greatest impact. We compute our occurrence rates with the method of Burke et al. (2015), modeling occurrence rates as a Poisson point process with a rate given by a product of power laws in orbital period and planet radius. We perform our occurrence rate analysis over the period and radius range of 50 ≤ period ≤ 400 days and 0.75 ≤ radius ≤ 2.5 R ⊕ , and integrate the resulting rate over two ranges considered by Burke et al. (2015): • F 1 : 50 ≤ period ≤ 200 days and 1 ≤ radius ≤ 2 R ⊕ , and • ζ ⊕ : within 20% of Earth's orbital period and radius. Figure 23 shows our baseline planet candidate population, with the planet markers sized and colored by that planet's reliability and the background and contours showing the completeness function η(p, r), including geometric transit probability. The F 1 and ζ ⊕ regions are indicated by boxes. We see that while the F 1 region is reasonably well populated, it has a large completeness correction of ∼ 500. ζ ⊕ , however, has only one lowreliability planet and a completeness correction > 10 4 , leading to large uncertainties in the estimate of ζ ⊕ .

Methodology
Following Youdin (2011) and Burke et al. (2015), we study the number of planets per star as a function of orbital period p and planet radius r, f (p, r), by inferring the population rate function λ(p, r) ≡ d 2 f /dp dr from a collection of planet detections at (p i , r i ) with a known completeness function η(p, r) and reliability R FA . If λ(p, r, θ) is a specific function parameterized by the parameter vector θ, then our problem is to determine θ. We proceed by Bayesian inference: given a set of planet candidates with orbital period and radius {p i , r i }, i = 1 . . . N p where N p is the number of planet candidates, by Bayes' theorem the probability of θ is where π (θ) is a prior on θ. Fixing π (θ), finding the highest probability θ amounts to maximizing the likelihood P ({p i , r i }, i = 1 . . . N p |θ).
In Appendix B we show that maximizing the likelihood is equivalent to treating planet occurrence as a Poisson point process with rate λ(p, r, θ) that depends on period, planet radius, and parameters θ. Here Λ(D) = D η(p, r)λ(p, r, θ)dp dr is the integral over the whole period-radius space D, where η(p, r) is the summed completeness function from §4. However, we point out that equation (11) is not itself a Poisson probability, as is sometimes implied in the literature.
The likelihood in equation 11 accounts for completeness but not reliability. Because this likelihood is derived from a Poisson distribution, which is defined only for discrete integer counts, we cannot account for reliability by weighting a planet's contribution by its reliability. We address reliability by performing multiple Bayesian inferences of θ using equation (10), drawing from the planet candidates according to their reliability. For example, a planet candidate with reliability 0.9 would be included in 90% of these inferences, while another planet candidate with reliability 0.2 would be included in 20% of these inferences. Then the θ posteriors of these inferences is concatenated to produce the posterior distribution of θ accounting for reliability.
Our domain of analysis includes Earth's orbital period and size, so we are not, strictly speaking, extrapolating to this regime. But, as described above, our approach to reliability with a Poisson likelihood will remove lowreliability planets near Earth's orbit and size from many, though not all, of the runs, effectively resulting in an extrapolated result.
Following Youdin (2011) and Burke et al. (2015), we model the planet candidate population rate λ(p, r, θ) as a product of power laws in period and radius. Inspired by Foreman-Mackey's implementation of Burke et al. (2015) 9 , we adapt the form resulting from solving explicitly for the normalization C n from Burke's equation (8) and using it in his unbroken power law equation (7) ( Burke et al. 2015): for θ = (F 0 , α, β), This form ensures that D λ(p, r, θ)dp dr = F 0 so F 0 can be interpreted as the integrated planetary occurrence rate over the period-radius range used in the analysis.

Baseline Results
To perform our Bayesian MCMC inference, we use the emcee package. To measure the impact of correcting for reliability, we run inferences both without and with reliability correction. For our inference without reliability correction, we use 16 walkers and run for 5000 steps after 1000 steps of burn-in. For our inference with reliability correction, we run 100 inferences as described in §6.1, probabilistically sampling from the planet candidates according to their reliability, with each infer-ence using 16 walkers and running for 2000 steps after 400 steps of burn-in. In both cases the walkers of each MCMC run are initialized in a small Gaussian distribution centered on the maximum-likelihood solution for that inference's planet population. The posteriors from each of the 100 inferences with reliability correction were thinned by a factor of 10 and concatenated to produce the θ posteriors. Table 1 shows the median and 16th and 84th percentiles of these posterior distributions both with and without reliability correction. We see that reliability has an overall impact of about 30% in F 0 , the integrated rate over our period and radius range of 50 ≤ period ≤ 400 days and 0.75 ≤ radius ≤ 2.5 R ⊕ . Figure 26 shows the marginalized population rate function λ(p, r, θ) for the posterior θ distribution, accounting for uncertainty. This figure also compares the predicted number of planet detections with the binned planet candidates.   Note-Baseline occurrence rate results comparing not accounting for reliability with accounting for reliability against both false positives and false alarms (middle column) and accounting for false alarm reliability only. Central values and error bars are the median and 16th and 84th percentiles of the θ posteriors of the Bayesian inference. Γ⊕ ≡ d 2 f /d log p d log r = p⊕r⊕λ (p⊕, r⊕, θ), evaluated at Earth's period and radius, F1 is the integrated planet rate over 50 ≤ period ≤ 200 days and 1 ≤ radius ≤ 2 R⊕, and ζ⊕ is the integrated rate within 20% of Earth's orbital period and size. Figure 25 and Table 1 show F 1 and ζ ⊕ , as well as Γ ⊕ ≡ d 2 f /d log p d log r = p ⊕ r ⊕ λ (p ⊕ , r ⊕ , θ), with and without accounting for reliability, evaluated over all posterior values of θ. We see that even though there is significant overlap in the distributions with and without reliability, accounting for reliability has a strong impact: Γ ⊕ and ζ ⊕ are are reduced by more than 50%, which can be understood from the very small number of low-reliability planets in the ζ ⊕ region in Figure 23. F 1 is the integrated rate over a region of higher reliability, but reliability still has a strong effect. F 0 is the inte-grated rate over our entire period-radius analysis range, but it is dominated by the fact that there are more highreliability planet candidates, so reliability has an impact similar to F 1 . Table 1 also shows the impact of accounting only for false alarm reliability, ignoring astrophysical false positive reliability, indicating that false alarm reliability accounts for about half the impact of the reliability correction. We also computed occurrence for the SAG13 definition of η ⊕ 10 , 237 ≤ period ≤ 860 days and 0.5 ≤ radius ≤ 1.5 R ⊕ . Without accounting for reliability, we find η ⊕ = 0.294 +0.173 −0.110 , consistent with the results of 10 https://exoplanets.nasa.gov/system/internal resources/ details/original/680 SAG13 closeout 8.3.17.pdf , while accounting for reliability yields η ⊕ = 0.111 +0.084 −0.048 . This result should be treated with caution because it involves extrapolation beyond the domain of both reliability and detection completeness characterization.
We find that the impact of accounting for reliability is significant for small planets in long-period orbits. While one can note that the median values of occurrence rates in this regime are not much more than "one σ" apart, the observed shifts in the distributions on the order of 40% are systematic, and clearly not due to statistical fluctuations.

Simple Estimates of the Impact of Input Uncertainty
A full treatment of uncertainties in occurrence rates is beyond the scope of this paper. Uncertainties in stellar properties would need to be accounted for in the selection of the parent stellar population, the modeling behind the detection completeness and impact on the Robovetter. In this work we do, however, produce uncertainties in the false alarm reliability in §5.1 through the MCMC posteriors of the fit functions, as well as planet radius uncertainties that follow from stellar radius transit fit uncertainties as described in §3.2. In this section we present simple experiments that examine the impact on our occurrence rates of the reliability and planet radius uncertainties. We study the impact of planet radius uncertainties separately from the impact of reliability uncertainty.

Impact of Planet Radius Uncertainty
We study the impact of planet radius uncertainties without accounting for reliability. We proceed in the same way that we study the impact of reliability, by performing several inference runs with a planet population in each run selected after applying the planet radius uncertainties. Specifically, for each run, prior to the restriction of the planet candidate population to the radius range 0.75 ≤ radius ≤ 2.5 R ⊕ , we add to each planet's radius an error given by a draw from a Gaussian distribution with width equal to that planet's radius uncertainty. Each planet is randomly assigned an upper or lower errorbar with 50% probability. The planet candidate population is then restricted to the range 0.75 ≤ radius ≤ 2.5 R ⊕ , and the inference is run.
The impact of planet radius uncertainties resulting from 1000 inference runs is shown in Table 2 and the top row of Figure 27. We see a small, consistent broadening of the width of the distributions and resulting error bars, and possibly a small systematic shift towards higher occurrence rates, but the overall impact is minor. We believe this is due to the smaller uncertainties resulting from using Gaia stellar properties, and from the fact that near the boundary of our planet size range there are many planets, so planets are equally likely to exit and enter our range due to uncertainty.

Impact of Reliability Uncertainty
We study the impact of uncertainties in reliability by modifying the method of computing occurrence rates with uncertainty described in §6.2. Prior to each inference run, we draw from the posteriors of the parameter vectors for false alarm efficiency ( §5.1.1) and the observed false alarm rate ( §5.1.2). We use these draws evaluate the FA efficiency and observed rate functions at each planet candidate's period and observed MES, from which the false alarm reliability is computed. The reliability is then computed as described in §5.3, and each planet is included in that run with probability given by this computed reliability. In other words, the reliability rate function is realized for each run and applied to the planet candidate population. Table 3 and the bottom row of Figures 27 compare occurrence rates with reliability correction but no reliability uncertainty, computed with the medianθ of §5.1.1 and §5.1.2, with the reliability distribution that results from using the respective full θ posterior distributions. We see that there is no significant impact due to reliability uncertainty apparent in 1000 inference runs. This is in spite of the broad distributions of the low-reliability planet candidates shown in Figure 28, which shows the false alarm reliability values R RA resulting from the θ posterior distributions. We believe this lack of impact on occurrence rates is due to the very small uncertainties for high-reliability targets (see Figure 22) combined with the facts that the low-reliability targets occur with less frequency in each inference and that most distributions in Figure 28 are close to symmetric, so are wellrepresented by their medians.   Note-A study of the impact of planet radius uncertainties, not accounting for reliability. See Table 1 for an explanation of the rows.

Variations
In this section we explore the impact of changing some of the inputs and assumptions in the baseline occurrence rate computed in §6.2. Our motivation is to understand the dependencies of the occurrence rate on these inputs and assumptions. In all cases except §6.4.2 the same models were found to be the best fit to vetting completeness, false alarm effectiveness and observed false alarm rate as in the baseline case, though the parame- Note-A study of the impact of false alarm reliability uncertainties. See Table 1 for an explanation of the rows. ters of these models had different values for the different variations.
We present several tables with occurrence rate results that should be compared with Table 1. This comparison is shown graphically for F 1 and ζ ⊕ at the end of this section in Figure 31.

Using the Q1-Q17 DR25 Stellar Properties
Our baseline occurrence rates are substantially lower than several occurrence rates based on pre-Gaia stellar properties. In this section we repeat our analysis, replacing the Gaia-based catalog of Berger et al. (in prep) with the pre-Gaia Q1-Q17 DR25 stellar properties from the NASA Exoplanet Archive 4 . We perform the same cuts as described in §3.1, with the exception that there is no cut on binary or evolved flags (these do not exist in the Q1-Q17 DR25 stellar properties) and we remove all stars with radius > 1.35R . The final catalog contains 75,541 GK stars.
We perform the same analysis as in the baseline case, starting from computing the vetting completeness for this stellar catalog, computing the summed completeness function η, the reliability and occurrence rates specified in §6.2. Figure 29 shows the resulting planet population, summed completeness and reliability. Comparison with Figure 23 shows that this catalog has more planet candidates in our period-radius range than when using Berger et al. (in prep). This results in the higher occurrence rates shown in Table 4, which should be compared with Table 1.
The choice of catalog has a stronger impact on ζ ⊕ than on F 1 : When not correcting for reliability, F 1 based on the DR25 stellar properties are about 18% higher than our baseline using Berger et al. (in prep), while the DR25-based ζ ⊕ is about 60% higher. When correcting for reliability, F 1 based on the DR25 stellar properties are about 30% higher, while the DR25-based ζ ⊕ is twice as high. Computing the SAG13 definition of η ⊕ 10 using the DR25 stellar properties without correcting for reliability yields η ⊕ = 0.503 +0.247 −0.165 , while correcting for reliability gives η ⊕ = 0.220 +0.143 −0.087 . The Berger et al. (in prep) stellar catalog used in our baseline occurrence rates differs from the DR25 stellar catalog used in this section in both the values of the stellar properties themselves and in the cuts used to define the stellar parent population. In Appendix C we describe a study to determine the relative impact of the difference in stellar properties vs. the impact of the different population cuts on the difference in occurrence rates. We find that the difference in occurrence rates is primarily due to the different stellar properties, primarily stellar radius and effective temperature (leading to different GK selections), and that the differing population cuts has a minor impact. 6.4.2. Baseline with a Score Cut of 0.9 The Robovetter outputs a score for each TCE, indicating the confidence with which the Robovetter vetted that TCE (Thompson et al. 2018). This score is not equivalent to reliability: for example the Robovetter confidently vetted several TCEs in the inverted/scrambled data incorrectly as PC with scores as high as 0.923. But score is roughly correlated with reliability, and Thompson et al. (2018) suggests computing high-reliability occurrence rates by considering only planet candidates with Robovetter score above some threshold. This will result in a smaller planet candidate population with lower completeness, but the resulting larger completeness correction will, in principle, correct the occurrence rate.
In this variation we impose an aggressive score cut, rejecting any planet candidate with score < 0.9. We use the Berger et al. (in prep) catalog, and compute the completeness and reliability as in the baseline case, treating any TCE with score < 0.9 as a false positive/alarm. Mulders et al. (2018) uses this score cut in their analysis, but their analysis is on a very different period-radius range so is not directly comparable to our results.
The result is a smaller, higher reliability planet candidate population, as shown in Figure 30, with noticeably lower completeness (compare the contours Figure 23). In this case the false alarm vetting efficiency was best fit with the constant = 0.999, resulting in a false alarm reliability very close to 1 for the entire period-radius range. The few planet candidates with lower reliability in Figure 30 are due to their astrophysical false positive probability, which results in F 1 and ζ ⊕ being slightly suppressed as shown in Table 5. This is an illustration of the fact that score cuts cannot be relied on to provide a population that is high reliability with respect to astrophysical false positives.
The agreement in ζ ⊕ when using this score cut and the baseline given in Table 1 is remarkable given the lack of planet candidates smaller than 2 R and orbital period > 220 days shown in Figure 30 (compare Figure 23). We interpret this agreement as an indication that the baseline ζ ⊕ is dominated by extrapolation because, in the baseline population, long-period, small planets have low reliability, as discussed in §6.2. Because the baseline results and those using requiring score > 0.9 are essentially unconstrained extrapolations from radius > 2r False Alarm Reliability Figure 28. The distribution of reliabilities assigned to planet candidates with median reliability < 0.9 in the reliability uncertainty study. As expected, the low-reliability candidates have broad distributions.
and orbital period < 220 days to smaller planets and longer periods, we believe it is premature to conclude that using this score cut provides accurate occurrence   Figure 29. The planet candidate population when using the Q1-Q17 DR25 stellar properties, colored and sized by reliability. Compared with the baseline population in Figure 23 there are substantially more planets in both the F1 box (on the right) and at period > 300 days, leading to higher occurrence rates. See Figure 23 for a description of the elements of this figure.   Berger et al. (in prep) stellar properties but only including planet candidates with a Robovetter score ≥ 0.9, colored and sized by reliability. Compared with the baseline population in Figure 23 there are substantially fewer planets in both the F1 box (on the right) and at period > 300 days, but also lower completeness leading to larger completeness corrections. Note the complete absence of small planets with orbital period > 220 days. See Figure 23 for a description of the elements of this figure. rates for radius < 2r and orbital period > 220 days. In §7 we propose a strategy to explore this question.

Baseline Without Vetting Completeness
This variation measures the impact of not including vetting completeness. This will result in a smaller completeness correction where vetting completeness is low, Note-Occurrence rate results for the baseline using Berger et al. (in prep) stellar properties and only including planet candidates with Robovetter score > 0.9. As expected, the result is a high-reliability population, so accounting for reliability makes little difference. Compare Table 1.
so we expect somewhat lower long-period, small planet occurrence rates. Comparison of Table 6 with Table 1 shows a small suppression in Γ ⊕ , F 1 and ζ ⊕ when not accounting for vetting completeness.

Baseline Without MES Smearing
This variation measures the impact of not smearing the MES in the calculation of detection completeness, described in §4.1. Comparison of Table 7 with Table 1 indicates an increase in small-planet, long-period occurrence rates indicated by increases in Γ ⊕ and ζ ⊕ , but has essentially no impact on F 1 .

Summary
The impact of the variations considered in this section on F 1 and ζ ⊕ are shown in Figure 31.

DISCUSSION
In this paper we show that a proper characterization of vetting completeness and reliability is important, particularly near the detection limit. In particular, in §6.2 we find that characterizing Kepler reliability and completeness can impact occurrence rates by more than a factor of two near Kepler's detection limit (see Table 1). We introduce a new approach to characterizing vetting completeness and reliability for the Kepler DR25 planet candidate population. This approach casts the problem as one of binomial probabilities via parameterized rate functions fitted to the DR25 injection, inverted and scrambled data. We develop parameterized models of completeness (described in §4), false alarm effectiveness ( §5.1.1) and the observed false alarm rate ( §5.1.2). The particular parametric models we choose are selected via the Akieke information criterion, which chooses the parametric model that maximizes the likelihood corrected for the number of model parameters (see Appendix A). We do not claim that our parametric models are the best or in any sense "true", just that they are the best of the parametric models we considered, described in Table 8. But our best models do a good job of accounting for the data, and are robust against choices such as grid resolution.
We caution, however, that vetting completeness and false alarm reliability as defined in this paper are properties of the specific Robovetter metrics and vetting thresholds behind the DR25 planet candidate catalog, as well as our analysis method, rather than properties of the detections themselves. For example, a different choice of Robovetter metrics may increase completeness while decreasing reliability or vice versa. While a low reliability for a transit detection from the analysis in this paper is reason to be cautious about asserting that detection is due to a true planet, further analysis of Kepler data can potentially result in higher confidence that a transit signal is due to a true planet. For example, as described in §2, at least one major source of false alarms, rolling bands, is highly dependent on focal plane position ( Van Cleve & Caldwell 2009;Cleve et al. 2009). Though some of the DR25 vetting metrics, such as skye (Thompson et al. 2018) are focal-plane dependent, the reliability analysis in this paper largely ignores focal plane dependence by averaging over the focal plane, and potentially underestimates the reliability of a detection in a focal plane position known to have a low occurrence of, for example, rolling bands. Pixel-level analysis of transit events beyond that used in DR25 may be useful in distinguishing false alarms due to statistical fluctuations and cosmic ray events. These observations can potentially be implemented as new Robovetter metrics, which could result in a higher-reliability, more complete planet candidate catalog.
The four years of Kepler's observation of its primary field provides a data set unlikely to be excelled in the near future. Full exploitation of this data for understanding exoplanet populations is only partially complete. This paper is an attempt to fill in a significant step in that exploitation. We deliberately chose to limit the innovations in this paper to the characterization of and correction for completeness and reliability, and the use of the uniform Gaia-based stellar properties catalog of Berger et al. (in prep). We show the impact of these innovations by computing occurrence rates using standard methods from Burke et al. (2015) in order to facilitate comparison with previous occurrence rates based on similar methods. The following discussion critically examines the assumptions underlying these occurrence rates, revealing weaknesses in both the DR25 catalog and the occurrence rate calculation method, and outlines some of the directions that we believe will prove fruitful in addressing these weaknesses.

Assumptions Underlying the Baseline Occurrence Rate
We illustrate the impact of reliability by computing a variety of occurrence rates near the Kepler detection limit (see §6). We chose our specific occurrence rate method, Bayesian inference using a dual power law population model in period and radius, because it is standard and well-understood. We believe that our occurrence rates provide high-confidence insight into what the DR25 planet candidate catalog tells us about the exoplanet population for the period and radius range of 50 ≤ period ≤ 400 days and 0.75 ≤ radius ≤ 2.5 R ⊕ under the following assumptions: • The parent stellar population is statistically welldescribed by Berger et al. (in prep).
• Detection and vetting completeness in the injected data, along with the analysis described in §4, capture the statistical behavior of detection and vetting completeness in the observed data.
• The false alarms in the inverted and scrambled data capture the statistical behavior of the false alarms in the observed data.
• The astrophysical false positives probability statistically captures the probability of astrophysical false positives in the planet candidates.
• Exoplanets are distributed according to a Poisson point process.
• Dependence of planet occurrence on period and radius is modeled by a product of power laws in period and radius.
We discuss each of these assumptions in turn. The parent stellar population: As stated in §3.1, we choose the Berger et al. (in prep) stellar properties because they are informed by Gaia radii and uniformly treat the stellar properties across the parent population. While detailed observations of individual stars may provide more accurate stellar properties for individual stars, our method requires statistically uniform analysis. This is provided by the isochrone fitting approach of Berger et al. (in prep). Therefore we believe that this stellar catalog is very well suited to our analysis. We showed in §6.4.1, however, that the occurrence rate depends critically on the stellar properties in the parent population. Inaccuracies, in particular biases in stellar radius estimates, can have a strong impact on occurrence rates.
Detection and vetting completeness: The injected data and analysis from §4 makes many assumptions. The detection completeness analysis makes several empirical approximations (described in Burke & Catanzarite (2017)) that may not apply well to individual planet candidate host stars. As described in §3, some of our stellar and planet candidate population approaches the restrictions stated in Burke & Catanzarite (2017) with respect to transit duration. Because our occurrence rates include regions with very few planet candidates, it is possible that the detection completeness for long-period planets is not as well modeled as that for short-period planets. Regarding vetting completeness, we are assuming that the Robovetter vets the injected detections with the same statistical accuracy as real transiting planets in the observed data. While we have confidence in this assumption, it is possible that some true planet transit signals have properties not captured by injection which confound the Robovetter, such as asymmetric transit shapes due to non-zero eccentricity, transit timing variations, and out-of-transit flux variations.
False alarm characterization: As we discussed in §5, inverted and scrambled data is believed to statistically model three identified classes of false alarms: rolling bands, statistical fluctuations combined with cosmic ray-induced pixel events that conspire to imitate long-period small transiting planets and stellar variability. The evidence for this belief is that the distribution of detected TCEs in the inverted and scrambled data closely matches the clearly anomalous distribution of detections in the observed data centered on the Kepler orbital period (see Thompson et al. (2018)), and tuning the Robovetter to eliminate this distribution from the PC population in the inverted and scrambled data also eliminates the anomalous distribution in the observed data. While using the inverted and scrambled data to model the false alarm population is clearly effective, there are likely other types of false alarms not represented by the inverted/scrambled data, though these seem to be a minor component compared to those represented by inverted/scrambled data. Such unmeasured false alarms would cause an overestimate of false alarm reliability.
Astrophysical false positive characterization: The false positive probabilities (Morton et al. 2016) are computed making strong assumptions about the lack of evidence for stellar multiplicity associated with the transit host star. While these false positive probabilities model stellar multiplicity as candidate hypotheses, the prior used in this model strongly assumes a lack of evidence for stellar multiplicity. As described in Hirsch et al. (2017) and Ciardi et al. (2015), there is evidence that a non-trivial fraction, possibly 20%, of Kepler target stars have unknown stellar companions. Such companions could cause an overestimate of the reliability of a subset of the PC population.
Poisson Likelihood: The use of the Poisson likelihood (Equation (11)) for the distribution of exoplanets is a standard choice, but may not be correct. For example, the assumption that the probability of different planets on the same star are independent of one another (an assumption behind Equation (11)) is almost certainly not correct, as indicated by existence of many packed exoplanet systems. There is also evidence that the detection of one planet on a star can prevent the detection of other planets on the same star . Likelihood-free methods, such as approximate Bayesian computation as applied to occurrence rates in Hsu et al. (2018) or the population sampling method used in  may yield more accurate occurrence rates.
The power law population model: Evidence is mounting against the use of a simple product of power laws in period and radius when modeling exoplanet population statistics. This is already apparent in the topleft panel of Figure 26, where the power law is a poor fit to the observed planet population as a function of radius. This is likely due to the Fulton gap (Fulton et al. 2017), though the orbital periods in our analysis are somewhat longer than in Fulton's analysis. Further, Petigura et al. (2018) presents evidence that host star metallicity is an important parameter in exoplanet population statistics. As pointed out by Hsu et al. (2018), model mis-specification is unlikely to lead to accurate results. Several authors have avoided the use of parameterized models in occurrence rate computations (for examples, see Foreman-Mackey et al. (2014); Hsu et al. (2018); Howard et al. (2012)), which is likely to lead to more accurate occurrence rates.
We believe that whatever method of statistical analysis is applied to the planet candidate catalog, characterizing and correcting for vetting completeness and reliability is critical. In the long-period, small-planet regime we have shown that reliability can reduce occurrence rates by a factor of two. We expect that this will be the case regardless of the statistical method and model, because reliability is a property of the planet candidate catalog. The effect of vetting completeness is less dramatic in the DR25 planet candidate catalog (as opposed to detection completeness, which is very important), but should not be neglected. Other planet catalogs may increase vetting reliability at the expense of vetting completeness, in which case vetting completeness can be more significant.
For habitability studies, the common practice of grouping together a wide class of stars and computing occurrence rates as functions of period and radius is potentially misleading. For example, the large range of stellar luminosities in our GK population shown in Figure 2 means that not all stars share the same habitable zones expressed as orbital periods. But grouping such a wide class of stars is necessary to provide the required statistics due to the small number of longperiod, small planet detections and the sparseness of false alarms described in §5.1.1. Occurrence rates computed as functions of insolation flux and planet radius for the same class of stars would provide the needed statistics, and are likely more informative for habitable exoplanet population studies. Recent improvements in stellar characterization of the parent stellar sample, rep-resented by Berger et al. (in prep), make insolation-flux based occurrence rates a viable alternative.

Improving the Planet Candidate Catalog
The discussion in §7.1 outlines the assumptions behind extracting our occurrence rate from the DR25 planet candidate catalog, and how those assumptions may fail. The DR25 catalog itself can likely be improved upon, particularly in the long-period small planet regime. There is evidence that several long-period, small radius detections were incorrectly classified as false positives: the Kepler False Positive Working Group (FPWG) (Bryson et al. 2015) has identified several TCEs vetted as false positive in DR25 that are viable planet candidates, identified with fpwg disp status = POSSIBLE PLANET in the Kepler certified false positive table at the NASA Exoplanet archive 2 . This is expected because the DR25 vetting process deliberately balanced statistical uniformity and accuracy for individual objects, which was required for the study in this paper but compromised accurate vetting for some objects. In principle, the lowered completeness resulting from mis-classifying true planets as false positives is corrected by characterizing vetting completeness. But in the long-period, small-planet regime there are very few, low-reliability detections and very large completeness corrections (see Figure 23), which is vulnerable to large errors due to small statistics.
Accurate characterization of completeness and reliability as developed in this paper opens an intriguing approach to addressing the problem of few detections at long period and small radius: planet candidate catalogs that have lower reliability and higher completeness. This would mitigate the small-statistics problem by providing more detections with a smaller completeness correction resulting in better statistical constraints on the extrapolations discussed in §6.4.2. We recommend an exploration of Robovetter thresholds that increase the number of detections, lowering reliability and increasing completeness. The methods to measure reliability described in this paper are a crucial step towards being able to extract more accurate occurrence rates from such a catalog.
We believe that the reliability of the planet candidate catalog can be improved by the development of metrics beyond those described in Thompson et al. (2018). We provide two examples that may prove fruitful.
• As described above, different regions of the Kepler focal plane have different different false alarm characteristics, which can be leveraged to more accurately evaluate the likelihood that a transit signal is due to a false alarm.
• Pixel-level analysis can be developed beyond the DR25 vetting metrics, based on the expectation that false alarms are likely to be significantly different from star-like transit signals at the pixel level, particularly in difference images (Bryson et al. 2013).
We expect that such improved vetting metrics will address the small statistics problem by increasing the reliability of planet candidates near the detection limit, so they have stronger statistical weight in occurrence rate calculations. Followup observation can potentially play a role in validating the reliability characterization developed in this paper. Ground-or space-based observations of a significant number of DR25 PCs, confirming them as planets or determining them to be false positives, could provide a ground truth of the number of PCs that are true planets. This ground truth can be used to independently compute the reliability of the DR25 PC population. We caution against using such followup observations to modify the planet candidate catalog, however, as that is likely to violate the uniformity assumptions behind the completeness correction.

CONCLUSION
This paper presents a new, probabilistic approach to statistically characterizing the vetting completeness and reliability Kepler DR25 exoplanet catalog. Using a standard occurrence rate calculation, we demonstrated that correcting for reliability can have a significant impact on occurrence rates, particularly near the Kepler detection limit at orbital periods longer than 200 days and planet radius <1.5 R ⊕ . We also showed that the choice of stellar properties for the searched stellar sample has a significant effect on occurrence rates. The results in this paper were made possible by the uniform detection and vetting methods behind DR25 that lend themselves to statistical characterization. We believe that the results presented in this paper are directly applicable to other exoplanet surveys such as K2, TESS, and PLATO so long as they create their catalogs in a similarly uniform way and expend the effort to create test data sets that measure completeness and false positives.
We thank NASA, Kepler management, and the Exoplanet Exploration Office for continued support of and encouragement for the analysis of Kepler data. We thank Bill Borucki and the Kepler team for the excellent data that makes these studies possible. We thank Daniel Foreman-Mackey for his Python Jupyter notebooks, which kicked off and inspired the work presented in this paper. T.A.B and D.H. acknowledge support by the National Science Foundation (AST-1717000).

B. DERIVATION OF THE LIKELIHOOD FROM THE POISSON PROBABILITY
We briefly summarize Bayesian inference using a Poisson likelihood. We will work in the period-radius parameter space.
If our planet population is described by a point process with a period and radius-dependent rate λ(p, r), then the probability that n i planets occur an individual star in some region B i (say a grid cell) of period-radius space is where Λ(B i ) = Bi λ(p, r)dp dr.
We now cover our entire period-radius range D with a sufficiently fine regular grid with spacing ∆p and ∆r so that each grid cell i centered at period and radius (p i , r i ) contains at most one planet. Then in cell i We now ask: what is the probability of a specific number n i of planets in each cell i? We assume that the probability of a planet in different cells are independent, so because the B i cover D and are disjoint. Here K is the number of grid cells and K 1 is the number of grid cells that contain a planet = the number of planet candidates. So the grid has disappeared, and we only need to evaluate λ(p, r) at the planet locations (p i , r i ) and integrate the rate function λ over the entire domain. We do not observe all the planets, however. We account for incompleteness, including geometric transit probability, by replacing λ(p, r) with η s (p, r)λ(p, r) in equation (B3), where η s (p, r) is the completeness function for this star s measured in §4. The result is the probability P {N (B i ) = n i , i = 1, . . . , K} = (∆p∆r) K1 e − D ηs(p,r)λdp(p,r) dr K1 i=1 η s (p i , r i )λ(p i , r i ).
We now consider the probability of detecting planets around a set of N * stars. Assuming that the planet detections on different stars are independent of each other, then the joint probability of a specific set of detections specified by the set {n i , i = 1, . . . , N * } in cell i on on all stars indexed by s is given by P {N s (B i ) = n s,i , s = 1, . . . , N * , i = 1, . . . , K} = N * s=1 (∆p∆r) K1 e − D ηs(p,r)λ(p,r)dp dr K1 i=1 η s (p i , r i )λ(p i , r i ) = V e − D η(p,r)λ(p,r)dp dr where V = (∆p∆r) (K1N * ) and η(p, r) = N * s=1 η s (p, r) is the sum of the completeness functions over all stars.

C. COMPARISON OF CATALOG CUTS
In §6.4.1 we found that using the DR25 stellar properties catalog results in larger occurrence ratse than the baseline of §6.2, which uses the stellar properties from Berger et al. (in prep). This difference can result from both the difference in the stellar properties themselves, and the fact that different stellar population cuts were used. In particular, as described in §3.1, the Berger et al. (in prep) catalog contains only stars with good Gaia noise characteristics, and we impose further Gaia fit quality requirements by removing stars with qualityFlag = highRUWE.
In this appendix we explore the relative impact of the difference in stellar properties between the two catalogs compared with the impact of the different cuts. We consider the Berger et al. (in prep) catalog with and without the cuts specific to this catalog. We then consider the same stars as Berger et al. (in prep), but using the DR25 stellar properties and cuts. Finally we consider the DR25 stellar catalog and its restriction to those stars contained in the supplemental catalog of Mathur & Huber (2016). In all cases all steps of the occurrence rate computation are recomputed, including detection/vetting completeness and reliability.
We compute the occurrence rates F 1 and ζ ⊕ , defined in §6. For two cases using the stellar properties of Berger et al. (in prep) which differ in the population cuts: • Case 1: the baseline of §6.2, starting with the Berger et al. (in prep) catalog, with all the cuts described in §3.1, and planet radii corrected for Gaia stellar radii as described in §3.2. Starts with 186,548 stars and ends up with 58,974 GK stars after cuts.
• Case 2: Same as case 1, starting with the Berger et al. (in prep) catalog, except without the highRUWE, Bin or Evol cuts described in §3.1, replacing these cuts with the cut on stellar radius removing stars with R * > 1.35R described in §6.4.1. Starts with 186,548 stars, and ends up with 66,956 GK stars after cuts.
We examine three cases using the DR25 stellar properties, which differ in the population cuts: • Case 3: The same cuts as case 2, starting with the Berger et al. (in prep) catalog, except using DR25 stellar properties and original DR25 planet radii. Starts with 186,548 stars and ends up with 71,168 GK stars after cuts.
• Case 4: The DR25 stellar catalog as described in section 4.1 using original DR25 planet radii. Starts with 200,038 stars and ends up with 75,541 GK stars after cuts.
• Case 5: The DR25 stellar catalog as in case 4, restricted to those stars in Mathur & Huber (2016) and using original DR25 planet radii. Starts with 197,096 stars and ends up with 74,989 GK stars after cuts.
Cases 1 through 3 start with the same stars, and differ in the cuts and the use of Gaia-based vs. DR25 stellar properties. The occurrence rates F 1 and ζ ⊕ for the various cases are given in Table 12 and shown in Figure 32. We see that using the same stellar properties gives similar occurrence rates, with a noticeable difference in occurrence rates computed using different stellar properties. Differing cuts using the same stellar properties apparently has a much smaller impact. We therefore conclude that stellar properties (including differences in GK classification due to differences in effective temperature) is the dominant cause of the different occurrence rates, and the population cuts play a minor role. In all cases correcting for reliability has a significant impact on ζ ⊕ .  . Results without correcting for reliability are shown with black dots, and corrected for reliability with red squares. Cases 1 and 2 use the stellar properties of Berger et al. (in prep) with differing catalog cuts, while cases 3, 4 and 5 use DR25 stellar properties with differing catalog cuts.