Impact of Correlated Noise on the Mass Precision of Earth-analog Planets in Radial Velocity Surveys

Characterizing the masses and orbits of near-Earth-mass planets is crucial for interpreting observations from future direct imaging missions (e.g., HabEx, LUVOIR). Therefore, the Exoplanet Science Strategy report (National Academies of Sciences, Engineering, and Medicine 2018) recommended further research so future extremely precise radial velocity surveys could contribute to the discovery and/or characterization of near-Earth-mass planets in the habitable zones of nearby stars prior to the launch of these future imaging missions. Newman et al. (2021) simulated such 10-year surveys under various telescope architectures, demonstrating they can precisely measure the masses of potentially habitable Earth-mass planets in the absence of stellar variability. Here, we investigate the effect of stellar variability on the signal-to-noise ratio (SNR) of the planet mass measurements in these simulations. We find that correlated noise due to active regions has the largest effect on the observed mass SNR, reducing the SNR by a factor of $\sim$5.5 relative to the no-variability scenario -- granulation reduces by a factor of $\sim$3, while p-mode oscillations has little impact on the proposed survey strategies. We show that in the presence of correlated noise, 5-cm s$^{-1}$ instrumental precision offers little improvement over 10-cm s$^{-1}$ precision, highlighting the need to mitigate astrophysical variability. With our noise models, extending the survey to 15 years doubles the number of Earth-analogs with mass SNR $>$ 10, and reaching this threshold for any Earth-analog orbiting a star $>$ 0.76 M$_{\odot}$ in a 10-year survey would require an increase in number of observations per star from that in Newman et al. (2021).


INTRODUCTION
Future flagship mission concepts such as LUVOIR (The LUVOIR Team 2019) and HabEx (Gaudi et al. 2020) will be capable of performing high-contrast direct imaging to characterize the atmospheres of nearby Earth analogs around Sun-like stars, and to search for potential biosignatures. Given the large observing time necessary for such detections, the efficiency of these missions benefits from a predefined target list of promising, potentially habitable near-Earth-mass candidates, rather than relying on their own blind direct imaging searches to supply these atmospheric characterization targets (Dressing et al. 2019). Transit surveys, e.g., Kepler (Borucki et al. 2010), TESS (Ricker et al. 2014), CHEOPS jluhn@uci.edu arXiv:2204.12512v1 [astro-ph.EP] 26 Apr 2022 (Broeg et al. 2013), PLATO (Rauer et al. 2014), while capable of probing the habitable zones of low mass stars (e.g., Anglada-Escudé et al. 2016;Gillon et al. 2017;Gilbert et al. 2020), will be unable to provide complete candidate target lists as their sensitivity to orbital period remains a challenge for the long orbital periods that correspond to the habitable zones of Sun-like stars. More importantly, the transit probability of such planets is ≈ 0.5%, so we would expect to detect fewer than one transiting Earth analog among the ∼150 Sun-like stars within 15 pc (Gupta et al. 2021). In contrast, radial velocity (RV) surveys are sensitive to planets both at longer periods and at wider ranges of orbital inclinations and are therefore well suited to providing more complete candidate target lists. Even if the results of an EPRV survey were not available in time to provide a target list for a direct imaging mission, measuring the planet masses with EPRVs would still be critical for interpreting direct imaging observations, e.g., providing an atmospheric scale height to interpret atmospheric properties, characterizing the extent of dynamical interactions with any other planets, and informing the planet's formation history (e.g., von Paris et al. 2013;Nayak et al. 2017;Dorn et al. 2018;Suzuki et al. 2018).
With this idea in mind, the NASA/NSF Extreme Precision Radial Velocity (EPRV) Working Group was established to study the feasibility of a large-scale RV survey to detect and precisely measure the masses of Earth analogs in support of future direct imaging missions, such as LUVOIR or HabEx (Newman et al. 2021). One focus of the working group was to explore the specific design of such a survey, including telescope size, longitudinal coverage, and instrument capabilities, decisions which affect not only the science results of the mission but also the required cost to build and operate. To fully explore the impact of these decisions on the scientific goals to measure precise masses, Newman et al. (2021) investigated ten "architectures"-defined to encompass all decisions about telescope size, observing plan, instrument, longitudinal coverage, etc.-and simulated a 10-year survey of ∼100 nearby stars that are likely targets for future direct imaging missions. The promising results of these simulations found that multiple architectures could measure the masses of habitable zone Earth analogs to better than 10%, under optimistic assumptions for the mitigation of stellar variability. A mass precision of 20% or better is expected to result in uncertainties in atmospheric properties being dominated by the quality of spectroscopy and/or atmospheric modeling, rather than the mass uncertainty (Batalha et al. 2019). We follow the EPRV Working Group and adopt a more stringent target mass precision of 10%, which is the necessary precision for distinguishing between composition models (e.g., Valencia et al. 2007), and also allows a margin for additional uncertainty due to stellar variability under more realistic assumptions. Despite being a conservative cutoff, achieving a mass precision of 10% has additional benefits. For small eccentricities, the RV signal attributable to the orbital eccentricity is well approximated by the epicyclic approximation, i.e., a sinusoidal signal with amplitude eK. Therefore, a mass precision of 10% also implies an eccentricity precision of 0.1, which places stringent constraints on an orbital parameter that has implications for the formation and evolution of the Earth-like planet and its planetary system.
The survey simulations carried out for the EPRV working group simulation (Newman et al. 2021) were made under optimistic assumptions, assuming that each RV measurement has uncertainty dominated by photon noise and/or instrumental precision and that the measurement noise for each observation is independent and uncorrelated. The most straightforward interpretation of the results is that it assesses how RV surveys would perform if surveying stars with no intrinsic stellar variability (and any instrumental noise is uncorrelated). An alternative more plausible interpretation is that it assesses how RV surveys would perform if improvements in the analysis of EPRV observations were to effectively remove all signals due to stellar variability (and any correlated instrumental noise). Indeed, the spectral resolution, signal-to-noise and wavelength coverage for the survey architectures considered by the EPRV working group were chosen to provide information that could be used to mitigate stellar variability, based on analyses of simulated solar data (e.g., Davis et al. 2017). At the moment, stellar variability-driven by many different astrophysical mechanisms, each with different amplitudes and timescales-remains the largest hurdle in the push toward detecting true Earth analogs in RVs (Crass et al. 2021). In order to properly characterize the efficiency of these simulated survey architectures, it is crucial to account for the stellar signals that will be imprinted in the radial velocities. In this study, we quantify the effects of stellar variability on the expected mass precision of the future surveys considered by the EPRV working group.

Magnetic Activity
The longest timescale variations are due to stellar magnetic activity cycles, similar to the Sun's 11-year magnetic cycle, with times of high activity usually associated with higher RV variability. These periods of high activity are expected to correspond to an increase of surface activity features (faculae/plages, spots) that introduce variations on the star's rotational timescale, of order days to months. On the Sun, RV variability is dominated by faculae in large, concentrated regions of plages that locally suppress the convective blueshift (Meunier et al. 2010;Haywood et al. 2016;Milbourne et al. 2019). Observed correlations between activity metrics (such as log R HK , a measure of the chromospheric emission in the Ca II H&K lines) and radial velocity measurements (e.g., Campbell et al. 1988;Saar et al. 1998;Santos et al. 2000;Wright 2005;Isaacson & Fischer 2010;Luhn et al. 2020) have led to some success identifying activity-induced signals using periodogram analyses (e.g., Hatzes et al. 2010) and more recently using Gaussian process models that jointly fit simultaneous RV and activity time series (Rajpaul et al. 2015;Gilbertson et al. 2020b). In some cases, additional information about rotationally modulated activity signals can be gleaned from high-precision photometry and can contribute to disentangling activity and planetary signals in RV time series (e.g., Queloz et al. 2009;Aigrain et al. 2012;Haywood et al. 2014).
However, Luhn et al. (2020) find that while the RV RMS is positively correlated with such stellar activity metrics, activity-induced signals do not always manifest similarly in the RVs, with most stars showing positive correlation between individual RV measurements and simultaneous activity (indicative of local suppression of the "net convective blue shift"), other stars showing no correlation (perhaps due to a pole-on inclination) and still others showing a negative correlation, perhaps indicating the stellar surface is dominated by bright magnetic features (e.g., Milbourne et al. 2019). Indeed, slowly rotating Sun-like stars have been shown to increase in brightness as their activity increases (Radick et al. 1983;Lockwood et al. 1984Lockwood et al. , 2007 and so we expect these stars to have RV variability dominated by bright active regions.

Granulation
Stellar granulation, the surface manifestation of convection, operates on shorter timescales (minutes to hours). Large, bright cells of hot rising gas cover the stellar surface, dominating over the darker, cool intergranular lanes of infalling material. This gives rise to the "net convective blueshift" across the surface of a star (e.g., Dravins et al. 1981;Gray 2009;Meunier et al. 2010). The convective cells vary in size (several hundred km to tens of Mm) and lifetimes (8 minutes to 1 day) from granulation to "supergranulation" (Del Moro 2004;Hall 2008). In photometry, the granulation signal has been studied in both the time domain ("8 hour flicker" seen in Bastien et al. 2014) and in the frequency domain of asteroseismology (e.g., Michel et al. 2008;Gilliland et al. 2010;Kallinger et al. 2014). In RVs, Collier Cameron et al. (2019) see evidence of granulation signals in the intraday variations of the daily high-cadence HARPS-N solar observations (Dumusque et al. 2015), matching the total granulation contribution expected (∼0.5-1 m s −1 ) for a solar-like star (Meunier et al. 2015). One strategy for mitigating the effects of granulation on RV surveys is to take multiple observations per night (Dumusque et al. 2011). If the observations within a night are averaged while assuming each observation to be independent, then one would underestimate the uncertainty in the nightly RV. Therefore, it is important to consider the strength of correlation between observations when evaluating this or other EPRV observing strategies. For a more in-depth review of the effect of granulation on the detection of Earth-mass planets, see Cegla (2019).

Oscillations
Stellar pressure-mode, or "p-mode," oscillations operate on timescales of just minutes for solar-like stars. The convective motions in a star generate acoustic pressure waves that ripple throughout the interior of star, creating deformations of the stellar surface. The deformations can be seen as intensity fluctuations in high-precision photometry of stars. Kepler revealed a wealth of information about the general characteristics and behavior of p-mode oscillations of stars from asteroseismology analyses. In these analyses, the power spectral density (PSD) of a star reveals a bump of excess power due to oscillations, often modeled as a Gaussian, with characteristic frequency, amplitude, and width dependent upon stellar properties of the star, notably the surface gravity, log g, and the effective temperature, T eff (e.g., Brown et al. 1991;Kjeldsen & Bedding 1995;Kallinger et al. 2014). In RV measurements, the oscillation bumps are more pronounced since the intensity fluctuations also correspond to material physically moving, compounding the effect. On the Sun, p-mode oscillations operate with a dominant timescale ∼5 minutes and amplitude 19 cm s −1 (Chaplin et al. 2019).
RV surveys can effectively mitigate the effect of p-mode oscillations by choosing integration times that average over the dominant oscillation timescale (Chaplin et al. 2019;Gupta et al. 2021). The EPRV working group survey simulations incorporated this observing strategy by setting a minimum exposure time of 5 minutes for each observation 1 and so we expect oscillations to be the least impactful source of correlated noise.

Paper Organization
In this work, we build on the results of Newman et al. (2021) by including a more realistic model for correlated noise due to stellar activity, granulation, and oscillation in the calculation of planet mass precision. We choose Gaussian process (GP) noise models for each of the three key mechanisms responsible for stellar variability, following the prescription of activity in Gilbertson et al. (2020b) and the model for oscillations and granulation in Pereira et al. (2019). For the RV Community Challenge, Dumusque (2016) took a slightly different approach, modeling oscillations and granulation components based on the power spectral density of these signals and adding an activity component based on simulations using SOAP 2.0 (Dumusque et al. 2014). Hall et al. (2018) applied a similar SOAP 2.0-based activity component (but no granulation or oscillations) to simulated survey designs for the Terra Hunting Experiment. An advantage of the noise models in this work is that we scale granulations and oscillations based on stellar properties and use the GP formulation of these signals to construct the covariance matrix for a simulated set of observations, rather than assuming independent observations. In Section 2 we summarize the simulated EPRV surveys from Newman et al. (2021). In Section 3 we describe the GP covariance kernels used in this work, which are necessary for the calculation of the mass precision outlined in Section 4. In Section 5, we analyze the effect of correlated noise on the simulated EPRV surveys and make comparisons to similar survey simulations with ideal and white noise models. Additionally, we investigate the choice of several survey parameters, including instrumental precision and survey duration. In Section 6, we discuss the implications of our results for planing EPRV surveys, consider caveats to the assumptions made in both the GP models and planet mass SNR calculation, and identify opportunities for further research. Finally, we summarize our conclusions in Section 7.

SIMULATED EPRV PLANET SURVEYS
We use the results of the simulated EPRV surveys in Newman et al. (2021), who ran a realistic observing program under various telescope architectures. Each architecture has distinct Northern and Southern Hemisphere surveys, assuming identical instruments, longitudinally distributed across existing observatories in each hemisphere. The simulations for each architecture include plausible estimates for the time allocation granted to the spectrograph on each telescope (varies with the telescope size) and take into account the observing season for each site and night-to-night observing window for each star. Additionally, the simulations account for typical weather losses. Table 1 gives a summary of the architecture default values.
For this study, we are concerned with the ability to detect Earth mass planets orbiting in the habitable zones of their stars and so we assume each star on the "green target list" has an Earth-mass habitable-zone planet in a circular orbit. In this case, the RV semi-amplitude scaling relation is where the period of a habitable zone is determined by the mass of the star: assuming a MS luminosity scaling relation L ∝ M α . We choose α = 4 for this work and assume orbits are viewed edge-on (sin i = 1).
We focus on 5 in architectures in particular: Architectures I, IIa, IIb, VIIIa, and VIIIb. These were chosen to provide a variety of telescope number and sizes. Figure 1 shows a summary diagram of the telescopes that make up each architecture while Table 2 shows a more detailed summary of the parameters for each telescope architecture.
Each survey observed the same 109 stars, which Newman et al. (2021) refers to as the "green target list". These stars had been previously identified as likely targets for one or more direct imaging mission concepts and are amenable 12.0 cm s −1 6.5 -16.5 min Note-N obs,tot gives the total number of observations for a given architecture. N obs,star reports the mean number of observations per star (N obs,tot /109). The column labeled "Telescopes" gives the telescope class breakdown for each architecture. For architectures made up of more than one telescope class, the following columns give the total number of observations in each telescope class N obs,tels and the subsequent mean number of observations per star in each telescope group N obs,tels,star . The final two columns give the mean photon precision, σ phot , for an observation in that telescope class and the 25th and 75 percentile range for the exposure times, τexp, for that telescope class. to RV surveys based on their spectral type (F7-K9) and v sin i < 5 km/s. The observation time, exposure duration, and photon noise level of each observation in the simulations follow from the telescope aperture and properties of each star on the "green target list". We summarize the details of each architecture below and refer the reader to Newman et al. (2021) for a more in depth discussion of the characteristics of each architecture.

Architecture I
Architecture I contains six fully robotic 2.4 m telescopes (based on the Automated Planet Finder, Vogt et al. 2014) with Northern Hemisphere (NH) telescopes at Kitt Peak (KP), Calar Alto (CA), and Mauna Kea (MK) observatories and Southern Hemisphere (SH) telescopes at Las Campanas (LC), Sutherland (SA), and Siding Spring (SS) observatories. This architecture assumes each site is a dedicated facility for an EPRV survey and allocates 100% of time for observing. Over the simulated 10-year survey, Architecture I obtains 308923 observations (average 2834 per star). Note that Architecture I in Newman et al. (2021) used an instrumental uncertainty of 10 cm s −1 . To better compare the effects of instrumental uncertainty in the presence of correlated noise, our default value for the instrumental uncertainty in Architecture I is 5 cm s −1 , to match the other architectures. We will later investigate in Section 5.4 how each architecture performs if we assume instrumental precision of 10 and 30 cm s −1 .

Architecture IIa
Similar to Architecture I, Architecture IIa contains larger telescope designs to achieve higher precision for a fixed exposure. Each hemisphere is equipped with two 3.5 m telescopes (KP, CA in the north; SS, SA in the south) as well as a 6 m telescope (MK in the north; LC in the south). This architecture also assumes 100% telescope time allotted to this survey. Over the simulated 10-year survey, Architecture IIa obtains a total of 587813 observations (average 5392 per star) divided into 285718 observations across the two 6 m telescopes (2621 per star) and 302095 observations across the four 3.5 m telescopes (average 2772 per star).

Architecture IIb
Architecture IIb is the same as Architecture IIa but with the two 6 m telescopes replaced with 3.5 m telescopes, such that all six sites are identical. Over the 10-year survey, Architecture IIb obtains a total of 515753 observations (average 4732 per star).

Architecture VIIIa
Architecture VIIIa contains two 10 m telescopes at MK in the north and LC in the south, with the same four additional 3.5 m telescopes as in Architecture IIa located at KP and CA in the north and SA and SS in the south. This architecture assumes 20% telescope time allotted to the two 10 m telescopes (one night per week) and 100% time on the four 3.5 m telescopes. Over the simulated 10-year survey, Architecture VIIIa obtains a total of 437198 observations (average 4010 per star) divided into 60354 observations across the two 10 m telescopes (544 per star) and 376844 observations across the four 3.5 m telescopes (average 3457 per star).

Architecture VIIIb
Architecture VIIIb is the same as Architecture VIIIa but with six 2.4 m telescopes instead of four 3.5 m telescopes. As a result, there is a 10 m telescope and a 2.4 m telescope at both MK in the north and LC in the south. The two 10 m telescopes assume the same 20% observing allocation as in Architecture VIIIa (and the same observations). Over the simulated 10-year survey, Architecture VIIIb obtains a total of 516225 observations (average 4736 per star) with 45871 observations across the six 2.4 m telescopes (average 4182 per star). Newman et al. (2021) used survey simulations to establish the feasibility of measuring precise masses assuming that all stellar variability components could be removed. Under these optimistic assumptions, several architectures performed well, and would be able to measure planet masses with SNR 10 for most of the stars on the green target list. Given this result, choosing an architecture for the EPRV precursor survey could be largely driven by cost and feasibility. The goal of this study is to update these results with correlated noise models to investigate which architectures are well suited for detecting Earth-mass planets in the habitable zones in the presence of more realistic stellar variability models. We model the RV signal due to stellar variability as a Gaussian Process. A covariance kernel k(|t − t |), specifies the form of covariances as a function of the time lag between observations. We model the RV perturbations due to stellar variability as a sum of four terms, each a Gaussian Process: one term for activity, one term for oscillations, and two terms for granulation. We describe each in the sections below.

Active Regions
We model RV perturbations due to rotationally linked stellar variability, hereafter "Active Regions," or "AR," using a GP model from Gilbertson et al. (2020b), hereafter Gil20b. Gilbertson et al. (2020a) simulated time-series of solar ), which comes from Gil20b, a sum of a Matern-5/2 GP (timescale λ = 2.52 d, amplitude a0 = −0.106884 m s −1 ) and its first derivative (with amplitude a1 = −1.154 m s −1 ). The covariance kernel is described by the absolute time difference between observations, ∆ ≡ |t − t |. Note that these values have been fit to simulations of solar active regions; we use this same kernel for every star in our sample (i.e., we do not scale amplitude or timescale with stellar properties). spectra focusing on the effects of faculae/plages and spots (collectively "active regions") using a custom version of SOAP 2.0 (Dumusque et al. 2014). Gil20b measured the apparent radial velocity and stellar variability indicators from each spectra and fit several Gaussian Process (GP) models to the resulting timeseries of stellar activity-induced RVs. Based on their results, we adopt a linear sum of a Matern-5/2 GP kernel and its first derivative as the GP kernel for modeling the effects of active regions. Equation 9 of Gil20b shows the generalized form of the kernel. In the case of only using the RVs (i.e., not including additional time series such as activity indicators), the resulting GP kernel is simply a difference of the 0th order and 2nd order time derivatives of the kernel for a latent process (in this case the Matern-5/2 kernel), so we can write where {a 0 , a 1 } are amplitude parameters (in m/s) and k M 52 (|t − t |) is the Matern-5/2 kernel: with timescale λ. We compute the second derivative of the kernel to derive the term in the right hand of Equation 3 where we have used the shorthand R ≡ √ 5/λ and ∆ ≡ |t − t |. From solar simulations of Gil20b, the best fit for the solar timescale hyperparameter λ is 2.52 d, and the best-fit solar amplitudes are {a 0 , a 1 } = {−0.106884, −1.154} m s −1 . The resulting covariance kernel is shown in Figure 2. The covariance matrix therefore has diagonal term (|t − t | = 0) of 0.361 m 2 s −2 , which implies a white noise term for active regions: σ AR = 0.6 m s −1 . The correlated structure of activity-induced variability has not yet been well quantified as a function of spectral type. We adopt the solar values for all stars in our sample, unlike the oscillation and granulation kernels in the following sections, which we scale with stellar properties. We discuss the potential impact of a model that depends on stellar parameters in Section 6.3.2.

Stellar Oscillations
We model the RV perturbations due to stellar oscillations using a GP model from Pereira et al. (2019), hereafter P19. P19 describe the transformation between the GP kernels for a stochastically driven, damped, harmonic oscillatorimplemented in the celerite framework described in Foreman-Mackey et al. (2017)-and the Gaussian functional form commonly used to approximate the stellar p-mode oscillation bump in asteroseismology power spectral density analyses. In the limit where the oscillator quality factor Q 1, the kernel becomes where ∆ is the the absolute value of the time between observations as above, ω osc is the characteristic frequency of the undamped oscillator, S osc is proportional to the power of the spectrum at ω = ω osc , Q is the quality factor, and η = |1 − (4Q 2 ) −1 | 1/2 . P19 further give the relation between these kernel parameters and the asteroseismic parameters where P g is the power excess of the oscillation bump and ν max is the frequency of maximum p-mode oscillation power. When simulating stellar oscillations for stars other than the Sun, we adopt a scaling relationship for ν max from Chaplin et al. (2019) To scale the power excess P g with stellar parameters, we follow Guo et al. (2022), hereafter Guo22, who transformed photometric scaling relations into scaling relations for the apparent RV signal due to stellar variability, where a gran = 3382ν −0.609 max is the granulation amplitude 2 (Kallinger et al. 2014, Guo22), σ env (µHz) = 0.174ν 0.88 max is the width of the Gaussian power excess, r osc is the amplitude ratio between the photometry in the Kepler bandpass and radial velocity power spectra, β(T eff ) is the correction factor from bolometric flux to the Kepler bandpass given by Equation 13 in Guo22, and α = 0.9 (Guo22). Therefore, given stellar parameters log g and T eff , one can compute ν max and from there calculate the kernel parameters in Equation 6.
The GP kernel above (Equation 6) describes the covariance between two instantaneous points in time. Unlike the active regions kernel in Section 3.1, the timescale in Equation 6 is not always much longer than exposure times. Therefore, we compute the double integral for two observations with durations δ 1 and δ 2 and with observation start times separated by time ∆. The integration can be performed analytically by parts. See Appendix A for the equations and derivation of the full oscillation covariance kernel and treatment of overlapping observations.

Granulation
We model the apparent RV signal due to granulation as a GP with two terms as in Guo22, where S 1 , S 2 define the power scaling and ω 1 , ω 2 are the characteristic frequencies for each granulation component. When modeling granulation for stars other than the Sun, we adopt scaling relationships for ω from Kallinger et al. (2014) and Guo22, Similarly, S 1 and S 2 are calculated using photometric scaling relations to convert to RV scaling relations. First, following P19, we adopt We set a 1 = a 2 = a gran in RVs (as in photometry, Guo22) and rescale the amplitudes for the photometric signature of granulation a gran,phot (ppm) = 3382ν −0.609 max (13) to obtain the amplitude for the RV signature of granulation a gran,RV (m/s) = a gran,phot r gran As with the oscillation kernel, the granulation kernel can be scaled to a given star using log g and T eff .
We follow the same approach as for the oscillation component for the granulation component to compute the double integral across two observations. See Appendix B for the equations and derivation of the full granulation covariance kernel and treatment of overlapping observations.

ESTIMATING THE MASS PRECISION
We wish to estimate the measurement precision for K, the radial velocity semi-amplitude due to a putative planetary signal. After defining the covariance kernels for each of the activity components in Section 3, we can compute the full In this case, the two observations are the same duration (δ1 = δ2), as is typically the case for observations of a given star from the same telescope.
covariance matrix C as a sum of the covariance kernels described in Section 3 for each pair of observations i, j and adding diagonal terms due to photon and instrumental noise: where δ ij is the Kronecker delta and σ phot,i is the photon noise for observation i, and σ instr is the constant instrumental uncertainty term. From this we can calculate the uncertainty in measuring the semi-amplitude, K, assuming planets are on circular orbits such that where v are the true, kinematic radial velocities at observation times t, for a planet with orbital period P (given in Equation 2) and phase φ 0 . If we assume the orbital period and phase are well known, then generalized least squares gives the parameter uncertainty in the estimate of K Often, the orbital period and phase are well known when performing RV follow-up of a planet identified via transit follow-up surveys. While direct imaging could provide secure planet detections, the uncertainty of the orbital period and phase will be greater than is typical of planets discovered via transit. When the orbital period is unknown, fitting for all parameters at once makes the problem non-linear, so generalized least squares is no longer appropriate. Unfortunately, performing a Markov chain Monte Carlo analysis (e.g., Ford 2006) is orders of magnitude more computationally demanding than generalized least squares. The formal uncertainty from generalized least squares provides a more direct comparison to the results of Newman et al. (2021) and represents both an approximation to and a lower bound on the uncertainty if one were to allow for potential correlations with orbital period and phase. We justify the assumption of well known periods through an investigation of simulated RV time series with 1000 realizations of planet phase and stellar variability-drawn from the GP kernels described previously in Section 3-and find that the simulated RV time series results in a significant periodogram peak 3 at the "correct" period (within 3% of the true planet period) at a median success rate of 96% across all stars in this sample 4 . We discuss implications of this assumption in Section 6.3.4. We also wish to emphasize that while the focus of this study is the precision of the mass SNR for Earth analogs, it will be important to obtain planet mass measurements that are not only precise, but also accurate.
For the purposes of this study, we are primarily interested in understanding the impact of correlated noise due to stellar variability on the yield of EPRV surveys for potentially Earth-like planets. Therefore, we compare the results of various surveys, using generalized least squares to approximate the uncertainty in each planet mass for each survey. We calculate the signal to noise ratio for the planet semi-amplitude estimate,K, which corresponds to the SNR for the estimated planet mass. Note that this SNR quantifies the precision of the mass estimate and is not necessarily a useful criterion for detecting a planet, nor does it provide information about the nature of the signal. Traditionally, astronomers have refrained from claiming a detection of an exoplanet until the data show that the orbital period is well constrained and coherent. Adopting such a criterion would require exploring a large parameter space (e.g., computing a periodogram) and checking whether the power near the true period is significantly larger than both a detection threshold and the power at any other putative orbital period. Such caution is particularly important when attempting to discover planets that may have short orbital periods and/or when using a relatively small number of observations, due to potential aliasing of the true period with the observational window function. Aliasing will be less important for the EPRV surveys explored in Newman et al. (2021) and this study, since these focus on orbital periods near a year and each star is to be observed thousands of times. Nevertheless, quantifying the significance of a planet detection is complicated and computationally demanding once one allows for the possibility of multiple planets and non-circular (or even non-Keplerian) orbits (Nelson et al. 2020). Further, direct imaging of a planetary system could provide independent detection of planets and constraints on their orbits. For the purposes of both Newman et al. (2021) and this study, we use generalized least squares to approximate the uncertainty for the mass estimate, assuming that the orbital period and phase are well known. This is in line with other assumptions (e.g., circular orbit, any other planetary signals are removed perfectly, etc.) that mean the uncertainties we compute will underestimate the uncertainties of a fully realistic EPRV survey. Because we have fixed the planet sizes and periods and rely on the simulated observing times from Newman et al. (2021), the only remaining parameter that can affect the final SN R is the phase φ 0 , i.e., how the observations line up with the planet's orbital phase. To account for this, we draw 20 random samples of φ 0 between 0 and 2π and report the median SNR for each planet and survey architecture.
In summary, we perform simulations asserting that each star has an Earth-mass planet in a circular, edge-on orbit in its habitable zone. Rather than performing a signal injection and recovery analysis, we instead use the covariance matrix following generalized least squares to calculate the SNR for the resulting planet's Doppler amplitude. This allows for quick, efficient computation of the expected performance for each telescope architecture.
We compare the resulting uncertainties under 3 analysis models for the "noise" due to stellar variability: correlated stellar noise, white noise (independent and identically distributed, or iid), and no stellar noise (ideal star).
• Correlated noise (corr): The correlated noise model calculates the covariance matrix in Equation 16 including all off-diagonal elements between observation pairs.
• White noise (iid): The white noise model uses only the diagonal terms of the covariance matrix, such that stellar variability is assumed to only contribute independent white noise ("jitter") to each observation. This is equivalent to adding a Kronecker delta term in front of each stellar variability kernel in Equation 16, such that only k(∆ = 0) is evaluated for each mechanism.  Histograms of RV SN R for measurements of Earth-mass planets for proposed RV survey architectures and various noise models for the effects of stellar variability on RV measurements. Each column shows a different telescope architecture, as described in Section 2 and Figure 1. Diagrams for each telescope architecture are repeated in the top right of each histogram for easy reference. Histograms are colored according to the assumptions for how stellar variability affects the RV measurements: correlated noise ("corr" in red/pink), white noise (labeled as "iid") in blue, and no additional noise due to stellar variability ("ideal") plotted for just the top row in black/grey. The vertical dashed red line indicates where planet mass measurements will have a precision of 10% or better. The exact percentage of planets above this threshold is given in red and blue in the bottom right, corresponding to the correlated and white noise models, respectively. We also include in parentheses the percentage of planets with mass SNR > 5 (20% mass precision). These data were generated using the default survey duration of 10 years and instrumental precision 5 cm s −1 . Note that the percentage of planets with mass measurement SNR > 10 in the "ideal" scenario (shown in the black/grey histograms) is 100% for every architecture, meaning that the labels also indicate the reduction expected by including the specified noise source.

Architecture Performance for Baseline Survey for Ideal, White, and Correlated Noise Models
In this section we perform the analysis described in Section 4 to explore the effect of each of the various correlated noise models on the precision of mass measurements. We compute the set of SNR's for each star's mass measurement using the full noise model (active regions, granulation, and oscillations) assuming the nominal 10-year survey with 5 cm s −1 instrumental precision. The results are shown in Figure 5, where we summarize the performance of each architecture by reporting the number of planets measured with mass SNR > 10 (i.e., 10% planet mass uncertainty), followed by the number with SNR > 5 (i.e., 20% planet mass uncertainty) in parentheses. We also report in Table 3 the median SNR (along with the 25th and 75th percentiles) for each architecture.
While each of these architectures is capable of measuring Earth-analog planets with mass precision better than 10% in the "ideal" case, we find that is not the case when accounting for additional noise due to stellar variability. First, we look at the white noise ("iid") models and see that Architecture I easily performs the worst as the only architecture with less than 50% of planets measured with SNR > 10. We expect this from Table 2, since Architecture I has the fewest number of observations per star. The relative performance of the rest of the architectures can be predicted based on the average number of observations per star. This follows expectations that in a white noise scenario, performance between architectures will depend largely on the number of observations.
In simulations assuming correlated noise due to stellar variability, the number of planets with masses measured with SNR > 10 drops drastically relative to either white noise or ideal simulations. From the white noise to the correlated noise case, we expect to measure planets masses to 10% or better for 4-6 times fewer planets, depending on the architecture. At best, we expect only 17% of planet masses to be measured to better than 10%. That is, in the presence of correlated noise due to stellar variability, a 10-year survey with ∼5000 observations per star and instrumental precision of 5 cm s −1 is not likely to measure planet masses to better than 10% for the majority of habitable Earth-mass planets around Sun-like stars, given the assumptions of our correlated noise models (e.g., solar-like activity amplitudes) and the underlying architecture simulations (e.g., assuming no coordination between telescopes and no further progress in the ability to mitigate the effects of stellar variability beyond the high cadence and large number of observations). Despite this, the histograms show that the majority of planet masses are measured with 5 ≤SNR≤ 10 in all architectures, with the worst architecture still capable of measuring planet masses to better than 20% for 70% of stars. With a SNR > 5 threshold, is is expected that the mass precision (20% or better) will lead to atmospheric characterization with uncertainties dominated by spectroscopic uncertainties and atmospheric modeling Batalha et al. (2019). While each architecture is able to achieve this threshold for a majority of the targets on the "green target list," there is a steep drop off as the threshold is increased. Due to the steep drop off, the number of stars with SNR > 5 or SNR > 10 will be sensitive to the specifics of our correlated noise model for stellar variability. Similar to Newman et al. (2021), we default to discussing results in terms of performance under a SNR > 10 threshold. However, where Newman et al. (2021) used this conservative threshold as a way of accounting for stellar variability, we quantitatively consider the survey performance under a plausible correlated noise model. Nevertheless, one could consider adopting the same conservative threshold (SNR > 10) to allow for additional variability effects that are not captured in our model (see discussion in Section 6). We can quantitatively describe the reduction in the SNR from the ideal to white noise to correlated noise models. We define the median ratio of the SNR from one noise model to another using the notation We report the values of R ideal/iid , R iid/corr , and R ideal/corr in Table 3. Across the five architectures, the white noise model reduces the SNR by a factor of ∼3-6 from the ideal model, and the correlated noise model reduces the SNR by a factor of ∼2 from the white noise model (a factor of ∼4-9 from white noise).
It is also useful to compare the performance between similar architecture pairs (Architecture IIa vs. Architecture IIb and Architecture VIIIa vs. Architecture VIIIb) in the correlated noise case to see how the architecture design choice is affected by correlated noise. Comparing Architecture IIa and Architecture IIb, we see they perform similarly, despite Architecture IIa having, on average, 660 more observations per star (a 14% increase). The telescope number and location are the same between the two architectures, and the increased number of observations comes from the shorter exposure times on the two larger 6 m telescopes in Architecture IIa. These observations also have photon precision of 7 cm s −1 instead of the 9 cm s −1 photon precision for the 3.5 telescopes. Despite these benefits for Architecture IIa, the improvements are not distinguishable. Comparing Architecture VIIIa and Architecture VIIIb, we do see improved performance in Architecture VIIIb over Architecture VIIIa. Despite the lower photon noise from the 2.4 m telescopes, the additional telescope site more than makes up for it due to increased number of observations (on average 18% more observations per star) and longitudinal coverage. Remember that in Architecture VIIIa the 10 m telescopes only observe a star once per week on average. In Architecture VIIIb, the additional telescope in both the northern and southern hemispheres are located at the same observatory as the 10 m telescopes, and therefore this gap in cadence and longitudinal coverage is filled.
In the next sections we look at the performance of each architecture under each of the ideal, white and correlated noise models as we change 1) which astrophysical mechanisms are included in the noise models, 2) the survey duration, and 3) the instrumental uncertainty.

Individual Effects of Active Regions, Granulation, and Oscillation
Here we examine the effect of each mechanism individually. We follow the same procedure as above, calculating the formal errors assuming the data are generated and analyzed with matching noise models by including only one of the three stellar variability kernels at a time (k AR , k osc , and k gran in Equation 16). We note that we expect oscillations to be least impactful due to our choice of exposure times (following Chaplin et al. 2019). The results are shown in Figure 6. Let us first consider the impact of each noise mechanism on Architecture IIa (see second column of Fig Figure 6). We can use the same ratio notation as above, but we now include subscripts to denote the stellar variability mechanism, e.g., R ideal/iid osc refers to the ratio of the mass SNR in the oscillation-only ideal model to the mass SNR in the oscillationonly white noise model. If we consider only the white noise model (blue histograms) for each mechanism, then the impact is small for oscillations and granulation, which reduce the mass SNR from the no variability ("ideal") case by a median factor of R ideal/iid osc = 1.01 and R ideal/iid gran = 1.14, respectively. However, the impact is significant for active regions, which reduces the mass SNR from the ideal case by a median factor of R ideal/iid AR = 3.96. If we consider the correlated noise models one noise source at a time, then each of these factors is increased and the reduction in the mass SNR is more severe. Active regions are still the most important effect, reducing the mass SNR by a median factor of R ideal/corr AR = 6.64, but the impact of granulation is also significant, reducing the mass SNR by a median factor of R ideal/corr gran = 2.5. As in the white noise case, the effect of oscillations is small, reducing the mass SNR from the ideal case by only a factor of R ideal/corr osc = 1.51. If we compare the AR-only noise model to including all three noise sources, then we see that the combination is slightly worse than AR-only, presumably due to the increased effect of granulation. In total, the combined effect of all variability sources reduces the mass SNR by a median factor of 7.42 in the correlated case, compared to a median factor of 3.98 in the white noise case. These values, along with the values for each architecture are reported in Table 4. We conclude that correlated "noise" due to active regions and granulation is an important effect that must be considered when forecasting the science yield of an EPRV survey similar to those proposed by the Newman et al. (2021).
Next, we compare the projected number of Earth-analog detections for the five most promising EPRV survey architectures considered by Newman et al. (2021), while accounting for the effects of correlated "noise" due to stellar variability. Qualitatively, each of the survey architectures considered by Newman et al. (2021) is affected similarly, but the magnitude of the effect differs due to the typical number and spacing of observations dictated by the survey architecture. Considering all three sources of correlated noise, Architecture IIa performs the best. To understand the reasons, it is useful to consider the relative performance of each survey architecture in the presence of each type of stellar variability.
First, we consider the white noise scenario and note that the differences between architectures are minimal for granulation-only and oscillation-only. Architecture I performs the best in the oscillation-only scenario, as expected since this can be explained by the long exposure times on the 2.4 m telescopes in this architecture (refer to Table 2) that average over the typical p-mode oscillations more effectively than architectures with larger telescopes and shorter exposure times 5 . Other architectures perform worse than Architecture I in the oscillation-only scenario, despite having ∼40-90% more observations per star, due to their shorter integration times. However, we emphasize that the performance difference between architectures in the oscillation-only case is minimal and not drastically different than the ideal model, suggesting that oscillations will not be the dominant hurdle to measuring precise masses of Earth-mass exoplanets in future surveys, provided they adopt similar exposure times.
Differences between architectures in the white noise granulation-only case are even smaller. For most stars, the granulation timescales are longer than integration times for any of the telescopes considered, so exposure time is not as effective in mitigating granulation as it is for oscillations. Instead, granulation mitigation occurs from observing a given star multiple times per night across the multiple telescopes (Dumusque et al. 2011;Meunier et al. 2015). Architecture IIa performs only slightly better, likely due to the larger number of observations. The biggest differences between architectures in the white noise model come from the AR-only scenario, with results closely matching those described in Section 5.1. Since the AR model has the largest white noise amplitude, the number of observations per star accounts for the performance differences between architectures. Note the poor performance of Architecture I when accounting for the effects of active regions. Even in the white noise scenario, less than 40% of stars are detected with SNR > 10, nearly a factor of two worse than the next best architecture, Architecture VIIIa.
Finally, we compare the yield of different survey architectures in the presence of correlated "noise" from each type of stellar variability. Qualitatively, the story is similar. For each mechanism, the SNR of mass measurements decreases relative to a white noise model. Again, stellar variability due to active regions is the dominant mechanism for reducing the survey yield relative to ideal or white noise simulations. Architecture IIa, Architecture IIb, and Architecture VIIIb perform the best in the presence of AR-induced noise, but the number of planets with masses measured at SNR > 10 decreases significantly to at most 20% of stars surveyed, even in the absence of granulation or oscillations. While either granulation or oscillations alone does not significantly reduce the number of planets with mass SNR > 10, each does reduce the expected mass SNR significantly for most architectures (excluding Architecture I; Architecture IIb shows only a modest effect). Despite the fact that neither granulation nor oscillations alone significantly reduces the number of planets with mass SNR > 10, each does still introduce a reduction in the overall planet mass SNR in most architectures. This additional reduction due to granulation and oscillations is what accounts for the reduction in the number of planets with masses measured at SNR > 10 when comparing the simulations including all three mechanisms for stellar variability to the AR-only model. In other words, AR-induced variability is the mechanism that reduces the planet mass precisions to the SNR ∼ 10 level; once at this level, it is important to account for the additional reduction due to granulation (smaller) and oscillations (smallest) to estimate the number of stars above the SNR > 10 threshold. These differences between architectures are reflected when using SNR > 5 threshold as well, with improved overall performance across the architectures, as discussed in Section 5.1.

Varying Survey Duration
Next, we explore the effect of survey duration, considering 2-, 5-, 10-, and 15-year surveys. In the case of survey durations less than 10 years, we truncate the simulated observation time series, noting that this induces a proportional decrease in the number of observations during the course of the survey. Given the previous results, we expect poor performance from the 2 year and 5 year surveys; we include these as reference points for other instrument surveys (e.g., Gupta et al. 2021). To simulate a 15-year survey, we populate each additional year beyond year 10 with a year randomly selected from the previous 10 years. The results are shown in Figure 7. The SNR increases as the survey duration (and thus the number of total observations) increases. An important result is the substantial benefit of a 15-year survey in terms of number of planet masses measured with SNR > 10. In both correlated and white noise models, the SNR scales such that a 50% increase in number of observations results in an increase in SNR by a factor of ∼1.22, as expected from √ N , implying the improvements are due to the increased number of observations rather than increased baseline. In almost every architecture, increasing the survey duration from 10 to 15 years (and thus increasing the planet mass SNRs by 22%) results in twice as many planet masses measured with SNR > 10.
Note, however, that our AR model does not account for long-timescale variations in activity as seen in the activity cycles of many stars, including the Sun. While the added observations from extending to a 15-year survey will certainly improve the SNR of planet detections, the benefits may therefore be overestimated under this model. We discuss the possibilities of accounting for and modeling long-term activity cycles in Section 6.3.3. In this case, we have used the default 5 cm s −1 instrumental noise and included all stellar noise components (active regions, granulation, oscillations) in both the correlated (corr) and white noise (iid) calculations. Note that the third row (10-year survey) is then identical to Figure 5.

Varying Instrumental Uncertainty
Finally, we explore the effect of instrumental precision (σ instr ) on survey yield. Our noise model (Equation 16) assumes each observation is contaminated by 5 cm s −1 , 10 cm s −1 , and 30 cm s −1 of non-calibratable instrumental noise that is Gaussian and uncorrelated between observations. For this case, we return to the default 10-year survey and include all 3 astrophysical noise sources (active regions, granulation, and oscillations), both correlated and uncorrelated, as well as the ideal scenario with no stellar variability. The results are shown in Figure 8.
Regardless of the survey architecture, there is a major increase in yield (for planets above SNR > 10) as the instrumental precision improves from 30 cm s −1 to 10 cm s −1 . However, improving the instrumental precision from 10 cm s −1 to 5 cm s −1 results in minimal improvement in survey performance. For many of the telescopes in these architectures, improving the instrumental precision from 10 cm s −1 to 5 cm s −1 means improvements down to the photon noise level (refer to Table 2). Since many of the telescopes have photon noise level between 5 and 10 cm s −1 , improving the instrumental precision below 10 cm s −1 has diminishing returns given that the photon noise is added in quadrature with the other variability components. The slight improvements seen in Architecture IIa, Architecture VIIIa, and Architecture VIIIb come from the 6 m and above telescope classes, which have one telescope per hemisphere with photon noise of 7, 4, and 4 cm s −1 , respectively, whereas Architecture I and Architecture IIb have photon noise In this case, we have used the default 10-year survey duration and included all stellar noise components (active regions, granulation, oscillations) in both the correlated (corr) and white noise (iid) calculations. As expected, the 5 cm s −1 instrumental precision performs the best. However, there is not a significant improvement for most telescopes by going down to 5 cm s −1 from 10 cm s −1 . For many of the telescopes in these architectures, improving the instrumental precision from 10 cm s −1 to 5 cm s −1 means improvements down to the photon noise level. The slight improvements seen in Architecture IIa, Architecture VIIIa, and Architecture VIIIb come from the 6 m and above telescope classes, which have photon noise below 10 cm s −1 (refer to Table 2).

Dependence on Stellar Mass
Finally, we investigate the effect of host star mass on the SNR of planet mass measurements, shown in Figure 9. Given the promising improvement by extending the survey duration, we include results for both 10 and 15-year survey durations in Figure 9. Following Equation 1 and Equation 2, we expect the signal of an Earth-mass planet in the habitable zone to be proportional to M −1.5 . However, fitting a power law to the mass SNR reveals power-law slopes closer to -2. This indicates that σ K ∝ M 0.5 under our model of stellar variability (both white and correlated noise). Since our model for stellar variability due to active regions does not depend on the stellar properties, the scaling σ K ∝ M 0.5 must be due to effects of the combination of granulation and oscillation. It will be important for future studies to examine correlated noise models for magnetic activity to establish how the timescales and amplitudes vary with spectral type. Including a scaling of active regions with stellar type would result in a different slope, as would a scenario in which the effectiveness of active region mitigation strategies depended on stellar mass. Conversely, we may find that our current AR model uniformly over-or underestimates the amplitude of rotationally linked, activityinduced variability. Correcting for this would result in a uniform vertical offset for each star, with the overall slope unchanged. We discuss the impact of an activity model dependent on stellar properties in Section 6.3.2. Figure 9. Planet mass SNR as a function of stellar mass for each architecture, including no noise ("ideal"), white noise ("iid"), and correlated noise ("corr"). We show this for the default 10-year survey (top row) as well as the 15-year survey (bottom row), both using the default 5 cm s −1 instrumental noise. Notice the strong trend with mass and the effect that introducing noise (correlated or uncorrelated) has on the overall SNR for these planets. In particular, in no case are we able to measure a planet mass with SNR above 10 (and therefore measure masses to 10% precision) for a solar mass star. In fact, the cutoff mass above which we cannot achieve SNR= 10 is closer to 0.76 M (K2). Figure 9 shows which target stars would be expected to yield a mass measurement with SNR > 10 (or 5) if they host an Earth-analog based on our survey simulations. In the correlated noise case, most mass measurements have a SNR of 5-10. If one adopts a mass SNR threshold of 10 then the EPRV surveys considered would be limited to planets around low mass stars, M 0.76 M (K2) for a 10-year survey and M 0.85 M (K0) for a 15-year survey. While this paints a bleak picture for measuring precise masses of true Earth-analogs with present instruments and data analysis techniques, we note that each of the five architectures considered here is capable of measuring the mass of Earth-analogs with SNR > 5 or better up to M ∼ 1.1 M . Obtaining more precise mass measurements is also possible if the field were to make significant improvements in our ability to mitigate stellar variability. 6. DISCUSSION

Comparison of Different Survey Architectures
In our simulations, Architecture IIa performs the best across all tests, with Architecture IIb performing similarly well. While neither is capable of measuring more than 20% of planet masses with SNR > 10 (except in the case of a 15-year survey), they are both routinely capable of detecting more than 80% of planet masses with SNR > 5 in most scenarios, including in the case of 10 cm s −1 instrumental precision. Given the relatively poor performance of Architecture VIIIa under the correlated noise models, our simulations suggest that similar RV surveys may want to disfavor larger (10 m) telescopes, if they indeed come at the cost of fewer observations, as prescribed in these architectures. However, there may be additional benefits of larger telescopes to correct or characterize stellar variability in detail, which is not considered in these simulations, e.g., validating MHD simulations of magnetoconvection (Cegla 2019). We note that Architecture IIb outperformed Architecture VIIIb, despite having roughly the same number of observations per star. This result suggests that an array of small (2.4 m) telescopes with photon noise above 10 cm s −1 is also disfavored. Therefore, it seems that there is a sweet spot in terms of telescope size, coverage, and photon precision. Our simulations favor architectures with a longitudinally spaced array of 3.5 m telescopes with 100% observing time and 7-10 cm s −1 instrumental precision. We recommend that the size and observing time of observatories for an EPRV survey be revisited once the community learns more about how effectively the effects of stellar variability can be mitigated as a function of spectral resolution, SNR and wavelength coverage.

Active Regions & Rotationally Linked Variability
The architecture that performed best was able to measure planet masses to better than 10% for only 17% of the sample in a 10-year survey with ∼5000 observations, and for only 33% of the sample in a 15-year survey with ∼7500 observations. Given our results that rotationally linked variability was the dominant mechanism, we therefore find that a large number of observations is needed to overcome active regions, if using purely time-domain methods. Given the larger number of observations required by the surveys simulated in this study, we encourage further research to develop stellar variability indicators that would enable future surveys to overcome the challenges of stellar variability with fewer observations. Our results here highlight the importance of ongoing efforts on other fronts to account for and mitigate stellar variability (see, e.g., Table A-3 of Crass et al. 2021). Additional improvements (relative to our results) may be possible if combined with simultaneous activity time series, which have not been accounted for in these calculations. Further improvements may be also possible if we are able to correct stellar activity in wavelength domain (e.g., Davis et al. 2017;Dumusque 2018;Wise et al. 2018;Collier Cameron et al. 2020;de Beurs et al. 2020;Holzer et al. 2021;Cretignier et al. 2021). With these approaches, larger apertures may still be valuable or even critical, if our ability to correct for or characterize stellar variability depends on SNR and/or resolution.

Granulation & Oscillations
Increasing telescope size has diminishing returns due to integration time and the combination of oscillations and granulation. While smaller telescopes make fewer total number of observations for a given survey duration and target list, the longer exposure times on smaller telescopes helps them to suppress the amplitudes of spurious RV signals due to oscillation and granulation.
It is important to note that the stellar sample studied here contains mostly solar analogs, which allow for proper averaging over the dominant stellar p-mode timescale (5 minutes for the Sun). Additionally, the granulation timescales are such that observations spanning multiple telescopes in a given night effectively average in a manner similar to Dumusque et al. (2011), who used multiple observations per night using a single telescope. These observation strategies may not be as effective for surveys of more evolved stars, with longer timescales for granulation (days) and oscillation (hours). Additional improvement for mitigating granulation may be possible following Dumusque et al. (2011) on each telescope, rather than each architecture as a whole, though at the cost of a reduced target list in order to observe each star more frequently. Since granulation is currently not the limiting factor in the success of these surveys, we deem such measures as non-critical. As above with activity, advancements in mitigation techniques in the wavelength domain (Collier Cameron et al. 2020) may require new observational approaches to mitigate granulation that could favor larger apertures or higher cadence.
Considering only oscillations and granulation driven variability, stars modestly less massive than Sun (i.e., M 0.7 − 0.9M , or K5-G8) appear to be better targets for EPRV surveys of Earth-analogs than stars somewhat more massive than the Sun (i.e., M ≥ 1.2M , or F7 and earlier). As seen in Section 5.5 and Figure 9, the masses of planets around the lower mass stars will be measured with better fractional precision than Earth analogs around solar-mass stars, due to a combination of the increased signal as a result of the closer habitable zones (primary effect, K ∝ M −1.5 ) and the decreased granulation and oscillation amplitudes and timescales of their host stars (secondary effect σ K ∝ M 0.5 ).

Target Selection
In light of the strong dependence of planet mass SNR on stellar mass in the presence of correlated noise (as shown in Figure 9), future survey simulations may wish to adjust the choice of target stars.
For example, it may be worthwhile to remove the small sample of high-mass stars where the planet mass SNR is unlikely to exceed ∼5 in a 15-year survey and reallocating that time to the remaining stars. Alternatively, one could adjust the observing scheduler, so as to aim for a more uniform sensitivity across host star mass. Implementing such design choices will depend on the exact survey goals, and will require future simulations to adequately test their impact.

A Priori Covariance Matrix
Our analysis has assumed that the covariance matrix for the GP noise models is known a priori, and that the only free parameters are planet parameters. In practice, the underlying covariance matrix will not be perfectly known, and will contribute additional loss of precision in the planet mass SNR. While the granulation and oscillation models are rooted in relatively well-established asteroseismic scalings, the AR model used here is intentionally a first-order approach. However, it is reasonable to assume that any star selected for such dedicated surveys will have additional observations that can help constrain the appropriate hyperparameters of a more accurate activity model (e.g., highcadence photometry, knowledge of long-term activity cycles from legacy Ca II H&K measurements).

Improve model for Active Regions & Rotationally-linked Activity as a function of stellar properties
As described in Section 3, the active regions model we implement comes from a fit to simulated solar observations based on SOAP2.0 (Dumusque et al. 2014) that also used simultaneous activity indicators. Therefore, the model assumes a base level of activity mitigation and it is likely that these results represent a reduced active regions component, as opposed to the full expected solar variability due to active regions. Indeed, the short timescale and low amplitude indicate that this may indeed be the case, as it is dominated not by rotational modulation, but other shorter-timescale effects. However, the model is useful as it attempts to represent a realistic level down to which we may be able to correct for using current methods (i.e., using simultaneous photometry, or removing correlations with activity indicators) rather than an optimistic or pessimistic scenario. Recently, Langellier et al. (2020) used a quasi-periodic kernel fit to 800 days of HARPS-N solar data to describe the variability of activity-induced signals, which provides a ground-truth test for the effects of activity-induced variability.
Both of these models for stellar variability (Gilbertson et al. 2020b;Langellier et al. 2020) are based on the Sun. Therefore, the amplitudes and timescales for the covariance kernels do not depend on stellar properties. A more sophisticated model that scales the kernel parameters based on stellar properties is beyond the scope of this paper. For now, we estimate the effect such a model might introduce on our results. First, consider the sign of the mass dependence of the white noise amplitude of activity-induced variability due to faculae/plages/spots. Among F, G, and K stars, activity-induced RV variability is positively correlated with mass (e.g., Wright et al. 2004;Isaacson & Fischer 2010;Luhn et al. 2020). A simple model where the amplitudes, {a 0 , a 1 } in Equation 3, scale with stellar mass (or T eff ) would therefore introduce an additional positive mass dependence for σ K . As a result, we would expect steeper negative slopes in Figure 9, with their value at M = M fixed. Such a model would reduce the activity-induced amplitudes of the low mass stars in the sample, and thus increase the threshold stellar mass at which Earth-analogs are no longer characterized with planet mass SNR > 10. Another simplified approach would be to scale the amplitudes with measured activity metrics, such as σ K ∝ R HK 1.66 as in Gupta et al. (2021). Haywood et al. (2020) found that the unsigned magnetic flux of the Sun is a good proxy for rotationally-modulated activity-induced velocity variations. Unfortunately, it has not yet been demonstrated that this it is practical to measure this quantity for other stars. If it becomes practical for measures of the unsigned magnetic flux to be extended to other stars, then the results could inform the activity model in future EPRV survey simulations and/or lead to improved activity mitigation strategies.
In addition to amplitude scalings, we must also account for dependence of the correlation timescales on stellar properties. It is reasonable to expect that the correlation timescale would scale with the stellar rotation period (e.g. Giles et al. 2017). As a population, more massive stars rotate more rapidly than less massive star (van Saders & Pinsonneault 2013). Since shortening the correlation timescale has the effect of increasing the planet mass SNR, this effect is expected to partially counteract the effect of scaling the amplitude with stellar mass.

Magnetic Activity Cycles
We have not considered long term activity trends in this analysis. It is currently impractical to implement a realistic global population model that accounts for long period activity cycles. Legacy surveys like the Mt. Wilson survey (e.g., Baliunas et al. 1995) and the California Planet Search (e.g., Isaacson & Fischer 2010) have monitored the Ca II H & K activity for stars spanning several decades. However, while recent efforts have provided long time baseline analyses of activity cycles (Baum et al. 2022), clear trends between stellar properties and magnetic cycle periods and amplitudes and stellar properties have not been established. Further, it is not possible to predict a priori whether a star will exhibit cyclical behavior in its activity time series. Even among known cycling stars, it is not possible to predict the amplitude-or even the sign-of the corresponding RV effect. Since long term activity cycles would be well sampled both in RV's and various simultaneous activity metrics, one could model them with an additional GP term. However, we expect that the effects of activity cycles are more complex. For example, the amplitude of variability induced by active regions would increase during periods of high activity and vice versa (Meunier et al. 2010). This could be modeled hierarchically, e.g., one GP with a long correlation timescale to model the magnetic activity cycle that multiplies the GP for active regions (Equation 3). We leave the development of such a model to future studies.

Effects of unknown orbital period, phase, non-zero eccentricities, and inclination effects
Our calculations have assumed circular orbits with known orbital period and phase. In practice, EPRV surveys uncertainties in the orbit will result in covariances between planet masses and orbit parameters and necessitate more computationally expensive analyses (e.g., MCMC) to characterize the uncertainties in planet masses and orbits. Given the number of observations in EPRV surveys and the orbital periods of habitable zone planets, we anticipate that aliases due to the solar day and lunar month be much less problematic than they can be for planets with shorter orbital periods. However, we expect added difficulties for planets with an orbital period close to one solar year, particularly for stars far from the ecliptic poles (i.e., subject to annual observability constraints). For such planets, the survey duration may need to be extended for a timescale greater than P −1 − (solar year) −1 −1 in order to sample the full range of orbital phases and build confidence that the putative RV signal is not an artifact.
We have assumed edge-on orbits, as done in (Newman et al. 2021), so as to allow direct comparisons. For understanding the effect of correlated noise, assuming an edge-on orbit is sufficient. However, the SNR that we (and Newman et al. 2021) report overestimates the signal expected with these noise models for a population of planets with randomly oriented orbits. When choosing how much observing time to allocate to reach a desired SNR threshold, this effect should be considered. To account for a realistic inclination distribution, the SNR values reported here can be multiplied by sin i, where cos i is drawn from a uniform distribution. Assuming randomly oriented orbital planes, the median of sin i ∼0.866. For our baseline survey simulations, assuming isotropic orbit orientations would decrease the percentage of planets with masses measured with SNR > 10 in Architecture IIa from 17% to 7% and percentage of planets with masses measured with SNR > 5 from 81% to 71%.

Effects of additional planets
Our calculations have assumed each stars hosts a single planet. In practice, most inner planetary systems contain multiple planets (He et al. 2020). Inferring the number of planets in a system from RV observation is a long-standing challenge (e.g., Nelson et al. 2014). Even once the number and approximate orbital periods of planets are known, inferring the masses and orbits of multiple planets simultaneously is expected to result in increased uncertainties and sometimes non-trivial correlations in the their masses and orbital parameters. Near-degeneracies can be particularly problematic for planets with period ratios near 1:2 (or other low-order mean motion resonances). While the fraction of planetary systems with dynamically significant mean motion resonances is relatively small (Veras & Ford 2012), planets near the 1:2 (and 2:3) mean-motion resonances are relatively common (Fabrycky et al. 2014). We encourage future EPRV survey simulations to consider the implications of multiple planet systems.

Incorporating Recent Wavelength-Domain Data Analysis Strategies for Mitigating Stellar Variability
Recent wavelength-domain efforts have also focused on disentangling spectral line deformations due to variability from Doppler shifts due to center-of-mass motions from orbiting planets (e.g., Davis et al. 2017;Dumusque 2018;Wise et al. 2018;Collier Cameron et al. 2020;Zhao & Tinney 2020). Such techniques may result in the ability to "clean" the RV time series, removing some level of variability. While these techniques focus specifically on activity-induced variability, they may be more broadly applicable to any spectral deformation, for example, those from granulation or oscillations. However, these methods have yet to be tested for the observing cadence or long survey durations expected for the types of surveys studied in this work. It is possible that improvements in stellar variability mitigation will result in a simple amplitude reduction for each of the active regions, granulation, and oscillation components studied here. However, it is also possible that the residuals after applying such techniques will have a different correlation structure. In either case, the correlated noise models in this work should be updated as the RV community continues to improve their ability to disentangle Doppler shifts and stellar variability.

Considering Alternative Observing Strategies for Mitigating Stellar Variability
In this study, we calculated the impact of correlated noise on the planet mass SNR for one reasonable survey strategy. Future studies could investigate alternative observing strategies. For example, a survey could intentionally take multiple observations of the same star each night to help reduce the effects of granulation. As another example, a survey might intentionally clump observations of a given target, so as to better characterize rotationally-linked variability (e.g., intensive observations over 3 rotation periods bracketed by observations at a lower rate.)

Empirical measurements of Activity & νmax
In this work, we assumed all stars have the same level of activity. In practice, each star will differ in activity, which can be constrained with some combination of a priori knowledge (e.g., through legacy Ca II H & K surveys like the Mt. Wilson program) and simultaneous Ca II emission monitoring, as recommended by Crass et al. (2021). In Section 6.3.2, we discussed how our results might be affected by including an activity component that depends on stellar properties. It is also possible that an activity noise model that depends on empirical activity indicators would shed light on the survey target selection. Incorporating empirical measurements of activity might then allow surveys to target only the stars for which we expect to measure planet masses of Earth analogs to better than 20%, (i.e., removing the high-mass end of Figure 9) and allow the observation time to be given back to the remaining sample.
We have also used scaling relations for ν max that depend on log g and T eff . In principle, one could measure the ν max for every star on the target list, so as to include a more accurate description of that star's oscillations and granulation spectrum. A direct measurement of ν max would reduce the uncertainty associated with the granulation and oscillation amplitudes and timescales, which will otherwise be driven by uncertainty in estimating log g. We expect that more precise measurements of ν max will improve the accuracy and precision of the calculated planet mass SNR for each star by only a small amount, given that the contribution of granulation and oscillations overall is small compared to that of rotationally linked activity. While it is not likely that such precision will affect the results for this type of study, such efforts may prove worthwhile for properly analyzing and interpreting the data from these RV surveys on a star by star basis.

Statistical methodology for detecting planets in presence of stellar variability
In this work, we've focused on what the formal measurement precision for K would be. That's similar to, but different than actually detecting a planet. Things get even more complicated when there are likely multiple planets, each of which could be aliased with each other, stellar activity, window function, etc. Doing proper planet detection calculations would be much more computationally expensive and the formal measurement precision calculated here serves as a basic check on our expectations for various survey designs.
We have also assumed that we're working on a univariate time series of velocity measurements that have been cleaned of any contamination due to stellar variability. In practice, RV surveys will produce a multivariate time series including a raw RV measurement and multiple stellar variability indicators that also provide information about the state of any active regions, and potentially granulation and oscillations. Future studies should explicitly consider the how uncertainty in the contribution from stellar variability propagates to affect the accuracy and precision of planet mass measurements. When there is a potentially viable/practical strategy for using those, then it will be necessary to rerun simulations to provide more realistic results. In this paper, we have explored solving the problem of stellar variability by simply getting lots of observations with current accuracy. There are complementary efforts to try to measure velocities in ways that are less contaminated by stellar variability that may provide additional improvements (e.g., Davis et al. 2017;Dumusque 2018;Wise et al. 2018;Collier Cameron et al. 2020;de Beurs et al. 2020;Holzer et al. 2021;Cretignier et al. 2021).

SUMMARY & CONCLUSIONS
We provide easy-to-use Gaussian process (GP) models for the apparent radial velocity (RV) signals caused by stellar oscillations and granulation as a function of stellar properties. We provide analytic expressions that also account for effects of finite integration time. active regions (AR) and rotationally linked stellar variability (Gilbertson et al. 2020b), we have a powerful toolbox for quantifying the effects of stellar variability on EPRV surveys.
We find that accounting for correlated noise is critical to accurately estimating the mass precision of future EPRV surveys. The effect of correlated noise reduces the planet mass SNR of Earth analogs by a factor of 7.42 from the no variability case and a factor of 1.86 from the white noise case for the survey design that performs best, Architecture IIa.
Rotationally linked stellar variability has the largest effect (compared to granulation or oscillations) on the detection and characterization of Earth-analogs in EPRV surveys. Our AR-only model with correlated noise results in a reduction of the planet mass SNR of Earth analogs by a factor of 6.64 from the no variability model and a factor of 1.67 from our AR-only white noise model for Architecture IIa.
Granulation also has a significant effect on the detection and characterization of Earth-analogs, particularly for stars more massive than the Sun. Our granulation-only model with correlated noise results in a reduction of the planet mass SNR of Earth analogs by a factor of 2.2 from our granulation-only white noise model and a factor of 2.5 from the no variability model for Architecture IIa.
The current observational strategy for mitigating the effects of oscillations on RV measurements is sufficient to characterize the masses of Earth-analogs. Our oscillation-only model with correlated noise results in a reduction of the planet mass SNR of Earth analogs by a factor of 1.5 from our oscillation-only white noise model and a factor of 1.5 from the no variability model for Architecture IIa.
The choice of observatory architecture significantly affects the precision of mass estimates and the fraction of Earthanalog planets which are expected to have masses measured with a given accuracy. For the default 10-year surveys with 5 cm s −1 instrumental precision, the percentage of Earth analogs with planet masses measured with SNR > 10 varies from 7% to 17% (4% to 7% when inclination is random) among the architectures studied.
Increasing the survey duration from 10 to 15 years (and thus increasing the number of observations by 50%) can significantly affect the science yield of future EPRV surveys such as those proposed by Newman et al. (2021). For 15year surveys with 5 cm s −1 instrumental precision, the percentage of Earth analogs with planet masses measured with SNR > 10 roughly doubles in each architecture compared to the 10-year survey, varying from 21% to 34% among the architectures studied. As these improvements follow the expectations of √ N , they arise primarily from the increased number of observations during the longer survey duration. A 10-year survey with 50% more observations may see similar results, however the increased rate of observations will increase the correlations between observations, which may affect each architecture differently. Future survey simulations may wish to consider such an approach. Improving instrumental precision (or reducing non-calibrated instrument noise) from 30 cm s −1 to 10 cm s −1 is expected to nearly double the percentage of planet masses measured with precision better than 10% (or an increase of 8-10 percentage points for mass precision better than 20% ).
Improving to the instrumental precision from 10 cm s −1 to 5 cm s −1 is not expected to have a significant impact on the yield of EPRV surveys similar to those proposed by Newman et al. (2021).
Our simulations indicate that none of the five survey architectures proposed by Newman et al. (2021) studied in this work would be expected to measure masses of Earth-analogs with 10% precision (or better) for most of the host stars on their "green target list," given the assumptions and caveats as detailed in Section 6.3.
Our simulations suggest all five of the survey architectures proposed by Newman et al. (2021) studied in this work (Architecture I, Architecture IIa, Architecture IIb, Architecture VIIIa, and Architecture VIIIb) are expected to measure masses of Earth-analogs with 20% precision (or better) for >70% of the host stars on their "green target list".
Our simulations suggest that three of the five survey architectures proposed by Newman et al. (2021) (Architecture IIa, Architecture IIb, and Architecture VIIIb) could be expected to measure masses of Earth-analogs with 10% precision (or better) for >30% of the host stars on their "green target list", if only the survey duration were extended from 10 to 15 years.
We recommend future research to better characterize the magnitude and temporal characteristics of apparent RV variability due to faculae, star spots, and other rotationally-linked stellar variability, particularly for stars likely to be targeted by future direct imaging mission-as also suggested in Table A-3 of Crass et al. (2021).
As researchers develop various methods for modeling and mitigating the effects of stellar variability, it will be important to characterize both the magnitude and the temporal correlations of the residuals (i.e., the apparent stellar RV that is not subtracted by the model) for each potential mitigation method. When assessing the utility of potential strategies for mitigating stellar variability, it will be important to use simulations such as in this study to quantify the benefit of each method (as opposed to simply reporting the magnitude of residuals).
We thank the anonymous referee, as well as Megan Bedell, Jason Wright, and Suvrath Mahadevan for their helpful comments and suggestions. This research was partially supported by Heising-Simons Foundation Grant #2019-1177. This work was partially supported by NASA Exoplanet Research Program Grant #80NSSC18K0443. This work was supported by a grant from the Simons Foundation/SFARI (675601, E.B.F.). This work was partially supported by funding from the Center for Exoplanets and Habitable Worlds, which is supported by the Pennsylvania State University, the Eberly College of Science, and the Pennsylvania Space Grant Consortium.