The Detectability of Rocky Planet Surface and Atmosphere Composition with JWST: The Case of LHS 3844b

The spectroscopic characterization of terrestrial exoplanets will be made possible for the first time with JWST. One challenge to characterizing such planets is that it is not known a priori whether they possess optically thick atmospheres or even any atmospheres altogether. But this challenge also presents an opportunity - the potential to detect the surface of an extrasolar world. This study explores the feasibility of characterizing the atmosphere and surface of a terrestrial exoplanet with JWST, taking LHS 3844b as a test case because it is the highest signal-to-noise rocky thermal emission target among planets that are cool enough to have non-molten surfaces. We model the planetary emission, including the spectral signal of both atmosphere and surface, and we explore all scenarios that are consistent with the existing Spitzer 4.5 $\mu$m measurement of LHS 3844b from Kreidberg et al. (2019). In summary, we find a range of plausible surfaces and atmospheres that are within 3 $\sigma$ of the observation - less reflective metal-rich, iron oxidized and basaltic compositions are allowed, and atmospheres are restricted to a maximum thickness of 1 bar, if near-infrared absorbers at $\gtrsim$ 100 ppm are included. We further make predictions on the observability of surfaces and atmospheres, perform a Bayesian retrieval analysis on simulated JWST data and find that a small number, ~3, of eclipse observations should suffice to differentiate between surface and atmospheric features. However, the surface signal may make it harder to place precise constraints on the abundance of atmospheric species and may even falsely induce a weak H$_2$O detection.


Characterizing Terrestrial Exoplanets in the Era of JWST
During the next few years the James Webb Space Telescope (JWST) will serve as the prime observatory enabling us to characterize rocky exoplanets using spectroscopic information about the planetary thermal emission. The challenges to characterizing such planets include their smaller sizes and increased potential for atmospheric diversity, compared to more well-studied hot Jupiters. Another less-appreciated challenge is that we lack a priori knowledge of whether individual terrestrial exoplanets possess atmospheres at all. This leads to a potential source of confusion in characterizing rocky worlds -it is not initially clear whether a planet's emission originates from its surface, or from its optically thick atmosphere. (A third possibility is a semioptically thin atmosphere, for which the surface is seen directly at certain wavelengths corresponding to transparent windows through the atmosphere.) To optimize the scientific yield of observations of rocky exoplanets with JWST, it is therefore crucial to understand the spectroscopic fingerprints of both their atmospheres and surfaces and how these translate to an emission signal recorded by the telescope.
Terrestrial exoplanets that are found to not possess atmospheres could present the first opportunities to char-acterize the surface composition of planets beyond our solar system. The analogs of airless or nearly airless rocky bodies in the solar system are Mercury, Mars, the Moon, and various asteroids. Before they could be spatially well resolved, many of these objects were studied using the spectra of reflected solar radiation or of thermal emission. The lunar mare are generally dark and absorb strongly at 1 µm and 2 µm, indicating that they are basaltic in composition (Pieters 1978), while the lunar highlands are bright and relatively featureless, pointing to a plagioclase feldspar composition (Pieters 1986). The spectral features of the Martian surface, together with its visually red appearance and its polarization properties, show that it is rich in ferric oxides (McCord & Westphal 1971). Spectra of Mercury's surface initially indicated that it had a similar composition to that of the lunar highlands due to its relatively flat spectrum (Blewett et al. 1997). However, recent observations from spacecraft of Mercury have pointed to a surface which is closer to ultramafic (Nittler et al. 2011) -the more recent revision demonstrates the challenges in using remote sensing data to uniquely constrain planetary surface composition. Present-day Earth possesses a granitoid surface, a tertiary crust that is a record of plate tectonics and the incorporation of water in subducted crustal materials. All of these solar system examples present plausible surface compositions for terrestrial exoplanets. Additional possibilities include a metal-rich surface, which would be indicative of a world with the mantle stripped off (Hu et al. 2012), among others, as discussed in Section 4.2 of Mansfield et al. 2019).
At the limit of airless planets, (Hu et al. 2012) proposed to use reflected and emitted light spectra of terrestrial exoplanets to constrain their surface compositions. However, the surface characterization may be complicated by the fact that such planets could possess substantial atmospheres, which could obscure the surface from view. We therefore need to explore whether an exoplanet's surface composition can be constrained in the absence of an atmosphere, and whether degeneracies exist between determining atmospheric and surface properties for cases in which an overlying atmosphere is present. This is the purpose of our current work.

The Rocky Super-Earth LHS 3844b
In this study, we focus on the specific test case of the hot, rocky exoplanet LHS 3844b. We assess our ability to characterize the planet with JWST with a focus on plausible atmospheric and surface compositions, given our current knowledge of the planet's properties. LHS 3844b, discovered by the Transiting Exoplanet Survey Satellite (TESS) in 2018, is a presumed tidally-locked super-Earth with a radius of 1.303 ± 0.022 R Earth and orbits its red dwarf M5-type host star every 11 hours (Vanderspek et al. 2019). With an emission spectroscopy metric (ESM) of 28 (Kempton et al. 2018), LHS 3844b provides one of the strongest emission signals for any exoplanet that is believed to have a dayside made of solid rock (in contrast to even hotter magma worlds), making it a prime target for future characterization.
Due to the very high irradiation level from the host star which promotes atmospheric escape any hydrogendominated primordial atmosphere of LHS 3844b is thought to have been lost over the course of its lifetime (Owen & Wu 2013;Owen & Wu 2017). However, the presence of a secondary atmosphere is difficult to predict by theory as such can be produced through many pathways during the planet's evolution, e.g., degassing during accretion (Elkins-Tanton & Seager 2008), vaporization of a molten mantle (Schaefer et al. 2012), reactions between nebula gas and magma (Kite & Schaefer 2021), high-energy impacts (Lupu et al. 2014) and volcanism (Gaillard & Scaillet 2014). However, there are also a number of potential conditions that would prevent the formation and retention of a thick secondary atmosphere on LHS 3844b or similar planets, including (i) a volatile-poor formation and associated inhibition of late-stage degassing (Kite & Barnett 2020), (ii) stripping of the atmosphere due to impacts (Schlichting et al. 2015), (iii) primordial atmosphere loss, as heavier molecules will be dragged away alongside H 2 when the flux is high (Hunten et al. 1987), and (iv) a high stellar UV flux preventing the revival of an atmosphere (Kite & Barnett 2020).
Empirical data appear to strengthen the picture that LHS 3844b does not possess a thick atmosphere. Namely, Spitzer phase curve observations with IRAC Channel 2 at 4.5 µm by Kreidberg et al. (2019) measured a high dayside brightness temperature of 1040 ± 40 K together, an undetectably small emission from the nightside and a phase curve offset consistent with zero. Since the main driver of heat redistribution is large-scale atmospheric circulation, the finding of a negligible longitudinal heat transport points toward the absence of a thick atmosphere (Showman et al. 2013;Koll 2022). Furthermore, comparing atmospheric model predictions to the measured eclipse depth of 380 ± 40 ppm Kreidberg et al. (2019) narrowed the possible surface pressure to 10 bars or less. Simulating atmospheric mass loss over the planet's history, Kreidberg et al. (2019) further concluded that only with an initial water inventory of at least ∼ 100 Earth oceans could atmospheric erosion be prevented. Additionally, more recent ground-based observations using 13 transits in the optical to nearinfrared by Diamond-Lowe et al. (2021) reveal a flat transmission spectrum, which is also consistent with a bare rock scenario. Specifically, the transmission spectrum observations ruled out a hydrogen-dominated atmosphere with high confidence (unless there are highaltitude clouds that would also flatten the spectrum) and disfavored a water-steam atmosphere with a surface pressure ≥ 0.1 bar.
The most promising strategy to constrain the surface composition of rocky exoplanets is through secondary eclipse observations which probe the planetary reflection and thermal emission spectrum. In contrast to the transmission spectrum, the planetary surface signal is imprinted on the emission spectrum in wavelengths where the atmosphere is optically thin. In addition, as the thermal emission signal scales with the planetary temperature, secondary eclipse spectroscopy appears to be especially well suited for atmospheric detection and characterization of hot planets, such as LHS 3844b (Morley et al. 2017;Koll et al. 2019). Due to these reasons, we model the planetary emission spectrum and make predictions on the detectability of LHS 3844b's atmosphere and surface using secondary eclipse observations in the current work.

Outline of This Study
First, we constrain the parameter space of surface and atmospheric properties of LHS 3844b that is consistent with the measured Spitzer eclipse depth (Sect. 3.1). It turns out that the consistent surfaces are metal-rich, basaltic and iron oxidized crusts (see Fig. 2) and atmospheres up to ∼ 1 bar surface pressure, if a near-infrared absorber at ≥ 100 ppm is included (see Fig. 3 and Table 2). Among all surface and atmosphere combinations that are consistent with observations we focus on two limiting cases. On one end, we assume that there is no atmosphere and the planetary signal stems solely from the surface; this is the no atmosphere limit (Sect. 3.1.1). On the other end, we include an atmosphere overlying a rocky surface, exploring a large number of compositions and thicknesses. We assess the maximum surface pressure that is consistent with Spitzer, and for each tested atmospheric background and gas absorber pair we pick the model that provides the largest gas absorber features in the planetary spectrum; this the thick atmosphere limit (Sect. 3.1.3). As shown in Fig. 4 the atmosphere models in this limit consist of the following compositions and surface pressures: O 2 with 100 ppm CO 2 and 0.1 bar, O 2 with 100 ppm SO 2 and 1 bar, O 2 with 1 ppm H 2 O and 10 bar and N 2 with CH 4 and 1 bar. For the no and thick atmosphere limits we assess the observabil-ity with JWST, determining to what extent JWST will help characterize this planet under optimal conditions in terms of surface and atmosphere visibility within the constraint of the Spitzer measurement (Sect. 3.2). These results are shown in Figures 5,6,9 and Tables 3,4,6, illustrating that the more reflective surfaces and all of the atmospheric setups will be detectable with less than a handful of eclipse observations. Finally, using a Bayesian framework we explore how well atmospheric abundances may be constrained from observations and what impact the surface signal has on the atmospheric retrieval (Sect. 3.3). We find that, while the presence of atmospheric species can be recovered with 5 eclipses, it will not be straightforward to pinpoint their precise abundance. On top of that, the surface signal makes it harder for the retrieval model to disfavor the presence of atmospheric species and may even mimic atmospheric H 2 O (see Figures 10,11 and Tables 7,8). We conclude and summarize in Sect. 4, discussing new constraints for the surface and atmosphere of this planet based on our results and how these constraints are narrower than the previous findings of Kreidberg et al. (2019). Lastly, we discuss implications for characterizing terrestrial exoplanet surfaces and atmospheres in the coming years with JWST.

Radiative Transfer Modeling
We generate model spectra of LHS 3844b with different assumed atmospheric species and surfaces with the open-source, 1D radiative transfer (RT) code HELIOS 1 (Malik et al. 2017(Malik et al. , 2019b, which simulates the atmosphere in radiative-convective equilibrium and the corresponding planetary emission. In contrast to most radiative transfer models used in the exoplanet field, HELIOS takes into account the radiative effects of the surface on both the atmosphere and the resulting planetary spectrum. Previously, only a grey surface albedo, constant over all wavelengths, could be modeled with HELIOS, as described in detail in Malik et al. (2019a). We have since then expanded the code's functionality to include a surface albedo as a function of wavelength, enabling the consideration of realistic surface albedos in the model. We describe updates to the HELIOS code and the new method of modeling the surface in Appendix A.
In HELIOS the radiative transfer calculation is performed twice. First, the atmospheric temperature profile and the surface temperature in radiative-convective equilibrium are obtained. For this step, we use the k-distribution method with 420 wavelength bins between 0.245 µm and 10 5 µm (distributed evenly in wavenumber space) and 20 Gaussian points within each bin. Since in this work each atmospheric set-up consists of only one major absorbing species, it is not necessary to assume any correlation between opacities from different sources (as is done in the correlated-k approximation). Once the equilibrium state is found, HELIOS is used in postprocessing mode to generate the emission spectrum using opacity sampling with a resolution of R = 4000.
Convection is treated in HELIOS via convective adjustment with the adiabatic coefficient ∇ ad = (∂ log T /∂ log P ) S -where T is the temperature, P is the pressure and S is the entropy -set to 2/7 (1/4) if a diatomic (tri-atomic) molecule acts as the atmospheric bulk gas, according to the ideal gas approximation. We model dayside-averaged conditions and use the scaling theory for heat redistribution of Koll (2022), their Eq. (10), making the day-to-night heat transport dependent on the thickness of the atmosphere. Note that the scaling of the heat redistribution subtracts the total heat that is assumed to be transported to the nightside from the incoming stellar flux in order to estimate the dayside heat content, but it does not include any vertical dependency of the day-to-night heat flow. In the no-atmosphere case, we assume no global heat transport and set the redistribution parameter accordingly to f = 2/3 (Burrows et al. 2008;Hansen 2008). Lastly, a numerical limitation of the HELIOS code is that it is not possible to model a surface without an overlaying atmosphere, and so we approximate this situation by including an atmosphere with P surf = 2 × 10 −9 bar which has a ∼ 10 −10 Planck mean optical depth. 2 We use a PHOENIX stellar model (Husser et al. 2013) to simulate the spectrum of the host star, downloaded from the online spectral library 3 and interpolated for the stellar parameters of LHS 3844. Further input parameters include the planetary surface gravity, semi-major axis, radius of the planet, radius of the star, temperature of the star, stellar surface gravity and stellar metallicity. The adopted numerical values for these parameters are displayed in Table 1.

Atmospheric and Surface Compositions Considered
For the atmospheric models of LHS 3844b we choose N 2 , O 2 and CO 2 as potential dominant atmospheric 2 According to our tests, an atmosphere with P surf < 10 −6 bar does not noticeably affect the surface temperature or planetary emission found in the model. 3 ftp://phoenix.astro.physik.uni-goettingen.de/HiResFITS/ species. From an empirical perspective, there are two Solar system rocky bodies with a thick atmosphere that have N 2 as the main constituent: Earth, which is of similar size to LHS 3844b, and Titan. There are a number of hypothetical pathways that may lead to substantial amounts of atmospheric N 2 (see discussion in Lammer et al. 2019). For instance, N 2 outgassing can potentially counteract nitrogen fixation by lightning, meteoritic impacts, high-energy UV or flares, leading to atmospheric N 2 buildup (Chameides & Walker 1981;Mikhail & Sverjensky 2014;Airapetian et al. 2016). Another theoretical prediction is an O 2 dominated atmosphere which is a potential product of an initially water-rich atmosphere that lost most of its hydrogen due to high irradiation from the host star (Wordsworth & Pierrehumbert 2014;Luger & Barnes 2015;Wordsworth et al. 2018). The high stellar irradiation places LHS 3844b close to the "cosmic shoreline" (Zahnle & Catling 2017), assuming an Earth-like bulk density. The idea here is that if this planet started with a thick envelope and a sufficient water abundance, it is expected to have undergone a Runaway Greenhouse effect at one point during its history, possibly ending up as a Venus-analogue (Kane et al. 2014).
Additional absorbers in our models include CO 2 , CO, SO 2 , CH 4 and H 2 O. They are all found in the atmospheres of Earth, Mars and Venus and they are predicted to be ubiquitous and major constituents in rocky exoplanet atmospheres as well, be it from mantle degassing (Schaefer & Fegley 2011;Schaefer et al. 2012;Liggins et al. 2021), meteoritic accretion (Lupu et al. 2014;Zahnle et al. 2020), photochemistry (Gao et al. 2015) or late-stage volcanism (Gaillard & Scaillet 2014). Additionally, all of these species exhibit strong absorption bands in the infrared, providing an opportunity to be detected with JWST.
Our atmospheric grid consists of the following gas combinations. On one side we explore oxygen-rich and oxidizing atmospheres: bulk O 2 with H 2 O, bulk O 2 with CO 2 , bulk O 2 with SO 2 , and a pure CO 2 scenario; and on the other side we explore nitrogen-rich and reducing atmospheres: bulk N 2 with CO, bulk N 2 with CH 4 , bulk N 2 with CO 2 , and a pure CO scenario.
For the surface contribution, six realistic surfaces compositions are tested: metal-rich (pyrite), feldspathic (97% Fe-plagioclase, 3% augite), ultramafic (60% olivine, 40% enstatite), basaltic (76% plagioclase, 8% augite, 6% enstatite, 5% glass, 1% olivine), granitoid (40% K-feldspar, 35% quartz, 20% plagioclase, 5% biotite), and iron oxidized (50% nanophase hematite, 50% basalt), which are ubiquitous in the Solar System and common products of geological processes (see Sect. 1  The albedo data is from Hu et al. (2012). and further discussion in Hu et al. 2012). We use the geometric albedo spectra from Hu et al. (2012) for these surfaces, pre-tabulated between 0.3 and 25 µm, which were obtained from experimental data taken on rock powders and extended by radiative transfer modeling. Although these albedo spectra are mineralogically realistic, spectra of the surface of Mars and the Moon differ from that of mineral mixtures in the laboratory due to various effects (e.g., Carli et al. 2015;Bishop et al. 2019). We extrapolate the tabulated albedo data to the whole wavelength range of the RT calculation by using the albedo value at 0.3 µm for all smaller and the value at 25 µm for all larger wavelengths. Lastly, we include a uniform grey surface with a geometric albedo of 0.1 as control to test the impact of the non-grey albedo variation of the realistic surfaces. The albedo spectra used in this work are shown in Figure 1

Simulating JWST observations
To simulate observations with JWST the python package Pandexo 5 ) is used. We focus on two JWST instruments that will be suitable and are planned to be utilized for rocky exoplanet characterization, MIRI LRS (Low Resolution Spectrometer) and NIRSpec G395H. Being on the longwave end of the JWST capabilities, the bandpass of MIRI LRS (5.02 -13.86 µm) maximizes the recorded secondary eclipse depth (planet-to-star contrast), since the emission of the hotter star falls off faster with wavelength than the planetary emission. Although the relative eclipse depth is smaller in the NIRSpec G395H range (2.87 -5.18 µm), a larger stellar flux at these wavelengths leads to a smaller overall photon noise. Additionally, many atmospheric species possess strong spectral signatures in the near-infrared range, making NIRSpec G395H one of the preferred choices for exoplanet observations. When simulating observations with Pandexo, we insert the model values from the HELIOS spectrum (consistent with noise-free data), set a given instrument, resolution, and number of secondary eclipse measurements, and add the simulated statistical noise as error bars. For all tests done over the entire instrument wavelength range, points are rebinned using Pandexo's integrated functionality which defines new wavelength bins of a constant resolution and uses the included "uniform tophat mean" function to calculate the new binaveraged function values together with the their uncertainty. In some cases, in addition to using the whole instrument range, we also focus on the detectability of individual spectral features by sampling over the width of these features, manually applying Pandexo's rebinning procedure. As for the systematic noise floor, we assume a value of 30 ppm based on tentative estimates (Matsuo et al. 2019;Schlawin et al. 2020Schlawin et al. , 2021. To test this assumption, the same analysis as shown in C7 has been conducted with the noise floor set to 0 ppm, which has led to indistinguishable results as our nominal setup. Therefore, we assume that the choice of noise floor has negligible impact on our results.

Bayesian Retrieval Modeling
We retrieve on the simulated JWST spectrum using a modified version of the Bayesian retrieval code PLATON (Zhang et al. 2018(Zhang et al. , 2020. Our version, first utilized in Ih & Kempton (2021), differs from the main branch of PLATON in that it allows for measuring abundances of multiple species during retrieval (in a so-called "free" retrieval). PLATON natively supports custom abundance profiles during forward models, but is configured to only perform equilibrium chemistry retrievals, in which the mixing ratios of all species are prescribed by the metallicity and C-to-O ratio defined globally, and temperature and pressure per layer. We relax this restriction by allowing the abundances of each species to be included as a retrieved parameters. All other details regarding radiative transfer remain the same as the original implementation.
In our free retrieval, the atmosphere is parameterized as follows. For the composition of the atmosphere, we assume two possible background gases that do not have a spectral signature (O 2 and N 2 ). We then assign a parameter corresponding to the vertically fixed logabundance of each species other than the background gases and a (linear) abundance for one of the background gases. The abundance of the other background gas is then derived as the remainder from unity. The thermal structure of the atmosphere is described by using the parametric T-P profile as given by Madhusudhan & Seager (2009). Having been developed with gas planets in focus, the retrieval code does not offer the possibility to include a solid surface. Still, we mimic the location of the surface in the retrieval modeling by the pressure level above which an isotherm is assumed, set by the parameter P 3 in the formalism of Madhusudhan & Seager (2009). Since a surface radiating as a blackbody is equivalent to an optically thick atmosphere of a single temperature, P 3 should in theory correspond to a limit on the surface pressure. Approximating the surface with a blackbody does not allow us to make any statements about a non-zero surface albedo and thus we do not retrieve on the surface but merely on the atmosphere in this work.
We perform our retrievals on the spectra generated using Pandexo by binning (resampling) the data down to a resolution of R = 100 and using uncertainties based on five simulated secondary eclipse observations. 3. RESULTS

LHS 3844b Spitzer eclipse depth constraint
In the following we first explore which of the tested surface types agree with the Spitzer eclipse depth measurement without considering an atmosphere (no atmosphere limit). Then, analogously, we find the atmospheric models (varying composition and thickness) consistent with the Spitzer measurement. Among those we determine the atmospheres with the largest spectral features that may allow for characterization with JWST (thick atmosphere limit).
Whether or not a model is in agreement with the Spitzer observation is determined by comparing the HELIOS-generated spectrum of the planet for each surface with the observed Spitzer 4.5 µm eclipse depth. The model is considered consistent with the data point if the simulated emission over the Spitzer bandpass, obtained by convolving the model spectrum with the Spitzer IRAC channel 2 bandpass function 6 , is within a 3 σ confidence interval from the observed value.

No atmosphere limit
In the no atmosphere limit we generate planetary emission spectra including the surface albedo signal only. Each model assumes that the whole planetary dayside is covered by one of the considered surface compositions. The secondary eclipse spectra of the surfaceonly models are shown in Figure 2. Based on this analysis, the Spitzer data point is most consistent with a metal-rich surface (1.39 σ from Spitzer measurement), followed by an iron oxidized (2.05 σ) and a basaltic (2.28 σ) surface. The ultramafic (3.61 σ), feldspathic (4.79 σ) and granitoid (4.80 σ) surfaces are excluded by the data based on the assumed confidence limit. Curiously, all explored models predict a significantly smaller eclipse depth than Spitzer with no single model prediction within 1 σ of the observation. First, this could be due to stochastic noise (although unlikely), but it could also indicate that the error of the Spitzer observation is underestimated due to an unknown systematic bias. As the reanalysis of the raw Spitzer phase curve of Kreidberg et al. (2019) is beyond the scope of this work, we limit ourselves to taking the Spitzer measurement at its face value. Second, it should be noted that not only is the albedo affected by grain size but even more the shortwave albedo can be lowered by small amounts of minor surface constituents, e.g., carbon, like on Mercury's surface (Izenberg et al. 2014), and metallic iron, produced by space weathering. Thus, by increasing the absorption of light in the shortwave and consequently boosting the thermal emission in the infrared, minor modeling assumptions can in theory affect the ordering of surfaces as well as the answer to whether they are consistent with the Spitzer data point. Testing these assumptions is again beyond the scope of this work.
Compared to the previous analysis of Kreidberg et al. (2019) we obtain consistently shallower eclipse depths for the same surface compositions and also find the ultramafic crust inconsistent with the observation, which is in contrast to the modeling result in Kreidberg et al. (2019). This discrepancy is discussed in Sect. 4.2.

Including an atmosphere
As bottom boundary our nominal atmosphere models use the metal-rich surface. Being most consistent with Spitzer, this choice maximizes the allowed atmospheric parameter space. Following the style of Fig Table 2. We find that the modeled eclipse depths result from a combination of three distinct effects: day-night heat transport, greenhouse warming and strength of gaseous absorption at 4.5 µm. Optically thick atmospheres transport more heat from the day to the night side than thinner atmospheres and consequently lead to a smaller dayside emission (Koll 2022). This mechanism translates to a decreasing eclipse depth with increasing surface pressure and also a decreasing eclipse depth with increasing absorber abundance. The second effect is the green- Figure 2. Secondary eclipse spectra for each of the tested surfaces and the grey albedo control case, shown at R = 100. The actual Spitzer 4.5 µm measurement is shown in black, where the y-axis error bar represents the uncertainty and the x-axis error bar represents the width of the bandpass. Modeled Spitzer points for each surface are shown in their respective color. We find the grey, metal-rich, iron oxidized and basaltic surfaces to be consistent (bold), i.e., < 3 σ from the observation, and the ultramafic, granitoid and feldspathic surfaces to be inconsistent with the Spitzer measurement. The granitoid and feldspathic mock Spitzer points are almost directly overlapping.
house effect. As all of the explored absorbers are greenhouse gases, the T-P profiles increase with pressure in optically thick atmospheric regions. (Note that the surface is smoothly connected to the first atmospheric layer via convective stability, i.e., temperature jumps at surface are not permitted.) If the planetary emission originates at or near the surface, increasing surface pressure consequently leads to an increased thermal emission and larger eclipse depth. Similarly, increasing the greenhouse gas abundance increases the strength of the greenhouse effect which warms the surface leading to a larger eclipse depth in spectral window regions (the total energy budget remains constant in radiative equilibrium). Finally, some species (CO 2 , CO) exhibit a strong absorption band around 4.5 µm whereas others do not (H 2 O, SO 2 ) or even exhibit an absorption window (CH 4 ) which affects the eclipse depth at that wavelength location. These combined effects lead to non-monotonic trends in the eclipse depth versus surface pressure or absorber abundance (as visible in Fig. 3) and also nonmonotonic trends in the maximum surface pressure consistent with Spitzer versus the absorber abundance (as visible in Table 2).
Overall, see Table 2, we find that some atmospheres with a surface pressure of 10 bar are consistent with the Spitzer measurement, but those possess a mixing ratio of only 1 ppm of a near-infrared absorber, which in the vast majority of cases has a marginal effect on the atmospheric extinction (1 ppm of H 2 O being the exception among our models). Once absorbing species with a mixing ratio of at least 100 ppm are included, the maximum surface pressure for any tested atmosphere is 1 bar, obtained for compositions with CH 4 , SO 2 or H 2 O. With CO 2 or CO as absorber, the maximum surface pressure allowed decreases to 0.1 bar.
The eclipse depths of the atmospheric models with O 2 and CO 2 deviate from the model predictions in Kreidberg et al. (2019) and also the overall maximum surface pressure is lower than the previously obtained maximum limit of 10 bar previously found. These differences are discussed in Sect. 4.2.
Lastly, exploring the impact of the surface on the allowed atmospheric parameter space, Figure C2 shows the analogous suite of atmospheric models as Fig. 3 but with a basaltic surface. As expected, the modeled eclipse depths for each atmospheric composition are generally more consistent with Spitzer for the metal-rich surface than the basaltic surface since the latter is overall more reflective and thus less consistent with the Spitzer observation.

Determining the thick atmosphere limit
In the following we pick for each gas pair (background and infrared absorber) the atmospheric model that provides the largest spectral features. Those models are then used for the JWST observability analysis presented in Sect. 3.2.3 and the retrieval analysis in Sect. 3.3. The eclipse spectra of the atmospheric models that are consistent with the Spitzer measurement are shown in Figure 4. Since the atmospheric optical thickness depends on both the amount of absorbers in the atmosphere and the overall atmospheric density, the size of absorption features is degenerate in those parameters. As CO 2 and CO are strongly absorbing in the Spitzer bandpass, the only atmospheric models not excluded by Spitzer are those that either have no significant amount of these absorbers or low surface pressures. Hence, model spectra including those two absorbers have only weak features. The models including CO 2 appear very similarly, independent of the background gas, as the only noticeable absorption features are due to CO 2 . Thus, we use only one of these compositions for further analysis, choosing the O 2 with 100 ppm CO 2 model. No model with CO exhibits visible features and thus no CO models are used for further analysis. In contrast, H 2 O, SO 2 and CH 4 have no absorption feature directly at 4.5µm which allows for larger atmospheric abundances of these species, still satisfying the Spitzer constraint. Among the corresponding models, we find that O 2 with 100 ppm SO 2 and N 2 with 1% CH 4 provide the largest spectral features. Specifically, SO 2 possesses a large double-peaked feature at 7 -9 µm and a smaller one at 4 µm and CH 4 has strong absorption bands at 2 -3 µm and from 4 to 8 µm. Lastly, the O 2 with 1 ppm H 2 O model provides the largest spectral feature among all models that include H 2 O. Interestingly, the striking feature at 5.5 -7.5 µm is not due to H 2 O alone, but also has a contribution from O 2 -O 2 CIA, which covers the small dip at 6.3 µm in the H 2 O opacity.

Observability with JWST
In the following, we first explore the observability of surfaces without an overlying atmosphere (no atmosphere limit). For that we use all six surface types shown in Fig. 1 including the three surfaces excluded by the Spitzer measurement for general applicability of our results (see Sect. 3.1.1). In contrast, the observability of atmospheres is assessed only for the atmospheric models that are consistent with the recorded Spitzer eclipse depth and that provide the largest signal for each background composition and gas absorber (thick atmosphere limit, see Sect. 3.1.3).

No Atmosphere limit
First, we test how many secondary eclipse observations would be necessary with JWST in order to distinguish each surface emission from a blackbody spectrum using the MIRI LRS or the NIRSpec G395H instrument. For the two models to be considered "distinguishable", a chi-square test is conducted and a p-value is obtained for a given number of eclipses, where the p-value represents the probability that the simulated data from the model to be tested are consistent with a reference model. We begin with one eclipse and increase in integer steps until the two models are distinguishable by 3 σ, given by a p-value of less than 0.0027. If the number of eclipses exceeds 30 we declare the models indistinguishable under realistic observation times. The blackbody spectrum is obtained by fitting the mock JWST spectrum using SciPy's "curve fit" function, with uncertainties on each data point at the given resolution obtained by Pandexo. The mock data used as input for the blackbody fitting . Predicted Spitzer 4.5 µm eclipse depth for the explored atmospheric models in combination with a metal-rich surface as a function of surface pressure compared to the Spitzer measurement (black horizontal line). Oxidizing/O2-dominated atmospheres are on the left and reducing/N2-dominated atmospheres on the right. The grey shaded area corresponds to 1 σ uncertainty and the pink shaded area to 3 σ uncertainty in the negative direction from the observed value. Table 2. Maximum surface pressure (bar) consistent with the Spitzer 4.5 µm eclipse depth at 3 σ, for each atmospheric composition modeled. Abundances listed in the left hand column correspond to abundance of the secondly-listed gas. The overall maximum surface pressure consistent with Spitzer is 10 bar among all set-ups and 1 bar once an infrared absorber at ≥ 100 ppm is included. The atmospheric models in bold possess the largest features and are used in the JWST observability analysis. If a value is listed as "No Soln.", then there was no solution which was consistent with Spitzer.

Maximum Pressure (bar) Consistent with Spitzer Eclipse Depth
are down-sampled to a resolution of 3, which we have found provides the most precise fit (not shown). The number of eclipses needed to distinguish each surface from a blackbody is listed in the first column and the first row of Table 3, depending on the JWST instrument used. In addition, the surface spectra for all plausible (i.e., consistent with Spitzer) surfaces together with the blackbody fit are displayed in Figure 5 and the inconsistent-with-Spitzer surface compositions are displayed in Figure 6. We find that overall NIR-Spec is more viable than MIRI for the purpose of distinguishing surfaces from a blackbody. Still, only the surfaces inconsistent with the Spitzer measurement are predicted to be distinguishable with less than 10 eclipses. That is because these surfaces exhibit more pronounced albedo features and higher albedos in general, making them more distinct from a blackbody spectrum. Specifically, we find that the ultramafic, granitoid and feldspathic surfaces should be distinguishable with 4, 4, and 6 eclipses, respectively. Perhaps surprisingly, the grey surface, although having no albedo features, differs from pure blackbody emission as well. This is a consequence of assuming a zero albedo for a blackbody compared to the 0.1 albedo in the grey model, creating a "tilt" between the two corresponding spectra. With MIRI, no single surface will be distinguishable from a blackbody with a feasible number of eclipses.
Additionally, we assess the number of eclipses needed to distinguish each surface from one another, again using MIRI or NIRSpec. After testing a variety of binning resolutions, we have found that the models are maximally distinguishable for a resolution of R = 3. It appears that a small number of data points are sufficient, since the surface features tend to be broad and the comparison appears to be driven by the overall eclipse depth rather than the shape of individual features. Table 3 shows the number of eclipses needed to distinguish surface pairs using an R = 3 binning. Secondary eclipse spectra for each pair of surface spectra are shown in Fig. C3. It can be seen that many of Figure 4. Secondary eclipse spectra for each atmosphere & metal-rich surface model at the maximum surface pressure consistent with the Spitzer observation. On each plot, the spectrum labelled "No Atmosphere" represents the surface-only spectrum for the metal-rich surface for comparison. The Spitzer 4.5 µm point is shown in black, where the y-axis error bar represents the uncertainty and the x-axis error bar represents the width of the bandpass. The atmospheric models with the largest features (bold) are used in the JWST observability analysis. No models are picked for the N2 with CO2 and the N2 with CO cases as the former is very similar to O2 with CO2 and the latter does not exhibit any noticeable features. the surfaces are more easily distinguishable from each other in the NIRSpec G395H band than in the MIRI LRS band. For example, the metal-rich and basaltic surfaces are indistinguishable in the MIRI LRS band, but are distinguishable with 6 secondary eclipses in the NIRSpec G395H band. Additionally, while the metalrich and iron oxidized surfaces are indistinguishable with MIRI LRS, they are distinguishable with 10 secondary eclipses using NIRSpec G395H. Overall, we find that less reflective materials are harder to disentangle due to their small features and eclipse depth variations than the more reflective ultramafic, granitoid and feldspathic surfaces.
In addition to utilizing the entire instrument wavelength range and constant binning for the pairwise distinguishability, we also conduct the same exploration focusing on the most prominent surface feature on ei-ther of the two surfaces and using optimized binning over the range of that feature. This is only done using the MIRI LRS instrument, since there are no prominent surface features within the NIRSpec G395H wavelength range (see Fig. C3 on the right). Using qualitative ranking, we choose three prominent features to be included in this analysis: The double-feature around 10 µm for the ultramafic, the double-feature at 8.5 µm for the granitoid, and the double-feature around 10 µm for the feldspathic surface. The grey, basaltic, metal-rich, and iron oxidized surfaces do not exhibit any sufficiently significant features for this analysis. Table 4 shows the number of eclipses needed to distinguish the ultramafic, granitoid and feldspathic from all other surfaces using the described method, while the corresponding pairs of surface spectra are shown in Figure C4. . Secondary eclipse spectra of the surface-only models, mock data points for observations with JWST MIRI LRS (left) and NIRSpec G395H (right) and the corresponding blackbody fits. Note that the blackbody models don't appear as smooth curves because the emitted flux is divided by the stellar spectrum model to obtain the eclipse depth. Only the surfaces consistent with the Spitzer 4.5 µm eclipse depth are shown here. The mock data points are rebinned to R = 10 with error bars corresponding to 5 eclipse observations. The original model spectra are downsampled from their native resolution to R = 100 for clarity.
Using this optimized binning over the most prominent surface feature, the number of eclipses needed to distinguish between three of the surface combinations was reduced, meaning that this method represented an improvement upon constant binning over the entire instrument wavelength range. More specifically, the number of eclipses was reduced from 6 to 3 for the granitoid and ultramafic surfaces, from 27 to 11 for the granitoid and feldspathic surfaces, and from 16 to 10 for the feldspathic and ultramafic surfaces. However, all other sur-face combinations tested using this optimized binning either remained unchanged or needed a greater number of eclipses than in the runs using the whole instrument range. As pointed out before, this result indicates that the constraining of surface compositions is largely driven by the Bond albedo of the surface (overall eclipse depth offset over the width of an instrument bandpass) rather than the spectral variation of features. Figure 6. Secondary eclipse spectra of the surface-only models, mock data points for observations with JWST MIRI LRS (left) and NIRSpec G395H (right) and the corresponding blackbody fits. Shown here are the surfaces inconsistent with the Spitzer 4.5 µm eclipse depth. The mock data points are rebinned to R = 10, with error bars corresponding to 5 eclipse observations. The model spectra are downsampled from their native resolution to R = 100 for clarity.  In the previous section we have fit the model spectra with a blackbody to determine whether surface features are detectable. However, the blackbody fit can also be used to determine the brightness temperature over a certain wavelength range. The brightness temperature in turn allows to infer the planetary Bond albedo, which is a useful quantity that can be used to place constraints on the presence and properties of an atmosphere, as elucidated in detail in Mansfield et al. (2019). However, as pointed out in that work, inferring the Bond albedo from MIRI observations is prone to be biased due to the non-grey shape of the surface albedo. In the following we re-enact their assessment of this issue for LHS 3844b using the non-grey radiative transfer code HELIOS and extend it to NIRSpec observations as well. First we assume that the brightness temperature is equivalent to the temperature of the blackbody fit for each surface. The inferred albedo is then calculated according to

Inferring planetary albedo and temperature
where α in is the inferred albedo, T bb is the temperature from the blackbody fit, T star is the temperature of the star, a is the semi-major axis of the planetary orbit, and R star is the radius of the star. We set the heat redistribution factor f to 2/3 as predicted for a "vanishingly thin" atmosphere. The wavelength-integrated emissivity is a priori unknown and thus set to 1 in our reference models, which is equivalent to assuming that the surface radiates as a blackbody. We also discuss the effects of assuming < 1 in Appendix B. We calculate the true Bond albedo for each surface by integrating the flux reflected from the planet over all wavelengths, and dividing by the total stellar flux received by the planet.
The upper two rows of Table 5 list the temperatures of best fit associated with each of these blackbody models based on the MIRI or NIRSpec spectra, while the third row of Table 5 lists the true surface temperatures obtained in the HELIOS models.
The Bond albedos and inferred albedos for each surface are listed in the bottom part of Table 5. Each surface has a separate inferred albedo for each JWST instrument, in contrast to the the Bond albedo which is a single quantity resulting from the interplay between the planetary surface and the stellar irradiation. Figure 7 displays the inferred brightness temperatures against the true model temperatures (left panel) and the inferred albedos against the Bond albedo (right panel) for each surface based on simulated MIRI observations. We find that the surface temperature of the planet is consistently underestimated by the blackbody fitting. This is because the fitting routine assumes an emissivity of 1 (equivalent to an albedo of 0), which assumes that the planet radiates heat more efficiently than it does in reality, finding a lower temperature for a given amount of thermal heat. The right panel shows that the majority of surfaces have an inferred albedo in the MIRI LRS band which is lower than the true Bond albedo. This can be explained by the non-grey albedo variation, discussed in detail in Mansfield et al. (2019). As a realistic, non-grey surface albedo exhibits strong variations in the near to mid-infrared, the planetary emission is muted in the albedo peaks, and amplified in the albedo troughs, in accordance with Kirchhoff's law of radiation. Since the majority of the surfaces tested here have a lower albedo within the MIRI bandpass than outside of it, emission within the bandpass is amplified, leading to an underestimation of the inferred albedo. In contrast, the albedo of the Fe-oxidized surface is determined accurately and moreover the metal-rich albedo is overestimated. Here, the wavelength variation effect is countered by another bias induced by the assumption of an emissivity of unity which leads to an overestimation of the albedo, best visible for the grey surfaces added as control cases (for more details see Appendix B). Lastly, another bias on the inferred albedo stems from the reflected stellar light being superimposed on the planetary thermal emission, mimicking a higher planetary emission and lower albedo. However, as the fraction of reflected light in the MIRI bandpass is minor we find that this effect is relatively small, see Fig. C5, left panel.
Interestingly, compared to the previous MIRI predictions in Mansfield et al. (2019) our new calculation predicts a consistently smaller observational bias for the albedo. After comparing the current and previous methodologies, we have found that the choice of stellar spectrum plays a significant role for the calculation of the inferred albedo. In the present study we use a PHOENIX stellar spectrum in all steps of the inferred albedo calculation. In Mansfield et al. (2019), the planetary temperature was determined using a PHOENIX stellar spectrum, but when inferring the brightness temperature from the secondary eclipse depth a blackbody was assumed for the host star. This seemingly minor inconsistency appears to be responsible for the stark difference between the current and previously obtained albedo values.
With the NIRSpec instrument the albedo can be inferred more accurately than with MIRI, even though the temperature of the blackbody fit is similarly underestimated, see Fig. 8. The albedo is in general, just as with MIRI, somewhat underestimated. However, here the main reason for the bias is the scattered light, the amount of which in the NIRSpec bandpass is signifi-cant for the more reflective surfaces. If the scattered light is removed from the analysis, the inferred albedo values become higher and predominantly overestimated, see Fig. C5, right panel. The emissivity effect, a major source of inference bias when using MIRI, is only significant for the higher-albedo surface in the case of NIRSpec, as is further explained in Appendix B.

Thick Atmosphere limit
In the following we explore whether the atmospheres in the thick limit (recall Sect. 3.1.3) are detectable with JWST, i.e., if they can be distinguished from a pure surface spectrum. First, we utilize the whole instrument bandpasses of MIRI LRS and NIRSpec G395H and choose the resolution R = 10 as this represents a good compromise between statistical noise and sampling of the spectral features. The number of secondary eclipse observations needed to detect the atmosphere in each tested scenario is listed in Table 6, top rows. We find that all of the atmospheres are detectable with ≤ 9 eclipse observations if using the preferred instrument, with NIRSpec being preferable at finding CO 2 and CH 4 , and MIRI preferable at finding SO 2 and H 2 O. The detections with the lowest time requirement are O 2 |SO 2(100ppm) with MIRI and N 2 |CH 4(1%) with NIRSpec, both requiring 3 eclipse observations. The O 2 |H 2 O (1ppm) atmosphere is detectable with MIRI and 7 eclipses. As a worst case observing scenario, with MIRI the O 2 |CO 2(100ppm) atmosphere is indistinguishable from a surface. The spectra of the atmospheric models and the associated surface-only spectra can be seen in Figure C7.
Similar as in the surface observability analysis, we conduct another examination focusing on individual atmospheric absorption features or bands in isolation instead of taking the whole instrument range. Specifically, for SO 2 we test the feature at 4 µm and the double-peaked band around 8 µm, for CH 4 we test the feature at 3.5 µm and the band around 7 µm and for CO 2 we test the feature at 4.3 µm. There are no discernible features within the MIRI instrument range for the atmosphere with CO 2 and, analogously, no discernible features withing the NIRSpec range for the atmosphere with H 2 O. The resolution and binning of each spectrum is chosen to ensure that there are two to three data points sampling the feature being probed. The number of secondary eclipses needed to distinguish each atmospheric feature from the surface-only spectrum is listed in Table 6, bottom rows. The resulting atmospheric model spectra and their no-atmosphere counterparts are shown in Figure 9, with error bars representing the uncertainty after 5 secondary eclipse measurements. We find that isolating the gaseous features increases the detectability significantly to the extent that all explored atmospheres can be distinguished from a surface with at most 3 eclipses if using the preferred instrument for each case. Most promisingly, the CH 4 and SO 2 cases can be detected with only a single eclipse using NIRSpec or MIRI, respectively. The O 2 |H 2 O (1ppm) atmosphere is detectable with MIRI and 2 eclipses and the O 2 |CO 2(100ppm) atmosphere with NIRSpec and 3 eclipses.

Atmospheric Retrieval
In addition to the chi-square tests presented in the last section, we further run atmospheric retrieval models in order to determine how JWST observations may help constrain the atmospheric composition of LHS 3844b using a Bayesian framework. To this end, we use the simulated JWST data from both NIRSpec G395H and MIRI LRS instruments with errors equivalent to 5 eclipses as input for the retrieval modeling. Note that, as described in Sect. 2.5, our code generating the forward models, HELIOS, adds the non-grey surface to the spectrum, but the retrieval code PLATON does not include the surface and only assumes an isotherm extending to the infinite, which should mimic retrieving the pressure where the surface is expected. Hence this analysis further allows us to explore the impact of the realistic surface albedo on the results of the retrieval modeling.
Instead of immediately using the forward models that include a non-grey surface, we first test the retrieval performance on pure atmosphere models, i.e., models with the atmosphere and a zero albedo surface. This test is indicative of whether the atmospheric signal alone is sufficient to make any statements about the gas inventory of the planetary envelope. Secondly, we then use the same atmospheric models and include the albedo signal of various surfaces. We include the crusts that are consistent with the Spitzer measurement, basaltic, metal-rich and iron oxidized with the addition of ultramafic in order to have an example of a more reflective surface. Finally, the last set of retrievals uses the surface spectrum only in order to explore whether the surface albedo variation itself can be interpreted as a false atmospheric signal by the retrieval algorithm in the case of the planet having no atmosphere at all.
The results of the atmospheric retrievals with and without surface contribution are listed in Table 7 and the surface-only retrievals in Table 8. Striking results that are discussed separately and presented in Figures 10 and  11 are highlighted in magenta. Note that the species O 2 and N 2 , not explicitly listed in the tables, are treated as background gases in the retrieval and their mixing ratio always adds up to unity with the other absorbers.  In the atmosphere-only set-up the retrieval model manages to recover the absorbing species in each case it is present (see Fig. 10, top row). Curiously, only the true H 2 O mixing ratio of 1 ppm lies within the 1 σ region of the retrieved posterior, log f H2O = −6.67 +3.38 −2.80 . The mixing ratios of the other absorbers, CO 2 , SO 2 and CH 4 are somewhat overestimated by the retrieval algorithm compared to their true values. This is likely a consequence of a degeneracy between the mixing ratio of an absorber and the parameterized temperature profile used in the retrieval. Specifically, what may be happening is that the retrieval code sets the photosphere predominantly at too low pressures (and gas densities) so that a higher mixing ratio of an absorber is needed to provide the optical depth that is necessary for a given spectroscopic signal. This photosphere mismatch may in turn be a consequence of HELIOS and PLATON having radiative transfer solvers and ingredients that are not identical, i.e., they use different opacity line lists and scattering prescriptions, which could result in the calculation of slightly different optical depths for the same input conditions. Lastly, just as the retrieval algorithm succeeds at recovering present species, it also correctly disfavors absent species, as the retrieved mean mixing ratios of the latter do not exceed ∼ 0.1 ppm in any of the  tested cases (see Fig. C8 for a full grid of atmosphereonly model posteriors). We find that the addition of a realistic surface crust generally appears not to impede our ability to detect absorbing species that are present in the atmosphere and sufficiently visible in the planetary spectrum. In all setups the posterior distributions of absorbers that are present are consistent between the cases with and without a non-gray surface signal. However, we find a tendency that a non-grey surface makes it harder for the retrieval code to reject species that are absent in the forward model. This phenomenon is best visible in the case of the O 2 |H 2 O (1ppm) atmosphere (see Fig. 11). Whereas the posteriors of the atmosphere-only setup appear wellbehaved, with a clear upper limit on the mixing ratio (see CO 2 , SO 2 and CH 4 panels of Fig. 11), the posterior distributions in the case with the ultramafic surface remain broad and undetermined. Technically, it appears that the retrieval algorithm tries to match the surface features in the planetary spectrum with combinations of small amounts of atmospheric absorbers leading to many degenerate solutions. For instance, for SO 2 higher mixing ratios of 1 ppm are strongly disfavored in the no-surface case, (log f SO2 = −9.17 +2.54 −1.68 ), but once the ultramafic signal is added (log f SO2 = −5.75 +3.26 −3.95 ) even mixing ratios of ∼ 1 % cannot be ruled out.
This last finding is further supported by our results from retrieving on the surface-only models. In general we find that the surface signatures imprinted on the planetary spectrum mislead the retrieval algorithm to falsely allow for the presence of atmospheric gases, with possible mixing ratios given by broad posterior distributions with a mean around ∼ 1 ppm. The extreme case is H 2 O for which the retrieved mixing ratios are higher with a mean around ∼ 100 ppm in the cases of the basaltic, fe-oxidized and ultramafic crusts (see Fig. 10, bottom row). Also, in these cases the posterior distributions are more clearly defined and could misleadingly point to a weak H 2 O detection. Figure C9 shows the full grid of surface-only model posteriors.  Table 7. Logarithm of input and retrieved gas volume mixing ratios for the atmospheres in the "thick" limit (see Sect. 3.1.3) without and with realistic surface crust."N Inc" means that the species is not included in the forward model. Individual results discussed in Sect. 3.3 and presented in Figures 10 and 11 are highlighted in magenta. Note that O2 and N2, treated as background gases in the retrieval modeling, are not directly retrieved and thus their mixing ratios are not listed here. Figure 9. Secondary eclipse spectra of atmospheric models vs. surface-only models over the range of the strongest absorption bands in the respective models, overlaid with MIRI LRS (left) and NIRSpec G395H (right) simulated data points. The metalrich surface is used for this test. Error bars show uncertainties for 5 eclipse observations. Model spectra are downsampled from the native resolution to R = 100 for clarity.  Table 8. Logarithm of retrieved gas volume mixing ratios for surface models without including an atmosphere. Individual results discussed in Sect. 3.3 and presented in Figure 10 are highlighted in magenta. Note that O2 and N2, treated as background gases in the retrieval modeling, are not directly retrieved and thus their mixing ratios are not listed here. Surface only models Figure 10. Top: Posterior distributions of the retrieved volume mixing ratios of CO2, SO2, H2O and CH4 using atmospheric forward models without a non-grey surface. The retrieved mean and 1-sigma width is given on top of each panel and the input mixing ratio (the "true" value) is shown as red, vertical line. The atmospheric composition used in the forward model is listed below the horizontal axis label. All atmospheric species are recovered in the retrieval, although the retrieved mixing ratio tends to be somewhat overpredicted for CO2, SO2 and CH4. Bottom: Posterior distributions of the retrieved H2O volume mixing ratios from the surface-only (no atmosphere) forward models. The surface crust used in the forward model is listed below the horizontal axis label. The basaltic, fe-oxidized and ultramafic surfaces lead to a relatively pronounced posterior distribution that may erroneously be interpreted as a weak detection of atmospheric H2O. Ultramafic surface, O 2 |H 2 O (1ppm) Figure 11. Posterior distributions of volume mixing ratios using the O2 with 1 ppm H2O atmosphere as input for the retrieval modeling, once without a non-grey surface (top) and once with an ultramafic surface (bottom). The retrieved mean and 1-sigma width is given on top of each panel and the input mixing ratio (the "true" value) is shown as red, vertical line. Compared to the atmosphere-only case, adding the ultramafic signal leads to "washed-out" posteriors making it harder to place limits on the mixing ratios. Note that O2 and N2 are treated as background gases in the retrieval modeling and are not directly retrieved.

New Constraints on the Surface and Atmosphere of LHS 3844b and their Observability
In this study we have explored the feasibility of characterizing the atmosphere and surface of the rocky super-Earth LHS 3844b with JWST. To find the parameter space of atmospheres and surface types that are plausible for LHS 3844b, we have modeled the planetary emission of LHS 3844b, including the spectral signal of both atmosphere and surface, and exhaustively explored all scenarios that are consistent with the existing Spitzer 4.5 µm measurement of Kreidberg et al. (2019). For the surface we have assumed six crust compositions that are common in the solar system and found that the surfaces that are consistent with Spitzer are metal-rich, iron oxidized and basaltic crusts. In contrast, inconsistent with the data are ultramafic, granitoid and feldspathic surfaces, whose high albedos lead to a planetary thermal emission too far below the recorded one. Our atmospheric models consist of O 2 , N 2 , CO 2 and CO dominated atmospheres with H 2 O, CO 2 , CO, CH 4 and SO 2 as additionally included near-infrared absorbers of various mixing ratios. We have found that in order to be consistent with Spitzer the maximum surface pressure is 10 −1 bars for the models including CO 2 or CO and 1 bar for the models including H 2 O, CH 4 or SO 2 if the mixing ratio of the included infrared absorber ≥ 100 ppm.
Next, we have conducted a JWST observability analysis exploring two limits; on one end we have assumed there is no atmosphere and on the other end we have investigated the atmospheric scenarios that provide the largest gas features while being consistent with the Spitzer data. In the no-atmosphere limit our analysis predicts that with the MIRI LRS instrument it will be very difficult to disentangle specific surface types from a blackbody null-hypothesis. However, with NIRSpec G395H the more reflective surfaces (albeit implausible given the Spitzer constraint) could be differentiated from a blackbody with ≤ 6 eclipse observations. Making use of not only the feature strength but also the total offset of the surface emission (or reflection) over an instrument bandpass, we have found that among the plausible surfaces the basaltic and metal-rich surfaces are distinguishable with 6 eclipse measurements and the metal-rich and iron oxidized surfaces are distinguishable with 10 eclipse measurements using NIRSpec. The more reflective surfaces disfavored by Spitzer, the granitoid, feldspathic and ultramafic crusts, would be distinguishable from most of the other surfaces with 1 -4 eclipses using NIRSpec.
Exploring the limit with an atmosphere, we have found that each atmospheric model is distinguishable from a surface-only spectrum in under 3 eclipse observations if the better-suited JWST instrument, MIRI or NIRSpec, is used in each case. The most amenable cases are the atmospheres with SO 2 and CH 4 , each requiring only a single eclipse to be constrained using MIRI or NIRSpec, respectively.
Lastly, we have run a suite of atmospheric retrieval models to determine how JWST may help constrain the atmospheric composition of LHS 3844b with a Bayesian framework. We have also explored whether surface albedo variations could bias the retrieval algorithm or even be mistaken for atmospheric signatures. Using uncertainties based on 5 eclipse observations, we have been able to detect all four of the included gas absorbers, albeit three of the retrieved mixing ratios are somewhat overpredicted. Furthermore, we have found that the non-grey surface has negligible effect on an atmospheric species' retrieved mixing ratio if the species' spectral signature is sufficiently large. However, the surface contamination of the atmospheric retrieval may lead to a false, weak detection of atmospheric H 2 O and may also make it harder to disfavor the presence of gas species. LHS 3844b will be observed during 3 secondary eclipses with JWST MIRI/LRS as part of Cycle 1 of the General Observer program. First, these observations will extend our knowledge about the presence and thickness of an atmosphere and also verify the existing Spitzer data point, testing the measured eclipse depth at 4.5 µm and its derived error. Furthermore, according to our predictions these MIRI observations will indicate whether the planet's surface is granitoid in nature, but the allocated time will not suffice to place any constrains on the other potential surface crusts. Also, if present, any significant amounts of H 2 O, SO 2 or CH 4 should be detected, provided the planetary atmosphere is thick enough. Kreidberg et al. (2019) and

Comparison to
Implications for Surface and Atmosphere Our current study expands the LHS 3844b phase curve analysis of Kreidberg et al. (2019) (herein after K19) on multiple fronts. First, we have extended the number of tested surface types from ultramafic, feldspathic, basaltic and granitoid to further include a metal-rich and an iron oxidized surface. Additionally, while K19 assumed atmospheres made up of O 2 , N 2 and CO 2 , we have tested a wider range of atmospheres by including CO, CH 4 , SO 2 and H 2 O as absorbers. Finally, our modeling of LHS 3844b is done using a radiative-convective two-stream radiative transfer code that additionally ac-counts for energy balance at the planet's solid surface. The analysis of LHS 3844b in K19 was performed against atmosphere models using the techniques described in Morley et al. (2017), which utilize a simplified temperature prescription and do not natively account for a surface.
While our results are in broad agreement with the LHS 3844b phase curve analysis of K19, we also find differences between our and their model predictions that are worth discussion. First, the wavelength-varying secondary eclipse depths and consequently the modeled Spitzer 4.5 µm eclipse depths of the bare-rock models in our work is significantly lower than predicted by K19 for the same conditions. Second, analogously to the bare-rock models, the eclipse depths of the models with (optically) thin atmospheres are also significantly lower than found by K19 for the same conditions. In contrast, the optically thick atmospheres provide eclipse depths that are larger than in K19. For example, K19's calculated eclipse depths of the O 2 + 1 ppm CO 2 atmospheres are larger than ours for all surface pressures ≤ 10 bar. However, K19's models with 100% CO 2 provide a shallower eclipse depth throughout almost all surface pressures (apart from 1 bar) compared to ours.
We believe the discrepancy in the atmosphere models can at least partially be attributed to the sophistication of the atmospheric radiative transfer modeling as we utilize self-consistent radiative-convective equilibrium models whereas they relied on parameterized temperature profiles which do not take the non-grey radiative feedback of gas absorbers into account. Yet, the modeling treatment of the atmosphere can only explain the cases with optically thick atmospheres, as otherwise the atmosphere has a marginal impact on the planetary spectrum. The discrepancy in the optically thin atmosphere and no-atmosphere models we believe is due to the treatment of the host star radiation. In K19, they scaled the stellar spectrum model to match the measured stellar flux density over the Spitzer 4.5 µm bandpass. However, this absolute flux measurement is not consistent with the parameters of LHS 3844 derived from SED fitting (Vanderspek et al. 2019;Kreidberg et al. 2019). In the present work, we have chosen to follow the literature and use a PHOENIX stellar spectrum for the parameters given in K19 without any additional spectrum scaling, as it remains unclear whether the eclipse modeling should be guided by the single-band photometric measurement (that could be prone to an unknown systematic error) or the parameters derived from the overall stellar fit. Ultimately, if multiple flux measurements of the star cannot be reconciled, the correct modeling procedure remains ambiguous. This highlights the need for accurate stellar flux measurements as any uncertainty in the treatment of the star directly affects the secondary eclipse depth.
A shallower eclipse depth prediction for the bare-rock models compared to K19 directly translates to tighter constraints on the type of surfaces that are possible for this planet when taking the Spitzer data point at face value. We recover their result that a basaltic surface is consistent with Spitzer at 3 σ, but our ultramafic model is outside of this confidence interval in contrast to their modeling.
In terms of atmospheric thickness, our new modeling sets the top limit on the surface pressure at ∼ 1 bar down from the previous limit of 10 bar, if an infrared absorbing gas is included at ≥ 100 ppm. Their best-fit models (< 1 σ deviation from observation) that include a nonmarginal amount of near-infrared absorber at ≥ 10 ppm require a thin atmosphere with a surface pressure < 0.1 bar. Interestingly, our models that fall within 1 σ allow atmospheres with up to 1 bar surface pressure if CH 4 is included. That is because CH 4 is not only a strong greenhouse gas, thus warming the deep atmosphere and surface, but also CH 4 is weakly absorbing in a spectral window around 4.5 µm, enabling a higher thermal emission from the deep atmosphere in the Spitzer bandpass.

General Implications for the Characterization of Rocky Exoplanets
Expanding the study of exoplanet atmospheres into the terrestrial planet regime is a major goal in the JWST era. Yet such planets will still be challenging atmospheric characterization targets, even with the improved observing capabilities of JWST. It is therefore necessary to establish robust strategies for extracting meaningful constraints on terrestrial planet atmospheres. Among such targets, hot rocky planets orbiting bright stars, such as LHS 3844b present the best opportunities for detailed characterization.
A zeroth order question to answer for rocky exoplanet targets is whether they possess atmospheres at all, and if so, how thick said atmospheres are Kreidberg et al. 2019). The next natural follow-on question is to establish the composition of the planet's atmosphere and surface. In this work, we have demonstrated how to go about addressing questions of surface and atmosphere composition for the thick (or thin) and no atmosphere case, as applied to the most amenable rocky target for thermal emission characterization, LHS 3844b. We have quantified the amount of observing time required and the limitations to which types of surfaces and atmospheres can be uniquely distinguished. We have also identified shortcomings in our current re-trieval capabilities for measuring the properties of terrestrial planet atmospheres. The approach that we have outlined here can be applied, in principle, to any rocky planet target that presents sufficiently high signal-tonoise thermal emission.
One particularly subtle challenge that we have discussed at length in this work is that of measuring a planet's albedo and its surface temperature. When interpreting the measured planetary emission one should be aware that due to the less-than-unity emissivity of the realistic surfaces, the inferred brightness temperature is substantially lower than the true surface temperature for both NIRSpec and MIRI observations. However, while inferring the planetary albedo (using the techniques described in this paper and in Mansfield et al. 2019) with NIRSpec is relatively accurate, using MIRI the inferred albedo is somewhat underestimated for the higher-albedo surfaces and overestimated for the loweralbedo surfaces.
The Kreidberg et al. (2019) Spitzer phase curve observations and our interpretation in this paper have already provided a phenomenal degree of insight into the properties of LHS 3844b. If the planet possesses an atmosphere at all, it is thinner than the Earth's. Yet this conclusion is based only on photometric measurements in a single bandpass. JWST will, for the first time, open the door to spectroscopic characterization of LHS 3844b as well as a wider range of rocky planets, ultimately providing much deeper insight into the fundamental properties of terrestrial planet surfaces and atmospheres. We look forward to the era of rocky planet characterization that is about to begin. For this work we have updated the numerical forward stepping algorithm of HELIOS that iterates towards radiativeconvective equilibrium and, in contrast to the previous implementation where the surface temperature was calculated directly from the energy balance across the surface boundary, see Eq. (4) in Malik et al. (2019a), we have also included the surface temperature in the numerical iteration.
The original expression for the temperature iteration, see Eq. (24) in Malik et al. (2017), required the knowledge of local atmospheric properties, such as the density and heat capacity. Furthermore, the radiative timescale was used, see Eq. (27) in Malik et al. (2017), to make the forward stepping independent of local atmospheric inertia, allowing for a faster and more stable convergence towards equilibrium. However, since the steady-state radiative equilibrium merely requires a vanishing net flux divergence across each atmospheric layer, we have found the inclusion and calculation of such a large number of atmospheric properties not necessary. That is why we have simplified the original formalism to the following expressions. Note that we denote the wavelength-integrated "bolometric" flux with F ([F] = erg s −1 cm −2 ) and the spectral flux with F ([F ] = erg s −1 cm −3 ). The change in temperature of atmospheric layer i ,∆T i , between successive iteration steps is calculated as where P i is the pressure in the center of layer i, ∆P i is the pressure difference across layer i, ∆F net,i is the net flux difference across layer i and f pre,i is a dimensional prefactor given by The goal of this prefactor is first to make the iteration more stable by dampening the impact of ∆F net,i , thus making the temperature iteration smoother between neighboring layers. Second, f pre,i is used to adapt the temperature step during the iteration. If the temperature is close to equilibrium, f pre,i becomes increasingly smaller. This prevents numerical oscillations around the equilibrium temperature without ever reaching it (details on the exact evolution of f pre,i are given in Malik et al. 2017). Analogous to the atmospheric layer temperatures, the surface temperature starts from the planetary dayside-averaged temperature at the beginning of the run and then advances with each iteration step as with where F intern is the internal heat flux. The index "0" denotes the first atmospheric layer or interface from the bottom. The prefactor f pre,surf is calculated with Eq. (A2), but using ∆F net,surf in the denominator. The net flux at the surface boundary is given by F net,0 = F ↑,0 − F ↓,0 . One caveat of including the surface in the numerical iteration of the atmosphere is that the surface and the first atmospheric layer sometimes become "stuck", oscillating back and forth between iterations, prone to happen when the near-surface atmosphere is optically thick. We employ a numerical trick to solve that issue. As long as the first atmospheric layer is not in radiative equilibrium, we use F net,1 in place of F net,0 in Eq. (A4), which essentially includes the first atmospheric layer in the energy balance of the surface. Once the first atmospheric layer has converged, we switch back to the correct energy balance expression for the surface, as given by Eq. (A4). Finally, the wavelength-varying surface albedo is included in the upward flux from the surface. Namely, the upward pointing spectral flux at the surface is calculated as F ↑,0,λ = A surf,λ F ↓,0,λ + (1 − A surf,λ )πB λ (T surf ), where A surf,λ is the wavelength-dependent surface albedo, B λ is the Planck function and T surf is the surface temperature.

B. ROLE OF THE EMISSIVITY ON THE INFERRED ALBEDO
The effect of assuming different emissivity values on the inferred albedo is shown in Fig. C6. When fitting a blackbody curve to a spectrum, the total area under the curve integrated from zero to infinity, analogous to the total energy emitted, is not conserved. This discrepancy is exacerbated if the assumed emissivity of the blackbody model significantly differs from the one of the realistic surface spectrum. Moreover, the farther the wavelength region for the fitting is separated from the peak of the planetary thermal emission, the larger is the discrepancy in the areas below the two curves. In the MIRI case, if the blackbody fit assumes an emissivity higher than the one used in the physical model the total energy of the blackbody emission is smaller than in the physical model, resulting in a too low blackbody temperature and consequently a too high inferred albedo. This is best visible for the three grey albedo models with α = 0, α = 0.1 and α = 0.3. When the fit for the grey albedo models is conducted with the correct emissivity the inferred albedo matches the real bond albedo.
When using NIRSpec, this emissivity effect is somewhat smaller and also non-monotonic. This is because in this case the instrument bandpass overlaps with the peak planetary emission making the areas between the blackbody curve and the planetary spectrum more similar. Furthermore, there appears to be a dependence on individual albedo variation. The surfaces showing an inverse trend have an overall higher albedo and a more strongly varying albedo within the NIRSpec bandpass.
C. ADDITIONAL FIGURES Figure C1. Temperature-pressure (T-P) profiles for oxidizing/O2-dominated atmosphere models (left) and reducing/N2dominated atmosphere models (right) assuming a surface pressure of 10 −3 bars (top) and 10 bars (bottom). Figure C2. Predicted Spitzer 4.5 µm eclipse depth for the explored atmospheric models in combination with a basaltic surface as a function of surface pressure compared to the Spitzer measurement (black horizontal line). Oxidizing/O2-dominated atmospheres are on the left and reducing/N2-dominated atmospheres on the right. The grey shaded area corresponds to 1 σ uncertainty and the pink shaded area to 3 σ uncertainty in the negative direction from the observed value. Figure C3. Secondary eclipse spectra of surface-only model pairs over the full range of the NIRSpec G395H (top) and MIRI LRS (bottom) bandpasses with overlaid mock data points at R = 3. The error bars correspond to 5 secondary eclipse observations. The model spectra are downsampled to R = 100 for clarity. Figure C4. Secondary eclipse spectra of surface-only model pairs over the range of selected surface features within the MIRI LRS bandpass with overlaid mock data points. Each case uses a resolution (shown in top left corner) to optimally match the spectral features. The error bars correspond to 5 secondary eclipse observations. The model spectra are downsampled to R = 100 for clarity. Figure C5. Left: Bond Albedo vs. Inferred Albedo for simulated MIRI LRS data at R = 10. The black dotted line represents the location where the inferred and Bond albedos are equal. The inferred albedo is determined using once the full spectrum (i.e., both planetary emission and reflected light contributions) and once the emitted radiation only. The nominal case uses the full spectrum. Right: Analogous to the left panel but using simulated NIRSpec G395H data at R = 10. Figure C6. Left: Bond Albedo vs. Inferred Albedo for simulated MIRI LRS data at R = 10. The black dotted line represents the location where the inferred and Bond albedos are equal. Three values of emissivity, 1.0, 0.9, and 0.7, are assumed for the planetary emission when inferring the temperature and the albedo. The nominal value used is 1.0. Right: Analogous to the left panel but using simulated NIRSpec G395H data at R = 10. Figure C7. Secondary eclipse spectra of atmospheric models vs. surface-only models, overlaid with MIRI LRS (left) and NIRSpec G395H (right) simulated data points at R = 10 over the whole instrument bandpass. The metal-rich surface is used for this test. Error bars show uncertainties for 5 eclipse observations. Model spectra are downsampled from the native resolution to R = 100 for clarity.
Atmosphere only models Ultramafic Figure C9. Same as Fig. C8 but showing here the retrieval results using the surface-only (no atmosphere) forward models with different surface crusts.