Information content of JWST-NIRSPEC transmission spectra of warm Neptunes

Warm Neptunes offer a rich opportunity for understanding exo-atmospheric chemistry. With the upcoming James Webb Space Telescope (JWST), there is a need to elucidate the balance between investments in telescope time versus scientific yield. We use the supervised machine learning method of the random forest to perform an information content analysis on a 11-parameter model of transmission spectra from the various NIRSpec modes. The three bluest medium-resolution NIRSpec modes (0.7 - 1.27 microns, 0.97 - 1.84 microns, 1.66 - 3.07 microns) are insensitive to the presence of CO. The reddest medium-resolution mode (2.87 - 5.10 microns) is sensitive to all of the molecules assumed in our model: CO, CO2, CH4, C2H2, H2O, HCN and NH3. It competes effectively with the three bluest modes on the information encoded on cloud abundance and particle size. It is also competitive with the low-resolution prism mode (0.6 - 5.3 microns) on the inference of every parameter except for the temperature and ammonia abundance. We recommend astronomers to use the reddest medium-resolution NIRSpec mode for studying the atmospheric chemistry of 800-1200 K warm Neptunes; its corresponding high-resolution counterpart offers diminishing returns. We compare our findings to previous JWST information content analyses that favor the blue orders, and suggest that the reliance on chemical equilibrium could lead to biased outcomes if this assumption does not apply. A simple, pressure-independent diagnostic for identifying chemical disequilibrium is proposed based on measuring the abundances of H2O, CO and CO2.


INTRODUCTION
With the much anticipated launch of the James Webb Space Telescope (JWST) in 2021 (and Cycle 1 proposals due in 2020), the exoplanet community is studying the balance between investments of telescope time and scientific yield (Beichman et al. 2014; Barstow et al. 2015;Greene et al. 2016;Howe et al. 2017;Batalha & Line 2017). Both the Guaranteed Time Observations as well as the Early Release Science programs are designed to gain an understanding of systematics and data reduction strategies (Stevenson et al. 2016;Bean et al. 2018;Kilpatrick et al. 2018) and will provide the first opportunities to obtain JWST transit spectroscopy data over a wide range of infrared wavelengths for many of the bestknown transiting exoplanets.

Motivation I: anticipated chemical diversity of warm Neptunes
One of the unexpected outcomes of the Kepler mission is that ∼ 1000 K sub-Neptune-to Neptune-sized exoplanets on short-period orbits are common (e.g., Petigura et al. 2013;Crossfield et al. 2016), which we will collectively term "warm Neptunes" in the current study. Their bulk densities indicate the presence of a hydrogen-and/or helium-dominated atmosphere. The Transiting Exoplanet Survey Satellite (TESS) is discovering warm Neptunes orbiting bright stars (e.g., Dragomir et al. 2019;Esposito et al. 2019;Quinn et al. 2019;Trifonov et al. 2019). With no example in our Solar System, a deeper understanding of the properties of warm Neptunes is expected to shed light on exoplanet formation processes. The complete chemical inventory of their atmospheres is currently unknown and it is expected that JWST spectra will allow the exoplanet community to make significant progress on this question.
Across a temperature range of 800-1200 K, warm Nep-tunes are theoretically predicted to exhibit remarkable chemical diversity with water (H 2 O), methane (CH 4 ), carbon dioxide (CO 2 ) and carbon monoxide (CO) having a wide range of volume mixing ratios as the elemental abundance of carbon (C/H) and the carbon-to-oxygen ratio (C/O) vary (Moses et al. 2013). At ∼ 1000 K, equilibrium chemistry predicts a transition from CH 4 -to CO-dominated atmospheres toward higher temperatures (e.g., Moses et al. 2011;Madhusudhan 2012;Venot et al. 2014;Heng & Tsai 2016). However, 800-1200 K is also the temperature range where the assumption of chemical equilibrium breaks down, because the chemical and dynamical timescales become comparable and photochemistry may not be negated by high temperatures. For example, Madhusudhan & Seager (2011) find tentative evidence for the over-abundance of CO (compared to expectations from chemical equilibrium) in the warm Neptune GJ 436b, which has an equilibrium temperature of about 649±60 K (Torres et al. 2008); see also (Morley et al. 2017). In our own Jupiter, the over-abundance of CO was interpreted as a sign of disequilibrium chemistry due to atmospheric mixing (Prinn & Barshay 1977). Similarly, Oppenheimer et al. (1998) detected an excess of CO in the brown dwarf Gliese 229B. For all of these reasons, warm Neptunes with atmospheric temperatures in the range of 800-1200 K are the next frontier in understanding atmospheric chemistry from transmission spectroscopy.

Motivation II: accuracy of constraining elemental abundances and C/O
The key controlling parameters of atmospheric chemistry are the set of elemental abundances (mainly C/H, O/H, N/H) and C/O (e.g., Burrows & Sharp 1999;Madhusudhan 2012;Heng & Tsai 2016). Atmospheric mixing and photolysis act to complicate the translation between the elemental and molecular abundances (e.g., Moses et al. 2011;Tsai et al. 2017). As already noted by  and Greene et al. (2016), the best approach for inferring the elemental abundances and C/O from spectra is to directly retrieve the abundances of the major carbon, oxygen and nitrogen molecular carriers, where X i are the volume mixing ratios of molecules and X H = 2X H2 + 4X CH4 + X HCN + 2X C2H2 + 2X H2O + 3X NH3 . In H 2 -dominated atmospheres (as studied here), X H ≈ 2X H2 . It is important to note that these are inferred quantities in the gas phase, which may differ from their bulk values due to condensation, e.g., sequestration of oxygen into olivine. Only in extremely hot conditions, such as for ultrahot Jupiters and main-sequence stars, may we reasonably assume that the photospheric and bulk elemental abundances are similar (e.g., .  cautions that retrieving directly for the molecular abundances results in a non-uniform prior for C/O (see their Section 3.3).
In the current study, we consider 7 molecules. CO and CH 4 are the major carbon carriers (Burrows & Sharp 1999) with CO 2 being a minor carbon carrier unless the metallicity is highly enriched (e.g., Moses et al. 2013;Heng & Lyons 2016). H 2 O and CO are major oxygen carriers (Burrows & Sharp 1999). Acetylene (C 2 H 2 ) becomes non-negligible as C/O approaches unity (e.g., Moses et al. 2011;Madhusudhan 2012;Heng & Tsai 2016). NH 3 competes with molecular nitrogen (N 2 ) as the major nitrogen carrier (Burrows & Sharp 1999), while hydrogen cyanide (HCN) is an important link between the carbon and nitrogen reservoirs (e.g., Moses et al. 2011). The accuracy of retrieving for the elemental abundances hinges on a spectrum having sufficient spectral resolution, signal-to-noise and wavelength coverage to accurately account for the molecules that are present in sufficient amounts. If not all of the molecules are properly accounted for, it will lead to erroneous inferences about C/O.
A second approach is to assume chemical equilibrium and parametrise all of the molecular abundances by two numbers: C/O and the metallicity. Chemical equilibrium is a local approximation in the sense that each patch of atmosphere has no memory of its past and all of the molecular abundances may be completely determined once one has knowledge of the local temperature and pressure. Metallicity has three definitions in the astronomical literature: stellar astrophysicists refer to the relative abundance of all elements heavier than helium by mass (Section 3.12 of Asplund et al. 2009), observational spectroscopists refer to the elemental abundance of iron by number (Section 4.2 of Asplund et al. 2009) and atmospheric chemists typically refer to the elemental abundance of a volatile element (e.g., carbon) by number (Moses et al. 2013). In the third definition, it is usually assumed that the ratios of the elemental abundances are kept fixed to their solar values with the exception of C/H or O/H, which are allowed to be free parameters in order to allow for a variable C/O (e.g., Moses et al. 2013;Heng 2018;Drummond et al. 2019). In chemical equilibrium, knowledge of the abundance of a single carbon or oxygen carrier is sufficient to constrain C/H or O/H, respectively. However, if chemical equilibrium is a poor assumption, then misleading conclusions will follow. None of the atmospheres of Solar System bodies are well-described by chemical equilibrium.
One of the goals of the current study is to examine the relationship between the accuracy of retrieving for the elemental abundances and hence C/O. Classical information content analysis is based on computing Jacobians, which are the derivatives of the model output (e.g., transit depth) with respect to the parameters. See Section 2 of Batalha & Line (2017) for a recent review. 1 Classical information content analysis is a time-consuming process. For example, Section 3.1 of Batalha & Line (2017) states, "For each of these 84 combinations of planet types, we compute a separate Jacobian" (these authors' emphasis). Howe et al. (2017) introduces the use of "mutual information" (see their Section 2), but remark how "the difficulty with the use of mutual information is that it is computationally intensive, especially for the dense data sets produced by JWST." For reasons of computational feasibility, Howe et al. (2017) adopted a simple three-parameter model that assumed an isothermal transit chord, gray clouds and a metallicity. 2 Batalha & Line (2017) assumed chemical equilibrium models described by the metallicity and C/O and a non-gray treatment of clouds.
In the current study, we adopt a qualitatively different approach to information content analysis. Recently, Márquez-Neila et al. (2018) demonstrated that the classical machine learning method of the "random forest" (Ho 1998;Breiman 2001;Criminisi et al. 2011) may be adapted to perform atmospheric retrieval, as a complement to standard methods such as nested sampling (Skilling 2006;Feroz & Hobson 2008;Feroz et al. 2009Feroz et al. , 2013. Fisher et al. (2020) compares random forest retrieval to other methods (nested sampling, Bayesian neural networks). There are three distinct advantages of random forest retrieval in terms of practical implementation. First, it performs "feature importance", which in the context of spectra means it is able to compute the relative importance of each data point for constraining each parame-ter of a chosen model used to interpret the spectra. Second, it is able to easily perform large suites of mock retrievals in the form of "real versus predicted" (RvP) plots (Márquez-Neila et al. 2018;Fisher et al. 2020;Oreshenko et al. 2020). Third, since the random forest may be trained on a pre-computed model grid of arbitrary sophistication, the obstacles of computational feasibility encountered by Howe et al. (2017) and Batalha & Line (2017) may be overcome. Instead of assuming chemical equilibrium, we allow each of our 7 molecules to take on a broad range of abundances and infer the elemental abundances and C/O from the retrieved abundances.
Examples of RvP plots are shown in Figure 1, where we perform a suite of 20,000 mock retrievals for Hubble Space Telescope (HST) Wide Field Camera 3 (WFC3) transmission spectra of the warm Neptune GJ 436b. These RvP plots may be used to quantify the ability of a retrieval to accurately recover each parameter value of the model. The random forest reports the mean predicted values of the parameters in the RvP plots. The figure of merit used is the "coefficient of determination" (R 2 ), where R 2 = 0 means zero predictability (zero correlation between the real versus predicted values of a parameter) and R 2 = 1 means perfect predictability. Model degeneracies will generally lower the value of R 2 (Márquez- Neila et al. 2018;Fisher et al. 2020;Oreshenko et al. 2020). The RvP plots reproduce widely accepted knowledge in the exoplanet retrieval literature: WFC3 transmission spectra probe mainly H 2 O, CH 4 and NH 3 with some sensitivity to HCN, but are insensitive to CO and CO 2 . Furthermore, while cloud particle radius and abundance may be retrieved, one is blind to the retrieval of cloud composition. If CO and H 2 O are present in comparable abundances, then the retrieval will only accurately infer the H 2 O abundance, leading to an inaccurate estimate of C/O. Figure 2 shows the accompanying feature importance plots. Each feature importance plot quantifies the relative importance of each of the 13 data points in the WFC3 transmission spectrum for determining the value of a parameter. For example, the two data points near 1.4 µm constrain the water abundance, which matches our intuition of a water feature being present at these wavelengths. The bluest data points constrain the cloud abundance and particle radius. When the feature importance is about equal for all 13 data points (e.g., for CO), it often indicates a lack of sensitivity to a given parameter, which can only be confirmed by cross-matching with the R 2 ≈ 0 value from the RvP plot ( Figure 1).
In the current study, we will demonstrate the usefulness of both feature importance and RvP analysis for understanding the information content of JWST NIRSpec transmission spectra of warm Neptunes. The random forest technique has also been applied to ground-based spectra of brown dwarfs at medium spectral resolution (Oreshenko et al. 2020) and ultrahot Jupiters at high spectral resolution (Fisher et al. 2020).  Real versus predicted (RvP) values of the various parameters from a suite of 20,000 mock retrievals on HST-WFC3 transmission spectra using the random forest method. For clarity (and with no loss of generality), we only show 5000 of these mock retrievals. The stellar and exoplanetary parameters of GJ 436 and the warm Neptune GJ 436b, respectively, are assumed (see text). The synthetic spectra are composed of 13 wavelength bins from 0.8-1.7 µm following Kreidberg et al. (2015) and to provide continuity with Márquez-Neila et al. (2018). Each synthetic data point assumes an optimistic photon-limited uncertainty of 20 parts per million (ppm). The blue and red points correspond to cloudfree (α0 < 10 −9 cm −1 ; see text for definition) and cloudy (α0 > 10 −9 cm −1 ) models, respectively. Negative and positive values of the coefficient of determination (R) correspond to negative and positive correlations, respectively.
The current study is restricted to transmission spectroscopy at optical to near-infrared wavelengths. Specifically, we consider the NIRSpec instrument on JWST. 3 In the lowresolution (∼ 100) prism mode, NIRSpec has a simultaneous wavelength coverage of 0.6-5.3 µm. It is suitable for stars fainter than J ≈ 10, corresponding to Kepler and fainter TESS targets. At medium resolution (∼ 1000), NIRSpec has four modes: 0.7-1.27 µm (G140M/F070LP), 0.97-1.84 µm (G140M/F100LP), 1.66-3.07 µm (G235M/F170LP) and 2.87-5.10 µm (G395M/F290LP). These modes are suitable for stars fainter than J ≈ 6-8. Four high-resolution (∼ 2700) modes exist as well, but as we will show these do not add much interpretational value, in terms of retrieving elemental and molecular abundances, to what the medium-resolution modes already offer. Table 1 provides a summary of the JWST NIRSpec modes we will consider in the current study.

Layout of study
In Section 2, we describe our methods of computation. In Section 3, we present the results from our information content analyses and also an improved diagnostic for chemical disequilibrium. In Section 4, we compare our results to those of previous studies and discuss their implications for planning JWST observations.    Figure 1, which shows the "feature importance" plots from the random forest retrieval analysis. Each feature importance plot quantifies the relative importance of each data point in a HST-WFC3 transmission spectrum for determining the value of a given parameter. The entries in each panel add up to unity. 1987,1992,1998,2003,2005,2009,2013) and HITEMP (Rothman et al. 2010) spectroscopic databases; the pressure broadening parameters for H 2 -He mixtures are taken from the ExoMol database. A review of the spectroscopic databases may be found in Tennyson & Yurchenko (2017). For a review of how to compute opacities given inputs from the spectroscopic databases, we refer the reader to, for example, the appendix of Rothman et al. (1998), Grimm & Heng (2015), Chapter 5 of Heng (2017) and Yurchenko et al. (2018). All opacities are calculated with the HELIOS-K opacity calculator (Grimm & Heng 2015) from 10 −8 -10 3 bar. Table 3 states the details of the opacities, including the range of wavenumbers/wavelengths over which spectroscopic data needed as input exist. When computing transmission spectra, we use opacity sampling at the resolutions stated in Table 1.

Clouds
In the current study, we will use the terms "cloud", "haze" and "aerosol" interchangeably, based on the reasoning that while these terms may reflect different formation pathways, the effects on a spectrum follow a common phenomenological treatment. There is no consensus on the use of these terms: Earth scientists use "haze" versus "cloud" as a measure of particle size, while planetary scientists use these terms to refer to photochemical and thermochemical formation origins, respectively.
If a cloud consists of spherical particles of a single radius (i.e., a monodisperse cloud), then its cross section is where Q is the extinction efficiency. It may be computed using Mie theory (Mie 1908).  use the open-source LX MIE Mie code to calibrate a convenient fitting function for Q, where Q 1 ≈ 4 (Kitzmann & Heng 2018), the dimensionless size parameter is x = 2πr c /λ and λ is the wavelength. This fitting function smoothly transitions between the regimes of small (x 1; Rayleigh and non-gray continuum) and large (x 1, gray continuum) particles. For simplicity, we assume a = 4; see Table 2 of  for the values of a as a function of the composition. Refractory and volatile condensates correspond to Q 0 ∼ 10 and ∼ 100, respectively (see Table 2 of ). The cloud extinction coefficient is assumed to be uniform along the transit chord.
It is worth noting that this simplified treatment of the cloud cross section does not capture composition-specific spectral features (e.g., Cushing et al. 2009;Lee et al. 2014).
As it is calibrated on first-principles calculations, our treatment of clouds is an improvement over the gray cloud assumption of Howe et al. (2017) and the approach of Greene  Figure 1, the blue and red points correspond to cloudfree (α0 < 10 −9 cm g −1 ) and cloudy (α0 > 10 −9 cm g −1 ) transmission spectra, respectively. For clarity of presentation, we show only 5000 out of the actual 20,000 mock retrievals performed.    (2017), who used a combination of a "cloud top pressure" (for gray clouds) and a power-law parametrisation (for non-gray clouds).

Total extinction coefficient
The total extinction coefficient is where m i is the mass, X i is the volume mixing ratio, κ i is the opacity of each molecule. The sum is performed over all of the molecules in the system. The mass density and mean molecular mass of the atmosphere are given by ρ and m, respectively. The extinction coefficient associated with clouds is written as The mean molecular mass, cloud volume mixing ratio (X c ) and Q 1 are subsumed into a single fitting parameter,

Transmission spectra
Consistent with Greene et al. (2016) and Howe et al. (2017), we assume isothermal, non-isobaric transit chords. 4 We use the HELIOS-O code to compute transmission spectra (Gaidos et al. 2017;Bower et al. 2019). Each model atmosphere is divided into 150 annuli in pressure (P ) from 10 −8 -10 bar. The limit of 10 bar is chosen to ensure that the atmosphere is fully opaque at the lower boundary and has no bearing on the final outcome of the calculation.
At each wavelength, the slant optical depth is computed using (Brown 2001 where x is the spatial coordinate along the line of sight. The transmission function along each line of sight is Integrating along the radial coordinate yields the transit depth (Brown 2001), JWST spectra are expected to encode enough information  to break the normalisation degeneracy (Benneke & Seager 2012;Griffith 2014;Barstow et al. 2015;Heng 2019). Nevertheless, we account for this degeneracy by matching the computed white-light radius of each model to the measured one (R = 0.3767 R J from 0.5-1.0 µm; Torres et al. 2008).
Across the wavenumber range of 1/λ = 1800-17000 cm −1 (wavelength range of 0.6-5.5 µm), we assume a uniform spacing in log (1/λ) corresponding to 6700 points, such that the spectral resolution is approximately constant with an average value of 3000. The spectra are then restricted in wavelength and binned down to a spectral resolution of 100, 600 or 1000, depending on which modes of NIRSpec one is studying (see Table 1). An optimistic, photon-limited uncertainty of 20 ppm per data point is assumed, consistent with Greene et al. (2016). The intention is to identify the possible weaknesses of each NIRSpec mode even under idealized conditions.
On a desktop computer (Intel Core i9-7960X CPU), it takes HELIOS-O, which is written in the C++ programming language, about 1 second to compute each model. For the entire grid of 100,000 models, this amounts to about 30 hours of computing time.

Random forest retrieval
The "random forest" is a classical, supervised method of machine learning (Ho 1998;Breiman 2001). It belongs to a class of methods known as Approximate Bayesian Computation (ABC). Within the ABC framework, it has been demonstrated that one may compute approximate posterior distributions and perform model comparison via computation of the Bayesian evidence (Sisson et al. 2019).
As is appropriate for continuous quantities such as transit depths or radii, a regression tree (rather than a decision tree) is used to classify transmission spectra with different sets of parameter values (treated as "labels"; Márquez-Neila et al. 2018). A bootstrapping method is used to generate an uncorrelated forest of regression trees, and the combined output of the random forest yields the posterior distributions of parameters (Criminisi et al. 2011). Following Fisher et al. (2020, we take as output all of the entries in a leaf, rather than the average of the leaf, as the sampled posterior distribution of a parameter. As demonstrated by Márquez-Neila et al. (2018), the random forest produces two additional diagnostics: feature importance plots, which quantify the relative importance of each data point in the transmission spectrum for constraining each parameter; and RvP plots, which quantify the degree to which each parameter may be predicted in mock retrievals given the noise model. The RvP analysis is essentially an efficient way to generate large suites (∼ 10 4 ) of mock retrievals, which is computationally challenging to accomplish using standard retrieval methods (e.g., Barstow et al. 2015).
The range of values of the model parameters, as well as the assumptions on their prior distributions, are stated in Table 2. Each parameter is randomly drawn from its prior and a noise-free transmission spectrum is generated, as explained in Section 2.2. In order to add noise, each point in the synthetic spectrum is assumed to follow a Gaussian distribution with a standard deviation of 20 ppm. The points are then randomly sampled from these distributions, centred on their noise-free values. This is repeated to build a grid of 100,000 models for the forest, split into 80,000 for training and 20,000 for testing. The random forest consists of 1000 trees. Tree splitting is performed using the following steps: the range of values of each parameter is normalized such that its maximum value is 100; tree splitting ceases when the change in total variance of the parameter values (as a node is split into two branches) is less than a stated tolerance, which is set to 0.01. Each time a tree is split, a random subset of ∼ √ N points is used, where N is the total number of spectral points, to reduce biases. Tree pruning methods are not used. The implementations of the random forest method and R 2 metric are from the opensource scikit.learn library (Pedregosa et al. 2011) in the Python programming language.
On a desktop computer (Intel Core i7 CPU), it takes HELA, which is written in the Python programming language, about 10 minutes to train the random forest.

RESULTS
As an illustration, we will use the example of GJ 436b for our calculations: the GJ 436 star has a stellar radius of R = 0.455 R and GJ 436b has a surface of g = 1318 cm s −2 (von Braun et al. 2012). The qualitative conclusions of our study do not depend on the choice of these parameter values. Table 1 lists the 4 medium-resolution modes of JWST NIRSpec. The expectation is that the M1 (0.7-1.27 µm) and M2 (0.97-1.84 µm) modes, which probe a collective wavelength range similar to the WFC3 instrument of HST, encode the most information on cloud properties (e.g., Lecavelier des Etangs et al. 2008), but may be insensitive to important carbon-bearing molecules such as CO and CO 2 . Therefore, we begin the discussion by comparing the M1 and M4 (2.87-5.10 µm) modes. Figure 3 shows the outcomes of performing 20,000 mock retrievals for each of the modes in turn. For clarity of presentation (and with no loss of generality), we display only 5000 out of the 20,000 mock retrievals. It is emphasized again that the random forest reports the mean predicted value of each parameter. 5 Based on the similar R 2 values obtained, the M1 and M4 modes do comparably well at constraining the H 2 O and NH 3 abundances, as well as the temperature. The M4 mode outperforms the M1 mode by more than 0.1 in R 2 value for constraining the abundances of HCN, CH 4 and C 2 H 2 . As demonstrated by the low R 2 values, the M1 mode is insensitive to CO 2 (R 2 = 0.138) and essentially blind to CO (R 2 = −0.003). The M4 mode offers drastic improvements on constraining CO (R 2 = 0.508) and CO 2 (R 2 = 0.779) due to their spectral features across 4-5 µm (Figure 4).

RvP analysis of different JWST NIRSpec modes
When a mock retrieval fails to predict the value of a given parameter, the RvP analysis returns values that are the mean of the range considered. In the case of CO, since the range of volume mixing ratios considered is 10 −9 -10 −2 (in loguniform spacing), the random forest returns X CO = 10 −6 -10 −5 for the M1 mode. In other RvP plots where the predicted values of the parameters level off at a value that is below the mean of the range considered, these indicate the minimum or threshold value of a parameter that can be constrained given the noise model. For example, X H2O 10 −7 for both the M1 and M4 modes. Generally, volume mixing ratios as low as ∼ 10 −8 may be constrained given the assumed 20 ppm noise floor.
In Figure 3, the points have been color-coded blue (α 0 < 10 −9 cm −1 ) or red (α 0 > 10 −9 cm −1 ) to correspond to cloudfree or cloudy atmospheres, respectively. This threshold value of α 0 was obtained by trial-and-error and is guided mainly by inspecting the RvP behavior of both α 0 and r c . The bimodal behavior of α 0 above this threshold is an indication of the degeneracy between the degree of cloudiness and the molecular abundances. The trend of r c leveling off at 1 µm is the outcome of the cloud opacity becoming gray/constant as the cloud particles become large compared to the wavelengths probed. This trend is consistent with the basic principles of Mie theory. In all of the RvP plots of the molecular abundances and temperature, a subpopulation of the red (cloudy) points cluster in the middle of the range of values considered, indicating that the random forest does not predict a value for the given parameter.
The M1 and M4 modes constrain α 0 (R 2 = 0.488 versus 0.452) and r c (R 2 = 0.207 versus 0.210) almost equally well. Both the M1 and M4 modes have no sensitivity to the cloud composition (via Q 0 ; R 2 ≈ 0), which implies that it is challenging to identify cloud composition by constraining changes in the gradient of the spectral continuum alone. It does not rule out the possibility that higher-order spectral features that are composition-specific may retain constraining power (e.g., Cushing et al. 2009;Lee et al. 2014).
For completeness, the RvP plots of the M2, M3 and L modes are included in the Appendix as Figures A2, A4 and A6, respectively. The M2 mode exhibits similar behavior as the M1 mode in that it is somewhat insensitive to CO 2 (R 2 = 0.322) and nearly blind to CO (R 2 = 0.039). The M3 mode (1.66-3.07 µm) is blind to CO (R 2 = 0.075), but sensitive to CO 2 (R 2 = 0.669). The L mode has good sensitivity to CO 2 (R 2 = 0.763), but is largely insensitive to CO (R 2 = 0.171). Section 4.2 and Figure 9 performs a detailed  comparison of the R 2 values of every parameter for all of the modes considered in the present study. Figure 4 shows the feature importance analysis of the M1 versus M4 modes. Each panel shows the fractional importance of each data point for constraining a given parameter. It cannot be over-emphasized that the feature importance values cannot be compared between panels, because the entries are normalised such that they add up to unity within the same panel.

Feature importance analysis of JWST NIRSpec modes
The feature importance analysis of the M4 mode reproduces our intuition about the warm Spitzer Space Telescope channels. Channel 1 of the IRAC instrument, which ranges from about 3.1-3.9 µm and is often quoted as the "3.6 µm channel", probes several spectral features of methane (e.g., Sudarsky et al. 2003;Fortney et al. 2005Fortney et al. , 2006Fortney et al. , 2010. Channel 2 of IRAC, which ranges from about 3.9-5.1 µm and is often quoted as the "4.5 µm channel", probes carbon monoxide (e.g., Sudarsky et al. 2003;Fortney et al. 2005Fortney et al. , 2006Fortney et al. , 2010Charbonneau et al. 2008). It is consistent with the narrative that the flux ratios of these channels probe the relative abundances of CH 4 to CO, and is thus a measure of disequilibrium chemistry (e.g., Madhusudhan & Seager 2011;Moses et al. 2011).
Other properties are less apparent without detailed scrutiny of the feature importance plots. Generally, molecules such as H 2 O, HCN, NH 3 , CH 4 and C 2 H 2 have multiple spectral lines distributed across the wavelength ranges of both the M1 and M4 modes. For the M4 mode, there are strong CO 2 features between 4-5 µm. It also encompasses a CO feature at 4.7 µm, which explains the ability of the 4.5 µm channel of IRAC to constrain carbon monoxide. The M1 and M4 modes are equally good at constraining α 0 and the cloud particle radius (based on comparing the R 2 values as discussed earlier), but these constraints come from different wavelength regions.
Parameters associated with R 2 ≈ 0 typically have almost equal feature importance distributed across wavelength, as is the case for Q 0 (both M1 and M4) and CO (only M1). For completeness, we include in the Appendix the feature importance plots of the M2, M3 and L modes in Figures A3, A5 and A7, respectively.
Consistent with the insensitivity of the M1 mode to CO, CO 2 and C 2 H 2 , the posterior distributions of these molecules are unconstrained (Figure 5). The M4 mode does surprisingly poorly on CO, but this is because its spectral lines are being masked by those of CO 2 and CH 4 (see Appendix). Both modes obtain only an upper limit for NH 3 , which is absent from this model atmosphere. Overall, the M4 mode does somewhat better at retrieving the C/O ratio compared to the M1 mode ( Figure 6).
Identifying the minimum set of molecules needed to explain a spectrum may be achieved using Bayesian model comparison (e.g., Benneke & Seager 2012;Waldman et al. 2015; or deep learning methods (e.g., Waldmann 2016), which are beyond the scope of the present study.

An alternative diagnostic for detecting chemical disequilibrium
Line & Yung (2013) previously proposed a simple diagnostic for identifying chemical disequilibrium in an atmo- sphere, based on measuring the volume mixing ratios associated with the following chemical reaction (e.g., Moses et al. 2011), When rewritten in the formalism of Heng & Tsai (2016), equation (2) of Line & Yung (2013) is the reciprocal of where P 0 = 1 bar is an arbitrary reference pressure. If the transit chord probed is in chemical equilibrium, then the preceding expression is the equilibrium constant, where R univ = 8.3144621 J K −1 mol −1 is the universal gas constant and ∆G 0,1 is the molar Gibbs free energy of the reaction (at the reference pressure) tabulated in the JANAF database 6 and listed in, for example, Table 2 of Heng & Lyons (2016).
The key idea proposed by Line & Yung (2013) is to retrieve for the volume mixing ratios (X CO , X CH4 , X H2O ) and obtain an estimate for equation (11). If this estimate disagrees with K eq (which requires the retrieved temperature as an input), then the region of the atmosphere probed by transmission spectroscopy is in chemical disequilibrium. The major uncertainty with this approach is that the pressure probed in transmission (P ) needs to be accurately and precisely known, especially since it appears as the square of itself in equation (11). See Section 6.3 of Greene et al. (2016) for a critique of Line & Yung (2013).
Using the same concept, we propose to focus on another chemical reaction (e.g., Moses et al. 2011), where the corresponding combination of volume mixing ratios has no dependence on pressure (e.g., Heng & Tsai 2016), because the number of molecules associated with the reactants and products is the same. As we will see in Section 4.2, only the M4 mode of JWST NIRSpec is highly sensitive to the presence of CO, CO 2 and H 2 O. By retrieving for their mixing ratios and obtaining an estimate for the preceding expression, one may then compare it to the corresponding equilibrium constant, where ∆G 0,2 is again listed in Table 2 of Heng & Lyons (2016). In chemical equilibrium, equations (14) and (15) are equal. Figure 7 shows that K eq,2 varies by a factor of about 7 from 800-1200 K.
To accurately employ this diagnostic, the spectra measured using JWST NIRSpec would have to be of a good enough quality to demonstrate that X CO X H2O /X CO2 is sufficiently different from K eq,2 . In Figure 8, we show as an illustration the pair of posterior distributions of X CO X H2O /X CO2 from retrievals on a mock spectrum corresponding to the M1 and M4 modes for the carbon-rich case study considered in Figure 5. The posterior corresponding to the M4 mode firmly excludes the equilibrium value of K eq,2 ≈ 0.7, indicating that the carbon-rich model atmosphere considered is out of chemical equilibrium. The posterior corresponding to the M1 mode is only marginally consistent with the equilibrium value. 4. DISCUSSION 4.1. Comparison to previous work 4.1.1. Greene et al. (2016) Greene et al. (2016) did not perform an information content analysis, but they did study mock retrievals across several exoplanet types (see their Tables 1 to 3) and JWST  modes (see their Table 4), both in emission and transmission. Six molecules were explicitly considered in the mock retrievals: CO, CO 2 , H 2 O, CH 4 , NH 3 and N 2 . For transmission spectra, the transit chord was assumed to be isothermal. The cloud model consists of a cloud top pressure (for gray clouds) and a power-law prescription (for non-gray clouds consisting of small particles). A key finding of Greene et al. (2016) is: "λ = 1-2.5 µm transmission spectra will often constrain the major molecular constituents of clear solarcomposition atmospheres well." The fourth rows of Figures 7 and 8 of Greene et al. (2016) show mock retrievals for a warm Neptune (700 K) and warm sub-Neptune (600 K), respectively, with clouds and solar metallicity. It is worth noting that Greene et al. (2016) have fixed X CO2 = 3.16 × 10 −11 and X CO = 10 −9 for both cases (see their Table 3). While the comparison is imperfect, the M2 mode (0.97-1.84 µm) may be compared to the NIRISS mode (1-2.5 µm) considered by Greene et al. (2016). Our RvP analysis in Figure A2 (Appendix) suggests that CO (with R 2 = 0.039 for the M2 mode) is undetectable across the wavelength range of NIRISS, which is consistent with the unconstrained posterior distributions of X CO obtained by Greene et al. (2016) in the fourth rows of their Figures 7 and 8. Since the contributions of CO and CO 2 to C/O are negligible in both cases, we have This explains why the posterior distributions of C/O associated with the 1-2.5 µm versus 1-5 µm retrievals are similar in the fourth rows of Figures 7 and 8 of Greene et al. (2016). As a further check, the third row of Figure 6 of Greene et al. (2016), which describes a mock retrieval for a hot Jupiter (X CO ∼ 10 −4 ), shows an unconstrained posterior distribution of X CO associated with 1-2.5 µm. However, the posterior distribution of X CO associated with 1-5 µm is bounded on both sides, consistent with the findings of the current study.  (see their Table 2). Mock atmospheric retrievals are performed using a Markov Chain Monte Carlo code. Their Table  3 lists the 11 hot Jupiters considered in their study. Figure 7 of Howe et al. (2017) shows examples of calculations of Jacobians with respect to the metallicity, temperature and pressure. Even though Howe et al. (2017) suggest the use of Jacobians to diagnose cloud properties, they ultimately do not explore this option in their study. For reasons of computational feasibility, Howe et al. (2017) opted for a 3-parameter model that explores the temperature (of the isothermal transit chord), metallicity and cloud top pressure (or equivalently, a constant cloud opacity). Howe et al. (2017) remarked that, "For our simple forward model, the instrument that consistently gives the most information is the NIRISS G700XD mode," which corresponds to a wavelength range of 0.6-2.8 µm. At face value, the statement about the NIRISS G700XD mode appears to be at odds with the conclusions of the current study that the blue modes of NIRSpec are suboptimal for constraining the elemental abundances and C/O (see Section 4.2). The solution to this conundrum lies in the assumption of chemical equilibrium made by Howe et al. (2017). In chemical equilibrium, knowledge of the elemental abundances, temperature and pressure allows one to fully specify all of the molecular abundances. Equivalently, one can back out the elemental abundances if the temperature, pressure and only a subset of the molecular abundances are known.
For example, at a given temperature and pressure one can infer O/H given only X H2O if chemical equilibrium is assumed (e.g., Heng 2018). It bypasses the need to detect CO or CO 2 , which are generally needed, in a chemical disequilibrium situation, for accurately inferring O/H. If the ratios of O/H to the other elemental abundances are further assumed to take on their solar values, then the metallicity may be inferred as a single number (e.g., Heng 2018). Otherwise, the metallicity is generally a set of numbers given by the different elemental abundances. The inferred information content of the 0.6-2.8 µm mode hinges on accepting these assumptions. 4.1.3. Batalha & Line (2017) Batalha & Line (2017) used an approach to information content (IC) analysis that is similar to that of Howe et al. (2017), which is based on computing Jacobians. Their model explorations are based on a WASP-62b-like gas giant, where the main parameters are the temperature, C/O and metallicity. It is unclear if the C/H or O/H has a fixed (solar) ratio to the other elemental abundances. The cloud model follows that of Greene et al. (2016). Key conclusions from Batalha & Line (2017) include: "A single observation with NIRISS always yields the highest IC content spectra with the tightest constraints, regardless of temperature, C/O, [M/H], cloud effects or precision." As elucidated in Section 4.1.2, this conclusion hinges on the assumption of chemical equilibrium. The temperature range considered by Batalha & Line (2017) (600-1800 K) crosses the transition where chemical equilibrium starts to break down at low temperatures. 4.1.4. Nixon & Madhusudhan (2020) In a recent study, Nixon & Madhusudhan (2020) assessed the random forest technique for atmospheric retrieval. They compared several retrievals using both random forests and the traditional nested-sampling method. They also added the extension of a likelihood function to the forest to produce posteriors that match the nested-sampling retrievals. The close agreement between their extended random forest and the nested-sampling posteriors is unsurprising as the same likelihood function is used in both. The agreement implies consistency and not necessarily veracity.
In their comparisons, Nixon & Madhusudhan (2020) show some discrepancies between the standard random forest and the nested-sampling retrievals. In an improvement on the implementation of Márquez-Neila et al. (2018), we have upgraded the trees in the forest to predict the entire set of parameters in the given leaf, as opposed to taking the average value of each leaf (as described in Section 2.3). This gives a more accurate sampling of the posterior. This upgrade is not included in the standard random forest used in Nixon & Madhusudhan (2020), and could account for the discrepancies in their Figure 13. Nixon & Madhusudhan (2020) also discuss the issue in Cobb et al. (2019), who showed an example where the forest predicts an overconfident, incorrect value of ammonia at the prior minimum in a mock retrieval. As discussed in Section 4.4 and Figure A4 of Fisher et al. (2020), this effect arises from a limitation of the training set used and not because of the random forest. Specifically, because spectroscopic line list data needed to compute the ammonia opacity did not exist above 1500 K, the ammonia mixing ratio was artificially set to 10 −13 when the temperature crossed this threshold. Fisher et al. (2020) showed that this artefact was also detected using the nested-sampling method. In other words, Cobb et al. (2019) succeeded in identifying the limitation of the training set, but drew the wrong conclusion from their findings.
In Section 4 of Nixon & Madhusudhan (2020), it is suggested that the forest cannot be used for a retrieval with many parameters, claiming that: "A Random Forest retrieval with n free parameters appears to require 10 n models for an adequate training set." There is in fact no explicit rule for the size of the training set, which will likely depend on many variables such as the relationships between the parameters, the prior ranges, the resolution of the spectra, etc. We found no issues in the current study when using our 11-parameter model on both the WFC3-and JWST-like spectra. One can see from the predicted vs real plots that the forest's performance is quite reasonable given the degeneracies one expects from multiple parameters.

Recommendations for JWST observing proposals
In Figure 9, we consolidate all of our findings into a summary plot that quantifies the predictive power of every JWST NIRSpec mode considered in the current study. Several key points arise from inspecting Figure 9.
• The three bluest medium-resolution modes (M1, M2 and M3) are essentially blind to CO (R 2 ≈ 0), implying that the derived elemental abundances of carbon and oxygen may be inaccurate if CO is a major constituent, data are only available in these modes ( 1.8 µm) and the atmospheric abundances are out of chemical equilibrium.
• All of the modes do equally well at constraining α 0 (which subsumes the cloud abundance) and the cloud particle size (which is constrained by the slope of the spectral continuum), but do equally poorly at identifying cloud composition via constraining the change in slope of the spectral continuum.
• Perhaps the most surprising finding is that the M4 mode (2.87-5.10 µm) out-performs the low-resolution (∼ 100) prism mode (0.6-5.3 µm) on the ability to constrain every parameter except for the temperature and ammonia abundance. Both modes constrain the cloud properties equally well (or poorly). In the tradeoff between spectral resolution (by a factor ∼ 10) and wavelength coverage, the former triumphs.
• While increasing the resolution from ∼ 100 to ∼ 1000 enhances the constraining power substantially, a further increase of resolution to ∼ 2700, corresponding to the high-resolution modes of JWST NIRSpec, adds diminishing value. We demonstrate this by performing a RvP analysis of the M4 mode with a resolution of ∼ 2700, which we label as "H4" in Figure 9. On average, the R 2 value increases by 0.026 or about 5.3% across the 11 parameters. The biggest improvement in R 2 is associated with CO: from 0.508 to 0.577 (increase of 13.6%).
In this study, we have adopted a fiducial noise model in which every spectral value is sampled with an uncertainty of 20 ppm, which is the optimistic theoretical noise floor of JWST (Beichman et al. 2014). As a sensitivity test, we perform another set of calculations using PandExo  to simulate a Bright Object Time-Series observation of GJ 436b and to obtain a more realistic noise model for application to the M4 mode. The noise model is simulated by assuming a single transit time series of GJ 436b with the G395M grating and the sub2048 subarray read-out mode.
The standard deviation as predicted by PandExo varies between 200 and 550 ppm over the M4 wavelength range. Our model spectra that serve as training data are subsequently interpolated onto the wavelength grid simulated by PandExo.
An example model spectrum after interpolation and addition of noise is shown in Figure 10. Figure 9 shows the constraining power for various model parameters obtained using the different modes, including M4 with the realistic noise model obtained with PandExo. Despite the fact that the noise of the realistic model is ∼ 10 to 20× higher than initially assumed, the qualitative conclusions remain unchanged: the M4 mode's ability (or inability) to constrain the 11 parameters of the model are similar to when 20 ppm uncertainties are assumed. The exception is CO, where the R 2 value drops from 0.508 to 0.389. However, the R 2 value associated with CO 2 remains high: 0.731 versus 0.779.
Overall, we recommend that the medium-resolution M4 mode be used as it offers the most balanced portfolio of constraining power across temperature, molecular abundances and cloud properties. If the goal is to constrain these parameters accurately in order to infer the elemental abundances and C/O without assuming chemical equilibrium, the mediumresolution M4 mode is sufficient; the corresponding highresolution mode is unnecessary.
We acknowledge financial support from the Swiss National Science Foundation, the European Research Council (via a Consolidator Grant to KH; grant number 771620), the PlanetS National Center of Competence in Research (NCCR), the Center for Space and Habitability (CSH) and the Swiss-based MERAC Foundation. We are grateful to Brice-Olivier Demory for constructive discussions and advice on the manuscript. APPENDIX A. ADDITIONAL FIGURES Figure A1 shows various transmission spectra associated with the carbon-rich case study of Section 3.3. It is apparent that the transmission spectra with X CO = 0, 10 −3 and 10 −2 are very similar. The similarity of these spectra is due to the spectral lines of CO being masked by those of CH 4 and CO 2 at the chosen abundances (X CO = X CH4 = 10 −3 , X CO2 = 10 −4 ). The transmission spectrum with X CO = 0.1 is markedly different only because CO is so abundant that it changes the mean molecular mass-and hence the pressure scale height-significantly.
For completeness, we include in Figures A2 to A7 the RvP and feature importance plots of the M2, M3 and L modes.  Table 1 for an explanation of the modes and wavelength coverage. Zero and perfect predictability correspond to R 2 = 0 and R 2 = 1, respectively. For comparison, the WFC3 channel (0.8-1.7 µm) of HST is included. The H4 mode covers the same wavelength range as the M4 mode, but at a higher resolution of ∼ 2700. Figure 10. A sample model spectrum from the training set, computed at a resolution of 3000 over the full wavelength range (black) and interpolated onto the wavelength grid of the M4 mode (NIRSpec G395M; red). The gray points denote a single realization of the model with errors sampled from the noise model as computed by PandExo. Figure A1. Transmission spectra corresponding to the carbon-rich case study of Figure 5, but with the CO abundance removed (labeled "0") or varied from 10 −3 (its default value) to 10 −1 . Three additional curves with CO only, CH4 removed and CO2 removed are included.     Importance for joint prediction Figure A7. Same as Figure 4, but for the L mode.