Parameterizing pressure-temperature profiles of exoplanet atmospheres with neural networks

,


Introduction
With now over 5000 confirmed planet detections outside the solar system (see, e.g., Christiansen 2022), the exoplanet science community is increasingly expanding from detecting new planets to also characterizing them.One important tool for this are atmospheric retrievals (AR), that is, "the inference of atmospheric properties of an exoplanet given an observed spectrum" (Madhusudhan 2018).The properties of interest here include, for example, the chemical composition of the atmosphere (i.e., abundances of different chemical species) or the presence of clouds or haze.Deriving these properties from a spectrum is a classic example of an inverse problem.As with many such problems, the standard approach is to combine Bayesian inference methods, such as nested sampling (Skilling 2006), with a simulator for the "forward" direction -in our case, turning a set of atmospheric parameters θ into the corresponding spectrum.Given the spectrum y of a planet of interest, one can then estimate a posterior distribution p(θ | y) by iteratively drawing a sample of parameter of a PT profile decode via f z : P → T Pressure Temperature f z 1 f z 2 f z 3 Fig. 1.Schematic illustration of the problem setting considered in this paper: "Parameterizing PT profiles" means finding a way to represent PT profiles as vectors z ∈ R d , together with a decoding mechanism f z : P → T that converts these z-vectors back into a function that maps pressure values onto temperature values.Our goal in this work is to learn both the representations z and the mechanism f z from data obtained with simulations using full climate models.
However, as mentioned above, it is common to limit oneself to one-dimensional approximations.The implications of such an approximation are discussed, for example, in Blecic et al. (2017).
There are, in principle, two different ways to determine the PT profile during an AR: One may either solve a set of (simplified) radiative transfer equations, or approximate the PT profile through an ad hoc fitting function (see, e.g., Seager 2010;Line et al. 2013;Heng 2017).In the latter case, which is the one that we consider in this work, one has to choose a parameterization for the PT profile, that is, one needs to assume a parametric family of functions f z : P → T which map pressure values onto corresponding temperature values and have tunable parameters z ∈ R d that control how exactly this mapping looks like (see Fig. 1 for an illustration).These parameters z are a subset of the full set of parameters θ for which the retrieval procedure estimates a posterior distribution (see above).Throughout this work, we commonly refer to z as a (latent) representation of a PT profile.The choice of f z will, in general, be determined by a trade-off between different (and possibly conflicting) requirements and desiderata, including the following: -For every profile that we can expect to find during the retrieval, there should exist a z such that f z is a good approximation of this profile (subjectivity).-For every value of z from some given domain (e.g., a subset of R d ), f z is a valid PT profile.Together with subjectivity, this implies that the image of z → f z is equal to the set of physically sensible (and relevant) PT profiles.-The mapping z → f z should be smooth in the sense that small changes in z should only result in small changes in f z (i.e., continuity); otherwise, the retrieval loop may not converge.-The dimensionality d = dim(z) should be as small as possible, since methods such as nested sampling often do not scale favorably in the number of retrieval parameters.-We should be able to define a simple prior p(z) for z, and for z ∼ p(z), the induced distribution over functions f z,z∼p(z) should constitute an appropriate prior distribution over the PT profiles of the population of planets that we are considering.-Ideally, the dimensions of z are interpretable as physical quantities, or correspond to physical processes (e.g., "z 1 is the mean temperature," or "z 2 controls the presence of an inversion").Of course, as suggested above, no choice of parameterization will perfectly satisfy all requirements simultaneously, and some sort of compromise is required.
Related work.Barstow & Heng (2020), who have identified the parameterization of PT profiles as one of the "outstanding challenges of exoplanet atmospheric retrievals," find the approaches of Madhusudhan & Seager (2009) and Guillot (2010) to be the most popular among the atmospheric retrieval community.These parameterizations, as well as the ones proposed by Hubeny et al. (2003), Hansen (2008), Heng et al. (2011), andRobinson &Catling (2012), all consist of (semi)-analytic expressions that seek to describe the relationship between pressure and temperature mainly in terms of physically interpretable quantities, such as the opacity or optical depth.Simpler (yet less physically motivated) approaches include the use of splines (e.g., Zhang et al. 2021) as well as low-order polynomials (e.g., Konrad et al. 2022).Finally, a data-driven approach was presented by Schreier et al. (2020), who compute a singular value decomposition of a representative set of PT profiles, and parameterize new profiles as a combination of the first k singular vectors.All these approaches typically use four to six parameters to describe a PT profile, which is often a substantial fraction of the total number of retrieved parameters.To the best of our knowledge, no studies to date have used machine learning (especially neural networks) to parameterize pressure-temperature profiles of exoplanet atmospheres.

Contributions.
In this work, we explore a conceptually new approach to parameterizing PT profiles that is based on the idea of using neural networks to learn efficient (i.e., low-dimensional) representations of PT profiles, as well as the corresponding decoding mechanisms, from simulated data1 .Using two different datasets of PT profiles obtained with full climate models, we show that our approach can achieve better fit quality with fewer parameters than our two baseline methods, while still easily integrating into an existing atmospheric retrieval framework.

Method
Our goal is the following: We want to learn a model (in our case: a neural network) that takes two inputs: (1) a pressure value p, and (2) a vector z ∈ R d , also called "representation", which contains a highly compressed description of a PT profile.The output of the model is the temperature t at pressure p, for the profile described by z.During an atmospheric retrieval, the Bayesian inference routine can then propose values of z, which can be converted to a PT profile by conditioning the network on the proposed z (that is, fixing the input z) and evaluating it at different pressures p i to get the corresponding temperatures t i .Thus, we will infer a posterior distribution over z, which corresponds directly to a posterior over PT profiles.
A3, page 2 of 18 Gebhard, T. D., et al.: A&A, 681, A3 (2024)          To learn such a network, we assume that we have a dataset D of PT profiles which were generated by a full climate model or radiative transfer code to ensure they are physically consistent.These models typically simulate an atmosphere in a layer-by-layer fashion, meaning that each PT profile is given in the form of a set support points, {(p i , t i )} i=1,...,N , where N is the number of layers, and p i and t i are the pressure and temperature in the i-th layer.Importantly, our method does not require that p i be the same for all profiles in D: Different profiles can use different pressure grids, and this is indeed the case for the datasets we work with.Moreover, in principle, even the number of layers N can vary between profiles.(We do not explicitly consider this case here, but it implies only minor technical complications at training time.)This flexibility also makes it easy, for example, to combine PT profiles from different models or pre-existing atmosphere grids into a single training dataset.
The approach that we propose in the following is inspired by the idea of a (conditional) neural process (NP; Garnelo et al. 2018a,b).A neural process is a type of machine learning model that learns a distribution over functions and combines the advantages of Gaussian processes with the flexibility and scalability of neural networks.In astrophysics, the usage of (conditional) neural processes is still relatively rare; examples include Park & Choi (2021), Čvorović-Hajdinjak et al. (2022), and Jankov et al. (2022), all of which use conditional NPs for interpolation.
Our method works as follows (see also Fig. 2).We employ two neural networks: an encoder E and a decoder D. The encoder is only required during training and will not be used during an AR.At training time, we take a PT profile -consisting of a vector p = (p 1 , ..., p N ) of pressure values and a vector t = (t 1 , ..., t N ) of corresponding temperatures -from our training data set and pass it to E, which outputs a representation z ∈ R d of the profile: (1) The dimensionality d of z can be chosen freely (but changing d will require re-training E and D).Generally, we want to keep d as low as possible (e.g., d = 2), but of course, higher values also imply more expressive or informative representations.
In the next step, the representation z is then used to condition the decoder network D. Once conditioned on z, the decoder network is a PT profile, that is, a function that takes in a single value (a pressure) and outputs a single value (a temperature).We then pass all pressure values p i of our input profile through D to get the respective predicted temperature values ti : where we use D(• | z) : R + → R + to denote the function that is given by conditioning D on z (i.e., fixing z as an input).We train the encoder and decoder jointly (i.e., we update the weights of the neural networks) to minimize the difference between the true and predicted temperatures.More specifically, we minimize the following mean squared error (MSE) reconstruction loss, which is a standard choice in machine learning for regression problems: In practice, we do this not only for a single PT profile, but minimize the average of L rec over our entire training dataset in a batched fashion.We also note that this choice of reconstruction loss assigns the same weight to all N atmospheric layers, and refer to Sect.5.4 for a brief discussion of possible alternatives.
To ensure that we can define a simple prior for z during a retrieval, we furthermore want to constrain the distribution of the z's produced by the encoder.There are various potential approaches for this (see Appendix A).For this work, we adopt an idea from Zhao et al. (2017) and add a second term to the loss function that encourages z to follow a standard Gaussian: where: Here, N d (0, 1) denotes a d-dimensional standard Gaussian distribution (i.e., mean zero and identity matrix as covariance matrix).Furthermore, b is the batch size of our training, and MMD stands for Maximum Mean Discrepancy (Borgwardt et al. 2006;Gretton et al. 2012), which is a kernel-based metric that measures whether two samples come from the same distribution (see Appendix A.2 for more details).By minimizing this MMD term, we encourage the model to produce values of z that follow the same distribution as the s i , that is, a d-dimensional standard Gaussian.Experimentally, we found that despite this MMD term, there are sometimes a few profiles that are mapped to a z very far from the origin, which is undesirable if we want to define a simple prior for z during a retrieval.To prevent this problem, we introduce a third loss term with a softplus barrier function on the norm of z: (5) A3, page 3 of 18 This is a smooth version of the ReLU(x) := max(0, x) function, where k controls the amount of smoothing: Larger values of k increase the similarity to ReLU.For this work, we have fixed k at 100 (ad hoc choice), after observing that using a standard ReLU function destabilized the training for us.The parameter τ defines the threshold on ∥z∥ 2 and was set to τ = 3.5 (ad hoc choice).
The total loss that we minimize is then just a weighted sum (with β, γ ∈ R + ) of our three loss terms: The hyperparameters β and γ control the tradeoff between the different loss terms: (1) Increasing β corresponds to placing a stronger prior on the distribution of z in the latent space, usually at the cost of a higher reconstruction error.We have tried different values of β and found that the performance is relatively similar over a range of about 0.1 to 1. Too high values of β can lead to model collapse, where the prior on z becomes too strong and prevents the encoder from learning meaningful representations.For this work, we set β = 0.1.This was an ad hoc choice and not the result of systematic optimization, which would likely yield a different value for each dataset and value of dim(z).( 2) The value of γ should be chosen such that γ • L norm dominates the other loss terms when ∥z∥ 2 > t.We set γ = 10 for this work (again without optimization).
To use our trained models during atmospheric retrieval, we provide a convenient Python wrapper that allows the model to be loaded as a normal function, taking as input a value for z and a numpy array of pressure values, and returning an array of corresponding temperatures.We note again that neither the lengths of the pressure grids nor the exact values need to match the pressure grid(s) used during training.(However, the range of pressure values should be approximately the same; otherwise, evaluating D(• | z) at pressure values it has never encountered during training may result in nonphysical outputs).

Datasets
We validate our proposed approach using two different, publicly available datasets of PT profiles: one for terrestrial planets, and one for hot Jupiters.A visual comparison of the two datasets and the respective P-T space that they cover is shown in Fig. 3.

PYATMOS
The PYATMOS dataset was generated by Chopra et al. (2023) and consists of 124 314 stable atmospheres of Earth-like planets around solar-type stars generated with Atmos, a 1D coupled photochemistry-climate model (Arney et al. 2016;Meadows et al. 2018).The atmospheres in this dataset were simulated using N = 101 layers covering an altitude range of 0-80 km.Consequently, each PT profile is given by two 101-dimensional vectors for the pressures and temperatures.In addition, about 80 other atmospheric quantities are available, such as fluxes and abundance profiles of chemical species.We manually removed one PT profile from the dataset because it was completely out-of-distribution and caused stability problems during training.
Furthermore, it has been brought to our attention that the simulations of the PYATMOS dataset suffered from a couple of issues that affect the physical correctness of the simulated PT profiles.In particular, the ClimaMain.f2specified a H 2 O kcoefficient file that assumed pressure broadening by H 2 O itself, which is not appropriate for a N 2 -dominated atmosphere.Further, the model version also did not allow for convection above 40 km.Consequently, the simulated PT profiles tend to underpredict stratospheric and mesospheric temperatures compared to the expectation from a real Earth twin.Newer versions of the Atmos code are not affected by these issues and use updated H 2 O and CO 2 k-coefficients (Teal et al. 2022;Vidaurri et al. 2022).While this may limit the scientific usefulness of the PT profiles, it should not affect our ability to use them to demonstrate the applicability of our method as such, that is, to show how we can learn efficient, low-dimensional parameterizations of PT profiles from simulated data.

GOYAL-2020
This dataset was published by Goyal et al. (2020) and consists of 11 293 PT profiles, corresponding to 89 hot Jupiters, each with four recirculation factors, six metallicities and six C/O ratios3 .Due to missing data, we manually removed three profiles.The data were simulated with Atmo (Amundsen et al. 2014;Tremblin et al. 2015Tremblin et al. , 2016;;Drummond et al. 2016) 4 , a radiative-convective equilibrium model for atmospheres, using N = 51 atmospheric layers.In addition to the PT profiles, the dataset also includes transmission and emission spectra for each atmosphere, as well as over 250 abundance profiles of different chemical species.

Experiments and results
In this section, we describe the experiments that we have performed to test our proposed approach and show our results.

Training and evaluation procedure
We use stochastic gradient descent to train our models (i.e., one pair of neural networks E and D for each combination of a dataset and a value of dim(z)) on a random subset of the datasets described in Sect.3, and evaluate their performance on another random subset that is disjoint from the training set.For more detailed information about our network architectures, their implementation, and our training procedure, see Appendix B, or take a look at our code, which is available online.

Reconstruction quality
As a first experiment, we study how well our trained model can approximate previously unseen PT profiles from the test set (cf. Appendix B.3), and compare the performance with two baseline methods: (1) low-order polynomials, and (2) a PCA-based approach similar to the method of Schreier et al. (2020).
Setup.For each dim(z) ∈ {1, 2, 3, 4}, we train three instances of our model (using three different random seeds to initialize the neural networks and control the split between training and validation) on the training set.We do not consider higher values of dim(z) here mainly because one of the goals of this work was to demonstrate that our method requires fewer fitting parameters than the baseline methods.Then, we take the learned decoder and, for each profile in the test set, use nested sampling to find the optimal z (denoted z * ) to reproduce that profile.In this case, "optimal" means that z * minimizes the mean squared error (MSE): We add this additional optimization step to the evaluation to remove the effect of the encoder: Ultimately, we are only interested in how well D can approximate a given profile for some zwhich does not necessarily have to match the output of E for that profile perfectly.We have chosen nested sampling over gradientbased optimization -which is possible given that D is fully differentiable both with respect to p and z -because the latter is generally not available during an AR, unless one uses a differentiable forward simulator (see, e.g., Diff-τ from Hou Yip et al. 2022).Nested sampling, therefore, should give us a more realistic idea of which profiles we can find during a retrieval.Our optimization procedure is based on UltraNest (Buchner 2021), with 400 live points and a truncated standard normal prior, limiting each z i to [−4, 4] (because L norm limits ∥z∥ to τ = 3.5).
For the polynomial baseline, we simply fit each profile in the test set using a polynomial with degree n poly − 1 for n poly ∈ {2, 3, 4, 5}.The minimization objective of the fit is again the MSE.
Finally, for the PCA baseline, we compute a PCA on all the temperature vectors in the training set, and then fit the temperature vectors in the test set as a linear combination of the principal components (PCs), for n PC ∈ {2, 3, 4, 5}, once again minimizing the MSE.We note that this PCA baseline must be taken with a grain of salt, for two reasons.First, it requires us to project all profiles onto a single common pressure grid to ensure that the ith entry of the temperature vectors always has the same interpretation (i.e., temperature at pressure p i ).We do this by linear interpolation.Consequently, the common pressure grid is determined by the intersection of the pressure ranges of all profiles, meaning that profiles may get clipped both at high and low pressures.In practice, this affects in particular the GOYAL-2020 data where, for example, the pressure in the deepest layer varies by more than an order of magnitude between different profiles, whereas for PYATMOS, the profiles all cover a similar pressure range.Second, combining principal components only returns vectors, not functions, meaning that if we want to evaluate a profile at an arbitrary pressure value, we again have to use interpolation.
Results.We are showing some example profiles and the bestfit reconstruction using our model in Fig. 4. The main results of this experiment -the distribution of the (root) mean square error (RMSE) on the test set -are found in Fig. 5.The RMSE is simply defined as the square root of the MSE defined in Eq. ( 3), that is, the root of the mean squared difference between a given temperature vector and its best-fit reconstruction.As Fig. 5 shows, our method achieves a significantly lower reconstruction error than both baseline methods, both for the PYATMOS dataset and for the GOYAL-2020 dataset (see Fig. D.2).We find that on the PYATMOS dataset, even the one-dimensional version of our model (i.e., each profile is parameterized as just a single number) outperforms the polynomial baseline with five fitting parameters, and is approximately on-par with the PCA solution using five components.
To account for the absolute scale of the data (i.e., Earth-like temperatures vs. hot Jupiters) and allow comparisons across the two datasets, we also compute the mean relative error (MRE): We compute this quantity for every profile in the test set, and then take the median of this distribution.For PCA and our method, the medians are then also mean-averaged over the different runs.
The final results are given in Fig. 6.Looking at the mean relative errors, we note our errors for the GOYAL-2020 dataset are systematically larger than for the PYATMOS dataset.We suspect that can be explained as the result of two factors: First, the PYATMOS training dataset is ten times bigger than the GOYAL-2020 training dataset, and, second, the GOYAL-2020 dataset covers a more diverse set of functions compared to the PYATMOS dataset.

Interpreting the latent representations of PT profiles in the context of the corresponding atmospheres
In this section, we take a closer look at the latent representations of PT profiles that our models learn and the extent to which they lend themselves to interpretation.We begin with Fig. 7, where we present an illustration of our entire method using a model with dim(z) = 2 that we trained on the GOYAL-2020 dataset expect, as this was the reason we introduced the L MMD term to the loss function.The bottom part of the figure illustrates what happens if we place a regular grid on the latent space and, for each z on the grid, evaluate D(• | z) on a fixed pressure vector p ′ , which does not have to match the p of any profile in the training or test dataset.For example, p ′ can have a higher resolution (i.e., more atmospheric layers) than the training data.
In both parts of the figure, we observe that PT profiles with different shapes -implying different physical processes -are mapped to different parts of the latent space, and that similar PT profiles are grouped together in the latent space.Here, "similar" refers not only to the shape of the PT profiles, but also to the properties of the corresponding atmospheres.To see this, we turn to Fig. 8.As mentioned above, both the PYATMOS and GOYAL-2020 datasets contain not only PT profiles, but also various other properties of the corresponding atmospheres, such as concentrations or fluxes of chemical species.In Fig. 8, we show several scatter plots of the latent representations of the test set (i.e., the z we obtained in the first experiment), color-coded according to some of these atmospheric properties -for example, the mean concentration of CO 2 in the atmosphere corresponding to a given PT profile.For ease of visualization, we limit ourselves to the case dim(z) = 2.We observe that there are clear patterns in these scatter plots that show a strong correlation between our latent representations and the properties of the corresponding atmospheres, again demonstrating that PT profiles from physically similar atmospheres are grouped together.
With this in mind, we return to the lower half of Fig. 7, where we evaluate D on a grid of z values.We observe that not only do different shapes of PT profiles correspond to different parts of the latent space, but the PT profiles given D(• | z) also vary smoothly as a function of z.This means that the decoder not only reproduces the real PT profiles that were used to train it, but also allows smooth interpolation between them.Along the axes of the latent space, one can identify specific behaviors: For example, fixing z 2 = 0 and increasing z 1 leads to cooling in the upper atmosphere, while fixing z 1 = 2 and increasing z 2 leads to heating that extends further into the upper layers.Profiles around the center of the latent space (i.e., z 1 ≈ z 2 ≈ 0) are almost isothermal, indicating a relatively uniform temperature distribution throughout the atmosphere.A negative value of z 1 is generally associated with the presence of a thermal inversion, and for these cases, z 2 seems to encode the altitude where the inversion happens.For instance, for z 1 = −2 the profile shows an inversion in the mid-atmosphere (around log 10 (P) = −2) for z 2 = 2, whereas for z 2 = −2, the inversion only happens at very high altitudes/low pressures around log 10 (P) = −6.
Finally, we can also draw direct connections between the behavior of the PT profiles in latent space and the correlations of z with some of the atmospheric parameters that we show in Fig. 8.For example, a high concentration of TiO (panel 3 of Fig. 8) corresponds to the hot, inverted atmospheres in the upper left of the latent space.This is consistent with expectations from the literature, where TiO is indeed one of the canonical species   proposed to cause temperature inversions in hot Jupiters (e.g., Hubeny et al. 2003;Fortney et al. 2008;Piette et al. 2020).

Our method
Overall, this suggests that the latent space is to some extent interpretable: While the latent variables may not correspond directly to known physical parameters -as is the case, for example, for the analytical parameterizations of Madhusudhan & Seager (2009) and Guillot (2010) -their relationships with other variables may provide insight into the behavior of exoplanet atmospheres.
Finally, a small word of caution: While we have just shown that the latent space is in principle interpretable, any specific interpretation -that is, a statement along the lines of "PT profiles with hot upper layers are found in the top left corner of the latent space" -will only hold for one specific model.If we re-train that same model using a different random seed to initialize the neural network, or using a different dataset, we will end up with an equivalent but different latent space.This is because our model is, in general, not identifiable, and there exists no preferred system of coordinates for z: For example, we can rotate or mirror the latent space without losing information, and unless we add additional constraints to the objective function, there is no reason why the model should favor any particular orientation.However, this simply means that the value of z can only be interpreted for a given trained model.It does not, in any way, affect the ability of our model to be used for atmospheric retrievals.

Usage in an atmospheric retrieval
A major motivation for the development of our method is the possibility of using it for atmospheric retrievals.In this experiment, we show how we can easily plug our trained model into an existing atmospheric retrieval pipeline.The target PT profile we use for this experiment is taken from the PYATMOS test set.We made sure that it is covered by the training data (i.e., there are similar PT profiles in the training set); however, in order not to make it too easy for our method, we chose a PT profile shape that is comparatively rare in the training data (see below).In Appendix C, we consider an even more challenging setting and show two additional results for PT profiles that were generated with a different code and that are not covered by our training distribution.
Setup.For our demonstration, we add our trained model to the atmospheric retrieval routine first introduced in Konrad et al. (2022) and developed further in Alei et al. (2022b) and Konrad et al. (2023), which is based on petitRADTRANS (Mollière et al. 2019), LIFEsim (Dannert et al. 2022) and PyMultiNest (Buchner et al. 2014).The goal of this experiment is to compare the PT profile parameterization capabilities of a version of our model (trained on the PYATMOS dataset using dim(z) = 2) with the fourth-order polynomial (i.e., five fitting parameters) used in Konrad et al. (2022) and Alei et al. (2022b).We perform two cloud-free atmospheric retrievals (using both our PT model or the polynomial one) on the simulated thermal emission spectrum of an Earth-like planet orbiting a Sun-like star at a distance of 10 pc, assuming a photon noise signal-to-noise ratio of 10 at a wavelength of λ = 11.2 µm.The wavelength range was chosen as [3 µm, 20 µm], and the spectral resolution was R = λ /∆λ = 200.The number of live points for PyMultiNest was set to 700.In addition to the parameters required for the PT profile, we retrieve 10 parameters of interest; see Table 1 for an overview.This setup closely matches Konrad et al. (2022) and Alei et al. (2022b).
For the retrieval with our model, we use a 2-dimensional uniform prior, U 2 (−4, 4) for z.The reason for choosing this uniform prior instead of a Gaussian is that the shape of the ground truth PT profile is relatively under-represented in the PYATMOS dataset that we use for training.This means that our trained model, together with a Gaussian prior on z, assigns it a very small prior probability, making it difficult for the retrieval routine to find the correct PT profile.Using a uniform prior is an easy way to give more weight to "rare" profiles.We discuss this choice, as well as potential alternatives, in more detail Sect.5.3.
Results.We show the main results of this experiment in the form of the retrieved PT profile and spectrum residuals in Fig. 9. First, we find that the spectrum residuals demonstrate that both retrievals reproduce the simulated spectrum with sufficient accuracy, despite the differences in the PT parametrization.The residual's quantiles are centered on the truth, and are significantly smaller than the photon noise-level.Second, for the retrieved PT profiles, we visually find that the result obtained with our model -unlike the polynomial baseline -is in good agreement with the ground truth, except at the highest layers of the atmosphere, where it tends to underestimate the true temperature.We believe this can be explained by the fact that the upper layers of the atmosphere (i.e., low pressures) have little to no effect on the simulated exoplanet spectrum (cf. the emission contribution functions in the right panel of Sect.4.4), making it hard for the retrieval routine to constrain the PT structure in the upper atmosphere.Third, the retrieved constraints for the surface pressure P 0 and temperature T 0 , are much tighter and more accurate for our model than the polynomial baseline.Of course, this is also partly because our model represents a much narrower prior over the PT profiles that we are willing to accept as an explanation of the data (i.e., the spectrum) compared to the polynomial.This is an assumption, and in this case, it is justified by construction.In general, however, the decision of what is an appropriate prior for a given atmospheric retrieval problem is beyond the scope of the method as such.We discuss this in more detail in Sect.5.2.Finally, we note that the retrieval using our model was significantly faster than the baseline: by decreasing the number of parameters required to fit the PT profile from five (polynomial) to two (our model), we reduced the total runtime by a factor of 3.2.

Discussion
In this section, we discuss the advantages and limitations of our method and suggest directions for future work.

Advantages
Our approach for parameterizing PT profiles has multiple potential benefits compared to existing approaches.First, our model A3, page 9 of 18 Gebhard, T. D., et al.: A&A, 681, A3 (2024)   makes realistic, physically consistent PT profiles -which require a computationally expensive solution of radiative transfer equations -available as an ad hoc fitting function.It works, in principle, for all types of planets (Earth-like, gas giants, etc.) as long as we can create a suitable training dataset through simulations.In this context, "suitable" refers to, for example, the total size of the dataset and the density in parameter space (cf. the single out-ofdistribution profile for PYATMOS).Our method then basically provides a tool which allows using the distribution of the PT profiles in that training set as a prior for the PT profile during a retrieval.Second, during an atmospheric retrieval, the parameterization of our model uses a simple prior for z that does not require hard-to-interpret fine-tuning (see also Sect. 5.3).Third, as we have shown in Sect.4, our method can fit realistic PT profiles with fewer parameters than the baseline and still achieve a lower fit error.Limiting the number of parameters needed to express a PT profile during retrieval can lead to faster processing times, or conversely, allow the retrieval of additional parameters when operating within a fixed computational budget.We would like to emphasize at this point that accelerating atmospheric retrievals is not only relevant for the analysis of existing observational data (e.g., from the James Webb Space Telescope), but can also be beneficial during the planning stages of future exoplanet science missions, such as LIFE (Quanz et al. 2022) or the recently announced Habitable Worlds Observatory (Clery 2023), which require the simulation of thousands of retrievals.
Finally, looking to the future, it seems plausible that approaches that replace the entire Bayesian inference routine with machine learning may gain further traction in the context of atmospheric retrievals (see, e.g., Márquez-Neila et al. 2018, Soboczenski et al. 2018, Cobb et al. 2019, Ardevol Martinez et al. 2022, Hou Yip et al. 2022, or Vasist et al. 2023).In this A3, page 10 of 18 Gebhard, T. D., et al.: A&A, 681, A3 (2024) case, reducing the number of retrieval parameters may no longer be a major concern -once trained, the inference time of these models is relatively independent of the number of parameters.However, even in this case, our approach will still be useful to provide a parameterized description of the posterior over physically consistent PT profiles.

The role of the training dataset
The practical value of our method stands and falls with the dataset that is used to train it.This should come as no surprise, but there are a few aspects to this that we will discuss in more detail here.
First, it is important to understand that the training dataset defines a distribution over the functions that our model learns to provide parameterized access to.When using our model in an atmospheric retrieval, this distribution then becomes the prior over the set of PT profiles to be considered.Compared to, for example, low-order polynomials, this is a much narrower prior: While polynomials can fit virtually anything (this is the idea of a Taylor series), our model will only produce outputs that resemble the PT profiles that it encountered during training.If this is a good fit for a given target profile, we can expect to significantly outperform the broader prior of polynomials.Conversely, if our prior does not cover the target profile, we cannot expect good results.However, this is not a limitation of our method in particular, but is true for all approaches that use a parameterization for the PT profile instead of, for example, a level-by-level radiative transfer approach.As Line et al. (2013) explain, using "[a] parameterization does force the retrieved atmospheric temperature structure to conform only to the profile shapes and physical approximations allowed by that parameterization."Thus, as always in Bayesian inference, the decision to use our method as a prior for the PT profile (rather than, e.g., low-order polynomials) represents an assumption about the space of possible explanations for the data (e.g., the observed spectrum) that one is willing to accept, and it is up to the scientist performing an atmospheric retrieval to decide whether that assumption is justified.
Finally, it is important to keep in mind that our model will learn to replicate any anomalies and imbalances present in the training data.For example, if atmospheres without inversions are overrepresented in the training data (compared to atmospheres with inversions), our learned function distribution will also place a higher probability on PT profiles without inversions, which will obviously affect the results of a retrieval.

Choosing a prior distribution for z
The last point in the previous section is closely related to the question of which prior one should choose for z during a retrieval.The natural choice, of course, is to assume a d-dimensional standard Gaussian prior for z as that is the distribution that we encourage during training by using the L MMD loss.However, there are a few additional considerations to keep in mind here.First, using a Gaussian prior for z means that the induced distribution over PT profiles, given by D(• | z), will (approximately) match the distribution in the training dataset.This is a good choice if the training dataset itself constitutes a good prior for the PT profiles one wants to consider during a retrieval.On the other hand, consider now the case mentioned above where the training dataset contains all the expected shapes, but is highly biased towards one of the modes, not because this reflects the (assumed) true distribution of atmospheres, but as an artifact of the dataset generation process.In this case, it may make sense to choose a non-Gaussian prior (e.g., a uniform distribution) for z to compensate for this imbalance.When doing so, it is important to ensure that z stays in the value range that the decoder D encountered during training (i.e., ∥z∥ < τ).Another (and perhaps more principled) way to solve this problem is to resample the training dataset to balance the different modes (e.g., "inversion" and "no inversion").For our experiment in Sect.4.4, we have validated that such a resamplingbased approach does indeed work, but have chosen not to include all the details of this additional experiment so as not to distract too much from the main message of our paper.
Finally, there is the fact that our trained model does not only return physically consistent PT profiles, but also allows smooth interpolation (or, to some extent, extrapolation) between them.In cases where this is not desired, for example because one wants to be sure that only PT profiles very close to the training dataset are produced, one can choose a data-driven prior distribution for z.A simple way to achieve this is to train a normalizing flow (see, e.g., Kobyzev et al. 2021or Papamakarios et al. 2021 for recent introductions) to match the distribution of z's obtained from the training dataset.If we then use the normalizing flow to sample z, we will almost perfectly reproduce the distribution of the training data set.However, this can also exacerbate some of the problems we discussed earlier, for example, if the training data is unbalanced.

Directions for future research
We see several directions for future work.For example, to account for the fact that not all parts of the atmosphere contribute equally to the observed spectrum, one possible approach could be to use a modified reconstruction loss weighted by the respective contribution function for each point in the profile.In the case of emission spectra, this could focus the expressiveness of the model on regions of higher pressure, which have a greater influence on the spectrum, rather than regions at high altitudes, where the thermal structure of the atmosphere is difficult to constrain from the available observational data.For transmission spectra, which probe the upper layers of the atmosphere, one would do the opposite and give more weight to high altitudes.
Another promising direction could be to extend our method to parameterize not only temperature as a function of pressure, but also chemical abundance profiles (e.g., the concentration of CO 2 in each layer), to make them accessible during a retrieval.
Finally, given a sufficiently large grid of training data, our approach can also be used to parameterize the thermal structure of 3-dimensional atmospheres, where temperature depends not only on pressure but also on the longitude and latitude.

Summary and conclusion
In this work, we have introduced a new approach to parameterizing PT profiles in the context of atmospheric retrievals of exoplanets.Unlike existing approaches for this problem, we do not make any a priori assumptions about the family of functions that describe the relation between P and T , but instead learn a distribution over such functions from data.In our framework, PT profiles are represented by low-dimensional real vectors that can be used to condition a particular neural network, which, once conditioned, is the corresponding PT profile (i.e., a single-argument, single-output function that maps pressure onto temperature values).
We experimentally demonstrated that our proposed method works for both terrestrial planets and hot Jupiters, and when compared with existing methods for parameterizing PT profiles, we found that our methods give better approximations on average A3, page 11 of 18 Gebhard, T. D., et al.: A&A, 681, A3 (2024) (i.e., smaller fitting errors) while requiring a smaller number of parameters.Furthermore, we showed that our learned latent spaces are still interpretable to some extent: For example, the latent representations are often directly correlated with other properties of the atmospheres, meaning that physically similar profiles are grouped together in the latent space.Finally, we have demonstrated by example that our method can be easily integrated into an existing atmospheric retrieval framework, where it resulted in a significant speedup in terms of inference time, and produced a tighter and more accurate posterior for the PT profile than the baseline.
Given access to a sufficient training dataset, we believe that our method has the potential to become a valuable tool for exoplanet characterization: not only could it allow the use of physically consistent PT profiles during a retrieval, but it can also reduce the number of parameters required to model the thermal structure of an atmosphere.This can either reduce the overall computational cost or free up resources for other retrieval parameters of interest, allowing for more efficient and accurate atmospheric retrievals, which in turn could improve our understanding of exoplanetary habitability and the potential for extraterrestrial life.

Fig. 2 .
Fig.2.Schematic illustration of our model: During training, the encoder network E maps a PT profile (consisting of a vector p = (p 1 , ..., p N ) of pressure values and a vector t = (t 1 , ..., t N ) of corresponding temperature values) onto a latent representation z ∈ R d .This latent code is then used to condition a decoder network D, and D(• | z) is evaluated on each p i to get the corresponding predicted temperatures ti .During an atmospheric retrieval, z is proposed by the Bayesian inference method (e.g., nested sampling).

GebhardFig. 3 .
Fig.3.All the PT profiles in our two datasets.Each segment of a line -that is, the connection between the points (p i , t i ) and (p i+1 , t i+1 ) -is color-coded by the density of the profiles at its respective (p, t) coordinate, which was obtained through a 2D KDE.The green line in the PYATMOS plot shows the one PT profile that we manually removed for being out-of-distribution (https://github.com/timothygebhard/ml4ptp/blob/main/scripts/plotting/fig-3-datasets/plot-dataset.py).

Fig. 5 .
Fig. 5. Distributions of the reconstruction error (RMSE) for a polynomial baseline, a PCA-based baseline, and our method, for different number of fitting parameters.For our model, as well as the PCA baseline, each distribution is the average of three runs using different random seeds controlling the initialization and the train/validation split.The dashed lines indicate the respective medians.Results are shown for the PYATMOS dataset; additional plots for the GOYAL-2020 are found in Fig. D.2 (https://github.com/timothygebhard/ml4ptp/blob/main/scripts/plotting/fig-5-error-distributions/plot-error-distributions.py).PolynomialsPCA Our method

Fig. 8 .
Fig. 7.A visual illustration of the encoder E, the decoder D, and their connection via the latent space.See Sect.4.3 in the main text for a detailed explanation.This figure is best viewed in color (https://github.com/timothygebhard/ml4ptp/blob/main/scripts/plotting/fig-7-overview/create-plots.py).A3, page 8 of 18

Fig. 9 .
Fig.9.Results for our simulated atmospheric retrieval of an Earth-like planet using petitRADTRANS, LIFEsim, and PyMultiNest.Left: the retrieved PT profiles for the polynomial baseline and our model, together with the emission contribution function.Right: the relative fitting error of the spectrum, computed as (true spectrum − simulated spectrum)/true spectrum.
Comparison of the median test set MRE for the different methods and datasets (lower is better).Reminder: For each profile, we compute the mean relative error (over all atmospheric layers), and then aggregate by computing the median over all profiles.Finally, for the PCA baseline as well as our method, the results are also mean-averaged over different random seeds (https://github.com/timothygebhard/ml4ptp/blob/main/scripts/plotting/fig-6-mre-comparison/plot-mre-comparison.py).

Table 1 .
Overview of all parameters in our simulated atmospheric retrieval.Notes.This table includes the respective priors, true values, and retrieved values for both PT profile parameterization schemes.For the latter, we report the median as well as the 2.5% and 97.5% percentiles.