Data-Driven Equation Discovery of a Cloud Cover Parameterization

A promising method for improving the representation of clouds in climate models, and hence climate projections, is to develop machine learning-based parameterizations using output from global storm-resolving models. While neural networks can achieve state-of-the-art performance within their training distribution, they can make unreliable predictions outside of it. Additionally, they often require post-hoc tools for interpretation. To avoid these limitations, we combine symbolic regression, sequential feature selection, and physical constraints in a hierarchical modeling framework. This framework allows us to discover new equations diagnosing cloud cover from coarse-grained variables of global storm-resolving model simulations. These analytical equations are interpretable by construction and easily transferable to other grids or climate models. Our best equation balances performance and complexity, achieving a performance comparable to that of neural networks ($R^2=0.94$) while remaining simple (with only 11 trainable parameters). It reproduces cloud cover distributions more accurately than the Xu-Randall scheme across all cloud regimes (Hellinger distances $<0.09$), and matches neural networks in condensate-rich regimes. When applied and fine-tuned to the ERA5 reanalysis, the equation exhibits superior transferability to new data compared to all other optimal cloud cover schemes. Our findings demonstrate the effectiveness of symbolic regression in discovering interpretable, physically-consistent, and nonlinear equations to parameterize cloud cover.


Introduction
Due to computational constraints, climate models used to make future projections spanning multiple decades typically have horizontal resolutions of 50-100 km (Eyring et al., 2021).The coarse resolution necessitates the parameterization of many subgrid-scale processes (e.g., radiation, microphysics), which have a significant effect on model forecasts (Stensrud, 2009).Climate models, such as the state-of-the-art ICOsahedral Nonhydrostatic (ICON) model, exhibit long-standing systematic biases, especially related to cloud parameterizations (Crueger et al., 2018;Giorgetta et al., 2018).A fundamental component of the cloud parameterization package in ICON is its cloud cover scheme, which, in its current form, diagnoses fractional cloud cover from large-scale variables in every grid cell (Giorgetta et al., 2018;Mauritsen et al., 2019).As cloud cover is directly used in the radiation (Pincus & Stevens, 2013) and cloud microphysics (Lohmann & Roeckner, 1996) parameterizations of ICON, its estimate directly influences the energy balance and the statistics of water vapor, cloud ice, and cloud water.The current cloud cover scheme in ICON, based on Sundqvist et al. (1989), nevertheless makes some crude empirical assumptions, such as a near-exclusive emphasis on relative humidity (see Grundner et al. (2022) for further discussion).These assumptions may impede the search for a parameterization that faithfully captures the available data.
With the extended availability of high-fidelity data and increasingly sophisticated machine learning (ML) methods, ML algorithms have been developed for the parameterization of clouds and convection (e.g., Brenowitz and Bretherton (2018); Gentine et al. (2018); Krasnopolsky et al. (2013); O' Gorman and Dwyer (2018); see reviews by Beucler et al. (2023) and Gentine et al. (2021)).High-resolution atmospheric simulations on stormresolving scales (horizontal resolutions of a few kilometers) resolve deep convective processes explicitly (Weisman et al., 1997), and provide useful training data with an improved physical representation of clouds and convection (Hohenegger et al., 2020;Stevens et al., 2020).There are only few approaches that learn parameterizations directly from observations (e.g., McCandless et al. (2022)), as these are challenged by the sparsity and noise of observations (Rasp et al., 2018;Trenberth et al., 2009).Therefore, a two-step process might be required, in which the statistical model structure is first learned on high-resolution modeled data before its parameters are fine-tuned on observations (transfer learning), leveraging the advantage of the consistency of the modeled data for the initial training stage before having to deal with noisier observational data.
Neural networks and random forests have been routinely used for ML-based parameterizations.Unlike traditional regression approaches, they are not limited to a particular functional form provided by combining a set of basis functions.They are usually fast at inference time and can be trained with very little domain knowledge.However, this versatility comes at the cost of interpretability as explainable artificial intelligence (XAI) methods still face major challenges (Kumar et al., 2020;Molnar et al., 2021).Given this limitation, we ask: Can we create data-driven cloud cover schemes that are interpretable by construction without renouncing the high data fidelity of neural networks?
Here, we use a hierarchical modeling approach to systematically derive and evaluate a family of cloud cover (interpreted as the cloud area fraction) schemes, ranging from traditional physical (but semi-empirical) schemes and simple regression models to neural networks.We evaluate them according to their Pareto optimality (i.e., whether they are the best performing model for their complexity).To bridge the gap between simple equations and high-performance neural networks, we apply equation discovery in a datadriven manner using state-of-the-art symbolic regression methods.In symbolic regression, as opposed to regular regression, the user first specifies a set of mathematical operators instead of a set of basis functions.For instance, including division as a mathematical operator may introduce rational nonlinearities, whose ubiquity and importance have been illustrated, e.g., in Kaheman et al. (2020).Based on these operators, the symbolic regression library creates a random initial population of equations (Schmidt & Lipson, 2009).Inspired by the process of natural selection in the theory of evolution, symbolic regression is usually implemented as a genetic algorithm that iteratively applies genetically motivated operations (selection, crossover, mutation) to the set of candidate equations.At each step, the equations are ranked based on their performance and simplicity, so that the top equations can be selected to be included in the next population (Smits & Kotanchek, 2005).Advantages of training/discovering analytical models instead of neural networks include an immediate view of model content (e.g., whether physical constraints are satisfied) and the ability to analyze the model structure directly using powerful mathematical tools (e.g., perturbation theory, numerical stability analysis).Additionally, analytical models are straightforward to communicate to the broader scientific community, to implement numerically, and fast to execute given the existence of optimized implementations of well-known functions.
To our knowledge, Zanna and Bolton (2020) marks the first usage of automated, data-driven equation discovery for climate applications.Training on highly idealized data, they used a sparse regression technique called relevance vector machine to find an analytical model that parameterizes ocean eddies.In sparse regression, the user defines a library of terms, and the algorithm determines a linear combination of those terms that best matches the data while including as few terms as possible (Brunton et al., 2016;Rudy et al., 2017;Zhang & Lin, 2018;Champion et al., 2019).In a follow-up paper, Ross et al. (2023) employed symbolic regression to discover an improved equation, again trained on idealized data, that performs similarly well as neural networks across various metrics and has greater generalization capability.Nonetheless, they had to assume that the equation was linear in terms of its free/trainable parameters and additively separable as their method included an iterative approach to select suitable terms.For the selection of terms, they took a human-in-the-loop approach rather than solely relying on the genetic algorithm.Additionally, the final discovered equation relied on high-order spatial derivatives, which may not be feasible to compute in a climate model.To prevent this issue, we only permit features we can either access or easily derive in the climate model.
Guiding questions for this study include: Using symbolic regression, can we automatically discover a physically consistent equation for cloud cover whose performance is competitive with that of neural networks?Given that modern symbolic regression libraries can handle higher computational overhead, we want to relax prior assumptions of linearity or separability of the equation.Then, what can we learn about the cloud cover parameterization problem by sequentially selecting performance-maximizing features in different predictive models?Finally, how much better do simple models generalize and/or transfer to more realistic data sets?
We first introduce the data sets used for training, validation and testing (Sec 2), the diverse data-driven models used in this study (Sec 3), and evaluation metrics (Sec 4), before studying the feature rankings, performances and complexities of the different models (Sec 5.1).We investigate their ability to reproduce cloud cover distributions (Sec 5.2), transfer to higher resolutions (Sec 5.3), and adapt to the ERA5 reanalysis (Sec 5.4).We conclude with an analysis of the best analytical model we found using symbolic regression (Sec 6).

Data
In this section, we introduce the two data sets used to train and benchmark our cloud cover schemes: We first use storm-resolving ICON simulations to train high-fidelity models (Sec 2.1), before testing these models' transferability to the ERA5 meteorological reanalysis, which is more directly informed by observations (Sec 2.2).

Global Storm-Resolving Model Simulations (DYAMOND)
As the source for our training data, we use output from global storm-resolving ICON simulations performed as part of the DYnamics of the Atmospheric general circulation Modeled On Non-hydrostatic Domains (DYAMOND) project.The project's first phase ('DYAMOND Summer') included a simulation starting from August 1, 2016 (Stevens et al., 2019), while the second phase ('DYAMOND Winter') was initialized on January 20, 2020 (Duras et al., 2021).In both phases, the ICON model simulated 40 days, providing three-hourly output on a grid with a horizontal resolution of 2.47 km.
Following the methodology of Grundner et al. (2022), we coarse-grain the DYA-MOND data to an ICON grid with a typical climate model horizontal grid resolution of ≈ 80 km.Vertically, we coarse-grain the data from 58 to 27 layers below an altitude of 21 km, which is the maximum altitude with clouds in the data set.For cloud cover, we first estimate the vertically maximal cloud cover values in each low-resolution grid cell before horizontally coarse-graining the resulting field.For all other variables, we take a three-dimensional integral over the high-resolution grid cells overlapping a given low-resolution grid cell.For details, we refer the reader to Appendix A of Grundner et al. (2022).Due to the sequential processing of some parameterization schemes in the ICON model, condensatefree clouds can occur in the simulation output.To instead ensure consistency between cloud cover and the other model variables, we follow Giorgetta et al. (2022) and manually set the cloud cover in the high-resolution grid cells to 100% when the cloud condensate mixing ratio exceeds 10 −6 kg/kg and to 0% otherwise.
We remove the first ten days of 'DYAMOND Summer' and 'DYAMOND Winter' as spin-up, and discard columns that contain NaNs (3.15% of all columns).From the remainder, we keep a random subset of 28.5% of the data, while removing predominantly cloud-free cells to mitigate a class imbalance in the output ('undersampling' step).We then split the data into a training and a validation set, the latter of which is used for early stopping.To avoid high correlations between the training and validation sets, we divide the data set into six temporally connected parts.We choose the union of the second (≈ Aug 21-Sept 1, 2016) and the fifth (≈ Feb 9-Feb 19, 2020) part to create our validation set.For all models except the traditional schemes, we additionally normalize models' features (or 'inputs') so that they have zero mean and unit variance on the training set.
We define a set of 24 features F that the models (discussed in Sec 3) can choose from.For clarity, we decompose F into three subsets: q c , q i , T, p, RH} groups the horizontal wind speed U [m/s] and thermodynamic variables known to influence cloud cover, namely specific humidity q v [kg/kg], cloud water and ice mixing ratios q c [kg/kg] and q i [kg/kg], temperature T [K], pressure p [P a] , and relative humidity RH with respect to water, approximated as: The second subset F 2 contains the first and second vertical derivatives of all features in F 1 .These derivatives are computed by fitting splines to every vertical profile of a given variable and differentiating the spline at the grid level heights to obtain derivatives on the irregular vertical grid.Finally, the third subset F 3 def = {z, land, p s } includes geometric height z [m] and the only two-dimensional variables, i.e., land fraction and surface pressure p s [P a].
In Grundner et al. (2022) we found it sufficient to diagnose cloud cover using information from the close vertical neighborhood of a grid cell.By utilizing vertical derivatives to incorporate this information, we ensure the applicability of our cloud cover schemes to any vertical grid.Since our feature set F contains all features appearing in our three baseline 'traditional' parameterizations (see Sec 3.1), we deem it comprehensive enough for the scope of our study.

Meteorological Reanalysis (ERA5)
To test the transferability of our cloud cover schemes to observational data, we also use the ERA5 meteorological reanalysis (Hersbach et al., 2018).We sample the first day of each quarter in 1979-2021 at a three-hourly resolution.The days from 2000-2006 are taken from ERA5.1, which uses an improved representation of the global-mean temperatures in the upper troposphere and stratosphere.Depending on the ERA5 variable, they are either stored on an N320 reduced Gaussian (e.g., for cloud cover) or a T639 spectral (e.g., for temperature) grid.Using the CDO package (Schulzweida, 2019), we first remap all relevant variables to a regular Gaussian grid, and then to the unstructured ICON grid described in Sec 2.1.Vertically, we coarse-grain from approximately 90 to 27 layers.
The univariate distributions of important features such as cloud water and ice do not match between the (coarse-grained) DYAMOND and (processed) ERA5 data.The maximal cloud ice values that are attained in the ERA5 data set are twice as large as in the DYAMOND data.We illustrate this in Fig 1, next to a comparison of the distributions of cloud water, relative humidity and temperature.Due to differences in the distributions of cloud ice, cloud water and relative humidity, we consider our processed ERA5 data a challenging data set to generalize to.

Data-Driven Modeling
We now introduce a family of data-driven cloud cover schemes.We adopt a hierarchical modeling approach and start with models that are interpretable by construction, i.e., linear models, polynomials, and traditional schemes.As a second step, we mostly focus on performance and therefore train deep neural networks (NNs) on the DYAMOND data.To bridge the gap between the best-performing and most interpretable models, we use symbolic regression to discover analytical cloud cover schemes from data.These schemes are complex enough to include relevant nonlinearities while remaining interpretable.

Existing Schemes
We first introduce three traditional diagnostic schemes for cloud cover and train them using the BFGS (Nocedal & Wright, 1999) and Nelder-Mead (Gao & Han, 2012) unconstrained optimizers (which outperform grid search methods in our case), each time choosing the model that minimizes the mean squared error (MSE) on the validation set.Before doing so, we multiply the output of each of the three schemes by 100 to obtain percent cloud cover values.The first is the Sundqvist scheme (Sundqvist et al., 1989), which is currently implemented in the ICON climate model (Giorgetta et al., 2018).The Sundqvist scheme expresses cloud cover as a monotonically increasing function of relative humidity.It assumes that cloud cover can only exist if relative humidity exceeds a critical relative humidity threshold RH 0 , which itself is a function of the fraction between surface pressure and pressure: If then the Sundqvist cloud cover is given by The Sundqvist scheme has four tunable parameters {RH 0,surf , RH 0,top , RH sat , n}.As properly representing marine stratocumulus clouds in the Sundqvist scheme might require a different treatment (see, e.g., Mauritsen et al. (2019)), we allow these parameters to differ between land and sea, which we separate using a land fraction threshold of 0.5.
The second scheme is a simplified version of the Xu-Randall scheme (Xu & Randall, 1996), which was found to outperform the Sundqvist scheme on CloudSat data (Wang et al., 2023).It additionally depends on cloud water and ice, ensuring that cloud cover is 0 in condensate-free grid cells.It can be formulated as The Xu-Randall scheme has only two tuning parameters: {α, β}.
The third scheme was introduced in Teixeira (2001) for subtropical boundary layer clouds.Teixeira arrived at a diagnostic relationship for cloud cover by equating a cloud production term from detrainment and a cloud erosion term from turbulent mixing with the environment.We can express the Teixeira scheme as where RH def = min{RH, 1 − 10 −9 } bounds relative humidity to 1 − 10 −9 to ensure reasonable asymptotics, q s = q s (T, p) is the saturation specific humidity (Lohmann et al., 2016), and {D, K} are the detrainment rate and the erosion coefficient, which are the two tuning parameters of the Teixeira scheme.
Besides those three traditional schemes, we additionally train the three neural networks (cell-, neighborhood-, and column-based NNs) from Grundner et al. (2022) on the DYAMOND data.These three NNs receive their inputs either from the same grid cell, the vertical neighborhood of the grid cell, or the entire grid column.Thus, they differ in the amount of vertical locality that is assumed for cloud cover parameterization.As the 'undersampling step' has to be done at a cell-based level, we omit it when pre-processing the training data for the column-based NN.Nevertheless, the column-based NN is evaluated on the same validation set as all other models.Now that we have introduced three semi-empirical cloud cover schemes, which can be used as baselines, we are ready to derive a hierarchy of data-driven cloud cover schemes.

Developing Parsimonious Models via Sequential Feature Selection
Our goal is to develop parameterizations for cloud cover that are not only performant, but also simple and interpretable.Providing many, possibly correlated features to a model may needlessly increase its complexity and allow the model to learn spurious links between its inputs and outputs (Nowack et al., 2020), impeding both interpretability (Molnar, 2020) and generalizability (Brunton et al., 2016).Therefore, we instead seek parsimonious models.As our feature selection algorithm we use (forward) sequential feature selection (SFS).

Sequential Feature Selection
SFS starts without any features and carefully selects and adds features to a given type of model (e.g., a second-order polynomial) in a sequential manner.At each iteration, SFS selects the feature that optimizes the model's performance on a computationally feasible subset of the training set, which is sufficiently large to ensure robustness (see also Sec 2.1).More specifically; let F contain all potential features of a model (type) M .Let us further assume that the SFS approach has already chosen n features P n ⊆ F at a given iteration (note that P 0 := ∅).In the next iteration, the SFS method adds another feature P n+1 = P n ∪ { f }, such that f ∈ F \ P n maximizes the model's performance as measured by the R 2 -value.Thus, the SFS method tests whether indeed holds on the training subset for all features ĝ ∈ F\P n .With the SFS approach, we discourage the choice of correlated features and enforce sparsity by selecting a controlled number of features that already lead to the desired performance.However, if two highly correlated features are both valuable predictors (as will be the case with RH and ∂ z RH), the SFS NN would pick them nonetheless.Another benefit is that by studying the order of selected variables, optionally with the corresponding performance gains, we can gather intuition and physical knowledge about the task at hand.On the way, we will obtain an approximation of the best-performing set of features for a given number of features.There is however no guarantee of it truly being the best-performing feature set due to the greedy nature of the feature selection algorithm, which decreases its computational cost.Due to the high cost, we could only verify that the models would pick the same first two features (or four features in the case of the linear model) using a non-greedy selector.However, we found that for some random data subsets the second-order polynomial temporarily outperforms the third-order polynomial due to the earlier pick of a third-order feature that decreased the score later on.

Linear Models and Polynomials
We allow first-order (i.e., linear models), second-order, and third-order polynomials.For each of these model types, we run SFS using the SequentialFeatureSelector of scikit-learn (Pedregosa et al., 2011).In the case of linear models, the pool of features F 1 to choose from is precisely F (see Sec 2.1).For second-order polynomials, F 2 also includes second-degree monomials of the features in F, i.e., Analogously we also consider third-degree monomials in the case of third-order polynomials.Thus, the set of possible terms grows from 25 to 325 for the second-order and would grow to 2925 for the third-order polynomials.However, to circumvent memory issues for the third-order polynomials, we restrict the pool of possible features to combinations of the ten most important features.The choice of these ten features is informed by the SFS NNs (Sec 3.2.3),which are able to select informative features for nonlinear models.In addition to these ten features, we also incorporate air pressure to later classify samples into physically interpretable cloud regimes.To be specific, this implies that By considering combinations of only eleven features, we reduce the total amount of possible terms from 2925 to 364.After obtaining sequences of selected features for each of the three model types, we fit sequences of models with up to ten features each using ordinary least squares linear regression.

Neural Networks
We train a sequence of SFS NNs with up to ten features using the "mlxtend" Python package (Raschka, 2018).As in the case of the linear models, the pool of possible features is F. We additionally train an NN with all 24 features in F for comparison purposes.As our regression task is similar in nature (including the vertical locality assumptions it makes for the features), we use the "Q3 NN" model architecture from Grundner et al. (2022) for all SFS NNs."Q3 NN"'s architecture has three hidden layers with 64 units each; it uses batch normalization and its loss function includes L 1 and L 2 -regularization terms following hyperparameter optimization.After deriving the sequence of ten features on small training data subsets (see Sec 5.1.1)we train the final SFS NNs on the entire training data set, always limiting the number of training epochs to 25 and making use of early stopping.Without the greedy assumption of the SFS approach we would already need to test more than 2000 NNs for three features.
Due to the flexibility of NNs, when combining SFS with NNs, we obtain a sequence of features that is not bound to a particular model structure.In Sec 3.2.2 and 3.3, we therefore reuse the SFS NN feature rankings for other nonlinear models to restrict their set of possible features.The combination of SFS with NNs also yields a tentative upper bound on the accuracy one can achieve with N features: If we assume that i) SFS provides the best set of features for a given number of features N ; and ii) the NNs are able to outperform all other models given their features, one would not be able to outperform the SFS NNs with the same number of features.Even though the assumptions are only met approximately, we still receive helpful upper bounds on the performance of any model with N features.

Symbolic Regression Fits
To improve upon the analytical models of Sec 3.1 and 3.2.2without compromising interpretability, we use recently-developed symbolic regression packages.We choose the PySR (Cranmer, 2020) and the default GP-GOMEA (Virgolin et al., 2021) libraries, which are both based on genetic programming.GP-GOMEA is one of the best symbolic regression libraries according to SRBench, a symbolic regression benchmarking project that compared 14 contemporary symbolic regression methods (La Cava et al., 2021).PySR is a very flexible, efficient, well-documented, and well-maintained library.In PySR, we choose a large number of potential operators to enable a wide range of functions (see Appendix C for details).We also tried AIFeynman and found that its underlying assumption that one could learn from the NN gradient was problematic for less idealized data.Other promising packages from the SRBench competition, such as DSR/DSO and (Py)Operon, are left for future work.PySR and GP-GOMEA can only utilize a very limited number of features.Regardless of the number of features we provide, GP-GOMEA only uses 3− 4, while PySR uses 5−6 features.For this reason, PySR also has a built-in tree-based feature selection method to reduce the number of potential features.Since the SFS NNs from Sec 3.2.3already provide a sequence of features that can be used in general, non-linear cases, we instead select the first five of these features to maximize comparability between models.The decision to run PySR with five features is also motivated by the good performance (R 2 > 0.95) of the corresponding SFS NN (see Sec 5.1.2).Each run of the PySR or GP-GOMEA algorithms adds new candidates to the list of final equations.From ≈ 600 of resulting equations, we select those that have a good skill (R 2 > 0.9), are interpretable, and satisfy most of the physical constraints that we define in the following section.The search itself is performed on the normalized training data (see also Sec 2.1).As a final step, we refine the free parameters in the equation using the Nelder-Mead and BFGS optimizers (as in Sec 3.1).

Physical Constraints
To facilitate their use, we postulate that simple equations for cloud cover C(X) ought to satisfy certain physical constraints (Gentine et al., 2021;Kashinath et al., 2021): 1) The cloud cover output should be between 0 and 100%; 2) an absence of cloud condensates should imply an absence of clouds; 3-5) when relative humidity or the cloud water/ice mixing ratios increase (keeping all other features fixed), then cloud cover should not decrease; 6) cloud cover should not increase when temperature increases; 7) the function should be smooth on the entire domain.We can mathematically formalize these physical constraints (PC): While these physical constraints are intuitive, they will not be respected by data-driven cloud cover schemes if they are not satisfied in the data.In the DYAMOND data, the first physical constraint is always satisfied, and PC 2 is satisfied in 99.7% of all condensatefree samples.The remaining 0.3% are due to noise induced during coarse-graining.In order to check whether PC 3 -PC 6 are satisfied in our subset of the coarse-grained DYA-MOND data, we extract {q c , q i , RH, T }.We then separate the variable whose partial derivative we are interested in.Bounded by the min/max-values of the remaining three variables, we define a cube in this three-dimensional space, which we divide into N 3 equallysized cubes.In this way, the three variables of the samples within the cubes become more similar with increasing N .If we now fit a linear function in a given cube with the separated variable as the inputs and cloud cover as the output, then we can use the sign of the function's slope to know whether the physical constraint is satisfied.
On one hand, the test is more expressive the smaller the cubes are, as the samples have more similar values for three of the four chosen variables and we can better approximate the partial derivative with respect to the separated variable.However, we only guarantee similarity in three variables (omitting, e.g., pressure).On the other hand, as the size of the cubes decreases, so does the number of samples contained in a cube, and noisy samples may skew the results.We therefore only consider the cubes that contain a sufficiently large number of samples (at least 10 4 out of the 2.9 • 10 8 ).
We collect the results in Table 1, and find that the physical constraint PC 3 (with respect to RH) is always satisfied.The other constraints are satisfied in most (on average 76%) of the cubes.Thus, from the data we can deduce that the final cloud cover scheme should satisfy PC 1 -PC 3 in all and PC 4 -PC 6 in most of the cases.To enforce PC 1 , we always constrain the output to [0, 100] before computing the MSE.With the exception of the linear and polynomial SFS models, we already ensure PC 1 during training.For PC 2 , we can define cloud cover to be 0 if the grid cell is condensatefree.We can combine PC 1 and PC 2 to define cloud fraction C (in %) as and our goal is to learn the best fit for f (X).In the case of the Xu-Randall and Teixeira schemes, ensuring PC 2 is not necessary since they satisfy the constraint by design.

Performance Metrics
We use different metrics to train and validate the cloud cover schemes.We always train to minimize the mean squared error (MSE), which directly measures the average squared mismatch of the predictions f (x i ) (usually set to be in [0, 100]%) and the corresponding true (cloud cover) values y i : The coefficient of determination R 2 -value takes the variance of the output To compare discrete univariate probability distributions P and Q, we use the Hellinger distance As opposed to the Kullback-Leibler divergence, the Hellinger distance between two distributions is always symmetric and finite (in [0, 1]).
As our measure of complexity we use the number of (free/tunable/trainable) parameters of a model.A clear limitation of this complexity measure is that, e.g., the expression f (x) = ax is considered as complex as g(x) = sin(exp(ax)).However, in this study, most of our models (i.e., the linear models, polynomials, and NNs) do not contain these types of nested operators.Instead, each additional parameter usually corresponds to an additional term in the equation.In the case of symbolic regression tools, operators are already taken into account (see Appendix C) during the selection process, and we find that the number of trainable parameters suffices to compare the complexity of our symbolic equations in their simplified forms.Finally, this complexity measure is one of the few that can be used for both analytical equations and NNs.

Cloud Regime-Based Evaluation
We define four cloud regimes based on air pressure p and the total cloud condensate q t (cloud water plus cloud ice) mixing ratio: 1. Low air pressure, little condensate (cirrus-type cloud regime) 2. High air pressure, little condensate (cumulus-type cloud regime) 3. Low air pressure, substantial condensate (deep convective-type cloud regime) 4. High air pressure, substantial condensate (stratus-type cloud regime) Pressure or condensate values that are above their medians (78 787 Pa and 1.62•10 −5 kg/kg) are considered to be large, while values below the median are considered small.Each regime has a similar amount of samples (between 35 and 60 million samples per regime).In this simplified data split, based on Rossow and Schiffer (1991), air pressure and total cloud condensate mixing ratio serve as proxies for cloud top pressure and cloud optical thickness.These regimes will help decompose model error to better understand the strengths and weaknesses of each model, discussed in the following section.

Performance on the Storm-Resolving (DYAMOND) Training Set
In this section, we train the models we introduced in Sec 3 on the (coarse-grained) DYAMOND training data and compare their performance and complexity on the DYA-MOND validation data.We start with the sequential feature selection's results.

Feature Ranking
We perform 10 SFS runs for each linear model, polynomial, and NN from Sec 3.3.Each run varies the random training subset, which consists of O(10 5 ) samples in the case of NNs and O(10 6 ) samples in the case of polynomials (as polynomials are faster to train).We then average the rank of a selected feature and note it down in brackets.We omit the average rank if it is the same for each random subset.By P d , d ∈ {1, 2, 3} we denote polynomials of degree d (e.g., P 1 groups linear models).The sequences in which the features are selected are: Regardless of the model, the selection algorithm chooses RH as the most informative feature for predicting cloud cover.This is consistent with, e.g., Walcek (1994), who considers RH to be the best single indicator of cloud cover in most of the troposphere.Considering that the cloud cover in the high-resolution data was only derived from the cloud condensate mixing ratio, the models' prioritization of RH is quite remarkable.From the feature sequences, we can also deduce that cloud cover depends on the mixing ratios of cloud condensates in a very nonlinear way: The polynomials choose q i q c as their third feature and do not use any other terms containing q i or q c .The NNs choose q i and q c as their second and third features, and are able to express a nonlinear function of these two features.The linear model cannot fully exploit q i and q c and hence attaches less importance to them.
Since RH and T are chosen as the most informative features for the linear model, we can derive a notable linear dependence of cloud cover on these two features (the corresponding model being f (RH, T ) = 41.31RH− 15.54T + 44.63).However, given the possibility, higher order terms of T and RH are chosen as additional predictors over, for instance, p or q v .Finally, ∂ z RH is an important recurrent feature for all models.Depending on the model, the coefficient associated with ∂ z RH can be either negative or positive.If ∂ z RH ̸ = 0, one can assume some variation of cloud cover (i.e., cloud area fraction) vertically within the grid cell.Thus, ∂ z RH is a meaningful proxy for the subgrid vertical variability of cloud area fraction.Since the effective cloud area fraction of the entire grid cell is related to the maximum cloud area fraction at a given height within the grid cell, this could explain the significance of ∂ z RH.

Balancing Performance and Complexity
In Fig 2, we depict all of our models in a performance × complexity plane.We measure performance as the MSE on the validation (sub)set of the DYAMOND data and use the number of free parameters in the model as our complexity metric.We add the Pareto frontier, defined to pass through the best-performing models of a given complexity.The SFS sequences described above are used to train the SFS models of the corresponding type.The only exception is the swapped order of ∂ z p and ∂ zz p for the NNs, as we base the sequence shown in Fig 2 on a single SFS run.For the SFS NNs with 4-7 features, it was possible to reduce the number of layers and hidden units without significant performance degradation, which reduced the number of free parameters by about an order of magnitude and put them on the Pareto frontier.
For most models, we train a second version that does not need to learn that condensatefree cells are always cloud-free, but for which the constraint is embedded by equation ( 6).For such models, condensate-free cells are removed from the training set.In addition to the schemes of Xu-Randall and Teixeira (see Sec 3.1), we find that it is also not necessary to enforce PC 2 in the case of NNs, since they are able to learn PC 2 without degrading their performance.PC 1 is always enforced by default for all models.
We find that, even though the Sundqvist and Teixeira schemes are also tuned to the training set, linear models of the same complexity outperform them.However, these linear models do not lie on the Pareto frontier either.The lower performance of the Teixeira scheme is most likely due to the fact that it was developed for subtropical boundary layer clouds.Its MSE experiences a reduction (to 290 (%) 2 ) when evaluated exclusively within the subtropics (from 23.4 to 35 degrees north and south).Among the existing schemes, only the Xu-Randall scheme with its two tuning parameters set to {α, β} = {0.9,9•10 5 } is on the Pareto frontier as the simplest model.With relatively large values for α and β, cloud cover is always approximately equal to relative humidity (i.e., C ≈ RH 0.9 ) when cloud condensates are present.The next models on the Pareto frontier are third-order SFS polynomials P 3 with 2-6 features with PC 2 enforced.To account for the bias term and the output of the polynomial being set to zero in condensate-free cells, the number of their parameters is the number of features plus 2. We then pass the line with R 2 = 0.9 and find three symbolic regression fits on the Pareto frontier, each trained on the five most informative features for the SFS NNs.All symbolic regression equations that appear in the plot are listed in Appendix D. We will analyze the PySR equation with arguably the best tradeoff between complexity (11 free parameters when phrased in terms of normalized variables) and performance (M SE = 103.95(%) 2 , improved spatial distribution as illustrated in Fig S2) in Sec 6.The remaining models on the Pareto frontier are SFS NNs with 4-10 features and finally the NN with all 24 features defined in Sec 2.1 included (M SE = 30.51(%) 2 ).Interestingly, the (quasi-local) 24-feature NN is able to achieve a slightly lower MSE (30.51 (%) 2 ) than the (non-local) column-based NN (33.37 (%) 2 ) with its 163 features.The two aspects that benefit the 24-feature NN are the additional information on the horizontal wind speed U and its derivatives, and the smaller number of condensate-free cells in its training set due to undersampling (Sec 2.1 and 3.1).The SFS NN with 10 features already shows very similar performance (M SE = 34.64(%) 2 ) to the column-based NN with a (12 times) smaller complexity and fewer, more commonly accessible features.
Comparing the small improvements of the linear SFS models (up to M SE = 250.43(%) 2 ) with the larger improvements of SFS polynomials (up to M SE = 190.78 (%) 2 ) with increasing complexity, it can be deduced that it is beneficial to include nonlinear terms instead of additional features in a linear model.For example, NNs require only three features to predict cloud cover reasonably well (R 2 = 0.933), and five features are sufficient to produce an excellent model (R 2 = 0.962) because they learn to nonlinearly transform these features.
The PySR equations can estimate cloud cover very well (R 2 ∈ [0.935, 0.940]).However, while the PySR equations depend on five features, the NNs are able to outperform them with as few as four features (R 2 = 0.944).This suggests that the NNs learn better functional dependencies than PySR, as they do better with less information.However, the improved performance of the NNs comes at the cost of additional complexity and greatly reduced interpretability.

Split by Cloud Regimes
In this section, we divide the DYAMOND data set into the four cloud regimes introduced in Sec 4. We evaluate the models located at favorable positions on the Pareto frontier (at the beginning to maximize simplicity, at the end to maximize performance, or on some corners to optimally balance both).Of the two PySR equations, we consider the one with the lowest MSE (as in Sec 6 later).Furthermore, we explore benefits that arise from training on each cloud regime separately and whether using a different feature set for each regime could ease the transition between regimes.
In general, we find that the PySR equation (except in the cirrus regime) and the 6-feature NN can reproduce the distributions quite well (Hellinger distances < 0.05), while the 24-feature NN shows excellent skill (Hellinger distances ≤ 0.015).However, all models have difficulty predicting the number of fully cloudy cells in all regimes (especially in the regimes with fewer cloud condensates).
Focusing first on the predictions of the Xu-Randall scheme, we find that the distributions exhibit prominent peaks in each cloud regime.By neglecting the cloud condensate term and equating RH with the regime-based median, we can approximately rederive these modes of the Xu-Randall cloud cover distributions in each regime using the Xu-Randall equation ( 4).With our choice of α = 0.9, this mode is indeed very close (absolute difference at most 8% cloud cover) to the median relative humidity calculated in each regime.By increasing α, we should therefore be able to push the mode above 100% cloud cover and thus remove the spurious peak.However, this comes at the cost of increasing the overall MSE of the Xu-Randall scheme.
For the PySR equation (and also the 24-feature NN), the cirrus regime distribution is the most difficult to replicate.The Hellinger distances suggest that it is the model's functional form, and not its number of features that limits model performance in the cirrus regime.Indeed, the decrease in the Hellinger distance between the PySR equation and the 6-feature NN is larger (0.049) than the decrease between the 6-and the 24-feature NN (0.02).Technically, the PySR equation has the same features as the 5-feature and  not the 6-feature NN, but the Hellinger distances of these two NNs to the actual cloud cover distribution are almost the same (difference of 0.003 in the cirrus regime).We want to note here that, while the PySR equation features a large Hellinger distance, it actually achieves its best R 2 score (R 2 = 0.84) in the cirrus regime as the coefficient of determination takes into account the high variance of cloud cover in the cirrus regime.In the condensate-rich regimes, the PySR equation is as good as the 6-feature NN and even able to outperform it on the stratus regime.To improve the PySR scheme further in terms of its predicted cloud cover distributions, and combat its underestimation of cloud cover in the cirrus regime, we now explore the effect of focusing on the regimes individually.
By training SFS NNs just like in Sec 5.1.1 but now on each cloud regime separately, we find new feature rankings: Cirrus regime: Cumulus regime: By rerunning PySR within each regime and allowing its discovered equations to depend on the newly found five most important features, we find equations that are better able to predict the distributions of cloud cover.In the supplementary information (SI), we present one of the equations per regime that strikes a good balance between performance and simplicity and show the predicted distributions of cloud cover.
As expected, cloud water is not an informative variable in the cirrus regime (with an average rank of 9.5).Based on q i , RH and T alone, we are able to discover equations that reduce the number of cloud-free predictions and improve the distributions for low cloud cover values (Hellinger distances of ≈ 0.05).We do not attribute these improvements to new input features, but rather to the ability of the equation to adopt a novel structure.Similarly, the features q i , q c and RH are sufficient to decrease the Hellinger distance from 0.049 to 0.041 within the cumulus regime.
In the condensate-rich regimes (deep convective and stratus), cloud water and/or ice are already present, making the exact amount of cloud condensates less pertinent.By focusing on the three most significant features RH, T and ∂ z RH, we find equations with an enhanced distribution of cloud cover within the deep convective regime (with Hellinger distances of only 0.02).The equations specific to the deep convective regime display strong nonlinearity, with the equation selected for the SI including a fourth-order polynomial of relative humidity and temperature.While the five most important features of the stratus regime also differ from the SFS NN features of Sec 5.1.1,we were not able to improve upon the Hellinger value of our single PySR equation through exclusive training within the stratus regime.A notable aspect of the stratus regime is the increased significance of ∂ z RH, which is discussed later (see Sec 6.2).
While the approach of deriving distinct equations tailored to each cloud regime, emphasizing regime-specific features, holds potential for improving predicted cloud cover distributions, the resulting MSE across the entire dataset is lower (≈ 113 (%) 2 ) compared to our chosen single PySR equation (≈ 104 (%) 2 ).Moreover, the number of free parameters increases to 33, which is three times the count of our single PySR equation.Lastly, formulating distinct equations for each cloud regime requires special attention at the regime boundaries to ensure continuity across the entire domain.Therefore, we henceforth focus on equations that generalize across cloud regimes.

Transferability to Different Climate Model Horizontal Resolutions
Designing data-driven models that are not specific to a given Earth system model and a given grid is challenging.Therefore, in this section we aim to determine which of our selected Pareto-optimal ML models are most general and transferable.We explore the applicability of our schemes at higher resolutions, nowadays also typical for climate model simulations.
To evaluate the performance of our models at higher resolutions, we coarse-grain some of the DYAMOND data to horizontal resolutions of ≈ 20 km (R2B7) and ≈ 40 km (R2B6) to complement our coarse-grained data set at ≈ 80 km (R2B5).For simplicity, in this section, we omit any coarse-graining in the vertical and do not retune the schemes for the higher resolutions.In Fig 4 we present R 2 -values for each resolution for the same models as in the previous section.We note that the lack of vertical coarse-graining can explain the slight decrease in performance on 80 km when compared to the results depicted in Fig 2 .We observe a clear, almost linear, tendency of all schemes to improve their R 2 -score on the coarse-grained data sets as we increase the resolution.The increasing standard deviation σ of cloud cover by ≈ 1.6% per doubling of the resolution (with σ ≈ 23.8% at 80 km) is not sufficient to explain this phenomenon.On the one hand, we find these improvements surprising, considering that the schemes were trained at a resolution of 80 km.On the other hand, at the low resolution of 80 km, the inputs are averaged over wide horizontal regions and bear very little information about how much cloud cover to expect.At higher resolution, large-scale variables and cloud cover are more closely related.Cloud water and ice reach larger values and become more informative for cloud cover detection.This is evident in the Xu-Randall scheme, which relies heavily on cloud condensates and shows a significant increase in its ability to predict cloud cover at higher resolutions.Our analysis reveals that the most skillful schemes at 20 km are the 6-feature NN and our chosen PySR equation.The 24-feature NN relies on many first-and secondorder vertical derivatives in its input, so its deteriorated performance could be an artifact of not vertically coarse-graining the data in this section.
Overall, the schemes exhibit a noteworthy capacity to be applied at higher resolutions than those used during their training.

Transferability to Meteorological Reanalysis (ERA5)
To our knowledge, there is no systematic method to incorporate observations into ML parameterizations for climate modeling.In this section, we take a step towards transferring schemes trained on SRMs to observations by analyzing the ability of the Paretooptimal schemes to transfer learn the ERA5 meteorological reanalysis from the DYA-MOND set.
To do so, we take a certain number (either 1 or 100) of random locations, and collect the information from the corresponding grid columns of the ERA5 data over a certain number of time steps in a data set T .Starting from the parameters learned on the DYAMOND data, we retrain the cloud cover schemes on T and evaluate them on the entire ERA5 data set.In other words, the free parameters of each cloud cover scheme are retuned on T .The retuning method is the same as the original training method, the difference being that the initial model parameters were learned on the DYAMOND data.We can think of T as mimicking a series of measurements at these random locations, which help the schemes adjust to the unseen data set.The first columns of the three panels show no variability because the schemes are applied directly to the ERA5 data without any transfer learning (T = ∅).None of the schemes perform well without transfer learning (R 2 < 0.15), which is expected given the different distributions of cloud ice and water between the DYAMOND and ERA5 data sets (Fig 1).That being said, the SFS NNs retain their superior performance (MSE ≈ 300 (%) 2 without retraining), especially compared to the non-retrained SFS polynomials, which exhibit MSEs in the range of 1375±55 (%) 2 and are therefore not shown in Panel c.
For most schemes, performance increases significantly after seeing one grid column of ERA5 data, with the exception of the SFS NNs with more than 6 features and the GPGOMEA equation.The performance of the GPGOMEA equation varies greatly between the selected grid columns, and the SFS NNs with many features appear to underfit the small transfer learning training set.The models with the lowest MSEs are (1) the slightly more complex of the two PySR equations (median MSE = 148 (%) 2 ); and (2) the SFS NNs with 5 and 6 features (median MSE = 200 (%) 2 ).While we cannot confirm that fewer features (5-6 features) help with off-the-shelf generalizability of the SFS NNs, they do improve the ability to transfer learn after seeing only a few samples from the ERA5 data.
After increasing the number of time steps to be included in T to 32 (corresponding to one year of our preprocessed ERA5 data set), the performances of the models start to converge and the SFS NNs with 5 and 6 features and its large number of trainable parameters outperform the PySR equation (with median ∆MSE ≈ 35 (%) 2 ).From the last column we can conclude that a T consisting of 100 columns from all available time steps is sufficient for the ERA5 MSE of all schemes to converge.Remarkably, the order from best-to worst-performing model is exactly the same as it was in Fig 2 on the DYA-MOND data set (in addition, Fig S3 visually demonstrates the improved spatial distribution of predicted cloud cover by the fully tuned PySR equation).Thus, we find that the ability to perform well on the DYAMOND data set is directly transferable to the ability to perform well on the ERA5 data set given enough data, despite fundamental differences between the data sets.This suggests a notable degree of structural robustness of the cloud cover models.
A useful property of a model is that it is able to transfer learn what it learned over an extensive initial dataset after tuning only on a few samples.We can quantify the ability to transfer learn with few samples in two ways: First, we can directly measure the error on the entire data set after the model has seen only a small portion of the data (in our case the ERA5 MSEs of the 1/1-column).Second, if this error is already close to the minimum possible error of the model, then few samples are really enough for the model to transfer learn to the new data set (in our case, the difference of MSEs in the 1/1-column and the 100/1368-column).In terms of the first metric (MSEs in (%) 2 ), the leading five models are the more complex PySR equation (147.6), the 5-and 6-feature NNs (199.6/199.8), the simpler PySR equation (216.8), and the 6-feature polynomial (254.6).In terms of the second metric (difference of MSEs in (%) 2 ), the top five models are again the more complex PySR equation (86.0), the 6-, 5-, and 4-feature polynomials (149.1/149.4/150.5),and the simpler PySR equation (152.3).If we add both metrics, weighing them equally, then the more complex PySR equation has the lowest inability to transfer learn with few samples (233.7),followed by the simpler PySR equation (369.1) and the 5-and 6-feature SFS NNs (370.5/374.5,where all numbers have units (%) 2 ).As the more complex PySR equation is leading in both metrics, we can conclude that it is most able to transfer learn after seeing only one column of ERA5 data, and we further investigate its physical behavior in the next section.

Physical Interpretation of the Best Analytical Scheme
We find that the two PySR equations on the Pareto frontier (see Fig 2) achieve a good compromise between accuracy and simplicity.Both satisfy most of the physical constraints that we defined in Sec 4.1.In this section, we analyze the (more complex) PySR equation with a lower validation MSE as we showed that it generalized best to ERA5 data (see Fig 5).We also conclude that the decrease in MSE is substantial enough (∆MSE = 3.04% 2 ) to warrant the analysis of the (one parameter) more complex equation.The equation for the case with condensates can be phrased in terms of physical variables as where To compute cloud cover in the general case, we plug equation ( 10) into equation ( 6), enforcing the first two physical constraints (C(X) ∈ [0, 100]% and in condensate-free cells C(X) = 0).On the DYAMOND data we find the best values for the coefficients to be {a 1 , . . ., a 9 , ϵ} = {0.4435,1.1593, −0.0145 584.8036 m, 2 km −1 , 1.1573 mg/kg, 0.3073 mg/kg, 1.06}.
Additionally, RH = 0.6025 and T = 257.06K are the average relative humidity and temperature values of our training set.
In this section, we use our symbolic model to elucidate the fundamental physical components that facilitate the parameterization of cloud cover from storm-resolution data, following the themes outlined in the subsequent subsections.

Relative Humidity and Temperature Drive Cloud Cover, Especially in Condensate-Rich Environments
The function I 1 (RH, T ) can be phrased as a Taylor expansion to third order around the point (RH, T ) = (RH, T ).The first coefficient a 1 specifies I 1 's contribution to cloud cover for average relative humidity and temperature values, i.e., a 1 = I 1 RH, T .While C(X) = a 1 at (RH, T ) if I 2 ≈ I 3 ≈ 0, the I 3 -term dominates when cloud condensates are absent, setting C(X) to 0. The following two parameters a 2 and a 3 are the partial derivatives of equation ( 10) at (RH, T ) w.r.t.relative humidity and temperature, i.e., a 2 = (∂I 1 /∂RH)| (RH,T ) and a 3 = (∂I 1 /∂T )| (RH,T ) .As a 2 is positive, cloud cover generally increases with relative humidity (see Fig 6a and 7a).To ensure PC 3 (∂C/∂RH ≥ 0) in all cases, we replace RH with max{RH, where c 1 = RH − a 2 /a 4 ≈ 0.317 and c 2 = a 5 /(2a 4 ) ≈ 1.623 • 10 −4 K −2 .We derive equation ( 11) by solving ∂f /∂RH = 0 for RH.Condition (11) of replacing RH triggers in roughly 1% of our samples.It ensures that cloud cover does not increase when decreasing relative humidity in cases of low relative humidity and average temperature (see Fig 7).
Modifying the equation ( 10) in such a way does not deteriorate its performance on the DYAMOND data.Fig 7b illustrates how the modification ensures PC 3 in an average setting (in particular for T = T ).It would be difficult to apply a similar modification to the NN, which in our case violates PC 3 for RH > 0.95.We can also directly identify another aspect of equation ( 10): the absence of a minimum value of relative humidity, below which cloud cover must always be zero (the critical relative humidity threshold ).
Since a 3 = (∂I 1 /∂T )| (RH,T ) is negative, cloud cover typically decreases with temperature for samples of the DYAMOND data set (see Fig 6f )).However, I 1 does not ensure the PC 6 (∂C/∂T ≤ 0) constraint everywhere.For instance, in the hot limit lim T →∞ I 1 (RH, T ), whether conditions are entirely cloudy or cloud-free depends upon relative humidity (in particular, whether RH > RH).
The coefficient a 4 = (∂ 2 I 1 /∂RH 2 )| (RH,T ) is precisely the curvature of I 1 w.r.t.RH, causing the equation to flatten with decreasing RH (taking (11) into account).It is consistent with the Sundqvist scheme that changes in relative humidity have a larger impact on cloud cover for larger relative humidity values.The final coefficient a 5 of I 1 is  a third-order partial derivative of I 1 w.r.t.T and RH.More precisely, .
The corresponding term becomes important whenever the temperature and relative humidity deviate strongly from their mean.In the upper or lower troposphere, where temperature conditions differ from the average tropospheric temperature, the a 5 -term either further increases cloud cover in wet conditions (e.g., the tropical lower troposphere) or decreases it in dry conditions (e.g, in the upper troposphere or over the Sahara).The contribution of the a 5 -term for selected vertical layers is illustrated in the second row of Fig A1 .When fit to the ERA5 data, the coefficients of the linear terms are found to be stable, while the emphasis on the non-linear terms is somewhat decreased; a 4 is 1.53 and a 5 is 2.5 times smaller.

Vertical Gradients in Relative Humidity and Stratocumulus Decks
The second function Its magnitude is controlled by the coefficient a 6 .If a 6 were 50% smaller (which it is when fit to ERA5 data), it would decrease the absolute value of I 2 by 87.5%.We introduce a prefactor of 1.5 for a 7 so that −a 7 describes a local maximum of I 2 (found by solving I ′ 2 (∂ z RH) = 0).We will now focus on the reason for this distinct peak of I 2 ≈ 0.8 at ∂ z RH = −a 7 .Removing the I 2 -term, we find that the induced prediction error is largest, on average, in situations that are i) relatively dry (RH ≈ 0.6), ii) close to the surface (z ≈ 1000m), iii) over water (land fraction ≈ 0.1), iv) characterized by an inversion (∂ z T ≈ 0.01 K/m), and v) have small values of ∂ z RH (∂ z RH ≈ −2 km −1 = −a 7 ; compare also to the cloud cover peak in Fig 6g).Using our cloud regimes of Sec 5.2, we find the average absolute error is largest in the stratus regime (4% cloud cover).Indeed, by plotting the globally averaged contributions of I 1 , I 2 and I 3 on a vertical layer at about 1500m altitude (Fig A1 ), we find that I 2 is most active in regions with low-level inversions where marine stratocumulus clouds are abundant (Mauritsen et al., 2019).From this, we can infer that the SFS NN has chosen ∂ z RH as a useful predictor to detect marine stratocumulus clouds and the symbolic regression algorithm has found a way to express this relationship mathematically.It is more informative than ∂ z T (rank 10 in Sec 5.1.1),which would measure the strength of an inversion more directly.Indeed, stratocumulus-topped boundary layers exhibit a sharp increase in temperature and a sharp decrease in specific humidity between the cloud layer to the inversion layer.Studies by Nicholls (1984) and Wood (2012) reveal a notable temperature increase of approximately 5 − 6 K and a specific humidity decrease of about 4−5 g/kg.In ICON's grid with a vertical spacing of ≈ 300 m at an altitude of 1000 − 1500 m, the decrease in relative humidity would attain values of ≈ −2.5 km −1 .It is important to note that the vertical grid may not precisely separate the cloud layer from the inversion layer, making it reasonable to maximize the parameter I 2 at a relative humidity gradient of ∂ z RH = −2 km −1 .Vertical gradients of relative humidity below −3, km −1 are extremely sporadic and confined to the lowest portion of the planetary boundary layer, where the vertical spacing between grid cells can get very small.In such cases, the attenuating effect of I 2 is unlikely to have significant physical causes.In contrast, vertical relative humidity gradients exceeding 1 km −1 are common in the marine boundary layer due to evaporation and vertical mixing of moist air in the boundary layer.In this context, I 2 generally increases cloud cover which aligns with the fact that cloud cover is typically 5−15% greater over the ocean compared to land (Rossow & Schiffer, 1999).With the estimated values for a 6 and a 7 , relative humidity would need to increase by 10% over a height of 260 m to increase cloud cover by 10%.

Understanding the Contribution of Cloud Condensates to Cloud Cover
The third function I 3 (q c , q i ) is always negative and decreases cloud cover where there is little cloud ice or water.It ensures that PC 4 and PC 5 are always satisfied.First of all, in condensate-free cells, ϵ serves to avoid division by zero while also decreasing cloud cover by 100%.Furthermore, the values of a 8 or a 9 indicate thresholds for cloud water/ice to cross to set I 3 closer to zero.When tuned to the ERA5 data set, the values for both a 8 and a 9 are roughly six times larger, making the equation less sensitive to cloud condensates.As larger values for cloud water are more common for cloud ice, we already expect I 3 to be more sensitive to cases when cloud ice actually does appear.By comparing the distributions of cloud ice/water at the storm-resolving scale, we provide a more rigorous derivation in Appendix B for why a 9 should indeed be smaller than a 8 .A simple explanation is that we usually find ice clouds in the upper troposphere, where convection is associated with divergence, causing the clouds to spread out more.
Given that equation ( 10) is a continuous function, the continuity constraint PC 7 is only violated if and only if the cloud cover prediction is modified to be 0 in the condensatefree regime (by equation ( 6)), and would be positive otherwise.The value of ϵ dictates how frequently the cloud cover prediction needs to be modified.In the limit ϵ → 0 we could remove the different treatment of the condensate-free case.In our data set, equation (10) yields a positive cloud cover prediction in 0.35% of condensate-free samples.Thus, the continuity constraint PC 7 is almost always satisfied (in 99.65% of our condensatefree samples).

Ablation Study Confirms the Importance of Each Term
To convince ourselves that all terms/parameters of equation ( 10) are indeed relevant to its skill, we examine the effects of their removal in an ablation study (Fig 8).We found that for the results to be meaningful, removing individual terms or parameters requires readjusting the remaining parameters; in a setting with fixed parameters the removal of multiple parameters often led to better outcomes than the removal of a full eq z RH 3 a 7 a 4 I 2 a 2 a 5 a 1 a 3 q c q i I 3 Term set to zero single one of them.The optimizers (BFGS and Nelder-Mead) used to retune the remaining parameters show different success depending on whether the removal of terms is applied to the equation formulated in terms of normalized or physical features (the latter being equation ( 10)).Therefore, each term is removed in both formulations, and the better result is chosen each time.To ensure robustness of the results, this ablation study is repeated for 10 different seeds on subsets with 10 6 data samples.We find that the removal of any individual term in equation ( 10 ) is more important to take into account than cloud water (∆M SE = 88/63 (%) 2 ), especially for the ERA5 data set in which cloud ice is more abundant (see Fig 1).More generally, out of the functions I 1 , I 2 , I 3 we find I 1 (RH, T ) to be most relevant (∆M SE = 1300/763 (%) 2 ), followed by I 3 (q c , q i ) (∆M SE = 119/123 (%) 2 ) and lastly I 2 (∂ z RH) (∆M SE = 18/0 (%) 2 ), once again matching the order of features that the SFS NNs had chosen.

Conclusion
In this study, we derived data-driven cloud cover parameterizations from coarsegrained global storm-resolving simulation (DYAMOND) output.We systematically populated a performance × complexity plane with interpretable traditional parameterizations and regression fits on one side and high-performing neural networks on the other.Modern symbolic regression libraries (PySR, GPGOMEA) allow us to discover interpretable equations that diagnose cloud cover with excellent accuracy (R 2 > 0.9).From these equations, we propose a new analytical scheme for cloud cover (found with PySR) that balances accuracy (R 2 = 0.94) and simplicity (10 free parameters in the physical formulation).This analytical scheme satisfies six out of seven physical constraints (although the continuity constraint is violated in 0.35% of our condensate-free samples), provid-ing the crucial third criterion for its selection.In a first evaluation, the (5-feature) analytical scheme was on par with the 6-feature NN in terms of reproducing cloud cover distributions (Hellinger distances < 0.05) in condensate-rich cloud regimes, yet underestimating cloud cover more strongly in condensate-poor regimes.While discovering distinct equations in each cloud regime can improve the Hellinger distances, both the overall complexity and mean squared error of a combined piecewise equation increase.This supports choosing a single continuous analytical scheme that generalizes across cloud regimes.When applied to higher resolutions than their training data we find that the cloud cover schemes further improve their performance.This finding opens up possibilities for leveraging their predictive capabilities in domains with increased resolution requirements.
In addition to its interpretability, flexibility and efficiency, another major advantage of our best analytical scheme is its ability to adapt to a different data set (in our case, the ERA5 reanalysis product) after learning from only a few of the ERA5 samples in a transfer learning experiment.Due to the small amount of free parameters and the initial good fit on the DYAMOND data, our new analytical scheme outperformed all other Pareto-optimal models.We found that as the number of samples in the transfer learning sets increases, the models converged to the same performance rank on the ERA5 data as on the DYAMOND data, indicating strong similarities in the nature of the two data sets that could make which data set serves as the training set irrelevant.In an ablation study, we found that further reducing the number of free parameters in the analytical scheme would be inadvisable; all terms/parameters are relevant to its performance on the DYAMOND data.Key terms include a polynomial dependence on relative humidity and temperature, and a nonlinear dependence on cloud ice and water.
Our sequential feature selection approach with NNs revealed an objectively good subset of features for an unknown nonlinear function: relative humidity, cloud ice, cloud water, temperature and the vertical derivative of relative humidity (most likely linked to the vertical variability of cloud cover within a grid cell).While the first four features are well-known predictors for cloud cover, PySR also learned to incorporate ∂ z RH in its equation.This additional dependence allows it to detect thin marine stratocumulus clouds, which are difficult, if not impossible to infer from exclusively local variables.These clouds are notoriously underestimated in the vertically coarse climate models (Nam et al., 2012).In ICON this issue is somewhat attenuated by multiplying, and thus increasing relative humidity in maritime regions by a factor depending on the strength of the low-level inversion (Mauritsen et al., 2019).Using symbolic regression, we thus found an alternative, arguably less crude approach, which could help mitigate this long-standing bias in an automated fashion.However, we need to emphasize that in particular shallow convection is not yet properly resolved on kilometer-scale resolutions.Therefore, shallow clouds such as stratocumulus clouds are still distorted in the storm-resolving simulations we use as the source of our training data (Stevens et al., 2020).To properly capture shallow clouds it could be advisable to further increase the resolution of the high-resolution model, training on coarse-grained output from targeted large-eddy simulations (Stevens et al., 2005) or observations.A crucial next step will be to test the cloud cover schemes when coupled to Earth system models, including ICON.We decided to leave this step for future work for several reasons.First, our focus was on the equation discovery methodology and the analysis of the discovered equation.Second, our goal was to derive a cloud cover scheme that is climate model-independent.Designing a scheme according to its online performance within a specific climate model decreases the likelihood of inter-model compatibility as the scheme has to compensate the climate model's parameterizations' individual biases.For instance, in ICON, the other parameterizations would most likely need to be re-calibrated to adjust for current compensating biases, such as clouds being 'too few and too bright' (Crueger et al., 2018).Third, the metrics used to validate a coupled model remain an active research area, and at this point, it is unclear which targets must be met to accept a new ML-based parameterization.That being said, the superior transferability of our analytical scheme to the ERA5 reanalysis data not only suggests its applicability to observational data sets, but also that it may be transferable to other Earth system models.
In addition to inadequacies in our training data (see above), which somewhat exacerbate the physical interpretation of the derived analytical equations, our current approach has some limitations.Symbolic regression libraries are limited in discovering equations with a large number of features.In many cases, five features are insufficient to uncover a useful data-driven equation, requiring a reduction of the feature space's dimensionality.To measure model complexity, we used the number of free parameters, disregarding the number of features and operators.Although the number of operators in our study was roughly equivalent to the number of parameters, this may not hold in more general applications and the complexity of individual operators would need to be specified (as in Appendix C).
Our approach differs from similar methods used to discover equations for ocean subgrid closures (Ross et al., 2023;Zanna & Bolton, 2020) because we included nonlinear dependencies without assuming additive separability, instead fitting the entire equation non-iteratively.By simply allowing for division as an operator in our symbolic regression method, we found rational nonlinearities in the equation whose detection would already require modifications such as Kaheman et al. (2020) to conventional sparse regression approaches.Despite our efforts, the equation we found is still not as accurate as an NN with equivalent features in the cirrus-like regime (the Hellinger distance between the analytical scheme and the DYAMOND cloud cover distribution was more than twice as large as for the NN).Comparing the partial dependence plots of the equation with those of the NN could provide insights and define strategies to further extend and improve the equation, while reducing the computational cost of the discovery.There are various methods available for utilizing NNs in symbolic regression for more than just feature selection, one of which is AIFeynman (Udrescu et al., 2020).While AIFeynman is based on the questionable assumption that the gradient of an NN provides useful information, a direct prediction of the equation using recurrent neural networks presents a promising avenue for improved symbolic regression (Petersen et al., 2021;Tenachi et al., 2023).Nonetheless, our simple cloud cover equation already achieves high performance.
Our study thus underscores that symbolic regression can complement deep learning by deriving interpretable equations directly from data, suggesting untapped potential in other areas of Earth system science and beyond.In this section, we plot average function values for the three terms I 1 , I 2 , and I 3 of equation ( 10).We focus on the vertical layer roughly corresponding to an altitude of 1500 m to analyze if one of the terms would detect thin marine stratocumulus clouds.Due to their small vertical extent, these clouds are difficult to pick up on in coarse climate models, which constitutes a well-known bias.To compensate for this bias, the current cloud cover scheme of ICON has been modified so that relative humidity is artificially increased in low-level inversions over the ocean (Mauritsen et al., 2019).Analyzing Fig A1, we find that the regions of high I 2 -values correspond with regions typical for low-level inversions and low-cloud fraction (Mauritsen et al., 2019;Muhlbauer et al., 2014).These I 2 -values compensate partially negative I 1 -and I 3 -values in low-cloud regions of the Northeast Pacific, Southeast Pacific, Northeast Atlantic, and the Southeast Atlantic.The I 3 -term decreases cloud cover over land and is mostly inactive over the oceans due to the abundancy of cloud water.The I 1 -term is particularly small in the dry and hot regions of the Sahara and the Rub' al Khali desert and largest over the cold poles.The a 5 -term is the only term in I 1 that cannot be explained as a linear or a curvature term.In the upper troposphere, the term is negative due to relatively cold and dry conditions.In August, temperatures are coldest in the southern hemisphere, so the term has a strong negative effect, especially over the South Pole.In the middle troposphere, temperatures are near the average of 257 K, weakening the term overall.Negative patches in the subtropics are due to the dry descending branches of the Hadley cell.The lower troposphere is relatively warm, especially in the tropics, resulting in a large positive a 5 -term under humid conditions, and a negative term under dry conditions.

Appendix B The Sensitivity of Cloud Cover to Cloud Water and Ice
In Equation ( 10), cloud cover is more sensitive to cloud ice than cloud water.In this section, we show that we can explain this difference in sensitivity from the stormscale distributions of cloud water and ice alone (Fig B1).On storm-resolving scales, a grid cell is fully cloudy if cloud condensates q t exceed a small threshold a > 0. Otherwise it is set to be non-cloudy.We can thus express the expected cloud cover as the probability of q t exceeding the threshold a E[C] = P[q t > a] = ∞ a f qt (q t )dq t , (B1) where f x is the probability density function of some variable x.As we can express cloud condensates as a sum of cloud water q c and cloud ice q i , we can also derive f qt from f qc and f qi by fixing q t and integrating over all potential values for q c f qt (q t ) = qt 0 f qc (z)f qi (q t − z)dz. (B2) In the following, we introduce the subscript s as a placeholder for either liquid or ice.
According to Fig B1, the storm-resolving cloud ice/water distributions feature distinct peaks at q s = 0, which can be modeled by weighted dirac-delta distributions.For q s > 0, we can approximate f qc and f qi with exponential distributions.After normalizing the distributions so that their integrals over q s ≥ 0 yield 1 we arrive at f qs (q s ) = (λ s exp(−λ s q s ) + w s δ(q s ))/(w s /2 + 1).
By rephrasing w s in terms of λ s and µ s , the mean of f qs , we get f qs (q s ) = λ s µ s (λ s exp(−λ s q s ) + (−2 + 2/(λ s µ s ))δ(q s )).(B3) By plugging in the expressions (B3) and (B2) into equation (B1) and letting a → 0 + we find the expected cloud cover to be a function of the shape parameters λ s and the means µ s for cloud water and ice f (RH, q c , q i ) = 0.009e 8.725RH + 12.795 log(229004q i + 0.774(e 11357qc − 1)) − 178246q c + 66 10) GP-GOMEA [161.45/12]: f (RH, T, q c , q i ) = (0.028e 6.253RH + 5RH − 0.076T + 4)/(183894q i + 0.73e 6565qc−91207qi − 0.62) + 92.3 Note that the assessed number of parameters is based on a simplified form of the equations in terms of its normalized variables.The amount of parameters in a given equation is at least equal to the assessed number of parameters minus one (accounting for the zero in the condensate-free setting).

Open Research
The cloud cover schemes and analysis code are preserved (Grundner, 2023).DYAMOND data management was provided by the German Climate Computing Center (DKRZ) and supported through the projects ESiWACE and ESiWACE2.The coarse-grained model output used to train and evaluate the neural networks amounts to several TB and can be reconstructed with the scripts provided in the GitHub repository.
Beucler acknowledges funding from the Columbia University sub-award 1 (PG010560-01).Gentine acknowledges funding from the NSF Science and Technology Center, Center for Learning the Earth with Artificial Intelligence and Physics (LEAP) (Award 2019625).This manuscript contains modified Copernicus Climate Change Service Information (2023) with the following datasets being retrieved from the Climate Data Store: ERA5, ERA5.1 (neither the European Commission nor ECMWF is responsible for any use that may be made of the Copernicus Information or Data it contains).The projects ESiWACE and ESiWACE2 have received funding from the European Union's Horizon 2020 research and innovation programme under grant agreements No 675191 and 823988.This work used resources of the Deutsches Klimarechenzentrum (DKRZ) granted by its Scientific Steering Committee (WLA) under project IDs bk1040, bb1153 and bd1179.where the features have been normalized over the training set.The total number of free trainable parameters is 33 (8 + 6 + 11 + 5 for the regime-specific equations above + 2 for the switch between cloud regimes + 1 for the condensate-free regime).

Figure 1 .
Figure 1.A comparison of the univariate distributions of four variables from the coarsegrained DYAMOND and ERA5 data sets.The y-axes are scaled logarithmically to visualize the distributions' tails.While cloud ice is often larger in our processed ERA5 data set, cloud water tends to be smaller than in the DYAMOND data.The distributions of temperature and relative humidity are comparable.

Figure 2 .
Figure 2. All models described in Sec 3 in a performance × complexity plot.The dashed vertical lines mark the R 2 = 0.95-and R 2 = 0.9-boundaries.Models marked with a cross satisfy the second physical constraint PC2 (using equation (6)).Only the best PySR and GP-GOMEA symbolic regression fits are shown.The NNs in cyan are the column-, neighborhood-and cell-based NNs when read from left to right.The SFS NN with the lowest MSE contains all 24 features described in Sec 2.1.For the SFS NNs, the last added feature is specified in curly brackets.Since the validation MSE of the SFS NNs decreases with additional features, we can extract the features for a given SFS NN by reading from right to left (e.g., the features of the SFS NN marked with {qc} are {qi, qc, RH}).
3. In Fig 3, we compare the cloud cover predictions of Pareto-optimal models (on Fig 2's Pareto frontier) with the actual cloud cover distribution in these regimes.

Figure 3 .
Figure 3. Predicted cloud cover distributions of selected Pareto-optimal models evaluated on the DYAMOND data, divided into four different cloud regimes.The numbers in the upper left indicate the Hellinger distance between the predicted and the actual cloud cover distributions for each model and cloud regime.

Figure 4 .
Figure 4. Selected Pareto-optimal models evaluated on DYAMOND data (Aug 11-20, 2018), coarse-grained horizontally to three different resolutions.Only data below an altitude of 21 km is considered.

Figure 5 .
Figure 5. Performance of DYAMOND-trained Pareto-optimal cloud cover schemes on the ERA5 data set after transfer learning.The labels on the x-axis denote how many grid columns taken across how many time steps make up the transfer learning training set.Each setting is run with six different random seeds and the diamond-shaped markers indicate the respective medians.
Fig 5 shows the MSE of the Pareto-optimal cloud cover schemes on the ERA5 data set after transfer learning on data sets T of different sizes.

Figure 6 .
Figure 6.Top row: 1D-or 2D-plots of the three terms I1, I2, I3 as functions of their inputs.In Panels a and b, the axis-values are bound by the respective minima and maxima in the DYA-MOND data set, while those minima/maxima were divided by 5000 in Panel c.The vertical black lines indicate the region of values covered by Panels d-g.Bottom row: Conditional average plots of cloud cover with respect to relative humidity and temperature (Panels d-f) or ∂zRH (Panel g).

Figure 7 .
Figure 7. Panel a: Contour plot of ∂RH f as a function of relative humidity and temperature.The contour marks the boundary where ∂RH f = 0. Panel b: Predictions of the PySR equation (10) with and without the modification (11) as a function of relative humidity.For comparison, the predictions of the SFS NN with 24 features are shown.The other features are set to their respective mean values.

Figure 8 .
Figure 8. Ablation study of equation (10) on the DYAMOND and ERA5 data sets.The removal of the function I1 leads to a very large decrease of MSE (of 1300/763 (%) 2 ) on the DYA-MOND/ERA5 data sets and is therefore not shown.
) would result in a noticeable reduction in performance on the DYAMOND data (∆M SE ≥ 3.4 (%) 2 in absolute and (M SE abl −M SE f ull )/M SE abl ≥ 3.2% in relative terms).Even though Fig 6g) suggests a cubic dependence of cloud cover on ∂ z RH, it is the least important term to include according to Fig 8. Applied to the ERA5 data, we can even dispense with the entire I 2 term.Furthermore, we find that the quadratic dependence on RH can be largely compensated by the linear terms.The most important terms to include are those with cloud ice/water and the linear dependence on temperature.Coinciding with the SFS NN feature sequences in Sec 5.1.1,cloud ice (∆M SE = 96/102 (%) 2

Figure A1 .
Figure A1.The first row shows maps of I1(RH, T ), I2(∂zRH) and I3(qc, qi) on a vertical layer with an average height of 1490m.In the second row we zoom in on the contribution of the term in I1 corresponding to the a5-coefficient on three different height levels (roughly at 11 km, 4 km, 320 m).All plots are averaged over 10 days (11 Aug-20 Aug, 2016).The data source is the coarse-grained three-hourly DYAMOND data.

Figure B1 .
Figure B1.The distributions of cloud water and cloud ice on storm-resolving scales (2.5 km DYAMOND Winter data).For positive values we approximate these distributions very loosely with exponential distributions.

Figure 2 .
Figure 2. Averaged cloud cover from 11 Aug to 20 Aug, 2016 at an altitude of ≈ 1500 m.The data source is the coarse-grained three-hourly DYAMOND data.

Figure 3 .
Figure 3. Averaged cloud cover from 11 Aug to 20 Aug, 2016 at an altitude of ≈ 1500 m.The data source is the remapped three-hourly ERA5 reanalysis data.

Table 1 .
The percentage of data cubes that fulfill a given physical constraint.Only the cubes with a sufficiently large amount of samples are taken into account.The last column shows the proportion of cubes (across all sizes we consider) in which the constraint is satisfied on average.