Physics-aware nonparametric regression models for Earth data analysis

Process understanding and modeling is at the core of scientific reasoning. Principled parametric and mechanistic modeling dominated science and engineering until the recent emergence of machine learning (ML). Despite great success in many areas, ML algorithms in the Earth and climate sciences, and more broadly in physical sciences, are not explicitly designed to be physically-consistent and may, therefore, violate the most basic laws of physics. In this work, motivated by the field of algorithmic fairness, we reconcile data-driven ML with physics modeling by illustrating a nonparametric and nonlinear physics-aware regression method. By incorporating a dependence-based regularizer, the method leads to models that are consistent with domain knowledge, as reflected by either simulations from physical models or ancillary data. The idea can conversely encourage independence of model predictions with other variables that are known to be uncertain either in their representation or magnitude. The method is computationally efficient and comes with a closed-form analytic solution. Through a consistency-vs-accuracy path diagram, one can assess the consistency between data-driven models and physical models. We demonstrate in three examples on simulations and measurement data in Earth and climate studies that the proposed ML framework allows us to trade-off physical consistency and accuracy.


Introduction
Physicists and environmental scientists attempt to model systems in a principled way through analytic descriptions that encode scientific understanding and domain expertise of the underlying processes. Conservation laws, physical principles or phenomenological behaviors are generally formalized using mechanistic models and differential equations. Such classical approaches in physics have been, and remain to be, the dominant framework for modeling complex natural Earth and climate phenomena. With the availability of large datasets collected with different remote sensing instruments and, generally, cheap and widely distributed in-situ data collections, the physical modeling paradigm is being complemented, sometimes challenged (and in many cases replaced) by the statistical, machine learning (ML) paradigm, which offers a prior-agnostic approach that does not make direct use of existing scientific knowledge [1][2][3].
Machine learning models can fit observations very well, but predictions may be physically inconsistent or even implausible. For example, ML models can commit large extrapolation errors, and their predictions can violate fundamental laws like mass or energy conservation [4,5]. This has been perhaps the most important criticism to ML algorithms, and a relevant reason why, historically, physical modelling and ML have often been treated as two different fields under very different scientific paradigms (theorydriven versus data-driven). Likewise, there is an ongoing debate around the limitations of traditional methodological frameworks: both about their scientific insight and discovery limits in general [1,2] and in the geosciences and hydrology in particular [6,7]. Recently, however, integration of domain knowledge and achievement of physical consistency by teaching ML models about the governing physical rules of the Earth system has been proposed as a principled way to provide strong theoretical constraints on top of the observational ones [4,8,9]. The synergy between the two approaches has been gaining attention, with recent approaches including redesigning model's architecture, augmenting the training dataset with simulations, and including physical constraints in the cost function to be optimized [4,6,7,[9][10][11][12][13][14][15].
The integration of physics in ML models may lead to improved performance and generalization but, more importantly, to improved consistency and credibility of such models. This hybrid approach has an interesting regularization interpretation: the inclusion of domain knowledge in the ML model reduces the parameter space to search upon by discarding implausible models. Therefore, physics-aware ML models combat overfitting better, typically become simpler (sparser), and require less training data to achieve similar performance [7,8,15,16]. Physicsaware ML thus leads to enhanced computational efficiency, and constitutes a stepping stone towards the goal of achieving more interpretable ML models [8,[17][18][19][20].
Among the many ML models available, kernel methods [21] have shown excellent theoretical properties and practical performance in Earth observation [22] data problems. Kernel methods have been primarily used for classification and regression problems, but also for data clustering, anomaly detection and dimensionality reduction [23]. Kernel methods generalize linear methods easily while still relying on linear algebra operations. The idea is to implicitly map the data into a reproducing kernel Hilbert space (RKHS) [24] where nonlinearities are taken into account, and solve the problem in this new space rather than in the original data space. The solution then typically becomes analytic and only involves simple linear algebra operations on a kernel (similarity) matrix that contains all pairwise similarities between the training data samples.
Our main goal here is to reconcile data-driven models with physics modeling by incorporating physical knowledge in the ML models. More specifically, we propose a nonparametric physics-aware regression method that is based on kernel theory and enables us to understand the role of the physics component from information-theoretic and regularization perspectives. Including prior knowledge in ML has traditionally been associated with the concept of regularization [20,25,26]. Regularizers are typically designed to enforce some desirable features on the model predictions, like smoothness using the ℓ 2 -norm on model weights, or directly on the model parameters, like sparsity using an ℓ 1 -norm. Recently, other regularizers have been proposed to minimize the sum of all violations of a known physical law [6]. We adopt an alternative regularization approach based on statistical dependence. This approach was introduced for enforcing fairness in model predictions [27], which states that the model predictions should be as statistically independent of predefined sensitive variables as possible. Our hypothesis is that encoding algorithmic fairness and consistency with domain knowledge through regularization play similar roles [28].
Following this idea, our approach achieves (physical) consistency by encouraging the model's predictions to be dependent on variables that encode physical knowledge or independent from biased information, similarly to the way that fair algorithms ensure that the model's predictions are independent of sensitive or protected variables: income predictions should be arguably independent of gender and race, and in a conceptually similar way, estimates of the forced response in (simulated) global mean temperature should be independent of variations in internal variability in that respective climate change scenario. Therefore, our overarching goal can be defined as: that is, learn a function f that fits well a target variable y from an input variable x, and at the same time, either make the predictionsŷ dependent with the ancillary variables s or statistically independent of them, depending on the problem and application. In this paper, for f we use a nonparametric regression function and for measuring independence the norm of the cross-covariance operator, both based on kernel methods [23].
Inspired by the fair kernel learning method [27] used in the context of algorithmic fairness, we propose physics-aware kernel learning (PKL). PKL includes a dependence-based regularizer to a given objective function that, depending on the application, may enforce model predictions to resemble a physical model output, a set of simulated data, or additional/supplementary observations (see details in Proposed physics-aware nonparameteric regression section). PKL can be trained not only to increase dependence with underlying knowledge but, alternatively, can also be used to ensure predictions are independent of biased/irrelevant information that can arise e.g. from observational errors or anomalous data variability not related to the physical process of interest.
We propose to use the Hilbert-Schmidt independence criterion (HSIC) [29] to measure the dependence between the predictions and physical variables. HSIC excels in capturing all higher order moments of dependence between random variables, is easy to compute and manipulate, and has appealing theoretical properties of convergence to the true dependence measure [29]. The PKL method leads to a closed-form analytic solution. We will show that the PKL method tackles the problem of physical consistency and model-data agreement in a straightforward manner. The performance of PKL will be illustrated in several examples in Earth and climate sciences. For example, PKL allows us to assess the degree of realism of models in biophysical parameter retrievals, or to detect the effect of external forcing in the climate system while aiming for independence to the exact representation and magnitude of key modes of internal variability. We anticipate that the PKL approach will be of great utility and flexibility for nonparametric data analysis in applied research in general, and in Earth and climate sciences in particular.

Oceanic chlorophyll data
We used the SeaBAM dataset [30,31], which gathers 919 in-situ measurements of chlorophyll concentration around the United States and Europe. The dataset contains in situ pigments and remote sensing reflectance measurements (Rrs, [sr −1 ]) at a set of given wavelengths (412, 443, 490, 510 and 555 nm) that are present in the SeaWiFS ocean color satellite sensor. The chlorophyll concentration values are between 0.019 and 32.79 mg m −3 . Although SeaBAM data originate from various researchers, the variability in the radiometric data is limited. At high Chla concentrations CC (mg m −3 ), the dispersion of radiance ratios increases, mostly because of the presence of more optically complex (Case II) waters, that is low values of the ratio of pigment concentration to scattering coefficient. At lowest concentrations the highest R(490)/R(555) ratios are slightly lower than the theoretical limit for clear natural waters. More information about the data can be obtained from SEABAM, and an extensive analysis in [31]. In addition to the observational data, we used several models to guide PKL. Morel1 and CalCOFI 2-band linear are described by s = 10 a0+a1κ , while OC2/OC4 models follow s = a 0 + 10 a1+a2κ+a3κ 2 +a4κ 3 . The ratio κ depends on the physical model used [31]: the Morel1 model uses κ = log(R(443)/R(555)), the Cal-COFI and OC2 models use κ = log(R(490)/R(555)), while the OC4 model uses κ = log(max{R(443), R(490), R(510)}/R(555)). The goal here is to predict the concentrations from reflectance measurements while being consistent with these parametric models that encapsulate some domain knowledge.

Hyperspectral and vegetation in situ data
We collected hyperspectral data from the CHRIS sensor as well as in situ measurements of chlorophyll content (Chla), leaf area index (LAI) and fractional vegetation cover (fCOVER) [32]. LAI is a dimensionless quantity that characterizes plant canopies. It is defined as the one-sided green leaf area per unit ground surface area in broadleaf canopies. fCOVER corresponds to the fraction of ground covered by green vegetation, and in practice it quantifies the spatial extent of the vegetation. It is indeed independent from the illumination direction and sensitive to the vegetation amount and derived from the LAI plus a number of structural parameters of the canopy. The data were obtained during the SPARC-2003 (SPectra bARrax Campaign) and SPARC-2004 campaigns in Barrax, Spain. The region consists of approximately 65% dry land and 35% irrigated land. Green LAI was derived from canopy measurements made with a LiCor LAI-2000 digital analyzer. Each elementary sampling unit (ESU) was assigned to a LAI value, which was obtained by the average of 24 measures (8 data readings × 3 replications) [33]. fCOVER was estimated from ground measurements using hemispherical photographs taken with a digital camera with a fish-eye lens. The final fCOVER estimate for each ESU was calculated as the average of twelve measurements. In total, nine crop types (garlic, alfalfa, onion, sunflower, corn, potato, sugar beet, vineyard and wheat) were sampled, with fieldmeasured values of LAI that vary between 0.4 and 6.3, Chla between 2 and 55 µg cm −2 and fCOVER between 0 and 1. Additionally, 30 random bare soil spectra with a biophysical (Chla, LAI, fCOVER) value of zero were added to broaden the dataset to nonvegetated samples. Concurrently, we used CHRIS images Mode 1 (62 spectral bands, 34 m spatial resolution at nadir). The images were geometrically and atmospherically corrected. A total of n = 136 data points in a 62-dimensional space were thus used to fit a PKL model. The goal here was to estimate LAI while being consistent with fCOVER, and estimate chlorophyll content while being consistent to LAI.

Internal variability and forced variability data
The US CLIVAR Working Group on large ensembles (LEs) contains a data archive of initial-condition LEs conducted with different climate models run within their CMIP5 setup. From the seven climate models available, we selected three different ones: CanESM2 for training, CSIRO-Mk3-6-0 for validation, and CESM1-CAM5 for testing. We emphasize that different train/validation/test splits are possible (and should be tested in real-world applications), but here we only show an illustration which is why we present this setup and given that results are robust for alternative choices. Model simulations for all three contained the period between 1950 and 2100 with historical and RCP 8.5 forcing conditions and incorporated more than 30 different ensemble members [34]. All model runs were defined with a spatial regridded world map of 5 • × 5 • resolution (in total d = 2592 grid points). The aggregated ensemble of each model, and all the grid points, defined the predictors matrix. We split it into a training, validation and test sets. We selected the spatially explicit simulated monthly near-surface air temperature (TAS) as our variable of interest (i.e. predictors). The ENSO internal variability is captured by the Niño3.4 index, and is taken from the climate variability diagnostics package for LEs developed by NCAR's Climate Analysis section [35]. Our goal is to predict the forced climate response from a single ensemble member, and the 'true' forced response was extracted as the average across the full ensemble, which is a standard procedure in the field. The predictors and metrics used were taken as the DJF seasonal average and standardized to zero mean and unit standard deviation.

Proposed physics-aware nonparameteric regression
We are given a set of inputs, x i ∈ X , and the corresponding targets, y i ∈ Y, for i = 1, . . . , n. Furthermore, we define s i ∈ S the set of variables to which we want to emphasize the (in)dependence. These will be referred to as the ancillary variables. We take x i to be an i.i.d. sample from an X -valued random variable x, and similarly for s. For simplicity, we will assume that the inputs are vectorial, i.e. x i ∈ R d×1 , s i ∈ R q×1 and that the targets are scalars, i.e. y i ∈ R, but the exposition can be trivially extended to non-Euclidean or structured domains which admit positive definite kernel functions. We let X ∈ R n×d denote the matrix of n observed inputs corresponding to d explanatory covariates, S ∈ R n×q denotes the matrix of n observations of q ancillary variables, y ∈ R n×1 denotes the vector of observed targets, which we could assume are distorted with biases or noise, andŷ is the prediction.
Fitting a consistent-regularized model f * ∈ H for some hypothesis class H reduces to optimizing a regularized empirical risk functional [27,36]: where V is the loss function, Ω acts as an overfitting/ complexity penalty on f, and I measures the statistical (in)dependence between the model f and the ancillary variables, the latter depending on the sign that accompanies I. The two regularization parameters λ ∈ [0, ∞) and µ ∈ (−∞, +∞) control smoothness and consistency of the solution, respectively. Note that the sign of µ forces dependence or independence. By setting µ = 0, the solution of the standard kernel ridge regression model is obtained. This does not mean that the solution is inconsistent but one can find a more consistent model, and thus conflicting less with ancillary information, by changing µ. As µ = 0 implies a trade-off between different objectives, selecting an optimal value is subject on the application [37]. On another note, one should be aware that the problem could be ill-conditioned for high negative values of µ.
For the consistency penalization term, I, we adopted the HSIC [29] between the predicted response f = [ f(x 1 ), . . . , f(x n )] and the ancillary variable s = [s 1 , . . . , s n ]. The HSIC measures independence between random variables f and s. Given the dataset D with n samples drawn from the joint distribution P( f, s), an empirical estimator of HSIC is defined as [29]: where K and L are the kernel matrices computed using kernel functions k and l, respectively, and H = I − 1 n 11 ⊤ has the role of centering the data in the feature space. For a broad family of kernels k and l, the population HSIC equals 0 if and only if the two involved variables are statistically independent [29]. With appropriate choices of kernels k and l for the input data X and the ancillary variable(s) S, respectively, the HSIC regularizer captures all types of statistical dependence between f and the ancillary variable s.
In this work we use the HSIC regularization in combination with kernel ridge regression as the model function class, which leads to a closed-form (analytic) solution. This regularization can be incorporated in other ML models, like neural networks and Gaussian processes. More information on the kernel physics-aware kernel regression (PKL) solution used in this work, and connections to other ML models can be found in the appendices. An illustration of the PKL method for a toy example can also be found in the supplementary material S1 (available online at stacks.iop.org/ERL/17/054034/mmedia).

Results and discussion
We describe three cases to illustrate the capabilities of PKL to encode physical knowledge about the system under consideration and to assess consistency between physical knowledge and the data. The PKL solutions are summarized in the form of an easy to understand consistency-vs-accuracy path diagram, i.e. HSIC-vs-RMSE path, that describes the relationship between the degree of consistency with physical knowledge and the accuracy of the regression model. This relationship is described as a function of the amount of dependence-based regularization that is captured by a hyperparameter µ, which varies between −∞ (dependence) and +∞ (independence). By observing how the path diagram changes upon encouraging (or alternatively reducing) dependence, one obtains an indication of how physical variables and model predictions are connected. This analysis allows us to assess the agreement between model predictions and a set of physical variables by comparing their corresponding path diagrams. Considering that HSIC takes values between zero and infinity, the corresponding HSIC (µ) values are then scaled by the HSIC (µ = 0) value to allow comparison across multiple consistency-vs-accuracy paths: the normalized HSIC is equal to 1 when µ = 0, and corresponds to the standard kernel regression solution. In what follows we will use x for the input, y for the target and s for the ancillary variables (see proposed physics-aware nonparameteric regression section).

Consistency with models for biophysical parameter estimation
A classical problem in remote sensing and geosciences involves estimation of biophysical parameters of interest from remote sensing (satellite) observations. We are given multidimensional measurements x ∈ R d , acquired from a satellite sensor at d different wavelengths, from which we want to estimate a parameter of interest y ∈ R. This can be done using satellite and in-situ measurement pairs (x, y). Traditionally, the community has designed a great many empirical, parametric models for estimation. We aim to use the PKL method to solve the fitting problem and, more importantly, to assess the most appropriate parametric model to rely on by studying the consistency-vs-accuracy path diagrams.
We first illustrate the performance of the PKL model for the estimation of in-situ chlorophyll concentration from multispectral reflectance images. The two measurements are subject to high levels of uncertainty, owing both to the difficulties in groundtruth data acquisition, and the noise inherent in satellite-obtained data. Moreover, there is commonly a time mismatch between the acquired image and the recorded in-situ measurements, which is critical, for instance, for coastal water monitoring. We use the SeaBAM dataset [30,31] (see details in Data section), and apply PKL to model ocean chlorophyll concentration y from radiances x := [R(λ 1 ), . . . , R(λ d )] ∈ R d at a set of given wavelengths, λ i , while encouraging consistency with four standard parametric models (Morel1, CalCOFI 2-band linear, OC2 and OC4) [31] in a set of four different experiments, one for each model as the ancillary variable s. Results are shown in figure 1. PKL solves the fitting problem with improved accuracy, i.e. a lower error (RMSE) can be achieved compared to standard kernel ridge regression (µ = 0), and allows to quantify the quality of the different parametric models considered through their respective consistency-vs-accuracy paths. Encouraging consistency with more recent models, such as OC2 and OC4, leads to a lower PKL prediction error (lower RMSE) compared to older models, such as Morel1 and CalCOFI, meaning that they fit the data better (i.e. the consistency between the data and the model is higher). When increasing dependence with each parametric model, the HSIC regularizer begins to dominate the training cost function and similarity of the target variable with the ancillary variable is reinforced. This happens until a certain turning point in µ where too much dependency leads to an increased error. In summary, this simple example shows how PKL may benefit the estimation of Earth system parameters such as the in-situ chlorophyll concentration from remote sensing by enforcing consistency with well-known parametric models that encapsulate domain knowledge.

Consistency with ancillary in situ data
Very often one does not have access to a parametric model as in the previous example, but to a set of ancillary observational data that the predictions should be consistent with. This is a standard case in remote sensing where some variables are coupled, e.g. greenness and chlorophyll content of the vegetation. We illustrate the use of the PKL in two cases: (1) to estimate leaf area index (y := LAI) while being consistent with the fraction of green vegetation cover (s := fCOVER), and (2) to predict chlorophyll-a concentration (y := Chla), while being consistent with the leaf area index (s := LAI). We use data from a terrestrial campaign where hyperspectral images and the aforementioned biophysical parameters were measured (see Data section for details). Figure 2 shows the joint evolution of HSIC and RMSE. The color intensity along the paths represents the increasing (decreasing) value of the consistency hyperparameter µ as it goes from reducing consistency with the ancillary variable (positive) to increasing consistency (negative). Optimal solutions, i.e. predictions with lower RMSE and higher HSIC to the ancillary variable, are obtained for negative values of µ, meaning that increasing the consistency between the variables (LAI-fCOVER and Chla-LAI) also helps in reducing the RMSE of the prediction. Similarly to the example of parametric models, higher consistency allows for higher accuracy (lower error) up to certain values starting from which a trade-off appears between consistency and accuracy. This example illustrated that, while many accurate solutions are possible when retrieving biophysical parameters with ML, one can achieve an optimal solution that is both accurate and ensures consistent results across variables.

Learning patterns of forced warming under uncertain climate variability
Separating forced and internal variability is a key challenge in climate science [34,38,39], and in particular in the context of detection and attribution (D&A) of externally forced climate change [40,41]. D&A typically uses fingerprints that encapsulate the structure of forced warming from model simulations into a spatial or spatiotemporal pattern [40][41][42]. Observations and climate model control simulations are then projected onto those fingerprints to assess whether a signal, such as multidecadal temperature trends at the global or continental scale, exceed internal climate variations [42].
In recent research, forced climate signals have been extracted from observations via signal-to-noise optimized fingerprints from climate models obtained through statistical learning [39,[43][44][45], thus building on ideas from traditional D&A. Statistical learning algorithms in this context aim to minimize the influence of internal variability and model disagreement to extract a forced signal [46]. However, patterns of internal variability (such as El Niño Southern Oscillation, ENSO) may nonetheless project onto the fingerprint, for example, if the fingerprint is derived from a given model (or a given set of models) but then applied to an unseen test model (or observations) with a potentially different representation of key modes of internal variability, such as ENSO. This phenomenon may lead to an inaccurate extraction of forced signals from models or observations, and could lead to over-or under-confidence in D&A statements if models systematically under-or overestimate the magnitude of internal variability [44,47]. Hence, robustness in fingerprint extraction with respect to model structural differences, or potential systematic biases in the models' representation of internal variability or of forced patterns, is an important issue that still remains a key uncertainty. This issue is related to distributional robustness in statistics [48] and transfer learning in ML [49]. For example, climate models simulate sea surface temperature patterns and warming that are broadly consistent with observations on a global scale [50], and also capture key modes of internal variability such as ENSO [51]. However, models differ (1) in the time scales of irregular ENSO fluctuations, (2) in the characteristic patterns of ENSO-related climate variability, and (3) in how ENSO variability is projected to evolve in a warming climate [51]. Moreover, differences across models in the forced warming and variability in multidecadal trends are particularly pronounced in the equatorial Pacific [50,52] [52]. It is currently unclear whether this discrepancy is due to an unusual realization of the observed climate (i.e. internal variability), or whether the models show systematic biases in the forced response compared to observations [50].
Here, we illustrate the idea of estimating the forced climate response from individual ensemble members (similar to e.g. [44,46]) but in a way that takes into account the ENSO-related variability by reducing consistency to the ENSO3.4 region variability expressed by the Niño3.4 index. Reducing consistency in this context thus implies that ENSOrelated variability does not project onto the fingerprint (i.e. the prediction of the forced response should be independent of the ENSO3.4 region variability), and hence our goal is to attain improved robustness in the context of uncertainties related to variability in the equatorial Pacific.
For this purpose, we employ the multi-model large ensemble archive [34] and the climate variability diagnostics package [35] (more details in Data section). Winter (December-January-February, DJF) seasonal average values of the forced response in each climate model are calculated as the ensemble average of multiple model runs that differ from each other by a small perturbation. By taking the average of the ensemble, the small oscillations between models that are attributed to internal variability are smoothed out and the end product reflects the forced climate response in the absence of variability [35]. These values constitute the target variable y of our prediction. The statistical model is based on DJF anomalies in individual ensemble members as predictor variables x. We run the PKL with Niño3.4 as the ancillary variable s that we want the prediction to be independent to (i.e. PKL is run with positive increasing values of the consistency hyperparameter, µ > 0). For illustration, we employ a different climate model for training (CanESM2), validation (CSIRO-Mk3-6-0) and testing (CESM1-CAM5).
Results are presented in figure 3. Without reducing consistency (µ = 0), the predictionŷ and Niño3.4 are fairly correlated (Pearson's correlation ρ = −0.72) and dependent from the prediction for the unseen test model ( figure 3(a)). This implies that ENSO-related uncertainties or potential systematic biases may alias into the prediction of the forced response for an unseen model or in observations. However, if Niño3.4 is taken as the ancillary variable and upon reducing consistency, µ > 0, the prediction becomes more decorrelated (ρ(ŷ, s) = 0.03) and independent (HSIC (ŷ, s) = 0.0005) of the variability introduced by ENSO3.4 ( figure 3(b)). The results imply that our forced response estimate would be more robust to ENSO-related uncertainties with an appropriate magnitude of µ across different climate models or between models and observations. An illustration of the effect the independence constraint has for the prediction can be seen in the supplementary material S2.
In summary, we have illustrated that estimating a climate signal (such as the forced response) from climate variables can be learnt in a way that is 'consistent' to certain known model structural uncertainties such as the representation of internal variability and warming in the equatorial Pacific that determine ENSO-related variability in the climate system.

Conclusions
Machine learning models have been traditionally considered black box models where interpretability of the decisions is hidden behind a complex architecture. ML models do not necessarily carry physical meaning, even if generally accurate, and hence they can produce non-physical or even nonsensical predictions. A robust trustworthy model should be in accordance with the known and agreed rules of the physical world. This dichotomy between how humans encode knowledge mathematically in mechanistic models and how ML models encode knowledge through their expression learnt from data has been the reason for longstanding debates in the last century: data-driven versus model-driven science.
In the last decade, performing accurate predictions has replaced process understanding in many applied areas of science and engineering. In these areas it is believed that data is the only needed regularizer of the model class. The problem, however, comes in cases of data-limited regimes, model misspecification, or omitted-variable bias. In all these cases, the ML model is an incomplete and often simplistic representation of reality, and generally attributes the effect of the missing variables to the estimated effects of the included variables. Besides, one can achieve similarly accurate results with completely different models; that is the issue of equifinality or non-identifiability. In recent years, many efforts have been conducted to develop ML algorithms that are more transparent, accountable, interpretable and consistent with first principles. However, developing hybrid ML models that include expert/domain knowledge (in the form of physics rules, parametrization and constraints) have been only marginally treated in the recent literature, for example, using deep neural networks, kernel methods and Gaussian processes [4,6,[9][10][11][12][13][14][15]. The field is interesting in broad terms and generally applicable to all domains where data and models coexist.
We proposed a general methodology based on kernels to combine two types of knowledge priors: smoothness of the function class and consistency of the predicted target with ancillary variables, with particular focus on Earth and climate sciences. Motivated from the algorithmic fairness literature, we introduced a physics-aware kernel-based model that exploits ancillary information or outputs from mechanistic models to regularize pure data-driven ML based on scientific knowledge. The PKL regression model includes a regularizer that measures the dependence of the learned function with physical variables, ancillary data or model simulations. The selection of a kernel-based regularizer based on HSIC leads to a simple and analytic solution where a probabilistic model interpretation is also amenable. HSIC ensures fast (exponential) convergence rates of the population measure to the population quantity as the sample size grows. PKL generalizes both linear and nonlinear kernel-based regression models, it is easy to implement and inherits all properties of the kernel methods treatment. For instance, recent advances in kernel and Gaussian processes modeling could be included in our predictive function, e.g. sparse formulations for improved computational scalability, deep models for improved expressive functions, or machines to learn ordinary/partial differential equations and control [10,11,15,53]. Likewise, the HSIC regularizer could be used in other ML models like neural networks. See appendices for more details on the derivation of PKL, a GP probabilistic interpretation of the HSIC regularizer and the use in standard feedforward neural networks. The present work focuses on kernel methods for their simplicity, since solutions are analytically available. Nevertheless, adding a physics-aware regularizer in the form of HSIC can be included in the loss function of other types of ML methods. The PKL model is fully functional, and being implemented as an open-source software tool,it allows to reuse current and upcoming kernel models and tools.
We anticipate that the PKL will constitute a convenient, useful and flexible tool for nonparametric data analysis in applied research where data, process understanding, principled models, and simulations are generally available. This is the general case in Earth sciences, but also for climate science studies where statistical learning is often adversely affected by uncertainties in regional variability characteristics, disagreement between models and structural uncertainty, and systematic biases.

Data availability statement
The data that support the findings of this study will be openly available following an embargo at the following URL/DOI [54]: https://github.com/IPL-UV/PKL. Data will be available from 18 April 2022.

Consistency regularization framework
The PKL functional for the function class f used in this work, is defined as: where we adopted the reproducing RKHS H k as the hypothesis class. Given that the selected consistency penalty term, HSIC, only depends on the unknown function f through its evaluations at the training inputs {x i }, direct application of the Representer's theorem [55] tells us that the optimal solution can be written as a linear combination of kernel function evaluations f = ∑ n i=1 α i k(·, x i ). Hence, we obtain the so-called dual problem: which can now be solved for α directly. In the case of the squared L 2 norm for V := f(x) − y 2 2 , the above dual problem has an analytic solution [27]: The dependence estimator is sensitive to the hyperparameters of the kernel functions k and l. To avoid variations in HSIC exclusively due to changes in these parameter values, some form of normalization is needed. Nevertheless, one cannot normalize both kernels arbitrarily since the optimization problem would become nonlinear with α and would not admit a closed-form solution anymore. As the parameters from K are adjusted to the data during the optimization process, one could partially normalize the cross-covariance on L. We introduce a modification inspired by the normalized version of HSIC called NOCCO [56]. This modification simply replaces HLH with HLH(HLH + nϵI) −1 , where ε is a regularization parameter used in the same way as [56]. The solution then becomes: In all our experiments, we used the squared exponential (SE) kernel for both k and l, that is e. g. k(a, b) = exp(− a − b 2 /(2σ 2 )), which captures sample similarity well in most of the problems, and only one hyperparameter σ needs to be tuned. For the standard kernel ridge regression (KRR) we implemented a standard cross-validation scheme to determine the ridge regression parameters, σ and λ, setting the squared error loss as the metric to minimize. For the ancillary variable we set the SE hyperparameter to the average Euclidean distance between all points, which is a commonly used heuristic in the kernel methods literature [23,57]. A much larger sigma would make HSIC approximate a linear dependence estimate, while a much smaller legnthscale σ would make HSIC insensitive to dependence [29].

A probabilistic treatment: the Gaussian process approach
Using HSIC in a kernel-based regression framework can be actually seen as a modified Gaussian process prior. From a Gaussian Process (GP) perspective, the prior covariance corresponds to a posterior covariance for meta-observations encouraging that the predictive function f remains constant as the ancillary variables vary. Let us assume that the loss corresponds to the negative conditional log-likelihood in some probabilistic model, i.e. that V( f(x i ), y i ) = − log p (y i |f(x i )), which is true for a wide class of loss functions. We write V(y, f) = − ∑ n i=1 log p(y i |f(x i )) to denote the rescaled conditional negative log-likelihood. Consider now using explicit feature mapping x i → ϕ(x i ) and denoting the feature matrix by Φ, we have f = Φβ and recast optimization as: where β play the same role as α in the PKL, and we defined δ = µ/λn for convenience. The two regularization terms correspond, up to an additive constant, to a negative log-prior of , which in turn gives a prior on the evaluations: By directly applying the Woodbury-Morrison formula, the covariance matrix in this prior becomes (K −1 + δHLH) −1 , compared to K in the standard GP case. Thus, adding an HSIC regularizer corresponds to modifying the prior on function evaluations f. The posterior mode in a Bayesian model using a modified GP prior becomes: where k X· = [k(·, x 1 ), · · · , k(·, x n )] ⊤ , for any training set {x i } n i=1 . Treating the learning problem in the GP framework actually allows to derive uncertainty estimates and hyperparameter tuning using marginal likelihood maximization.

Neural network training with the physics-dependence loss
Neural networks can also benefit from the new loss including the dependence term. Including the HSIC term in standard backpropagation leads to simple updating rules for training neural networks. Let us denote a feedforward neural network functionŷ = F(X; W), parameterized by a set of weights W. The backpropagation algorithm uses gradient descent to adjust model weights iteratively W[t] ← W[t − 1] + η∇J W , where J = e is the loss (cost, energy) function that depends on training error e = y −ŷ and η > 0 is the learning rate. Including the physics-dependence loss in a neural network training is straightforward, and only involves replacing the output error e = y −ŷ with e ′ = y − (λI + µSS ⊤ )ŷ, which intuitively trades the training error for physics consistency. The standard backpropagation algorithm can be applied to the new error function, J ′ W := e ′ 2 . Alternatively one could use other training algorithms like Adam or automatic differentiation if the interest is to learn the dependence measure parameters end-to-end. The algorithm allows us to train large scale problems while encoding physical consistency concepts easily.