Causal hybrid modeling with double machine learning

Hybrid modeling integrates machine learning with scientific knowledge to enhance interpretability, generalization, and adherence to natural laws. Nevertheless, equifinality and regularization biases pose challenges in hybrid modeling to achieve these purposes. This paper introduces a novel approach to estimating hybrid models via a causal inference framework, specifically employing Double Machine Learning (DML) to estimate causal effects. We showcase its use for the Earth sciences on two problems related to carbon dioxide fluxes. In the $Q_{10}$ model, we demonstrate that DML-based hybrid modeling is superior in estimating causal parameters over end-to-end deep neural network (DNN) approaches, proving efficiency, robustness to bias from regularization methods, and circumventing equifinality. Our approach, applied to carbon flux partitioning, exhibits flexibility in accommodating heterogeneous causal effects. The study emphasizes the necessity of explicitly defining causal graphs and relationships, advocating for this as a general best practice. We encourage the continued exploration of causality in hybrid models for more interpretable and trustworthy results in knowledge-guided machine learning.


Introduction
Machine learning (ML), specifically deep learning (DL), has proven to be effective in identifying and modeling complex patterns from data sets.This led to unprecedented progress in fields such as computer vision [1], natural language processing [2], and speech recognition [3].These data-driven models also increasingly complement or even substitute mechanistic methods in science [4,5].
In the Earth sciences, for instance, the common way to understand and model the Earth's properties, structure, and processes is using knowledge of first principles, realized in mechanistic models based on functional equations [6].These models allow principled predictions of how the system under study would behave under different conditions [7].Nevertheless, they are not always sufficient to capture the complex and usually not completely known relationships in the real world.
Computational constraints and missing understanding have led to simplified or even missing representation of important processes in the current generation of climate models [8].Structural limitations often necessitate parameterizations to approximate complex processes.Significant uncertainties include the representation of cloud feedbacks [9], resolving ocean components at varying resolutions [10], surface energy partitioning [11], representing key processes like vegetation response to CO 2 [12], and difficulties in representing functional structures across different biome types [13].Addressing these challenges is essential for enhancing the accuracy and reliability of Earth system models in projecting future climate change and weather extremes.
Integration of machine learning (ML) with abundant Earth data presents a promising avenue to overcome the limitations of current Earth system models [14,15].Support vector machines [16], random forests (RFs) [17], or neural networks (NNs) [18] are highly flexible, make little prior assumptions on the functional form and can integrate the large datasets abundant in Earth and climate sciences.
The flexibility of ML models comes with some known downsides: (i) Many popular machine learning models are black boxes, meaning that we do not understand the internal reasoning behind the model's predictions [19].(ii) Often, ML models are not robust and fail to generalize out of the domain of the data used for training [20,21].(iii) They violate physical properties and laws of nature, such as conservation laws, symmetries, or equi-and invariances [14,22].These are crucial matters in Earth and climate sciences, where a prime goal is to make realistic predictions on the Earth's system under a changing climate [23].
All these issues are gaining attention in ML and Earth system science literature.Research in generalization and extrapolation aims at ensuring robustness outside of the training domain [24][25][26].Explainable artificial intelligence (XAI) tackles questions on the explainability of black box models [27][28][29], which find growing usage in remote sensing problems [30,31].At the same time, the general goal of explaining black boxes is being challenged by advocates for glass box models, i.e., inherently interpretable models [32,33], and there is an ongoing debate on the evaluation and rigorousness of XAI methods [34,35].
A flourishing area of research is science-aware or knowledge-guided machine learning (KGML), which combines the knowledge-driven and data-driven worlds to overcome inconsistencies [36].These methods increasingly find their way into various domains within Earth sciences [37][38][39][40][41][42].One example is physics-informed neural networks (PINNs) [43], where an additional term is added to the loss for training that punishes deviations from physical laws encoded with ODEs or PDEs.Alternatively, ML models can be trained on a combination of data and simulations from physical models to improve consistency in the sparse observation regime [37].
Finally, hybrid modeling replaces some components of mechanistic models with machine learning [44][45][46].This constraint makes the models more interpretable and serves as a regularizer for better generalization to unseen data.If we use deep learning models as the machine learning component, the only requirement for fitting these hybrid models is that the parametric components are differentiable [47].Then, gradient-based optimization allows joint optimization of the neural network (NN) parameters and physical parameters of the mechanistic model and leads to seamless data integration.In the following, we will refer to this as gradient-descent-based hybrid modeling (GD-based HM).It serves as a baseline for our proposed method.
There are persisting challenges in hybrid modeling.Firstly, these models are prone to equifinality, which denotes the existence of multiple models and sets of parameters that describe the data similarly well.Already in the standard mechanistic modeling, this is a well-known difficulty when not only model performance but also retrieving meaningful parameters is the goal.In this setting, robust inference already poses a challenge [48], which becomes even more difficult and prohibitively expensive in deep learning [49,50].Ultimately, equifinality can jeopardize the interpretability of the results.Second, regularization techniques in machine learning can introduce bias on the physical parameters [45].Finally, given the flexibility of non-parametric models such as NNs, it is tempting to use different sets of variables for the model and choose the ones that lead to the best overall performance.For a pure prediction task, that is a sensible procedure [51].For hybrid modeling, though, apart from equifinality, this can lead to bias or different interpretations of the parameter of interest in the causal sense.We might be right for the wrong reasons and imperil the desired interpretability of the hybrid model (see Box 1 for an illustrative example).
In many instances, physical equations encode actual cause-effect relationships.It is essential to capture the causal relationships between the variables to obtain interpretable and more accurate models.Respecting the causal direction of time has shown to be effective in training PINNs for chaotic systems where previous approaches failed [52].Furthermore, coupling causal discovery to identify the causal drivers in climate models before applying deep learning algorithms improved performance and interpretability [53,54].Causally constrained recurrent NNs more accurately reflect underlying processes and were shown to enhance our understanding of methane in wetlands [55].Ultimately, causality aims at being right for the right reasons.
Therefore, we believe it is time for a causal hybrid modeling framework, where we introduce an explicit physical prior by assuming a causal graph and framing the problem as a causal effect estimation problem within the hybrid modeling framework.We will show how this approach leads to well-defined problems, thus mitigating equifinality and being robust to biases of training and regularization.As a first step, we propose a method based on double machine learning (DML) [56].DML is a causal effect estimation technique developed in econometrics, where it is common to investigate the effect of some proposed treatment on an outcome variable [57,58].It has recently been used for effect estimation in the environmental sciences [59].We suggest that this causal effect estimation technique can be applied to a class of hybrid models where the effect of some input driver on the output is encoded.We coin this method DML-based hybrid modeling (DML-based HM).
Apart from the causal perspective, DML has favorable properties over naive fitting approaches.Regularization of the estimators for the non-parametric part of the equation can introduce substantial bias in estimating the parametric part of the equation.Using DML, even for erroneous estimators, we can still obtain consistent estimators of the causal effect coefficient.This is particularly useful if the confounding effects are high-dimensional or are described by a complicated function that is hard to learn.Furthermore, it enables us to make inferences, as the estimators are shown to be approximately normally distributed, which yields confidence intervals [56].
Within the proposed framework based on DML, we can solve problems that can be transformed into a regression problem of the form where T is a one-dimensional input variable and X and W are further sets of predictors.We assume that f is a known transformation of T , and our hybrid modeling goal is to estimate the non-parametric functions θ and g.We will see relevant examples of problems that fall into this class.This includes, in particular, the problems where θ describes the effect of T on Y .This effect can be constant or depend on some other predictors X.
We demonstrate the advantages of DML-based HM in two examples around carbon fluxes: 1.The temperature sensitivity Q 10 model for ecosystem respiration [60][61][62] and, 2. the light-use efficiency model for carbon flux partitioning [63].
These two models are particularly relevant as they allow statements on the productivity and respiration of plants under changing conditions.
Our contributions are as follows: In the case of synthetic data for Q 10 , DML retrieves the Q 10 temperature sensitivity parameter more robustly and efficiently than the GD-based HM approach, especially in the low data regime and under regularization.It retrieves Q 10 values consistent with the literature on measured respiration data.We show how equifinality can yield misleading results and how causal prior knowledge can solve the problem without giving up flexibility.In the carbon flux partitioning problem, we show how the method can be extended to the non-linear heterogeneous case, where the hybrid modeling retrieves consistent fluxes and shows competitive performance to the current state-of-the-art neural network.
In essence, we introduce DML-based HM as a novel approach to fitting hybrid models and show that the obtained estimates are more efficient and robust than the ones from GD-based HM.We describe a path to better pose problems with equifinality, enforcing causal interpretability instead of hoping for it.
T Y X, W θ(X) Figure 1: Causal graph of treatment effect estimation of T on Y .Sets X and W can enter both as confounders and mediators.Treatment effect θ can be heterogeneous and dependent on X or constant.

Box 1: Equifinality in hybrid modeling
Modeling the temperature dependence of ecosystem respiration R eco is a fundamental step in better understanding biosphere evolution and responses under global warming scenarios [64][65][66].The functional relationship between temperature and respiration has been classically represented via the Q 10 respiration model: where Q 10 is the parameter describing temperature sensitivity, X is a set of meteorological drivers and R b describes the base respiration.Including air temperature T A as a driver of R b is an optional choice if we are to believe that there are effects of temperature beyond the exponential dependency through Q 10 .A common hybrid modeling approach amounts to using a NN as an estimator for R b , treating Q 10 as a trainable parameter, and fitting everything end-to-end with gradient descent, as it has been done in [45].
Equifinality in this problem can be shown by reformulating (2) for c > 0: as the temperature sensitivity.In this case, we would obtain one of the solutions by chance and thus reach erroneous conclusions about the temperature sensitivity.
In this example, equifinality arises because the problem is mathematically ill-posed.It is less obvious, however, when introducing several non-parametric models in more complicated physical equations.In practice, we will obtain a distribution over the parameters mainly driven by inductive biases of the learning algorithm or the network architecture [67] and which are not guided by any physical knowledge.Additional explicit information can alleviate this problem.These include the introduction of additional losses or adding prior knowledge [68,69].Similarly, a regularization term can make the problem identifiable.This has been formally proven for solving hybrid ODEs [70].Regularization, however, is known to introduce bias on parameters of interest in semi-parametric modeling problems [56].
2 Double machine learning for hybrid modeling -a causal perspective Our setting considers problems that can be expressed as in (1), which can be studied under a causal perspective, see Fig. 1.The parameter θ describes the direct effect of some treatment variable T on the outcome variable Y .Moreover, we have access to sets of predictors X and W that are confounding or mediating the effect of T on Y .Confounders are common causes of T and Y , while mediators are variables through which T indirectly affects Y .The inclusion of mediators has important implications for the interpretation of the results.When we estimate the effect of T on Y with mediators, we only obtain the direct effect by discounting the effects through these mediators.The variables in X can further enter as effect modifiers by modulating the effect θ of T on Y .Technically, we can use all mediators and confounders as effect modifiers when we include them all in X, leaving W empty, or treat θ as a constant effect by instead leaving X empty.At this point, we need to be careful with the choices of control variables X and W as we need to assume that all relevant confounders are observed and included.In particular, this means we need to be careful not to include mediators that have an unobserved common cause with Y or that we introduce a common effect of T and Y .Both cases would open a new path and substantially bias the estimation [71].As per the DML framework, we must define an auxiliary equation that models the confounding and mediating effects of X and W on T .Assuming, without loss of generality, centered noise for both equations, we obtain Sometimes, the original problem formulation must be manipulated to fit our setting.We will see examples of given transformations f , though the identity f (T ) = T could also be used when the relationship is assumed linear in T .The causal effect θ is modeled either as a constant coefficient or as a function of some covariates (heterogeneous effect).
We proceed according to the partialling out method in the DML framework [56]: We call the estimators in (i) and (ii) the first-stage estimators.The primary benefit of the DML framework is that it yields fast estimation rates and, under certain assumptions, asymptotic normality of θ.It is robust to errors in the first-stage estimators due to overfitting or regularization bias.This robustness stems from the observation that the moment equations corresponding to the final least squares loss in (iv) fulfill Neyman orthogonality with respect to the first-stage estimators [56].This approach has been analyzed for a large set of model classes [56,[72][73][74][75].For example, any combination of linear regression, decision trees, support vector machines, or NNs can be used to model the treatment and/or the outcome models.Similarly, any of these or a combination of models could be chosen to estimate the treatment effect.To maintain the theoretical guarantees of the DML framework, it is important to split the data and perform the first two fitting steps ((i),(ii)) on a different data subset than the last fitting step for the residuals (iv).By doing cross-fitting, data efficiency can be maintained.
If the only object of the analysis is the interpretable treatment effect θ, the task is completed by the above DML procedure.Nevertheless, as is usually the case in hybrid modeling tasks, we are probably also interested in obtaining an estimator of g.For this, we have two options: The plug-in estimator (i) uses all estimators fitted in the previous steps and can be obtained at no additional computational cost.A derivation of this estimator is given in Appendix A.1.On the downside, in contrast to θ, there are no theoretical guarantees on how well it describes g.Option (ii) adds a final supervised learning step, with the advantage being that we are not limited to using the X and W to estimate θ.Once θ has been estimated in a well-posed setting, we can now introduce, for example, T as a driver in the estimation of g.We can combine all estimators to obtain the fitted hybrid model for Eq. ( 1) (see Fig. 2 for a summary of the proposed procedure).By separating the problem into a causal inference and a standard supervised learning step, we have maintained its well-posedness.Next, we will explain how this technique can be effectively applied in two use cases around carbon fluxes.

Case studies
Carbon fluxes are crucial in the global carbon cycle, a key component of the Earth's climate system [76].Net ecosystem exchange N EE is the net carbon dioxide flux measured using the eddy covariance (EC) technique [77].The data for our studies is half-hourly data from FLUXNET, a global network of EC towers that collect data on carbon dioxide, energy fluxes, sensible heat fluxes, and water vapor exchange between the atmosphere and the terrestrial biosphere [78].It offers comprehensive measurements of meteorological parameters and constitutes a crucial data source for ecosystem modeling and climate research.
Different biogeochemical processes contribute to the carbon balance of the land [79].In particular and as common, we split N EE as where gross primary production GP P describes the gross carbon uptake by the environment and ecosystem respiration R eco denotes the carbon release of all organisms.

The Q 10 model
A common parametrization of R eco is the Q 10 respiration model [60][61][62]: This model highlights temperature T A as a principle driver of respiration, with Q 10 denoting the temperature sensitivity parameter.Furthermore, R b describes the base respiration, and X a set of meteorological drivers.Following the example of [45], we use data from the EC tower in Neustift, Austria, available in the FLUXNET2015 dataset [80].
Based on this site, we extensively probe the DML-based HM in the controlled setting of synthetic data and showcase its potential on measured data.As the goal of this paper is not to provide a comprehensive analysis of global Q 10 values, we limit ourselves to this site for our first use case.
Data Synthetic data is generated from a Q 10 model with seasonally varying base respiration and measured air temperature T A , and with true constant Q 10 set to 1.5 (for details, see Appendix B.1.1).We provide additional experiments for Q 10 values of 1.25 and 1.75 to showcase the robustness of the results.
Ecosystem respiration is a latent flux not directly observed at flux towers during the day.It can only be measured as nighttime N EE, as without photosynthesis, we assume GP P to be zero or under controlled conditions like a sealed chamber [79].Applying DML-based HM Applying a log-transform to (8) and setting f The resulting equation ( 9) describes a partially linear regression problem [81] equivalent to (1).Here, log(R b (•)) represents the non-parametric function g(•) as we do not know the functional form of R b .We aim to estimate the constant linear effect θ = log(Q 10 ) of the transformed temperature f (T A ) on the log-transformed ecosystem respiration.In this work, we employ and compare both NNs and RFs as examples for first-stage estimators.
After obtaining the estimator Q10 , we fit a NN on Figure 3: Assumed causal graphs for the estimation with the causal hybrid modeling approach in Q 10 estimation.
We compare the causal DML-based HM to the standard GD-based HM as described in [45].We fit with a NN representing the base respiration R b .The weights of the NN are optimized together with Q 10 using the Adam [82] optimizer.
We run the experiments with and without regularization for all involved NNs in both hybrid modeling approaches.
For this, we use dropout at a rate of 0.2.This technique randomly drops connections in a NN during training and was found to have a sparsifying effect on the model [83].We provide additional experiments with weight decay [84], another common regularization technique in deep learning at a rate of 0.1.To showcase the effect of equifinality, we also introduce T A as an additional predictor in R b .We will apply the same training procedure and NN architectures for both hybrid modeling approaches for comparability and to show robustness in the presence of biased estimators.We only drop the final nonlinearity for the first-stage estimators in the DML-based HM.Details on the NNs and their training can be found in Appendix B.3.
Causal graph of the Q 10 model The causal graph we assume for the Q 10 model is shown in Fig. 3.The smooth potential radiation cycle given by SW SM POT and SW SM,diff POT represent seasonality and, thus, has a confounding effect on temperature T A and R eco .For the real data, we add V P D to the graph, representing humidity and water availability.This variable enters as a mediator in the graph as temperature affects evaporation and how much water the air can hold [85].Furthermore, water availability also has a strong effect on respiration [86].However, the temperaturesensitivity Q 10 should only describe the immediate temperature effect [85].We model the effects of water in the base respiration factor R b .Thus, assuming this graph, with our choices of variables, we estimate only the direct, immediate effect and not the one mediated through water or confounded by seasonality.

Problem formulation
Direct measurements of GP P or R eco at the ecosystem level are difficult to obtain [79].Alternatively, partitioning methods estimate these fluxes numerically from the measured N EE.Common approaches implement functional relationships based on physiology and estimate the fluxes using data-driven models [87][88][89][90][91]. Several hybrid-modeling approaches have recently been proposed modeling both fluxes with NNs [38,68,92].
Separating a single signal into two additive signals is generally prone to equifinality issues.[38] tried to break the symmetry between fluxes in the partition by enforcing different sets of explanatory environmental covariates for the two fluxes and applying a simple hybrid model.In particular, the authors combined NNs with the light-use-efficiency model given by where LU E models the linear efficiency of the incoming shortwaves SW on the resulting GP P .In this form, GP P was modeled as the product of the incoming radiation and LU E parametrized by a NN.[68] showed that with different random initializations, this approach can lead to different resulting fluxes.The equifinality of the solution becomes particularly evident in extreme conditions.The authors can reduce variability through a multi-task learning approach.They introduce a second loss, forcing the network to learn to predict solar-induced chlorophyll fluorescence (SIF) from the separated GP P as both signals are known to be correlated under normal conditions.
Data As a proof of concept, we evaluate the proposed method on synthetically generated data (see Appendix C.3).We only used measured N EE for the real data and applied the hybrid modeling approach site-wise per year.For the data selection of real data from FLUXNET2015 [80], we closely followed [38] to compare our method to the neural network approach that imposes similar structural equations.We chose the same set of 36 different FLUXNET2015 sites (see Appendix B.2) and used the same quality criterion to select site-years, i.e., years of a specific site.This implies that fitting is done year-wise per site, and only measured data is used.To have enough high-quality data, only site-years for the analysis are selected where at least 80% of the meteorological data and 10% of each daytime and nighttime N EE were measured.As a target, similar to [38], we use the N EE obtained from the 50th percentile of the CUT method [80].For comparison, we use the respective partitioned R eco and GP P fluxes obtained from the daytime [90] and nighttime [87] methods, already provided as part of the FLUXNET2015 dataset.Moreover, we compare the partitions to the results obtained with NNs from [38].
Applying DML-based HM We want to fit the following flux partitioning equation where X and W are sets of meteorological drivers and f transforms the incoming radiation to allow for more flexible light-response curves, leading to a potentially non-linear light-use efficiency model.Here, R eco (•) and LU E(•) represent g(•) and θ(•) in the equivalent problem (1), respectively.This time, we use the estimator of R eco obtained from the first-stage estimators.As a proof of concept, we apply this method with f being the identity function for linearly generated data over different noise levels (see Appendix C.3).
For real data, the assumption of a linear relationship to SW is violated as GP P saturates with increasing light.We will thus first fit a transformation f of the light curve before applying the DML schema.In order to find f , we finally fit α and β in with a moving window of 15 days, we always transform the 5 days in the center of the fitting interval.This procedure is motivated by the daytime flux partitioning method [90], which estimates a parameterized rectangular hyperbola over moving windows to obtain GP P .This heuristic allows us to find a flexible, smoothly changing light response curve.
Other ways to obtain such a transformation can be envisaged.For the synthetic data, we use inputs according to how the data was generated, i.e., vapor pressure deficit V P D and temperature T A for X and the seasonal cycle of potential radiation for W .On the real data, we use day of the year doy, V P D, temperature T A , and soil water content SW C (for the sites where it is available) for X and leave W empty (For the assumed causal graphs, see Section 3.2.1).We use gradient boosting regressors [93], an ensemble method of multiple shallow decision trees for all involved fitting steps.
Causal graph of the LU E model The causal graphs assumed for the LU E model are shown in Fig. 4. As R eco is modeled similarly to the Q 10 model, we keep the same variables modeling the seasonal cycle.In addition to that, we include V P D and T A , which were used to model GP P .The incoming radiation SW has an effect on the temperature as well as on water vapor [85].Thus, both variables enter as mediators on the path to N EE.For the real data, we use the day of the year DOY to model the seasonality, which continues to be a confounder.In addition to the V P D and T A , we add soil water content, which also enters as a mediator when available.Consequently, we estimate GP P as the direct effect of light on N EE, discounting the indirect effects through temperature, V P D, and SW , which we allocate to RECO.Note that in this setup, these three variables are still entered as modifiers on the effect of light on N EE, affecting GP P .Table 1 summarizes the variables used for the different setups of the use cases.

Results and Discussion
We show the applicability of our causal DML-based HM on two carbon flux modeling problems.We estimate the temperature sensitivity parameter in the Q 10 model to showcase the robustness to regularization biases.We further illustrate the flexibility of the method to tackle the carbon flux partitioning problem.We simulated ecosystem respiration data from observations of FLUXNET.The true Q 10 parameter was set to 1.5.We sample 100 datasets of varying sample sizes to see how the methods perform in different data regimes.We compare the GD-based HM approach using NNs to the proposed causal DML-based HM framework in two possible instantiations, either using RFs or NNs as first-stage estimators.Experiments are run with and without applying dropout regularization and introducing T A as an additional predictor in base respiration.
The Q 10 estimation results are shown in Fig. 5. First, Fig. 5a shows the results where no dropout was applied to the NNs.In this case, the estimates of the GD-based HM approach, where T A is included as a predictor for R b , show values that are, on average, between 2.1 and 2.3 over all sample sizes.They show a substantial mismatch to the true value of 1.5 and a wide spread at each sample size.This illustrates that equifinality expresses itself in the estimations as a wide range of values that hardly decreases with increasing sample size.We are not obtaining the full range of R > 0 values, which is by (8) mathematically possible, but a range that is constraint alone by the initial Q 10 value, the network's implicit biases and the first optimization steps of the gradient descent algorithm.This can make us mistake this for a valid inference of the method.Instead, methods that exclude T A as a predictor find good estimators that converge with increasing data size.This is, in general, an encouraging result for all hybrid modeling approaches in this setup.Over the whole range, the GD-based HM shows wider spreads than the DML-based HM approaches, which converge notably faster with increasing data size.At low data, they also have lower bias than the GD-based HM approach.Remarkably, the random forest shows very little bias for solving this task over the whole data regime.Experiments corresponding to Q 10 values of 1.25 and 1.75 (see Appendix C.2) exhibit minor variations in magnitude, proportional to the effect parameter.However, they consistently affirm the findings obtained for Q 10 = 1.5.These results showcase the data efficiency of the DML-based approach.At the same time, it is currently computationally less efficient.The causal DML-based HM involves various fitting steps, which may seem uncomfortable compared to the usual end-to-end learning with NNs.One may think of ways also to make DML end-to-end possible.Here, one would apply NNs for all fitting steps and introduce a common loss over all optimization problems optimized with gradient descent.By weighting these losses adaptively, one can force this training to first fit the first stage estimators and then the treatment effect variable similar to what has been done in fitting PINNs respecting temporal and spatial causality [52].Efforts would need to be put into parallelizing the fitting of the first-stage estimators to make this approach computationally less costly.

Robustness against regularization bias.
Dropout is commonly used in deep learning for regularization [83] or uncertainty quantification [94].Fig. 5b shows the Q 10 estimations where dropout is applied to all NNs of the GD-based HM approach and the HM approach based on DML.With dropout, the GD-based HM approach has a more challenging time finding a good solution.It substantially overestimates the value of Q 10 in the low data regime and only slowly gets more constrained and closer to the true value at the upper end of the used sample sizes.While the GD-based method got notably worse with the introduction of dropout, the DML shows robust results for the estimations over the full data range.On average, the Q 10 estimations perform similarly to the experiments without dropout.In the low data regime, the bias in the estimation even decreased further.When fitting the GD-based HM with T A , the regularization with dropout has a positive effect.The estimated values for Q 10 are closer to the true value, and the spread reduces with more data points.The regularization through dropout restricts the space of solutions and reduces equifinality even though more data is necessary to overcome the stochasticity introduced through dropout.In Appendix C.1, we show additional results with weight decay [84], another common regularization technique.As it yields qualitatively similar results (see Fig. 10), we conclude that the presented findings are not only inherent to dropout.
In light of the results, DML, in combination with dropout, can be effectively used for a full probabilistic assessment of hybrid models with inference on the parameter of interest and the non-parametric part, as dropout is also a common technique for obtaining uncertainty estimates for NNs [94].While the GD-based HM approach suffered from the

GD-based HM without T A GD-based HM with T A DML-based HM with NNs DML-based HM with RFs
Figure 6: Estimation of Q 10 on real data.Both DML-based HM find on average a Q 10 value of 1.401 and 1.411 for RFs and NNs, respectively.This agrees with values from the literature that find a Q 10 value around 1.41±0.1 [95].The value for the GD-based HM is lower at 1.331 when leaving out T A as a predictor.With T A , problems of equifinality show up again.application of dropout, the DML approach was robust.Moreover, the technique further yields confidence bands for the approximately normally distributed estimators.By separating both estimations, we can obtain a distribution over the estimated Q 10 and safely obtain uncertainty estimates for R b using dropout.

Results on real data
As discussed in Section 3.1, we obtain measured respiration data using nighttime N EE measurements.We apply GD-based HM and DML-based HM with NNs and RFs without dropout to the data.We used the full dataset of over 100 different random seeds.The obtained distributions of Q 10 are shown in Fig. 6.The GD-based HM approach finds a mean value of 1.322, with a skewed distribution and estimated values ranging between 1 and 2. Including T A as a predictor in the GD-based approach, the values lie in a completely different range between 2.5 and 3.5, with the mean being 2.816.The estimations based on DML yield a mean of 1.407 and 1.409 for the RFs and NNs, respectively, with similarly peaked distributions.The results of the DML estimate agree fairly well with the results of [95] that after controlling for seasonal confounding, find that Q 10 takes values around 1.41 ± 0.1 independently of mean-annual temperature and biome.

CO 2 flux partitioning
We apply the causal DML-based HM to the problem of carbon flux partitioning as defined in (7).In this scenario, we model the effect as a heterogeneous treatment effect, a function of other predictors, parametrized with an ML model.We use gradient boosting estimators for all three estimators involved.Moreover, we show that the plug-in estimator for R eco obtained by combining the first-stage estimators yields useful values without the need for an additional refit.

Consistent flux partitioning
We use vapor pressure deficit V P D, air temperature T A , and day of the year (for seasonality) as drivers over all sites.Where available, we also included soil water content.Since we do not have access to the real partial fluxes, we compare the retrieved fluxes to the ones obtained by the NN approach described in [38] and by the established daytime and nighttime methods [87,90].The daytime and nighttime methods are assumed to capture a simple cycle depending on a few meteorological drivers.New methods may deviate but should show a similar pattern overall.For the partitioned fluxes of two methods (x i ) N i=1 and (y i ) N i=1 , we compute the R 2 , the root-mean-square error (RMSE), given by N i=1 (xi−yi) 2  N , and the bias as the difference between the sample means x and ȳ.The results are reported in Table 2.
Table 2: Cross consistency in terms of R 2 , RM SE and bias of retrieved GP P , RECO and estimated N EE between the established daytime (DT) [90] and nighttime (NT) [87] methods and the GD-based HM with neural networks (NN) [38] and DML-based HM (DML), proposed in this work.The reported statistics are median and in brackets 0.25/0.75quantiles over all site-years.Overall the consistency of the method based on DML lies in a similar range of values to the NN approach [38] when compared to the daytime and nighttime methods.The estimated data uncertainty of the used

Flux
. For almost all compared fluxes, our method lies under this threshold in terms of RMSE.Only for the GP P and N EE of the nighttime method, the values lie on average slightly above with 1.97 µ mol CO2 m 2 s and 1.92 µ mol CO2 m 2 s , respectively.The nighttime method fits respiration overnight and obtains GP P as the residuals between the estimated R eco and measured N EE.Thus, by construction, the N EE of the nighttime method corresponds to the measured N EE.Hence, both N EE and GP P of the nighttime method are higher in noise, and thus, a higher RMSE of our method is expected.When comparing the bias between methods, the causal DML-based HM shows a slightly smaller bias compared to both standard methods than these methods between them in almost all cases.Furthermore, it lies in a similar range to the GD-based HM.Overall, our method shows higher similarity to the daytime method, which is expected due to the fitting of the rectangular hyperbola in the first step.The retrieved GP P is similar to the daytime method as the NN approach, and the obtained N EE is even closer.At the same time, the obtained R eco shows a larger deviation even to the daytime method.This is because we used the plugin-in estimator for R eco obtained from the first-stage DML estimators.
We could obtain a more sophisticated estimator by refitting another model on the residuals, as done in the case of the Q 10 model, where we could also employ SW as a predictor without experiencing equifinality.It would even allow using the previously estimated GP P as a predictor of R eco .As an additional proof of concept, we apply the method to synthetic data with different levels of heteroscedastic noise.The method finds robust estimates even to high levels of noise.The results can be found in Appendix C.3.

Learned functionalities
The consistency tables served as a sanity check that the methods produce reasonable estimations that contain similar trends over the day and year.The next questions are: Where do they produce similar outputs?When do the outputs differ?For this, we compare the retrieved fluxes on two different sites.In Fig. 7, we see the retrieved GP P flux over a few days in July 2006 in France Le Bray.We compare the DML-based HM to the GD-based HM, daytime and nighttime methods.The retrieved GP P of the daytime and hybrid modeling methods show similar patterns.High V P D, which marks low water availability, reduces productivity.The daytime method implements this functionality parametrically.The LU E function of the DML-based HM approach learned a similar functionality that decreases with  The LUE shows a consistent functionality over the different years where an increase in V P D, which marks lower water availability, reduces productivity.This is also consistent with the functionality that the daytime method implements parametrically.increasing V P D and has preferred temperatures roughly between 15 • C and 30 • C (see Fig. 8).It is consistent over the four consecutive years the method was applied to at this site.This demonstrates that the causal hybrid modeling approach can learn a similar functional relationship as the parametric daytime method in a non-parametric way.The nighttime method shows a noisier but qualitatively similar pattern.
To highlight the differences between the methods, we look at a grassland site in Santa Rita (US) [96].Fig. 9 shows the estimated R eco over few days in July 2010.The selected time window was preceded by two months without rain, leading to low soil water content and, in turn, reduced respiration activity [86].During the shown period, a rain event leads to a sudden increase in soil water content.Such an event is expected to lead to a sudden increase in respiration as it stimulates microbial activity [86].We find that the daytime and nighttime methods cannot capture this sudden behavior as their estimation is based on window fitting and cannot detect sudden changes in dynamics.While R eco estimated with the nighttime method increases even before the event, the daytime method yields slowly increasing respiration flux shortly after the event.Instead, the fluxes estimated with the non-parametric hybrid modeling approaches show an increase right at the event's time, demonstrating that they can adapt to sudden changes in dynamics.A difference between both hybrid modeling approaches shows that the GD-based HM estimates a stronger respiration pulse but yields a noisier estimate from the onset of the event.
Our approach offers unique advantages.While traditional daytime and nighttime methods are fully interpretable, they struggle to capture rapid dynamic changes due to their parametric nature.On the other hand, the end-to-end GDbased methods, such as the approach by [38], may lack interpretability due to non-identifiability or implicit functional constraints, relying on assumptions with unclear implications.In contrast, our causal interpretation-based approach offers a middle ground, providing reasonable estimates of fluxes while maintaining interpretability as it is grounded in causal assumptions.By identifying GPP as the causal effect of light on NEE, our method offers a clear and meaningful interpretation of the flux partitioning process.While it may not match the predictive performance and flexibility of pure deep learning, it offers a valuable alternative by combining interpretability with reasonable estimation accuracy.
The analysis we carried out merely serves as a proof of concept toward a causally meaningful flux partitioning method.To maintain comparability, we ran the experiments on the same sites and years with similar quality filters as [38].For both DML-based HM as well as the GD-based HM approach with NNs, further research is necessary before they can be employed at scale in the data processing pipelines of FLUXNET sites.In particular, this would require a comprehensive analysis of the performance over all FLUXNET sites to disentangle the effects of geographical region, climate, vegetation, data quality, and data availability on the consistency of new flux partitioning methods.This should ideally be accompanied by simulations of sets of land surface models tailored for different land cover types to benchmark the adaptability of data-driven methods.This is beyond the scope of this work, which aims at introducing a causal approach to hybrid modeling.As for today, a benchmarking set and standardized evaluation pipeline are not available but could become key in the future when more data-driven flux partitioning models are developed.Understanding how these local factors influence the data-driven methods is crucial as the flux partitioning products serve as ground truth for downstream tasks such as upscaling from the site level to global fluxes as aimed for in the FLUXCOM project [97].

Conclusions
Machine learning is becoming a complementary tool to enhance scientific research and discovery in all fields of science.Its limitations are evident: lack of transparency and interpretability, weak generalizability to unseen data, and violation of governing laws.Hybrid modeling aims to incorporate scientific knowledge to overcome these limitations.However, this alone is insufficient to obtain the interpretability we hope for.Spurious links between variables can lead to equifinality: many models describe the data similarly well.Therefore, we must also teach these hybrid models what seems evident to us: correlation is not causation.And it is causation that we want.
In this paper, we propose a first step in this direction.We split the fitting of hybrid modeling involving treatment effects into subsequent steps, where we first estimated the causal effect with DML and then estimated the remaining of the model.By separating different estimation steps and being explicit about the underlying causal graph and the causal effect, we were able to obtain a well-defined problem that, originally was ill-posed and, in practice, suffering from equifinality.We applied this technique to two problems of carbon flux estimation, namely, Q 10 estimation in ecosystem respiration and carbon flux partitioning.We demonstrated the superiority of DML in retrieving parameters describing causal effects over end-to-end estimations with usual hybrid modeling approaches using NNs.The estimation is shown to be efficient and robust and effectively reduces bias through regularization techniques such as dropout and weight decay.On real data, it could retrieve a value for Q 10 consistent with the literature.We further showed the flexibility of the method by transforming the treatment and fitting a heterogeneous treatment effect of the LU E model for carbon flux partitioning as a non-parametric function.The retrieved fluxes were consistent with the ones of established methods, showed reasonable functional dependencies, and could improve on known limitations stemming from the window fitting of these methods.
We note that to apply the method effectively, assuming a causal graph and being explicit about the causal relationships of the involved variables is essential.This also includes thinking about unobserved confounders, mediators, and correlations between variables.We believe that this should be a general best practice.Our method encourages machine learners and practitioners to do so.A remaining problem is that even though we could show that it has broader applicability than the standard semi-linear regression problem, its relevance is still limited to hybrid models of a particular form containing parameters or non-parametric functions describing causal effects.
Integrating causality with hybrid modeling is crucial for achieving more interpretable and reliable outcomes in knowledge-driven machine learning.Our work has showcased this integration in two important problems in ecology through the application of causal effect estimation.Our causal hybrid modeling framework holds promise for enhancing interpretability and causal inference across diverse scientific fields that demand more insightful machine learning models.Looking ahead, we encourage further exploration and integration of causality concepts within hybrid modeling techniques.

B.3 Details on the neural networks
The NNs used for the GD-based HM had two hidden layers with 16 units each.A tanh nonlinearity was applied at the end of each hidden layer.A final softplus function was applied to the output of the last layer to obtain non-negative results for the base respiration.This function is a smooth approximation of the ReLU function.For the case of regularization, dropout was applied to the outputs of the hidden layers at a rate of 0.2.To probe other instances of regularization, we also used weight decay with hyperparameter 0.1 instead of dropout.The initial Q 10 is sampled from a Gaussian with σ = 0.1 and µ = 1.5 (or 1.25, 1.75 for the respective experiments).For the DML-based HM approach, we used the same network architecture without final softplus for the first-stage estimators.For the estimation of R b after obtaining Q 10 , we used the same network again, but this time we included the softplus nonlinearity.We used stochastic gradient descent with the Adam optimizer [82] for the training.We apply exponential learning rate decay as a scheduler with a decay rate of 0.95 over 500 steps.We trained the first stage estimators of the DML over 2000 iterations each.For the GD-based HM and the final g estimator in the causal DML-based HM, we trained over 10000 iterations.To avoid overfitting, 20% of the data is always kept as validation data for model selection.Table 4: Coefficient of determination R 2 for generated data on all 36 flux sites with different heteroscedastic noise levels between the gross primary production (GPP), ecosystem respiration (RECO) and net ecosystem exchange (NEE) obtained with the DML approach and the respective ground truth.For NEE, the noise-free value is stated.The reported statistics are the median and in brackets, the 0.  ) for generated data on all 36 flux sites with different heteroscedastic noise levels between the GPP, RECO and NEE obtained with the DML approach and the respective ground truth.For NEE, the noise-free and noisy values are stated.The reported statistics are the median and, in brackets, the 0.25 and 0.75 quantiles over all site-years. σ

Thus, a flexible
enough function estimator (e.g. a NN) could learn R b (X, T A )c (T A −T ref A )/10 and obtain Q10 c

Figure 2 :
Figure 2: Schema of the proposed approach: (i) Frame the problem as a treatment effect estimation problem and assume causal graph.(ii) Build estimators of Y and f (T ) and deploy DML in the constant or heterogeneous treatment effect setting.(iii) Estimate g with plug-in estimator or via a final fitting on the residuals.And finally, (iv) Combine θ and ĝ into a causally interpretable hybrid model.
We use 2003 to 2007 for training and keep 2008 and 2009 for testing.Moreover, we consider only measured observations, which amount to approximately 10% of the nighttime data for training (4331 data points).

Figure 4 :
Figure 4: Assumed causal graphs for the estimation with the causal hybrid modeling approach in flux partitioning.
HM without TA GD-based HM with TA DML-based HM with NNs DML-based HM with RFs (a) Without dropout.HM without TA GD-based HM with TA DML-based HM with NNs (b) With dropout.

Figure 5 :
Figure 5: Simulation study for Q 10 estimation with the GD-based HM and the DML-based HM over 100 sampled datasets at different sample sizes.The plots show average and 95% CI for the estimated Q 10 for different methods without (a) and with (b) dropout applied as a regularizer in the NN regression models.The true Q 10 parameter has a value of 1.5.Introducing T A as a predictor in R b leads to equifinality problems.Dropout as a regularizer introduces bias on the estimation of Q 10 in the GD-based HM case, while the causal hybrid modeling approach performs satisfactorily in the absence of equifinality.

Figure 7 :
Figure 7: Retrieved GP P flux of daytime method, nighttime method and DML-based HM in July 2006 in France Le-Bray.The DML-based HM retrieved a similar flux to the daytime method that decreases with the increase of V P D.

Figure 8 :
Figure 8: Functional behavior of the learned LUE in the years 2005 to 2008 over V P D and T A.The LUE shows a consistent functionality over the different years where an increase in V P D, which marks lower water availability, reduces productivity.This is also consistent with the functionality that the daytime method implements parametrically.

Figure 9 :
Figure 9: Retrieved R eco flux of daytime, nighttime, and both hybrid modeling methods in July 2010 in Santa Rita in the US.The daytime and nighttime methods show slow adaption to the change in dynamics caused by a rain pulse event that followed a long drought.Both hybrid modeling approaches can retrieve the expected immediate increase in respiration.The estimates of the GD-based HM are lower and less noisy.
HM without TA GD-based HM with TA DML-based HM with NNs DML-based HM with RFs (a) Q10 of 1.25 without dropout.HM without TA GD-based HM with TA DML-based HM with NNs (b) Q10 of 1.25 with dropout.HM without TA GD-based HM with TA DML-based HM with NNs DML-based HM with RFs (c) Q10 of 1.75 without dropout.HM without TA GD-based HM with TA DML-based HM with NNs (d) Q10 of 1.75 with dropout.

Figure 11 :
Figure 11: Additional simulation study for Q 10 estimation with the GD-based HM and the DML-based HM similar to Fig. 5 with different values for Q 10 In a) and b) Q 10 was set to 1.25 and in c) and d) to 1.75.The findings are qualitatively similar to the case of 1.5.The magnitude of the errors grows with the magnitude of Q 10 .

Table 1 :
Summary of variables for the experiments.

Table 3 :
FLUXNET sites used for flux partitioning experiments with DML.