Causality guided machine learning model on wetland CH 4 emissions across global wetlands

towers (30 sites from 4 wetlands types: bog, fen, marsh, and wet tundra) to construct a causality-constrained machine learning (ML) framework to explain the regulative factors and to capture CH 4 emissions at sub-seasonal scale. We found that soil temperature is the dominant factor for CH 4 emissions in all studied wetland types. Ecosystem respiration (CO 2 ) and gross primary productivity exert controls at bog, fen, and marsh sites with lagged responses of days to weeks. Integrating these asynchronous environmental and biological causal relationships in predictive models significantly improved model performance. More importantly, modeled CH 4 emissions differed by up to a factor of 4 under a + 1 ◦ C warming scenario when causality constraints were considered. These results highlight the significant role of causality in modeling wetland CH 4 emissions especially under future warming conditions, while traditional data-driven ML models may reproduce observations for the wrong reasons. Our proposed causality-guided model could benefit predictive modeling, large-scale upscaling, data gap-filling, and surrogate modeling of wetland CH 4 emissions within earth system land models.


Introduction
Methane (CH 4 ) has been the second most important contributor to post-industrial global warming after carbon dioxide (CO 2 ), with a Global Warming Potential (GWP) of 28-34 times of CO 2 over a 100-year time horizon (Bergamaschi et al., 2013;IPCC, 2013).Wetland CH 4 emissions are the largest natural global sources, contributing around 20-30% to global emissions (Bousquet et al., 2006;Chen and Prinn, 2006;Saunois et al., 2020).Global warming (Koffi et al., 2020), anthropogenic emissions (Boothroyd et al., 2017), wetland expansion (Zhang et al., 2017), and increasing methanogenic substrate availability (Schuur et al., 2008) are expected to increase CH 4 emissions and thereby amplify climate warming (Tao et al., 2020).Freshwater wetlands remain the largest and most uncertain natural CH 4 source to the atmosphere (Peltola et al., 2019;Saunois et al., 2020), but with considerable discrepancies among bottom-up biogeochemistry models, top-down atmospheric inversion models, and data-driven machine learning models (Koffi et al., 2020;Peltola et al., 2019;Saunois et al., 2020).Therefore, improvements in the understanding of uncertainty sources and development of robust modeling frameworks for CH 4 emissions are required to estimate present-day and future wetland CH 4 emissions (Dean et al., 2018).
Wetland CH 4 emissions are affected by multiple environmental (e.g., temperature, redox conditions) and biological (e.g., plant photosynthesis, microbial enzyme activity) factors (Delwiche et al., 2021;Knox et al., 2021).Wetland CH 4 is produced by methanogens under anaerobic conditions (Mayer and Conrad, 1990), with the production rate controlled by multiple drivers such as temperature, availability of substrate (Bergman et al., 2000;Schaufler et al., 2010;Whalen, 2005), O 2 , and alternative electron acceptors (Pasut et al., 2021).After production, CH 4 can be emitted to the atmosphere through various pathways (e.g., diffusion, ebullition, plant aerenchyma transport) that are affected by temperature, water depth, air pressure, and plant aerenchyma properties (Bastviken, 2009;Knox et al., 2021;Morin et al., 2014;Rey-Sanchez et al., 2018;Villa et al., 2020).CH 4 can be oxidized by aerobic bacteria when passing through oxic soil or water during transport (Wahlen, 1993) or even via anaerobic pathways (anaerobic oxidation of methane, AOM) (Fan et al., 2021).The impacts of environmental and biological factors on CH 4 emissions are often non-linear and operate over a range of time scales (Sturtevant et al., 2016).For example, the response of CH 4 production to temperature is observed to be hysteretic (Chang et al., 2021) due to seasonal substrate availability and microbial activity (Chang et al., 2020).The response of CH 4 emissions to GPP may be delayed and the relationship between them has been observed to be lagged by hours to days (Hatala et al., 2012a;Rinne et al., 2018), while CH 4 emission responses to water table fluctuations can be lagged by days to months (Chen et al., 2021;Goodrich et al., 2015;Sturtevant et al., 2016).The multi-driver dependency, nonlinearity, and time-lagged characteristics make it challenging to understand how CH 4 emissions interact with environmental and biological factors and to accurately represent them in predictive models (Kim et al., 2020;Sturtevant et al., 2016;Turner et al., 2021).
In most ecosystem biogeochemical models, wetland CH 4 production is represented as a function of net primary production and/or heterotrophic respiration (as a proxy for microbial activity), with both constrained by environmental scalars (Melton et al., 2013;Wania et al., 2013;Xu et al., 2016).For example, temperature sensitivity scalars have been proposed based on observed CH 4 emissions (Yvon-Durocher et al., 2014).However, in situ observations reveal high variability and uncertainty in CH 4 emissions even with nearly identical environmental conditions (Chadburn et al., 2020;Granberg et al., 1997;Hemes et al., 2018;Koch et al., 2014;Rinne et al., 2018;Villa et al., 2021;Zona et al., 2016), implying much more complex functional relationships between CH emissions and environmental and biological factors.A few ecosystem models explicitly represent more of the underlying microbial, plant, and abiotic processes leading to wetland CH 4 emissions (e.g., ecosys (Grant et al., 2015;Grant et al., 2017a;Grant et al., 2017b), BAMS4 (Pasut et al., 2021), andJSBACH-methane (Castro-Morales et al., 2018)) and confirm that these nonlinear interactions should be considered to improve model predictions of methane emissions (Chang et al., 2019).
In addition to the ecosystem biogeochemical models, Machine Learning (ML) models are becoming useful tools for capturing complex nonlinear relationships, and have achieved good performance in gap filling CH 4 emission data (Hatala et al., 2012a;Hatala et al., 2012b;Irvin et al., 2021;Kim et al., 2020;Knox et al., 2019;Morin et al., 2014) and spatial upscaling (Peltola et al., 2019).However, widely-applied ML frameworks do not accurately represent lagged CH 4 emission dependencies (Kim et al., 2020).Including lagged variables as predictors may improve ML model performance, but risks overfitting, especially for multiple-driver dominated ecosystems with limited temporal observations (Kim et al., 2020).Furthermore, commonly used ML models do not consider causality constraints (Pearl, 2019;Reichstein et al., 2019).Such ML models may fit an observational dataset well while not being driven by causal relationships (Pearl, 2019;Runge et al., 2019a).In this study, we explore whether an ML model that represents lagged responses and considers underlying causal relationships can improve process understanding and wetland CH 4 emission predictions.
We used CH 4 emission measurements at 30 eddy covariance towers covering 4 wetland types (bog, fen, marsh, and wet tundra), to test three hypotheses: (1) It is possible to infer with statistical confidence causal relationships between drivers and CH 4 emissions.(2) The environmental drivers significantly affecting methane emissions differ among the wetlands by their type and location.(3) Future model predictions that are well calibrated based on current flux observations, but differ in their assumed causal relationships between drivers and methane emissions, will diverge significantly.To test these hypotheses, we develop an integrated framework that combines causality and ML to improve understanding of causal relationships affecting CH 4 emissions and modeling of wetland CH 4 emissions across various wetland ecosystems.In this work, a causal relationship exists between predictor (X) and CH emissions if, when excluding the confounding effects from other predictors and from the history of CH 4 emissions, knowing the predictor (X) could significantly reduce the uncertainty in predicting CH 4 emissions (Abdul Razak and Jensen, 2014;Runge et al., 2019a).The overarching goal of this study is to develop, train, and validate a ML model to improve predictive modeling of wetland CH 4 emission for diverse wetlands.
K. Yuan et al.

Study sites and data description
The dataset used in this study is from the FLUXNET-CH 4 synthesis activity, which compiles, standardizes, and gap-fills available daily eddy covariance CH 4 emission data, via the regional networks of AmeriFlux, EuroFlux, OzFlux, and AsiaFlux (Delwiche et al., 2021;Knox et al., 2019).We focus on four types of natural freshwater wetlands (bog, fen, marsh, and wet tundra), and use 30 wetland sites, each with at least one year of CH 4 observations (Fig. 1; Table 1).The wetland classification is based on the site-specific literature (Delwiche et al., 2021).Daily CH 4 emissions (F CH4 ) and 13 potential drivers are considered in our analysis: Air Temperature (T a ), Topsoil Temperature (T s ) (detailed information of soil temperature depth can be seen in Delwiche et al. (2021)), Water  S1).These variables are widely acknowledged as important driving factors for wetland CH 4 emissions (Knox et al., 2021;Oertel et al., 2016).Details of data standardization for the FLUXNET-CH 4 dataset are presented in Knox et al. (2019).In this study, we used the observed non-gap-filled measurements to maintain the original dynamic patterns and avoid potential biases from the gap-filling algorithms that have their own assumed causal relationships.

Transfer entropy analysis
We employ a transfer entropy approach with PCMCI framework (Runge et al., 2019b) to identify non-linear directional relationships between environmental and biological factors and F CH4 .Transfer entropy is a powerful tool to reveal the causality for non-linear and asynchronous systems (Bouskill et al., 2020;Liu et al., 2019;Schreiber, 2000).The approach quantifies information entropy flow from source variables (e.g., T a ) to the target variable (F CH4 ) by measuring the information entropy reduction in the target variables when excluding effects from various confounders (Yuan et al., 2022;Li et al., 2022).If transfer entropy is statistically significant, the causal relationship from a source variable to the target variable is confirmed.For each pair of variables of interest, we calculate the transfer entropy (T) from source variable X to a target variable Y considering the confounders of Z (Schreiber, 2000): where l is the corresponding time lag of source variable X. p is the probability density.Compared with the linear and nonlinear correlation based methods (e.g., mutual information in Knox et al. (2021)), transfer entropy can explicitly exclude confounding effects when detecting the causal strength from one variable to F CH4 through removing shared information between confounders (Z) and the target variable (Y).
In theory, all potential confounders should be included when identifying causal relationships.However, in practice, too many confounders will cause high dimensionality and statistical instability issues (Runge et al., 2019a;Yuan et al., 2021).For simplicity, previous studies often considered the immediate history of a target variable as the confounder, assuming that it contributes the most confounding information to the target (Ruddell and Kumar, 2009;Yuan et al., 2021).However, wetland F CH4 can be jointly regulated by multiple factors including the history of F CH4 .To minimize the interferences from important confounders and to avoid high dimensionality, we adaptively considered three confounders that have the strongest control on the variation of F CH4 through the PCMCI framework (Runge et al., 2019b).PCMCI contains two key steps: (1) PC (named after its inventors Peter and Clark) (Spirtes et al., 2000) and ( 2) Momentary Conditional Independency (MCI) (Runge et al., 2019b).To infer the causal strength from a source variable to the target variable, we firstly used the transfer entropy method in PC to rank the contribution of all potential confounders (e.g., air temperature, soil water content) with relative lower dimensionality (Spirtes et al., 2000), and used transfer entropy in MCI to calculate the causal strength from a source variable to the target variable by excluding the information entropy from the most important confounders (Runge et al., 2019b).We iteratively conducted the causal inference process for each variable to obtain the causal strength (Fig. S1).
The shuffled surrogate method (Kantz and Schürmann, 1996) was employed to test the statistical significance of transfer entropy.This method randomly shuffles source and target time series to destroy time correlations.Shuffled surrogate transfer entropy was computed 100 times through Monte Carlo simulations.A one-tailed significance test is then applied to determine the 95% confidence of the transfer entropy (Ruddell and Kumar, 2009).

CH 4 emission predictive models
We develop a causality constrained interpretable ML model based on the Long-Short-Term-Memory framework (Guo et al., 2019a;Hochreiter and Schmidhuber, 1997;Li et al., 2020) for prediction (hereafter causal-LSTM).We compared the causal-LSTM model performance, internal functional relationships, and model sensitivity against its baseline LSTM model (described below), to illustrate the benefit of including causality constraints in prediction.

Baseline model
The baseline naïve LSTM model has been widely used in time sequence predictions (Alahi et al., 2016;Li et al., 2020).One of the advanced features of LSTM is the gate mechanism that controls the information flow to be memorized or forgotten, which enables capturing short-term and long-term dependencies underlying data sequences.Here, we use the LSTM model for prediction, given the lagged responses of emissions to environmental and biological factors.The recursive representations of LSTM and prediction can be represented as: where x t (t is time step, 0<t ≤ T) is the input vector, c t is the cell memory state vector, and h t represents the hidden state vector with useful information for predictions.In this study, x t represents the biotic and abiotic drivers across sites; h T is the hidden state vector at the last time step T; ŶT+1 is the predicted F CH4 at the time step T+1; and W l and b l are the parameters that need to be learned.The f in Eq. ( 2) is an integrated function that includes five individual equations: For the LSTM, we used the recursive feature elimination (RFE) method (Guyon et al., 2002) to remove spurious predictors.Specifically, we iteratively removed one predictor, used the remained predictors to train the LSTM, and calculated the correlation coefficient between observations and predictions after removing the predictor.Then, we removed the weakest predictor which showed the lowest impacts on model performance, and repeated the predictor elimination process until only one predictor was left.Finally, we present LSTM modeling results based on the subset of predictors selected by RFE method that have the highest model performance.

Causality constrained LSTM
Although baseline LSTM is capable of capturing short-term and longterm dependencies in the input time series, it works as a black-box and cannot explicitly select important driving variables and lacks interpretability of its predictions.Also, the dependencies identified within the LSTM model are based on correlations rather than causality (a more informative directional relationship).To this end, the LSTM model can be improved through attention mechanism, an effective weight assignment method, to increase its transparency (Alahi et al., 2016;Guo et al., 2019a;Li et al., 2020;Liang et al., 2018;Qin et al., 2017;Vaswani et al., 2017).The weight mechanism explicitly and dynamically assigns larger weights to more important variables, thereby improving model performance and interpretability (Guo et al., 2019a;Li et al., 2020).However, without the guide or constraint of causality, the correlation-based ML models may represent wrong processes (e.g., mis-capture dominant causal drivers) (Moraffah et al., 2020;Pearl, 2019;Runge et al., 2019b), making the model unreliable, especially for predictions using multiple drivers with similar seasonal trends (confounding) information under climate change (Runge et al., 2019a).In addition, we further imposed additional constraints using causal relationships from input variables to the target variable and led to the causal-LSTM model.The causal-LSTM model first calculated the causal relationship using transfer entropy.And then through optimization, it reduced the model biases on both prediction error and structure difference between model captured variable dependency and observation-based causal strength.Below, we introduce the weight assignment mechanism (attention mechanism) in the LSTM approach and describe details of how we incorporate causality constraints in the model.
Similar to the baseline LSTM, the ith driving variable at time step t can be iteratively transformed to a hidden state vector h t i through the gate mechanism Guo et al., 2019a;Hochreiter and Schmidhuber, 1997;Li et al., 2020;Qin et al., 2017).To represent the importance of the ith variable at time step t, a weight, w i t or w i t is dynamically calculated through Eqs.(4) and ((5), and assigned to h t i .Then, the weighted summation h sum i of h t i across time steps is obtained to represent the summarized information for the ith driving variable: (5) Where W p is a parameter matrix that needs to be learned, and tanh is the hyperbolic tangent function.T is the total number of time steps.
To further represent the relative importance of the ith driving variable compared to other driving variables, a weight, α i , is obtained and normalized as α i ′ : where W a is a learnable parameter matrix.Finally, using the weighted sum of all driving variables, the model generates the prediction ŶT+1 : where the linear function with weight W o and bias b o , along with weight α i ′ produce the final prediction.
To make the internal structure of the model more consistent with underlying physical processes, we use transfer entropy inferred causal relationships to constrain the variable importance (variable weight) in the predictive model.A larger transfer entropy from a driver (e.g., soil temperature) to F CH4 implies variations of the driver can cause larger variations in CH 4 emissions, compared to other drivers (Ruddell and Kumar, 2009).Similarly, a larger variable weight (α i ′ ) indicates that the i th variable plays more important roles in modeling the target variable (Guo et al., 2019a;Li et al., 2020;Liang et al., 2018;Qin et al., 2017).To guide the model to learn dependencies between causally dominant drivers and F CH4 , we measure the difference between transfer entropy inferred feature importance vector α TE and that of the model captured feature importance vector α k for each sample k, and integrate the difference along with modeled errors into the final loss function (Eq.( 11)).
In the vector α TE , α TEi represents the transfer entropy from the ith driving variable to F CH4 .In α k , α k,i represents the ith variable weight, α i ′ , for a sample k.Each vector is divided by its summation to obtain a probability distribution ranging from 0 to 1, and KL-Divergence (Kullback and Leibler, 1951) (the second item in the loss function, Eq. ( 11)) is used to measure the distribution difference between the two vectors: where λ is a structural punishment parameter, and a larger λ means that the model puts more emphasis on structural similarity instead of errors.In Eq. ( 11), the first right hand side term is the errors between observations and predictions, while the second term is the structural similarity between causality inferred feature importance and importance the model captured.N is the number of predicted data samples, and n is the number of variables.The baseline LSTM uses only the first term on the right-hand side for the loss function, while the causal-LSTM has additional constraint from causal relationships via the second term (Eq.( 11)).
The model parameters are learned via a back-propagation algorithm (Rumelhart et al., 1986) by minimizing the integrated loss (Eq.( 11)) with a variational dropout to avoid overfitting (Gal and Ghahramani, 2016).We used the intra-site validation scheme to test model performance on capturing intra-site temporal variations of F CH4 .Specifically, in each experiment, for each site, we randomly sampled 80% of data as a training dataset, remained 10% as a validation dataset (used to avoid overfitting during training (Prechelt, 1998)), and retained the remaining 10% as a test dataset (a holdout dataset used to unbiasedly evaluate the final model).We repeated each experiment 20 times to reduce model bias due to random data selection.We compared the model performance with different λ values (Fig. S2), and selected the best one (λ=0.005)that has the lowest prediction errors.To evaluate the lag effects for model improvement, we varied the lengths (one-week vs. one-month) of time series input used in the models.In addition, we also used the leave-one-site-out scheme (here referred as inter-site validation) (Jung et al., 2011) to test model performance on spatial extrapolation of F CH4 on each tested site.Other detailed experimental settings of each model are listed in Table S2.

Causal relationships derived from transfer entropy
Transfer entropy analysis revealed that daily F CH4 was most strongly K. Yuan et al. driven by soil temperature (T s ) in the four analyzed wetland ecosystem types (bog, fen, marsh, and wet tundra; Fig. 2a), with a range of different time lags.The statistics of dominant drivers at each individual site also showed that T s dominated in most sites (Fig. S3).Furthermore, the strength of the T s → F CH4 relationship declined with increasing mean air temperature (slope = -0.0014,R value = -0.63,p value <0.05) (Fig. 2b).This inverse relationship suggested that CH 4 emissions in colder regions were more sensitive to temperature than in warmer areas.The control from air temperature (T a ) was weaker than that from T s and was prominent only at fen and marsh wetlands (Fig. 2a).
Two biological factors, Ecosystem Respiration (RECO) and plant Gross Primary Production (GPP), also exerted strong controls on daily F CH4 in bog, fen, and marsh wetlands.These strong relationships between F CH4 and vegetation carbon turnover are consistent with the findings of many previous studies (Hatala et al., 2012a;Mitra et al., 2020;Rinne et al., 2018).Plant GPP stimulates CH 4 production indirectly by providing carbon input, mainly via root exudates fueling microbial activity, which produces substrates (such as acetate and CO 2 ) for acetotrophic and hydrogenotrophic methanogenesis (Bastviken, 2009;Mitra et al., 2020;Ström et al., 2012;Whiting and Chanton, 1993).Additionally, GPP can be seen as a proxy of plant-mediated CH 4 transport via aerenchyma tissue (Bastviken, 2009;King et al., 1998;Turetsky et al., 2014).Previous studies argued that the relationship between GPP and F CH4 may be due to covariation with confounding drivers (e.g., soil temperature) (Chang et al., 2021;Knox et al., 2019).In this study, we confirmed the existence of a strong coupling from GPP and RECO with F CH4 by removing confounding effects when identifying the causal relationships across multiple wetland types.
Compared with temperature and biological factors, the controls from other variables on F CH4 were much weaker (Figs.2a, S3) and less consistent across wetland types.For example, VPD controlled F CH4 more at bog and fen ecosystems, while PA showed weak causal relationships with F CH4 across all sites.For water-related factors, significant controls on F CH4 existed only at a few sites, which may be attributed to limited observations of water table depth (D wt , 16 sites) and soil water content (θ, 9 sites), and limited variations of soil wetness across studied sites (more details are discussed in Section 4.1).

F CH4 predictions with causal constraints
Because causal relationships varied across wetland ecosystems, we trained independent ML models for each wetland type (bog, fen, marsh, and wet tundra).Two types models were considered: Long Short-Term Memory (LSTM) and causality-constrained interpretable LSTM (causal-LSTM).We found that Causal-LSTM performed consistently better than LSTM for all four wetland types with higher Pearson correlation coefficient (R) and lower relative MAE (mean absolute error) when inputting four weeks of historical drivers (Table S3 and S4).For example, R values in LSTM ranged from 0.861 to 0.908 and relative MAE ranged from 0.271 to 0.433, while R in causal-LSTM ranged from 0.904 to 0.921 and relative MAE ranged from 0.217 to 0.368 (Fig. 3, Tables S3 and S4).Consistently, with one week of inputs, the causal-LSTM also showed significantly higher R and lower relative MAE compared with LSTM in all wetland types (p < 0.05, Tables S5 and S6).We also compared the causal-LSTM approach with four other widely used ML algorithms (random forest, decision tree, artificial neural networks, and support vector machine), and found that causal-LSTM had the highest prediction accuracy (Fig. S4), with R value of 0.94 between observations and predictions of causal-LSTM across all sites (Fig. S5).
For model evaluation with the inter-site validation scheme, causal-LSTM also performed reasonably well with R value of 0.75 between observations and predictions (Fig. S6) and lower biases than that of LSTM (Table S7).However, the inter-site validation performance of causal-LSTM dropped, compared with the intra-site validation scheme  K. Yuan et al. especially for the marsh (the mean R value dropped to 0.81 in bog, 0.81 in fen, 0.86 in wet tundra, and 0.69 in marsh), which may be due to the strong spatial heterogeneity of F CH4 magnitude (e.g., mean CH 4 emission ranged from 2.706 nmol m − 2 s − 1 to 165.472 nmol m − 2 s − 1 across different sites) and environmental conditions (e.g., annual precipitation in marsh varied from ~200 to ~1400 mm/year).Overall, we conclude that the causal-LSTM provides the most effective approach to model wetland F CH4 .
The results showed that model performance tended to be improved as the length of input time series increased from one to four weeks for both causal-LSTM (Fig. 3a and b, yellow vs. red bars) and LSTM models (Fig. 3a and b, green vs. purple bars).For R, the performance of both models at bog, and marsh was significantly (p < 0.05; Tables S8 and S9) improved as the input data length increased.Similarly, in terms of relative MAE, the causal-LSTM model showed significantly lower biases (p < 0.05; Table S10) in bog, marsh, and wet tundra ecosystems, and LSTM showed significant lower biases in bog and wet tundra (p<0.05,Table S11).Overall, longer histories of drivers (i.e., memories) can provide additional information for predictions, especially in bog, marsh, and wet tundra.

Soil temperature versus soil water control on F CH4
Wetland CH 4 emissions are regulated by multiple biotic (i.e., production, oxidation) and abiotic (i.e., advection, diffusion) processes, with each posting different dependencies on environmental factors.Therefore, the emergent relationships between wetland methane emissions and the corresponding environmental factors are expected to be complex and diverse across different wetland ecosystem types and across sites with different climate conditions (Turetsky et al., 2014).Among those environmental variables, previous studies have identified temperature and soil water content as major abiotic drivers for wetland CH 4 emissions (Knox et al., 2021;Song et al., 2011;Strachan et al., 2015) because soil water saturation and warm soil conditions are two prerequisites for anaerobic production of wetland CH 4 (Riley et al., 2011).
Here, we found strong soil temperature control on CH 4 emissions across bog, fen, marsh, and wet tundra ecosystems.The stronger causal relationship of T s → F CH4 compared to T a → F CH4 is consistent with the hypothesis that air temperature may decouple from soil temperature in colder ecosystems (e.g., wet tundra) due to snow insulation of the ground (Kim et al., 2007).Similar strong correlations between T s and wetland F CH4 have been reported in numerous site-level studies (Granberg et al., 1997;Knox et al., 2021;Morin, 2019).
We also found relatively weak control from soil water related variables (Fig. 2), admit low confidence because of limited data.For example, soil water content had weak control in fen ecosystems, and water table depth had moderate control in bog and marsh ecosystems, but not in fen or wet tundra ecosystems.The lack of sensitivity may partly be due to the data quality of water related variables (D wt is available in ~50% of our studied 30 sites, and θ is available in only ~30% of the 30 sites (Table S1)).Another potential reason is the fact that the sites used in this study all experienced relatively low variation of D wt (mean standard deviation is 10.6 cm).Strong seasonal fluctuations of soil water are more expected at rice paddy or tropical swamp ecosystems (Jauhiainen et al., 2005;Mezbahuddin et al., 2014), which are not included in this study.For example, water table depth could vary ~80 cm at a managed rice paddy site in northern California and plays an important role in driving CH 4 emissions during both growing season and fallow periods (Knox et al., 2016).Although not frequently occurred, extreme droughts may result in significantly different water table at fen and bog sites that will reduce the methane emission (Brown et al., 2014;Rinne et al., 2020).However, ML model was trained with majority of the data to capture non-extreme conditions.In addition, we note that several studies reported weak dependencies between D wt and F CH4 (Jackowicz-Korczyński et al., 2010;Rinne et al., 2007;Rinne et al., 2018).Given the limitations in sites and water-related data availability, our results highlight the need for more eddy covariance and ancillary measurements in bog, fen, marsh, and wet tundra ecosystems, particularly measurements under long-term drying and rewetting conditions, or experiencing natural flooding and water table fluctuation.These observations will facilitate a more complete picture of how various factors affect wetland CH 4 emissions within these wetland ecosystems.

Causal relationships inform model evaluation and development
In addition to commonly used model evaluation metrics (e.g., MAE and R), causal inference provided additional metrics to evaluate and benchmark models in terms of internal causal structures.Causal relationships may also help select process-based models with model causal structures similar to those in observations.In this analysis, we found that methane ML models can achieve comparable performance even though they have diverse causal relationships.We visualized variable importance within LSTM and causal-LSTM models and validated the modeled relationships against observed causal relationships identified by transfer entropy analysis (Fig. 4).The feature importance of causal-LSTM and LSTM were calculated according to attention weight statistics (Guo et al., 2019b;Li et al., 2020) and the feature importance derived from RFE (Guyon et al., 2002;Meyer et al., 2019) of 20 repeated experimental results, respectively, and were both normalized to 0~1.We found that LSTM mainly used dependencies from wind, atmospheric pressure, soil and air temperature, and total ecosystem respiration to estimate F CH4 , which were different from those inferred from observations and causal-LSTM (Fig. 4).The feature importance in the causal-LSTM model is much more consistent with observations, confirming the effectiveness of the causality constraints.
The inferred causal relationships from biological and environmental variables on CH 4 emissions vary across different wetland types and time windows.Our results show that soil temperature dominantly controls F CH4 in wet tundra, while biotic variables along with soil temperature codominate F CH4 in fens, bogs, and marshes.The different controls imply that different ecosystems need to be considered separately in machine learning model development (Turetsky et al., 2014).Also for each wetland ecosystem, the responses of F CH4 rely on processes with short time lags (e.g., CH 4 transport, microbial activity) and long time lags (e.g., fine-root turnover).Integrating both short and long causal relationships may also improve model performance.(c) observations.Colors show the corresponding normalized feature importance that is normalized between 0 and 1, with higher value indicating higher importance.

Implications of considering causal relationships in CH 4 emission projections
Our results, in line with previous studies, suggest that data-driven ML models may accurately reproduce observations with the wrong reasons (Pearl, 2019;Reichstein et al., 2019;Runge et al., 2019a).The different causal relationships built within predictive models are critically important for climate change projections, since the responses of CH 4 emissions to climate change strongly depend on the strength of the underlying causal relationships.Thus we hypothesized that although both LSTM and causal-LSTM performed reasonably well under present-day conditions, their predictions under warming climate could differ due to their differences in internal functional relationships, or altered combinations of forcing mechanisms.To test this hypothesis, we conducted a theoretical soil warming experiment (+1 • C) at all sites through modeling.We acknowledged that more complex changes can occur in a real soil warming experiment (e.g., soil drying caused by warming) (Pries et al., 2017).However, this simple soil warming experiment isolates impacts from other environmental or biological variables and focuses only on the temperature effect.
For each wetland type, we calculated the mean change in F CH4 due to soil warming across all site years.We defined response ratio to warming by percentage change of F CH4 under warmed and controlled conditions.Large differences between the LSTM and causal-LSTM existed in response to warming, especially for bogs (4.9% vs 21.8%) and fens (2.7% vs 10.1%) (Fig. 5).The differences in causal-LSTM predictions are significantly larger than those of LSTM for bog, fen, marsh and wet tundra sites (p < 0.05; Table S12).Overall, the LSTM model estimated lower methane emission in response to warming than causal-LSTM model, primarily due to the less important role of soil temperature in its internal model functions (Fig. 4a).Therefore, this work highlights the importance of considering causal relationships in modeling CH 4 emissions under a changing climate.We advocate the use of these types of causal relationship constraints for other ecosystem variables calculated through machine learning approaches (e.g., FLUXNET-MTE GPP (Jung et al., 2011)).In addition, causality constrained ML models could serve as surrogate modules for efficient parameterization and high accuracy prediction, especially for processes that lack theoretical understanding and mathematical model structures.

Conclusions
Based on in situ eddy covariance measurements of daily CH 4 emissions (F CH4 ) at 30 eddy covariance sites in bog, fen, marsh, and wet tundra wetlands, we found consistent causal regulations from soil temperature on F CH4 using a transfer entropy approach.We also confirmed important causal relationships with ecosystem respiration (RECO) and gross primary production at bog, fen, and marsh wetlands.The transfer entropy approach explicitly excludes confounding variables and therefore reduces the possibility that the observed causal relationship between F CH4 and RECO or GPP was due to covariation with other environmental drivers, such as temperature (Chu et al., 2014;Knox et al., 2019).We then developed a predictive model that integrated the transfer entropy inferred causal relationships for F CH4 simulations.The causality constrained model outperformed other baseline ML models in terms of accuracy (relative MAE and R); more importantly, we demonstrated that including underlying causal relationships in predicting F CH4 under a 1 • C soil warming could differ by up to a factor of 4, compared with traditional ML models.Our results highlighted that those causal relationships can be used to benchmark, evaluate, and improve wetland methane emission models.Our proposed causality constrained model could benefit large-scale upscaling, data gap-filling, and surrogate modeling of wetland CH 4 emissions within earth system land models.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Fig. 1 .
Fig. 1.(a) Geographic locations and wetland types of the 30 selected eddy covariance sites.(b) Mean annual temperature and precipitation of each site.

Fig. 2 .
Fig. 2. (a) Causal relationships between environmental and biological factors and daily F CH4 averaged across sites within four wetland ecosystems.Colors in the grid squares show the strength of transfer entropy (normalized to range from 0 to 1) from each variable to F CH4 ; darker colors represent larger values (a grey grid means that the observation data is unavailable).(b) Relationship between the strength of T s → F CH4 relationships and site Mean Annual Temperature (MAT); the grey bounds show a 95% confidence interval.

Fig. 3 .
Fig. 3. Model performance comparison with different input lengths for LSTM (green and purple boxes) and causal-LSTM (yellow and red boxes), in terms of (a) correlation coefficient (R), and (b) relative MAE between predictions and observations.The boxes represent 25th to 75th percentiles, and the whiskers represent 5th to 95th percentiles of R or MAE for each wetland type.

Fig. 4 .
Fig. 4. Comparison of feature importance in (a) LSTM, (b) causal-LSTM, and(c) observations.Colors show the corresponding normalized feature importance that is normalized between 0 and 1, with higher value indicating higher importance.

Fig. 5 .
Fig. 5. F CH4 response ratio to an imposed +1 • C soil warming of LSTM (green) and causal-LSTM (yellow) models.The boxes represent 25th to 75th percentiles, and the whiskers represent 5th to 95th percentiles of R or MAE for each wetland type.

Table 1
site information of the 30 sites used in this analysis.