How Interpretable Machine Learning Can Benefit Process Understanding in the Geosciences

Interpretable Machine Learning (IML) has rapidly advanced in recent years, offering new opportunities to improve our understanding of the complex Earth system. IML goes beyond conventional machine learning by not only making predictions but also seeking to elucidate the reasoning behind those predictions. The combination of predictive power and enhanced transparency makes IML a promising approach for uncovering relationships in data that may be overlooked by traditional analysis. Despite its potential, the broader implications for the field have yet to be fully appreciated. Meanwhile, the rapid proliferation of IML, still in its early stages, has been accompanied by instances of careless application. In response to these challenges, this paper focuses on how IML can effectively and appropriately aid geoscientists in advancing process understanding—areas that are often underexplored in more technical discussions of IML. Specifically, we identify pragmatic application scenarios for IML in typical geoscientific studies, such as quantifying relationships in specific contexts, generating hypotheses about potential mechanisms, and evaluating process‐ based


Introduction
The widespread application of machine learning (ML) in the geosciences, particularly for predictive modeling, represents a significant technological advance (e.g., Bi et al., 2023;Ham et al., 2019).While their predictive capabilities are widely acknowledged, ML methods are often considered separate from the fundamental scientific methodologies of the geosciences, typically being viewed as more practical tools for simulation and forecasting rather than integral components of scientific exploration (Nearing et al., 2021).It was hoped that ML would revolutionize scientific inquiry (H.Wang et al., 2023), but this anticipated transformation has yet to fully manifest.One concern is that these innovations may not be fully aligned with the core scientific goals of the discipline (e.g., Birhane et al., 2023).
In recent years, the ML community has made significant progress in developing strategies to improve model interpretability, leading to the evolution of interpretable ML (IML) and explainable AI (XAI) (Gunning & Aha, 2019;Murdoch et al., 2019).Despite the differences between IML and XAI as they are used in the ML community, for example, IML focuses more on models and XAI includes a broader set of techniques to make ML more explainable (Rudin et al., 2022), we choose not to emphasize these distinctions in this paper.Our concentration is more on the broader concept and practical applications of IML, and thus most terms related to IML can be used interchangeably with XAI in the following discussions.In this paper, we approach IML from a practical perspective, focusing primarily on the use of post-hoc interpretation techniques, although inherently interpretable models are also discussed in context.These post-hoc interpretation techniques, such as Shapley additive explanations (SHAP) (Lundberg & Lee, 2017), integrated gradients (Sundararajan et al., 2017), and local interpretable model-agnostic explanations (LIME) (Ribeiro et al., 2016), are intended to help users demystify the inner workings of complex, often opaque ML models.For geoscientists with ML expertise, these techniques (albeit with varying effectiveness) provide a way to demonstrate the credibility of a model by justifying its mechanisms against existing knowledge (Dwivedi et al., 2023).Here, however, we highlight the potential of IML to benefit a much broader range of geoscientists, including those who have not engaged with ML models in their research.Essentially, IML provides a new lens for exploring, interpreting, and understanding the complex relationships within geoscientific data (Toms et al., 2020).Through the process of interpreting ML models, we may gain insight into how different input features interact and influence geoscientific phenomena, including relationships that might be difficult to identify through traditional analysis (e.g., Ham et al., 2023;Jiang et al., 2024;Kraft et al., 2019).
Despite the progress made in implementing IML to improve scientific understanding and discovery in many fields (Roscher et al., 2020b), its integration into established geoscientific methodologies still requires both more diverse and careful application.On the one hand, IML is often introduced through a data science-centric lens that typically focuses on its fundamental concepts, various algorithms, and major research trajectories and trends (e.g., Adadi & Berrada, 2018).However, for geoscientists engaged with process-based models, the value of IML may not be immediately apparent, as IML is more likely to be perceived as a technical tool for justifying or debugging ML models.On the other hand, the rapid proliferation of IML has also led to instances of careless application without a thorough understanding of its limitations and underlying assumptions (Arif & MacNeil, 2022;Molnar et al., 2022;Roscher et al., 2020a).
Therefore, this paper aims to bridge this gap by highlighting both the practical benefits and good practices of using IML in geoscientific research.Our goal is not to provide an exhaustive review of IML techniques and their applications in the geosciences, which have been extensively covered in the literature, such as Gevaert (2022) on IML in Earth observation and remote sensing, Başağaoğlu et al. (2022) in hydroclimate, and Mamalakis, Ebert-Uphoff, and Barnes (2022) in meteorology and climate science.Here, we narrow our focus to the direct relevance of IML for broad geoscience purposes, particularly with respect to process understanding, an aspect that has been variously highlighted as critical in AI for Earth system science (e.g., Irrgang et al., 2021;Reichstein et al., 2019;Shen et al., 2023) but still needs more focused discussion.Specifically, we will concentrate on promising applications of IML for targeted insights for geoscientists, including non-linear quantification of relationships within data, generation of hypotheses about potential mechanisms, and evaluation of process-based models.We present a general but practical workflow for effectively integrating IML into routine research activities, from translating specific geoscience research questions into IML tasks to obtaining actionable insights from the IML models.Importantly, we identify several common pitfalls of IML applications and emphasize that careless use of IML can not only lead to potentially misleading conclusions but also undermine its credibility in the geoscience field.Ultimately, we hope to make IML more accessible and relevant to a broader range of geoscientists, enabling them to properly use these innovative tools in their scientific endeavors and opening new avenues for understanding Earth's complex systems.

Demystifying IML for Geoscientists
While there has been a surge in IML research over the past decade, the concept of deriving interpretable models from data has a longer tradition (Molnar et al., 2020).For instance, linear regression, which dates back to the early nineteenth century and has been widely used in scientific studies, can be broadly considered an early incarnation of IML.Linear regression has evolved into a variety of regression analysis tools, such as logistic regression, generalized linear models (GLMs) (Nelder & Wedderburn, 1972), and generalized additive models (GAMs) (Wood, 2017).These models are often designed based on specific distributional assumptions and a predefined limit on complexity to ensure interpretability.Similarly, models such as decision trees (Quinlan, 1986) and decision rules (Quinlan, 1987) are inherently interpretable, as their decision logic can be easily traced by examining the learned rules or structured hierarchy of decisions.
However, relying solely on the inherent interpretability of these simple models can have limitations, particularly in terms of predictive performance, as their simplicity may restrict their ability to capture arbitrarily non-linear interactions.On the other hand, complex ML models, such as deep neural networks (NNs) and boosting algorithms, often outperform simpler models in terms of accuracy, but lack inherent interpretability (Dramsch, 2020;Molnar et al., 2020).As a result, popular IML research uses post-hoc interpretation methods to explain the output of "black-box" ML models (Dwivedi et al., 2023).Particularly, the built-in measure of feature importance in random forests (RFs) was a major milestone (Breiman, 2001a).Since around 2015, the IML field has experienced significant growth, with the emergence of numerous model-agnostic explanation methods applicable to different ML model types as well as model-specific explanation methods tailored to NNs or tree ensembles (Molnar et al., 2020).These post-hoc methods (hereafter referred to as interpretation methods or interpretation techniques) do not simplify the model itself, but rather provide a window into the complex interactions and non-linear relationships that the model has captured from the data.
Generally, interpretation methods analyze the relationships learned by ML models by examining the model components or sensitivities (Figure 1).For instance, activation maps help reveal how internal representations are formed by convolutional NNs by visualizing the layer-wise activation patterns (Olah et al., 2017).In comparison, interpretation methods (e.g., SHAP, integrated gradients, and LIME) study the sensitivity aspect of ML models by perturbing the original inputs, computing the gradient of model outputs with respect to inputs, or approximating complex models with inherently interpretable models.Other than understanding the contribution of each input feature to individual predictions, interpretation methods such as partial dependence plots (Friedman, 2001) and permutation feature importance (Altmann et al., 2010) illustrate the general impact of features across the data set.Overall, compared to inherently interpretable models, post-hoc methods bring partial but functional interpretability without sacrificing the predictive power of advanced ML models.
Figure 1.The relationship between data, machine learning (ML) models, and post-hoc interpretation techniques, in the framework of interpretable ML (IML), as well as the usefulness of their results in Earth science studies.The primary goal of using IML in this context is to uncover relationships within the data used to make predictions.Dark blue arrows represent the flow from data through opaque ML models to post-hoc interpretation techniques that make the ML models interpretable.The revealed relationships can support various aspects of geoscientific research, with green boxes indicating applications that are directly relevant to broader geoscience studies in terms of process understanding.
In summary, IML is not an entirely new concept but can be regarded as a form of data analysis, or more specifically, an approach to understanding data through the lens of the data-driven models that process it.This perspective, though differing from the formal definition of IML, is useful in its pragmatism.IML essentially extends the capabilities of traditional statistical tools by providing sophisticated methods for analyzing variable relationships, which is particularly valuable in the geosciences where complex interactions and non-linear relationships are common.For readers interested in a more detailed technical understanding of the IML algorithms and methods discussed in this paper, we recommend referring to comprehensive reviews (e.g., Adadi & Berrada, 2018;Barredo Arrieta et al., 2020;Başağaoğlu et al., 2022;Gevaert, 2022;Gilpin et al., 2018;Gunning et al., 2019;Mamalakis, Ebert-Uphoff, & Barnes, 2022;Molnar et al., 2020;Murdoch et al., 2019;Roscher et al., 2020aRoscher et al., , 2020b) that provide in-depth discussions of various interpretation techniques, their theoretical underpinnings, their implementation details, and their applications in various subfields of the geosciences.

Usefulness of IML for Geoscientists
IML offers a variety of applications in the geosciences, and its usefulness may be most apparent to geoscientists who focus on ML, primarily to justify and diagnose their models for predictive tasks (e.g., Kratzert et al., 2019;Mayer & Barnes, 2021).In this paper, however, we will not discuss such applications in depth, but rather explore how IML can be directly used to potentially enhance process understanding for the field (Figure 1).

Quantifying Relationships Within a Given Context
A fundamental aspect of process understanding in the geosciences is quantifying the relationships within data, including identifying which variables are most influential, understanding the nature of their influence (whether linear, non-linear, or conditional), and determining how changes in one variable might affect others.IML is directly applicable in this context, equipping geoscientists with the tools necessary to quantitatively delineate relationships within established frameworks.These relationships may be partially known, but possibly remain qualitative, conceptual, or local.For example, IML has been used to explore relationships between environmental variables and diverse phenomena, such as species distributions (Ryo et al., 2020), flooding mechanisms (Jiang, Zheng, et al., 2022), landslide generation processes (Brenning et al., 2015), and soil-vegetation coupling (W.Li et al., 2022).IML has facilitated the identification of hotspot regions where precipitation anomalies are highly sensitive to anthropogenic warming (Ham et al., 2023), or where regional temperature signals exhibit significant sensitivity to aerosol forcing (Labe & Barnes, 2021).Overall, IML allows geoscientists to refine and enhance the current scientific understanding in a quantifiable and non-linear context.However, it should be emphasized that the relationships uncovered by IML using predictive models, while potentially useful, are not inherently causal, as discussed in more detail in Section 4.2.
In general, the primary approaches to infer variable relationships in the geosciences are conventional statistical analysis and numerical experiments using process-based models.Conventional (parametric) statistical methods, which are based on solid theory and usually provide additional confidence intervals, prediction intervals, and significance tests, are best suited for confirming well-defined hypothetical relationships.In contrast, IML enhances data exploration within large, high-dimensional data sets that often contain a multitude of interacting factors and patterns that are not readily apparent through traditional statistical analysis (Breiman, 2001b).Moreover, a practical advantage of IML is its ability to provide granular interpretations for individual instances (Lundberg et al., 2020), which is important in scenarios where we need to understand specific data points, such as the potential drivers of extreme events (van Oldenborgh et al., 2021).
Numerical experiments, such as controlled experiments, scenario analyses, and sensitivity analyses using process-based models, are common in the geosciences and critical to understanding how systems respond to environmental change (e.g., O'Neill et al., 2016).However, conducting such numerical experiments can be timeconsuming, limiting the number of experiments that can be realistically conducted.Controlled simulation experiments also run the risk of inadvertently disrupting natural interdependencies, such as the typically anticorrelated relationship between temperature and precipitation at interannual scales during summer (Madden & Williams, 1978).If specific variables are manipulated in isolation, these experiments may lead to artificial combinations of variables that are not physically plausible.This consideration is particularly important for understanding compound weather and climate events, where the combination of non-extreme drivers can lead to extreme impacts (Zscheischler & Seneviratne, 2017).Moreover, the effectiveness of numerical experiments often depends on a well-established understanding of the underlying mechanisms of the systems.In cases where these mechanisms are not fully known, or where comprehensive process-based models are not available, the application of IML to observational data may be partially useful (Irrgang et al., 2021).

Generating Hypotheses About Mechanisms With IML
In traditional geoscientific research, hypothesis generation often follows a time-intensive path that begins with careful observation of phenomena, followed by the formulation of a hypothesis based on those observations (Sivapalan & Blöschl, 2017).This process typically involves extensive data collection, analysis, and integration of multiple data sources (often based on the researcher's intuition) to identify patterns or anomalies.While thorough, this approach can be slow and sometimes limited by the inherent biases and perceptual limitations of human analysis.This is especially challenging in the era of big Earth data, where traditional analytical methods may struggle to navigate the complexities inherent in large, diverse, and multimodal data sets (X.Li et al., 2023).In comparison, IML can quickly analyze large and complex data sets, such as multidimensional data from multiple sources (e.g., satellite imagery, sensor networks, and historical records).For instance, using a large data set of Earth observations and climate variables, Kraft et al. (2019) analyzed variable contributions to temporally lagged dependencies (i.e., memory effects) in vegetation modeling through interpretation of long short-term memory (LSTM) models.This investigation revealed some new aspects of memory effects, such as their associations with climate gradients.While IML by itself does not confirm causality, because ML models may predict the right outcome for the wrong reasons (Lapuschkin et al., 2019), the correlations, statistical dependencies, and patterns it uncovers can still be informative.For example, IML may reveal that certain variables, previously deemed unlikely to be relevant, play a significant role in predictions, or that the relevance of variables shifts in ways that defy initial expectations (Ryo et al., 2020).These findings can prompt geoscientists to reevaluate their prior assumptions, providing valuable starting points for further rigorous testing and investigation through targeted studies and experiments (Carloni et al., 2023).This ability of IML to efficiently sift through and interpret large amounts of data can accelerate the hypothesis generation process, and thus the entire research cycle.This rapid turnaround is particularly beneficial in climate research or natural hazard assessment, where timely insights can have a significant impact (van Oldenborgh et al., 2021).Furthermore, IML's ability to handle large data sets means that hypotheses can be generated and refined in real time as new data become available, keeping pace with the dynamic and evolving nature of the Earth system.

Evaluating Process-Based Models With IML Insights
The relationships and patterns revealed by IML also facilitate the assessment of the variability of specific factors across models, data sets, or scenarios (Reichstein et al., 2019), something that is often less emphasized.Understanding whether different models consistently reproduce the dependence structure of variables observed in real-world data would help evaluate and refine process-based models (e.g., Gnann et al., 2023).Process-based models are essential for projections of future trends, though the reliability of these models in simulating future climate events cannot be directly evaluated.Traditional model evaluation and intercomparison have largely relied on benchmarking approaches that focus on univariate comparisons, where the performance of models is assessed based on their ability to reproduce observed values of individual variables (Jägermeyr et al., 2021).However, the univariate approach may overlook compensating errors that arise from interactions among multiple variables within a system, potentially masking problems in model structure or parameterization (Touzé-Peiffer et al., 2020).C. Müller et al. (2024) emphasize the need to include analyses of functional properties in process-based model evaluation, which may reveal more about model plausibility and skill than merely comparing variables, since different model responses to drivers may offset each other in the historical evaluation period, but not in future scenarios.To this end, the use of IML to evaluate these multifaceted relationships holds promise to provide geoscientists with a tool that complements and enhances traditional evaluation techniques and moves toward pattern-and process-oriented model evaluation (Reichstein et al., 2019).By inter-comparing IML-derived relationships from models with those from observational data sets, we can uncover the consistencies and inconsistencies in their covariability, and thus identify specific aspects of the model that may require adjustment or further investigation.Such evaluations are particularly relevant for addressing the challenges of predicting extreme climate and weather events under climate change, which are often caused by complex interactions among multiple factors (Zscheischler et al., 2018).In this case, however, the challenges of out-of-distribution predictions and representational biases are considerable.
Recently, advanced data science methods such as complex networks and causal discovery algorithms have been increasingly used in climate model evaluation.For example, Feldhoff et al. (2014) applied complex networks to evaluate a regional climate model simulating multiple climate variables in South America, where the characteristics of the constructed networks were compared between the model and reanalysis data.Likewise, Nowack et al. (2020) used causal networks to evaluate coupled model intercomparison project phase 5 (CMIP5) models, focusing on their ability to simulate atmospheric dynamical interactions represented by lagged correlations between climate variables at remote locations.They found models that more accurately capture characteristic causal relationships tend to have smaller biases in their precipitation simulations.However, the potential for using IML for model evaluation, for example, in various model intercomparison projects (e.g., Eyring et al., 2016;Warszawski et al., 2014), remains largely unexplored.IML could be used to systematically compare these models to identify common or different model biases, to constrain uncertainties in climate change projections, and to provide a comprehensive overview of areas for improvement.

Typical Workflow of IML for Process Understanding
Having established the relevance and applicability of IML in the geosciences, this section is dedicated to outlining an actionable workflow for the effective use of IML in geoscientific research (Figure 2a).This workflow is intended as a practical guide to assist geoscientists in structuring their research questions and methodologies around IML to achieve reliable and meaningful results.Here, we focus on presenting the key stages and general principles of the workflow, exemplified by selected cases from the existing literature (Figures 2b-2g).Detailed technical discussions and more extensive examples can be found in Supporting Information S1.
In all cases, the decision to use IML, whether simple or complex, should be contextually appropriate to the specific complexity and demands of the data and research questions.Once the IML workflow has been implemented, it is advisable to compare the results with those derived from traditional analysis methods to assess the unique insights and added value that IML can bring to the study.

Translating Geoscientific Research Questions Into IML Tasks
The first and perhaps most critical step is to clearly define the research question and translate it into a task that can be effectively addressed using IML methods.Typical investigations may focus on identifying key influencing factors and their contributions, or untangling dependencies and conditional effects.For example, geoscientists may want to understand how a specific outcome (e.g., extreme weather and climate events) can be attributed to potential drivers (e.g., Davenport & Diffenbaugh, 2021;Jiang, Zheng, et al., 2022;Kondylatos et al., 2022;Ryo et al., 2020;R. Wang et al., 2021).In these scenarios, the IML task is to quantify the relationships between these events (Y) and a number of possible influencing factors (X ), such as atmospheric conditions or geographic features.Beyond attribution to individual factors, IML can be used to determine how multiple factors interactively affect a particular outcome (e.g., H. Wang et al., 2022;Xu et al., 2023).The question of critical thresholds in systems can also sometimes be translated into IML tasks of identifying inflection points in the contribution of X relative to its value (e.g., Chakraborty et al., 2021).These examples and additional case studies are elaborated in Text S1 in Supporting Information S1.
At this stage, it is important to form preliminary hypotheses based on existing knowledge, literature review, or exploratory data analysis.These hypotheses can guide the selection of appropriate IML methods that address specific types of data and are consistent with the goals of the research.For instance, if the hypothesis involves exploring the complex interaction effects between variables in tabular data, the IML methods considered should be capable of explicitly quantifying these interactions.Possible approaches may include the use of tree-based models, such as RFs or extreme gradient boosting (XGBoost) (Chen & Guestrin, 2016), in conjunction with TreeSHAP (Lundberg et al., 2020), which allows the decomposition of model predictions into the contributions of feature pairs based on the structured decision paths inherent in these models.Alternatively, one could consider Explainable Boosting Machines (Lou et al., 2013), where interaction terms can be explicitly specified during model configuration and each interaction term is modeled and interpreted separately.
For readers seeking detailed guidance on method selection, numerous comprehensive reviews and studies (e.g., Barredo Arrieta et al., 2020;Bommer et al., 2024;Graziani et al., 2023;Mamalakis, Barnes, & Ebert-Uphoff, 2022;McGovern, Lagerquist, et al., 2019;Schwalbe & Finzel, 2023;Zhong et al., 2022) are available that thoroughly examine the suitability and conditions for using specific IML methods.For instance, Schwalbe and Earth's Future 10.1029/2024EF004540 Finzel (2023) provide a structured and detailed taxonomy of IML methods that synthesizes insights from a multitude of surveys on IML techniques, metrics, and characteristics, which can assist researchers in identifying the most suitable IML methods for various domain-specific explanation use cases.Additionally, Bommer  2024) discuss metrics for evaluating different IML methods in the context of climate science.They highlight key considerations in selecting an appropriate IML method and propose a framework using evaluation metrics to support the selection of an appropriate IML method for a specific research task.

Preparing and Preprocessing Data
Data preparation is a fundamental step in the IML workflow.The accuracy and reliability of IML outcomes depend heavily on the collection of appropriate and comprehensive data relevant to the defined problem.Typically, variable selection is improved iteratively, guided by model evaluation and interpretation in subsequent steps.In addition to following general principles of data preparation for ML models, such as handling missing values or outliers (Zhu et al., 2023), it is necessary to ensure that the data adequately reflect the temporal and spatial scales relevant to the processes under study (Jiang, Bevacqua, & Zscheischler, 2022;W. Li et al., 2022).Importantly, data volume alone may not be sufficient for IML studies; diversity within the data is equally important (Fang et al., 2022), and different scenarios, conditions, and variations should be included.However, the sample distribution should not disproportionately favor, for instance, certain climatic zones or geographic features, a common issue in site-based observational data sets (Chu et al., 2017).In addition, as a general principle, data often require cleaning, formatting, and transformation to be used effectively (e.g., L. Yu et al., 2006), and this is no different for geoscience data.Depending on the research question, it may also be necessary to remove seasonality and long-term trends from time series data in order to focus on more specific variables of interest (e.g., Davenport & Diffenbaugh, 2021;W. Li et al., 2022) (detailed in Text S2 in Supporting Information S1).

Training and Validating ML Models
Training a ML model for process understanding may require more consideration than for purely predictive tasks.For example, the choice of an appropriate ML model should be informed by the complexity of the geoscience question and data, as well as the goals of the analysis.In general, the chosen model should be as complex as necessary to capture the essential dynamics of the data, but as simple as possible so that its interpretations can be translated into concise and actionable insights (Toms et al., 2020).The full extent of complexity is often not immediately evident, so it is wise to start with a simpler and more transparent model as a baseline and increase complexity incrementally (discussed in Section 4.5).Also, different ML models have unique strengths and are better suited to specific types of geoscience questions, and the choice of model affects how well it can handle spatial and/or temporal aspects of the data (Grinsztajn et al., 2022;Ham et al., 2019;Jiang, Bevacqua, & Zscheischler, 2022;Kraft et al., 2019;Kratzert et al., 2019;Lees et al., 2022;Saha et al., 2021) (detailed in Text S3 in Supporting Information S1).
An important consideration throughout the model building process is to prevent potential information leakage and ensure that the resulting model is generalizable and does not learn shortcuts (Schratz et al., 2019;Sweet et al., 2023).This requires careful management of the training and test data sets with consideration of the specific characteristics of the data (Bischl et al., 2023;Brenning, 2022;Davenport & Diffenbaugh, 2021;de Burgh-Day & Leeuwenburg, 2023;Lopez-Gomez et al., 2023;McGovern, Jergensen, et al., 2019;Meyer & Pebesma, 2022) (detailed in Text S3 in Supporting Information S1), and implementing strategies such as regularization and early stopping to prevent the model from exploiting certain patterns in the training data to overfit (Ying, 2019).
Careful evaluation of model performance is essential to derive meaningful interpretations in subsequent steps.While sufficient predictive accuracy is necessary, it alone does not guarantee that the model has effectively captured the underlying patterns and relationships in the data (Murdoch et al., 2019).Therefore, a comprehensive, multifaceted approach to evaluation is essential.Ideally, model performance should be tested across diverse subsets that vary in time, space, and/or feature distribution (Sweet et al., 2023).Rigorous testing helps to challenge the model, ensuring that it has not only learned specific patterns, shortcuts, or biases that may be inherent to a particular segment of the training data, but has instead developed a broad, generalizable understanding of the data (discussed in Section 4.1).Typically, fitting multiple, independent models (e.g., Jiang, Bevacqua, & Zscheischler, 2022;McGovern, Jergensen, et al., 2019) or exploring different data sets (e.g., Davenport & Diffenbaugh, 2021;Ham et al., 2019;W. Li et al., 2022) and then examining the distribution of their performance metrics can solidify the robustness of the findings.Ultimately, the success of IML in producing reliable and insightful results depends on thorough and thoughtful ML model training and validation.

Implementing Interpretations and Ensuring Robustness
The choice of an appropriate interpretation technique depends not only on its compatibility with the specific ML model, but also on the level of explanation required (e.g., explanation for individual predictions or global understanding of model behavior).In recent years, SHAP values have gained popularity for their ability to provide detailed insight into each feature's contribution to instance-level model predictions, which can be further aggregated to provide a global perspective of the data set (Lundberg et al., 2020).Other methods such as integrated gradients or expected gradients can be applied to temporal models including LSTM (e.g., Jiang, Bevacqua, & Zscheischler, 2022;Kratzert et al., 2019), while techniques such as layer-relevant propagation or occlusion sensitivity are often used for image-based models (e.g., Ham et al., 2023;Toms et al., 2020).However, choosing among the available interpretation techniques can be challenging due to the lack of ground truth for evaluation.Several metrics have been developed to evaluate the suitability and effectiveness of interpretation techniques, focusing on comparing key properties between techniques for specific research problems (e.g., Hedström et al., 2024;Nauta et al., 2023).Typical examples of these properties include faithfulness-where the high importance assigned to a feature by the interpretation technique should significantly affect the model's prediction -and robustness, which assesses the stability of the explanations against minor input variations.This evaluation is critical for making an informed decision about the most appropriate interpretation technique(s).Readers are encouraged to consult recent studies (e.g., Bommer et al., 2024;Mamalakis, Barnes, & Ebert-Uphoff, 2022) that have conducted comprehensive evaluations of different methods against various metrics tailored to the specific context of Earth science.
It should be recognized that no interpretation technique is universally optimal or suitable for all models and tasks, and results from different interpretation methods have been found to be inconsistent (Krishna et al., 2022;Mamalakis, Barnes, & Ebert-Uphoff, 2022).It is therefore advisable to use more than one method whenever possible to assess the robustness of findings.In addition, an essential consideration for some interpretation techniques is the selection of appropriate baselines or background data, which serve as reference points for understanding how different feature values shift the model output from a base value.Different baselines can lead to divergent interpretations (Mamalakis et al., 2023).
To ensure the robustness and generalizability of the interpretation results obtained, the interpretations should be confirmed as not merely artifacts of the specific data set, ML model, or interpretation technique used.For example, the major patterns of interpretation results should remain as consistent as possible under minor input data perturbations or when using independent data sets from various data sources.Therefore, validation across multiple satellite products, model-based data, or in-situ measurements is appreciated (e.g., W. Li et al., 2022).The inclusion of random variables unrelated to the target variable can also serve as a point of comparison to evaluate the importance of genuine features (e.g., Zhou & Hooker, 2021).Furthermore, the sensitivity of interpretation results to various model configurations (e.g., filter sizes in CNNs, temporal lengths in LSTM, random seeds) should be examined (Mishra et al., 2021).Note that the uncertainty arising from the above processes is an important aspect to consider when applying IML, which will be further discussed in Section 4.4.

Distilling Interpretation Results Into Geoscientific Understanding
The process of distilling meaningful geoscientific insights from interpretation results requires interpreting the revealed model behavior within the existing geoscientific context.Some interpretation methods are capable of directly describing model behavior within its operational domain by illustrating how input features affect predictions on average, or what concepts a model has generally learned to encode.These include partial dependence plots (Friedman, 2001), permutation feature importance (Altmann et al., 2010), and several emerging techniques such as concept relevance propagation (Achtibat et al., 2023), network dissection (Bau et al., 2020), and structural causal model-based feature relevance (Reimers et al., 2020).In contrast, some interpretation methods (e.g., SHAP value) focus on instance-level explanations, detailing the contribution of individual variables to specific predictions.Figures 2b-2d showcases the form of instance-level interpretation results based on three types of data typical in the geosciences (i.e., spatial data, multivariate time series, and tabular data) from the literature (Davenport & Diffenbaugh, 2021;Jiang, Bevacqua, & Zscheischler, 2022;H. Wang et al., 2022).For spatial data, for instance, the pixel relevance map in Figure 2b highlights areas that significantly influence model predictions.For multivariate time series, the interpretation assigns feature importance values over time, revealing how input variables contribute to specific predictions at each time step, as shown in Figure 2c.In the context of tabular data, Earth's Future 10.1029/2024EF004540 which often has fewer dimensions, interpretations tend to be more straightforward (Figure 2d), indicating how each input variable moves the output value from the model's baseline value to the actual prediction for a given instance.In addition, several other interpretation methods, such as anchor algorithms (Ribeiro et al., 2018) and counterfactual explanations (Wachter et al., 2017), can provide more problem-specific and actionable insights for decision making by identifying precise conditions for predictions or pinpointing minimal input changes that alter the outcome.
Generally, elevating these instance-level interpretations to a comprehensive understanding requires synthesizing these individual insights into a cohesive perspective.Figures 2e-2g presents aggregated interpretation results, corresponding to those in Figures 2b-2d, using various strategies.For example, methods such as composite maps (Figure 2e) and clustering of feature importance (Figure 2f) can help identify key features or common underlying mechanisms across different scenarios or instances.Moreover, investigating how a feature's contribution to model predictions changes with its value and the value of other variables can be informative.For instance, the bee swarm plots in Figure 2g provide a dense summary of each input feature's impact on model output, while the dependence plot illustrates how model predictions depend on interactions between multiple features.These examples are described in more detail in Text S4 in Supporting Information S1 and can be found in the respective literature.In addition, examining variations in feature contributions is also helpful in identifying thresholds or saturation points at which a feature value begins to have diminishing or increasingly significant effects on the predicted outcome (Chakraborty et al., 2021).

Common Pitfalls and Good Practices
To effectively apply IML in the geosciences, it is essential to recognize and understand common pitfalls, which are not isolated but interrelated, and to adopt good practices that ensure robust, reliable, and scientifically sound outcomes.This section aims to summarize some key considerations and practical advice on both what to avoid and how best to approach IML applications.

Model Interpretations Do Not Always Reflect Data Truths
A common pitfall in seeking insights from IML is the misconception that the model's interpretations necessarily equate to truths about the underlying data-generating process or real-world phenomena.In reality, the interpretations offered by these methods merely estimate how a specific ML model arrives at certain predictions based on inputs (Good & Hardin, 2012).Misinterpreting these as direct insights into real-world phenomena can lead to misleading or incorrect conclusions, especially if the model's learned decision rules do not match the actual underlying data relationships (Figure 3a).For example, models that are underfitted due to overly general decision rules will perform poorly on both training and test data, indicating a failure to capture the true underlying relationship.Conversely, overfitted models that learn rules too close to the training data, including noise and anomalies, may also struggle to generalize the underlying relationships.Perhaps more imperceptibly, even models that perform well on training and independent and identically distributed test data, but not on out-ofdistribution data, may misrepresent the data-generating process.This scenario can occur when a model relies on superficial or spurious patterns (e.g., shortcut learning) (Geirhos et al., 2020)-for instance, classifying images based on embedded text labels rather than their actual features.In essence, ML algorithms can skillfully perform tasks based on spurious, non-physical relationships, but the true relationships may deviate from the correlations initially observed in the training data.
In the context of the geosciences, the unique spatial and temporal structures inherent in geoscientific data, such as autocorrelation, make these issues particularly critical.For instance, in large-scale ecological mapping studies, ML models are often used to characterize the relationship between local environmental conditions (e.g., climate, topography, and soil types) and targets of interest, such as vegetation reflectance properties, in order to extrapolate the targets of interest beyond the sampling locations (Ploton et al., 2020).However, it has been reported that the predictive power of ML models in the literature is often evaluated using nonspatial cross-validation, which can lead to misleadingly confident interpretations of model accuracy and reliability where autocorrelation may act as a shortcut (Stock et al., 2023).Consequently, any inference of ecological determinism based on post-hoc interpretations of these models must be approached with extreme caution (Ploton et al., 2020).In practice, rigorous validation is essential, using resampling procedures such as holdout or (repeated) cross-validation, depending on sample size.These validation procedures should reflect the structure of the prediction task, taking into account Earth's Future 10.1029/2024EF004540 spatial or temporal prediction distances and out-of-sample estimation where appropriate (Brenning, 2022), since validation results as well as model interpretations will inevitably depend on the chosen resampling strategy (Meyer & Pebesma, 2022;Schratz et al., 2019;Sweet et al., 2023).Ideally, model interpretations should also be validated against out-of-sample data sets (e.g., independent data sets relevant to the study) to ensure that the insights they provide are not artifacts of the unique characteristics or biases present in the training data but reflect more general patterns and relationships.Overall, insights derived from IML should not be regarded as definitive interpretations of data truths, but rather as hypotheses that require further validation through additional analysis and experimentation.

Tendency of Causal Interpretation
In IML applications, a subtle yet significant risk is the often-unintentional misinterpretation of relationships derived from predictive models as causal in nature (Figure 3b).Standard supervised ML models are designed to exploit associations in the data rather than explicitly model causal relationships.Generally, predictive models focus primarily on understanding the observational conditional probability p(Y|X = x 0 ) by inferring the probable values of Y when X is observed to be value x 0 .Conversely, causal tasks focus on the interventional probability p(Y|do(X = x 0 )), which to understand the effect of a change or intervention in X (e.g., setting it to x 0 ) on Y (Pearl & Mackenzie, 2018).For example, consider a flood prediction model that uses vegetation cover as one of its input variables.Such a model might perform well by exploiting the observed association between vegetation cover and certain flood processes (Calder & Aylward, 2006).However, this observed association within the predictive model does not inherently reveal the direct impact of interventions in vegetation cover (e.g., afforestation or deforestation) on flood events (Rogger et al., 2017).This is because the observed association may arise from correlations between vegetation and climate characteristics or geomorphology, which also influence the distribution and characteristics of flood events.Moreover, when building predictive ML models, it is common to include as many explanatory variables as possible to maximize performance.However, this approach can be counterproductive when the goal is interpretation.For example, research has shown that IML methods used to identify influential variables and uncover underlying functional relationships in ecology are negatively affected by the inclusion of spurious variables (those that are correlated with, but not causally related to, the target variable) (Q.Yu et al., 2021).Therefore, when process understanding is important, it can be helpful to construct ML models using independent variables that have clear causal effects on response variables.
Typically, an important condition for a predictive model to yield a causal effect estimate is that its input variables are independent of unobserved confounders (i.e., variables that affect both the input and the model target).Otherwise, interpretations derived directly from a predictive model do not directly indicate whether a variable acts as a cause, an effect, or has no causal relationship with the target variable (Molnar et al., 2022).Despite the awareness that correlation does not imply causation, there is a tendency to interpret the results of IML methods from a causal perspective (Arif & MacNeil, 2022), especially if such an interpretation is consistent with preexisting beliefs or theories.Recent literature suggests that predictive ML models have already been conflated with causality in ecological studies, where ML models are increasingly being misused for causal interpretations (Arif & MacNeil, 2022).When interpreting IML outputs, it is important to use language that accurately reflects the nature of these findings.For example, terms such as "associated with" may be more appropriate than "driven by" (Thapa et al., 2020).However, there remains the possibility that readers may interpret correlational statements as causal (Gershman & Ullman, 2023).Explicitly stating the limitations of the analysis and acknowledging the potential for alternative explanations or confounding factors can help readers understand the nature of the relationships presented.
In most cases, IML should not be considered a definitive source of causal knowledge.The challenge of causal discovery and inference remains an important open question in ML research (Runge et al., 2023).In general, a thorough investigation is needed to make explicit under which assumptions causal insights can be extracted from the interpretation of ML models (Janzing et al., 2020).Recently, there has been a growing interest in integrating causal inference concepts such as structural causal models, do-operators, and causal metrics into ML interpretation (e.g., Carloni et al., 2023;Reimers et al., 2020).For example, Heskes et al. (2020) proposed causal Shapley values, which extend the traditional Shapley value framework by explicitly incorporating interventional expectations to account for both direct and indirect contributions of a feature to the model's predictions.Similarly, the knockoff framework allows causal exploration with ML models by generating synthetic control variables to rigorously assess the importance of features, distinguishing between causally relevant features and correlated features (Popescu et al., 2021).In addition, innovations such as double ML (Chernozhukov et al., 2018) and causal ML (Tesch et al., 2023) are being explored in Earth science research.To robustly explore and validate causal relationships, it may be necessary to complement IML findings with additional causal inference frameworks, such as quasi-experimental approaches (Butsic et al., 2017) and time-series causal analysis (Runge et al., 2019).

Multicollinearity and Dependence Among Features
Another issue that often receives insufficient attention in IML applications is interdependence among features, where one or more features can be explained non-linearly by ML models using the other features, which is referred to as "concurvity" in some contexts (Wood, 2017).A widely known example of this is multicollinearity among input variables, where some features are strongly correlated with one another.In addition to exacerbating the risk of misattribution of causality discussed above, the problem of multicollinearity can also affect the reliability of IML results (Figure 3c).While this concern is well recognized in classical statistical analysis, for example, variance inflation factor (Mansfield & Helms, 1982), its importance seems to be less emphasized in the context of IML.This oversight may be due to the fact that ML models, even when trained on multicollinear data, are likely to retain predictive power, especially when the test data used have a similar dependence structure (Farrell et al., 2019).However, this predictive power does not negate the interpretive challenges posed by multicollinearity, especially when attempting to derive quantifiable insights for process understanding from a predictive model.This issue is particularly prevalent and critical in the geosciences, where variables often exhibit strong dependence and multicollinearity due to the interconnected nature of Earth systems.A case study in atmospheric chemistry demonstrated that correlated and dependent features can lead to spurious process-level Earth's Future 10.1029/2024EF004540 explanations, where chemical reactions can be wrongly attributed to fundamentally incorrect compounds (Silva & Keller, 2024).
Theoretically, ML models could arbitrarily assign importance or weight across highly correlated variables when making predictions because they carry similar information about the target variable.In this case, the importance of features may be spread across multiple features, suggesting a weak or negligible association with the response (Brenning, 2023), or it may show high variability and even directional shifts (Chan et al., 2022).Furthermore, the presence of multicollinearity can lead to unreliable interpretations, especially when using perturbation-based methods.When features are highly correlated, these perturbations can extrapolate into "uncharted" regions within the feature space that lie outside the observed joint distribution of the variables, leading to biased assessments of feature importance (Hooker et al., 2021).
A common and straightforward strategy to mitigate the effects of multicollinearity is to exclude highly correlated variables whose information may be redundant in feature selection (Katrutsa & Strijov, 2017).However, this can sometimes conflict with the goal of identifying underlying relationships based on a comprehensive set of as many relevant variables as possible, which may contain subtle but crucial information.For instance, two climate variables may be closely related, but may affect an ecological process differently under varying conditions.In this case, it is important to increase the diversity of the environment (e.g., varied climate regions, geographic conditions, and species diversity) for the variables.This increased diversity can help account for multifaceted relationships between variables, especially when certain correlations are actually dependent on other factors (Dormann et al., 2013).For example, the dependence between soil moisture and evapotranspiration is generally determined by water and energy availability, which varies with season and geographic location (Hsu & Dirmeyer, 2023).Moreover, where possible and appropriate, closely related variables may be grouped or transformed for collective or conditional interpretation, where their contributions can be considered more holistically, rather than attempting to separate the individual contributions of these variables (Brenning, 2023;Jiang et al., 2024).Krell et al. (2023) further suggest that models based on gridded geospatial data can be sensitive to the choice of grouping scheme, and thus it is beneficial to compare explanations from multiple grouping schemes for more accurate insights, as each may probe the model differently.

Uncertainty in Interpretations
Using interpretation to enhance the transparency of ML models may inadvertently create an illusion of certainty about their results.However, as highlighted earlier, these interpretations are subject to various uncertainties, including those inherent in the data, the structure and training processes of the ML model, and the specific assumptions and computations behind the interpretation methods (Figure 3d).For example, multiple distinct ML models with comparable performance may provide divergent explanations for the same set of data (i.e., model multiplicity (Breiman, 2001b) or equifinality)-how can we discern which explanation is the most accurate or valid?The different narratives offered by each model often stem from their unique approaches to processing and using the input data, including biases in feature selection (Strobl et al., 2007).Furthermore, while predictive accuracy is not typically a primary concern in the pursuit of process understanding, interpretations from poorly or unstably performing models are likely to be inherently unreliable (Murdoch et al., 2019).In many cases, applying different interpretation methods to a single model (Mamalakis, Barnes, & Ebert-Uphoff, 2022), or even applying the same interpretation method but with varying settings or hyperparameters (S.Müller et al., 2023), can lead to different results.The variance in the latter case can be largely due to the approximations used by the interpretation techniques.These approximations simplify complex mathematical models into forms that are more understandable and computationally manageable, but can vary with each computation when stochastic processes are involved.For instance, the LIME method constructs simpler, surrogate models based on perturbed samples to locally approximate the prediction function of complex models (Tulio Ribeiro et al., 2016).Consequently, the explanations provided by LIME are sensitive to changes in the number of perturbed samples (Bansal et al., 2020).Similarly, Monte Carlo integration methods are often used to approximate Shapley values, which are also subject to sampling variability (Goldwasser & Hooker, 2023;Štrumbelj & Kononenko, 2013).
For example, Hu et al. (2023) have compared 11 IML methods to gain process understanding of climate and crop interactions from crop yield prediction modeling and found divergent results among these methods.They advised that future studies should not uncritically rely on the variable importance rankings produced by a single IML method to draw definitive conclusions.In practice, it is advisable to consider approaches or strategies for Earth's Future 10.1029/2024EF004540 quantifying uncertainty in IML explanations, such as probabilistic and bootstrapping techniques.For example, Slack et al. (2020) proposed a Bayesian framework to generate probabilistic versions of LIME and SHAP, instead of pointwise estimates of feature importance.To account for various sources of uncertainty and enhance the robustness and reliability of interpretations, it may be beneficial to perform IML analysis repeatedly by resampling the data, using different subsets of data, varying initial random seeds in ML models, or applying multiple interpretation methods (e.g., Jiang et al., 2024;Labe & Barnes, 2021;W. Li et al., 2022).In addition, it is important to be aware of the assumptions, limitations, and potential weaknesses of the interpretation methods applied to realistic and complex geoscientific data sets.Recently, Mamalakis, Barnes, and Ebert-Uphoff (2022) developed synthetic attribution benchmark data sets specifically tailored for geoscience applications, providing a solid foundation for more falsifiable and rigorous research.Bommer et al. (2024) also introduced a suite of metrics to evaluate the effectiveness of different interpretation methods in climate research, such as robustness, faithfulness, randomization, complexity, and localization, to facilitate the selection of the most appropriate interpretation methods for both technically robust and contextually relevant applications.

Gap Between Complexity and Interpretability
The development of post-hoc interpretation methods has somewhat alleviated the long-standing trade-off between accuracy and interpretability of ML models (Murdoch et al., 2019).However, extracting scientific insights from ML models with complex structures remains a practical challenge (Figure 3e).Interpreting the internal mechanisms of complex, high-performing ML models in a human-understandable way often requires a degree of simplification that may obscure the subtle intricacies captured by the model.Moreover, even when interpretation methods accurately reflect the algorithmic functioning of ML models, the resulting explanations are not necessarily intuitive and aligned with human understanding in specialized domains (Ehsan et al., 2022).This mismatch between the computational logic of algorithms and human intuition can lead to misinterpretations, requiring thoughtful translation of algorithmic explanations into terms that are both accessible and relevant to the domain (Achtibat et al., 2023).In a study examining ozone mapping models, SHAP values revealed that the models placed more importance on geographical features such as absolute latitude and altitude than chemical factors like NOx emissions (Betancourt et al., 2022).The authors noted that this finding might appear counterintuitive, as ozone chemistry is typically expected to play a more significant role in such models.However, on the other hand, comparing these interpretations to existing knowledge can be fraught with cognitive biases that tend to reinforce existing theories or expectations and potentially overlook novel insights.For example, if a model suggests an unconventional factor as influential in climate change, it may be dismissed if it contradicts long-held beliefs, despite its potential validity.These challenges highlight the need to balance the advanced computational accuracy of complex ML models with the practical need for clear, concise, and actionable insights.
As noted previously, an iterative model building strategy is advocated, where complexity is incrementally increased and the interpretability of the model is continuously evaluated (Molnar et al., 2022).This method aims to find a sweet spot where the model achieves both high accuracy and meaningful interpretability.For example, a GAM and its geospatial extensions can serve as a gradual transition between linear models and complex ML models in this iterative process (Rudin, 2019;Wood, 2017).The additive structure of a GAM is specified prior to model fitting, allowing for the estimation of prescribed features or interaction effects.In addition to GAM implementations based on smoothing splines (Wood, 2017), tree-based GAM smoothers, such as Explainable Boosting Machines (Lou et al., 2013), can provide greater flexibility and robustness, especially in highdimensional situations.Such additive models are often as accurate as state-of-the-art ML models (e.g., XGboost), while remaining inherently interpretable (Goetz et al., 2015).Furthermore, as noted by Betancourt et al. (2022), the counterintuitive results for ozone attribution may arise because the purely data-driven model approach is inherently a posteriori and not process-oriented in any way, that is, scientific consistency was not enforced during the training process.These shortcomings highlight the value of hybrid (Reichstein et al., 2019) or differentiable modeling (Shen et al., 2023) strategies in Earth sciences that aim to be effective in creating inherently interpretable models, that is, models that follow a domain-specific set of constraints that make the reasoning processes understandable (Rudin et al., 2022).These strategies involve the integration of physical relationships or models into ML architectures (e.g., Jiang et al., 2020;Kraft et al., 2022;C. Wang et al., 2024).In this way, the complexity inherent in the data can be effectively managed by anchoring the models in well-established scientific principles, which helps constrain the models to plausible behaviors and thus reduces ambiguity in their explanations.

Conclusion and Outlook
The rapid development of AI and its subfield IML has opened new frontiers in various scientific disciplines, including the geosciences.However, amidst the rapid expansion in the use of IML, there has been both a tendency toward careless and superficial application and an underestimation of its much broader potential in the field.This study aims to address these issues, improve the accessibility and relevance of IML to a wider range of geoscientists and, more importantly, facilitate more effective and appropriate use of these innovative tools.In this paper, we specifically focus on the potential benefits of IML for process understanding in the geosciences.It is anticipated that IML will become an indispensable method for enhancing our current, often conceptual and qualitative understanding with quantifiable non-linear insights, and for generating innovative hypotheses with large data sets.In particular, IML is expected to play an important role in evaluating and revising existing processbased models.However, it is important to recognize that AI tools alone are not sufficient to drive progress in domain science, and to remain vigilant about the potential risks of scientific monocultures that AI-led science may foster (Messeri & Crockett, 2024).Rather than advocating a shift away from process-based modeling, we emphasize the complementary role of IML in addressing tasks that are challenging for traditional methods.While the current application of IML to understanding the complexities of the Earth system is in its early stages, its farreaching implications are undeniable.We envision a future in which a broad spectrum of geoscientists benefit from the insights provided by IML, using it as an advanced analytical method in an era of abundant data to deepen our understanding of Earth's complex systems both directly and indirectly in the future.
This study presents a practical workflow with examples for geoscientists to effectively integrate IML into their research.Especially, we identify several potential pitfalls that are likely to be encountered when applying IML.We advocate cross-disciplinary collaboration between geoscientists, data scientists, and ML experts to tailor IML tools to specific geoscience needs, with a focus on causal and multifactorial process considerations, knowledge integration, and uncertainty quantification.In general, we argue for a pragmatic approach to these tools and their more thoughtful use in geoscientific research to ensure responsible knowledge production.While existing research has pioneered the use of IML, we recognize the need to be more cautious in drawing conclusions, especially in scenarios where rigorous validation is not possible.We encourage researchers to carefully evaluate the robustness of their results, taking into account the good practices we have suggested, before reporting them, with the goal of further solidifying the role of IML as a reliable and effective tool for advancing geoscientific research.

Figure 2 .
Figure 2. Workflow and examples of applying interpretable machine learning for geoscientific process understanding.(a) Flowchart illustrating the general workflow, where gray boxes represent objects and red boxes represent operations (explained in the corresponding subsections).(b-g) Illustrate how the algorithmic explanation results for different types of data can be translated into scientific understanding with examples from the literature (briefly explained in Section 3.5 and detailed in Text S4 in Supporting Information S1).(b, e) Are modified from Davenport and Diffenbaugh (2021), where (b) shows a sea level pressure anomaly map for a given day, with the IML-derived pixel-wise relevance indicating its contribution to the classification of the day as having large-scale extreme precipitation circulation patterns (EPCP).(e) Presents composite relevance maps for EPCP days, which aggregate the relevance maps exemplified in (b).(c, f) Are adapted from Jiang, Bevacqua, and Zscheischler (2022).(c) Shows the IML-derived feature importance of precipitation, temperature, and day length over 180 days for predicting streamflow on the subsequent day.(f) Illustrates the results of a clustering analysis applied to all feature importance values across events and basins, with the bar plot indicating the average feature contribution pattern (aP: antecedent precipitation from 180 to 7 days before the event) and the map showing the proportion of events falling into this cluster in individual basins.(d, g) Are adapted from H. Wang et al. (2022).(d) Indicates the contribution of seven variables in predicting gross primary productivity for a specific sample, as estimated by the SHAP value.The actual values of these variables are shown in gray.The top plot in (g) illustrates the relationship between feature contribution (x-axis) and values (color) across all variables.The bottom plot in (g) is a dependence plot of vapor pressure deficit (VPD) versus its contribution value along the soil water content (SWC) gradient in grasslands.For more information, including definitions of other abbreviations, see the respective references.

Figure 3 .
Figure 3. Common pitfalls in geoscience interpretable machine learning (IML) applications.(a) ML model training can result in underfitting, overfitting, shortcut learning, or successful capture of the underlying data generation process.These results can be reflected by sufficient model performance on training data, independent and identically distributed (i.i.d.) test data, and out-of-distribution (o.o.d.) test data, as indicated by the corresponding links in the diagram.(b) The difference between predictive and causal goals.The predictive model generally captures only the observational distribution of the data and cannot be equated with causal insights based on the interventional distribution.(c) Strongly interdependent input variables can lead to varying feature importance scores in different model runs, due to similar information about the target output.(d) Different methodological choices can lead to diverse insights, thereby introducing uncertainty into the interpretation process.(e) Complex models may accurately capture intricate data patterns, but the interpretations may be difficult for humans to intuitively understand, hindering the ability to gain actionable insights from the IML framework.