Identifying Causal Interactions Between Groundwater and Streamflow Using Convergent Cross‐Mapping

Groundwater (GW) is commonly conceptualized as causally linked to streamflow (SF). However, confirming where and how it occurs is challenging given the expense of experimental field monitoring. Therefore, hydrological modeling and water management often rely on expert knowledge to draw causality between SF and GW. This paper investigates the potential of convergent cross‐mapping (CCM) to identify causal interactions between SF and GW head. Widely used in ecology, CCM is a nonparametric method to identify causality in nonlinear dynamic systems. To apply CCM between variables the only required inputs are time‐series data (stream gauge and bore), so it may be an attractive alternative or complement to expensive field‐based studies of causality. Three upland catchments across different hydrogeologic settings and climatic conditions in Victoria, Australia, are adopted as case studies. The outputs of the method seem to largely agree with a priori perceptual understanding of the study areas and offered additional insights about hydrological processes. For instance, it suggested weaker SF‐GW interactions during and after the Millennium Drought than in the previous wet periods. However, we show that CCM limitations around seasonality, data sampling frequency, and long‐term trends could impact the variability and significance of causal links. Hence, care must be taken while physically interpreting the causal links suggested by CCM. Overall, this study shows that CCM can provide valuable causal information from common hydrological time‐series, which is relevant to a wide range of applications, but it should be used and interpreted with care and future research is needed.

surface water are dictated by multiple processes that are often difficult to measure, insufficiently monitored, and inaccurately conceptualized (Barthel, 2014;Fleckenstein et al., 2010;Krause et al., 2011;Peterson & Western, 2018;Ross, 2012;Staudinger et al., 2019). For instance, hydrologic models often assume unidirectional gaining streams and the importance of the hydrogeologic setting is often neglected by considering horizontal homogeneous isotropic GW formations (Barthel & Banzhaf, 2016;Rinderer et al., 2018). Also, powerful experimental methods, like water balance with geochemical signatures (Atkinson et al., 2015;Cartwright & Morgenstern, 2015;Cartwright et al., 2014;Gilfedder et al., 2013;Hartmann et al., 2013;Rimmer & Hartmann, 2014) have the ability to constrain the spatial heterogeneity of GW response and sources of baseflow (Ali et al., 2015;Seibert et al., 2003Seibert et al., , 2011, but they have their own inherent uncertainty (Atkinson et al., 2015;Barthel, 2014;McDonnell, 2014) and are sufficiently costly to often limit their application to small study areas and short periods of time (Barthel & Banzhaf, 2016). This is the case particularly in the developing world, which is under-represented in hydrological studies (T. P. Burt & J. J. McDonnell, 2015) and where financial resources are scarce and increasing water demands are managed with limited hydrological data (Davtalab et al., 2017;FAO, 2017;Hulsman et al., 2021). Therefore, hydrological models, which provide an indirect way to identify dominant hydrological processes, arguably remain the most practical and prevalent method to understand process interaction and estimate SF-GW interactions at the catchment scale. Yet, the original development of a model often relies on limited data and the perceptual understanding of the developers to draw causal links and define the model structure that is usually static, therefore, representing a single subset of working hypothesis that might not be appropriate across hydrologically distinct catchments Staudinger et al., 2019). In fact, the choice of model structure can represent a large source of uncertainty that hinders the ability of hydrological models to characterize the hydrological processes in the catchment of interest.
In the context of hydrologic models, distinguishing model structures from rainfall-runoff (RR) response is challenging. For instance, Knoben et al. (2020) evaluated 36 conceptual RR models having different structures (1-15 parameters) on daily observed SF across 559 catchments in the United States of America and have shown that in most catchments up to 28 different models were very close to the best performing model. This illustrates the difficulties in using RR models to identify hydrological processes, as these 28 models are effectively multiple working hypotheses (Clark et al., 2011;Fenicia et al., 2011) with different conceptualizations of processes within the catchments. For example, from this study it would be difficult to infer which catchments have significant SF-GW interactions, as only 10 models had a GW store. Moreover, even if all models had a representation of GW processes, the different structures could result in internal fluxes and storage behavior that lack realism when compared with independent complementary data. Therefore, the nonidentifiability of model structure contributes significantly to the challenge of having models "working for the right reasons" (Kirchner, 2006), despite the advances represented by frameworks like Flux Mapping (Khatami et al., 2019), SUPERFLEX , SUMMA (Clark et al., 2015a(Clark et al., , 2015b, and MARRMoT (Knoben et al., 2018). Given the difficulties of inferring hydrological processes from models as well as the high cost of detailed field monitoring, there is a clear need for more direct data-driven approaches to investigate SF-GW interactions from commonly available monitoring data.
Causal inference methods (CIMs) seek to identify causality between variables directly from the data. Although these methods are gaining attention in earth sciences (Goodwell et al., 2020;Runge et al., 2019), their application in hydrological sciences remains mostly focused on linear methods, particularly cross-correlation and Granger causality (Granger, 1969). Linear CMIs are faster to compute and easier to relate, however, they assume linear and separable systems which reflects the common conceptualization that variables behave linearly and information about a causative variable is not present in the affected variable. Yet, it is widely known that correlation is not sufficient to infer causality, nor does lack of correlation infers lack of causation. In addition, the linear conceptualization might be inadequate for catchments as they can have an underlying nonlinear dynamic (Kirchner, 2009;Sivakumar, 2000Sivakumar, , 2009Sivakumar & Singh, 2012) with finite resilience to disturbances (Peterson, 2009;Peterson & Western, 2014a;Peterson et al., 2014Peterson et al., , 2021Saft et al., 2015).
The theory governing this nonlinear dynamic, called dynamical system theory (DST), has been widely applied in different areas of hydrology (Sivakumar, 2000(Sivakumar, , 2009(Sivakumar, , 2017. For instance, Rodriguez-Iturbe et al. (1989) has investigated the chaotic dynamics and limits of predictability of rainfall. Jayawardena and Lai (1994) applied it to predict rainfall and SF and Delforge et al. (2020) to analyze and forecast SF recessions. Erkyihun et al. (2016), Rajagopalan et al. (2019), and Sangoyomi et al. (1996) utilized it for hydroclimatic forecasting and simulation. Also, studies that identified low-dimensional dynamics across different hydrological processes showed that complexity can emerge from a small set of governing equations, which underpins the idea that simple deterministic hydrological models can adequately describe complex hydrological processes (Procaccia, 1988). Nonlinear CIMs have been used in hydrology, for instance, to investigate interactions between soil moisture and precipitation (Wang et al., 2018), ecohydrological interactions (Ruddell & Kumar, 2009), dynamics of treatment wetlands (Medina et al., 2019), stream chemistry dynamics (Jiang & Kumar, 2019), drivers of evapotranspiration (ET) (Ombadi et al., 2020), hydrological connectivity in karst systems (Delforge et al., 2022), in rivers deltas (Sendrowski & Passalacqua, 2017), and between GW and SF (Rinderer et al., 2018). Also, recent studies have investigated the potential and limits of some nonlinear CIMs, particularly the impact of noise, sample size, and the occurrence of spurious causal links (Delforge et al., 2022;Ombadi et al., 2020). However, the application of these methods remains incipient and further exploration is important to foster their use in hydrology.
The nonlinear CIM investigated in this paper, called convergent cross-mapping or simply CCM (Sugihara et al., 2012), is based on time-delay embedding proposed by Takens (1981). Here, it was used to investigate SF-GW interactions from the perspective of empirical causal inference: that is, are GW head dynamics causing SF dynamics, or vice versa, or both? This paper explores the potential of the method to contribute to the long-standing challenge of identifying hydrological processes and understanding heterogeneity by comparing the CCM results with a priori perceptual understanding of processes in three upland catchments located in different hydrogeologic and topographic settings of southeast Australia. In light of a high number of spurious causation identified by CCM and reported in the literature (Delforge et al., 2022;Ombadi et al., 2020), we also explore some limitations of the method that are yet not well covered, particularly the role of seasonality, data sampling frequency, and long-term trends. This is done with multiyear GW head and SF data encompassing a wet and a dry period, including a long and severe multiyear drought that has produced complex hydrological behaviors in the study areas.
In the next sections we describe the causality framework applied in this paper (Section 2), present a general account of the study areas along with a priori perceptual understanding of hydrological processes (Section 2.2), and describe the input data for the CCM analysis (Section 2.2.2). The CCM results are presented in Section 3 and discussed in Section 4.

Dynamic Systems, State-Space Reconstructions, and Attractors
Nonlinear dynamic systems, notably observed by Lorenz (1963) while running weather simulations, are not stochastic, meaning that they follow certain sets of rules to change from one state to the other (Deyle & Sugihara, 2011;Tsonis, 1992). In fine, the temporal behavior of a dynamical system is often pictured as a trajectory within the state-space diagram which represents its evolution from some unknown initial state (Ivancevic & Ivancevic, 2008). Dynamical system's states are solutions of the equations that describe how the system evolves along the state-space trajectories and how the D-explanatory variables interact over time. The temporal behavior of these systems produces trajectories that are recurrent and converge to a portion of the state-space, which is called an "attractor." This sub-space of attraction can be multi-dimensional, but its dimension is less than the D-dimension of the system's state-space driven by D-variables. Deterministic systems with long-term predictability have attractors of integer dimension. In the case of dynamic systems highly sensitive to initial conditions, attractors have fractal (or non-integer) dimension, and the long-term predictability of the system is not certain. Such attractors have a manifold geometry and are called "strange attractors," terminology coined by Ruelle and Takens (1971).
State-space reconstructions using a single time-series often rely on Takens' theorem (Takens, 1981). Essentially, Takens considered time-series as a unidimensional projection of the attractor of a dynamical system (Deyle & Sugihara, 2011). Figuratively, in the context of watersheds, the state-space could be tri-dimensional and explained by rainfall, GW head, and SF, each representing one dimension of the manifold. Therefore, projecting the state-space onto one of its dimensions would yield the time-ordered value of one of the explanatory variables. In other words, time-series are considered a sequential observation of the system's state-space. Accordingly, Takens' theorem suggests that a system's state-space can be reconstructed, or embedded, using enough lagged coordinates of a single time-series to evaluate the characteristics of a dynamic system. Therefore, even if the dynamic system behaves according to multiple variables, information about the system behavior should still be encoded in the state-space reconstructed from a single time-series, or in the so called shadow manifold. Takens (1981) and Deyle and Sugihara (2011) describe how a reconstruction of the system state can be done and for a more thorough explanation of state-space reconstructions and attractors the reader is directed to Ivancevic and Ivancevic (2008), Sugihara et al. (2012), and Tsonis (2018).

Convergent Cross-Mapping (CCM)
Convergent cross-mapping (CCM) is a non-parametric method to infer causality between two variables in nonlinear dynamical systems. This method is based on Takens' theorem of manifold reconstruction. To identify nonlinear interactions, CCM tests for topological similarities between reconstructed attractor manifolds (Sugihara et al., 2012;Tsonis et al., 2018). In other words, CCM detects causality by checking if two variables behave consistently while revisiting similar states. This is done by measuring the ability of increasingly large fractions of the historical time-series (or library size) of one variable to forecast states of the other variable. Essentially, stronger correlations between observed and predicted states indicate stronger interactions. In detail, to test the hypothesis that variable X causes variable Y, the first step is to construct a shadow manifold from Here, E is the embedding dimension, or the number of lags used to reconstruct the shadow manifold and tau is the embedding delay which was set at 1 time step of the time-series. For a single forecast at a time of reference, E+1 nearest neighbors are identified in M Y according to their Euclidean distance to Y considering a specific library size. The library size (or sample length) defines the subsample length of the shadow manifold M Y where nearest neighbors can be identified from. Then, the time-indices of the identified nearest-neighbors are used to map points in X which are then averaged considering the Euclidean distances in M Y to compute the forecast of X*. For all time indices, CCM cross-map skill is then quantified as the Pearson's correlation between the observed X and forecasted X*. As for each library size the M Y is bootstrapped (randomly and without replacement) a number of times (set to one hundred in this study), the cross-map skill for each unique library size is computed as the mean Pearson's correlation.
Concisely, variable X causes Y and pertains to the same system if it is possible to cross-map X from the shadow manifold of Y, as the causal variable leaves an information signature in the time-series of the affected variable. Hence, the causal variable can be estimated from its signature in the affected variable, which is counterintuitive and different to the popular GC framework (Granger, 1969) and those derived from it, such as conditional mutual information and transfer of entropy (Fraser, 1989;Fraser & Swinney, 1986;Schreiber, 2000), which assume the separability of the causative from the affected variable.
Another important concept is that the affected variable can be used to estimate the states of the causal variable, but not the other way around, as the dynamics of the causal variable does not depend on the affected variable. To illustrate, for a catchment with significant two-way SF-GW interactions, cross-mapping would be possible from SF to GW and vice-versa. However, in cases where SF-GW interactions are primarily one way (i.e., GW drives SF), recovery of information would be similarly one-sided (i.e., only GW's causal relationship on SF can be estimated). Also, CCM estimates the average interactions over the analyzed time-series rather than for a specific subset. Thus, when analyzing multiyear hydrographs, the results refer to the entire time period and not, for example, to a single day or over a season. For a more thorough and visual explanation of the CCM fundamental ideas and algorithm, the reader is directed to Sugihara et al. (2012), Ye et al. (2015), Delforge et al. (2022), and to the following animation: http://tinyurl.com/EDM-intro.
In applying CCM, the SF and GW head used in the CCM analysis were not transformed, therefore, they are the original hydrographs. The optimal embedding dimension (E) to reconstruct the shadow manifolds (Table S4 in Supporting Information S1) was identified by finding the E values (from 2 to 10) that gives the highest cross-map skill (ρ). The CCM analysis was carried out assuming a contemporaneous causality given that a lag time of zero invariably produced the higher cross-correlation in our data. The rEDM 1.2.3 package (Ye et al., 2020) for the programming language R 4.0.0 (R Core Team, 2020) was the software package used in the data analysis.

Convergence and Significance to Distinguish Unidirectional, Bidirectional, and Non-Causal Interactions
To distinguish causation from spurious causality using CCM, it is necessary to observe convergence in the cross-map skill, which indicates information flow from the causal to the affected variable. This is done by checking if larger library seizes yield increasingly larger cross-map skills. To illustrate, Figure 1(a) shows that larger library sizes produce higher cross-map skill from Y to X (hypothetical variables). The idea is that longer samples generate denser manifolds with closer nearest neighbors that allow more precise cross-map estimates. as only variable X drives Y, without feedbacks, because only the cross-map skill (ρ) from Y to X significantly exceeds the cross-correlation at full library size and at the same time rejects the null hypothesis of a strong common forcing or seasonality as the driver of the results (see cross-correlation lines and green marker in relation to the boxplots in the figure). (b) A strong bi-directional link between X and Y, with a somewhat weaker effect of X on Y. (c) Shows no causal interaction between X and Y, and (d) suggests a unidirectional effect of Y on X. As indicated in the text, the library size refers to the number of data points considered.
Moreover, for this convergence to be significant, the cross-map skill (mean Pearson's correlation) at full library size should be significant to avoid being confounded as spurious causality. To assess the significance of the results in this study (p < 0.01), we tested whether the CCM results could be due to seasonality in the hydrographs or to a strong common forcing (i.e., seasonal rainfall or ET) using Fisher's Z test to check whether the cross-map skill (ρ) at full library size was significantly higher than the highest lagged cross-correlation (0-30 time steps) as well as stochastic replicates of SF and GW head. The seasonal method available in Ye et al. (2020) was used to create k = 200 stochastic replicates, that is, surrogate time-series for each hydrograph. Here, the method computes the seasonality of the annual cycle with a smoothing spline and shuffles the residuals without additive Gaussian noise. The p-value is calculated according to (n + 1)/(k + 1), where k is the total number of surrogates and n represents the amount of replicates that produced higher cross-map skill than the actual data. Figure 1 explains the interpretation of the different types of interaction using the convergence and significance criteria. This figure uses horizontal lines to illustrate the highest lagged cross-correlation between the variables and boxplots to show the difference between the CCM results obtained from stochastic replicates and the one from the original data. The blue boxplots represent the distribution of CCM results obtained from stochastic replicates of the variable X cross-mapping the observed variable Y. The red boxplots show the results from surrogate time-series of Y cross-mapping the variable X. The green target circles show the CCM results obtained with the original datasets. Accordingly, Figures 1a and 1d show unidirectional causal links between the variables, as the significance criteria is achieved only for one direction of interaction (X causing Y) even though convergence is achieved in all cases. Figure 1b illustrates a bi-directional causal link since the convergence and significance criteria are achieved for both cross-mapping directions. Figure 1c shows no causal interaction between the variables as both the convergence and significance criteria were not achieved.

Exploring the Role of Data Length and Frequency, Seasonality, Long-Term Trends, and Climatic Conditions on the CCM Analysis
As mentioned, in addition to use CCM to test an a priori conceptual understanding of processes in the study areas, we seek to explore the role of seasonality, long-term trends and data sampling frequency on the CCM results. Seasonality potentially leads to spurious causality as well as represent the dynamic of a third common cause. Data sampling frequency affects both seasonality and length of the data available for the analysis.
To evaluate the dynamics of SF-GW interactions at different time steps as well as the role of data sampling frequency on the significance of the results, the CCM analysis was done using the original daily data and using monthly aggregated hydrographs. The monthly SF hydrographs were derived as the monthly summation of daily SF and the monthly GW head time-series as the head observed in the last day of the month. The same significance tests were applied to both daily and monthly CCM results and a comparison between the daily and monthly analysis in the context of the stochastic replicates was used to evaluate the role of seasonality in the CCM results, as at the monthly time step both SF and GW hydrographs are expected to be more seasonal.
Regarding long-term trends, they can be problematic for the CCM framework as they can violate the recurrence assumption of dynamic trajectories, thus, reducing the capacity to verify whether two time series behave consistently in the state-space. This means that nearest neighbors in the state-space could be irrelevant, a behavior that would contradict the CCM framework. To avoid this issue, one could use different data filtering techniques to eliminate trends, however, as trends in nature are real (Gleick, 1987) and filtering the data usually changes its dimensionality (Badii et al., 1988;Chennaoui et al., 1990;Jayawardena & Lai, 1994;Mpitsos et al., 1988), we investigated the role of trends in our CCM results using the original datasets. To examine the role of long-term trends jointly with different climatic conditions, we divided the daily CCM analysis for bores S_bore1 and S_ bore2 in Sunday Creek into three subperiods of approximately 10 years each. These subperiods encompassed different climatic conditions as well as remarkably different trends in both SF and GW head hydrographs. The 1,500 days moving average (MA1500, roughly 4 years) was used to make the long-term trends more noticeable in the SF hydrograph of Sunday Creek.

Study Areas and A priori Perceptual Understanding of Hydrological Processes
The study area encompasses three upland catchments located in different hydrogeological settings in the state of Victoria, Australia (Figure 2), where SF and GW interactions are important management topics. Victoria's climate is classified as temperate (Peel et al., 2007) with high intra-annual seasonality in SF and rainfall (Peel et al., 2001). Moreover, the region experiences severe multiyear droughts (Potter et al., 2010) and the recent Millennium Drought (1997)(1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)) was the most severe on record (CSIRO, 2010;Van Dijk et al., 2013). Meteorologically, this drought ended in 2010 with large floods caused by a strong La Niña event (National Climate Centre -Bureau of Meteorology, 2011;Van Dijk et al., 2013). However, the hydrological drought appears to have continued, producing complex RR non-recovery behavior that challenges the assumption that SF always recovers from drought (Peterson et al., 2021), and there is much speculation as to whether changes in ET and the "long memory" of GW dynamics play a part in the persistence of the hydrological drought (Hughes et al., 2012;Peterson et al., 2021;Petrone et al., 2010;Saft et al., 2015, 2016a, Saft, Peel, Western, & Zhang, 2016Scaife & Band, 2017). For instance, a recent multi-disciplinary study suggests the duration of the drought as the main driver of the RR non-recovery together with interconnected GW processes, namely: declining GW levels and recharge which are associated with unsaturated zone expansion and less interaction between surface and underground water . At the same time, the same study highlights other plausible causes, such as increased evaporative demand, and the need for more long-term field monitoring of internal and subsurface processes to consolidate a physical explanation for the multiyear dynamics.
The Australian Bureau of Meteorology (Doody et al., 2017) considers all study areas as ground water dependent ecosystems (GDE) and therefore we expect GW to be driving SF in all three catchments. Various characteristics of the three catchments are provided in Table 1 and, according to the joint analysis and discussion that is presented below, our conceptual understanding suggests that GW drives SF more significantly in Brucknell Creek than in Sunday or Ford Creek. Additionally, the causal relationship in the opposite direction (flow driving GW) is conceptualized to be less significant in general, as the study areas are upland catchments where recharge from the stream is conceptualized to be less active than rainfall infiltration. Within the catchments, we expect bores in valley bottoms to have stronger links with SF, as they tend not only to be closer to the stream but also to be in areas that are more porous and permeable given sedimentary formations and/or deeper soils.
In terms of the hydrological characteristics provided in Table 1, at all three catchments the annual RR ratio shifted in response to the Millennium Drought but to different degrees (Saft, Peel, Western, & Zhang, 2016) and appears to have complex non-recovery behavior (Peterson et al., 2021). Brucknell Creek was the one less affected, where the RR ratio was 30%-50% lower during the drought. Ford and Sunday Creek were more strongly impacted, with decreases of 50%-75%, and >75%, respectively (Saft, Peel, Western, & Zhang, 2016). As previously mentioned, changes in GW might explain these shifts, however, there is no detailed monitoring of the hydraulic gradient between the river and the surrounding water table that could describe changes in the gaining and loosing stream processes. Moreover, classic RR models cannot reproduce this behavior given their representation of SF-GW interactions  and therefore the use of a data driven approach to shed light on the SF-GW interactions in the study areas is of paramount importance for future research.
Focusing on possible anthropogenic causes of RR shifts, the volume of GW extractions within the catchments could help explain these different behaviors, but the catchment with the smaller shift in the RR ratio was the one that has the highest annual permissible consumptive volume (PCV) of GW, namely Brucknell Creek. In this catchment, the PCV represents approximately 330% of the mean annual flow, while in Ford and Sunday Creek it represents only around 14% and 9% of the mean annual flow for the entire record, respectively.
At the same time, while rainfall and potential evapotranspiration (PET) are different between the catchments, Brucknell Creek does not have the larger mean annual rainfall. Neither has Sunday Creek the highest PET demand. In addition, all three catchments have seasonal SF (low flows in summer and high flows in winter, as seen in Figure S1 in Supporting Information S1), which is mostly driven by seasonality in evaporative demand  Figure S2 in Supporting Information S1) followed by rainfall ( Figure S3 in Supporting Information S1). Brucknell Creek has a stronger seasonality as 39% of its monthly flow variance is explained by the seasonal cycle computed with a smoothing LOESS (Savitzky & Golay, 1964) whereas in Ford and Sunday Creek this percentage decreases to 18% and 1%, respectively.
Brucknell Creek is a perennial stream, even during the Millennium Drought, and has the highest RR ratio (0.14), a baseflow index (BFI as per Lyne & Hollick, 1979;Nathan & McMahon, 1990) of 0.27 while in the other catchments BFI is below 0.13. Also, the intra-annual variation of monthly SF is more than 30% smaller in Brucknell Creek than in Ford or Sunday Creek. With regard to hydrogeology, Brucknell Creek sits mostly on a regional high productivity porous limestone aquifer, having a small portion of the catchment located on local low productivity fractured basalt aquifers (Brodie et al., 1998;Dahlhaus et al., 2002). Conversely, Ford and Sunday Creek are on local low productivity aquifers that are composed of fractured sedimentary and low-grade metamorphic rocks (Brodie et al., 1998;Walker et al., 2003). Figure S4 in Supporting Information S1 presents a detailed hydrogeological map of each catchment emphasizing the distinct types of aquifers and the location of the observation bores.
Although the RR shift magnitudes cannot be easily explained, as previously mentioned, Brucknell Creek is expected to have more significant GW and SF interactions that potentially could explain the smaller shift in the RR ratio during the long and severe Millennium Drought, as its regional high-productivity aquifer could offer extra buffer to sustain SF-GW connectivity thanks to its larger recharge area and larger GW storage. At the same time, all these parameters do not shed light on the strength and direction of this interaction across the catchments, or on how different areas of the aquifer might interact differently with the SF. Only using commonly available monitoring data, this study offers a methodology that seeks to derive more granular analysis of SF-GW interactions and a way to further develop and test conceptual models.

Streamflow and Groundwater Data
The daily observed SF data was sourced from Victoria's Department of Land, Environment, Water and Planning (DELWP) using the Water Measurement Information System ( Figure S8 in Supporting Information S1). The GW head observation data was from DELWP's State Observation Bore Network (SOBN), available at the same information system. There are 34 SOBN in the catchment areas, however, in this study we consider only those that had (a) multiyear monitoring (>5 years of data), (b) a relatively high number of observations per year of monitoring (>9 obs./year), and (c) monitoring period fully or partially covering the Millennium Drought (1997-2009, approximately). In total, 12 bores fulfilled these criteria in the study areas, two in Ford Creek, and five bores in each of Brucknell and Sunday Creeks. Table 1 and Figure S5 in Supporting Information S1 summarize the raw GW data for each catchment.
Given the paucity of the observed GW head time-series in Victoria, the raw observations were temporally interpolated on a daily time-step using the freely available HydroSight software (https://github.com/peterson-tim-j/ HydroSight/wiki). HydroSight uses the modeling approach of Peterson & Western (2014b) to interpolate between each observation based on climate drivers (Peterson & Western, 2018). In this study, we first identified and removed outlier observations as per Peterson et al. (2017) and then interpolated the hydrographs based on area weighted daily rainfall and PET at the location of each bore. These climate forcings were sourced and derived using the freely available R-package AWAPer (https://github.com/peterson-tim-j/AWAPer, Peterson et al., 2019). The PET data were determined using results from Morton's complementary relationship areal model (Guo et al., 2016;Morton, 1983) with daily maximum and minimum temperature and vapor pressure data from the Australian Water Availability Project (AWAP) database, and net solar radiation calculated with empirical equations from the Food and Agriculture Organization (Allen et al., 1998). The reliability of the resulting interpolated hydrographs was then evaluated using leave-every-other-point-out cross-validation. The average performance across all bores was 0.84 (Nash-Sutcliffe coefficient of efficiency) and the cross-validation results are presented in Table S1 and Figure S7 in Supporting Information S1. The observed and interpolated GW head data are shown in Figure S5 in Supporting Information S1. The SF and GW head time-series used in the CCM analysis are presented in their transformed form (zero mean, unit variance) in Figure S9 in Supporting Information S1.

Results
The CCM results indicate statistically significant causal relationships between GW head and SF in all three upland catchments at both daily and monthly time steps. Figures 3-5 present the daily and monthly CCM results and their interpretation for all 12 bores evaluated in Brucknell, Sunday, and Ford Creek, respectively, along with the location within the hydrogeologic setting of each catchment. The daily results are plotted with contiguous lines whereas the monthly results are shown with dashed lines. For a more detailed bore location and description of their hydrogeology, the reader is directed to Figure S4 in Supporting Information S1 and to Dahlhaus et al. (2002) and Walker et al. (2003).
For both daily and monthly time scales, three types of SF-GW causal interactions were identified across the study catchments, namely (a) GW is causal to SF, marked in red as GW-> SF; (b) SF is causal to GW, marked in blue as SF -> GW; and (c) GW and SF are causal to each other, marked in green as GW <-> SF. For instance, at the daily time step, Figure 3(a) indicates a significant bidirectional link between GW head and SF, as the cross-map skill (ρ) at maximum library size substantially exceeds the cross-correlation in both directions and is significant when compared to the surrogate hydrographs. At the same time, as the cross-map skill from head to flow is smaller than in the opposite direction, the results indicate a slightly weaker effect of flow on head. In terms of unidirectional link, Figure 3d shows GW head significantly driving SF without feedbacks, as only the cross-map skill (ρ) from flow to head significantly exceeds the cross-correlation at maximum library size although it is significant when compared to the stochastic replicates. Conversely, Figure 4c indicates SF unidirectionally affecting GW head, because only the cross-map skill (ρ) from head to flow is significant when compared to the cross-correlation and surrogate time-series. As mentioned in the Section 2, the direction of causation is contrary to the cross-mapping direction since only the causal variable can leave footprints on the manifold of the affected variable, making the backward cross-mapping possible.  Table 2, the interpretation of the CCM results in terms of the SF-GW interaction for each bore in the three study is presented as color-coded annotations within Figures 3-5 and 7. Figure 6 uses boxplots to show the significance of both the daily and monthly CCM results in relation to the null hypothesis that the results are driven by a common strong forcing or seasonality (p < 0.01). The blue boxplots represent the distribution of CCM results obtained from stochastic replicates of GW cross-mapping the original SF. The red boxplots show the results from surrogate time-series of SF cross-mapping actual GW. The green target circles show the CCM results obtained with the original hydrographs and their relative distance to the ones obtained with stochastic replicates. The boxplots for the daily and monthly analysis are grouped together for each bore, which are grouped by catchment. Figure 6a shows the results obtained for Brucknell Creek, while Figures 6b and 6c present the surrogate analysis for Sunday and Ford Creek, respectively. Figure 6d shows the surrogate analysis for three subperiods with different trends in the hydrographs and different climatic characteristics for two bores (S_bore1 and S_bore2) in Sunday Creek. Figures 6a-6c show that the monthly stochastic replicates tend to produce higher median cross-map skills, regardless of the direction of interaction, than the daily stochastic replicates, particularly in Brucknell Creek. Figures 6a-6d show that the daily CCM results are significant for most of the bores in all catchments when compared to the surrogate time-series. Figure 7 and Table 3 examine in more detail the role of long-term trends jointly with different climatic characteristics on the CCM results. Figures 7b, 7d and 7f show the original daily data emphasizing the three subperiods evaluated and the trends in the data. Figure 7(a) shows the daily CCM results for the first wet period of 1975-1985. Figure 7b presents the CCM results for the wet period just before the Millennium Drought, from 1985 to 1996. Figure 7c shows the CCM analysis for during and after the Millennium Drought, from 1996 to 2013. The daily results for S_bore1 are plotted with contiguous lines whereas the results for S_bore2 are shown For the monthly results, the reader is directed to Table S3 in Supporting Information S1. For the monthly results, the reader is directed to Table S3 in Supporting Information S1.
with dashed lines. As mentioned in Section 2, despite the time steps used in the analysis, the causation results relate to the average interactions over the period evaluated rather than to a specific point or subset of this period, therefore, relating to the perceptual understanding of predominant hydrological processes within the study areas over the period analyzed. Tables S2 and S3 in Supporting Information S1 show the daily and monthly CCM results, respectively, along with their statistical significance in the context of both cross-correlation and surrogate hydrographs.

CCM Results and the A priori Perceptual Understanding of Processes
The literature on CCM in hydrology remains incipient and, as such, this paper offers new insights into the potential of this method, hopefully contributing to the long-standing challenge of identifying dominant hydrological processes for a given catchment. At the same time, given the limited range of environments considered here, we would like to highlight the empirical nature of our findings. Not only the sample size needs to be increased, preferably using hundreds of more diverse and longer hydrographs, but detailed field monitoring should be implemented to confirm the results. Furthermore, the CCM technique itself is relatively young and in addition to the  Table 2 and the color-coded annotations within the figure (red = GW causes SF; blue = SF causes GW; green = bidirectional SF-GW interaction). For the monthly results, the reader is directed to Table S3 in Supporting Information S1.
limitations investigated and discussed in this study (Section 4.2), several aspects need further exploration, such as the effect of error and intermittency in the data.
Besides the limitations of the method, the daily and monthly CCM results (all bores in Figures 3-5) offer significant statistical evidence that suggests a robust estimation of the long-term average causal pathways in the study areas which would not be evident from correlation in our data (e.g., F_bore2 in Table 2). Moreover, the results are consistent with the a priori understanding of SF-GW interactions Specifically, they suggest that Brucknell Creek has stronger SF-GW interactions than in Sunday and Ford Creek, which is in accordance with the understanding that this catchment has more significant and persistent SF-GW processes that led to smaller shifts in RR ratio during the Millennium Drought.
The SF-GW causal links presented large variability in strength and direction among bores and between catchments, identifying three types of causal links. The variability in the causality results are in agreement with experimental work that suggests that SF and GW interactions are complex and take place with different strengths and directions in response to a multitude of factors (Barthel, 2014;Barthel & Banzhaf, 2016). In this study, the meteorological forcings, namely rainfall and PET, differed by only 16% and 12%, respectively, between Brucknell, Ford and Sunday Creek (see Table 1, Figures S2 and S3 in Supporting Information S1) and, therefore, might not be the major factors driving the variability in the SF-GW interactions between catchments. Conversely, the different topography (hilliness) and hydrogeology (aquifer type) of the study areas seem to be good indicators of the differences in the long-term average causal pathways in the catchments.
In order to explore the role of topography, Figure 8 shows the 1-s Multi-resolution Valley Bottom Flatness (MrVBF) of the catchment areas (Gallant & Dowling, 2003), which is a topographic index used to map depositional/erosional areas and infer valleys and hills. Zero MrVBF values indicate steep and erosional terrain, whereas larger values indicate increasingly larger depositional areas, which are often associated with deeper soils (Gallant & Dowling, 2003). Hence, in Figure 8   In terms of how the hydrogeologic settings impact the daily SF-GW interactions, by comparing the CCM results with the hydrogeology of the catchments ( Figure S4 in Supporting Information S1), one can notice that GW head in the regional porous and productive aquifer of Brucknell Creek appear to drive SF more strongly than in the local fractured and low productivity aquifers found in Ford and Sunday Creek, where SF eventually ceases during prolonged dry periods. When further cross-referencing the CCM results with the hydrologic characteristics of the catchments (Table 1), we see that the higher runoff ratios, stronger seasonality, and lower intra-annual variation of monthly flow occur in the catchment with greater influence of GW head on SF, namely Brucknell Creek. This reinforces the understanding that SF-GW interactions are more significant in this catchment.  (a) 1975-1985, (c) 1985-1996, and (e) 1996-2013, where there were remarkable trends in the hydrographs and climatic conditions were different (before, and during and after the Millennium Drought). (b) and (d) present the daily GW head hydrographs for bores S_bore1 and S_bore2, respectively. (f) Shows the daily SF of Sunday Creek at Tallarook and its moving average of 1,500 days to emphasize multiyear trends. For the interpretation of the results for each bore, the reader is directed to Table 3 and the color-coded annotations within the figure.
The variability in our results are in accordance with previous experimental studies that also found spatial heterogeneity in SF-GW interactions Cartwright & Morgenstern, 2015;Grayson et al., 1997;Rosenbaum et al., 2012;Seibert et al., 2003Seibert et al., , 2011, however, at the same time, our results offer complementary insights. For instance, Cartwright and Gilfedder (2015) mapped GW inflows to 62 km of the Deep Creek, southeast Australia, using Radon surveys and found that gaining sections of the river were in reaches closer to steep valleys or in the edge of its floodplain (close to hills with basement rock), suggesting that steeper and less permeable areas were the major source of baseflow. At first glance, this seems to contradict our findings. However, it is important to notice that our CCM results represent the long-term average causality of the GW dynamics of each bore to the catchment SF and, although our study found that bores in valley bottoms are more causal on SF, it does not invalidate the idea of steep and less permeable areas as sources of baseflow.
In fact, what it suggests is that GW head variations in steeper terrain affect the catchment SF less intensely than those in valley bottoms. If we consider that GW moves in a continuum from top to bottom of a hillslope until it discharges to the stream and, at the same time, assume that the underground formation of valley bottoms tend to be more porous and permeable, we can conceptualize that head fluctuations in the valley bottom represent larger displacements of water and, consequently, it would affect the SF more strongly.
Similarly, Bishop et al. (2011) identified a strong and spatially structured differentiation of local flow within a small till catchment, as during runoff events most of the SF was coming from the interflow in the more porous and permeable upper level of the soil. As our bores are all relatively shallow and have a shallow depth to water table, it can be that GW head in the valley bottoms are responsible for triggering strong interflow in areas where soils tend to be deeper and more porous.
Additionally, GW head in the valleys could also be linked to soil saturation levels across wider areas of the catchment and therefore it would significantly trigger different runoff generation mechanisms in the catchment (Ali et al., 2015;Hughes et al., 2012;Petrone et al., 2010;Saffarpour et al., 2016;Seibert et al., 2003). In this case, conceptually, higher GW levels in the valley bottoms would imply the occurrence of saturation excess runoff across larger areas of the catchment, potentially making it the major runoff mechanism. In other words, higher GW levels would not only be related to more baseflow but when close to the surface they could also influence saturation excess runoff and interflow. Thus, our results suggest the existence of a balance between catchment hilliness and villainess that yields the strongest SF-GW interaction. Therefore, not necessarily the river flowing through the steeper catchment will present the stronger contribution of GW to SF.
In summary, the CCM results suggests that (i) GW head in the regional porous/productive aquifer of Brucknell Creek is driving SF more strongly than in the local fractured aquifers of Sunday and Ford Creek and that (ii) there is significant variability in SF-GW causality strength and direction within a catchment. Bores in the valley bottoms tend to be more causal on SF, and this interaction can go either way (GW causal on flow, or vice versa). These insights also suggest that CCM could allow more granular and rigorous testing of perceptual models and  contribute to the structural design of hydrologic models and reduce the uncertainty of their projections (Anderson & Woessner, 1992;Beven, 1993;Blöschl et al., 2019;Gallart et al., 2007;Konikow & Bredehoeft, 1992;Oreskes et al., 1994).

Exploring Limitations of CCM
While the results described above are insightful, CCM has limitations that reportedly produce a high false positive rate when identifying causal links in hydrology (Delforge et al., 2022;Ombadi et al., 2020). Hence we have explored the role of seasonality, sampling frequency, and long-term trends in our results. The significance of these aspects on the CCM results are discussed in the sections below.

The Role of Seasonality
As mentioned in Section 4.1, the meteorological forcings (PET, rainfall) are not remarkably different between the study areas, however, together with SF and, to a lesser degree, GW head, they present remarkable seasonality ( Figures S1, S2, S3, and S6 in Supporting Information S1), which is problematic for causal inference studies as cycles are both deterministic and predictable and are therefore a likely confounding factor. In addition, CCM is a bi-variate causality framework and hence the identified causal links might be the result of a third (or more) strong common forcing, a hypothesis that is inevitable in hydrology when studying SF and GW interactions. Both SF and GW are forced by rainfall and PET (and their seasonality) and hence some degree of association might always be present. In fact, a synthetic study has shown that disconnected SF and GW reservoirs could always exhibit CCM convergence due to the common meteorological forcing (Delforge et al., 2022). This issue is known as the Moran effect or Reichenbach principle of the common cause (Moran, 1953;Reichenbach, 1956). Both the impact of seasonality and Moran effect on the significance of our results was evaluated by comparing the CCM results for the real data against those obtained for surrogate time-series that preserved the seasonality and shuffled the residuals of the original hydrographs. The results of this analysis are presented in Figure 6.
Regarding the null hypothesis of a third common cause, the considerable variability in causal direction and strength within and between catchments and the significance of the CCM results (Figures 3-7) suggest that it would not be a major issue, particularly for the daily analysis. If the CCM results obtained from the original hydrographs were mainly due to a strong common forcing, we would expect homogeneous results across all sites with CCM presenting convergence and significance in a similar fashion. This expectation derives from the fact that the catchments are relatively small and the climate forcing is fairly similar across them. At the same time, the fact that eight out of the twelve causal links in Table 2 are bi-directional raises the hypothesis that the variables could be strongly synchronized. With a strong common forcing or a strong driver (i.e., GW strongly driving SF), the dynamics of the forced variable could become subordinate to the forcing and CCM would indicate a bi-directional link when in fact the interaction is unidirectional. The lagged version of CCM (Ye et al., 2015) could help resolve this issue of synchrony, but it was outside the scope of this study that assumed a contemporaneous causality given the higher cross-correlation at zero time lag in our data.
Similarly, one can notice that the cross-map skill of both real and surrogate hydrographs was generally higher in Brucknell Creek (Figure 6a) than in Sunday and Ford Creek (Figures 6b and 6c), even more so when looking at the monthly CCM results, when seasonality becomes more pronounced in the hydrographs (Figures S1 and S6 in Supporting Information S1). In fact, the results in Figures 3-6 suggest that the monthly analysis often produces cross-map skills higher than for the daily data. This indicates that increased variance captured by the seasonal cycle can strengthen the SF-GW interaction suggested by CCM. Therefore, it is possible that the differences in cross-map skill between catchments is related to the different degrees of variance explained by seasonality, particularly the higher seasonality of SF and GW head in Brucknell Creek (Table 1, Figure S6a in Supporting Information S1), and not to different processes that emerge from the topography and hydrogeology of the study areas. Ultimately, seasonality hinders the significance and the physical interpretability of the causal links suggested by the CCM analysis.

The Role of Data Length and Sampling Frequency
The issue of data length in CCM was explicitly explored by Ombadi et al. (2020) which concluded that CCM is applicable to short daily time-series (<100 days). The results from Delforge et al. (2022) also converge to the same conclusion and suggest that convergence is not worth checking for daily hydrological time-series. However, as mentioned, DST assumes infinite, noise and error free datasets, even though different authors suggest different lengths of data as the minimum to properly reconstruct a system manifold (Kurths & Herzel, 1987;Nerenberg & Essex, 1990;Procaccia, 1988;Smith, 1988;Sugihara et al., 2012). Moreover, the dynamics of the system can vary according to the time scale.
Therefore, longer observation time scales, as in the case of our monthly time-series, could affect not only the significance of the results in relation to both stochastic replicates and cross-correlation (due to shorter time-series), but also the ability to properly reconstruct the state-space that underpins the method. Although the general convergence of our CCM results suggest that short daily hydrological time-series are likely appropriate (Delforge et al., 2022;Ombadi et al., 2020), some monthly curves did not finish converging, particularly, those for GW cross-mapping SF in bores S3, S4, F1, F2. Even though these are significant convergence values, they illustrate the effect of data length and the spatial variability of convergence rate from site to site. On the same note, difficulties to unfold the dynamics of hydrological processes at longer time scales is reported in the literature. For instance, the seminal work of Rodriguez-Iturbe et al. (1989) managed to unfold the chaotic dynamics of a rainfall event using 1990 data points of 15 s interval but the same was not possible for 148 years of weekly rainfall data (around 7,696 data points), even though the authors suggested the possibility of a weekly chaotic behavior in a large dimensional space.
The issue of sampling frequency and data length can be exacerbated by the application of CCM to the entire SF hydrograph instead of using only the baseflow. The higher dimensionality or randomness of the daily flood hydrograph can be harder to cross-map from GW head because of its smoother and more deterministic behavior, possibly even more so due to the paucity of the observed GW head and the use of interpolated data. These aspects could be causing the CCM assimetry between the monthly and daily analysis, that is, monthly data generally producing higher cross-map skills. Using baseflow hydrographs could mitigate this issue, however, not only it is hard to estimate baseflow (Su et al., 2016) but it directly represents how much of the SF is coming from GW which can defeat the purpose of using a casual inference method. Additionally, GW head dynamics are potentially linked to runoff processes, for instance, by activating saturation excess runoff, muting the response to rainfall and possibly contributing to changes in RR states. Therefore, applying CCM to the entire SF hydrograph tests the link between GW head and all these processes, not only low-flow dynamics. Anyway, baseflow filters can and should be used to complement and interpret the CCM results, as done in this study. To summarize, the issues of sampling frequency and data length remain sources of uncertainty in this study and highlights the need to have longer, better, and diverse hydrological monitoring to substantiate insights derived from CCM analysis.

The Role of Long-Term Trends and Climatic Conditions
To examine the role of long-term trends jointly with different climatic conditions, Figure 7 and Table 3 show the daily CCM analysis for bores S_bore1 and S_bore2 in Sunday Creek for three subperiods emcompassing different climatic conditions and different trends in both SF and GW head hydrographs. Figure 7 also shows the original data emphasizing these subperiods.
For the first period, from 1975 to 1985 (Figure 7a), the GW head hydrographs for both bores and the SF had little trend during a relatively wet period. Even though the trends were similar in this period, the CCM results were quite different for the two observations bores. For S_bore1, the CCM analysis indicates a unidirectional causal link where SF was driving GW head, which corroborates with the signatures typical of strong recharge events that are present in the GW head hydrograph. Whereas for S_bore2, CCM indicates a relatively weaker bidirectional causal link that supports the smoother and more seasonal characteristics of its GW head hydrograph.
For the second period, from 1985 to 1996 (Figure 7b), the GW head hydrographs presented remarkably different trends, where S_bore1 displayed little trend while S_bore2 experienced a strong upward trend. In this period, SF had little trend while the climate was relatively wet in comparison to the subsequent Millennium Drought. Although S_bore1 and S_bore2 had very different trends in their hydrographs, the CCM results were rather similar and indicated a unidirectional causal link where GW head was driving SF, suggesting a shift to a dominantly gaining stream before the Millennium Drought (van Dijk et al., 2009).
For the last period, from 1996 to 2013 (Figure 7c), hence during and after the Millennium Drought, the GW head hydrographs for both bores and the SF presented similar downward trends, which would change only in 2010 with the end of the drought. Both bores presented similar trends in the hydrographs and the CCM results indicate a bidirectional interaction between SF and GW head, where SF drives GW head more strongly in S_bore1 than S_bore2. Potentially, S_bore1 had a stronger connection to the stream during the recharge event observed in 2010 toward the end of the drought. However, this contradicts the sharper GW rise observed in S_bore2 during the same period.
As mentioned, pronounced trends can be problematic for CCM as it violates the recurrence hypothesis that underpins the idea of identifying causality by checking if shadow manifolds behave consistently while revisiting similar states. Hence, the differences between the CCM results for the entire observation period and for the three subperiods could primarily be caused by different trends in the data. Trends can also produce artificially high cross-map skills because of temporal autocorrelation rather than recurrence, which was not controlled in this study with a Theiler window (Theiler, 1986). This parameter (exclusion radius in the rEDM package) ignores from the universe of possible nearest neighbors the points whose time indices are too close to the predicted. As Sunday Creek has generally lower cross-map skills than in Brucknell and Ford Creek (Figures 3-,5 and 7), it seems that trends in its GW data ( Figure S5 in Supporting Information S1) are potentially diminishing CCM's predictive skill. To help solve this conundrum, a Theiler window could be systematically applied to investigate the role of autocorrelation and the current data could be detrended before the application of CCM. However, this could make the results questionable as we know that trends in nature are real (Gleick, 1987) and filtering the data can change its dimensionality (Badii et al., 1988;Chennaoui et al., 1990;Jayawardena & Lai, 1994;Mpitsos et al., 1988). Hence, the application of CCM in Sunday Creek would require longer time-series to confirm whether the trends are part of a long time scale dynamics and would disappear into the future.
At the same time, one can look at the results in light of the different climatic conditions during the three subperiods. Particularly, the CCM results suggest weaker SF-GW interactions, regardless of the direction, during and after the Millennium Drought (1996)(1997)(1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013) than in the previous wet periods. At first glance, this seems inconsistent with the general understanding that GW (baseflow) should contribute more significantly to SF during droughts. However, this long drought seems have strongly affected both GW and SF  and produced generally lower SF-GW connectivity triggered by declining GW levels and soil moisture (Grayson et al., 1997;Hughes et al., 2012;Petrone et al., 2010). This behavior would be consistent with the observation of similar SF hydrograph shapes (Trotter et al., 2022) and lower RR ratios during and after the Millennium Drought (Peterson et al., 2021;Saft, Peel, Western, Perraud, & Zhang, 2016;Saft, Peel, Western, & Zhang, 2016). Additionally, between 1975 and1985 the GW head in S_bore1 and S_bore2 presents remarkably different trends, if not opposite, while the CCM results are quite similar. This raises the question of whether the long-term trends are strong enough to be impacting the results.
In summary, although it is possible to interpret the CCM results in light of the different climatic conditions, care must be taken while physically interpreting the results because of the impact of long-term trends in CCM. Thus, future research is necessary to investigate the role of trends using a larger and diverse sample of time-series with long trends under different climatic conditions while systematically applying a Theiler window.

Concluding Remarks
We have examined the potential and some limitations of CCM to identify causal links between SF and GW head in three upland catchments located in different hydrogeologic settings using multiyear daily and monthly time-series encompassing both wet and dry periods. Overall, this study shows that CCM has the potential to unlock valuable causal information from SF and GW head hydrographs that would not be evident from cross-correlation in our data. It allowed to test an a priori perceptual understanding of processes in the study areas and offered additional insights about hydrological processes, both spatially and temporally. Specifically, the CCM results suggest that: (a) GW head in the regional porous/productive aquifer is driving SF more strongly than in the local fractured aquifers; (b) there is significant variability in SF-GW causality strength and direction within a catchment; (c) Bores in the valley bottoms tend to be more causal on SF, and this interaction can go either way (GW causal on flow, or vice versa); (d) SF-GW interactions were weaker during and after the Millennium Drought than in the previous wet periods possibly due to weaker hydraulic connectivity. This is important to understand catchment dynamics as CCM offered more granular insights that would not be evident from cross-correlation. Given the often-limited resources for environmental monitoring, we argue that commonly available monitoring data and CCM can help to guide and optimize the implementation of detailed fieldwork for different purposes, for instance, in water contamination studies. At the same time, care must be taken while physically interpreting the causal links suggested by this method, as CCM has limitations that can lead to spurious causality.
This paper investigated how the CCM results were affected by seasonality, data frequency, and long-term trends.
Besides the use of surrogate data to test our results more rigorously, these aspects impact both the significance and the physical interpretability of the causal links. This corroborates with the high false positive rate of CCM in the hydrology literature (Delforge et al., 2022;Ombadi et al., 2020). Therefore, future research is needed to further investigate these and other issues, such as the role of data error and intermittency, ideally using a large and heterogeneous sample of catchments and time-series. Ultimately, it is hard to prove causality from statistical methods and they should be coupled with detailed field monitoring to further substantiate the relationships suggested by CCM or other CIMs.

Perspectives on CCM in Hydrology
Finally, the application of dynamic system theory to identify causality in hydrology remains limited. Therefore, we would like to touch upon some perspectives that shall represent exciting interdisciplinary research opportunities, for instance, among hydrologists, mathematicians, and meteorologists.
6.1. How Does CCM Compare to Other Causal Inference Frameworks to Infer SF-GW Causality at Different Climatic, Hydrogeologic, and Topographic Settings?
CCM was particularly developed to account for state-dependent nonlinear interactions (Runge et al., 2019;Sugihara et al., 2012), however, it can present high false positive rates (Delforge et al., 2022;Ombadi et al., 2020). Therefore, using large and diverse datasets of catchments/hydrographs which could be clustered according to hydrological similarity (Papacharalampous et al., 2021;Rinderer et al., 2019), how would the CCM results compare with other nonlinear CIMs, such as Transfer Entropy (Schreiber, 2000) and multivariate PC algorithm (Spirtes et al., 2000)? Would the causal links remain attributable to differences in hydrogeology, topography, and climate?

How Sensitive Is the Significance of the CCM Results to the Stochastic Replication Technique and Other Adjustable Parameters of CCM?
Here we used standard CCM and a stochastic replication technique in rEDM; however, using a global sensitivity analysis approach, future research should investigate the sensitivity of the results to stochastic techniques commonly used in hydrology and also to adjustable parameters of CCM, such as the prediction horizon (lagged version of CCM), the embedding delay (tau), and the Theiler window to help understand confounding factors (e.g., seasonality, strong synchrony) and the role of data autocorrelation.

What Is the Role of Seasonal Dynamics in the Long-Term SF-GW Interactions?
Given the complex RR response in the study areas during the Millennium Drought, we analyzed the average long-term causal interactions between SF and GW by using the entire time-series. However, a strong seasonal dynamic shall be present in the long-term interactions. This could be further investigated with a CCM analysis focused on different seasons. For instance, is GW more significant to SF in dry months or in the wet season?

What Is the Role of Droughts and Persistent Shifts in Rainfall-Runoff on the CCM Results?
We analyzed three catchments during and after the Millennium Drought. From our daily CCM results, the catchments with higher shifts in runoff where those with the lower SF-GW interactions. However, there are many more catchments in Australia that shifted and have been presenting different recovery behaviors. Therefore, looking at a larger sample size, could the shift in runoff and persistence in a lower state affect the CCM results due to long-term trends in the hydrographs?

Conflict of Interest
The authors declare no conflicts of interest relevant to this study.

Data Availability Statement
Streamflow and groundwater data used in this study were from the Water Measurement Information System of DELWP, https://data.water.vic.gov.au/. Climate data were from the AWAP project, www.bom.gov.au/jsp/awap/. The computational resources used in the data analysis were kindly provided by Spartan, the high-performance computational cluster of the University of Melbourne (Lafayette et al., 2016). We thankfully acknowledge the valuable comments from five anonymous reviewers, whose feedback greatly improved the manuscript.