A hybrid empirical and parametric approach for managing ecosystem complexity: Water quality in Lake Geneva under nonstationary futures

Significance This paper develops a hybrid approach to account for the complex interactions affecting lake water quality and its management in a nonlinear, changing world. The approach uses data to leverage our first-principles understanding of the mechanisms operating on dissolved oxygen in the lake. This yields a manageable, but more complete systems perspective for environmental management of the lake under climate change, where our analysis suggests that multiple modes of intervention may be necessary to achieve a healthy lake.


Supplementary Information Text
Additional detail on the history of Lake Geneva.
The history of Lake Geneva is an iconic example where eutrophication from runoff gave way to partial recovery through control of anthropogenic phosphorus inputs, but where the extent of recovery has not met expectation. Although lake-averaged TP has been reduced by a factor of 6 from up to ~90 μg L -1 in 1980 to ~15 μg L -1 today ( Figure 1B), this TP reduction did not result in decrease in lake averaged CHL or change in oxygen depletion rate (1).
The complex response to re-oligotrophication is tentatively explained by an interdependent intersection of chemical, biological and physical processes, some well documented, some more hypothetical. These are graphically summarized in Figure S1. First, TP internal loading from the sediments complicates any attempt to simply correlate TP external loading and lake averaged TP (2, 3) despite earlier seminal work suggesting a simple first-order relationship (4). Second, ecological observations during the re-oligotrophication period suggest significant changes in phytoplankton composition and productivity (5,6), zooplankton community structure (7) and the fish community (8)food-web alterations thought to result from complex, coupled effects of top-down and bottom-up controls (9). Third, increasing air temperature (the most clear effect of climate change for the Lake Geneva area) contributes to modify the thermal structure and study (10) suggests also promotes less edible harmful cyanobacteria. The change in thermal structure also leads to prolonged duration of the summer stratified period and decrease in the frequency of deep mixing event (11,1,12).

R-Markdown Code
Below are the key steps of data processing and computation for reproducing EDM and hybrid-model calculations in the main text. The full materials for calculations are available at the GitHub URL listed in the text, https://github.com/SugiharaLab/Geneva_Hybrid. In several cases we have condensed the code by calling functions to perform auxiliary tasks like data curation and complex plotting. These are contained in the helper scripts sourced below, but need to be retrieved from the GitHub by readers who desire to reproduce the full plots.
Also, please note we have made liberal use of the pipe operator "%>%" from the magrittr package. This operator takes the left-hand side as the first argument to right hand side, so "x %>% f" is equivalent to "f(x)" and "x %>% f(y,z,…)" is equivalent to "f(x,y,z,…)".
This project was completed using Version 0.7.3 of rEDM. Since there were substantial changes to syntax with the advent of rEDM 1.0, it is easiest to simply install the old version in a local "lib" folder for this project. This will require most machines to have Rtools installed in the default location.  We additionally source a set of help functions to neaten plot creation.

load(data_file)
This yields the following variables for our primary analysis:

names(data_lake_geneva)
## [1] "date" "oxydeep" "T_surf" "T_bot" ## [5] "T_delta" "h_mix" "Ptot_lake" "Ptot_epi" ## [9] "PO4_lake" "PO4_epi" "chl" "wind_speed" ## [13] "T_air" "rhone" "T_surf_model" "h_mix_model" Finally, the hybrid model contains an EDM predictor for DO over six-month intervals. To do this, we need to expand on the basic tools of the rEDM package and define a function that performs iterative multivariate EDM forecasts of DO. We define equivalent functions for both simplex and S-map EDM predictors that have the same general structure. The S-map predictor is used for historical analysis, but due to its higher computational overhead, iterative forecasting for the hybrid model was done with the less computationally intensive simplex predictor. We also calculate cross-map skill for the dynamic biogeochemistry variables, chl and TP_surf.
These results are the basis for the table panels of Figure S4.

Short-term Multivariate forecasting
To perform short-term multivariate EDM forecasting analysis, we define functions to make calls to the rEDM function block_lnlp() to perform the short-term multivariate forecasting analyses across lists of specificed embedding coordinates.

State-Dependent Interaction Coefficients
These results are the basis for Figure 2. First, we examine the S-map approximations of the local linear coefficients.

Hybrid Modeling
Having selected a set of embedding variables to define our multivariate EDM model for DO, we can now write the EDM predictor module that goes inside of the hybrid model. It is a function that calls in the training data, the data representing the simulation scenario, the time we are initializing the forecast, the initial condition of oxygen, and the length of time (in months) to perform the iterative forecasts.

Historical predictions
The hybrid model is first run retrospectively. These results are used for Figure 4a

Scenario Exploration
To perform the long-term exploration of future scenarios with the hybrid model, first we import archived runs of Simstrat for atmospheric warming scenarios. Code that integrates system calls to Simstrat execution is provided in the Github, but requires independent installation of the Simstrat software.   Figure S1: Illustration of the food web rearrangement in Lake Geneva documented in previous studies prior and after re-oligotrophication . Most significant to this study, the food web rearrangement includes changes in the pelagic food web functioning that controls organic matter exports that ultimately condition the rate of deep oxygen consumption. (1) Prior to reoligotrophication in the 1990s, the phytoplankton of Lake Geneva was dominated by small pelagic algae (mostly diatoms, Cyclotella spp. (14). Since re-oligotrophication, Mougeotia gracillima, a non-toxic filamentous macroalgae, has increasingly dominated the phytoplankton composition (15). M. gracillima is considered as inedible for most zooplankton species (16). M. has decreased in size (19) and abundance (9,19). (5) The main zooplanktivorous fish, Coregonus lavaretus, has undergone large increases in abundance (20). (6) Finally, in Lake Geneva, exported organic matter, with an export ratio of ~34%, is mostly autochthonous (21).
The consequence of all these changes are inferred to be a decoupling of surface chlorophyll from nutrient concentrations, and a greater fraction of primary carbon exported to deep water and sediments.  When =0, all weights are equal, but as  increases to large values (shown here up to =10), the regression is increasingly sensitive to just the immediate neighborhood of the target. In this way, the S-map regression approximates the local linear dynamics on the trajectories in statespace, i.e. the system Jacobian. Although the example we show here was designed to fit in 3dimensions, the weighting on Euclidian distance works just as well in higher dimensions (e.g. the S-map analysis in main text Figure 1D includes up to 7 dimensions). inputs of oxygenated river water, and consumption at depth due to biogeochemical activity.
Anthropogenic forcing is split into 2 scenarios (orange arrows), increases in air temperature related to climate change that directly affect the Simstrat component and another for change in the watershed use that modify the nutrient loading (i.e., re-oligotrophication) that directly affect the EDM component. The climate change scenario is used to force the equation-based model (Simstrat) which estimates the evolution of the lake thermal structure and outputs mixed-layer height (hmix) based on the depth of the thermocline (grey arrow). This drives the simple 2-box model. When hmix exceeds 250m, increase in oxygen from mixing is estimated based on the fraction of the hypolimnion waters above hmix. Historically, this only occurs in winter months. In winter months without deep mixing, water discharged from the Rhone river will sink to the bottom. With or without mixing, the estimated lake thermal structure (hmix) as well as the reoligotrophication scenario then serve as forcing parameters for the equation free model (EDM) estimating the rate of DOB change over 6-months, encapsulated in Figure 3. Figure S5: Identifying mechanistic embeddings for other dynamic biogeochemistry variables.
First, convergent cross-mapping is used to distinguish possible drivers from CHL (top) and TPsurf (bottom). Variables are the same as in Table S1, with the addition of a seasonal index, sin(), where  = 2(day of year)/365. To generate a minimum dimensional explicit embedding for predicting CHL and TPsurf dynamics under management scenarios, we then use a greedy search approach on the most relevant drivers. In each case we begin with the two drivers at the core of management questions, lake temperature in the epilimnion Tsurf(t) output from the Simstrat model and total phosphorus across all depths TPlake(t), then try adding other variables with significant evidence of causal association based on time-lag CCM analysis (A,C). To predict monthly dynamics of CHL, we find that adding the seasonal cycle, sin(v) where v is the celestial angle of the Earth from perigee tracking insolation I(t) to the S-map regression on the drivers Tsurf(t) and TPlake(t), improves forecasting, but additional drivers do not. To predict monthly dynamics of TPsurf, we find the same. These models can be roughly interpreted as nonstationary seasonal cycles where oscillations depend on nutrient loading and temperature.
In both cases, these embeddings out-perform univariate and simple stationary seasonal models (B,D). The multiview forecast skill (22) is also included as a useful comparison; multiview is a model averaging procedure for obtaining highly accurate EDM predictions in exchange for clear (mechanistic) interpretation. The multiview forecast skill (which does not depend on a nonlinear tuning parameter) can be thought of as an estimate on the upper bound predictability possible from low-dimensional nonlinear analysis of these variables. Note that for predicting TPsurf(t + 1mo), a small additional improvement in forecast skill is possible by also including the 0 th univariate time lag, i.e. TPsurf(t). Figure S6: Identifying robust set of drivers for explicit ("mechanistic") empirical dynamic model for predicting month-to-month changes in DOB when previous DOB is included as an EDM variable. Near-term (1-month) forecast skill of DOB is high due to fundamental autocorrelation in the time-series, which makes distinguishing between embeddings difficult once DOB itself is included as a variable (in contrast to Fig 1D). If we apply iterative forecasting with S-map, however, difference in skill becomes more apparent. For DOB, we find that eliminating the exogenous physical variables of Rhone discharge (Q) and atmospheric temperature (Tatm) leads to more robust forecast skill through time, while eliminating any of the biogeochemical variables leads to decreased performance. Consequently, we focus S-map coefficient and hybrid modelling efforts on the embedding TPlake(t), TPsurf(t), hmix(t),Tsurf(t), CHL(t), DOB(t) to predict changes in DOB through time.