Predicting water quality from geospatial lake, catchment, and buffer zone characteristics in temperate lowland lakes

(cid:129) Lake water quality can be predicted with machine learning and geospatial predictors. (cid:129) Buffer zone geomorphology metrics are predictors of eutrophication related vari- ables. (cid:129) Landscapehistoryandcatchmentsoiltype are in ﬂ uential on alkalinity and pH. (cid:129) Lakesurfaceareaisamastervariablewith strong in ﬂ uence on pH, color, and pCO 2 . (cid:129) National upscaling can be improved using water quality estimates for > 180.000 lakes. the predictions we had generated for the remaining seven variables. This improved model performance (R 2 = 0.45 – 0.78). Overall, the uncovered relationships were in line with the ﬁ ndings of previous studies, e.g., total nitrogen was positively related to catchment agriculture and chlorophyll a, Secchi depth, and alkalinity were in ﬂ uenced by soil type and landscape history. Remarkably, buffer zone geomorphology (curvature, ruggedness, and elevation) had a strong in ﬂ uence on nutrients, chlorophyll a , and Secchi depth, e.g., curvature was positively related to nutrients and chlorophyll a and negatively to Secchi depth. Lake area was a strong predictor of multiple variables, especially its relationship with pH (positive), pCO 2 (negative), and color (negative). Our analysis shows that the combination of machine learning methods and geospatial data can be used to predict lake water quality and improve national upscaling of predictions related to nutrient and carbon cycling.


Introduction
Lakes and wetlands provide a diverse range of essential ecosystem services, e.g., biodiversity, carbon sequestration, food production, water supply, and recreation (Janssen et al., 2021;Peterson et al., 2003). Lakes only cover approximately 3 % of the global surface area (Pekel et al., 2016), nevertheless, lakesand small lakes in particularhave disproportionately high greenhouse gas (GHG) emissions of CO 2 and CH 4 (Holgerson and Raymond, 2016) and support high biodiversity (Biggs et al., 2017). Freshwater habitats face multiple threats due to human activity, including reclamation, habitat degradation, and eutrophication (Moreno-Mateos et al., 2012;Riis and Sand-Jensen, 2001). Accurate predictions of essential water quality variables in lakes across large scales and understanding the relationships with important drivers can pave the way for better management strategies (Read et al., 2015). Furthermore, the ability to make local, lake-level predictions may reduce the uncertainty associated with the upscaling of GHG emissions (Martinsen et al., 2020a) and other important processes. Here, we used country-level data to investigate the ability of machine learning models to predict eight essential water quality variablesalkalinity, pH, total phosphorus (TP), total nitrogen (TN), chlorophyll a, Secchi depth, color, and the partial pressure of carbon dioxide (pCO 2 )from readily available geospatial data.
Several water chemical variables can be used to evaluate the state and quality of freshwater ecosystems (Bhateria and Jain, 2016;Kalff, 2002). These variables are routinely measured in monitoring programs in many countries to determine the ecological quality and function of lakes. This is especially the case for the major nutrients, nitrogen and phosphorus (Stanley et al., 2019), which are closely connected to biodiversity (Jeppesen et al., 2000), phytoplankton development (Kalff and Knoechel, 1978), GHG-emission (Beaulieu et al., 2019;Huttunen et al., 2001), and water clarity (Jeppesen et al., 2000). The nutrient state influences lake primary production and, in turn, the variation in pH and pools of inorganic carbon species, e.g., pCO 2 and alkalinity (Kragh and Sand-Jensen, 2018;Trolle et al., 2012). The relationship between nutrients and production is modulated by the amount of available light (Krause-Jensen and Sand-Jensen, 1998), suspended particles, and colored dissolved organic compounds (Kirk, 1994). Thus, an understanding of lake water quality is obtainable from water samples and measurements of a restricted set of lake variables. However, this approach does not scale well to larger regions and unvisited lakes. High degrees of temporal and spatial variation intensify the challenge of predicting water quality variables in lakes.
Lakes are influenced by their surroundings, both the immediate surroundings (buffer zone) and the topographical area (catchment area, watershed, basin, upslope area, etc.; the term catchment is used onwards) that delivers water to the lake (Jeppesen et al., 1999;Staehr et al., 2012). On its way to the lake as either surface water or groundwater, water chemistry is influenced by land use, geology, soil type, natural vegetation, and human activity (Marx et al., 2017;Rapp et al., 1985;Read et al., 2015). The strength of the relationship between lake chemistry and catchment conditions is modulated by the speed at which water travels through the landscape (Fraterrigo and Downing, 2008;Smits et al., 2017) and the influence of internal lake processes. The presence of streams may reinforce the relationship between catchment and lake chemistry by offering efficient transportation (Abell et al., 2011;Nielsen et al., 2012). The shape of the catchment landscape can be described using geomorphological variables that can be computed from a digital elevation model (DEM; Hengl and Reuter, 2008) and is often used to predict quantities such as water occurrence (Lidberg et al., 2020) and soil composition (Hengl et al., 2014).
Several attempts have been made to quantify the relationship between lake water quality and spatial catchment or buffer zone characteristics (Nielsen et al., 2012;Toming et al., 2020). In-lake concentrations of the major nutrients, nitrogen and phosphorus, increase with agriculture intensity in buffer zones and catchments (Arbuckle and Downing, 2001;Taranu and Gregory-Eaves, 2008). The ability to predict a major limiting nutrient such as phosphorus prompts an expectation that closely related water quality measures such as Secchi depth and chlorophyll a levels are predictable from similar sets of variables. Carbonate or silicate mineral levels in the catchment are, as expected, strong predictors of the level of chemical weathering products in streams and downstream lakes (Marx et al., 2017). Similarly, pCO 2 in streams is predictable from catchment characteristics (Lauerwald et al., 2015;Martinsen et al., 2020a). However, parallel relationships for inorganic carbon and nutrients in streams might not be readily transferable to lakes, as the pronounced increase in residence time and, consequently, higher influence of in-lake metabolism, may modify the relationships (Hotchkiss et al., 2015;Martinsen et al., 2020b). Potentially, some of this variation can be accounted for by including variables related to lake bathymetry, which in turn may be related to the slope and shape of the landscape in the buffer zone surrounding the lake (Messager et al., 2016). Despite the expectation that a lake's water quality and its buffer zone or catchment are closely connected, the relationships are not simple and their effects might be outweighed by the pronounced spatial and temporal variation. Furthermore, non-linear relationships and interactions among predictor variables may challenge linear approaches to assessing lake water quality.
Applying methods and models from the field of machine learning could alleviate some of these problems (Olden et al., 2008). These models are trained to minimize prediction error on new observations, i.e., observations not used for training the model, with the goal of maximizing the ability of the models to generalize. They are well-suited for tasks with many observations and predictor variables and can handle complex relationships and inter-correlation (James et al., 2013). The models can be viewed as flexible, functional approximators that capture the relationships between the response and predictor variables (Breiman, 2001). This approach contrasts with traditional statistical modeling, e.g., the family of linear models, where the functional relationships are specified upfront and performance is assessed with the data that had been used to fit the model.
It could be argued that the complexity of some machine learning models may result in 'black-box' models. However, several techniques exist to identify influential variables and assess their relationship with the response variable, making machine learning models a tool for both improving generalization and uncovering important drivers (Molnar, 2019). Therefore, machine learning appears to be a tractable option for improving our ability to predict lake water quality across a wide range of different lakes (Read et al., 2015). Predictions of water quality for a national collection of lakes may both yield estimates for unvisited lakes and act as a reference for ongoing monitoring. That is, the predictions can be used as readily available predictor variables in future modeling efforts of water quality, large-scale nutrient and carbon cycling, and biodiversity Domisch et al., 2015;Shen et al., 2020).
In this study, we set out to test how well the levels of eight water quality variables can be predicted using geospatial predictors and machine learning models. This is done using lake monitoring data from up to 1054 Danish lakes, collected during the last 20 years, as the response variables and a large collection of readily available predictor variables at the lake, buffer zone, and catchment scale. Specifically, we hypothesize that: 1) the predictability of water quality variables can be improved by including a wider range of predictor variables; 2) complex relationships between response and predictor variables can be modeled using flexible machine learning models; and 3) similar or closely connected water quality variables share important drivers. The overall aim is to produce a country-level dataset with predictions of the eight water quality variables for all 180,378 Danish lakes.

Study region
We used water quality data collected as part of the Danish national monitoring program to train machine learning models to use predictor variables at the lake, buffer zone, and catchment level, and used these models to make predictions for Denmark's 180,378 mapped lakes. Despite Denmark's small area (approx. 43,000 km 2 ), the lakes span a large gradient in water chemistry due to the variable influence of the last glacial period (Weichselian glaciation). At their maximum extent, glaciers covered only parts of Denmark (Fig. 1), resulting in large differences in catchment geology between the eastern (glacier-influenced) and western parts of the country. This makes Denmark a suitable study area for investigating the influence of catchment geology and land use on lake water quality and it is representative of the lowland North-temperate regions. The mean annual air temperature is 8.1°C and annual precipitation is 704 mm (Fick and Hijmans, 2017 We used publicly available data from the national surface water monitoring program (MFVM and DCE, 2021) to calculate annual averages of eight key water quality variables: Alkalinity, pH, total phosphorus and nitrogen, chlorophyll a, Secchi depth, color, and pCO 2 . These eight variables were selected because they are important for the ecological quality (nutrients, chlorophyll a, Secchi depth, and color), involved in carbon cycling (pCO 2 , alkalinity, and pH) or predictors of biodiversity (several). Furthermore, measurements of these variables were available for a wide range of lakes with good seasonal coverage. Color is affected by the quantity of humic compounds or the 'colored' fraction of the dissolved organic matter pool. The water quality variables were measured using standard methods, and pCO 2 was calculated from alkalinity, pH, and water temperature using the seacarb R-package (Gattuso et al., 2021). The investigated lakes generally have high carbonate alkalinity (Table 1), which reduces the potential influence of organic alkalinity on the estimation of pCO 2 (Abril et al., 2015;Liu et al., 2020). To further reduce the potential bias, and similar to other studies (Hastie et al., 2017), we excluded observations with pH below 5.4. However, the data generally vary in both temporal and spatial sampling intensity, and thus required some preprocessing to calculate a robust annual average for each lake and response variable.

Interpolation and annual averages
We used data from a 20-year period (2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016)(2017)(2018)(2019). We used only surface water samples (during water column stratification) or mixed samples (no stratification). We calculated the monthly median for each lake and month, excluding lakes with less than four monthly values. To interpolate values for missing months, we fitted generalized additive models (GAM) for each water quality variable. The term 'month' was modeled as a cyclic cubic spline to make December and January line up, with 'lake' as a random effect, using the mgcv Rpackage (Wood, 2017(Wood, , 2011. For the GAM analysis, the response variables were log 10 (x + 1) transformed. The model fits a global shape of the intra-annual variation with a different intercept for each lake. All models showed a good fit, with R 2 values ranging from 0.56 to 0.96. The fitted GAM were then used to fill the gaps in the annual time series. Finally, we calculated the annual average for each response variable resulting in data from 924 to 1054 lakes (1054 unique lakes; Table 1).

Catchment delineation
We delineated the topographical catchment of 180,377 lakes (one of the country's 180,378 mapped lakes could not be delineated), including catchments of the 1054 lakes whose water quality data were used in this study. Lake polygons and streamlines for Danish lakes are publicly available (SDFE, 2021). We used a publicly available high-resolution digital elevation model (DEM; 1.6 m resolution) based on LiDAR data from a national survey conducted in 2007, with some hydrological corrections added subsequently (SDFE, 2021).

DEM preprocessing
To ease the computations, we first delineated drainage sub-basins that were then used as units for further hydrological processing. Sub-basins were delineated based on an aggregated (10 m resolution) version of the high-resolution DEM. The DEM was hydrologically corrected by 'breaching', which carves through obstacles to enable flow routing. This contrasts with 'filling', which raises the elevation within depressions that would otherwise impede flow routing (Lindsay and Creed, 2005). Following this, sub-basins were labeled using the algorithm described in Barnes et al. (2014a). The breaching and labeling algorithms are part of the the RichDEM Python/C++ package (Barnes, 2016). The delineated subbasins were then merged iteratively with smaller neighboring sub-basins using GRASS GIS (Neteler and Mitasova, 2013) and further enlarged using a spatial buffer zone of 1000 m to avoid edge artifacts. This resulted in 114 sub-basins and 164 islands for further hydrological processing.

Lake catchments
We split the high-resolution DEM based on the delineated sub-basins and performed hydrological corrections by breaching as described above, flat resolution (Barnes et al., 2014b), and, finally, assignment of flow directions (deterministic-eight method; O' Callaghan and Mark (1984)). These methods are also part of the RichDEM library. The topographical catchment for each lake was delineated by identifying all grid cells that contribute flow to the lake, using the NumPy and Numba Python libraries (Harris et al., 2020;Lam et al., 2015). Finally, catchment polygons were simplified, retaining approximately 10 % of their constituting points to speed up the extraction of summary statistics. This geospatial analysis also relied on other software libraries such as GDAL (GDAL/OGR contributors, 2021) and the sf (Pebesma, 2018), exactextractr (Baston, 2021), and raster (Hijmans, 2021) R-packages.

Predictor variables
We extracted data for 132 predictor variables at the lake (L), buffer zone (B), and catchment (C) levels from a wide range of geospatial data sources (Table S1). The lake level includes attributes based on the lake polygons, e.g., area and shoreline length. The catchment level includes summary statistics of land use (Corine Land Cover: Bossard et al., 2000), geology (Pedersen et al., 2011), geomorphology (DEM and digital surface model derivatives, e.g., slope, aspect, and curvature (Horn, 1981;Zevenbergen and Thorne, 1987), and climate (Fick and Hijmans, 2017). For each lake, we also calculated summary statistics of geomorphological variables within buffer zones of different distances (50, 100, and 250 m). We expect that catchment characteristics are useful for predicting lake water quality, though this relationship might be influenced by, in this case, unknown lake bathymetric attributes such as volume. For this reason, we have included buffer-zone geomorphological variables, which can be expected to correlate with lake bathymetry (Messager et al., 2016).

Predictive modeling
We used nine machine learning models (Table S2) to predict each of the eight water quality variables from the lake, buffer zone, and catchment characteristics. The models vary in complexity and their ability to approximate the functional relationship between the water quality variables and geospatial characteristics. Importantly, we assessed each model's predictive performance on unseen data using the R-squared (R 2 ), root-mean-squarederror (RMSE), and mean-absolute-error (MAE) metrics. Therefore, the data were split randomly into training (80 %) and test (20 %) sets, where the training set was used for model selection and training, and the test set was reserved for the final assessment.

Preprocessing
We applied some preprocessing to both the response and predictor variables. This is necessary to reduce distributional skewness and differences in units that could affect model performance. Seven water quality response variables were log 10 (x) transformed; the eighth, alkalinity, was log 10 (x + 1) transformed. For the predictor variables, we used median imputation to fill a few (three observations in the training data) missing values, removed near-zero variance variables, square-root transformed the catchment land use and geology proportions, applied Yeo-Johnson transformation (Yeo and Johnson, 2000), standardized variables to unit standard deviation, and finally removed predictor variables iteratively until the Spearman rank correlation coefficient of all variable pairs were below 0.7. This left 50 out of the 132 candidate predictor variables (Table S1) for further analysis. The recipes Rpackage was used for preprocessing (Kuhn and Wickham, 2021).

Model selection and training
We compared the predictive performance of nine machine learning models: featureless (AVG), linear model (LM), k-nearest neighbor (NN: Beygelzimer et al., 2019), regression tree (TREE: Therneau and Atkinson, 2019), partial least square regression (PLSR: Liland et al., 2021), elastic net (ELAST: Friedman et al., 2010), neural network (NNET: Venables and Ripley, 2002), support vector machine (SVM: Meyer et al., 2021), and random forest (RF: Wright and Ziegler, 2017). For each response variable, we compared the performance of the nine models using 5-fold crossvalidation repeated 5 times (outer loop) to select the best model for further analysis. Many of the models depend on hyperparameter tuning for optimal performance (Table S2). A tuning search was performed for each crossvalidation split using 4-fold cross-validation (inner loop) and a 30iteration random search, and the best set of hyperparameters was then used to fit the model. Following model selection, the best-performing model was trained on the entire training set using the same hyperparameter tuning procedure as for the model selection, except that the number of random search iterations was increased to 100. The final models were then evaluated on the test set and used to make predictions for all lakes in Denmark.

Model interpretation
To identify the important predictor variables and the functional relationships between response and predictor variables, we computed the permutation variable importance (normalized to 0-1 range) and accumulated local effects (ALE) using the iml R-package (Molnar et al., 2018). Furthermore, we visualized the relationship between the permutation variable importance scores for each response variable using principal components analysis (PCA). Machine learning models were trained and evaluated using the mlr R-package (Bischl et al., 2016).

Using predictions as predictor variables
Thus far, the eight water quality variables have been treated separately. However, the water quality predictions, now available for all Danish lakes, provide a straightforward way to upscale related lake variables. To provide an example of such usage and a performance estimate, we used 5-fold crossvalidation to evaluate the performance of an RF model that incorporated all observations of the eight water quality variables, along with the predicted values of the remaining seven. The same tuning procedure as in the model selection analysis was used.

Data availability
All analysis was performed in R version 4.1 (R Core Team, 2021) or Python version 3.8 (Van Rossum and Drake, 2011). All raw data are publicly available from the sources cited in the main text. All scripts used for the analysis and the resulting data products (e.g., lake catchments, predictions, and models) are available from an online repository (http:// doi.org/10.17894/ucph.a344db4b-3d71-4a48-8293-a17b4ccf0e9d).

Danish lakes and catchments
180,378 Danish lakes have been mapped, covering a total area of 713 km 2 (1.66 % of the national area). Small lakes <1 ha make up 97.5 % of the total (Fig. 2 A). Mean and median lake surface areas are 3951 m 2 and 495 m 2 respectively. The catchments of Danish lakes cover a non-overlapping area of 31,811 km 2 (73.8 % of the national area) with a mean and median area of 1.1 km 2 and 0.016 km 2 , respectively (Fig. 2 B). Thus, the catchment area is generally much larger than the lake surface area, with a mean and median lake to catchment ratio of 0.47 and 0.04, respectively (Fig. 2 C).

Water quality variables
Data for eight water quality variables for 924-1054 Danish lakes cover a wide range of conditions in terms of nutrients, inorganic carbon, and light attenuation (Table 1). These lowland lakes are generally nutrient-rich because of the high population density and intense agricultural land use in the country. However, a sampling bias in the lakes that are included in the analysis results in medium-and large-size lakes being overrepresented (Fig. 2 A). Further, there is a pronounced variation in lake alkalinity from the western to eastern parts of Denmark, a consequence of the last glacial period, which left the southwestern part free of ice cover and subject to sand deposition by rivers (Fig. 1). The eight water quality variables are highly inter-correlated (Table S3), as expected.

Model selection
For each of the eight response variables, we compared the performance of nine predictive models ranging from very simple (AVG and LM) to more complex and often better-performing models. For six response variables, the best-performing model was RF, with SVM having the best performance for the remaining two variables (pH and color; Fig. S1). Generally, all models performed well on the test set, explaining 28-60 % of the variation (R 2 ; Table 2).

Water quality predictions for Danish lakes
We used the trained models to make predictions for 180,377 Danish lakes (Fig. 3). This is straightforward because the predictor variables are readily available from geospatial databases. The density distributions of the predictor for some response variables differ from those of the observations used to train the models, which is likely a consequence of the bias in sampling of a higher proportion of medium-to large lakes, relative to small lakes (Fig. 2 A). This is the case for pCO 2 , pH, color, and Secchi depth, for which the predicted national average (Fig. 3) is much different from the observations (Table 1) because the many small lakes are predicted to have higher pCO 2 and color along with lower pH and Secchi depth. The difference is less pronounced for TP, TN, alkalinity, and chlorophyll a, which are not affected by lake size to the same degree.

Using water quality predictions to train new models
In addition to generating knowledge that is immediately useful in nature management, model predictions can be used in future national modeling efforts. To demonstrate this, we computed cross-validated model performance using observations of each of the eight water quality variables as the response variable, with the predictions of the remaining seven variables as the predictors (Table 3). For all variables, the mean performance was superior to that of models that only use geospatial variables at the lake, buffer zone, and catchment level. This is a result of the interdependence between the water quality variables, which often are highly correlated (Table S3). Moreover, it indicates that the trained predictive models must capture relevant relationships for each of the eight water quality variables and not only consider them individually. This in turn improves the performance when the predictions are subsequently used as part of new modeling efforts.

Drivers of lake water quality
Fifty predictor variables were used to train the predictive models. These variables directly or indirectly cover a range of effects and interactions at the lake, buffer zone, and catchment level. The influence of each predictor variable on each of the eight response variables generally differs widely, but some commonalities are present for similar variables such as nutrients (Fig. 4). Generally, the incorporation of several predictor variables enhanced the performance of the predictive models. The exception is the pCO 2 levels, which was reliably and consistently predicted on the basis of a single predictor variable: lake area. Among the top ten predictors for the water quality variables, the major categories of predictor variables are present, i.e. geomorphology (curvature and elevation), land use (non-irrigated arable land and coniferous forest), geology (clayey till and sandy deposits), and lake metrics (area, distance to the coastline, distance to main stationary line of ice cover during the last glaciation), and the three spatial levels (lake, buffer zone, and catchment).

Permutation variable importance
In general, many of the predictor variables expected a priori to be important are present among the most important variables (Fig. 4). This is the case for lake area and many of the catchment soil and land use variables, which previous studies suggested would be promising predictor variables. Perhaps more surprising is the apparent importance of geomorphological variables at both the buffer and catchment levels. These variables may influence lake water quality through both direct (e.g., soil erosion) and indirect pathways (e.g., buffer zone geomorphology as related to lake bathymetry).
For alkalinity and pH, clayey till, the distance to the main stationary line, and coniferous forest for pH, are important predictors, emphasizing the influence of the last glaciation on inorganic carbon chemistry in Danish lakes (Fig. 1). Lake area, catchment forest cover, and presence of freshwater deposits were important predictors of color. For response variables directly related to eutrophication (TP, TN, Secchi depth, and chlorophyll a), buffer zone curvature and elevation, unexpectedly, were among the most important predictors; for Secchi depth, ruggedness also was a surprisingly important predictor. Less surprisingly, agriculture (non-irrigated arable land), ice cover during the last ice age, catchment geology (clayey till and sandy deposits), and landscape position (distance to the coastline) were all important predictors (Fig. 4).

Response for the most important predictors
The average influence of each predictor variable along a range of values was examined with ALE. From the response curves of the most important variables (Fig. 5), it is evident that the trained predictive models can capture a range of relationships, with the RF appearing more step-like because it is an ensemble of regression trees, in contrast to the more linear

Table 2
Predictive performance on the test set of the best performing machine learning models as root-mean-squared-error (RMSE), mean-absolute-error (MAE), and variance explained (R 2 ) for eight water quality variables (log 10 -transformed). relationship with SVM (color and pH). For many of the predictor variables, the relationships, either positive or negative, are similar to a priori expectations. Alkalinity increases with the proportion of clayey till in the catchment and the distance to the main stationary line and, thus coverage by glaciers during the last glaciation (Fig. 1). Color decreases with lake area and increases with the proportion of mixed forest in the catchment. For pCO 2 , the dominant relationship is the negative relationship with lake area, and consequently, its relationships are positive with color and negative with pH. For the eutrophication-related variables, the influence of buffer zone geomorphology (e.g., curvature and elevation, and ruggedness for Secchi depth) is perhaps the most striking, with a positive relationship with TP, TN, and chlorophyll a.

Similarities in important drivers of water quality
As highlighted above, there are similarities in the permutation variable importance among the eight response variables. This is not surprising since there are significant correlations among the eight variables (Table S3). The similarities can be visualized using PCA (Fig. 6), showing that the eutrophication-related variables generally group together, near Secchi depth and pCO 2 , but distanced from pH, alkalinity, and color.

Predictability of water quality variables
From a large candidate set of 132 predictor variables, we used 50 to train predictive models for eight water quality variables. We selected readily available predictor variables, so that predictions could be made for all current and potential future lakes. Some variables belong to well-defined categories, e.g., land use, geology, or geomorphology, while others place the lake in the landscape and account for spatial autocorrelation, e.g., distance to the coastline, distance to the main stationary line of the  long-gone glaciers . The predictive performance ranged from moderate to strong, both regarding the accuracy as RMSE or MAE and regarding the proportion of explained variance as R 2 . Thus, this study demonstrates the ability of machine learning models to yield good predictions of several lake water quality variables from readily available data on lake, buffer zone, and catchment geospatial variables (Fig. S2). This result is a significant improvement over previous approaches. Many existing models use geospatial variables to predict water quality by predicting the concentrations of nutrients like TN and TP (Stanley et al., 2019), but few have attempted to generate predictions for pCO 2 , pH, and alkalinity. This motivated us to include a broader range of important lake variables and evaluate the relationships among them. Multiple studies have used landscape characteristics to predict lake water quality, but their accuracy and the proportion of variance explained has generally been low (Gémesi et al., 2011;Nobre et al., 2020). Comparing the performance of various models may be difficult because they use different predictor variables and because some studies evaluate their model's performance on the basis of data used to train the model, while others use an independent test set. For TN and TP, the R 2 values reported here are similar to previous studies of Denmark (Nielsen et al., 2012) and climatically similar Estonia (Sepp et al., 2022). For pCO 2 , other studies have achieved similar R 2 values using mainly lake variables instead of the more indirect landscape drivers (Humborg et al., 2010;Lapierre et al., 2017). In general, the framework presented here yielded good performance compared to existing studies, despite not using any measured lake chemical variables as predictors. This is a great step forward because the data can be upscaled efficiently, whereas models that are dependent on data from individual lakes cannot.
The fact that data from catchments and buffer zones are good predictors of lake chemistry is highly promising, though not a surprise (Arbuckle and Downing, 2001;Gémesi et al., 2011;Nobre et al., 2020). Previous studies have noted that the degree of lake-catchment interaction depends on the Fig. 4. The permutation variable importance (normalized to 0-1 range with 1 and 0 being the most and least important respectively) of 50 predictor variables calculated using the best performing machine learning model for each of the eight water quality variables. Variables are determined at the levels of the lake (L), catchment (C), and buffer zones (B; e.g., B100 is a 100 m wide buffer zone surrounding the lake).
fluvial connectivity of the catchment, i.e., how efficiently water and elements are transported within and to the lake system (Fraterrigo and Downing, 2008;Read et al., 2015). In an effort to accommodate this, we included variables describing lake-stream connectivity but these were found not to be of great importance. However, the distance to the coastline and terrain elevation were important for several water quality variables and should capture some aspects of the lakes' position in the fluvial network (Olden et al., 2001). This could be developed further by applying network analysis and thereby emplace the lake in the fluvial network (Jones, 2010).
While the topographical catchment units are straightforward to delineate, they are also prone to error (Oksanen and Sarjakoski, 2005). Errors may be due to inaccurate catchment delineations or water transport that does not follow the terrain surface (Lindsay, 2016). This may especially be the case in flat regions like Denmark, which has a high degree of subsurface water transportation via, e.g., sewerage, drainage, or groundwater. The use of a very high-resolution DEM and the applied preprocessing methods should alleviate some of the issues associated with flow routing in flat landscapes (Lidberg et al., 2017).

Drivers of lake water quality
We expected catchment land use, geology, and soil composition to be influential predictor variables of water quality. For several of the response variables, the predictive models also uncovered such relationships but additionally found geomorphology at the buffer and catchment level to be of particular importance. Especially for eutrophication-related variables TN, TP, Secchi depth, and chlorophyll a, the influence was apparent. Generally, the water-quality variables were highly inter-correlated (Table S3) and in line with results presented in numerous studies. One exception is pCO 2 , that has been considered to a lesser extent in this context, and shows the strongest correlations to color and pH, but exhibited no apparent relationship to TN and TP. The similarities among important predictor variables between the eight water quality variables were examines using PCA. The water quality variables directly related to eutrophication, TN, TP, and chlorophyll a, group together, with pCO 2 and Secchi depth slightly distanced. This is expected, as pCO 2 and Secchi may not be directly influenced by the drivers of eutrophication but are affected by attributes such Fig. 5. Influence of the four most important predictor variables assessed using Accumulated Local Effects (ALE) for each of the eight water quality response variables. Lines are colored according to their ranked importance, with 1 being the most important.
as lake area. Further distanced from these are alkalinity, pH, and color, which differ to a much larger degree in their drivers.
For alkalinity, the strong impact of the last glaciation was pronounced, and as expected shared influential predictors with pH (Rebsdorf et al., 1991), which was positively influenced by lake area and negatively by coniferous forest cover. Other studies have also found an influence of coniferous forest cover on lake pH (D'Arcy and Carignan, 1997). The positive influence of surface area on pH, and the inverse for pCO 2 , is likely related to the higher gas transfer velocity (Holgerson et al., 2017), the decreasing influence of terrestrial organic matter, and CO 2 supersaturated groundwater inputs relative to lake volume associated with increasing lake area (Martinsen et al., 2020b). The overshadowing influence of lake area on pCO 2 is somewhat surprising, given that landscape processes and geomorphology are important predictors of pCO 2 in streams (Martinsen et al., 2020a;Rocher-Ros et al., 2019). It is likely that the increased residence time and seasonal stratification-mixing dynamics (Weyhenmeyer et al., 2012), even in the shallow Danish lakes, overrides these effects on surface water pCO 2 . For color, the negative influence of lake area and positive influence of forest cover is in line with previous studies in Danish lakes and the higher input of terrestrial humic organic material relative to lake water volume (Sand-Jensen and Staehr, 2009Staehr, , 2007. We also observed a negative relationship between color and terrain slope, which has been reported by other studies as well (Rasmussen et al., 1989). The success of terrain slope as a predictor of terrestrially influenced carbon pools, e.g., color, DOC (D'Arcy and Carignan, 1997), and pCO 2 (Martinsen et al., 2020a), is likely due to the influence of terrain slope on soil thickness and organic matter accumulation. Jankowski et al. (2014) found that the quality of organic matter in streams, expressed as the temperature sensitivity of respiration increased as catchment slopes became steeper, while the quantity of organic matter decreased. This may explain some of the observed variations in the pools of dissolved carbon, as differences in substrate, and the degree to which they have been exposed to microbial processing is influenced by the transit time within the catchment. A steep catchment terrain also impacts pCO 2 directly, through higher gas exchange velocities (Wallin et al., 2011), and the relative role of photochemical degradation of colored dissolved organic matter during transit (Köhler et al., 2002).
Geomorphological variables may influence lake nutrient variables directly or indirectly. For example, soil erosion, which is directly influenced by landscape geomorphology, land use, and soil type is known to be an important delivery mechanism for phosphorus (Laubel et al., 2003). Erosion of soil that contains particle-bound phosphorus is driven by physical forcing through precipitation and wind, and is particularly important in regions with exposed soils and steep slopes (Grant et al., 1996;Kronvang et al., 2007). Our finding regarding the influence of geomorphology on chlorophyll a may be a result of its impact on TP, as these two variables are highly correlated (Table S3). Furthermore, phosphorus is often the limiting nutrient for phytoplankton development in lakes and thus an important predictor of chlorophyll a and potentially Secchi depth (Jackson et al., 2007;Jeppesen et al., 2000). The high importance of catchment agriculture cover for TN is expected, as most nitrogen input to Danish streams comes from agriculture-dominated catchments, whereas phosphorus inputs are generally dominated by point sources (Jeppesen et al., 1999) and are very sensitive to soil erosion in steep terrain. Indirect effects of geomorphology on lake nutrients are also likely, because buffer zone topography is a good predictor of lake bathymetry (Messager et al., 2016;Sobek, 2011) and in turn TP, TN, chlorophyll a, and Secchi depth (Fee et al., 1996;Qin et al., 2020).
The lake water quality predictions produced in this study can also be used to investigate the traditional water quality relationships of interest, e.g., the relationships between chlorophyll a and TP and TN (Kalff, 2002). Our predictions cover the nationwide range of environmental characteristics and lake types which should provide an improved view of such relationships. The slope between chlorophyll a and TN appears similar to those found in earlier more restricted studies but appears to be lower for TP (Fig. S3). This could be due to the presence of inflection points and non-linearities (Quinlan et al., 2021) which is more likely when considering a much wider gradient in lake sizes and nutrient conditions in the nationwide data. This could be further promoted by the inclusion of small lakes which tend to have higher dissolved organic carbon and color levels (as also shown here) which may restrict phytoplankton biomass and primary production (Sand-Jensen and Staehr, 2009Staehr, , 2007. Additionally, the relationships are modified by water depth (e.g., euphotic depth: mixed depth) in a complex manner as light, vertical mixing, and wind-induced particle resuspension may reach shallow bottoms (Krause-Jensen and Sand-Jensen, 1998; Yuan and Jones, 2020).

Predictive modeling and large-scale investigations
Machine learning models, as opposed to traditional linear models, can have several advantages. Their predictive performance can often be superior because they are sufficiently flexible to capture complex, non-linear relationships and interdependencies that are present in monitoring data (Read et al., 2015), and they scale well to many more observations (Olden et al., 2008). Furthermore, the researchers do not define a priori the functional relationship between the response and predictor variables, a task that is more straightforward for experimental data but challenging for a diverse range of predictors presented here (Breiman, 2001). A model that can approximate the functional relationships may even reveal new or surprising relationships (Read et al., 2015). However, the flexibility of the models may also result in poor performance when extrapolating outside the range of predictor variables or new spatio-temporal domains, and this possibility should be taken into account when evaluating model performance in an extrapolation setting (Meyer et al., 2018;Meyer and Pebesma, 2021). Consequently, while we are confident that equivalent datasets can be assembled for other countries and regions, especially when considering the availability of readily available data on elevation, land use, soil composition, etc. with near-global coverage Hengl et al., 2014), the model performances reported here is strictly valid only for Denmark, the range of environmental conditions is so extensive that the models should be applicable in many regions with a similar climate, geology, and soil conditions.
In the present analysis, we only considered the spatial variation, that is, we used multi-year aggregations of lake water quality, assuming that the inter-annual variation is low. We believe this is a reasonable assumption for the 2000-2019 period in a Danish context, where lake water quality improved markedly with the implementations of the water action plans in 1987, but has since slowed down (Kronvang et al., 2005). Including temporal variation in a predictive model could be advantageous for some usecases. Indeed, GAM models that interpolated monthly values captured seasonal variation very well, suggesting that a spatio-temporal predictive (forecasting) model also could perform well. Approaches that combine machine learning models and process knowledge could result in further improvements (Hanson et al., 2020).
As a secondary step in our analysis, we show that the water quality predictions can be used subsequently to train even better-performing models. This highlights an advantage of considering multiple response variables simultaneously and suggests that making predictions for all available lakes, and not only for the lakes used for modeling, could be a suitable direction for future studies. This is commonly done for variables that are continuous in space, e.g., climate or soil type Fick and Hijmans, 2017;Hengl et al., 2014), and less so for discrete units such as lakes. Adding additional water chemical response variables using the existing monitoring data to create a collection of predictions for each lake could provide exciting stepping stones for future upscaling exercises. However, monitoring data, as in Denmark, may contain biases. In our study as well as others (Stanley et al., 2019;Wagner et al., 2008), small lakes were undersampled, leading to biases in sample statistics that potentially influenced model performance. The lack of studies of small lakes may result in significant differences between sampled observations and large-scale predictions of water quality variables where lake size is influential, e.g., pH, pCO 2 , and color in this study. For pCO 2 , we found an approximately two-fold difference between the mean computed for observations from monitoring data and the national predicted values, and, despite the higher gas transfer velocity in larger lakes, this may have significant implications for carbon emission estimated over large geographical areas (Holgerson and Raymond, 2016;Martinsen et al., 2020b).

Conclusions
A significant proportion of the variation in water quality between lakes can be explained using geospatial predictor variables related to land use, geology, geomorphology, and lake morphometry. Some of the relationships have previously been described, however, we present a surprisingly strong influence of buffer zone geomorphology (curvature, elevation, ruggedness, slope, etc.) on lake water quality. Eutrophication related variables share many important predictor variables but some variables (e.g. pH, alkalinity, and color) are ultimately driven by other processes, which appear in our analysis when considering similarity between water quality variables based on the contribution of important predictor variables. Furthermore, we show that relationships between water quality and predictor variables can be approximated using flexible machine learning models and, subsequently, the most important variables and their relationships can be identified. These models automate an otherwise error-prone step of the modeling process, thereby improving predictive performance. The water quality predictions can be included as predictor variables in new models and, if available for entire regions, form the basis for improved large-scale estimates of nutrient and carbon cycling by moving beyond estimates based on discrete size classes. Additionally, the predictive models can be used to inform approaches to nature management by providing water quality estimates for lakes, both existing and contemplated, with varying lake, buffer zone, and catchment characteristics. Future studies should address how the temporal dimension can be incorporated and the potential influence of biases associated with data from regional monitoring programs.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.