Environmentally driven risk assessment for algal bloom occurrence in shallow lakes

An algal bloom is a complex hydro-biological phenomenon driven by multi-attribute environmental processes and thus is still difficult to predict. In this paper, a comprehensive modelling framework for forecasting algal bloom risks in shallow lakes is presented, which is based on long-term field observation and modelling of eutrophic shallow lakes. In the procedure, the major factors and their suitable ranges are investigated, and the individual influence of various driving factors is evaluated quantitatively, using an integrated approach of orthogonal design and regression analysis. By analysing the possible combined effects of the major driving factors and the relationship between algal bloom risk and major bloom-driving factors, a cost-effective environmentally driven risk assessment model is developed to forecast the likelihood of algal bloom occurrence, through a parameter optimization and prediction comparison routine. The risk model has been calibrated and validated against long-term field observations of algal blooms in Taihu Lake, with the prediction accuracy higher than 70%, which only requires readily available meteorological and water quality data. It is noted that for the closed shallow lake, the influence of hydrodynamics can be indirectly reflected by the variation of wind speed; and, total phosphorus, water temperature, photosynthetically active radiation, and average wind speed could be used as major bloom-driving factors in Taihu Lake generally. This study provides a practical framework for the development of algal bloom early warning schemes for shallow lakes and helps to understand the combined function of complex bloom-driving factors.


Introduction
The occurrence of eutrophication and algal blooms in inland lakes has been increasing worldwide, which bring a range of hazards including water discoloration, drinking water crisis, hypoxia and fish kills (Guan et al 2022). Algal blooms are a complex hydro-biological phenomenon driven by multiattribute processes. Multiple factors such as trophic status, physical hydrography and meteorological conditions influence algal growth and abundance. The most commonly observed environmental constraints (or promotions) on algal photosynthesis are nutrients, water temperature (WT) and solar radiation (Gibbons and Bridgeman 2020). And there is a general consensus that the probability of algal bloom occurrence increases in response to increased nutrient loading (Titman 1976, Xu et al 2022. Hydrographic factors (related to the wind for shallow lakes) also play an important role, as most of the algal blooms occur in a weakly flushed water body with relatively long residence times (Mao et al 2014).
Algal blooms can occur and subside over rather short time scales, which are hence notoriously difficult to observe and predict (Janssen et al 2019, Li et al 2021b, Pyo et al 2022. To effectively and proactively reduce hazards caused by algal blooms in freshwater lakes, numerous algal bloom modelling studies have been conducted. Recently, algal blooms are predicted using either process-based (PB) or data-driven (DD) modelling approaches (Rousso et al 2020). PB models such as PROTECH (Page et al 2018), DYRESM-CAEDYM , and PCLake (Chou et al 2021), are primarily based on some generally accepted framework of eutrophication kinetics, in which the biological and ecological knowledge can be incorporated to solve mass conservation equations numerically. However, their practical application is often hampered by the lack of knowledge of many complex and highly non-linear processes, and the difficulties in the calibration of numerous model parameters (typically over 20-30 parameters). As an alternative approach, DD modelling techniques, such as Regression and Indexes (Mchau et al 2019, Ding et al 2021, Artificial Neural Networks (Yang et al 2018, Park et al 2021 and Bayesian Networks (Mu et al 2021) prospered in algal bloom modelling, combined with remote sensing (Li et al 2022a). DD models are mainly based on data mining algorithms and statistical techniques that analyse and identify patterns amongst monitoring data to create anticipating rules linked to algal bloom dynamics. Nevertheless, since DD models strongly rely on the availability and quality of data, their model performance can be severely constrained if data is not appropriate for the models' purpose or is not appropriately preprocessed.
For instance, in Taihu Lake, for identification of major driving factors of algal bloom, water quality monitoring and enrichment bioassay experiments (Xu et al 2015); statistical analysis methods , Xu et al 2022 and machine learning (Lu et al 2016, Shan et al 2020 have been used extensively to study the relationship between cyanobacterial blooms and the environmental factors in Taihu Lake. However, few studies quantified the individual effect of major driving factors on chlorophyll-a (Chl-a) concentrations (an indicator of algal biomass). For prediction of algal bloom occurrence in Taihu Lake, numerous PB (Huang et al 2012, Li et al 2022b and DD model , Wang et al 2022 were adopted and proposed, some combined with new technology such as remote sensing (Li et al 2020). However, combined effect of the driving factors usually cannot be fully revealed in data driven models, while the PB models usually are inefficient. A simple and practical model with high accuracy is still needed for effective and proactive algal bloom management in shallow lakes.
In this paper, we develop a comprehensive framework for estimating algal bloom risk in shallow lakes, in which key meteorological, hydrodynamic and biochemical factors, as well as their combined influence are properly accounted for. First, a statistical method is used to identify the major environmental factors driving the algal bloom occurrence. Second, the individual influence of various driving factors is investigated quantitatively. Third, by evaluating the possible combined effect of the major driving factors, a forecast model is developed to forecast the likelihood of algal blooms occurrence. The predicted results are demonstrated, using Taihu Lake as a case study, and the effectiveness and limitation of the proposed modelling framework are finally discussed.

Study area
Taihu Lake is the third-largest freshwater lake in China. It is a shallow water body with an average depth of 1.9 m and a surface area of 2338 km 2 . The lake area is influenced by a humid subtropical climate, with the average annual precipitation about 1000-1400 mm, and the annual average air temperature in 14.9 • C-16.2 • C. The lake can be characterized by several segments, including four major embayments: Zhushan Bay, Meiliang Bay, Gonghu Bay, and Eastern Lake Bay (figure 1). The circulation of this lake is primarily driven by wind, and its hydraulic retention time is around 300 days (if not considering temporary water transfer from the Yangtze River) (Mao et al 2008). The effect of water transfer on water transport processes in the lake is strongly influenced by hydrodynamic conditions induced by wind, and is also influenced by the water transfer routs and flow discharge (Hu et al 2010, Li et al 2013. Taihu Lake was oligotrophic as recently as the 1950s and 1960s, with Chl-a concentrations of about 2 mg m −3 and total phosphorus (TP) of about 10 mg m −3 (Chen et al 2003). However, since the late 1980s, the water quality of this lake has severely deteriorated and algal blooms have become a frequent nuisance (especially in Meiliang Bay, the most eutrophic region). A series of countermeasures addressing effluent diversion and water quality improvement has been implemented since 2007 (cost ∼US$14 billion), but nutrient loading and cyanobacterial blooms have not responded as expected to abatement efforts (Qin et al 2019).

Data collection
The basic geometry and bathymetry data are provided by the Taihu Basin Authority. Monthly samples are taken at 14 monitoring stations across the lake, and the field data from 1995 to 2006, and 2017 are collected to calibrate and validate the forecasting model. For each station, a number of physical and water quality parameters are measured, including Chl-a, WT, pH, chemical oxygen demand (COD Mn ), dissolved oxygen (DO), nutrient concentrations of total nitrogen (TN) and TP. Meteorological data of average wind speed (AWS) and photosynthetically active radiation (PAR) are collected at two sites: Taihu Weather Station and Dongshan Weather Station.

Model development framework
A model development framework is proposed for forecasting algal bloom risks in shallow lakes, using a variety of statistical methods as well as a model IV-Western Lake; V-Central Lake;VI-Xuhu Lake; VII-Eastern Lake Bay; VIII-Southern Lake; IX-Wuli Lake). optimization routine. As shown in figure 2, the proposed framework consists of five steps. This stepwise framework is based on long-term field observation and modelling of the eutrophic shallow lakes. The model input is the monitoring data of meteorological, hydrodynamic and biochemical factors of the lake. The output is the algal bloom risk R calculated from the combine effect model. The intermediate parameters of the model are mainly the weight parameters in the combined effect models in step 4 and are calibrated through an optimization algorithm.

Combined effect modelling
To quantify the combined effect of factors driving algae growth, the algal bloom risk assessment formula of the additive model is expressed as: where R add represents algal bloom risk assessed by the additive model; IE i represents the individual effect of the ithmajor bloom-driving factor; ω add i represents the weight of the ith major bloom-driving factor ; G represents the number of the major bloom-driving factors.
The algal bloom risk assessment formula of the multiplicative model is expressed as: where R mul represents bloom risk assessed by the multiplicative model; ω mul i represents the weight of the ith major bloom-driving factor (ω mul i > 0); IE i and G have the same meaning as in equation (1).
We present a composite model to account for the combined effect of the driving factors based on the general mass balance equation for water quality variables (Hydraulics 2006). Theory of the composite model can be found in the Supplementary of this paper. Algal bloom risk assessed by the composite model is represented by R com . The final form of this formula is expressed as: where IE(PAR) represents the individual effect of PAR; IE(WT) represents the individual effect of WT; IE(C N ) represents individual effect of the limiting nutrient; ω PAR , ω WT , ω CN are their weights, respectively. Nutrient (TN or TP) with relatively great Pearson correlation coefficient with Chl-a is chosen as To all three risk assessment formulas, the individual effect of each driving factor is normalized by the following equation: where x and y represents the individual effect of a driving factor before and after the normalization, respectively; x min and x max represents the minimum and maximum effect of an individual factor before the normalization, respectively.

Parameter optimization
We develop two objective functions which measure the predictive capability of the three combined effect models. For the optimization model of each risk assessment model, decision variables are weights contained in the corresponding formula and risk thresholds between all bloom grades. Considering algal bloom divided into n grades (1, 2, . . . , n) with their severity in ascending order (non-bloom included and signed as grade 1), . . , R L i i represent assessed risks of observed blooms of grade i, which are assessed by one of the three risk assessment models, where L i represents observed frequency of blooms of grade i. Then the objectives are defined as: and where F 1 represents the average prediction accuracy (APA) of several bloom grades and F 2 represents the minimum prediction accuracy (MPA) of several bloom grades to be maximized; pre i represents prediction accuracy of blooms of grade i; RT j , j ∈ {1, 2, . . . , n − 1} represents risk threshold between blooms of grade j and j + 1 (RT 0 = 0; RT n = 1; RT j−1 < RT j < RT j+1 ); Tag k i is the binary signal indicating whether the assessed risk of an observed bloom of grade i with k labeled falls between upper risk threshold RT i and lower risk threshold The standard of grading division of algal bloom in Taihu Lake has been studied in a deep-going way, focusing on the location and area of the bloom, algal cell density, economic loss and number of people affected (Liu et al 2011b). As the concentration of Chl-a in water can reflect the change of algae biomass and is commonly used in Taihu Lake, we use the standard proposed by Liu et al (2011a) for reference and simplify it for convenient application. The standards of Non-bloom, small bloom and large bloom are the Chl-a concentration of 0 ∼ 30 µg l −1 , 30 ∼ 50 µg l −1 and larger than 50 µg l −1 respectively.

Identification of major bloom-driving factors
To eliminate the impact of the extreme values, Chl-a and environmental factors are transformed by the natural logarithm in advance of the calculated coefficient (Chen et al 2001). Pearson correlation coefficients for environmental factors versus Chl-a are given in table S1 in the supplementary. All environmental factors but AWS and DO show positive relationships with Chl-a, and all correlations are significant at the 0.01 level. The major bloom-driving factors are then selected according to the biochemical mechanisms between such environmental factors and phytoplankton metabolism (Mao et al 2009). Specifically, Chl-a is most strongly related to COD Mn and pH, which along with DO can be viewed as consequencesf of algae growth. Thus, COD Mn , pH and DO are excluded to avoid the collinearity among environmental factors, Between the two collected nutrient factors, TN shows a trifling relationship with Chl-a, while TP shows a much stronger relationship. This comparison suggests TN plays a non-limiting role in Taihu Lake. Moderate or strong relationships are established between other bloom-driving factors (i.e. WT, PAR, AWS) and Chl-a. Therefore, TP, WT, PAR, and AWS are identified as major bloom-driving factors. The maximum Chl-a occurs in the eighth experimental group with TP, WT, PAR and AWS at level 4, 3, 3 and 2, respectively (i.e. TP is 0.15 ∼ 0.80 mg l −1 , WT is 18.00 ∼ 25.20 • C, PAR is 5.34 ∼ 6.86 MJ m −2 and AWS is 3.67 ∼ 5.34 m s −1 ). Variance analysis ANOVA (table 3) suggests that the impacts of variations of TP, WT, and AWS on Chl-a are significant (P < 0.01). Moreover, impact of PAR variation is also intense (P < 0.05). The main effect plot (figure 3) is the graphical representation of the results of the range analysis. Depending on figure 3, Chl-a reaches a relatively high value when TP is placed at level 3 and 4 and decreases obviously when TP is at level 1 and 2. Thus, the optimal TP in terms of Chl-a is level 3 and 4, which can be considered as the suitable range of TP. Based on this approach of judgment, suitable ranges of major bloom-driving factors in terms of Chl-a are given in table 4.

Quantification of individual effects of factors
The use of regression analysis to derive relationships between algal biomass and the four major bloomdriving factors results in four polynomials (table 5). Figure 4 illustrates how the relationships between Chl-a and the four factors vary. According to figure 4, the way in which Chl-a changes with a particular factor is rather similar to that indicated in figure 3. However, when AWS increases from level 1 to level 4, Chl-a declines sharply first then get stable in the latter part based on figure 4, which is not in accordance with the trend presented in figure 3. This may be explained by the lack of representativeness of measured wind speed in the weather stations for the 14 monitoring stations. Note: The significance levels are indicated by * P < 0.05 and * * P < 0.01.

.57
Note: The significance levels are indicated by * P < 0.05 and * * P < 0.01.    Figure 5 shows the Pareto optimal front for APA and MPA calibration for three risk assessment models.

Calibration and validation of the risk assessment models
Model predictions of both bloom and non-bloom occurrence compared with field observations are summarized in table 6. The model has an overall prediction accuracy of (412 + 139)/681 = 0.809. Specifically, the non-bloom prediction accuracy is 0.806. The bloom prediction accuracy is 0.818. And the false alarm rate is as low as 99/511 = 0.194.
Validation of the calibrated composite model is implemented as a follow-up step based on observed data not used in the calibration (table 7). The overall prediction accuracy amounts to 0.736. Prediction accuracy of non-bloom and bloom occurrences are 0.708 and 0.827, respectively. About 92.9% of the 381 predicted non-bloom occurrences are successful. Additionally, a case in April 2017 (18 consecutive days on a monitoring site in the western lake in April 2017) was used to verify the timeliness of the comprehensive model (table 8). In this case, a comparison of simulated and observed bloom and non-bloom occurrences achieved an accuracy of 0.875. Prediction accuracy of non-bloom and bloom occurrences is 0.897 and 0.786, respectively. About 94.5% of the 55 Figure 5. Pareto optimal front showing the APA and MPA of two bloom grades for three risk assessment models.   predicted non-bloom occurrences are successful. The results show the timeliness and accuracy of the model prediction. Figures 6 and 7 show the prediction accuracy of non-bloom and bloom occurrence for different monitoring stations, and temporal variation of Chl-a and bloom risk for different lake regions based on complete datasets during the field observation. It can be seen that the prediction accuracy of bloom occurrence can at least achieve 0.786; two stations (i.e. C1 and C2) are slightly lower for 0.636 and 0.667 respectively. Prediction accuracy of non-bloom occurrence could achieve 0.836, with three stations (i.e. W2, M1, and M3) are a bit lower for the minimum of 0.615. It is noticeable that for stations of W1, W2, and M2, the prediction accuracy of nonbloom occurrence is as low as below 0.62. This may be resulted from the complex hydrodynamics. During the field observation, water diversion project from the Gonghu Bay to Wuli Lake was conducted, increasing the current velocity surrounding W1 and W2. Besides, considering M2 is located near the tributaries of the lake (figure 1), the peripheral region has relatively complicated hydrodynamics.

Strengths and limitations
The modelling framework in this study combined the methods of experimental design (the orthogonal design), statistical analysis (correlation and regression), optimization and PB equations (the combined effect model) into a DD process to assess algal bloom risks in shallow lakes. Compared to algal bloom modelling studies using PB models (Huang et al 2012, Li et al 2022b or DD modelling approaches , Wang et al 2022, it is simple and reveals the algal bloom mechanisms to some degree. To quantify the relationships between the bloom-risking indicator and each driving factor, a factorial design method for experiments is adopted, which helps to select the representative observations and to separate the effects of each individual factor. The composite model based Figure 6. Prediction accuracy of non-bloom and bloom occurrence at 14 monitoring stations (bloom occurrence has not been observed at S1, E1 and G1 during the field observation).
on the general mass balance equation for water quality variables makes the model equations very simple, along with the timeliness and accuracy of the model prediction. And the framework also provides valuable information on the algal bloom mechanisms when real-time assessing the bloom risks of the lake.
Although the results of the model are promising, there are several limitations in this model framework. Firstly, when applying the modelling framework to different lakes, the essential steps should be redone based on the long-term field observations of environmental factors of the new lake: major bloomdriving factors need to be identified and their suitable ranges reinvestigated correspondingly; and the selected risk assessment model might be different according to the Pareto optimal fronts of the combined effect models. Secondly, the spatial and temporary difference of the algal bloom driving mechanisms in big lakes (Li et al 2020) could not be addressed at once. Reputation of the modelling work are needed to reveal the characteristics. Thirdly, AWS is used to reflect the influence of hydrodynamics for the shallow lake in this study, while other factors, such as prior wind field, lacustrine topography, hydrophytes and inflow and outflow of the lake are not included. This led to the model not reliable and stable enough in terms of all listed bloom grades and lake regions (see the Model performance section in the supplementary).

Implications for lake functioning and management
Research on algal blooms of Taihu Lake increased rapidly in recent years, especially after a severe blue-green algae bloom event in 2007. Scholars found that the TP and TN was the most important factor that affected the success of the phytoplankton community in the past 20 years (Deng et al 2014, Li et al 2022b. Other physical environment changes such as WT, hydrodynamics, and stratification also directly affect algae growth (Berger et al 2007, Li et al 2022a. This is coincident with the major bloom-driving factors identified by the method of this study, and the framework also revealed that the overall effect of TP is more significant than TP and qualified the individual effects of the factors (Paerl et al 2011, Guo et al 2017, Li et al 2022b. The overall prediction accuracy of this model is about 0.809 in calibration and 0.736 in validation, and prediction accuracy of the Chl-a concentration is more than 0.8, satisfactory compared to the existing studies (Liang et al 2020, Li et al 2021a. Results of the case study in Taihu Lake indicates that Eastern Lake, Gonghu Bay, and Southern Lake are lake regions with least bloom occurrences and relatively low bloom risk, followed by Central Lake and Western Lake (figure 7), indicating the northwest area of Taihu lake is more likely to suffer from algal blooms (Li et al 2020). For most lake regions the assessed bloom risk complies reasonably with Chl-a, but as observed in figure 7(f), bloom occurrence is often observed with assessed risk below the threshold at WL1. Firstly, 5 of 6 false negatives occur in summer or autumn, when east or southeast wind predominates within the lake area, this might be caused by wind-induced algae drift toward the west or northwest over this period (Kong et al 2009), could lead to unpredicted bloom occurrences. Secondly, TP of the water inflow in the adjacent tributaries is always higher than that at WL1 during the last three false negatives (no detailed monitoring data for the first three ones), and thus assessed bloom risk at WL1 is not able to represent that in the peripheral region due to TP dilution. This indicates that wind monitoring is important for algal bloom assessment in large shallow Figure 7. Temporal variation of Chl-a and bloom risk for eight regions ((a) Meiliang Bay (M4), (b) Central Lake (C2), (c) Wuli Lake (W2), (d) Eastern Lake Bay (E1), (e) Gonghu Bay (G1), (f) Southern Lake (S1), (g) Western Lake (WL1), (h) Western Lake2 (WL2)) of Taihu Lake; shaded region indicates period(s) when assessed risk exceeds the threshold; Chl-a exceeding 30 µg l −1 indicates the bloom condition.
lakes as Taihu Lake . Furthermore, seen from the results of major driving factors identification, controlling the TP concentration in Taihu Lake is still the best method for the Chinese government to solve the problem of algal blooms (Xu et al 2022).

Conclusions
The major factors driving the algal blooms (phosphorus, WT, solar radiation, and wind) are statistically identified. The individual effect of various driving factors has been investigated based on an orthogonal experimental design procedure and the regression analysis. By analysing the possible combined effect of the major driving factors, an environmentally driven risk assessment model is developed to forecast the likelihood of algal bloom occurrence with the accuracy higher than 70%. This study provides a comprehensive framework for estimating the algal bloom risk in shallow lakes based on long-term field data.

Data availability statement
The data generated and/or analysed during the current study are not publicly available for legal/ethical reasons but are available from the corresponding author on reasonable request.