Probabilistic prediction of algal blooms from basic water quality parameters by Bayesian scale-mixture of skew-normal model

The timeliness of monitoring is essential to algal bloom management. However, acquiring algal bio-indicators can be time-consuming and laborious, and bloom biomass data often contain a large proportion of extreme values limiting the predictive models. Therefore, to predict algal blooms from readily water quality parameters (i.e. dissolved oxygen, pH, etc), and to provide a novel solution to the modeling challenges raised by the extremely distributed biomass data, a Bayesian scale-mixture of skew-normal (SMSN) model was proposed. In this study, our SMSN model accurately predicted over-dispersed biomass variations with skewed distributions in both rivers and lakes (in-sample and out-of-sample prediction R2 ranged from 0.533 to 0.706 and 0.412 to 0.742, respectively). Moreover, we successfully achieve a probabilistic assessment of algal blooms with the Bayesian framework (accuracy >0.77 and macro-F 1 score >0.72), which robustly decreased the classic point-prediction-based inaccuracy by up to 34%. This work presented a promising Bayesian SMSN modeling technique, allowing for real-time prediction of algal biomass variations and in-situ probabilistic assessment of algal bloom.


Introduction
The frequency and intensity of algal blooms have increased globally (Hallegraeff 1993, Ho et al 2019, Xiao et al 2019a. As one of the most serious environmental problems, algal blooms profoundly threaten local water safety and cause critical losses to aquatic biodiversity (Qin et al 2010, Olokotum et al 2020, Amorim and Moura 2021. Traditionally, algal biomass monitoring is of primary importance to aquatic health assessment and bloom management (Lee and Lee 1995, Ye et al 2014, Wu et al 2017. However, limited by the laborious and time-consuming process in field survey and lab analysis (Liu et al 2020), the timeliness of algal monitoring is often insufficient to meet demands, e.g. Lake Erhai and coastal waters in Hong Kong, China (Wang et al 2018, Guo et al 2020. In recent decades, with the expansion of in-situ water quality monitoring systems, mathematical approaches using basic physicochemical parameters become promising to improve water management outcomes by achieving real-time biomass predictions (Glibert et al 2010, Shmueli and Koppius 2010).
The ecological relationships between algal overproliferation and basic water quality parameters, such as water temperature, pH, dissolved oxygen (DO), conductivity and clarity, have been widely observed (Weisse 2008, Mantzouki and Visser 2015, Visser et al 2016. For instance, the increase in water temperature can boost the algal blooms within an optimal range, and intensive algal photosynthesis usually alters the DO, pH and conductivity of water columns (Flynn et al 2015), and reduces water clarity due to the high biomass accumulations (Mantzouki and Visser 2015). Over past years, based upon such tight empirical links, a wide variety of methods have been successfully applied for predicting the algal variations and trends, including deep neural networks (Lee and Lee 2018, Lee et al 2022, Liu et al 2022), hybrid evolutionary algorithms (Recknagel et al 2014, Ye et al 2014, and support vector regressions (García-Nieto et al 2020). More recently, with the cheaper availability of computation, Bayesian regression has risen in great popularity (Qian et al 2019, Zhang et al 2019. Bayesian approaches usually have high power of prediction and allow the use of probabilistic paradigm to address the modeling uncertainties (He et al 2020). In practice, the employing of Bayesian regression model is often useful for bloom management efforts, especially for analyzing the exceeding risks of algal biomass according to different guidelines (Cha et al 2014, Mellios et al 2020).
However, algal data samples often have large variances and contain a big proportion of extreme values (Fletcher et al 2005), posing great challenges for empirical models (Gelman et al 2013, Cusack et al 2015, Haakonsson et al 2020. To overcome this issue, commonly the data pre-transformation can be a feasible way to scale the data range and eliminate the presence of extreme values, e.g. Box-Cox (Chung et al 2007). Recently, as an alternative, the scalemixture of skew-normal (SMSN) modeling assumption (Branco and Dey 2001) also provides solutions to handle the irregular data characteristics (Benites et al 2019). With extra scale factor and shape parameters, the SMSN models can strongly accommodate occasional data and show generate robust modeling analysis in many other study fields (Montenegro and Branco 2016, Silva et al 2020, Mirfarah et al 2021. Nevertheless, to our knowledge, there have been no previous reports of utilizing this tool to predict algal variations. In view of the above considerations, our main objectives were to explore the Bayesian SMSN regression to predict algal blooms, by (a) using only basic water quality parameters that are convenient to measure; (b) modeling biomass variations having extreme data distribution characteristics; (c) incorporating probabilistic framework to enhance the assessment accuracy of algal blooms. The Bayesian SMSN models were developed and validated using three ecological datasets with records spanning from 2012 to 2019, which were acquired from one large river system (Zhejiang, China) and two multilake systems (Wisconsin, USA) with cyanobacterial and chlorophyll-a (Chl-a) levels analyzed.
The proposed approach achieved real-time prediction of algal biomass dynamics and in-situ assessment of algal blooms, supporting water environmental management.

Monitoring data
The Hangjiahu Region Rivers are in the downstream reaches of Lake Taihu, located in Taihu Basin (figure 1). Taihu Basin is one of the most developed areas in China, surrounded by many large cities including Shanghai, Suzhou, Wuxi, and Hangzhou. Over the past decades, aquatic ecosystems in Taihu Basin have continuously suffered eutrophication and harmful cyanobacterial blooms (CyanoHABs) problems due to excessive nutrient inputs (Qin et al 2019). During 2018-2019, we sampled Hangjiahu Region rivers at a quarterly frequency (spring: 12-18 April 2018; summer: 13-19 June 2018; fall: 25-31 October 2018; and winter: 2-8 January 2019), in total, there were 31 sites and 124 collected samples (table 1). To acquire cyanobacterial abundance data, the riverine cyanobacteria samples were identified down to the species level (Hu 2006) using a microscope (BX53, Olympus Inc., Japan) in the laboratory and were quantified as cyanobacterial cell biomass. Physicochemical parameters including pH, turbidity, DO, conductivity, photosynthetically active radiation (PAR), water temperature, and water depth, were measured in-situ with portable multiparameter analyzers (YSI EXO2, YSI Inc., U.S.A.), and water transparency was measured with Secchi disk (Secchi disk depth, SDD; Shanghai Changmu Environment Technology Ltd) (table 2).
The two multi-lake districts, i.e. Trout Lake Region and Madison Lakes Region, are located in northern and southern Wisconsin, respectively (figure 1). These lakes are monitored by the North Temperate Lakes-Long Term Ecological Research (NTL-LTER, https://lter.limnology.wisc.edu/) project, and are sampled every 2 weeks during the icefree season (late March or early April through early September) and every 6 weeks during the ice-covered season. In this study, the Trout lakes dataset was collected in five lakes and two bog lakes from 2015 to 2018, and the Madison lakes dataset was collected in two lakes from 2013 to 2018 (table 1). The Chl-a concentrations were analyzed spectrophotometrically, and cyanobacterial samples were identified to species via microscope and were reported as cell biomass. Together, water physicochemical parameters including water temperature, SDD, pH, DO, and PAR were measured at each site with multi-parameter sondes (YSI EXO2, YSI Inc., U.S.A.) (table 2). All data for the two lake systems were obtained from the LTER website (https://lter.limnology.wisc.edu/about/ overview).

Model development 2.2.1. Bayesian SMSN regression model
Continuous algal data, such as cell biomass, are commonly modeled as a normal or lognormal distribution. However, sampling data often have extreme values that can cause skewness, fat-tailedness, and even multimodality in the distribution, which violated Normal or lognormal assumptions. The use of the SMSN distribution family can well characterize these departures (Benites et al 2019). In this case, considering that response variable y (i.e. cyanobacterial biomass and Chl-a concentration) were all positive, we assumed it followed a log-SMSN distribution. Formally, the general class of SMSN distributions was given by the location µ, the scale σ 2 , the positive scale factor s, and the shape (skewness) λ. Hereby, the log-SMSN model can be restrictively written as following hierarchical representations within the Bayesian framework (Marchenko and Genton 2010, Cabral et al 2012) (more details in text S1): where i denotes ith observation, the distribution of s determines the form of log (y i ). For example, when s = 1, the SMSN distribution degenerates to skewnormal (SN) distribution, and with both s = 1 and λ = 0, we retrieve the normal distribution. Here, we assigned s as following: where ν 1 > 0 and ν 2 > 0, and can be considered as two unknown 'degree of freedom' parameters that characterize the shape of the SMSN distribution. In fact, with the given s, the general SMSN case becomes a special generalized-skew-t case (Branco and Dey 2001) (or well-known as the skewed Pearson type VII case (Nadarajah andGupta 2005, Shimizu andIida 2006)), as we can have the usual skew-t case with ν 1 = ν 2 .
Then we constructed linear model conditional on the location µ i by including water quality parameters as predictors: are vectors of regression parameters. Note that the monitoring data were collected across sites and dates, the spatial and temporal variations involved in data may largely influence the model estimates (Carstensen and Lindegarth 2016). Here, we also partially pooled the external site-specific and date-specific information to improve the posterior estimates. Thus, γ j are site-specific random effects varying by site j; δ k are date-specific random effects varying by date k. For the seasonal dataset of Hangjiahu rivers, k = 1, 2, 3; for the multi-weekly datasets of Trout and Madison lakes, monthly sales (i.e. k = 1, . . . , 12) were well-fitted according to the previous study in this area (Xiao et al 2019b).
Additionally, our preliminary correlation analysis showed that some predictor variables were highly correlated (figure S1). Therefore, ridge regression was developed to address the potential multicollinearity problems in the linear model (Dormann et al 2013). Via ridge regression, additional regularization parameters τ = (τ 0 , τ 1 , . . . , τ p ) were taken to describe the prior precision of regression parameters β, thereby to restrict the overfitting of training data with the collinear variables (McElreath 2018). We assigned the priors for the ridge estimates of regression parameters as (Shi et al 2016, Assaf et al 2019: For the random effects, the weakly informative priors were assigned as: For two 'freedom' parameters characterizing the population-level data distribution, we considered the priors suggested by Rômulo Barbosa Cabral et al (2012): For the skewness and scale parameters, the weakly informative priors were also assigned as:

Computation procedures
All computations for Bayesian inference were programmed in the R environment (R Core Team 2020) using the RStan interface (Stan Development Team 2020) to Stan (Stan Development Team 2019). The Markov chain Monte Carlo (MCMC) algorithm was applied using the No-U-Turn sampler to sample for parameter posterior distributions. We ran four chains for 20 000 iterations, discarded the first 5000 (burning), and retained the second 15 000 (sampling) iterations per chain to obtain 60 000 MCMC samples in total. We also pre-set the sampler controlling parameters (adapt_delta = 0.99, stepsize = 0.95, and max_treedepth = 25) and re-parameterized the Stan codes for efficient and stable computations in sampling procedures. The convergences of Markov chains were assured by R-hat statistics (R is maintained under 1.01). The predictors were centered through standardization to achieve a reliable and stable posterior estimate. Model predictions were summarized as medians (point prediction) with credible intervals of the predictive posterior distributions (PPDs; probabilistic prediction). Predictions on new observations from new groups were obtained using the marginal of random effects (McElreath 2018). Moreover, for modeling simplification, a stepwise regression procedure was adopted to reduce predictor variables based on the five-fold cross-validation results using the calibration datasets (details in table S2).

Probabilistic assessments of algal blooms
To inform the algal blooms in three different waterbodies, two alert standards related to the healthbased drinking water supplies were provided, as defined by the World Health Organization (2021). Two algal bloom thresholds were categorized according to either cyanobacterial biomass or Chl-a concentration in the water sample: alert level 1 (300 µg l −1 of cyanobacterial biomass or 1 µg l −1 of Chl-a); and alert level 2 (4000 µg l −1 of cyanobacterial biomass or 12 µg l −1 of Chl-a). Thus, we used the entire posterior distributions to calculate the probability of exceeding the two standards, denoted as the proportion of exceeded MCMC samples (more computation details in text S2).

Model evaluation
The model performance was assessed via both calibration data (in-sample) and validation data (outof-sample). The regression model was evaluated with correlation-coefficient (R 2 ) and root-mean-squareerror (RMSE), which were the measures of the deviation of predicted values from the observed values, and calculated as: where n is the number of data points;ŷ i and y i are the ith predicted and observed values;ȳ is the mean of y i . The prediction performance on algal bloom stages was evaluated based on the confusion matrix in terms of accuracy and macro-F 1 score (F 1 ). The accuracy was normally used to accounts for the overall correct rates of classification, defined as: Accuracy = True Positives + True Negatives n where n is the number of data points. The F 1 statistic considered both the true rate and false rate of classification when measuring the overall accuracy, and was calculated using the precision and recall as: True Positives True Positives + False Negatives where B is the number of classes from the confusion matrix.

Distribution of algal biomass data
For each of the three algal datasets, the distributions of response variables (i.e. Chl-a concentration or cyanobacterial biomass) were drawn by the histogram method (figure 2). The detailed description was also listed in table S1, with the statistics of Shapiro-Wilk normality statistic (W), coefficient of Skewness (SK), and coefficient of Pearson's kurtosis (K). In general, over-skewed and over-dispersed characteristics were found in all three algal datasets, and a large proportion of outliers can be also observed due to the inclusion of numerous extreme values (figure 3). For instance, Chl-a concentration in Trout lakes (figure 3(a)) presented the most extreme skewed and leptokurtic features (highest SK value of 6.49, highest K value of 49.9), showing a violation of normality (lowest W value of 0.34). Moreover, although the excessive extreme values were well reduced in the logarithmic scale, the logarithmic distribution still presented asymmetry and multimodality ( figure 3(b)). This again justified the use of SMSN assumption for modeling the irregular algal samples.

Predicting algal biomass variations
Three sets of optimal modeling predictors were identified via stepwise regression procedures (table S2). In general, variables including pH, conductivity, water depth, water temperature, DO, and SDD showed stronger relevance to the algal variations, and pH and conductivity were the common inclusion of model predictors for three cases. In addition, the variance components of site-specific variation estimated by the model were nearly two orders of magnitude larger than the temporal variation, showing larger modeling uncertainties in spatial factors as compared to temporal factors (table S3). The predictive performance of SMSN models was shown in figure 3. Overall, calibration R 2 values ranged from 0.533 to 0.706, indicating that the SMSN model presented successful goodness-of-fit for the algal biomass data violating the normality assumption. When comparing the out-of-sample predictions, the SMSN models still achieved acceptable accuracy, with validation R 2 values ranging from 0.412 to 0.742. The three case studies indicated that the developed Bayesian SMSN models presented a reliable tool to predict algal variability.

Probabilistic prediction of algal blooms
The model prediction performance of algal bloom was shown in table 3. Results of all three probabilistic models presented accurate assessment with low false classification rates, as revealed by the calculated accuracy rate and F 1 score (accuracy >0.758 and F 1 > 0.725). Taking Trout lakes as an example, in the early alert level 1 ( figure 4(a)  by the model, achieving an overall accuracy rate of 88.1% and an overall F 1 score as high as 0.841. In addition, the probabilistic method achieved high classifying accuracy in the validation data as good as in the calibration process (figures 4(b) and (d)). Similarly, the other two models for Hangjiahu rivers and Madison lakes also suggested satisfactory performances (figures S2, S3 and table 3).
As a view of comparison, the classification performance of directly using point predictions was also listed in table 3. The results showed very comparable accuracy, however, when the linear models did not yield strong regression performance, the pointprediction-based classification presented high misjudgments for the actual exceedances of alert standards (e.g. Hangjiahu rivers in figure 4(a) and table 3). Moreover, the point-prediction-based classification resulted in more false positives and false negatives in all case studies, as indicated by its lower F 1 scores (table 3).

Discussion
In this study, we presented a promising Bayesian SMSN approach to predict algal biomass from the aquatic environment fluctuations. Differing from many of previous works, our model was conducted only based on the standard water physicochemical parameters (e.g. DO, pH and conductivity). Comparatively, this is an advantage over predictive models that require time-consuming predictors (e.g. nutrients), since these parameters can be rapidly measured in-situ with portable sensors and have been included in the basic monitoring of most waterbodies. Using the historical observations, our method can therefore facilitate future algal monitoring via achieving real-time and reliable biomass variation estimates. We also found the relatively strong effects of water temperature, conductivity, and DO on the algal variation (table S3), which was in line with many relevant studies (Cha et al 2014, Xiao et al 2019a, Haakonsson et al 2020, Liu et al 2023. This highlighted that the selection of such indicators that were closely related to the algae growth can be critical for future analogue modeling. To date, increasing studies have been aware of this importance. In Australia, Recknagel et al (2014) successfully predicted algal dynamics in three sub-tropical reservoirs with conductivity, turbidity, DO and water temperature; in China, an early warning system for phytoplankton blooms was developed based on in-situ automated online sondes in Xiangxi Bay (Ye et al 2014); and in South America, coastal cyanobacterial blooms were accurately predicted (R 2 = 0.82) from water temperature and conductivity conditions (Haakonsson et al 2020). Moreover, with the cheaper availability of sonde technologies in the future, using a simplified parameter approach could further cut down the costs of algal monitoring systems, which would greatly benefit aquatic environment management. Interestingly, our model worked well not only for lakes, but also for the riverine system. Compared to the relatively static lakes, the hydrological condition plays a more important role in algal distributions in river systems . The stream flow can cause unexpected changes to the relationships between algal biomass and environmental factors from site to site (Smith et al 1999, Jaiswal andPandey 2019). This high spatial variation may lead to a problem of Simpson's paradox in a crosssectional ecological modeling (Qian et al 2015), and often make the linear estimates of riverine algal variations less applicable (Cha et al 2016). In our preliminary analysis for Hangjiahu rivers (figure S1(a)), the low correlations between cyanobacterial biomass and water quality parameters posed great challenges to conducting linear models. Nevertheless, as presented in this study and many other previous works (Malve and Qian 2006  Level 1   pooling of such site-specific heterogeneity as external information can succeed in addressing this issue. The spatial hierarchical structure appears to be useful to improve the overall fitting ability of a model based on large-scale monitoring data. In addition, although spatial heterogeneity was substantially stronger than temporal heterogeneity for affecting algal biomass in our three cases, considering that the distribution of algal community usually has high variation in both spatial and temporal scales (Kolber and Falkowski 1995), temporal hierarchy is still important and need to care within a time-varying algal prediction model. In recent years, given the increasing monitoring of spatiotemporal scales for ecology science (Xiao et al 2014), there are broader applications of the proposed approach.
Under favorable conditions such as climate warming (Xiao et al 2019a(Xiao et al , 2019b, stable hydraulic status (Park et (table 3 and figure 3). Further, since the family of SMSN distribution conceptually allows for the possibility of outliers in the data distribution, namely that the SMSN regression model is robust to avoid parameter inference bias (Silva et al 2020). In practice, a robust and high fault-tolerance approach with powerful predictability could better support the decisionmaking works such as the bloom management.
We showed that the incorporation of probabilistic framework benefits the assessment of algal bloom stages, which successfully addressed the high false rate problem when employing regression pointprediction (table 3). For ecological studies, regression estimates are often useful for classification purposes. However, this utilization often tends to show appreciable false rates even it has good overall accuracy (Motamarri and Boccelli 2012), since regressionbased outcomes are typically point-wise and inevitably involve a large amount of uncertainties (Zhao and Kockelman 2002, Carstensen and Lindegarth 2016, Hutorowicz and Pasztaleniec 2021. The uncertainty may come from the spatiotemporal variations, inaccuracy and mistakes in measurements when collecting the source data (He and Kolovos 2018), or resulted from the statistical models (Carstensen and Lindegarth 2016). Nevertheless, the effect of uncertainty on data-driven ecological research receives less attention (Carstensen and Lindegarth 2016), though it has been informed that the uncertainty will propagate through the input to the output of models (Zhao and Kockelman 2002) and may bias a modeling analysis if without prior acknowledged (He et al 2020). Fortunately, in a Bayesian model, the overall uncertainties can be propagated forward to the entire PPDs via inference (McElreath 2018). The PPDs approximate the probability of true values within a creditable interval, offering a natural uncertainty assessment framework to the parameter and outcome estimates (Qian et al 2004, He andChristakos 2018). In our case, the PPDs of algal biomass estimates were applied for the probabilistic prediction of algal bloom stages. Using this method, all of the categorizations presented high accuracy even when the regression models performed poorly (figure 4 and table 3). Additionally, this development gave us direct information about the probability of water samples in exceeding certain alert standards, which can further be used as the scientific basis for the lake or river managers to build bloom-warning advisories.

Conclusion
This work presented a promising and efficient Bayesian probabilistic SMSN modeling technique, allowing for the real-time prediction of algal variations and in-situ assessments of algal bloom stages, which: (a) Required only basic physiochemical water quality parameters. (b) Had good prediction performance on biomass data having over-dispersed characteristics and containing a big proportion of extreme values. (c) Achieved robust prediction accuracy of algal blooms through combining probabilistic framework.
In the future, the modeling could be enhanced via involving more diverse predictor variables such as hydrometeorological and anthropogenic factors, which were limited by the dataset as shown in the current study.

Data availability statement
The data that support the findings of this study are available upon reasonable request from the authors.