THE APPLICATION OF TRINDEX TO PREDICT HARMFUL ALGAL BLOOMS IN LAKE TORMENT (NOVA SCOTIA, CANADA)

. This paper introduces the Threshold Index (hereafter called TRINDEX) for the Harmful Algal Bloom (HAB) prediction in lake Torment (Nova Scotia province, Canada). TRINDEX was suggested via the logarithmic transformation and then the thresholds of bloom pattern were established by the discrimination test named the Receiver Operating Characteristic curve (ROC). The cohort studies will be also presented to show how accurate the bloom prediction when using TRINDEX in comparing to the real observations.


Introduction
Harmful algal blooms (HAB) have become more prominent and frequent event in Canadian fresh waters [1]. HAB can cause damaging physical (by an excess accumulation and following hypoxia in the water) or chemical (producing toxins) effects on various habitats living in the waterbody and surrounding areas. It is also well-known fact that HAB can affect negatively on the environment, human and animal lives through their food chain or via direct contacts with the water containing HAB. Eutrophication has been considered for a long time the main cause of HAB with the intensive growth of cyanobacteria in freshwater bodies. Therefore, the different indices mostly based on nutrients were largely developed to explain how eutrophication can lead to the HAB patterns. These indices included single or combined indexes, Single indexes, such as total phosphorus (TP), chlorophyll-a (Chl-a) or phycocyanin (PC) concentration, Secchi disk measurements [2][3][4][5][6][7], are still very common in the practice of water management. However, they can poorly predict the phytoplankton growth in real conditions of the waterbody. Some combined indexes of HAB prediction are very specific to the local environmental conditions. Moreover, rare research works were focused on the determination of the thresholds for HAB happening, except for [8] which predicted the risk of cyanobacterial dominance, and from there the threshold could be identified.
Based on the Trophic Index proposed for marine water to characterize the trophic state of coastal marine water (Emilia-Romagna coast) of the Adriatic Sea by Vollenweider and co-workers [9], our index is suggested with modification and adaptation for freshwater media. Our suggestion, called TRINDEX, will show the estimation effectiveness for the variation of cyanobacterial dominance in phytoplankton communities. Moreover, TRINDEX can assist in determining the threshold for cyanobacterial bloom onset due to the fact that the typical pigment phycocyanin characterizing for development of cyanobacteria is incorporated.
In this paper, we introduce in the first step the new indicator used to estimate the potential bloom occurrence in freshwater bodies and then to determine the threshold for bloom happening. Next, we will use the cohort studies to evaluate the accuracy of TRINDEX in the bloom prediction. Samples and real bloom observations were obtained from lake Torment (Nova Scotia). Our utmost goal in this paper is to introduce a new and simple way to help predicting HAB patterns via the determination of bloom thresholds and then to continue performing TRINDEX indices for a long term practical application in the water management and monitoring program of the areas affected by HAB.

Study sites
Our research was conducted at the lake Torment in Nova Scotia province as above mentioned. The location and geographical information are displayed in Fig. 1 below.
Lake Torment (LT) locates in the Kings County (Nova Scotia province, Canada). The lake depth varies from 1.0 to 6.5 meters and around the lake there is the residential area including 250 cottages and leisure activities. In additions, the lake is surrounded by a forest and one Christmas tree farm nearby. According to Kings Report [10], LT is dystrophic with brown water, low pH, low carbonate level and especially high organic content. HAB are identified annually, even ramps up with high density lately. The research from Nguyen-Quang and collaborators [11] shows that cyanobacteria strains in LT ecosystem contain mostly Dolichspermum flos-aquae and they came along randomly in the time frame from June to November. Microcystis sp. also appeared in the fall season under the mixed bloom forms with Dolichspermum flos-aquae.

Field Sampling Process and Lab Analysis
Water sampling was done with the bi-weekly or monthly frequency starting from May through to November during the 2016-2018 period with 170 samples in total (statistical analysis will be shown in Table 2 below). Water was taken at 0.5 m from the surface and at 1m from the bottom at 10 different locations. DO was measured by YSI probe (Professional Plus, Hoskin scientific LTD, USA) directly at the field. Other parameters (phosphate (PO 4 ), nitrate (NO 3 ), Chl-a and PC were analysed at our Lab, Dalhousie University.
Water samples were filtered through the GF/A Whatman filters which were used for the determination of pigment concentrations. Next, filtered water was used for nutrients analyses. The used filters were extracted in 90 % acetone (for Chl-a) and in the phosphate buffer saline (for PC). For better extraction, a sonication (50 % amplitude for 30 seconds) and a two-step centrifugation (at room temperature with 3500 g for 10 minutes in the beginning followed by second centrifugation at 4 ºC, 13000 g for 1.5 hours) were held. Pigment concentrations from Chl-a and PC were measured in Turner 10AU Fluorometer (Turner Designs, USA) based on the calibration standard curve for both. Dissolved fraction of phosphate and nitrate were measured by a photometer using a tablet reagent system [12].

Mathematical formulation
Our goal for formulation in this paper is just to present a summary of all main points related to the TRINDEX conception and calculation. We do not show here details of formula development. Readers are referred to [13] for all details of TRINDEX development.
To deal with the non-normal distribution of most of the environmental data, the logarithmic transformation is a mathematically appropriate way to 'transform' random data into a normal distribution form as below.
where, M i -measured parameter i; L i -lower limit (concentration) of the considered parameter i; U i -upper limit (concentration) of the considered parameter i; k -factor standing for the maximum value of considered range (0.10), so k=10 by default. n -total number of parameters M i we expect to consider.
Two scenarios of TRINDEX 1 & 2 are referred to the research of Hushchyna and Nguyen-Quang [13]. Precisely, TRINDEX1 consists of PC, D %O, PO 4 and NO 3 , while TRINDEX2 includes all parameters present in TRINDEX1 plus the pigment Chl-a.
The quantity (logUi − logLi) is defined by the difference of upper and lower limits. When these limits are determined, all values out of this range should be excluded. Therefore, to have an appropriate range to cover different trophic conditions, we used limits of detection (LOD) as the lower limit and the maximum value obtained in the measurements of the considered variable for the upper limit.
Our experimental data related to HAB for lake Torment is not normally distributed. Using the log transformation as mentioned above is to convert them into the normal distribution and TRINDEX can be then processed. The statistical software R combined with SPSS are used to carry out all steps including the next ones.

Determination of the bloom thresholds
As the pigment Chl-a can be found in all of algal species including the green and blue-green algae, as well as in other microplants, the pigment PC can presumably be a better alternative parameter to reflect the cyanobacterial presence in all phytoplankton community. Therefore, we suggest that when PC concentration in the lake water is greater than 0.03 ± 0.002 (mg/L), it can be considered as onset of bloom (PC criteria based on [5]), equivalent to the cell count 20,000 cells/mL of cyanobacteria, and PC will be inserted into the TRINDEX to evaluate the bloom patterns of cyanobacteria.
The onset of bloom could be a visible bloom or scum situation, but this might not be stable. The surface bloom at the onset status can be observed unstably appeared and disappeared in a short period of time (critical phase) while the supercritical phase of blooms can show a more stable situation where blooms or scums can last visibly for longer periods (many hours or many days). The onset status can lead to a 'stable bloom' if ambient conditions allow them to develop, or alternatively completely vanish, also due to the ambient conditions.
As in clinical practice [14], a 'yes or no' decision is usually required for 'diseased or non-diseased' situation, herein two states for the bloom: 'yes -bloom occurrence and no -no bloom' are also defined.
Four possible outcomes can result for each trial: true positive, true negative, false positive and false negative. At this point, the cut-off area will be introduced as the area which measures the discrimination, i.e. the ability of the TRINDEX test to correctly classify those with, or without the 'disease', as a binary variable. That is equivalent to bloom occurrence (yes) or no bloom (no) respectively. The ROC (Receiver Operating Characteristic) analysis is a binary discriminator test which assesses the predictive power of a binary classification system to evaluate a model in a decision-making process and it helps to identify the threshold. On ROC curve, however, there are numerous observed cut-off points, thus pointing out the best cut-off point will strengthen the reliability for the model.
The point with coordinates (0,1) is where the sensitivity and specificity's values are equal to 1.0. The best cut-off point (or optimal threshold point) can be determined by the minimum distance from the point (0,1) to the ROC curve [15] (see Fig. 2). Mathematically, the distant between this point (0,1) to any points on ROC curve is calculated as below.
where, S n denotes the sensitivity and S p is a specificity. Based on this formula (4), we can infer the determination of the best cut-off point as below Once the ROC curve is drawn, the area under the curve (AUC) can be used to evaluate the overall performance of the discrimination test (Table 1). AUC may take values ranging from 0.5 (no discrimination) to 1 (perfect discrimination).

Fig. 2.
Finding best cut-off point (optimal threshold) for ROC curve Table 1 The AUC criteria to evaluate the accuracy of diagnostic test Another approach we can use to estimate the effectiveness of our test is the Youden index J [16] which can be defined as follows. J = max {sensitivity c + specificity c -1} (6) where c ranges over all possible criterion values.
The Youden index J, ranging between 0 and 1, is commonly used to measure overall diagnostic effectiveness [17]. When J values close to 1, they indicate the effectiveness is relatively good, and close to 0 showing the limited effectiveness.

Cohort studies applied for TRINDEX in Torment
Cohort studies [18] are a type of medical research used to investigate the causes of disease and to establish links between risk factors and health outcomes. The word cohort means a group of people and has the roots from military strategies. These types of studies look at groups of people, which can be prospective (forwardlooking) or retrospective (backward-looking).
The prospective studies are planned and carried out over a future time period while the retrospective cohort studies look at data that already exist and try to identify risk factors for particular conditions. Inspired from this, we assume the risk factors for our problem are represented by our TRINDEX1 or TRINDEX2; and the health outcomes are the bloom situations: no bloom (no disease) or blooming (disease). For more details of cohort studies, refer to [18].
A cohort study follows up two or more groups from exposure to outcome. In its simplest form, a cohort study compares the experience of a group exposed to some factor with another group not exposed to the factor. If the former group has a higher or lower frequency of an outcome than the unexposed, then an association between exposure and outcome is evident [18]. TRINDEX1 is chosen for the analysis of our cohort studies in this research to predict the probability of blooms in lake Torment. We use R with the package 'epiR' to perform our studies.

Descriptive statistics and normal distribution test for TRINDEX
A general look on some basic statistical features of TRINDEX1 and TRINDEX2 is shown as below. In Table 2 for the case of TRINDEX1, mean and median values are nearly equal to each other (3.93 and 3.92), the skewness is in the range from -1.0 to +1.0 (0.956). Therefore, this variable is likely to be normal distributed. However, in the case of TRINDEX2, though mean and median are nearly equal to each other (4.09 and 3.96), the skewness is not in the range from -1.0 to +1.0 (1.464). Hence this variable seems unlikely to be normal distributed. The histograms of distribution for  Fig. 3. For both cases TRINDEX1 and 2, their histograms show the bellshaped curve. However, to confirm if these two datasets of TRINDEX1 and TRINDEX2 are normally distributed, the Q-Q Tests for both cases TRINDEX1 and 2 are conducted.
According to the Q-Q plot tests, if the datasets are closely distributed along a line, it can be concluded that they are with a normal distribution. In contrast, if the data points are scattered far from the line, it is obvious that the dataset are not under the normal distribution form. The Q-Q plot for TRINDEX 1 and 2 are displayed in Fig.4 which show the majority of data points of both TRINDEX1 and TRINDEX2 disperse very closely along the diagonal line in each case. Therefore, it can be concluded that the logarithm transformation has converted our field data from lake Torment (which are not normally distributed) into two datasets under the quasi-normal distribution form of TRINDEX 1 and TRINDEX2.

ROC Curves and threshold values to predict bloom
With the assistance of SPSS software, our ROC curves for TRINDEX1 and TRINDEX2 are built as shown in Fig. 5. The yellow line represents the reference below it the curve is not valid. The blue and green lines (standing for TRINDEX1 and TRINDEX2, respectively) illustrate the trade-off between sensitivity (the blooming patterns really occurring, or true positive) and (1 -specificity), i.e. non-blooming cases are identified as positive, or false positive.
Based on formula 5, TRINDEX1 will have the cutoff point value of 4.815 (d min = 0.212) and TRINDEX2 will have the cut-off point value of 4.605 (d min = 0.134). In both cases of TRINDEX 1 and TRINDEX 2, at those cut-off points, the sensitivity is lower than the specificity (79.2 % vs. 95.9 % for TRINDEX1, and 87.5 % vs. 95.2 % for TRINDEX 2) which lead to the conclusion that these two TRINDEXES are likely to be appropriate for diagnosing the "disease-bloom" in the water reservoirs.
Besides, we can also determine Youden Index based on formula 6. Youden index for TRINDEX1 case is 0.73, while it equals to 0.78 in the case of TRINDEX2. These Youden values are close to 1.0 in both cases showing the good effectiveness of these 2 TRINDEXES.

Cohort studies
Cohort studies are processed for lake Torment as previously mentioned. The matrix to be considered in the study is defined by the number of bloom cases predicted by TRINDEX1 based on PC criteria, versus the observed cases of blooms during the period 2016-2018.
According to our data analysis, cases of real observations for blooms and no-blooms are shown in following Table 4.  Table 5 displays the results from our cohort studies. In Table 5, the Outcome+ (bloom) and Outcome-(no bloom) parts show the summary of entrance data before going to calculation. The Incident risk (Inc. risk) is defined as the risk possibility (very high bloom possibility). The Expose+ is case of real observations while the Expose-is case predicted by using TRINDEX1. For lake Torment, the incident risk if we use TRINDEX1 (high-risk probability of blooms) is 14.1 % while the real observations give 17.6 % (bold numbers).
The difference between prediction by TRINDEX1 and observations is 3.53 % (defined as Attribute Risk) with the 95 %CI is (-4.23 % to 11.29 %).
In other words, in using TRINDEX1 for the bloom prediction, we could get the results with a probability of risk not far from the real observations (14.1 % versus 17.6 %) and this difference varies in average from -4.23 % to 11.29 % (95 %CI).
As the range of 95 % CI contains the zero value and going through from -to + ranges, this difference can be not statistically significant, i.e the bloom risk difference between them is just random. In other words, using the prediction by TRINDEX1 is quite close to the real observations. That means the TRINDEX1 prediction for lake Torment is quite accurate.

Conclusions
In this paper, the prediction of Harmful Algal Blooms (HAB) via TRINDEX uses a simple logarithmic transformation combined with the technique of classification/discrimination and in situ sampling. The TRINDEX is to introduce a new assessment tool for Cyanobacterial Bloom prediction. The state-of art test of discrimination (ROC Curve) was used to confirm the threshold values. From this, the forecasting for bloom incidence can be processed. It can be said that at the first time, the cohort studies were introduced into the HAB research. They were used to establish links between risk factors (parameters involved) and bloom outcomes to estimate the accuracy of TRINDEX. Further research related to TRINDEX development and cohort studies for other waterbodies would be envisaged.
All our efforts rely on a close relationship between observations and simple model leading to developing a forecasting capability. Building a based observationmodelling framework would not only increase our knowledge of fundamental ecological relationships for constructing the better models but also establish the structure of an early warning system, the goal of which is the protection of livelihoods and water resources.