Meta-classification of remote sensing reflectance to estimate trophic status of inland and nearshore waters

Common aquatic remote sensing algorithms estimate the trophic state (TS) of inland and nearshore waters through the inversion of remote sensing reflectance (Rrs (λ)) into chlorophyll-a (chla) concentration. In this study we present a novel method that directly inverts Rrs (λ) into TS without prior chla retrieval. To successfully cope with the optical diversity of inland and nearshore waters the proposed method stacks supervised classification algorithms and combines them through meta-learning. We demonstrate the developed methodology using the waveband configuration of the Sentinel-3 Ocean and Land Colour Instrument on 49 globally distributed inland and nearshore waters (567 observations). To assess the performance of the developed approach, we compare the results with TS derived through optical water type (OWT) switching of chla retrieval algorithms. Metaclassification of TS was on average 6.75% more accurate than TS derived via OWT switching of chla algorithms. The presented method achieved > 90% classification accuracies for eutrophic and hypereutrophic waters and was > 12% more accurate for oligotrophic waters than derived through OWT chla retrieval. However, mesotrophic waters were estimated with lower accuracy from both our developed method and through OWT chla retrieval (52.17% and 46.34%, respectively), highlighting the need for improved base algorithms for low moderate biomass waters. Misclassified observations were characterised by highly absorbing and/or scattering optical properties for which we propose adaptations to our classification strategy.


Introduction
Eutrophication is the process whereby nutrient enrichment leads to excessive primary production of phytoplankton (cyanobacteria and algae) in water bodies (Conley et al., 2009;Smith et al., 2006). The main causes of eutrophication are non-point pollution from agricultural practices, urban development and energy production and consumption (Glibert et al., 2005;Mainstone and Parr, 2002). Increasing frequency and extent of phytoplankton blooms can have implications for ecosystem services and health (Heisler et al., 2008;Lewis et al., 2011;Nixon, 1995). In affected waters, cyanobacteria may produce cyanotoxins which adversely affect human and animal health (Codd, 2000;Merel et al., 2013).
Naturally, lentic waters such as lakes are significant emitters of the greenhouse gases carbon dioxide (CO 2 ), nitrous oxide (N 2 O) and methane (CH 4 ) (Cole et al., 2007;DelSontro et al., 2018). Enhanced eutrophication due to anthropogenic climate change is expected to increase aquatic CH 4 emissions from lentic waters by 30 -90% over the next century (Beaulieu et al., 2019;Tranvik et al., 2009). Over the last decades, several frameworks have been developed to assess and manage eutrophication. Carlson (1977) proposed a Trophic State Index (TSI) linking transparency (Secchi disk depth (zSD [m])), surface phosphorus (P [mg/l]) and phytoplankton chlorophyll-a (chla [mg/m 3 ]) concentrations to the trophic state (TS) of lakes. The index partitioned TS into three classes: oligo-, meso-and eutrophic. In later work Carlson and Simpson (1996) introduced an additional TS class (hypereutrophic) to include extreme biomass scenarios. More recently, other parameters linked to water optical properties, such as turbidity (NTU) and colour scales, were employed for the retrieval of TS (Binding et al., 2007;Lehmann et al., 2018;Wang et al., 2018).
Of the aforementioned TSI parameters, in situ measurements of chla are most frequently used to estimate TS. Chla is a reliable proxy directly for phytoplankton biomass and indirectly for primary production (Carlson, 1977;Huot et al., 2007;Kasprzak et al., 2008). In situ derived chla is a core indicator in monitoring programs such as the European Water Framework Directive or the U.S. Clean Water Act (Carvalho et al., 2008;Keller and Cavallaro, 2008;Søndergaard et al., 2005). While the extraction of chla from in situ collected water samples has few, and likely low, associated uncertainties, this monitoring approach cannot be scaled up to include remote sites and short-lived phytoplankton bloom phenomena (Schaeffer et al., 2013;Tyler et al., 2016). Aquatic remote sensing complements in situ measurements for the estimation of surface water concentrations by providing a spatial and temporal observation advantage (Mouw et al., 2015).
In aquatic remote sensing the inherent optical properties (IOPs, i.e. absorption, backscatter and fluorescence) of water and the optically active constituents (OACs), namely phytoplankton pigments (ϕ(λ)), nonpigmented particles (nap(λ)) and the absorption by the chromophoric fraction of dissolved organic matter (a cdom (λ)[1/m]), impact the remote sensing reflectance (Rrs (λ, sr − 1 )) vector (Gordon et al., 1988;Morel and Prieur, 1977). Rrs (λ, sr − 1 ) is defined as the ratio of water-leaving radiance L w ( μW cm − 2 sr − 1 nm − 1 ) to total downwelling irradiance E d ( μW cm − 2 nm − 1 ) : Rrs (λ) is thus the critical optical property to derive information from a water body about OACs dispersed in the water column (O'Reilly et al., 1998). The retrieval of phytoplankton chla concentration, or the phytoplankton absorption component, a ϕ (λ)[1/m], can be expressed as a function estimation problem that requires inversion of Rrs (λ): whereby x is the quantity to invert Rrs (λ) for Siegel (1997, 1995). The inversion of Rrs (λ) is known to be mathematically ill-posed, as multiple combinations of IOPs can result in the same Rrs (λ) vector and may thus cause ambiguity in the inversion (Defoin-Platel and Chami, 2007;Sydor et al., 2004). OAC compositions and concentrations strongly vary across inland and nearshore waters, thus accurate modelling of Eq. 2 has led to the development of numerous chla retrieval algorithms over the past decades (see reviews by Blondeau-Patissier et al., 2014, Matthews, 2011, Odermatt et al., 2012, Tyler et al., 2016. Chla retrieval algorithms may be divided into two categories: empirical and semi-analytical. As the name implies, algorithms of the former category are based on empiricism, in which a functional relationship between an OAC and the optical Rrs (λ) vector is established from field observations and domain knowledge. Popular examples are the Fluorescence Line Height (FLH) (Gower et al., 1999), the Maximum Peak-Height (MPH) (Matthews et al., 2012) and Maximum Chlorophyll Index (MCI) (Gower et al., 2005) algorithms, which use band arithmetic to relate spectral phenomena associated with phytoplankton to the concentration of chla.
Machine learning (ML) algorithms also belong to the empirical category. Typically, ML algorithms are based on non-linear regression models developed with large datasets consisting of field and/or simulated observations Pahlevan et al., 2021). Regression approaches can also be used to retrieve IOPs such as a ϕ (λ) (Craig et al., 2012). Retrieved a ϕ (λ) is then scaled to chla concentration.
Algorithms of the second category, semi-analytical solution algorithms (SAA), invert Rrs (λ) for IOPs . SAA base the retrieval on physical reasoning, but partly employ statistical methods (hence the term 'semi'). In the inversion for a ϕ (λ), SAA show many variants and differ in their definition of the a ϕ (λ) spectral shape, the method to calculate the magnitude of a ϕ (λ) and the defined relationship between Rrs (λ) and a ϕ (λ).
The scaling of a ϕ (λ) to chla derived from SAA or regression approaches can be significantly confounded in optically complex inland and nearshore waters due to pigment packaging and the contribution of accessory pigments to absorption (Bricaud et al., 1995;Simis et al., 2007). Unless this variability is accounted for, non-linear effects in the relationship between a ϕ (λ) and chla will also affect TS estimation.
To reduce retrieval errors Rrs (λ) can be assigned into previously defined and distinct optical water types (OWTs) (Moore et al., 2014;Spyrakos et al., 2018). OWTs are then utilised to guide the retrieval, since a single chla algorithm in practice often shows limited accuracy across a range of OWTs. OWT switching and blending of several algorithms following prior classification into known OWTs has become established practice (Eleveld et al., 2017;Neil et al., 2019).
Whether empirical or SAA algorithms are included in an OWT scheme or stand-alone, they estimate the TS of a water body indirectly by inverting Rrs (λ) for chla or by scaling a ϕ (λ) to chla. The retrieved concentration then indicates a TS class. In a recent study, Shi et al. (2019) outlined that significant uncertainties may propagate into TS estimation due to the limited precision associated with inversion for chla. To overcome intermediate chla retrieval when TS information is ultimately required, Shi et al. (2019) developed an approach that directly relates the light absorption coefficient of OACs to TS using the quasi-analytical algorithm (QAA) by Lee et al. (2002).
In this study, we develop a methodology to overcome issues associated with indirect TS derivation through inversion of Rrs (λ) into chla (or a ϕ (λ)). To accomplish this, our method inverts for TS classes directly through modelling of the TSI system as a classification task. To retrieve TS classes instead of a chla concentration value a classification algorithm is required. For the classification of TS we establish a relationship between the Rrs (λ) vector and a TS class through an in situ dataset (n = 2184) of co-located chla and Rrs (λ) measurements.
We recognise the limited validity of a single algorithm for the information retrieval across many OWTs. However, instead of common switching or blending of algorithms through OWT schemes, we stack multiple classification models and combine their TS class predictions through a higher-level classifier.
To illustrate classification of TS, we define our dataset in this study as D = {(y i , x i ), i = 1, …, N}, where y i is the TS class and x i is a vector representing the Rrs (λ) values of the i-th instance. Examples of vector classification algorithms (classifiers) are decision trees, support vector machines, neural networks and k-nearest neighbours (Ham et al., 2005;Mou et al., 2017). The aim of vector classifiers is to learn statistically meaningful patterns of observations through the minimisation of a defined loss criterion (Vapnik, 1999). In practice, different statistical approaches overlap because they learn the same properties for a given Rrs (λ) vector. Each classifier model is the result of a statistical learning process generating unique class decision boundaries. Therefore, while one algorithm may fail to correctly predict the true TS class, another may succeed (Polley and van der Laan, 2011;Ting and Witten, 1999). The combination of individual learners is the baseline for ensemble learning. The idea explored here is to construct a strong single learner from several weak learners.
Two of the most popular ensemble methods are bagging (Breiman, 1996) and boosting (Freund and Schapire, 1996). Bagging combines the predictions of weak learners using different bootstrap samples of the training set. Boosting sequentially trains a series of weak learners with weighted versions of the training set based on the performance of previously constructed learners. Wolpert (1992) proposed a linear combination of individual models to form an ensemble and named it "stacked generalisation" also known as "stacking". van der Laan et al. (2007) have extended the original stacking approach with a cross-validation framework and coined it "Super Learner". The classification framework presented here is based on the concept by van der Laan et al. (2007). In the context of this research, we call this higher-level classification algorithm the "meta-classifier". The meta-classifier acts as a meta instance, learning from the decisions of each individual model to predict TS classes for Rrs (λ).
In this study we develop an algorithm for the direct classification of Rrs (λ) into TS. To successfully cope with the optical diversity of inland and nearshore waters, we explore the concept of stacking classifiers in a meta-learning scheme. We evaluate the method on the multispectral resolution of the Sentinel-3A Ocean and Land Colour Instrument (OLCI) for 49 inland and nearshore sites. An established practice is to estimate TS through inversion of Rrs into chla, whereby multiple chla algorithms may be combined in an OWT scheme to address in-water optical complexity. The developed meta-classifier involves multiple classification algorithms, thus we assess the utility of our approach through comparison with TS derivation via OWT switching of several chla retrieval algorithms.

Methods
In situ bio-optical data were sourced from LIMNADES (Lake Biooptical Measurements and Matchup Data for Remote Sensing: https:// limnades.stir.ac.uk/). We assembled a subset of LIMNADES with 2751 samples of co-located in situ chla and hyperspectral Rrs (λ) measurements. The datasets and measurement protocols of OACs and the derivation of Rrs (λ) are described in Spyrakos et al. (2018) (see Table 1 herein). Rrs (λ) in all datasets were measured just above the water surface (0 + ). We did not apply an additional correction to Rrs (λ), assuming the measurements were made under optimal viewing angles and quality controlled during the original study. For brevity, we omit the wavelength-dependency for the remainder of the paper.
For the development of the classification method two independent datasets were created: one for training the classification algorithms and one for evaluating their performance. A priority in the development process was to avoid the allocation of observations from the same water body to both training and test datasets. Mixing or randomising the in situ measurements across both datasets would introduce knowledge about the water bodies included in the test set to the classification algorithms in the training stage and prevent an independent evaluation of the proposed method. We therefore split the entire dataset using the first letter of the water bodies names: A-D (n = 567, 20%) for testing and E-Y (n = 2184, 80%) for training.

Radiometric data pre-processing
The in situ hyperspectral Rrs measurements from the training and test datasets were spectrally resampled to the multispectral band configuration of Sentinel-3A OLCI, normalised and subsequently assigned a TS class (Fig. 1). These steps are detailed below.
Resampling of the hyperspectral data was necessary to combine Rrs from multiple sources and sensors. The spectral data were convolved to the spectral response function (SRF) of OLCI because the observation frequency, spectral sensitivity and resolution of this sensor are relevant to observe the TS of inland and nearshore waters (Kravitz et al., 2020). To convolve Rrs for each OLCI band i, we calculated the values of the spectral albedo for the bands: where (λ) is the wavelength, (λ i ) is the center wavelength in the i-th spectral band, ϕ i (λ) is the SRF of the i-th spectral band, (λ 1 , λ 2 ) are the boundary wavelengths of the considered spectral range, R is the spectral albedo and R(λ i ) is the mean spectral albedo in the i-th spectral band. The mean albedo values within the bands represent the spectral albedo seen by the sensor and are often also called the spectral signature of the viewed surface. We note that the convolution was performed on Rrs rather than the mathematically correct L w and E d measurements. Any effects are considered negligible at the 10-nm bandwidth of OLCI (Burggraaff, 2020). Several previous optical classification studies classified untreated Rrs, which is sensitive to both amplitude and spectral shape (Lee et (Binding et al., 2008;Binding et al., 2010;Binding et al., 2013;Bradt, 2012;Dall'Olmo et al., 2003;Dall'Olmo et al., 2005;Dall'Olmo and Gitelson, 2006;Gitelson et al., 2007;Gitelson et al., 2008;Gurlin et al., 2011;Li et al., 2013;Li et al., 2015;Schalles, 2006 (Bradt, 2012;Dall'Olmo et al., 2003;Dall'Olmo et al., 2005;Gitelson et al., 2008;Li et al., 2013;Li et al., 2015;Moore et al., 2014;Schalles, 2006) USTIR 29 Lake Balaton (Hungary) (Riddick et al., 2015) UL 14 Lake Bogoria (Kenya) (Tebbs et al., 2013) particularly relevant to optically complex water bodies, the reflectance vector has been normalised prior to the classification Mlin et al., 2011;Spyrakos et al., 2018;Xi et al., 2015;Xi et al., 2017). The amplitude of Rrs is strongly influenced by particulate (back-) scattering, while the shape is primarily affected by a ϕ (λ), a cdom (λ) and the absorption by non-pigmented particles (a nap (λ)[1/m]) (Roesler et al., 1989). In this study we followed the prior normalisation approach of Rrs to emphasise the shape: where r n (in units of nm − 1 ) indicates the normalised spectrum obtained through trapezoidal integration between λ 1 (412 nm) and λ 2 (753 nm) (Mlin and Vantrepotte, 2015). The resampled and normalised reflectance datasets are displayed in Fig. 2. To enable the inversion for TS we established a relationship between Rrs and the TS classes. We used the TSI definition by Carlson (1977) and the extension made for hypereutrophic waters by Carlson and Simpson (1996). The TSI definition provides a logarithmic model to interpret chla concentrations as indicators for TS classes. We separated our Rrs dataset into four reflectance TS classes (oligo-, meso-, eutro-and hypereutrophic) based on the chla concentrations measured from co-located in situ water samples ( Table 2). The compiled dataset covers a wide range of bio-optical conditions and trophic states found across inland and nearshore waters. Fig. 3 displays the logarithmic distribution of chla [mg/m 3 ], total suspended matter (TSM [g/m 3 ]), and a cdom (443) [1/m] for the training and test sets. The minimum and maximum values of the OACs are given in Table 3. The training set was used multiple times throughout the classification scheme, whereas the resultant classification algorithms were applied once to the test set for evaluation.

Meta-classification of remote sensing reflectance
In our dataset D = {(y i , x i ), i = 1, …, N}, y i represents the trophic class values and x i the reflectance vector values of the i-th instance. To classify an instance, a library L with k = 4 base classification algorithms (base-classifiers), i.e. L = {k 1 , k 2 , …, k 4 }, was created. The library is a collection of vector classifiers. We invoked the k-th base-classifier in L to predict the class for each instance x i , along with its true TS classification y i . Combining these predictions along with the true trophic class vector led to a new dataset, the level-zero data. The level-zero dataset was treated as the training ground for a new learning problem subsequently solved by an additional classification algorithm, the meta-classifier.

Base-classifiers
We trained the meta-classifier with the predictions of four baseclassifiers characterised by different statistical assumptions: eXtreme Gradient Boosting (XGBoost), LightGBM (LGBM), Naïve Bayes (NB) and a neural network (NN). The classifier assumptions and the training procedure, involving the stacking of base-classifiers to fit the metaclassifier, are as follows.
The first two classifiers are XGBoost and LGBM (Chen and Guestrin, 2016;Ke et al., 2017). Both classifiers have their statistical origin in gradient boosting machines (GBM) combined with decision trees as base-learners (GBDT) (Freund and Schapire, 1997;Friedman et al., 1998;Friedman, 2000). GBDTs create new models sequentially to provide more accurate estimates of the target variable. The principle is to construct new learners that focus on weak areas already learnt, or in statistical terms, construct learners correlated with the negative gradient of the used loss function (Natekin and Knoll, 2013). For reviews on boosting algorithms see Bühlmann and Hothorn (2007), Schapire (2003). The prediction of the XGBoost algorithm at each iteration t is based on the defined objective function Ĵ : where and The objective function Ĵ (t) consists of two parts: L(θ) and Ω(θ). θ describes the parameters in the equation. L(θ) is a differentiable convex loss function that measures the difference (residual) between a class prediction ŷ i and y i at the t-th iteration. The goal of the optimisation process is to construct a tree structure that minimises a loss function in each iteration. The updated tree structure in each iteration learns from the previous tree's model decision and uses the residuals to fit a new residual tree. To construct models that generalise and avoid overfitting, Eq. 5 denotes a regularisation term Ω(θ). T in Eq. 7 is the number of terminal nodes in a tree and γ the learning rate (between 0 -1). γ is multiplied by T to enable tree pruning. Terminal nodes and the learning rate are hyper-parameters that we optimised in separate steps (see where λ is an additional regularisation parameter, and w j enables to control the weights of the tree leaves (Goodfellow et al., 2016). Ω(θ) prevents overfitting and allows a better generalisation of the constructed model.
In this study we used the multi-class logarithmic loss function (mlogloss) for both XGBoost and LGBM. Mlogloss measures the performance of the models with an output probability value between 0 and 1 and increases when the predicted probability diverges from the actual class label: where N is the number of observations, M the number of TS class labels, y ij a variable with the predicted class label and p ij is the classification probability output by the classifier for the i-th instance and the j-th label (Bishop, 1995;Hsieh, 2009). Solving Eq. 8 becomes challenging and computationally demanding. Therefore, Eq. 8 is transformed using a second-order Taylor expansion (Bishop, 2006). The transformation allows the objective function to depend only on the first and second order derivatives of a point in the loss function, also speeding up the process. The main difference between XGBoost and LGBM is the tree construction process. Both classifiers can capture highly non-linear feature-target class relationships. The models can be precisely controlled by tuning a set of hyper-parameters. In addition, each classifier can be trained on both small and large datasets making them suitable for any given classification task. The third classifier in our ensemble is a Naïve Bayes (NB) probabilistic model based on the Bayes' Theorem: where P(y i ⃒ ⃒ x) is the conditional probability that a reflectance spectrum x belongs to a trophic class y i . The Bayes' rule specifies how this conditional probability can be calculated from the features (wavelengths) of the reflectance vectors of each trophic class, and likewise the unconditional probability (Lewis, 1998). The NB classifier calculated the probability of each trophic class for a given reflectance and output the trophic class with the highest one. We specifically wanted to include the assumption that for some reflectance spectra independent, single wavelengths are dominant, and hence strongly influence the class assignment. Since the wavelengths of multispectral reflectance vectors are at least partly correlated, the NB assumption is naive. In our ensemble of classifiers, NB is one of several base-classifiers generating a TS prediction. Following the theory of stacked generalisation, the metaclassifier should recognise when the NB assumptions apply through evaluation of the predictions for each test reflectance. In case the NB assumptions hold, high prediction accuracies are expected and thus the meta-classifier could prioritise the NB predictions in the decisionmaking to generate a final TS estimate. In case the NB assumptions do not apply, the accuracies will be low and lead to higher influence on the meta-classifier of other, more accurate base-classifiers. As the fourth base-classifier we used a NN. NNs have shown success

Fig. 2.
Hyperspectral in situ data (top row) and the resulting Sentinel-3A OLCI resampled and normalised multispectral reflectance spectra (bottom row) for both training (green) and test measurements (blue).

Table 2
TSI classification after Carlson and Simpson (1996) and assigned reflectance spectra per class in our training and test sets. n is the number of observations. across a diverse set of waters due to their aptitude to approximate nonlinear input-output functions (Brockmann et al., 2016;Doerffer and Schiller, 2007;Hieronymi et al., 2017;Ioannou et al., 2013;Krasnopolsky et al., 2002;Krasnopolsky et al., 2018). In this study, a NN with one layer and multiple hidden neurons h j was trained. The output of the NN for the test dataset was given by: with where x i and y are input and output vectors, respectively; w j and w j are weights, a j and â are fitting parameters and ϕ is the non-linear hyperbolic tangent activation function (Hsieh, 2009). In Eq. 10, n is the number of the non-linear activation function, ϕ in Eq. 11. The defined objective function Ĵ (t) (Eq. 5) for the NN minimised the mlogloss function (Eq. 8) and the regularisation term Ω ( θ ) = 1 2 ‖ w ‖ 2 2 , also known as weight decay. The NN was trained using backpropagation (Goodfellow   , 2016). For the multi-class output layer, we used the standard softmax function (Bishop, 2006). It is worth noting that the NN with one layer can be considered shallow, whereas it is becoming more common to use "deeper" NNs characterised by more layers. We could have added additional layers or used a more advanced architecture such as a mixture density network (MDN), as recently demonstrated for inland and coastal waters in Pahlevan et al. (2020). However, the intent of this paper is to present metaclassification, and not to showcase various NN architectures. Further, a deeper NN increases training time and requires more in situ measurements which are naturally limited. For this application, we therefore opted to keep the NN architecture as basic as possible.
Our library L consists of a diverse set of base-classifiers and underlying statistical models that forwarded the unique information learnt about the reflectance spectra, constituting each trophic class, to the meta-classifier.

Meta-classifier
The meta-classification training and prediction procedures were multi-step processes, as we illustrate schematically in Fig. 4. All classifiers were trained using a v-fold cross-validation scheme with v = 5. Cross-validation enables performance assessment of the classification algorithms during the training process (Schaffer, 1993). The process of training the meta-classifier on the predictions of the base-classifiers in our library L = {k 1 , k 2 , …, k 4 } was as following: 1. We split the reflectance training set into 5 exclusive folds of n/v = 2184/5 ∼ 437.

For each fold
(a) Reflectance spectra in fold v i were validation data (hold-out set), while the remaining observations (80% of the reflectance spectra) constituted the training set. Each base-classifier was fit on the training set. (b) With each base-classifier we predicted the outcome ŷ i for each reflectance instance x i in a validation set v i . The resulting loss of each base-classifier was estimated between the true outcome y i and its prediction ŷ i for all reflectance spectra.
(c) For each classifier, the v-estimated loss rates over the v-validation sets were averaged resulting in the cross-validated loss. For each reflectance the model of the respective base-classifier with the smallest cross-validated loss was selected for subsequent use by the meta-classifier.
Combined with the true trophic class y i , we stacked the crossvalidated predictions made on the training set of each base learner to generate a vector of level-zero predictions: pzero train = {(p i , x i ) = 1, …, N}. This important step constituted the training set of the metaclassifier, where each feature in pzero train was a single prediction of the base-classifiers. For each prediction, we knew the true outcome y i and hence provided the meta-classifier the necessary training data. The meta-classifier learnt which of the base-classifiers predicted the true trophic class y i for each training reflectance. We used a separate NN as the meta-classifier. We selected a NN because of its high approximation capability to learn the non-linear decision boundaries necessary to distinguish between the base-classifier predictions. The task of the metaclassifier NN was to select the most accurate base-classifier for each reflectance and assign a final trophic class. The training procedure of the meta-classifier required the level-zero predictions pzero train .
In the application, each base-classifier in L predicted a trophic class for each reflectance in our test set, resulting in a vector with level-zero test predictions: pzero test = {(p i , x i ) = 1, …, N}. These predictions were then stacked and provided to the meta-learner NN to estimate a final trophic class for each test reflectance.
In this study we utilised the open-source ML-ensemble Python library that interfaces with Skicit-Learn (Flennerhag, 2017;Pedregosa et al., 2011).

Hyper-parameter optimisation
We optimised the learning process of the considered classifiers through hyper-parameter optimisation (HPO). Given a learner A of any of the base-classifiers k in our library L, hyper-parameters were exposed λ x Λ. Tuning hyper-parameters changed the way model A learnt to correctly classify training reflectance spectra in the dataset D. For example, a hyper-parameter of the base-classifiers XGBoost and LGBM limits the maximum depth of the constructed tree. Further, the NNs Fig. 4. Schematic diagram of the training and application processes included in the meta-classification framework. require a selection of neurons in a layer (as opposed to weights that are model parameters learnt during training). Mathematically, HPO can be represented as: where f(x) is the objective function to minimise (or maximise) -such as the mlogloss -and x * is the lowest (or highest) value of a function for a set of hyper-parameters we have drawn from a domain X . In practice, X was a previously defined grid of hyper-parameters. f is a black-box function and has a large set of hyper-parameter combinations that are computationally costly to evaluate. The search that optimises f is often either manually performed or accomplished by selecting randomly from a set of hyper-parameters. Another option is to search through a grid consisting of a substantial combination of all possible hyper-parameter configurations (Bergstra et al., 2011;Bergstra and Bengio, 2012;Thornton et al., 2012). Because our meta-classification approach involved the training of classifiers with several hyper-parameters, a manual, random or grid search approach was considered unpractical. These search approaches are time intensive and susceptible to missing an optimal hyper-parameter configuration. In this study, we instead followed the concept of Bayesian optimisation (Jones et al., 1998;Streltsov and Vakili, 1999). As in the Naïve Bayes classifier, Bayesian optimisation is based on the Bayes' Theorem stating that the posterior probability (or hypothesis) M of a learner (or model) A given data points D is proportional to the likelihood of D given M multiplied by the prior probability of M: Bayesian optimisation methods can be understood as a sequence process that builds a probabilistic model by keeping track of past evaluation results (Brochu et al., 2010). A probabilistic model is build by mapping hyper-parameters to a probability of a score on the objective function f.
One can represent this model as P(C|x), where C is the score for each hyper-parameter x. In literature, the model P(C|x) is called utility (or surrogate) function u. In this study, we built the model with a Gaussian Process (GP) as the prior probability model on f (Rasmussen and Williams, 2005). GPs have become a standard in Bayesian optimisation (Martinez-Cantin, 2014;Snoek et al., 2012). The surrogate function u was then optimised for and based on the posterior distribution for sampling the next set of hyper-parameters (x t+1 ): To find the next best point to sample f from, a point was chosen that maximised an acquisition (or selection) function, here the expected improvement (EI): where x is the current optimal set of hyper-parameters. Maximising x was the objective to improve upon f most. f was continually evaluated against the true objective function until a defined maximum of iterations was reached. By continually updating the surrogate probability model, Bayesian reasoning led to reliable results. The next set of hyperparameters was selected based on the previous performance history instead of a costly grid search through all possible hyper-parameter combinations. Several libraries exist that implement Bayesian optimisation. Here we used Scikit-Optimize, as it was built on top of Scikit-Learn (Head et al., 2018).

Optical water type switching to derive trophic status
To assess the performance of the developed meta-classifier, we compare it against derivation of TS through OWT switching of chla retrieval algorithms (see Fig. 5).
Our OWT switching approach is based on three previous studies. First, the OWTs for all Rrs in our training and test datasets are available from Spyrakos et al. (2018) (Fig. 6). Second, our dataset is almost identical (98% common) to the one used in the study by Neil et al. (2019) (n = 2807, compared to n = 2751 herein). Neil et al. (2019) assessed the performance of 48 chla algorithms on their dataset, resulting in one best-performing algorithm per OWT (see Table 5 therein). Since the datasets of the two studies are nearly identical, the performance results of Neil et al. (2019) are considered valid for the dataset of the present study. Third, Neil et al. (2019) recommended chla algorithms for groups of OWTs when applied to independent, unknown data (such as the test set herein). We slightly modified the selection of algorithms, based on recent performance evaluations from the European Copernicus Global Land Service (Simis et al., 2020). Four chla algorithms were thus assigned to groups of OWTs (Table 4). Restriction of the OWT scheme to four chla algorithms increases the quality of the exercised comparison, as the meta-classifier is equally based on four base-learners.
For the retrieval of chla through OWTs two approaches exist. The first approach uses the most dominant/highest OWT membership score a Rrs received in the clustering process performed in Spyrakos et al. (2018). Chla is then retrieved with an assigned algorithm per OWT. In the present study we refer to this approach as OWT switching of chla algorithms. The second approach utilises the highest n OWT membership scores per Rrs to retrieve chla with n algorithms. The n retrieved chla values are then weighted to reflect the OWT membership scoring, resulting in a blended chla value (Moore et al., 2014).
The blending procedure varies depending on the value of n and the definition of the weighting function. Since the largest impact on the chla retrieval originates from the algorithm chosen for each OWT, we simplified the process and utilised the highest OWT membership score per Rrs.
The meta-classifier was trained with 80% observations of the overall dataset. The coefficients of the chla algorithms included in the switching scheme were thus re-calibrated solely with measurements included in the respective OWT group of the training data set. For example, in our OWT switching scheme, the OC2 algorithm was assigned for OWTs 3, 9, 10, 13, which combined constitute 511 observations in the training set. Using these OWT group measurements of the training set, the coefficients of the OC2 fourth-order polynomial were estimated using nonlinear least squares fitting. As a result of the dataset split into independent training and test sets (which did not take into account OWT memberships), all measurements included in OWT 13 were assigned to the training set. Therefore the OC2 algorithm could only be applied to measurements of OWTs 3, 9 and 10 included in the test set.
We did not re-calibrate the coefficients of the Gons and Gilerson 2band algorithms, as the number of required chla-specific absorption (a * ϕ (λ)[m 2 g − 1 ] and backscatter [1/m] measurements included in the respective OWT training groups was low and thus insufficient for this purpose.
Chla from a ϕ (665)[1/m] retrieved by QAA Mishra was estimated using the equation by Bricaud et al. (1998): where a and b were calibrated with training data included in OWT group 7.
Unlike the meta-classifier that operated on normalised Rrs (see Section 2.1), the chla retrieval algorithms were applied to nonnormalised Rrs at corresponding OLCI wavelengths.
Each algorithm retrieved chla for the test Rrs measurements contained in the assigned group of OWTs. TS was subsequently derived from the retrieved chla concentration according to the TS class ranges depicted in Table 2.

Performance evaluation
To evaluate the base-classifiers independently of the meta-classifier and vice versa, we compared them to a separate support vector machine (SVM) classifier (Cortes and Vapnik, 1995). Identical to the baseclassifiers, we used the same training and test sets and procedures to train the SVM.
For the evaluation of TS classifications either through metaclassification or derived via conventional chla retrieval, we calculated the following metrics: 1. Overall Accuracy (OA). Accuracy of all reflectance instances x correctly classified into each of the four TS classes M, divided by the total number of test samples (n = 567): 2. Average Accuracy (AA). Average classification accuracy across all four trophic classes: 3. Kappa Coefficient (Kappa). Percentage agreement corrected by the level of agreement that could be expected due to chance alone: where p 0 was the accuracy and p e was the probability of agreement by chance (Cohen, 1960;Congalton, 1991).
OA and AA are not equal because each trophic class holds a different number of samples. Because of the dataset split procedure, the dataset suffers from an imbalance between the classes (see Table 2). Using OA alone lacks precision because the smaller number of samples in the trophic classes 1 and 2 have less impact on the final accuracy score than class 3 (eutrophic). Hence, we calculated AA for all classification models. Because the eutrophic class has the highest amount of samples, large differences between OA and AA may indicate a biased classification model. We included the Kappa coefficient to estimate the probability of a correct class assignment by chance alone.
In the comparison of the meta-classifier against TS derived via OWT switching, we evaluated the results of the regression of chla retrieved from an algorithm (estimated (E)) versus the in situ chla values (observed (O)).
We assessed the residuals of E i − O i with log-transformed metrics, as they enable a robust assessment of the algorithms over large concentration ranges of chla (O'Reilly and Werdell, 2019;Seegers et al., 2018) :   Fig. 5. OWT switching scheme of chla algorithms to derive TS. OWT clustering of the dataset was performed in Spyrakos et al. (2018). Chla algorithm selection was based on benchmark results from Neil et al. (2019) for this dataset and modifications undertaken in Simis et al. (2020). Each group of OWTs was assigned one chla algorithm. Algorithm coefficient calibration was performed on the respective OWT group training data and the re-calibrated algorithms were applied to the test observations of the respective OWT group. TS was derived from the retrieved chla value based on the TS class ranges defined in Tabl.e 2.
1. Bias, which quantifies the average difference between estimated chla and the observed in situ value and is robust to systematic errors produced by an algorithm: 2. Mean Absolute Error (MAE), which captures the error magnitude accurately but can be sensitive to outliers: 3. Median Absolute Percentage Error (MdAPE), which is outlierresistant. For each sample (i): wherex is the median of Additionally, we provide the slope of the linear regression fit to enable comparisons with previously published results. We omit the coefficient of determination (r 2 ) as it lacks a response to bias and is sensitive to outliers and thus subject to false interpretation (Seegers et al., 2018).

Meta-classification
For the meta-classifier LGBM and XGBoost were base-learners.
LGBM and XGBoost performed similarly with slightly higher prediction accuracies by LGBM across all classes. Differences were primarily due their distinct approaches to build the decision trees. Both models constituted balanced base-learners without major prediction failures for any of the TS classes.

Table 4
Chlorophyll-a algorithms included in the OWT switching scheme. Calibration coefficients for each model highlighted in bold. explained by the higher number of test samples in the eutrophic class 3 (n = 332) and the disproportionate impact on the overall accuracy metric. Consequently, AA becomes a more relevant metric because it incorporates the imbalanced dataset distribution. The NB AA score was approximately 15% lower at 60.22%. NB assumptions only applied to eutrophic and partially hypereutrophic waters (68.22%). For oligotrophic waters, the NB classifier performed comparable to the other classifiers.

OWTs
The base-classifier NN achieved the highest accuracies for oligotrophic systems (61.11%). Compared to LGBM and XGBoost, the results were inferior for mesotrophic (45.65%) and hypereutrophic waters (82.24%). However, as for the other classifiers, the NN scored high accuracies for eutrophic waters (92.17%). Whereas for oligotrophic and eutrophic waters the prediction accuracies by the NN were competitive, the model's predictions were not balanced across the entire set of TS classes. It is to observe that higher accuracies for the oligotrophic class were accompanied by lower precision for mesotrophic waters. Similarly, the eutrophic class was retrieved with high precision, whereas less accurate predictions for hypereutrophic waters were made. Because the NN in this study is considered shallow, adding depth to the architecture may stabilise the predictions made across the TS classes. Therefore, more thorough experiments with different NN architectures need to be undertaken than they were exercised in this study. For exemplification of the meta-classifier concept the NN sufficed to add meaningful information to the ensemble of base-learners.
The non-base SVM classifier scored the highest accuracy for mesotrophic waters (63.04%, 6.52% more accurate than the highest baselearner prediction by LGBM (56.52%)) and hypereutrophic waters (0.94% compared to LGBM predictions). SVM predictions were 10.87% and 1.87% more accurate than from the meta-classifier for these two Table 5 Classification accuracies of the different classifiers for the test set shown in percentages. The highest accuracy in each row is shown in bold.  Fig. 7. Classification matrices for predictions made by all classifiers on the independent test set (n = 567). The percentage of reflectance spectra assigned per TS class is shown. Yellow colours indicate high, purple colours low percentages per classifier. TS classes are denoted as 1 = Oligotrophic, 2 = Mesotrophic, 3 = Eutrophic, 4 = Hypereutrophic.
classes, which in sum represents a significant performance gain. However, the SVM misclassified a large proportion of the eutrophic class (73.79% compared to 92.16% by the meta-classifier), reducing all performance metrics significantly.
In the present study, the SVM functioned as a standalone comparison model which therefore could not be incorporated into the ensemble of base-learners. However, given the performance gains of the SVM over other base-learners for mesotrophic and hypereutrophic waters, the addition of the SVM to the ensemble should be investigated. Before adding the SVM, it needs to be clarified whether the eutrophic misclassifications are a primary consequence of the model's more accurate mesotrophic and hypereutrophic classifications. If included, it is furthermore important to validate if the lower accuracies for eutrophic waters can be accurately handled by the meta-classifier without an overall performance loss for this class. Only if the meta-classifier can discard misclassifications accurately, performance gains of the SVM for mesotrophic and hypereutrophic waters would improve overall metaclassifier accuracies. Inclusion of the SVM into the ensemble of baselearners would also require to re-train the meta-classifier NN.
The meta-classifier achieved the highest classification accuracies across all performance metrics (OA: 83.92%, AA: 75.41%, Kappa: 71.72%) and the oligotrophic class 1 (66.67%). In comparison, the baseclassifiers' average accuracy for oligotrophic waters was 56.25%. The meta-classifier improved on this score by 10.42%. Compared to the oligotrophic class, the meta-classifier did not improve over the most accurate base-classifiers for mesotrophic waters. The decision-making process of the meta-classifier to prioritise a reliable base-classifier became increasingly complex in the case of strongly differing baseclassifier predictions. For mesotrophic waters, the meta-classifier had to discard the poor performing base-classifiers NB (22.83%) and NN (45.65%) in favour of the more accurate XGBoost (53.26%) and LGBM (56.52%). The meta-classifier was not able to entirely dismiss the NB and NN classifiers compared to the most reliable performance achieved by the base-learner LGBM. Despite the poor performances by NB and the NN, the meta-classifier scored 52.17% prediction accuracy for mesotrophic waters. Since the selection of a base-classifier for a reflectance was learnt using the training data, choosing incorrect classifiers for a reflectance of the test set was an expected outcome in heterogeneous classification scenarios.
In contrast, the meta-classifier generated highly accurate results for eutrophic and hypereutrophic waters (92.16% and 90.65%, respectively), which were significantly higher than for the oligo-and mesotrophic classes. Confusion by the meta-classifier between these two classes was below 10%. Out of the four chla algorithms, the OC2 algorithm performed accurately for low chla concentrations, but struggled toestimate higher chla waters accurately (MAE of 0.79 [mg/m 3 ], negative bias (-0.19)). The OC2 stagnated at approximately 14 mg/m 3 of chla, which can be explained by the saturation of the polynomial to calculate higher chla concentrations. The same stagnation can be observed in the algorithm's application to the training data it was calibrated with (grey hexagons in Fig. 8, metrics not shown).

Optical water type switching of chla algorithms
The failure of the OC2 algorithm to retrieve chla accurately for higher concentrations is an expected outcome, since the OC2 algorithm was designed for clear waters, where phytoplankton dominates the Fig. 8. Performance evaluation of chla retrieval algorithms included in the OWT switching scheme: OC2, Gilerson 2-band, Gons and QAA Mishra. Coloured circles represent algorithm retrievals for measurements included in the respective OWT test groups (n = 567). For illustrative purposes, grey hexagons represent algorithm retrievals for the respective OWT training groups the algorithms were calibrated with. Metrics are shown for test data. optical properties. The retrieval result of OC2 indicates that the blue and green areas of the test Rrs were not only changing as a function of phytoplankton. Thus the blue/green ratio of the OC2 algorithm led to inaccurate retrievals. TS derivation following the chla retrieval through OWT switching is shown in Fig. 9. The accuracy of the OWT chla algorithm switching approach to derive TS for the test dataset was 79.54% (OA), 68.66% (AA) and 63.38 % (Kappa). As for the meta-classifier, the largest errors occurred for oligo-and mesotrophic waters, whereas the retrieval was highly accurate for eutro-and hypereutrophic waters (> 85% accuracy). For oligotrophic waters, the meta-classifier was 12.38% more accurate (66.67% versus 54.29%, respectively) and 5.83% more precise in the classification of mesotrophic waters than derived through OWT switching of chla algorithms (52.17% and 46.34%, respectively). The developed meta-classifier was slightly more accurate for eutrophic (4.12%) and hypereutrophic waters (4.67%). Using the AA metric that incorporates the imbalance of samples per TS class, the meta-classifier was on average 6.75% more accurate than the OWT switching of chla algorithms (75.41% and 68.66%, respectively).

Misclassifications of oligo-and mesotrophic classes
Both the meta-classifier and OWT switching scheme misclassified a high percentage of oligo-and mesotrophic reflectance spectra. Here we investigate the misclassifications of the meta-classifier that are higher for both classes when derived through OWT switching of chla algorithms.
The meta-classifier misclassified 19.44% reflectance spectra of the oligotrophic class as mesotrophic and 38.04% reflectance spectra of the mesotrophic class were falsely classified as eutrophic (see Fig. 7). None of the oligotrophic and mesotrophic test waters were misclassified as hypereutrophic. To investigate the misclassifications, we plotted the distributions of the OACs per TS class of the training and test sets (Fig. 10). Based on the TS definition and the split of measurements into training and test sets after each Rrs was assigned a TS class, the two datasets showed almost identical chla concentrations within each class. Greater variation occurred only in the hypereutrophic class for which a maximum chla [mg/m 3 ] concentration was not defined. In contrast, TSM [g/m 3 ] concentrations and a cdom (443)[1/m] strongly varied between the oligo-and mesotrophic classes. Since chla concentrations were low for both the oligo-and mesotrophic classes, TSM was dominated by inorganic particle loads, leading to highly turbid and strongly scattering water properties.
Based on the constituent medians of the OACs, the optical properties of the oligotrophic class in the training set were mostly dominated by phytoplankton chla, as a cdom (443) The misclassified reflectance spectra of both the oligotrophic and mesotrophic waters reflect the influence of high sediment loads (Fig. 11). The Rrs vectors of misclassified oligotrophic instances (19.44% as mesotrophic and 13.89% as eutrophic) do not reflect a significant reduction in Rrs values at 560 nm to 620 nm that characterises correctly assigned oligotrophic class observations. Moreover, misclassifications show high reflectance values in the red to near-infrared part of the spectrum. The reflectance spectra are similar in shape and magnitude compared to the training data of the mesotrophic and eutrophic waters. A comparable pattern can be observed for the 35 misclassifications of the mesotrophic class (classified as eutrophic), wherein both shape and magnitude are similar to the training vectors of the eutrophic class.
The reflectance spectra contained in the test sets of the two lowest TS classes were influenced by higher concentrations of absorbing a cdom (λ) and/or concentrations of scattering particles than represented in the provided training data. Consequently, the corresponding Rrs vectors were substantially less present in the training sets, which influenced the learning of the classifiers. Without appropriate representation of these waters, the classifies were unable to adjust their class decision boundaries accordingly. For the classifiers in the training stage, the corresponding Rrs vectors were more similar to those abundant in higher trophic classes, which consequently led to incorrect TS predictions on the test set.

Meta-learning
A single retrieval algorithm often has limited suitability for use over a range of optically complex waters. Meta-learning represents a novel approach to handle the limits of individual algorithms. In this study, the prediction accuracies of the base-classifiers and the SVM strongly varied across the four TS classes. Overall, the meta-classifier was able to identify with high precision the correct and incorrect TS class predictions made by the individual base-classifiers. The high classification accuracy achieved by the meta-classifier over the separate SVM and the base-classifiers validate the stacking theory. Training a meta-learner on the predictions of base-learners can result in significant prediction improvements and reduces the dependency on individual algorithms. Meta-learning also decreases the requirement for knowledge about the performance of a single retrieval algorithm prior to its application to unseen observations. Inherently, the meta-learner has access to the prediction performance of each base-learner during the application through the provided level-zero predictions. Independent of the encountered water type the meta-learner can thus decide on a specific base-learner for each observation. Fig. 9. Classification matrix for TS predictions on the independent test set (n = 567) derived from OWT switching of chla algorithms. The percentage of reflectance spectra assigned per TS class is shown. Yellow colours indicate high, purple colours low percentages. TS classes are denoted as 1 = Oligotrophic, 2 = Mesotrophic, 3 = Eutrophic, 4 = Hypereutrophic.

Direct trophic status classification
By directly classifying Rrs into TS, the presented method avoided some of the issues inherent to TS derivation via chla. For example, the meta-classifier was not confronted with the task of scaling a ϕ (λ) to chla. Naturally, phytoplankton is part of TSM and produces dissolved organic matter. Indirectly, specific phytoplankton groups favour certain water conditions, and also cluster by turbidity or dissolved organic matter  loads (e.g. due to riverine influence). When higher or lower phytoplankton absorption efficiency is correlated to changes in a cdom (λ) or a nap (λ), the classifiers can incorporate the resulting influence on the absorption budget in their decision-making through the varying contribution of these IOPs to the Rrs vector.
The base-classifiers were trained using Rrs with previously assigned TS classes. Notably, the chla values used to define the TS class ranges were unknown to the classifiers during the training process. Since the TS class ranges were a function of chla, the base-classifiers learnt this functional relationship indirectly. Corresponding Rrs vectors were treated by the classifiers without knowledge of the OAC concentrations. Consequently, the classifiers learnt to define a TS class decision boundary in their feature space through the provided Rrs vectors, whereby the input features corresponded to the values at the band positions of OLCI. Additional bands may be added to the Rrs vectors in the training stage to improve the optical distinction of TS classes.

Adaptation of the classification framework
In misclassified turbid oligotrophic and mesotrophic waters the optical properties were dominated by high inorganic particle loads. These properties weakened the established relationship between chla and the Rrs vectors that defined the TS class assignment. However, scenarios where biological productivity is light-limited due to high suspended sediment loads are common in natural waterbodies (e.g. rivers) and must thus be better incorporated into the presented classification scheme. To adapt the classification method to turbid waters, other optical parameters can be employed for the TS class assignment. For example, the TSI definition by Carlson (1977) enables to relate transparency (in the form of zSD measurements), which is inversely related to turbidity, to TS. The TS class assignment would then be based on the relationship of transparency to Rrs. While this TS class assignment could be useful for turbid waters, it likely has its own limitations. Therefore, the TS class assignment should ideally be based on the encountered water conditions, warranting the definition of an optical criterion to switch between chla and zSD in the assignment. The use of other optical or water colour indicator parameters to classify Rrs directly into TS might require a different TSI definition than used in the present study.

Conclusion
This is the first study that demonstrates direct classification of Rrs into TS to overcome issues that are inherent to TS derivation via chla. For the classification of TS, we stacked unique base-classifiers in a metalearning scheme. The classifiers of this study were trained with a large in situ dataset of co-located Rrs and chla measurements (n = 2184). When applied to test observations (n = 567), the developed approach demonstrated that direct meta-classification of TS can significantly outperform indirect TS derivation via OWT switching of chla algorithms. The meta-classifier estimated eutrophic and hypereutrophic waters with > 90% prediction accuracy, making the proposed method a reliable tool to assess and monitor eutrophication of inland and nearshore waters. Our method was able to improve retrieval accuracies for oligo-and mesotrophic waters over OWT switching of chla algorithms by 5-12%. Nevertheless, accurate classification of TS from low -moderate biomass waters influenced by high TSM concentrations and/or a cdom (λ) remains a primary challenge to solve.
The classifiers of the presented study were trained with 80% of the dataset. Improvements to the developed approach can be based on the inclusion of additional base-learners such as the SVM and the re-training of the classifiers with the entire dataset. In addition, the TS class assignment may be based on other optical TS indicators. Performance improvements for the oligo-and mesotrophic waters are therefore likely. In this study we exemplified the algorithm on the multispectral resolution of Sentinel-3A OLCI. After resampling of the training dataset, the algorithm can, however, be applied to other sensors such as the Multispectral Instrument (MSI) of the Sentinel-2 satellite to enable crossmission retrievals of TS with the same method.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.