Ensembles of multiple spectral water indices for improving surface water classification International Journal of Applied Earth Observations and Geoinformation

Mapping surface water distribution and its dynamics over various environments with robust methods is essential for managing water resources and supporting water-related policy design. image (TSWI) with threshold is a common way of using water index (WI) for mapping water for it is easy to use and could obtain acceptable accuracies in many applications. As more and more WIs are available and each has its distinct merits, the real-world application of TSWI, however, often face two practical concerns: (1) selection of an appropriate WI and (2) determination of an appropriate threshold for a given WI. These two issues are problematic for many users who rely either on trial-and-error procedures that are time-consuming or on their personal preferences that are somewhat subjective. To better deal with these two practical concerns, an alternative way of using WIs is suggested here by transforming the current paradigm into a simple but robust ensemble approach called Collaborative Decision-making with Water Indices (CDWI). A total of 145 subsite images (900 × 900 m) from 22 Landsat-8 OLI scenes that covering various water-land environments around the world were used to assess the performance of TSWI and the CDWI. Five benchmark WIs were adopted in five TSWI methods and CDWI method: Normalized Difference Water Index (NDWI), the Modified NDWI (MNDWI), the Automated Water Extraction Indices without considering (AWEI0) and with considering (AWEI1) shadows, and the state-of-the-art 2015 water index (WI2015). Two aspects of performance were analyzed: comparing their accuracies (indicated by both F1-scores and Youden ’ s Index) over various environments and comparing their accuracy sensitivities to threshold. The results demonstrate that CDWI produced higher accuracies than the other five TSWI methods for most application cases. Particularly, more cases (indicated by percentage) produced higher F1-scores by CDWI than the other five TSWI methods, i.e. 67% (CDWI) vs. 15% (TSWI NDWI ), 54% (CDWI) vs. 22% (TSWI MNDWI ), 42% (CDWI) vs. 12% (TSWI AWEI0 ), 57% (CDWI) vs. 17% (TSWI AWEI1 ), and 34% (CDWI) vs. 12% (TSWI WI2015 ). Moreover, the F1-score of the CDWI is less sensitive to the change of thresholds compared with that of the five TSWI methods. These important benefits of CDWI make it a robust approach for mapping water. The uncertainty of CDWI method was thoroughly discussed and a general guidance (or look-up-table) for determining parameters of CDWI method was also suggested. The underlying framework of CDWI could be readily generalizable and applicable to other satellite images, such as Landsat TM/ETM + , MODIS, and Sentinel-2 images.


Introduction
Inland water is an important earth resource for providing ecosystem services (Karpatne et al., 2016;Ogashawara et al., 2017), such as being a key habitat for flora and fauna of aquatic ecosystems and support biodiversity conservation (Vörösmarty et al., 2010). It is also a key component of Earth's hydrologic cycle and, as such, can support many aspects of daily life, including drinking water, agricultural irrigation, electricity production, and transportation (Huang et al., 2018). Spatially explicit monitoring of water changes is, therefore, essential for a variety of scientific disciplines and to inform land-use policy and decisionmaking (Berry et al., 2005;Ma et al., 2010;Pekel et al., 2016).
As remote sensing is well recognized for detecting spatiotemporal patterns of land cover, it has been widely used for monitoring water changes with various purposes, such as water resource inventory, flooding and drought assessment, and urban hydrological evaluation (Allen and Pavelsky 2018;Berry et al., 2005;Shao et al., 2019). Generally, the success of mapping water bodies with remote sensing images relies on the distinct reflectance spectra of water in comparison with other land features: water generally show lower reflectance and a decreasing pattern of reflectance from visible to infrared spectral wavelengths (Bukata et al., 2018). Based on such optical characteristics, various types of water classification methods have been developed which can be broadly grouped into indirect and direct strategies.
The indirect strategy considers water bodies as one of several broad land cover categories, and the water bodies can be extracted from a land use/land cover map derived from image classification methods, such as deep learning, random forest, support vector machine (Cao et al., 2019). The direct classification strategy is to classify an image into water and non-water (land) categories directly. It is easy to use and widely adopted in practice (Allen and Pavelsky 2018;Berry et al. 2005;Cooley et al. 2017;Guo et al. 2017). One of the most common approaches is called Thresholding Single Water Index (TSWI), in which the water index (WI) is derived from two or more spectral bands with a carefully designed algorithm and water pixels would gain high values and the non-water pixels would gain low values (Ji et al., 2009). In the processing of TSWI, selecting a WI and generating the corresponding WI image should be done first, and then pixels with their WI values higher than (or lower than in some cases) a predefined threshold are categorized as water, otherwise non-water (Huang et al., 2018).
As WIs are sensor dependent, only the WIs designed for Landsat images are focused on this research. The Normalized Difference Water Index (NDWI;McFeeters 1996), is considered as the first-generation WI for using TSWI method to classify water. It is calculated using the green and near-infrared (NIR) bands of Landsat TM with an equation similar to NDVI which is used for vegetation (Tucker 1979), and the threshold 0 is suggested for thresholding water areas. NDWI was the most widely used index (McFeeters, 2013) before the Modified Normalized Difference Water Index (MNDWI) was introduced by Xu (2006). MNDWI was designed because using NDWI with TSWI method cannot efficiently suppress the signal from built-up areas, such that the suggested threshold 0 fails to distinguish water bodies from built-up surfaces accurately. The equation of MDNWI is similar to NDWI, but the NIR band is replaced by the first shortwave infrared band (or Band 5) of Landsat TM imagery. MNDWI is the most widely used WI for a variety of applications, including surface water mapping, land use/cover change analyses, and ecological monitoring research (Allen and Pavelsky 2018;Ji et al., 2009). In certain situations, however, the performance of MNDWI may be relatively poor due to the presence of low reflectance surfaces such as asphalt roads and shadow effects. To overcome such issues, Feyisa et al. (2014) proposed two new WIs, Automated Water Extraction Index with (AWEI1) and without (AWEI0) considering shadows. AWEI0 and AWEI1 are considered highly useful WIs and have been applied with TSWI to extract water bodies from Landsat imagery (Huang et al., 2018;Jiang et al., 2014). Fisher et al. (2016) conducted a comprehensive inter-comparison of the existing WIs and designed the latest water index (WI2015). The WI2015 is derived from linear discriminant analysis and involves all the bands of Landsat TM/ETM+ except for the blue band and it demonstrats similar accuracy to some of the prevailing WIs.
The driving force behind proposing different WIs indicates the fact that water-land environments in the real-world are very heterogeneous and the performance of TSWI method with any single WI would be unstable over different environments , Yang et al., 2018. Therefore, an average user of TSWI method would face two basic concerns: (1) which WI should be chosen from existed WIs, and (2) what is the appropriate threshold that should be used for a given WI?
In general, the answer to the first concern involves some personal preference because there is no clear guidance of WI selection for different water-land environments, such as wetland, mountain, urban, forest, and desert (Fisher et al., 2016;Ji et al., 2009). As a consequence, the same image classified by different TSWI users could produce inconsistent results due to different choices of WIs (Feyisa et al., 2014;Huang et al., 2018). For the second concern, three types of thresholds have been reported according to the availability of ground reference data, i.e., the real outline of water bodies that were obtained at the same time as the image acquisition time. Case 1: If enough reference data is available in an application, the local optimal threshold is suggested because such threshold can be determined (or trained) by the reference data. In most average applications, however, the obtaining of timely reference data could be difficult, especially for highly dynamic water landscapes (e.g., rivers and wetlands during flood events). Case 2: If there is no reference data, the locally-adaptive threshold and predefined threshold could be the choices. The locally-adaptive threshold is determined by the WI image itself with some segmentation technologies, so that the thresholds can vary self-adaptively for different images (Huang et al., 2018;Li and Sheng 2012;Wen et al., 2020). One obvious shortcoming of locally-adaptive threshold is it heavily depends on the applied image extent and its land/water ratio, such that threshold can be vastly different for the same location when it is determined from different extents . The predefined thresholds are often recommended by the original WI inventors or by other experienced authorities. To the best of our knowledge, the predefined thresholds are widely used in average water mapping applications for they are easy to be applied. However, this type of thresholds should be used with caution because they cannot guarantee satisfying results due to the complex water-land environments in the real world (Feyisa et al., 2014;Fisher et al., 2016).
In summary, the application of TSWI faces two common concerns as mentioned above and the ways to deal with them are unsatisfied if there is no sufficient reference data. Thus, alternative solutions have been explored over the past few years (Huang et al., 2018), including the construction of new WIs that are robust and relatively insensitive to threhsold selection or the development new methods using multiple existing WIs (Sánchez et al., 2018;Wang et al., 2018). The latter is regarded as the most appropriate approach because the combination of multiple WIs could complement their merits and apply to different environments compared with TSWI method (Yang et al., 2015). Such strategy is, to some extent, in line with the collaborative decisionmaking theory where multiple variables can produce complementary information to support a more robust result than each variable (Kacprzyk and Fedrizzi 2012).
Inspired by these ideas, this research aims to propose a new way of using WIs based on collaborative decision-making theory to deal with the two concerns mentioned above that commonly exist in TSWI method. Such new approach is supposed to have the advantages of: (1) less concerned about the WIs selection and (2) less sensitive to WIs thresholds than TSWI method. Specifically, the new approach is transforming the current paradigm of using WIs (i.e., TSWI method) into a simple but robust ensemble way of using WIs called Collaborative Decision-making with Water Indices (CDWI). The CDWI was tested in a variety of water-land environments around the world and assessed by comparing its performances with that of TSWI method using five benchmarked WIs.

Test sites and subsites
Performances of water classification methods are generally affected by two error sources: the applied aquatic environments and their surrounding land features , Yang et al., 2018. The aquatic environments are often characterized by a variety of watercolors (e.g., dark, yellow, red, and brown, etc.) and water types (e.g., river, reservoir, pond, and ditch, etc.). The surrounding land features are usually recognized as vegetation conditions (high-density vegetation, sparse vegetation, etc.), built-up area (road and buildings), and shadows (cloud shadow, building shadows, and terrain shadows). The combinations of these two error sources make the selection of test sites tricky and timeconsuming. Fortunately, many test sites have already been used for validating water classification methods in previous studies and such sites can guide us for selecting test sites in this study. Finally, 22 test sites were carefully selected with some come from Yang et al. (2015) and Feyisa et al. (2014) and some newly selected by considering their spatial representativeness (Fig. 1). These sites scattered around the world and covered a variety of water-land environments (Table 1).
Among each test site, several subsites with 900 × 900 m square size each were selected for preparing test data (as exampled in Fig. 1b, 1c, and 1d). The subsites were manually selected with expert knowledge in true-color composite of Landsat-8 OIL images (R: Band 4, G: Band 3, B: Band 2) by following two criteria: (1) the subsites should cover both water and land; (2) the subsites should cover as many different types of watercolors, water types, and land features as possible. Overall, 145 subsites were selected from these 22 test sites (Table 1). Although various land features have been covered by these subsites, their sample sizes (or area) varied significantly due to their different frequencies of presences in the real world. For example, vegetated land could be more likely to be sampled than shadowed land near water bodies. To mitigate such imbalanced sample sizes, 35 additional subsites only covered "uncommon" land features (e.g., built-up land and shadowed land) were selected. Finally, a total of 180 subsites were selected.

Landsat-8 OLI images
A total of 22 Landsat-8 OLI images with each covered one test site and acquired in different seasons were selected (Table 1). They were standard Landsat-8 surface reflectance level-2 products with 30 m spatial resolution and more information of those products can be found in the Landsat 8 Product Guide (https://www.usgs.gov/media/files/l andsat-8-collection-1-land-surface-reflectance-code-product-guide, accessible on Dec. 20, 2020). The images were firstly downloaded from the USGS Earth Resources Observation and Science Center Science Processing Architecture on Demand Interface (https://espa.cr.usgs. gov/) and then were clipped into sub-images using subsite-defined square polygons (900 × 900 m, see Fig. 1). Only the pixels that were entirely contained by the subsite square polygons were selected. In total, 180 clipped subsite images with 153,140 pixels of seven-band surface reflectances (range from 0 to 1 in float) were extracted and stored as integer values by scaling 10, 000 (any pixels with values less than 0 or greater than 10,000 were masked).

High spatial resolution images
PlanetScope Analytic Ortho Scene (PSAOS) products were served as reference data for labeling Landsat-8 pixels as water and non-water. PSAOS images have high spatial resolution (3 m) and high temporal resolution (1-3 days), which make them ideal reference data sources. They consist of four bands: blue (455-515 nm), green (500-590 nm), red (590-670 nm), and NIR (780-860 nm). Before distributed to users, they are orthorectified to remove distortion caused by terrain and to eliminate the perspective effect on the ground (not on buildings), as well as to restore the geometry of an image taken at zenith (Planet Labs Inc., 2018).
PSAOS images were carefully selected such that their acquisition dates matched exactly the same as that of the corresponding Landsat images (Table 1). In other words, both a PSAOS image and the corresponding Landsat-8 image were captured on the same day. All the PSAOS images were obtained from Planet Explorer (https://www.planet .com/explorer/; Planet Team, 2017) and manually georeferenced to the corresponding Landsat-8 image. The geo-referencing errors of PSAOS images were less than one pixel (30 m), which minimized the geolocation error that could potentially propagate to the final classification results.

Test dataset preparation
Each test pixel (153,140 in total) holds several attributions: location, source image, band reflectance, WIs values, feature type (water or non-water), percentage of water. The first three attributions were directly obtained from the source Landsat-8 image. WIs values were derived from band reflectance with specific algorithms (detailed in Section 3.1). Feature type and percentage of water were identified with the help of the PSAOS reference images which involved three steps. First, the PSAOS images were displayed in false-color (R: NIR, G: Red, B: Green) and carefully classified into water (including different watercolors) and non-water polygons (including vegetated land, built-up land, or shadowed land) through visual digitization with expert experience. Then, the water area percentage of each corresponding 30 m by 30 m pixel was derived with a series of spatial analysis functions (e.g., create fishnet, clip, etc.) coded in Python script in ArcGIS 10.5 (version 10.5.0.6491; ESRI, 2016). Finally, all the pixels with water percentage higher than 50% were labeled as water, otherwise as non-water (Feyisa et al., 2014;Yang et al., 2015). The non-water type was further identified as vegetated land, built-up land, or shadowed land. In addition, pixels with water percentage equal to 0 (non-water type) or 100% (water) were considered as pure pixels, otherwise as mixed pixels. The count numbers of water pixels, non-water pixels, pure pixels, and mixed pixels are listed in Table 2. The dataset is available at Mendeley Data repository (https:// doi.org/10.17632/mfp7jvw7yk.1).

Watercolor
Water type Land features 3. Methods

The common way of using spectral water indices: TSWI
Although numerous Landsat WIs have been developed over the past three decades, five are prevailing with distinct merits for different water-land environments: NDWI, MNDWI, AWEI0 (also known as AWEI nsh ), AWEI1 (also known as AWEI sh ), and WI2015 (Table 3). The application of TSWI method with a WI is straightforward: applying a predefined threshold to a preselected single WI image. Pixels with values larger than the threshold are labeled as water, otherwise non-water. Please note that the applications of TSWI method with NDWI, MNDWI, AWEI0, AWEI1, and WI2015, are denoted hereafter as TSWI NDWI , TSWI MNDWI , TSWI AWEI0 , TSWI AWEI1 , TSWI WI2015 , respectively.

Principle of CDWI
An alternative way of using WIs for water classification is proposed here to handle the common concerns of using TSWI: WI selection and the corresponding threshold determining. The approach is designed as the Collaborative Decision-making with Water Indices (CDWI). It combines a group of weighted and thresholded WI images to generate a new water probability image and a new decision-making probability threshold is applied to extract water. The rationale of the collaborative decisionmaking principle is that a group of variables can provide potentially complementary information to support a more reliable decision than that based on a single component (Kacprzyk and Fedrizzi 2012). When it comes to handling the concerns of TSWI, CDWI could provide an alternative way of selecting WIs and a potential stable threshold for extracting water. The step-by-step procedure of CDWI is as follows (see also Fig. 2) and the ready-to-use Python script is attached as a supplementary file.
• Step 1: Select a group of WIs and calculate corresponding WI images.
In this study, the five prevailing WIs were used as listed in Table 3.
The reason for selecting these WIs is that they were reported showing complementary merits in classifying water over different water-land environments. For example, MNDWI was designed to separate water from vegetated area and built-up area (Ji, et al., 2009;Xu, 2006), AWEI0 performs better than other WIs in differing built-up land from water, and AWEI1 is good at distinguishing shadow from water (Feyisa et al., 2014). • Step 2: Apply an appropriate predefined threshold to each WI image to initially classify water (labeled 1) and non-water (labeled 0). Note that this step is also known as applying TSWI for water classification. • Step 3: Apply an appropriate weight to each initially classified TSWI image. The sum of all weights is 1. TSWI method with better performance needs to be assigned a larger weight to its classified TSWI image. • Step 4: Sum up all weighted images to achieve a new CDWI image.
Its pixel values are considered to represent water probability. The larger CDWI pixel value, the greater confidence of the pixel being decided as water. • Step 5: Apply a probability decision-making threshold (T CDWI ) to binarize the CDWI image and obtain the final water image.
From the perspective of the collaborative decision-making process, the workflow of CDWI can be understood as following. Consider there is a decision-making committee named CDWI, and the job of which is to decide whether image pixels are water or not. It has several experienced committee members (i.e., TSWI NDWI , TSWI MNDWI , TSWI AWEI0 , TSWI A-WEI1 , and TSWI WI2015 in this study) but with different abilities (weights). In the processing of collaborative decision-making, each committee member would independently make an initial decision (water or nonwater) first with TSWI method. Then, each member assigns its weight (W) to the corresponding TSWI image. The sum of all weighted TSWI images forms a new CDWI image waiting for the final decision: pixels with values larger than T CDWI are classified as water, otherwise nonwater.

CDWI parameters estimation
The application of CDWI requires three types of parameters: (1) the predefined WI thresholds (T NDWI , T MNDWI , T AWEI0 , T AWEI1 , and T WI2015 ) for applying the five TSWI methods, (2) the weights (W NDWI , W MNDWI , W AWEI0 , W AWEI1 , and W WI2015 ) of the five TSWI methods, and (3) the CDWI threshold (T CDWI ) for slicing the final CDWI image (Fig. 2). Since the predefined thresholds have already been recommended by the previous authors in applying the five TSWI (Table 3), they are directly adopted in this CDWI approach as well. The other two parameters were estimated in the following ways (Fig. 3).
(1) Weights of the five TSWI methods According to the principle of CDWI, a TSWI method showing better performance should hold larger weight. Assessing performances of the five TSWI methods and determining their weights were conducted accordingly as below. First, we prepared 1,000 sample sets with each formed by 1,000 randomly selected pixels from the test dataset: 500 are water and 500 are non-water. Note that the same size of water and nonwater pixels can minimize the uncertainty caused by imbalanced sample size (Warmink, et al., 2010). Then, the five TSWI methods with the recommended corresponding predefined thresholds (Table 3) were applied to each sample set, and their accuracies were evaluated by F1score, a harmonic accuracy assessment metric as detailed in Section 3.3.1 (Daskalaki et al., 2006;Zhong et al., 2019). As each sample set produced five F1-scores for the five TSWI methods, and the one holding the maximum F1-score was considered as performed the best and counted one. After this process went for the entire 1, 000 sample sets, each WI would get a final count number (N) and the sum of five count numbers equals 1,000. Finally, the weight of a TSWI method was determined by the proportion of its count value to the sum of all count  Fisher et al.
0.63 0.63 values (Fig. 3a). In this study, for example, the weight of TSWI NDWI (W NDWI in Fig. 2) was calculated as Eq. (1): (2) CDWI threshold (T CDWI ) Since CDWI image is the sum of several weighted TSWI images (Fig. 2), any pixel value of such CDWI image is the sum of one combination weights of TSWI methods. In total, there are 31 different combinations of weights in the case of this study (Fig. 2): W NDWI , W MNDWI , W AWEI0 , W AWEI1 , W WI2015 , W NDWI + W MNDWI , W NDWI + W AWEI0 , W NDWI + W AWEI1 , …, and W NDWI + W MNDWI + W AWEI0 + W AWEI1 + W WI2015 or 1. Therefore, the final recommended T CDWI should be determined from this list. The determination process is straightforward. First, we generated 1,000 sample sets in the same way as mentioned above. Each sample set would produce 31 F1-scores after applying 31 candidate CDWI thresholds independently. Among these 31 F1-scores, the maximum score and its corresponding threshold were identified and counted. After applying this procedure to all 1,000 sample sets, the threshold which obtained the largest count number was identified as the recommended T CDWI , for it held the most cases of holding the maximum F1-scores than the other candidate thresholds (Fig. 3b).

Accuracy assessment
As mentioned in Section 2.1, there are 145 out of 180 subsite images cover both water and land features (the other 35 out of 180 subsite images only cover land features). Therefore, the five TSWI methods and the CDWI method were applied to these 145 subsite images to assess their accuracy stabilities over different water-land environments around the world. As previous studies suggested, both F1-score and Youden's Index (YI) (Youden, 1950) were used to assess accuracies of the six methods Li et al., 2019;Zhong et al., 2019;Wen et al. 2016). The F1-score is the harmonic average of the producer's accuracy and user's accuracy (Daskalaki et al., 2006;Eq. (2)): F1-score = 2 × Producer's accuracy × User's accuracy Producer's accuracy + User's accuracy The producer's accuracy is the percentage of correctly classified water pixels from the total number of true water pixels. The user's accuracy is the percentage of correctly classified water pixels from the total number of classified water pixels. F1-score reaches its best value at 1 and worst at 0. It is considered more objective than overall accuracy (the percentage of correctly classified pixels, both as water and nonwater, from the total number of pixels) in our binary classification case because a water body mostly covers a small portion of the image under evaluation. The YI was often used for determining local optimal thresholds (Wen et al. 2016), and it was considered as an indicator of water classification accuracy (Eq. (3)). The larger YI value, the smaller sum of omission error and commission error.

Sensitivity to thresholds
Sensitivity to thresholds, defined as how much the accuracy would change by changing the threshold values for a given method (TSWI methods or CDWI method), is indicated by the slope of a threshold- accuracy curve. A robust classification method should, therefore, be less sensitive (low absolute slope value) to threshold changes. For TSWI methods, such thresholds are the predefined WI thresholds; for the CDWI method, such thresholds involve both the predefined WI thresholds and T CDWI .
For a given TSWI method, the classification result is purely affected by the predefined thresholds (Fig. 2). Each predefined threshold outputs a classification result and one accuracy (F1-score or YI value). The sensitivity analysis, thus, involves selecting different predefined thresholds and calculating their corresponding accuracies. To make such selection more objective, the local optimal thresholds of 145 subsite images were served as candidate predefined thresholds. For a subsite image, its local optimal threshold was determined as the one at which the YI gained the maximum value (Fisher et al., 2016).
For the proposed CDWI method, its accuracy relies on both the five predefined WI thresholds (Table 3) and T CDWI (Figs. 2 and 3). To make the sensitivity analysis more clearly, T CDWI was fixed (to the suggested one) in analyzing the sensitivity of CDWI to WI thresholds; while WI thresholds were fixed (to the suggested ones, see Table 3) in analyzing the sensitivity of CDWI method to T CDWI . Each group of the five selected WI thresholds will produce one F1-score of the CDWI. As a WI threshold could be chosen from the 145 candidate local optimal thresholds, 145 5 (=64,097,340,625) different threshold groups could be generated with 145 5 accuracies. To reduce this huge computational burden, the 145 candidate local optimal thresholds were split into 15 equal interval groups and the central value of each group was reselected. Finally, there are 15 5 (=759375) WI threshold groups and 15 5 corresponding CDWI accuracies are obtained. Each selected WI threshold would generate one accuracy for the corresponding TSWI method but 15 4 (=50625) accuracies for the CDWI method. To make them comparable, the mean accuracies of the CDWI method was used for sensitivity analysis.

Suggested CDWI parameters
The parameters of applying the CDWI method were estimated carefully (Fig. 3) and could be directly used in further applications given that they are evaluated by the dataset collected from various water-land environments around the world. To estimate the weights of the five TSWI methods, their accuracies were assessed. Overall, TSWI MNDWI showed the best performance for classifying water and then followed by TSWI WI2015 , TSWI AWEI1 , TSWI AWEI0 , and TSWI NDWI . Accordingly, the suggested five weights of TSWI methods were estimated as 0.640, 0.333, 0.019, 0.008, and 0.000, respectively (Fig. 4a). Note that TSWI NDWI performed the worst among the five TSWI method and got zero weight, for it held zero cases among 1,000 sample sets that gained the highest F1-scores.
With regard to T CDWI , it suggests 0.648 as the best for further applications for it obtained the largest number of cases that got the maximum F1-score among all the candidate CDWI thresholds (Fig. 4b).
The result means that pixel values larger than 0.648 in the CDWI image (sum of weighted TSWI images) are more likely to be labeled as water than non-water. Furthermore, this T CDWI is the sum weights of MNDWI (W 2 = 0.640) and AWEI0 (W 3 = 0.008), which statistically implies that pixels were classified as water by both TSWI MNDWI and TSWI AWEI0 are more likely to be correctly classified than that only classified either by TSWI MNDWI or TSWI AWEI0 .

Accuracy assessment over different environments
The accuracies of the six methods were applied to 145 individual subsite images to compare their accuracies over different water-land environments (Fig. 5). All the TSWI methods and CDWI method obtained high accuracies for their F1-scores and YI values greater than 0.9 for most subsites (Fig. 5). Although they all performed relatively well, the differences in their performances can be observed. In general, the number of subsites with their accuracies improved by the CDWI method was much greater than the number of subsites with their accuracies that decreased by the CDWI method. For example, 54% of subsite images classified by the CDWI method produced higher F1-scores than that produced by the TSWI MNDWI method, and only 22% of subsite images got lower F1-scores by using CDWI than TSWI MNDWI method (Fig. 5b). Moreover, the absolute mean value of decreased accuracies was smaller than that of the increased accuracies. Take YI as an example, such a pattern can be observed as: |-0.029| vs. 0.087 in Fig. 5a, |-0.009| vs. 0.022 in Fig. 5b, |-0.011| vs. 0.021 in Fig. 5c, etc. This finding shows that the CDWI method could be more likely to obtain a better water classification result than any TSWI method in general.

Sensitivity to predefined WI thresholds
Each subsite image can obtain a local optimal threshold. For all subsite images, their local optimal threshold varied significantly, as shown in Fig. 6. Generally, the histograms of those local optimal thresholds approximately follow Gaussian distributions. The F1-score of any TSWI method changed dramatically with different WI thresholds (the blue lines in Fig. 6). Overall, the sensitivity curves of all the TSWI methods are in unimodal patterns and peak at their thresholds around the suggested predefined thresholds that we used in this study (see Table 3). These sensitivity curves can be broadly categorized into three types: high sensitivity with a steep slope, moderate sensitivity with a moderate slope, and low sensitivity with a roughly flat slope. The further distance of a threshold to the suggested predefined threshold, the higher sensitivity of a TSWI method to such threshold can be observed (Fig. 6). In contrast, the proposed CDWI method showed the least sensitivity to threshold. That is, no matter what threshold was used, the accuracies of the CDWI method changed slighter than those of any TSWI method. For example, when the threshold changed from − 0.45 to 0.26, the F1score of TSWI MNDWI changed from 0.82 to 0.97, whereas the mean F1score of CDWI method changed from 0.912 to 0.918 (Fig. 6b). This low sensitivity-to-threshold of the CDWI method indicate that the uncertainties related to threshold determination can be significantly reduced compared to the TSWI methods. Such characteristics of CDWI method could make users less worrying about whether the selected thresholds are the optimal ones or not in applications without reference data.

Sensitivity to T CDWI
Overall, the sensitivity of CDWI accuracy to T CDWI is relatively low (Fig. 7). The mean F1-score of the CDWI method changes from 0.940 to 0.956 as the T CDWI changing from 0.008 to 1. 000. It generally shows a "∩" pattern with short increasing, long-flatten, and a slightly decreasing trend in order. In terms of YI value, it also shows a similar sensitivity-to-T CDWI pattern as of F1-score. It is noteworthy that the accuracy produced by combined T CDWI (i.e., summed by two or more TSWI weights) is overall larger than that produced by single T CDWI (i.e., single TSWI weight), which is explained here. Among the entire candidate T CDWI values (denoted by the x-axis ticklabels in Fig. 6), the single T CDWI values are 0.000 (W 1 ), 0.008 (W 2 ), 0.019 (W 3 ), 0.333 (W 4 ), and 0.640 (W 5 ); the rest are combined T CDWI values. It is observed that the mean F1-score produced by the 0.027 (W 2 + W 3 ) is larger (0.952) than the mean F1score produced either by 0.940 (W 2 ) or 0.951 (W 3 ). This observation goes for our suggested T CDWI (0.648, W 2 + W 5 ) in the study (Fig. 4): its mean F1-score and YI value are both larger than that produced by the corresponding single T CDWI : 0.008 (W 2 ) and 0.640 (W 5 ).

Pure pixels vs. Mixed pixels
One commonly recognized uncertainty of a water classification method may come from water-land mixed pixels or water-land boundary pixels (Comber et al., 2012;Yang et al., 2015). To better understand how the CDWI works, we compared the performances of the six methods in classifying both pure pixels and mixed pixels of the 145 subsite images (Fig. 8). It is observed that all the TSWI methods and CDWI method performed worse for mixed pixels than for pure pixels. Because the TSWI methods were developed based on the principle that water and land features have distinct reflectance properties: water shows a decrease in Fig. 6. The sensitivity of F1-score to threhsold for the five TSWI methods and the CDWI method. The sensitivities are indicated by the slope of the sensitivity curve: the threshold-against-F1 curve. The red-colored values are the predefined WI thresholds suggested by previous studies as listed in Table 3. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) Fig. 7. The sensitivity of CDWI accuracy (F1-score and YI value) to the T CDWI . The red-colored threshold (0.648) marks the suggested T CDWI in this study (see Fig. 4). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) reflectance from the visible to infrared wavelengths, while land features (e.g., vegetation) often do not show such reflectance pattern (Xiong et al., 2018). Moreover, those WI methods are "hard" classification methods using a Boolean set (i.e., 0 or 1) to restrict each pixel to either water or non-water types (Yang, et al., 2015). Therefore, classifying mixed pixels often introduce more errors to the result than classifying pure pixels with TSWI methods, due to the averaging of the reflectance properties of the water and non-water components (Fisher et al., 2016). How to reduce the class uncertainty of mixed pixels in classifying water is accordingly a research topic for many researchers.
Various techniques have been developed in attempts to reduce the uncertainty of mixed pixels in water classification. Some are based on the idea of "soft" classification such as sub-pixel classification and fuzzy classification (Dewi et al., 2016;Xiong et al., 2018). Some use machine learning techniques by taking mixed pixels into the training process (Foody and Mathur 2006). In this study, however, the CDWI achieved higher performance than the other TSWI methods for classifying water from mixed pixels (Fig. 8). It looks like this study provides an alternative way of reducing the uncertainty of mixed pixels. For a mixed pixel labeled as water (i.e., water percentage larger than 50%), the processing of CDWI could be considered as accumulating the probability of a water pixel that being correctly classified. That is, the decision of a mixed pixel be water or non-water is not only based on a single result of an individual TSWI method (except it has large weight) but collectively decided by the results of several TSWI methods. Based on these understandings, it is highly recommended to apply CDWI to the cases where mixed pixels are very common, such as small water bodies (e.g., pond), or water bodies with large perimeters-area ratios (e.g., dike, creek, tide channel, and mountainous reservoir) as shown in Fig. 9.

Different compositions of land features
We observed in some subsites that CDWI performed worse than TSWI methods as illustrated in Fig. 5 (the blue dots below the 1:1 line). One reason could be the parameters of CDWI were estimated from a simulated general scenario, not from the specific scenario of each subsite. Such a general scenario was simulated by 1, 000 sample sets, with each of them formed by 1,000 randomly selected water and non-water pixels from the test dataset. Since the dataset collected from various water-land environments around the world ( Fig. 1 and Table 1), a general scenario could consist of water with different colors, and land features with most covered by vegetation and some parts covered by built-up land and shadows. However, for some specific scenarios, the proportion of land components may differ a lot from the general scenario. For example, an urban is mostly occupied by built-up land and building shadows and a small portion of vegetated land. In such a case, the suggested parameters of CDWI in Fig. 4 could not perform well than the ones carefully designed for an urban area, like AWEI0 and MNDWI (Feyisa et al., 2014).
To explore more application scenarios, we first simulated a variety of land environments that consisted of different fractions of three typical land features, namely vegetated land, built-up land, and shadowed land. For each simulated land environment, the corresponding five WI weights are estimated in the same way as that for the general scenario (Fig. 10). Overall, the performances (indicated by TSWI weights) of both TSWI NDWI and TSWI AWEI1 are not sensitive to any kind of land environment and gained the lowest weights (0 or near to 0). When the fraction of shadowed land larger than 10%, TSWI MNDWI gained the largest weights than the other TSWI methods. It implies that in the scene with a large portion of shadows, image classified by TSWI MNDWI method should be assigned dominant weight than that by the other TSWI methods in applying CDWI method; or if one just wants to use TSWI method, the TSWI MNDWI should also be suggested for guiding WI selection in applying TSWI method. It also shows that the AWEI0 is sensitive to the fraction of built-up land: the more built-up land in an application, the higher weight of the TSWI AWEI0 gains (Feyisa et al., 2014;Fisher et al., 2016). In an extreme scenario such as in urban areas, it is suggested to assign the largest weight to TSWI AWEI0 than other TSWI methods in applying the CDWI method. We recommend that the above findings and Fig. 10 could be served as a general guidance (or a look-uptable) for determining TSWI weights for using CDWI method or selecting WIs for using TSWI method.

Transferability of the CDWI
Different from the common WI methods which were designed for a specific sensor with fixed equations (Feyisa et al., 2014;McFeeters 1996;Xu 2006), the CDWI could also be considered as a new framework that could readily be used in many applications involving different sensors. First, both the number and the form of TSWI methods involved in the CDWI are not fixed and can be adjusted according to practical conditions. For example, the existing water indices that are not used in this study, such as TCW (Crist 1985), WRI (Rokni et al., 2014), TSUWI , and MBWI , could be integrated readily into the CDWI method in further applications. Likewise, as newly designed water indices become available, they can be brought into the framework of CDWI. Moreover, any water classification maps either obtained by TSWI methods or by more sophisticated methods (e.g., Random Forest and Support Vector Machine; see Acharya et al., 2016;Ireland et al., 2015) can be included in the CDWI method to determine the final water classification results. Second, although the proposed CDWI method is tested and demonstrated on Landsat-8 OLI images, it is also suggested for application to Landsat TM/ETM + images because the WIs used here were all originally designed for Landsat TM/ETM + images (Huang et al., 2018). Since these TSWI methods work well on the Landsat-8 OLI images in this study, they should be suitable for Landsat TM/ETM + images as well. Third, the framework of the CDWI method can be applied to other types of images with different bands than the Landsat images, such as MODIS (Sharma et al., 2015), Sentinel-2A/B (Du et al., 2016), and HJ-1A/B images (Lu et al., 2011). Because the image bands of these images are very different from those of the Landsat images, their sensor-dependent water indices should be carefully selected before using the CDWI method.
In summary, the proposed CDWI method has four critical potential advantages: (1) The operation procedure of CDWI is straightforward, applied with basic raster algebra. Users can expand any TSWI methods into the CDWI framework.
(2) The robustness of the CDWI is higher than that of the TSWI methods making it suitable for a wide range of applications over different water-land environments. (3) The accuracy of the CDWI is less sensitive to the threshold (both predefined WI thresholds and T CDWI ) selection compared to the TSWI methods, such that the need for tedious parameter tuning of the threshold is reduced or avoided. (4) The framework underlying the CDWI is not WI dependent and sensor dependent. It has the potential to be applied to other indices (e.g., impervious surface index) and other sensors (e.g., Landsat TM/ETM+, MODIS, and Sentinel-2).

Conclusions
The TSWI methods are widely adopted in water mapping applications due to their potential ease-of-use and generally acceptable performances. However, two concerns need to be carefully considered before applying them in practice: the selection of WI and the determination of an appropriate threshold for the given WI. In practice, answers to these two concerns could be affected by several subjective factors, such as experiments and personal preference. To overcome these two concerns, a new ensemble way of using WIs for water mapping approach that integrates five widely used WIs is proposed, namely the CDWI, based on the collaborative decision-making principle.
A total of 145 subsite images were selected representing different geographical areas with distinct water-land environments and different seasonal patterns. The performances of the CDWI method and the five TSWI methods were assessed in terms of accuracy and robustness. It was found that (1) the CDWI produced higher or comparable accuracies than the five benchmark TSWI methods for most cases, making it less sensitive to application scenarios and, thus, suitable for more different waterland environments. (2) The accuracy of the CDWI is much less sensitive to the predefined WI thresholds chosen for the TSWI methods; (3) The underlying framework of CDWI has great potential for transferability and further application. For example, it can be modified readily by adding new WIs in the future. Moreover, the principle underlying the CDWI method is not sensor-dependent and, thus, the proposed CDWI can be applied to different types of images, such as Landsat TM/ETM+, MODIS, Sentinel-2A/B, and HJ-1A/B images in future applications.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.