remote sensing Estimating Water pH Using Cloud-Based Landsat Images for a New Classification of the Nhecolândia Lakes (Brazilian Pantanal)

: The Nhecol â ndia region, located in the southern portion of the Pantanal wetland area, is a unique lacustrine system where tens of thousands of saline-alkaline and freshwater lakes and ponds coexist in close proximity. These lakes are suspected to be a strong source of greenhouse gases (GHGs) to the atmosphere, the water pH being one of the key factors in controlling the biogeochemical functioning and, consequently, production and emission of GHGs in these lakes. Here, we present a new ﬁeld-validated classiﬁcation of the Nhecol â ndia lakes using water pH values estimated based on a cloud-based Landsat (5 TM, 7 ETM + , and 8 OLI) 2002–2017 time-series in the Google Earth Engine platform. Calibrated top-of-atmosphere (TOA) reﬂectance collections with the Fmask method were used to ensure the usage of only cloud-free pixels, resulting in a dataset of 2081 scenes. The pH values were predicted by applying linear multiple regression and symbolic regression based on genetic programming (GP). The regression model presented an R 2 value of 0.81 and pH values ranging from 4.69 to 11.64. A lake mask was used to extract the predicted pH band that was then classiﬁed into three lake classes according to their pH values: Freshwater (pH < 8), oligosaline (pH 8–8.9), and saline ( ≥ 9). Nearly 12,150 lakes were mapped with those with saline waters accounting for 7.25%. Finally, a trend surface map was created using the ALOS PRISM Digital Surface Model (DSM) to analyze the correlation between landscape features (topography, connection with the regional drainage system, size, and shape of lakes) and types of lakes. The analysis was in consonance with previous studies that pointed out that saline lakes tend to occur in lower positions compared to freshwater lakes. The results open a relevant perspective for the transfer of locally acquired experimental data to the regional balances of the Nhecol â ndia lakes.


Introduction
Lakes are transitory landscape features, sometimes created by catastrophes such as volcanic eruptions, floods, earthquakes, or human interventions, and they sometimes evolve slowly over a long period of time [1]. The morphology and distribution of lakes are related to physical, chemical, and biological events within the basins coupled to climatological constraints that play a major role  Monitoring and understanding the characteristics of global inland waters are of fundamental importance to scientists and policymakers. While conventional approaches tend to be limited in terms of spatial coverage and temporal frequency, remote sensing provides invaluable sources of data from local to global scales [19]. The chemical composition of the waters that supply the Pantanal is directly influenced by the lithology and land use of the surrounding areas [20,21], but both the fresh and saline lake waters of the Nhecolândia lakes belong to the same chemical family and suggest that most of the chemical compositional changes in the system are related to chemical sedimentation mainly involving Ca, Mg, K, Si, Al, and Fe [22][23][24][25]. The freshwater lakes present sodic, calcic, carbonate, chloride, and sometimes potassic waters, whereas the saline lakes contain sodic-carbonate water [23]. These patterns result in distinct conditions among the lakes that acquire different colors, making image classification difficult. Thus, classification using pH values is a suitable alternative for the classification of the lakes.
Previous studies have performed manifold classifications of the Nhecolândia lakes using passive and active sensors at short-term periods or single images (see [11,12,14,26]). Nowadays, Google Earth Engine (GEE) offers a new suitable cloud-based platform for environmental data analysis from local to planetary scales, with rapid access and processing of multiple orbital data from different missions [27]. The data catalogue contains a variety of standard Earth Science raster datasets consisting of imagery, geophysical, climate and weather, and demographic data collections with widely used geospatial datasets, such as the entire Landsat archive. GEE has exponentially increased the feasibility and reliability of remote sensing analysis, processing large volumes of data of long-term environmental analysis. Datasets that would previously have been analyzed as single, neighboring, or temporally sequential scenes can now be processed faster and mass bulked [28].
Taking into account the capabilities of GEE, we hereby propose an estimative of water pH values for Nhecolândia lakes using long-term time-series analyses of satellite images. These lakes are suspected to be a strong source of greenhouse gases (GHGs) to the atmosphere. Recent studies have highlighted that water pH is one of the key factors in controlling the biogeochemical functioning and, consequently, the methane, carbon dioxide, and nitrous oxide production and emission of water bodies [29,30]. In this context, the pH of surface water has been shown to be one of the key tools in controlling GHG emissions. Here, we create a new pathway to analyze the pH of the Nhecolândia lakes using a cloud-based time series (2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016)(2017) of Landsat TM/ETM+ and OLI images, with threefold objectives: (I) To estimate pH values, (II) to analyze spectral signature variations of lakes with changes in pH, and (III) to correlate the different types of lakes with relief landforms and drainage networks.

Regional Settings
The Pantanal is located in the Upper Paraguay River Basin, at the center of the South-American continent ( Figure 1A). The floodplain is surrounded by plateaus that consist of calcareous formations in the north (Serra das Araras) and in the south (Serra da Bodoquena), basalts of the Serra Geral Formation mainly in the south-east, sandstone formations in the east, and crystalline rocks on the wetland basement [21,31]. Despite the flat relief with a low topographic gradient of about 0.3% (80 to 200 m), the Pantanal consists of several sub-regions with their own specificities regarding the date and duration of flooding [8], the sedimentation rates and geologic/geomorphologic characteristics [10,32], sediment mineralogy, and water chemical composition [21], among others. The floodplain is covered by quaternary sediments and is composed of several fluvial megafans, including that of the Taquari River, considered the largest active megafan in the world [18,33,34].
The Nhecolândia Region lies over the southern half of the Taquari megafan, comprising an area of 26,900 km 2 from the Taquari River to the Negro River in the south. Two distinct morphological zones are recognized: (I) The Upper Nhecolândia that is composed of pleistocenic morphologies such as abandoned channels/meander belts and paleolakes, nowadays gradually covered by sediments deposited by the Taquari River [35], and (II) the Lower Nhecolândia, the focus of our study (Figure 1), where thousands of lakes and ponds are spread in an area of~10,000 km 2 [8]. The area is subject to seasonal flooding, with peaks usually occurring from February to April, and a dry period from August Remote Sens. 2020, 12, 1090 4 of 21 to October [36]. Landscape units in the Nhecolândia Region comprise: (i) Forest woodlands, (ii) open wood savanna, (iii) open grass savanna, (iv) swampy grasslands, and (v) lakes [26]. The region presents a complex hydrographic network with water usually flowing along the hundred-meter-wide shallow waterways locally known as vazantes.
The climate is Tropical-Savanna (Aw, with "A" = Tropical and "w" = dry winter, [37]) according to the Köppen classification, with wet summers and dry winters. The mean annual temperature ranges from 20 • C to 27 • C [38] and the precipitation varies from 1200 to 1350 mm in the north-northeast, and from 710 to 1200 mm in the south-southwest [39][40][41]. The wetland is seasonally flooded by the inundation pulse [42], associated with low-pressure zones of the South American Summer Moonson [43]. The inundation of floodplains basically occurs in three ways [44]: (I) Overflow of natural river levees and flooding of lower-lying floodplains and waterways, draining fields locally referred to as vazantes, (II) a backwater effect of the Paraguay River, and (III) pooling of local rainfall due to low relief gradient.
The salinity observed in some lakes results from an ongoing process of accumulation controlled by the occurrence of low-permeability horizons in the soil cover surrounding the lakes [7] under the current climate conditions. In addition, recent studies pointed out that the salinity is the results of regional climate changes and biogeochemical transformations during the late Holocene [45,46]. The Nhecolândia waters evolved in an alkaline pathway under the influence of evaporation. In this context, the water pH increases with concentration, reaching values close to 10 [30]. Nevertheless, these variations in water pH are still an ongoing process of accumulation and evaporation under relatively humid climatic conditions and poor drainage conditions [7,23].
Baías and salobras are water bodies with pH < 9 and electrical conductivity (EC) below 750 µS cm −1 , connected to shallow waterways (vazantes) during the wet season and shrinking in the dry season, leaving the floor of some lakes totally exposed and occupied by herbaceous vegetation [26,47]. The salobras are usually deeper, less connected to the major drainage systems, and have higher pH values and total dissolved solids (TDS) concentrations than most baías [47]. Salinas are normally perennial, with circular to elliptical shapes in depressions of approximately 500-1000 m in diameter, and are 0.5-3.0 m lower in elevation than neighboring freshwater lakes [48]. This type of lake always presents pH values above 9 and TDS values between 1000 and 10,000 mgL −1 , and is usually free from annual floods once separated from the regional drainage system by forested sandy barriers (known as cordilheiras) that are 2-3 m higher than adjacent plains [47].

Landsat Surface Reflectance Dataset
We used a 2002-2017 time-series of Landsat images (5 TM, 7 ETM+, and 8 OLI) in the GEE online platform to acquire the pH of the Nhecolândia lakes ( Figure 2). To ensure the use of the real pixel values, we used calibrated top-of-atmosphere (TOA) reflectance images that are included in the "USGS Landsat (5 TM, 7, and 8) Collection 1 Tier 1 TOA Reflectance" on GEE. These collections are atmospherically corrected by the Landsat Ecosystem Disturbance Adaptive Processing System (LEDAPS) [49] that includes cloud, shadow, water, and snow masks produced using per-pixel saturation analysis and a CFMASK filter [50]. The orthorectified and atmospherically corrected images have Level-1 Terrain Precision (L1TP) processed data, with well-characterized radiometry and inter-calibration across the different Landsat sensors [50]. All the spectral bands were used in the model: 4 visible and near-infrared (VNIR), and 2 short-wave infrared (SWIR) bands of 30 m spatial resolution. Thermal bands and Landsat 8 coastal/aerosol bands were excluded. Landsat 8 OLI satellite band names were renamed in GEE to match the nomenclature of Landsat 5 and 7 (Landsat 8 OLI bands 2, 3, 4, 5, 6, and 7 renamed to 1, 2, 3, 4, 5, and 7, respectively).

Water and Vegetation Indexes
As the water pH is affected by natural variables such soil minerals, vegetation, precipitation, and temperature [53,54], we also included synthetic bands to the obtained Landsat collection ( Figure  2): (i) The Normalized Difference Vegetation Index (NDVI) [55], (ii) Automated Water Extraction Index with no shadow features (AWEInsh) [56], (iii) Normalized Difference Water Index (NDWI) [57] and (iv) Modified Normalized Difference Water Index (MNDWI) [58]. NDVI is a numerical indicator that assesses the occurrence of green vegetation. It is calculated as the ratio of the measured reflectance in the red and near infrared (NIR) bands [55], computed as follows: where is the reflectance in the NIR spectral range extracted from Landsat NIR bands (~0.76 to 0.90 μm) and is the reflectance in the red spectral range extracted from Landsat red bands (~0.63-0.69 μm).
The AWEI is divided into two indexes depending on the occurrence of shadow in the investigated area [56]. Here, we applied AWEInsh that is suitable for areas with no shadows, as described in Equation (2). Cloud coverage was masked using Quality Assessment (QA) bands contained in each Landsat sensor, using the Fmask method [51] to ensure the usage of only cloud-free pixels over the sampled lakes. Absent values of a given period were estimated through temporal interpolation of neighboring values (the mean of pixel values of scenes acquired before and after the acquisition of the scene containing the null pixel value) using the moving average smoothing method (adapted from Ivits et al. [52]) After the cloud masking, we obtained a dataset with 2081 surface reflectance scenes. The final selection of the image collections was then merged into a single collection to acquire the median values of each pixel for the 2002-2017 time-series.
NDVI is a numerical indicator that assesses the occurrence of green vegetation. It is calculated as the ratio of the measured reflectance in the red and near infrared (NIR) bands [55], computed as follows: where ρ NIR is the reflectance in the NIR spectral range extracted from Landsat NIR bands (~0.76 to 0.90 µm) and ρ red is the reflectance in the red spectral range extracted from Landsat red bands (~0.63-0.69 µm).
The AWEI is divided into two indexes depending on the occurrence of shadow in the investigated area [56]. Here, we applied AWEInsh that is suitable for areas with no shadows, as described in Equation (2).
where ρ green is the reflectance in the green spectral range extracted from the Landsat green band, and ρ nir is the reflectance in the red spectral range. The major limitation of NDWI is that it is sensitive to signal noises coming from land cover features in built-up and disturbed areas. Xu [58] noted that water bodies have stronger absorbance in the SWIR band than that observed in the NIR band, and highlighted that the MNDWI might be more efficient in evaluating water bodies in areas with a higher influence of land cover features not related to water. The index is described according to Equation (4): where ρ green is the reflectance in the green spectral range and ρ swir is the reflectance in the red spectral range.

Field Sample Data
Ground-truth data comprised a subset of 130 georeferenced surface water samples collected at a depth of approximately 20 to 30 cm (see location on Figure 1C) used for validation of the model. The pH values were acquired using a pH-meter (Hanna Instruments HI98140), with stabilization between 10 s and 1 min. Each sample contains pH information collected at the time of data acquisition measured directly on the surveyed lake. Some lakes were sampled more than once but in different years or seasons. The measured pH values within the 130 water samples followed a bimodal distribution (see Figure S1 in the Supplementary Materials), with values centered between 5.7 and 8.2. We also observed the occurrence of hypersaline lakes with pH values ranging from 9 to 10. The selected samples were well-distributed within the Nhecolândia region and were collected by different groups in different field campaigns. Thus, the field sample data represent the average pH condition of the Nhecolândia lakes.
All lake categories (freshwater, saline, and oligosaline) were surveyed. The final database presented samples collected by different groups, as follows: (i) 65 samples collected by the authors; 4 in 2009, 39 in 2014, 12 in 2015, and 10 in 2016; and (ii) 66 water samples measured by Costa et al. [12] in 2008. The water pH values of Costa et al. [12] were also determined in situ by using a Hydrolab Quanta G multimeter.
Samples not related to water bodies, as well as those located in lakes smaller than the minimum possible area mapped by the Landsat data (900 m 2 ), or in areas sampled in swampy regions adjacent to the lakes (vazantes) were excluded to avoid outliers. The resulting databases were combined and subsequently divided into independent validation and training datasets to predict the regional occurrence of saltwater and freshwater lakes according to the methodology described in the following sections.

Prediction of pH Values
The prediction of pH values to separate saline lakes from freshwater lakes at a broad scale was achieved by considering linear and non-linear regression methods. The filtered composition with a higher correlation between the auxiliary image bands and field-measured pH values among the five seasonal filters was selected using stepwise multiple linear regression (SMLR). The spectral values of the lakes used for pH prediction were collected for the pixel (30 × 30 m) that overlaps the georeferred water pH sample collected in the field. Field-measured pH values were set as dependent variables, while median Landsat pixels from the 2002 to 2017 time-series were set as explanatory variables. The pH values were predicted by applying linear multiple regression and symbolic regression based on genetic programming (GP), a powerful machine-learning modeling technique introduced by Koza [59]. The objective was to select the best global linear model (full collection from 2002 to 2017 or collections with specific seasonal filters) to predict the dependent variables. Unlike linear methods, the symbolic regression searches both the parameters and the form of the equations, allowing the automatic generation of GP functions [60]. The resulting equations of the regression between in situ pH and spectral bands (Landsat bands 1, 2, 3, 4, 5, 7, and synthetic bands NDVI, AWEInsh, NDWI, and MNDWI), were used to simulate pH bands along the Nhecolandia lakes.
We randomly classified the 130 samples into subsets, comprising training (90 samples) and independent validation datasets (40 samples). The training dataset was used to predict values of the pH and to derive the regression models to be used for generating the pH image band according to median Landsat time-series collection. The independent validation dataset was used to evaluate the quality of the regression models. We used the coefficient of determination (R 2 ), root-mean-square error (RMSE), and Akaike's Information Criterion (AIC) to access the accuracy of the predicted pH values against the measured validation values. The RMSE and AIC are defined as whereρ i and ρ i are the observed and predicted pH values, respectively, i represents the lake sample, P is the number of parameters used, n is the total number of observations, and Nln is the natural logarithm. The best model is the one with RMSE values closer to 0 and smaller AIC values. The GP uses a set of arithmetic and complex operators to model the relationship between pH and auxiliary Landsat pixels, by testing different operators until the best prediction model is achieved. The module chosen for the genetic programming takes numerous solutions to find the best fit for the dependent variable (pH) related to the independent variables. The equation found by the GP is then performed in the GEE environment applied to the bands or indexes with higher correlation, with the field-measured pH values that will generate the predicted pH band. Finally, we conducted principal component analysis (PCA) [61] to evaluate how different band groups and measured pH values explain distinct lake characteristics.

Analyzing the Spatial Distribution of Saline and Freshwater Lakes
We performed a supervised classification using the "Maximum Likelihood Classification" method to create a lake mask. However, for the analyzed period, most lakes of the Nhecolândia southwest portion were colonized by shrub vegetation or covered by macrophytes, resulting in the misclassification of a large number of lakes that were being classified as vazantes. Conversely, in the Geocover circa 1990, the majority of lakes were devoid of vegetation (see Supplementary Materials), fitting our purpose perfectly. This product is a global set of regional images mosaicked from the Landsat collection imagery collected from 1989 to 1993 [62]. The Geocover multispectral data, though systematically constrained to an 8-bit dynamic range, display a wide range of reflectance intensities within a single one-degree area [62], and resulted in better classification of the area. The final lake mask was used to extract the predicted pH band that was then classified into three lake classes according to their pH values: 1) Freshwater (pH < 8), 2) oligosaline (pH 8-8.9), and 3) saline (pH ≥ 9). These ranges was based on the studies of Rezende-Filho et al. (2012) and Furian et al. (2013).
The major goal of this analysis was to evaluate, at broad scale, the correlation between landscape features (topography, connection with regional drainage system, size, and shape of lakes) and variations in the lakes' pH estimated from Landsat bands, as well as to verify the relative topographic difference between saline and freshwater lakes.
The PRISM Digital Surface Model (DSM) [63], with a 30 m spatial resolution, was the altimetric data considered in the analysis. The original DSM presents values of surface altitude in meters, projected according to the vertical reference of the EGM 96 geoid. A trend surface generated from a third-order global polynomial interpolation was subtracted from the original model, creating a residual model of altitude based on the method proposed by Zani et al. [64]. The residual model, in our case, represents the local relative difference between lake levels and the surrounding environment.
In addition to estimation of the relative altitude of lakes, we also used the DSM to estimate the preferential regions of drainage flux to estimate the mean Euclidean distance between the centroid of classified lakes and the surrounding drainage network estimated from hydrological modeling of the DSM. We also correlated other metrics such as area, perimeter, and compactness of lake classes with the regional drainage system, as previously discussed by other researchers [7,26,65]. The compactness index was estimated by considering the Polsby-Popper index that varies from 0 to 1; the closer the value is to 1, the more rounded the observed feature (lake) is.

Spatial Modeling of Nhecolândia Lakes
By using GEE, we were able to analyze over 2081 scenes of median values of spectral signatures with cloud-free pixels for the 2002-2017 time-series, including the TM, ETM, OLI bands, and synthetic indexes. The use of five seasonal filters was necessary because the Pantanal is seasonally flooded. The correlation values between explanatory bands and dependent variables considering five different scenarios, as well as the results of SMLR analysis, are summarized in Table 1. The highest correlation was observed by considering the high-water season Landsat collection from 2002 to 2017, with an R 2 of 0.72 and RMSE of 0.90. It was followed by the collection for the same period, but with no seasonal filter. We also verified the correlation of pH values with spectral data at sampled locations for the same sample data acquisition period, considering separately the period of the two field campaigns (high water season in 2008 and the period from 2014 to 2017). We observed an R 2 of 0.67 for the median values for the 2008 samples (high water season from January to May 2008), while Remote Sens. 2020, 12, 1090 9 of 21 the 2014 to 2017 samples resulted in a correlation with an R 2 of 0.64 between the pH and spectral values. Therefore, the longer the period considered for the median Landsat data, the better the correlation that exists between the pH and spectral values (Table 1), explaining why the full collection (2002 to 2017) within the wet season in Table 1 resulted in the best SMLR model. Thus, the high-water collection from 2002 to 2017 was used as the reference collection for modeling the pH of the Nhecolândia lakes.
The PCA of the pH and median spectral values was applied to highlight the original and synthetic Landsat bands that better explain the spatial variability of the water pH within the sampled lakes. The two first factorial axes, F1 (first principal component) and F2 (second principal component) represented 79.64% and 10.77%, respectively, of the explained total variance (Figure 3a). Therefore, the first factorial plan explains up to 90% of the total variance in the sampling values (Figure 3b). It is also interesting to note that the third factorial, F3, explained less than 7% of the variance followed by F4 that explained 3% of the variance, making it difficult to interpret the real significance of these factors within variability in the spectral signatures and pH of the lakes (Figure 3b). related to the pH, comprises the synthetic band indexes spatially suited for studies of water bodies that refer to the group of interest in this research. The second group is composed of the reflectance of the original Landsat bands (Figure 3a). The first factorial plan, with 90% of the variance, highlights the weight of the spectral difference in the Nhecolândia lakes, with a wide range of chemical compositions from freshwater to saline lakes.
Lower spectral variability was observed in the 130 sampled lakes for the visible bands (see boxplot of band variance and basic statistics: Media, variance, standard deviation, etc., respectively available in Figure Suppl. 2 and Table S2 in the Supplementary Material). The lake with a maximum reflectance value in bands 1, 2, and 3 has a very close reflectance value compared to the lake that presented the minimum value. Band 4 (NIR) presented the higher variation (minimum of 0 and maximum of 2700) among the spectral bands. A higher variability of the reflectance values was only achieved by using the synthetic bands (water and vegetation indexes). The MNDWI and NDWI presented the higher variation of reflectance values with contrasting maximum and minimum values. The MNDWI band presented the highest variance among the sampled lakes. This variability is essential to represent the chemical and biological diversity of the lakes, which will ultimately better represent the pH variation in the sampled lakes.
The higher variance in the NDWI and MNDWI can be explained by the seasonality of the Pantanal wetland, which is directly affected by the Summer Monsoon that causes intense rainfall and large floods [43], but also experiences annual droughts after the passage of the flood-pulse, resulting in rivers and lakes decreasing in water level, with some eventually drying out [66][67]. We observed two major groups of variables within the analyzed samples (group 1: MNDWI, AWEInsh, NDWI, and pH; group 2: Green, red, blue, nir, swir 1, swir 2, and NDVI bands). However, most of the variability is explained by the first factor, with the variation in the AWEInsh, NDWI, and MDWI directly related to the spatial variability in the pH (Figure 3a). The first set of data, closely related to the pH, comprises the synthetic band indexes spatially suited for studies of water bodies that refer to the group of interest in this research. The second group is composed of the reflectance of the original Landsat bands (Figure 3a). The first factorial plan, with 90% of the variance, highlights the weight of the spectral difference in the Nhecolândia lakes, with a wide range of chemical compositions from freshwater to saline lakes.
Lower spectral variability was observed in the 130 sampled lakes for the visible bands (see boxplot of band variance and basic statistics: Media, variance, standard deviation, etc., respectively available in Figure S2 and Table S2 in the Supplementary Materials). The lake with a maximum reflectance value in bands 1, 2, and 3 has a very close reflectance value compared to the lake that presented the minimum value. Band 4 (NIR) presented the higher variation (minimum of 0 and maximum of 2700) among the spectral bands. A higher variability of the reflectance values was only achieved by using the synthetic bands (water and vegetation indexes). The MNDWI and NDWI presented the higher variation of reflectance values with contrasting maximum and minimum values. The MNDWI band presented the highest variance among the sampled lakes. This variability is essential to represent the chemical and biological diversity of the lakes, which will ultimately better represent the pH variation in the sampled lakes.
The higher variance in the NDWI and MNDWI can be explained by the seasonality of the Pantanal wetland, which is directly affected by the Summer Monsoon that causes intense rainfall and large floods [43], but also experiences annual droughts after the passage of the flood-pulse, resulting in rivers and lakes decreasing in water level, with some eventually drying out [66,67].

Regression Models for Predicting pH in Lakes
The predictive models for the pH of lakes were based on the high-water season from 2002 to 2017 that presented the highest correlation between the median pixel values and field-measured pH. We considered only models capable of explaining more than 60% of the pH spatial variation within the lakes. As the pH values are not constant, the use of a long-term time series gives us the typical value of each pixel, and by using median values, we avoid outliers.
The GP self-learning model was set to use the logarithm-squared error for automatic selection of the most suitable model. The model based on symbolic regression resulted in seven equations for solving the correlation between the pH and Landsat bands, from simple to more complex solutions, and we selected the equation (Equation (7)) that offered the higher correlation between the pH and explanatory data (Figure 4). Landsat bands 2, 3, 4, and 7 were not used in any of the seven predictive models, while MNDWI, NDWI, and AWEInsh appeared in 6, 5, and 4 of the resulting GP models, respectively. Hence, two models were selected, one derived from genetic programing based on symbolic regression (Equation (7)) and the other based on SMLR (Equation (8)). Four basic arithmetic operators (+, −, *, and /) and seven more complex operators ( √ , x 2 , power, tang, sin, cos, and log) were considered for generating the final GP model, as shown in Equation (7).
where pH ( where pH (SMLR) is the pH predicted by SMLR, band 2, and 4, AWEInsh and NDWI are the Landsat bands from 2002 to 2017, and the letters represent coefficients of the equation, as follows: a: 1.36338; b: 0.00110; c: 0.00818; d: 0.00392; e: 0.00120. Genetic programing based on symbolic regression (Equation (7)) resulted in a more complex solution than SMLR (Equation (8)), but the correlation results were better in explaining the spatial variability of pH in the Nhecolância lakes considering both the validation and total sample dataset ( Figure 4). The model obtained by GP explained more than 85% of the spatial pH variability (Figure 4c), while the model generated by SMLR explained 74% of the variability (Figure 4d), taking into account the validation dataset. Moreover, the correlation considering both validation and training datasets (Figure 4a,b) is better in the GP regression model, with an R 2 of 0.81 against 0.73 for the SMRL model.
The GP model proved the most accurate for modeling pH in the Nhecolândia using explanatory Landsat bands (Figure 4). The GP model based on the validation dataset had R 2 , RMSE, and AIC values of 0.85, 0.55, and −32.8, respectively, while the values of the R 2 , RMSE, and AIC for SMLR were 0.74, 0.85, and −27.06, respectively. As a result, Equation (7) was used to generate a synthetic Landsat band in the GEE cloud platform, referred to here as the pH Band. This proposed band allows for the systematic segmentation of saline and freshwater lakes. The Nhecolândia saline lakes tend to be more perennial, normally devoid of vegetation, and with high concentrations of cyanobacteria; on the other hand, the freshwater lakes are mostly temporarily affected by the flood pulse and have large amounts of aquatic macrophytes [7,13,29,68,69]. That explains why the GP model using the NDWI and MNDWI indexes presented the best correlation with the field-measured pH values. To demonstrate that saline lakes tend to be more perennial than freshwater lakes, we performed a time-series of NDWI for all the field-surveyed lakes, showing the higher values of NDWI for saline and oligosaline lakes ( Figure 5). The saline lakes may also present different colors such as green, black, or crystalline, while the freshwater lakes present crystalline water due to the bloom of cyanobacteria (see example in Figure 1e). The Nhecolândia saline lakes tend to be more perennial, normally devoid of vegetation, and with high concentrations of cyanobacteria; on the other hand, the freshwater lakes are mostly temporarily affected by the flood pulse and have large amounts of aquatic macrophytes [7,13,29,68,69]. That explains why the GP model using the NDWI and MNDWI indexes presented the best correlation with the field-measured pH values. To demonstrate that saline lakes tend to be more perennial than freshwater lakes, we performed a time-series of NDWI for all the field-surveyed lakes, showing the higher values of NDWI for saline and oligosaline lakes ( Figure 5). The saline lakes may also present different colors such as green, black, or crystalline, while the freshwater lakes present crystalline water due to the bloom of cyanobacteria (see example in Figure 1E).
The pH is an "invisible" parameter for remote sensing analysis, unlike dissolved or suspended sediment concentration (i.e., Park & Latrubesse [70]); thus, indirect factors such as the seasonality and the bloom of cyanobacteria explain the high correlation of the GP model. Due to the bloom of cyanobacteria, the saline lakes reflect even in the NIR and SWIR bands (used in the water indexes), while the baías absorb most of it. The bloom of cyanobacteria also shows an increment in the levels of chlorophyll that causes increases in the pH values [71]. Saline lakes typically have higher dissolved inorganic and organic C concentrations, and the biogeochemistry of C in saline lakes can be different from that of freshwater lakes such as carbonate precipitation/dissolution reactions and the chemical enhancement of CO 2 exchange rates at the air/water interface, being much more prevalent in saline environments [72,73]. The pH is an "invisible" parameter for remote sensing analysis, unlike dissolved or suspended sediment concentration (i.e., Park & Latrubesse [70]); thus, indirect factors such as the seasonality and the bloom of cyanobacteria explain the high correlation of the GP model. Due to the bloom of cyanobacteria, the saline lakes reflect even in the NIR and SWIR bands (used in the water indexes), while the baías absorb most of it. The bloom of cyanobacteria also shows an increment in the levels of chlorophyll that causes increases in the pH values [71]. Saline lakes typically have higher dissolved inorganic and organic C concentrations, and the biogeochemistry of C in saline lakes can be different from that of freshwater lakes such as carbonate precipitation/dissolution reactions and the chemical enhancement of CO2 exchange rates at the air/water interface, being much more prevalent in saline environments [72,73].
In limnological studies, pH and salinity are not necessarily synonyms, and the aim of this study was not to measure the salinity of the lakes, but the pH. However, it is important to highlight that pH and electrical conductivity (EC) have a direct correlation in the Pantanal lakes. Furian et al. [7] analyzed more than 300 samples collected on surface water and found a high correlation (R² = 0.87) between pH and the logarithm of EC. Therefore, it is safe to assume that pH is a direct proxy of water salinity in the Nhecolândia lakes. In our dataset, we found an R² slightly smaller (0.745, see graph in the Supplementary Material); however, one should remember that our dataset uses field samples collected in different campaigns and by distinct groups, once we also used samples from Costa et al. [12], which could explain the smaller value of the R 2 . We believe the model properly represents the reality observed in the field as we could capture the variation in pH values in the Nhecolândia region with a high correlation between field samples and our pH band. However, it is important to highlight that we did not aim to propose a universal pH-predicting model for the region. Our goal was to find a way to predict pH in non-observed lakes, in order to create a systematic map of lakes' pH, which is essential to further understand the geochemistry of the region.

Landscape Features Related to the Distribution of Lakes in the Nhecolândia
Our spatial analysis mapped around 12,150 lakes for the Lower Nhecolândia region (Figure 6). Based on the predicted pH band, ~92.5% of the lakes are freshwater lakes (pH values ranging 4.69 to 7.9). Nearly 900 lakes presented pH values of oligosaline and saline waters, with values ranging from In limnological studies, pH and salinity are not necessarily synonyms, and the aim of this study was not to measure the salinity of the lakes, but the pH. However, it is important to highlight that pH and electrical conductivity (EC) have a direct correlation in the Pantanal lakes. Furian et al. [7] analyzed more than 300 samples collected on surface water and found a high correlation (R 2 = 0.87) between pH and the logarithm of EC. Therefore, it is safe to assume that pH is a direct proxy of water salinity in the Nhecolândia lakes. In our dataset, we found an R 2 slightly smaller (0.745, see graph in the Supplementary Materials); however, one should remember that our dataset uses field samples collected in different campaigns and by distinct groups, once we also used samples from Costa et al. [12], which could explain the smaller value of the R 2 . We believe the model properly represents the reality observed in the field as we could capture the variation in pH values in the Nhecolândia region with a high correlation between field samples and our pH band. However, it is important to highlight that we did not aim to propose a universal pH-predicting model for the region. Our goal was to find a way to predict pH in non-observed lakes, in order to create a systematic map of lakes' pH, which is essential to further understand the geochemistry of the region.

Landscape Features Related to the Distribution of Lakes in the Nhecolândia
Our spatial analysis mapped around 12,150 lakes for the Lower Nhecolândia region ( Figure 6). Based on the predicted pH band,~92.5% of the lakes are freshwater lakes (pH values ranging 4.69 to 7.9). Nearly 900 lakes presented pH values of oligosaline and saline waters, with values ranging from 8 to a maximum of 11.64. It is important to note that the halo of freshwater pixels ( Figure 6B,C) around the saline lakes is due to the influence of seasonality.
Remote Sens. 2020, 12, x FOR PEER REVIEW 13 of 22 8 to a maximum of 11.64. It is important to note that the halo of freshwater pixels (Figure 6b,6c) around the saline lakes is due to the influence of seasonality. The global surface (second-order polynomial interpolator) generated from ALOS PRISM DSM had an R² of 0.93 compared to the original DSM PRISM that shows a high global topography component within the studied area. In our study, we found that a global third-order polynomial interpolator was useful for evaluating the relative position of the lakes (Figure 7a). The saline lakes tend to occur in lower positions compared to freshwater lakes, and there is continuous deepening from freshwater to saltwater lakes, where areas with a pH above 9 tend to occur on topographic positions lower than 4 m compared to the global DSM (Figure 7a).
We also verified two other assumptions about the Nhecolândia lakes, evolving size and typical shape. Saline lakes (Figure 7) tend to be rounded depressions ~500-1000 m, with diameters slightly larger than those of freshwater lakes, being 0.5 to 4 m lower in elevation than the freshwater areas (Figure 7a), and are cut off from floods by sandy barriers. They also tend to be larger with greater areas (Figure 7c) and perimeters (Figure 7d) and are usually deeper than freshwater lakes (Figure 7a) by an average of 0.8 to 1.5 m. The Polsby-Popper test also showed that there is a difference in shape between freshwater and saltwater lakes, where areas of saline lakes tend to be rounder (Figure 7b).
The estimation of relative distance from major waterways (Figure 8) showed no clear difference in distance from drainage networks between freshwater and oligosaline lakes (Figures 8a,b). However, the areas of saline lakes tend to be less connected than other lakes, with a higher frequency of lakes occurring at distances of 500 to 800 m from the major drainage system (Figure 8b). In addition, more than 44% of the saline lakes occurred at distances greater than 900 m from the drainage network, while 72% of freshwater lakes and 71% of oligosaline lakes occurred within less than 500 m from major drainage networks (Figure 8b). Thus, the wooded higher grounds (locally known as cordilheiras) surround these saline lakes, preserving them from the massive freshwater supply by the surface during the floods in the Pantanal [7,48]. They are exclusively supplied with subsurface flows when the freshwater level exceeds a natural soil threshold not visible within wooded areas. Figure 6. Spatial distribution of lakes' pH according to the pH predicted by the GP method (Equation (7)) for the entire study area. Areas B and C highlight two different regions with high and low frequencies of salt-water lakes, respectively (B,C). The histogram highlights the absolute frequency distribution of lakes according to the range of pH values observed in the region (A).
The global surface (second-order polynomial interpolator) generated from ALOS PRISM DSM had an R 2 of 0.93 compared to the original DSM PRISM that shows a high global topography component within the studied area. In our study, we found that a global third-order polynomial interpolator was useful for evaluating the relative position of the lakes (Figure 7a). The saline lakes tend to occur in lower positions compared to freshwater lakes, and there is continuous deepening from freshwater to saltwater lakes, where areas with a pH above 9 tend to occur on topographic positions lower than 4 m compared to the global DSM (Figure 7a).
We also verified two other assumptions about the Nhecolândia lakes, evolving size and typical shape. Saline lakes (Figure 7) tend to be rounded depressions~500-1000 m, with diameters slightly larger than those of freshwater lakes, being 0.5 to 4 m lower in elevation than the freshwater areas (Figure 7a), and are cut off from floods by sandy barriers. They also tend to be larger with greater areas (Figure 7c) and perimeters (Figure 7d) and are usually deeper than freshwater lakes (Figure 7a) by an average of 0.8 to 1.5 m. The Polsby-Popper test also showed that there is a difference in shape between freshwater and saltwater lakes, where areas of saline lakes tend to be rounder (Figure 7b).
The estimation of relative distance from major waterways ( Figure 8) showed no clear difference in distance from drainage networks between freshwater and oligosaline lakes (Figure 8a,b). However, the areas of saline lakes tend to be less connected than other lakes, with a higher frequency of lakes occurring at distances of 500 to 800 m from the major drainage system (Figure 8b). In addition, more than 44% of the saline lakes occurred at distances greater than 900 m from the drainage network, while 72% of freshwater lakes and 71% of oligosaline lakes occurred within less than 500 m from major drainage networks (Figure 8b). Thus, the wooded higher grounds (locally known as cordilheiras) surround these saline lakes, preserving them from the massive freshwater supply by the surface during the floods in the Pantanal [7,48]. They are exclusively supplied with subsurface flows when the freshwater level exceeds a natural soil threshold not visible within wooded areas. The dimensional characteristics of the lakes, and notably the relationship between the perimeter (water supply zone) and the volume (surface-depth) conditions the concentration rate of the lakes, and, therefore, their water pH.
Remote Sens. 2020, 12, x FOR PEER REVIEW 14 of 22 The dimensional characteristics of the lakes, and notably the relationship between the perimeter (water supply zone) and the volume (surface-depth) conditions the concentration rate of the lakes, and, therefore, their water pH.  The shape of the most alkaline lakes evolves, in part, by the chemical withdrawal of elements (mainly Si, Al, Mg, and K) from the sediments and clay neoformations in the surrounding beach [22]. This evolution conditions the rounded shape of the saline lake contours and can lead to coalescence between several lakes [74].
The presence of hundreds of saline lakes among thousands of freshwater lakes is one of the unique aspects of the lower Nhecolândia region. Knowing the exact number of existing lakes in the Nhecolândia is a hard task to accomplish and different amounts have been reported. A pioneering survey [8] (Por, 1995) speculated~10,000 lakes, with about 10% of the lakes being saline. A recent lake inventory based on field observations and orbital data [12] reported for about 8851 lakes, with 7.19% of them being salinas. The number proposed by Por [8] was an estimate, and differences between our predictions and those by Costa et al. [12] can be attributed to two main causes: 1) Usage of different methods and instruments for lake classification; for example, they used radar while we used optical sensors; 2) toward the south-westward section of the Lower Nhecolândia, many of the lakes have been colonized by shrubs and macrophyte vegetation, which may lead to the misclassification of some areas. Morphologically, the feature is still a lake, but the vegetation causes changes in the spectral response. The shape of the most alkaline lakes evolves, in part, by the chemical withdrawal of elements (mainly Si, Al, Mg, and K) from the sediments and clay neoformations in the surrounding beach [22]. This evolution conditions the rounded shape of the saline lake contours and can lead to coalescence between several lakes [74].
The presence of hundreds of saline lakes among thousands of freshwater lakes is one of the unique aspects of the lower Nhecolândia region. Knowing the exact number of existing lakes in the Nhecolândia is a hard task to accomplish and different amounts have been reported. A pioneering survey [8] (Por, 1995) speculated ~10,000 lakes, with about 10% of the lakes being saline. A recent lake inventory based on field observations and orbital data [12] reported for about 8851 lakes, with 7.19% of them being salinas. The number proposed by Por [8] was an estimate, and differences between our predictions and those by Costa et al. [12] can be attributed to two main causes: 1) Usage of different methods and instruments for lake classification; for example, they used radar while we used optical sensors; 2) toward the south-westward section of the Lower Nhecolândia, many of the lakes have been colonized by shrubs and macrophyte vegetation, which may lead to the misclassification of some areas. Morphologically, the feature is still a lake, but the vegetation causes changes in the spectral response.
The co-existence of saline and freshwater lakes is due to a differential hydrological regime controlled by the morphology of the soil cover, particularly by the presence of a morphological threshold consisting of green and grey sandy loam horizons, with high sodium fractions partially cemented by silica [25]. Most of the saline lakes are concentrated on the central and southeast portion, and some on the northwest portion. Toward the southwest portion, most of the lakes are freshwater and we observed practically no salinas (Figure 6b). One of the explanations for the lack of salinas in this region may be due to the avulsion that occurred in the Taquari River. The process began crevassing in 1988 and ended with the abandonment of the former channel in 1998 [33,34], changing its mouth from the southwest to ~100 km westward (see a picture of the Taquari avulsion in the Supplementary Material). This gradually triggered hydrological and phytophysiological changes in the pattern, intensity, and duration of the annual floods and also affected the water table dynamic, The co-existence of saline and freshwater lakes is due to a differential hydrological regime controlled by the morphology of the soil cover, particularly by the presence of a morphological threshold consisting of green and grey sandy loam horizons, with high sodium fractions partially cemented by silica [25]. Most of the saline lakes are concentrated on the central and southeast portion, and some on the northwest portion. Toward the southwest portion, most of the lakes are freshwater and we observed practically no salinas ( Figure 6B). One of the explanations for the lack of salinas in this region may be due to the avulsion that occurred in the Taquari River. The process began crevassing in 1988 and ended with the abandonment of the former channel in 1998 [33,34], changing its mouth from the southwest to~100 km westward (see a picture of the Taquari avulsion in the Supplementary Materials). This gradually triggered hydrological and phytophysiological changes in the pattern, intensity, and duration of the annual floods and also affected the water table dynamic, one of the factors responsible for recharging the lakes.

Remote Sensing Big Data for Inland Environments
The approach adopted in GEE for the spatial modeling and estimation of water pH was subdivided into different steps. The first and most important one, however, comprised the image tiling method implemented in GEE and was based on pixel band algebra [27]. Accordingly, processing each output tile requires retrieving only a small number of tiles for each input band that involves the concurrent processing of a limited volume of image pixels for collections with hundreds of images ( Figure 9). Thus, GEE allows for the fast computation of results at any requested scale or projection that would be impractical if working offline by downloading all Landsat scenes acquired from 2002 to 2017 (Figure 9). This aspect was especially important considering the studied area as we observed an increasing correlation between the pH values and the explanatory variables for longer periods (Figure 9). This assumption was only confirmed after considering a large number of observations, or in the case of the GEE, pixels collected at different time-periods over the same lakes. This was only possible by employing remote sensing Big-Data in GEE. The correlation (R 2 ) between the pH and the most correlated band (NDWI) continually increased according to the number of scenes considered in the analysis (Figure 9). This aspect is explained by the fact that we calculated the median pixel values of the lakes and a greater number of observations, probably better expressing the typical signatures of the lakes in a given location.
subdivided into different steps. The first and most important one, however, comprised the image tiling method implemented in GEE and was based on pixel band algebra [27]. Accordingly, processing each output tile requires retrieving only a small number of tiles for each input band that involves the concurrent processing of a limited volume of image pixels for collections with hundreds of images ( Figure 9). Thus, GEE allows for the fast computation of results at any requested scale or projection that would be impractical if working offline by downloading all Landsat scenes acquired from 2002 to 2017 (Figure 9). This aspect was especially important considering the studied area as we observed an increasing correlation between the pH values and the explanatory variables for longer periods (Figure 9). This assumption was only confirmed after considering a large number of observations, or in the case of the GEE, pixels collected at different time-periods over the same lakes. This was only possible by employing remote sensing Big-Data in GEE. The correlation (R²) between the pH and the most correlated band (NDWI) continually increased according to the number of scenes considered in the analysis (Figure 9). This aspect is explained by the fact that we calculated the median pixel values of the lakes and a greater number of observations, probably better expressing the typical signatures of the lakes in a given location. Studies based on remote sensing data have long investigated quality, characteristics, and changes in continental waters [75][76][77]. The measurement of water salinity through remote methods is also not novel. Thomann [78] remotely measured the coastal surface water salinity of the Mississippi and Louisiana rivers with radiometers at a wavelength of 21 cm. A pioneer in remote sensing imagery developed a regression model using Landsat multispectral scanner (MSS) images and color infrared photographs with surface measurements for salinity mapping of the San Francisco Bay Delta [79]. However, studies measuring salinity using remote sensing instruments on inland environments are, normally, created to analyze soil characteristics [80][81][82], with few focusing on spatio-temporal variations [83,84], and none using Big Data environments. Studies based on remote sensing data have long investigated quality, characteristics, and changes in continental waters [75][76][77]. The measurement of water salinity through remote methods is also not novel. Thomann [78] remotely measured the coastal surface water salinity of the Mississippi and Louisiana rivers with radiometers at a wavelength of 21 cm. A pioneer in remote sensing imagery developed a regression model using Landsat multispectral scanner (MSS) images and color infrared photographs with surface measurements for salinity mapping of the San Francisco Bay Delta [79]. However, studies measuring salinity using remote sensing instruments on inland environments are, normally, created to analyze soil characteristics [80][81][82], with few focusing on spatio-temporal variations [83,84], and none using Big Data environments.

Conclusions
Here, we presented a new window into the complex classification of the Nhecolândia lakes by using orbital cloud-based imagery. We investigated the correlation between the water pH of Nhecolândia lakes and their typical spectral signatures in different multispectral satellite bands, using hundreds of Landsat observations over the same area. The use of long-term periods was more efficient in predicting pH values than using short specific time periods. The model successfully predicted pH values using Landsat and synthetic bands for 2002-2017 time-series data, achieving an R 2 correlation of over 85% for pH prediction. The use of the GEE platform and top-of-atmosphere (TOA) reflectance images (Collection 1 Tier 1 TOA Reflectance) for Landsat 5 TM, 7, and 8 was the key tool used to achieve our major goal as we used more than 2081 Landsat scenes collected over the Nhecolândia region, which would be impractical if the images were manually downloaded. Based on our findings, we verified that~92.5% of the lakes of the Nhecolândia region are freshwater. From these findings, we also corroborated and statistically verified previous assumptions regarding the studies that pointed out that saline water lakes tend to be larger in area, have a greater perimeter, are rounder, and are topographically lower than freshwater lakes, in addition to verifying the higher isolation of salinas. These results open a relevant perspective for the transfer of locally acquired experimental data to the regional balance of pH values and distribution of the Nhecolândia lakes.