Machine Learning Application in Water Quality Using Satellite Data

Monitoring water quality is a critical aspect of environmental sustainability. Poor water quality has an impact not just on aquatic life but also on the ecosystem. The purpose of this systematic review is to identify peer-reviewed literature on the effectiveness of applying machine learning (ML) methodologies to estimate water quality parameters with satellite data. The data was gathered using the Scopus, Web of Science, and IEEE citation databases. Related articles were extracted, selected, and evaluated using advanced keyword search and the PRISMA approach. The bibliographic information from publications written in journals during the previous two decades were collected. Publications that applied ML to water quality parameter retrieval with a focus on the application of satellite data were identified for further systematic review. A search query of 1796 papers identified 113 eligible studies. Popular ML models application were artificial neural network (ANN), random forest (RF), support vector machines (SVM), regression, cubist, genetic programming (GP) and decision tree (DT). Common water quality parameters extracted were chlorophyll-a (Chl-a), temperature, salinity, colored dissolved organic matter (CDOM), suspended solids and turbidity. According to the systematic analysis, ML can be successfully extended to water quality monitoring, allowing researchers to forecast and learn from natural processes in the environment, as well as assess human impacts on an ecosystem. These efforts will also help with restoration programs to ensure that environmental policy guidelines are followed.


Water quality
Water quality describes a state of a water body, as well as its chemical, physical, and biological aspects, including its usefulness for a particular activity (i.e., fishing, swimming or drinking). Substances that can damage aquatic species if found in high enough quantities can also impair water quality. Monitoring water quality is a critical aspect of environmental sustainability. Poor water quality has an impact not just on aquatic life but also on the ecosystem. The following variables are also be used to provide an indicator of water quality: the content of dissolved oxygen (DO); amounts of fecal coliform bacteria from people and animal wastes; levels or ratio of plant nutrients nitrogen and phosphorus; volume of particulate suspended matter (turbidity) and the amount of salt (salinity) in the water. To assess water quality, quantities of substances such as pesticides, herbicides, heavy metals, and other pollutants can be calculated. The abundance of chlorophyll-a (Chl-a), a green pigment

Systematic review objectives
In this systematic review, the effectiveness of applying ML methodologies were investigated to retrieve water quality parameters from satellite data. Specifically, the objective of studies, the types of satellite data, the ML methodologies, the significance or outcome of the ML application were summarized. Figure 1 provided the list of the abbreviations, acronyms and symbols used in this manuscript.

Materials and Methods
The Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) methodology was used to prepare and report the results of this study [5]. PRISMA is a standard method to give a systematic review of existing research.

Eligibility criteria
This study focused on peer-reviewed publications that applied ML to estimate water quality parameters with satellite data. The searches for and screen publications focused on three criteria: (1) water quality parameters, (2) ML techniques, and (3) type of satellite.

Information sources and search
The peer-reviewed publications were searched in three resources: Scopus, Web of Science and IEEE citation databases. The search was restricted to research articles published in English and in peerreviewed journals or conference proceedings. The following query is constructed with the Boolean operator AND and OR. The list of queries is shown in Table 1. The searches were run against the title, keywords and abstract of the publications in different databases separately. "machine learning" AND "satellite" OR "ocean colour" OR "organic" OR "phytoplankton" OR "salinity" OR "temperature" OR "time series" OR "water quality" OR "suspended" OR "CDOM" 2 "ocean colour" AND "ocean color" OR "forecast" OR "forecasting" OR "predict" OR "prediction"

Study selection
The eligibility of publications was evaluated and the publications were screened by examining the titles, abstracts and methods, and then obtained eligible publications through reading the full text.

Data collection and analysis
The data were documented with the objectives, methodologies, environments, problems investigated, language and datasets for each eligible publication. A narrative synthesis of all relevant papers was carried out and arranged based on (1) research goal, (2) ML methodologies, and (3) scientific findings. While the first perspective demonstrated satellite data applications for water quality monitoring, the second view gave an insight into current techniques of study and challenges when applying ML to process and analyze water quality parameters. The third viewpoint showed the lessons that may be drawn from water quality concerns.

Risk of bias
This systematic review is biased in certain aspects. To begin with, there is a risk of bias in the review process because there is only one reviewer who screens the literature, and the subjectivity of the inclusion and exclusion criteria may influence the selection of relevant articles. Furthermore, throughout the search procedure, the year range was not specified. This implies that the search results are from all years accessible, starting with the earliest publication discovered in the individual databases and ending with the most current (May 2021). Moreover, though the search was limited to three databases, there are many more databases (e.g., Google Scholar, ACM Digital Library) that may contain more material addressing the ML application in water quality utilizing satellite data discussed in this paper.

Results
The process of identifying eligible articles is depicted in Figure 2. Initially, the queries returned 1796 publications. After that, the publications were screened to eliminate duplicates. There are 473 duplicates that were removed. The abstracts and titles were read in order to examine the techniques and account for the aforementioned inclusion and exclusion criteria, resulting in the removal of 1196 articles and the retention of 127 for a more in-depth examination. Following the full publication review, 14 studies were excluded due to non-English language publications and studies that were unable to get access to the manuscripts. Finally, 113 publications between the year 2001 until 2021 were included in the systematic review. Table 2 summarizes the publications in terms of their type of satellite used, ML techniques involved, water quality parameters extracted and significance or outcomes of studies.

Discussion
The majority of the reviewed studies demonstrated that ML can be effectively applied to learn about water quality monitoring via satellite or remote sensing. This section discusses the insight that can be learned from the reviewed studies.

Importance of water quality monitoring
A variety of indicators are often used to assess water quality, i.e., turbidity, suspended solids, concentrations of Chl-a, pollution-sediment, DO, CDOM, nutrients (TP, TN, ammonia-nitrogen, nitrate, orthophosphate, silicate), and harmful algae, etc. while water temperature, salinity and many other pollutants are also used as water quality indicators. Nutrient and sediment loads have an impact on water quality. Excess nitrogen and/or phosphorus can lead to eutrophication and fish deaths by increasing algal blooms and aquatic plant growth. The terms suspended-sediment concentration (SSC) and total suspended solids (TSS) are frequently used interchangeably to denote pollution-sediment which is a crucial parameter to consider because of its environmental, economic, and human health implications. [4,21,30,65]. E. coli and cyanobacteria are hazardous organisms that can limit public usage of lakes and coastal waters by lowering dissolved oxygen levels and producing taste and odor problems. Significantly, microcystins, which have been related to liver cancer and tumors in people and animals, have been identified [16,46,102,104]. Monitoring water quality parameters such as Chla concentration is crucial in fisheries studies, management, and harvesting since environmental factors impact the number and distribution of fish species for example skipjack tuna [26,33].   VIIRS temperature, salinity, Chl-a MLP method produced promising outcomes [90].

MERIS, MODIS
Chl-a LR, polynomial, exponential functions and PCA were used for the partitioning mechanism. SVM method is used for the iterative classification process [92]. 90 SMOS, Aquarius salinity, temperature SVM produced promising outcomes [93].

MODIS/Terra total organic carbon (TOC)
The ANN model was chosen among GP and ELM for the forecasting method [95].

MODIS
Chl-a Algorithms using SVM are able to give better results than DT and Loglinear [99].

103
MODIS/Terra SDD, suspended solids, Chl-a GP advantage was identified and been selected among ANN and MLR for estimation [106].

106
MERIS Chl-a MLP was able to show prediction performance [109].

Remote sensing for water quality
Optical and thermal sensors collect water quality information with a great spectrum and spatial resolution. Watershed scale models based on ocean color satellite data have been constructed for determining optical active components (OAC) such as Chl-a, suspended solids and CDOM. However, existing satellites cannot directly monitor all water quality parameters, including nutrient concentrations, DO and COD levels, and microorganisms/pathogens, because some of these variables are not optically active, or because there is an absence of hyperspectral data at precise spatial resolutions. Therefore, some studies used OAC as a proxy to estimate non-OAC parameters by determining their relationship [57] and also use possible band compositions from satellite imagery bands [6].

Machine learning application
Numerous studies have been conducted to determine water quality using satellite data. The majority of the research relied on empirical relationships between satellite-derived reflectance and target water quality parameters to apply relatively simple linear or nonlinear regressions on satellite data. The empirical models produced have a limitation in that they may not operate effectively in diverse environments (such as the open ocean, coastal, river or inland waters). As a result, additional in-situ data is required, as well as parameter values that have been optimized [101]. Moreover, numerous ML models, which are sophisticated nonlinear data-driven approaches that have been tested and widely utilized. Some studies applied ML model comparison and select the best performance ML method to implement for their research. Other studies use the ML method to make improved measuring techniques such as fluorescence line height measurement [7], SSS measurement [9], estimation of POC [13], reduce spectral noise [19], reconstruct missing value [24] and atmospheric correction [35].

Limitation
This systematic review has numerous limitations that should be acknowledged. Firstly, ML-related keywords included in the search queries are not enough to cover as many related publications as possible. Therefore, this process might miss some studies that failed to be retrieved. Secondly, there is no review of performance used in method evaluation for ML. Thirdly, this review does not includes bibliometric analysis to show the research trends.

Conclusion
This systematic review summarized how ML has been applied on satellite data to study water quality issues. The initial search process resulted in 1796 publications, and by refining the search by removing 473 duplicates publication, excluded 1196 non-related topics publications. Through the screening of 127 publications, 113 papers have been selected for data extraction and synthesis. Results also showed that there is a huge variety of ML methods suggested especially on the retrieval of water quality parameters. The most common ML approaches were ANN, SVM, RF, DT, MLP, cubist and GP for monitoring water quality at regional and global scales. According to the systematic analysis, ML can be successfully extended to water quality monitoring, allowing researchers to forecast and learn from natural processes in the environment, as well as assess human impacts on an ecosystem. These initiatives will also aid policymakers and water resource managers in taking proactive actions to prevent the negative consequences of water pollution through restoration projects, as well as ensure that environmental regulatory rules are followed.