SVM Model to Predict the Water Quality Based on Physicochemical Parameters

Analysis of water quality is a very important and challenging task in the management of water bodies and requires immediate attention as it adversely affects the health of living beings. Three parameters namely, pH, Total Dissolved Solids (TDS)


Introduction
Water is mainly responsible for the survival of all biotic species of the whole ecosystem which maintain human health and welfare. The main sources of water in India are Rivers, Lakes, Oceans, etc. Water quality mainly depends upon physiochemical and biological characteristics like conductivity, TDS, color and taste of water, hydrogen ion concentration (pH value), Turbidity, Temperature, inorganic heavy metals (magnesium, aluminum, titanium, iron, copper, tin, silver, gold, platinum, gallium, thallium, and hafnium), Biological oxygen demand (BOD), Dissolved Oxygen (DO), etc., (Sharma, 2014;Pareek et al., 2018;Tadesse et al., 2018). But the water quality of the water bodies contaminated regularly. Water contamination depends upon several factors such as industrial activities, geographical condition of area, artificial products (gasoline, oil, road salts and chemicals), and agricultural activities. It is further classified in to following five types; inorganic, organic, microorganisms, radio nuclides, and disinfectants (Nollet, 2000). In which, inorganic chemicals are mainly responsible for the contamination of water quality as compared to other factors. The striking reason behind conducting this research is growing water pollution in the hilly areas of Uttarakhand. Nainital is one of the largest, historical tourist places of Uttarakhand. It is famous for this beautiful lake and its true natural beauty. But it has been facing water problems from the last few years, due to over-dependence on lake water for the drinking water supply.
As the place is presently the middle of tourism, consequently that is critical to assess the ingesting water of the Nainital Lake to guarantee secure ingesting water for citizens and tourists. But it was found that water quality of Nainital Lake was deteriorating rapidly as there are many anthropogenic reasons like increasing population, tourism movement in the city which leads to pollution, intense boating activity, construction activity, and disposal of solid waste and throwing of edible material into the lake besides these factors, other factors are eutrophication, untreated sewage, lack of drainage maintenance, landslides, etc. (Sharma, 2014). Nainital Lake is considered to be the lifeline for the resident here and its surrounding areas. Therefore, it is very essential to maintain and monitor the water quality of the Lake.
Several research studies have been going in this field by using different techniques, because of the non-linearity nature of the Environmental System (Sarkar and Pandey, 2015). Nowadays various machine learning techniques have been employed to find the solution of the problem. To assess the water quality authors have conducted an acute toxicity test on Zebrafish (Danio rerio) in a toxic environment by the use of four metal ions. The behavior response of Zebrafish was visualized to assess the water quality using techniques Biomonitoring and Multiclass SVM. The experimental result represents that the response of Zebrafish to one out of four metal ions comes out to be strong and study also revealed that the response of Zebrafish using method of Biomonitoring results in highly cost effective (Liao et al., 2011). To develop an efficient model authors have performed an experiment to determine the water quality of the Langat River Basin using Support Vector Machine. In this study, six parameters that are found in the catchment area of dual reservoir were used for the experiment. The experiment performed in two ways: Scenario 1 and Scenario 2. The result represents that scenario 1 performed better outcome as compared to Scenario 2 on the basis of MSE and Correlation Coefficient (Yahya et al., 2019). To conduct a case study on Wen-Rui River, China authors have used of 11 hydro chemical variables to determine the concentration of dissolved oxygen using Support Vector Machine in hypoxic river systems. They used four different calibration models, such as General Regression Neural Network (GRNN), Multiple Linear Regression (MLR), Backpropagation Neural Network (BPNN), and Support Vector Machine (SVM). The prediction accuracy of models was evaluated and compared using Mean Square, Determination coefficient (R 2 ) and Nash-Sutcliffe (NS) to assess the performance of the models. The experimental result represents that SVM model outperformed on models and can efficiently predict water quality (Ji et al., 2017). To predict water quality using Least squares support vector machine method authors have conducted experiments on time-series data. The experimental result represents that LS-SVM performed better than Multilayer Back propagation and Radial-Basis Neural Network to better meet the requirements of water quality prediction and perfectly works well on small sample cases (Tan et al., 2012).
To analyze the performance for predicting water quality components of Tireh River situated in the Southwest part of Iran authors used three Artificial Intelligence techniques consists of Artificial Neural Network (ANN), Support Vector Machine (SVM) and Group method of data handling (GMDH). For developing ANN and SVM model different transfer functions and kernel functions were evaluated. The result showed that tansig and RBF as transfer and kernel function outperformed among others. The comparison of GMDH model result with ANN and SVM model has been done and it was found that GMDH model has acceptable performance but SVM come out to be the best model with lowest DDR value and high accuracy (Haghiabi et al., 2018).
The study conducted by authors to explores data-based methods for forecasting short-term water quality. The prediction of water quality indicators was done with Support Vector Machine and fed to particle swarm algorithm based optimal wavelet neural network to forecast the complete status index of water quality. To examine the effectiveness of proposed approach Gubeikou monitoring section of Miyun reservoir were used. The result showed that proposed model performed better than other data driven models including traditional BP neural network model, wavelet neural network model and Gradient Boosting Decision Tree model in terms of stability and consumes less time .
To predict the monthly Stream flow authors have used two soft computing techniques including Artificial Neural Network (ANN) and Support Vector Machines (SVM). The result of both models was compared on the basis of determination coefficient (R 2 ), root mean square error (RMSE) and mean absolute error (MAE) to evaluate the performance. The result showed that SVM model is superior as compared to ANN model (Adnan et al., 2017).
The result of this research reveals that SVM is a powerful technique to resolve several kinds of environmental issues. Through literature review, it was found that only a few chemical and statistical analysis related research work was reported for Nainital lake previously. So, in this paper, the author proposed an efficient model for the prediction of water quality based upon LIBSVM. This kind of prediction model by using a machine learning technique was reported first time for Nainital Lake. (As far the best knowledge of the authors).

Problem Statement and Objective
Increasing tourism movement, improper treatment of sewage waste, and unplanned construction in the catchment area in Nainital city have increased the level of pollution. Due to the pollution, the quality of Nainital lake water is deteriorating day by day as it is the only source of drinking water for the people here. Therefore, it is very important to maintain the water quality of the lake. Several studies in the past conducted on Nainital lake reveals that the water quality of the lake is unsuitable for drinking (Sharma, 2014;Dalakoti et al., 2018). In this paper the efficiency of Support Vector Machine models is evaluate in predicting water quality of Nainital lake.
The objective of this research study: a) To proposed a powerful and efficient machine learning model for the prediction of water quality for Nainital Lake. b) To overcome the limitations of traditional methods and develop computational models that efficiently and accurately work.

Study Area
Nainital Lake, as shown in Figure 1 situated amidst the township of Nainital in the Uttarakhand, India. It is a major tourist spot and an important source of fresh water. It is a natural kidneyshaped having uniform temperature and density from top to bottom, situated at 290 24' N latitude and 790 28' E longitudes. The surface area covers by lake is 48 hectares. The maximum depth of the lake is 27.3m and mean depths of the lake is 16.2m, respectively.

Selection of Appropriate Inputs
Experimental data used in this study was obtained from Uttarakhand Jal Sansthan, Nainital, and Sensor-based device installed in Nainital lake for determining water quality by Uttarakhand Science Education and Research Centre (USERC), Dehradun. Time series data of 1.5 years from 2018 to 2019 were utilized in this study. Three water quality parameters namely, pH, Total Suspended Solids, and Turbidity were chosen for SVM Modelling. Statistical chart corresponding to selected parameters shown in Table 1.

Support Vector Machine
Support vector machine is one of the most popular Supervised machine learning algorithms which is used to solve the problems related to Classification, Regression, Recognition and timeseries (Kara et al., 2011). But it is generally used in Classification Problem.

Basic Principle
It is based on the statistical theory of learning proposed by Vapnik and Chervonekis in the late 1960's (Vapnik and Chervonekis, 1971). SVM becomes quite popular due to some features such as Good generalization ability, high prediction accuracy, faster training speed. SVM capable of working perfectly with small sample data and it has high ability to classify both linear as well as non-linear data. In two-classification problem, if there exist some original training samples which are represented as xi= {x1, x2, x3…xn} in the input space where xi ϵ R d and their corresponding labels are represented as yi ϵ {-1, +1} (Kara et al., 2011).
Then the main objective of SVM is to construct a hyperplane between training samples which divides the training samples by maximal margin so as to mapped them from input space to higher dimensional feature space. So, that the +1 labeled vector lies on one side of the hyperplane and -1 labeled vectors lie on the other side of the hyperplane.
The training samples that lie closet to the hyperplane are called support vectors (Tong and Koller, 2001). The maximum margin hyperplane also known as Optimal Separating Hyperplane (OSH). The mapping of training samples from input space to higher dimensional space is perform with the help of kernel function k(xi, xj) as shown in Figure 2.

Figure2.
Transformation of input space to corresponding high dimensional feature space using kernel function.

Experiment
In SVM, the selection of kernel function plays a very critical role to achieve a higher prediction accuracy. In this research paper, LibSVM has been used for classification with the use of two kernel function types Polynomial and Radial Basis functions. Several levels of degree used in polynomial function (d) along with regularization parameter (C) and gamma constant (ϒ) were evaluated in parameter setting experiments (Tarmizi et al., 2014;Lerios and Villarica, 2019).
In order to train the model, test it with a certain part of the data the research explores Crossvalidation technique were used. In Cross Validation technique the input data splits into k subsets and iterates over all the subsets considering k-1 subsets as training data and keeping 1 subset as testing data. We split the data into k=10 subsets and ran cross validation. The result obtained using both the kernel types were compared to analyze, the best SVM predictive model. All experiments were executed on WEKA software using LibSVM a library developed for SVM implementation (Chang and Lin, 2011). Table 3 shows the summary of parameter values which was used in the experiment with respect to kernel function types and their parameters. In this research, to know how efficient our predictive models are? The following classification measures such as Accuracy, Precision, Recall, F1_measure, ROC was used (Goutte and Gaussier, 2005;Sokolova et al., 2006). The classification measures are as follows: a) Accuracy It is a fraction of correct number of predictions made by the model over all the observed values. Accuracy measured by Equation (3), where TP refers to true positive, TN refers to true negative, FP refers to false positive and FN refers to false negative.
It is the fraction of correct classified instances of the positive class out of the total classified instances of that class. Precision measured by Equation (4), where TP refers to true positive, TN refers to true negative, FP refers to false positive and FN refers to false negative.
It is the fraction of correct classified instances of the positive class over total instances that actually classified as positive. Recall measured by Equation (5), where TP refers to true positive, TN refers to true negative, FP refers to false positive and FN refers to false negative.

d) F1_measure
It is the Harmonic mean between Precision and Recall. Its range lies between 0 and 1. F1_measure measured by Equation (6), where TP refers to true positive, TN refers to true negative, FP refers to false positive and FN refers to false negative.

1_
= 2 * * + e) Area Under the Curve-Receiver Operating Characteristic Curve It is graph illustrating the performance of classification models at various thresholds settings. ROC represents the probability curve and AUC represents the degree of separability. It is one of the most important evaluation metrics for checking the performance of classification models.

Result and Discussion
In order to determine efficient parameter combination a total of 320 experiments were evaluated in parameter setting experiment. A total of 80 parameter combination was evaluated for radial basis function and a total of 240 parameter combinations were evaluated corresponding to three levels of degree for Polynomial function.
The predicted accuracy of Polynomial SVM model ranges between78.6792 % and 99.434%. On the contrary, the predicted accuracy of the Radial Basis SVM Model ranges between78.6792% and 98.6792%. The best parameter combination (C, d, ϒ) of SVM models evaluated for both Kernel types is represented in Table 4.  Figure 3, illustrates the graphs of the result of predicted accuracy after parameter setting experiment.
The graphs in Figure 4, illustrates the result of Precision measure after parameter setting experiment. The performance of Precision measure in case of Radial basis SVM model ranges between 0.787 and 0.987. On the contrary, the performance of Precision measure in case of the Polynomial SVM Model ranges between 0.787 and 0.994. It was observed that the Polynomial SVM model attained maximum value when C=1 and C=10.     The Figure 7, illustrates the plot of Receiver Operating Characteristic Curve as mention in Table  5. It was clearly observed in the above plotted graphs that the Area under ROC evaluated in Polynomial Kernel is about 0.9964 is very close to 1.0 as compared to Radial Basis Kernel i.e. 0.9819.

Conclusion
Predicting the quality of water before its use for drinking is the necessity of today's time. Based on the experiment carried out on this study and after the literature review, many conclusions can be drawn. Heavy tourism activity, unplanned construction, and improper solid waste disposal have somehow contributed to the deterioration of water quality. Nainital Lake is the only source of drinking water for the people here, so it's planning and management is important. In this study for mapping of training samples from input space to higher dimensional space, LibSVM has been used with the use of two kernel function type and their results were compared. The experiment results demonstrate that LibSVM with polynomial kernel function achieved the best classification accuracy of 99.434%. In conclusion, SVM has proven its efficiency and effectiveness in predicting water quality of Nainital lake.