Application of artificial neural network for predicting water levels in Hooghly estuary, India

Hydrodynamic models for morphodynamic studies in estuaries require continuous tidal water level data as boundary conditions. However, for the Hooghly estuary in India, measurement of continuous tidal water elevation data at the most downstream point is a very difficult task because of the remote location and the confluence with the deep sea. The tidal water level data at this station are measured for a half tidal cycle which is not useful for hydrodynamic modeling. However, at other upstream stations, tide water level data are measured continuously. Accordingly, in this study, an attempt is made to generate continuous tidal water level data at the remote station, using the data of the neighboring stations as input to an artificial neural network (ANN) model. A three-layered feed-forward backpropagation (FFBP) network with two hidden layers is selected and five different combinations of input vectors are used. Simulated water level data obtained from each model are compared with the observed data graphically as well as by estimating the standard error parameters. The best model suitable for prediction of continuous tidal elevation during any time of the tidal cycle and applicable throughout the year is then identified. It is found that tidal data from the nearest neighboring station are more suitable for training.


INTRODUCTION
In an estuary, studies on characteristics of sediment transport processes and the resulting changes in the bed morphology are essential tasks for operation and management of ports, construction of jetties and other structures, maintenance dredging works to provide adequate navigational depths for ships, etc. Usually, for these purposes, an appropriate two-dimensional or three-dimensional hydrodynamic model is used to assess the sediment transport phenomena and the resulting morphological changes. However, along with other data like bathymetry, sediment characteristics, etc., such a hydrodynamic model requires continuous tidal water level data over the period of simulation, to be provided at the key sections as boundary conditions. Unfortunately, in the large Hooghly estuary in India, it is extremely difficult to measure the tidal water level at the most downstream station near the sea continuously for 24 h because of the limited accessibility and proximity to the turbulent sea. Accordingly, tidal water level data at this station, which is an important boundary section for hydrodynamic modeling, are available only for daytime (12-h period). However, for neighboring upstream stations, tidal water level data are measured continuously. So, since a continuous tidal water level record at the remote station is essential for hydrodynamic modeling, an appropriate model needs to be developed which can simulate the continuous tidal water level at that station. This may preferably be done using the continuous water level data of the neighboring upstream stations, so that the characteristics of the tidal water wave during spring tides and neap tides, peaks, troughs, lags, etc., are adequately reflected in the simulated tidal water levels. Accordingly, in this study, an attempt is made to simulate the continuous tidal water level data at the remote station, by developing artificial neural network (ANN) models and training them with measured continuous water level data of the neighboring upstream stations as input.
Conventional tidal forecasting using least-squares method for the prediction of long-term tidal levels requires a large number of parameters for harmonic analysis. Least Square Estimation (LSE) plays a major role in harmonic analysis. Cai et al. (2018) and Li et al. (2019) used the Inaction Method (IM), to predict short-term and long-term tidal levels, respectively. A large number of tide gauge data comprising long and continuous records were required for the long-term prediction. ANN models can be used as an alternative to conventional harmonic analysis for forecasting tidal elevation where measured data are limited (Lee & Jeng 2002;Meena & Agrawal 2015;Salim et al. 2015).
ANNs have also been attempted for simulation of tidal water level, discharge, and wave height in coastal areas, as substitutes of numerical models. Tsai & Lee (1999) used backpropagation ANN to forecast tidal water level variations, using two advance consecutive tidal data and two corresponding shocks as inputs to obtain one prediction value. Lee & Jeng (2002) used ANN models to predict hourly tidal levels over a long duration using a short-term hourly tidal record. Ustoorikar & Deo (2008) used ANN and genetic programming for filling up gaps in wave height data. Panda et al. (2010) compared the river stage data obtained from an ANN model with a one-dimensional hydrodynamic model. Feed-forward neural network architecture with Levenberg-Marquardt (LM) backpropagation training algorithm was used to train the neural network model using hourly water level data. Chen et al. (2012) applied ANN as an alternative modeling approach to simulate the water stage time-series of the Danshui River estuary in northern Taiwan and compared the results with vertical (laterally averaged) 2D and 3D hydrodynamic models. Hidayat et al. (2014a) developed a tidal discharge forecast model for a site on the Mahakam River. Hidayat et al. (2014b) used an M5 model tree machine-learning technique to reconstruct a disrupted water level record of the Mahakam Delta in Indonesia. Meena & Agrawal (2015) used ANN models to predict the tidal level variations at stations situated on the southwest coast of India and fill the gap for the missing data. A sequential learning radial basis function (RBF) network was used by Yin et al. (2013) for accurate real-time prediction of tidal level. Lee & Resdi (2014) presented a feed-forward backpropagation (FFBP) network and generalized regression neural network (GRNN) model for prediction of coastal high and low water, which provided effective prediction but was not suitable for one-year simulation. Salim et al. (2015) used FFBP and Nonlinear Auto Regressive with exogenous input (NARX) network to predict hourly tidal level variations at Mangalore, Karnataka, using a week's hourly tidal levels as input and found the NARX network performance better than the FFBP network.
An adaptive variable-structure online sequential extreme learning machine (OS-ELM) was used by Yin et al. (2016) for online tidal level prediction purpose. Tidal prediction simulations were conducted based on the actual measured tidal data and meteorological data and OS-ELM was found effective for short-term tidal predictions in terms of accuracy and rapidness.
In this study, simulation of the continuous tidal water level data at the most remote station in the Hooghly estuary is done with an ANN (FFBP) model using observed continuous tidal water level data of the neighboring upstream stations. This simulated continuous tidal water level data at the remote station is proposed to be used with a hydrodynamic model as one of the important boundary conditions. The simulated water level data may also be used for navigation purposes. The ANN (FFBP) model is selected because of its simplicity of structure, minimum data requirement, good convergence, and acceptable performances.
The following aspects are investigated while developing the models: (i) which of the upstream neighboring stations are suitable for training and whether the inclusion of more stations improves the model's performance; (ii) whether the model's performance is the same for any part of the tidal cycle (i.e., during spring tide and neap tide); (iii) whether 1-day ahead prediction is possible; and (iv) whether increasing the number of lagged inputs actually improves the model's performance.

STUDY AREA
The Hooghly estuary ( Figure 1) is the lowest portion of the Ganga-Bhagirathi-Hooghly river system which joins with the Bay of Bengal and lies approximately between 21°31 0 N-23°20 0 N and 87°45 0 E-88°45 0 E. Hooghly estuary is a funnel-shaped estuary. Width of this estuary varies from about 25 km at the mouth to about 6 km at the head. The estuary is very shallow with an average depth of 6 m and the maximum depth is about 20 m. It is a well-mixed estuary, and mixing zones of the estuary extend up to Diamond Harbour. The estuary is characterized by the presence of a large number of tidal bars and tidal islands. The large tidal variations, irregular coastlines, the presence of islands, and dredged navigational channels separated by shallow areas make the flow patterns quite complex.
The Hooghly estuary is the gateway to the two major riverine ports in India: the Haldia port and the Kolkata port. The Haldia port is shown in Figure 1, and the Kolkata port is situated further upstream of Diamond Harbour (at the top). The red lines shown in Figure 1 are the shipping channels. On the left side, the Eden Channel and the Jelligham channel take the ships to the Haldia port and the Rangafalla Channel on the right navigates the ships towards Kolkata port.
Operation and management of these two ports are totally dependent on the navigability of these shipping channels. Availability of adequate draft for movement of ships along these channels is largely affected by frequent sediment deposition. This poses a perennial problem to the Kolkata Port Trust (KoPT) Authority, and regular maintenance dredging requires to be employed, which is highly expensive. Disposing the dredged material at appropriate locations in the estuary is also difficult as tidal action brings the disposed dredged material back to the shipping channels. All these tasks require a proper understanding of the hydrodynamics of flow and characteristics of the morphological changes.
The morphology of the estuary is extremely dynamic with high rates of erosion and accretion, which makes it essential to develop appropriate hydrodynamic models for assessing the morphological behavior of the estuary. These hydrodynamic models require continuous tidal water level data at the key stations as boundary conditions. However, continuous record of tidal water level data is not available at some of the remote stations in the estuary, like Sagar (Figure 1), Dadanpatra and Fraserganj (not shown in Figure 1). The tidal elevation data measured at these stations are usually for a 12-h stretch (daytime). However, at other upstream stations like Gangra, Haldia, Diamond Harbour (Figure 1), tidal water level data are measured continuously for 24 h. Accordingly, in thisstudy, it is attempted to simulate the continuous tidal water level data of the remote gauging station Sagar, situated at the mouth of the estuary, using continuous tidal water level data of the neighboring upstream stations, Gangra, Haldia, and Diamond Harbour, by developing a number of ANN models.  METHODS ANN models consist of interconnected nodes which can extract complex nonlinear relationships from a set of input and output data. A typical network consists of one input layer, one or more hidden layers, and one output layer. Each layer is made up of interconnected nodes with a set of associated weights. Multilayer feed-forward network (MLFF) is the commonly used ANN model (Maier et al. 2010) in water resources applications. The MLFF network has a unidirectional flow of information. The output from nodes in one layer is used as inputs to nodes in the next layer. Different learning rules are available to train the network, e.g., Delta Learning Rule, Memory-based learning rule, Hebbian Learning Rule, and Boltzman Learning Rule. The learning algorithm is the specific mathematical method, used to update the synaptic weights during each training iteration. A set of optimal values of connection weights and thresholds are thus found that minimize a predetermined error function between the ANN model simulated output and the target values. In the present study, the nonlinear least squares LM algorithm is adopted for this training, which has been found to have a faster convergence rate, compared to the standard gradient descent BP algorithm (Habib & Meselhe 2006;Chen et al. 2012). As mentioned by Wu et al. (2014), the steps followed in ANN model development are a selection of input variables, division of data, determination of best network architecture, model calibration, and finally model validation.

Selection of input variables
Tidal water levels of the three upstream stations at Gangra, Haldia, and Diamond Harbour are selected as the input variables, and the tidal water at Sagar is taken as the output variable. Five different models are developed and tested, which have the following input variables as shown in Table 1. The measured water level data at all four stations are at half-hourly intervals. Hence, in the second column of Table 1, level at (t À 1) denotes level at half an hour before t; level at (t À 2) denotes level at 1 h before t , and so on. Similarly, level at (t À 48) indicates level at 24 h before t, i.e., 1-day before t. It may be mentioned here that Gangra is the nearest station and Diamond Harbour is the farthest station from Sagar.

Performance evaluation of the models
Performances of the developed ANN models are measured with respect to the following statistical parameters, which are calculated from the predicted values of ANN and the target values.

Root mean square error (RMSE)
It is one of the most widely used functions for assessing the performance of a predictive model and is defined as the square root of the average sum of squares of the difference between the target and III Water levels at Gangra at time t, (t À 1), (t À 2), and (t À 3) 4-6-6-1 IV Water levels at Gangra at time (t À 48), (t À 49), (t À 50), and (t À 51) 4-7-7-1 V Water levels at Gangra at time, t, (t À 1), (t À 2), . . . , (t À 12) 13-9-9-1 predicted values: where T i is the target data, P i is the corresponding value predicted by the ANN model, and n is the number of data.

Nash-Sutcliffe model efficiency coefficient (E)
This coefficient, E, measures the differences between the observations and predictions relative to the variability in the observed data itself.
where T denotes the mean of T i : As per Equation (2), E may range from À∞ to 1.0. A value of E ¼ 1.0 indicates a perfect model, whereas E ¼ 0.0 indicates that the observed mean is as good as the model predicted, and E , 0 indicates that the model is worse than using mean as a predictor.

Correlation coefficient (R)
This is a measure of the linear correlation between the target values and predicted values: Value of R ranges between 0 and 1, with a value of 1 indicating a perfect linear relation.

Mean absolute percentage error (MAPE)
This is an unbiased statistical measure for the predictive capability of a model and is a measure of the relative error in prediction with respect to the actual value of the variable. MAPE calculates the error as a percentage of the actual value.
The combined use of R, RMSE, MAPE, and E helps to assess the performance of a model and compare the accuracy of any two modeling approaches. Statistically, ANN performance can be considered good when the values of R, MAPE, MSE, and RMSE are close to 1, 0, 0, and 0, respectively (Habib & Meselhe 2006).

Network architecture
A typical MLP has an input layer, an output layer, and one or two 'hidden' layers of neurons to connect the input and output layers. The neurons are connected by excitation functions whose weights and biases are estimated during training. The best network architecture for an ANN model is determined considering the number of input variables used. Appropriate training parameters are selected and then the network is trained using pre-processed training data. Identification of the best network architecture involves determining the number of hidden nodes, the number of iterations, the learning rate, and the momentum coefficient. This is done by trial and error, based on the mean square error (MSE) value. Parameters like learning rate, epoch, show, goal, and momentum coefficient are used to control the rate of convergence and are obtained by trial and error, based on the MSE values. In this study, a learning rate of 0.1 is used in all models. A maximum epoch of 1,500 is assigned during training and testing of the ANN models. The value of show is taken as 500, that is, the training result is shown after every 500th iteration. In all models, a momentum coefficient of 0.3 is used. For convergence, the acceptable value of MSE is taken as 10 À6 which is termed as the goal.
Most of the reported works have used one or two hidden layers. The number of neurons in the hidden layer of an MLP determines the potential complexity of the model estuaries (Hidayat et al. 2014a;Rath et al. 2017). In this study, it is found that using two hidden layers yielded much better results than using a single hidden layer, keeping the total number of neurons in the hidden layers the same. So, two hidden layers with an equal number of nodes in each hidden layer are adopted for further use. A three-layer fully connected feed-forward network (Figure 2) is thus used in all the models with two hidden layers and a single output layer, and different activation function for different layers. Tan-sigmoid function is used in the first hidden layer, log-sigmoid function is used in the second hidden layer, and pure-line function is used in the output layer.
The optimum number of nodes in the hidden layer is selected by varying them and checking the improvement in the R value (Figure 3).
In Table 1, the column 'Model architecture' indicates the type of layers and nodes used in the particular model. For example, in Model III, the term '4-6-6-1' indicates four inputs in the input layer, two hidden layers with six nodes, and one output layer. In the present study, design of the network, its training and simulation are performed using the Neural Network Toolbox MATLAB 13.0 developed by MathWorks ® , Inc.
The backpropagation algorithm requires initial values of layer weights and biases to start the training process. In the Neural Network Toolbox, the layer weights and biases are initialized using the Nguyen-Widrow method. Once a network is properly trained with acceptable levels of accuracy using a known dataset of input and target, it can be used for simulation. This implies that the trained network can be used to predict possible outputs for any other dataset of input, not used in the training  process. Levels of acceptance of the predicted output can be verified from the parameters R, RMSE, MAPE, and E, as defined in Equations (1)-(4).

Dataset
Three different datasets are used in the study. Dataset 1 and Dataset 2 contain 14 days' tidal water level data measured continuously at an interval of 0.5 h at all the four stations: Gangra, Haldia, Diamond Harbour, and Sagar. Dataset 1 was for spring tide measured during April 2014 and Dataset 2 was for neap tide measure during January 2015. On the other hand, Dataset 3 (spring tide measured during November 2016) contains 14 days' tidal water level data measured continuously at an interval of 0.5 h at stations Gangra, Haldia, and Diamond Harbour, but only 12-h data (during daytime) at an interval of 0.5 h over a period of 14 days measured at station Sagar. So, Dataset 3 contains half-cycle data at Sagar, whereas Dataset 1 and Dataset 2 contain full-cycle data at Sagar. In fact, Dataset 1 and Dataset 2 are the only two available datasets with full-cycle data at Sagar. In this study, Dataset 1 is used for training the networks, Dataset 2 is used for validation of the network outputs, and Dataset 3 is used for simulation (prediction). To improve the forecasting capability of ANN outside the range of training datasets, the input data are generally normalized in the range of 0 to þ1 or À1 to þ1 (Kisi 2004;Panda et al. 2010). Standardization of the input and output vector is, therefore, done to keep the data within the interval of [À1, 1].

RESULTS AND DISCUSSION
During training with Dataset 1, following R values are obtained from the five models: (i) R ¼ 0.8732 for Model I; (ii) R ¼ 0.8667 for Model II; (iii) R ¼ 0.9993 for Model III; (iv) R ¼ 0.9907 for Model IV; and (v) R ¼ 0.9994 for Model V. Since Model I and Model II yielded low values of R, these two models were not considered for further analysis. Since Model I contains only one data as input, it might not be able to represent the variabilities in the output water level. Model II has three input variables which include data of stations which are located further upstream and away from the station Sagar and in these stations, there are appreciable lags in the tidal data and the tidal ranges are also larger compared to that of Sagar. These might have resulted in a lower R value.
Accordingly, it may be concluded that not all of the upstream stations are suitable for training and inclusion of more stations does not necessarily improve the performance of the model.
Next, validation is done for Models III, IV, and V, with Dataset 2. The statistical parameters R, MAPE, RMSE , and E estimated from the results of this validation are shown in Table 2.  It may be seen from Table 2 that all three models yielded very close values of parameters R, MAPE, RMSE , and E, with Model IV resulting in marginally better values.
The water levels obtained from these three models with Dataset 2 are plotted with the corresponding observed levels at Sagar, and these are shown in Figures 4(a), 5(a), and 6(a), for Models III, IV, and V, respectively. It may be observed from these figures that all the three models resulted in water levels very close to the observed levels. However, the peak values of the simulated levels are a little lower than the observed levels. One probable reason for this could be that the Dataset 1 which is used in training, contains the water level data of spring tides, whereas Dataset 2 used in validation has the water level data of neap tides (lower peaks). The purpose behind the use of these two datasets of different tides (spring and neap) was to examine whether the trained model could be used to predict any part of the tidal cycle (spring or neap), which was one of the objectives of this study. Based on the results, it may be stated that the trained network can be used to predict any part of the tidal cycle satisfactorily.
All the three models are considered for simulation of the discontinuous Dataset 3. The simulated water levels are plotted with the observed water levels at Sagar for assessing the performances of the three models in terms of predicting full-cycle data. These plots are shown in Figures 3(b), 4(b), and 5(b) for Models III, IV, and V, respectively. Scatterplots of observed vs. ANN predicted levels are developed for all the three models, and these are shown in Figure 7 for Dataset 2 and in Figure 8 for Dataset 3. It may be observed that in case of Model III, the scatter and dispersion around the regression line is less, and the points are more     It may be noted that, both Model III and Model IV have four input variables, and in terms of statistical parameters mentioned in Table 2, Model IV apparently performed better. Considering the variations in the tidal range, as well as high water levels and low water levels (peaks and troughs), it can be seen from Figure 5 that Model IV performed better. Model III marginally overpredicted the high water levels compared to Model IV.  Model V overpredicted the high water levels (Figure 6(b)), more than that of Model III (Figure 4(b)). It may be noted that Model V has input variables similar to those of Model III but has nine additional variables. Inclusion of these additional variables was expected to improve the performance, but it did not happen. This indicates that the increasing number of variables does not necessarily improve the performance of an ANN model.
Based on the validation and simulation results, Model IV may be considered as the preferred one, with the advantage of having 1 day before data (i.e., already observed data) of the neighboring station as input. Hence, it may also be used for 1-day ahead forecast.

CONCLUSION
In this paper, an ANN-based approach is presented for prediction of complete cycles of tidal water levels at a remote station in an estuary where only half-cycle data can be physically measured. Based on the objectives of this study, the following conclusions made from the results obtained.
• Comparisons of the graphical plots of observed and predicted tidal water levels, and the comparisons of statistical errors parameters (R, MAPE, RMSE , and E) estimated from the predicted and observed tidal water levels, indicated that the developed ANN models may be accepted for tidal water level prediction.
• Water level data of the nearest upstream station are more suitable for training. Inclusion of data of additional upstream stations further away from the station concerned not necessarily improve the performance of the model.
• The developed ANN models can be used to generate complete cycles of water level data at the remote station, for both spring and neap cycles.
• Performance of Model IV indicated that 1-day ahead prediction is possible. • Performance of Model V indicated that increasing the number of lagged input variables does not improve the performance of the model.
• The generated tidal data can be used as a boundary condition in the hydrodynamic models.
However, it should be noted that, because ANN is a data-driven model, the above conclusions are primarily related to the system under study. Also, it should be mentioned here that availability of adequate and synchronous data at different stations is a concern. In this study, only two sets of continuous tidal water level data were available at Sagar, which were used in training and validation. With more continuous dataset available, additional training or validation could have been done. Data during the monsoon period were not available because the sea becomes very rough and there are administrative restrictions in movements at that time. Nevertheless, this study successfully identified the applicability of ANN in such field conditions.