A prediction model of aquaculture water quality based on multiscale decomposition

: In the ﬁeld of intensive aquaculture, the deterioration of water quality is one of the main factors restricting the normal growth of aquatic products. Predicting water quality in real time constitutes the theoretical basis for the evaluation, planning and intelligent regulation of the aquaculture environment. Based on the design principles of decomposition, recombination and integration, this paper constructs a multiscale aquaculture water quality prediction model. First, the complete ensemble empirical mode decomposition with adaptive noise (CEEMDAN) method is used to decompose the di ﬀ erent water quality variables at di ﬀ erent time scales step by step to generate a series of intrinsic mode function (IMF) components with the same characteristic scale. Then, the sample entropy of each IMF component is calculated, the components with similar sample entropies are combined, and the original data are recombined into several subsequences through the above operations. In this paper, a prediction model based on a long short-term memory (LSTM) neural network is constructed to predict each recombination subsequence, and the Adam optimization algorithm is used to continuously update the weight of neural network to train and optimize the prediction performance. Finally, the predicted value of each subsequence is superimposed to predict the original water quality data. The dissolved oxygen and pH data of an aquaculture base were collected for prediction experiments, the results of which show that the proposed model has a high prediction accuracy and strong generalization performance.


Introduction
With the development of the Internet of Things, big data and artificial intelligence technology, aquaculture is increasingly becoming more intensive, precise and intelligent. In the field of high-density intensive aquaculture, predicting the development trend of water quality (that is, predicting the trends of variables such as dissolved oxygen, pH, temperature, and turbidity) in real time is of great significance for preventing the water quality from deteriorating and for avoiding the outbreak of disease.
Existing water quality prediction methods mainly include traditional statistical methods such as regression analysis and time series methods and intelligent calculation methods such as neural networks and support vector machines (SVMs). For instance, Rajaee and Jafari [1] proposed integrating the discrete wavelet transform into artificial neural networks, gene expression planning, and decision trees for the prediction of water quality indicators. Amir Hamzeh Haghiabi et al. [2] studied the application of artificial neural networks (ANNs), the group method of data handling (GMDH) and SVMs to the prediction of water quality. Rahman et al. [3] developed a set of step predictors, each of which predicts a specific timestamp, thereby providing new insights for the long-term prediction of dissolved oxygen. Barzegar et al. [4] studied the wavelet and extreme learning machine (WA-ELM) hybrid model for multi-step-ahead prediction and adopted the boosting integration method. Jafari et al. [5] proposed a water quality prediction model based on hybrid wavelet genetic programming method and Shannon entropy. Rozario and Devarajan [6] employed the fuzzy C-means clustering method and constructed a radial basis function (RBF) neural network to predict the change trend of dissolved oxygen. Kisi et al. [7] proposed Bayesian model averaging (BMA) to estimate the hourly dissolved oxygen. Li et al. [8] established three dissolved oxygen prediction models using a recurrent neural network (RNN) model, a long short-term memory (LSTM) model, and a gated recurrent unit (GRU) model. Dabrowski et al. [9] studied a method to forecast the quality of prawn pond water that introduces mean reversion into multi-step-ahead forecasts of state-space models. Chen et al. [10] established a hybrid threedimensional dissolved oxygen content prediction model based on an RBF neural network with K-means and subtractive clustering.
An RNN introduces the concept of time series into the network structure, making it more adaptable in time series data prediction and analysis tasks. In contrast, an LSTM neural network [11] solves the gradient disappearance problem and avoids the gradient explosion issue in RNN models. Moreover, LSTM neural networks have a time loop structure that can effectively describe sequence data with temporal and spatial correlations and can solve the problem of long-distance dependence [12]. LSTM adjust the structure of the network on the basis of the simple recurrent neural network, adding a gating mechanism to control the transmission of information in the neural network. As a variant of LSTM, Gated Recurrent Unit (GRU) has made certain changes in the gating mechanism, and also mixed the cell state and hidden state [13,14]. GRU directly passes the hidden state to the next unit, while LSTM uses memory cell to wrap the hidden state. The performance of GRU and LSTM is similar in many tasks. The GRU structure is simpler and has fewer parameters, so it is easier to converge. In recent years, LSTM and GRU have been considered as one of the effective methods to deal with time series forecasting problems.
Michieletto et al. [15] studied the application of LSTM and phased LSTM (PLSTM) networks to the prediction of dissolved oxygen. Li et al. [16] proposed a water quality prediction model combining a sparse autoencoder with an LSTM network. Barzegar et al. [17] studied the application of a convolutional neural network (CNN)-LSTM hybrid deep learning model for short-term water quality prediction. Zhou et al. [18] proposed a water quality prediction method based on the improved gray relational analysis (IGRA) algorithm and an LSTM neural network. Zou et al. [19] proposed a water quality prediction method based on a bidirectional LSTM network with multiple time scales.
Aquaculture water quality data are nonlinear and unstable. Hence, if the original data are directly used for prediction, considerable problems such as the impact of noise and a low prediction accuracy will arise. Empirical mode decomposition (EMD) [20], a fully adaptive nonlinear signal processing algorithm, can resolve the nonstationarity of the input data and improve the model prediction accuracy. Accordingly, Fijani et al. [21] proposed a hybrid water quality prediction model that combines complete ensemble empirical mode decomposition with adaptive noise (CEEMDAN) and variational mode decomposition (VMD) algorithms with an extreme learning machine (ELM) and a least-squares SVM (LSSVM). Likewise, Huan et al. [22] studied a hybrid model involving integrated EEMD and an LSSVM for the prediction of dissolved oxygen. Similarly, Eze and Ajmal [23] proposed a combined dissolved oxygen prediction method based on integrated EEMD and an LSTM neural network. Liu et al. [24] constructed a multiscale water temperature prediction model based on EMD and a back-propagation neural network.
For data sets with dynamic and nonlinear characteristics, only relying on information decomposition technology limits the accuracy and efficiency of prediction, and sequence reorganization using sample entropy can decrease the workload and operate more handily. Wei Sun et al. [25] Proposed a hybrid wind speed forecasting model, including fast ensemble empirical mode decomposition, sample entropy, phase space reconstruction and back-propagation neural network with two hidden layers. Jujie Wang et al. [26] proposed a hybrid model composed of complete ensemble empirical mode decomposition (ceemdan), sample entropy (SE), long-term and short-term memory (LSTM) and random forest (RF) to realize the accurate prediction of coal price. Qunli Wu et al. [27] proposed a hybrid air quality index forecasting model using variational mode decomposition (VMD), sample entropy (SE) and long short-term memory (LSTM) neural network. More and more researchers combine sample entropy with decomposition technology. Sample entropy can be used to analyze the complexity of decomposition sequence, reorganize sequence to reduce computational complexity, or determine the number of decomposition layers.
The dissolved oxygen content and pH value are important factors that affect the quality of aquaculture water. When the dissolved oxygen content in aquaculture water falls below 4 mg/L, the intake of food by fish begins to decrease, and dissolved oxygen contents higher than 14.4 mg/L can cause gas bubble disease. In addition, the pH range of aquaculture water suitable for fish is from 7.5 to 8.5; thus, pH values less than 4 or greater than 10 can result in the death of a large number of fish. Dissolved oxygen and pH are important indicators affecting the survival of aquatic organisms. By accurately predicting its development trend, breeders can find abnormal water quality in advance. So as to avoid the death and disease of aquatic organisms and ensure the high-quality development of aquaculture.
By combining the principles of decomposition and reconstruction with deep learning, this paper constructs a water quality prediction model named CEEMDAN-SE-LSTM and conducts research on the prediction of dissolved oxygen and pH to forecast aquaculture water quality. The proposed model first applies CEEMDAN to decompose the dissolved oxygen and pH sequences at multiple scales, thereby obtaining a series of intrinsic mode functions (IMFs) with different characteristic scales and a remainder. Then, the IMF components with similar sample entropy (SE) are recombined to reduce the input complexity. Finally, the reconstructed sequences are applied to a trained LSTM neural network for single-step prediction, the values of which are integrated to obtain the final prediction result.
The contributions of this paper are listed as follows: (1) This paper uses CEEMDAN to decompose dissolved oxygen and pH data into subsequences with different time scales. This process can fully determine the characteristics and trends of the water quality series and transform complex single-scale characteristics into simple multiscale characteristics for the ease of prediction.
(2) The SE of each IMF sequence is calculated, merged and recombined into sequences with similar entropy values. Then, the LSTM prediction model is trained for each sequence after the reconstruction, the model structure is optimized for single-step prediction, and finally, the prediction results are integrated.
(3) In this paper, the autocorrelation coefficient is used to measure the degree of correlation between different time points in the water quality series. The autocorrelation coefficient is used as the time step parameter of the model prediction, thus avoiding the redundancy or insufficiency of input information and increasing the efficiency of the constructed prediction model.
The remainder of this paper is structured as follows: the CEEMDAN algorithm, SE and LSTM neural network are described in Section 2. The experimental process of the CEEMDAN-SE-LSTM model and a comparative analysis with other models are discussed in Section 3. The paper is summarized and the directions of future research on water quality prediction are discussed in Section 4.

Complete ensemble empirical mode decomposition with adaptive noise
EMD is based on the variation in the data and can be applied directly without preliminary analyses or research. However, studies have shown that EMD has a limitation regarding mode mixing [28,29]. To solve the mode mixing problem, an ensemble version of EMD called EEMD [30] was developed, which added white noise on the basis of EMD decomposition, so that the decomposed IMF is a single mode. Although EEMD greatly reduced the possibility of mode mixing, there raised a new problem: a residue noise will be mixed into the original signal after reconstruction. Therefore, the Complementary EEMD (CEEMD) [31] is proposed, which added the white noise to the original data in pairs, which greatly alleviates the residual problem of noise after reconstruction . CEEMD still has some problems, such as incompleteness and large amount of calculation [32]. In recent years, CEEMDAN [33] has been proposed. The CEEMDAN adds a limited amount of adaptive white noise at each stage, which can effectively suppress residuals, increase the reconstruction accuracy, and reduce the number of iterations; moreover, CEEMDAN is more suitable for nonlinear signal analysis than other existing methods [34].
The steps for decomposing the original water quality time series in CEEMDAN are as follows: (1) Add white noise following a normal distribution. The resulting water quality time series of the i-th experiment is shown in Eq (2.1).
where v i (t) is the noise sequence added in the i-th experiment and ε 0 is the noise amplitude.
(2) Perform n-time EMD on the noise-added signal, and obtain the first IMF component through the mean value calculation, as in Eq (2.2).
(3) Obtain the remainder from the original data and the first IMF component, as in Eq (2.3).
(4) Add white noise to the remainder, and continue to implement decomposition to obtain the second IMF component, as in Eq (2.4).
where E k (•) is the kth IMF component produced by the EMD method.
(5) According to the above steps, continue to perform multiple decompositions, and calculate Both the remainder after the kth decomposition and the k+1th IMF component, as in Eqs (2.5) and (2.6).
(6) Repeat step 5 until the extremum points of the margin do not exceed two; the satisfaction of this condition terminates the decomposition. Assuming that m IMFs are obtained, the final remainder R(t) is described in Eq (2.7).
After the above steps, the original data are finally decomposed into several IMF components and a remainder. A better decomposition effect can be obtained by adjusting various parameters, such as the noise standard deviation (Nstd), number of realizations (NR), and maximum number of iterations (MaxIter).

Sample entropy
SE [35] can be used to quantify the regularity of time series fluctuations. If the SE difference between two time series is small, the two series are highly similar.
The SE algorithm is expressed as follows: (1) For an aquaculture water quality data sequence obtained by sampling at equal time intervals {X i (i = 1, 2 · · · n)}, using m as the time window length, divide the original sequence into n-m+1 subsequences, as in Eq (2.8).
between the vectors X m (i) and X m (i + 1) as the absolute value of the maximum difference between the two corresponding elements, and calculate the distance between each sequence, as in Eq (2.9).
(3) Define the threshold F = r * std, where std is the standard deviation of the original sequence and r takes a value between 0.1 and 0.25 according to the application scenario. Count the ratio of the number of distances greater than F to all sample values that do not include itself, denote the ratio as C m i (t), and calculate the average value Φ m (t) following Eq (2.10).
(4) Taking the length of the time window as m+1, repeat the above steps to obtain the SE of each subsequence, as in Eq (2.11): The SE calculation does not depend on the amount of data; moreover, the calculation speed is fast, and the anti-interference ability is strong [36]. Additionally, the SE is very sensitive to time series changes and thus has been widely used to measure the complexity of various time series.

Long short-term memory neural network
The cell state of an LSTM network is composed of two activation functions, which are composed of three gating units: a forget gate, an input gate and an output gate [37]. Each gate in an LSTM model has a unique function. The forget gate controls whether the previous cell state is forgotten with a certain probability, while the input gate and output gate control the direction of data flow [38].
The structure of a single neuron in an LSTM network is illustrated in Figure 1, in which X t and h t denote the input and output of the neuron at time t, respectively, and C t is the neuron cell state at time t.
In an LSTM model, the forget gate uses the sigmoid activation function to determine what information can pass through the cell state. The output gate generates a value from 0 to 1 based on the output h t−1 at the previous moment and the current input X t to determine whether to completely or partially pass the information C t−1 learned at the previous moment. The output formula of the forget gate is shown in Eq (2.12).
where W x f ,W h f and W c f are the relevant connection weights, b f is the bias matrix, and σ is the sigmoid activation function, the mathematical formula of which is described in Eq (2.13). The input gate determines which new information needs to be received and consists of two parts. The first part uses the sigmoid activation function to determine which values to update, and the second part applies the tanh activation function to generate a new candidate valueC t , as in Eqs (2.14) and (2.15).
In the above formulas, W ii, W ci and W hi , W ch are the corresponding weights, and b i and b c are bias matrices.
The tanh function is a hyperbolic tangent function whose output range is between −1 and 1, and its mathematical formula is shown in Eq (2.16).
The cell state C t exists throughout the entire LSTM chain system and is updated through the input and forget gates, as in Eq (2.17).
In the above formula, the value of C t is determined by the cell state of the previous neuron C t−1 and by the input gate i t and output gate f t .
The output gate determines the output of the model. First, an initial output is obtained through the sigmoid activation function, and then the value of C t is scaled to between −1 and 1 using the tanh activation function. Finally, the output obtained by C t and the sigmoid activation function is multiplied pairwise to obtain the output of the model, as in Eqs (2.18) and (2.19).  Figure 2. Flow chart of the aquaculture water quality data prediction model.
where W ox and W oh are the relevant connection weights and b o is a bias matrix. LSTM can selectively retain or forget information when this information flows in each neuron through the gate structure. This structure can effectively solve the problem of long-distance dependence and is suitable for the prediction of aquaculture water quality time series data.

CEEMDAN-SE-LSTM hybrid prediction model
Aquaculture water quality data (such as dissolved oxygen and pH) are nonlinear and nonstationary and are easily affected by many factors, such as the water temperature, weather, and aquaculture density. This paper proposes a hybrid prediction model named CEEMDAN-SE-LSTM. The model first uses CEEMDAN to decompose the water quality sequence data and then uses the SE to reconstruct similar sequences. Finally, an LSTM network is used for the single-step prediction of each sequence before integrating to obtain the final prediction result. A flow chart of the prediction model is presented in Figure 2.
The CEEMDAN-SE-LSTM prediction model proposed in this paper mainly consists of four parts.
(1) Decomposition of water quality data: The CEEMDAN method is used to decompose the original water quality series into IMF function components with different frequencies, so as to reduce the influence of the non-stationarity of the original series on the prediction accuracy.
(2) Combination based on SE: Calculate the sample entropy value of each IMF separately, and recombine the IMF with the approximate sample entropy value into a new sequence, which can effectively reduce the amount of calculation and avoid inaccurate information extraction caused by over-decomposition.
(3) Prediction of each sub-sequence: According to the data characteristics of each sub-sequence, the hyperparameters of the LSTM neural network are optimized for individual prediction.
(4) Integration: The prediction results of each recombination sequence are added to obtain the prediction results of the final water quality data.

Sources of water quality data
This paper selected the Shandong Yantai aquaculture base as the experimental area. This aquaculture base is equipped with modern fishery equipment such as dissolved oxygen sensors, pH sensors, aeration pumps, and wireless monitoring systems. Dissolved oxygen data, which fluctuate considerably, are collected every 10 minutes. During the 9 days from August 25 to September 2, 2019, after data preprocessing, a total of 1024 valid data points were retained. The pH of the aquaculture water was relatively stable and was measured once an hour, yielding a total of 634 valid data points. Eighty percent of the data are selected to train the prediction model, and the remaining twenty percent of the data are used for testing.

Multiscale decomposition of water quality data
The CEEMDAN algorithm is used to decompose the dissolved oxygen and pH data at the marine aquaculture base in Laishan, Yantai, Shandong, and to identify and separate several IMF components and one residual component step by step. The results are shown in Figure 3.
Through the CEEMDAN algorithm, the original dissolved oxygen sequence is decomposed into seven IMF components with different characteristics and a residual signal. Likewise, the original pH sequence is divided into six IMF components and a residual signal. The results demonstrate that the features at different scales in the original data sequences are decomposed well.

Reconstruction of the IMF components based on the sample entropy
Considering the large number of IMF components obtained by CEEMDAN, direct prediction will increase the computational cost. Therefore, this paper uses the SE to evaluate the complexity of each IMF component and then reconstruct the decomposed components based on the differences in the SE among the components.
An experimental verification suggests that, when calculating the SE of the data samples in this paper, the time window length parameter m = 2 and the threshold parameter F = 0.2 * std(I MF(i)) can best reflect the different complexity of each component. The SE of each IMF component decomposed from the above dissolved oxygen and pH data sequences is plotted in Figure 4. Adjacent IMF components whose SE difference is less than 0.1 are similar (that is, their complexity and regularity are similar), and thus, these components can be recombined into a single new component. This recombination of IMF    components can reduce the computational complexity of the prediction model and prevent the extraction of inaccurate information caused by overdecomposition. The IMF components of the dissolved oxygen and pH data sequences in this paper can be recombined into several subsequences, as shown in Table 1. According to the recombination scheme described in Table 1, the dissolved oxygen and pH data sequences were recombined separately, and the experimental results as shown in Figure 5.

Construction of the CEEMDAN-SE-LSTM hybrid forecasting model
This paper uses the Keras deep learning library in Python based on the TensorFlow framework, adopts a sequential model structure, and combines an LSTM network with the dense layer to build a prediction model. To train the model, the mean square error (MSE) is selected as the loss function [39], adaptive moment estimation (Adam) is used as the parameter optimizer [40], and the dropout [41] method is used to prevent overfitting. This model is optimized and trained for the input time series window length, learning rate and number of iterations and other parameters to improve the iteration convergence speed and prediction accuracy.
This paper uses the autocorrelation coefficient to determine the length of the input time window. The autocorrelation coefficient [42,43] measures the correlation degree of the time series with itself at different time points. The mathematical formula of the autocorrelation coefficient is expressed in Eq (3.1).
where k is the lag order of the time series X = {x 1 , x 2 · · · x n } and u is the sample mean of the series. The value of R k is usually between −1 and 1. When the absolute value of R k is greater than 0.8, the k-th data in the sequence are strongly correlated with the first (k-1) data.
The four subsequences formed by the decomposition and recombination of the abovementioned dissolved oxygen sequences and their respective autocorrelation coefficients are shown in Figure 6.  Figure 7. Taking the second dissolved oxygen subsequence (SE2) as an example, the correlation values at the first three lag orders are all greater than 0.8, indicating a strong correlation. Therefore, the length of the input time window is selected as 3; that is, the value of every 3 time points in SE2 is used to predict the value of the next time point. In the same way, the time window lengths of the other subsequences are determined by their autocorrelation coefficients.
Eighty percent of the data are extracted from each sequence to train the model. After training and verification, the settings of the parameters for each single-step prediction model, such as the learning rate, number of iterations, and batch size, are optimized. Then, the data of each subsequence test sample are input into the model for prediction; after the prediction result of each sequence is obtained, each single-step prediction value is superimposed to obtain the final predicted values of dissolved oxygen and pH.

Model evaluation and comparative analysis
To verify the prediction performance of the CEEMDAN-SE-LSTM model proposed in this paper, a variety of evaluation indicators [44,45] are used to evaluate the prediction effect of the model.
(1) The mean absolute error (MAE), the average value of the absolute error, can better reflect the actual situation of the error in the predicted value. The MAE is calculated using Eq (3.2).
(2) The root mean squared error (RMSE) is used to measure the deviation between the predicted value and the true value following Eq (3.3).
(3) The mean absolute percentage error (MAPE) is inversely proportional to the accuracy: the smaller the MAPE is, the more accurate the prediction. The MAPE is expressed in Eq (3.4).
In the above formulas, y i represents the true value,ŷ i represents the predicted value, and N is the number of samples. We implemented other three ( RBF, RNN and GRU) prediction models using python language programming. We used CEEMDAN and SE to decompose and recombine the time series of dissolved oxygen and pH in aquaculture water quality, and completed other three hybrid prediction models: CEEMDAN-SE-RBF, CEEMDAN-SE-RNN and CEEMDAN-SE-GRU. With reference to related literature, we simulated the hybrid prediction model of variational model decomposition and LSTM (VMD-LSTM) [46] and the hybrid prediction model of wavelet transform and LSTM (WT-LSTM) [47]. Using the same data samples, the above models are compared with the CEEMDAN-SE-LSTM prediction model proposed in this paper. The prediction errors of each model for dissolved oxygen and pH are shown in Table 2.
The experimental results confirm that an LSTM neural network has a long-term memory function, allowing certain advantages in the prediction of water quality data time series, and the prediction accuracy is higher than that of both the RBF, RNN and GRU. Compared with the single prediction models, the hybrid prediction models based on the principles of decomposition and recombination achieve better prediction effects. Compared with the other hybrid models, the CEEMDAN-SE-LSTM prediction model proposed in this paper has the lowest prediction error and the best performance in the prediction of the dissolved oxygen and pH of the aquaculture water quality. The prediction effect of each model on pH is shown in Figure 8.
In order to test the predictive performance of the model on long-period data, we applied 14598 dissolved oxygen data for 111 days from June 2 to September 21, 2020 to conduct simulation experiments. Eighty percent of the data (11679) are selected to train the prediction model, and the remaining twenty percent of the data (2919) are used for testing. The prediction effect of each model on dissolved oxygen is shown in Figure 9.
The results of this simulation experiment demonstrate that the prediction curve of the model constructed in this paper is closer to the original water quality data curve than are those of the other prediction models. The CEEMDAN-SE-LSTM model can quickly track changes in the mutating data, and the agreement between the prediction curve and the original data curve is better than that of the other two hybrid models. Therefore, the prediction error of this model is smaller, and the fitting effect is better. Consequently, the proposed model is suitable for predicting aquaculture water quality data.

Conclusions
The quality of aquaculture water has a tremendous impact on the growth of aquatic organisms and thus is a key factor that determines the intensive and intelligent development of aquaculture. Therefore, accurately predicting water quality has always been a key issue to be resolved in the aquaculture field. This paper focuses on this problem from two perspectives, namely, multiscale decomposition and LSTM neural network optimization, and the CEEMDAN-SE-LSTM hybrid prediction model is proposed.
The CEEMDAN algorithm does not need to set the basis function in advance and can automatically perform decomposition step by step according to the characteristics of the sequence. The individual IMF components obtained after decomposition reflect the fluctuating characteristics of the time series on different time scales. The original water quality factors are separately predicted after the sequences are decomposed, and finally, the prediction results are integrated, which can improve the prediction accuracy compared with the direct prediction of the original sequences. The LSTM neural network solves the problem of short-term memory by adding gates on the basis of a cyclic neural network model. Compared with other neural networks, LSTM has a better efficiency and higher accuracy in the prediction of time series sequences.
The prediction model proposed in this paper provides a scientific basis for accurately predicting aquaculture water quality and has important guiding significance both for the intelligent regulation and management of water quality and for ensuring the stable and efficient operation of aquaculture. However, the single-step prediction of each subsequence obtained by decomposition and recombination has a certain prediction error. Hence, the simple superposition of the single-step prediction results will increase the overall error. In the future, in-depth research will be conducted on an integrated stacking method to perform single-step prediction and further improve the prediction accuracy of the model.