Estimation of pressure drop in a demister for multi-stage flash desalination process based on stacked long short-term memory (LSTM) neural networks

The thermal desalination process such as a multi-stage flash desalination process (MSF) uses demisters as the separator between the flashed off vapor and the brine droplets in the flashing stage. The performance of the MSF plant depends on the quantity of fresh water produced. The separation efficiency depends on the pressure drops in the demister that influences the plant performance. This study proposes the application of Long Short-Term Memory (LSTM) neural networks for estimating pressure drop across demisters. The stacked LSTM algorithm is effective in estimating the pressure drop for experiment and real plant data. The superiority of stacked LSTM algorithm above reference benchmarks is evident. The Root Mean Square Error (RMSE) of the estimation from stacked LSTM algorithm is less than the estimation from the work of Al-Fulaij et al CFD model Al-Fulaij H, et al 2016, Desalination, 385: 148–157 by 40%.


Introduction
Demisters are commonly used in processes involving separation of micron scale droplets. They are mostly found in petroleum refining, pulp and paper, petrochemicals, dryers, scrubbers, flash drums, etc Demister acts as a mist separator equipment. A porous blanket of wire separates the liquid droplets from vapor [1]. Wire materials such as stainless steel, monel, nickel, aluminium, copper, carbon steel, tantalum, polyethylene, fluoropolymer, and glass multifilament are common for being used as knitting meshes [2]. Even with the operation involving high pressure, the droplets sizing at least 10 m are successfully removed by demisters [3]. When the mixing stream of liquid and vapor passes through demisters, the wire acts as an obstacle in the flow path. The smaller droplets and the vapor can move around the wire while the bigger droplets adhere to the wire by surface tension and coalesces with other droplets [4]. When the combined liquid droplet becomes heavier to overcome the surface tension and the drag force, it starts to detach from the wires [1].
With the fact that the packing density, void fraction and specific surface area are major characteristics of demisters. Vapor velocity is the major indicator for demisters' performance [5]. When the vapor velocity is low, the momentum of liquid droplet is also low so that it drips through the wire and carries over with the vapor stream. With high vapor velocity, the momentum of liquid droplet is high and collides with the wire. Liquid droplet has no chance to flow through demister with vapor stream. Therefore, the efficiency of demister is increased. Originally, El-Dessouky et al [1] and Al-Dughaither et al [6] propose empirical models which are simple models for demister efficiency determination. Those empirical models relate to vapor and liquid physical properties, demister characteristics including wire diameter, demister height, and demister area [7]. From empirical models, the pressure drop positively depends on the vapor velocity and the packing density whereas it negatively depends on the wire diameter.  (1-5 mm). El-Dessouki et al correlation [1] for determining the separation efficiency h t is shown in equation (1) where r l is liquid density.
Even though, the empirical models are easy to use, the values of the empirical constants are varied from experiment to experiment. That is the reason why the empirical models are not well suited to describe the demister performance. The computational fluid dynamics model (CFD) is then introduced and proven to be effectively estimating the demister performance. Rahimi and Abbaspour simulated CFD model for wire mesh demisters range of 1-7 m s −1 , packing density of 200 kg m −3 , pad thickness 0.2 m and wire diameter of 0.31mm. The estimation was closely to the experimental data and match the empirical model of El-Dessousky et al at low vapor velocities [8]. Galletti et al utilized either the Eulerian or Lagrangian approaches CFD model for waveplate demister and the estimation was in close match to the experimental data [9]. For multi-stage flash (MSF) desalination process, Al-Fulaij et al [10] using the Eulerian approach for vapor phase and the Lagrangian approach for the brine droplets modeled consistent estimation with experiment data. The operating conditions for Al-Fulaij et al were velocity range (1.13-10.4 m s −1 ), packing density (80.317-176.35 kg m −3 ), thickness of demister pad (100-200 mm), wire diameter (0.2-0.32 mm), droplet diameter (10 micron-m), and volume fraction of inlet water droplets (1´10 -5 ) [10]. The increased in vapor velocity makes the demister efficiency initially increased. However, the separation efficiency starts to drop when pasting the flooding limit. The demister efficiency varies inversely with pressure [11]. The value of pressure may estimate the efficiency of demister.
Both empirical model and CFD model are simple and yet provide reliable results, however, they are heavily relying on assumptions. Deep learning referred as several neural networks are tasked for handling non-linearity and it utilizes the plant data to generate the estimation without knowing plant characteristics. Recent research works show that the deep neural networks yield better performance for time series data [12]. The case study is the multi-stage flash desalination process utilizing wire mesh demisters to remove brine droplets from flashed off water vapor. In this work, stacked Long Short-Term Memory (LSTM) algorithm is proposed for estimating pressure drop from five distinct datasets [13]. The experimental data by El-Dessouky et al [1] and the work of Al-Fulaij et al [10] are used to evaluate models. The detail of LSTM-based neural network is discussed in section 2. The experimental results and discussion are presented in section 3. Conclusion is in section 4.

LSTM-based neural network
The Long Short-Term Memory (LSTM) network is recurrent neural network variant where the past data is in memory with three gates namely input, forget and output gates for controlling the flow of information [14]. The LSTM structured from recurrent neural network by Hochreiter and Schmidhuber in 1997 [15] is shown in figure 1. The sigmoid activation function, s, generates output at current time, f , t using previous hidden unit, -H , , bias, b f and current input, x t at forget gate as shown in equation (2). The output with the sigmoid activation function displays between zero and one. The gate is blocking when the value is zero while the gate is passing everything through when the value is one [16]. The forget gate determines which information will be used. Mostly, the key information is adopted while the useless information is dropped. The previous hidden unit, -H , t 1 weight matrices, ( ) W i H , , bias, b i and current input, x t are updated at input gate, i t as shown in equation (3). The modulation gatec t changes the previous state and the new input using a tanh activation function as shown in equation (4).
The current cell state, c t is updated from the previous cell state, -c , t 1 and the current input cell, i t as shown in equation (5). The previous cell state multiplies (Ä) with the current output value from forget gate. That product adds to the updated current input cell with hyperbolic tangent (tanh) activating function to get the new current cell state value. The output gate, o t uses sigmoid activation function as shown in equation (6). The hidden unit at time t, H , t is calculated from equation (7). Sigmoid and hyperbolic tangent (tanh) activation functions are shown in equations (8) and (9) respectively.
Algorithm method is a combined several base models either in a parallel or a sequential manner [12]. In this study, only the stacking technique is mainly focused. When two LSTMs are trained simultaneously, these structures are called bidirectional Long Short-Term Memory (BiLSTM) [17]. Two or more LSTM trained in series are called stacked LSTM. For stacked LSTM, the output from the first LSTM layer is fed as the input for the second LSTM and so forth. The LSTM algorithm is shown as figure 2. The number of LSTM signifies the learning capacity. Therefore, the learning should be improved when adding number of LSTM to the network. The computation time is increased and the overfitting is detected [18].     Table 1 lists the network architecture for LSTM, BiLSTM and stacked LSTM. All inputs are preprocessed and are stabilized in minibatch before entering LSTM layer to accelerate the learning process. For LSTM algorithm as shown in figure 2, the network is trained using 80 hidden units, batch size of 25, adaptive moment estimation (Adam) optimizer, and learning rate of 0.0001. The dropout layer (50%) is added to avoid overfitting so that the model performance is improved. The extracted data from the previous layer is acquired using the fully connected layer. Then the output is estimated through a regression layer.
For BiLSTM algorithm as shown in figure 3, all inputs are preprocessed and stabilized in minibatch before entering the first LSTM layer to accelerate the learning process. The network is trained using 80 hidden units, batch size of 25, adaptive moment estimation (Adam) optimizer, and learning rate of 0.0001. The output from the first LSTM layer is then fed into dropout layer (50%) and fully connected layer to avoid overfitting. Then, data extracted from the first LSTM layer pass through batch normalization layer to be stabilized. Thus, the learning process is accelerated before entering the second LSTM layer. The network is trained using 120 hidden units, batch size of 25, adaptive moment estimation (Adam) optimizer, and learning rate of 0.0001. The estimated output then is displayed after dropout layer (50%), fully connected layer, and regression layer.
The stacked LSTM algorithm as shown in figure 4, starts with preprocessed and stabilized input data to accelerate the learning process before entering the first LSTM training. Network is trained using 80 hidden units, batch size of 25, adaptive moment estimation (Adam) optimizer, and learning rate of 0.0001. The output from the first LSTM training enter dropout layer (50%) and fully connected layer to perform overfitting and data extraction. Before entering the second LSTM training, data are stabilized to accelerate the learning process. Network is trained using 100 hidden units, batch size of 25, adaptive moment estimation (Adam) optimizer, and learning rate of 0.0001. After training, data flow into dropout layer (50%) to avoid overfitting and fully connected layer to extract data. Then, before entering the third LSTM training, data are stabilized to accelerate the learning process. Network is trained using 130 hidden units, batch size of 25, adaptive moment estimation (Adam) optimizer, and learning rate of 0.0001. The estimated output is displayed after dropout layer (50%), fully connected layer, and regression layer.
Two common metrics used to measure accuracy are Mean absolute error (MAE) and Root mean squared error (RMSE). The lower value of these metrics means a model with higher performance. Mean absolute error (MAE) as shown in equation (10) where y i target and y i predicted are the target and estimated values, respectively with a total of N observations. The MAE measures how close calculated values are to the target values [18]. The average magnitudes of the errors are only measured for the MAE. Mean squared error (MSE) and root mean square error (RMSE) as shown in equations (11) and (12) average the squared errors and take a square root of the squared errors, respectively. The MSE and RMSE are closed to zero if the model is well estimated [19]. All proposed LSTM algorithms are performed using MATLAB and Deep Learning Toolbox Release 2021b, The MathWorks, Inc., Natick, Massachusetts, United States. All computations are calculated on a laptop PC with 1.80 GHz Intel Core i7 processor and 16 GB RAM. The input data is normalized for each batch (m) before training to improve the generalization ability and convergence speed [19]. are first calculated as in equations (13) and (14), respectively. Then each sample is normalized as in equation (15) where e is a constant for numerical stability. The output is updated with normalized input for each neural network layer as shown in equation (16) where g and b are valued from optimization process [21].
After data have been normalized, data are split into 90% training and 10% testing. The weight and bias are adjusted accordingly so that the error from the model is minimized during training. The model is verified during testing. The reason behind this 90% training is that data used in this study are limited and highly imbalanced [22]. The process is also run 10 times to shape the ten-fold cross-validation so that the test error rate estimation is not suffered from high bias and high variance [23]. The hyperparameters such as the minibatch size [24], the number of hidden units, and the dropout percentage (50%) affect performance of a model. The gradient threshold is fixed at two to avoid the gradient exploding. All LSTM algorithms are trained with the adaptive      moment estimation (Adam) optimizer. The experiment results are compared to El-Dessouky empirical correlation [1] and Al-Fulaij et al CFD model [10]. The performance metrics are RMSE and MAE.

Experimental results and discussion
This section presents experimental results that examine the effectiveness of stacked LSTM algorithms to estimate pressure drop. In this study, we focus on the pressure drop estimation from five different sets of data. Two datasets from experiment contain vapor velocity at a specific packing density and a wire diameter [1]. The other three sets of data collected from real plant at two mode of operations which are a brine circulation at low and high temperature, and a once through operation only at low temperature [13]. The different conditions of the El-Dessouky et al experiments [1] and real MSF plant data are employed to verify the effectiveness of our proposed models as shown in table 2.
All LSTM algorithms estimate closely identical to the real value comparing to the El-Dessouky empirical correlation [1] and the Al-Fulaij et al CFD model [10]   The MAE performance comparison for all dataset is shown in figure 10. For dataset 1, estimation errors are less than the El-Dessouky empirical correlation [1] as much as 57%, 58% and 82% for LSTM, BiLSTM, and stacked LSTM, respectively. Estimation errors are less than the Al-Fulaij et al CFD model [10] as much as 2.62%, 5.03% and 5.9% for LSTM, BiLSTM, and stacked LSTM, respectively.
The RMSE performance metric for all dataset is shown in figure 11. For dataset 1, estimation errors are less than the experimental El-Dessouky et al [1] as much as 66%, 67% and 85% for LSTM, BiLSTM, and stacked LSTM, respectively. Estimation errors are less than the Al-Fulaij et al CFD model [10] as much as 2.24%, 5.79% and 5.7% for LSTM, BiLSTM, and stacked LSTM, respectively.

Conclusion
In this paper, the stacked Long Short-Term Memory (LSTM) algorithm estimating the pressure drop across demisters for the Multi-stage flash desalination process (MSF) is proposed. There are five datasets used in this study where two sets are from El-Dessouky et al experiments [1] and the rest are from real operating plant [13]. We select a high accuracy, reliable correlation, and a computational fluid dynamic model (CFD) as benchmarks to verify the model effectiveness. The algorithm of stacked Long Short Term Memory (LSTM) performs better than the Al-Fulaij et al CFD model [10] and El-Dessouky et al Correlation [1]. Since the structure of stacked LSTM where previous time steps influence current time step so that it is suited for dynamic estimating [25]. In addition, the data is precisely estimated with stacking LSTM algorithm. Moreover, the accuracy is generally improved as the size of LSTM training increases; however, overfitting issue required tends to be higher with a greater number of LSTM training layers [26]. The overfitting issue can be avoided by implementing the dropout layer into stacked LSTM architecture. The Root Mean Square Error (RMSE) of the estimation from stacked LSTM algorithm is less than the estimation from the work of Al-Fulaij et al CFD model [10] and the El-Dessouky et al Correlation [1] by 40% and 88%, respectively.