A Novel Hybrid CNN-LSTM Scheme for Nitrogen Oxide Emission Prediction in FCC Unit

Fluid Catalytic Cracking (FCC), a key unit for secondary processing of heavy oil, is one of the main pollutant emissions of NOx in refineries which can be harmful for the human health. Owing to its complex behaviour in reaction, product separation, and regeneration, it is difficult to accurately predict NOx emission during FCC process. In this paper, a novel deep learning architecture formed by integrating Convolutional Neural Network (CNN) and Long Short-Term Memory Network (LSTM) for nitrogen oxide emission prediction is proposed and validated. CNN is used to extract features among multidimensional data. LSTM is employed to identify the relationships between different time steps.$e data from the Distributed Control System (DCS) in one refinery was used to evaluate the performance of the proposed architecture. $e results indicate the effectiveness of CNNLSTM in handlingmultidimensional time series datasets with the RMSE of 23.7098, and the R of 0.8237. Compared with previous methods (CNN and LSTM), CNN-LSTM overcomes the limitation of high-quality feature dependence and handles large amounts of high-dimensional data with better efficiency and accuracy. $e proposed CNN-LSTM scheme would be a beneficial contribution to the accurate and stable prediction of irregular trends for NOx emission from refining industry, providingmore reliable information for NOx risk assessment and management.


Introduction
Fluid Catalytic Cracking (FCC) is one of the most important technologies for secondary processing of heavy oil in refining and chemical enterprises [1]. Catalytic cracking reaction and catalyst regeneration are the main chemical processes of FCC. In the catalytic cracking reaction, crude oil is transformed into gasoline and diesel under catalysis during which 40%-50% of nitrogen in feedstock is transferred to coke and deposited on the catalyst [2][3][4]. en, coke-covered spent catalysts are burned in the reaction regenerator for catalyst active regeneration, heat balance, energy recovery, and stable operation. During catalyst regeneration process, about 90% of the nitrogen in coke is converted into N 2 and the rest into NO x and other reduced nitrogen compounds (NH 3 , HCN, etc.). NO and NO 2 are the most detected NO x which have potential risks to human health. As blood poison, NO would cause hemichypoxia and depress the central nervous system by strongly binding with hemoglobin (HB); NO 2 would cause bronchiectasis (even toxic pneumonia and pulmonary edema) by irritating and corroding the lung tissue [5,6]. Furthermore, with the development of refining chemical technology, especially catalytic technique, more heavy oil with high percentage of nitrogen (such as residual oil and wax oil) were utilized. erefore, it is urgent to accurately predict the NO x produced during FCC process so as to effectively optimize the noxious gas discharged into the environment subject to the technical and economic conditions. e FCC process is complex both from the modelling and from the control points of view [7][8][9][10][11]. Fortunately, many researchers have explored and developed semiempirical models, lumped kinetic models, and molecular-based kinetic models [12]. A comprehensive review on FCC process modelling, simulation, and control was reported by [13]. Many research studies have been conducted using different models for modelling, controlling, and optimizing the FCC process with promising results [14][15][16]. With the development statistical learning theory, machine learning algorithms have proved effective methods for simulating natural systems in capturing nonlinearity with limited computation costs. e application of machine learning algorithms in the field of FCC is still at an early stage. Michalopoulos et al. [17] and Bollas et al. [18] proved the applicability of Artificial Neural Networks (ANN) in predicting the FCC products and optimized the operation conditions by developing ANN models for determining the steady-state behaviour of industrial FCC units. Zhang [19] established a NO x emission model by Support Vector Machine (SVM) and further optimized the parameters with an improved adaptive genetic algorithm. Gu et al. [20] constructed a boiler combustion model on the basis of Least Support Vector Machines (LSSVM) and successfully forecasted NO x emissions and other parameters which were verified by field data. Recent advantages in artificial intelligence (AI) (lead by deep learning) offered powerful predictive tool for effectively solving the highly complex chemical processes (such as FCC). Shao et al. proposed a new fault diagnosis method of chemical process by combining LSTM (Long Short-Term Memory) and CNN (Convolutional Neural Network) [21]. Yang et al. integrated deep neural network ("black box model") with lumped kinetic model (white box model) to create a novel "gray box model" for improving the efficiency and accuracy of simulating FCC process [22]. However, to the best of the authors' knowledge, there are few research studies using deep learning algorithms for predicting the NO x emission in FCC unit. Some research studies of pollution emission problems have been conducted in power plants [23]. Compared to power plants, the FCC process is relatively complex with more factors involved. erefore, it is of great difficulty to predict NO x emissions in FCC units.
In this paper, a novel deep learning architecture for predicting NO x emissions in the FCC Unit is proposed. e deep learning architecture is formed by integrating Convolutional Neural Network (CNN) and Long Short-Term Memory Network (LSTM) (refer as CNN-LSTM hereafter) with CNN layers extracting features among several variables and LSTM layers learning time series dependencies. e data from the Distributed Control System (DCS) in one refinery was used to demonstrate the performance of CNN-LSTM in the FCC unit. e main contributions of this paper are (1) the proposal of a novel hybrid CNN-LSTM scheme which is able to extract feature among different data sequences and the features between different time steps; (2) the application of the proposed scheme to predict NO x emission during the FCC process with significant results.

Convolutional Neural Network Model (CNN)
. CNN is a special kind of neural network which is widely used in the field of image processing [24,25]. In CNN, a feature map is used to extract the features from the input of the previous layer with a convolution operation. e pooling layer is used to reduce the computational complexity by reducing the size of the output from one stack layer to the next and at the mean time preserving important information.
ere are many pooling techniques available, among which maximum pooling is mainly used for pooling windows that contain maximum elements.
e convolution layer provides the outputs of the pooling layer and maps it to the next layer. e last layer of CNN is usually fully connected for data classification. Figure 1 shows the basic architecture of CNN.
In neural network training, the accuracy and training speed could be affected by many factors [26]. For example, number of input layer nodes, number of hidden layers, number of hidden layer nodes, and the Internal Covariate Shift (ICS). at is to say, the inputs of the current layer would change according to the variation of parameters in the previous layers which would lead to more training time. In addition, if the inputs are distributed in ranges where the gradient of activation function is low, the ICS would cause the disappearance of gradient. In order to solve these problems, a Dropout method was included as follows.
Dropout ( Figure 2) was first proposed by Hinton et al. in order to reduce the overfitting problem in neural networks [27][28][29][30][31][32]. In dropout procedure, the local feature dependency of the model will be reduced with a probability of P, and consequently, the generalization ability of the model will be improved effectively.

Long Short-Term Memory Network (LSTM)
. RNN is a kind of deep neural network which is specially used to process sequential data [33]. Compared with the traditional ANN, the characteristic of RNN is the inclusion of dependencies through time. e basic structure of a RNN is shown in Figure 3.
e left side and the right side in the architecture are the folded form and the expanded form, respectively. In equations (1)∼(4), t is time, x is the sequence of input data, h is the hidden layer state of the network, o is the output vector of the neuron, U is the parameter matrix from the input layer to the hidden layer, V is the parameter matrix from the hidden layer to the output layer, W is the parameter matrix between the hidden layers at different times, andŷ t represents the probability output of the predicted value after normalization. All the parameter matrices are shared matrix of the hidden states at different times.
In order to solve the disappearance or explosion of gradient during training RNN, researchers proposed LSTM by introducing gate mechanism in RNN [34,35]. e gate mechanism is composed of input gate, output gate, and forgetting gate. As a special type of RNN, the neurons in the LSTM model are connected to each other in a directed cycle. e basic structure of LSTM is shown in Figure 4. e LSTM model saves long-term dependencies using three different gates in an effective way. e structure of LSTM (shown in Figure 4) is similar to RNN. LSTM uses three gates to regulate and preserve information into every node state. e explanation of LSTM gates and cells is provided in equations (5)∼(8): Output where b represents the bias vector; W is weight matrix; x t is the input vector at time t; and In, f, C, and O represent input, forget, cell memory, and output gates, respectively.

CNN-LSTM
Due to the characteristics of CNN and LSTM, a common thought to combine the advantages is to integrate CNN and LSTM. In this study, a new deep learning scheme was proposed by integrating CNN and LSTM. Two layers of CNN were used to ensure the correlation and effective extraction of multidimensional data. e feature sequence from the CNN layer was considered as the input for LSTM.

Unroll
Output Input Mathematical Problems in Engineering e time dependencies were further extracted in the LSTM layer. ree fully connected layers existed in the architecture which refer to FC1, FC2, and FC3. FC1 and FC2 are used to obtain the features extracted by the CNN layer, and FC3 is used to conduct the final data prediction. Figure 5 shows the architecture of the proposed CNN-LSTM.

CNN Layer.
e input data (train_x) and output data (train_y) are defined as follows: where p represents time step and q represents data features. e ith sample from the training set is fed into the network. In the first convolution layer (1 st ConV), the convolution kernel size, number, and step length are denoted as filter_size � (m, n), filter_num, and strides, respectively. e jth convolution kernel W j is defined as follows: e algorithms between jth convolution kernel W j and input train_x i could be described as follows: · · · · · · · · · · · · x a1 x a2 · · · x ab e operation for convolution layer is denoted as ⊙, where e element x in the feature map is obtained through multiplying W j by Receptive Field, which is recorded as follows: x ∈ featureMap x i � W j ○train x i field � m k�1 n l�1 w kl x kl , where ○ means multiply the elements.
1 st ConV is calculated as ReLU is used as the activation function:  Mathematical Problems in Engineering e output of the convolutional layer is nonlinear mapping by the activation function. In pooling layer, the data are compressed and recorded as pooling_size � (m′, n′).
For every feature map, where a′ � a/m ′ , b ′ � b/n ′ us, the ith sample after convolutional, activation, and pooling layer is e convolutional, activation, and pooling in 2ndConV are similar to those in 1stConV.
Dropout is denoted as dropout (λ); λ takes the value between 0 and 1, which means the percentage of the data that should be discarded. For instance, dropout (0.5) means that 50% of neuron data are discarded randomly.
FC layer dense (α) is the output data in the last dimension. For the above input type [none, a′, b′, k], only the last dimension [none, a′, b′, α] is changed after full connection.
Transform [samples, height, width, channels] to [samples, timesteps, features], and then feed them in the LSTM layer. e modular construction of LSTM is shown as follows, in which forget, input, and output gates are included.

LSTM Layer.
e forget gate is expressed as follows: Input gate could be expressed as follows: where W i represents the weight matrix for the forget gate and b i represents the offset of the input gate. e cell state for input description is calculated by the last output data and the current input data: e current cell state (C t ) is as follows: where the last cell state (C t−1 ) is multiplied by forget gate (f t ) according to different element and the current input cell state (C t ) is multiplied by input gate (it) according to different element. e new cell state (C t ) is established by current memory (C t ) and long-term memory C t−1 . On one hand, due to the mechanism of forget and input gate, the new cell state store information from a long time ago or forget the irrelevant content. On the other hand, the output gate controls the effect of long-term memory on current output: e final output of LSTM is decided by the output gate and cell state (equation (29):

Realization of CNN-LSTM.
e CNN-LSTM was realized in Keras using TensorFlow backend based on Figure 5 and the theory described in the previous sections (shown in Algorithm 1). After normalization, the training data (train_x, train_y) was fed into the constructed CNN model (1 st Con-V_model) to train the parameters with loss function (loss_function which is "mae" in our case) and optimizer (optimizer, which is "adam" in our case). e feature map of CNN was then extracted and reshaped to train the LSTM layer.

Datasets.
Several key production factors that affect the nitrogen oxide concentration in the plant were selected from 276 kinds of production factors of catalytic cracking unit. By inquiring experts, the key factors of production include nitrogen content in raw materials, process control parameters of reactor (FCC reaction temperature, catalyst/oil ratio, and residence time), the regeneration process control parameters (regeneration way, dense bed temperature, oxygen content in furnace, and carbon monoxide concentration), and catalyst species (platinum CO combustion catalyst and nonplatinum CO combustion catalyst).
A total of 2.592 × 10 5 of samples collected in half a year were divided into training and validation sets with the proportion of 70% and 30%, respectively. As shown in Table 1, the key production factors were used as input data and the NO x emission were used as labels.
In order to eliminate the dimensional effects among different variables, the original data was standardized using the MinMaxScaler function in Python (equations (25) and (26)):

Mathematical Problems in Engineering
where X max and X min are the maximum and minimum values of the data and max and min are maximum and minimum values of the zoom range. In addition, the problem of time prediction was reconstructed into supervised learning.

CNN.
e hyperparameters in RNN mainly contain weight initialization, learning rate, activation function, epoch numbers, iteration times, etc. Several important hyperparameters include number of convolution layers, number of convolution kernels, and size of convolution kernel are discussed in this study.

LSTM.
Long-short term memory (LSTM) is a kind of RNN, in which tanh could be replaced by sigmoid activation function, resulting in faster training speed. In LSTM, Adam was used as an optimizer, MSE was used as a loss function, and identity activation function was used to complete the weight initialization. e hyperparameters in LSTM mainly contain number of hidden layer nodes and the number of batch sizes. e number of hidden layer nodes in LSTM have direct influences on the learning results by affecting the ability of nonlinear mapping which is the same as in Feedforward Neural Networks. e batch size have an influence on the computation costs and the learning accuracy by affecting amount of data used for updating the gradient.

CNN-LSTM.
As a neural network model combined CNN with LSTM, the hyperparameters of CNN-LSTM is basically the same with CNN and LSTM which mainly include learning rate η, regularization parameter λ, the number of neurons in each hidden layer (such as the fullconnected layer and the number of neurons in LSTM), batch size, convolution kernel size, neuron activation function, pool layer size, and dropout rate. All the related hyperparameters were investigated and analysed in Section 5.

Performance Criteria.
e performances of different algorithms were evaluated by the Root Mean Square Error (RMSE) (equation (27) and the coefficient of determination (R 2 ) (equation (28)) [36]. e RMSE value reflects the discrete relationship between predicted and observed values: , (27) where N is the data length, o n is the nth observed value, and p n is the nth predicted value. e R 2 value reflects the accuracy of the model which ranges from 0 to 1 with 1 denotes perfect match: where y i represents the predicted value, y is the average value, and y i is the observed value.

CNN-LSTM.
e hyperparameters mentioned above were determined by the trial-and-error method. RMSE and R 2 were considered as objective function to optimize the size and number of convolution kernel, the number of batch size, the number of convolution layers, and the probability of dropout. e results shown in Figure 6 indicate the process of optimizing hyperparameters for the proposed method.   Mathematical Problems in Engineering e network structure adopts two convolutional parts as the CNN layer; the kernel size is 1 × 5 for the first and second CNN layer. Each convolution layer is followed by a Rectified Linear Unit (ReLU) layer (equation (29) and a maximum pooling layer. e output of CNN part is a 32-dimensional vector after operations. All the vectors form a sequence and feed into the LSTM layer: where z i, j,k is the input of the activation function at location (i, j) on the kth channel. ReLU allows neural networks to compute faster than sigmoid or tanh activation functions and train deep network more effectively. In order to train a neural network with strict backpropagation algorithm, the contribution of all samples to the gradient must be considered simultaneously.
With the incorporation of the LSTM network, the proposed CNN-LSTM network can be trained with time series data of FCC unit. A LSTM layer followed by the FC layer is used to assign the predicted value to each frame in the sequence. e output of the CNN layer passes through two dropout layers and two FC layers to combine the features extracted by the CNN layer. During the training stage, the dropout layer will randomly remove the connection between the CNN layer and the FC layer in each iteration. In our experiments, we set the dropout rate to an empirical value of 0.25, which has shown effectiveness in performance improvement (the experiment on the dropout rate is shown in Figure 6(e)). e convolution layer, convergence layer, and activation function layer are conducted to map the raw data to the feature space in hidden layer. And the full-connected layer plays the role of "classifier" which maps the learned feature representation to the memory space of the sample.

CNN.
e hyperparameters of CNN were also determined by the trial-and-error method. RMSE and R 2 were considered as an objective function to optimize the size and number of convolution kernel and the number of convolution layers. e results shown in Figure 7 indicate the process of optimizing hyperparameters for CNN from which one can conclude that the optimal values for the number of convolution layers, the number of convolution kernels and the size of convolution kernel are 1, 16, and 1 × 5.

LSTM.
e optimization process of hyperparameters for LSTM was shown in Figure 8. RMSE and R 2 were used as quantitative performance criteria to evaluate the hyperparameters (i.e., the number of hidden layer nodes and the number of batch size). e process and results shown in Figure 8 indicated the optimal values for the number of hidden layer nodes and the number of batch size are 40 and 500, respectively.

Experiments on Different Methods.
e accuracy of CNN and LSTM and the proposed CNN-LSTM method for the training stage and validation stage were evaluated by R 2 and RMSE. All methods were well tuned and ten test runs were conducted to eliminate the random errors of each method. e average criteria for each method were calculated to evaluate the performance. e results were presented in Tables 2 and 3 and Figure 9, respectively. Compared with traditional CNN, the proposed CNN-

Conclusions
In this paper, a novel CNN-LSTM scheme combining CNN and LSTM was proposed for the prediction of NO x concentration observed during FCC process. Dropout were introduced to accelerate network training and address the overfitting issue. In our study, a series of hyperparameters (learning rate, regularization parameter, the number of neurons in each hidden layer, small batch data size, convolution kernel size, neuron activation function, pool layer size, and dropout rate) and conditions (raw materials, process control parameters of reactor, regeneration process control parameters, and catalyst species) were selected and optimized. Experiments were conducted to evaluate the proposed scheme with traditional methods (CNN and LSTM) being baseline models. e hyperparameters of all the methods were optimized to obtain the best results. RMSE and R 2 were used to evaluate the performance of different methods. Due to the capability of extracting features among different data sequences and different time steps, better efficiency and accuracy were obtained by CNN-LSTM than baseline models. is study provides a potential direction of deep learning methods by integrating different architectures for individual advantages. e CNN-LSTM scheme proposed in this paper would be a beneficial contribution to the accurate and stable prediction of irregular trends for NO x emission from refining industry and provided more reliable information for NO x risk assessment and management. Future work will focus on attention and transformer mechanism to obtain better results and explore the application of the proposed scheme on other datasets.

Data Availability
All data and program files included in this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.