Support Vector Machine-based Soft Sensors in the Isomerisation Process +

This paper presents the development of soft sensor empirical models using support vector machine (SVM) for the continual assessment of 2,3-dimethylbutane and 2-methyl-pentane mole percentage as important product quality indicators in the refinery isomeri-sation process. During the model development, critical steps were taken, including selection and pre-processing of the industrial process data, which are broadly discussed in this paper. The SVM model results were compared with dynamic linear output error model and nonlinear Hammerstein-Wiener model. Evaluation of the developed models on independent data sets showed their reliability in the assessment of the component contents. The soft sensors are to be embedded into the process control system, and serve primarily as a replacement during the process analysers’ failure and service periods.


Introduction
Process analysers, used for measurement of key process variables, are often weak links in refinery plants. Their long analysis time, tendency of failure, and high price usually make them impractical and unprofitable. Soft sensors that enable real-time prediction of key product properties occur as an alternative to process analysers.
Rarely used as first principle modelling and more often as data-driven mathematical models, soft sensors can well describe dynamics of complex industrial processes 1 .
This paper presents data-driven soft sensors which have common steps in the development procedure: selection of real process data from plant history database, data pre-processing, determination of a model structure and regressors, model estimation and validation 2 .
The support vector machine is a popular method for soft sensor model development presented by Vapnik 3 as part of a general learning theory. The method has attractive features, such as the ability to learn well with only a very small number of free parameters, robustness, and computational efficiency compared to several other methods 4 .
The method is widely used for nonlinear system identification required in the process industry.
The application of SVM has been described in many published papers over the past few years. Meng et al. 5 developed data-driven soft sensor based on twin support vector regression for cane sugar crystallisation. Ibrahim et al. 6 used SVM and surrogate column models for a novel optimisation-based design of crude oil distillation units. Lv et al. 7 proposed SVM-based model for puerarin extraction. Shokri et al. 8 developed SVM model for the prediction of the content of hydrogen sulphide in the hydrotreatment (HDT) refinery process. Support vector machine is presented in the papers by Xu et al. 9 where least squares support vector machine (LS-SVM) is used for gas flow measurements, while Cheng and Liu 10 used LS-SVM to propose online soft sensor for product quality monitoring in propylene polymerisation process. Some earlier works should be mentioned, such as the paper by Yan et al. 11 where SVM was introduced during soft sensor modelling for light gas oil freezing point assessment in the distillation process, as well as the paper by Li et al. 12 who developed the model for kerosene dry point assessment based on least squares support vector machine (LS-SVM).
The research and application of soft sensors on an isomerisation process are still rare. Lukec et al. 13 proposed application of a software analyser for online + This paper was presented at the Meeting of Young Chemical Engineers 2020 at the Faculty of Chemical Engineering and Technology, University of Zagreb *Corresponding author: Srečko Herceg: hsrecko@gmail.com; sherceg@fkit.hr estimation of isopentane content in the deisopentaniser column top product, in the feed treatment section for the isomerisation process, while Xianghua et al. 14 presented the model for para-xylene content estimation at an isomerisation unit reactor outlet.
In our research, development of soft sensor empirical models based on SVM for the continual assessment of mole percentage of 2,3-dimethylbutane and 2-methylpentane in the product streams of the refinery isomerisation process is presented. These components directly affect the octane number of the isomerate -the product of the isomerisation process. The model development procedure and the model results are presented and discussed.

Material and methods
In this section, SVM, along with the dynamic output error and Hammerstein-Wiener model structure are briefly explained. A particular part is dedicated to the description of the refinery isomerisation process, while soft sensor model development is explained in detail.

SVM model structure
The basic idea of support vector machine can be expressed as shown in Fig. 1. Input space objects, separated with a complex non-linear curve, are mapped (rearranged) into a so-called feature space, where the objects are linearly separable, i.e., an optimal separable curve can be found 15 .
Mathematically it can be represented by Eq. (1), where the feature space linear regression function is a solution to the nonlinear regression problem: where x is the input vector, w is the load vector, b is a non-variable value, Φ(x) is a "feature" function, and (w · Φ(x)) is the scalar product in the "feature" space. In order to obtain a model, the optimisation problem of so-called structural risk minimisation principle should be solved. Employing the commonly used ε-intensive cost function and inserting an adjusting constant C, problem, which is being optimised, from Eq. (1) we obtain: with the following constraints: where the adjusting constant, C, is a "penalty" factor of the model complexity 2 w , while ε is the parameter of the ε-intensive cost function and represents the tube radius located around the function f(x) (Fig. 2) 15 .
Deviation outside the [ε, -ε] region denotes the forecast error, represented by formulation The points on the surface and outside the ε-tube are called support vectors (SV). The percentage of SVs affects the model accuracy -as the percentage of SVs decreases, a more flattened model is obtained and vice versa 15 .
The solution of the optimization problem expressed in Eq. (2) is presented by the equation: where K(x, x i ) is a kernel function, and α, α * are Lagrange multipliers. Radial basis function (RBF) is the most used kernel function, and is defined as: where γ is the free parameter of RBF. Kernel function "avoids" cumbersome mathematical operations that take up a lot of computational time 15 .
where B(q) = B 1 + B 2 q -1 +…+ B nb q -nb+1 is polynomial matrix by q -1 dimensions n(y)×n(u), nb is the number of past process inputs, and nk is the input time delay expressed by the number of samples. F(q) = 1 + F 1 q -1 + F 2 q -2 +…+ F nf q -nf is polynomial matrix by q -1 dimensions n(ŷ) × n(ŷ), nf is the number of past outputs predicted by the model. The most complex nonlinear dynamic model is HW. It has a block structure described by 3 func-tions: w(t) = f (u(t)) is a nonlinear function transforming input data u(t), x(t) = (B / F) w(t) is a linear transfer function where B and F are polynomials of the OE model, and ŷ (t) = h (x(t)) is a nonlinear function mapping output data x(t) from the linear block to the model output. Nonlinear function could be represented with many nonlinear units, such as wavelet, sigmoid, piecewise-linear, and others.

Process description
The goal of the refinery isomerisation process is to upgrade the octane of light straight-run naphtha, processing paraffin (mainly pentane and hexane) together with hydrogen on a low-temperature, noble-metal, fix-bed catalyst, which is mainly used today. In more detail, the feed paraffin is converted to high-octane iso-structures -normal pentane (nC 5 ) to isopentane (iC 5 ), and normal hexane (nC 6 ) to 2,2 and 2,3-dimethylbutane. Process conditions improve isomerisation and reduce unfavourable reactions (e.g., hydrocracking), and are featured by medium operating pressure, low temperature, and low hydrogen partial pressure 17 .
According to the process flow diagram (Fig. 3), the dried feed is mixed with make-up hydrogen, and heated before entering the reactor section. After passing the reactors, the isomerised product is stabilized in the stabilizer column, where the liquid from the stabilizer bottom passes to gasoline blending, while the stabilizer overhead vapour product flows to a fuel gas system, before being caustic scrubbed with in aim of removing the HCl formed from organic chloride added to the reactor feed to maintain catalyst activity 17 .

F i g . 3 -Straight-through isomerisation process 17
A straight-through isomerization process can be improved by separating the stabilizer bottoms into normal and isoparaffin components by adding a deisohexanizer column (DIH) (Fig. 4).
The DIH column sidecut stream concentrates non-converted n-paraffins and newly-formed low-octane methylpentanes, and returns them to the reactor section. The isomerate is then drawn from the column top, while the heptane fraction is drawn from the column bottom 17 . Fig. 5 depicts the process flow diagram of the observed plant deisohexanizer section with the deisohexanizer column as its part. All process and measuring equipment, as well as control loops, are shown.
Soft sensor model development 2,3-DMB and 2-MP come out as key components of isomerate -the product of the refinery isomerisation process improved by adding a DIH distillation column. The components affect the octane-number of the product. High-octane 2,3-DMB and low-octane 2-MP mole fractions are measured on-line by process analysers in the DIH sidecut stream and in the DIH overhead, respectively, keeping 2,3-DMB in the column top and 2-MP in the side of the column to regulate their molar percent-age 18 . The components are also analysed by laboratory assays once a day.
As the process analysers quite often become unavailable due to failures and have long time delays, it was decided to develop soft sensors that would find their applications primarily as the analysers' replacement.
Soft sensor development has a common procedure, as follows: -potentially influential variable selection, -data collection and pre-processing, -preliminary research, -model structure and regressor selection, model estimation, and validation, -model implementation.

Potentially influential variable selection
Based on process studies and consultations with process experts, the potentially influential variables for 2,3-DMB and 2-MP SVM soft sensor model development were selected as shown in Table 1.

Data collection and pre-processing
Experimental data for model development were acquired from the plant history database containing up to a few years back historical data recorded every minute.
Data were collected for eight potentially influential variables (Table 1), and for 2,3-DMB and 2-MP content. Special attention was paid to selecting the data period covering significant process dynamics. After the data were collected over several periods, lasting from 2-3 weeks to about 2 years, the data were pre-processed, including sample period determination, detection of outliers, and interpolation of missing data.
According to process dynamics influential variable (further referred to as input variable), the sample period of 3 minutes was evaluated as suitable. Process analyser data (further referred to as output variable) sample period was 30 minutes. The missing data were interpolated using cubic spline interpolation.
Additional data can be generated using cubic spline or Multivariate adaptive regression spline methods. MARSpline is normally used where there is more than one input variable, as well as when the output variable sample time is long 19 . Otherwise, the data interpolated using cubic spline have no physical bases, i.e., a relationship between input and output variables is not taken into account. However, due to the fact that refinery processes are quite inertial, this method can be considered sufficiently reliable 20 .
The common methods for detection of outliers are 3-sigma 21 , as well as principal component analysis (PCA) and partial least squares (PLS) methods 22 , which perform only statistical inspection of data and tend to remove peak values that can contain useful information about process dynamics. Therefore, the procedure of outlier detection cannot be a fully automated process, and data always should be checked visually. All data collected were visually checked and the amount of outliers detected was negligible.

Preliminary research
Since SVM models are static, it is very important to determine the output variable time delays regarding the change in the value of each input variable.
The output variable time delays were determined for 2,3-DMB and 2-MP content, respectively. For this purpose, an experiment on the observed plant was performed.
In the case of 2,3-DMB, it was observed how a change in one variable caused changes in others. As may be seen in Fig. 5, the V17 reflux flow rate (FIR-028) was reduced slightly, and measured were the time periods until the change in the V17 overhead vapor temperature (TIR-046), in the 21 st V17 tray temperature (TIRC-047), and finally in the value of 2,3-DMB content in the V17 side product (determined by AIR-004B chromatograph). A relatively quick response was noticed. After 2 minutes, the top column temperature (TIR-046) and the temperature on the 21 st tray (TIRC-047) started to rise. After 4 minutes, 2,3-DMB content reacted in the side product -it increased slightly. At once, 2,3-DMB content value time delay in regards to the change in FIR-028 and FIRC-029 input variable, respectively, was 4 minutes. Consequently, it was concluded that, for TIR-046 and TIRC-047 input variable, respectively, time delay was 2 minutes. Since TIR-045 input variable was installed upstream of the AIR-004B chromatographic analyser, and between them there were only the P10A/B pump and the EA03 air cooler, the time delay was short (0 -2 minutes). Therefore, time delays were determined exactly for the following variables: TIR-046, TIRC-047, TIR-045, FIR-028 and FIRC-029.
In the case of 2-MP time delays, 2-MP content (determined by the AIR-005B chromatograph) had not changed during the experiment. In order not to affect the process regime, the experiment could not be carried out further and had to be stopped. Therefore, the output variable time delays in this case were determined by a calculation procedure. The main issue was to determine the time until the composition in the V12 separator had changed. Due to the relatively large separator volume (about 107 m 3 ), change in composition took considerable time. Based on simple hydraulic calculation schematically represented in Fig. 6 23 , using data on the mass flow rate of the V17 top product vapours, density, pipe diameters and lengths between the V17 column, the V12 separator and the AIR-005B chromatograph, as well as the V12 volume, the time until the composition in the V12 separator had changed was calculated. Taking into account the data obtained by the experiment, 2-MP content value time delays were determined for those variables For TIR-049, FIRC-020, and FIRC-026 variables regarding both 2,3-DMB and 2-MP content, the calculation procedure was the same. However, due to the complex construction of the V17 column, the obtained results could not be considered accurate but only rough.
The output variable time delay results are displayed and additionally discussed in the next section "Results and discussion".
After delays had been determined, the correlation analysis was performed on the selected potentially input variables, obtaining Pearson linear correlations 24 among the input and output variables. The analysis was performed on two independent data periods (during 2015 and 2016, respectively). The results are displayed and discussed in the next section.
Among several data periods researched, periods were found when the observed plant was in reduced product regime. However, since such periods are still rare, these data were not considered as potential for the software sensor development.
Performed were descriptive statistics of the input and output variables for the selected data periods, as presented in the "Results and discussion" section.

Model structure and regressor selection, model estimation and validation
SVM models were developed using TIBCO Statistica software. Selected data for model development were divided randomly: 75 % for training data, and 25 % for model validation. Randomly divided data enabled the selection of data of greater diversity for training, and consequently, better model results.
Obtained output variable time delays were incorporated into the software in such a way that each "excel like" data column, representing the input variable, was moved backwards in regards to the output variable data column by the number of steps equal to a time delay (one step is 3 minutes). The delay steps are presented in Tables 2 and 3.
Free model coefficients, C and ε, were optimized by a grid search algorithm defining the search interval for each coefficient. The coefficient of radial basis kernel function, γ, was initially set at the default value of 0.167, and then adjusted by trial-and-error. The procedure was repeated with the number of iterations defined in the range from 1 000 to 1 000 000. The algorithm calculation was stopped, and the SVM was considered to have been trained sufficiently when the training error reached the value of 0.001 % mole 15  Developed models were validated based on FIT values, final prediction error (FPE), root mean square error (RMSE), and mean absolute error (MAE).

Model implementation
Implementation of the model to the refinery isomerisation process is underway. The goal is to implement the developed models into the module for advanced process control in plant DCS. The new variable labels for 2,3-DMB and 2-MP content predicted by the model, will be created and stored in the process history database.

Results and discussion
In this section, the results obtained from data collection and pre-processing, preliminary research -including calculation of input variables time delays, performance of the correlation analysis and descriptive statistics, as well as the developed 2,3-DMB and 2-MP content SVM model results compared with dynamic linear OE and nonlinear HW model, are presented and discussed.
Interpolation of missing data Fig. 7 shows a part of cubic spline interpolated output data compared with the real measurements (stepped curve). Very good interpolation of missing data can be observed.

Determining time delays
From Table 2 it can be seen that the time delays are relatively small. AIRC-004A chromatographic analyser on V17 sidecut product line was installed with no upstream accumulated liquid vessel, e.g., a separator, etc. (Fig. 5), that can consequently slow down a mass transfer.
In the case of 2-MP content, the time delays are larger than in the case of 2,3-DMB. This is mainly due to the aforementioned V12 separator, i.e., AIR-005B chromatograph installed downstream of it. Tables 4 and 5 show Pearson's linear correlation coefficients between the input and output variables within 2015 and 2016 data period, respectively.

Correlation analysis
Observing the tables, it can be concluded that most of the inputs had significant impact on the outputs. However, from Table 4 it can be seen that, for all four outputs, the potential input variable FIRC-020 (V17 side product flow) had low correlations, and could be excluded as the model input. Also, for data according to Table 5, the potential inputs FIRC-020 and FIRC-026 (V17 bottom product flow) could be excluded.
The time delays were not taken into account.

Final input variables and model determination periods
Earlier research 18 on dynamic polynomial models has shown that much better results had been obtained for 2,3-DMB content models during 2015 data period, while in the case of 2-MP, the better results were during 2016. Due to the direct comparison of SVM model results with the results of the dynamic polynomial, the development of the proposed SVM models was based on the same corresponding periods. In the case of 2,3-DMB content SVM model, the range was from November 27 to December 11, 2015, comprising 6 667 measured data, while in the case of 2-MP model, the range was from January 1 to January 21, 2016, comprising 10 078 measured data, for each of the input variables and the outputs. The final number of the input variables for the development of the SVM models based on the correlation coefficient results given in Tables 4 and 5 is  given in Tables 6 and 7. Tables 8 and 9 show the descriptive statistics of the input and output variables within 2015 and 2016 data period, respectively. Figs. 8 and 9 show comparison between the online chromatographic analysers data (2,3-DMB, 2-MP contents) and the laboratory assays within 2015 and 2016 data period, respectively.

Descriptive statistics
From the plots, a good correlation between measured and laboratory data can be observed, which proves the accuracy of the on-line chromatographic analysers, i.e., validity of the selected data periods.

Model evaluation results
Based on statistical criteria, Tables 10 and 11 show validation of SVM as well as OE and HW models on validation data set for estimating the content of 2,3-DMB and 2-MP in the isomerisation process DIH column side and top product, respectively.
As may be seen from Table 10, 2,3-DMB content SVM model shows superior results compared to both dynamic polynomial OE and HW model, respectively, with only 3 model free coefficients (C, ε and γ). Table 11 shows that the dynamic models are better; however, with dozens of model free coefficients.
The FIT values, as well as the values of FPE, RMSE, and MAE of the SVM models, indicate sat-isfactory assessment meaning that the process dynamics were well described.
It is easy to conclude that the SVM models are better for implementation than dynamic polynomial linear or nonlinear models, which also contributes to the robustness of the SVM models. Graphical representation in Figs. 10 and 11 depicts the comparison between measured data and 2,3-DMB/2-MP content model data, respectively. Very good correspondence between the measured data and model outputs may be observed. Table 12 shows the comparison between 2,3-DMB and 2-MP SVM content models. 2-MP SVM content model is somewhat more complex than 2,3-DMB -a larger number of SVs was required to achieve an accurate model.
Overall, the obtained results confirm SVM method to be suitable for nonlinear system identification in chemical plants.

Conclusion
Soft sensor models based on SVM for continual assessment of the mole percentage of 2,3-DMB and 2-MP, as the key components in products of the refinery isomerisation process, improved by adding a deisohexanizer, were developed. The models describe the process dynamics very well, and are therefore suitable for implementation within the isomerisation process plant distributed control system (DCS). Due to its robustness, it is expected that the method will be an alternative to expensive process analysers. The development and application of soft sensors are desirable and often necessary in modern process industry. The requirements for continuous improvements, especially in product quality and minimization of energy consumption, urge increased application of soft sensors.   6 -normal hexane PCA -principle component analysis PLS -partial least squares q, q -1 , q -nb+1 , q -nf -time shift operator RBF -radial basis function RMSE -root mean square error SV -support vectors SVM -support vector machines t -time u(t) -input data function w(t) -nonlinear function w -load vector x(t) -linear transfer function x, x i -vector of input data x, x i -input data value y, y i -output data value ŷ(t) -model output function α, α * -Lagrange multipliers γ -parameter of radial basis function ε -parameter of ε-intensive cost function ξ, ξ * , ξ i , ξ i * -slack variables Φ(x) -feature function