Determination of oil pollutants by three-dimensional fluorescence spectroscopy combined with improved pattern recognition algorithm

Petroleum refineries are one of the main sources of hazardous air pollutants, so the accurate determination of petroleum pollutants is of great significance to maintain ecological balance. In this study, three-dimensional (3D) fluorescence spectroscopy combined with pattern recognition algorithm is adopted to distinguish the composition and content of oil pollutants efficiently and accurately. Three hundred samples of kerosene, diesel, and gasoline mixed solutions with different concentrations are prepared. The principal component analysis is used to extract the optimal feature variables, and the correlation coefficient method is used to obtain eight groups of principal component features in the spectra. The dimension is selected as 8, and the principal component score is calculated, which is used as the input data of the extension neural network. Next, the pattern recognition method is improved, and the designed neural network has functions of both resolution and measurement. The results of neural network pattern recognition are used as the input of the concentration network. The first 270 samples are used as the training samples to train the network model, and the remaining 30 samples are used as test samples, which are applied to the input layer of the trained neural network. The relative fluorescence intensity, relative slope, and comprehensive background parameters are used as the input parameters, and the extension neural network is used for pattern recognition and evaluation of oil pollutants. The experimental results show that the average recognition rate of the improved pattern recognition algorithm for oil pollutants is 98.43%, and the average recovery rate of concentration is 98.67%. Further, the average time for pattern recognition is 1.53 s, while the parallel factor analysis algorithm takes 2.89 s. This suggests that the improved extension neural network is an effective and reliable pattern recognition method for the identification of mixed oil pollutants.


Introduction
Oil pollution causes significant harm to the human health, environment, and local ecosystems. The main oil pollutants are the wastewater discharged from factories and offshore oil leakage, which cause severe water pollution. [1][2][3] Therefore, the accurate and efficient detection of the composition and content of oil pollutants is of great significance to maintain the ecological balance. Fluorescence spectroscopy is a sensitive technique with high measurement accuracy, which usually requires small concentration of experimental samples. It has been extensively applied for the detection of fluorescent substances such as oil pollutants, pesticides, food, etc. [4][5][6][7] The aromatic components in pollutants such as gasoline and diesel have strong fluorescence characteristics under ultraviolet light excitation. [8][9][10][11] The three-dimensional (3D) fluorescence spectrum reflects the simultaneous change in fluorescence intensity with the excitation and emission wavelengths, and it essentially represents the 1 School of Electrical and Control Engineering, Xuzhou University of Technology, Xuzhou, China 2 College of Electrical Engineering, North China University of Science and Technology, Tangshan, China continuous distribution of energy in a two-dimensional (2D) region. [12][13][14] Compared with the traditional 2D fluorescence spectroscopy, the 3D fluorescence spectroscopy provides more comprehensive information and has strong recognition ability, which is suitable for the detection of multi-component fluorescent substances. [15][16][17][18] Traditional pattern recognition methods have poor detection accuracy and low speed. In the recent years, several scholars have extracted the characteristic parameters of fluorescence spectrum and applied statistical indicators such as origin moment and kurtosis coefficient for the spectral analysis, [19][20][21] which can only reflect the overall characteristics of 3D fluorescence spectrum.
Wang et al. 22 used the back propagation (BP) neural network combined with alternating trilinear decomposition (ATLD) and 3D fluorescence spectroscopy to examine the composition and content of fluorescent substances. The results showed that the BP neural network has good data compression effect, and this method can be extended for the qualitative/quantitative analysis and rapid detection of trace polycyclic aromatic hydrocarbons (PAHs) in water. Azcarate et al. 23 used front-face fluorescence spectroscopy for the nondestructive evaluation of mayonnaise samples stored at two different temperatures. The results confirmed that the excitation-emission matrices (EEMs) in combination with N-way partial least square discriminant analysis (NPLS-DA) provide information related to the mayonnaise fluorescent molecular structure, facilitating the classification of samples as a function of the storage time. Wu et al. 24 applied total synchronous fluorescence (TSyF) spectroscopy and convolutional neural networks (CNNs) to identify and quantify counterfeit vegetable oils. The results confirmed the feasibility of this method for vegetable oil identification. Catena et al. 25 used the second-order calibration of EEMs and parallel factor analysis (PARAFAC) decomposition as an analytical approach for the detection of PAHs in food matrix (smoked tuna). The experimental results showed that the PAHs were clearly identified and quantified with decision limit (CCa) and capability of detection (CCb) equal to 0.11 and 0.21 mg/L, respectively, for benzo[a]pyrene (BaP). Lenhardt et al. 26 used fluorescence spectroscopy coupled with PARAFAC and partial least squares discriminant analysis (PLS DA) for the characterization and classification of honey. The number of fluorophores present in honey, excitation and emission spectra of each fluorophore, and their relative concentration were determined using a six-component PARAFAC model. The PLS DA classification model, constructed from PARAFAC model scores, detected fake honey samples with 100% sensitivity and specificity. The honey samples were also classified using PLS DA with errors of 0.5% for linden, 10% for acacia, and nearly 20% for both sunflower and meadow mix.
In this study, three-dimensional (3D) fluorescence spectroscopy combined with pattern recognition algorithm is adopted to distinguish the composition and content of oil pollutants efficiently and accurately. Three hundred samples of kerosene, diesel, and gasoline mixed solutions with different concentrations are prepared. The first 270 samples are used as the training samples to train the network model, and the remaining 30 samples are used as test samples, which are applied to the input layer of the trained neural network.The principal component analysis is used to extract the optimal feature variables, and the correlation coefficient method is used to obtain eight groups of principal component features in the spectra. The dimension is selected as 8, and the principal component score is calculated, which is used as the input data of the extension neural network. Next, the pattern recognition method is improved, and the designed neural network has functions of both resolution and measurement. The results of neural network pattern recognition are used as the input of the concentration network.The relative fluorescence intensity, relative slope, and comprehensive background parameters are used as the input parameters, and the extension neural network is used for pattern recognition and evaluation of oil pollutants. The experimental results show that the average recognition rate of the improved pattern recognition algorithm for oil pollutants is 98.43%, and the average recovery rate of concentration is 98.67%. Further, the average time for pattern recognition is 1.53 s, while the parallel factor analysis algorithm takes 2.89 s. The experimental results show that 3D fluorescence spectroscopy combined with the extension neural network as the pattern recognition method is a reliable method for detecting oil pollutants.

Principle of principal component analysis
Principal component analysis (PCA) is a statistical method that uses orthogonal transformation to convert a set of possibly correlated variables into a set of values of linear variables, called principal components. Specifically, the fewer new variables are a linear combination of the original variables, which retain as much original statistical information as possible. [27][28][29] From a mathematical point of view, PCA is a data dimensionality reduction technique that maps the high-dimensional data into lower dimensions. 30,31 Based on the idea of dimensionality reduction, many original variables x 1 , x 2 , Á Á Á , x p with certain correlation are linearly combined and screened, and they are recombined to form a new set of independent comprehensive variables m 1 , m 2 , Á Á Á , m m (m4p). [32][33][34] The specific steps of PCA are as follows: (1) Standardize the original data, that is, subtract each variable by the mean value and then divide it by the standard deviation to eliminate the effect of dimension.
(2) Calculate the correlation coefficient matrix R = (r ij ) based on the standardized data matrix X = (x ij ). (3) Calculate the eigenvalues and eigenvectors of the correlation coefficient matrix R to determine the number of principal components. Firstly, the characteristic equation lI À R j j= 0 is solved. Then, the eigenvectors e i (i = 1, 2, Á Á Á , p) corresponding to these eigenvalues l i are obtained.

Pattern recognition based on extension neural network
The standard BP network structure is used as the structure of the extension neural network model, and the expected output area is V c , which is the expected output corresponding to the class c sample. If the network training results of all kinds of samples fall within the corresponding expected output area, the classification is completed. The learning algorithm of the extension neural network model is based on the principle of BP error. The expected output region V c of the type c mode is a hypercube with (ŷ c 1 ,ŷ c 2 ,ŷ c L ) as the center and WD c as the half side length. When L = 2, the expected output region V c is a square in 2D space. Assuming that the kth sample belongs to the cth class, the actual output of the network is (y k1 , y k2 , Á Á Á , y kL ), and V k is the expected output area corresponding to the kth sample, V k = V c . Then, the error function of the ith (i = 1, 2, .L) output unit corresponding to the kth sample is Three-dimensional data and second-order calibration A three-dimensional spectral image consisting of emission wavelength and excitation wavelength as X-axis, Y-axis and fluorescence intensity as Z-axis is called a three-dimensional fluorescence spectrum (EEM, Excitation-Emission Matrix). The three-dimensional fluorescence spectrum mainly reflects the shape of the fluorescence spectrum of the substance and the gradient of the fluorescence intensity from the macroscopic aspect. It is a three-dimensional characteristic curve that can simultaneously obtain the fluorescence intensity and excitation and emission wavelength changes. The fingerprint can quickly and accurately determine the fluorescence intensity corresponding to the specific excitation-emission wavelength, and Rayleigh scattering and Raman scattering can also be easily distinguished in the fingerprint. Accurately Locating the Optimum Emission Wavelength of Matter by Emission Spectra. The excitation spectra accurately determines the optimal excitation wavelength of the substance. Three-dimensional fluorescence spectroscopy has become a hot choice for many researchers to detect petroleum pollutants due to its high sensitivity, good selectivity, rich information content and nondestructive material structure. Compared with the twodimensional spectrum, the three-dimensional spectrum can more completely reflect all the spectral information contained in the mineral oil spectrum, which enables the three-dimensional fluorescence analysis method to better realize the qualitative and quantitative analysis of the substance.
With the development of sophisticated instruments, the understanding of the 3D data is gradually becoming mature, and the 3D data and second-order calibration method are increasingly applied for the analysis of complex chemical systems. A data matrix can be generated by a single measurement on an analytical sample, and a set of matrices can be obtained by simultaneously or sequentially measuring a number of analytical samples. Thus, a 3D data matrix can be obtained by combining multiple matrices. The second-order correction method is a method for analyzing a 3D data matrix. Tucker 35 proposed the three-mode PCA model (called the Tucker3 model) for the processing of 3D data matrix. Its essence is to decompose the 3D data matrix X into three load matrices A, B, C, and a 3D core matrix G. The PARAFAC algorithm is a second-order calibration method, and the alternating least squares method is used to decompose the tri-line model. Its goal is to minimize the residual summation. 36 where s is the residual sum of squares, and F is the number of factors selected by the parallel factor method. x ijk is an element of the 3D data matrix X; a if is an element of the load matrix A; b jf is an element of the load matrix B; c kf is an element of the load matrix C; e ijk is an element of the residual matrix E.

Experiment
The Hitachi F-7000 fluorescence spectrometer has been adopted as an experimental instrument, which can quickly complete 3D spectral scanning. The voltage of photomultiplier (PMT) was 400 V, the scanning rate was 12,000 nm/min, the scanning step was 5 nm, the incident slit was 10 nm, and the exit slit was 10 nm. The excitation wavelength range was 250-400 nm, and the emission wavelength range was 270-500 nm. The starting point of the emission scanning wavelength is always 20 nm behind the excitation wavelength to fully avoid the interference of Rayleigh scattering spectrum. The algorithm is implemented on MATLAB8.0 and above. A solution of carbon tetrachloride and oil pollutants was prepared. The ratio of the oil substance and the carbon tetrachloride was 1:1000, which was used to gradually dilute 300 samples with different concentrations. The samples no. 1-270 samples were used as the training samples, and the samples no. 271-300 samples were used as the test samples. The concentrations of difficult oil pollutants tested are shown in Table 1. The realization process of oil component detection in petroleum oil in Figure 1.
The 3D fluorescence spectrum of sample no. 4, which is a solution of diesel and gasoline, is shown in Figure 2. It can be seen that although the fluorescence intensity of each oil is different, the spectra of the two mineral oils are seriously overlapped, and it is difficult to realize spectral distinction and concentration prediction by chemical methods. Further, the 3D spectrum of other samples cannot distinguish between the components, so they are not shown here.

Improved pattern recognition
Firstly, the original fluorescence spectral data were standardized, and the correlation coefficient matrix was calculated. The dimension of the feature spectrum was selected as 8. Secondly, the optimal feature variables, which could reflect the complete features of fluorescence spectrum, were selected. The dimension of the feature space was compressed to reduce the amount of calculation, which was beneficial for selecting the feature with the largest amount of information and with the most significant effect on the spectrum classification. The correlation coefficients between the parameters were calculated. Finally, the principal component, which can be regarded as the characteristic spectrum of the original fluorescence spectrum of the sample, was selected. The extracted feature vectors are listed in Table 2.
The load of a variable is defined as the coefficient of the variable in the linear combination equation multiplied by the square root of the corresponding eigenvalues of the principal component, but the coefficient  0  0  156  50  25  0  2  0  100  0  157  75  50  0  3  80  10  0  158  30  70  0  4  60  20  0  159  20  80  0  5  50  30  0  160  150  0  500  6  500  0  50  161  250  0  350  7  400  0  100  162  350  0  150  8  300  0  200  163  450  0  75  9  200  0  300  164  50  40  65  10  100  0  400  165  25  50 50  itself is often called the load. The larger the load, the more the similarity of this variable with the main component. Therefore, the load can be regarded as the correlation between the variable and the principal component. A sample corresponding to a primary component is called the score by a combined calculation. The network input data is the main component score, as listed in Table 3. The above principal component score data were input into the network as new data. The crossvalidation method was used to avoid the occurrence of over-fitting in the classification process. Under the premise of having enough information, the top five characteristic parameters were selected (except concentration information). The number of input nodes of the network model was set to 5, and the number of output nodes, that is, the number of refined oil types, was set to 3. In the extension neural network, the initial weight is directly related to the training results. Under the initial weight equalization, the training samples can be trained by the network, where the loop iteration in the learning algorithm generates training errors, and the training results of the network model represent the approximate sample and the desired output. The neural network can be used for both accurate value calculation and pattern recognition. When used for pattern recognition, its output node number is related to the number of intensive points. If there are two (three) types, two (three) nodes can be used. Accordingly, the three classes can be expressed as (1,0,0), (0,1,0), and (0,0,1), that is, the expected output is (D1, D2, D3). The training results and expected output of the network model are listed in Table 4. The pattern recognition error curve is shown in Figure 3.
After training the network model with the training samples, the data of the test samples were input into the trained neural network, and the input parameters included the concentration information (relative fluorescence intensity, relative slope, comprehensive background parameters) for pattern recognition and measurement of oil pollutants. In the process of concentration measurement, the output value of the    Table 5. The statistical data of corresponding characteristics are listed in Table 6. The extension neural network was used as the pattern recognition method, and the concentration measurement process took 1.53 s.

Parafac algorithm for the detection of oil pollutants
The PARAFAC algorithm was applied for the analysis of oil pollutants. The kernel consistency diagnosis method and the residual sum of squares method were used to jointly estimate the number of factors. When the number of factors was 3, the kernel consistent coefficient decreased significantly, and the residual sum of squares also decreased. In this study, the number of factors was selected as 2. The analysis results of the PARAFAC model for the mixed solution sample are presented in Figures 4 and 5. Figure 4 shows a comparison between the theoretical and experimentally measured fluorescence excitation spectrum, and Figure 5 shows a comparison between the theoretical and experimentally measured fluorescence emission spectrum.

Conclusion
Combining the advantages of the data representation of PCA and the pattern recognition of extension neural network for mixed component system, the refined oil products were effectively identified and measured.The principal component analysis is used to extract the optimal feature variables, and the correlation coefficient method is used to obtain eight groups of principal component features in the spectra. The dimension is selected as 8, and the principal component score is calculated, which is used as the input data of the extension neural network. Next, the pattern recognition method is improved, and the designed neural network has functions of both resolution and measurement. The results of neural network pattern recognition are used as the input of the concentration network. The relative fluorescence intensity, relative slope, and comprehensive background parameters are used as the input parameters, and the extension neural network is used for pattern recognition and evaluation of oil pollutants. The experimental results show that the average recognition rate of the improved pattern recognition algorithm for oil pollutants is 98.43%, and the average recovery rate of concentration is 98.67%. The average pattern recognition rate of the oil pollutants based on the PARAFAC model is 93.1%. The average recovery rates of diesel and gasoline are 92.19% and 89.73%, Further, the average time for pattern recognition is 1.53 s, while the parallel factor analysis algorithm takes 2.89 s. The comparison between the theoretical and experimental characteristic fluorescence excitation and emission spectra was used to verify that the extension neural network is a very powerful tool for spectral data analysis.
In this paper, the Pattern Recognition of Extension Neural Network still has the shortcomings of easy to fall into local optimum and slow convergence in the application. In the future research, it still needs to be improved to improve the recognition accuracy and efficiency, and the recognition effect in the fields of health care and food safety is studied to expand its application fields.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.