Multiple Regression and Big Data Analysis for Predictive Emission Monitoring Systems

Zinovi Krougly; Vladimir Krougly; Serge Bays

doi:10.4236/am.2023.145023

Applied Mathematics > Vol.14 No.5, May 2023

Multiple Regression and Big Data Analysis for Predictive Emission Monitoring Systems

Zinovi Krougly¹, Vladimir Krougly², Serge Bays²
¹Department of Mathematics, Western University, London, Canada.
²Limesoft Inc., London, Canada.
DOI: 10.4236/am.2023.145023 PDF HTML XML 124 Downloads 706 Views

Abstract

Predictive Emission Monitoring Systems (PEMS) offer a cost-effective and environmentally friendly alternative to Continuous Emission Monitoring Systems (CEMS) for monitoring pollution from industrial sources. Multiple regression is one of the fundamental statistical techniques to describe the relationship between dependent and independent variables. This model can be effectively used to develop a PEMS, to estimate the amount of pollution emitted by industrial sources, where the fuel composition and other process-related parameters are available. It often makes them sufficient to predict the emission discharge with acceptable accuracy. In cases where PEMS are accepted as an alternative method to CEMS, which use gas analyzers, they can provide cost savings and substantial benefits for ongoing system support and maintenance. The described mathematical concept is based on the matrix algebra representation in multiple regression involving multiple precision arithmetic techniques. Challenging numerical examples for statistical big data analysis, are investigated. Numerical examples illustrate computational accuracy and efficiency of statistical analysis due to increasing the precision level. The programming language C++ is used for mathematical model implementation. The data for research and development, including the dependent fuel and independent NOx emissions data, were obtained from CEMS software installed on a petrochemical plant.

Keywords

Matrix Algebra in Multiple Linear Regression, Numerical Integration, High Precision Computation, Applications in Predictive Emission Monitoring Systems

Share and Cite:

Krougly, Z. , Krougly, V. and Bays, S. (2023) Multiple Regression and Big Data Analysis for Predictive Emission Monitoring Systems. Applied Mathematics, 14, 386-410. doi: 10.4236/am.2023.145023.

1. Introduction

Continues Emissions Monitoring Systems (CEMS) were traditionally used to monitor emissions from stationary sources. CEMS requires a significant capital investment and is very expensive to operate and maintain due to the high initial cost of the hardware and significant amount of labor required to maintain the equipment. Predictive Emission Monitoring Systems (PEMS) can be a cost-effective alternative, they can continuously monitor dependent parameters and estimate emissions with a close to real accuracy, greatly reducing the initial commissioning costs and ongoing maintenance.

PEMS development is based on proven scientific methods to perform statistical data analysis of the significant parameters used as inputs to a multiple regression model. Several commercial software packages, such as Minitab, MATLAB, and R, are available and provide methods and techniques for multiple linear regression models [1] [2] [3] [4] . We will focus on using the available techniques in combination with C++ programming language to develop the multiple linear and polynomial regression software that can work as a powerful PEMS model. Other used methods are: big data analysis, vectors, matrices and linear algebra, high precision numerical calculations [5] [6] , as well as custom-developed classes and objects implementing high-performance scientific computing. Our PEMS model also uses numerical integration technique, including integral representation for the cumulative distribution function (CDF) of t, F and Normal distributions. These methods require big statistical data analysis for testing the significance of regression coefficients and calculation of significance levels for F statistic.

The Data Acquisition System software was provided by Limesoft Inc. [7] to input and test the PEMS mathematical model, present results and compare them in real-time and on historical trend chart with actual NOx values obtained from the CEMS gas analyzer.

The PEMS model was verified against performance specification [8] requirements. It also has all the necessary test procedures to perform PEMS model evaluation, assessment and verification, to prove that PEMS software results have required accuracy in order to be accepted by the Environmental Agency.

The remainder of the paper is organized as follows. In Section 2, a brief description of the underlying theory is given, to introduce matrix algebra formulation to multiple linear regression. Section 3 applies predictive emission monitoring with matrix notation. Sections 4 and 5 describe statistical techniques and numerical integration used in PEMS model development. In section 6, multiple polynomial regression model was tested with big data analysis. In Section 7, PEMS mathematical model was tested by actual CEMS gas analyzer. The procedures described in Section 8, provide a framework for testing PEMS model during normal engine operations. In Section 9 and Appendix, we discuss the role of high precision software in statistical computations. In Sections 10, 11 and 12, we consider comprehensive methodology for developing PEMS, limitations of PEMS model and environmental benefits of PEMS over CEMS.

2. Matrix Algebra Formulation to Multiple Linear Regression

Linear regression model specifies that the regression function is a linear function of the regressor (independent) variables. More realistic applications require more general regression models because the regression function is not linear or because there are many regressor variables. Including polynomial terms in the regression model is a common way to achieve a better approximation to the true regression function. This leads to the term polynomial regression. Incorporating several regressor variables, with or without additional polynomial terms, is considered a multiple regression model. The most effective way to express the mathematical operations is fitting model in matrix notation.

Suppose that there are k regressor variables and n observations, $(x_{i 1}, x_{i 2}, \dots, x_{i k}, y_{i}), i = 1, 2, \dots, n$ . The regression function can be modeled as

$y_{i} = β_{0} + β_{1} x_{i 1} + β_{2} x_{i 2} + \dots + β_{k} x_{i k} + ϵ_{i}, i = 1,2, \dots, n$ (1)

This model is a system of n equations that can be written in matrix form as

$y = X β + ϵ,$ (2)

where

$y = (\begin{matrix} y_{1} \\ y_{2} \\ ⋮ \\ y_{n} \end{matrix}), X = (\begin{matrix} 1 & x_{11} & x_{12} & \dots & x_{1 k} \\ 1 & x_{21} & x_{22} & \dots & x_{2 k} \\ ⋮ & ⋮ & ⋮ & ⋮ \\ 1 & x_{n 1} & x_{n 2} & \dots & x_{n k} \end{matrix}), β = (\begin{matrix} β_{0} \\ β_{1} \\ ⋮ \\ β_{k} \end{matrix}), ϵ = (\begin{matrix} ϵ_{1} \\ ϵ_{2} \\ ⋮ \\ ϵ_{n} \end{matrix})$ (3)

The vector $y$ is the $n \times 1$ vector of the observations, the design matrix $X$ is the $n \times (k + 1)$ matrix containing the values of the input variables, the parameter $β$ is $(k + 1) \times 1$ vector of the regression coefficients and $ϵ$ is the $n \times 1$ vector of random errors. The vector of the least squares estimator $\hat{β}$ is the solution of the normal equations, which can be written in matrix notation as

$X^{'} X \hat{β} = X^{'} y$ (4)

Multiplying both sides of the normal equations (4) by ${(X^{'} X)}^{- 1}$ , we obtain the least square estimate of $\hat{β}$ :

$\hat{β} = {(X^{'} X)}^{- 1} X^{'} y,$ (5)

where $X^{'}$ is the transpose of the matrix $X$ .

If the variables $x_{1}, x_{2}, \dots, x_{k}$ are linearly independent, the matrix $X^{'} X$ is always nonsingular [4] , so the methods for inverting matrices can be used to find ${(X^{'} X)}^{- 1}$ . In practice, we perform matrix calculation using the systematic definition of the vector and matrix classes in C++ [9] .

The fitted value of the response variable at the data point $(x_{i 1}, x_{i 2}, \dots, x_{i k})$ is:

${\hat{y}}_{i} = {\hat{b}}_{0} + \sum_{j = 1}^{k} {\hat{b}}_{j} x_{i j}, i = 1, 2, \dots, n$ (6)

The matrix form of the fitted values and residuals is:

$\hat{y} = X \hat{β}$ and $e = y - \hat{y}$ (7)

The test for significance in regression involves the null hypothesis $H_{0}$ to determine whether a linear relationship exists between the response variable y and any of k regressors, $x_{1}, x_{2}, \dots, x_{k}$ . The test for $H_{0} : β_{1} = β_{2} = \dots = β_{k} = 0$ is based on a statistic that has a particular F distribution:

$F_{0} = \frac{SSR / k}{SSE / (n - k - 1)} = \frac{MSR}{MSE}$ (8)

Rejection region and p value for a level $α$ test are:

$F_{0} > f_{α, k, n - k - 1}, p value = 1 - f_{α, k, n - k - 1} (F_{0})$ (9)

We should reject $H_{0}$ if the computed value of F statistic in Equation (8) is greater than $f_{α, k, n - k - 1}$ distribution value with parameters k and $n - k - 1$ . Test for significance of regression involving the total sum of squares SST which is partitioned into a sum of squares, due to regression SSR, and a sum of squares due to errors SSE:

$SST = SSR + SSE$

The computation formula for the regression sum of squares is:

$SSR = \sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2} = {\hat{β}}^{'} X^{'} y - \frac{{(\sum_{i = 1}^{n} y_{i})}^{2}}{n}$ (10)

The sum of squares for error is defined by

$SSE = \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2} = y^{'} y - {\hat{β}}^{'} X^{'} y$ (11)

The strength of a regression model is measured using the coefficient of multiple determination $R^{2}$ , which takes the values between 0 and 1:

$R^{2} = \frac{SSR}{SST} = 1 - \frac{SSE}{SST}$ (12)

Since $R^{2}$ always increases when regressors are added to the model, it is better to use adjusted $R_{a d j}^{2}$ statistic involving degrees of freedom:

$R_{a d j}^{2} = \frac{SSR}{SST} = 1 - \frac{SSE / (n - k - 1)}{SST / (n - 1)}$ (13)

3. Predictive Emission Monitoring with Matrix Notation

Table 1 lists eight data set examples used as tests of the numerical calculations to various types of PEMS mathematical models.

The data that appeared in Table 2 correspond to Example 1 in Table 1. The format includes the observation interval, number of observations, observation number, observation vector, and design matrix, containing the input variables of data with n = 22,200 and k = 15. Table 3 shows the results of fuel combustion analysis.

C++ software multiple regression outputs are given in Table 4 for estimated $\hat{β}$ and calculated t values.

Consider matrix algebra calculations for the data set presented in Table 5.

Fitted values are calculated by Equation (14)

Table 1. Data set examples used as tests of the numerical calculations.

Table 2. Example 1. Observation interval, number of observations, observation number, observation vector and design matrix, containing the input variables of data with n = 22,200 and k = 15.

Table 3. Results of fuel combustion analysis.

Table 4. Estimated $\hat{β}$ and calculated t value. Values $3.929 e - 12 \equiv 3.929 \times 10^{- 12}$ .

Table 5. Given observation vector y and design matrix X, containing the input variables.

$y_{i} = β_{0} + β_{1} x_{i 1} + β_{2} x_{i 2} + β_{3} x_{i 3} + \dots + β_{15} x_{i 15}$ (14)

Multiple linear regression model with k = 15 regression coefficients predicts the fitted value

$\begin{matrix} {\hat{y}}_{i} = 38.49 + 50.23 x_{1} - 0.0001476 x_{2} + 0.0012698 x_{3} + 0.362444 x_{4} \\ + 2374.26 x_{5} - 3682.95 x_{6} - 31.94 x_{7} + 29.13 x_{8} - 1072.14 x_{9} \\ + 3275.63 x_{10} + 1.95 x_{11} - 5.68 x_{12} - 25.41 x_{13} - 25.32 x_{14} - 0.0163964 x_{15} \end{matrix}$

This regression model can be used to obtain the fitted values by substituting each observation into Equation (14). Thus, fitted value ${\hat{y}}_{i}$ for the data in Table 5 is:

$\begin{matrix} {\hat{y}}_{i} = 38.49 + 50.23 (2.29092) - 0.0001476 (26428.6) + 0.0012698 (30834) \\ - 0.0001476 (89.2179) + 0.0012698 (0.025518) + 0.362444 (0.035515) \end{matrix}$

$\begin{matrix} + 2374.26 (1.39106) - 3682.95 (0.293697) - 31.94 (0.089405) \\ + 29.13 (0.061245) - 1072.14 (0.102182) + 3275.63 (8.78347) \\ + 1.95 (2.89435) - 5.68 (1.68749) - 0.0163964 (171.713) \\ = 51.09. \end{matrix}$

Next step is calculations of estimated regression coefficients $\hat{β}$ , t value, p value and significance level. Regression sum of squares for errors is calculated as

$SSE = e^{'} e = \sum_{i = 1}^{n} e_{i}^{2} = \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}$

and

${\hat{σ}}^{2} = MSE = \frac{SSE}{n - k - 1}$

$SSE = e^{'} e = {4.67}^{2} + \dots + {(- 11.50)}^{2} = 739266$

Estimation of the error of variance is:

${\hat{σ}}^{2} = MSE = \frac{SSE}{n - k - 1} = \frac{739266}{22200 - 15 - 1} = 33.3243$

and $σ = \sqrt{33.3243} = 5.77272$ .

The diagonal elements $d .e . ({\hat{β}}_{i})$ of the matrix ${(X^{'} X)}^{- 1}$ used to calculate the standard error of the estimate $s .e . ({\hat{β}}_{i})$ .

For example

$s .e . ({\hat{β}}_{0}) = σ \times \sqrt{d .e . ({\hat{β}}_{0})} = 5.77272 \times \sqrt{5.03727} = 12.9562$

$s .e . ({\hat{β}}_{1}) = σ \times \sqrt{d .e . ({\hat{β}}_{1})} = 5.77272 \times \sqrt{2734.87} = 301.89$

$s .e . ({\hat{β}}_{15}) = σ \times \sqrt{d .e . ({\hat{β}}_{15})} = 5.77272 \times \sqrt{3.03898 \times 10^{- 7}} = 0.0032$

The corresponding t statistic values are

$t_{0} = \frac{{\hat{β}}_{0}}{s .e . ({\hat{β}}_{0})} = \frac{38.4872}{12.9562} = 2.97056$

$t_{1} = \frac{{\hat{β}}_{1}}{s .e . ({\hat{β}}_{1})} = \frac{50.2346}{301.8899} = 0.166401$

$\dots$

$t_{15} = \frac{{\hat{β}}_{15}}{s .e . ({\hat{β}}_{15})} = \frac{- 0.0163964}{0.0032} = - 5.15233$

Estimated coefficients $\hat{β}$ , t values and p values shown in Table 6. The t test statistic and p values for the significance of each regression coefficient are given in the second and third columns. The significance levels are indicated in the last column. The low p values, with indicate +++, specify significance level less than 0.01, and show that corresponding parameters estimate ${\hat{β}}_{i}$ should be kept in the model and that they are useful in modeling the fitted values ${\hat{y}}_{i}$ . Thirteen of

Table 6. Estimated coefficients $\hat{β}$ , t values and p values. The last column specifies the following significance levels: + corresponds to 0.1, ++ to 0.05, and +++ to 0.01.

the sixteen p values are less than 0.01. Thus, all regression coefficients, except three ( $β_{1}, β_{13}$ and $β_{14}$ ), are significantly different from zero at the level $α = 0.01$ .

The regression sum of squares is computed from Equation (10),

$SSR = {\hat{β}}^{'} X^{'} y - \frac{{(\sum_{i = 1}^{n} y_{i})}^{2}}{n} = 39240555.922 - \frac{{862490.518}^{2}}{22200} = 5732002.172$

The sum of squares for error by Equation (11) is:

$\begin{matrix} SSE = \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2} = y^{'} y - {\hat{β}}^{'} X^{'} y \\ = 39979813.3969 - 39240555.922 = 739257.475 \end{matrix}$

To test the null hypothesis $H_{0} : β_{1} = β_{2} = \dots = β_{k} = 0$ , we calculate by Equation (8) statistic $F_{0}$ :

$\begin{matrix} F_{0} = \frac{SSR / k}{SSE / (n - k - 1)} = \frac{5732002.172 / 15}{739257 .475 / (22200 - 15 - 1)} \\ = \frac{382133.478}{33.324} = 11467.2 \end{matrix}$

That is the highly significant result, since $F_{0} > f_{0.05, 15, \infty} = 1.67$ and even more as $F_{0} > f_{0.01, 15, \infty} = 2.04$ (or since the p value is considerably smaller than $α = 0.01$ ). The null hypothesis should be rejected at any reasonable significance level. We conclude that there is a useful linear relationship between y and at least one of the $k = 15$ regressors in the model. This does not mean that all fifteen regressors are useful. It was shown about this relationship in Table 6, as thirteen from sixteen p values are less than 0.01. All regression coefficients except three are significantly different from zero at the level $α = 0.01$ .

Computationally adjusted $R_{a d j}^{2}$ statistic is:

$R_{a d j}^{2} = 1 - \frac{SSE / (n - k - 1)}{SST / (n - 1)} = 1 - \frac{739265.823 / (22200 - 15 - 1)}{6471259.647 / (22200 - 1)} = 0.886$

where

$SST = SSR + SSE = 5732002.172 + 739257.475 = 6471259.647$

4. Numerical Integration for Cumulative Distribution Functions. Calculations of t and p Values

To find stable, accurate, and computationally efficient methods for performing big matrix calculations and numerical integration techniques, software implementation with high precision arithmetic is needed. In [5] the numerical examples illustrate the computational accuracy and efficiency of the numerical integration technique, particularly for the direct Laplace transform and its inverse, and C++ implementation of the composite Simpson’s Rule for numerical integration (direct Laplace transform). Several algorithms have been proposed in [6] and [10] for numerical integration technique and software implementation with arbirary-precision arithmetic.

Fundamental development in estimation p values, significance levels, and other statistics in ANOVA calculations involve the numerical integration program to approximate CDF of t distribution, F distribution and Normal distribution.

The CDF of a real-valued random variable X is the function given by

$F (x) = P (X \leq x) = \int_{- \infty}^{x} f (u) d u, - \infty < x < \infty$ (15)

The probability density function (PDF) of t distribution is [4] :

$f (x) = \frac{Γ [(ν + 1) / 2]}{\sqrt{ν π} Γ (ν / 2)} \cdot \frac{1}{{[(x^{2} / ν) + 1]}^{(ν + 1) / 2}}, - \infty < x < \infty$ (16)

where $ν = n - k - 1$ is the number of degrees of freedom and $Γ$ is the gamma function. The formula for the gamma function is: $Γ (α) = \int_{0}^{\infty} t^{α - 1} e^{- t} d t$ , for $α > 0$ .

The Normal distribution $N (μ, σ^{2})$ with parameters $μ$ and $σ$ has PDF:

$f (x) = \frac{1}{σ \sqrt{2 π}} e^{- {(x - μ)}^{2} / 2 σ^{2}}$ (17)

The t distribution converges to Normal distribution as the number of degrees of freedom $ν$ approaches to $\infty$ . The t distribution provides statistical analysis for testing significance levels of regression coefficients.

The F probability distribution has two shape parameters, denotes by $ν_{1}$ and $ν_{2}$ . The parameter $ν_{1}$ is the number of numerator degrees of freedom, and $ν_{2}$ is the number of denominator degrees of freedom.

The formula for the PDF of F distribution [4] is:

$f (x) = \frac{Γ (\frac{ν_{1} + ν_{2}}{2}) {(\frac{ν_{1}}{ν_{2}})}^{ν_{1} / 2} x^{(ν_{1} / 2) - 1}}{Γ (\frac{ν_{1}}{2}) Γ (\frac{ν_{2}}{2}) {[(\frac{ν_{1}}{ν_{2}}) x + 1]}^{(ν_{1} + ν_{2}) / 2}}, 0 < x < \infty$ (18)

It is usually abbreviated as $f_{ν_{1}, ν_{2}}$ .

The formula for the significance level of the F distribution does not exist in a closed form. It is computed numerically. In Table 7 the significance level $α$ of the F distribution is given as a function of different values of the shape parameters $ν_{1}$ and $ν_{2}$ and F₀ statistic = 5.

For example if $ν_{1} = 5$ , $ν_{2} = 1000$ and F₀ statistic = 5, the confidence level is:

$P (F_{ν_{1}, ν_{2}} > 5) = 1.42 \times 10^{- 9}$

The 3D plot in Figure 1 follows the outputs given in Table 7.

Table 8 and Figure 2 give the significance level $α$ of the F distribution for different values of the shape parameters $ν_{1}$ and $ν_{2}$ and F₀ statistic = 10.

Let’s consider a software program to perform a numeric integration. The idea is to calculate an approximation of the area under the function’s curve between a and b points, as illustrated in Figure 3.

The technique consists in splitting the interval between a and b into n subintervals of equal length. Each subinterval has width $w = (b - a) / n$ . The height of that subinterval varies. So we may choose the mid-point x of the subinterval and calculate the height as $h = f (x)$ (the vertical dashed line). If we multiply the height h and the width w, we obtain the area of a rectangle that comes fairly close to the actual area under the function $f (x)$ in that subinterval. Doing this repeatedly, once for each subinterval will yield a pretty good approximation of the entire area under the curve, especially if we have a large number of subintervals,

Table 7. Significance level $α$ of the $F_{ν_{1}, ν_{2}}$ distribution and F₀ statistic = 5.

Figure 1. Significance level $α$ of the $F_{ν_{1}, ν_{2}}$ distribution and F₀ statistic = 5.

Table 8. Significance level $α$ of the $F_{ν_{1}, ν_{2}}$ distribution and F₀ statistic = 10.

i.e., if n is large.

The numerical integration program was implemented in C++ high precision software [5] . The calculations can be performed up to 1000 decimal places [11] . The Appendix introduces C++ code used to implement numerical CDF and PDF for t and F distributions in arbitrary precision.

The p value is: $p = 2 P (X \geq | t |) = 2 (1 - F (X))$ , where the random variable X has t distribution with $ν$ degrees of freedom, and $F (X) = P (X \leq | t |)$ is a CDF of X. The p values, corresponding to t distribution with $ν = n - k - 1 = 22200 - 15 - 1 = 22184$ degrees of freedom, shown in Table 6. For example:

$\begin{matrix} p_{0} = 2 \times P (X \geq | t_{0} |) = 2 \times P (X \geq 2.97056) = 2 \times (1 - F (2.97056)) \\ = 2 (1 - 0.998514) = 0.00297 \end{matrix}$

$\begin{matrix} p_{1} = 2 \times P (X \geq | t_{1} |) = 2 \times P (X \geq 0.1664) = 2 \times (1 - F (0.1664)) \\ = 2 (1 - 0.566079) = 0.86784 \end{matrix}$

Figure 2. Significance level $α$ of the $F_{ν_{1}, ν_{2}}$ distribution and F₀ statistic = 10. Estimated data presented in terms of $- \log_{10} α$ .

Figure 3. Numerical integration.

$\begin{matrix} p_{15} = 2 \times P (X \geq | t_{15} |) = 2 \times P (X \geq | - 5.15233 |) = 2 \times (1 - F (5.15233)) \\ = 2 (1 - 0.99999987) = 2.6 \times 10^{- 7} . \end{matrix}$

5. Confidence Interval and Prediction Interval

The confidence interval and prediction interval are calculated by the following equations:

${\hat{μ}}_{y | x_{0}} - t_{α / 2, n - p} \sqrt{{\hat{σ}}^{2} {x^{'}}_{0} {(X^{'} X)}^{- 1} x_{0}} \leq μ_{y | x_{0}} \leq {\hat{μ}}_{y | x_{0}} + t_{α / 2, n - p} \sqrt{{\hat{σ}}^{2} {x^{'}}_{0} {(X^{'} X)}^{- 1} x_{0}}$ (19)

${\hat{y}}_{0} - t_{α / 2, n - p} \sqrt{{\hat{σ}}^{2} (1 + {x^{'}}_{0} {(X^{'} X)}^{- 1} x_{0})} \leq Y_{0} \leq {\hat{y}}_{0} + t_{α / 2, n - p} \sqrt{{\hat{σ}}^{2} (1 + {x^{'}}_{0} {(X^{'} X)}^{- 1} x_{0})}$ (20)

A regression model can be used to predict new or future observations on the response variable Y corresponding to the independent variables, $x_{01}, x_{02}, \dots, x_{0 k}$ . If ${x^{'}}_{0} = (1, x_{01}, x_{02}, \dots, x_{0 k})$ , a point estimate of the future observation Y₀ at the point $x_{01}, x_{02}, \dots, x_{0 k}$ is:

${\hat{y}}_{0} = {x^{'}}_{0} \hat{β} .$ (21)

A prediction interval is always wider than a confidence interval. Confidence interval expresses the error in estimating the mean of a distribution, while the prediction interval expresses the error in predicting a future observation from the distribution at point x₀.

We calculate the confidence interval and the prediction interval using CDF for t distribution and Normal distributions. Now we can do that for the t distribution with PDF Equation (16) and Normal distribution with PDF Equation (17).

6. Multiple Polynomial Regression Model

As shown in Table 6 for multiple linear regression model, all regression coefficients except three ( $β_{1}, β_{13}$ , and $β_{14}$ ) are significantly different from zero at the level $α = 0.01$ . Three regression coefficients have the p values higher than 0.01 ( $α > 0.01$ ). They are not statistically significant and indicate strong evidence for the null hypothesis.

Example 2. Consider the following multiple polynomial regression model (the second order no-interaction model). There are $k = 15$ initial regressor variables $(x_{i 1}, x_{i 2}, \dots, x_{i k}, y_{i})$ , $i = 1,2, \dots, n$ , and r additional regressor variables with quadratic terms:

$\begin{array}{l} y_{i} = β_{0} + β_{1} x_{i 1} + β_{2} x_{i 2} + \dots + β_{k} x_{i k} + β_{k + 1} x_{i 1}^{2} + β_{k + 2} x_{i 2}^{2} \\ + \dots + β_{k + r} x_{i k}^{2} + ϵ, i = 1, 2, \dots, n \end{array}$ (22)

Estimated regression coefficients $\hat{β}$ are shown in Table 9. The t test statistic and p values for the significance of each regression coefficient are given in the second and third columns. The significance levels are indicated in the last column. The low p values, with indicate +++, specify significance level less than 0.01, and indicate that the corresponding regression parameter estimate ${\hat{β}}_{i}$ is useful in modeling the fitted values ${\hat{y}}_{i}$ . Twenty five of the thirty p values are less than 0.01. Thus, all regression coefficients, except five ( $β_{1}, β_{13}, β_{14}$ , and $β_{29}$ ) are significantly different from zero at the level $α = 0.01$ .

For both linear and polynomial regression models, software output produced the desired information, which is summarized in Table 10.

Table 11 shows fitted values, and residuals for n = 5485 data points. Example 3 and Example 4 correspond to linear and polynomial regression models accordingly.

Figure 4 shows two scatter plots for the observations, observation $y_{i}$ and fitted ${\hat{y}}_{i}$ values with n = 22,200 data points. Example 1 (left) and Example 2 (right) correspond to linear and polynomial regression models accordingly.

The data with n = 96 was created as a fragment of the data with n = 22,200

Table 9. Example 2. Estimated coefficients $\hat{β}$ , t values and p values for multiple polynomial regression model with n = 22,200 and k = 30. The last column specifies the following significance levels: + corresponds to 0.1, ++ to 0.05, and +++ to 0.01.

Table 10. Example 2. Observation interval, number of observations, observation number, fitted values, and residuals for linear and polynomial regression models. The observation numbers are in accordance with scatter plots 4.

Table 11. Fitted values, and residuals for n = 5485 data points. Example 3 and Example 4 correspond to linear and polynomial regression models accordingly.

Figure 4. Two scatter plots of the observations $y_{i}$ and fitted ${\hat{y}}_{i}$ values with n = 22,200 data points. Example 1 (left) and Example 2 (right) correspond to linear and polynomial regression models accordingly.

observations. Figure 5 shows two scatter plots of the observation $y_{i}$ and fitted ${\hat{y}}_{i}$ values of the data with n = 96. Example 5 (left) and Example 6 (right) correspond to linear and polynomial regression models accordingly.

Evidently, the two scatter plots in Figure 4 and in Figure 5, are very much alike. They demonstrate similar tracking boundary movements for observed and fitted values and can illustrate whether the algorithm has succeeded in obtaining high-order accuracy or has failed due to numerical instability.

Figure 4 and Figure 5 demonstrate slight improvement, as visually observed, for multiple polynomial regression model vs linear model. These figures illustrate that the fitted and observed values are visually more overlapping on the right scatter plots corresponding to the multiple polynomial regression model.

Quantitative comparison shows slight improvements in accuracy through residual average and the coefficient of multiple determination R², which is an important step in building realistic regression models.

For polynomial model with n = 22,200 and k = 30 residual average is 3.825 over 4.571 for linear model (decreased on 16.34%), and adjusted $R_{a d j}^{2}$ statistic is 0.917 over 0.886 for linear model.

For polynomial model with n = 96 and k = 15 residual average is 0.188 over 0.275 for linear model (decreased on 31.6%), and R² is 0.968 over 0.943 for linear model.

For the Example 6, the number of terms in approximation was increased from k = 15 (linear model) to $k + r = 15 + 15 = 30$ (polynomial model. Naturally it would seem that father increase of the number of second-order terms would improve the accuracy of the approximation since more terms seem to produce better solution results.

On the other hand, using a large number of terms will not be of benefit if the precision with which each term is calculated is insufficient.

In Figure 6, n = 5485, k = 15 (Example 3) and k = 30 (Example 4). A slight improvement is visually observed for multiple polynomial regression model over linear model. These figures illustrate that the fitted and observed values are visually more overlapping on the right scatter plot corresponding to the multiple polynomial regression model.

Figure 7 shows two scatter plots of the observation $y_{i}$ and fitted ${\hat{y}}_{i}$ values with n = 8444, for gases NO (Example 7, left) and NO₂ (Example 8, right).

Figure 5. Two scatter plots of the observations $y_{i}$ and fitted ${\hat{y}}_{i}$ values of the data with n = 96. Example 5 (left) and Example 6 (right) correspond to linear and polynomial models accordingly.

Figure 6. Two scatter plots of the observations $y_{i}$ and fitted ${\hat{y}}_{i}$ values of the data with n = 5485 for linear (left) and multiple polynomial regression model (right).

Figure 7. Two scatter plots of the observation $y_{i}$ and fitted ${\hat{y}}_{i}$ values with n = 8444, for NO (Example 7, left) and NO₂ (Example 8, right).

Figure 8 illustrates for Example 1 the histogram of residuals using PDF scaling. The area of each bar is the relative number of observations. The sum of the bar areas is equal to 1. The histogram of residuals can be used to check whether the variance is normally distributed. A symmetric bell-shaped histogram which is evenly distributed around zero indicates that the normality assumption is likely to be true. This histogram indicates that random error is not normally distributed, it suggests that the model's underlying assumptions may work well in some areas, and show what can happen to prediction and inference when certain assumptions are violated.

Figure 8. Histogram of residuals.

7. PEMS Mathematical Model versus Actual CEMS Gas Analyzer

PEMS mathematical model was tested by actual CEMS gas analyzer. In Table 12, for Example 3 with n = 5485, are shown estimated coefficients $\hat{β}$ , the t test statistic and p values. The significance level of each regression coefficient is indicated in the last column. The low p values, with indicate +++, specify significance level less than 0.01, and show that corresponding parameters estimate ${\hat{β}}_{i}$ should be kept in the model and that they are useful in modeling the fitted values ${\hat{y}}_{i}$ . Eleven of the sixteen p values are less than 0.01. Thus, all regression coefficients, except five ( $β_{0}, β_{1}, β_{7}, β_{13}$ and $β_{14}$ ), are significantly different from zero at the level $α = 0.01$ .

Figure 6, for n = 5485, shows the scatter plot of the observation $y_{i}$ versus fitted ${\hat{y}}_{i}$ .

The Data Acquisition System software was provided by Limesoft Inc. [7] to input and test the PEMS mathematical model. We present results and compare them in real-time and on a historical trend chart with actual values obtained from the CEMS gas analyzer.

Figure 9 shows screenshot of PEMS Model—Trend Chart for Example 3 with n = 5485. Predicted plot PEMS versus actual CEMS plot is given for 15 min average NO data. Predicted and actual NO values are visibly overlapping. PEMS mathematical model seems to be an effective way to determine emissions based on historical and real-time process data.

8. Testing and Calculation Procedures Based on Performance Specification Requirement

The procedures described in this section provide a framework for testing PEMS

Table 12. Example 3 with n = 5485. Estimated coefficients $\hat{β}$ , t values and p values. The last column specifies the following significance levels: + corresponds to 0.1, ++ to 0.05, and +++ to 0.01.

model for Example 3 during normal engine operations. They are based on performance specification [8] . PEMS model must pass a relative accuracy (RA) test and accompanying statistical tests to be acceptable for use in demonstrating compliance with applicable requirements. We demonstrate those procedures for the data with n = 96 illustrated in the scatter plot in Figure 5.

First is the arithmetic mean of the differences:

$\bar{d} = \frac{1}{n} \sum_{i = 1}^{n} d_{i} = \frac{1}{96} \times 3.3899 = 0.03531,$ (23)

where n is the number of data points and $d_{i}$ is the difference between observation and fitted values.

Second is the standard deviation of the differences:

$S_{d} = \sqrt{\frac{\sum_{i = 1}^{n} d_{i}^{2} - n * {\bar{d}}^{2}}{n - 1}} = \sqrt{\frac{11.8414 - 96 * {0.03531}^{2}}{96 - 1}} = 0.3513$ (24)

Third is the confidence coefficient for $α = 0.025$ and for $n - 1$ degrees of

Figure 9. A screenshot of C# software output. Predicted PEMS vs actual CEMS plot are given for 15 min average NO data.

freedom:

$c c = t_{0.025} \frac{S_{d}}{\sqrt{n}} = 1.988 \frac{0.3533}{96} = 0.071$ (25)

Fourth is the relative accuracy:

$RA = \frac{| \bar{d} | + | c c |}{\bar{y}} \times 100 = \frac{| 0.03531 | + | 0.071 |}{36.7638} \times 100 = 0.2896 %,$ (26)

where $| \bar{d} |$ is absolute value of the mean differences, $| c c |$ is absolute value of the confidence coefficient and $\bar{y} = 36.7638$ is the mean of the observation values. This is acceptible result as the RA must not exceed 20% if the PEMS mesurements are between 100 ppm and 10 ppm.

Next is the test for significance of regression (with $α = 0.025$ ). The regression sum of squares computed by Equation (10) is:

$SSR = \sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2} = 195.16939$ (27)

The sum of squares for error is defined by (11):

$SSE = \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2} = 11.84140$ (28)

F₀ statistic vs constant model is defined by (8):

$F_{0} = \frac{SSR / k}{SSE / (n - k - 1)} = \frac{195.16939 / 15}{11.84140 / (96 - 15 - 1)} = 87.90$ (29)

and p value = 6.49 × 10⁻²⁸. This low p-value, less than 0.025, is statistically significant. It indicates strong evidence against the null hypothesis.

Coefficient of multiple determination is calculated by Equation (12):

$R^{2} = \frac{SSR}{SST} = 1 - \frac{SSE}{SSR + SSE} = 1 - \frac{11.84140}{195.16939 + 11.84140} = 0.943$ (30)

and multiple correlation coefficient is: $R = \sqrt{0.943} = 0.97$ . The PEMS correlation is acceptable, as it must be 0.8 or grater.

9. The Role of High Precision Arithmetic in Numerical Calculations

High precision software has been implemented in C++ for calculating numerical Laplace and inverse Laplace transforms [5] .

The Appendix introduces C++ code used to implement numerical CDF and PDF for t and F distributions in arbitrary precision.

Table 13 shows sum of squares, F₀ statistic and R² evaluated in double with precision N = 16 and high precision with N = 32 and N = 128. The notation * indicates the calculation error in double precision. Fist we compare Equations (11) and (28).

Double precision gives calculation error by formula $sse = y^{'} y - {\hat{β}}^{'} X^{'} y = 136.407$ .

The solution by Equation (28) gives $SSE = \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2} = 11.84$ . The accurate solution $SSE = 11.72$ gives high precision software.

Next we compare Equations (10) and (27).

Double precision gives calculation error by formula $ssr = {\hat{β}}^{'} X^{'} y - \frac{{(\sum_{i = 1}^{n} y_{i})}^{2}}{n} = 70.604$ .

Table 13. Sum of squares, F statistic and R² evaluated in double with N = 16 and multiple precision level with N = 32 and N = 128. The notation * indicates the calculation error in double precision.

The solution by Equation (27) as $SSR = \sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}$ gives $SSR = 195.169$ .

High precision software gives the accurate solution $SSR = 195.288$ .

10. Comprehensive Methodology for Developing PEMS

PEMS development is based on techniques for statistical data analysis of significant parameters used as inputs to multiple regression model. The methodology includes the following steps.

Data collection: Obtain historical data on dependent fuel and independent NOx emissions from CEMS software installed on the industrial source. Data covers a representative range of operating conditions and emission levels.

Variable selection: Identify the most significant process parameters, such as oxygen, temperature, pressure, fuel flow, and fuel composition, affecting emissions formation. Perform correlation and regression analyses to determine the relationships between these variables and the emission levels.

Model development: Split the historical data into training and testing sets. Use the training set to develop the PEMS model and the testing set to assess its predictive accuracy. Adjust and refine the model as necessary to improve its performance.

Model evaluation: Assess the accuracy of the PEMS model by comparing real-time and historical trend charts of estimated emissions with actual NOx values obtained from CEMS gas analyzers. Verify the model against the performance specification [8] requirements and perform additional tests for PEMS model evaluation and assessment.

Model implementation: Integrate the PEMS model into the plant’s control system to provide real-time emission estimates and facilitate data-driven decision-making for emission control and optimization.

Ongoing model maintenance: Periodically update and recalibrate the PEMS model to account for changes in process conditions, equipment performance, and regulatory requirements.

11. Limitations of PEMS Model

Although PEMS offers several advantages over CEMS, it has certain limitations.

Applicability: PEMS may not be suitable for all types of industrial sources, especially those with highly variable or complex emission profiles such as coal-fired power stations or alternative fuel cement plants. Industries with rapidly changing operating conditions or multiple emission sources may require more sophisticated models or hybrid monitoring approaches combining PEMS with CEMS.

Data quality: The accuracy of PEMS is highly dependent on the quality and representativeness of the historical data used for model development. Poor data quality or gaps in data coverage can result in less accurate and reliable PEMS models.

Model adaptability: PEMS models may require regular updates and recalibration to maintain accuracy as process conditions change over time. This will often require close collaboration between plant operators and PEMS vendors.

12. Environmental Benefits of PEMS over CEMS

PEMS offers several environmental benefits compared to CEMS.

Enhanced process optimization and potential emissions reduction, by providing real-time insights into emission levels. PEMS can help improve overall plant efficiency by providing continuous feedback on process performance. This information can be used to optimize fuel usage, reduce excess air, improve combustion efficiency, and facilitate timely interventions to optimize process parameters and minimize emissions. This proactive approach to emissions control can result in more effective and sustainable emission reduction strategies. Real-time feedback enables plant operators to adjust processes, ultimately reducing the release of pollutants into the environment.

Improved compliance and reporting: PEMS models can help industries maintain compliance with environmental regulations by providing accurate, real-time emissions data. This information can be used to generate reports for regulatory agencies and demonstrate ongoing compliance with emissions limits. Improved compliance can lead to fewer penalties and fines. The adoption of PEMS models signals a commitment to sustainable business practices and environmental stewardship. By demonstrating the implementation of advanced monitoring techniques and a proactive approach to emission reduction, industries can enhance their corporate social responsibility profiles and contribute to global efforts to mitigate climate change and reduce air pollution.

Reduced maintenance and downtime: PEMS models typically require less maintenance than CEMS equipment, resulting in reduced downtime for industrial facilities. This not only lowers maintenance costs but also minimizes the potential for accidental releases of pollutants during maintenance activities. By reducing the need for invasive and time-consuming maintenance procedures, PEMS contribute to a safer and more environmentally friendly workplace.

Reduced resource consumption: PEMS rely on existing process data and sensors, reducing the need for additional hardware and consumables used in CEMS. By minimizing the need for sampling equipment and consumable materials, such as calibration gases, PEMS help reduce the environmental impact associated with waste disposal and resource extraction. PEMS models typically consume less energy compared to the operation of CEMS analyzers, reducing the overall environmental impact. This benefit is particularly significant for large industrial facilities where energy consumption can be a major contributor to greenhouse gas emissions and operating costs. The reduced energy consumption also translates to cost savings for the facility.

13. Conclusion

Software-based PEMS uses data collected by the process sensors and analyzers to learn how various process parameters, such as oxygen, temperature, pressure, fuel flow, and fuel composition affect the emissions formation. In many cases this data can be used to develop and install PEMS, to replace or to be used instead of CEMS. This development is based on a regression model and statistic methods, but also on previously collected CEMS data. To develop a stable, accurate, and computationally efficient PEMS model this study used high precision C++ software to conduct the multiple regression analysis. Computation statistics methods were used for the estimation of NO and NO₂ emissions from one of the stacks of the petrochemical refinery plant. The methods for PEMS implicated the creation of mathematical models to express the relationship between emissions and various operating and external parameters, such as flue gas temperature, excess combustion air, and heat load. The applicability of PEMS has been tested with multiple regression analysis of big statistical data. Multiple linear regression model allows the response variable to be modeled as a function of more than one input variable. The computations are considerably more complex than in simple linear regression. The most efficient way to deal with multiple linear regression mathematically is by using the matrix algebra approach. In multiple polynomial regression model, the response variable is not expressed as a linear combination of the parameters. Many ideas in the multiple polynomial regression are similar to those in linear regression. The mathematical processes in multiple regression require model fitting and making statistical inferences. We investigated the accuracy of the PEMS model by applying test procedures described in performance specification [11] . The most important result is that the PEMS model was tested and was found to be suitable for continuous emissions monitoring, provided that dependent influencing parameters would continue to operate in levels recorded and used for PEMS model development.

Appendix: C++ Code Used to Implement Numerical CDF and PDF for t and F Distributions in Arbitrary Precision

The statistical technique and numerical integration used in PEMS model development were described in Sections 4 and 9. The calculations can be perform up to 1000 places of decimals/1000 significant digits. The C++ code used to implement numerical CDF and PDF for t and F distributions in arbitrary precision are given below.

Conflicts of Interest

The authors declare no conflicts of interest regarding the publication of this paper.

References

[1]	Akritas, M. (2016) Probability & Statistics with R for Engineers and Scientists, Pearson Education.
[2]	Chapra, S.C. (2012) Applied Numerical Analysis with Matlab. 3rd Edition, McGraw- Hill.
[3]	Hayter, A.J. (2012) Probability and Statistics for Engineers and Scientists, 4th Edition, Brooks/Cole, Cengage Learning.
[4]	Montgomery, D.C. and Runger, G.C. (2020) Applied Statistics and Probability for Engineers. 7th Edition, John Wiley & Sons.
[5]	Krougly, Z., Davison, M. and Aiyar, S. (2017) The Role of High Precision Arithmetic in Calculating Numerical Laplace and Inverse Laplace Transforms. Applied Mathematics, 8, 562-589. https://doi.org/10.4236/am.2017.84045
[6]	Krougly, Z. (2021) Accuracy and Precision Requirement in Probability Models. Reliability: Theory & Applications, 16, 133-151.
[7]	Industrial Software Solutions for Environmental and Process Monitoring. https://limesoft.ca/
[8]	US EPA 40CFR 60 PS 16 (2017) Performance Specification 16—Specifications and Test Procedures for Predictive Emission Monitoring Systems in Stationary Sources, https://www.epa.gov/sites/default/files/2017-08/documents/performance_specification_16.pdf
[9]	Flowers, B.H. (2000) An Introduction to Numerical Methods in C++. Oxford University Press, Oxford.
[10]	Krougly, Z.L., Jeffrey, D.J. and Tsarapkina, D. (2014) Software Implementation of Numerical Algorithms in Arbitrary Precision. 2013 15th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, Timisoara, 23-26 September 2013, 131-137. https://doi.org/10.1109/SYNASC.2013.25
[11]	High-Precision Software Directory. https://www.davidhbailey.com/dhbsoftware/

Journals Menu

Follow SCIRP

	+1 323-425-8868
	customer@scirp.org
	+86 18163351462(WhatsApp)
	1655362766

	Paper Publishing WeChat

Journals Menu

Home

About SCIRP

Service

Policies