Fast Minimum Error Entropy for Linear Regression

: The minimum error entropy (MEE) criterion finds extensive utility across diverse applications, particularly in contexts characterized by non-Gaussian noise. However, its computational demands are notable, and are primarily attributable to the double summation operation involved in calculating the probability density function (PDF) of the error. To address this, our study introduces a novel approach, termed the fast minimum error entropy (FMEE) algorithm, aimed at mitigating computational complexity through the utilization of polynomial expansions of the error PDF. Initially, the PDF approximation of a random variable is derived via the Gram–Charlier expansion. Subsequently, we proceed to ascertain and streamline the entropy of the random variable. Following this, the error entropy inherent to the linear regression model is delineated and expressed as a function of the regression coefficient vector. Lastly, leveraging the gradient descent algorithm, we compute the regression coefficient vector corresponding to the minimum error entropy. Theoretical scrutiny reveals that the time complexity of FMEE stands at O ( n ) , in stark contrast to the O ( n 2 ) complexity associated with MEE. Experimentally, our findings underscore the remarkable efficiency gains afforded by FMEE, with time consumption registering less than 1‰ of that observed with MEE. Encouragingly, this efficiency leap is achieved without compromising accuracy, as evidenced by negligible differentials observed between the accuracies of FMEE and MEE. Furthermore, comprehensive regression experiments on real-world electric datasets in northwest China demonstrate that our FMEE outperforms baseline methods by a clear margin.


Introduction
Entropy serves as a pivotal metric for quantifying the inherent uncertainty within a system, finding ubiquitous application across a multitude of domains including dimension reduction [1,2], parameter estimation [3,4], identification [5,6], feedback control [7,8], and abnormal detection [9][10][11].The minimum error entropy (MEE) criterion stands out as a notable framework aimed at minimizing the entropy associated with estimation errors, thereby mitigating uncertainty within the estimation model.Particularly pertinent to linear regression estimation quandaries, MEE diverges from conventional least square methodologies by not only considering the variance of prediction errors, but also incorporating higher-order cumulants, rendering it adept at handling non-Gaussian noise distributions [12,13].Given the prevalence of non-Gaussian noise in real-world scenarios, the efficacy of MEE has been duly validated across various applications, encompassing adaptive filtering [14][15][16], face recognition [17,18], sparse system identification [19][20][21], stochastic control systems [22,23], and visible light communication [24].
Furthermore, the convergence properties of MEE have been meticulously examined in the prior literature [25][26][27].In-depth analyses delved into the convergence dynamics of fixed-point MEE methodologies [28], while investigations into the interplay between MEE and Minimum Mean-Squared Error (MMSE) [29], as well as its relationship with maximum correntropy [30], have been extensively explored.Noteworthy extensions to the conventional MEE framework include the proposal of kernel MEE variants [31], elucidations on regularization strategies for MEE implementations [32], and the development of semi-supervised and distributed adaptations [33].Moreover, within the errors-in-variables (EIV) modeling realm, robustness analyses of MEE methodologies have been expounded upon [34], alongside the proposition of novel methodologies such as the minimum total error entropy approach [35].
The computational demands associated with MEE criterion stem from its time complexity, which scales quadratically as O(n 2 ).Central to MEE's computational load is the necessity for a double summation operation during the calculation of the gradient pertaining to the error probability density function (PDF), particularly when employing Parzen windowing [36] and certain types of Kernel functions.Moreover, the selection of an inappropriate Kernel function or its parameters can detrimentally impact the accuracy of the resulting PDF.To solve the problem, efforts have been made to optimize the gradient descent MEE approach, as evidenced by the normalization of the step size in [37], leveraging the power of the input entropy.Additionally, innovative methodologies such as the utilization of a quantization operator, as proposed in [38], have been instrumental in mitigating computational complexity.By mapping the error onto a series of real-valued code words, this approach effectively transforms the double summation operation into a singular form, consequently reducing the algorithm's computational overhead.
This paper presents a novel approach, the fast minimum error entropy estimation (FMEE) algorithm, tailored specifically for linear regression tasks, leveraging the polynomial expansion technique applied to the error Probability Density Function (PDF).In contrast to the traditional minimum error entropy (MEE) methodology employing Parzen windowing for error PDF derivation, FMEE adopts the Gram-Charlier expansion method to approximate the error PDF, resulting in a notable reduction in time complexity from O(n 2 ) to O(n).Notably, FMEE obviates the need for Kernel functions or their associated parameters.The proposed algorithm entails several key steps: firstly, the derivation of the PDF approximation for a random variable via the Gram-Charlier expansion, followed by the simplification of the entropy of the random variable facilitated by the orthogonality of the Hermite polynomials.Subsequently, the error of the linear regression model is ascertained, with the error entropy expressed as a function of the regression coefficient vector.Ultimately, the gradient descent technique is employed to deduce the regression coefficient vector corresponding to the minimum error entropy.Experimental validation underscores the efficacy of FMEE, revealing a remarkable reduction in time consumption, with a disparity of less than 1‰ compared to that of the MEE approach for identical problem settings.Importantly, minimal discrepancies are observed between the accuracy levels of FMEE and MEE methodologies.
The rest of the paper is organized as follows: Section 2 introduces the related algorithms, FMEE is proposed in detail in Section 3, Section 4 contains the experiments, and the conclusions are in Section 5.

MEE
MEE minimizes the information contained in the error and maximizes the information captured by the estimated model.Usually, the second-order Renyi entropy [39] is used to measure the entropy contained in the error of the model.
where E is the error random variable of MEE, and H(E) is the entropy of E and p E (e) donates the PDF of E. In this paper, Ep(•) donates the expectation in order to avoid misunderstanding.Then, the PDF of E can be estimated by Parzen windowing: where n is the number of the observed data, K is the kernel function, and h is the bandwidth.Usually, the Gaussian kernel is chosen, and . Then, the empirical information of error Ĥ( f ) can be calculated as: As the logarithmic function is monotone increasing, it can be removed without any influence on the minimizing process.Then, the transformed empirical error information can be obtained as Equation (4): For the linear regression, y = w T x + e, the target of the model is to estimate w from the limited samples.As e i = y i − w T x i , the corresponding empirical error information is calculated as Equation ( 5): Then, the MEE estimator ŵ can be obtained by minimizing R(w) with gradient descent.The time complexity of MEE is O(n 2 ) due to the double summation operation, as shown in Equation (5).As the time consumption of MEE is quadratic to the number of the observed data, the running time of MEE would increase dramatically with the increase in the amount of the observed data.

The Hermite Polynomials
The Hermite polynomials [40] are a classical orthogonal polynomial sequence.They are defined by the derivatives of the standardized Gaussian PDF φ(ξ) as Equation (6): where H(i) donates the ith-order Hermite polynomial.The Hermite polynomials forms an orthogonal system, as in Equation ( 7): This paper uses the first five Hermite polynomials, shown in Equation ( 8):

The Gram-Charlier Expansion
The Gram-Charlier expansion obtains the approximation of the PDF of a random variable, and there is no significant loss of accuracy with this expansion, which is demonstrated in [41].Under the assumption that the random variable is near the normal distribution with the same mean and variance as itself, the Gram-Charlier expansion of the PDF of x can be expressed as: where µ and σ are the mean and standard deviation of x. ξ is the Gaussian random variable whose mean and variable are the same as those of x. κ 3 (x) and κ 4 (x) are the third-order and fourth-order cumulant of x, which can also be called skewness and kurtosis.H 3 and H 4 are the third-order and fourth-order Hermite polynomials.

Methodology
This section derives the relation between the differential entropy and the Gram-Charlier expansion of the PDF of the error for linear regression.Moreover, the assumption that the PDF of the regression error is near the Gaussian distribution φ(ξ) that has the same mean and variance as the error is implemented.
The differential entropy [42] of a random variable X is shown in Equation ( 10): Substitute Equation ( 9) into Equation (10); the result can be derived in Equation ( 11): This integral is rather difficult to compute.However, notice that if the PDF of x is near the normal density as assumed, it can be inferred that κ 3 (x) and κ 4 (x) are very small.Then, the approximation in Equation ( 12) can be used: where ϵ denotes the infinitesimal.Then, Equation ( 11) can be transformed into Equation (13): As H i forms an orthogonal system, Equation ( 13) can be simplified as Equation ( 14), The details of the simplification process of Equation ( 14) are illustrated as follows.This part is to derive Equation ( 14).First, it is needed to show that and For Equation ( 15), suppose that t = (ξ As exp(− t 2 2 ) is an even function and t 5 − 3t 3 is an odd function, then and Equation (15) holds.For Equation ( 16), suppose that t = (ξ The first term in Equation ( 19) can be calculated as Equation (20), The second term in Equation ( 19) can be calculated as Equation ( 21), The last term in Equation ( 19) can be calculated as Equation ( 22), According to Equations ( 20)- (22), and Equation ( 16) holds.
Then, for Equation (13), Notice that H 0 = 1, then the second term of Equation ( 24) is 0. As it is assumed that x is close to a normal random variable, κ 3 (x) and κ 4 (x) are very small.Then, the last term of Equation ( 24) can be omitted as 0, as it involves third-order monomials, which are infinitely smaller than the terms only containing second-order monomials.Then, for Equation (24), take Equations ( 15) and ( 16) into consideration, and notice again that H 0 = 1, So, Equation ( 14) holds.
For the linear regression, e i = y i − w T x i , the standard deviation of the error e w is in Equation ( 26): where ȳ is the mean of y and x donates the mean of x.
To simplify the calculation of κ 4 (e w ), let e w become a random variable with 0 mean: Notice that the transformation in Equation ( 27) does not change the standard deviation and the entropy of e w as σ 2 (e + c) = σ 2 (e) and H(e + c) = H(e) when c is a constant number.In the rest of the paper, to keep clarity, σ 2 (•) is replaced with var(•).
Then, for linear regression, its entropy of the error can be expressed as Equation ( 28): where To minimize H(e w ), the derivative of H(e w ) with respect to w is calculated in Equation ( 31 where Then, the optimal w for the minimum of H(e w ) can be obtained by the iteration scheme in Equation (35) with gradient descent: where α is the step size determined by Armijo condition [43].
To compute

Experiments
In this section, a comprehensive array of experiments is undertaken, encompassing both numerical simulations and real-world scenarios.The numerical simulations serve to validate the efficacy of the FMEE approach and to scrutinize the time efficiency of both FMEE and MEE methodologies.Concurrently, practical experiments are conducted aimed at forecasting power outages within a city located in northwest China.
Regarding the numerical simulations, we consider a scenario where x ∈ R 10 , with the model defined as y = w * T x + e, where w * = [1, −1, 1, −1, 1, −1, 1, −1, 1, −1] T and x follows a normal distribution N (0, I 10 ).Two distinct types of noise are examined in our experiments.Firstly, Gaussian noise e ∼ N (0, 1) is employed, while the second type comprises generalized Gaussian noise characterized by a probability density function f (e) ∝ exp(−α|e|) 0.3 .This heavy-tailed distribution, utilized herein, aligns with previous works, such as references [27,35].Although FMEE is derived under the assumption of near-Gaussian noise, it is anticipated that FMEE remains effective in the presence of non-Gaussian, heavy-tailed noises.Here, α denotes a constant adjusted to maintain a variance of 1 for e.For MEE, the step size is set to √ 0.005π, and the scale parameter for the Gaussian kernel is 10, mirroring the settings adopted in Reference [27].Across the experiments, sample sizes range from 100 to 500, with each combination of sample size and noise type repeated 100 times.The ensuing experimental results are detailed below.
Table 1 provides a comparative analysis of the time consumption between MEE and FMEE in the presence of Gaussian noise.FMEE demonstrates a notable acceleration in computational efficiency compared to MEE.Specifically, while the computational time of MEE exhibits quadratic growth relative to the sample size, FMEE's computational overhead scales linearly with the number of samples.Notably, FMEE achieves remarkable speed, processing 500 objects in approximately 0.1 s, thereby positioning it as a highly promising candidate for time-sensitive applications.Table 2 compares the mean squared error between MEE and FMEE in the context of Gaussian noise.Overall, MEE exhibits superior performance compared to FMEE in terms of MSE.Particularly noteworthy is MEE's substantial advantage over FMEE, especially evident when the sample size is 100.However, as the sample size increases, the discrepancy in MSE between the two algorithms diminishes.By the time the sample size reaches 500, the difference dwindles to a mere 2%.This observation suggests that while FMEE may lag slightly behind MEE, particularly in larger-scale datasets, such disparity remains acceptable in practical applications, given FMEE's significantly reduced computational time compared to MEE.Table 3 analyses the iteration counts required for convergence between MEE and FMEE in the context of Gaussian noise.Notably, FMEE achieves convergence with significantly fewer iterations compared to MEE.Furthermore, it is observed that the iteration count of MEE gradually decreases as the sample size increases.Conversely, the iteration count of FMEE remains relatively constant, irrespective of variations in sample size.Table 4 provides a comparative assessment of the time consumption between MEE and FMEE in the context of non-Gaussian noise.Across all five experimental groups, FMEE consistently demonstrates a remarkable reduction in time consumption, consistently amounting to less than 1‰ of MEE's time expenditure.Moreover, in accordance with theoretical predictions, the computational cost of MEE exhibits quadratic growth relative to the sample size, whereas FMEE's time overhead scales linearly with the sample size.Based on the findings presented in Tables 1-6, it is evident that FMEE consistently yields comparable outcomes to MEE, albeit with significantly reduced time consumption and iteration counts.Notably, FMEE stands out for its remarkable efficiency, being approximately 1000 times faster than MEE and achieving convergence within approximately 0.1 s for 500 instances.Given its linear time complexity of O(n), FMEE holds considerable promise for deployment in real-time scenarios characterized by non-Gaussian noise.
Subsequently, we embark on practical experimentation utilizing transformer characteristic data to forecast distribution network failures.The transformers are categorized into two groups based on their geographical location within distinct levels of urban activity within a specific city in northwest China.Initially, our attention is directed towards a subset of 1265 transformers situated within a smaller, less densely populated area of the aforementioned city.We undertake a random partitioning of this dataset into two segments: a training set comprising 80% of the data, totaling 1012 instances, and a verification set encompassing the remaining 20%, comprising 253 instances.To be more specific, the transformer dataset comprises various characteristic variables, involving the standardized heavy overload duration, maximum active load ratio, average active load ratio, mean three-phase unbalance, standardized heavy three-phase unbalance duration and so forth.
We conduct a comparative analysis, pitting our proposed FMEE against three established baseline methodologies: logistic regression, neural network, and support vector machine.The dataset comprises 1265 transformer characteristic variables associated with substantial load levels, undergoing 30 rounds of random partitioning.Subsequently, each randomly divided dataset is subjected to four distinct algorithms.The predictive efficacy of these algorithms on the verification set is assessed based on F-measure and error rate metrics.
Figure 1 depicts the F-measure evaluations stemming from 30 prediction instances conducted by four distinct algorithms.Notably, each evaluation involves consistent partitioning of the dataset across the four algorithms.A discernible trend emerges wherein all algorithms yield F-measure values surpassing 0.8.Notably, our proposed FMEE ex-hibits the most promising performance, boasting an average F-measure of 0.904, thereby outshining the other three baseline methodologies by a significant margin.To facilitate a comprehensive comparison among the four algorithms, Figure 2 depicts the error rate and F-measure of fault outage prediction results in the form of line and box charts.Notably, FMEE emerges as frontrunner, showcasing a substantial advantage over the sophisticated support vector machine in terms of error rate.Specifically, our FMEE achieves an average error rate of 9.1%, surpassing the neural network, logistic regression, and support vector machine by margins of 0.3%, 4.2%, and 4.3%, respectively.These findings underscore the clear superiority of our proposed FMEE methodology.

Neural Network Logistic Regression
Neural Network Logistic Regression Our FMEE Support Vector Machine Furthermore, we extend our experimentation to a more densely populated urban region within the northwest Chinese city.Figure 3 illustrates the F-measure evaluations derived from 30 independent tests conducted across the four comparative algorithms.On a comprehensive scale, our proposed FMEE attains an F-measure of 0.891, surpassing baseline methodologies such as neural networks, logistic regression, and support vector machines by margins of 0.013, 0.023, and 0.028, respectively.
Neural Network Logistic Regression Our FMEE Support Vector Machine Finally, we report the error rate of F-measure concerning fault outage prediction results derived from four comparative algorithms, represented through line and box charts.As shown in Figure 4, our proposed FMEE achieves an average error rate of 11.2%, showcasing a notable superiority over three baseline methodologies.These findings robustly underscore the efficacy of our algorithm in addressing regression problems.
Neural Network Logistic Regression Our FMEE Support Vector Machine 0.17

Figure 1 .
Figure 1.The boxplot of F-measure among four algorithms for failure outage prediction results.

Figure 2 .
Figure 2. The boxplot of error rate among four algorithms for failure outage prediction results.

Figure 3 .
Figure 3.The boxplot of F-measure of four algorithms for failure outage prediction results.

Figure 4 .
Figure 4.The boxplot of error rate of four algorithms for failure outage prediction results.

Table 1 .
The running time of MEE and FMEE for Gaussian noise.

Table 2 .
The mean squared error of MEE and FMEE for Gaussian noise.

Table 3 .
The number of iterations of MEE and FMEE for Gaussian noise.

Table 4 .
The running time of MEE and FMEE for non-Gaussian noise.

Table 5
conducts a comparative evaluation of the mean squared error (MSE) between MEE and FMEE in the context of non-Gaussian noise.Generally, the disparity between the MSE values of FMEE and MEE is relatively minor.Notably, FMEE outperforms MEE when the sample size is 100 or 400, while MEE exhibits superior performance for sample sizes of 200, 300, and 500.Additionally, FMEE attains the optimal MSE across all sample sizes, except for n = 100.Furthermore, it is observed that the MSE values for non-Gaussian noise, as depicted in Table5, span a broader range compared to those for Gaussian noise, indicating greater variability in results for both MEE and FMEE.

Table 5 .
The mean squared error of MEE and FMEE for non-Gaussian noise.Table6compares the number of iterations required for convergence between MEE and FMEE in the context of non-Gaussian noise.FMEE consistently demonstrates faster convergence compared to MEE.Furthermore, it is observed that the iteration count of MEE decreases as the sample size increases.In contrast, the iteration count of FMEE remains relatively stable despite variations in sample size.

Table 6 .
The number of iterations of MEE and FMEE for non-Gaussian noise.