Twin Hyper-Ellipsoidal Support Vector Machine for Binary Classification

,


I. INTRODUCTION
{S}upport vector machines (SVMs) has become a popular method in classification and regression data, introduced by Vapnik and colleagues [1], [2]. The use of kernel techniques in SVM is one of the reasons to improve and propose several extensions for this popular method for application in many fields [3]- [5].
Recently, many extensions of SVM have been presented to increase the generalization performance and improve the learning speed [6]- [8]. One of these methods, twin support vector machine (TWSVM), was proposed by Jayadeva [9] for binary classification. TWSVM is aimed at generating two nonparallel hyperplanes for two classes, each of which is as The associate editor coordinating the review of this manuscript and approving it for publication was Geng-Ming Jiang .
close to the points of the corresponding class as possible and at a distance of at least one from the other. Each hyperplane is derived by solving an SVM-type quadratic programming problem (QPP). TWSVM classifies new test points according to the distance from each of two nonparallel hyperplanes such that the points belong to the corresponding class of the nearest hyperplane. TWSVM has recently received more attention due to its lower computational complexity, mainly because each of its QPPs is only half as large as the full-sized QPP in the classical SVM. This is why many extensions of TWSVM have been proposed, such as recursive projection twin SVM [10], non-parallel plane proximal classifier [11], least squares twin SVM [12], least squares twin support vector hypersphere [13], sparse twin SVM [14], twin SVM [15], twin parametric margin SVM [16], twin support vector regression (SVR) [17], twin-parametric insensitive VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ SVR [18], etc. TWSVM, despite the faster learning speed and higher generalization performance than the classical SVM, cannot effectively depict the characteristics of two classes [19]. In particular, for the classification of two classes from different Gaussian distributions, we cannot use two nonparallel hyperplanes. Therefore, in [19], a twin-hypersphere support vector machine (THSVM) for binary classification of data was introduced. THSVM determines two hyperspheres, with each one covering as many data points in one class as possible while keeping far away from the opposite class. To find these hyperspheres, THSVM solves two smaller QPPs instead of one big QPP as in the classical SVM. THSVM benefits from higher learning speed than TWSVM, because it avoids the-inversion matrix. It uses two hyperspheres instead of two nonparallel hyperplanes to show the characteristics of two classes, which is more practical for databases produced by Gaussian distributions. So THSVM is also superior to TWSVM in terms of its generalization performance.
As mentioned earlier, one reason for the success of SVM and its extensions, such as TWSVM and THSVM, is the employment of the kernel technique. Since the kernels used in SVM, TWSVM, and THSVM are based on Euclidean distance, these methods assume that the data points have been distributed within a hyper-spherical region, while the data points of two classes are distributed within two different hyper-ellipsoidal regions. Therefore, two hyperspheres of THSVM cannot effectively extract the data information in two classes. By considering the structural nature of a hypersphere, THSVM assumes that the data points in one class have been raised in all directions with the same scale simultaneously. In other words, using Euclidean distance in two hyperspheres of THSVM leads to ignoring the covariance matrices of two classes of data.
Mahalanobis distance not only saves the correlations between data points but also it is scaled invariant, and therefore it is a better choice to work with hyper-ellipsoidal areas [20]. Mahalanobis distance is a more general case of Euclidean distance such that if any data point is considered as a unique region, then Mahalanobis distance degenerates to Euclidean distance.
In this paper, we propose a twin hyper-ellipsoidal support vector machine (TESVM) for the binary classification of data. First of all, a pair of Mahalanobis distance-based kernels is introduced. Then, according to these kernels, which are obtained by covariance matrices of two classes of data, TESVM generates two hyper-ellipsoidals via two smaller QPPs, in such a way that each hyper-ellipsoidal covers as many data points in one class as possible and keeps as far away from the other class as possible. Due to the use of Mahalanobis distance instead of Euclidean distance, TESVM is a more general case of THSVM. If the covariance matrices of the two classes reduce to identity matrices, TESVM performs similarly to THSVM.
The main idea of this paper is to use Mahalanobis distance-based kernels for two classes of data in the THSVM algorithm, which leads to improved generalization performance. TESVM, besides successfully inheriting the merits of THSVM and TWSVM, such as fast learning speed, effectively takes the advantages of covariance matrix information of two classes, into the prediction phase.
Computational comparisons of SVM, TWSVM, THSVM, and TESVM in terms of generalization performance and learning CPU time, on several benchmarks and synthetic and image datasets indicate that TESVM not only obtains fast learning speed but also demonstrates comparable generalization performance. This paper is organized as follows: The next section represents a brief review of two related algorithms and presents the related formulas. In Section III our proposed TESVM algorithm is explained in detail, and its computational complexity and connections with other algorithms are introduced. The experimental results on the toy, benchmark, and image datasets are shown in Section IV, and the analysis experiments are introduced in this section. Finally, the last section concludes this paper.

II. BACKGROUND
Suppose the training points for two classes are as follows: where N i is the samples with n dimension in class i, such as matrix with N 1 ×n samples ( A i as ith sample) in class +1 and matrix B with N 2 ×n samples in class −1, where N 1 +N 2 = N .

A. TWIN SUPPORT VECTOR MACHINE
As an extension of the classical SVM to improve its learning speed and generalization performance, Jayadeva [9] proposed a new twin support vector machine (TWSVM). There is a significant difference between SVM and TWSVM, in that TWSVM solves classification problems by generating two non-parallel hyperplanes, each of which passes through as many samples in one class as possible and gets far away, a distance of at least one from the other class. Each TWSVM hyperplane is constructed by solving a small half-sized QPP, compared with only one large-QPP in SVM. This leads to an approximately four times increased in learning speed. For the linear case of TWSVM, it solves the following two QPPs: where c 1 > 0 and c 2 > 0 represent the pre-specified penalty factors, and ξ i and ξ j are the slack variables. A test point X is assigned to a class according to its distance from the two nonparallel hyperplanes. More details and the nonlinear form of TWSVM can be found in [9].

B. TWIN-HYPERSPHERE SUPPORT VECTOR MACHINE
As mentioned in [9], the experimental results show that the TWSVM algorithm not only derives faster-learning speed but also obtains better generalization performance compared to the classical SVM. However, there remain some shortcomings in TWSVM. First, its optimization QPPs need to compute inversion of matrices with size (m + 1) × (m + 1) or (l + 1) × (l + 1) where m denotes the dimensions of training points and l the number of training points, which causes an-increase in central processing unit (CPU) learning time. Second, the algorithm cannot effectively depict the properties of data in two classes, since each hyperplane of TWSVM tries to be closer to as many samples in the corresponding class as possible and simultaneously get far away, to a distance of at least one from the other class. So, to overcome the problem of binary classification with two classes, coming from two different Gaussian distributions, we cannot use two nonparallel hyperplanes. To overcome the shortcomings, a new twin hypersphere support vector machine (THSVM) was proposed in [19]. THSVM aims at generating two hyperspheres, each of which covers as many samples in one class as possible and is as far away from the other class as possible. In the learning phase of the THSVM classifier the following pair of QPPs must be solved: where c 1 > 0 and c 2 > 0 represent the pre-specified penalty factors, and ξ i and ξ j are the slack variables. After optimizing Equation (4) and considering suitable parameters and, the center of the positive hypersphere can be achieved as follows: and its squared radius R 2 (1) value is obtained as: where the index set I (1) R is calculated as: Similar equations for the negative hypersphere are obtained as in [19]. When the two hyperspheres are calculated: a new test point belongs to the positive or negative class according to the following minimization:

III. TWIN HYPER-ELLIPSOIDAL SUPPORT VECTOR MACHINE (TESVM)
As previously mentioned, a pair of hyperspheres can better depict the data characteristics of classes than two nonparallel hyperplanes, especially when the two classes are of different Gaussian distributions [19]. However, the THSVM is blind to the orientation information of samples in two classes. Note that for many real-world classification problems, orientation information in two classes is often different, which makes their covariance matrices different.
In the spirit of the covariance matrices of two classes, we introduce Mahalanobis distance-based kernels to THSVM and propose a novel THSVM classifier, which call the twin hyper-ellipsoidal support vector machine (TESVM) classifier.

A. TESVM CLASSIFIE
The TESVM classifier uses two SVM-type QPPs in its optimization problems, similar to THSVM and TWSVM. However, compared with the Euclidean distance-based kernels used in the two QPPs of THSVM and TWSVM, our proposed method employs Mahalanobis distance instead. In brief, this model finds a pair of hyper-ellipsoidals such that each one not only covers as many samples in one class as possible, and keeps as far away from the other class as possible, but also captures the orientation information of the corresponding class samples. A new sample is assigned to a positive or negative class according to which hyper-ellipsoidal it lies closest to. The aim is to fit one hyper-ellipsoidal for each class in the feature space with minimum effective radius R(R > 0) that covers a majority of samples and is far away from the other class.
Fact 1: Two TESVM hyper-ellipsoidals with Mahalanobis distance-based kernels in the feature space are obtained by solving the following pair of optimization problems: where the penalty factors c 1 , c 2 > 0 and v 1 , v 2 > 0 are pre-specified by the user, and c + and R + denote as the center and radius of the corresponding hyper-ellipsoidal, respectively. (proof in Appendix).
We now discuss the optimization problems, Equations (11) and (12), more precisely. The first term in the objective function of these two Equations minimizes the square of the effective radius of the corresponding hyper-ellipsoidal. This leads to construction of as compact a hyper-ellipsoidal as possible. The second term in the objective functions maximizes the sum of squared Mahalanobis distances from the center of the corresponding hyper-ellipsoidal to the points of the opposite class, which causes the center of this hyper-ellipsoidal to be as far from the samples of the opposite class as possible, by taking into account the orientation information of the corresponding class, embedded in the covariance matrices and of Equations (11) and (12), respectively. The constraints make the corresponding hyper-ellipsoidal cover samples of the corresponding class with regard to the orientation information of its samples. On the other hand, this information can change the shape of the corresponding hyper-ellipsoidal to better cover its class samples and be farther from the opposite class samples.
Besides, the slack variables ξ i , i ∈ I + and ξ j , j ∈ I − for error measurement are added. As the last term in the objective functions of Equations (11) and (12), the sum of error variables is minimized, which leads to reduced misclassification due to the points belonging to the opposite class. Now we optimize the optimization problems, Equations (11) and (12): The dual QPP form of Equation (11) is as follows: (proof in Appendix.)

Fact 3:
The simpler dual form of Equation (12) becomes: where β i , j ∈ I − are the Lagrangian multipliers, and the center c − is computed as: also, we have: When the positive and negative centers c + and their squared radiuses R 2 + are obtained, we can acquire the following two hyper-ellipsoidals for positive and negative classes, respectively: Based on the distance of a test point x from the two hyper-ellipsoidals (Equation (12)), we can determine its class label, such that test point belongs to a class whose hyper-ellipsoidal is located closer to x, i.e:

B. COMPUTATIONAL COMPLEXIT
Our proposed method comprises two hyper-ellipsoidals to model two classes of data, each of which is constructed concerning the points of only one class. Therefore, any QPPs of the TESVM are almost of the size of the classical SVM, making TESVM almost four times faster.
In contrast to TWSVM and similar to TMSVM, our proposed algorithm has higher computational complexity. The calculation of two Mahalanobis distance-based kernels K + (., .) and K − (., .) is the main cost in TESVM, which needs to reverse The time order of matrix inversion is O(l 2.3 ) [21]. However, by caching some necessary matrices, we can avoid the extra communication cost. For instance, caching , avoids the extra computation cost in the test section but consumes more RAM capacity.
In summary, the computational complexity of TESVM is higher than TWSVM and THSVM algorithms but, as shown in the experiments, the learning speed of TESVM is still faster than that of the classical SVM.

C. CONNECTION TO OTHER METHODS
In this section, we compare our corresponding TESVM to other related classification algorithms.

1) CONNECTION TO TWSVM
As we saw earlier, the TESVM algorithm finds a pair of hyper-ellipsoidals to classify data points, while the TWSVM algorithm does it with a pair of hyperplanes. Each of the TESVM's hyper-ellipsoidals concerning the trend of data distribution in different directions tries to cover as many samples in the corresponding class as possible and keeps away from the other class. For TWSVM, each hyperplane passes through as many samples in one class as possible and resides at a distance of at least one from the other class. As another difference, for TWSVM we have two matrix inversions with size (l + 1) * (l + 1) or (m + 1) * (m + 1) at a cost of O(l 2 ), which is calculated in its optimization problems. TESVM needs to calculate K + (., .) and K − (., .) with inversion of σ I + J T [21], which makes the computational complexity larger for THSVM than TWSVM.

2) CONNECTION TO THSVM
Similar to THSVM and TWSVM, our proposed algorithm comprises a pair of QPPs. In the constraints of each QPP, only the samples of one class participate. In other words, each QPP is roughly half the size of all samples compared with the full-sized QPP in the classical SVM. THSVM aims to find two hyperspheres such that each one covers as many samples in one class as possible and keeps away from the other class. Note that a hypersphere is a set of points at a constant distance from the center of the hypersphere.
Also, it implicitly assumes that the growth rate of points in different directions is the same. As seen in the THSVM models, they use this assumption, which does not usually occur in real-world problems.
On the other hand, TESVM does not consider this assumption and can effectively change its shape due to the different growth rates of points in different directions. Therefore, TESVM can capture the trends of points in one class in different directions while trying to effectively cover as many samples in the corresponding class as possible and be as far away from the other class. TESVM exploits this orientation information of samples in one class by employing Mahalanobis distance-based kernels in its optimization problems. Compared with THSVM, our TESVM considers that the samples of two classes are distributed in two different hyper-ellipsoidal regions. This makes TESVM surpass THSVM in terms of classification accuracy and generalization performance, as can be observed in the experiments.
However, the THSVM algorithm needs to solve two SVM-type optimization problems with no matrix inversion in the objective functions of its dual QPPs, while TESVM has to calculate matrix inversions at the cost of O(l 2.3 ), leading to higher computational complexity.

IV. EXPERIMENTS
To evaluate our proposed algorithm with THSVM, TWSVM, and SVM, we show the execution results on several benchmark datasets in terms of classification accuracy and CPU learning time. These experiments use MATLAB software on a system with a 2.26 GHZ CPU and 4 GB of RAM. For simplicity of all algorithms, we set c 1 = c 2 = c and v 1 = v 2 = v; they are selected from the set of values {2 i |i = −9, −8, · · · , 0, · · · , 9, 10} through the random data including 30% of the training data as a tuning set. The v value in THSVM is chosen from the set {0.1, 0.2, . . . , 0.9} Once the parameters are determined, the tuning set is returned to the training set to obtain the final classifier.

A. TOY DATASET
In this section, to show the performance of our TESVM, we compare it with the other algorithms on the two Gaussian problem dataset. Similar to [19], this dataset comprises 400 samples such that each positive and negative class has 200 samples, with a Gaussian distribution N (3, 3) T , diag{4, 0.5} for the positive class (blue samples) and N (0, 0) T , diag{0.5, 4} for the negative class (red samples). For testing, we produced 2000 samples of each positive and negative class with the same Gaussian distribution. As shown in Figure 1, the positive class (blue) has a roughly horizontal orientation and the negative class has a roughly vertical orientation, such that a structural conflict between the two classes appears.
The results of SVM, TWSVM, THSVM, and TESVM with the linear kernel are shown in Figure 1. As seen in this figure, linear SVM, THSVM, and TESVM obtain better classification accuracy than linear TWSVM. This is for two non-parallel hyperplanes of TWSVM that could not effectively extract the characteristics of the samples in two classes. Therefore, the separating hyperplane of TWSVM cannot classify the training samples correctly. For SVM, since its attention is focused on the support vectors, the separating hyperplane resides almost in the middle of the support vectors of the two classes and can classify most of the training samples. On the other hand, one of the assumptions of SVM on data orientation or data shape is + = − = , which means two positive and negative classes have the same covariance matrices. This assumption makes SVM ignore the orientation information of samples in two classes and is not appropriate for this example. THSVM, uses two hyperspheres to model two classes, so the data scattering magnitude can take into account but cannot insert the orientation information of two classes into its models. Thus positive and negative classes would not be effectively covered. On the other hand, two hyper-ellipsoidals of our TESVM, due to the merits of the hyper-ellipsoidal shape, exploit the orientation information of two classes embedded in the corresponding covariance matrices. Therefore, they can better follow the data trend of two classes than the other two algorithms and cover the corresponding class samples more precisely. Thus TESVM's separating hyperplane can classify the samples more accurately than that of the other algorithms.
The results of the TESVM and other algorithms on the two Gaussian problem is shown in Table 1. As can be found, THSVM and TESVM algorithms achieve better classification accuracy than the other algorithms. Besides, TESVM surpasses THSVM and TWSVM in terms of classification accuracy. This indicates that TESVM can more effectively extract the characteristics of samples in two classes. In terms of the learning CPU time, although TESVM spends more time in the learning phase compared with THSVM and TWSVM, it is still faster than the traditional SVM on the artificial two Gaussian example. For another experiment, we have the checkerboard example. This example contains 16 squares of uniformly distributed samples taken from two classes of data. Similar to \ cite{b19}, in the checkerboard experiment we independently produced 10 groups of datasets, each one containing 1000 training samples (500 for each class) and 1000 testing samples (500 for each class). The classification results of SVM, TWSVM, THSVM, and TESVM on the checkerboard dataset with the Gaussian kernel are shown in Figure 2. Observing the performance of the hyperplanes, hyperspheres, and hyper-ellipsoidals of TWSVM, THSVM and TESVM, in Figure 2, similar to [19], the hyperspheres of THSVM effectively covered the samples of the corresponding classes and obtained a better result than TWSVM and SVM. This is because the embedded information in the samples of two classes could not be exploited in SVM and TWSVM. However, in THSVM, two hyperspheres can partially depict the characteristics of the two classes, since the orientation information still could not be derived in the models. Considering the covariance matrices, we can effectively introduce the orientation information into TESVM's optimization problems. This makes it better cover the samples of each class, leading to a better separating hyperplane compared to the others.
The classification accuracy of SVM, TWSVM, THSVM, and TESVM on the checkerboard dataset is also shown in Table 1. As can be seen, TESVM's classification accuracy is much better than other algorithms and THSVM achieves higher classification accuracy. For the CPU learning time, TESVM needs more time compared with THSVM and TWSVM, but, it is faster than the classical SVM. To further illustrate the difference between TESVM and THSVM, we show two-dimensional scatter plots for the 1000 samples (500 for each class) of this checkerboard dataset obtained by THSVM and TESVM in Figure 3.

VOLUME 8, 2020
In Figure 3-a, the distances of each test point x i from two hyperspheres of THSVM are calculated by using for negative hypersphere. In Figure 3-b, the distances of x i from two hyper-ellipsoidals of TESVM are obtained by using d i and for positive and negative hyperellipsoidals, respectively. If d i test point x i belongs to the positive class and if d i , it belongs to the negative class. The two-dimensional projection plots represent the ability of each algorithm to separate between two classes. The dots marked in red denote negative class and those in blue denote positive class. As can be seen in Figure 3-a, each hyperplane of TWSVM tries to pass through as many points in one class as possible. So the distance between the points of each class from the corresponding hyperplane will be near zero, which causes an over-fitting problem. On the other hand, the distance between this hyperplane and the points of the opposite class also approaches zero, which reduces the generalization ability.
In THSVM, as observed in Figure 3-b, the points of each class are almost covered by a hypersphere such that their distances to the corresponding hypersphere are almost less than one. This indicates that THSVM often succeeds in extracting the structure of each class and has less over-fitting than TWSVM. However, as derived from Figure 3-b, the distance of points in one class to the opposite hypersphere is still small, which makes THSVM's generalization ability remain low.
On the other hand, each hyper-ellipsoidal of the TESVM algorithm, not only tries to cover as many points in one class as possible and stay far away from the points of the opposite class, but also can effectively extract the characteristics and trends of data points in one class. As seen in Figure 3-c, the distances of data points in one class from the corresponding hyper-ellipsoidal become approximately less than one. Therefore, TESVM, similar to THSVM, can successfully exploit the structure of each class, leading to decreased overfitting. On the other hand, the distance of data points in one class from the opposite hyper-ellipsoidal greatly increases improving the separability of the algorithm and achieving higher generalization performance compared with THSVM. Thus the TESVM algorithm can not only better extract the characteristics of each class, more effectively covering it, but also better discriminate between the two classes compared with THSVM.

B. BENCHMARK DATASET
We compared TESVM with the other related algorithms in terms of classification accuracy and CPU learning time on the common benchmark classification databases. A description of each database is shown in Table 2.
For a graphical comparison of TESVM with other algorithms, the banana dataset, which is two-dimensional, was used. As a result of the four algorithms on the banana dataset, as shown in Figure 4, the separating hyperplane of TESVM is better than that of the others. This indicates that TESVM has better performance than the other algorithms. This is because each hypersphere and hyperellipsoidal in THSVM and TESVM try to cover as many samples in one class as possible, and be as far away from the other class as possible, while the hyperplanes in TWSVM, try to pass through as many samples in one class as possible and keep away from the other class.
As shown in Figure 4, THSVM and TESVM can better extract the structure of two classes than TWSVM. On the other hand, for a class, since THSVM assumes that the growth rate of its samples in different directions is the same, it uses a hypersphere to cover its data points. The two hyper-ellipsoidals of TESVM also inherit the characteristics of the THSVM's hyperspheres and try to cover all samples of the corresponding classes.
However, each hyper-ellipsoidal considers the different growth of samples in one class in different directions by employing covariance matrices. So TESVM can better model each class and the separating hyperplane of TESVM is more accurate than that of THSVM.
The results of 10 independent executions of the TESVM, THSVM, TWSVM, and SVM nonlinear algorithms on the benchmark datasets and the comparison in terms of classification accuracy and CPU learning time are given in Table 3. According to the results in this table, it is observed that the accuracy of TESVM is better than the others, consequently, it has higher generalization performance. In addition to the classification accuracy in this table, the CPU learning times are also given for each algorithm on every dataset. As a reminder, the relevant matrices during the learning process of these algorithms are stored to reduce the CPU learning time, and no time for calculating duplicate matrices is needed. As can be seen, the TESVM algorithm needs more time in its learning process than THSVM and TWSVM. However, we need to calculate the covariance matrices of two classes to extract the orientation information of their samples. This information indicates the growth rate of samples in different directions. However, the learning speed of our proposed algorithm is still faster than the classical SVM.

C. IMAGE RECOGNITION
In this section, for further comparison of the algorithms, we apply them to typical image recognition problems such as object recognition (COIL-20) [23] and handwriting recognition (USPS) [24]. COIL-20 contains 20 objects. Images of each object were taken 5 • apart as the object was rotated on a turntable, and there are 72 images of each object. The size VOLUME 8, 2020 of each image is 32 × 32 pixels, with 256 gray levels per pixel. Thus, each image is represented by a 1024-dimensional vector.
The USPS database involves gray-scale handwritten images of the digits 0 to 9 such that there are 1100 images for each digit with a size of pixels in 256 gray levels. For this dataset, five-pairwise digits for odd vs. even digit classification were selected. We randomly partitioned these images into two groups with the same sizes and repeated this process 10 times. One group was used for training and the other for testing. Also, we used Gaussian kernel in the learning phase of all algorithms. The experimental results of algorithms over COIL-20 and USPS datasets are listed in Table 4. As shown in Table 4, for almost all cases, our TESVM algorithm achieved better classification accuracy than the others.

D. ANALYSIS EXPERIMENTS 1) AUC TEST
To evaluate the performance of classifiers in this paper, we used the area under the receiver operating curve (AUC) as a criterion. AUC is defined as follows: where TP stands for true positive, FN for false negative, for false positive and TN for true negative. If the output of an algorithms does not surpass the random manner, the AUC value becomes 0.5, and for refer to the best algorithm outputs the AUC value becomes 1. Similar to [25], all AUC values are medians of five twofold cross-validations. According to [26], in this paper, we performed five twofold cross-validations (5 × 2 f cv) as a simple method for model selection. In this method, initially, all data are divided into two equal parts, one is used for learning the algorithm and the other to evaluate it, and the process is repeated five times. As can be seen in Table 5, the results of median AUC are shown as 5 × 2 f cv. Since, all median AUC values are bigger than 0.5, classifier predictions are better than the random manner.
As can be observed, TESVM is better than the others in terms of AUC and THSVM has a higher value than TWSVM and SVM. As is clear from Table 6, since the interquartile  range (IQR) of TESVM is smaller than the others, its variations are lower and it has better stability than the other algorithms.
For better statistical comparison between algorithms, as proposed in [26], [27], we used the Friedman test with corresponding post hoc tests. For this study, we used the average (mean) ranks of four algorithms on AUC, shown in Table 7. Regardless of the null hypothesis that the algorithms are all the same, the Friedman test formula is calculated as follows [10]: 20) where N represents the fold number, K is the number of algorithms, R j indicates the jth algorithm (of K algorithms) in the ith fold (from N folds), and R j = 1 N i r j i , where r j i represents AUC values for the ith fold and jth algorithm. By using χ 2 F , a better statistic can be achieved. We can calculate the F statistic as follows: The F statistic, or f value, is a random variable with f distribution and k − 1 and (k − 1)(N − 1) degrees of freedom. According to Equations (59) and (60), we have χ 2 F = 25.9350 and F F = 57.4207, where F F has F distribution with (3,27) degrees of freedom. For the degree of importance α = 0.05, the critical value F(3, 27) is 2.96, for α = 0.025 it is 3.64 and for α = 0.01 it becomes 4.60. Since the value of F F is much larger than the critical value, there are significant differences between the four algorithms. Recall that according to Table 7, the average rank of the TESVM algorithm is much lower than other algorithms. This indicates that our TESVM algorithm is more valid than the other three algorithms.

V. CONCLUSIONS
In this paper, an improved THSVM algorithm for binary classification of data is presented. In our algorithm, Mahalanobis distance-based kernels are made by the covariance matrices of two classes of data and then TESVM finds two hyper-ellipsoidals by these obtained kernels, such that each one covers as many data points in one class as possible and stays as far away from the other class as possible. This improvement allows the TESVM to take advantage of the orientation information of the two classes embedded in their covariance matrices. Note that for many real-world problems, two classes often have different covariance matrices.
The experiments on benchmark, synthetic and image datasets in Section IV indicate that TESVM also has better generalization performance compared with the other algorithms and faster learning speed than the classical SVM. Finally, increasing the learning speed of TESVM can be investigated in the future works.
Further research could look into comparing the developed model with classification methods such as Bayesian networks and random forests. However, another improved model could be developed for high-dimensional datasets where the number of variables is more than the number of observations.

CONFLICT OF INTEREST
The authors declare no conflict of interest.

PROOF OF FACT 1
First, we have to calculate the covariance matrices, so according to [24] we can use the following equations: where J + J T (1) and l (2) denote the number of samples in positive and negative classes, respectively. I and e are a vector and a matrix, respectively, with appropriate dimensions. In addition, ϕ X (I) = ϕ X (1) and similarly, ϕ X (2) = ϕ X k ∈ I (2) , k = 1, . . . , l (2) . To calculate the Mahalanobis distance-based inner product < ϕ(·), ϕ(·) > (i.e., kernels) in the feature space, we have to calculate the inverses of Since ± are often ill-conditioned, to improve the robustness, we add a small positive value σ to the diagonal elements of ± . Thus, instead of − ± 1 we have to compute (σ I + ± ) −1 . By using the Woodbury matrix identity [27]: we set U = σ I and V = ϕ X (1) J + , so we have: where K X (1) = ϕ X (1) T ϕ X (1) = k X (1) , X (1) .. Similarly, we obtain: where K X (2) = ϕ X (2) T ϕ X (2) = k X (2) , X (2) . So the Mahalanobis distance-based kernels K + (x i , x j ) and K − (x i , x j ) are obtained as follows:

PROOF OF FACT 2
The Lagrangian function of Equation (11) is: L (c + , R + , ξ, α, r, s) where s ≥ 0, α i ≥ 0, r i ≥ 0, i ∈ I + denote the Lagrangian multipliers. Differentiating the Lagrangian function (Equation (27)) concerning c + , R 2 + and ξ i , i ∈ I + yields the following necessary and sufficient Karush-Kuhn-Tucker (KKT) optimality conditions: By selecting suitable parameters v 1 and c 1 in optimizing Equation (11), the condition will hold, and according to the KKT conditions of Equations (28) and (34), we have: and which also derives an implicit value for v so that 0 < v 1 < 1. By substituting Equations (28), (29), and (36) into Equation (27), the dual QPP form of Equation (11) is as follows: To solve Equation (37) we have to find the term Also, the term j∈I − ϕ x j − c + −1 + ϕ x j − c + can be computed similarly. Finally, by discarding the constant items from the optimization problem Equation (29), it becomes simpler: Now the squared radius R 2 + is obtained by the following formula: