Combining Linear Classifiers Using Probability-Based Potential Functions

The score function can be used as a measure for evaluating predicted probabilities of the classification models. In multiple classifiers systems, one of the problems is the diversity of the way of determining the scoring function of individual base classifiers. To alleviate this limitation, in this article, we propose a novel concept of calculating a scoring function defined by the probability-based potential function. The proposed potential functions take into account the distance of the recognized object from the decision boundary as well as a prior probability of the class labels. The proposed score function has the same nature for all linear base classifiers, which defined the multiple classifiers model. Additionally, the proposed method is compared with other ensemble algorithms based on homogeneous linear base classifiers. The experiments on seventy databases demonstrate the effectiveness of our method. To discuss the results of our experiments, we use multiple classification performance measures dedicated to standard and imbalanced datasets. The statistical analysis of the experiments is also performed.


I. INTRODUCTION
The idea of building an ensemble of classifiers (EoC) is to compose a single strong model from the pool of weak or different ones. In general, EoCs improve the possibilities of individual base models (base classifiers) by building a more stable and accurate complex model [1]. The real-world classification problems solved with EoC was already mentioned in paper [2] because EoC increase the performance of individual classification models. Since then, many publications have appeared that indicate the practical applications of EoC. The network intrusion detection approach used EoC was proposed in [3] whereas paper [4] presents usefulness of EoC in detection cross-site scripting attack for web security. EoC have been also applied in many industrial fields like: the optimal stacking ensemble for remaining useful life estimation was proposed in [5], classification of cutting tools [6] or in the in-line detection of surface defects on glass substrates of thin-film transistor liquid crystal displays [7]. In addition, EoCs are used in other applications such as: the marine sediments classification [8], the land cover type classification [9] or in medical diagnostics [10]. The EoC classifiers play also an important role in the multi-label classification problems.
The associate editor coordinating the review of this manuscript and approving it for publication was Alberto Cano .
In those classification problems, multiple classes may be simultaneously assigned to one object [11]. A possible solution to so posed classification is to decompose the multi-label problem into a series of multi-class classification tasks [12]. Then the results of created classifiers are combined to form a multi-label solution. These decomposition and aggregation are usually done using EoC approaches [13]- [15].
In general, the procedure for creating an EoC can be divided into three major steps: generation, selection, and fusion [16]. In the final phase of the EoC building, the outputs of each base classifier are combined. The classifier outcome can be represented by the class label, a subset of class labels ordered by plausibility or a vector of all possible labels with the corresponding scoring function that can reflect a measure expressed as a probability. This article focuses on a new proposition of the score function, which is defined by a probability-based potential function. This proposition significantly expands the concept of the scoring function presented in the earlier authors' work, because as proposed in [17] a scoring function is based on the distance of the recognized object from the decision boundary of a given base classifier and the distance to the class centroid. On the other hand, the article [18] presents the weighted scoring function based on Manhattan distance that uses the location of the cluster centroids and the distance to the decision boundary. VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Given the above, the main objectives of this work can be summarized as follows: • A proposal of a new scoring function based on a probability-based approach dedicated to linear base classifiers. We propose two versions of this potential function.
• The proposal of a new scoring function is used in the fusion process of homogeneous linear base classifiers.
• A new experimental setup for the comparison of the proposed method with other base classifiers fusion methods using different classification performance measures. The outline of the paper is as follows: In the next section (Section II), related works are outlined. The proposed method is presented in Section III. In Section IV, the experiments that were carried out are discussed, while results and the discussion are present in Section V. Finally, we conclude the paper in Section VI.

II. RELATED WORK
In general, a classifier is a function that maps the feature space X into a set of class labels M [19]. Usually, if we talk about a classifier, we implicitly assume that the classifier is built with the use of a kind of supervised learning procedure. That is a procedure that incorporates information extracted from the training set [19]. The training set consists of training samples (taken from the feature space) and the information about the class points to which these samples belong to. In this article, it is assumed that the input space X is a d − dimensional Euclidean space X = R d . The paper is focused on linear (binary) classifiers. Consequently, each object from the input space x ∈ X belongs to one of two available classes, so the output space is: M = {−1; 1}.

A. LINEAR BINARY CLASSIFIER
A linear classifier produces a decision boundary that is described using a hyperplane π defined by the following equation: where n is a unit normal vector of the decision hyperplane, b is the distance from the hyperplane to the origin and ·; · is a dot product defined as follows [20]: For each instance x, the linear classifier ψ produces the discriminant function [21]: When the normal vector of the plane is a unit vector, the absolute value of the discriminant function equals the perpendicular distance from the decision hyperplane to the point x. The sign of the discriminant function depends on the site of the plane where the instance x lies.
Consequentely, the decision of the linear classifier is determined using the sign of the discriminant function: The decision-plane coefficients (n and b) are obtained in a supervised learning procedure using the training set T containing |T | (where | · | is the cardinality of a set) pairs of feature vectors x and corresponding class labels m: (1) , m (1) ), (x (2) , m (2) ), . . . , (x (|T |) , m (|T |) ) , (5) where x (k) ∈ X and m (k) ∈ M.

B. ENSEMBLE OF LINEAR CLASSIFIERS
An ensemble classifier (or multiclassifier) is a set of properly trained classifiers whose decisions are then combined to produce the final decision of the system [22]. There are a few main reasons for building classifier ensembles [23]. First, the classifier training procedure may be interpreted as a procedure for exploring the hypothesis space. The goal is to find the best hypothesis that fits the training data. Unfortunately, a single classifier can search only a limited subspace of the entire hypothesis space. What is more, the optimization process connected with the classifier training may get stuck in one of the local optima. Those problems may be dealt with by employing a set of diverse classifiers [24].
Solving practical classification problems involves dealing with limited training/validation data. This may result in finding a set of classifiers which achieve the same classification quality. Combining responses of multiple classifiers may prevent the ensemble from choosing the wrong model [24].
The process of developing an ensemble system is divided into two main tasks: choosing the ensemble building strategy and choosing the method of output combination [22], [24].
The main goal of the ensemble building step is to provide the system with a set of accurate and diverse classifiers. Diversity of base classifiers is even more important than their high accuracy because combining classifiers whose predictions are identical gives no improvement over prediction of a single classifier [24]. There are two well-known ways of building a diverse set of classifiers. One is to build an ensemble of classifiers based on the different learning paradigm (heterogeneous ensemble) [25]. The other is to build a set of homogeneous classifiers (the same learning paradigm) which are learned on different training data. The most widely used methods of creating homogeneous ensembles are bagging [26], boosting [27], and random subspaces [28].
The second step of the ensemble creating process is to develop a method that combines outputs of the classifiers forming the ensemble (a combiner). Two methods can be used to combine the responses of base classifiers, namely output weighting methods and meta-learning [29]. In the meta-learning methods there is a need to learn at least two levels of classifiers/regressors. Classifiers on the first level are learned using object description and classifiers on higher levels are learned using outputs of classifiers from the lower level [21], [30]. The output weighting methods can be divided into [31]: • voting based [32] and support based [33]; • trainable [32] and untrainable [34]; • static [32], [34] and dynamic [33]. The idea of constructing ensemble classifiers has been and still is widely explored [22], [35], [36]. This is because they proved to be an efficient tool for solving many classification problems across multiple domains. Ensembles of linear classifiers have also been utilized in many practical problems like medical diagnosis [37], robotics [38] or bioinformatics [39] to name only a few. The linear classifiers owe their popularity to their low computational complexity and fair accuracy that may be achieved. They are also less overfitting prone. [37]. Now, let us define an ensemble of linear classifiers as a set of N classifiers: This article is focused mainly on the methods of combining the outputs of linear classifiers constituting the ensemble.
In the literature, we may find multiple methods of doing so. The simplest strategy to combine the outcomes of multiple classifiers is to apply the majority voting scheme: where ω (i) (x) is the value of the discriminant function provided by the classifier ψ (i) for point x. However, this simple yet effective strategy ignores the values of the discriminant function ω(x).
Another simple strategy is model averaging [40]. The output of such a model is calculated by simply averaging the values of the discriminant functions: It is easy to notice that the model-averaging approach has a major drawback. That is, the discriminant function of a linear classifier is unbounded, and it grows with the distance from the decision plane. Consequently, a classifier-related decision plane placed far from the real decision boundary will produce a high value of the discriminant function that may negatively affect the ensemble. For the same reason, the outliers may acquire an abnormally high value of the discriminant function which may also affect the decision of the ensemble. A compromise between the above-mentioned methods is to transform the discriminant function by applying a kind of sigmoid function to it [21]. The sigmoid function is a monotonic function that has finite upper and lower bounds: As a consequence, the distance-specific information is not lost, and the impact of the misplaced decision boundaries is reduced. This approach is equivalent to applying a kind of the logistic regression on the values of the discriminant function [41]. The value of the discriminant function may also be used to estimate the conditional probability of a class given the instance x [41], [42]. The normalized outputs are then simply averaged: The other issue with combining linear classifiers is that the discriminant function of the linear classifier grows monotonically with the distance to the decision plane. It means that the linear classifier ignores the data spread along the normal vector of the decision plane and it is implicitly assumed that the distribution is uniform. However, in many real-world datasets objects are distributed unevenly. An example of this situation is visualised in Fig. 1. The figure presents a binary, two-dimensional, banana-shaped dataset and the decision boundary. As we can see, the objects are placed in one cluster located in the intersection of intervals x 1 ∈ [−1.5; 2.5], x 2 ∈ [−1.5; 2]. Outside this area, there are no class-specific objects. Consequently, the discriminant function generated by the classifier should be low outside this area. Unfortunately, a linear classifier ignores this fact and its support will grow (along the normal vector n of the decision boundary) outside this area. Transforming the discriminant function using a monotonic function, such as a sigmoid function, does not change the situation at all. This is because far from the decision boundary the discriminant function approaches its upper (lower) limit. Being close to the limit still indicates high support for a particular class in the area where there are no class-specific instances. Ignoring the class-spread-related information does not change the outcome of the single classifier since the sign VOLUME 8, 2020 of the discriminant function remains the same. However, our previous research has shown that employing this information may improve the classification quality for heterogeneous ensembles of linear classifiers [17]. In the previously-proposed approach, the discriminant function is transformed using a non-monotonic function derived below: where ζ is a coefficient that determines the position and steepness of peaks (positive and negative peaks) [see Fig. 2]. This coefficient should be tuned during the training procedure. The translation constants 0.5 and the scaling factor √ 2ζ guarantee that the maximum and minimum values (peaks) of the discriminant functions are 1 and −1 respectively. The above-mentioned non-monotonic function is visualised in Fig. 2. The figure shows the shape of the function for different values of ζ . Using this transformation, the prediction of the ensemble is calculated as follows: After combining the base classifiers, the final prediction of the ensemble is obtained according to the rule (4).
Harnessing the above-mentioned transformation allows the ensemble to improve the classification quality. This is due to the function being tuned so that the potential is near zero in the areas where there are no training points. However, when the data distribution is imbalanced, the performance may degrade [17]. The other drawback of this method is that the ζ coefficient controls the position and the steepness of the peaks simultaneously. This can be seen in Fig. 2. The higher the value of ζ is, the closer the peak is to the decision boundary, and it is narrower. The solution may be to use an asymmetric, data-driven potential function. The potential function should possess the following properties: • The function should be bounded within a given interval; • The function should not be a sigma-shaped one; • The function should not be an odd function; • The function should allow multiple peaks; • The peaks should indicate areas where the density of class-specific instances is high.

III. PROPOSED METHOD
In this section, the probability-based potential function is introduced. The concept description is preceded by the discussion that motivates the usage of such a potential function.

A. POTENTIAL FUNCTION
The discussion has brought us to the point where we realise that an asymmetric, data-driven potential function is needed. This section describes such a function based on a probabilistic framework. The harnessing of the probabilistic model means that x and m are realizations of random variables X and M, respectively. The joint distribution P(X, M) is also given. Then, the value of the discriminant function ω(x) of a linear classifier (described by its normal vector n and offset b) is also a realisation of a random variable defined as follows: We denote its probability density function by w(ω). This one-dimensional distribution describes the data spread along the line defined by the normal vector n of the decision plane. More precisely, it describes the distribution of distances of points from the decision plane. This random variable is also jointly distributed with M: P(W, M).
Taking this into consideration, the potential function β should be proportional to the conditional probability of class 1 given ω(x): The conditional probability is expressed as follows: The potential function β ω(x) is given as follows: To turn the probability into the potential function, the following modifications have been applied.
• When an outlying object x is considered, the sum i∈M w ω(x)|M = m P(M = m) may be close to zero. To avoid numerical problems, the softmax transformation is applied. The application of the softmax normalization also gives the potential function a desired shape. That is, for outliers lying far from the decision boundary, the potential function is close to zero [because the fraction in the (17) is close to 0.5].
• To get the potential bounded within an interval [−0.5; 0.5] (the sign of the potential indicates the class), we have to subtract 0.5 from the probability. The above-defined potential function takes into account prior class probabilities P(M = m). Using these probabilities may not be suitable for imbalanced classification problems. If we want the potential function using only class-conditional densities, we simply assume P(M = m) = 0.5 ∀m. It gives the following potential function: (18) Given the above-defined potential function and the ensemble of linear classifiers, the outcome of the ensemble is calculated using simple averaging of the values of the potential function: where β (i) ω(x) is the classifier-specific potential.

B. PROBABILITY ESTIMATION
In the previous section the potential function has been developed. It was assumed that all needed probabilities are known. Unfortunately, in the real-world classification tasks, the probabilities have to be estimated from the training data. This section describes the techniques used to estimate the probabilities needed by the potential function. First, the linear classifier ψ is trained using the training set T . Then, the training set is divided into two datasets containing objects belonging to class -1 and 1 respectively: Given those sets, the prior probabilities may be easily estimated as:P Estimating the class-conditional densities w ω(x)|M = m is a more complicated task. This is due to the entire probability density function have to be estimated instead of a single value. To avoid any assumptions about the distribution type, the non-parametric estimation technique must be used [43], [44]. Two widely-known examples of such techniques are histograms and kernel estimators [45]. The histogram-based estimators are often criticized because the shape of estimated distribution strongly depends on the width and the number of bins. What is more, due to the binning approach the histogram loses some information coming from the sample [44]. However, the kernel-based estimators are devoid of these drawbacks [43], [44]. Consequently, the class-conditioned densities are estimated using the kernel estimator [46]: where h is the smoothing parameter (bandwidth) and K (·) is the kernel function. The kernel estimation technique uses a 'smooth' kernel function that is centered at each data point and then the values coming from the centered kernels are summed up to form the estimated probability density function. The estimation result of a smiple distribution is visualised in Fig. 3. The kernel function usually meets the following properties [44], [47]: Many different types of kernel functions have been described so far in the literature [44]. However, for the symmetric kernels described above the shape of the kernel function has little impact on estimator properties [48], [49]. VOLUME 8, 2020 On the other hand, the Gaussian kernels are told to produce most 'smooth' estimators and, due to this fact, they are most frequently used [44]. The impact of the kernel shape on the resulting estimation is shown in Fig. 4.   FIGURE 4. Example of the kernel estimation. The vertical lines over the abscissa represent the realisations of the random variable being estimated. The realizations are generated using normal distributions with standard deviation equal to 0.5 and mean values equal to zero and two respectively. The number of realizations is 50 for each of the normal distributions. The estimation was made using different kernels and bandwidth selected using the Silverman's rule [44]. As we can see, the shapes of the estimated distributions are quite similar for all of the investigated kernels.
In this study, we decided to use the Gaussian (The kernel is the pdf. of the normal distribution) kernel: In contrast to the kernel shape choice, the choice of the kernel bandwidth is critical [44]. On one hand, too large value of the smoothing parameter causes the estimator to be over smoothed. The over smoothed estimator may loose important details of the density function. For example the over smoothed estimator may loose information about multiple modes when the peaks are to close to each other [43], [44]. The impact of the bandwidth on the estimation result is shown in Fig. 5.
On the other hand, too small bandwidth value may cause the estimator to show many insignificant details of the density function. For example, it may show multiple modes when estimating unimodal distribution [44]. The literature presents many techniques for finding the proper value of the kernel bandwidth [47], [50], [51]. However, for Gaussian kernels, the Silverman rule of thumb is often used in practice due to ist simplicity and the ability to provide quite good results [44]. In our work, we also decided to select the bandwidth using Silverman's rule of thumb [48]: whereσ is the sample standard deviation, IQR is the interquantile ratio of the sample and n is the sample size [52].

C. TOY EXAMPLES
In this section, simple examples of the potential-functionconstruction process are presented.

1) PROBABILITY ESTIMATION AND THE POTENTIAL FUNCTION
In this subsection, a simple example of constructing the potential field out of balanced dataset is given The training dataset and the constructed linear classifier (constructed using the nearest centroid rule [53]) for this set are presented in Fig. 6. The estimated class-conditioned densities are shown in Fig. 7. The potential function is visualised in Fig. 8.

2) ENSEMBLE-SPECIFIC DISCRIMINANT FUNCTIONS
In this subsection, we visualise the ensemble-specific discriminant functions obtained using different approaches to combining linear classifiers. In the examples, we used an imbalanced two-dimensional dataset. As a base classifier the Fisher LDA classifier is used. [54] The base classifiers of the ensemble are trained using the bagging approach and the number of the base classifiers is three. 207952 VOLUME 8, 2020  As we can see in Fig. 9, for the response of the majority voting ensemble is rather crisp. The shades between red and green are visible only in the area where the decision areas of base classifier do not overlap.
For the ensemble using the model averaging approach, which is shown in Fig. 10, no clear decision boundary  is visible. The values of the discriminant function decrease along the horizontal axis of the plot. Consequently, the objects that are more distant from the decision boundary, the higher the absolute value of the discriminant function is.
The discriminant function generated by the ensemble using β potential function is visualised in Fig. 11. It may be seen that the green colour dominates the plot. It means that the ensemble is biased towards the majority class. However, we may see that absolute value of the discriminant function for the majority class is the highest in the area where the value of class-specific probability-density-function is the highest.
The discriminant function generated by the ensemble using β b potential function is visualised in Fig. 12. In this case, the ensemble is not biased towards the majority class. As expected, the class-specific valiues of the discriminant VOLUME 8, 2020  function are the highest in the areas where many classspecific objects are placed.

IV. EXPERIMENTAL SETUP
In the experimental study, the proposed method was used to combine classifiers using a homogeneous ensemble of classifiers. The ensembles were created using a bagging approach [40]. The generated ensembles consist of 11 classifiers learned by using the bagging method containing 80% of the number of instances from the original dataset. The instances for the given base learner are chosen randomly.
The following base classifiers were used to build the above-mentioned ensembles (except random forest and rotation forest): • ψ FLDA -Fisher LDA [54]; • ψ LR -Logistic regression classifier [57]; • ψ PER -perceptron classifier [58]; • ψ NC -nearest centroid (Nearest Prototype) [53] with the Euclidean distance; • ψ SVM -SVM classifier with the linear kernel (no kernel) [59]. The classifiers used were implemented in the WEKA framework [60]. If not stated otherwise, the classifier parameters were set to their defaults. The multi-class problems were dealt with using One-vs-One decomposition [14]. The source code of the proposed methods is available online . 1 For each of the employed kernel estimators the kernel bandwidth was selected using the Silverman's rule [48]. The gaussian kernel is used.
To evaluate the proposed methods, six classificationquality criteria are used: • Macro-averaged: false discovery rate (1 − precision, FDR); false negative rate (1 − recall, FNR); -Matthews correlation coefficient (MCC); • Micro-averaged: false discovery rate (1 − precision, FDR); false negative rate (1 − recall, FNR); -Matthews correlation coefficient (MCC). Macro and micro averaged measures were used to assess the performance for the majority and minority classes. This is because the macro-averaged measures are more sensitive to the performance for minority classes [61]. The criteria are bounded in the interval [0, 1], where zero denotes the best classification quality. The results obtained using the MCC criterion are also normalised to fit the above-mentioned property.
The experimental procedure was conducted using the ten-fold cross-validation procedure. The data folds were generated using methods implemented in WEKA software. The random seed used to generate them is zero.
Following the recommendation of [62] the statistical significance of the obtained results was assessed using the two-step procedure. The first step was to perform the Friedman test [62] for each quality criterion separately. Since the multiple criteria were employed, the family-wise errors (FWER) should be controlled [63]. To do so, the Bergmann-Hommel [63] procedure of controlling FWER of the conducted Friedman tests was employed. When the Friedman test shows that there is a significant difference within the group of classifiers, the pairwise tests using the Wilcoxon signed-rank test [62] were employed. To control FWER of the Wilcoxon-testing procedure, the Bergmann-Hommel approach was employed [63]. For all tests, the significance level was set to α = 0.01. Table 1 displays the collection of the 70 benchmark sets that were used during the experimental evaluation of the proposed methods. The table is divided into three columns. Each column is organized as follows. The first column contains the names of the datasets. The remaining ones contain the set-specific characteristics of the benchmark sets: the number of instances in the dataset |S|; dimensionality of the input space d; the number of classes C; average imbalance ratio IR.
The datasets come from the Keel 2 repository. The datasets are available online . 3 During the dataset-preprocessing stage, a few transformations on the datasets were applied. The PCA method [64] was applied and the percentage of covered variance was set to 0.95. The attributes were also normalized to have zero mean and unit variance.

V. RESULTS AND DISCUSSION
To compare multiple algorithms on multiple benchmark sets, the average ranks approach is used. In this approach, the winning algorithm achieves rank equal to '1', the second achieves rank equal to '2', and so on. In the case of ties, the ranks of algorithms that achieve the same results are 2 https://sci2s.ugr.es/keel/category.php?cat=clas 3 https://github.com/ptrajdos/MLResults/blob/master/data/KeelData. tar.xz averaged. To provide a visualization of the average ranks the radar plots are employed. In the radar plot, each of the radially arranged axes represents one quality criterion. In the plots, the data is visualized in such a way that the lowest ranks are closer to the centre of the graph. Consequently, higher ranks are placed near the outer ring of the graph. Graphs are also scaled so that the inner ring represents the lowest rank recorded for the analyzed set of classifiers, and the outer ring is equal to the highest recorded rank. The radar plots are presented on Fig. 13 -17.    The numerical results are given in Table 2 to 6. Each table is structured as follows. The first row contains the names of the investigated algorithms. Then, the table is divided into six sections -one section is related to a single evaluation criterion. The first row of each section is the name of the quality criterion investigated in the section. The second row shows the p-value of the Friedman test. The third one shows the average ranks achieved by algorithms. The following rows show p-values resulting from the pairwise Wilcoxon test. The p-value equal to 0.000 informs that the p-values are lower than 10 −3 and p-value equal to 1.000 informs that the value is higher than 0.999. P-values lower than α are bolded. Consequently, the bolded results show that there is a significant difference between classifiers.  Before we begin the analysis of the classification quality of the proposed methods, let us analyse the reference algorithms first. We start with the analysis of the results connected with ensembles built using linear classifiers only. For the above-presented experimental setup there are almost no significant differences between the reference methods. For ψ NC classifier on the other hand, for macro-averaged FDR measure ψ MV classifier performs significantly worse than the other reference ensemble algorithms based on the linear classifiers. It means that for ψ NC , the majority voting strategy fails to identify the minority class. This may be due to the fact that ψ NC completely ignores inter and intra-class variation. Now, the results related to tree-based ensembles (ψ RA and ψ RO ) are analysed. For macro-averaged FDR and MCC, the ψ RO algorithm is significantly better than the ensembles built using linear classifiers for four of five base classifiers.   What is more, for macro-averaged FDR it is better than the above-mentioned ensembles for two (ψ NC and ψ SVM ) out of five base classifiers. On the other hand, in terms of all macro-averaged measures, ψ RA classifier is significantly better than the ensembles based on the linear classifiers for two (ψ NC and ψ SVM ) out of five base classifiers. These results show that ψ RA and ψ RO tend to be better at recognizing minority classes.
The situation changes for the micro-averaged quality criteria. For those criteria, ψ RA and ψ RO algorithms tend to be significantly better in terms of FNR and worse in terms of MCC. These results indicate that, for majority classes, ψ RA and ψ RO tend to make more false-positive predictions than the ensembles based on the linear base classifiers.
A. ψ B1 VS REFERENCE Observing the tables for ψ B1 classifier, we may find three different patterns of behaviour.
First for ψ FLDA (Tab. 2, Fig. 13) and ψ NC (Tab. 5, Fig. 16) classifiers, ψ B1 get a significantly worse score for VOLUME 8, 2020  the macro-averaged FNR and a significantly better score for micro-averaged FNR and MCC measures. On the other hand, for the micro-averaged FDR, the B1 ensemble is significantly better than ψ RA and ψ RO . It means that for those classifiers ψ B1 classifier makes more false-negative predictions for the minority class than the reference ensembles built using linear classifier. Moreover, it tends to be biased towards the majority class. However not as biased as ψ RA and ψ RO .
The second pattern of behaviour may be observed for ψ LR (Tab. 3, Fig. 14) and ψ PER (Tab. 4, Fig. 15). For those classifiers, ψ B1 is also worse than the reference methods in terms of macro-averaged FNR, but for ensembles built using linear base classifiers it is comparable in terms of micro-averaged measures. It means that its bias towards the majority class is smaller for those base classifiers.
For the above-considered cases, the bias towards the majority class is an effect of employing the prior class probabilities in the potential function. Employing those probabilities moves the decision boundary towards the majority class.
The third pattern is observed for ψ SVM base classifier (Tab. 6, Fig. 17). For this base classifier, ψ B1 outperforms the linear-base-classifier-based reference ensembles in terms of macro-averaged measures. ψ B1 is also comparable to them in terms of micro averaged measures. Contrary to the previously observed patterns, for this base classifier, ψ B1 is not biased towards the majority class. What is more, it allows to improve the classification quality for the minority classes. This effect may be related to the separation margin that is enforced by the SVM-based classifiers. The harnessing of the prior class probabilities moves the decision boundary towards the majority class, however, the separation-margin is so wide that it does not cause a bias towards the majority class. The situation changes when ψ B1 is compared to ψ RA and ψ RO in terms of the macro-averaged criteria. In this case ψ B1 is significantly worse than the tree-based-reference ensembles.

B. ψ B2 VS REFERENCE
Comparing ψ B2 classifier with the reference methods allows us to observe the following patterns.
For ψ FLDA base classifier (Tab. 2, Fig. 13), there are no significant differences between the ψ B2 ensemble and the reference methods.
For most base classifiers, except ψ FLDA , ψ B2 the ensemble is significantly better than linear-classifier-based ensembles in terms of macro-averaged FNR measure. Additionally, for ψ NC (Tab. 5, Fig. 16), and ψ SVM (Tab. 6, Fig. 17), the investigated classifier is significantly better in terms of the macro-averaged FDR. What is most important, for ψ PER , ψ NC , and ψ SVM , it is also better in terms of macro-averaged MCC measure. It means that ψ B2 classifier tends to outperform the linear-classifier-based reference ensembles in recognizing objects of minority classes.
The situation changes when comparing ψ B2 ensemble with the tree-based ensembles (ψ RA and ψ RO ). In this case for ψ NC , and ψ SVM base classifiers, the ψ B2 ensemble is significantly worse in terms of macro-averaged FDR and MCC.
For the majority classes the situation is quite different. For ψ NC (Tab. 5, Fig. 16), and ψ SVM (Tab. 6, Fig. 17), when comparing to the linear-classifier-based ensembles, there are no significant differences in terms of the micro-averaged measures. It means that for those base classifiers ψ B2 ensemble is able to improve the classification quality for minority classes without harming the recognition quality of the majority classes. For ψ LR and ψ PER , on the other hand, ψ B2 tends to be worse in terms of the micro averaged measures. Consequently, for those classifiers, ψ B2 ensembles are biased towards minority classes.
The situation is a bit different when comparing ψ B2 with the tree-based ensembles. In this comparison, ψ B2 tends to be significantly better in terms of the micro-averaged MCC criterion and significantly worse in terms of micro-averaged FNR. It means that the ψ B2 ensemble is less biased towards the majority classes.
For all the investigated base classifiers, the result pattern is very clear. That is, in terms of macro-averaged FNR and MCC, ψ B2 is always significantly better than ψ B1 . It means that for the minority class ψ B1 makes significantly more false-negative predictions than ψ B2 . In other words, employing prior probabilities makes the ensemble based on ψ B1 potential function less sensitive to the minority class.
On the other hand, for micro-averaged FNR and MCC measures, ψ B1 is better than ψ B2 for three out of five base classifiers (ψ FLDA (Tab. 2, Fig. 13), ψ PER (Tab. 4, Fig. 15), ψ NC (Tab. 5, Fig. 16)). For two remaining base classifiers (ψ LR (Tab. 3, Fig. 14) and ψ SVM (Tab. 6, Fig. 17)), no significant differences are observed. It means that ψ B1 is far better at identifying majority class examples. This fact which is connected with the low classification quality for macro-averaged measures means that ψ B1 classifier is biased towards the majority class. This is disadvantageous because, in many practical imbalanced classification problems, the minority class is the class of most interest [65].

D. MAIN FINDINGS
Given the above, the main advantages of the proposed method can be summarized as follows: • Ensemble ψ B2 is better than ψ B1 in dealing with imbalanced data.
• In general, ψ B2 ensemble is comparable to ψ RA and ψ RO in terms of macro-averaged quality measures. It means that the classification quality for examples coming from the minority classes is similar to the classification quality obtained by non-linear ensembles.
• In general, in terms of the macro-averaged measures ψ B2 ensemble is better than other ensembles built using linear classifiers i.e. ψ MV , ψ MA , ψ SM and ψ PF Likewise given the above, the main disadvantages of the proposed method can be summarized as follows: • Creating the potential function for ψ B1 and ψ B2 needs more computational burden than creating potential function in ψ MV , ψ MA and ψ SM . This is because the kernel estimators is employed. What is more, the bandwidth parameter of the kernel must be chosen. This generates an additional computational cost.
• Generally, for macro-averaged measures ψ B1 ensemble is worse than the reference methods. It means, that ψ B1 , due to employing prior class probabilities, is biased towards majority classes.

VI. CONCLUSION
In this article, a new method of combining linear classifiers has been proposed. Outputs of the base classifiers constituting the ensemble are combined via the potential functions. Two potential functions based on class-conditional probabilities have been developed. One of them ignores class prior probabilities. The proposed methods have been compared to four reference methods. The comparison was done in terms of six different quality criteria. The experiments were conducted using a large set of benchmark datasets (70 benchmark sets).
The experimental evaluation shows that the potential function that ignores prior probabilities outperforms most of the reference methods in terms of macro-averaged quality criteria. It means that the method is significantly better at recognizing minority class objects.
In this study, a simple bagging approach was used to constitute a diverse set of base classifiers. This simple, yet effective method allowed to achieve quite interesting results. Nevertheless, we believe that harnessing an ensemble building scheme tailored to the proposed potential function will allow to improve the classification quality achieved by the proposed method. Future research should be aimed at this issue.