1 Introduction

K Nearest Neighbors (kNN) is an intuitive, distance-based method for classification. The outcome class of observation i is predicted by the most frequently occurring class among its k nearest neighbors in the feature space. A small value for k causes predictions to be precise, but holds the risk of being unstable (Hassanat et al. 2014). With large k the predictions are more stable but probably biased. kNN’s nonparametric nature and the lack of assumptions make it suitable and effective in many settings (Hastie et al. 2001). Although kNN can be used for both binary as well as multiclass classification, in the current manuscript we focus on binary classification.

kNN is, however, not without its drawbacks. We like to point out two drawbacks. First, kNN’s prediction accuracy is negatively affected by the number of features, resulting in a curse of dimensionality (Hastie et al. 2001). Second, the kNN classifier does not allow for the identification or interpretation of the effect of the individual features or pairs of features.

Past research focused on improving the predictive accuracy by creating an ensemble of classifiers, composed of random subsets of features or some optimal selection of features (Bay 1999; Domeniconi and Yan 2004; Gul et al. 2016; Li et al. 2011; Zhou and Yu 2005). For an ensemble method to work well, the separate models need to be both diverse and accurate (Bay 1999). kNN, however, is a stable method when it comes to small changes in the input data, especially for large k.

In Bagging, bootstrap samples from the data are taken and a classifier is fitted on each bootstrap sample. These classifiers are then used to create predictions by either using their average, a majority vote, or some other decision rule. Breiman (1996) identified kNN’s stability as the limiting factor for the performance with bagging. Bay (1999) reported better performance for bagging when using a random subset of the features without resampling. Domeniconi and Yan (2004) noted that using a subset of features carries the risk of possibly including non-informative features or discarding informative ones, and for that reason make use of feature relevance. Gul et al. (2016) combined the use of a random set of features while also using resampled versions of the data.

Boosting is an iterative algorithm that gives more weight to data points that are difficult to classify. This is done by either directly weighting the input data, or by weighted resampling (Dietterich 2000). Weighting the data points does not improve the situation for kNN, however, because the classification of a data point depends on its neighbors and never on its own weight (Bay 1999). García-Pedrajas and Ortiz-Boyer (2009) propose two methods in which not the points are weighted, but the input space is modified in such a way that difficult data points are more easy to classify. Neo and Ventura (2012) adapted the boosting by weighting algorithm by adapting the influence of the k nearest neighbors instead of the weight of the point itself, which they do by warping the distance function.

In stacking, the out-of-sample predictions of multiple base-learners are the input for a meta-learner to produce the final predictions. The stacking method originated with Wolpert (1992), who suggested that using predictions of classifiers holds more information than the original input itself. Breiman (1996) proposed a regression meta-learner, and noted that using the cross-validated base-learner predictions is necessary to avoid over-fitting. The literature lacks studies of stacking solely kNN models, but using a combination of support vector machines, kNN, and random forest is common (Mirończuk and Protasiewicz 2019; Wang et al. 2019; Yadrintsev and Sochenkov 2019).

In this paper, we propose SUBiNN, a Stacked Uni- and Bivariate Nearest Neighbors classifier. SUBiNN is an ensemble of kNN classifiers that deals with the reduction of dimensionality by using only uni- and bivariate kNN classifiers, and allows for the interpretation of relevance of those features and their pairwise interactions by using the Lasso as a meta-learner. Training the base-learners on different subsets of features causes diversity between classifiers, which facilitates the improving effect that stacking can have. The meta-learner then gives an importance to the different base-learners in the form of coefficients, which can be used to filter out uninformative base-learners.

This paper is organized as follows. In the next section, we outline SUBiNN and the choices we made for implementation. We also show a small empirical application. In Sect. 3, we provide a pilot study on the choice of k within SUBiNN. In Sect. 4, we describe four simulation experiments to test the classification performance of SUBiNN and compare it with other nearest neighbor based classifiers as well as Random Forests and the Support Vector Machine. In Sect. 5, we compare SUBiNN with the same classification methods on a large set of benchmark data sets. We provide a discussion in terms of classification performance as well as feature selection. We conclude with some discussion.

2 SUBiNN

Stacking involves base-learners that generate cross-validated predictions and a meta-learner that takes these predictions and combines them into a final predictive model. Wolpert (1992) noted that in an ideal situation the base-learners are mutually orthogonal. Opitz and Maclin (1999) noted that the base-learners have to be both different and perform well for the ensemble to have an improving effect. In this paper, we will use kNN as base-learners and train them on a single feature or a pair of features, so that each base-learner uses different information. The nonnegative lasso regression will be used as a meta-learner.

We denote by \({\mathbf {y}}\) the n-vector with binary outcomes, \(y_i \in \{0,1\}\) for \(i = 1,\ldots ,n\). Furthermore, let \({\mathbf {X}}\) be the \(n \times P\) matrix of features or predictor variables with elements \(x_{ip}\) for \(p = 1,\ldots , P\).

2.1 Base-learners: kNN

Because distances depend on the scaling of the original variables, we first standardize the features to have zero mean and variance 1. On each combination of features \(X_p\) and \(X_q\), with \(p = 1,\ldots , P\) and \(q = p, \ldots , P\), we apply kNN with 10 fold cross-validation. If \(p = q\), we fit a univariate classifier leading to a main effect. If \(p \ne q\) we fit a bivariate classifier, which indicates a pairwise interaction effect. As distance measure we use the Euclidean distance. The choice of k is important for the performance of nearest neighbor classifiers and is a manifestation of the bias variance trade-off. Small k will lead to unbiased but variable predictions, while large k will lead to biased but stable predictions. The optimal value for k depends on many factors such as the sample size, the covariance structure of the predictors, and the class proportions (Enas and Choi 1986).

Using 10-fold cross-validation with a nearest neighbor classifier, we obtain for every combination of p and q a probabilistic prediction for every observation in the data, that is, the proportion of neighbors that are in class 1. These cross-validated predictions are collected in the \(n \times R\) matrix \({\mathbf {Z}}\), where \(R = P + P(P-1)/2\). Note that the columns of the matrix \({\mathbf {Z}}\) all have the same range, that is, the probabilities lie in the range 0 to 1.

2.2 Meta-learner: Lasso regression

The meta-learner takes \({\mathbf {y}}\) and \({\mathbf {Z}}\) as input. Because R might be very large, we need a meta-learner that regularizes. Furthermore, we like to have a meta-learner that results in a sparse solution. Therefore we use Lasso regression (Tibshirani 1996, 2011).

As the kNN predictions already have the same range we do not use any further standardization of these variables. Moreover, because the cross-validated predictions are already in the range of the outcome variable, we do not need an intercept in the lasso regression model. Therefore, in the Lasso regression we minimize the following loss function

$$\begin{aligned} {\mathcal {L}}(\beta _1, \ldots , \beta _R) = \sum _i^n \left( y_i - \sum _r^R z_{ir}\beta _r\right) ^2 + \lambda \left( \sum _r^R |\beta _r| \right) , \end{aligned}$$

where we impose a nonnegativity constraint on the regression weights, that is we require \(\beta _r \ge 0\). Breiman (1996) suggested this non-negativity constraint on the coefficients of the regression meta-learner. Further support for this constraint is found in Leblanc and Tibshirani (1996) and Van Loon et al. (2020). For the choice of an optimal \(\lambda \) value we use 10-fold cross-validation.

Linear regression, as opposed to logistic regression, might seem a weird choice when the outcome variable is dichotomous. We would like to point out that the classification boundary of a linear regression and a logistic regression are often quite similar. Assumptions for linear regression, such as independent normally distributed residuals with mean zero and constant variance, when applied to a dichotomous outcome are not tenable. The assumptions, however, mainly influence the standard errors, the resulting statistical tests, and the p-values of a linear regression, not the estimates themselves, nor the predictions. In SUBiNN, we are not interested in standard errors or test statistics for the regression coefficients. Instead, we focus on predictive accuracy (Shmueli 2010). The final predictions may go out of the range, in which case we simply set them to zero or one. The choice for a linear model instead of a logistic model is also motivated by interpretational issues: The interpretation in terms of a weighted combination on the probability scale is much simpler than on the log-odds scale.

Because we use linear regression as a meta-learner, the final outcome of SUBiNN becomes a weighted combination of the predictions of the kNN base-learners. If an estimated regression coefficient equals zero (i.e., \({\hat{\beta }}_r = 0\)) the corresponding pair of features (p and q) have no effect on the outcome. If \(p = q\), and the regression coefficient for the corresponding base-learner is zero there is no main effect for this feature; if \(p \ne q\) and the regression coefficient for the corresponding base-learner equals zero there is no interaction effect for this pair of features. The magnitude of the lasso regression coefficients provides variable importance measures.

2.3 Final model and predictions for new data

After we fitted the lasso regression models, we know which main effects and which pairwise interactions are important for predictive purposes (those with \(\beta _r > 0\)) and which are irrelevant (those with \(\beta _r = 0\)). Predictions of the SUBiNN model for a new observation with features \({\mathbf {x}}_+\) look into the selected subspaces. Suppose, that only the pairs of variables 1 and 2, and 3 and 7 are selected by the lasso with estimated regression weights 0.8 and 0.2, respectively. Then we compute the proportion of nearest neighbors with \(y_i = 1\) in the bivariate spaces of predictors 1 and 2, and 3 and 7 for the complete training data; say they are 0.6 and 0.4, respectively. The estimated probability for the new observation to be in class \(Y = 1\) is therefore \(0.8 \times 0.6 + 0.2 \times 0.4 = 0.56\) and the observation would be best classified in class \(Y = 1\) (if we use the threshold of 0.5).

2.4 Empirical application

To show the merits of SUBiNN, we apply it here to the classification of patients with a panic disorder on the basis of personality characteristics. For 200 subjects we have information on the big five personality scales neuroticism, extraversion, openness, agreeableness, and conscientiousness. Furthermore, we have an assessment whether this person has a panic disorder or not. The data consist of a subsample from the study reported in Spinhoven et al. (2009).

The first step in the analysis is to obtain predicted probabilities of \(k\hbox {NN}\) classifiers for each of the personality characteristics and for each pair of the personality characteristics. We use \(k = 14 \approx \sqrt{200}\) for the number of neighbors. The 10-fold cross-validated predictions (probabilities) of \(k\hbox {NN}\) are collected in the matrix \({\mathbf {Z}}\), with 200 rows and 15 (i.e., \(P + P(P-1)/2\) with \(P=5\)) columns. In the second step, this matrix is used as the predictor matrix in a linear lasso regression with non-negativity constraints. We select an optimal value of the penalty parameter of the lasso through 10-fold cross-validation and with this optimal value obtain the estimated regression weights. This whole procedure is repeated 100 times and we took the median of the 100 regression weights as our final regression weights.

For this data set, four base-learners were selected. The univariate base-learners for the features openness (regression weight 0.33) and agreeableness (regression weight 0.18), and the bivariate base-learners for (1) openness and conscientiousness (regression weight 0.13) and (2) agreeableness and conscientiousness (regression weight 0.30). \(k\hbox {NN}\) classification plots are shown in Fig. 1 where for the univariate base-learners we also use two-dimensional graphs in order to get a clearer picture. Each of the four plots indicates nonlinear decision boundaries.

Fig. 1
figure 1

Nearest Neighbor classification plots for the four selected base-learners for the panic disorder data. Red points (in black and white: light grey) indicate subjects with panic disorder, while blue points (in black and white: dark grey) indicate subjects without panic disorder. The dots indicate observations, while the background colour indicates predictions. The black triangle indicates a new observation for which we like to make a prediction. The upper two plots give the univariate base-learners, the lower two plots give the bivariate base-learners

To obtain a prediction about a new person, we only need values of the three predictors openness, agreeableness, and conscientiousness. Such a final prediction is an additive combination of the predictions of the four base-learners. Suppose, we have a new person with standardized scores of 2, \(-2\), and 2 on these three features, respectively. This person is also indicated in Fig. 1 by the black triangle. Based on the four base-leaners, we obtain estimated probabilities for having panic disorder. These four probabilities are 0.56 (openness), 0.73 (agreeableness), 0.57 (openness − conscientiousness), and 0.71 (agreeableness − conscientiousness). To obtain the final probability of having panic disorder we take a weighted average of these four probabilities, that is

$$\begin{aligned} 0.33 \times 0.56 + 0.18 \times 0.73 + 0.13 \times 0.57 + 0.30 \times 0.71 = 0.60, \end{aligned}$$

where the weights are the estimated regression coefficients of the Lasso. We would classify this person into the panic disorder class.

3 Pilot study: the choice of k

The effect of k on predictions with kNN is large. Gul et al. (2016) fit their kNN model using an optimal value of k, which is derived by means of cross-validation. In SUNiNN this would add another layer of cross-validation in SUBiNN. A small pilot study was therefore performed to study the effect of k on the predictive performance, base-learners selection, and model execution time.

3.1 Setup

For this experiment we made use of two benchmark datasets. The Dystrophy dataset from the ‘ipred’ package (Peters and Torsten 2019) and the Diabetes dataset from the ‘mlbench’ package (Leisch and Dimitriadou 2010). The Dystrophy dataset consists of 194 observations on 5 features (we used age and the four serum markers), for the Diabetes dataset this is 768 observations on 8 features.

For a k of 5, 10, \(\sqrt{n}\) and cross-validated optimal value, the model’s performance was measured with 10-fold cross-validation over 100 replications. We recorded which base-learners were selected, the model’s execution time, and the prediction accuracy. For k = \(\sqrt{n}\), n refers to the number of observations in the training set within that fold. Within each fold, \(\sqrt{n}\) becomes 13 for Dystrophy, and 26 for Diabetes. For k = opt, the function tune from the ‘e1071’ package (Meyer et al. 2019) was used. The range of possible k values was set from 1 to 20.

3.2 Results and conclusion

The results in Table 1 show that different values of k causes small differences in accuracy and sometimes large difference in the number of base-learners selected. While one could expect that a cross-validated value of k would cause better performance, this does not hold for these datasets. For the Diabetes data set, the difference between k = \(\sqrt{n}\) and k = opt is virtually 0, while opt results on average in more selected base-learners. For the Dystrophy data set, the best k is 10, thereby also selecting on average the smallest number of base-learners. The computational time for tuning k through cross-validation is substantial for both datasets.

Table 1 The average prediction accuracy, number of selected (non-zero coefficient) base-learners and execution time in seconds for SUBiNN

From the results of k = 5, 10, and \(\sqrt{n}\), the effect of k on the selected number of base-learners becomes visible. With a smaller k, the model is less stable and base-learner predictions are more variable between runs. This increases the chance that a base-learner that was fit on a non-informative feature or pair of features is selected.

The likely cause for the non-optimal performance of the optimal k is that this optimal value is unstable. It is calculated within another layer of cross-validation, leaving less training samples for fitting. The conclusion can be drawn that the performance gain by calculating the optimal k is not worth the added complexity. For that reason, in the remainder of this manuscript we will use \(k = \sqrt{n}\). It is less arbitrary than choosing 5 or 10 because it depends on the sample size of the data, and has the property that it creates more stability in predictions and results in a lower number of base-learners selected.

4 Simulation studies

In this Section, we will report about four simulation studies. The first two are replications from those in Gul et al. (2016) where we added SUBiNN. Simulation experiments three and four are adaptations of the first two.

In these simulation studies, we study the classification and feature selection performance of SUBiNN, compare the performance of SUBiNN with other nearest neighbor based classifiers, that is, kNN, bagged kNN (BkNN), random kNN (RkNN), Multiple Feature Selection (MFS), and the ensemble method proposed by Gul et al. (2016), ESkNN. In our comparison, we also take two other, well known, classifiers into account: Random Forests (RF) and Support Vector Machine (SVM). Both often have high classification performance but are also considered black box techniques. We choose for these two methods as they require minimal tuning in comparison to, for example, gradient boosted trees or extreme gradient boosted trees. Bentéjac et al. (2021), for example, conclude that with the default settings of these boosted tree methods the performance is inferior to Random Forests. We like to know whether the nearest neighbor classifiers, and especially our SUBiNN approach, is competitive with these. These models are compared in terms of their classification performance. For SUBiNN, we also analyze its feature selection capabilities.

Two-class data are generated to include both non-informative features and informative features with a varying covariance and correlation structure. The inclusion of non-informative features in the first and third experiment is a means to test a model’s robustness with respect to noise input features. Adapting the covariance structure of informative features for one of the two classes in the second simulation experiment, allows for the investigation of the effect of an increased difference in variances between classes. Lastly, the inclusion of a varying correlation structure between features in the fourth experiment is meant to show the effect of the difference in correlation structure between the two classes while keeping the variance the same.

4.1 Data generation for main simulation experiments

For the first two experiments the data generation follows the set up described in Gul et al. (2016). The last two experiments change the covariance structure of one of the classes.

For all experiments we generate 20 informative features from a multivariate normal distribution. For class 1 these features have a mean of 2 and a specified covariance matrix, \(N({\mathbf {2}}, \varvec{\Sigma })\), where the definition of \(\varvec{\Sigma }\) varies for the four simulation experiments (see below). Class 2 has a mean of 1 and constant variance 1, \(N({\mathbf {1}}, {\mathbf {I}})\), where \({\mathbf {I}}\) is the identity matrix. The data generation model therefore follows that of a quadratic discriminant analysis, where the covariance matrices differ per class, and lead to nonlinear classification boundaries. Non-informative features are drawn from a standard normal distribution, irrespective of the outcome class. The sample size is 1000 for all simulation experiments.

In the first part (experiment 1 and 2), the covariance matrix as specified by Gul et al. (2016) is used, \(\varvec{\Sigma } = w\varvec{\Psi }\), where \(\varvec{\Psi }\) has elements \(\psi _{p,q} = (1/2)^{|p - q|}\) for \(p, q = 1, \dots , 20\).

This covariance matrix follows an autoregressive structure, where the covariance between features declines with the distance between features. In the first experiment we use \(w=1\), and vary the number of non-informative features (0, 50, 100, 200 and 500). In the second experiment, the number of uninformative features is fixed at 50, and w is taken as 3, 5, 10, 15, and 20. All variances and covariances for the features determining class 1 are multiplied by w.

The data generation mechanism in experiment 1 without the non-informative features represents a worst case scenario for our SUBiNN model, in the sense that there is relatively good separation between the 2 classes in 20 dimensional space but that there is more and more overlap between the classes in lower dimensional subspaces. Because SUBiNN works with 1 and 2 dimensional subspaces this will have a detrimental effect. Other feature selection nearest neighbor classifiers (such as RkNN and MFS) will also work less good compared to kNN and BkNN that use all features. In Fig. 2 we show two bivariate scatterplots for data generated in this experiment. In the left hand side plot we show ‘adjacent’ features, where the correlation between the 2 features is relatively high in class 1, but low in class 2. On the right hand side we show a scatterplot of features 1 and 20, the features that have the lowest correlation in class 1, similar to the correlation between the features in class 2.

We expect that adding non-informative features does not have a large influence on SUBiNN, because separation between the classes is difficult in the two-dimensional spaces based on non-informative features. Therefore we expect the meta-learner to exclude those base-learners. In contrast, the non-informative features will have a negative effect on the classification performance of kNN and BkNN.

Fig. 2
figure 2

Scatterplots of informative features 1 versus 2 (left hand side) and 1 versus 20 (right hand side) for the data generation model from experiment 1 (where \(w = 1\)), showing that in the low dimensional spaces there is quite some overlap between the observations in class 1 (indicated by \(\bullet \)) and class 2 (indicated by \(\blacktriangle \)). The covariance between features 1 and 2 for class 1 is much higher than the covariance between features 1 and 20

In experiment 2, by changing the w we obtain different variances and covariances in class 1, both are multiplied with this constant. That also means that the covariances that are zero (for example between feature 1 and features 13 till 20) remain zero for all w. Furthermore, by increasing the variances, the observations of class 2 get more and more in the middle of the observations of class 1. See Fig. 3 for scatterplots of features 1 and 2 and features 1 and 20 for this experiment. In this case, the increased variances become meaningful and by increasing the variances of class 1 separation becomes possible. So, with increasing w values we expect SUBiNN to have increased classification performance. We do not expect the meta-learner to favor any pair of features from the informative ones; as can be seen in Fig. 3 separation between the classes is equally good in the two-dimensional spaces of features 1 and 2 as in the two-dimensional spaces of features 1 and 20.

Fig. 3
figure 3

Scatterplots of informative features 1 versus 2 (left hand side) and 1 versus 20 (right hand side) for the data generation model from experiment 2 where \(w = 20\), showing that in the two-dimensional sub spaces the observations in class 1 (indicated by \(\bullet \)) and class 2 (indicated by \(\blacktriangle \)) can be quite well distinguished. The variance of the features in class 1 is the determining factor. The covariance between features 1 and 2 for class 1 is much higher than the covariance between features 1 and 20

With \(\varvec{\Sigma } = w\varvec{\Psi }\), increasing w increases both the variance of the features and the covariances between features, causing the correlation to be roughly constant across w’s. In contrast with what Gul et al. (2016) state in their article, generating data in this manner data with a varying w does not result in a differing correlation structure between variables. Neither within the first class nor when pooling the two classes together.

Therefore, we designed a second set of experiments (experiments 3 and 4) where the correlation structure between the informative features is manipulated while keeping the variances of the informative features constant. For experiments 3 and 4, the covariance matrix of class 1 is adapted, and \(\varvec{\Sigma } = \varvec{\Phi }\), with elements \(\phi _{p,q} = 0.99^{w|p - q|}\).

The third experiment is again about introducing a different number of non-informative features (0, 50, 100, 200, 500), leaving \(w = 1\). In the fourth and last experiment, the correlation structure of class one is adapted, by changing the value of \(w = 3, 5, 10, 15, 20\). At \(w = 1\), all features are extremely collinear, at \(w = 20\) the largest correlation is between any two adjacent features 0.82 and virtually 0 between the features with their indices furthest apart.

In experiment 3, we again have the same situation as in experiment 1 where the separation between the two classes is good in 20 dimensional space. In the two-dimensional subspaces the separation of the classes is created by the correlation between adjacent features (see Fig. 4) . That is, features 1 and 2 have a correlation of 0.99 in class 1, whereas this correlation is 0 in class 2. Most information for distinguishing the two classes is therefore found in adjacent features. We therefore expect the meta-learner to select base-learners based on adjacent features. Like before, we do not expect non-informative features to have much influence on the classification performance of SUBiNN.

Fig. 4
figure 4

Scatterplots of informative features 1 versus 2 (left hand side) and 1 versus 20 (right hand side) for the data generation model from experiment 3, showing that the observations of class 1 (indicated by \(\bullet \)) and 2 (indicated by \(\blacktriangle \)) can be distinguished reasonably well. The correlation between the features is the determining factor. This correlation is stronger for adjacent features (left hand side plot) than for features far apart (right hand side plot)

In experiment 4, in the data generating mechanism we vary values of w. Since, with increasing w the correlations become smaller, we expect that classification performance of SUBiNN decreases with increasing w. With increasing values of w it becomes more difficult to distinguish the two classes in the two-dimensional subspaces (compare Fig. 4 with Fig. 5)

Fig. 5
figure 5

Scatterplots of informative features 1 versus 2 (left hand side) and 1 versus 20 (right hand side) for the data generation model from experiment 4 wit \(w = 20\), showing that with increasing value of w it becomes more difficult to distinguish the two classes (compare to Fig. 4 which shows the case for \(w = 1\)). Class 1 is indicate by \(\bullet \) and class 2 by \(\blacktriangle \)

For each experiment we use 100 replications. Data sets with sample size 1000 were generated and partitioned into 10 sets. The eight models are subsequently fitted on 9 of these, including all tuning and possibly internal cross-validation. Out of sample predictions were made on the left out part. That means that in total we have 100 (repetitions) times 10 sets of model estimates and corresponding prediction errors. As the measure of prediction error we used the misclassification rate. The cross-validated prediction errors of all replications are averaged to obtain the final models’ misclassification rate. Furthermore, for each of the 1000 results per condition we look at the number of times a particular feature or pair of features is selected by SUBiNN’s meta-learner.

4.2 Software implementations

R version 3.6.1 was used (R Core Team 2019). Random samples from the multivariate normal distributions were drawn using the function mvrnorm from the package ‘MASS’ (Venables and Ripley 2002). Any added non-informative features are drawn from a standard normal distribution, using base R’s rnorm function.

For the implementation in R of the seven models, we follow from Gul et al. (2016). For simple kNN, we used the function knn from the package ‘class’ (Venables and Ripley 2002). BkNN is implemented by fitting 1001 of these kNN models where the input data is sampled with replacement. The final prediction is a majority vote of the 1001 outcomes. For these two models we used \(k = \sqrt{n}\).

For RkNN and MFS we used the rknn function from the ‘rknn’ package (Li 2015). The parameters of importance are again k, r the number of models fitted (taken again to be 1001), and mtry, the number of random features drawn at each fitting. Following Gul et al. (2016), we used a subset size of 1/3rd of the number of features, with a minimum of 2. For MFS we fit 1001 kNN model using a random draw of the features with replacement (again 1/3rd of the total number of features, with at least 2) and take the majority vote of the 1001 results.

For RF we have used the function best.randomForest from the package ‘e1071’ (Meyer et al. 2019), which also uses the function randomForest from the package ‘randomForest’ (Liaw and Wiener 2002). The automatic best.randomForest function does parameter selection without range specification, using 10-fold cross-validation which is implemented with the argument tunecontrol where sampling = ‘cross’ and cross = 10.

SVM is implemented using the ksvm function from the package ‘kernlab’ (Karatzoglou et al. 2004) where the kernel is said to be rbfdot and the kpar is set to automatic to allow for automatic optimal parameter selection.

For ESkNN we made use of the package ‘ESkNN’ (Gul et al. 2015). In the first stage, 1001 models are fitted, with the number of random features taken to be 1/3 of the total number of features and the percentage of models to be selected is set to 40%. This function then returns the best 40% of models which can be used for prediction with the predict.esknnClass function.

SUBiNN is implemented analogous to the implementation specified in the previous chapter. All base-learners are kNN models fitted using the package ‘class’ (Venables and Ripley 2002), fit on all single features and pairwise combinations of features. The predictions produced by these base-learners are used as input for the Lasso meta-learner, using cv.glmnet. The predict function and the glmnet object are used to generate predictions for the test samples.

R code for all our analysis can be found on the github website of the first author (https://github.com/TElsten/subinn).

4.3 Results

4.3.1 Experiment 1: adding noise

For this experiment, we have chosen to evaluate the result of SUBiNN by comparing its accuracy and execution time to the other models. Additionally, the stability of base-learner selection is taken into consideration. The results with respect to classification performance are shown in Table 2.

Table 2 Average misclassification rate (mean) and standard deviations (Std) of the 8 methods for data with 20 informative and a varying number of added noise features

When looking at the performance of SUBiNN, two things become apparent. First, the prediction accuracy as compared to the other methods is bad. Second, adding non-informative features has no effect on the prediction accuracy of SUBiNN. With regard to the first outcome, as alluded in Sect. 4.1 the data generating mechanism is not beneficial for SUBiNN. The reason is that in each of the two-dimensional subspaces the classes are not well separable, while separation becomes easier when more informative features are used simultaneously. The fewer dimensions are used in such a case, the larger the overlap between the two clouds of class observations. None of the base-learners will thus produce very good predictions.

We see that other nearest neighbor classifiers that work on subsets of features (RkNN, MFS, and ESkNN) also perform less good when compared to classifiers that use all features (kNN and BkNN). The reasoning is the same, separation is easier in higher dimensional spaces in this case. The performance of RF and SVM without non-informative features is comparable to kNN and BkNN.

With regard to the second outcome, SUBiNN is able to filter out the non-informative base-learners in the meta-step and is unaffected by the introduction of more noise. This appears to be a main strength of SUBiNN. Whereas SUBiNNs classification performance remains the same irrespective of the number of non-informative features, the performance of all other classification methods deteriorates when non-informative features are included.

On average, the meta-learner used 22 base-learners, but never contained a non-informative feature. Over the replications, the number of selected base-learners fluctuated only slightly, but this fluctuation was unrelated to the number of non-informative features. Figure 6 shows the number of times each informative base-learner was used for \(w = 1\) and 500 non-informative features. Remember that we used 1000 replications, i.e., the maximum frequency in the cells is 1000. The figure does not include the half-informative (pair of informative and non-informative feature) or non-informative base-learners because they were never selected by the meta-learner.

Fig. 6
figure 6

Frequency of base-learner selection with low covariance data, \(w = 1\), 500 uninformative features

As shown by the diagonal of Fig. 6, none of the single-feature base-learners was ever selected by the meta-learner. This is not surprising, because these one-dimensional features carry less information that distinguishes the two classes. The reasoning is the same as before, the lower the dimensionality the more overlap between classes. There is no clear pattern in which base-learners are selected. Adjacent features (with correlation 0.5 within class 1, but zero in class 2) seem to play a role as well as features with a small correlation (i.e., pairs 4 or more apart that have a correlation smaller than \(0.5^4 = 0.06\) within class 1). From Fig. 6, it is also clear that the feature selection was unstable between runs. The average of 22 base-learners selected are spread across the 190 possible pairs with no very clear preference. Base-learner selection does not seem to be very stable in this situation.

A major downside of the SUBiNN model is its execution time (see Table 3). The addition of 500 non-informative features resulted in \(\left( {\begin{array}{c}520\\ 2\end{array}}\right) + 520\) = 135,460 potential base-learners which is a massive number. On the other hand, the final predictions can be made rapidly using only 22 base-learners.

Table 3 Average execution time in seconds of classifying data with varying number of added non-informative features

4.3.2 Experiment 2: varying covariance structure

Table 4 shows the classification results with increased variance and covariances (increasing values of w) and 50 uninformative features.

The SUBiNN model performs better when the data for class 1 is generated with higher values of w. This was as expected, increasing the variances and covariances for class 1 while keeping the variance for class 2 constant, simply means increasing the differences between the two classes, making them more easily distinguishable.

SUBiNN already performs better than the other nearest neighbor classifiers when \(w=3\) and the differences become larger with larger values of w. We see that for the kNN and BkNN classifiers classification accuracy deteriorates with increasing w. For all nearest neighbor methods that take a subset of the features, performance increases with increasing w. SUBiNN outperforms the SVM when w becomes large (\(\ge 10\)). Random Forest (RF) is the best classifier for this situation.

Table 4 Average misclassification rate (Mean) and standard deviations (Std) of the 8 classification methods for data with 50 uninformative features and increasing variance/covariance of features
Fig. 7
figure 7

Frequency of base-learner selection with high covariance data, \(w = 20\), 50 uninformative features

In Fig. 7 we show the number of times each informative base-learner was selected for the situation where \(w=20\). The model has a preference for selecting feature pairs with low covariance in class 1 (right hand side of Fig. 3 instead of left hand side Fig. 3). The feature pairs with the highest covariance, for instance the pairs 1-2, 2-3, etc., are not often selected. Again, SUBiNN did not select single-feature base-learners, because such a single feature can not capture the quadratic nature of the decision boundaries. The meta-learner makes use of more base-learners as w increases, ranging from on average 22 at \(w = 1\) to 36 at \(w = 20\). Again, no uninformative base-learners were ever included by the meta-learner. Figure 7 shows that the most favoured base-learners are selected up to 300 out of 1000 times.

4.3.3 Experiment 3: correlation and noise

In this experiment we generated data with highly collinear features in class 1 and increasing numbers of non-informative features. The results are shown in Table 5.

Table 5 Average misclassification rate (Mean) and standard deviation (Std) of the 8 classification methods for data with 20 informative features with high correlation (\(w = 1\)), and an increasing number of uninformative features

Like in experiment 1, SUBiNN is insensitive to the non-informative features. The classification performance is equally good, no matter the number of non-informative features. Furthermore, the classification performance of SUBiNN is better than that of all other nearest neighbor classifiers. Comparing SUBiNN with SVM, we see that in the case of no uninformative features SVM performs better, but with uninformative features SUBiNN outperforms the SVM. A similar patter can be observed when comparing SUBiNN to RF, although in that case the number of non-informative features should be large (\(\ge 200\)) before SUBiNN outperforms RF.

Figure 8 shows that the meta-learner preferred base-learners trained on the most highly-correlated pairs of features in class 1 (cf., Fig. 4). SUBiNN used on average 21.5 base-learners. The non-informative base-learners were never included in the final model. Like before, no base-learner fit on a single feature was ever included in the final model.

Fig. 8
figure 8

Frequency of base-learner selection with highly correlated data, \(w = 1\), 500 uninformative features

Table 6 Average misclassification rate (Mean) and standard deviation (Std) of the 8 models for data with 20 informative features of decreasing correlation as w increases, and 50 uninformative features

4.3.4 Experiment 4: varying correlation structure

Table 6 shows the results with different levels of feature correlation, which decreases by increasing w. The number of uninformative features is kept constant at 50. SUBiNN outperforms all other nearest neighbor classifiers in most situations. The performances becomes roughly equal when \(w = 20\). SUBiNN outperforms the SVM for all w, excepts \(w = 20\). The performance of SUBiNN is comparable to that of RF, although at larger values of w, RF seems to do a little better.

Similar to experiment 3, SUBiNN has a preference for base-learners consisting of pairs of features with the highest correlation, see Fig. 9 which shows the number of features selected at \(w = 20\). Comparing the number of base-learners selected by SUBiNN, we see that as the correlation decreases the number of base-learners also decreases, ranging from on average 21 when \(w = 1\) to on average 14 when \(w = 20\).

Fig. 9
figure 9

Frequency of base-learner selection with low-correlation data, \(w = 20\), 50 uninformative features

4.4 Conclusion

The four simulation experiments with a varying number of added uninformative features and a varying covariance and correlation structure have shown that in the right conditions, SUBiNN can perform well. While in experiment 1 SUBiNN performs badly, in experiments 2, 3 and 4 SUBiNN outperformed the other nearest neighbor classifiers. Furthermore, SUBiNN filters out successfully all uninformative features, which is a great advantage of SUBiNN. The latter result is probably due to the choice of a relatively high value for k. Compared to SVM, SUBiNN performed better in experiments 3 and 4 and also in experiment 2 when \(w \ge 10\). Compared to RF, SUBiNN often performs worse, except for experiment 3 with large numbers of non-informative features.

5 Benchmark data study

We used 22 benchmark data sets to compare the same set of models as in the simulation studies. The eight models will be compared by their classification accuracy and feature selection capabilities.

5.1 Data

Table 7 specifies the datasets’ sample size, number of features, and the resulting number of SUBiNN base-learners. For all datasets, cases with missing values were omitted. Furthermore, in accordance with the procedure from Gul et al. (2016), the maximum number of observations used was 1000, if this was a subset of the full dataset the 1000 rows were sampled at random.

Both categorical and nominal features were transformed into numerical features. If the nominal feature had more than 2 levels, they were transformed into indicator variables, one for each level of the feature. The Cylinder Bands dataset forms an exception, for this dataset the nominal features with more than 2 levels were omitted, following Gul et al. (2016).

Table 7 Specification of benchmark datasets

5.2 Models

The software implementation of the eight models is analogous to the implementation in the simulation studies. All eight models are fit on the 22 benchmark datasets by means of 10-fold cross-validation, for 100 replications.

5.3 Results

Table 8 shows the cross-validated classification error of the models on the 22 benchmark datasets. The last ‘B-learners’ column denotes the average number of base-learners that were used by SUBiNN’s meta-learner. In the following we first discuss the classification accuracy and thereafter provide more detailed results about the selection of base-learners.

When we compare SUBiNN against the other nearest neighbor classifiers, we see that for 8 data sets it performs the best, for 9 data sets it performed as intermediate, and for 5 data sets it performed worst. We ordered the rows in Table 8 in those three groups, such that in the upper part the data sets appear where SUBiNN performs better than the other nearest neighbor methods, in the middle part its performance is intermediate, and in the lower part SUBiNN performs worse than the other nearest neighbor classifiers. The three parts are indicated by the dashed lines.

Table 8 Misclassification rate (with standard deviations in smaller font) of the 8 classification methods on different benchmark datasets

For the 8 data sets that SUBiNN performed best among the nearest neighbor classifiers, RF performs a little better for 7 of the 8 data sets, although the differences are in 5 cases within one standard deviation. For 1 data set, the Glaucoma dataset, SUBiNN outperforms all other methods, including RF.

For the 5 data sets that SUBiNN performed worst among the nearest neighbor classifiers, in one case RF performed slightly worse. One of the five datasets is the solar data, for which all classification methods basically perform equally good. Only with the Twonorm and Cylinder Bands datasets did SUBiNN perform substantially worse than the other models. The Twonorm dataset is a simulated data set with characteristics similar to the datasets used in the first simulation experiment, where there is relatively good separation in the high dimensional space, but quite some overlap of the classes in every two-dimensional subspace. SUBiNN also makes considerably more errors than the other models for the Cylinder Bands dataset. This dataset has a large number of nominal features which appears problematic for SUBiNN because there is not much information in the two-dimensional subspaces of this problem.

The only dataset for which SUBiNN obtains a considerably better performance is the Glaucoma dataset. It does so while using on average only 5.6 base-learners, a great reduction in dimensionality from the complete dataset of 66 features. kNN and ESkNN have the highest missclassification rate for this dataset. An explanation for this can be found in Peters et al. (2003), who used this dataset in their study to the applicability of indirect variables in classifying glaucoma. The dataset contains three of these indirect variables, which are derived from clinical expert judgement and prior analysis. Along with those 3 indirect variables are the 63 measurement (direct) variables. For SUBiNN, these features resulted in 2211 potential base-learners. In almost every run, base-learners (64), (64, 66), and (65, 66) which correspond to the indirect measurements, were used in the final model. The meta-learner barely made use of any of the direct measurement variables but chose combinations of the three indirect variables. Note that for this data set the univariate base-learner for feature 64 was selected in every run. So, whereas in the simulation study univariate base-learners were never selected, application of SUBiNN to this data set shows that indeed the univariate base-leaners can have a contribution to the final model.

Considering, more generally, the choice of base-learners it seems that when SUBiNN performs relatively good, it uses a small number of base-learners, as can be witnessed in the last column in Table 8.

6 Conclusion and discussion

In this paper we proposed SUBiNN, a stacked uni- and bivariate nearest neighbor classifier. Researchers are often interested in the effects of predictor variables/features on an outcome variable or in the interaction effect of features on an outcome variable. Usually, such main effects and interaction effects are included in a logistic regression model. The logistic regression model, however, assumes that these effects are linear effects which is a rather strong assumption. Nearest neighbor analysis imposes no functional form, but does not have the capability to distinguish main and interaction effects. Moreover, nearest neighbor analysis is sensitive to the dimensionality of the feature space. In SUBiNN we built an ensemble method by combining nearest neighbor analyses in one and two-dimensional subspaces, thereby overcoming the curse of dimensionality and at the same time enabling identification of main and pairwise interaction effects without imposing a functional form on these effects.

In this respect, SUBiNN works similarly to generalized additive models that also focus on sums of one and possible two-dimensional smoothers (Hastie and Tibshirani 1990; Wood 2017). In SUBiNN, we finally have a sum of one and two-dimensional nearest neighbor classifiers, where the number of neighbors can be used to control the smoothness in a way.

In nearest neighbor classifiers the number of neighbors, k, is important. Small values k lead to unbiased models with high variance, while high values of k lead to biased but stable models. In a small experimental study we investigated the choice of k for SUBiNN and found no big differences in classification performance. Therefore, we choose to use \(k = \sqrt{n}\) in the simulation studies and benchmark study. When applying SUBiNN to an empirical dataset a researcher might choose to use k as a tuning parameter, fitting the whole procedure with different values of k and selecting the one with best classification performance. Another strategy is to define an even larger set of base-learners where for every combination of features p and q, base-learners are trained with varying value of k. The meta-learner then makes a selection out of this enlarged set of base-learners.

In the simulation studies, we compare SUBiNN to other nearest neighbor classifiers and two methods that usually classify very accurately, Random Forests and Support Vector Machines. We generated data following a quadratic discriminant analysis model where we varied the covariance matrices and the number of uninformative features. One strong result for SUBiNN is its robustness with respect to noise features. SUBiNN never selected a base-learner involving an uninformative feature. In comparison to other NN classifiers, SUBiNN performed relatively good if there exist pairs of features that separate the observations of the two classes quite well, such as in experiment 2 and 3. When the separating information is mainly available in higher dimensional subspaces the performance of SUBiNN is not that good (such as in experiment 1).

The comparison in terms of classification performance of SUBiNN with RF and SVM point out that overall RF is the best classifier. There have been further improvements on ensembles of classification trees that we did not study, such as (extreme) gradient boosted trees. They might even be a bit better in terms of classification performance, but require more tuning in order to outperform RF. Such tuning also carries the risk of overfitting. RF and other classification tree ensembles, however, are often considered black-box techniques (similar to SVM). Recently, there have been advances in developing interpretational tools for RF and other black box techniques (Sies and Van Mechelen 2020; Ribeiro et al. 2016). Others have adapted ensemble methods with trees to increase interpretability (Meinshausen 2010) and searched for optimal tree ensembles (Khan et al. 2020, 2021). We note that both in our simulation experiments as well as in the benchmark data set, there was at least one dataset or condition for which SUBiNN outperformed RF.

A clear disadvantage of SUBiNN is its considerable execution time (see Table 3) when the number of (uninformative) features was increased. As a result of the sparsity created by Lasso, this long execution time only applies to the training stage and not to the final task of predicting the outcome. High performance computing can be used to parallelize computations, that is, every base learner can be trained on a different computer/core. Methods to approximate the kNN distance mapping could be implemented, such as Locality Sensitive Hashing (Andoni and Indyk 2008).

We focussed on binary classification. Many classification problems have more than two classes and \(k\hbox {NN}\) is well suitable for such situations. To generalize SUBiNN for multiclass classification, the only thing that is needed is to have a good meta-learner. Two options are regularized linear discriminant analysis (Clemmensen et al. 2011; Trendafilov and Jolliffe 2007) or polytomous logistic regression with lasso-type penalty (Friedman et al. 2010). Further research is needed.

The choice for only using univariate and bivariate base-learners stems from the wish to identify the main- and interaction effects of the input. Base-learners could be fit on subsets of three, four, or more features, thereby introducing three-way, four-way, and higher-order interactions of features. However, in practice three-way interaction term are often difficult to interpret, let alone interactions of a higher order. Moreover, the two-dimensional subspaces with classification boundaries can be relatively easy visualized, which becomes more difficult for three or higher-dimensional subspaces.

In all our analysis we used the Euclidean distance to find the neighbors. The Euclidean distance is a simple choice, but might not be an optimal choice in all situations. Other distance measures have been developed, see Cox and Cox (2000) for many types of proximity measures for different kind of features and Gower (1971) for a general distance measure for features with different characteristics. Another choice we made is to standardize the features to have zero mean and unit variance. Although distance measures need some kind of standardized features, other choices could result in different conclusions. Different scaling methods have been studied mainly in the context of cluster analysis (Schaffer and Green 1996; Steinley 2004).

We choose to use Lasso linear regression as the meta-learner. Lasso tends to arbitrarily select a single base-learner among a group of highly correlated base-learners. Zou and Hastie (2005) note that an Elastic Net regularization has the potential of alleviating this problem by either including or excluding groups of highly correlated variables.

In sum, we proposed and evaluated a new stacking ensemble learner based on univariate and bivariate \(k\hbox {NN}\) classifiers. When the meta-learner selects a small number of base-learners the results can be understood in terms of nonlinear main effects and two-variable interaction effects. SUBiNN often performs better than other NN-based classification methods and under certain conditions even outperformed Random Forests.