Extensions to rank-based prototype selection in k-Nearest Neighbour classification
Introduction
The -Nearest Neighbour (kNN) rule is one of the well-known algorithms in the supervised classification field [1]. Its wide popularity comes from both its conceptual simplicity as well as its good results categorizing a prototype1 with respect to its k nearest neighbour prototypes of the training set [2]. In spite of its longevity, it is still subject of ongoing research [3], [4], [5]. However, since no classification model is generated out of the training data, this algorithm generally exhibits a low efficiency in both memory consumption and computational cost.
These shortcomings have been widely analysed in the literature, where three different families of solutions have been proposed:
- (i)
Fast Similarity Search (FSS) methods, which base its performance on the creation of search models for fast prototype retrieval in the training set [6] as, for example, the Approximating and Eliminating Search Algorithm (AESA) family of algorithms [7].
- (ii)
Approximate Similarity Search (ASS) algorithms whichwork on the premise of searching sufficiently similar prototypes to a given query in the training set at the cost of slightly decreasing the classification accuracy [8], as for instance the methods in [9], [10].
- (iii)
Data Reduction (DR) techniques, which consist of pre-processing techniques that aim at reducing the size of the training set without affecting the quality of the classification [11].
In this work we shall focus on the latter family of methods, i.e., the ones which aim at reducing the size of the training set by means of pre-processing it.
DR can be broadly divided into two different approaches: Prototype Generation (PG) [12] and Prototype Selection (PS) [13]. The former builds a new training set with artificial prototypes that represent more efficiently the same information, while the latter simply selects the most interesting prototypes of the initial training set. PS strategies are more general as regards data representation because it is not necessary to know how the feature space is codified [14] but only the distance values among the prototypes in the set. We therefore focus on this set of strategies.
Over the last decades, there have been a number of proposals for performing PS, which will be reviewed in detail in the next section. Recently, rank-based approaches has been proposed, which are based on ordering the prototypes of the training set according to their relevance in the success of the classification task. That is, prototypes are ranked following some criteria, after which they are selected according to the established order [15].
Among the current rank methods we identify two main drawbacks. The first is that, so far, the process does not take into account the possible noise at the label level. It is true that the kNN classification is robust to this type of phenomenon because of the parameter ‘k’, which tends to soften the impact of this noise by taking into account more neighbours when classifying. However, PS methods are performed before the classification process, and so this robustness to noise might be mitigated if the PS algorithm does not take into account the value of the parameter ‘k’ that will be eventually considered. Thus, we extend in this work the current rank methods so that they also consider the ‘k’ during the selection of the prototypes. On the other hand, these methods require an extra parameter to be fixed, which regulates how many prototypes are finally selected. Given this, we also extend these rank methods to avoid the need for tuning this parameter so that the selection criterion depends exclusively on the data itself. As will be seen in the experiments, these extensions provide higher robustness to noise, as well as optimal results in the trade-off between accuracy and efficiency, thus establishing the new procedures as successful alternative to the PS methods proposed to date.
The rest of the article is structured as follows: Section 2 introduces previous attempts to PS, including those concerning rank methods. Section 3 describes our new strategy to extend previous rank methods. Section 4 presents the different data collections, evaluation metrics, and alternative PS strategies to benchmark with. Experimental evidence of the goodness of the proposed approach is given in Section 5 through a series of experiments and analyses. Finally, Section 6 outlines the main conclusions as well as promising lines for future work.
Section snippets
Background
Given that the work is framed in the context of PS, this section provides some background in this regard.
PS techniques aim at reducing the size of a given training set while maintaining (or increasing) as much as possible the accuracy of the classifier. To achieve this goal, these techniques select those prototypes of the training set that are most promising, discarding the rest of them. Formally, let denote an initial training set. PS seek for a reduced set .
Typically, the accuracy
Extensions to rank methods for prototype selection
This section describes the proposed extensions to improve the rank methods for PS. For the sake of clarity, we first introduce the basic notions of the aforementioned rank methods on which the modifications will be performed. After that, we present the different proposals to improve their robustness against noisy instances. Finally, we explain the selection rule proposed to avoid the need for the manual tuning.
Experimental setup
In this section we present the configuration of the experiments carried out to evaluate the proposed improvements, such as the considered datasets, the set of PS algorithms to comparatively assess the performance of the proposed algorithms, and the evaluation protocol.
Results
In order to comprehensively evaluate our proposals, the experimental results are presented in two different ways. First of all, we compare the classical rank methods with those including the proposal to improve the process in noise environments. That is, we will compare the classical rank-based PS algorithms that always assumed , with the new voting approach that considers the same for both the selection and the classification processes. Then, we also carry out an exhaustive comparison of
Conclusions and future work
In this paper we present extensions to some classical rank methods for PS based on voting heuristics. The first extension focuses on improving the tolerance of the reduced set to noisy data by considering the parameter ‘k’ of the classifier in the voting strategies. Additionally, a self-guided criterion is proposed for the actual selection, which eliminates the need for tuning an external user parameter that the classical methods hold.
We conduct experiments with several datasets and report the
Declaration of Competing Interest
No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.asoc.2019.105803.
Acknowledgements
This work is supported by the Spanish Ministry HISPAMUS project TIN2017-86576-R, partially funded by the EU .
References (47)
- et al.
Improving nearest neighbor classification using ensembles of evolutionary generated prototype subsets
Appl. Soft Comput.
(2016) - et al.
Clustering-based k-nearest neighbor classification for large-scale data with neural codes representation
Pattern Recognit.
(2018) - et al.
New rank methods for reducing the size of the training set using the nearest neighbor rule
Pattern Recognit. Lett.
(2012) - et al.
Local sets for multi-label instance selection
Appl. Soft Comput.
(2018) - et al.
Multi-selection of instances: A straightforward way to improve evolutionary instance selection
Appl. Soft Comput.
(2012) - et al.
Instancerank based on borders for instance selection
Pattern Recognit.
(2013) - et al.
New rank methods for reducing the size of the training set using the nearest neighbor rule
Pattern Recognit. Lett.
(2012) - et al.
Dynamic programming algorithm optimization for spoken word recognition
- et al.
On the suitability of prototype selection methods for knn classification with distributed data
Neurocomputing
(2016) - et al.
Pattern Classification
(2001)
Nearest neighbor pattern classification
IEEE Trans. Inf. Theory
A novel version of k nearest neighbor: Dependent nearest neighbor
Appl. Soft Comput.
Efficient kNN classification with different numbers of nearest neighbors
IEEE Trans. Neural Netw. Learn. Syst.
An algorithm for finding nearest neighbours in (approximately) constant average time
Pattern Recognit. Lett.
Fast and accurate k-nearest neighbor classification using prototype selection by clustering
A taxonomy and experimental study on prototype generation for nearest neighbor classification
IEEE Trans. Syst. Man Cybernet. Part C (Appl. Rev.)
Prototype selection for nearest neighbor classification: Taxonomy and empirical study
IEEE Trans. Pattern Anal. Mach. Intell.
Prototype generation on structural data using dissimilarity space representation
Neural Comput. Appl.
Multi-objective optimization
The condensed nearest neighbor rule (corresp.)
IEEE Trans. Inform. Theory
Cited by (22)
Multilabel Prototype Generation for data reduction in K-Nearest Neighbour classification
2023, Pattern RecognitionCloud service selection based on weighted KD tree nearest neighbor search
2022, Applied Soft ComputingK-nearest neighbors rule combining prototype selection and local feature weighting for classification
2022, Knowledge-Based SystemsCitation Excerpt :In this section, we review the related work and summarize our contributions. Prototype selection that selects a set of representative instances to classify the unknown instances is a widely used technique for data reduction in KNN rule [6,29]. Generally, there are three categories of prototype selection methods: editing, condensation and hybrid.
Efficient k-nearest neighbor search based on clustering and adaptive k values
2022, Pattern RecognitionCitation Excerpt :Since one may prioritize one of these goals depending on the actual context of use, it is thus not possible to depict a single optimal configuration. Related literature addresses this issue considering a Multi-Objective Problem framework, which basically retrieves a set of possible solutions to the task without any particular preference among them [8,28,75]. Section 5.5.1 of this work provides the analysis of the caKD+ proposal considering that particular framework for further providing additional insights about the method.
Efficient and decision boundary aware instance selection for support vector machines
2021, Information SciencesCitation Excerpt :However, they also preserve a few inner instances located in dense areas. Rico-Juan et al. [36] developed some neighborhood-based heuristics that allow sorting instances of a dataset according to their expected relevance in the classification task, which is then used to decimate the dataset to a specific number of the most relevant samples. Zhu et al. [49] proposed a new neighborhood-based heuristic named cited count for identifying border instances.
Hidden multi-distance loss-based full-convolution hashing
2021, Applied Soft Computing