AEkNN: An AutoEncoder kNN-based classifier with built-in dimensionality reduction

High dimensionality, i.e. data having a large number of variables, tends to be a challenge for most machine learning tasks, including classification. A classifier usually builds a model representing how a set of inputs explain the outputs. The larger is the set of inputs and/or outputs, the more complex would be that model. There is a family of classification algorithms, known as lazy learning methods, which does not build a model. One of the best known members of this family is the kNN algorithm. Its strategy relies on searching a set of nearest neighbors, using the input variables as position vectors and computing distances among them. These distances loss significance in high-dimensional spaces. Therefore kNN, as many other classifiers, tends to worse its performance as the number of input variables grows. In this work AEkNN, a new kNN-based algorithm with built-in dimensionality reduction, is presented. Aiming to obtain a new representation of the data, having a lower dimensionality but with more informational features, AEkNN internally uses autoencoders. From this new feature vectors the computed distances should be more significant, thus providing a way to choose better neighbors. A experimental evaluation of the new proposal is conducted, analyzing several configurations and comparing them against the classical kNN algorithm. The obtained conclusions demonstrate that AEkNN offers better results in predictive and runtime performance.


Introduction
Classification is a well-known task within the area of machine learning [29]. The main objective of a classifier is to find a way to predict the label to be assigned to new data patterns. To do so, usually a model is created from previously labeled data. In traditional classification, each example has a single label. Different algorithms have been proposed in order to address this work. One of the classic methodologies is instance-based learning (IBL) [1]. Essentially, this methodology is based on local information provided by the training instances, instead of constructing a global model from the whole data. The algorithms belonging to this family are relatively simple, however they have demonstrated to obtain very good results in facing the classification problem. A traditional example of such an algorithm is k-nearest neighbors (kNN) [23].
The different IBL approaches, including the kNN algorithm [13], have difficulties when faced with high-dimensional datasets. These datasets are made of samples having a large number of features. In particular, the kNN algorithm presents problems when calculating distances in high-dimensional spaces. The main reason is that distances are less significant as the number of dimensions increases, tending to equate [13]. This effect is one of the causes of the curse of dimensionality, which occurs when working with high-dimensional data [8,9]. Another consequence that emerges in this context is the Hughes phenomenon. This fact implies that the predictive performance of a classifier decreases as the number of features of the dataset grows, keeping the number of examples constant [44]. In other words, more instances would be needed to maintain the same level of performance.
Several approaches have been proposed for facing the dimensionality reduction task. Recently, a few proposals based on deep learning (DL) [25,11] have obtained good results while tackling this problem. The rise of these techniques is produced by the good performance that DL models have had in many research areas, such as computer vision, automatic speech processing, or audio and music recognition. In particular, autoencoders (AEs) are DL networks offering good results due to their architecture and operation [17,41,84,67,94,14].
High dimensionality is usually mitigated by transforming the original input space into a lower-dimensional one. In this paper an instance-based algorithm, that internally generates a reduced set of features, is proposed. Its objective is to obtain a better IBL method, able to deal with high-dimensional data. Specifically, the present work introduces AEkNN, a kNN-based algorithm with built-in dimensionality reduction. AEkNN projects the training patterns into a lower-dimensional space, relying in an AE for doing so. The goal is to produce new features of higher quality from the input data. This approach is experimentally evaluated, and a comparison between AEkNN and the kNN algorithm is performed considering predictive performance and execution time. The results obtained demonstrate that AEkNN offers better results in both metrics. In addition, AEkNN is compared with other traditional dimensionality reduction algorithms. This comparison offers an idea of the behavior of the AEkNN algorithm when facing the task of dimensionality reduction.
An important aspect of the AEkNN algorithm that must be highlighted is that it performs a transformation of the features of the input data, against other traditional algorithms that perform a simple selection of the most significant features. AEkNN performs this transformation taking into account all the characteristics of the input data, although not all will have the same weight in the new generated space.
In short, AEkNN combines two reference methods, kNN and AE, in order to take advantage of kNN in classification and reduce the effects of high dimensionality by means of AE. In this way, the proposed method presents a baseline for future works. Summarizing, the main contributions of this work are 1) the design of a new classification algorithm, AEkNN, which combines an efficient dimensionality reduction mechanism with a popular classification method, 2) an analysis of the AEkNN operating parameters that allows selecting the best algorithm configuration, 3) an experimental demonstration of the improvement that AEkNN achieves with respect to the kNN algorithm, and 4) an experimental comparison between the use of an autoencoder for dimensionality reduction with respect to other classical methods such as PCA and LDA.
This paper is organized as follows. In Section 2 a few fundamental concepts, such as machine learning, classification, and the kNN algorithm, are briefly introduced, and some details about DL techniques and AEs are provided. Section 3 describes relevant related works, focused on tackling the problem of dimensionality reduction in kNN. In Section 4, the proposed AEkNN algorithm is introduced. Section 5 defines the experimental framework, as well as the different results obtained from the experimentation. Finally, Section 6 provides the overall conclusions.

Preliminaries
AEkNN, the algorithm proposed in this work, is a kNN-based classification method designed to deal with high-dimensional data. This section outlines the essential concepts AEkNN is founded on, such as classification, nearest neighbors classification, DL techniques and AEs. The most basic concepts are introduced in subsection 2.1. The kNN algorithm is discussed in subsection 2.2, while DL and AEs are briefly described in subsections 2.3 and 2.4.

Machine Learning and Classification
In general terms, machine learning is a subfield of artificial intelligence whose aim is to develop algorithms and techniques able to generalize behaviors from information supplied as examples [46,33]. The different tasks that can be performed on machine learning can be classified following different criteria. One of these categorizations arises according to the training patterns used to train the machine learning system [3]. In this sense, it can be distinguished between supervised learning, when patterns are labeled, such as classification and regression, and unsupervised learning, when patterns are not labeled, such as clustering, among others.
Classification is one of the tasks performed in the data mining phase. It is a predictive task that usually develops through supervised learning methods [47]. Its purpose is to predict, based on previously labeled data, the class to assign to future unlabeled patterns. In traditional classification, datasets are structured as a set of input attributes, known as features, and one output attribute, the class or label. Depending on the number of values that this output class can take, the classification problem can be seen as: -Binary classification, in which each pattern can only belong to one of two classes. -Multi-class classification, in which each pattern can only belong to one of a limited set of classes. Consequently, a binary classification can be seen as a problem of multi-class classification with only two classes.
Several issues can emerge while designing a classifier, being some of them related to high dimensionality. According to the Hughes phenomenon [44], the predictive performance of a classifier decreases as the number of features increases, provided that the number of training instances is constant. Another phenomenon that particularly affects IBL algorithms is the curse of dimensionality. IBL algorithms are based on the similarity of individuals, calculating distances between them [1]. These distances tend to lose significance as dimensionality grows.

The kNN Algorithm
kNN is a non-parametric algorithm developed to deal with classification and regression tasks [4,23]. In classification, kNN predicts the class for new instances using the information provided by the k nearest neighbors, so that the assigned class will be the most common among them. Fig. 1 shows a very simple example on how kNN works with different k values. As can be seen, the prediction obtained with k = 3 would be B, with k = 5 would be A and with k = 11 would be A. An important feature of this algorithm is that it does not build a model for accomplishing the prediction task. Usually, no work is done until an unlabeled data pattern arrives, thus the denomination of lazy approach [5]. Once the instance to be classified is given, the information provided by its k nearest neighbors [24] is used as explained above.
One of kNN's main issues is its behavior with datasets having a high-dimensional input space, due to the loss of significance of traditional distances as the dimensionality of the data increases [13]. In such a high-dimensional space distances between individuals tend to be the same. As a consequence similarity/distancebased algorithms, such as kNN, usually do not offer adequate results.
In this article, kNN has been selected to perform classification tasks. kNN is very popular since it has a good performance, uses few resources and it is relatively simple [60,88,59,28]. The objective of this proposal is to present an algorithm that combines the advantages of kNN to classify with DL models to reduce dimensionality.

Deep Learning
DL [87,25] arises with the objective of extracting higher-level representations of the analyzed data. In other words, the main goal of DL-based techniques is learning complex representations of data. The main lines of research in this area are intended to define different data representations and create models to learn them [12].
As the name suggests, models based on DL are developed as multi-layered (deep) architectures, which are used to map the relationships between features in the data (inputs) and the expected result (outputs) [34,10]. Most DL algorithms learn multiple levels of representations, producing an hierarchy of concepts. These levels correspond to different degrees of abstraction. The following are some of the main advantages of DL: -These models can handle a large number of variables and generate new features as part of the algorithm itself, not as an external step [25,19,76]. -Provides performance improvements in terms of time needed to accomplish feature engineering, one of the most time-consuming tasks [34]. -Achieves much better results than other methods based on traditional techniques [20,49,40,90] while facing problems in certain fields, such as image, speech recognition or malware detection. -DL-based models have a high capacity of adaptation for facing new problems [87,10].
Recently, several new methods [11,52,25] founded on the good results produced by DL have been published. Some of them are focused on certain areas, such as image processing and voice recognition [52]. Other DL-based proposals have been satisfactorily applied in disparate areas, gaining advantage over prior techniques [25]. Due to the great impact of DL-based techniques, as well as the impressive results they usually offer, new challenges are also emerging in new research lines [11].
DL models have been widely used to perform classification tasks obtaining good results [20,49,30,27]. However, the objective of this proposal is not to perform the classification directly with these models, but use them to the dimensionality reduction task.
The goal of the present work is to obtain higher-level representations of the data but with a reduced dimensionality. One of the dimensionality reduction DLbased techniques that has achieved good results are AEs [56]. An AE is an artificial neural network whose purpose is to reproduce the input into the output, in order to generate a compressed representation of the original information [10]. Section 2.4 thoroughly describes this technique.

Autoencoders
An AE is an artificial neural network able to learn new information encodings through unsupervised learning [17]. AEs are trained to learn an internal representation that allows reproducing the input into the output through a series of hidden layers. The goal is to produce a more compressed representation of the original information in the hidden layers, so that it can be used in other tasks. AEs are typically used for dimensionality reduction tasks by their characteristics and performance [41,84,67,94,14,73,93]. Therefore, the importance of such networks in this paper.
The most basic structure of an AE is very similar to that of a multilayer perceptron. An AE is a feedforwark neural network without cycles, so the information always goes in the same direction. An AE is typically formed by a series of layers: an input layer, a series of hidden layers and an output layer, being the units in each layer connected to those in the next one. The main characteristic of AEs is that the output has the same number of nodes than the input, since the goal is to reproduce the latter into the former throughout the learning process [10].
Two parts can be distinguished in an AE, the encoder and the decoder. The first one is made up of the input layer and the first half of hidden layers. The second is composed of the second half of hidden layers and the output layer. This is the architecture shown in Fig. 2. As can be seen, the structure of an AE always is symmetrical. The encoder and decoder parts in an AE can be defined as functions ω (Eq. (1)) and β (Eq. (2)), so that: Where x ∈ R d = X is the input to the AE, and z ∈ R p = F is the mapping contained in the hidden layers of the AE. When there is only one hidden layer (the most basic case) the AE maps the input X onto Z. For doing this, a weight vector W and a bias parameter b are used: Eq. (3) corresponds to the compression function, wherein a encoded input representation is obtained. Here, γ 1 is an activation function such as a rectified linear unit or a sigmoid function.
The next step is to decode the new representation z to obtain x', in the same way as x was projected into z, by means of a weight vector W' and a bias parameter b'. Eq. (4) corresponds to the decoder part, where the AE reconstructs the input from the information contained in the hidden layer.
During the training process, the AE tries to reduce the reconstruction error. This operation consists of back-propagating the obtained error through the network, and then modifying the weights to minimize such error. Algorithm 1 shows the pseudo-code of this process. Algorithm 1 AE training algorithm's pseudo-code.

Inputs:
T rainData Train Data 1: For each training instance: 2: for each instace in T rainData do 3: Do a feed-forward through AE to obtain output: 4: newCod ← feedForwardAE(aeM odel, instance) 5: Calculate error: 6: error ← calculateError(newCod, instance) 7: Backpropagate the error through AE and perform weight update: 8: aeModel ← backpropagateError(eror) 9: end for Learning a representation that allows reproducing the input into the output could seem useless at first sight, but in this case the output is not of interest. Instead, the concern is in the new representation of the inputs learned in the hidden layers. Such new codification is really interesting, because it can have very useful features [34]. The hidden layers learn a higher-level representation of the original data, which can be extracted and used in independent processes.
Depending on the number of units the hidden layers have, two types of AEs can be distinguished: -Those whose hidden layer have less units than the input and output layers.
This type of AE is called undercomplete. Its main objective is to force the network to learn a compressed representation of the input, extracting new, higher-level features. -Those whose hidden layer have more units than the input and output layers.
This type of AE is called overcomplete. The main problem in this case is that the network can learn to copy the input to the output without learning anything useful, so when it is necessary to obtain an enlarged representation of the input it is necessary to use other tools to prevent this problem.
In conclusion, AEs are a very suitable tool for generating a new lower-dimensional input space made of higher-level features. AEs have obtained good results in the accomplishment of this task. This is the main reason to choose this technique to design the AEkNN algorithm described later. However, it is important to note that there are other methods of dimensionality reduction that produce good results, such as denoising autoencoders [83], restricted Boltzmann machines [70] or sparse coding [89]. The objective of this proposal is to present an algorithm that hybridizes kNN with AEs. This establishes a baseline that allows supporting studies with more complex methods.

Dimensionality reduction approaches
In this section, an exploration of previous works related to the proposal made in this paper is carried out. The subsection 3.1 introduces classical proposals to tackle the dimensionality reduction problem. Some approaches for facing the dimensionality reduction task for kNN are portrayed in subsection 3.2.
In automatic learning, dimensionality reduction is the process aimed to decrease the number of considered variables, by obtaining a subset of main features. Usually two different dimensionality reduction methods are considered: -Feature selection [58], where the subset of the initial features that provides more useful information is chosen. The final features have no transformation in the process. -Feature extraction [57], where the process constructs from the initial features a set of new ones providing more useful and non-redundant information, facilitating the next steps in machine learning and in some cases improving understanding by humans.

Classical proposals for dimensionality reduction
Most dimensionality reduction techniques can be grouped into two categories, linear and non-linear approaches [79]. Below some representative proposals found in the literature, those that can be considered as traditional methods, are depicted. Commonly, classic proposals for dimensionality reduction were developed using linear techniques. The following are some of them: -Principal Components Analysis (PCA) [63,43] is a well-known solution for dimensionality reduction. Its objective is to obtain the lower-dimensional space where the data are embedded. In particular, the process starts from a series of correlated variables and converts them into a set of uncorrelated linear variables. The variables obtained are known as principal components, and their number is less than or equal to the original variables. Often, the internal structure extracted in this process reflects the variance of the data used. -Factors analysis [75] is based on the idea that the data can be grouped according to their correlations, i.e. variables with a high correlation will be within the same group and variables with low correlation will be in different groups. Thus, each group represents a factor in the observed correlations. The potential factors plus the terms error are linearly combined to model the input variables. The objective of factor analysis is to look for independent dimensions. -Classical scaling [78] consists in grouping the data according to their similarity level. An algorithm of this type aims to place each data in an N-dimensional space where the distances are maintained as much as possible.
Despite their popularity, classical linear solutions for dimensionality reduction present the problem that they can not correctly handle complex non-linear data [79]. For this reason, nonlinear proposals for dimensionality reduction arose. A compilation of these is presented in [15,71,54,81]. Some of these techniques are: Isomap [77], Maximum Variance Unfolding [86], diffusion maps [21], manifold charting [79], among others. These techniques allow to work correctly with complex non-linear data. This is an advantage when working with real data, which are usually of this type.

Proposals for dimensionality reduction in kNN
There are different proposals trying to face the problems of kNN when working with high-dimensional data. In this section, some of them are collected: -A method for computing distances for kNN is presented in [91]. The proposed algorithm performs a partition of the total data set. Subsequently, a reference value for each partition made is assigned. Grouping the data in different partitions allows to obtain a space of smaller dimensionality where the distances between the reference points are more significant. The method depends on the division of the data performed and the selection of the reference. This is a negative aspect, since a poor choice of parameters can greatly influence the final results. -The authors of [48] analyze the curse of dimensionality phenomenon, which states that in high-dimensional spaces the distances between the data points become almost equal. The objective is to investigate when the different methods proposed reach their limits. To do this, they perform an experiment in which different methods are compared with each other. In particular, it is exposed that the kNN algorithm begins to worsen the results when the space exceeds eight dimensions. A proposed solution is to adapt the calculation of distances to high-dimensional spaces. However, this approach does not consider a transformation of the initial data to a lower-dimensional space. -The proposal in [85] is a kNN-based method called kMkNN, whose objective is to improve the search of the nearest neighbors in a high-dimensional space. This problem is approached from another point of view, being the goal to accelerate the computation of distances between elements without modifying the input space. To do this, kMkNN relies on k-means clustering and the triangular inequality. The study shows a comparison with the original kNN algorithm where it is demonstrated that kMkNN works better considering the execution time, although it is not as effective when predictive performance is taken into account. -A new aspect related to the curse of dimensionality phenomenon, occurring while working with the kNN algorithm, is explored in [65]. It refers to the number of times that a particular element appears as the closest neighbor of the rest of elements. The main objective of the study is to examine the origins of this phenomenon as well as how it can affect dimensionality reduction methods. The authors analyze this phenomenon and some of its effects on kNN classification, providing a foundation which allows making different proposals aimed to mitigate these effects. -In [37], the problem of finding the nearest neighbors in a high-dimensional space is analyzed. This is a difficult problem both from the point of view of performance and the quality of results. In this study, a new search approach is proposed, where the most relevant dimensions are selected according to certain quality criteria. Thus, the different dimensions are not treated in the same way. This can be seen as an extraction of characteristics according to a particular criterion. Finally, an algorithm based on the previous approach, that faces the problem of the nearest neighbor, is proposed. However, this method makes a selection of the initial features that meet a certain criterion, so it does not take into account all the input features. Therefore, it could discard important information in the process. -A method called DNet-kNN is presented in [62]. It consists in a non-linear feature mapping based on Deep Neural Network, aiming to improve classification by kNN. DNet-kNN relies on Restricted Boltzmann Machines (RBM) to perform the mapping of characteristics. RBMs are another type of DL network. The work offers a solution to the problem of high-dimensional data when using kNN by combining this traditional classifier with DL-based techniques.
The conducted experimentation proves that DNet-kNN improves the results of the classical kNN. DNet-kNN requires a pre-training of the RBM, which is an additional phase. In addition, experimentation is performed on datasets of digits and letters, not on actual images.
The aforementioned proposals represent a list of approaches that are closely related to the problem dealt with in this article. However, there are many other proposals that face the problem of dimensionality reduction from other perspectives [68,7,16,36,92].
In conclusion, different proposals have arisen to analyze and try to address the problems of IBL algorithms when they have to deal with high-dimensional data. These methods are affected by the curse of dimensionality, which raises the need to bring new approaches. Among the previous proposals, there is no one that presents an hybrid method based on IBL that incorporates the reduction of dimensionality intrinsically. In addition, none of the proposals obtains improvements both in predictive performance as well as in execution time. The present work is aimed to fulfill these aspects.
Once the main foundations behind our proposed algorithm have been described, this section proceeds to present AEkNN. It is an instance-based algorithm with a built-in dimensionality reduction method, performed with DL techniques, in particular by means of AEs.

AEkNN Foundations
As mentioned, high dimensionality is an obstacle for most classification algorithms. This problem is more noticeable while working with distance-based methods, since as the number of variables increases these distances are less significant [44,48]. In such situation an IBL method could loss effectiveness, since the distances between individuals are equated. As a consequence of this problem, different methods able to reduce its effects have been proposed. Some of them have been previously enumerated in Section 3.2.
AEkNN is a new approach in the same direction, introducing the use of AEs to address the problem of high dimensionality. The structure and performance of this type of neural networks makes it suitable for this task. As explained above, AEs are trained to reproduce the input into the output, through a series of hidden layers. Usually, the central layer in AEs has less units than the input and output layers. Thereby, this bottleneck forces the network to learn a more compact and higherlevel representation of the information. The coding produced by this central layer can be extracted and used as the new set of features in the classification stage. In this sense, there are different studies demonstrating that better results are obtained with AEs than with traditional methods, such PCA or multidimensional scaling [41,79]. Also, there are studies analyzing the use of AEs from different perspectives, either focusing on the training of the network and its parameters [84] or on the relationship between the data when building the model [67].
AEkNN is an instance-based algorithm designed to address classification with a high number of variables. It starts working with an N -dimensional space X that is projected into an M -dimensional space Z, being M < N. This way M new features, presumably of higher level than the initial ones, are obtained. Once the new representation of the input data is generated, it is possible to get more representative distances. To estimate the output, the algorithm uses the distances between each test example and the training ones but based on the M higher-level features. Thus, the drawbacks of high-dimensional data in distances computation can be significantly reduced. As can be seen, AEkNN is a non-lazy instance-based algorithm. It starts by generating the model in charge of producing the new features, unlike the lazy methods that do not have a learning stage or model. AEkNN allows to enhance the predictive performance, as well as obtaining improvements in execution time, when working with data having a large number of features.

Method description
AEkNN consists of two fundamental phases. Firstly, the learning stage is carried out using the training data to generate the AE model that allows to produce a new encoding of the data. Secondly, the classification step is performed. It uses the model generated in the first phase to obtain the new representation of the test data and, later, the class for each instance is estimated based on nearest neighbors. Algorithm 2 shows the pseudo-code of AEkNN, that is thoroughly discussed below, while Fig. 3 shows the algorithm process in a general way. The inputs to the algorithm are TrainData and TestData, the train and test data to be processed, k, the number of neighbors, and PPL, the percentage of elements per layer (PPL). This latter parameter sets the structure of the AE, i.e. the number of layers and elements per layer. It is a vector made of as many items as hidden layers are desired in the AE, indicating each one the percentage of units in that layer with respect to the number of input features. In section 5, different configurations are analyzed to find the one that offers the best results.
The algorithm is divided into two parts. The first part of the code (lines 2-9) corresponds to the training phase of AEkNN. The second part (lines [11][12] refers to the classification phase. During training AEkNN focuses on learning a new representation of the data. This is done through an AE, using the training data to learn the weights linking the AE's units. This is a process that has to be repeated for each layer in the AE, stated by the number of elements in the PPL parameter. This loop performs several tasks: -In line 5 the function getSizeLayer is used to obtain the number of units in the layer. This value will depend on the number of characteristics of the training set (T rainData) and the percentage applied to the corresponding layer, which is established by the PPL parameter. -The function getAELayer (called in line 6 and defined in line 14) retrieves a layer of the AE model. The layer allows to obtain a new representation of the data given as first parameter (modelData). The number of units in the AE layer generated in this iteration will be given by the second parameter (sizeLayer). Firstly, the AE is initialized with the corresponding structure (line 15). The Algorithm 2 AEkNN algorithm's pseudo-code.
Finally, the generated AE layer is returned (line 21). -The function applyAELayer (line 7) allows to obtain a new representation of the data given as second parameter (modelData). To do this, the AE layer previously generated, represented by the first parameter (aeLayer), is used.
-The last step consists in adding the AE layer generated in the current iteration to the complete AE model (line 8).
During classification (lines 11-12) the function classification is used (lines [24][25][26][27][28][29][30][31][32][33][34][35]. The class for the test instances given as the first parameter (T estData) is predicted. The process performed internally in this function is to transform each test instance using the AE model generated in the training phase (aeM odel), producing a new instance, more compact and representative. (line 27). This new set of features is used to predict a class with a classifier based on distances, using for each new example its k nearest neighbors (line 28). Finally, this function returns the error rate (result) for the total set of test instances (line 33). As can be seen, classification is conducted in a lower-dimensional space, mitigating the problems associated with a high number of variables.
At this point, it should be clarified that the update of weights (lines 16-20) is done using mini-batch gradient descent [38]. This is a variation of the gradient descent algorithm that splits the training dataset into small batches that are used to calculate the model error and update the model coefficients. The reason for using this technique its better performance when dealing with large dataset.
From the previous description is can be inferred how AEkNN accomplish the objective of addressing classification with high-dimensional data. On the one hand, aiming to reduce the effects of working with a large number of variables, AEs have been used to transform such data into a lower-dimensional space. On the other hand, the classification phase is founded on the advantages of IBL. In Section 5, the performance of AEkNN is analyzed.

Experimental study
In order to demonstrate the improvements provided by AEkNN, the algorithm proposed in the present work, an experimental study was conducted. It has been structured into three steps, all of them using the same set of datasets:

Experimental framework
The conducted experimentation aims to show the benefits of AEkNN over a set of datasets with different characteristics. Their traits are shown in Table 1. The datasets' origin is shown in the column named Ref. For all executions, datasets are given as input to the algorithms applying a 2×5 folds cross validation scheme. In both phases of experimentation, the value of k for the classifier kNN and for AEkNN will be 5, since it is the recommended value in the related literature. In addition, to compare classification results was necessary to compute several evaluation measures. In this experimentation, Accuracy (5), F-Score (6) and area under the ROC curve (AUC) (9) were used.
Accuracy (5) is the proportion of true results among the total number of cases examined. Accuracy = T P + T N T P + T N + F P + F N Where TP stands for true positives, instances correctly identified. FP is the false positives, instances incorrectly identified. TN represents the true negatives, instances correctly rejected. FN corresponds to false negatives, instances incorrectly rejected.
F-Score is the harmonic mean of Precision (7) and Recall (8) Finally, AUC is the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one. AUC is given by the Eq. (9):

TPR(T )FPR(T )dT (9)
Where TPR stands for the true positive rate and FPR is false positive rate. The significance of results obtained in this experimentation is verified by appropriate statistical tests. Two different tests are used in the present study: -In the first part, the Friedman test [31] is used to rank the different AEkNN configurations and to establish if any statistical differences exist between them. -In the second part, the Wilcoxon [72] non-parametric sign rank test is used.
The objective is to verify if there are significant differences between the results obtained by AEkNN and kNN.
These experiments were run in a cluster made up of 8 computers, having 2 CPUs (2.33 GHz) and 7 GB RAM each of them. The AEkNN algorithm and the experimentation was coded in R language [64], relying on the H2O package [2] for some DL-related functions.

PPL Parameter Analysis
AEkNN has a parameter, named PPL, that establishes the configuration of the model. This parameter allows selecting different architectures, both in number of layers (depth) and number of neurons per layer.
The datasets used (see Table 1) have disparate number of input features, so the architectures will be defined according to this trait. Table 2 shows the considered configurations. For each model the number of hidden layers, as well as the number of neurons in each layer, is shown. The latter is indicated as a percentage of the number of initial characteristics. Finally, the notation of the associated PPL parameter is provided. The results produced by the different configurations considered are presented grouped by metric. Table 3 shows the results for Accuracy, Table 4 for F-Score and Table 5 for AUC. These results are also graphically represented. Aiming to optimize the visualization, two plots with different scales have been produced for each metric. Fig. 4 represents the results for Accuracy, Fig. 5 for F-Score and Fig.  6 for AUC.  The results presented in Table 3 and in Fig. 4 show the Accuracy obtained for AEkNN with different PPL values. These results indicate that there is no configuration that works best for all datasets. The configurations with three hidden layers obtain the best results in 4 out of 14 datasets, whereas the configurations with one hidden layer win in 10 out of 14 datasets. This trend can also be seen in the graphs. Table 4 and Fig. 5 show the F-Score obtained by AEkNN with different PPL values. The values indicate that the configurations with one single hidden layer get better results in 11 out of 14 datasets. The configuration with PPL = (0.25) and with PPL = (0.75) are the ones winning more times (5). The version with PPL = (0.25) shows disparate results, the best values for some cases and bad results for other cases, for example with hapt, image or microv2. Although the version with PPL = (0.75) wins the same number of times, its results are more balanced.
In Table 5 the results for AUC obtained with AEkNN can be seen. Fig. 6 represents those results. For this metric, it can be appreciated that single hidden layer structures work better, obtaining top results in 12 out of 14 datasets. However, a configuration that works best for all cases has not been found.    the overall best cannot be determined, since for each dataset there is a setting that works best. However, some initial trends can be drawn. Single-layer configurations get more better results than configurations with three hidden layers. Also, the configurations with PPL = (0.75) and PPL = (0.5) offer close to the best value in most cases, while the other configurations sometimes are far from the best value. Once the results have been obtained, it is necessary to determine if there are statistically significant differences for each one of them in order to select the best configuration. To do this, the Friedman test [31] will be applied. Average ranks obtained by applying the Friedman test for Accuracy, F-Score and AUC measures are shown in Table 6. In addition, Table 7 shows the different p-values obtained by the Friedman test.  As can be observed in Table 7, for AUC (wich is considerer a stronger performance metric) there are statistically significant differences between the considered PPL values if we set the p-value threshold to the usual range [0.05, 0.1]. However, for Accuracy and F-Score there are no statistically significant differences. In addition, in the rankings obtained, it can be seen that there are two specific configurations that offer better results than the remaining ones. In the three rankings presented, the results with PPL = (0.75) and PPL = (0.5) appear first, clearly highlighted with respect to the other values. Therefore, it is considered that these two configurations are the best ones. Thus, the results of AEkNN with both configurations will be compared against the kNN algorithm.

AEkNN vs kNN
This second part is focused on determining if the results obtained with the proposed algorithm, AEkNN, improve those obtained with the kNN algorithm. To do so, a comparison will be made between the results obtained with AEkNN, using the values of the PPL parameter selected in the previous section, and the results obtained with kNN algorithm on the same datasets.
First, Table 8 shows the results for each one of the datasets and considered measures, including running time. The results for both algorithms are presented jointly, and the best ones are highlighted in bold. Two plots have been generated for each metric aiming to optimize data visualization, as in the previous phase, since the range of results was very broad. Fig. 7 represents the results for Accuracy, Fig. 8 for F-Score, Fig. 9 for AUC, and Fig. 10 for runtime.  The results shown in Table 8 indicate that AEkNN works better than kNN for most datasets considering Accuracy. On the one hand, the version of AEkNN with PPL = (0.75) improves kNN in 11 out of 14 cases, obtaining the best overall results in 6 of them. On the other hand, the version of AEkNN with PPL = (0.5) obtains better results than kNN in 11 out of 14 cases, being the best configuration in 6 of them. In addition, kNN only obtains one best result. Fig. 7  Analyzing the data corresponding to the metric F-Score, presented in Table 8, it can be observed that AEkNN produces an overall improvement over kNN. The AEkNN version with PPL = (0.75) improves kNN in 11 out of 14 cases, obtaining the best overall results in 5 of them. The version of AEkNN with PPL = (0.5) obtains better results than kNN in 10 out of 14 cases, being the best configuration in 7 of them. kNN does not obtain any best result. How the results of the versions corresponding to AEkNN produce higher values than those corresponding to kNN can be seen in Fig. 8.
The data related to AUC, presented in Table 8, also show that AEkNN works better than kNN. In this case, the two versions of AEkNN improve kNN in 11 out of 14 cases each one, obtaining the best results in 13 out of 14 cases. Fig. 9 shows that the trend is increasing towards the versions of the new algorithm. kNN only gets a best result, specifically with the coil2000 dataset, maybe due to the low  The running times for both algorithms are presented in Table 8 and Fig. 10. As can be seen, the configuration that takes less time to classify is the one corresponding to AEkNN with PPL = (0.5), obtaining the lowest value for all datasets. This is due to the higher compression of the data achieved by this configuration. In the same way, AEkNN with PPL = (0.75) obtains better results than the algorithm kNN in all cases.
Summarizing, it can be observed that the results obtained through AEkNN improve those obtained with the original kNN algorithm for most of the datasets. AEkNN, despite the transformation of the input space to reduce dimensionality, the quality of the results in terms of classification performance are better than those of kNN in most cases. In addition, in terms of classification time, it can be noted how AEkNN with higher compression of information significantly reduces the time spent on classification, without having a negative impact on the other measures.
To determine if there are statistically significant differences between the obtained results, the proper statistical test has been conducted. For this purpose, the Wilcoxon test will be performed, comparing each version of AEkNN against the results of the classical kNN algorithm. In Table 9, the results obtained for Wilcoxon tests are shown. As can be seen the p-values are rather low, so that statistical significant differences between the two AEkNN versions and the original kNN algorithm in all considered measures exist can be concluded, considering the p-value threshold within the usual [0.05, 0.1] range. On the one hand, taking into account Accuracy, F-Score and AUC, the configuration with best results is that performing a 75% of feature reduction. Therefore, it is the optimal solution from the point of view of predictive performance. The reason for this might be that there is less compression of the data, therefore, there is less loss of information compared to the other considered configuration (50%). On the other hand, considering running time, the configuration with best results is that performing a 50% of feature reduction. It is not surprising that having less features allows to compute distances in less time.

AEkNN vs PCA / LDA
The objective of this third part is to assess the competitiveness of AEkNN against traditional dimensionality reduction algorithms. In particular, the algorithms used will be PCA [63,43] and LDA [32,92], since they are traditional algorithms that offer good results in this task [61]. To do so, a comparison will be made between the results obtained with AEkNN, using the values of the PPL parameter selected in section 5.3, and the results obtained with PCA and LDA algorithm on the same datasets. It is important to note that the number of features selected with these methods will be the same as with the AEkNN algorithm, so there are two executions for each algorithm.
First, Table 10 shows the results for each one of the datasets and considered measures. The results for the three algorithms are presented jointly, and the best ones are highlighted in bold. One plot have been generated for each metric aiming to optimize data visualization. In this case, the graphs represent the best value of the two configurations for each algorithm in order to better visualize the differences between the three methods. Fig. 11 represents the results for Accuracy, Fig. 12 for F-Score and Fig. 13 for AUC.
The results shown in Table 10 indicate that AEkNN works better than PCA and LDA for most datasets considering Accuracy. On the one hand, the version of AEkNN with PPL = (0.75) improves PCA in in 12 out of 14 cases and LDA in 9 out of 14 cases, obtaining the best overall results in 7 of them. On the other hand, the version of AEkNN with PPL = (0.5) obtains better results than PCA Analyzing the data corresponding to the metric F-Score, presented in Table  10, it can be observed that AEkNN produces an overall improvement over PCA and LDA. The AEkNN version with PPL = (0.75) improves PCA in 12 out of 14 cases and LDA in 10 out of 14 cases, obtaining the best overall results in 6 of them. The version of AEkNN with PPL = (0.5) obtains better results than PCA in all cases and LDA in 8 out of 14 cases, being the best configuration in 4 of them. PCA does not obtain any best result and LDA obtains the best result in 4 cases. How the results of the versions corresponding to AEkNN show values higher than those corresponding to kNN can be seen in Fig. 12 The data related to AUC, presented in Table 10, also show that AEkNN works better than PCA and LDA. The AEkNN version with PPL = (0.75) improves PCA in 13 out of 14 cases and LDA in 10 out of 14 cases, obtaining the best overall results in 7 of them. The version of AEkNN with PPL = (0.5) obtains better results than PCA in 13 out of 14 cases and LDA in 8 out of 14 cases, being the best configuration in 4 of them. LDA only gets the best result in 4 cases. Fig. 13 confirms this trend.  Summarizing, it can be observed that the results obtained through AEkNN improve those obtained with the PCA and LDA algorithms for most of the datasets. The quality of the results with AEkNN in terms of classification performance are better than those of PCA and LDA in most cases. This means that the high-level features obtained by the AEkNN algorithm provide more relevant information than those obtained by the PCA and LDA algorithms. Therefore, the results obtained by classifying with the AEkNN algorithm improve those obtained with PCA and LDA.
Previously, the data obtained in the experimentation have been presented and a comparison between them is made. However, it is necessary to verify if there are significant differences between the data corresponding to the different algorithms. To do this, the Friedman test [31] will be applied. Average ranks obtained by applying the Friedman test for Accuracy, F-Score and AUC measures are shown in Table 11. In addition, Table 12 shows the different p-values obtained by the Friedman test.  As can be observed in Table 12, for Accuracy, F-Score and AUC there are statistically significant differences between the different PPL values if we set the p-value threshold to the usual range [0.05, 0.1]. It can be seen that AEkNN with PPL = (0.75) offer better results than the remaining ones. In the three rankings presented, the AEkNN configurations with PPL = (0.75) and PPL = (0.5) appear first, clearly highlighted with respect to the other values. Therefore, it is considered that AEkNN obtains better predictive performance, since the reduction of dimensionality generates more significant features.

General Guidelines on the use of AEkNN
AEkNN could be considered as a robust algorithm on the basis of the previous analysis. The experimental work demonstrates that it has good performance with the two PPL considered values. From the conducted experimentation some guidelines can also be extracted: -When working with very high-dimensional datasets, it is recommended to use AEkNN with the PPL = (0.5) configuration. In this study, this configuration has obtained the best results for datasets having more than 600 features. The reason is that the input data has a larger number of features and allows a greater reduction without losing relevant information. Therefore, AEkNN can compress more in these cases. -When using binary datasets with a lower dimensionality, the AEkNN algorithm with the PPL = (0.5) configuration continues to be the best choice. In our experience, this configuration has shown to work better for binary datasets with a number of features around 100. In these cases, the compression may be higher since it is easier to discriminate by class. -For all other datasets, the choice of configuration for AEkNN depends on the indicator to be enhanced. On the one hand, if the goal is to achieve the best possible predictive performance, the configuration with PPL = (0.75) must be chosen. In these cases, AEkNN needs to use more original information.
On the other hand, when the interest is to optimize the running time, while maintaining improvements in predictive performance with respect to kNN, the configuration with PPL = (0.5) is the best selection. The reason is in the higher compression of the data. AEkNN needs less time to classify lower-dimensional data.
Summarizing, the configuration of AEkNN must be adapted to the data traits to obtain optimal results. For this, a series of tips have been established.

Concluding remarks
In this paper, a new classification algorithm called AEkNN has been proposed. This algorithm is based on kNN but aims to mitigate the problem that arises while working with high-dimensional data. To do so, AEkNN internally incorporates a model-building phase aimed to perform a reduction of the feature space, using AEs for this purpose. The main reason that has led to the design of AEkNN are the good results that have been obtained by AEs when they are used to generate higher-level features. AEkNN relies on an AE to extract a reduced representation of a higher level that replaces the original data.
In order to determine if the proposed algorithm offers better results than the kNN algorithm, a experimentation has been carried out. Firstly, the analysis of these results have allowed to determine which AE structure works better. Furthermore, in the second part of the conducted experimentation, the results of the best configurations have been compared with the results produced by kNN. As has been stated, the results of AEkNN improve those obtained by kNN for all the metrics. In addition, AEkNN offers a considerable improvement with respect to the time invested in the classification.
In addition, a comparison has been made with other traditional methods applied to this problem, in order to verify that the AEkNN algorithm improves the results when carrying out the dimensionality reduction task. For this, AEkNN has been compared with LDA and PCA. The results show that the proposed AEkNN algorithm improves the performance in classification for most of the dataset used. This occurs because the features generated with the proposed algorithm are more significant and provide more relevant information to the classification using distance-based algorithms.
In conclusion, AEkNN is able to reduce the adverse effects of high-dimensional data while performing instance-based classification, improving both running time and classification performance. These results show that the use of AEs can be helpful to solve this kind of obstacle, opening up new possibilities of future work in which they are applied to help solve similar problems presented by other traditional models.