MAP-REDUCE BASED DISTANCE WEIGHTED K-NEAREST NEIGHBOR MACHINE LEARNING ALGORITHM FOR BIG DATA APPLICATIONS

. With the evolution of Internet standards and advancements in various Internet and mobile technologies, especially since web 4.0, more and more web and mobile applications emerge such as e-commerce, social networks, online gaming applications and Internet of Things based applications. Due to the deployment and concurrent access of these applications on the Internet and mobile devices, the amount of data and the kind of data generated increases exponentially and the new era of Big Data has come into existence. Presently available data structures and data analyzing algorithms are not capable to handle such Big Data. Hence, there is a need for scalable, flexible, parallel and intelligent data analyzing algorithms to handle and analyze the complex massive data. In this article, we have proposed a novel distributed supervised machine learning algorithm based on the MapReduce programming model and Distance Weighted k-Nearest Neighbor algorithm called MR-DWkNN to process and analyze the Big Data in the Hadoop cluster environment. The proposed distributed algorithm is based on supervised learning performs both regression tasks as well as classification tasks on large-volume of Big Data applications. Three performance metrics, such as Root Mean Squared Error (RMSE), Determination coeﬀicient (R2) for regression task, and Accuracy for classification tasks are utilized for the performance measure of the proposed MR-DWkNN algorithm. The extensive experimental results shows that there is an average increase of 3% to 4.5% prediction and classification performances as compared to standard distributed k-NN algorithm and a considerable decrease of Root Mean Squared Error (RMSE) with good parallelism characteristics of scalability and speedup thus, proves its effectiveness in Big Data predictive and classification applications.


1.
Introduction.Now a days, all recent applications, like credit rankings, pattern recognition, location oriented Geographical Information Systems (GIS), all facilities management, recommendation systems, computer vision, smart-cities services and so on are in need of well-organized processing of queries using bigdata methodologies.One of the best query retrieval method is Nearest Neighbor Query.When we consider two points are given, A as Query and B as Training, NN finds out the nearest neighbors of B for every position of A. This form of query is very much helpful in reality.
For example, kNN can be used in recommendation systems since it helps to find public with analogous qualities.It can also be used in an online video streaming platform, for example, to suggest a content that a user is more possible to outlook based on what other users look at.For image categorization, the KNN algorithm can also be used.It is important in a variety of computer vision applications since it can group similar data points jointly, such as cats and dogs in split classes.Additional applications of the kNN include classification, graph-based computational learning etc. when the datasets concerned are moderately small, kNN be able to be capably answer in a central background.This is commonly not the reason when we are working with huge quantity of data, considering that we live in the epoch of WWW and in the mobile technology.There is huge quantity of data coming out through GIS devices, sensor networks, geotagged tweets, scientific devices, etc. and their handing out is very challenging, thus we come out with the mapping of parallel and distributed computing environment is very much essential to get results in a reasonable amount of time.
The k-Nearest Neighbors (kNN) algorithm [1,2] is one of the top 10 supervised machine learning algorithms that perform classification as well as regression analysis in Big data applications.Due to its non-parametric nature, easy implementation, and effectiveness, it becomes an important part of the machine learning domain and also it has an inherent feature of parallel implementation in Map Reduce environment.Due to this, the kNN and Map reduce framework found applications in various diverse fields like pattern classification [3], image classification [4,5], big data classification [6], automated world wide web usage mining [7] and document classification [8,9,10,11,12].
The contributions of this paper include: • A distributed and parallel Distance Weighted k-Nearest Neighbor model has been proposed to analyze big data • Hadoop MapReduce framework cluster has been established to improve execution efficiency and scalability of the proposed system • Different block size of data in the underlying HDFS chosen to handle large datasets with high efficiency rate.
• Several experiments have been designed and executed to show the speedup and scale-up of the proposed distributed algorithm • Classification accuracy of the proposed system measured and analyzed for different size of k-Nearest neighbors • The performance of the proposed system is evaluated on the authentic, standard and reliable dataset.The rest of this paper is organized as follows.Section two provides Literature survey and Section three provides an introduction to the distance weighted k-NN algorithm; we discussed the big data technology Hadoop framework and the architecture of the proposed MR-DWkNN model with the Hadoop implementation algorithm in section four.The experimental setup such as Hadoop computational cluster and its components, various benchmark datasets, and performance metrics for evaluation are described in section five.The conduct of experiments, performance analyses, and results in comparison with standard k-NN are presented in section six while section seven draws conclusions from the experimental study and suggests the directions for future work.

Literature Survey.
There are several variants of the kNN algorithm have been proposed to handle various kinds of data and are shown to be very effective in its performance.Keller et al. [13] proposed a modified version of the kNN algorithm based on fuzzy concepts called fuzzy-kNN with three techniques to assign fuzzy memberships to the data points.Denoeux [14] proposed a method D-SkNN based on Dempster-Shafer theory to address the problem of unseen pattern classification in a dataset based on the kNN algorithm.The author demonstrated the performance of the new method D-SkNN with a real-time dataset as well as a simulated dataset and compared it with the standard kNN and majority voting methods.Kuncheva [15] specified an intuitionistic fuzzy version of the kNN rule called IF-kNN and incorporated the voting rule with assigned weights based on the membership and non-membership to a certain category of class.Based on the threshold value, the vote is categorized as positive or negative.Yang et al. [16] developed an enhanced version of the kNN algorithm with a fuzzy editing rule and incorporated a few asymptotic properties into kNN.The experimental results conducted on various datasets conformed that this approach outperforms the standard kNN algorithm.Huang et al. [17] developed a modified version of kNN called DCT-kNN which is based on feature weighting and class distribution.The experiments on UCI datasets had shown a considerable classification accuracy improvement.
Liu et al. [18] introduced the kNN method for Multi-class classification problems and demonstrated its performance in the classification of Multiclass datasets.Liu and Zhang [19] designed the variant of the kNN algorithm termed as mutual nearest neighbors (MkNN) to remove the noisy data and improved the data classification accuracy.Zhang [20] incorporated the certainty factor measure to the kNN algorithm to apply it over the imbalanced class distribution dataset and the variant is called kNN-CF.The authors also demonstrated that kNN-CF algorithm accuracy is better than the standard kNN algorithm.Shichao et al. [21] proposed a novel method called self-reconstruction to determine the k-value of the kNN algorithm for each training sample and applied on real datasets for data classification.The experimental results show that this method outperformed the standard data classification methods in terms of classification accuracy.
In recent years, the map-reduce based distributed and parallel machine learning algorithms are the focus of the research community.The MapReduce framework has emerged as a powerful, robust, and distributed parallel programming model [22,23,24,25] provides a solution with good performance and efficient execution to large-scale data analytic applications including data mining, web page access ranking, graph analysis, image classification and bioinformatics [26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44].Kolb et al. [45] investigated the use of the MapReduce programming model for parallel entity resolution with automated data partitioning on real-world datasets.
The application of the kNN algorithm and its variants in the context of big data for classification and regression has been already considered.Triguero et al. [46] suggested a method for handling big data in kNN classification.They proposed a novel partitioning method based on the Map-Reduce framework and distributed the functioning of algorithms to multiple nodes in the cluster environment without any loss of classification accuracy.The performance of this method was tested with 5.7 million instances on Poker hand dataset, obtained an accuracy of 0.5171 for k=3 and the results show that kNN is a suitable algorithm to handle big data.
Ding et al. [47] proposed a clustering-based approach for processing large high-dimensional datasets.The authors added the Principal Component Analysis for dimensionality reduction in addition to the kNN classification algorithm in the processing of large datasets.Deng et al. [48] introduced the kNN algorithm in big data applications for classifications.The authors applied the k-means clustering algorithm on a large size dataset, as a result, it is split into several subsets.Finally, they applied the kNN method, its variants RC-kNN and LC-kNN, classified the samples in each subset of the dataset with an accuracy of 72.21%, 83.89% and 86.35% on MNIST dataset.In [49] the authors propose a new distributed and parallel kNN join operation on large real and synthetic datasets in the Map-Reduce platform.They demonstrated the scalability and efficiency of the method with hundreds of millions of records.In [50], the authors developed a cost-effective MapReduce-based k-Nearest Neighbor (MR-kNN) algorithm for Big Data classification and tested for different k-values against the Pokerhand dataset.The classification accuracy was about 0.5386 for k=7.In [51], an iterative Hadoop MapReduce method called iHMR-kNN was developed for kNN based classification and analysis of the image dataset.Singh, A.P. et al [52] employed kNN in deep learning architectures for image classification which shows a higher classification accuracy.In [53], a label driven latent subspace learning for multi-view image classification model was developed which produces an improved classification result.Let T R = x 1 , • • • , x n be the training dataset with n instances and m attributes.All the instances are class labeled data points (x i , c j ), j = 1, • • • , n c the corresponding class labels of the instances.Here n > n c .In the k-NN algorithm, the learned target function may be either categorical (discrete-valued) or continuous-valued (real-valued) function.Given a query instance , from the collections of query instances dataset T Q, its unknown class c

Distance
′ is determined as follows.
Step 1: Compute the Euclidean distance between the query instances to all instances in T R. Let the instance x i in T R is described as a feature vector (a is the Euclidean distance between the query instance x q and the instance x i in the dataset T R and is computed as per equation (3.1) Step 2: Sort the instances in the dataset T R in ascending order of their Euclidean distances as given in equation (3.2), n is the total data points.
Step 3(a): For discrete-valued learning target function: The general form of a discrete-valued learning function is Return the class label for the query instance x q as given in equation (3.3).

Step 3(b): For real-valued target function:
The general form of the function is f : The target attribute value for the query instance x q is computed as given in equation (3.4) The above steps illustrate the standard kNN algorithm and this can be enhanced with the inclusion of weight function as given in equation ( 5), Step 4(a): Select the k-nearest neighbors from the dataset TR Let T R ′ = x 1 , • • • , x k be the k nearest instances.Assign a weight w i to i th nearest neighbor of the query x q using the distance-weighted function as given in equation (3.5) Based on the majority voting, assign the class label c j to the query instance x q of discrete-valued function as given in equation Step 4(b): For real-valued target functions, the target attribute value for the query instance x q is computed as given in equation 4. Hadoop Implementation of our proposed algorithm.

Hadoop Distributed Framework.
Hadoop is an Apache project, initially introduced in 2007 as a free and open-source framework with two major components Hadoop Distributed File System (HDFS) and MapReduce (MR) programming model.It becomes very popular in recent years because of its simplicity, builtin fault tolerance, scalability on data-intensive tasks, parallel execution of tasks, and capable of running on a Hadoop cluster with 1000s of commodity hardware machines.Hadoop framework itself takes care of job scheduling, job execution, controlling of all underlying tasks in the cluster, and other runtime management tasks.
MapReduce is a software framework that enables the application developers to explore all the options of parallel procedures with Map and Reduce functions.The MapReduce programming model supports the developers to write and execute the applications upon a cluster consists of a few hundred to several thousand commodity hardware machines.The three major classes of the MapReduce program are Master/Driver, Mapper, and Reducer.The Master class is responsible for setting various execution parameters to the MapReduce job to run in the Hadoop cluster.The parameters are names of Mapper and Reducer classes, data types, and job names to be executed.The MapReduce framework operates on (key, value) pairs.Each Map task process an input split (block) generating intermediate data of (key, value) format.Then, they are sorted and partitioned by key, so later at the Reduce phase, pairs of the same key will be aggregated to the same reducer for further processing [54,55].

MapReduce Architecture of proposed Distance Weighted kNN Algorithm.
The main purpose of the supervised machine learning technique is to facilitate an algorithm to learn from the previous historical data/events and extracts the knowledge from the events.The extracted knowledge is represented as an intelligent mathematical model that can be used to make predictions or classifications on the given new scenario/future events.In most general terms, the machine learning model consists of two phases, the training phase, and the testing phase.Fig. 5.1 depicts the Map-Reduce architecture of the proposed Distance weighted kNN algorithm and its implementation.
Algorithm 1 provides the details of the Map function operation of Distance weighted kNN.The input training dataset T R with m samples and n features is split into s partitions/blocks as p1, p2, • • • , ps in the distributed file system of the Hadoop cluster.Each partition takes one block storage size of HDFS (64MB/128MB/256MB) as set initially in the Hadoop file system.Each partition contains m/s samples, distributed uniformly with a default replication factor of 3. The dataset TQ contains the query instances and since the K-Nearest Neighbor is a lazy learner, the pattern matching process initiated only on the submission of query instance xq to the system.For each block of the input file, a separate map task is created and each map task computes the Euclidean distance between the query instance xq and all the instances in the block.The Euclidean distance that is computed for all the blocks is converted into <key, value> pair as <instance_id, (Euclidean distance, attributes)> and these are stored in HDFS as intermediate results.

Datasets.
In this research work, we will use six benchmark large-size datasets from the UCI machine learning repository.Among the six datasets, three datasets with real-valued target attributes used for the regression task (Table 6.1), and the remaining three datasets with discrete-valued target attributes are used for the classification task (Table 6.2).We tabulate the number of attributes (attributes), Data type of attributes (Data types), number of instances (instances), file size (File size), and the year of publication (year).In our experimental work, all the datasets are partitioned into training and test dataset using an n th fold crossvalidation technique.

Root Mean Square Error (RMSE):
The error value of the regression model is measured in terms of RMSE.This metric shows the square root of the quadratic mean of the differences between the predicted and expected values of the target attribute.RMSE is computed as given in equation (6.1)

Determination coefficient (R2):
This metric analyzes how the differences in one variable (x) can be explained by a difference in another variable (y).This measure evaluates how a model approximates the real data points.The higher R2, the more efficient is the prediction model and its value usually between 0 and 1 is computed using equation (6.2).

Determinationcoef f icient
Accuracy: The performance of classification model is measured using the metric accuracy.It is defined as the ratio of the number of correctly predicted samples to the total number of input samples in the dataset as given in equation (6.3).

Accuracy =
N umberof correctlypredictedsamples T otalN umberof inputsamples (6.3) These are three commonly used performances metric for measuring the performance of the standard predictors/classifiers Scalability: This defines the capability of an algorithm in a parallel computing cluster environment to enhance both in terms of the number of processing cores/nodes in the cluster and the number of instances in the dataset.It can also be defined as the capability of an s-times larger computing system in the cluster environment to execute an s-times larger computational task in the same given job execution time as the original system and this can be expressed as in equation (6.4).

Scalability(s, T D) =
JET (1, T D) JET (s, sT D) (6.4) where 's' is the number of cores/nodes in the cluster environment, JET(1, TD) is the job execution time on one core/node with data size of TD, JET(s, sTD) is the job execution time of the parallel tasks with 's' cores/nodes in the cluster environment with data size s times of TD.Ideal parallelism shows a constant scale up with an increasing number of computing cores in the cluster and dataset size [56][57] Speedup: Speedup is also one of the measures employed in evaluating the performance of the parallel algorithms and is used to enhance the job execution time in the cluster.It is defined as the ratio of the time of sequential execution to the time of parallel execution.Speedup can be expressed as in equation (6.5).

Speedup(s, T D) =
JET (1) JET (s) (6.5) where 's' is the number of cores in the cluster environment, JET(1) is the job execution time on one core with data size of TD, JET(s) is the job execution time of the parallel tasks with 's' cores in the cluster environment with the same data size TD.

Conduct of experiments and discussion of results.
In this section, we evaluate the proposed supervised MR-DWkNN algorithm on six public datasets in the UCI Machine Learning repository.Here we described the four experiments conducted and compare the results collected from these experiments with the MR-SDkNN algorithm.
1.In the first two experiments, the MR-DWkNN method was executed against the three datasets under the regression category and another three datasets under the classification category.The performance metrics are compared with the standard MR-SDkNN method as given in section 5.1 and 5.2 respectively.2. Third, the scalability performance of the MR-DWkNN model on regression and classification datasets are analyzed, reported in section 5.3 3. Finally, the performance of our proposed method for different k-values was tested and the performance metrics are described in section 5.4.

Performance of MR-DWkNN model with MR-SDkNN model.
In this experiment, the HDFS block size is set as 64MB, and 10% instances from the Higgs dataset, 20% from Susy dataset, and the entire Pokerhand dataset are chosen.Initially, we run the parallel version of the standard kNN algorithm MR-SDkNN over all the six datasets which is used for comparison with its variants.To do this, 20% of the instances are chosen randomly from each dataset as query instances (TQ) and the remaining 80% of the instances are considered as Training instances (TR).The performance metrics root mean square error (RMSE) and Determination coefficient (R2) are recorded for the regression task and accuracy is recorded for the classification task.
Afterward, we executed the MR-DWkNN algorithm over all the three datasets under the classification category.In this, the entire input dataset file is split into HDFS block size of equal-sized files and contains an equal number of samples.The samples in each file have been distributed evenly and studied the performance of the proposed MR-DWkNN algorithm.Initially, the Map-Reduce model is trained with 80% of the training samples.Table 7.1 shows the classification accuracy of MR-SDkNN and MR-DWkNN on three benchmark datasets for two different k-values.Similarly, the regression performance of MR-SDkNN and MR-DWkNN is tabulated in Table 7.2.From the above two tables, we can conclude that, 1.In the case of the classification task, the proposed MR-DWkNN algorithm produces an increase of classification accuracy in the range of 1.5% to 3.5% for all three datasets as compared with the standard MR-SDkNN.2. In the case of regression task, the MR-DWkNN algorithm produces an increase of Determination coefficient (R2) in the range of 1.5% to 3.2%, and consequently, there is a fall of RMSE in the range of 1.5% to 2.5% for all three datasets as compared with the standard MR-SDkNN.

Scalability, dataset split and distribution.
To demonstrate how well both the MR-SDkNN and MR-DWkNN scales up, two datasets under each category were chosen.The scalability experiments were performed where the instances of the dataset/size of the dataset were increased in proportion to the number of cores.Table 7.3 summarizes the datasets used for the scalability experiment, number of instances, computing processor cores, the number of map tasks, and the scalability results obtained for both MR-SDkNN and MR-DWkNN algorithms.From the Table 7.3, it is observed that the scalability of both MR-SDkNN and MR-DWkNN decreases slowly when the size of the dataset increases, and the number of processor cores used for the computation increases.of 7 map tasks are created and a cluster of size 7 cores are used for its execution.Afterward, the size of the dataset and the number of cores in the cluster has been increased proportionately.The results show that both MR-SDkNN and MR-DWkNN scale up to higher than 94.78% and 91.87% respectively.Fig. 7.4 shows the scalability performances of our proposed MapReduce-based kNN versions on the Wave energy converters dataset.The initial size of the dataset is 123 MB with 0.288 million instances occupies one HDFS block of size 128 MB.For this experiment, the HDFS block size is configured as 128 MB.Initially, one core in the cluster is used to execute the proposed model, and subsequently, the experiment was conducted with a dataset size of 0.576 million, 1.152 million, 2.304 million, 4.608 million, and 9.216 million instances on 2, 4, 8, 16 and 32 cores respectively.From this experiment, it is observed that the scalability of both MR-SDkNN and MR-DWkNN decreases gradually when the size of the dataset and number of cores for its execution increases, and it maintains a value of scale-up higher than 87.69% and 85.67% respectively.From these experiments, we can conclude that, • Hadoop clusters can handle large volumes of datasets and provide as many processor cores as required for the execution of our proposed algorithm.• Both MR-SDkNN and MR-DWkNN able to scale up when the size of the dataset and computing processor cores increases.

Speedup.
To measure the speedup performance of both the MR-SDkNN and MR-DWkNN algorithms, two datasets under each category were chosen.The speedup experiments were performed where the number of dataset splits was increased in proportion to the number of computing cores in the cluster with a fixed size of the dataset.Table 7.4 summarizes the datasets used for the speedup experiments, number of instances, computing processor cores, the number of map tasks, and the speedup (x times) results obtained for both MR-SDkNN and MR-DWkNN algorithms.It is observed that the speedup of both MR-SDkNN and MR-DWkNN increases in proportion to the number of computing processor cores enabled in the cluster.From these experiments, we can conclude that, • Both MR-SDkNN and MR-DWkNN able to speedup when the computing processor cores increases for a fixed size dataset • The execution efficiency of the proposed algorithms in the cluster improves proportionally with the size of the cluster

Influence of neighborhood size k on performance metrics and comparison.
To investigate the influence of neighborhood size k on performance metrics, we have chosen all the six datasets under two different categories.The performance of the proposed distributed algorithms is measured in a cluster with an HDFS block size of 64MB and the interval of neighborhood size k ranges from 3 to 31.Table 7.5 shows the influences of neighborhood size k on the classification accuracies on three benchmark classification datasets.It is observed that the classification performance of the MR-DWkNN classifier is better than the MR-SDkNN classifier with increasing neighborhood size k.However, after a certain value of k, the classification accuracy starts decreasing on all three datasets.The best classification accuracy of each classifier on the benchmark data sets is shown in bold-faces against the corresponding k-value.The values in the parenthesis represent the corresponding dataset.Table 7.6 shows the influences of neighborhood size k on the regression task and its performance metric determination coefficient (R 2 ).The Determination coefficient (R 2 ) shows that how close the data instances fit with the regression line and also it measures the strength of the relationship between the distributed machine learning model constructed in the training and the response variable.The three benchmark datasets Superconductivity, Year prediction MSD, and Wave Energy converters dataset under the regression category are chosen.From the results, it is observed that the prediction performance of the MR-DWkNN method is better than the MR-SDkNN method with increasing neighborhood size k.However, after a certain value of k, the R 2 value starts decreasing on all three datasets.The best R 2 value of each classifier on the benchmark data sets is shown in bold-faces against the corresponding k-value.The values in the parenthesis represent the corresponding dataset.Table 7.  2. MR-DWkNN is a scalable approach in a multi-node cluster environment and a proven parallel approach with promising performance metrics achievement in Big Data applications.The future work considered is the improvement of runtime execution of Map and Reduce tasks through the use of other big data handling frameworks such as Spark and Flink.In addition, the kNN may be integrated with various deep learning architectures for image analytics of multi-class classification.
.6) and computes the target attribute value for the real-valued target function as given in equation (3.7).The detailed operation of the Reduce function is given in algorithm 2. The Map-reduce architecture of standard k-NN is the same as Fig 5.1 with the exclusion of distance weighted function as given in equation (3.5) and is called an MR-SDkNN algorithm.

- Algorithm 2
Reduce function of Distance Weighted kNN -

6 . 6 . 1 .
-Procedure DWkNN-REDUCE (key, value) Key: instance_id Value: Euclidean distance, Attributes For all k-nearest instances of TR do the following Compute w_i as given in equation (5) Assign the distance-weighted value to all k-instances If the target function is discrete-valued then Find the class-label of x_q using equation (6) Compute the performance metric classification Accuracy Context.write(x_q,<class label, accuracy>) else Find the target attribute value of x_q using equation (7) Compute the performance metrics RMSE and Determination coefficient (R2) Context.write(x_q,(class label, RMSE, Determination coefficient (R2))) endif end for Experimental Setup.This section describes the Multi-node Hadoop cluster configured for this research work (Sec 5.1), specifications of the datasets chosen (Sec 5.2), and the various performance metrics employed to measure the classification and regression task performances.(Sec 5.3).Hadoop cluster environment.The various experiments designed for this proposed research work are executed on a Multi-node Hadoop cluster established with 25 physical machines.One machine is designated as a master node (Name node) and configured to run Hadoop services and the remaining machines are configured as the worker nodes (data node).The configuration of all the physical machines is Core i5 four-core Processor, 2.1GHz clock speed, 16 GB RAM, 6MB Cache, 1 TB HDD with 1 Gbps network card.The specific details of the software used in the cluster and Hadoop environment configuration parameters are the following: Hadoop framework version 2.9.0 Operating system: Ubuntu Linux 18.04.03LTS 64-bit Replication factor: 3 HDFS size: 64 MB/128MB Virtual memory for the map and reduce task: 8 GB

Fig 7 . 1
shows the performance results of our proposed model on Higgs datasets.The experiment was conducted with a size of 1.1 million, 2.2 million, 4.4 million, and million 8.8 million instances on12, 24, 48, and

8 .
7 shows the classification accuracy of our proposed methods and compared with MR-kNN ((Maillo et al.) and MRPR -FCNN ((Triguero et al.)) methods.It is concluded that our model MR-DWkNN classifies instances with good accuracy rate for k = 1, 3, 5 and 7 as compared other models on Poker hand dataset.Conclusions and further work.In this research work, we have developed a MapReduce based on two different versions of the kNN algorithm called MR-SDkNN based on standard k-NN algorithm and MR-DWkNN based on distance weighted k-NN algorithm.The distributed learning model is constructed with 80% of training instances and the remaining 20% instances are used as test (query) instances.Various experiments are carried out on recently published six benchmark large volume datasets from the UCI repository.The predictive and classification performance of MR-DWkNN is evaluated in terms of three metrics as Root Mean Squared Error (RMSE), Determination Coefficient (R2), and Accuracy.In addition to these, the scalability performance of the proposed algorithm was also tested on Hadoop Multi-node cluster.The results obtained from these experiments have shown that the main accomplishments of MR-DWkNN are the following:1.There is an increase in performance such as classification accuracy and determination coefficient (R2) of the proposed MR-DWkNN as compared to the MR-SDkNN

Table 6 .
1: Summary description of the datasets with discrete-valued target attributes In our research work, the performance of the proposed MR-DWkNN algorithm is assessed with the following four metrics.

Table 7 .
1: Performance metrics of a Classification task

Table 7 .
2: Performance metrics of a Regression task

Table 7 .
5: Influence of neighborhood size k on the classification task

Table 7 .
6: Influence of neighborhood size k on the regression task