Cludoop: An Efficient Distributed Density-Based Clustering for Big Data Using Hadoop

Density-based clustering for big data is critical for many modern applications ranging from Internet data processing to massive-scale moving object management. This paper proposes Cludoop algorithm, an efficient distributed density-based clustering for big data using Hadoop. First, we propose a serial clustering algorithm CluC by leveraging cell partition optimization and c-cluster to fast find clusters. CluC completes classification of the points using the relationships of connected cells around points instead of expensive completed neighbor query, which significantly reduce the number of distance calculations. Second, we propose the Cludoop, which can efficiently cluster very-large-scale data in parallel using already existing data partition on Map/Reduce platform. It employs the proposed serial clustering CluC as a plugged-in clustering on parallel mapper, along with a cell description instead of completed cell in transmission to reduce both network and I/O costs. Guided by proposed cell-based principles, we also design a Merging-Refinement-Merging 3-step framework to merge c-clusters on the overlay of assigned preclustering result on reducer. Finally, our comprehensive experimental evaluation on 10 network-connected commercial PCs, using both huge-volume real and synthetic data, demonstrates (1) the effectiveness of our algorithm in finding correct clusters with arbitrary shape and (2) the fact that our proposed algorithm exhibits better scalability and efficiency than state-of-the-art method.


Introduction
Clustering is to group data objects into different classes or clusters, and the objects within a cluster have high similarity, while objects in intercluster differ significantly with one another. Clustering has played a crucial role in numerous applications ranging from pattern recognition, mobile sensor networks, and moving object management to location-based service. With the rise of big data science, clustering analysis has attracted considerable interests in the big data mining, while clustering based on density is very useful to distancebased data mining in many applications with increasingly large-scale data owing to their capability to discover clusters with arbitrary shape.
Existing density-based algorithms such as DBSCAN [1], OPTICS [2], DENCLUE [3], and GDDSCAN [4] can obtain better groups of data points in the static large-scale and high-dimension databases, which were widely applied into many applications in the past decade. However applying the algorithms into current data-intensive applications is challenging due to rapidly increasing distributed stored big data. For example, the traffic data (GPS trajectories and infrared acquisition) from intelligent transportation system in Jiangsu Province reaches 6.94 billion records, and 4 TB sensor data were collected from older's healthcare monitoring in Shanghai for one week; the emerging wearable devices will further promote coming of big medical data era. The dataset released by Twitter is larger than 133 TB, increased data updated by Tencent's mobile applications reaches about 200 to 300 TB a day, and even the operational data of Yahoo reaches 5 PB. When the amount of data is this large, it is impossible to handle the data using the serial clustering methods on a single machine. Therefore, the best clustering algorithm is the one that (a) combines a scalable effective 2 International Journal of Distributed Sensor Networks serial algorithm, (b) makes it run efficiently in distributed platform, and (c) does not need to preprocess all dataset.
For big data analysis, Map/Reduce and its open source equivalent Hadoop have attracted a lot of attentions due to its parallel way to handle massive-scale data. Hadoop is a desirable distributed computing platform based on a shared-nothing cluster architecture. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Hadoop adopts HDFS (Hadoop Distributed File System) storage structure and simple Map/Reduce programming model. HDFS provides highthroughput access to application data and initially partitions data in multiple nodes, and data is represented as <key, value> pairs. In Map and Reduce phases, Map and Reduce functions take a key-value pair as input and then output key-value pairs. The Map function is first called on different partitions of input data on each slave node in parallel. The outputs by mappers are next grouped and merged by each distinct key. Then a Reduce function is invoked for each distinct key with the list of all values sharing the key. Finally the Map/Reduce framework executes the main function on a single master machine to postprocess the outputs of Reduce functions. Depending on the applications, a pair of Map and Reduce functions can be executed once or nested multiple times. Ever since Hadoop was first introduced in 2003, Map/Reduce received great success because it allows easy development of scalable parallel application to process big data on thousands of commodity nodes, tolerating the failure nodes in the process; some works also already reflected the fact in academia (see [5]). However, finding density-based clusters from big data is a very challenging problem today.
This paper focuses on the problem of efficient, effective, and scalable density-based clustering for big data. Our method employs Hadoop as computing platform and incorporates cell partition and c-cluster optimizations into density-based clustering. Our major research works are (a) how to minimize the communication (network) cost among processing nodes, (b) how to avoid preprocessing the data, namely, reducing the I/O cost, and (c) how to extract exact density-based clusters. So this paper proposes the distributed density-based clustering using Hadoop-Cludoop method, which efficiently handles large-scale data without any preprocess.
The main contributions of this paper include the following. (1) We propose a c-cluster definition along with cell-connected observations to significantly reduce computational cost of neighbor range query. (2) We propose an efficient serial clustering algorithm leveraging cclusters and neighbor searching optimization. (3) We propose the distributed clustering framework for big data using Map/Reduce structure based on the proposed serial clustering algorithm as a plugged-in clustering on Map, along with a 3-step merging framework on Reduce. (4) We conduct comprehensive experiments on Hadoop platform deployed on 10 commodity machines to evaluate the performance of Cludoop using larger-scale real and synthetic data. The results show that our methods are both effective and efficient.
The paper is organized as follows: we first discuss related work in Section 2. In Section 3, we present preliminary notions and problem statement. Section 4 introduces theoretical ideas and presents the serial clustering algorithm. In Section 5, we then present the distributed Cludoop algorithm. In Section 6, we perform an experimental evaluation of the effectiveness, efficiency, and scalability of our algorithm. Section 7 concludes whole paper with a summary.

Related Work
In this section, we mainly review related work in the areas of density-based clustering, parallel variants of DBSCAN, and distributed clustering on Map/Reduce platform.
Density-Based Clustering. Density-based methods describe the clusters being the high density area of points separated from the low-density regions in the data space, so clusters exhibit arbitrary shapes in the high-density region. There are two types of density-based clustering method: algorithms based on local connectivity such as DBSCAN [1] and OPTICS [2] and algorithms based on density function such as DEN-CLUE [3]. DBSCAN determines a nonhierarchical, disjoint partitioning of the data into several clusters. Clusters are expanded starting at arbitrary seed points within dense areas. Objects in areas of low density are assigned to a separate noise partition. DBSCAN is robust against noise and the user does not need to specify the number of clusters in advance. OPTICS is proposed to solve parameters selection problem on density-based clustering algorithms. The paper proposes two concepts: core-distance and reachabilitydistance to organize points. The point objects would be ordered by reachability-distance and their core point to obtain the clustering structure, which contains hierarchical clusters under a broad range of parameter settings. OPTICS visualizes clearly the cluster structure via the ordered point list and could find arbitrary shaped clusters and overlapping clusters. DENCLUE formalizes the cluster notion by nonparametric kernel density estimation based on modelling the local influence of each data object on the feature space by a simple kernel function, for example, Gaussian. It defines a cluster as a local maximum of the probability density.
Parallel Variants of Density-Based Clustering. Xu et al. [6] propose a parallel clustering algorithm PDBSCAN for mining large distributed spatial databases. It uses the so-called shared-nothing cluster architecture, which has the main advantage that it can be scaled up to a high number of computers. However it is not fully paralleled while it still needs a single node to aggregate intermediate results. Januzaj et al. [7] depict a distributed clustering based on DBSCAN and DBDC and formed local and global two-level clustering. The local clustering is carried out independently on local data; then global clustering is done on a central server based on the transmitted representatives from local clustering. Subsequently, they design a density-based distributed clustering [8] which allows a user-defined tradeoff between clustering quality and the number of transmitted objects from the local sites to the global server site based on DBDC idea. They first order all objects located at a local site according to a quality criterion reflecting their suitability to serve as local International Journal of Distributed Sensor Networks 3 representatives and then send the best of these representatives to a server site where they are clustered with a slightly enhanced DBSCAN algorithm to obtain high quality clusters. Dash et al. [9] propose a parallel hierarchical agglomerative clustering based on partially overlapping partitioning on shared memory multiprocessor architecture for handling nested data. The experiment shows the algorithm achieves near linear speedup. Brecheisen et al. [10] present a parallel DBSCAN on a workstation, which is parallelized by a conservative approximation of complex distance functions, based on the concept of filter merge points. The final result is derived from a global cluster connectivity graph. Böhm et al. [11] implement several data mining tasks under the highly parallel environment of GPUs, including similarity search and clustering. For density-based clustering, they design a parallel DBSCAN algorithm supported by their proposed similarity join on GPU. They then propose a massively parallel density-based clustering method [12] using GPUs by leveraging the high parallelism combined with a high bandwidth in memory transfer at low cost. Andrade et al. [13] also propose a GPU parallel version of DBSCAN, named G-DBSCAN, using graph to index point objects with less than a given distance threshold to each other. However the parallel density-based clustering can not straightforward be transferred on the Hadoop, because not only have GPUcapable parallel algorithms shared main memory but groups of the processors even share very fast memory units, while Hadoop uses a distributed file system.
Distributed Clustering on Map/Reduce. Ene et al. [14] develop partition-based clustering algorithms, -center andmedian, running on Map/Reduce, which use sampling to decrease the data size and run in a time constant number of Map/Reduce rounds. For density-based clustering, He et al. [15] propose a parallel DBSCAN using Map/Reduce framework, MR-DBSCAN, which partitions all spatial data into different Maps by the space location in the preprocessing stage and then performs DBSCAN in each mapper and merges the bordering spaces in Reduce step. Similarly, Dai and Lin [16] also propose a Map/Reduce-based DBSCAN with a data partition, partition with Reduce boundary points (PRBP), selecting partition boundaries based on the distribution of data points. However the data partition based on data space easily causes load unbalancing due to the sparse data. Subsequently, He et al. [17] also propose a load balancing mechanism based on a cost-based spatial partitioning for heavily skewed data. However above methods expend I/O cost due to preprocessing, especially for the distributed stored big data. This is also one of the key points which we resolve in this paper. Cordeiro et al. [18] present a clustering solution to find subspace for highdimensional data using Map/Reduce; however they focus on the tradeoff problem between the I/O cost and network cost and aim to dynamically choose the best strategy. These techniques did not explore the optimization opportunities enabled by the nature insights of serial density-based clusters; this is exactly what our work does for delivering highly scalable distributed solutions along with an efficient optimized serial density-based clustering.

Definitions and Theorems.
This section introduces the definitions of related terms based on the notion of connected density in [1]. and are the neighbor range parameter and the minimum number of neighbor points, respectively.
Definition 1 (Eps-range). The Eps-range of a point is the circular area with radius centered on the specific point .
Definition 2 (Eps-neighbor). All points included in Eps-range of the point are called Eps-neighbor of , denoted by ( ).

Therefore,
( ) is a set of data points. Let | ( )| denote the cardinality of ( ). Obviously, the neighbor is symmetric for pairs of points.
Definition 4 (border point). For a point , if | ( )| < and ∃ ( ∈ ( )) is a core point, the point is a border point of the cluster including point .

Definition 5 (isolated point). A point is classified as an isolated point if it is neither a core point nor a border point.
Note that isolated points can be considered as either anomalous points or noise.
Definition 6 (directly density reachable). A point is directly density reachable from another point , if the point is one of Eps-neighbors of , and is a core point.
The Eps-neighbors of a core point are directly density reachable from , and the border points are directly density reachable from their Eps-neighbors which are core points.
Definition 7 (density reachable). A point is density reachable from a point , if there is a chain of points 1 , 2 , . . . , Definition 8 (density connected). A point is density connected to a point , if there exists a point such that both and are density reachable from .
Definition 9 (density-based cluster (d-cluster)). Let be points set. A density-based cluster is a nonempty subset of including at least a core point and all points which are density reachable from the core point.
Note that all core points would be classified in densitybased clusters. Proof. Since point ∈ 1 and is a core point, then ( ) ∈ 1 . According to Definition 9, the d-cluster 1 can be expanded to 1 = { | ∈ and is density reachable from }. Similarly, due to ∈ 1 , 2 also is equivalent to { | ∈ and is density reachable from }. So 1 and 2 should be merged into one d-cluster. In summary, a core point only belongs to a d-cluster. Figure 1 shows an example demonstrating Lemma 10; is a core point and belongs to both blue and green d-clusters, so the two d-clusters would be merged.
Definition 11 (cell). According to the , the data space is divided into cubes whose length of the sides is /(2 * √ 2), and each partition of space is called a . A cell is denoted Let | ( 1 , 2 ) | denote the number of points falling in ( 1 , 2 ) . In this paper, we describe the concept of cell in 2D space for generality, though -dimensional space could equally be applied.

Lemma 12. Given a point in the cell
Proof. Lemma 12 can be easily proven using the definition of . Since the edge length of cell is /(2 * √ 2), the distance between any point in the cells ( + , + ) ( = −1, 0, 1; = −1, 0, 1) and is not greater than . Therefore, ; thus is a core point. Thus all points in the cells ( + , + ) ( = −1, 0, 1; = −1, 0, 1) are directly density reachable from . These 9 cells are called inner cells with respect to . By Lemma 12, we can further deduce Corollary 13. ( , ) , if ∑ 1,1 =−1, =−1 | ( + , + ) | ≥ , then ∀ ( ∈ ( , ) ) is a core point. ( , ) , there exist at most 36 cells in which the points need to be checked when verifying whether is a core point or not. Proof. Lemma 14 is intuitive by Lemma 12 and definition of Eps-neighbor (Definition 2). We directly demonstrate this in Figure 2. Suppose the central cell is ( , ) . First, for ∀ ( ∈ ( , ) ), the Eps-range of must be located within the area according to the side of cell. Second, all of the 36 cells would be covered in the Eps-range of some point in ( , ) ; in extreme cases, the number 17 to number 36 cells are covered when the is located at four vertexes. These 36 cells are called outer cells with respect to .

Lemma 14. Given a point in the cell
According to the d-cluster definition (Definition 9) and Lemma 12, we observe that if a core point in the cell ( , ) belongs to a d-cluster , then all points in inner cells with respect to belong to . If all points falling in a cell can be determined to belong to a d-cluster , we call it an inclusive cell of . Therefore, the d-cluster can be represented by all inclusive cells and other border points not in the inclusive cells. Here we give a formalized definition of cell-cluster (ccluster) by the cell-connected observation. , then also belongs to the counterpart of . Thus should be merged into by Lemma 10.

Problem Statement.
In this work, we focus on the efficient parallel solution of finding density-based clusters from big data on Hadoop platform. Next, we define the distributed clustering problem based on the above definitions, which this paper just aims to resolve.
Problem (distributed clustering with c-cluster). Given a massive-scale dataset , parameter setting ( and ) and network-connected computers deployed Hadoop platform, the distributed clustering is that we output exact cclusters from with respect to the given and on the .

Proposed Serial Clustering with c-Cluster
Our proposed distributed clustering solution is designed based on one optimized serial clustering method that we propose next: clustering with c-cluster-. CluC aims to find out c-clusters from a large-scale dataset. In the following, we present a basic version of CluC omitting details of data structure and additional information for understanding, as shown in Algorithm 1. . The ( , . ) represents the smallest distance between and any points in the .
. Namely, for the point , not in inclusive cells, if there exists any point in the .
such that ( , ) ≤ , the is a border point with respect to the c-cluster . CluC first expands the inclusive cells recursively to reduce a large amount of distance calculations; then it can fast searches the border points in the pruned space that eliminate the inclusive cells. Therefore, the CluC achieves an efficient and also effective improvement compared with the state-ofthe-art serial methods.

Cludoop Framework. The overall architecture of our
Cludoop framework is depicted in Figure 3. Cludoop uses the existing data partition and does the parallel CluC algorithm in mappers and then does the merging work based on intermediate c-clusters in reducers. The final c-clusters that consist of inclusive cell descriptions and a tiny amount of border points are obtained on one reducer/machine. Figure 3, Cludoop starts with mappers reading the data in parallel from the HDFS and employs CluC as a plugged-in clustering on each mapper. We call the phase preclustering to distinguish whole clustering. In this phase, we first build the cell index according to the received dataset, mapping the points into the cells. Then each mapper performs CluC algorithm on the cell-structure data in parallel, aiming to output the c-clusters and noises as preclustering result. However, the preclustering results need to be normalised and shipped to the appropriate reducers over the network, to get final clusters. Thus the network traffic and normalization would cost much CPU resources when shipping all c-clusters and noises to reducers. How can we reduce the network cost for this large-scale dataset? Our main idea is to only ship the c-clusters with inclusive cells' descriptions and the border points to reducers, rather than send all points which have already been classified into c-clusters. Thus, we use a simple description ( ℎ -, ℎ -, V

Preclustering on Mapper. As shown in
?, and ℎ --) to represent an inclusive cell; however the description provides sufficient information with reducers for further merging process. This is because of the following. (1) We build Input: dataset , parameters: and . Output: c-clusters (1) // is Unclassified (2) Get SetOfCells from ; (3) //SetOfCells is Unclassified (4) for each point ∈ do (5) Get Cell ← .
; (6) if . = Unclassified then (7) if . = Unclassified then (8) Create c-cluster ← ( ); (9) if expandcluster ( , , , , , ) then (10) Create c-cluster ← ( ++); (11) end if (12) end if (13) end if (14) end for (15)   To efficiently merge the c-clusters located in near area on different mappers, meanwhile to avoid the computation unbalancing for skewed data, we combine the spatial distribution and uniform division in shuffle mechanism. We set s according to the number of reducers; similar with spatial partitioning, we first normalize the mean coordinate of inclusive cells of each c-cluster to the corresponding nearest value. Thus the c-clusters with same value on different mappers would be sent to the same reducer in the shuffle step. Then we shift the c-clusters from crowded to near unoccupied for balancing the workload in Reduce phase. So our Cludoop achieves a full load balancing throughout whole Map/Reduce phases.

Merging-Refinement-Merging Framework on Reducer.
In Reduce phase, each reducer receives the normalized cclusters and noises, denoted by { , }, where is the set of c-clusters with cell's descriptions and is the set of assigned noises. We propose a Merging-Refinement-Merging 3-step framework based on following cell-based merging and refinement principles on reducer to merge the assigned intermediate c-clusters.
First, we observe that the c-clusters should be merged when they share certain cells or connected cells. To characterize the merging process we formalize two merging rules. ( , ) in a c-cluster , all c-clusters in , which have ( , ) as an inclusive cell or have any border point falling in ( , ) , can be merged into .

Merging 1. Given an exclusive cell
Merging 1 is intuitive. First, since ( , ) is an exclusive cell of , there must exist at least one core point in ( , ) . The core point also must be a core point of any c-clusters covering ( , ) on the overlay of { , }. The ( , ) also is an exclusive cell of the c-clusters; thus the c-clusters can be merged by Lemma 18. Second, if one border point of ccluster falls in ( , ) on the overlay, then also belongs to , and also would be changed to be a core point by the definition of exclusive cell. Therefore, the and should be merged by Lemma 10. Merging 2 implies that two c-clusters covering one of two connected exclusive cells, respectively, should be merged into one cluster. This is obvious, because any core points in these two cells are neighbors; thus the density reachable core points would be classified into one cluster. Therefore, they should be merged.
Actually, Merging 1 already contains the same cases of Merging 2. For example, given two connected cells 1 and 2 , Similarly Refinements 1 and 2 depict the update case for the border points not in inclusive cells on the overlay of { , }. Note that we can not compute the exact number of neighbors for point , because we use the cell's description instead of the points in the inclusive cell. In refinement phase, we employ an approximate method to get ( ) when checking if is a core point. If all vertexes of a cell are in ( ) whose distance to is no large than , then the points in the cell are neighbors of ; also if there exist vertexes whose distance to is larger than , we calculate the approximate rate of area covered in the -of in the cell and use the * to estimate approximately the number of neighbors for in the cell. Based on the above merging and refinement rules, we describe the Merging-Refinement-Merging framework to merge the preclustering in Reduce phase. The pseudocode of the framework is shown in Algorithm 2. First, we execute directly the first-round merging using the proposed two merging rules (lines 2-10). This step would merge the overlapping c-clusters and noise in { , } to fast reduce the number of c-clusters and noise and update the sharing cells' point number at the same time. Next, the status of nonexclusive cells, unchecked border points, and noise is further refined on the overlay of { , } (lines 12-40). As shown in Algorithm 2, Refinement 1 is employed first for the nonexclusive cells. { , } denotes the updated cell on the overlay of { , }. Then Refinements 2 and 3 are performed in turn for unchecked border points and noise. In process of Refinement 1, most border points and noise candidates falling in the cells also are processed; this greatly reduces the number of points that need to be handled in Refinements 2 and 3. This is also why we first perform Refinement 1. More specifically we also first utilize Lemma 12 to check whether a cell is an exclusive cell or a point is a core point to reduce a large amount of distance calculation for most cases. Finally, second-round merging is performed to reexamine the cclusters on the updated cells and border points and get the final c-clusters over { , }.
In the phase, reducers execute the Merging-Refinement-Merging process in parallel. However we still need to perform one reducer to return final clusters on one machine in final phase as shown in Figure 3.

Experimental Setup and Methodologies.
A comprehensive performance study has been conducted on 10 networkconnected commodity computers that deployed Hadoop platform to evaluate the effectiveness and efficiency of the proposed Cludoop algorithm. Each PC/node is equipped with an Intel Core 2 Duo P8600 processor with 8 GB memory and runs a Ubuntu 11.4 operating system. One node is configured as both NameNode and JobTracker. Other nodes are configured as DataNode and TaskTracker.
Real Datasets. We use two real datasets in our experiments. The first real dataset is a trajectory database from project [19] [20]. The total number of points in this dataset is about 15 million, reaching 1.46 GB size.
Synthetic Datasets. We also generate two synthetic datasets by our developed data generator using Matlab. The first synthetic dataset, denoted by , is generated to test the effectiveness of clustering, which includes multiple clusters of arbitrary shape comprised of 4413 points and 300 noises in 1000×1000 square.
. ( ); (17) . . ( ( )); (18) end if (19) end if (20) . . The second large-scale dataset is generated to evaluate performance of algorithm on processing big data, which contains about 2 billion data points in 2D space, reaching 53.6 GB size. For generality, we normalize all coordinates into range obtained by DBSCAN and is the clustering result of our algorithm.
We also evaluate the performance of Cludoop method by varying the most important parameters. In particular, we measure the scalability of methods by varying the volume of dataset and the number of work nodes. Moreover, we also measure sensitivity of our Cludoop algorithm on efficiency by varying the input parameter in a large range.

Effectiveness Evaluation.
First, we evaluate the effectiveness of our proposed algorithm compared to DBSCAN using two evaluation metrics (visualization comparison, Precision/Recall) on Ssyn and GeoLife data. Figure 4 shows the clustering result comparison of two algorithms on Ssyn data when is fixed to 20 and to 4. The black points are classified as noise. We can see that Cludoop can find almost the same clusters as DBSCAN with respect to the same parameter setting. Only a very few of marginal border points are misclassified due to the proximate strategy employed for some border points in Reduce phase. Figure 5 depicts the results of two algorithms on GeoLife data, setting to 100 meters and to 10. Again, our algorithm shows an excellent clustering ability, even better than that on Ssyn data. This is because the locations of GeoLife are on or close to road networks, and only the location crowds at hot regions or crossings are classified as clusters; thus the ratio of border points not in inclusive cells is far below that of arbitrary distributed dataset.
The Precision and Recall of Cludoop on Ssyn data are shown in Figure 6. For Ssyn data, the Precision and Recall are computed based on labelled points. From Figure 6(a), the Precision is nearly 100% once the parameter falls in a rather large range. Figure 6(b) shows the Recall of our algorithm is also good in certain parameters space. For the real GeoLife data, we compute the Precision and Recall of Cludoop algorithm based on the clustering result of DBSCAN with respect to the same parameters, excluding the noise. From Figure 7, the Precision and Recall of clusters of Cludoop achieve on average 97% and 96% compared to DBSCAN in all test cases, respectively. In particular, Cludoop shows nearly 100% Precision when ≤ 100 and varies from 5 to 20.  In summary the effectiveness study confirms that our distributed algorithm can obtain the correct clusters with noise under a loose parameter setting.

Efficiency Evaluation.
Next we evaluate the efficiency of our algorithm compared to MR-DBSCAN algorithm using the Taxi and Lsyn data. We vary the most important parameters, to (1) assess the scalability of Cludoop versus MR-DBSCAN in terms of efficiency and (2) evaluate sensitivity of parameter variation on our method.

Varying Volume of Dataset .
We first evaluate the scalability of our algorithm in terms of the volume of dataset using Lsyn data. In this experiment we randomly extract four subsets from the Lsyn data from 20% to 80%. We fix the to 0.001 and to 200. Figure 8 shows the total running time of our algorithm and MR-DBSCAN on the five datasets. Cludoop algorithm exhibits much better scalability than MR-DBSCAN in terms of CPU time. In particular, Cludoop saves on average 42% time compared to MR-DBSCAN on all tested cases. As the volume of dataset increases, both algorithms require more time to process these increased data points. However, our Cludoop algorithm saves more time with the rising of the number of data points. This is because more cells could be determined directly to be inclusive cells to reduce more distance calculations as the volume of increases at fixed and .

Varying Number of Nodes .
Next, we valuate the speedup of our algorithm as the number of nodes increases by varying work node from 2 to 9 on the Taxi data when is fixed to 100 meters and to 4. As shown in Figure 9, Cludoop algorithm also clearly outperforms MR-DBSCAN in terms of running time in all tested cases. In particular, the speedup of our algorithm achieves at 6 times while that of MR-DBSCAN just achieves at 4 times when the number of nodes increases from 2 to 9. However, the speedup  of our Cludoop also decreases as increases similar to MR-DBSCAN. This may be because more border points or noise would be sent to reducers in shuffle phase when launching more mappers, spending more network cost, and merging time on reducers, although the increased parallel mappers reduce the preclustering time.

Varying Neighbor Range
Threshold . Finally, we analyze the sensitivity of our algorithm with respect to the important parameter on both 10% * Lsyn and Taxi datasets. We vary from 0.0005 to 0.003 on the 10% * Lsyn data when is fixed to 200 and vary from 100 to 300 meters on Taxi data at fixed = 4. From Figures 10 and 11, our proposed Cludoop shows outstanding superiority on time consumption compared to MR-DBSCAN with respect to . This is because (1) our algorithm reduces the cost of maintaining cell index as increases and (2) more cells could be directly determined to be inclusive cells at a larger value when is fixed, saving much network cost in shuffle and merging time in reducer. Whereas larger value leads to more duplicate points, costing more time in Shuffle and Reduce phase for MR-DBSCAN, larger is more likely to lead to imbalance workload in reducer due to its data partition being based on . However, very large value actually is meaningless for density-based clustering when is fixed, since it can not return the meaningful clusters.

Conclusion
Density-based clustering for increasing big data applications is very important yet difficult task. This paper proposes an efficiency and load-balanced distributed density-based