A Group Mining Method for Big Data on Distributed Vehicle Trajectories in WAN

A distributed parallel clustering method MCR-ACA is proposed by integrating the ant colony algorithm with the computing framework Map-Combine-Reduce for mining groups with the same or similar features from big data on vehicle trajectories stored in Wide Area Network. The heaviest computing burden of clustering is conducted in parallel at local nodes, of which the results are merged to small size intermediates. The intermediates are sent to the central node and clusters are generated adaptively. The great overhead of transferring big volume data is avoided by MCR-ACA, which improves the computing efficiency and guarantees the correctness of clustering. MCR-ACA is compared with an existing parallel clustering algorithm on practical big data collected by the traffic monitoring system of Jiangsu province in China. Experimental results demonstrate that the proposed method is effective for group mining by clustering.


Introduction
Recently, big data on vehicle trajectories collected by traffic monitoring systems are more and more important in practice, which are based on license plate identification and the RFID techniques. For example, there are more than 100 subsystems in a traffic monitoring system of Jiangsu province in China, which includes more than 2 × 10 4 data sensors (collecting devices) distributed in 13 cities, covering 3 million acres in size. Around 5 × 10 4 million data records have been collected so far with 70 million being increased every day. Usually, traveling features of human beings imply some patterns. For example, because of people's behavior habits or their fixed working locations, they always go out at the same time with the same trajectory. Therefore, the data collected by traffic monitoring systems imply features of vehicle trajectories, which illustrate characteristics of human behaviours. It is reported that 93% of human behaviors can be foreseen [1] and four spatiotemporal points are enough to uniquely identify 95% of the individuals [2]. Likewise, it is possible to identify a driver with high probability according to several trajectory records, and a driver group with the same or similar behaviour features can be found by mining conducted on big data of vehicle trajectories. It is quite important and applicable. For example, a band of criminal suspects would be found if they use cars for transportation. Big data on vehicle trajectories is distributively stored in WAN, which is huge in amount and hard to be physically centralized by extraction.
The biggest challenge in big data clustering is designing effective algorithms for clustering and distributed parallel computation [3]. For these issues, some distributed parallel computing frameworks based on Cloud Computing [4] and MapReduce [5] have been proposed in recent years, such as the batch computing framework [6,7], the stream parallel computing framework [8], the customized parallel computing framework [9], and the mixed parallel computing framework [10]. Based on such computing frameworks, some distributed parallel clustering algorithms have been proposed. A new density-based clustering algorithm DBCURE-MR [11] was introduced, which is robust to find clusters with various densities and suitable for parallelizing the algorithm with MapReduce. A nonparametric accuracy estimation method and system [12] were proposed for speeding up big data analysis. Sampling with replacement was adopted to obtain 2 International Journal of Distributed Sensor Networks the sampling points according to the sampling distribution. The amount of data input to MapReduce can be decreased considerably. Taking into account the distributed nature of portioned data and model, three clustering algorithms [13], -mean, canopy, and Fuzzy -mean, were implemented in parallel on MapReduce. The efficiency of distributed clustering was improved significantly. For detecting in large community networks, a parallel structural clustering algorithm was introduced [14], which is based on the similarity of edge structures and MapReduce. The interfaces and implementations for user-defined aggregation in several states of the distributed computing systems were evaluated in [15], where communication overhead of data-intensive applications could be decreased largely by local clustering, which clusters the intermediate results generated by Map tasks and then transmits the clustered results to Reduce tasks.
The existing methods improve clustering efficiency either by parallel computing on physically centralized big data or by reducing data scale using sampling. However, the communication overhead of data centralization and the impact on sparse data for clustering accuracy have not been considered yet. In this paper, by integrating the ACA (ant colony algorithm) with the computing framework Map-Combine-Reduce (MCR), a MapReduce based distributed parallel clustering method MCR-ACA is proposed for group mining on big data of vehicle trajectories in WAN. Some parallel ACA methods based on MapReduce have been proposed [16,17] (defined as MR-ACA). However, since these methods work on physically centralized big data in LAN, the communication overhead of data centralization is ignored. The MCR-ACA method contains three stages: Map operation, Combine operation, and Reduce operation. Both the computation tasks with the heaviest burden are conducted and their results are combined in parallel on data source nodes. The combined result is transmitted to the central node and new cluster centers are generated adaptively. The presented method avoids the communication overhead of big data migration, improves the clustering efficiency, and guarantees the accuracy of the global cluster among distributed nodes.
The rest of this paper is organized as follows. The problem of group mining for vehicle trajectories is described in Section 2. A distributed parallel clustering method MCR-ACA is proposed in Section 3. Section 4 shows the computational experiments, followed by the conclusion and future work in Section 5.

Group Mining for Vehicle Trajectories
Distributed frameworks in WAN are always adopted by traffic monitoring systems. Hierarchical ones are even utilized by some complex systems. For the traffic monitoring system of Jiangsu examined in this paper, a 3-layer framework is applied. There are 13 independent branch centers in 13 cities, respectively, which are responsible for the integration of independent data branches within each city. A head center is in charge of all the branch ones. Therefore, the main characteristics of big data of vehicle trajectories in WAN are multiple data sources and hard to be physically centralized.
In this paper, the data branches are called source nodes, city branch centers are city nodes, and the head center is the central node. The topological network is shown in Figure 1.
Traffic monitoring systems which contain multiple independent subsystems are being developed in the cities. They form distributed data sources which increase rapidly. There are more than one hundred subsystems in the system of the Jiangsu province. The amount of data in the subsystems is quite huge and growing rapidly (data increment in one city of Jiangsu is over 12 million records every day). Multimedia data in various formats (such as photos and videos) increase several TBs each day, which is very difficult to be physically centralized in the central node. The trajectory data of the cars collected by a subsystem is listed in Table 1.
Group mining for vehicle trajectories (GMVT for short) is critical for data clustering on big data of distributed traffic monitoring systems in WAN. The main idea of mining groups on big data of vehicle trajectories is to implement the automatic partition of vehicle trajectories with the same or similar features by clustering on attributes, which consist of the metadata (e.g., time and location) of the vehicle trajectories. The information (the number of license plates) of the vehicle trajectories in the same clusters can be drawn from the partition result. A complete vehicle trajectory record includes those metadata: the number of the license plates, passing time, location, direction, speed, and car color. These attributes are set as metadata. Records collected by different sensors in various subsystems are normalized as a 6-tuple (the number of license plates, passing time, location, direction, speed, and car color). The first record in Table 1 is normalized as S032V0, 20130521073907, checkpoint at the crossroad of Suyuan road and west Qingshuiting road, 1, 41, and A. Every element is assigned to a weight. Features of vehicle trajectories with the same or similar elements are clustered on the data records.
Suppose there are Z distributed data sources (subsystems) { 1 , 2 , . . . , Z }. Each data source with 6 attributes has tables { ,1 , ,2 , . . . , , }. A table , has , records. In total, there are ∑ Z =1 ∑ =1 , records. The objective of group mining is to partition set with ∑ Z =1 ∑ =1 , records into subsets 1 , 2 , . . . , . In each subset, records have the same or similar attributes and ⋃ =1 = , In an arbitrary subset , all records are grouped according to the number of license plates. The clustering accuracy rate is the ratio of the clustered records of one car to the total records which this car is involved in (e.g., if the total records of one car are , records of the car are clustered into one class; the corresponding accuracy rate is ( / )100%). If the accuracy rate is greater than a given threshold , the car is merged into this class. All cars in this class have the same or similar trajectory features.

Group Mining Methods for Vehicle Trajectory in WAN
In WAN, GMVT is critical for clustering data of vehicle trajectories by attribute features such as time or location to construct various classes with the same or similar attributes, City node City node · · · · · · · · · · · · · · · Figure 1: Topological network of traffic monitoring system. according to which drivers with the same or similar features are identified as a group. Because of the two characteristics mentioned in Section 2, traditional parallel clustering methods are no longer efficient, which motivates us to present the following method.

Clustering Framework for Distributed Big Data in WAN.
Data clustering is very difficult for big data that is stored distributively in WAN. The reason lies in two aspects: (i) huge amount of data makes clustering computing more time-consuming, which leads to existing methods being infeasible. (ii) Communication overhead on data migration is generally more than the computing cost. It is better to migrate computation rather than migrate data. Therefore, a distributed parallel computing framework (MCR), which is based on MapReduce, is proposed in this paper. The framework of MCR is depicted in Figure 2.
Based on MCR, the traditional ACA is adapted to MCR-ACA for group mining for big data. The procedure of MCR is described as follows.
(ii) Map operations are carried out on each data chunk , by a clustering strategy. All records in , are clustered by the given strategy. (v) The method terminates if the global clustering converges to or reaches the maximal iterations max . Otherwise, the comparison parameter is sent to each data chunk by Reduce. The next iteration starts from step (ii).
Computing operations with the heaviest burden are conducted in parallel at source nodes. Data in each source node is divided into data chunks. All chunks are clustered in parallel which leads to good efficiency. · · · · · · · · · · · · · · · · · · Figure 2: MCR distributed parallel computing framework.

MCR-ACA Method for Group
Mining. ACA was inspired from the phenomenon that ant individuals gather at a location with food by pheromone interaction among them [18,19]. By integrating ACA with the computing framework MCR, the group mining method MCR-ACA is proposed. The number of classes and the trajectory records in a class are determined adaptively, and clustering centers are generated in iterations without predefinition, which is desirable for the considered problem.

Map Function of MCR-ACA.
A vector of elements (attributes) is developed for each record of vehicle trajectories; that is, a data record to be clustered is denoted as R = ( 1 , 2 , . . . , ). , ants are assigned to data chunk , , each of which randomly serves for a record in the initial step. A neighborhood is established with the center and radius R (from experience), denoted as ( , R). The comprehensive similarity between the center and all records within its neighborhood ( , R) is defined by where is the similarity factor, representing the range of dimensions (the difference between the maximum and minimum of dimension). Let be the space distance between two records and , which is calculated by the weighted distance according to where ℎ is the weight based on the experience and data collecting accuracy. Therefore, the probability of clustering record into class ( , R) is computed by where , are control parameters and ( ) is the pheromone amount on the path from to at time ( (0) = 1). The decision of putting down or moving is made in terms of the clustering probability ( ): (i) If ( ) is greater than or equal to the given threshold 0 , the ant puts down and clusters it into class ( , R). The traveled path length of the ant is saved, and the location where was put down is set as the start of a new traversal path; then another record is randomly assigned to the ant. (ii) If ( ) is less than 0 , the ant carrying keeps moving to the next point with the largest ( ). is dropped when the path length reaches the maximum or the ant has not found the proper clusters until the travel ends (it can be regarded as abandoned). The ant gets a new record. After all records in , are travelled by ants, that is, | , | records are clustered or abandoned, local clustering stops.
The Map function takes (⟨key, ⟩, ⟨ , , ⟩) as the input key/value pair, where key is the key value of , is the clustering probability, and is the path length where the ant carries . and are initialized as 0. is the coordinate of nodes along the path with length , which is initialized as 0. is the index of the current iteration. ( ) is the pheromone value on path to after th iteration with the initial value 1. is the clustering probability for , is the path length while is clustered or abandoned, and is the node where is clustered or abandoned. is the minimum after the Cluster into ( , R), save , , , ← + 1.
Go to Step 17. (12) Select the node with largest ( ) into . (13) if all records are examined by the ant then (14) Abandon , save , , .
Go to Step 17. th iteration, which is the comparing parameter for the next iteration. Map function on , is described in Algorithm 1.

Combine
where ∈ (0, 1] denotes the evaporating rate of pheromone. Δ is the pheromone left by passing of ant. Δ is set as 1 if an ant passes by; otherwise it is set to 0.
For the records with the clustering probability less than the given threshold 0 , the minimal is set as the minimum path length , in the data chunk , . For all the records with clustering probability not less than 0 , they are combined according to the clustered nodes . Records with the same are merged into the same class . The number is denoted as N . For the records , in class , the sum of the attribute vector is = ∑ , ). Multiple Combine functions can be conducted in parallel for one data source , each of which works on one or several data chunks. The Combine function for , is described in Algorithm 2.
There are only two possible outputs from the Combine function, either ( , ⟨N , ⟩) or the minimal path length , . Therefore, the data sent to the central node in WAN can be greatly reduced. Data chunk , is combined into , classes; the communication overhead is , intermediate results ( , ⟨N , ⟩) and a , . By testing on practical data, the data chunk with the amount of 1.8 GB only needs to transmit 30 KB after combination, which is only (1/6) × 10 −4 of the original data.

Reduce Function of MCR-ACA.
At the th iteration, two parts obtained from the Combine phase on data chunks, that is, intermediate results ( , , ⟨N , , , ⟩) and , , , are recombined by the Reduce function. New clustering centers are generated. The weighted distances among clustering centers in different data chunks are calculated by (2). If it is less than or equal to R, the parts are merged into one class ( , , R). The global cluster center , at the th iteration is computed by ∑ ( , ,R) , / ∑ ( , ,R) N , . , converges to and outputs the global clustering result if | , − , −1 | ≤ | , −1 − , −2 |. Otherwise, the minimal , , is output as and sent to each source node for the next comparison. The Reduce function is described in Algorithm 3.

MCR-ACA Method Description.
MCR-ACA is constructed by integrating MCR with the Map function conducted on various data chunks in different source nodes, the Combine one on the local clustering results, and the Reduce one on global cluster centers. Assume the maximum iteration is max . The MCR-ACA method is described in Algorithm 4.

Experimental Results on Practical Big Data
In the experiment, the MCR-ACA method is compared with the existing MR-ACA method on the traffic monitoring system of Jiangsu province in China. The two cities Nantong and Changzhou are selected; two subsystems are chosen from each of them, respectively. Nanjing is the central city. Subsystems are linked by fiber with 1000 Mbps within each city. The distance between Nantong and the central city is 270 kilometers. There are 140 kilometers from Changzhou and the center Nanjing. The cities are connected by the Internet with network width of 200 Mbps. We adopt Hadoop, Mahout, and IK as software tools. Two PCs are used in the two cities, respectively, while four PCs work in the central node. All of them are configured with Intel 5620CPU, 2.4 GHZ, 6-core, 4 G memory, and 300 G disk. In the MCR-ACA experiment, Map and Combine operations are conducted parallel in four PCs in two cities, and the Reduce function is conducted in the central node. In the MR-ACA experiment, all data is transmitted to the central node and processed by the four  Tables  2 and 3.
It can be seen from Table 2 that as R increases the total time of clustering computation increases and the accuracy of clustering is robust. Table 3 illustrates that the total time of clustering computation tends to decrease and then increase while the accuracy of clustering fluctuates as 0 increases. For further comparison of the effect of different parameter pairs International Journal of Distributed Sensor Networks 7  of R and 0 on clustering time and accuracy, the comparison is based on the accuracy in unit time (i.e., accuracy/total time). R = 0.006 and 0 = 0.441 are the ideal parameter pair according to Tables 2 and 3. Therefore, the experimental parameters are set as below: 56 Map functions, 4 Combine functions, 4 Reduce functions, the maximal iteration being 5, 3 ants for each Map function, the neighbourhood radius R taking 0.006, 0 = 0.441, and the similarity factor set by the range of dimensions. In the experiments, the records on vehicle trajectories are divided into 56 chunks on average with one Map function working on one data chunk. The communication overhead of data extraction is the time cost of extracting the records from Nantong and Changzhou to the central node. The results are given in Table 4. The metrics for Map, Combine, Reduce, and total time are hours, and the unit for accuracy is percentage. Table 4 implies that the accuracy of the two methods is similar and rising as the data amount becomes larger. The data amount is key to the clustering accuracy. Reducing data scale by sampling also reduces the accuracy. The increasing rate of clustering accuracy for MCR-ACA is greater than MR-ACA, as depicted in Figure 3.
Furthermore, the computing time for Map function of MR-ACA is longer than MCR-ACA, and the difference becomes bigger as data amount increases. The reason lies in that all records are mixed up in hard disk after extraction, which makes the Map function more complicated in the central node than that in the data source nodes. The Map function of MR-ACA works on the data chunks that are divided from the data mixed stored on the central node, and these data chunks are more complex in elements as the data amount grows. Therefore, the Map function of MR-ACA costs more computing time. The comparison is indicated in Figure 4.
For Reduce function, the time consumed in MR-ACA is about twice as much as that in MCR-ACA, but less than the sum of computing time in both the Reduce function and the Combine function in MCR-ACA. Actually, the processing time in the Combine function in MCR-ACA is included in the Reduce function in MR-ACA. As shown in Table 4, the total time in MR-ACA is 50% longer than that in MCR-ACA due to the data extracting time on average, which indicates that the data extraction is the most essential influential factor for big data cluster. Table 4 also demonstrates that as the data amount increases, the computation time of both of the two methods increases rapidly while the accuracy improvement is quite limited. The reason is that as the number of Map functions, Combine functions, and Reduce functions keeps the same, the amount of data in the data blocks among Map functions, Combine functions, and Reduce functions grows proportionally, which finally causes the computation time to grow rapidly. The clustering accuracy is a relative value, mainly determined by data amount (the ratio of the clustered records of one car to the total records which this car is involved in). As the data amount grows, the clustering accuracy increases gradually. However, there is no relationship between the clustering accuracy and the computation time, which leads 8 International Journal of Distributed Sensor Networks    to the inconsistency of the computation time increasing and the accuracy increasing. According to the experiment, the number of obtained classes is listed in Table 5.
The results show that 4 cars in a research group illustrate the obvious similar trajectories, which is listed in Table 6. Table 6 indicates that the 4 cars were caught by the same camera in half an hour. The car number "S032V0" was misidentified as "S032W0" by the camera capture. Through clustering, the trajectories of these 4 cars have plenty of traces with the same or similar features which construct a cluster. The time feature is on every Friday evening; the location feature is overlapped along the way to the university, as shown in Figure 5.

Conclusion and Future Work
Critical issues for group mining on big data of vehicle trajectories are centralization and source distribution. In this paper, a distributed parallel clustering method MCR-ACA is proposed for group mining on distributed vehicle trajectories. Parallel clustering is realized while communication overhead of big data is avoided. The method is tested on traffic monitoring systems of three cities (including the center city Nanjing) of Jiangsu province in China. Experimental results demonstrate that the proposed method achieves better performance on group mining.
Group mining can be used in many scenarios. According to the experiment results in this paper, two aspects are promising for further work: (i) the forecast of group behavior based on specific features; for example, if the time feature of a group is in midnight and the location feature is somewhere with high crime incidence, the group can be regarded as a possible crime group with high possibility; (ii) outlier analysis for vehicle trajectory. Some vehicle trajectory outliers are formed in the clustering process (e.g., the abandoned vehicle trajectories defined in the paper); the reason that these vehicle trajectories are abandoned as outliers is useful for behavior forecast.