R OAD NETWORK PARTITIONING METHOD BASED ON C ANOPY -K MEANS CLUSTERING ALGORITHM

: With the increasing scope of traffic signal control, in order to improve the stability and flexibility of the traffic control system, it is necessary to rationally divide the road network according to the structure of the road network and the characteristics of traffic flow. However, road network partition can be regarded as a clustering process of the division of road segments with similar attributes, and thus, the clustering algorithm can be used to divide the sub-areas of road network, but when Kmeans clustering algorithm is used in road network partitioning, it is easy to fall into the local optimal solution. Therefore, we proposed a road network partitioning method based on the Canopy-Kmeans clustering algorithm based on the real-time data collected from the central longitude and latitude of a road segment, average speed of a road segment, and average density of a road segment. Moreover, a vehicle network simulation platform based on Vissim simulation software is constructed by taking the real-time collected data of central longitude and latitude, average speed and average density of road segments as sample data. Kmeans and Canopy-Kmeans algorithms are used to partition the platform road network. Finally, the quantitative evaluation method of road network partition based on macroscopic fundamental diagram is used to evaluate the results of road network partition, so as to determine the optimal road network partition algorithm. Results show that these two algorithms have divided the road network into four sub-areas, but the sections contained in each sub-area are slightly different. Determining the optimal algorithm on the surface is impossible. However, Canopy-Kmeans clustering algorithm is superior to Kmeans clustering algorithm based on the quantitative evaluation index (e.g. the sum of squares for error and the R-Square) of the results of the subareas. Canopy-Kmeans clustering algorithm can effectively partition the road network, thereby laying a foundation for the subsequent road network boundary control.


Introduction
The number of vehicles is quickly increasing, urban traffic network is becoming increasingly complex and the scope of traffic signal control widens with the rapid development of social economy.the traffic density of each section of urban road network, which is not conducive to urban traffic management and control, varies greatly with the uneven distribution of road network congestion and various types of roads (Zhao et al., 2018;Klos and Sobota, 2019).Therefore, the urban road network needs to be divided.Walinchus(1971), an American scholar, first proposed the concept of road network partition.He pointed out that a complex and huge road network can be divided into several independent sub areas according to certain principle indicators, and then appropriate control optimization strategies can be implemented according to the holding of sub areas, so as to lower the control power of the road network level by level and make the whole road network system more flexible, efficient and reliable.It can be seen that reasonable road network partition can improve the stability and flexibility of the traffic control system.Initially, the static division method was used to partition the road network.This method means that researchers divide the road network into sub regions according to the historical data of the road network (such as traffic flow, traffic density, road network structure and road network size).The static division method is easy to realize and is feasible for the road network with little traffic flow change.However, once the random change of traffic flow is large, a lot of manpower and material resources need to be invested to get the traffic data again.Therefore, some scholars have studied the method of dynamic subarea division.For example, Yun et al.(2004) increased the constraints of travel time, improved the road network zoning method in the accident management system, solved the improved model with genetic algorithm, and finally evaluated the traffic benefit.Zhou et al.(2013) proposed a new calculation method of correlation degree to determine whether two adjacent intersections need to be divided in the same sub area.Xu et al.(2017) proposed a dynamic sub area division method of road network based on different congestion levels according to the homogeneity and relevance of intersections in different states of road network.
Recent research shows that urban road network traffic flow has objective regularity, which is called macroscopic fundamental diagrams (MFD)(Daganzo, 2007; Geroliminis and Sun, 2011).Godfrey(1969) first proposed the concept of MFD, but it was not until 2007 that Daganzo And Geroliminis revealed the theoretical principle of MFD (Daganzo, 2007;Geroliminis and Daganzo, 2008).These two scholars proposed that MFD is the internal objective law of the road network, which objectively reflects the general relationship between the weighted traffic flow (q w ) and the weighted traffic density (k w ) in the road network.Some scholars use MFD theory to study the method of road network subarea division.2015) searched for the road section with the lowest density variance in each road section of the road network and its adjacent road sections, combined to form a set of road sections with similar density, which was defined as snake set.Then, the snake sets with similar similarity were merged and fine-tuned, and finally the heterogeneous road network was divided into several homo-proton areas.Later, they optimized three models and verified them in the actual large-scale road network.The results show that the algorithm is better than Ji's method(Saeedmanesh and Geroliminis, 2016).An et al.(2017) proposed a robust and efficient road network subarea division algorithm by using the link connectivity and regional growth technology, which was implemented and tested on the regional planning road network in Arizona, USA.In fact, the road network partition can be regarded as the clustering process of the segments with similar attributes, and thus, clustering algorithm can be used to partition the sub-areas of the road network.Clustering divides the data set into several clusters (classes) with similar attributes and features.The attributes or features of objects in the same cluster are similar to one another but are quite different from those in other clusters.Clustering algorithm has been applied to traffic condition identification; road network, traffic control period, accident point and flow partition; and other traffic fields.However, the application of clustering algorithm in road network partition is still in its infancy.2017) proposed density-based spatial and data point clustering post-processing methods by using normalised cutting, density-based noise spatial clustering and growth nerve gas.The method was validated by a large urban network in Amsterdam, the Netherlands.Wang (2017) proposed road network partition methods based on Kmeans clustering and improved FCM algorithm and compared their advantages and disadvantages through actual road network analysis.However, the K value of Kmeans algorithm is pre-set, which is difficult to estimate and does not have universality.At the same time, random selection of clustering centres may lead to different clustering results each time, and the obtained results may not be the optimal solution.Therefore, we proposed a road network partitioning method based on Canopy-Kmeans clustering algorithm to make up for the shortcomings of the Kmeans algorithm.We considered the longitude and latitude of the central section, average speed and density of the section collected in real time as sample data and constructed a vehicle-connected network simulation model.Then, Kmeans and Canopy-Kmeans clustering algorithms are used to partition the road network.Finally, the quantitative evaluation method of road network partition based on MFD is used to evaluate the results of road network partition and the optimal road network partition algorithm is determined.

Brief introduction of kmeans and canopy algorithms 2.1. Introduction of kmeans algorithm
Kmeans algorithm is a classical unsupervised learning clustering method based on distance.The algorithm divides the samples that have eigenvalues close to one another to form multiple clusters.The distances between two objects is considered close, and the similarity degree is higher.The algorithm randomly selects K data objects from N data objects as initial clustering centres, calculates the distance between the remaining data objects and K clustering centres, divides the remaining data objects into corresponding clusters with the smallest distance and recalculates the pairwise mean of all data objects in the clustering of each new sample and considers them new clustering centres.This process is repeated until the sum of mean square deviation is minimised.The advantages of Kmeans algorithm are presented as follows: 1) simple idea and rapid running speed; 2) high computing efficiency and scalability for large data sets and 3) low time complexity and suitable for data mining of large data sets.\However, the algorithm presents the following shortcomings: 1) pre-selected K value is difficult to estimate and 2) initial clustering centre is randomly selected.Different results may appear in each calculation.If the improper initial value is selected the clustering result may not be the optimal clustering result.

Introduction of canopy algorithm
Canopy algorithm is a kind of rough clustering algorithm that does not need to specify K value beforehand.It has rapid execution speed, indicating that it has great practical application value.The main idea is to randomly select one data object as the initial clustering centre for arbitrary data set V, set two concentric region radii (e.g.T1 and T2), calculate the similarity of all data objects in the data set by rough distance calculation method and divide the data set into several overlapping small datasets according to the similarity of each data object (defined as Canopy).After many iterations, all data objects can eventually fall within the scope of Canopy coverage (

Road network partition method based on
Canopy-Kmeans clustering The subarea of road network is divided by Canopy-Kmeans clustering based on real-time sample data, such as longitude and latitude and average speed and density.The algorithm flow is presented as follows: Phase 1 (Data Pre-processing): Several Canopy and Canopy centres are determined by Canopy algorithm based on 'minimum and maximum principle' (Mao, 2012).
(1) Determine Canopy: The road network dataset is expressed as: where   is the partition parameter of section i in the road network and: where  1 is the longitude of section centre,  2 is the latitude of section centre,  3 is the average speed of section and  4 is the average density of section.If ∀  ∈  satisfies the following formula: where   denotes whether the coefficients belong to class K.If   belongs to the range of class K, then   =1; otherwise,   =0.The formula is expressed as follows:  Nagle et al.(2014) proposed that if the floating cars are evenly distributed in the road network and the proportion of floating cars is known, then the MFD of the road network can be estimated by using the traffic trajectory estimation method(Edie,1963), which is called floating car data(FCD) estimation method.The formula is presented as follows :

Quantitative evaluation method of road network partition based on MFD
where k w and q w are the weighted traffic density (veh/km) and the weighted traffic flow (veh/h) respectively;  is the ratio of floating cars; T is the acquisition cycle(s); n is the total number of road network sections; li is the road section i length (km); m is the number of floating cars during the acqui- sition cycle T (veh);  ′  is the driving time of j-th floating car during the acquisition cycle T(s) and  ′  is the driving distance of j-th floating car during the acquisition cycle T(m).According to Daganzo and Geroliminis, the MFD of road network presents a cubic function curve of one variable, which can be expressed as follows: (  ()) =   () 3 +   () 2  +   () +  (11 and MFD is in homogeneous road network.Thus, the rationality of road network partition can be evaluated according to the fitting degree of MFD scatters.However, to accurately judge the fitting degree of MFD scatters in road network only from the perspective of vision is difficult.Therefore, a quantitative analysis method of road network's MFD based on the sum of squares for error (SSE) and the R-Square is used to evaluate the rationality of the aforementioned road network partition clustering method (Wang, 2017), and its flow chart is shown in Fig 3.
(1) Drawing MFDs of each partition and determining its fitting function The MFDs of each partition are drawn by using the relevant theory of MFD, and the curve function is fitted on the scatter plot of MFD.(2) Calculating the SSE and the R-Square The SSE and the R-Square between scatter points and fitting functions are calculated to Lin, X., Xu, J., Archives of Transport, 54(2), [95][96][97][98][99][100][101][102][103][104][105][106]2020 quantitatively evaluate road network partitions.SSE is the deviation between the fitting and the actual data.The smaller the value of SSE, the smaller the deviation between the fitting and the actual data.The formula is as follows: Where,   is the actual value;  ̂ is the fitting value; i is i-th data and n is the total data.The R-Square is determined by the sum of squares for regression (SSR) and the sum of squares for total (SST).SST is the sum of squares of the difference between each actual and average value, which reflects the overall fluctuation of the scatter point.SSR is the sum of squares of the difference between fitting value and mean value.Their formulas are as follows: where the value of R-square is between 0 and 1.
If the value of R-square is closer to 1, then the effect of curve fitting is better.(3) Analysing the fitting degree of MDF in road network partition SSE and R-square were used to describe the fitting degree of MFD.When SSE is smaller and R-square is closer to 1, the fitting degree of MFD is higher.The MFD scattering points are more centralised, scattering is lower, image is clearer, critical traffic density and maximum traffic flow of road network are easier to determine, traffic state of road network is easier to distinguish from macroscopic level and traffic flow of whole road network is more uniform.On the contrary, if SSE is larger and Rsquare is smaller, then the fitting degree of MFD is worse.If the scatter points of MFD are scattered, then the fitting curve is not obvious, and determining the maximum flow is difficult.Moreover, critical traffic density and state of road network are difficult to judge.To analyse the fitting degree of MFD in each subarea .
Fig. 3 Evaluation method of road network partition based on MFD

Empirical Analysis 5.1. Construction of vehicle-connected network simulation platform
To verify the effectiveness of the proposed algorithm, the core road network of Tianhe District in Guangzhou is selected as the research object (Lin et al., 2019), and a vehicle-connected network simulation platform based on Vissim simulation software is constructed (Fig 4).The road network consists of 8 three-dimensional intersections, more than 60 plane intersections and more than 100 entrances and exits.The simulation results show that when the proportion of connected vehicles (i.e.floating cars) in the traffic volume of the road network (i.e.coverage) reaches 42%, the MFD estimation accuracy of the road network (i.e. the accuracy of weighted traffic density and weighted traffic flow) can reach 97%.Therefore, the coverage rate of networked vehicles is set to 42%.The networked vehicles upload the trajectory data every 15 seconds for a total of 32400 seconds.Then, the networked vehicle data file (*.fzp) of the simulation results is imported into EX-CEL file.The FCD estimation method is realised by VBA macroprogramming.The statistical period of the data is 120 seconds.Finally, the weighted traffic flow q w and the weighted traffic density k w of the road network are obtained.

Result and evaluation of road network partition 5.2.1. Result of road network partition
The spectral clustering algorithm is programmed in MATLAB software by taking simulation time and weighted traffic density of road network as sample data.The simulation time of road network is divided into four stages, namely, low, peak flat peak time, peak time and the over-saturation state.The results of simulation time division of the vehicle network simulation platform are calculated (Table 1).The results of road network partitioning using Kmeans and Canopy-Kmeans algorithms under over-saturation state are analyzed by taking road network partitioning under over-saturation state as an example.The simulation result section evaluation data file (*.str) is imported into EXCEL.The coordinates of the section centre point are calculated according to the coordinates of the starting and ending points of the section.The average speed and density of the section under the supersaturated state are calculated according to the time division results in Table 1.Taking the Xcoordinate of the central point, Y-coordinate of the central point, average velocity and density of each section as sample data, Kmeans and Canopy-Kmeans algorithms are programmed separately in MATLAB software to partition the road network (Figs. 5 and 6).(3) Simulation data obtained from the vehicle connected network simulation platform are taken as sample data, but their impact on traffic accidents, road construction, weather and other special circumstances on the traffic data of the road network was not considered.Therefore, in the next work, the actual traffic data of the road network will be used to verify and analyse the algorithm.
For example, Ji et al. (2012) firstly used Ncut method to initially divide the road network and then used merge algorithm to roughly divide the initial division.Finally, the road network was divided into several homogeneous road network partitions by using boundary adjustment algorithm to reduce the density variance of the section.Haddad et al.(2014) designed robust peripheral controller based on MFD at the boundary of two sub regions, solved the flow regulation problem of boundary control of two sub regions, and then proposed a new sub region division model.Saeedmanesh et al.( Scholars used clustering algorithm to investigate road network partition.For example, Li et al.(2009) proposed an automatic sub-area division method of road network based on spatial statistical clustering algorithm and used the floating car data of Shanghai actual road network to realise the automatic sub-area division of road network.Dai et al. (2010) established a weighted fuzzy clustering analysis method based on fuzzy C-means clustering algorithm (FCM) and used the analytic hierarchy process to optimise the weights.Finally, the feasibility of the method was verified by taking the West Bank Economic Zone of the Straits as the experimental area.Yin et al. (2010) proposed dynamic road network partition based on spectral graph theory and spectral clustering algorithm according to real-time traffic flow data and the topological structure attributes of road intersections.Du et al. (2014) proposed a traffic area partition method framework based on expressway traffic connection by using and converting weighted average distance clustering analysis method, which is based on OD traffic volume data between toll stations of expressway network toll collection, to calculate the traffic connection volume.Feng et al. (2015) proposed a sub-area merging model based on two-dimensional graph theory clustering algorithm for road network merging and used F test to determine the optimal merging results.Pascale et al.(2015) proposed a homogeneous sub-area detection method based on spatial clustering method, which was validated by the actual London road network.Lopez et al.(

Fig. 1 .
Fig. 1.Schematic of canopy algorithm 4) Repeat Steps 2 to 3 until the cluster centre remained unchanged.(5) K clusters are obtained by outputting the results.Fig 2 shows the flow chart of the aforementioned Canopy-Kmeans clustering algorithm.

Fig. 4
Fig. 4 Layout of the simulation experiment area