Street-Level IP Geolocation Algorithm Based on Landmarks Clustering

: Existing IP geolocation algorithms based on delay similarity often rely on the principle that geographically adjacent IPs have similar delays. However, this principle is often invalid in real Internet environment, which leads to unreliable geolocation results. To improve the accuracy and reliability of locating IP in real Internet, a street-level IP geolocation algorithm based on landmarks clustering is proposed. Firstly, we use the probes to measure the known landmarks to obtain their delay vectors, and cluster landmarks using them. Secondly, the landmarks are clustered again by their latitude and longitude, and the intersection of these two clustering results is taken to form training sets. Thirdly, we train multiple neural networks to get the mapping relationship between delay and location in each training set. Finally, we determine one of the neural networks for the target by the delay similarity and relative hop counts, and then geolocate the target by this network. As it brings together the delay and geographical coordinates clustering, the proposed algorithm largely improves the inconsistency between them and enhances the mapping relationship between them. We evaluate the algorithm by a series of experiments in Hong Kong, Shanghai, Zhengzhou and New York. The experimental results show that the proposed algorithm achieves street-level IP geolocation, and comparing with existing typical street-level geolocation algorithms,the proposed algorithm improves the geolocation reliability significantly.


Introduction
IP geolocation technology aims to obtain the geographic location of a given IP address [1]. It has been widely used in advertisement delivery, user positioning, tracking attack source and so on [2][3][4]. High-precision and reliable IP geolocation technology is getting more and more City-level IP geolocation methods include GeoPing [6], CBG (Constraint-Based Geolocation) [12], Octant [13], GeoWeight [14], LBG (Learning-based Geolocation) [15], Point of Presence (PoP) Analysis based Geolocation [16], GBLC (Landmark Clustering based Geolocation) [17], PoP Partition based Geolocation [18], Geo-PoP [19]. These methods mainly use attributes such as delay, hop count and network structure to constrain the geographical location of the target IP to a certain area or use the landmark of the known geographical location as its estimated location. Among them, GeoPing takes the location of the landmark whose delay vector resembles the target most closely as the location of the target; CBG calculates the "delay-distance" conversion coefficient of each probes, and estimates the location of the target IP through multiple probes; Octant and GeoWeight improve the CBG, on the basis of calculating the relationship between delay and distance, they constrain the location of targets by using intermediate routers and statistical ideas respectively. GBLC clusters the landmarks to filter out high-reliability landmarks for improving the precision of city-level IP geolocation algorithm; PoP Analysis based Geolocation, PoP Partition based Geolocation and Geo-PoP extract the PoP network topology inside the city through the tightly connected network nodes, and geolocate the target to the city to which the target-connected PoP belongs.
Street-level IP geolocation methods include SLG (Street-Level Geolocation) [20], IRLD (Identification Routers and Local Delay Distribution Similarity based Geolocation) [21], NC-Geo (Nearest Common Router based Geolocation) [22] and TNN (IP Geolocation Algorithm based on Two-tiered Neural Networks) [23]. These methods mainly adopt the idea of layer-by-layer approximation. Namely, first geolocating the target IP to a larger range and then estimating its location in a smaller range. Among them, the SLG algorithm uses the landmark having the minimum relative delay with the target IP as the estimated location of the target IP. On the basis of the SLG algorithm, the IRLD algorithm considers the problem of delay expansion and anonymous routing, and uses the similarity of local delay distribution to replace the minimum relative delay in SLG algorithm to geolocate the target IP, which better solves the anonymous routing when geolocation. The NC-Geo algorithm estimates the location of the target IP by finding the landmarks with the nearest common router to the destination IP and using the minimum relative delay between the landmarks and the router, but it requires at least three landmarks to be connected to the common router. In essence, IRLD algorithm and NC-Geo algorithm are more precise geolocation under the specific conditions of SLG algorithm. The TNN algorithm uses neural network to learn the mapping relationship between delay and latitude and longitude, so as to realize IP geolocation. SLG, IRLD, and NC-Geo estimate the location of the nearest landmark or router as the target geographical location. When the nearest landmark or router is far from the target, the geolocation error will be large. The main principle in TNN algorithm is based on the fact that IPs with similar geographical locations have similarities in their delays, but its inverse proposition that IPs with similar delays have close geographic locations actually fails to hold water. Therefore, the use of delay similarity in TNN algorithm to perform geolocation will cause unreliable geolocation.
Aiming at the above problems, this paper constructs the geolocation algorithm by using the delay and relative hop counts under the ideal conditions of the network. The algorithm obtains the delay and paths from probes to landmarks, uses delay to cluster, and uses the landmark sets to filter the clustered results to obtain the training sets, and trains the neural networks with the training sets. The delay similarity and the relative hops between the target and the training sets are used to judge which training set the target belongs to. When the relative hops between the target and the training set satisfy the set threshold conditions, the training set is used to train the neural network to locate the target. The algorithm uses the delay vector clustering as well as latitude and longitude clustering of the landmarks, which better improves the problem of unreliable geolocation in TNN algorithm. The proposed algorithm also avoids the limitations of SLG algorithm, IRLD algorithm and NC-Geo algorithm by using neural networks to learn the mapping between delay and geographic location.
The rest of this paper is organized as follows. Section 2 reveals the correlation between IP delay similarity and geographical location distribution. Section 3 introduces the main steps of the algorithm and divides it into three stages as training sets filtering, neural network training and target geolocation to explain in detail. The performance of the algorithm is evaluated through the experiments in Section 4. Finally, Section 5 summarizes the work of this paper.

Relationship between Delay Similarity and Geographical Distribution
We conducted a total of more than 5,000 (lasting for 21 days) traceroute measurements on street-level landmarks (105,461) in China and the US by using nine probes in Beijing, Chengdu, Shanghai, Wuhan, Washington, Silicon Valley, New York, Atlanta and Seattle, and obtains a lot of delay information.
In order to verify the relationship between delay similarity and geographical distribution, we use the K-means algorithm to cluster the delay vectors of landmarks in China and US separately in this section. We selected cluster quantity K making the contour coefficient meet its maximum. When the numbers of clusters in China and the United States are 326 and 168 respectively, the contour coefficients reach its maximum, which are 0.72 and 0.83, respectively. The delay clustering results and geographical distribution statistical results of landmarks are shown in Tab. 1. In Tab. 1, there are 395 clusters of landmarks with geographical distribution covering greater than 300 Km. Although the delays in these clusters are similar, the actual distance between the corresponding landmarks is greater than 300 Km, which means that the landmarks with similar delays are not necessarily geographically close.
In fact, among the above 494 clusters, the geographical locations where the landmarks in cluster W (a total of 315 landmarks), cluster X (124 landmarks), cluster Y (324 landmarks) and cluster Z (187 landmarks) are located are shown in Fig. 1. The landmarks in cluster W are located in Dallas, Houston, etc. in the United States. The landmarks in cluster X are located in Los Angeles, San Francisco, etc. in the United States. The landmarks in cluster Y are located in Shanghai, Hangzhou, etc. in China. The landmarks in cluster Z are located in Chengdu, Yibin, etc. in China. The average contour coefficients of cluster W, cluster X, cluster Y and cluster Z are 0.73, 0.76, 0.81 and 0.83, respectively. As a consequence, it is unreliable to merely use the similarity of delays as the basis for geolocation. Tab. 2 shows that the geolocation accuracy rate of the TNN algorithm with a geolocation error being 10, 20 and 40 Km, respectively, is not very high. This may be due to the fact that when the TNN algorithm trains the neural network with all landmarks, the landmark position of similar delay is not adjacent very often, so the mapping relationship between the delay and location of the landmarks learned by the neural network is not very strong.

Basic Principles and Main Steps of Proposed Algorithm
The basic idea of the algorithm is as follows: Based on the rule that hosts in the same local area and under the same network conditions often have similarity in their delays, the delays and relative hop counts are obtained for geolocation. The schematic framework of the algorithm is shown in Fig. 3.  1 * "training set size > N" represents a landmark set composed of all training sets with a landmark quantity greater than N in the training set. § "PGE < X" is short for "proportion of the targets within geolocation error being X".
As shown in Fig. 3, the algorithm is divided into three parts: Training sets filtering, Neural networks training, and Target geolocation. The specific steps of the algorithm are as follows: Training sets filtering. First, deploy n probes P 1 , P 2 , . . . , P n , and acquire the delay from the probes to landmark sets, and construct absolute delay vectors: where Vec j represents the delay vector of the j-th landmark, and d i,j represents the delay from the detection source i to the landmark j. Then, use Eq. (1) to cluster the landmarks. Next, use the latitude and longitude in the landmark set to cluster all the landmarks. Finally, the intersection of the two clustering results is calculated, and each intersection is used as a training set, so that the delay, latitude and longitude in each training set landmark are similar.
Neural networks training. Take the delay of the landmarks in the training set C i as input, and the latitude and longitude thereof as output, obtaining a well-trained neural network.
Target geolocation. First, acquire the delay information from n probes to the target, express it as  where d i represents the delay from the detection source i to the target. Then, Eq. (2) is used to determine the delay cluster which the target belongs to, calculate the training set having the smallest relative hop counts with target in the delay cluster, record the relative hop counts between target and the training set as V . Set the threshold U, and if U ≥ V , input the neural network constructed by C j in the Eq. (2) to obtain its latitude and longitude; otherwise, end the algorithm.
Among them, training sets filtering, neural network training and target geolocation are the important parts of the algorithm, which will be described in detail in the following subsections.

Training Sets Filtering
Because the geographical location of landmarks with similar delay is not necessarily close, and if all landmarks are used as training sets to train neural network, the result of location will be unreliable. Therefore, the training sets need to be filtered so that the delay, latitude and longitude of landmarks in each training set are similar. The specific steps are as follows: Input: Delay vectors of landmarks, longitude and latitude of landmarks

Output: Filtered training sets
Step 1 Use Eq. (1) to perform K-means clustering on the landmarks, wherein K value is iterated in ascending order, and then select the k value maximizing the contour coefficient, and record the clustering set as D = {D 1 , D 2 , . . . , D k }.
Step 2 Use the latitude and longitude in the landmark set to cluster all the landmarks, in terms of the number of clusters, also select the value corresponding to the maximum contour coefficient and recording it as h, and record the clustering set as L = {L 1 , L 2 , . . . , L h }.
Step 3 Calculate F = L ∩ D and record the final set of clusters as F = C 1 , C 2 , . . . , C q .
At this time, the delay, latitude and longitude of the landmarks in each training set are similar. The neural network is trained by using the landmarks in each training set, and the mapping between delay and latitude and longitude will be more reliable.
As a result, this training set can ensure that the samples with delay similarity are geographically close. Close geographical locations and delay similarity can indicate that the samples are similar in network local characteristics. For example, the samples have a common router, and the hop counts from them to the common router are not large, showing that these samples have very similar paths and thus share similarities in network characteristics such as network congestion. It is therefore reasonable to use the samples in such sets as training sets to train the neural networks.

Neural Network Training
Use Eq. (1) of the landmarks in C i as the input of the neural network, and use the latitude and longitude vectors of the landmarks in C i as the output of the neural network to train the neural network. This paper uses the multilayer perceptron neural network, and its structure is shown in Fig. 4 [24].
where H i−1 k is the output value of the hidden layer neuron k in the i − 1 layer, w i k,j is the connection weight from the hidden layer neuron k in the i − 1 layer to the current layer neuron j, θ i j is the threshold of the hidden layer neuron j in the i − th layer. I k is the input value of the input layer neuron k. The hidden layer neuron activation function f (x) is set to a sigmoid function, which is just like The calculation formula of the output layer neurons is where H n j is the output value of the hidden layer neuron j in the n layer, w j,x and w j,y are the connection weights from the hidden layer neuron j in the n-th layer to the output layer neuron x and y, θ x and θ y are the thresholds of the output layer neuron x and y.

Target Geolocation
After training the neural network for each training set, in target geolocation, it is first necessary to judge the training set to which the target belongs. Then, the target delay vector is input into the neural network trained by the training set to obtain the latitude and longitude of the target. Specific steps are as follows:

Output: Target longitude and latitude
Step 1 Use the detection source deployed in the previous stage to measure the delay of the target, and use the Ally method [25] and the Mercator [26] method to merge the router aliases. Construct Eq. (2) using targets. Record the hop counts of each router and the target in each measurement path.
Step 2 Calculate the Euclidean distance between the D i center and Eq. (2), and select the D i whose center has the smallest Euclidean distance with Eq. (2) as D i to which the target T belongs.
Step 3 Extract the router in the path of the set C j in the probe measurement set D i , which is denoted as R c j = {r 1 , r 2 , . . . , r s } where r m is the k-th router in C j , and s is the number of routers in the path of the set C j . The minimum hops of the distance between r m and the landmarks in C j are recorded as h r m ,C j .
Step 4 By taking the intersection of routers in the probe-to-target paths and R c j , common router sets are obtained, which is denoted as M j = r 1 , r 2 , . . . , r p (7) p is the number of common routers for R c j and the routers in the paths from probes to the target. The relative hop count of T and C j is recorded as where h r m ,T is the minimum hops of the distance between r m and the target. C j with the smallest L j is used as the training set to locate the target, and record the smallest L j as V .
Step 5 Set the threshold U, and if U ≥ V , use the neural network formed by the training set C j to geolocate the target; otherwise, end the algorithm.
The feasibility behind this strategy is as follows. The algorithm uses delay to determine the cluster to which the target belongs (clusters are obtained by time-delay clustering), but this cluster may produce multiple subclusters after the intersection with the cluster of latitude and longitude clustering. Therefore, it is necessary to calculate the relative hop count between the target and the landmarks in multiple clusters, and take the cluster with the smallest relative hop count as the target geolocation training set. It is worth noting that the "relative hop count between the target and the cluster" refers to the minimum relative hop count between the target and a landmark in the cluster, and to some extent, it represents the similarity between the target and the landmark. Specifically, the smaller the relative hop count, the more similar the target is to a certain sample in the cluster, and the mapping relationship between the target's delay and latitude and longitude is more consistent with that of the trained neural network for this cluster. On the other hand, the greater the relative hop count, the greater the difference between the target and the sample path in this cluster. At this time, network characteristics such as path and congestion will affect the mapping relationship between delay and latitude and longitude, making the mapping relationship of the target different from that of trained neural network for this cluster. Consequently, in the algorithm, it is reasonable to measure the reliability of geolocation by setting a corresponding threshold. When the relative hop count is greater than the threshold, the algorithm deems that the target cannot be geolocated, thus ensuring the reliability of the algorithm under different geolocation requirements.

Experimental Results and Analysis
This section mainly verifies the rationality and effectiveness of the proposed algorithm. The experiment includes two experiments: Verification on the geolocation effect of the algorithm, and comparative verification. The experimental setups are shown in Tab. 3. In Tab. 3, the landmarks used in the verification of the correlation between delay similarity and geographical distribution in Section 2 include the landmarks used in the verification on the geolocation effect of the algorithm and comparative verification.
Because the IRLD algorithm and NC-Geo algorithm belong to the geolocation under the specific conditions of the SLG algorithm. Unlike the application scenarios in this paper, the geolocation conditions of the SLG algorithm are more general, so comparative verification is carried out on the algorithm proposed in this paper and the SLG algorithm and TNN algorithm.

Verification on the Geolocation Effect of the Algorithm
Based on the experimental setups in Tab. 3, we verify the effect of the geolocation algorithm in this subsection. 80% of the landmarks are randomly selected from each city as the candidate set of the training set for training network, and the remaining 20% of the landmarks (a total of 11,063) are used as unknown targets for geolocation verification. The landmarks can be divided into 67 clusters by using the landmark clustering in the algorithm and filtering algorithm. Tab. 4 shows the relationship between the size of the training set, the number of clusters and the geographical location thereof. Tab. 5 shows the geolocation effects of training sets in different training sizes and different geolocation thresholds on the corresponding targets. Tab. 5 and Fig. 5 show that as the total number of landmarks in the landmark set decreases, the number of samples in a single training set increases, and the number of targets that can be geolocated decreases, but the geolocation error (median error/maximum error) is on a downward trend. The reason lies in that the network trained by the training set that fails to satisfy a certain number of samples lacks universality, which is statistically reasonable. In addition, it can be seen that different geolocation thresholds have different degrees of influence on the number of targets that can be geolocated and geolocation error. As the geolocation threshold increases, the number of targets that can be geolocated increases, but the geolocation error also increases.
It can be seen that the geolocation effect of the algorithm is closely related to the number of samples in the training set and the geolocation threshold. In fact, it is also closely related to the network characteristics of the geolocation target and the training set.
Tab. 6 shows the number of targets that can be geolocated, geolocation error and other geolocation effect when the number of landmarks in the corresponding training sets in Zhengzhou and Hong Kong and the geolocation threshold are 4. 3.01 * U is short for "geolocation threshold"; § QCG is short for "quantity of the targets that can be geolocated", QCNG is short for "quantity of the targets that can't be geolocated"; MGE is short for "median geolocation error of the targets that can be geolocated". Tab. 6 shows that under the same threshold condition, the smaller the relative hop count from the target to the corresponding training set, the higher its proportion in the whole targets that can be geolocated, and the higher the geolocation accuracy. This, to some extent, shows that the smaller the hop count from the target to the training set, the more similar the network characteristics of the target are to the network characteristics of the landmark in the training set. Thus, the use of the geolocation algorithm should fully consider the relative hop count from the target to the training set and the sample quantity of the training set.  3.01 * LQ is short for "landmark quantity"; § U is short for "geolocation threshold", QCG is short for "quantity of the targets that can be geolocated"; MGE is short for "median geolocation error of the targets that can be geolocated".

Comparative Verification
In this subsection, we compare the geolocation effect of the proposed algorithm in this paper with those of the SLG algorithm and the TNN algorithm under the situations of the same target and landmark.  6 shows the geolocation cumulative distribution of the proposed algorithm in this paper, the SLG algorithm and the TNN algorithm. The black line, red dashed line and blue dot line indicate the geolocation error cumulative distribution of the proposed algorithm, SLG algorithm and TNN algorithm, respectively. It can be seen from the geolocation error cumulative distribution in the Fig. 6 that when the geolocation threshold is 2 or 3, the geolocation effect of the proposed algorithm is better than those of the SLG algorithm and TNN algorithm, but when the threshold is 4, the partial geolocation result of the proposed algorithm is weaker than that of the TNN algorithm. Tab. 7 shows with a geolocation threshold being 4, the statistical results of the proposed algorithm in this paper, the SLG algorithm and TNN algorithm when the geolocation error is 10, 20 and 40 Km.
Tab. 7 shows that when the landmarks are composed of a training set with training samples greater than 100, the geolocation accuracy rate of the algorithm within 20 Km is 82.0%, while the geolocation accuracy rate of the TNN algorithm is 83.7%.
When the landmarks are composed of a training set with training samples greater than 300, the geolocation accuracy rates of the algorithm within 20 and 40 Km are 83.9% and 95.5%, respectively, while the geolocation accuracy rates of the TNN algorithm are 86.2% and 96.3%, respectively. Tab. 8 gives the reasons for this phenomenon. 92.3 * "PGE < X" is short for "proportion of the targets within geolocation error being X". Tab. 8 shows that when the relative hop count between the target and the training set is relatively large, the neural network trained by the training set is not sufficient to reflect the network characteristics of the target, which may increase the error. The TNN algorithm uses all landmarks as a training set, which blurs local network characteristics. However, the TNN algorithm will not guarantee higher geolocation reliability.
Section 2 shows the geolocation of all targets by TNN algorithm. Fig. 2 shows that when the TNN algorithm geolocates the targets that it considers can be geolocated, although the number of targets that can be geolocated is more than the proposed algorithm in this paper, its geolocation error increases significantly. Tab. 2 shows that the geolocation accuracy rate of the TNN algorithm with a geolocation error being 10, 20 and 40 Km, respectively, is significantly lower than that of the proposed algorithm. Hence, it can be seen that the proposed algorithm in this paper is of greater reliability.

Conclusion
IP geolocation algorithm based on delay similarity is a kind of classical IP geolocation algorithms. However, owing to the inconsistency between IP similar delays and geographical similarity, the reliability of the geolocation results of such algorithms is not enough. Aiming at the deficiencies in this kind of algorithm, this paper proposes a street-level geolocation algorithm based on landmarks clustering. This paper has carried out experimental verification on a total of 55,318 measurable street-level landmarks in Hong Kong, Shanghai, Zhengzhou and New York. The experimental results show that the proposed algorithm achieves street-level geolocation, and the reliability of the street-level geolocation algorithm is improved effectively compared with the SLG algorithm and TNN algorithm.
Because the delays from the probes to the hosts are not stable enough during network measurement, the geolocation result would be affected, and the path of the network measurement is stable. Infuture work, we consider integrating path vectorization into the construction of geolocation model to improve the geolocation accuracy.