An Alias Resolution Method Based on Delay Sequence Analysis

: Alias resolution, mapping IP addresses to routers, is a critical step in obtaining a network topology. The latest work on alias resolution is based on special fields in the packet, such as IP ID, port number, etc. However, for security reasons, most network devices block packets for setting options, and some related fields exist only in IPv4, so these methods cannot be used for alias resolution of IPv6. In order to solve the above problems, we propose an alias analysis method based on delay sequence analysis. In this article, we present a new model to describe the distribution of Internet delays and give a mathematical proof. After experimental measurements using the Macroscopic Internet Topology Data Kit (ITDK) and Ark IPv6 Topology Dataset, it was found that the statistical differences in most alias delay models were very small. The statistical differences in the non-alias delay models are spread over a wide range. Using the wavelet decomposition in delay sequence, it was found that the approximate components and the detail components of the delay sequence of aliases were the same after filtering out the noise, which provided a theoretical explanation for the experimental results. This technology is applicable to both IPv4 and IPv6.


Introduction
Ping tool is widely used to measure Internet delay, based on ICMP, UDP or TCP protocol. The tool shows the reachability of a host on Internet and round-trip time from source host to destination host. The program reports errors, packet loss, time to live (TTL) and a statistical summary of the results, typically including the minimum, maximum, the mean round-trip times, and standard deviation of the mean [Muuss (1983)]. Because of its cross-platform (windows, mac and Linux), and its cross-version (IPv4 and IPv6), ping tool is selected to obtain reliable experimental data. Alias resolution is the process of identifying which interface IP addresses belong to the same routers and is required to convert the abstract IP-level topology discovered by traceroute to a more concrete router-level topology [Pansiot and Grad (1998); Gunes and Sarac (2007)]. Accurate and efficient identification of alias IPs is important for obtaining real router-level network topologies. The existing alias resolution methods can be divided into two types based on active probe and statistical analysis. The method based on the active probe obtains the response packet by detecting the IP address of the interface, and performs alias resolution based on the source address, the IP identifier, and the optional field of the response packet. The method based on the statistical analysis is based on the router host naming rules, IP address assignment conventions and network composition, and statistical analysis results such as network graph structure, for alias resolution. Typical alias resolution methods based active probe are as follows: 1) alias resolution according to responding to the content of the packets, including header and data. Multiple interfaces of the router usually share a counter. When a packet is generated, the counter value will be the IP identification of the packets, and the IP-ID series is in linear or nonlinear relationship. the IP-ID values of aliases in response packets are ordered and adjacent [Spring, Mahajan and Wetherall (2002)]. Radar Gun [Bender, Sherwood and Spring (2008)] considers that the IP-ID series in multiple response packets of aliases are similar. MIDAR [Keys, Hyun, Luckie et al. (2013)] considers IP-ID series of alias have monotonic trend in common. However, most routers or hosts were found to have a constant IPID (39%) or a local counter (34%), and global counters account for approximately 18%, counters using random number account for approximately 9% [Salutari, Cicalese and Rossi (2018)]. Options in ICMP is used to alias resolution such as Sidecar [Sherwood and Spring (2006)], RIPAPT [Sherry, Katz-Bassett, Pimenova et al. (2010)], Pythia [Marchetta, Persico and Pescapè (2013)], etc. In 2014, the IETF (The Internet Engineering Task Force) recommended that network devices should discard the packets with option settings for security. This strategy leads to the inability to measure based on options. 2) alias resolution according to inductive detection. Mercator, iffinder [Govindan and Tangmunarunkit (2000)] consider that when using the UDP high port probe for the router interface IP, the source address of the response packet header may be the other IP in the same router, which is used to determine alias by comparing the IP address of the probe packet with the IP address of the source address in the response packet. But only few IPs will response UDP high port probe and the fewer will response from the other source [Govindan and Tangmunarunkit (2000)]. Robert [Beverly, Brinkmeyer, Luckie et al. (2013)] take packets of a specific size to induce alias information This packet exceeds the 1280 B limit of the IPv6 protocol MTU and is less than the 1500 B limit of the MTU of the IPv4 protocol. The target splits the message into 2 packets for transmission and joins the segment ID. It can be used for resolution. But this measurement mechanism causes huge network load. The alias resolution theory based on statistical analysis is as follows: IPs with the same host name or similar naming rules [Gunes and Sarac (2006)], in the same subnet [Augustin, Cuvellier, Orgogozo et al. (2006)] is aliases or two addresses that directly precede a common successor are aliases, assuming point-to-point links are used [Spring, Dontcheva, Rodrig et al. (2004)]. The practicality of these theories is not satisfactory. Router host name acquisition is difficult, and host naming rules are not in same standard. There are a large number of unknown interfaces on the router, and there is no stable connection between the interfaces.
With the development of technology, machine learning is used in alias resolution. using neural networks to improve the accuracy and universality of analysis. Grailet et al. [Grailet and Donnet (2017)] proposed an alias resolution algorithm by combining various measurement methods. With machine learning, a large number of aliases clustering is realistic. However, interpretability of neural networks used for alias resolution is still unknown. This paper proposes an alias analysis technique using delay series and also explains why neural network usage delays for alias resolution. Based on the previous work, we proposed the mathematical form of the delay model of the Internet link and gave proof. This model was used to calculate the difference in delay models. And we compared two methods of measuring the difference in delay. This paper introduces wavelet decomposition to distinguish and analysis aliases and discovers delay series can be used for alias resolution.

Theory of link delay distribution
Round trip times is the elapsed time between the instant a packet is released by the source to the instant the corresponding ack is received by the source [Shakkottai, Brownlee and Broido (2004)]. The distribution of RTTs can dynamically display network link status, congestion and bandwidth in interval moments. Many theories show that network delays are regular, and a large number of researches have explained the network delay distribution with accurate models. There are queuing delays present in the RTTs which are accentuated by distance in geography [Shakkottai, Brownlee and Broido (2004)]. Aliases in routers are in same location, and the queuing delays of these are almost equal.

Figure 1:
There are the circumstances that traceroute from the measure host to the target alias. The alias is bound to interface 0 and interface 1. The solid line represents the public route and the dotted line represents the difference route From Fig. 1A, 1B and 1C have many public routes, and the path similarity is at a high level. The routing paths in D has almost no public route, and this circumstance is not considered in alias resolution. Aliases have many characteristics in common [Keys, Hyun, Luckie et al. (2013)]. Using data sets from 2000000 IPs from CAIDA, experiments show that about 96.7% of the measuring hosts to aliases routes belong to circumstances A, B and C, and rarely belong to circumstance D. There is usually no public route for non-aliases in traceroute, about 16.7% of which have only a few hop (less than 3) public routes. The theory is based on the fact that the same datagrams of the same link at the same interval has the same delay [Hu, Xiang, Wu et al. (2019)]. Due to the dynamics of the Internet, a small number of delay measurements are not enough to reflect the characteristics of the entire link, there needs a model of the delay distribution. It was found that the round-trip delay distribution could be well approximated by a truncated normal distribution [Shakkottai, Brownlee and Broido (2004); Elteto and Molnar (1999)]. And Konstantina Papagiannaki proves that single-hop queueing delay from an operational backbone network is in Weibull distribution [Papagiannaki, Moon, Fraleigh et al. (2002)], the probability density function (PDF) of that is given by: C is a constant representing the processing delay, transmission delay and propagation delay. Considering a fixed link, the transmission delay and propagation delay are the same for the same packets. And the processing delay is negligible because routers usually process the packets within 1 ms. For ease of calculation, single-hop model of queueing delay will degenerate into an Exponential distribution within the error tolerance due to = 0.82 [Papagiannaki, Moon, Fraleigh et al. (2003)]. From Fig. 2, the Weibull distribution is very similar to the exponential distribution when the shape parameter is close to 1. Therefore, the probability density function can be written as: We prove that the entire link delay is consistent with the Gamma distribution in appendix: Obviously, Gamma distribution is equivalent to Weibull distribution only if = 1. We consider the queue event is "bad" thing in transactions and there is not queue delay in perfect circumstances because it is the most efficient for packet in transaction without queuing. The shape parameter is that power plus one, and so this parameter can be interpreted that = 1 indicates that the "bad" rate is constant over time. This might suggest random external events are causing "bad" events and correspond to the generation mechanism of queuing delay. The network delay conforms a certain distribution and model. In fact, this model shows that the network delay distribution is affected by the entire link, including TTL, the congestion of per hop, the forwarding performance of each router, etc. The difference of delay model can also display the difference of entire link characteristics. We introduce the energy distance [Székely and Rizzo (2013)] and Wasserstein distance used to measure the difference in the network delay distribution model, and the differential values of aliases is typically less than no-aliases.

Data sources
Because the dynamics of the internet, there is some mistake in the dataset. IP in alias group may be not in same router now and no-aliases may be assigned in one router. We drop those that are significantly different from other aliases due to high loss rate.

IPv4
Macroscopic Internet Topology Data Kit (ITDK) is a classical dataset of Internet measure, including 3.01 million addresses extracted from the IPv4 Routed/24 Topology Dataset.
The ITDK contains data about connectivity and routing gathered from a large crosssection of the global Internet. This dataset is useful for studying the topology of the Internet at the router-level, among other uses. The nodes file of ITDK lists the set of interfaces that were inferred to be on each router.

IPv6
The Ark IPv6 Topology Dataset contains information useful for studying the IP-topology of the IPv6 Internet. It uses scamper to perform ICMP-based traceroutes using the Paris traceroute technique. For each probed path, this dataset includes the IP address, RTT, reply TTL, and ICMP responses for all hops. This dataset verifies the validity of the delay series theory in IPv6.

Methodology and data analysis
We analyze the delay distribution of a large number of aliases in 24 hours, whose granularity is 5 minutes. Obvious and stratified results are obtained, Internet delay dynamically change in a small range. Experiments also show difference of no-aliases by energy distance. We believe these conclusions will help later research.

Delay distribution over time
We send measuring probes to alias group include about 100 aliases in 5 minutes granularity. Measurement lasted for 24 hours and 288 times. The characteristics of the delay distribution clearly demonstrated by the measurement results in a chronological 3D diagram. For an IP, there is stability in the latency distribution state in Fig. 4. Using Wasserstein distance as a metric to calculate, 88.2% of IP's continuous change over time does not exceed 2.0. When using energy distance as measure method, IPs of less than 0.5 accounts for 92.7% of the total aliases. It means that delay distribution is stable and experimentally verifiable in proper interval. But a large interval between the two measurement times of an IP results in large errors, even more than no-aliases. The accumulation of small errors in a short period of time can make a huge difference. Therefore, measurement duration should be within a small range.

Difference of aliases and no-aliases in delay
We deploy three hosts in Los Angeles and Seattle and performed extensive latency measurements of 400,000 IPs belonging to approximately 6,000 routers from ITDK. After the analysis of delay distribution, it is found that aliases have great similarity with others which reflected in the low level of difference in the energy distance, Wasserstein distance and the wavelet analysis of delay series. On the contrary, no-aliases are in high level.
To distinguish between non-aliases in latency distribution, we randomly sampled 10,000 IP pairs from 4 million IP pairs, of which 5,000 were alias pairs and 5,000 non-alias pairs. And thresholds are set to 5.0 and 2.0 in the calculation of energy distance and Wasserstein distance after delay distribution regularization. There is a clear difference between the alias and the non-alias. A series of delay cannot be only summed up as a probability distribution model, but the sequential information of these values itself should also be used. Network delay has multiple sources: signal transmission in the link, router processing information, congestion control and traffic control, etc. We use wavelet decomposition to get important information for alias resolution in Figs. 6 and 7.  Because there are a large number of public routes for alias pairs, the link status and the running status of the routers are the same at the same time, so the primary component of the delay is the same. The error of alias pairs is due to the fact that the moment is not fully simultaneous in the actual measurement.

Conclusion
This paper verifies and illustrates the mathematical expression of the overall model of network delay. Results show that Gamma distribution corresponds to the network delay distribution very well. This paper proposes an alias resolution method for the difference of delay probability distribution. Wavelet analysis was used to decompose the delay series and it was found that the aliases not only have similarity in the latency distribution, but the approximate components of the delay series without noise are almost the same in alias pairs, which used sequential information on the time domain. This idea is closer to reality, because the running status of the router do not always present a perfect periodicity, and these can partly be recorded in the delay series. Extracting the main frequency and time domain information and filtering out the unrelated noise, which are main task for alias resolution using delay series. Based on the wavelet analysis results, it was shown that the delay is more than just an isolated number for discrimination, distribution and order information of which can also assist the alias identification. Delay series is feasible for alias resolution.

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.