Elsevier

Computer Networks

Volume 51, Issue 10, 11 July 2007, Pages 2701-2716
Computer Networks

Quantile sampling for practical delay monitoring in Internet backbone networks

https://doi.org/10.1016/j.comnet.2006.11.023Get rights and content

Abstract

Point-to-point delay is an important network performance measure as it captures service degradations caused by various events. We study how to measure and report delay in a concise and meaningful way for an ISP, and how to monitor it efficiently. We analyze various measurement intervals and potential metric definitions. We find that reporting high quantiles (between 0.95 and 0.99) every 10–30 min as the most effective way to summarize the delay in an ISP. We then propose an active probing scheme to estimate a high quantile with bounded error. We show that only a small number of probes are sufficient to provide an accurate estimate. We validate the proposed delay monitoring technique on real data collected on the Sprint IP backbone network. To make our work complete, we lastly compare the overhead of our active probing technique with a passive sampling scheme and show that for delay measurement, active probing is more practical.

Introduction

Point-to-point delay is a powerful “network health” indicator in a backbone network. It captures service degradation due to congestion, link failure, and routing anomalies. Obtaining meaningful and accurate delay information is necessary for both ISPs and their customers. Thus delay has been used as a key parameter in Service Level Agreements (SLAs) between an ISP and its customers [12], [33]. In this paper, we systematically study how to measure and report delay in a concise and meaningful way for an ISP, and how to monitor it efficiently.

Operational experience suggests that the delay metric should report the delay experienced by most packets in the network, capture anomalous changes, and not be sensitive to statistical outliers such as packets with options and transient routing loops [3], [11]. The common practice in operational backbone networks is to use ping-like tools. ping measures network round trip times (RTTs) by sending ICMP requests to a target machine over a short period of time. However, ping was not designed as a delay measurement tool, but a reachability tool. Its reported delay includes uncertainties due to path asymmetry and ICMP packet generation times at routers. Furthermore, it is not clear how to set the parameters of measurement tools (e.g., the test packet interval and frequency) in order to get a certain accuracy.

Inaccurate measurement defeats the purpose of performance monitoring. In addition, injecting a significant number of test packets for measurement may affect the performance of regular traffic, as well as tax the measurement systems with unnecessary processing burdens. More fundamentally, defining a metric that can give a meaningful and accurate summary of point-to-point delay performance has not been considered carefully.

We raise the following practical concerns in monitoring delays in a backbone network. How often should delay statistics be measured? What metric(s) capture the network delay performance in a meaningful manner? How do we implement these metrics with limited impact on network performance? In essence, we want to design a practical delay monitoring tool that is amenable to implementation and deployment in high-speed routers in a large network, and that reports useful information.

The major contributions of this paper are three-fold: (i) By analyzing the delay measurement data from an operational network (Sprint US backbone network), we identify high-quantiles [0.95–0.99] as the most meaningful delay metrics that best reflect the delay experienced by most of packets in an operational network, and suggest 10–30 min time scale as an appropriate interval for estimating the high-quantile delay metrics. The high-quantile delay metrics estimated over such a time interval provide a best representative picture of the network delay performance that captures the major changes and trends, while they are less sensitive to transient events, and outliers. (ii) We propose and develop an active probing method for estimating high-quantile delay metrics. The novel feature of our proposed method is that it uses the minimum number of samples needed to bound the error of quantile estimation within a prescribed accuracy, thereby reducing the measurement overheads of active probing. (iii) We compare the network wide overhead of active probing and passive sampling for delays. To the best of our knowledge, this is the first effort to propose a complete methodology to measure delay in operational networks and validate the performance of the active monitoring scheme on operational data.

The remainder of this paper is organized as follows. In Section 2, we provide the background and data used in our study. In Section 3, we investigate the characteristics of point-to-point delay distributions obtained from the packet traces and discuss metrics used in monitoring delay in a tier-1 network. In Section 4, we analyze how sampling errors can be bounded within pre-specified accuracy parameters in high quantile estimation. The proposed delay measurement scheme is presented and its performance is evaluated using packet traces in Section 5. In Section 7, we summarize related works. We conclude the paper in Section 8.

Section snippets

Data and background

We describe our data set and provide some background about point-to-point delay observed from this data.

Metrics definition for practical delay monitoring

The objective of our study is to design a practical delay monitoring tool to provide a network operator with a meaningful and representative picture of delay performance of an operational network. Such a meaningful and representative picture should tell the network operator major and persistent changes in delay performance (e.g., due to persistent increase in traffic loads) not transient fluctuations due to minor events (e.g., a transient network congestion). Hence in designing a practical

Quantile estimation analysis

In this section we develop an efficient and novel method for estimating high-quantile delay metrics: it estimates the high-quantile delay metrics within a prescribed error bound using a number of required test packets. In other words, it attempts to minimize the overheads of active probing. In the following, we first formulate the quantile estimation problem and derive the relationship between the number of samples and the estimation accuracy. Then, we discuss the parameters involved to compute

Delay monitoring methodology

In this section, we describe our probing scheme and validate its accuracy using delay measurement data collected from the Sprint operational backbone network.

Active vs. passive sampling: overhead comparison

In this section, we compare a network-wide overhead of our active measurement method to a passive sampling technique.

For the comparison, we first describe a passive sampling process for delay measurement in a network (see Fig. 12 for reference). We sketch here a hash based scheme proposed in [6]. For delay measurement, all regular packets are hashed and passively sampled based on their hash values and time-stamped at the measurement points. To capture the same sets of packets on different

Related work

IPPM (IP Performance Metrics) [15] has defined a set of metrics [10] for measuring the quality, performance, and reliability of Internet paths, and developed standard frameworks [35] for active probing. IPPM does not provide a complete delay measurement methodology as we do. Projects such as RIPE (Reseaux IP European) TTM (Test Traffic Measurement) [29] and Surveyor [19] implement IPPM metrics, and provide GPS enabled measurement infrastructures to be deployed on networks to monitor. In these

Conclusions

We proposed a practical delay measurement methodology designed to be implemented in operational backbone networks. It consists of measuring high quantiles (between 0.95 and 0.99) of delay over 10–30 min time interval using pseudo random active probing. We justify each step and parameters of the technique and validate it on real delay measurement collected on a tier-1 backbone network. The accuracy of the delay measured can be controlled, and is guaranteed with a given error bound. Our method is

Baek-Young Choi received her B.S. from Pusan National University, Korea in 1993 and M.S. from Pohang University of Science and Technology, Korea in 1995. She earned her Ph.D. degree from the University of Minnesota, Twin Cities in 2003. All of her degrees are in Computer Science and Engineering. After her Ph.D., she held positions at Sprint Advanced Technology Labs and the University of Minnesota, Duluth as a post-doctoral researcher and 3M McKnight Distinguished Visiting Assistant Professor,

References (37)

  • Y. Bejerano, Rajeev Rastogi, robust monitoring of link delays and faults in IP networks, in: IEEE INFOCOM’03, San...
  • J.-C. Bolot, End-to-end packet delay and loss behavior in the Internet, in: Proceedings of ACM SIGCOMM, San Francisco,...
  • C. Boutremans, G. Iannaccone, C. Diot, Impact of link failures on VoIP performance, in: Network and Operating Systems...
  • B.-Y. Choi, S. Moon, Z.-L. Zhang, K. Papagiannaki, C. Diot, Analysis of point-to-point packet delay in an operational...
  • A. Downey, Using pathchar to estimate Internet link characteristics, in: Proceedings of ACM SIGCOMM, Cambridge, MA,...
  • N. Duffield, M. Grossglauser, Trajectory sampling for direct traffic observation, in: Proceedings of ACM SIGCOMM,...
  • NLANR (The National Laboratory for Applied Network Research), Active measurement project,...
  • CAIDA (The Cooperative Association for Internet Data Analysis), skitter,...
  • C. Fraleigh et al.

    Packet-level traffic measurements from the Sprint IP backbone

    IEEE Network

    (2003)
  • G. Almes et al.

    A one-way delay metric for IPPM

    Internet Request For Comments

    (1999)
  • U. Hengartner, S. Moon, R. Mortier, C. Diot, Detection and analysis of routing loops in packet traces, in: ACM SIGCOMM...
  • G. Huston

    ISP Survival Guide: Strategies for Running a Competitive ISP

    (1998)
  • G. Iannaccone, C-N. Chuah, R. Mortier, S. Bhattacharyya, C. Diot, Analysis of link failures over an IP backbone, in:...
  • Sprint ATL IPMon project,...
  • IPPM, Internet Engineering Task Force, IP performance metric charter,...
  • V. Jacobson, pathchar....
  • R. Jain

    The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling

    (1991)
  • S. Jamin, C. Jin, Y. Jin, R. Raz, Y. Shavitt, L. Zhang, On the placement of Internet Instrumentation, in: Proceedings...
  • Cited by (32)

    • Joint tracking of multiple quantiles through conditional quantiles

      2021, Information Sciences
      Citation Excerpt :

      Quantiles are useful to describe the distribution in a flexible and nonparametric way [30]. The estimation of quantiles of data streams encompasses a wide range of applications such as portfolio risk measurement in the stock market [10,1], fraud detection [47], signal processing and filtering [41], climate change monitoring [48], SLA violation monitoring [39,40], network monitoring [7], Monte Carlo simulation [44], structural health monitoring [12], non-parametric statistical testing [28] and Tukey depth estimation [19]. Motivated by this wide range of applications of streaming quantile estimation, in this paper, we will investigate advancing the state-of-the-art when it comes to joint quantile estimation.

    • Online research in older adults: Lessons learned from conducting an online randomized controlled trial

      2011, Applied Nursing Research
      Citation Excerpt :

      Each tracking program offers different types of reports, and researchers must carefully review those reports before selecting a tracking program. While evaluating participants' online activities, researchers must take into consideration Internet traffic and idle times (Choi, Moon, Cruz, Zhang, & Diot, 2007). In addition, the amount of time participants were logged into an online program may not accurately reflect the amount of intervention exposure because the participants could have been engaged in other activities while they were logged on.

    • Locating congested segments over the Internet by clustering the delay performance of multiple paths

      2009, Computer Communications
      Citation Excerpt :

      Network congestion itself might be defined in various different ways and in various time-scale. In [27], for example, the 90–95th percentile of packet delay every 10–30 min was analyzed. On the other hand, in this paper we regard congestion on a network segment as a transient condition during which extraordinarily large delays are likely to be experienced by packets traversing that segment, and such a condition likely lasts at least several times of the average interval of two adjacent probing packets in a measurement period.

    • Vulnerabilities and Attacks of Inter-device Coordination in Programmable Networks

      2023, IEEE International Workshop on Quality of Service, IWQoS
    View all citing articles on Scopus

    Baek-Young Choi received her B.S. from Pusan National University, Korea in 1993 and M.S. from Pohang University of Science and Technology, Korea in 1995. She earned her Ph.D. degree from the University of Minnesota, Twin Cities in 2003. All of her degrees are in Computer Science and Engineering. After her Ph.D., she held positions at Sprint Advanced Technology Labs and the University of Minnesota, Duluth as a post-doctoral researcher and 3M McKnight Distinguished Visiting Assistant Professor, respectively. She is an assistant professor at the University of Missouri, Kansas City from 2005. Her research interest includes network monitoring, wireless sensor networks and optical networks.

    Sue Moon received her B.S. and M.S. from Seoul National University, Seoul, Korea, in 1988 and 1990, respectively, all in Computer Engineering. She received a Ph.D. degree in computer science from the University of Massachusetts at Amherst in 2000. From 1999 to 2003, she worked in the IPMON project at Sprint Advanced Technology Labs in Burlingame, California. In August of 2003, she joined KAIST and now teaches in Daejeon, Korea. Her research interests are: network performance measurement and monitoring of diverse network types and their security, anomaly, and fault resilience aspects.

    R.L. Cruz received the B.S. and Ph.D. degrees in Electrical Engineering from the University of Illinois, Urbana, in 1980 and 1987, respectively, and the S.M.E.E. degree from the Massachusetts Institute of Technology in 1982. Since 1987, she has been on the faculty at the University of California, San Diego. Dr. Cruz was elected to be a Fellow of the IEEE in 2003.

    Zhi-Li Zhang received a B.S. degree in Computer Science from Nanjing University, China, in 1986 and his M.S. and Ph.D. degrees in computer science from the University of Massachusetts in 1992 and 1997. In 1997 he joined the Computer Science and Engineering faculty at the University of Minnesota, where he is currently a professor. From 1987 to 1990, he conducted research in Computer Science Department at Århus University, Denmark, under a fellowship from the Chinese National Committee for Education. He has held visiting positions at Sprint Advanced Technology Labs; IBM T.J. Watson Research Center; Fujitsu Labs of America, Microsoft Research China, and INRIA, Sophia-Antipolis, France.

    Christophe Diot received a Ph.D. degree in Computer Science from INP Grenoble in 1991. He was with INRIA Sophia-Antipolis from October 1993 to September 1998, Sprint (Burlingame, CA) from October 1998 to April 2003, and Intel Research (Cambridge, UK) from Mai 2003 to September 2005. He joined Thomson in October 2005 to start and manage the Paris Research Lab (http://parislab.thomson.net). His current research activities focus on communication services and platforms for the future. He is an ACM fellow.

    View full text