Resilience of structured P2P systems under churn: The reachable component method

doi:10.1016/j.comcom.2008.01.051

Computer Communications

Volume 31, Issue 10, 25 June 2008, Pages 2109-2123

https://doi.org/10.1016/j.comcom.2008.01.051 Get rights and content

Abstract

Users in a peer-to-peer (P2P) system join and leave the network in a continuous manner. Understanding the resilience properties of P2P systems under high rate of node churn becomes important. In this work, we first find that a lifetime-based dynamic churn model for a P2P network that has reached stationarity is reducible to a uniform node failure model. This is a simple yet powerful result that bridges the gap between the complex dynamic churn models and the more tractable uniform failure model. We further develop the reachable component method and derive the routing performance of a wide-range of structured P2P systems under varying rates of churn. We find that the de Bruijn graph based routing systems offer excellent resilience under extremely high rate of node turnovers, followed by a group of routing systems that include CAN, Kademlia, Chord and randomized-Chord. We show that our theoretical predictions agree well with large-scale simulation results. We finish by suggesting methods to further improve the routing performance of dynamic P2P systems in the presence of churn and failures.

Introduction

In the past few years, we have witnessed an explosion in peer-to-peer (P2P) research, especially in the area of structured P2P systems or distributed hash tables (DHTs). In a real P2P network, nodes continuously log-on and log-off. As a result, understanding the resilience of P2P systems under realistic churn conditions becomes an important research topic. Toward the goal of characterizing the routing resilience of P2P systems under failure, we consider the measure of routability, which is defined as the expected number of routable node pairs divided by the number of possible node pairs among the surviving nodes. Note that the routability metric is a “per query” measure (i.e. given a query or routing request to be made, routability is the probability that the message can be successfully routed from a random source to a random destination). The routability of a P2P network under failure has been estimated via simulations, and has been shown to be a function of the architecture and routing topology [1]. However, a general method that can compute an analytical routability expression for most structured P2P architectures has been lacking.

In this work, we provide the reachable component method to study the routing performance of structured P2P systems under failures and churn. We derived analytical expressions for the routability of a wide range of P2P systems under uniform failure and varying churn parameters. The P2P systems investigated are: Symphony [2], Kademlia [3], Chord [4], CAN [5], and de Bruijn graph based systems [6], [7], [8]. For the same number of connections per node, we found that the de Bruijn routing topology is extremely resilient, even under very high rates of node turnovers, followed by a group of “hypercube-like” routing systems that include CAN, Kademlia, Chord, randomized-Chord and logarithmic Symphony (i.e. a Symphony system with number of shortcuts that scales logarithmically with system size). We verified our analytical results against large-scale simulations of different P2P systems under churn.

The reachable component method is applicable to analyzing almost all the proposed structured P2P architectures under failures and churn, and provides a common framework to compare across them. Certainly, there are differences in the performances of various P2P architectures and the reachable component method provides the means for calculating and characterizing such differences in the presence of node turnovers. Furthermore, the analytical nature of the reachable component method provides insights into the nature of these differences in performance, thus enabling the development of methods to further improve performance.

Previous research efforts often employed the uniform failure model [1], [4], [7], [9], [10], [11], or focused on the connectivity properties of the system [12], [13]. The uniform failure model is often criticized as unrealistic, since a tool to accurately estimate the failure probability q in a continuously evolving P2P system is lacking [14]. Furthermore, in a structured P2P system, the analysis of routing performance is more relevant than the study of connectivity, since it is possible that two nodes in the same connected component cannot route to each other under system node failures (see Fig. 1).

Toward the goal of developing a model of dynamic node failures in P2P systems, Leonard et al. developed the lifetime-based node failure model [12], [13] (referred to as the lifetime model in this work): each joining node i is given a random lifetime L_i, drawn from a probability distribution; when failed neighbors are encountered due to log-off, nodes employ neighbor-recovery algorithms to replace the failed neighbors. However, neighbor-recovery is not instantaneous. In the lifetime model, the time required for repairing the failed links is denoted as the search-time S, which is assumed to be strictly positive. Furthermore, a new node will spend an amount of time equal to S to find all of its required neighbors upon joining the P2P network.

To better characterize node turnovers in structured P2P systems, we develop a tool to accurately estimate the uniform failure probability q in a continuously evolving P2P system: using the lifetime-based assumptions, the dynamic churn model for a P2P system that has evolved long enough for renewal theory to hold is reducible to a q-percent uniform failure model. Intuitively, for a large dynamic P2P network in the steady-state, there always exist nodes that have just logged off with their affected neighbors still trying to replace them; as a result, there is a steady fraction of nodes in a failed state at any given time instant. In this work, we will show that the steady state probability of finding a given node in the failed state is $q = \bar{S} / (\bar{L} + \bar{S})$ , where $\bar{L}$ and $\bar{S}$ are the mean lifetime and search-time, respectively.

In short, the methods and tools developed in this work are targeted to be of assistance to system designers in assessing routing performance of a deployed P2P system under different user lifetime characteristics. The rest of this paper is organized as follows. In Section 2, we discuss previous work on the resilience of P2P systems. In Section 3, we will demonstrate how the dynamic churn model can be reduced to the uniform failure model. In Section 4, we present the reachable component method (RCM). In Section 5, we apply RCM on six DHT topologies and derive analytical expressions for each topology’s routability. In Section 6, we specify our simulation setup and compare our analytical results against large-scale simulation results. In Section 7, we discuss how RCM is widely applicable under different modification of the routing algorithm, such as allowing backtracking. In Section 8, we give our concluding remarks.

Section snippets

Related work

Network resilience in P2P systems has become an important research topic [15]. Gummadi et al. [1] showed through simulation results that the routing topology of a DHT has a large effect on the network’s static resilience to random failures. In addition to simulation studies, theory work has been done to predict the performance of DHT systems under a static failure model [6], [9], [10], [11]. In particular, Wang et al. used Markov chains to model the routing process and calculated the “hit

Link between the Churn Model and the Uniform Failure Model

In the lifetime model, each joining node stays in the network for the lifetime L, then fails [12], [13]. After a node’s neighbor has departed, the node spends the search-time S looking for a replacement neighbor. By the same token, a newly joining node spends S amount of time searching for required connections.

In the lifetime model [13], in order to prevent network size decreasing to zero, it is assumed that each failed node is to be immediately replaced by a new node with a randomized nodeID

Reachable component method

Since a dynamic lifetime-based churn model can be reduced to a uniform node failure model by invoking Lemma 1 (see Section 3), we begin our work by demonstrating the reachable component method under the uniform failure model.

Application of RCM on DHT routing topologies

Using the reachable component method, the analytical expressions for the routability of a DHT can be derived. In the derivations, finding the expression for p(h, q) through the state diagram is key. The analytical expressions derived in this section are verified against simulation results in Section 6. Furthermore, we will use the notation G(i, j) to denote the probability that, starting at state i, the routing process ever visits state j.

Simulation setup: uniform failure and churn

The analytical routability results derived in the previous section will be rigorously compared to simulation results in this section for different network sizes (2¹⁶, 2¹⁸ and 2²⁰ nodes). For each of the DHT system, we will first construct the connection topology and implement the routing algorithm. For each node failure probability q, we then perform the following: we randomly pick a root node and count the number of nodes in the system that the root node can reach by following the DHT’s

Improvement techniques

The reachable component method developed in this work is flexible and can be adapted and applied to estimating the performance of P2P networks under a number of different routing algorithms. So far in this work, we have used greedy routing to find the size of the reachable component under failure and churn. It is conceivable that we can improve a system’s routability under churn and failure by introducing modifications to the routing algorithms or to the system’s topology.

Concluding remarks

Using the lifetime model and the widely-adopted steady state network size assumption, we showed that a dynamic churn model for a P2P network that has reached stationarity is reducible to a uniform failure model in the steady state. Furthermore, we present the reachable component method (RCM), an analytical framework for characterizing structured P2P routing performance under random failures and churn. The method’s efficacy is demonstrated through an analysis of the routability of several

References (32)

P. Fraigniaud et al.
D2b: a de bruijn based content-addressable network
Theoretical Computer Science
(2006)
K. Gummadi et al.
The impact of dht routing geometry on resilience and proximity
G.S. Manku, M. Bawa, P. Raghavan, Symphony: distributed hashing in a small world, in: Proceedings of Fourth USENIX...
P. Maymounkov et al.
Kademlia: a peer-to-peer information system based on the xor metric
I. Stoica et al.
Chord: a scalable peer-to-peer lookup protocol for internet applications
IEEE/ACM Transactions on Network
(2003)
S. Ratnasamy et al.
A scalable content-addressable network
D. Loguinov et al.
Graph-theoretic analysis of structured peer-to-peer systems: routing distances and fault resilience
F. Kaashoek, D.R. Karger, Koorde: A simple degree-optimal hash table, in: Proceedings of the Second International...
O. Angel, I. Benjamini, E. Ofek, U. Wieder, Routing complexity of faulty networks, in: 24th ACM Symposium on Principles...
S.S. Lam et al.
Failure recovery for structured p2p networks: protocol design and performance evaluation
SIGMETRICS Performance Evaluation Review
(2004)

S. Wang, D. Xuan, W. Zhao, On resilience of structured peer-to-peer systems, in: Proceedings of GLOBECOM,...

D. Leonard, Z. Yao, X. Wang, D. Loguinov, On static and dynamic partitioning behavior of large-scale networks, in:...

D. Leonard et al.

On lifetime-based node failure and stochastic resilience of decentralized peer-to-peer networks

R. Bhagwan, S. Savage, G.M. Voelker, Understanding availability, in: Proceedings of the 2nd International Workshop on...

L. Massouli, A.-M. Kermarrec, A. Ganesh, Network awareness and failure resilience in self-organising overlay networks,...

D. Liben-Nowell et al.

Analysis of the evolution of peer-to-peer systems

Cited by (13)

Interlaced: Fully decentralized churn stabilization for Skip Graph-based DHTs
2021, Journal of Parallel and Distributed Computing
Citation Excerpt :
DKS’s connectivity performance depends on the failure pattern of the successors, i.e., concurrent failures of the successors keep a node away from finding new successors and recovering a query from failure. 1-backtracking [39] is another reactive solution where a message is back-tracked one step upon detection of a failure on the path, and re-routed again. In systems with high churn rate or low availability, it is very likely for the alternative backtracked neighbors to be offline, which causes the entire search to be dropped but at a longer response time compared to no backtracking scenario.
As a distributed hash table (DHT) routing overlay, Skip Graph is used in a variety of peer-to-peer (P2P) systems including cloud storage. The overlay connectivity of P2P systems is negatively affected by the arrivals and departures of nodes to and from the system that is known as churn. Preserving connectivity of the overlay network (i.e., the reachability of every pair of nodes) under churn without compromising the overlay latency is a performance challenge in every P2P system including the Skip Graph-based ones. The existing decentralized churn stabilization solutions that are applicable to Skip Graphs mainly optimize the connectivity of the system under churn and do not consider routing latency of overlay as an optimization goal. Additionally, those existing solutions change the message complexity of Skip Graphs, distort its topology, or apply constant message overhead to the system. In this paper, we propose Interlaced , a fully decentralized churn stabilization mechanism for Skip Graphs that provides drastically stronger overlay connectivity and faster search queries without changing the asymptotic complexity of the Skip Graph in terms of storage, computation, and communication. We also propose the Sliding Window De Bruijn Graph (SWDBG ) as a tool to predict the availability of nodes with high accuracy. Our simulation results show that in comparison to the best existing DHT-based solutions, Interlaced improves the overlay connectivity of the Skip Graph under churn with the gain of about $1.73$ times. Likewise, compared to the existing availability prediction approaches for P2P systems, SWDBG is about $1.26$ times more accurate. A Skip Graph that benefits from Interlaced and SWDBG is about $2.47$ times faster on average in routing the queries under churn compared to the best existing solutions. We also present an adaptive extension of Interlaced to be applied to other DHTs, for example, Kademlia.
CPSCox: A survival analysis model of peer behavior in large scale DHT system
2012, Computer Communications
Citation Excerpt :
Finally, for interval censored data we know only that the event occurred within the interval, so the information is the probability that the event time is in this interval. In the area of churn modeling in large scale P2P system past research usually considered peer behaviors as an ON/OFF process (e.g. [18]). The steady-state probability of the ON state was introduced, which relates to the mean value of both lifetime and search-time.
The peer behavior of P2P network has become a major concern and attracted significant attention in recent years. Most existing peer behavior research primarily focuses on only some specific properties of peers or requires the knowledge of detailed parameter values, which makes their analytical models not adoptable for large scale and dynamic Distributed Hash Table (DHT) system. In this paper, we propose a general recurrent events modeling in which three major types of peer behavior in DHT systems, session length, inter-session length and remaining uptime are considered. This model, called CPSCox, combines the counting process and stratified Cox proportional hazards method to explicitly reveal critical risk factors that influence the peer behavior and find out the distribution of session length and inter-session length of peers. Real dataset gathered from realistic KAD networks were employed to verify our model. Evaluation results illustrated that the model is able to obtain adequately reliable estimates of the regression coefficients for session length and inter-session length even though the baseline hazard or survival is not specified. The effective of predicting remaining uptime in large scale KAD-like DHT systems is validated as well. Being a semi-parametric method, CPSCox can closely approximate to correct parametric models.
Interlaced: Fully decentralized churn stabilization for Skip Graph-based DHTs
2019, arXiv
Descriptive agent-based modeling of Kademlia peer-to-peer protocol
2019, Modeling and Simulation of Complex Communication Networks
Reversible phase transitions in a structured overlay network with churn
2016, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
A performance comparison of Chord and Kademlia DHTs in high churn scenarios
2015, Peer-to-Peer Networking and Applications

View all citing articles on Scopus

View full text

Resilience of structured P2P systems under churn: The reachable component method

Abstract

Introduction

Section snippets

Related work

Link between the Churn Model and the Uniform Failure Model

Reachable component method

Application of RCM on DHT routing topologies

Simulation setup: uniform failure and churn

Improvement techniques

Concluding remarks

Theoretical Computer Science

The impact of dht routing geometry on resilience and proximity

Kademlia: a peer-to-peer information system based on the xor metric

Chord: a scalable peer-to-peer lookup protocol for internet applications

IEEE/ACM Transactions on Network

A scalable content-addressable network

Graph-theoretic analysis of structured peer-to-peer systems: routing distances and fault resilience

Failure recovery for structured p2p networks: protocol design and performance evaluation

SIGMETRICS Performance Evaluation Review

On lifetime-based node failure and stochastic resilience of decentralized peer-to-peer networks

Analysis of the evolution of peer-to-peer systems