Resilience of structured P2P systems under churn: The reachable component method
Introduction
In the past few years, we have witnessed an explosion in peer-to-peer (P2P) research, especially in the area of structured P2P systems or distributed hash tables (DHTs). In a real P2P network, nodes continuously log-on and log-off. As a result, understanding the resilience of P2P systems under realistic churn conditions becomes an important research topic. Toward the goal of characterizing the routing resilience of P2P systems under failure, we consider the measure of routability, which is defined as the expected number of routable node pairs divided by the number of possible node pairs among the surviving nodes. Note that the routability metric is a “per query” measure (i.e. given a query or routing request to be made, routability is the probability that the message can be successfully routed from a random source to a random destination). The routability of a P2P network under failure has been estimated via simulations, and has been shown to be a function of the architecture and routing topology [1]. However, a general method that can compute an analytical routability expression for most structured P2P architectures has been lacking.
In this work, we provide the reachable component method to study the routing performance of structured P2P systems under failures and churn. We derived analytical expressions for the routability of a wide range of P2P systems under uniform failure and varying churn parameters. The P2P systems investigated are: Symphony [2], Kademlia [3], Chord [4], CAN [5], and de Bruijn graph based systems [6], [7], [8]. For the same number of connections per node, we found that the de Bruijn routing topology is extremely resilient, even under very high rates of node turnovers, followed by a group of “hypercube-like” routing systems that include CAN, Kademlia, Chord, randomized-Chord and logarithmic Symphony (i.e. a Symphony system with number of shortcuts that scales logarithmically with system size). We verified our analytical results against large-scale simulations of different P2P systems under churn.
The reachable component method is applicable to analyzing almost all the proposed structured P2P architectures under failures and churn, and provides a common framework to compare across them. Certainly, there are differences in the performances of various P2P architectures and the reachable component method provides the means for calculating and characterizing such differences in the presence of node turnovers. Furthermore, the analytical nature of the reachable component method provides insights into the nature of these differences in performance, thus enabling the development of methods to further improve performance.
Previous research efforts often employed the uniform failure model [1], [4], [7], [9], [10], [11], or focused on the connectivity properties of the system [12], [13]. The uniform failure model is often criticized as unrealistic, since a tool to accurately estimate the failure probability q in a continuously evolving P2P system is lacking [14]. Furthermore, in a structured P2P system, the analysis of routing performance is more relevant than the study of connectivity, since it is possible that two nodes in the same connected component cannot route to each other under system node failures (see Fig. 1).
Toward the goal of developing a model of dynamic node failures in P2P systems, Leonard et al. developed the lifetime-based node failure model [12], [13] (referred to as the lifetime model in this work): each joining node i is given a random lifetime Li, drawn from a probability distribution; when failed neighbors are encountered due to log-off, nodes employ neighbor-recovery algorithms to replace the failed neighbors. However, neighbor-recovery is not instantaneous. In the lifetime model, the time required for repairing the failed links is denoted as the search-time S, which is assumed to be strictly positive. Furthermore, a new node will spend an amount of time equal to S to find all of its required neighbors upon joining the P2P network.
To better characterize node turnovers in structured P2P systems, we develop a tool to accurately estimate the uniform failure probability q in a continuously evolving P2P system: using the lifetime-based assumptions, the dynamic churn model for a P2P system that has evolved long enough for renewal theory to hold is reducible to a q-percent uniform failure model. Intuitively, for a large dynamic P2P network in the steady-state, there always exist nodes that have just logged off with their affected neighbors still trying to replace them; as a result, there is a steady fraction of nodes in a failed state at any given time instant. In this work, we will show that the steady state probability of finding a given node in the failed state is , where and are the mean lifetime and search-time, respectively.
In short, the methods and tools developed in this work are targeted to be of assistance to system designers in assessing routing performance of a deployed P2P system under different user lifetime characteristics. The rest of this paper is organized as follows. In Section 2, we discuss previous work on the resilience of P2P systems. In Section 3, we will demonstrate how the dynamic churn model can be reduced to the uniform failure model. In Section 4, we present the reachable component method (RCM). In Section 5, we apply RCM on six DHT topologies and derive analytical expressions for each topology’s routability. In Section 6, we specify our simulation setup and compare our analytical results against large-scale simulation results. In Section 7, we discuss how RCM is widely applicable under different modification of the routing algorithm, such as allowing backtracking. In Section 8, we give our concluding remarks.
Section snippets
Related work
Network resilience in P2P systems has become an important research topic [15]. Gummadi et al. [1] showed through simulation results that the routing topology of a DHT has a large effect on the network’s static resilience to random failures. In addition to simulation studies, theory work has been done to predict the performance of DHT systems under a static failure model [6], [9], [10], [11]. In particular, Wang et al. used Markov chains to model the routing process and calculated the “hit
Link between the Churn Model and the Uniform Failure Model
In the lifetime model, each joining node stays in the network for the lifetime L, then fails [12], [13]. After a node’s neighbor has departed, the node spends the search-time S looking for a replacement neighbor. By the same token, a newly joining node spends S amount of time searching for required connections.
In the lifetime model [13], in order to prevent network size decreasing to zero, it is assumed that each failed node is to be immediately replaced by a new node with a randomized nodeID
Reachable component method
Since a dynamic lifetime-based churn model can be reduced to a uniform node failure model by invoking Lemma 1 (see Section 3), we begin our work by demonstrating the reachable component method under the uniform failure model.
Application of RCM on DHT routing topologies
Using the reachable component method, the analytical expressions for the routability of a DHT can be derived. In the derivations, finding the expression for p(h, q) through the state diagram is key. The analytical expressions derived in this section are verified against simulation results in Section 6. Furthermore, we will use the notation G(i, j) to denote the probability that, starting at state i, the routing process ever visits state j.
Simulation setup: uniform failure and churn
The analytical routability results derived in the previous section will be rigorously compared to simulation results in this section for different network sizes (216, 218 and 220 nodes). For each of the DHT system, we will first construct the connection topology and implement the routing algorithm. For each node failure probability q, we then perform the following: we randomly pick a root node and count the number of nodes in the system that the root node can reach by following the DHT’s
Improvement techniques
The reachable component method developed in this work is flexible and can be adapted and applied to estimating the performance of P2P networks under a number of different routing algorithms. So far in this work, we have used greedy routing to find the size of the reachable component under failure and churn. It is conceivable that we can improve a system’s routability under churn and failure by introducing modifications to the routing algorithms or to the system’s topology.
Concluding remarks
Using the lifetime model and the widely-adopted steady state network size assumption, we showed that a dynamic churn model for a P2P network that has reached stationarity is reducible to a uniform failure model in the steady state. Furthermore, we present the reachable component method (RCM), an analytical framework for characterizing structured P2P routing performance under random failures and churn. The method’s efficacy is demonstrated through an analysis of the routability of several
References (32)
- et al.
D2b: a de bruijn based content-addressable network
Theoretical Computer Science
(2006) - et al.
The impact of dht routing geometry on resilience and proximity
- G.S. Manku, M. Bawa, P. Raghavan, Symphony: distributed hashing in a small world, in: Proceedings of Fourth USENIX...
- et al.
Kademlia: a peer-to-peer information system based on the xor metric
- et al.
Chord: a scalable peer-to-peer lookup protocol for internet applications
IEEE/ACM Transactions on Network
(2003) - et al.
A scalable content-addressable network
- et al.
Graph-theoretic analysis of structured peer-to-peer systems: routing distances and fault resilience
- F. Kaashoek, D.R. Karger, Koorde: A simple degree-optimal hash table, in: Proceedings of the Second International...
- O. Angel, I. Benjamini, E. Ofek, U. Wieder, Routing complexity of faulty networks, in: 24th ACM Symposium on Principles...
- et al.
Failure recovery for structured p2p networks: protocol design and performance evaluation
SIGMETRICS Performance Evaluation Review
(2004)
On lifetime-based node failure and stochastic resilience of decentralized peer-to-peer networks
Analysis of the evolution of peer-to-peer systems
Cited by (13)
Interlaced: Fully decentralized churn stabilization for Skip Graph-based DHTs
2021, Journal of Parallel and Distributed ComputingCitation Excerpt :DKS’s connectivity performance depends on the failure pattern of the successors, i.e., concurrent failures of the successors keep a node away from finding new successors and recovering a query from failure. 1-backtracking [39] is another reactive solution where a message is back-tracked one step upon detection of a failure on the path, and re-routed again. In systems with high churn rate or low availability, it is very likely for the alternative backtracked neighbors to be offline, which causes the entire search to be dropped but at a longer response time compared to no backtracking scenario.
CPSCox: A survival analysis model of peer behavior in large scale DHT system
2012, Computer CommunicationsCitation Excerpt :Finally, for interval censored data we know only that the event occurred within the interval, so the information is the probability that the event time is in this interval. In the area of churn modeling in large scale P2P system past research usually considered peer behaviors as an ON/OFF process (e.g. [18]). The steady-state probability of the ON state was introduced, which relates to the mean value of both lifetime and search-time.
Descriptive agent-based modeling of Kademlia peer-to-peer protocol
2019, Modeling and Simulation of Complex Communication NetworksReversible phase transitions in a structured overlay network with churn
2016, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)A performance comparison of Chord and Kademlia DHTs in high churn scenarios
2015, Peer-to-Peer Networking and Applications