Graph connectivity in log steps using label propagation

The fastest deterministic algorithms for connected components take logarithmic time and perform superlinear work on a Parallel Random Access Machine (PRAM). These algorithms maintain a spanning forest by merging and compressing trees, which requires pointer-chasing operations that increase memory access latency and are limited to shared-memory systems. Many of these PRAM algorithms are also very complicated to implement. Another popular method is"leader-contraction"where the challenge is to select a constant fraction of leaders that are adjacent to a constant fraction of non-leaders with high probability. Instead we investigate label propagation because it is deterministic and does not rely on pointer-chasing. Label propagation exchanges representative labels within a component using simple graph traversal, but it is inherently difficult to complete in a sublinear number of steps. We are able to solve the problems with label propagation for graph connectivity. We introduce a surprisingly simple framework for deterministic graph connectivity using label propagation that is easily adaptable to many computational models. It propagates directed edges and alternates edge direction to achieve linear edge count each step and sublinear convergence. We present new algorithms in PRAM, Stream, and MapReduce for a simple, undirected graph $G=(V,E)$ with $n=|V|$ vertices, $m=|E|$ edges. Our approach takes $O(m)$ work each step, but we can only prove logarithmic convergence on a path graph. It was conjectured by Liu and Tarjan (2019) to take $O(\log n)$ steps or possibly $O(\log^2 n)$ steps. We leave the proof of convergence as an open problem.


Introduction
Given a simple, undirected graph G = (V, E) with n = |V | vertices and m = |E| edges, the connected components of G are partitions of V such that every pair of vertices are connected by a path, which is a sequence of adjacent edges in E. If two vertices are not connected then they are in different components. The distance d(v, u) is the shortest-path length between vertices v and u. The diameter D is the maximum distance in G. We wish to find the connected components of G in O(log n) steps using simple, deterministic methods that are adaptable to many computational models.
The fastest, deterministic parallel (N C) algorithms for connected components take logarithmic time and perform superlinear work on a Parallel Random Access Machine (PRAM). These algorithms maintain a spanning forest by merging and compressing trees [22,13,3,42], which requires pointer-chasing operations that increase memory access latency. Pointer jumping was a primary source of slowdown in a parallel minimum spanning tree algorithm [12]. The PRAM implementations are also limited to shared-memory systems [19,34,4]. Another popular method is "leader-contraction" where the challenge is to select a constant fraction of leaders that are adjacent to a constant fraction of non-leaders with high probability [2,26,36,23]. Not only is this method randomized but it can require adding many more edges to the graph. Instead we investigate label propagation because it is deterministic, easy to implement, and does not rely on pointer-chasing. Label propagation exchanges representative labels within a component using simple graph traversal, but it is inherently difficult to complete in a sublinear number of steps [37,39,41]. Adding and removing edges must be carefully managed to keep the edge count, and therefore the work, linear in each step. We are able to overcome the problems with label propagation for graph connectivity.
We introduce a surprisingly simple framework for deterministic, undirected graph connectivity using label propagation that is easily adaptable to many computational models. It achieves logarithmic convergence independently of the number of processors and without increasing the edge count. We employ a novel method of propagating directed edges in alternating direction while performing minimum reduction on vertex labels. We believe our solution to obtaining sublinear convergence and near optimal work for connected components is one of the simplest to date. Moreover, our experiments demonstrate fast convergence.
In this paper we say that a (v, u) edge is directed from v to u and (·, ·) denotes an ordered pair that distinguishes (v, u) from (u, v). We call the counter-oriented (v, u), (u, v) edges the twins of a conjugate pair. An undirected {u, v} edge in G is then comprised of these twins. To propagate a label w from edge (v, u) we create just the (u, w) twin. In the next step we reverse the direction to return the opposite twin, (w, u), if w is the minimum label for u. We'll call these two edge operations label propagation and symmetrization, respectively. Concomitant with label propagation is a min update on u's minimum label, which may or may not change. Contraction of the graph is due to the label propagation operation, and symmetrization ensures that a vertex and its minimum label are able to exchange new minimum labels. Thus every vertex in each step propagates and retains its current minimum label. Let l(v) be the current minimum label for v. Then for an edge (v, u) we get (u, l(v)) or (u, v) in a single step, due to either label propagation or symmetrization, respectively. Each edge is replaced by a new edge and hence the method maintains a stable edge count. The essential operations are summarized as follows.
• For every (v, u) edge if u is not l(v) then min update and label propagation, else symmetrization, repeating until no label changes. Figure 1 illustrates our method where the starting minimum label for each vertex is the lowest vertex ID among its neighbors and itself, and eventually the graph is transformed into a star whose root is the component label. This is a practical and very simple technique for large-scale streaming and parallel graph connectivity. We present new algorithms in PRAM, Stream, and MapReduce. Our approach takes O(m) work each step, but we can only prove logarithmic convergence on a path graph. Despite the simplicity of our algorithm, the proof of logarithmic convergence is elusive and poses a rather interesting challenge. We conjecture that our algorithm takes O(log n) steps to converge. Our algorithm behaves well and empirically takes O(log n) steps on a range of difficult graphs. In 2019 Liu and Tarjan conjectured that an earlier version of our algorithm takes O(log n) steps or possibly O(log 2 n) steps [29]. We leave the proof of convergence as an open problem.  Figure 1: A path graph converges in three steps. After each step the output graph becomes the input for the next step. Undirected edges denote a pair of counter-oriented edges.

Our contribution
We introduce a simple, deterministic label propagation method for undirected graph connectivity. Our approach propagates directed edges in alternating direction to achieve fast convergence independently of the number of processors while also maintaining O(m) work each step. We present new algorithms in PRAM, Stream, and MapReduce. We will silently use the standard notation for asymptotic bounds to provide a familiar basis for comparison, but the reader should keep in mind that our bounds that depend on the convergence are conjecture only. If our conjecture of O(log n) convergence is true, then our label propagation algorithm on a Concurrent Read Concurrent Write (CRCW) PRAM achieves O(log n) time and O(m log n) work with O(m) processors. On an Exclusive Read Exclusive Write (EREW) PRAM it takes O(log 2 n) time and O(m log 2 n) work. In contrast, the fastest deterministic CRCW graph connectivity algorithms take O(log n) time and O((m + n) · α(m, n)) work using O((m + n) · α(m, n)/ log(n)) processors [22,13], where α(m, n) is the inverse Ackerman function. The best-known N C EREW algorithm [11] takes O(log n) time and O((m + n) log n) work with O(m + n) processors. Although our results are slower than those for the fastest deterministic CRCW and EREW algorithms, our method is much simpler and easier to implement. We also give an efficient Stream-Sort algorithm that takes O(log n) passes and O(log n) memory, and a MapReduce algorithm taking O(log n) rounds and O(m log n) communication overall. These would be the first deterministic O(log n)step graph connectivity algorithms in Stream and MapReduce models. For the purposes of this discussion we will assume O(log n) convergence holds for our algorithm. With that assumption, refer to Table 1 for a summary of these results and comparison to the current state-of-the-art.
The computational models we explored are briefly described in Section 3. We refer the reader to [10,33,31,28] for more complete descriptions. A survey of related work is given in Section 4. Then in Section 5 we introduce our principal algorithm which establishes the framework behind our technique. This leads to our PRAM results in Section 6. In Section 7 the framework is extended for Stream-Sort and MapReduce models, which introduces the subject of label duplication in Section 8 where we identify when pathological duplication of labels arises and how to address it. We give our Stream-Sort and MapReduce algorithms in Sections 9 and 10. Finally we briefly describe a parallel implementation of our principal algorithm in Section 11 followed by empirical results in Section 12.

Computational models
In a PRAM [18] each processor can access any global memory location in unit time. Processors can read from global memory, perform a computation, and write a result to global memory in a single clock cycle. All processors execute these instructions at the same time. A read or write to a memory location is restricted to one processor at a time in an Exclusive Read Exclusive Write (EREW) PRAM. Writes to a memory location are restricted to one processor at a time in a Concurrent Read Exclusive Write (CREW) PRAM. A Concurrent Read Concurrent Write (CRCW) PRAM permits concurrent read and write to a memory location by any number of processors, where concurrent writes to the same memory location are handled by a resolution protocol. A Combining Write Resolution uses an associative operator to combine all values in a single instruction. A Combining CRCW employs this to store a reduction of the values, such as the minimum, in constant time [40]. The Stream model [32,21] focuses on the trade-off between working memory space s and number of passes p over the input stream, allowing the computational time to be unbounded. In W-Stream [38] an algorithm can write to the stream for subsequent passes and in Stream-Sort [1] the input or intermediate output stream can also be sorted. In both W-Stream and Stream-Sort the output streams become the input stream in the next pass. In Stream-Sort the streaming and sorting passes alternate so a Stream-Sort algorithm reads an input stream, computing on the items in the stream, while writing to an intermediate output stream that gets reordered for free by a subsequent sorting pass. The streams are bounded by the starting problem size. An algorithm in Stream-Sort is efficient if it takes polylogarithmic passes and memory.
The MapReduce model [24,20,35] appeared some years after the programming paradigm was popularized by Google [14]. The model employs the functions map and reduce, executed in sequence. The input is a set of key, value pairs that are "mapped" by instances of the map function into a multiset of key, value pairs. The map output pairs are "reduced" and also returned as a multiset of key, value pairs by instances of the reduce function. A single reduce instance gets all values associated with a key. A round of computation is a single sequence of map and reduce executions where there can be many instances of map and reduce functions. Each map or reduce function can complete in polynomial time for input n. Each map or reduce instance is limited to O(n 1−ǫ ) memory for a constant ǫ > 0, and an algorithm is allowed O(n 2−2ǫ ) total memory. The number of machines/processors is bounded to O(n 1−ǫ ), but each machine can run more than one instance of a map or reduce function.

Related work
The famous 1982 algorithm by Shiloach and Vishkin [42] takes O(log n) time using O(m + n) processors on a CRCW PRAM, performing O((m+n) log n) work overall. In 1991 Cole and Vishkin improved the result to O(log n) time and O((m+n)·α(m, n)) work using O((m+n)·α(m, n)/ log(n)) processors [13], but hides a large constant in the asymptotic bound. The constant was reduced by Iwama and Kambayashi in 1994 [22]. But these latter CRCW algorithms are very complicated and difficult to translate to other computational models because of the pointer operations. Our algorithm takes O(log n) time using O(m) processors, albeit on a more powerful CRCW. But it is more amenable to other models because it does not rely on pointer-jumping.
The fastest deterministic EREW algorithm takes O(log n) time using O(m + n) processors and is due to the 2001 work by Chong, Han, and Lam [11]. This algorithm relies on carefully merging adjacency lists, and improves the earlier 1995 result by Chong and Lam, which took O(log n log log n) time. These EREW algorithm require parallel sorting and pointer jumping. Our EREW algorithm is slower, taking O(log 2 n) time and O(m log 2 n) work using O(m) processors, but doesn't rely on pointer jumping or sorting and is far simpler to implement.
The best known deterministic Stream algorithm for connected components is given by Demetrescu, Finocchi, and Ribichini [15], taking O((n log n)/s) passes and s working memory size in W-Stream. Their algorithm can only achieve O(log n) passes using s = O(n) memory. A randomized s-t-connectivity algorithm by Aggarwal et al. [1] takes O(log n) passes and memory in Stream-Sort. It can be modified to compute connected components with the same bounds, but requires sorting in three of four steps in each pass [33]. In contrast, our Stream-Sort connected components algorithm is deterministic and takes O(log n) passes and O(log n) memory. It requires only one sorting step per pass and is straightforward to implement.
A randomized MapReduce algorithm by Rastogi et al. [36] was one of the first to show promise of fast convergence for connected components. But it was later shown in [2] that their Hash-to-Min algorithm [36] takes Ω(log n) rounds. It uses a single task to send an entire component to another, which for a giant component will effectively serialize the communication. In 2014 Kiveris et al. [26] introduced their Two-Phase algorithm, which takes O(log 2 n) rounds and O(m log 2 n) communication overall. Unlike the Hash-to-Min algorithm it avoids sending the giant component to a single reduce task.
We introduce a new MapReduce algorithm that is comparable to Two-Phase while being deterministic and simple to implement. Like Two-Phase our algorithm does not load components into memory or send an entire component through a single communication channel. We go further in memory conservation by maintaining O(1)-space working memory. Our MapReduce algorithm completes in O(log n) rounds using O(m log n) communication, thereby improving the state-of-the-art by Ω(log n) factor in both convergence and communication.
Although we do not study the MPC model [5,6] in this paper, we want to highlight recent breakthrough work in this model. The MPC model is a generalization of MapReduce and other Bulk Synchronous Parallel (BSP) style models. The MPC model is more powerful than MapReduce; a MapReduce algorithm can be simulated in MPC with the same runtime. In 2018 Andoni et al. [2] gave a randomized O(log D log log m/n n) round algorithm in MPC for connected components. Their algorithm uses the leader-contraction method that works by selecting a small fraction of leader vertices while maintaining high probability that non-leader vertices are adjacent. To achieve this the authors add edges so the graph has uniformly large degree, but to avoid Ω(n 3 ) communication cost they carefully manage how edges are added. This result was later improved in 2019 by Behnezad et al. [7], who gave a O(log D + log log m/n n)-round, randomized algorithm in MPC. More recently, Liu et al. [30] gave a randomized CRCW PRAM algorithm based the work of Andoni et al. [2] and Behnezad et al. [7], taking O(log D + log log m/n n) time using O(m) processors. These results use randomization and are not simple to implement, which conflicts with our motivation.
The work most closely related to ours is that of Liu and Tarjan [29] who gave a family of label propagation algorithms taking between O(log n) and O(log 2 n) steps. Similar to our approach they propagate labels by minimum reduction at each step, which creates a directed graph. But in their approach they maintain acyclicity and hence their algorithms produce a forest of trees Algorithm 1 L k ⊲ arrays for l(v) at each step k Initialize L 1 with all starting l(v) 1: for k = 1, 2, . . . until labels converge do 2: In contrast, our algorithm does not require maintaining trees as evident in Figure 1.
They also perform a short-cutting operation in which the parent of every vertex is replaced with its grandparent, whereas we only exchange labels between edge endpoints. They had analyzed an earlier version of our algorithm and conjectured it takes polylog steps to converge. Our current algorithm in this paper [8] is simpler than our previous version and those in [29]. Moreover, it has the added benefit that the work per step is easily shown to take no more space than the input graph.

Principal algorithm
We begin with our principal algorithm to establish the framework and core principles. The essential operations, as succinctly summarized in the introduction, are simply for every (v, u) edge if u is not l(v) then min update and label propagation, else symmetrization, repeating until no label changes.
Here Algorithm 1 describes our method in full. We don't specify any model now so we can focus on the basic procedures. It should be noted that this principal algorithm achieves linear work and fast convergence in both sequential and parallel settings. We will use N k (v) to denote the neighborhood of a vertex v at step k and N + k (v) = {v} ∪ N k (v) as the closed neighborhood. Then let l(v) = min(N + k (v)) be the current minimum label for v. We use this l(v) notation without a step subscript for simplicity. In our algorithm listings we use arrays L k in its place, e.g. L k [v] holds l(v) for v at step k. Only two such arrays are needed in each step. Before the algorithm starts L 1 is initialized with the l(v). For all algorithms we use E k to denote the edges that will be processed at step k, but E k is a multiset because it may contain duplicates.
We employ two edge operations, label propagation and symmetrization, defined as follows.
Since G does not contain loop edges then these operations cannot create loops, otherwise it would contradict the minimum value. Thus u = l(v) implies v = l(v). Label propagation is primarily responsible for path contraction. Minimum label updates are concomitant with label propagation. Symmetrization keeps an edge between a vertex and its minimum label. Thus for an edge (v, u) in one step there will be either (u, v) or (u, l(v)) in the next step. In Algorithm 1 each edge is replaced by one new edge, either by label propagation or symmetrization, so the edge count does not increase. The algorithm is illustrated in Figure 1. Notice the (2, 1) edge is duplicated in the second step. Also observe that the (4, 2) edge in the second step produces (2, 1) for the third step because L 2 (4) = 1.
We remark that Algorithm 1 can be terminated a number of ways without affecting the asymptotic bounds. For example, we can detect when labels no longer change by simply keeping a counter for the label propagation branch. At each step the counter is set to zero. If the minimum label l(v) for a vertex v is not v itself, the counter is incremented. Now observe that in the final star graph, only the root of the star can fall into the label propagation branch but since the root is the minimum label, the counter cannot be updated. This simple check and update takes O(1) time for each label propagation operation and is therefore free. We use this approach in our implementation of Algorithm 1 described in Section 11.
In G an undirected edge {u, v} is comprised of counter-oriented twins (v, u), (u, v), thus there are 2m edges in total and G is symmetric. As stated in the introduction, we are careful to create just one twin. Thus at each step the graph may be directed. By propagating single directed edges and alternating the direction of the edge with a minimum label, we are able to limit the work in each step to O(m) edges while also maintaining overall connectivity. Say for a (v, u) edge that u is the minimum for itself and v. This (v, u) becomes (u, v) by symmetrization and then (v, u) again by label propagation, cycling until the algorithm ends or a new minimum is acquired. If at some later step either endpoint gets a new minimum, then that endpoint can propagate it to the other, thereby replacing their edge which prohibits retaining an obsolete minimum label. In this case, u cannot be passed to v again because there will be no edge to v. Proof. First we will demonstrate that given (v, u) in a step then v, u remain connected in the next step because either (u, v) or (u, l(v)) are created. In the former case the connection is obvious. In the latter case v, u are connected through l(v) because either (v, l(v)) or (l(v), v) will be simultaneously created with (u, l(v)).
In the first case, (u, v) can be created as follows.
If v = l(v) then (u, v) is created by label propagation. If u = l(v) then (u, v) is created by symmetrization.
In the second case (u, l(v)) is created by label propagation, and simultaneously either (v, l(v)) or (l(v), v) will also be created as follows.
In the current step there must exist (v, l(v)) or (l(v), v) due respectively to label propagation or symmetrization from the previous step. Then an edge between v, l(v) is created by the following.
If (v, l(v)) exists then (l(v), v) is created by symmetrization. If (l(v), v) exists then (v, l(v)) is created by label propagation.
We have now established from these two cases that v and u remain connected after one step. By inductively applying this to every new edge, it follows that a connection between v and u is preserved in all subsequent steps. Proof. Let v be the minimum label of a component and so the minimum label for v is itself. Let v ′ denote any vertex that is not v but has v as its minimum label. Only the edges (v, u) and (u, v) need to be analyzed because all other edges will follow.
Any (v, u) is replaced with (u, v) by label propagation and so u gets v as its minimum label. Now u is a new v ′ and will pass v by label propagation. Thus any subsequent (v ′ , u) is the same case as (v, u). l(u)). Observe that this (v, l(u)) is the same case as (v, u). Moreover, any (u, v ′ ) where l(u) is not v is the same case as (u, v), which subsequently leads to the case of (v, u).
Claim 1 establishes that if (v, u) exists at some step, then v, u are connected for all remaining steps. It follows that a component remains connected at each step. Then by induction on (v, u), (u, v) cases every vertex in a component gets the same representative label for that component.
Proof. Any new edge is a replacement of an existing (v, u) edge as follows.
If u = l(v) then (v, u) is replaced with (u, l(v)) by label propagation.
Since there are 2m edges in G then there are 2m edges in the first step and by induction there are 2m edges in every step.
This last result has significant practical benefit because it ensures that the read/write costs remain linear with respect to the edges at each step, otherwise on very large graphs these costs can be the bottleneck with respect to runtime performance.
We have shown that our algorithm is very simple and takes O(m) work each step, making it appealing for practical applications. This is demonstrated in Section 12 where our algorithm empirically converges in O(log n) steps on a variety of graphs. Despite the simplicity of our algorithm, the convergence in general is difficult to analyze. Hence we can only conjecture the following and leave the proof as a challenging open problem. Instead, we can show the convergence on path graphs, which might provide insight to a more general proof. Observe that only minimum labels are propagated and a minimum for a vertex can only be replaced with a lesser label. At each step, minimum labels are exchanged between the endpoints of each (v, u) edge through label propagation, and symmetrization maintains the edge between a vertex and its minimum. Thus each length-two path is shortened by label propagation, and symmetrization ensures that a minimum label vertex and its subordinate can both get a new minimum from one or the other in a later step.
An interesting case is label convergence on a sequentially labeled path. On such a path, label convergence follows a Fibonacci sequence. A naïve label propagation algorithm will duplicate labels leading to a progressive increase in the number of edges. We give the following results for our algorithm. An interested reader can find the proofs in Appendix A.
Lemma 3. Algorithm 1 on a sequentially labeled path propagates labels in Fibonacci sequence, specifically at each step k the label difference The label updates follow a Fibonacci sequence so the next statement holds. It is intuitive that the convergence of Algorithm 1 on any path doesn't take asymptotically more steps than on a sequentially labeled path.
Claim 2. Algorithm 1 on any 4-path converges in at most three steps.
Proof. Observe that every vertex in a 4-path is within a distance of three from any other vertex in the path. Now recall that a vertex can only get a new minimum label by label propagation and label propagation contracts a length-two path. Let C(G) denote the minimum labeled vertex of the path and is therefore the component label.
It isn't difficult to see that if C(G) is not an endpoint of the 4-path then the algorithm converges to a star in two steps. This is because all vertices are a distance of two from C(G) and will be connected to it in the first step. Then it takes one more step to break the non-tree edge from the endpoint that originally was not connected to C(G).
If C(G) is an endpoint of the 4-path then the other endpoint does not get an edge to C(G) in the first step because it is a distance of three from C(G). But this other endpoint will be connected to some other vertex that is connected to C(G) and therefore gets C(G) in the next step. Since all other vertices get C(G) in the first step, then the total number of steps for convergence is three. Thus the claim holds.
Claim 3. Given two stars rooted by their respective minimum labels and then connected by the roots, it takes Algorithm 1 at most three steps to converge.
Proof. Let L, R be the respective roots of the left and right stars, and each is the minimum label for its star. Without loss of generality, let L be less than R. Now suppose the stars are connected by either an (L, R) or (R, L) edge.
If it is an (R, L) edge, then symmetrization creates (L, R) and label propagation passes L to each leaf node of R. But also symmetrization can create (R, u) edges from any (u, R) edges. Then a subsequent step replaces these (R, u) with (u, L) edges to complete the final rooted star. This takes two steps in total.
If it is an (L, R) edge, then label propagation creates (R, L). It then follows the steps for the previous case and hence takes three steps in total. Thus the claim holds. Proof. Observe that an 8-path is just two 4-paths connected end-to-end. Let L, R be the least smallest labels respectively for the two subpaths.
One of these labels will be the component label. Suppose the component label is in the middle of the path, and therefore is the endpoint of one 4-path that is adjacent to the endpoint of the other 4-path. The algorithm will converge both 4-paths simultaneously in the same number of steps as a single independent 4-path. This is because minimum labels are stored externally in an array and it is the minimum label from this array that is propagated. Since both 4-paths have an endpoint whose minimum label is the component minimum, then both converge in the same number of steps.
But we are interested in the worst-case. Suppose that L, R are at opposite ends of the path, one of which is the component label. The algorithm simultaneously converts each to a star rooted respectively at L and R, taking at most three steps by Claim 2.
The graph remains connected in concordance with Claim 1, and because of label propagation there will be an edge connecting L and R. It follows from Claim 3 that it takes at most three more steps to get the final star.
Likewise, doubling an 8-path yields a 16-path. Then this takes at most three more steps to converge than the 8-path.
A path of n = 4 · 2 k size can be generated by doubling k times. Each doubling takes at most three steps to merge connected stars according to Claim 3. Thus it takes at most 3k extra steps for all the intermediate merges for an n-path. It follows from n = 2 k+2 that k ≤ log n. Then by Claim 2 it takes at most three steps to merge all concatenated 4-paths and 3 log n steps to merge intermediate stars. This leads to a total of 3 + 3 log n = O(log n) steps to converge.
For the remainder of this paper we will assume Conjecture 1 is true.

PRAM algorithm
Our Algorithm 1 maps naturally to a PRAM. We will use the following semantics for our parallel algorithms. All statements are executed sequentially from top to bottom but all operations contained within a for all construct are performed concurrently. All other statements outside this construct are sequential. Recall that in a synchronous PRAM all processors perform instructions simultaneously and each instruction takes unit time. We use a Combining CRCW PRAM to ensure the correct minimum label is written in O(1) time [40]. An EREW algorithm follows directly from Algorithm 1 because it is well-known that a CRCW algorithm can be simulated in a EREW with logarithmic factor slowdown [25]. The only read/write conflict in Algorithm 1 is in the minimum label update. Here it does not require a minimum reduction in constant-time. Thus for a p-processor EREW, reading L k [v] takes O(log p) time by broadcasting the value in binary tree order to each processor attempting to read it. It isn't difficult to see that a minimum value can be found in O(log p) time using a binary tree layout to reduce comparisons by half each step 1 . This immediately proves Theorem 2.

Extending to other models
The Stream and MapReduce models restrict globally-shared state so the minimum label for each vertex must be carried with the graph at each step. Recall in Algorithm 1 there may not be an explicit (v, l(v)) edge in a step but all minimum labels are kept in the global L k array. So given a (v, u) edge but no explicit (v, l(v)) edge, we can still apply label propagation and produce (u, l(v)). Otherwise it would create some (u, w) edge where w is the minimum of the current set of neighbors but may not be the true minimum for v. This would cause the algorithm to fail, which is easily demonstrated on the graph in Figure 1. Moreover, in MapReduce the map and reduce functions are sequential so a giant component that is processed by one task will serialize the entire algorithm. We address these limitations by slightly altering the label propagation and symmetrization operations. Label propagation and symmetrization will now only proceed from vertices v = l(v) to mitigate sequential processing of a giant component, thus skipping over intermediate representative labels. As before, u = l(v) implies v = l(v), otherwise it would contradict the minimum function. Now symmetrization adds both edges, (l(v), v), (v, l(v)), so that v is always paired with its minimum label in the absence of random access to global memory. We also remark that symmetrization must be this way when ignoring v where v = l(v) because the vertices that would have created the edge (v, l(v)) are now ignored.
These minor changes do not invalidate the correctness or convergence established by Algorithm 1 because the same edges are created but with some added duplicates. The primary difference is that both (v, l(v)), (l(v), v) edges are created in the same step rather than strided across two consecutive steps. But this incurs label duplication that can lead to a progressive increase in edges if left unchecked.

Label duplication
The new label propagation and symmetrization for Stream and MapReduce can lead to O(log n) factor inefficiency as a result of increased label duplication, especially on sequentially labeled path or tree graphs. This leads to the following crucial observation.

Observation 1.
Adding counter-oriented edges in the symmetrization step of Algorithm 1 will pair each v with every new l(v) it gets for three steps on a sequentially labeled path graph.
Since symmetrization retains each l(v) for the next step then v gets its k th label l k (v) = v − F k for the next three steps because of the recurrence of F k . Thus any new l(v) that v receives will return to v a total of three steps unless l(v) is the minimum label for the component of v.
Once an l(v) is replaced with an updated minimum for v, it is no longer needed and only adds to the edge count. Symmetrization by Definition 4 retains (v, l(v)) so when v gets l(l(v)) from its current l(v), then in the next step l(l(v)) will propagate back to l(v). Since each vertex in a chain is a minimum label to vertices up the chain, then each vertex will in turn be back-propagated down the chain. This follows a Fibonacci sequence hence the duplication of labels grows rapidly. For example we can see from Lemma 3 that vertex 2 in a chain will be the minimum label for the 3, 4, 5, 7, 10, 15, 23, . . . vertices in sequence, and each of these vertices will return vertex 2 back to the neighbor from which it was received. Moreover, as seen in Observation 1 each new l(v) is retained by v for three steps. Relabeling the graph can avoid the pathological duplication but a robust algorithm is more desirable, especially in graphs that may contain a very long chain.
Suppose now a (u, l(v)) edge is added to E k+1 only if that edge is not currently in E k . This is testing if l k (v) / ∈ N k (u) then it can be added to N k+1 (u). Since Definitions 3,4 are applied in models with limited random access and global memory, we leverage sorting to identify the next minimum label for each vertex and also remove labels that would otherwise fail this membership test. Let E ′ k+1 be the intermediate edges that are created during the k th step and from which a subset are retained for the k + 1 step. Sorting edges in E k and E ′ k+1 will identify those edges that are duplicated across both steps and therefore should be removed. But an edge that is duplicated in only E ′ k+1 must be retained for proper label propagation. To avoid inadvertently removing such an edge, all duplicates in the E ′ k+1 are first removed before merging and sorting with E k edges. After removing duplicates the edges from E k are also removed because these were only needed for the membership test. We apply this in our next algorithms.

Stream-Sort algorithm
Our Stream-Sort algorithm in Algorithm 2 extends Algorithm 1 as described in Section 7 and removes duplicates by sorting in the manner described at the end of Section 8. It requires two stages per iteration step. The first stage performs label propagation and symmetrization and also returns the input edges. The second stage eliminates duplicates. A one-stage algorithm [9] was described in 2016, which is simpler to implement, but does not address label duplication.
Recall that Algorithm 1 does not return the current edges but creates new edges by symmetrization and label propagation. But in Stream-Sort we must return the current edges temporarily in the intermediate sorting stage in order to ignore duplicate edges. Hence we mark the edges to distinguish old from new. Here a NEW edge resulted from either symmetrization or label propagation, and an OLD edge is a current input edge. Label propagation and symmetrization follow Definitions 3 and 4. In the first stage Algorithm 2 reads sorted edges, hence l(v) = min(v, u) from the first edge of v. If u = l(v) for this first edge then symmetrization adds (u, v), N EW , (v, u), N EW to an intermediate stream E ′ k+1 , and for each remaining edge a (u, l(v)), N EW is added to E ′ k+1 by label propagation, and (v, u), OLD is also added. Note that if u = l(v) in the first edge of v, all remaining u cannot be l(v) due to sorting and thus u, v = l(v) for the remaining edges of v so label propagation can proceed. The intermediate stream E ′ k+1 is sorted by a sorting pass and input to the second stage. In the second stage the edges are sorted so all NEW and OLD versions for (v, u) are grouped together. Then any edge without an OLD member is added to a new output stream E k+1 , which will be the input stream in the next pass. Both intra-and inter-step duplicates have been removed. The algorithm repeats this procedure until no new minimum can be adopted. If Conjecture 1 holds, this would be the first efficient, deterministic connected components algorithm in Stream-Sort.

MapReduce algorithm
Our MapReduce algorithm described in Algorithm 3 is similar to Algorithm 2, managing duplicates as described at the end of Section 8. It takes two rounds per iteration step, the first to perform label propagation and symmetrization and the second to remove duplicates. A single-round algorithm [9] with less efficient communication is also available to the interested reader. The values for each key are sorted hence intra-step duplicates are adjacent and easily removed, permitting the algorithm to maintain O(1) working memory. Since the values are sorted then l(v) is simply the lesser between the key and first value. We omit the specifics on sorting and getting l(v) for brevity. Label propagation and symmetrization follow Definitions 3 and 4. The first round returns label propagation or symmetrization edges as NEW, and current edges as OLD, again to assist in removing duplicates.
If v = l(v) then there is a u = l(v), thus symmetrization emits v, (l(v), N EW ) , l(v), (v, N EW ) , and for each u = l(v) a u, (l(v), N EW ) is added by label propagation, and v, (u, OLD) is also added. This skips any local minimum or the component minimum which could otherwise result in a large span of sequential processing. The second round accumulates these edges and every edge without an OLD member is returned with no markings. The two rounds are repeated until labels converge. Since duplicates are removed the total communication cost is O(m) per round. Proof. Label propagation and symmetrization extends from those in Algorithm 1 but create more duplicate edges. These duplicates are removed in each iteration step, thus correctness follows from Lemma 1. There are O(log n) steps due to Conjecture 1, and two rounds per step for a total of 2 log n = O(log n) rounds. The communication cost is proportional to the number of edges written after all rounds. Both inter-and intra-step duplicates are removed in each iteration step so the total number of edges after the second round of each step is O(m) following Lemma 2. Thus for O(log n) rounds the overall communication is O(m log n) as claimed. This is comparable in runtime and communication to [26], but differs in that it is a deterministic MapReduce algorithm for connected components.

Implementation
We will give some basic empirical results for our principal algorithm described in Algorithm 1. First we'll briefly describe our parallel implementation. We implemented the algorithm in C++ and posix threads, and for write conflicts we used atomic operations.
There is a write conflict in updating the minimum label for each vertex. We use the "compare and exchange" atomic operation to update the minimum label. But this atomic does not test relational conditions, instead an exchange is made if the values being compared differ. Thus to atomically update a minimum value, the compare and exchange result must be repeatedly tested. Once a thread succeeds with its exchange, it must test if its original data is still less than the updated value, accomplished by a simple loop construct. Since the criteria for updating is just a difference in value, then eventually the thread with the minimum value will succeed and all other threads will test out. The winning thread itself will also test out because its value will be equal to the updated value.
Since at each step every edge is replaced by a new edge independently of other edges, the edge list can be concurrently updated without synchronization and the work per step is exactly 2m by Lemma 2. An array of size 2m is initialized with the input edge list and each thread is given a unique subset of indices in this work array. Threads can then concurrently replace each edge in their subset of the work array without conflict. Threads are blocked until all edges are updated in a step, and then threads concurrently update the two label arrays.
Our implementation must detect when labels no longer change. We keep a counter for the label propagation branch to identify when label propagation no longer updates minimum labels. At each step the counter is set to zero. If the minimum label l(v) for a vertex v is not v itself, the counter is incremented. In the final star graph only the root of the star can fall into the label propagation branch but since the root is the minimum label, the counter cannot be updated. Specifically, in the loop over (v, u) edges, we carry out label propagation if u is not l(v), otherwise symmetrization is performed. We update the counter in the label propagation branch if l(v) = v is true. The algorithm halts if the counter value is zero at the end of a step. Effectively this updates the counter until each vertex is a leaf node of a star. This is because any v that is a star means that l(v) is equal to v, hence it must fall into the label propagation branch and then fail to update the counter.

Experiments
We ran our parallel implementation of Algorithm 1 on a workstation with 28 Intel Xeon E5-2680 cores and 256 GB of RAM. Our experimental results are given in Table 2. The first column of Table 2 lists the graphs used in our experiments. All graphs were unweighted, without self-loops, and symmetrized with vertices labeled from 0 to n − 1, and are therefore simple, undirected graphs with a total of 2m edges. The first four graphs are large, sequentially labeled path graphs. Our naming convention for these uses a suffix that denotes the number of vertices by the base-2 exponent. Hence the graph named seqpath20 is a path of n = 2 20 vertices labeled in order from 0..n − 1. It is interesting to note that the convergence on these graphs follows the prediction asserted in Proposition 1. Using log φ n with φ = 1.618, the predicted number of steps for n = 2 20 , 2 22 , 2 24 , 2 26 is 29, 32, 35, 37. Our implementation takes one extra step to test for completion and so it closely matches the prediction.
The next four graphs are road networks with relatively large diameters. The first three road networks are from the Stanford Network Analysis Project (SNAP) [27]. The fourth road network is the USA road network from the 9 th DIMACS Challenge [16].
The next eight graphs are based on the example in Appendix I of Andoni et al. [2], which was devised to be difficult for fast convergence in graph connectivity. Each of these graphs is an r × c grid labeled sequentially in row-major order from 1..rc but with a bridge node, the zero label vertex, connecting to the first vertex of each row. Hence n = rc + 1, m = (2r − 1)c and the diameter is D = 2c. We use the naming convention of grid-0to[rows]by [columns]. The number of rows is fixed to r = 262144 for the first five of these graphs. For the last three such graphs the number of rows is r = 2 c and thus grows exponentially with respect to the diameter. Our algorithm converges linearly with respect to the diameter on these last three graphs.
The remaining graphs in the table are from SNAP [27]. These graphs have small diameter, which is expected for real-world networks. Our algorithm converges in O(D) steps on these graphs.
It is evident by the number of steps in the sixth column of Table 2 that our algorithm converges rapidly on these graphs, tending towards O(log n) convergence as we conjectured. Observe that it takes fewer steps to converge than the diameter in each of the graphs we tested, demonstrating a real practical benefit. Moreover, the convergence rate is independent of the number of processors.
Our implementation can admit improvement when comparing the runtime to state-of-the-art implementations [17]. Due to the simplicity and fast convergence of our algorithm we believe the runtime performance can be significantly improved. The fast convergence, simplicity, and extensibility to other computational paradigms makes our algorithm appealing in practice.