1 Introduction

We consider the problem of performing t similar and independent tasks in a distributed system prone to processor crashes. This problem, called Do-All, was introduced by Dwork et al. [16] in the context of a message-passing system. Over the years the Do-All problem became a pillar of distributed computing and has been studied widely from different perspectives [20, 21].

The distributed system studied in our paper is based on communication over a shared channel, also called a multiple-access channel, and was first studied in the context of Do-All problem by Chlebus et al. [11].

The communication channel and the p crash-prone processors, also called stations, connected to it are synchronous. They have access to a global clock, which defines the same rounds for all operational stations. A message sent by a station at a round is received by all operational (i.e., not yet crashed) stations only if it is the only transmitter in this round; we call such a transmission successful. Otherwise, unless stated differently, we assume that no station receives any meaningful feedback from the channel medium, except an acknowledgment of its successful transmission; this setting is called without collision detection, oppose to the setting with collision detection in which stations can distinguish between no transmission and simultaneous transmission of at least two stations (we will refer to the latter in few places of this work).

Stations are prone to crash failures. Allowable patterns of failures are determined by abstract adversarial models. Historically, the main distinction was between adaptive and oblivious adversaries; the former can make decisions about failures during the computation while the latter has to make all decisions prior the computation. Another characteristic of an adversary is its size-boundedness, or more specifically f-boundedness, if it may fail at most f stations, for a parameter \(0\le f< p\); a linearly-bounded adversary is simply a \(c\cdot p\)-bounded adversary, for some constant \(0<c<1\).

We introduce the notion of the ordered adaptive adversary, or simply ordered adversary, which can crash stations online but according to some preselectedFootnote 1 order (unknown to the algorithm) from a given class of partial orders, e.g., linear orders (i.e., all elements are comparable), anti-chains (i.e., no two elements are comparable), k-thick partial orders (i.e., at most k elements are incomparable). On the other hand, a strongly-adaptive adversary is not restricted by any constraint other than being f-bounded for some \(f<p\).

Adversaries described by a partial order are interesting on their own right. To the best of our knowledge, such general adversaries were not considered in literature so far and hence offer novel framework for evaluating performance and better understanding of reliable distributed algorithms. For instance, in hierarchical architectures, such as clouds or networks, a crash at an upper level may result in cutting off processors at lower levels, what could be seen as a crash of lower levels from the system perspective. Linear orders of crashes could be motivated by the fact that each station has its own duration, unknown to the algorithm, and a crash of some station means that all with smaller duration should crash as well. Independent, e.g., located far away, systems could be seen as a set of linearly ordered chains of crashes, each chain for a different independent part of the system. Furthermore, the study in this paper indicates that different partial orders restricting the adversary may require different algorithm design and yield different complexity formulas.

Another form of restricting adversarial power is to delay the effect of its actions by a number of time steps—we call such adversaries round-delayed. Such adaptive adversaries are motivated by various reactive attacking systems, especially in analyzing security aspects. We show that this parameter can influence performance of algorithms, independently of the partial-order restrictions on the adversary.

Due to the specific nature of the Do-All problem, the most accurate measure considered in the literature is work, accounting the total number of available processor steps in the computation. It was introduced by Kanellakis and Shvartsman in the context of the related Write-All problem [26]. We assume that algorithms are reliable in the sense that they must perform all the tasks for any pattern of crashes such that at least one station remains operational in an execution. Chlebus et al. [11] showed that \(\varOmega (t+p\sqrt{t})\) work is inevitable for any reliable algorithm, even randomized, even for the channel with collision detection (i.e., when operational stations can recognize no transmission from at least two transmitting stations in a round), and even if no failure occurs. This is the absolute lower bound on the work complexity of the Do-All problem on a shared channel. It is known that this bound can be achieved even by a deterministic algorithm for channels with enhanced feedback, such as collision detection, cf., [11], and therefore such enhanced-feedback channels are no longer interesting from perspective of reliable task performance.

Our goal is to check how different classes of adversaries, especially those constrained by a partial order of crashes, influence work performance of Do-All algorithms on a simple shared channel with acknowledgments only.

1.1 Previous work

The Do-All problem was introduced by Dwork, Halpern and Waarts [16] in the context of a message-passing model with processor crashes.

Chlebus, Kowalski and Lingas (CKL) [11] were the first who considered Do-All in a multiple-access channel. Apart from the absolute lower bound for work complexity, discussed earlier, they also showed a deterministic algorithm matching this performance in case of channel with collision detection. Regarding the channel without collision detection, they developed a deterministic solution that is optimal for such weak channel with respect to the lower bound they proved \(\varOmega (t + p\sqrt{t} + p\min \left\{ f,t\right\} )\). The lower bound holds also for randomized algorithms against the strongly adaptive adversary, that is, the adversary who can see random choices and react online, which shows that randomization does not help against a strongly adaptive adversary.

Furthermore, their paper contains a randomized solution that is efficient against a weakly adaptive adversary who can fail only a constant fraction of stations. A weakly adaptive adversary is such that it needs to select f crash-prone processors in advance, based only on the knowledge of algorithm but without any knowledge of random bits; then, during the execution, it can fail only processors from that set. This algorithm matches the absolute lower bound on work. If the adversary is not linearly bounded, that is, \(f<p\) could be arbitrary, they only proved a lower bound of \( \varOmega \left( t + p\sqrt{t} + p\min \left\{ \frac{p}{p-f},t\right\} \right) \).

Clementi et al. [13] investigated Do-All in the communication model of a multiple-access channel without collision detection. They studied F-reliable protocols, which are correct if the number of crashes is at most F, for a parameter \(F<p\). They obtained tight bounds on the time and work of F-reliable deterministic protocols. In particular, the bound on work shown in [13] is \(\varTheta (t+F\cdot \min \{t, F\})\). In this paper, we consider protocols that are correct for any number of crashes smaller than p, which is the same as \((p-1)\)-reliability. Moreover, the complexity bounds of our algorithms, for the channel without collision detection, are parametrized by the number f of crashes that actually occur in an execution. Results shown in [13] also referred to the time perspective with a lower bound on time complexity equal

$$\begin{aligned} \varOmega \left( \frac{t}{p-F} + \min \left\{ \frac{tF}{p}, F + \sqrt{t} \right\} \right) . \end{aligned}$$

However the protocols make explicit use of the knowledge of F. In this paper we give some remarks on time and energy complexity but, opposed to results in [13], those statements are correct for an arbitrary f, which does not need to be known by the system (see details in Sect. 9). Observe that, opposed to the worst case time complexity, the considered work complexity could be seen as an average processors time multiplied by the number of processors.

1.2 Related work

Do-All problem. After the seminal work by Dwork, Halpern and Waarts [16], the Do-All problem was studied in a number of follow-up papers [7, 8, 10, 14, 17] in the context of a message-passing model, in which every node can send a message to any subset of nodes in one round. Dwork et al. [16] analyzed task-oriented work, in which each performance of a task contributes a unit to complexity, and the communication complexity defined as the number of point-to-point messages.

De Prisco et al.  [14] were the first to use the available processor steps [26] as the measure of work for solutions of Do-All. They developed an algorithm which has work \({\mathcal {O}}(t+(f+1)p)\) and message complexity \({\mathcal {O}}((f+1)p)\). Galil et al. [17] improved the message complexity to \({\mathcal {O}}(f p^\varepsilon + \min \{f+1, \log p\}p)\), for any positive \(\varepsilon \), while maintaining the same work complexity. This was achieved as a by-product of their investigation of the Byzantine agreement with crash failures, for which they found a message-efficient solution. Chlebus et al. [7] studied failure models allowing restarts.

Chlebus and Kowalski [10] studied the Do-All problem when occurrences of failures are controlled by the weakly-adaptive linearly-bounded adversary. They developed a randomized algorithm with the expected effort (effort = work + number of messages) \({\mathcal {O}}(p\log ^*p)\), in the case \(p=t\), which is asymptotically smaller than the lower bound \(\varOmega (p\log p/\log \log p)\) on work of any deterministic algorithm. Chlebus et al. [8] developed a deterministic algorithm with effort \({\mathcal {O}}(t+p^a)\), for some specific constant a, where \(1<a<2\), against the unbounded adversary, which is the first algorithm with the property that both work and communication are \(o(t+p^2)\) against this adversary. They also gave an algorithm achieving both work and communication \({\mathcal {O}}(t+p\log ^2 p)\) against a strongly-adaptive linearly-bounded adversary. All the previously known deterministic algorithms had either work or communication performance \(\varOmega (t+p^2)\) when as many as a linear fraction of processing units could be failed by a strongly-adaptive adversary. Georgiou et al. [19] developed an algorithm with work \({\mathcal {O}}(t+p^{1+\varepsilon })\), for any fixed constant \(\varepsilon \), by an approach based on gossiping. Kowalski and Shvartsman [30] studied Do-All in an asynchronous message-passing mode when executions are restricted such that every message delay is at most d. They showed lower bound \(\varOmega (t+pd\log _d p)\) on the expected work. They developed several algorithms, among them a deterministic one with work \({\mathcal {O}}((t+pd)\log p)\). For further developments we refer the reader to the book by Georgiou and Shvartsman [21].

Related problems on a shared channel. Most of work in this model focused on communication problems, see the surveys [6, 18]. Among the most popular protocols for resolving contention on the channel are Aloha [1] and exponential backoff [33]. The two most related research problems are as follows.

The selection problem is about how to have an input message broadcast successfully if only some among the stations hold input messages while the other do not. It is somehow closely related to the leader election problem. Willard [36] developed protocols solving this problem in the expected time \({\mathcal {O}}(\log \log n)\) in the channel with collision detection. Kushilevitz and Mansour [31] showed a lower bound \(\varOmega (\log n)\) for this problem in case of a lack of collision detection, which explains the exponential gap between this model and the one with collision detection. Martel [32] studied the related problem of finding maximum within the values stored by a group of stations.

Jurdziński et al. [25] considered, the leader election problem for the channel without collision detection, giving a deterministic algorithm with sub-logarithmic energy cost. They also proved log-logarithmic lower bound for the problem.

The contention resolution problem is about a subset of some k among all n stations which have messages. All these messages need to be transmitted successfully on the channel as quickly as possible. Komlós and Greenberg [28] proposed a deterministic solution allowing to achieve this in time \({\mathcal {O}}(k+k\log (n/k))\), where  n and k are known. Kowalski [29] gave an explicit solution of complexity \({\mathcal {O}}(k polylog n)\), while the lower bound \(\varOmega (k(\log n)/(\log k))\) was shown by Greenberg and Winograd [22]. The work by Chlebus et al. [9] regarded broadcasting spanning forests on a multiple-access channel, with locally stored edges of an input graph.

Significant part of recent results on the communication model considered in the literature is focused on jamming-resistant protocols motivated by applications in single-hop wireless networks. To the best of our knowledge this line of research was initiated in [2] by Awerbuch, Richa and Scheideler, wherein authors introduced a model of adversary capable of jamming up to \((1-\epsilon )\) of the time steps (slots). The following papers [34, 35] by Richa, Scheideler, Schmid and Zhang proposed several algorithms that can reinforce the communication even for a very strong, adaptive adversary. For the same model Klonowski and Pająk [27] proposed an optimal leader election protocol, using a different algorithmic approach.

The similar model of a jamming adversary was considered by Bender et al. [3]. The authors consider a modified, robust exponential backoff protocol that requires \(O(\log ^2 n + T)\) attempts to the channel if there are at most T jammed slots. Motivated by saving energy, the authors try to find maximal global throughput while reducing device costs expressed by the number of attendants in the channel.

Finally, there are several recent results on finding approximations of the network. Brandes et al. [4] proposed an algorithm for the network of n stations that returns \((1+\varepsilon )\)-approximation of n with probability at least \(1-1/f\). This procedure takes \(O(\log \log n+\log f/\varepsilon ^2)\) time slots. This result was also proved to be time-optimal in [4]. Chen et al. [5] demonstrated a size approximation protocol for seemingly different model (namely RFID system) that needs \(\varOmega \left( \frac{1}{\epsilon ^2 \log {1/\epsilon }} + \log \log {n}\right) \) slots for \(\epsilon \in [1/\sqrt{n},0.5]\) and negligible probability of failure. In fact, this result can be instantly translated into the MAC model.

Table 1 Summary of main results; first three were introduced in CKL [11], the other are presented in this paper

1.3 Our results

We introduce a hierarchy of adaptive adversaries and study their impact on the complexity of performing jobs on a shared channel. The most important parameter of this hierarchy is the partial order. It describes adversarial crashes, hence we call such adversaries ordered. The other parameters are: the number of crashes f (we call them size-bounded adversaries) and a delay c in the effect of the adversary’s decisions. We call them c-Round-Delayed or c-RD.

Since the adversaries that we introduce originate from partial order relations, then appropriate notions and definitions translate straightforwardly. The relation of our particular interest while considering partially ordered adversaries is the precedence relation. Precisely, if some station v precedes station w in the partial order of the adversary, then we say that v and w are comparable. This also means that station v must be crashed by the adversary before station w. Consequently a subset of stations where every pair of stations is comparable is called a chain. On the other hand a subset of stations where no two different stations are comparable is called an anti-chain.

It is convenient to think about the partial order of the adversary from a Hasse diagram perspective. The notion of chains and anti-chains seems to be intuitive when graphically presented, e.g., a chain is a pattern of consecutive crashes that may occur while an anti-chain gives the adversary freedom to crash in any order due to non-comparability of elements/stations.

We show that adversaries constrained by an order of short width i.e., with short maximal anti-chain or 1-RD adversaries have very little power, results in performance similar to the one enforced by oblivious adversaries or linearly-ordered adversaries, cf., [11]. More specifically, we develop algorithms ROBAL and GILET, which achieve work performance close to the absolute lower bound \(\varOmega (t + p\sqrt{t})\) against “narrow-ordered” and 1-RD adversaries, respectively.

Fig. 1
figure 1

The hierarchy of adversaries

In case of ordered adversaries restricted by orders of arbitrary width \(k\le f\), we present algorithm GruTEch that guarantees work \(\mathcal {O}\left( t + p\sqrt{t} + p\min \left\{ \frac{p}{p-f},t,k\right\} \log p\right) \) against ordered adversaries restricted by orders of width k, and show that it is efficient by proving a lower bound for a broad class of partial orders. This also extends the result for a weakly-adaptive linearly-bounded adversary from [11] to any number of crashes \(f<p\), as weakly-adaptive adversary is a special case of ordered adversary restricted by a single anti-chain. Our results together with [11] prove a separation between classes of adversaries. The easiest to play against, apart of the oblivious ones, are the following adaptive adversaries: 1-RD adversaries, ordered adversaries restricted by short-width orders, and linearly bounded adversaries. More demanding are ordered adversaries restricted by order of width k, for larger values of k, and f-bounded adversaries for f close to p. The most demanding are strongly-adaptive adversaries, as their decisions and the way they act are least restricted. See Table 1 for detailed results and comparisons.

The hierarchy of the considered adversaries is illustrated on Fig. 1. It depends on three main factors. Additionally, we introduce several solutions for the specified settings. Consequently our contribution is a complement to adversarial scenarios presented in literature together with a taxonomy describing the dependencies between different adversaries.

First of all we have the vertical axis which describes adversary features, that is how restricted his decisions are. We have the Strongly-Adaptive adversary in the origin, who may decide on-line which stations will be crashed. Above the Strongly-Adaptive adversary is the Weakly-Adaptive adversary, who is slightly weaker and before the algorithm execution has to declare the subset of stations that will be prone to crashes. Next we have the k-Chain-Ordered adversary and its more general version k-Thick-Ordered adversary (Chain-O-Adaptive in Fig. 1 for consistency). Apart from declaring the faulty subset, the former adversary is restricted by partial orders being collections of disjoint k chains, while the latter—by all partial orders of thickness k (and so k-chains as well). Finally, there is the Linearly-Ordered adversary (Linearly-O-Adaptive in Fig. 1) that we introduce in this paper—its order is described by a linear pattern of processor crashes.

The horizontal axis describes another feature of the adversary, that we introduce in our paper, i.e., the Round-Delay of adversary decisions. Similarly, the configuration for the problem is hardest in the origin and a 0-RD adversary is the strongest against which we may execute an algorithm. An interesting particularity is that if the Strongly-Adaptive adversary’s decisions are delayed by at least one round, then we may design a solution whose work complexity is independent of the number crashes.

The axis orthogonal to those already considered, describes the channel feedback. In the origin we have a multiple-access channel without collision detection, then there is the beeping channel, followed by MAC with collision detection.

We may see that the most difficult setting is in the origin, while the further from the origin, the easier the problem. The boxes in Fig. 1 represent the algorithms and their work complexities in certain configurations of the model i.e. features described above. The bold boxes denote algorithms from this paper and the remaining ones are from CKL [11]. Factors marked red denote the “distance” from the lower bounds, understood as how far the algorithms are from optimum.

A subset of the results from this paper forms a hierarchy of partially ordered adversaries. The first solution we introduced was designed to work against a linearly-ordered adversary, whose pattern of crashes is described by a linear order. The upper bound of this algorithm does not depend on the number of crashes and is just logarithmically far from the minimal work complexity in the assumed model. The second algorithm serves for the case when the adversary’s partial order of stations forms a maximum length anti-chain. Nevertheless we also analyze this solution against an in-between situation when the partial order is consists of k chains of an arbitrary length, yet the sum of their lengths is f.

In order to conclude the content of this paper, we would like to emphasize that building on solutions from CKL [11] we introduce different algorithms and specific adversarial scenarios, for more complex setups that, to some extent, filled the gaps for randomized algorithms solving Do-All in the most challenging adversarial scenarios and communication channels providing least feedback. Due to the basic nature of the considered communication model—a shared channel with acknowledgments only—our solutions are also implementable and efficient in various different types of communication models with contention and failures.

1.4 Document structure

We describe the model of the problem, communication channel details, different adversary scenarios and the complexity measure in Sect. 2. Section 3 is dedicated to the Two-Lists and Groups-Together procedures from [11] that ares used (sometimes after small modifications) as a toolbox in our solutions. In Sect. 4 we present a randomized algorithm ROBAL solving Do-All in presence of a Linearly-Ordered adversary. In the following section (Sect. 5) there is a work-efficient algorithm GruTEch that simulates a kind of fault-tolerant collision detection on a channel without such feature. This is followed by Sect. 6, where we adjust this solution for a k-Chain-Ordered adversary. Finally, we have Sect. 7 that contains a solution for the 1-RD adversary (algorithm GILET) and Sect. 8 is dedicated to the transition of Groups-Together to the beeping model. We conclude with a short summary in Sect. 9, which also contains some general remarks about time and energy complexity of our algorithms.

2 Shared channel model and the Do-All problem

The Do-All problem has been introduced by Dwork et al [16] and was considered further in numerous papers [8, 10, 12, 14, 17] under different assumptions regarding the model. In this section we formulate the model we consider, which is based on the one considered in [11].

In general, the Do-All problem is modeled as follows: a distributed system of computationally constrained devices is expected to perform a number of tasks [20]. We will call those devices processors or simply stations. The main efficiency measure that we use is work, i.e., the total number of processor steps available for computations [26].

2.1 Stations

In our model we assume having p stations, with unique identifiers from the set \( \{1,\dots , p\} \). The distributed system of those devices is synchronized with a global clock, and time is divided into synchronous time slots, called rounds. All the stations start simultaneously at a certain moment. Furthermore every station may halt voluntarily. In this paper, by \( n < p \) we will denote the number of operational, i.e. not crashed, stations.

2.2 Communication

The communication channel for processors is a multiple-access channel [6, 18], where a transmitted message reaches every operational device. All our solutions work on a channel without collision detection, hence when more than one message is transmitted at a certain round, then the devices hear a signal indistinguishable from the background noise. In our model we assume, that the number of bits possible to broadcast in a single transmission is bounded by \( \mathcal {O}(\log p) \), however all our algorithms broadcast merely \( \mathcal {O}(1) \) bits, hence we omit the analysis of the message complexity.

In few places of this work we refer to the alternative channel setting with collision detection, in which there are three types of feedback from the channel:

  • Silence: no station transmits, and only a background noise is heard;

  • Single: exactly one station transmits a legible information;

  • Collision: an illegible signal is heard (yet different from Silence), when more than one station transmits simultaneously.

Section 3.2 provides more details. We note here that the setting with collision detection is referred to only in the context of transforming algorithmic tools and lower bounds to the more challenging setting without collision detection primarily studied in this work.

It is worth emphasizing that the communication channel in our model can be made resistant to non-synchronized processor clocks without increasing asymptotic performance, using methods developed previously [23]: when processor clocks are not synchronized, then we could replace each round by two rounds of appropriate lengths to compensate possible lags.

2.3 Adversaries

Processors may be crashed by the adversary. One of the factors that describes the adversary is its power f. It represents the total number of failures that may be enforced. We assume that \( 0 \le f \le p-1 \), so always at least one station remains operational until an algorithm terminates. Stations that were crashed neither restart nor contribute to work. Another feature of the adversary is whether it is adaptive or not. Following the definition from [20], an adaptive adversary is the one that has complete knowledge of the computation that it is affecting, and it makes instant dynamic decisions on how to affect the computation. A non-adaptive adversary, sometimes called oblivious, has to determine a sequence of events it will cause before the start of the computation. In this paper we focus on adaptive adversaries.

In the previous work, the following adversarial models were considered, c.f., [11, 21]:

  • Strongly-Adaptive f-Bounded: the only restriction of this adversary is that the total number of failures may not exceed f. In particular all possible failures may happen simultaneously.

  • Weakly-Adaptive f-Bounded: the adversary has to declare a subset of f stations prone to crashes before the algorithm execution. Yet, it may arbitrarily perform crashes on the declared subset.

  • Unbounded: that is Strongly-Adaptive \( (p-1) \)-Bounded.

  • Linearly-Bounded: an adversary of power f, where \( f = cp \), for some \( 0< c < 1 \).

We now introduce new adversarial models that complement the existing ones from the literature.

2.3.1 The Ordered f-Bounded adversary

Formally, the Orderedf-Bounded adversary has to declare, prior the execution, a subset of at most f out of p stations that will be prone to crashes. Then, before starting the execution, the adversary has to impose a partial order on the selected stations, taken from a given family of partial orders. This family restricts the power of the adversary—the wider it is the more power the adversary possesses. Moreover, as we will show in this work, the structure of available partial orders influences asymptotic performance of algorithms and the complexity of the Do-All problem under the presence of the adversary restricted by these partial orders.

The adversary may enforce a failure independently from time slots (even f at the same round), but with respect to the order. This means that a pre-selected crash-prone station can be crashed in a time slot if and only if all stations preceding it in the order has been already crashed by the end of the time slot.

In this work we focus on the following three types of partial orders.

The Linearly-Orderedf-Bounded adversary. Formally, the Linearly-Orderedf-Bounded has to choose a sequence \( \pi = \pi (1)\dots \pi (f) \) designating the order on the selected set of f stations in which the failures will occur, where \(\pi (i) \) represents the id of the ith fault-prone station in the order. This means that station \( \pi (i) \) may be crashed if and only if stations \( \pi (j) \) are already crashed, for all \( j < i \). In what follows, the notion of sequence \( \pi \) is consistent with a linear partial order.

Thek-Chain-Ordered andk-Thick-Orderedf-Bounded adversary. The k-Chain-Orderedf-Bounded adversary has to arrange the pre-selected f stations into a partial order consisting of k disjoint chains of arbitrary length that represent in what order these stations may be crashed. In what follows there are k chains; we denote \( l_{j} \) as the length of chain j, and we assume that the sum of lengths of all chains equals f.

While considering k-Chain-Ordered adversaries we will also define additional notions, useful in the analysis of certain results. We say that a partial order is a k-chain-based partial order if it consists of k disjoint chains such that:

  • no two of them have a common successor, and

  • the total length of the chains is a constant fraction of all elements in the order.

Furthermore, by the thickness of a partial order P we understand the maximal size of an anti-chain in P. An adversary restricted by a wider class of partial orders of thickness k is called k-Thick-Ordered.

The Anti-Chain-Orderedf-Bounded adversary. This adversary is restricted by a partial order which is the anti-chain of f elements, i.e., all f crash-prone stations are incomparable, thus could be crashed in any order. This adversary is the same as the Weakly-Adaptive f-Bounded adversary and the f-Thick-Ordered ones.

2.3.2 The c-RD f-Bounded adversary

The c-RD adversary decisions take effect with a c round delay. This means that if we consider time divided into slots (rounds), then if the adversary decides to interfere with the system (crash a processor) then this will inevitably happen after c rounds. In particular, this means that the subsequent execution and random bits do not influence the decision and its effect—the decision is final and once made by the adversary cannot be changed during the delay. We still consider f-boundedness of the adversary, but apart from that it may decide arbitrarily, without declaring which stations will be prone to crashes before the algorithm’s execution.

A special case of the c-RD adversary is a 0-RD and a 1-RD adversary model. The definition of the former case is consistent with the Strongly-Adaptive adversary. The latter case may give an answer to the question regarding the matter of how delay influences the difficulty of the problem for a strong adversary.

2.4 Complexity measure

The complexity measure that is mainly used in our analysis is work, as mentioned before. It is the number of available processor steps for computations. This means that each operational station that did not halt contributes a unit of work even if it is idling. Since we use work complexity measure extensively, we adopt the following definition from [21].

Definition 1

([21], Definition 2.2) Let A be an algorithm that solves a problem of size t with p processors, under adversary \( \mathcal {A} \). Let \( \mathcal {E}(A, \mathcal {A}) \) denote the set of all executions of algorithm A, under adversary \(\mathcal {A}\). For execution \( \xi \in \mathcal {E}(A, \mathcal {A}) \), let time \( \tau (\xi )\) be the time (according to the external clock), by which A solves the problem. By \(p_{i}(\xi )\) let us denote the number of processors completing a local computation step (e.g., a machine instruction) at time i of the execution, according to some external global clock (not available to the processors). Then work complexityS of algorithm A is:

$$\begin{aligned} S = S_{\mathcal {A}}(t, p) = \max _{\xi \in \mathcal {E}(A, \mathcal {A})}\left\{ \sum _{i = 1}^{\tau (\xi )} p_{i}(\xi ) \right\} . \end{aligned}$$

For randomized algorithms that we are dealing with in this paper, we assess the expected work\( S^{E}_{\mathcal {A}}(t, p) \), which is defined as the maximum over all executions \( \xi \in \mathcal {E}(A, \mathcal {A}) \) of the expectation of the sum \(\sum _{i = 1}^{\tau (\xi )} p_{i}(\xi )\) from Definition 1.

In order to illustrate work complexity measure of a single execution of an algorithm let us assume that an execution starts when all the stations begin simultaneously in some fixed round \( r_{0} \). Let \( r_{v} \) be the round when station v halts or is crashed. Then its work contribution is equal \( r_{v} - r_{0} \). In what follows, the algorithm complexity is the sum of such expressions over all stations, i.e.: \( \sum _{1 \le v \le p}(r_{v} - r_{0}) \).

2.5 Tasks and reliability

We expect that processors will perform all t tasks as a result of executing an algorithm. Tasks are initially known to processors. We assume that tasks are similar (that is each task requires the same number of rounds to be done), independent (they can be performed in any order) and idempotent (every task may be performed many times, even concurrently by different processors without affecting the outcome of its computation). We assume that one round is sufficient to perform a single task.

2.6 Do-All formal definition

Having explained the assumptions for tasks, we may now state the formal definition of the Do-All problem after [20]:

Do-All: Given a set ofttasks, perform all tasks usingpprocessors, under adversary\(\mathcal {A}\).

In our considerations adversary \( \mathcal {A} \) from the definition above is one of the adversaries described in Sect. 2.3.

We assume that all our algorithms need to be reliable. A reliable algorithm satisfies the following conditions in any execution: all the tasks are eventually performed, if at least one station remains non-faultyandeach station eventually halts, unless it has crashed.

The Do-All problem may be considered completed or solved. It is considered completed when all tasks are performed, but their outcomes are not necessarily known by all operational stations. The problem is considered solved if in addition all operational processors are aware of the tasks’ outcomes. In this paper we do not assume that stations need to know tasks’ outcomes, yet algorithm ROBAL is designed in such a way that it may solve the problem, while all other algorithms complete it (performing tasks is confirmed by a collision signal that does not contain any meaningful message).

3 Useful algorithmic tools

3.1 Algorithm Two-Lists

In this subsection we describe a deterministic Two-Lists algorithm from [11] which is used in our solutions as a sub-procedure. It was proved that this algorithm is asymptotically optimal for the Weakly-Adaptive adversary on a channel without collision detection, and its work complexity is \( \mathcal {O}(t + p\sqrt{t} + p\,\min \{f, t\}) \). The characteristic feature of Two-Lists is that its complexity is linear for some setups of p and t parameters, describing the number of processors and tasks, respectively (for details, see Fact 7).

3.1.1 Basic facts and notation

Two-Lists was designed for a channel without collision detection. That is why simultaneous transmissions were excluded therein. It has been realized by a cyclic schedule of broadcasts (round-robin). This means that stations maintain a transmission schedule and broadcast one by one, accordingly. Because of such design every message transmitted via the channel is legible for all operational stations.

Another important fact about Two-Lists is that stations maintain the list of tasks, what enables them to distinguish which tasks are they responsible for. Both the tasks list and the transmission schedule are maintained as common knowledge. The result of such an approach is that stations may transmit messages of a minimal length, just to confirm that they are still operational and performed their assigned tasks.

Additionally the transmission schedule and tasks list is stored locally on each station, but the way how stations communicate allows to think of those lists as common for all operational stations.

figure a
figure b
Fig. 2
figure 2

Tasks assignment in Two-Lists

Two-Lists is structured as a loop (see Algorithm 1). Each iteration of the loop is called an epoch. Every epoch begins with a transmission schedule and tasks being assigned to processors. During the execution some tasks are performed and if a station transmits such fact, it is confirmed by removing those certain tasks from list TASKS. However due to adversary activity some stations may be crashed, what is recognized as silence heard on the channel in a round that a station was scheduled to transmit. Stations recognized as crashed are also removed from the transmission schedule. Eventually a new epoch begins with updated lists.

Epochs are also structured as loops (see Algorithm 2). Each iteration is now called a phase, that consists of three consecutive rounds in which station v:

  1. 1.

    Performs the first unaccomplished task that was assigned to v;

  2. 2.

    v broadcasts one bit, confirming the performance of tasks that were assigned to v, if it was v’s turn to broadcast. Otherwise v listens to the channel and attempts to receive a message

  3. 3.

    Depending on whether a message was heard v updates its information about stations and tasks.

An epoch consists of a number of phases, that is described by the actual number of operational stations or outstanding tasks. In each epoch there is a repeating pattern of phases that consists of the following three rounds: (1) each operational station performs one task. Next (2) a transmission round takes place, where at most one station broadcasts a message, and the rest of the stations attempt to receive it. The process is ended (3) by an updating round, where stations reconstruct their knowledge about operational stations and outstanding tasks.

3.1.2 The significance of lists

In the previous section we mentioned the concept of knowledge about stations and tasks, that processors maintain. It was described somehow abstractly, so now we will explain it in detail. Furthermore we will provide information on how the stations are scheduled to transmit and how do they know which tasks should they perform.

It is not accidental that the algorithm was named Two-Lists as the most important pieces of information about the system are actually maintained on two lists. The first is list STATIONS. It represents operational (at the beginning of an epoch) processors and sets the order in which stations should transmit in consecutive phases. That list is operated by pointer Transmit, that is incremented after every phase. It points exactly one station in a single iteration, what prevents collisions on the channel. Hence when some station did not broadcast we may recognize that it was crashed and eliminate from STATIONS, setting the pointer to the following device.

The second list is TASKS. It contains outstanding tasks, and the associated pointer is Task_To_Do\(_{v}\), separate for each station. Task assignment is organized in the following way (see Fig. 2 for a visualized example). Let us present processors from list STATIONS as a sequence \(\langle v_{i} \rangle _{1 \le i \le n}\), where \( n = \texttt {|STATIONS|} \) is the number of operational stations at the beginning of the epoch. Each station is responsible for some segment of list TASKS and all segments sum to the whole list. The length of a segment for station \( v_{i} \) equals i in a single epoch. A single task may belong to more than one segment at a time, unless the number of tasks is accordingly greater than the number of stations.

It is noticeable that lists STATIONS and TASKS are treated as common to all the devices, because of maintaining common knowledge. However, in fact every station has a private copy of those lists and operates with appropriate pointers.

Finally, there are additional two lists maintained by each station. The first one is list \(\texttt {OUTSTANDING}_{v} \) and it contains the segment of tasks that station v has assigned to perform in an epoch. The second is list \(\texttt {DONE}_{v} \) and it contains tasks already performed by station v. These two additional lists are auxiliary and their main purpose is to structure algorithms in a clear and readable way.

3.1.3 Sparse versus dense epochs

The last important element of Two-Lists description, that explains some subtleties are definitions of dense and sparse epochs.

Definition 2

Let \( n = \texttt {|STATIONS|} \) denote the number of operational stations at the beginning of the epoch. If \( n(n + 1)/2 \ge \texttt {|TASKS|} \) then we say that an epoch is dense. Otherwise we say that an epoch is sparse.

The expression \( n(n + 1)/2 = 1 + 2 + \cdots + n \) from the definition above determines how many tasks may be performed in a single epoch. If all the broadcasts in Two-Lists are successful, then this is the number of performed (and confirmed) tasks.

In general that is why if we consider a dense epoch, then it is possible that some task i was assigned more than once to different stations. A dense epoch may end when the list of tasks will become empty. However for sparse epochs the ending condition is consistent with the fact that every station had a possibility to transmit, and pointer Transmit passed all the devices on list STATIONS.

We end this section with results from [11] stating that Two-Lists is asymptotically work optimal, for the channel without collision detection and against the Strongly-Adaptive adversary.

Fact 1

([11], Theorem 1) Algorithm Two-Lists solves Do-All with work \(\mathcal {O}(t + p\sqrt{t} + p\min \{f, t\}) \) against the f-Bounded adversary, for any \( 0 \le f < p \).

Fact 2

([11], Theorem 2) The f-Bounded adversary, for \( 0 \le f < p \), can force any reliable, possibly randomized, algorithm for the channel without collision detection to perform work \( \varOmega (t + p\sqrt{t} + p\min \{f, t\}) \).

Fact 3

([11], Corollary 1) Algorithm Two-Lists is optimal in asymptotic work efficiency, among randomized reliable algorithms for the channel without collision detection, against the adaptive adversary who may crash all but one station.

3.2 Algorithm Groups-Together

Beside Two-Lists, which serves as a sub-procedure for our considerations, some of our algorithms are built on another algorithm from CKL [11]—Groups-Together. The design of both algorithms is similar, yet Groups-Together was introduced for a channel with collision detection. In what follows, we will describe the technicalities and main differences in this subsection.

Let us recall that a shared channel with collision detection provides three types of signals:

  • Silence: no station transmits, and only a background noise is heard;

  • Single: exactly one station transmits a legible information;

  • Collision: an illegible signal is heard (yet different from Silence), when more than one station transmits simultaneously.

Simultaneous transmissions are excluded in Two-Lists, as they do not provide any valuable information when collision detection is not available. In such case a simultaneous transmission of multiple stations results in a silence signal, and does not provide any meaningful information.

Because Groups-Together is specifically designed to work on a channel with collision detection, then the feedback from collision signals is extensively used. The main difference is that instead of list STATIONS, it maintains list GROUPS—and indeed, in Groups-Together the stations are arranged into disjoint groups. Assigning stations to groups is as follows. Let n be the smallest number such that \( n(n+1)/2 > |\texttt {TASKS}| \) holds. Stations have their unique identifiers from set \( \{1,\dots , p\} \). Let \( g_{i} \) denote some group i, where \( g_{i} \) contains the stations that identifiers are congruent modulo i. For this reason, any two groups from GROUPS differ in size by at most 1. Consequently, the initial partition results in having \( \min \{\sqrt{t}, p\} \) groups.

Tasks assignment is the same as in Two-Lists, with the difference that now the algorithm operates on groups instead of single stations. In what follows, all the stations within a single group have the same tasks assigned and hence work together on exactly the same tasks. The round-robin schedule of consecutive broadcasts from Two-Lists also applies to Groups-Together, yet now points out particular groups instead of single stations. Consequently, if a group broadcasts simultaneously and there is a collision signal (or a single transmission) heard on the channel, this means that the tasks that the group was responsible for have been actually performed and may be removed from list TASKS. However, if silence is heard, then we are sure that all the stations from the group have been crashed.

Apart from the differences described above Groups-Together is the same Two-Lists. It is structured as a loop, which one iteration is called an epoch. An epoch is also structured as a repeat loop which one iteration is called a phase. Phases contain three rounds, one of which is for transmission. If no transmission occurs in a phase, we call it silent. Otherwise it is called noisy. The notions of dense and sparse epochs remain the same as in the Two-Lists analysis.

We finish this section with useful results from CKL [11] about Groups-Together.

Fact 4

([11], Lemma 4) Algorithm Groups-Together is reliable.

Fact 5

([11], Theorem 3) Algorithm Groups-Together solves Do-All with the minimal work \( \mathcal {O}(t + p\sqrt{t}) \) against the f-Bounded adversary, for any f such that \( 0 \le f < p \).

4 ROBAL: Random Order Balanced Allocation Lists

In this section we describe and analyze the algorithm for the Do-All problem in the presence of a Linearly-Ordered adversary on a channel without collision-detection. Its expected work complexity is \( \mathcal {O}(t + p\sqrt{t}\log (p)) \) and it uses the Two-Lists procedure from [11] (c.f., Sect.  3).

figure c
figure d
figure e
Fig. 3
figure 3

(1) Initially we have p stations. (2) The adversary chooses f stations prone to crashes. (3) Then it declares the order according to which the stations will crash. (4) Mix-And-Test chooses a number of leaders which are expected to be distributed uniformly among the adversary linear order

ROBAL (Algorithm 3) works in such a way that, initially, it checks whether \( p^{2} > t \), because for such parameters it can execute the Two-Lists algorithm which complexity if linear in t (see Fact 6). If this is not the case, the main body of the algorithm is executed, yet another specific condition is checked: \( \log _{2}(p) > e^{\frac{\sqrt{t}}{32}} \). If so, it assigns all tasks to each station. If every station has all the tasks assigned, then after t phases we may be sure that all the tasks are done, because always at least one station remains operational. Because of the specific range of the parameters, the redundant work in this case is acceptable (i.e., within the claimed bound) from the point of view of our analysis. However we execute procedure Confirm-Work (Algorithm 4) in order to confirm this fact on the channel.

Confirm-Work is a type of leader election procedure. It assigns certain probability of a station to broadcast, in a way expecting that exactly one station will transmit in a number of trials. Because we cannot be sure what is the actual number of operational stations, the probability is changed multiple times until all the tasks are confirmed.

If the specific conditions are discussed above are not satisfied (Algorithm 3 lines 1–10), then Mix-And-Text is executed (Algorithm 5). It changes the order of stations on list \( \texttt {STATIONS} \). Precisely, stations that performed successful broadcasts are moved to front of that list. This procedure has two purposes. On one hand changing the order makes the adversary less flexible in crashing stations, as its order is already determined. On the other hand, we may predict with high probability to which interval \(n \in \left( \frac{p}{2^{i}}, \frac{p}{2^{i-1}}\right] \) for \( i = 1,\ldots , \lceil \log _{2}(p)\rceil \) does the current number of operational stations belong, what is important from the work analysis perspective.

Stations moved to front of list STATIONS are called leaders. Leaders are chosen in a random process, so we expect that they will be uniformly distributed in the adversary order between stations that were not chosen as leaders. This allows us to assume that a crash of a leader is likely to be preceded by several crashes of other stations (see Fig. 3).

Let us consider procedure Mix-And-Test in detail. If n is the previously predicted number of operational stations, then each of the stations tosses a coin with the probability of success equal 1 / n. In case where none or more than one of the stations broadcasts then silence is heard on the channel, as there is no collision detection. Otherwise, when only one station did successfully broadcast it is moved to front of list \( \texttt {STATIONS} \) and the procedure starts again with a decremented parameter. However stations that have already been moved to front do not take part in the following iterations of the procedure.

Upon having chosen the leaders, regular work is performed. However, an important feature of our algorithm is that we do not perform full epochs, but only \( \sqrt{t} \) phases of each Two-Lists epoch. This allows us to be sure that the total work accrued in each epoch does not exceed \( p\sqrt{t} \). If, at some point, the number of successful broadcasts substantially drops, another Mix-And-Test (Algorithm 5) procedure is executed and a new set of leaders is chosen.

Before the algorithm execution the Linearly-Ordered adversary has to choose f stations prone to crashes and declare an order that will describe in what order those crashes may happen. In what follows, when there are unsuccessful broadcasts of leaders (crashes) we may be approaching the case when \( n \le \sqrt{t} \) and we can execute Two-Lists that complexity is linear in t for such parameters. Alternatively the adversary spends the majority of its possible crashes and the stations may finish all the tasks without any distractions.

4.1 Analysis of ROBAL

We begin our analysis with a general statement about the reliability of ROBAL.

Lemma 1

Algorithm ROBAL is reliable.

Proof

We need to show that all the tasks will be performed as a result of executing the algorithm. First of all, if we fall in to the case when \( \frac{p}{2^{i}} \le \sqrt{t} \) (or initially \( p \le \sqrt{t} \)) then Two-Lists is executed, which is reliable as we know from [11].

Secondly, when \(\log _{2}(p) > e^{\frac{\sqrt{t}}{32}} \) we assign all the tasks to every station and let the stations work for t phases. We know that \( f < p \) so at least one station will perform all the tasks.

Finally, if those conditions do not hold, the algorithm runs an external loop in which variable i increments after each iteration. If the loop is performed \( \lceil \log _{2}(p) \rceil \) times then we run Two-Lists. Variable i may not be incremented only if the algorithm will enter and stay in the internal loop. However this is possible only after performing all the tasks, because the internal loop runs for a constant number of times until all tasks are completed. \(\square \)

We now proceed to a statement bounding the worst case work of Two-Lists, which is used as a sub-procedure in ROBAL.

Fact 6

Two-Lists always solves the Do-All problem with \( \mathcal {O}(pt) \) work.

\(\mathcal {O}(pt) \) work is consistent with a scenario when every station performs every task. Comparing it with how Two-Lists works, justifies the fact.

We already mentioned that ROBAL was modeled in such a way that, whenever \( \frac{p}{2^{i}}\le \sqrt{t} \) holds, the Two-Lists algorithm is executed, because the work complexity of Two-Lists for such parameters is \( \mathcal {O}(t) \). We will prove it in the following fact.

Fact 7

Let n be the number of operational processors, and t be the number of outstanding tasks. Then for \( n \le \sqrt{t} \)Two-Lists work complexity is \( \mathcal {O}(t) \).

Proof

If \( n \le \sqrt{t} \), then the outstanding number of crashes is \( f < n \), hence \( f < \sqrt{t} \). Algorithm Two-Lists has \( \mathcal {O}(t + p\sqrt{t} + p\,\min \{f, t\}) \) work complexity. In what follows the complexity is \( \mathcal {O}(t + \sqrt{t}\sqrt{t} + \sqrt{t}\,\min \{\sqrt{t}, t\}) \)\( = \mathcal {O}(t) \). \(\square \)

Figure 3 presents the way how we expect leaders to interlace other stations in the adversary’s order. The following lemma estimates the probability that if a number of leaders was crashed, then, overall, a significant number of stations must have been crashed as well.

Lemma 2

Let us assume that we have n operational stations at the beginning of an epoch, where \( \sqrt{t} \) were chosen leaders. If the adversary crashes n / 2 stations, then the probability that there were 3 / 4 of the overall number of leaders crashed in this group does not exceed \( e^{-\frac{1}{8} \sqrt{t}} \).

Proof

We have n stations, among which \( \sqrt{t} \) are leaders. The adversary crashes n / 2 stations and our question is how many leaders where in this group?

The hypergeometric distribution function with parameters N-number of elements, K-number of highlighted elements, l-number of trials, k-number of successes, is given by:

$$\begin{aligned} \mathbb {P}[X = k] = \frac{\left( {\begin{array}{c}K\\ k\end{array}}\right) \left( {\begin{array}{c}N-K\\ l-k\end{array}}\right) }{\left( {\begin{array}{c}N\\ l\end{array}}\right) }. \end{aligned}$$

The following tail bound from [24] tells us, that for any \( t > 0 \) and \( p = \frac{K}{N} \):

$$\begin{aligned} \mathbb {P}[X \ge (p + t)l] \le e^{-2t^{2}l}. \end{aligned}$$

Identifying this with our process we have that \( K = n/2 \), \( N = n \), \( l = \sqrt{t} \) and consequently \( p = 1/2 \). Placing \( t = 1/4 \) we have that

$$\begin{aligned} \mathbb {P}\left[ X \ge \frac{3}{4}\sqrt{t}\right] \le e^{-\frac{1}{8}\sqrt{t}}. \end{aligned}$$

\(\square \)

The following two lemmas give us the probability that Mix-And-Test diagnoses the number of operational stations properly, and hence, that the whole randomized part of ROBAL works correctly with high probability.

Lemma 3

Let us assume that the number of operational stations is in \(\left( \frac{p}{2^{i}}, \frac{p}{2^{i-1}}\right] \) interval. Then procedure Mix-And-Test(itp) will return true with probability \( 1 - e^{-c\;\sqrt{t}\log _{2}(p)} \), for some \( 0< c < 1 \).

Proof

We start from proving the following claim.

Claim

Let the current number of operational stations be in \( \left( \frac{x}{2}, x\right] \). Then the probability of an event that in a single iteration of Mix-And-Test exactly one station will broadcast is at least \( \frac{1}{2\sqrt{e}} \) (where the \(\texttt {coin}^{-1}\) parameter is \( \frac{1}{x} \)).

Proof

(of the Claim) Let us consider a scenario where the number of operational stations is in \( \left( \frac{x}{2}, x\right] \) for some x. If every station broadcasts with probability of success equal 1 / x then the probability of an event that exactly one station will transmit is \( \left( 1 - \frac{1}{x}\right) ^{x-1} \ge 1/e \). Estimating the worst case, when there are \( \frac{x}{2} \) living stations (and the probability of success remains 1 / x) we have that

$$\begin{aligned} \frac{1}{2} \left( 1 - \frac{1}{x}\right) ^{x \cdot \frac{x-2}{2}} \ge \frac{1}{2\sqrt{e}} \ . \end{aligned}$$

This concludes the proof of the Claim. \(\square \)

According to the Claim the probability of an event that in a single round of Mix-And-Test exactly one stations will be heard is \( \frac{1}{2\sqrt{e}} \).

We assume that \( n {\in } \left( \frac{p}{2^{i}}, \frac{p}{2^{i-1}}\right] \). We will show that the algorithm confirms appropriate i with probability \( 1 - e^{-c\;\sqrt{t}\log _{2}p} \). For this purpose we need \( \sqrt{t} \) transmissions to be heard.

Let X be a random variable such that \( X = X_{1} + \cdots + X_{\sqrt{t}\log _{2}(p)}, \) where \( X_{1}, \ldots , X_{\sqrt{t}\log _{2}(p)} \) are Poisson trials and

$$\begin{aligned} X_{k} = \left\{ \begin{array}{ll} 1 &{} \text {if station broadcasted,}\\ 0 &{} \text {otherwise.} \end{array} \right. \end{aligned}$$

We know that

$$\begin{aligned} \mu&= \mathbb {E}X = \mathbb {E}X_{1} + \cdots + \mathbb {E}X_{\sqrt{t}\log _{2}(p)} \\&\ge \frac{\sqrt{t}\log _{2}(p)}{2\sqrt{e}}. \end{aligned}$$

To estimate the probability that \( \sqrt{t} \) transmissions were heard we will use the Chernoff’s inequality.

We want to have that \( (1-\epsilon )\mu = \sqrt{t} \). Thus \( \epsilon = \frac{\mu - \sqrt{t}}{\mu } = \frac{\log _{2}(p) - 2\sqrt{e}}{\log _{2}(p)} \) and \( 0< \epsilon < 1 \) for sufficiently large p. Hence

$$\begin{aligned} \mathbb {P}[X < \sqrt{t}]\le & {} e^{-\frac{\left( \frac{\log _{2}(p) - 2\sqrt{e}}{\log _{2}(p)}\right) ^{2}}{2} \frac{\sqrt{t}\log _{2}(p)}{2\sqrt{e}}} \\= & {} e^{-c\;\sqrt{t}\log _{2}(p)} \ , \end{aligned}$$

for some bounded \( 0< c < 1 \). We conclude that with probability \( 1 - e^{-c\;\sqrt{t}\log _{2}(p)} \) we will confirm the correct i which describes and estimates the current number of operational stations. \(\square \)

Lemma 4

\(\textsc {Mix{-}And{-}Test}(i, t, p) \) will not be executed if there are more than \( \frac{p}{2^{i-1}} \) operational stations, with probability not less than \( 1 - (\log _{2}(p))^{2}\,\max \{ e^{-\frac{1}{8}\sqrt{t}} ,e^{-c\;\sqrt{t}\log _{2}(p)} \}. \)

Proof

Let \( A_{i} \) denote an event that at the beginning of and execution of the Mix-And-Test(i, t, p) procedure there are no more than \( \frac{p}{2^{i-1}} \) operational stations.

The basic case then \( i = 0 \) is trivial, because initially we have p operational stations, thus \( \mathbb {P}(A_{0}) = 1 \). Let us consider an arbitrary i. We know that

$$\begin{aligned} \mathbb {P}(A_{i})= & {} \mathbb {P}(A_{i}|A_{i-1})\mathbb {P}(A_{i-1})\\&+\,\mathbb {P}(A_{i}|A^{c}_{i-1})\mathbb {P}(A^{c}_{i-1}) \ge \mathbb {P}(A_{i}|A_{i-1})\mathbb {P}(A_{i-1}). \end{aligned}$$

Let us estimate \( \mathbb {P}(A_{i}|A_{i-1}) \). Conditioned on that event \( A_{i-1} \) holds, we know that after executing Mix-And-Test(\(i-1, t, p\)) we had \( \frac{p}{2^{i-2}} \) operational stations. In what follows if we are now considering Mix-And-Test(itp), then we have two options:

  1. 1.

    Mix-And-Test(\(i-1, t, p\)) returned false,

  2. 2.

    Mix-And-Test(\(i-1, t, p\)) returned true.

Let us examine what do these cases mean:

  1. 1.

    If the procedure returned false then we know from Lemma 3 that with probability \( 1 - e^{-c\;\sqrt{t}\log _{2}(p)} \) there had to be no more than \( \frac{p}{2^{i-1}} \) operational stations. If that number would be in \( \left( \frac{p}{2^{i-1}}, \frac{p}{2^{i-2}}\right] \) then the probability of returning false would be less than \( e^{-c\;\sqrt{t}\log _{2}(p)} \).

  2. 2.

    If the procedure returned true, this means that when executing it with parameters \( (i-1, f, p) \) we had no more than \( \frac{p}{2^{i-1}} \) operational stations. Then the internal loop of ROBAL was broken, so according to Lemma 2 we conclude that the overall number of operational stations had to reduce by half with probability at least \( 1 - e^{-\frac{1}{8}\sqrt{t}} \).

Consequently, we deduce that \( \mathbb {P}(A_{i}|A_{i-1}) \ge (1 - \max \{ e^{-\frac{1}{8}\sqrt{t}} ,e^{-c\;\sqrt{t}\log _{2}(p)} \}) \). Hence \( \mathbb {P}(A_{i}) \ge (1 - \max \{ e^{-\frac{1}{8}\sqrt{t}} ,e^{-c\;\sqrt{t}\log _{2}(p)} \})^{i} \). Together with the fact, that \( i \le \log _{2}(p) \) and the Bernoulli inequality we have that

$$\begin{aligned} \mathbb {P}(A_{i}) \ge 1 - \log _{2}(p)\,\max \{ e^{-\frac{1}{8}\sqrt{t}} ,e^{-c\;\sqrt{t}\log _{2}(p)} \}. \end{aligned}$$

We conclude that the probability that the conjunction of events \( A_{1},\ldots ,A_{\log _{2}(p)} \) will hold is at least

$$\begin{aligned}&\mathbb {P}\left( \bigcap _{i = 1}^{\log _{2}(p)}A_{i}\right) \\&\quad \ge 1 - (\log _{2}(p))^{2}\,\max \{ e^{-\frac{1}{8}\sqrt{t}} ,e^{-c\;\sqrt{t}\log _{2}(p)} \}, \end{aligned}$$

what ends the proof. \(\square \)

We can now proceed to the main result of this section.

Theorem 1

ROBAL performs \(\mathcal {O}(t + p\sqrt{t}\log (p)) \) expected work against the Linearly-Ordered adversary in the channel without collision detection.

Proof

In the algorithm we are constantly controlling whether condition \( \frac{p}{2^{i}} > \sqrt{t} \) holds. If not, then we execute Two-Lists which complexity is \( \mathcal {O}(t) \) for such parameters.

If this condition does not hold initially then we check another one i.e. whether \( \log _{2}(p) > e^{\frac{\sqrt{t}}{32}} \) holds. For such configuration we assign all the tasks to every station. The work accrued during such a procedure is \( \mathcal {O}(pt) \). However when \( \log _{2}(p) > e^{\frac{\sqrt{t}}{32}} \) then together with the fact that \( e^{x} < x \) we have that \( \log _{2}(p) > t \) and consequently the total complexity is \( \mathcal {O}(p\log (p)) \).

Finally, the successful stations, that performed all the task have to confirm this fact. We demand that only one station will transmit and if this happens, the algorithm terminates. The expected value of a geometric random variable lets us assume that this confirmation will happen in expected number of \( \mathcal {O}(\log (p)) \) rounds, generating \( \mathcal {O}(p\log (p)) \) work.

When none of the conditions mentioned above hold, we proceed to the main part of the algorithm. The testing procedure by Mix-And-Test for each of disjoint cases, where \( n \in \left( \frac{p}{2^{i}}, \frac{p}{2^{i-1}}\right] \) requires a certain amount of work that can be estimated by \( \mathcal {O}(p \sqrt{t}\log (p)) \), as there are \( \sqrt{t}\log _{2}(p) \) testing phases in each case and at most \( \frac{p}{2^{i}} \) stations take part in a single testing phase for a certain case.

In the algorithm we run through disjoint cases where \( n \in \left( \frac{p}{2^{i}}, \frac{p}{2^{i-1}}\right] \). From Lemma 2 we know that when some of the leaders were crashed, then a proportional number of all the stations had to be crashed. When leaders are crashed but the number of operational stations still remains in the same interval, then the lowest number of tasks will be confirmed if only the initial segment of stations will transmit. As a result, when half of the leaders were crashed, then the system still confirms \( \frac{t}{8} = \varOmega (t) \) tasks. This means that even if so many crashes occurred, \( \mathcal {O}(1) \) epochs still suffice to do all the tasks. Summing work over all the cases may be estimated as \( \mathcal {O}(p\sqrt{t}) \).

By Lemma 4 we conclude that the expected work complexity is bounded by:

$$\begin{aligned}&\left( (\log (p))^{2}\,\max \{ e^{-\frac{1}{8}\sqrt{t}} ,e^{-c\;\sqrt{t}\log (p)} \}\right) \\&\quad \mathcal {O}(pt + p\sqrt{t}\log ^{2}(p))\\&\quad + \left( 1 - (\log (p))^{2}\,\max \{ e^{-\frac{1}{8}\sqrt{t}} ,e^{-c\;\sqrt{t}\log (p)} \}\right) \\&\quad \mathcal {O}(p\sqrt{t}\log (p)) = \mathcal {O}(p\sqrt{t}\log (p)), \end{aligned}$$

where the first expression comes from the fact, that if we entered the main loop of the algorithm then we know that we are in a configuration where \( \log _{2}(p) \le e^{\frac{\sqrt{t}}{32}} \). Thus we have that

$$\begin{aligned}&\frac{pt + p\sqrt{t}\log ^{2}(p)}{e^{\frac{\sqrt{t}}{8}}} \le \frac{pt + \log ^{2}(p)}{e^{\frac{\sqrt{t}}{16}}e^{\frac{\sqrt{t}}{16}}}\\&\quad \le \frac{p + p\log ^{2}(p)}{e^{\frac{\sqrt{t}}{16}}} \le p + p\log (p) = \mathcal {O}(p\log (p)), \end{aligned}$$

which ends the proof. \(\square \)

5 GruTEch: Groups Together with Echo

In this section we present a randomized algorithm designed to reliably perform Do-All in the presence of a Weakly-Adaptive adversary on a shared channel without collision detection. Its expected work complexity is \( \mathcal {O}(t + p\sqrt{t} + p\;\min \{p/(p-f), t\}\log (p)) \).

5.1 Description of GruTEch

Our solution is built on algorithm Groups-Together (details in Sect. 3.2) and a newly designed Crash-Echo procedure that works as a kind of fault-tolerant replacement of collision detection mechanism (which is not present in the model). In fact, the algorithm presented here is asymptotically only logarithmically far from matching the lower bound shown in [11], which, to some extent, answers the open question stated therein.

figure f
figure g
figure h
figure i

The Crash-Echo procedure. Let us recall the details of Groups-Together from 3.2. All the stations within a certain group have the same tasks assigned and when it comes to transmitting they do it simultaneously. This strongly relies on the collision detection mechanism, as the stations do not necessarily need to know which station transmitted, but they need to know that there is progress in tasks performance. That is why if a collision is heard and all the stations within the same group were doing the same tasks, we can deduce that those tasks were actually done.

In our model we do not have collision detection, however we designed a mechanism that provides the same feedback without contributing too much work to the algorithm’s complexity. Strictly speaking we begin with choosing a leader. His work will be of a dual significance. On one hand he will belong to some group and perform tasks regularly. But on the other hand he will also perform additional transmissions in order to indicate whether there was progress when stations transmitted.

When a group of stations is indicated to broadcast the Crash-Echo procedure is executed. It consists of two rounds where the whole group transmits together with the leader in the first one and in the second only the leader transmits. We may hear two types of signals:

  • loud: a legible, single transmission was heard. Exactly one station transmitted.

  • silent: a signal indistinguishable from the background noise is heard. None or more than one station transmitted.

Let us examine what are the possible pairs (group & leader, leader) of signals heard in such approach:

  • (silent, loud): in the latter round the leader is operational, so he must have been operational in the former round. Because silence was heard in the former round this means that there was a successful transmission of at least two stations one of which was the leader. This is a fully successful case.

  • (loud, loud): the former and the latter round were loud, so we conclude that it was the leader who transmitted in both rounds. If the leader belonged to the group scheduled to transmit, then we have progress; otherwise not.

  • (silent, silent): if both rounds were silent we cannot be sure was there any progress. Additionally we need to elect a new leader.

  • (loud, silent): when the former round was loud we cannot be sure whether the tasks were performed; a new leader needs to be chosen.

Nevertheless, the Weakly-Adaptive adversary has to declare some f stations that are prone to crashes. The elected leader might belong to that subset and be crashed at some time. When this is examined, the algorithm has to switch to the Elect-Leader mode, in order to select another leader. Consequently the most significant question from the point of view of the algorithm’s analysis is what is the expected number of trials to choose a non-faulty leader.

Two modes. We need to select a leader and be sure that he is operational in order to have our progress indicator working instead of the collision detection mechanism. When the leader is operational we simply run Groups-Together algorithm with the difference that instead of a simultaneous transmission by all the stations within a group, we run the Crash-Echo procedure that allows us to distinguish whether there was progress.

Choosing the leader is performed by procedure Elect-Leader, where each station tosses a coin with the probability of success equal 1 / p. If a station is successful then it transmits in the following round. If exactly one station transmits then the leader is chosen. Otherwise the experiment is continued (for p rounds in total). Nevertheless if this still does not work, then the first station that transmits in a round-robin fashion procedure, becomes the leader.

Note that we have a special variable i used as a counter that is incremented in the Elect-Leader procedure until it reaches value p. We assume that this value is passed to Elect-Leader by reference, so that its incrementation is also recognized in the main body of the Epoch-Groups-CE algorithm, thus i is a global counter.

5.2 Analysis of Gru T Ech

Let us begin the analysis of GruTEch by recalling an important result from [11].

Theorem 2

([11], Theorem 6) The Weakly-Adaptive f-Bounded adversary can force any reliable randomized algorithm solving Do-All in the channel without collision detection to perform the expected work

$$\begin{aligned} \varOmega (t + p\sqrt{t} + p\;\min \{p/(p-f), t\}). \end{aligned}$$

In fact the theorem above in [11] stated that the lower bound was \( \varOmega (t + p\sqrt{t} + p\;\min \{f/(p-f), t\}) \), however the proof relied on the expected round in which the first successful transmission took place and the authors did not take into consideration that the first successful transmission may occur earliest in round 1. Hence as it must be at least round number 1 we correct it as follows: \( \frac{f}{p-f} + 1 = \frac{f}{p-f} + \frac{p-f}{p-f} = \frac{p}{p-f}.\)

Lemma 5

GruTEch is reliable.

Proof

The reliability of GruTEch is a consequence of the reliability of Groups-Together. We do not make any changes in the core of the algorithm. Crash-Echo does not affect the algorithm, as it always finishes. Elect-Leader procedure always finishes as well. The first loop is executed for at most p times and then it ends. The second loop awaits to hear a broadcast in a round-robin manner. But we know that \( 0 \le f \le p - 1 \), so always one processor remains operational and it will respond. \(\square \)

Let us define a sustainable leader as a station that is operational until the end of the execution or a non-faulty station, and was elected as a leader during some execution of procedure Elect-Leader.

Lemma 6

The total number of rounds during which procedure Elect-Leader is run (possibly splitted into several executions) until electing a sustainable leader is \( \log (p) \frac{4p}{p-f} \) with probability at least \(1 - \frac{1}{p} \).

Proof

Recall that procedure Elect-Leader could be called several times, until selecting a sustainable leader at the latest. The expected number of rounds needed to elect a sustainable leader during these calls is upper bounded by the time needed to hit the first non-faulty station by the executions of procedure Elect-Leader. Hence, in the remainder of the proof we estimate the total number of such rounds with probability at least \(1-1/p\).

We have p stations from which f are prone to crashes. Hence we have \( p - f \) non-faulty stations. That is why the probability that a non-faulty one will respond in the election procedure is at least \( (p-f)/p \). We may observe that this probability will increase if we failed in previous executions. In fact, after f executions we may be sure to choose a non-faulty leader. However we will estimate the probability of our process by an event of awaiting the first success in a number of trials, as our process is stochastically dominated by such a geometric distribution process.

We have a channel without collision detection, so exactly one station has to transmit in order to elect a leader. Let x be the actual number of operational stations. The probability s of the event that a non-faulty station will be elected in the procedure may be estimated as follows:

$$\begin{aligned} \frac{p-f}{p} \left( 1 - \frac{1}{p}\right) ^{x-1}\ge & {} \frac{p-f}{p} \left( 1 - \frac{1}{p}\right) ^{p-1}\\\ge & {} \frac{p-f}{p}\cdot \frac{1}{4}\cdot \frac{p}{p-1}\\= & {} \frac{p-f}{4(p-1)} \ge \frac{p-f}{4p}. \end{aligned}$$

Let us estimate the probability of awaiting the first success in a number of trials. Let \( X \sim Geom((p-f)/4p) \). We know that for a geometric random variable with the probability of success equal s:

$$\begin{aligned} \mathbb {P}(X \ge i) = (1 - s)^{i-1}. \end{aligned}$$

Applying this to our case with \( i = \frac{4p}{p-f}\log (p) + 1 \) we have that

$$\begin{aligned}&\mathbb {P}\left( X \ge \frac{4p}{p-f}\log (p) + 1\right) \\&\quad = \left( 1 - \frac{1}{\frac{4p}{p-f}} \right) ^{\frac{4p}{p-f}\log (p)} \le e^{-\log (p)} = \frac{1}{p}. \end{aligned}$$

Thus the probability of a complementary event is

$$\begin{aligned} \mathbb {P}\left( X < \frac{4p}{p-f}\log (p) + 1\right) > 1 - \frac{1}{p}. \end{aligned}$$

\(\square \)

Theorem 3

GruTEch solves Do-All in the channel without collision detection with the expected work \( \mathcal {O}(t + p\sqrt{t} + p\;\min \{p/(p-f), t\}\log (p)) \) against the Weakly-Adaptive f-Bounded adversary.

Proof

We may divide the work of GruTEch to three components: productive, failing and the one reasoning from electing the leader.

Firstly, the core of our algorithm is the same as Groups-Together with the difference that we have the Crash-Echo procedure that takes twice as many transmission rounds. According to Fact 5, it is sufficient to estimate this kind of work as \( \mathcal {O}(t + p\sqrt{t})\).

Secondly, there is some work that results from electing the leader. According to Lemma 6, a sustainable leader will be chosen within \( \frac{4p}{p-f}\log (p) \) rounds of executing Elect-Leader with high probability. That is why the expected work to elect a non-faulty leader is overall

\( \mathcal {O}\left( p\;\frac{p}{p-f}\log (p)\right) \).

Finally, there is some amount of failing work that results from rounds where the Crash-Echo procedure indicated that the leader was crashed. However work accrued during such rounds will not exceed the amount of work resulting from electing the leader, hence we state that failing work contributes \( \mathcal {O}\left( p\;\frac{p}{p-f}\log (p)\right) \) as well.

Consequently, we may estimate the expected work of GruTEch as

$$\begin{aligned}&\left( 1 - \frac{1}{p}\right) \mathcal {O}(t + p\sqrt{t} + p\;\min \{p/(p-f), t\}\log (p))\\&\qquad +\,\frac{1}{p}\mathcal {O}(p^{2})\\&\quad = \mathcal {O}(t + p\sqrt{t} + p\;\min \{p/(p-f), t\}\log (p)) \end{aligned}$$

what ends the proof. \(\square \)

6 How Gru T Ech works for other partial orders

The line of investigation originated by ROBAL and GruTEch leads to a natural question whether considering some intermediate partial orders of the adversary may provide different work complexities. In this section we answer this question in the positive by examining the GruTEch algorithm against the k-Chain-Ordered adversary on a channel without collision detection.

6.1 The lower bound

Theorem 4

For any reliable randomized algorithm solving Do-All on the shared channel and any integer \(0<k\le f\), there is a k-chain-based partial order of f elements such that the ordered adversary restricted by this order can force the algorithm to perform the expected work \(\varOmega (t+p\sqrt{t}+p\min \{k,f/(p-f),t\})\).

Proof

The part \(\varOmega (t+p\sqrt{t})\) follows from the absolute lower bound on reliable algorithms on shared channel. We prove the remaining part of the formula. If \(k > c\cdot f/(p-f)\), for some constant \(0<c<1\), then that part is asymptotically dominated by \(p\min \{f/(p-f),t\}\) and it is enough to take the order being an anti-chain of f elements; clearly it is a k-chain-based partial order of f elements, and the adversary restricted by this order is equivalent to the weakly-adaptive adversary, for which the lower bound \(\varOmega (p\min \{f/(p-f),t\})\) follows directly from Theorem 2. Therefore, in the reminder of the proof, assume \(k\le c\cdot f/(p-f)\).

Consider the following strategy of the adversary in the first \(\tau \) rounds, for some value \(\tau \) to be specified later. Each station which wants to broadcast alone in a round is crashed in the beginning of this round, just before its intended transmission. Let \({\mathcal {F}}\) be the family of all subsets of stations containing k / 2 elements. Let \({\mathcal {M}}\) denote the family of all partial orders consisting of k independent chains of roughly (modulo rounding) f / k elements each. Consider the first \(\tau =k/2\) rounds. The probability \(\Pr (F)\), for \(F\in {\mathcal {F}}\), is defined to be equal to the probability of an occurrence of an execution during the experiment, in which exactly the stations with from set F are failed by round \(\tau \). Consider an order M selected uniformly at random from \({\mathcal {M}}\). The probability that all elements of set \(F\in {\mathcal {F}}\) are in M is a non-zero constant. It follows from the following three observations. First, under our assumption, \(k<f\) (as \(k\le c\cdot f/(p-f)\) for some \(0<c<1\)). Second, from the proof of the lower bound in [11] with respect to sets of size O(f), the probability is a non-zero constant provided in each round we have at most \(c'\cdot f\) crashed processes, for some constant \(0<c'<1\). Third, since each successful station can enforce the adversary to fail at most one chain, after each of the first \(\tau =k/2\) rounds there are still at least k / 2 chains without any crash, hence at most f / 2 crashes have been enforced and the argument from the lower bound in [11] could be applied. To conclude the proof, non-zero probability of not hitting any element not in M means that there is such \(M\in {\mathcal {M}}\) that the algorithm does not finish before round \(\tau \) with constant probability, thus imposing expected work \(\varOmega (pk)\). \(\square \)

6.2 Gru T Ech against the k-chain-ordered adversary

The analysis of GruTEch against the Weakly-Adaptive adversary relied on electing a leader. Precisely, as we knew that there are \( p - f \) non-faulty stations in an execution, then we expected to elect a non-faulty leader in a certain number of trials.

Nevertheless we could have chosen a faulty station as a leader and the adversary could have chosen to crash that station. However the amount of such failing occurrences would not exceed the number of trials needed to elect the non-faulty one. While considering the k-Chain-Ordered adversary, these estimates are different.

When a leader is elected then he may belong to the non-faulty set (and this is expected to happen within a certain number of trials) or he may be elected from the faulty set, thus will be placed somewhere in the adversary’s partial order. If the leader was elected in a random process then it will appear in a random part of this order. In what follows we may expect that if the adversary decides to crash the leader, then it will be forced to crash several stations preceding the leader in one of the chains in his partial order. Consequently this is the key reason why the expected work complexity would change against the k-Chain-Ordered adversary.

Theorem 5

GruTEch solves Do-All in the channel without collision detection with the expected work

$$\begin{aligned} \mathcal {O}(t + p\sqrt{t} + p\;\min \{p/(p-f), k, t\}\log (p)) \end{aligned}$$

against the k-Chain-Ordered adversary.

Proof

Because of the same arguments as in Theorem 3, it is expected that a non-faulty leader will be chosen in the expected number of \( \mathcal {O}\left( \frac{p}{p-f}\log (p)\right) \) trials, generating \( \mathcal {O}\left( p\;\frac{p}{p-f}\log (p)\right) \) work.

On the opposite, let us consider what will be the work accrued in phases when the leader is chosen from the faulty set and hence may be crashed by the adversary. According to the adversary’s partial order we have initially k chains, where chain j has length \( l_{j} \). If the leader was chosen from that order then it belongs to one of the chains. We will show that it is expected that the chosen leader will be placed somewhere in the middle of that chain.

Let X be a random variable such that \( X_{j} = i \) where i represents the position of the leader in chain j. We have that \( \mathbb {E}X_{j} = \sum _{i=1}^{l_{j}} \frac{i}{l_{j}} = \frac{1}{l_{j}} \frac{(1+l_{j})}{2}l_{j} = \frac{(1+l_{j})}{2}. \)

We can see that if the leader was crashed, this implies that half of the stations forming the chain were also crashed. If at some other points of time, the faulty leaders will also be chosen from the same chain, then by simple induction we may conclude that this chain is expected to be all crashed after \( \mathcal {O}(\log (p)) \) iterations, as a single chain has length \( \mathcal {O}(p) \) at most. In what follows if there are k chains, then after \( \mathcal {O}(k\log (p)) \) steps this process will end and we may be sure to choose a leader from the non-faulty subset, because the adversary will spend all his failure possibilities.

Finally, if we have a well serving non-faulty leader then the work accrued is asymptotically the same as in Groups-Together algorithm with the difference that each step is now simulated by the Crash-Echo procedure. This work is equal \( \mathcal {O}(t + p\sqrt{t}) \).

Altogether, taking Lemma 6 into consideration, the expected work performance of GruTEch against the k-Chain-Ordered adversary is

$$\begin{aligned}&\left( 1 - \frac{1}{p}\right) \mathcal {O}(t + p\sqrt{t} + p\;\min \{p/(p-f), k, t\}\log (p))\\&\qquad +\,\frac{1}{p}\mathcal {O}(p^{2})\\&\quad = \mathcal {O}(t + p\sqrt{t} + p\;\min \{p/(p-f), k, t\}\log (p)) \end{aligned}$$

what ends the proof. \(\square \)

6.3 Gru T Ech against the adversary limited by arbitrary order

Finally, let us consider the adversary that is limited by arbitrary partial order \(P=(P,\succ )\). We say that two partially ordered elements are incomparable if none of relations \(x \succ y \) and \(y\succ x\) hold. Translating into the considered model, this means that the adversary may crash incomparable elements in any sequence during the execution of the algorithm (clearly, only if x and y are among f stations chosen to be crash-prone).

Theorem 6

GruTEch solves Do-All in the channel without collision detection with the expected work

$$\begin{aligned} \mathcal {O}(t + p\sqrt{t} + p\;\min \{p/(p-f), k, t\}\log (p)) \end{aligned}$$

against the k-Thick-Ordered adversary.

Proof

We assume that the crashes forced by the adversary are constrained by some partial order P. Let us first recall the following lemma.

Lemma 7

(Dilworth’s theorem [15]) In a finite partial order, the size of a maximum anti-chain is equal to the minimum number of chains needed to cover all elements of the partial order.

Recall that the k-Thick-Ordered adversary is constrained by any order of thickness k. Clearly, the adversary choosing some f stations to be crashed cannot increase the size of the maximal anti-chain. Thus using Lemma 7 we consider the coverage of the crash-prone stations by at most k disjoint chains, and any dependencies between chains’ elements create additional constraints to the adversary comparing to the k-Chain-Ordered one. Hence we fall into the case concluded in Theorem 5 that completes the proof. \(\square \)

7 GILET: Groups with Internal Leader Election Together

In this section we introduce an algorithm for the channel without collision detection that is designed to work efficiently against the 1-RD adversary. Its expected work complexity is \( \mathcal {O}(t + p\sqrt{t}\log ^{2}(p)) \). The algorithm makes use of previously designed solutions from [11], i.e., Groups-Together algorithm (cf. Sect. 3.2), however we implement a major change in how the stations confirm their work (due to the lack of collision detection in the model).

figure j
figure k
figure l
figure m

In our model, there is a channel without collision detection. That is why whenever some group g is scheduled to broadcast, a leader election procedure Mod-Confirm-Work is executed in order to hear a successful transmission of exactly one station. Because all the stations within g had the same tasks assigned, then if the leader is chosen, we know that the group performed appropriate tasks.

The inherent cost of such an approach of confirming work is that we may not be sure whether removed groups did really crash. The effect is that if all the tasks were not performed and all the stations were found crashed, then we have to execute an additional procedure that will finish performing them reliably.

This is realized by a new list REMOVED containing removed stations, and procedure  Check-Outstanding which assigns every outstanding task to all the stations. Then if with small probability we have mistakenly removed some operational stations, the algorithm still remains reliable and efficient.

7.1 Analysis of GILET

Lemma 8

GILET is reliable.

Proof

As well as in case of GruTEch, the solution does depend on reliability of algorithm Groups-Together, because procedure Mod-Confirm-Work always terminates. If we fall into a mistake that some operational station has been removed from list GROUPS, then we execute procedure Check-Outstanding that will finish all the outstanding tasks. \(\square \)

Lemma 9

Assume that the number of operational stations within a group is in \(\left( \frac{k}{2^{i+1}}, \frac{k}{2^{i}}\right] \) interval and the \( \texttt {coin} \) parameter is set to \( \frac{k}{2^{i}} \). Then during \( \textsc {Mod{-}Confirm{-}Work} \) a confirming-work broadcast will be performed with probability at least \( 1 - \frac{1}{p} \).

Proof

We assume that the number of operational stations is in \( \left( \frac{k}{2^{i+1}}, \frac{k}{2^{i}}\right] \). The probability that exactly one station will broadcast, estimated from the worst case point of view where only \( \frac{k}{2^{i+1}} \) stations are operational is \( \frac{1}{2\sqrt{e}} \), because of the same reason as in Claim 4.1 of Lemma 3.

That is why we would like to investigate the first success occurrence in a number of trials with the probability of success equal \( \frac{1}{2\sqrt{e}} \).

Let \( X \sim Geom\left( \frac{1}{2\sqrt{e}}\right) \). We know that for a geometric random variable with the probability of success equal s:

$$\begin{aligned} \mathbb {P}(X \ge i) = (1 - s)^{i-1}. \end{aligned}$$

Hence we will apply it for \( i = 2\sqrt{e}\log (p) + 1 \). We have that

$$\begin{aligned}&\mathbb {P}(X \ge 2\sqrt{e}\log (p) + 1)\\&\quad = \left( 1 - \frac{1}{2\sqrt{e}}\right) ^{2\sqrt{e}\log (p)} \le e^{-\log (p)} = \frac{1}{p}. \end{aligned}$$

Thus

$$\begin{aligned} \mathbb {P}(X> 2\sqrt{e}\log (p) + 1) > 1 - \frac{1}{p}. \end{aligned}$$

\(\square \)

Theorem 7

GILET performs \( \mathcal {O}(t + p\sqrt{t}\log ^{2}(p)) \) expected work on channel without collision detection against the 1-RD adversary.

Proof

The proof of Groups-Together work performance from [11] stated that noisy sparse epochs contribute \( \mathcal {O}(t) \) to work and silent sparse epochs contribute \( \mathcal {O}(p\sqrt{t}) \). Dense epochs do also contribute \( \mathcal {O}(p\sqrt{t}) \) work. Let us compare this with our solution.

Noisy sparse epochs contribute \( \mathcal {O}(t) \) because these are phases with successful broadcasts. And there are clearly t tasks to perform, so at most t transmissions will be necessary for this purpose.

Silent sparse epochs, as well as dense epochs consist of mixed work: effective and failing. In our case, each attempt of transmitting is now simulated by \( \mathcal {O}(\mathrm{log}^{2}(p)) \) rounds. That is why the amount of work is asymptotically multiplied by this factor. Hence we have work accrued during silent sparse and dense epochs contributing \( \mathcal {O}(p\sqrt{t}\log ^{2}(p)) \).

However according to Lemma 9 with some small probability we could have mistakenly removed a group of stations from list GROUPS because Mod-Confirm-Work was silent. Eventually the list of groups may be empty, and there are still some outstanding tasks. For such case we execute Check-Outstanding, where all the stations have the same outstanding tasks assigned, and do them for \( |\texttt {TASKS}| \) phases (which actually means until they are all done). It is clear that always at least one station remains operational and all the tasks will be performed. Work contributed in such case is at most \( \mathcal {O}(pt) \).

Let us now estimate the expected work:

$$\begin{aligned}&\left( 1 - \frac{1}{p}\right) \mathcal {O}(t + p\sqrt{t}\log ^{2}(p)) + \frac{1}{p}\mathcal {O}(pt)\\&\quad = \mathcal {O}(t + p\sqrt{t}\log ^{2}(p)), \end{aligned}$$

what completes the proof. \(\square \)

8 Transition to the beeping model

To this point we considered a communication model based on a shared channel, with distinction that collision detection is not available. In this section we consider the beeping model.

In the beeping model we distinguish two types of signals. One is silence, where no station transmits. The other is a beep, which, when heard, indicates that at least one station transmitted. It differs from the channel with collision detection by providing slightly different feedback, but as we show it has the same complexity with respect to reliable Do-All. More precisely, we show that the feedback provided by the beeping channel allows to execute algorithm Groups-Together (cf. Sect. 3.2) and that it is work optimal as well.

8.1 Lower bound

We state the lower bound for Do-All in the beeping model in the following lemma.

Lemma 10

A reliable algorithm, possibly randomized, with the beeping communication model performs work \( \varOmega (t + p\sqrt{t}) \) in an execution in which no failures occur.

Proof

The proof is an adaptation of the proof of Lemma 1 from [11] to the beeping model. Let \( \mathcal {A} \) be a reliable algorithm. The part \( \varOmega (t) \) of the bound follows from the fact that every task has to be performed at least once in any execution of \( \mathcal {A} \).

Task \( \alpha \) is confirmed at round i of an execution of algorithm \( \mathcal {A} \), if either a station performs a beep successfully and it has performed \( \alpha \) by round i, or at least two stations performed a beep simultaneously and all of them have performed task \( \alpha \) by round i of the execution. All of the stations broadcasting at round i and confirming \( \alpha \) have performed it by then, so at most i tasks can be confirmed at round i. Let \( \mathcal {E}_{1} \) be an execution of the algorithm when no failures occur. Let station v come to a halt at some round j in \( \mathcal {E}_{1} \).

Claim:The tasks not confirmed by round jwere performed byvitself in\( \mathcal {E}_{1} \).

Proof

(of the Claim) Suppose, to the contrary, that this is not the case, and let \( \beta \) be such a task. Consider an execution, say \( \mathcal {E}_{2} \), obtained by running the algorithm and crashing any station that performed task \( \beta \) in \( \mathcal {E}_{1} \) just before it was to perform \( \beta \) in \( \mathcal {E}_{1} \), and all the remaining stations, except for v, crashed at step j. The broadcasts on the channel are the same during the first j rounds in \( \mathcal {E}_{1} \) and \( \mathcal {E}_{2} \). Hence all the stations perform the same tasks in \( \mathcal {E}_{1} \) and \( \mathcal {E}_{2} \) till round j. The definition of \(\mathcal {E}_{2} \) is consistent with the power of the Unbounded adversary. The algorithm is not reliable because task \( \beta \) is not performed in \( \mathcal {E}_{2} \) and station v is operational. This justifies the claim. \(\square \)

We estimate the contribution of the station v to work. The total number of tasks confirmed in \( \mathcal {E}_{1} \) is at most

$$\begin{aligned} 1+2+\cdots +j=\mathcal {O}(j^2)\ . \end{aligned}$$

Suppose some \( t' \) tasks have been confirmed by round j. The remaining \( t-t' \) tasks have been performed by v. The work of v is at least

$$\begin{aligned} \varOmega (\sqrt{t'}+(t-t'))=\varOmega (\sqrt{t})\ , \end{aligned}$$

which completes the proof. \(\square \)

8.2 How algorithm Groups-Together works in the beeping model

Collision detection was a significant part of algorithm Groups-Together as it provided the possibility of taking advantage of simultaneous transmissions. Because of maintaining common knowledge about the tasks assigned to groups of stations we were not interested in the content of the transmission but the fact that at least one station from the group remained operational, what guaranteed progress.

In the beeping model we cannot distinguish between Single and Collision, however in the sense of detecting progress the feedback is consistent. It means that if a group g is scheduled to broadcast at some phase i, then we have two possibilities. If Silence was heard this means that all the stations in group g were crashed, and their tasks remain outstanding. Otherwise if a beep is heard this means that at least one station in the group remained operational. As the transmission was scheduled in phase i this means that certain i tasks were performed by group g.

Lemma 10 together with the work performance of Groups-Together allows us to conclude that the solution is also optimal in the beeping model.

Corollary 1

Groups-Together is work optimal in the beeping channel against the f-Bounded adversary.

9 Conclusions

This paper addressed the challenge of performing work on a shared channel with crash-prone stations against ordered and delayed adversaries, introduced in this work. The considered model is very basic, therefore our solutions could be implemented and efficient in other related communication models with contention and failures.

We found that some orders of crash events are more costly than the others for a given algorithm and the whole problem, in particular, more shallow orders or even slight delays in the effects of adversary’s decisions, constraining the adversary, allow solutions to stay closer to the absolute lower bound for this problem.

All our algorithms work on a shared channel with acknowledgments only, without collision detection, what makes the setup challenging. While it was already shown that there is not much we can do against a Strongly-Adaptive \(f\)-Bounded adversary, our goal was to investigate whether there are some other adversaries that an algorithm can play against efficiently.

Taking a closer look at our algorithms, each of them works differently against different adversaries. ROBAL does not simulate a collision detection mechanism, opposed to the other two solutions, but tries to exploit good properties of an existing (but a priori unknown to the algorithm) linear order of crashes. On the other hand, its execution against a Weakly-Adaptive Linearly-Bounded adversary could be inefficient—the adversary could enforce a significant increase in the overall work performance by crashing a small number of stations multiple times. Then Mix-And-Test would be executed many times with the same parameters, generating excessive work. GruTEch, on the other hand, cannot work efficiently against the 1-RD adversary, as there is a global leader chosen to coordinate the Crash-Echo procedure that simulates confirmations in a way similar to a collision detection mechanism (recall that we do not assume collision detection given as channel feedback). Hence such an adversary could decide to always crash the leader, making the algorithm inefficient, as electing a leader is quite costly—the leader is chosen in a number of trials, what generates excessive work. Yet, from a different angle, GILET confirms every piece of progress by electing a leader in a specific way, which is efficient against the 1-RD adversary, but executing it against the Weakly-Adaptive adversary would result in an increase in the overall work complexity.

Remarks on time complexity. First of all, we emphasize that time complexity, defined as the number of rounds until all non-crashed stations terminate, is not the best choice to describe how efficient the algorithms are, because this strongly depends on how the adversary interferes with the system. In what follows we present some general bounds that might, however, overestimate the time complexity for a vast range of executions.

In all our considerations, at some point of an execution (even at the very beginning) it may happen that only the non-faulty stations remain operational, because the adversary will realize all of its possible crashes. Then at most t tasks must be performed by the remaining \( p - f \) stations. Hence, even if the tasks are equally distributed among the non-faulty stations, doing them all lasts at least \( t/(p-f) \) rounds. On the other hand, initially t tasks are distributed among p stations. Thus, on average, a station will be working on t / p tasks. If now the adversary decides to crash a station just before it was to confirm its tasks, then this prolongs the overall execution by t / p rounds. Because there are f crashes, then at most tf / p rounds are additionally needed to finish. However, it is also true that stations are capable of performing t tasks in \( \sqrt{t} \) rounds. This corresponds to the triangular Two-Lists-fashion of assigning tasks to stations. In this view each crash enforces an additional step of the execution, what gives us the upper bound of around \( f + \sqrt{t} \) rounds.

All our algorithms undergo the same time bounds for actually performing tasks or suffering crashes as mentioned above. Additionally, \( \mathcal {O}(\sqrt{t}\log (p)) \) rounds are needed for ROBAL to select sets of leaders throughout all the executions of the Mix-And-Test procedure. Consequently, the expected running time of ROBAL is \(\mathcal {O}\left( \frac{t}{p-f} + \min \left\{ \frac{tf}{p}, f + \sqrt{t}\right\} + \sqrt{t}\log (p) \right) \). Following the same reasoning, GruTEch algorithm, apart from doing productive work in the presence of the adversary, will require additional time for the leader election mode, which is \(\mathcal {O}\left( \frac{p}{p-f}\log (p)\right) \) in expectation. The total expected running time of GruTEch is therefore \(\mathcal {O}\left( \frac{t}{p-f} + \min \left\{ \frac{tf}{p}, f + \sqrt{t}\right\} + \frac{p}{p-f}\log (p) \right) \). In GILET each transmission is confirmed by electing a leader, hence its expected running time is \(\mathcal {O}\left( \left( \frac{t}{p-f} + \min \left\{ \frac{tf}{p}, f + \sqrt{t}\right\} \right) \log ^{2}(p)\right) \).

Remarks on energy complexity. Since our algorithms are randomized, it is also quite difficult to state tight bounds for the transmission energy used in executions. Here by transmission energy we understand the total number of transmissions undertaken by stations during the execution. Nevertheless, assuming that n denotes the number of operational stations and there is a certain amount of work S accrued by some time of an execution of any of our algorithms, then \( S/\sqrt{n} \) is roughly (the upper bound on) the number of transmissions done by that time. This is because substantial parts of our algorithms are based on procedure Groups-Together, in which roughly \(\sqrt{n'}\) stations in a group transmit in a round, out of at least \(n'\ge n\) operational ones that contribute to the total work S. However, our algorithms also strongly rely on different leader election type of procedures, therefore the total transmission energy cost in an execution may vary significantly.

Open problems. Further study of distributed problems and systems against ordered adversaries seems to be a natural future direction. Another interesting area is to study various extensions of the Do-All problem in the shared-channel setting, such as considering a dynamic model, where additional tasks may appear while algorithm execution, partially ordered sets of tasks, or tasks with different lengths and deadlines. In other words, to develop scheduling theory on a shared channel prone to failures. In all the above mentioned directions, including the one considered in this work, one of the most fundamental questions arises: Is there a universally efficient solution against the whole range of adversarial scenarios? Different natures of adversaries and properties of algorithms discussed above suggest that it may be difficult to design such a universally efficient algorithm.