RMWPaxos: Fault-Tolerant In-Place Consensus Sequences

Building consensus sequences based on distributed, fault-tolerant consensus, as used for replicated state machines, typically requires a separate distributed state for every new consensus instance. Allocating and maintaining this state causes significant overhead. In particular, freeing the distributed, outdated states in a fault-tolerant way is not trivial and adds further complexity and cost to the system. In this article, we propose an extension to the single-decree Paxos protocol that can learn a sequence of consensus decisions ‘in-place’, i.e., with a single set of distributed states. Our protocol does not require dynamic log structures and hence has no need for distributed log pruning, snapshotting, compaction, or dynamic resource allocation. The protocol builds a fault-tolerant atomic register that supports arbitrary read-modify-write operations. We use the concept of consistent quorums to detect whether the previous consensus still needs to be consolidated or is already finished so that the next consensus value can be safely proposed. Reading a consolidated consensus is done without state modifications and is thereby free of concurrency control and demand for serialisation. A proposer that is not interrupted reaches agreement on consecutive consensus decisions within a single message round-trip per decision by preparing the acceptors eagerly with the previous request.


INTRODUCTION
S TATE machine replication [1] is a common technique for implementing distributed, fault-tolerant services. Commonly, replicated state machine (RSM) implementations are centered around the use of a consensus protocol, as replicas must sequentially apply the same commands in the same order to prevent the divergence of their states.
Existing consensus protocols like Paxos [2], [3], Raft [4], or variations thereof [5], [6], [7] that can be used to build an RSM are based on the idea of a command log. Once a replica learns one or multiple commands by consensus, it appends them to its persistent local command log. Several practical systems [8], [9], [10] follow this general approach.
However, the implementation of such a command log incurs additional challenges like log truncation, snapshotting and log recovery. In the case of Paxos, these problems have to be addressed separately on top of the consensus algorithm. This is a challenging and error-prone task as noted by Chandra et al. [11]. Other consensus solutions, e.g. Raft, consider some of these issues as part of the core protocol whilst sacrificing the ability to make consensus decisions without an elected leader. In either case, implementing consensus sequences requires extensive state management.
A command log is worth its overhead when the commands are small compared to the managed state. However, aggregating largely independent data into a bigger managed state, such as multiple key-value pairs in a key-value store, to compensate for the log overhead is counterproductive because the log would then unnecessarily order commands targeting different keys. Managing each key-value pair separately would be ideal, but this is unpractical with log-based approaches due to the implied overhead and challenges outlined above.
In this paper, we present a novel approach called Read-Modify-Write Paxos (RMWPaxos) where the state of an RSM is managed 'in-place'. Instead of replicating a command log as an intermediate step, RMWPaxos replicates the latest state directly. A new command is processed by applying it to the current state and proposing the result as the next value in a sequence of consensus decisions. This makes it possible to use a fixed set of state variables for all decisions, thus avoiding the state management issues. At the same time, it enables using distributed consensus on a finer granularity than before and makes it trivial to use an arbitrary number of parallel consensus instances. This allows the faulttolerant implementation of ubiquitous primitives like counter, locks, or sets. In addition to existing use cases like key-value stores, we believe that such fault-tolerant, fine-granular RSM usage might become more and more relevant with the rise of byte addressable non-volatile memory and RDMA-capable low latency interconnects.
Before presenting RMWPaxos, we introduce the notion of a consensus sequence register, a multi-writer, multireader register providing obstruction-freedom for reads and writes that performs any submitted write operation at-least-once. Writes are expressed in the form of update commands applied on an opaque object. Instead of explicitly agreeing on a sequence of commands, it agrees on the sequence of object states that result from the submitted update commands. By adhering to the safety properties of consensus, reads are guaranteed to observe the latest consistent state. Strengthening the register to apply writes exactly-once results in RMWPaxos-a fault-tolerant general atomic read-modify-write (RMW) register.
The main contributions are: • We introduce the abstractions of a consensus sequence register and strengthen it to provide an atomic RMW register (Sect. 4). These abstractions can be used to implement RSMs. If updates are idempotent the consensus sequence register suffices to build an RSM. Otherwise, the atomic RMW register is required (Sect. 5.6). • We provide a new implementation of a faulttolerant atomic write-once register by modifying the Paxos algorithm. In particular, we enhance Paxos by using the concept of consistent quorums-a set of replies containing identical answers-to reduce contention in read-heavy workloads. Once a consistent quorum is detected, the consensus decision is known. This allows learning the current value of the register in two message delays and prevents concurrent reads from blocking each other (Sect. 5.4). • By further exploiting consistent quorums, we extend the atomic write-once register to a multi-write register that is a consensus sequence register. Here, a consistent quorum indicates the most recent consensus decision. This makes it possible to propose a follow-up value in-place, i.e. without a command log or multiple independent consensus instances with separate state variables (Sect. 5.4). If there is only a single writer, follow-up decisions can be made in two message delays (Sect. 5.8) without electing a leader. • The consensus sequence register applies submitted updates from multiple writers at-least once, which is sufficient when updates manipulate the opaque object (or parts of it) in an idempotent way (like adding a member to a set). We show that by using ordered links exactly-once semantics can be achieved to build an atomic RMW register, called RMWPaxos (Sect. 5.5).

SYSTEM MODEL
We consider an asynchronous distributed system with processes that communicate by message passing. Processes work at arbitrary speed, may crash, omit messages and may recover with their internal state intact (a recovering process is indistinguishable from one experiencing omission failures). We do not consider Byzantine failures. A process is correct if it does not crash or recovers from crashes in finite time with its (possibly outdated) state intact. We assume that every process can be identified by its process ID (PID).
In the first part of this paper, processes send messages to each other via direct unreliable communication links. Links may lose or delay messages indefinitely or deliver them out-of-order. While a fair-loss property is desirable to support progress, it is not formally necessary for safety. In Sect. 5.5, we strengthen this and require reliable in-order message delivery.

THE CONSENSUS PROBLEM
The consensus problem describes the agreement of several processes on a common value in a distributed system. We differentiate between proposer processes that propose values and learner processes that must agree on a single proposed value. In practice, a process can also implement both roles. A correct solution to the consensus problem must satisfy the following safety properties [12]: C-Nontriviality. Any learned value must have been proposed. C-Stability. A learner can learn at most one value. C-Consistency. Two different learners cannot learn different values.
In addition to safety, the liveness property requires that some value is eventually learned if a sufficient number of processes are correct. However, guaranteeing liveness whilst satisfying the safety properties of consensus is impossible in an asynchronous system with one faulty process [13].

PROBLEM STATEMENT
We define a fault-tolerant register that is replicated on N processes and tolerates the crashes of a minority of replicas. The register holds a value v. Its initial value is v = ⊥. Any number of clients can read or modify v by submitting commands to any replica. The main motivation of our work is to provide a register abstraction that allows the implementation of a replicated state machine. For that, we start with a simpler abstraction, which we then extend.
Write-Once Atomic Register. Commands submitted to the register either write a value or read its current value. Read commands return either ⊥ or a value v w that has been submitted by some write. The register is linearisable [14], i.e. all commands appear to take effect instantaneously at some time between their submission and the corresponding completion response from the register. Thus, once a read returns v w , then all subsequent reads must return v w as well. However, an arbitrary number of reads is allowed to return ⊥ beforehand if no value was written yet. This is achieved by satisfying the safety properties stated in Sect. 3.
Consensus Sequence Register. We extend the writeonce atomic register by allowing multiple clients to submit update commands that change the register's value. We say that a value v is the result of update sequence s(v) = u 1 , . . . , u n , iff v equals u n • . . . • u 1 applied on ⊥ (• being function composition). The register ensures that reads return values with growing update sequences. For that, we extend the safety properties of consensus. CS-Nontriviality. Any read value is the result of applying a sequence of submitted updates. CS-Stability. For any two subsequent reads returning values v 1 and v 2 , s(v 1 ) is a prefix of s(v 2 ). CS-Consistency. For any two reads (including concurrent ones) returning values v 1 and v 2 , The prefix relation on update sequences is reflexive. Every update sequence is its own prefix. We note that update sequences are merely a tool to argue about the registers properties. The actual register implementation does not explicitly store them, but keeps the value resulting from the latest update. For updates, we additionally require the following properties: CS-Update-Visibility. Any completed update is included at least once in the update sequence of all values returned by subsequent reads. CS-Update-Stability. For any two subsequent updates u 1 and u 2 , u 1 appears before u 2 in the update sequence of any returned value that includes both u 1 and u 2 . Atomic Read-Modify-Write Register. To satisfy linearisability we strengthen CS-Update-Visibility by requiring that every completed update is included exactlyonce in the update sequence of all values returned by subsequent reads. This results in a general atomic readmodify-write (RMW) register [15]. Unlike specialised RMW registers that can perform a single type of RMW operation like test-and-set or fetch-and-add, this register can atomically execute arbitrary computations on its previous value.
As liveness is impossible in our system model, waitfreedom [16] cannot be provided. However, we require obstruction-freedomness [17] for a valid implementation of the registers. If wait-freedom is still required, an obstruction-free implementation can be extended with a leader election mechanism by assuming a ♦W failure detector [18].

IN-PLACE CONSENSUS SEQUENCE
In this section, we present our protocols that satisfy the properties of the register abstractions introduced in Sect. 4. The write-once atomic register makes use of the principles of Paxos consensus [2], [3] and adopts the concept of consistent quorums [19]. These concepts are then extended for the more powerful abstractions to allow a sequence of multiple consensus decisions 'in-place', i.e. on the same set of state variables by overwriting the previous consensus. A more detailed, albeit more informal description of a previous version is given by Skrzypczak [20]. We discuss how to build an RSM with our register in Sect. 5.6.

Paxos Overview
Our approach is derived from the Paxos protocol. In addition to proposers and learners, Paxos introduces the role of acceptor processes that coordinate concurrent proposals by voting on them. If a sufficient number of acceptors have voted for the same proposal, the proposal's value can be learned by a learner. Such a set of acceptors is called a quorum. A proposal is chosen if it has acquired a quorum of votes. The value of a chosen proposal is a chosen value. The size of quorums depends on the application and Paxos variant in use [7], [21], [22]. However, it is generally required that any two quorums consistent quorum inconsistent quorum Figure 1: Consistent/inconsistent quorum with 7 acceptors. A quorum view Q for a system using n acceptors consists of |Q| = n 2 + 1 elements (here 4).
have a non-empty intersection to prevent two disjoint quorums that voted for different values (as this would allow two learners to learn different value).
In order for Paxos to learn a value, a quorum of acceptors, a learner, and the proposer, which has proposed the value, must be correct during the execution of the protocol. For simplicity, we consider any majority of acceptors to be a quorum. Thus, a system with 2F + 1 acceptors can tolerate at most F acceptor failures.
If enough processes are correct, then Paxos is obstruction-free [17], i.e. an isolated proposer without concurrent access succeeds in a finite number of steps. However, concurrent proposals can invalidate each other repeatedly, thereby preventing learners from learning any value. This scenario is known as duelling proposers.

Consistent Quorums
Similar to Paxos, our approach structures the communication between proposers and acceptors into phases. In each phase, a proposer sends a message to all acceptors and waits for a minimal quorum of replies. The seen quorum is consistent if the indicated state by the acceptors in the quorum is identical, otherwise, it is inconsistent (see Figure 1). Not waiting for more replies than necessary ensures tolerating a minority of failed acceptors without delaying progress.
If a proposer p observes an inconsistent quorum, it cannot infer which of the seen values is or will be chosen and learned. For example, if p receives the quorum depicted in the right part of Figure 1, it cannot decide if or exists in a majority since it has no information about the state of acceptors 1-3. In contrast, it is trivial for p to deduce the chosen value with a consistent quorum (Figure 1 left). Existing Paxos variants do not distinguish consistent or inconsistent quorums. As we will see, detecting a consistent quorum allows the proposer to terminate the protocol early in the single-decree case. Furthermore, the consistent state can be used as the basis for follow-up proposals if multiple consensus decisions are needed in sequence.

Paxos Write-Once Atomic Register
In the following, we present our modifications to the original single-decree Paxos protocol for implementing a write-once atomic register. Its pseudocode is depicted in Algorithm 1. We discuss the differences to Paxos in Sect. 5.3.2. No separate learner role exists, as each proposer also implements the functionality of a learner in our implementation. if t ← consQ (rvoted ) ∧ (t.n > 0 ∨ op = read) then 7: return existing consensus 8: send DONE , v to client 9: else if r ← consQ (rack ) ∧ inc then 10: r consistently prepared 11: vprop ← maxQ (rvoted , v) 12: if vprop = ⊥ then 13: no consensus yet; propose own value 14: vprop ← val 15: propose value 16: send VOTE , r, vprop to all acceptors 17: else inconsistent quorum; retry with higher round 18: rnew ← maxQ (rack ) 19: rnew .n ← rnew .n + 1 20: send PAXOS _PREP , rnew to all acceptors 21: on receive VOTED, v from quorum Q: 22: send DONE , v to client Acceptor 23: initialise: 24: rack ← (0, ⊥), val ← ⊥, rvoted ← (0, ⊥) Phase 1 25: on receive PREPARE , op, rid from proposer p: 26: incremental phase 1 27: if op = write then 28: rack ← (rack .n + 1, rid ) 29: send ACK , true, rack , rvoted , val to p 30: else 31: no state change for read 32: send ACK , false, (rack .n, rid ), rvoted , val to p 33: on receive PAXOS _PREP , r from proposer p: 34: canonical Paxos phase 1 35: if r > rack then 36: rack ← r 37: send ACK , true, rack , rvoted , val to p Phase 2 38: on receive VOTE , r, valnew from proposer p: 39: if r ≥ rack then 40: rack ← r improve rack consistency 41: rvoted ← r 42: val ← valnew 43: send VOTED, val to p For brevity's sake, we use the following conventions when handling sets of reply messages: Let a process receive a set of reply messages S. Each message in S is an n-tuple denoted as t, e 1 , . . . , e n−1 . We make use of pattern matching techniques commonly found in functional programming. The type t of the message is matched to ensure it has the correct format. Its payload is stored in tuple elements e 1 to e n−1 . Since messages in S may hold different values in the same tuple element, we define the following functions: cons S (e i ) returns the value of e i if it is equal for all messages in S, or false otherwise; max S (e i ) returns the largest value of e i ; max S (e i , e j ) returns the value of e j from the message with the largest value of e i .

Algorithm 1 Paxos-based write-once atomic register
We furthermore assume that processes can keep track of multiple concurrent requests and know to which outstanding request a received reply belongs to.

Protocol Description
The protocol has two phases. In the first phase, a proposer checks for concurrently proposed values and prepares acceptors to deny outdated proposals. In the second phase, a proposer proposes either its own or a value seen in the first phase. To eventually learn a value, both phases must be passed without interruption by other proposers. Concurrent proposals are ordered by so-called rounds (analogue to 'proposals numbered n' in [2] and 'ballot numbers' in [3]). A round is a tuple (n, id), where n is a non-negative integer and id some globally unique identifier. Rounds are partially ordered. r 1 < r 2 iff r 1 .n < r 2 .n. Furthermore, r 1 = r 2 iff r 1 .n = r 2 .n ∧ r 1 .id = r 2 .id. Newer proposals are indicated by higher rounds. Rounds with the same n but different id cannot be ordered.
The protocol begins with proposer p receiving a REQ message from client c. The request is either a write that tries to set the register to a value val , or a read that returns the register's current value (here, val = ⊥). The client's request is handled asynchronously by the register. It will be notified by a DONE message once the request has been processed.
The proposer starts the first phase by choosing a round ID and sending it in a PREPARE message along with the request type to all acceptors (line 2-4). Acceptors act as the distributed fault-tolerant storage. Each acceptor maintains three values (line 24): (1) the highest round r ack it has acknowledged, (2) the last value val it has voted for, and (3) round r voted in which the values' proposal was proposed in.
If an acceptor A receives a PREPARE message of a write request from proposer p, it knows that p intends to submit a new proposal in phase 2 of the protocol. A acknowledges this by incrementing its r ack and updating the round's ID. Thereby, A promises p to not vote for any lower numbered proposals in the future. Other proposals submitted by different proposers might already be in progress. A then replies to p with its current state in an ACK message and an indication that it has increased its round (line [28][29]. If p's PREPARE message belongs to a read request, then A answers without incrementing its r ack (line 32). Letting the state untouched when processing reads reduces their interference with other ongoing requests and is not part of canonical Paxos.
The second phase begins as soon as p has received ACK messages from any quorum Q of acceptors. If p received the same r voted round in all messages from Q with r voted .n > 0, then the value proposed in this round is already chosen. Thus, p can be certain that a consensus was already reached and delivers the chosen value back to the client. In other words, p has learned v. Similarly, p can be certain that no consensus was reached yet if r voted is consistent with the initial round number 0. Here, p can return an empty value when processing a read (line 6-8).
For an inconsistent quorum, p cannot deliver a result. It must execute phase 2 by issuing a proposal on its own, when p received consistent r ack rounds. For inconsistent r ack rounds, p executes a canonical singledecree Paxos phase 1 with an explicit round number by sending PAXOS _PREP messages (line [33][34][35][36][37]. This increments the r ack rounds in the acceptors to enable the concurrency control necessary for helping to establish the unfinished consensus. If p has received messages with consistent r ack rounds, it can issue a proposal. If one of the acceptors that replied to p already voted for any proposal, then p observes an inconsistent quorum as depicted in Figure 1. It cannot decide if the proposal's value was already established and learned or not. Thus, to be on the safe side, it must choose the value seen in the highest round. Otherwise, p proposes its own value. The proposal is then sent in a VOTE message to all acceptors using the acknowledged round r ack (line 10-16).
All acceptors receiving the message vote for the proposal if they have not already given a promise for a higher numbered proposal during a (concurrent) phase 1 execution. They notify the proposer p of their vote (line [38][39][40][41][42][43]. If p received a quorum of positive replies, it knows its proposed value was chosen and notifies the client on the established consensus (line 22). This concludes the protocol.

Comparison to Paxos
Our write-once atomic register is based on the same mechanism for safety as Paxos, but differs from the canonical single-decree Paxos [2] in several aspects: Consistent Quorums. In canonical Paxos, all proposals must complete both phases of the algorithm even if a value was already chosen. This effectively serialises concurrent reads and it causes unnecessary state changes in acceptors (their round numbers). Our protocol, instead, terminates early and returns the result after the first phase, when a proposer observes a consistent quorum. This prevents (1) state modifications by reads, (2) allows termination in two message delays and (3) prevents live-locks caused by duelling proposers once all correct acceptors have agreed on a proposal. This is possible because once a proposal with value v is chosen, any proposal made in a higher round will contain v (see Sect. 5.3.3). As the value of the write-once register cannot change anymore, it is needless to execute the second phase.
Distinguishing between reads and writes. In canonical Paxos, to read the state of a consensus it is necessary to propose a value for consensus when no proposal was seen yet, i.e. actually performing a write, which is unintended. For a read a client can either (1) initiate the protocol as a proposer and-in accordance to the protocol-has to propose a (dummy) value itself when no value was chosen yet or (2) it can ask a learner. However, a learner that has not learned a value also has to propose a (dummy) value to ensure its answer is up-to-date. As this dummy value might be written, the read semantic is violated. Drawing from the concept of consistent quorums, we support reads without the risk to change the register's value and are also able to reliably recognise an empty register. A read only behaves like a write to help to establish an ongoing proposal if it was only partially accepted. However, no value will be proposed that was not already proposed by a write.
Incremental round number negotiation. Proposers have to choose a high enough round number for their proposal to succeed. Canonically, a proposer chooses the round number itself. If it is too low, the proposer's attempt fails and it has to try again (see PAXOS _PREP ) with a higher round. This works well when a leader makes the proposals, as it knows the previous used round number. Without a leader, however, the first guess of a proposer is likely to fail, costing a round trip even without concurrent access.
Instead, we let the acceptors increment their round on an initial round-less attempt and retrieve the 'assigned' round from the replies when they form a consistent quorum. Otherwise, we calculate a higher round number from the replies and retry like in Paxos.
Using incremental rounds is optional. If a proposer can determine a round number that likely succeeds, it can also start with that without violating the protocol's safety.
Single learner per request. In canonical Paxos, acceptors send their VOTED messages to a set of learner processes, which learn the value once they have received a quorum of votes for a proposal. Therefore, the number of messages sent is the product of the number of acceptors and the number of learners. In our approach, the proposer that has received a request acts as its sole learner. Thus, every acceptor sends only a single VOTED message.

Sketched Proof of Safety
In this section, we provide a proof sketch for our Paxos atomic write-once register. We show that the safety requirements of Sect. 3, as well as linearisability are satisfied. Since our protocol has a close resemblance to canonical Paxos, we can use analogue arguments and invariants as described by Lamport [2] to prove safety. Proposition 1. If a proposal p was learned in round r, then there exists a quorum of acceptors Q such that any acceptor in Q has given a vote for p (i.e. the proposal must have been chosen).
Proof Sketch. For any two acceptors a 1 , a 2 , which have voted for proposal p 1 and p 2 respectively in the same round r, it holds that p 1 = p 2 because rounds are uniquely identified by their ID. To learn a value, a proposer must either (a) receive a consistent quorum of r voted rounds from acceptors at the beginning of phase 2, or (b) receive a quorum of VOTED messages. For (a) to be possible, a quorum with r = r voted must exist. For (b), a quorum of acceptors must have voted for the proposer's proposal. Since a proposal is issued for a specific round, all replying acceptors have voted for a proposal in the same round.
C-Nontriviality is trivial to proof using proposition 1 since acceptors can only vote for any value that was previously proposed by a proposer. C-Stability and C-Consistency hold by satisfying the following invariant: Proposition 2. If a proposal with value v c and round r c is chosen, then every proposal issued with round r > r c by any proposer has also value v c .
Proof Sketch. By proposition 1, there is a quorum Q that has voted for v c in r c . Since any two quorums have a non-empty intersection, any proposer p will receive at least one ACK reply of an acceptor included in Q. Furthermore, no acceptor has voted for a proposal valued v with v = v in round r with r > r c . This would imply the existence of a quorum Q for which every acceptor has acknowledged round r before voting for the proposal in round r c . This contradicts the existence of Q since acceptors cannot vote for a lower round than they have previously acknowledged. Therefore, the proposal with the highest round that p receives has value v c . Thus, p issues a proposal with v c . Proposition 2 assumes that rounds can be totally ordered. However, they are only partially ordered due to our modified negotiation mechanism. Thus, we must show: Proposition 3. For any round number n, at most one proposal is issued.
Proof Sketch. A proposer can only issue a proposal in a round with round number n once it has received an acknowledgement from a quorum of acceptors with consistent and increased r ack with round number n. Any acceptor can send at most one ACK message in which it has also increased its r ack to have round number n. Thus, at most one proposer can receive such a quorum to make a proposal. If incremental rounds are not used, proposers have to choose their own unique round numbers (cf. canonical Paxos).

Proposition 4.
The Paxos-based write-once atomic register is linearisable.
Proof Sketch. Proposition 1 and 2 show that all writes return value v c of the first chosen proposal as their result. Reads differ from writes in that they can return the initial value ⊥, but only if no value is chosen since a consistent quorum is required. Since a proposer must have learned a value before any write (or read not returning ⊥) can complete, any subsequent read will return v c .

Consensus Sequence Register
In this section, we outline how our fault-tolerant writeonce atomic register can be extended to support a sequence of updates. The typical approach in consensus is to chain multiple consensus instances on separate resources [2], [11]. In contrast, we aim to operate on the same set of resources.
Algorithm 2 Proposer's modified phase 2 supporting a sequence of writes 1: on receive ACK , inc, rack , rvoted , v from quorum Q: 2: if consQ (rvoted )∧ op = read then 3: read: return current consensus 4: send DONE , v to client 5: else if consQ (rvoted ) ∧ r ← consQ (rack ) then modifications to propose next command 6: consensus established, r prepared; propose next consensus value 7: vnew ← cmd(v) 8: if vnew = NOOP then 9: send VOTE , r, vnew to all acceptors 10: else 11: cmd, e.g. test-and-set, not applicable to latest value v 12: send DENIED to client 13: else if r ← consQ (rack ) ∧ inc then 14: r consistently prepared 15: vprop ← maxQ (rvoted , v) 16: write-through unfinished consensus 17: send VOTE , r, vprop to all acceptors 18: else 19: inconsistent quorum; retry with higher round 20: rnew ← maxQ (rack ) 21: rnew .n ← rnew .n + 1 22: send PAXOS _PREP , rnew to all acceptors The required changes of the proposer's second phase are highlighted in Algorithm 2. In addition, the interface exposed to the client changes slightly. Instead of including a specific value val (see Algorithm 1 line 1) in a write request, clients include an update command cmd , which performs a transformation on the current value of the register. The acceptor's behaviour remains unchanged.
The protocol presented in Sect. 5.3 terminates the second phase early by making use of consistent quorums. A consistent quorum is a proof that the current value is established. By making use of this information, we can extend the protocol to handle a sequence of updates. Instead of returning the learned value to the client, the proposer now proposes a new value on top of the old consensus. This is done by applying cmd to the learned value and including the result in VOTE messages, which are then send to all acceptors. Sometimes an update might be reduced to a no-op, denoted by NOOP , if it cannot be applied to the current value, e.g. when it includes compare-and-swap semantics or a required write lock is missing. As such an update has no effect, the protocol can terminate early.
When a proposer receives an inconsistent quorum, e.g. due to an unfinished consensus or message loss, it first has to complete the started consensus (line [13][14][15][16][17][18][19][20][21][22]. This will be referred to as a write-through. The proposer can then restart the protocol and attempt to process its own write request. Safety. Intuitively, the register behaves as if executing multiple single-decree Paxos instances in sequence, with each instance using the previously chosen proposal and its round as initial state. Updates are applied on top of a chosen value, which is ensured by observing a consistent quorum. Thus, for any two values v 1 and v 2 that are chosen in this order, s(v 1 ) is the prefix of s(v 2 ). By an argument analogous to proposition 4, reads Algorithm 3 RMWPaxos: A fault-tolerant atomic read-modify-write register if consQ (rvoted ) ∧ op = read then deliver read value 7: send DONE , v to client 8: else if consQ (rvoted ) ∧ req cur = consQ (req prev ) then 9: Proposer's previous attempt failed but was established by writethrough 10: send DONE , v to client 11: else if consQ (rvoted ) ∧ r ← consQ (rack ) then can propose next cmd 12: vnew ← cmd(v) 13: if vnew = NOOP then propose next cmd 14: send VOTE , r, vnew , req cur , v, req prev to all acceptors 15: else cmd, e.g. test-and-set, not applicable to latest value 16: send DENIED to client 17: else if r ← consQ (rack ) ∧ inc then execute write-through 18: vprev ← maxQ (rvoted , v) 19: req tmp ← maxQ (rvoted , req prev ) 20: send VOTE , r, vprev , req tmp , ⊥, ⊥ to all acceptors 21: else retry with higher round 22: rnew ← maxQ (rack ) 23: rnew .n ← rnew .n + 1 24: send PAXOS _PREP , rnew to all acceptors 25: on receive VOTED, v from quorum Q: 26: if proposer executed a write-through then 27: restart protocol from phase 1 with same ReqID 28: else 29: send DONE , v to client 30: on receive LEARNED, v, req from any acceptor: 31: if proposer has not completed request req yet then 32: send DONE , v to client Acceptor 33: initialise: 34: rack ← (0, ⊥), val ← ⊥, rvoted ← (0, ⊥), req ← ⊥ Phase 1 35: on receive PREPARE , op, rid from proposer p: 36: incremental phase 1 37: if op = write then 38: rack ← (rack .n + 1, rid ) 39: send ACK , true, rack , rvoted , val, req to p 40: else 41: no state change for read 42: send ACK , false, (rack .n, rid ), rvoted , val, req to p 43: on receive PAXOS _PREP , r from proposer p: 44:  send VOTED, v to p always return the latest chosen value. Thus, CS-Stability and CS-Consistency are satisfied. An update u can only complete if a value that includes u in its update sequence is chosen, as a quorum of VOTED messages is required. As only chosen values are returned, CS-Update-Visibility is guaranteed. No proposer applies u on any chosen value after u is completed. Thus, every subsequent update appears after the last occurrence of u in s(v) of a subsequently chosen value v (CS-Update-Stability).

RMWPaxos: Atomic Read-Modify-Write Register
The consensus sequence register presented in the previous section is not atomic, as it is possible that an update command submitted by a client is proposed and applied multiple times by the same proposer. For example, consider the following scenario: Proposer p completes phase 1 and submits a proposal. However, it only gets a minority of acceptor votes, as some concurrent proposer p 2 already increased the r ack rounds of a quorum of acceptors. In this case, p 2 may observe an inconsistent quorum and therefore executes a write-through for p's proposal. If it succeeds, then p's proposal was effectively accepted as the value proposed by p is chosen. However, p does not know this and retries.
For atomicity, it must be ensured that a proposer does not re-submit a proposal once a proposed value of a previous attempt is chosen. For that, we strengthen the system model of Sect. 2 by assuming reliable inorder message delivery. This can be provided by reliable communication protocols such as TCP. Note, that messages can be lost if a TCP connection fails and is later reestablished during processing of some request. An easy way to deal with this is to treat the process as crashed until the request is completed. Afterwards, communication can proceed normally. Now, the protocol can be modified as follows (cf. Algorithm 3): For every write request that a proposer p receives, it generates a request ID (ReqID) consisting of its PID and some locally unique value (line 2). Every acceptor holds the ReqID of the last proposal it voted for and includes it in all ACK messages it sends. If a proposer submits a non-write-through proposal, it includes its own ReqID as req cur and the ReqID received in phase 1 as req prev in its VOTE messages (line 14). When voting, acceptors send a LEARNED message to the proposer in req prev (line 52). If a proposer submits a write-through, it only includes the ReqID received in phase 1 as req cur (line 20). In this case, acceptors update req ← req cur but do not send the LEARNED message. Receiving a LEARNED message guarantees that the corresponding proposal was chosen. A proposer that retries a request with some ReqID stops the protocol if (1) it observes a consistent quorum with this ReqID (line 8), or (2) it receives a LEARNED message with it (line 30). In both cases, it notifies the client that its write request succeeded.
We note that it is easy to avoid sending values in LEARNED and VOTED messages back to the proposer if the proposer keeps track of its proposed values locally. By extension, it is not necessary to include val prev in VOTE messages. For simplicity, this is not shown in Algorithm 3. Safety. Assume a write request with ReqID r and update command u is processed by proposer p. Assume that p's attempt failed, but its proposed value is chosen (e.g. due to a write-through). Proposer p does not propose u as the direct successor of its own proposed value because it would observe a consistent quorum with ReqID r beforehand. Thus, assume that some successor value proposed by a different proposer is chosen. This means that LEARNED messages with ReqID r are sent to p by some quorum Q. Let p retry its request. In order to apply u and propose a new value, it must observe a consistent quorum Q . As Q ∩ Q = ∅ and reliable ordered links are used, p receives a LEARNED message before receiving a consistent quorum. Thus, p does not apply u on a value whose update sequence already includes u.

State Machine Replication
By using RMWPaxos, we can build a fault-tolerant replicated state machine using a fixed set of storage resources. The state is stored in the register and the state changes are done by the corresponding update commands. If updates are idempotent, the consensus sequence register suffices. One way to achieve this is by using transactional semantics like compare-and-swap.
In log-based approaches like Multi-Paxos [2, Sect. 3], acceptors accept commands, i.e. state transitions of the state machine. In our approach, in contrast, the acceptors accept the complete state. This has a number of implications. First, a dedicated set of learner processes is no longer required. Any process that wishes to learn the current state of the RSM can do so by executing a read. This process then acts as the sole learner in the context of this command. In contrast, Multi-Paxos requires multiple learners in order to have access to the state in a fault-tolerant manner. Since every learner must also learn every command in order to make progress, n * m VOTED messages are required in a setup with n acceptors and m learners. Our approach requires only n VOTED messages.
Second, by keeping the full state in acceptors, a sequence of commands can now be applied to the RSM in-place using the same set of acceptors. Thus, it is not necessary to allocate and free storage resources. This simplifies the protocol's complexity and its implementation. Due to the absence of any state management overhead, it is trivial to use arbitrary many RMWPaxos instances in parallel, allowing a more fine-granular use of the RSM paradigm. This is especially useful if the state can be split into many independent partitions, as it is often the case in key-value structured data.

Liveness
Reads and writes are obstruction-free [17] as long as a quorum of acceptors and the proposer receiving the requests are correct. Wait-or lock-freedom [16] cannot be guaranteed without further assumptions, as postulated by the FLP result [13]. A common assumption is the existence of a stable leader to which all requests are forwarded to. The leader then acts as the sole proposer of the system. To handle leader failures, a ♦W failure detector [18] is necessary.

Optimisations
There are several ways to optimise the basic protocol. Fast Writes. Handling writes requires a proposer to complete both phases of the protocol. That means that at least four message delays are needed. By using a mechanism similar to Multi-Paxos [2], the first phase can be skipped by a proposer that processes multiple writes uninterrupted by other proposers. We refer to such writes as fast writes.
The modification is simple. Whenever an acceptor votes for a proposal made in round r, it sets r ack to (r.n + 1, r.id ) (cf. Algorithm 1 line 40). By doing so, it effectively behaves as if receiving a PREPARE message from the same proposer immediately after voting. Therefore, this proposer can skip the first phase when making its next proposal.
This optimisation is useful for single-writer settings or scenarios in which a proposer must execute multiple writes within a short period. As no locking or lease mechanism is used, an ongoing fast write sequence can be interrupted at any time by other proposers. Thus, we avoid the costs and unavailability associated with a leader and its (re-)election. Fewer concurrency conflicts caused by reads. If a read observes a consistent quorum after the first phase, it returns a result without interfering with any concurrent request because acceptors do not modify their rounds. If a read observes an inconsistent quorum, a writethrough is triggered which can cause interference. Writethroughs cannot be prevented completely, as a crashing proposer can cause a proposal to be only partially established. Therefore, we adopt the idea of contention management [17], [23] to unreliably detect a crashed writer: When a reading proposer observes an inconsistent quorum, it stores the highest round it has received. Then, it retries phase 1 without an explicit round. If the quorum is again inconsistent, it checks whether progress was made by comparing the received rounds with the rounds from the previous iteration. If they remain unchanged, then it is possible that the write crashed and a write-through must be triggered. Otherwise, the reader can try again. The proposer can keep collecting replies from its previous attempts as it is possible to reach a consistent quorum with delayed replies.
To prevent a read from starving due to a continuous stream of writes, we define an upper limit on the number of retry attempts. Its effects are evaluated in Sect. 7.
Batching. Batching is a commonly used engineering technique to reduce bandwidth and contention by bundling multiple commands in a single request at the cost of higher response latency. Every proposer manages separate batches for read and update commands. A batch is processed at regular intervals by starting the protocol. For write batches, all update commands of the batch are applied in-order on the old value before proposing the resulting new value. When processing a read batch, the read value is simply returned to all clients. The size of all messages remains constant, independent of the number of batched commands. This shifts the performance bottleneck from internal communication to the processing speed of the respective proposers.

ANALYSIS
In this section, we focus on additional aspects that might be beneficial for practical deployments. An experimental evaluation can be found in Sect. 7.
Compared to canonical Paxos and Multi-Paxos our registers require a similar number of 2-4 message delays per consensus in the conflict-free case. Two additional message delays are needed by canonical Paxos when a valid round number is not known yet and by our registers when a read using incremental rounds has to help to establish a consensus. Reading a stable, established consensus with our approach only needs 2 message delays, no concurrency control and does not cause acceptor state changes, which is costly if their state must be persisted. Furthermore, our approach works on a fixed set of resources which makes dynamic resource allocation, pruning, and deallocation needless. This makes our register applicable on a more finegranular level than other consensus-based approaches that rely on a command log.
Relying on consistent quorums does not harm robustness nor performance. Like in canonical Paxos, a single replica with the highest round seen in an inconsistent quorum will suffice to propose its value. But on a consistent quorum, we can (a) terminate a read operation early by not needing to write and re-learn the consensus and (b) can base the next consensus in our consensus sequence on that.
Not requiring an explicit leader provides more continuous availability. In our approach, any proposer can issue requests to the register at any time. When a proposer fails, other proposers can immediately proceed and do not need to wait for or elect a new leader. Still, a proposer submitting many requests in sequence without interference of other proposers can perform each write to the register in just two message delays (no batching), like a leader.

EXPERIMENTAL EVALUATION
We implemented RMWPaxos in Scalaris [24], a distributed key-value store written in Erlang. The correctness of our implementation was extensively tested using a protocol scheduler [25], which forces random interleavings of incoming messages. No safety violations were detected using this approach.
The primary focus of the evaluation is to show the scalability of our approach under different workloads, as absolute performance is highly dependent on the available hardware environment and engineering efforts that are independent of the actual approach. Our register aims to be a general primitive. Thus, we consider use-case dependent techniques that optimise network traffic and concurrent access, e.g. request batching, to be out-of-scope of this paper.
All benchmarks were performed on a cluster with two Intel Xeon E5-2670 v3, 2.40 GHz per node. All nodes are fully-connected with 10 Gbit/s links. Each cluster node hosts a single replica, which is a Scalaris node that encapsulates one proposer and one acceptor process. Load generation was performed on up to two separate cluster nodes using the benchmarking tool Basho Bench [26], which was modified to enable workloads with heterogeneous worker processes.
All shown measurements ran over a duration of 10 minutes with request data aggregation in one second intervals. We show the mean with 99 % confidence intervals (CI) and 99th percentile latencies. In almost all cases, the CI lies within two percent of the reported median.

Comparison with Raft and Multi-Paxos
First, we compare the performance of RMWPaxos with open-source implementations of Multi-Paxos [2], [27] and Raft [4], [28], two commonly used state-of-the-art protocols. To minimize the performance impact of the IO subsystem, we configured both approaches to write their data to RAM disk. In RMWPaxos, data is stored by using Erlang's build-in term storage [29]. All approaches use three replicas. As both Multi-Paxos and Raft make use of a leader, we simulate a leader by randomly selecting one node to which all requests are forwarded to in the case of RMWPaxos. As any leader election protocol can be implemented on top of RMWPaxos, we consider leader election to be out-of-scope.
We measured the throughput of all approaches using two scenarios: First, a counter that is accessed by an increasing number of clients (Figure 2a). Second, a binary value of increasing size accessed by a fixed number of clients (Figure 2b). In both scenarios, clients submit a request and wait for a response before issuing the next one.
All three approaches handle requests in a single round-trip between leader and a quorum of following nodes. Thus, the observed differences can largely be attributed to their different strategies in handling the data locally. Due to the absence of any state management, RMWPaxos consistently outperforms both the Raft and Multi-Paxos implementation for small state sizes. For the latter two, overhead caused by reading/writing data to the local file system increases request latency, which in turn negatively affects throughput. In addition, the Multi-Paxos and Raft implementations use mechanisms  like checksum validation to protect against disk corruption.
For values smaller or equals to 4kB, all approaches exhibit nearly constant read performance. However, the throughput of RMWPaxos decreases for larger values. This is because the full value is always transferred to the proposer from a quorum of nodes when executing a read. This causes high network communication costs for large values. In contrast, the Raft and Multi-Paxos implementations include optimisations to keep data transfer costs between nodes constant when executing a read if leader is stable. In Raft's case, an empty heartbeat log entry must be appended to the command logs to ensure that the data of the leader is up-to-date. This introduces a slight overhead when reading entries.

Leaderless Performance
RMWPaxos is derived from Paxos. Thus, it does not depend on the existence of a leader to satisfy the safety properties of consensus, in contrast to protocols like Raft, which do not work without a single leader. However, a leader is beneficial for progress because it prevents the duelling proposer problem. In the case of RMWPaxos, we can alleviate the need for a leader as it is trivial to deploy an arbitrary number of concurrent RMWPaxos instances. This way, load on a single RMW-Paxos instance can be greatly reduced, depending on the workload.
We examined both single-writer ( Figure 3) and multi-writer (Figure 4) workloads, as previous work in the design of data structures has shown that supporting concurrent modifications often inhibits their perfor-mance [23]. To better illustrate the effects of concurrent requests, we increased the system size to five replicas (acceptors).

Single-Writer.
To evaluate single-writer performance, we used one writing Basho Bench worker and up to 1024 concurrent readers with a different number of read retries (parameter X). The results are depicted in Figure 3. We observed that even a single retry (X=1) improves both read and write throughput greatly compared to disabling this optimization (X=0). In the latter case, the register was overloaded due to concurrent writethrough attempts by the readers if more than 64 readers where used, dropping throughput to 0 at some times. As these results are not stable, they are not shown in Figure 3a. Choosing a value for X larger than 2 has only a minor impact on the read throughput. As acceptors must handle more messages with an increasing number of clients, their response latency increases. This leads to a consistent decline of the write throughput as shown in Figure 3b. Since the load is distributed more evenly across all replicas, the maximum observed throughput increased by roughly 70 % compared to our leader-based experiments (cf. Figure 2a), even though the system size increased from 3 to 5 replicas. Figure 3c shows the latency impact of using read retries. Read latency only increases by approx. 0.5 ms in presence of a concurrent writer. This may contradict the expectation that some reads require multiple round trips as they can observe an inconsistent quorum initially. However, proposers can continue collecting replies from the initial attempt and return a result once they observe a consistent quorum. As there is only a single writer, such a quorum always exists, at the latest after receiving a reply from every acceptor. This also means that no write-throughs are triggered by reads. Thus, both reads and writes succeed after a single round trip in a single writer setup as long as no acceptor fails. Note that writes exhibit a slightly lower latency as they always succeed with a quorum of replies, wheres reads must potentially wait for all replies in some cases. Multi-Writer, Single-Register. All workers sent a uniform mix of read and write requests for the evaluation of multiple writers. Figure 4a compares the throughput of a read-heavy workload (5 % writes) with a write-heavy workload (50 % writes) [30]. Performance degradation caused by duelling proposers can be observed for both workloads. The throughput of the read-heavy workload scales up until four concurrent clients. Afterwards, the number of instances where clients invalidate each other's proposals repeatedly increases. In write-heavy workloads, even two concurrent clients are enough to have a negative impact on the system's performance. As shown in the previous experiments, a leader at the application level helps handling write concurrency effectively.
Multi-Writer, Multi-Register. All previous measurements focused on a single register. As highlighted in Sect. 5.6, the absence of state management overhead easily allows for arbitrary many registers to be used. We benchmarked configurations using up to 10 6 register instances and 512 concurrent clients. The registers were accessed according to a Pareto distribution with α ≈ 1.16 (80 % of requests targeted 20 % of registers). Figures 4b and 4c show the results for read-heavy (5 % writes) and write-heavy (50 % writes), respectively. The results are as expected. More concurrent clients can be handled without performance degradation due to duelling proposers by increasing the number of parallel registers. The system performs consistently better under read-heavy workloads, both in throughput and number of clients supported. This coincides with the results from the single-register evaluation. The load is evenly distributed across all replicas, as no leader is used.
In addition, contention is low in settings with a large number of parallel registers. This results in a higher achievable throughput then it is possible with the use of a leader (cf. Figure 2a). These results show potential for future optimisations to improve the issue of high write contention on a single register whilst alleviating the bottleneck caused by a leader. As the single-writer evaluation has shown, a single register is able to handle high read concurrency (cf. Figure 3). Thus, load on a leader can be reduced by only forwarding writes to it. In addition, dynamic leader allocation for only highly contentious registers would further reduce the load on the leader because it would only handle a fraction of all requests.

RELATED WORK
Starting with Lamport's work on the discovery of the Paxos algorithm [2], [3], numerous Paxos extensions [7], [31], [32], [33] have been proposed-most of them following the design of using multiple Paxos instances to learn a sequence of commands. As a notable exception, Generalized Paxos [12] and its derivatives [34], only use a single Paxos instance but require keeping track of an ever-growing set of commands in its messages. In all cases, pruning in some form must be implemented to prevent unbounded memory consumption, which introduces a considerable amount of complexity to the system. This is identified by Chandra et al. [11] as one of the main challenges for using Paxos-based designs in practical systems. Despite numerous efforts of making Paxos more approachable [35], [36], [37], reliable state management with Paxos is seldom discussed in detail. Only a few practical Paxos-based systems exist to this date like Chubby [9], Spanner [8], Megastore [10], and Scalaris [24].
In recent years, various proposals were made to alleviate the dependence on a single leader. Mencius [5] evenly shares the leader's responsibilities by assigning individual consensus instances to single replicas. In Egalitarian Paxos [7], the replica receiving a command is regarded as its command leader. Each replica can act as a leader simultaneously for a subset of commands. This is achieved by decoupling command commit and application from each other and making use of the dependency constraints of each command. In contrast, we do not need an explicit leader depending on workload and load-distribution.
As of today, few consensus protocols, which are not Paxos-based, exist. Most prominently Raft [4] and the closely related Zab protocol [6]. Both are based on the idea of a central command log. Furthermore, they require a strong leader, meaning that at most a single leader is allowed to exist at any given time. In contrast, we perform updates on a distributed state in-place and do not need a strong leader.
To the best of our knowledge, we present the first Paxos-based approach that does not rely on additional state management without requiring a leader to satisfy the safety properties of consensus by implementing an atomic RMW register. The register by Li et al. [38] only recasts the original Paxos without modification and provides a regular write-once register. The round-based register proposed by Boichat et al. [36] is not atomic and only write-once. It is similar to the approach of Li et al. and modular to build several, known Paxos variants like Multi-Paxos or Fast Paxos [21]. CASPaxos [39] provides a Paxos-based linearizable multi-reader multiwriter register by letting clients submit a user-defined function instead of a value. However, when handling concurrent writes it is not guaranteed that all (or any) writes are processed by the register due to duelling proposers, which makes it unsuitable to implement basic primitives like counters. The key-value consensus algorithm Bizur [40] is based on a set of single-writer multi-reader registers and therefore relies on the election of a strong leader.
The use of consistent quorums in conjunction with Paxos is first introduced by Arad et al. [19] in the context of group membership reconfigurations. In this context, a consistent quorum expresses a consistent view of the system in terms of group memberships.
Shared register abstractions were first formalized by Lamport [41]. Among them, the atomic register provides the strongest guarantees by being linearisable. Numerous implementations exist today. In particular, the multi-writer generalisation [42, p. 25ff.] of ABD [43] has the greatest resemblance to our approach. However, the properties of atomic registers alone do not suffice to solve consensus, as not every completed write is necessarily applied to the register when being confronted with concurrent access. Moreover, only fixed values can be written. Our register abstractions provide arbitrary value transformations based on the register's previous value and ensure that completed writes are applied atleast-once (consensus sequence register) or exactly-once (RMWPaxos).

CONCLUSION
In this paper, we introduced register abstractions that satisfy the safety properties of consensus and allow consensus sequences. We provided implementations extending the principles of Paxos consensus, to allow a sequence of consensuses 'in-place' using a single set of storage resources, instead of using a new set of resources per consensus instance.
Additionally, read operations in RMWPaxos do not interfere with each other (are not serialised) and do not modify any state in the acceptors when the register is stable, i.e. no write operation is induced. This improves the parallel read throughput and saves unnecessary, potentially costly state changes of persistent storage for reads. When reads detect ongoing writes, they can either hope the writer will finish soon and mitigate the chance of duelling proposers by just retrying the read, or can start to support the writing themselves as the writer might have crashed. As we show in our evaluation (Sect. 7), the trade-off between both strategies and how often one should retry the read before helping the writer depends on the system deployment, the number of expected concurrent readers and writers, etc.
Avoiding the need for costly state management and complex protocols for state pruning, providing fast writing in two message delays and supporting concurrent readers without serialisation opens a wide new spectrum of use-cases for Paxos based fault-tolerance. The protocols we provide are beneficial and applicable on a more fine-grained level than Multi-Paxos or similar approaches, as they have low system overhead and provide good scalability.
Code Availability. The source code for our RMWPaxos implementation [44] and the protocol scheduler [25] can be found on GitHub under the Apache License 2.0.