Delta state replicated data types

https://doi.org/10.1016/j.jpdc.2017.08.003Get rights and content

Highlights

  • Definition of delta-state CRDTs and their relation to state CRDTs.

  • Proofs of conditions to attain equivalence to state based CRDTs.

  • Anti-entropy algorithm for basic and causally consistent convergence.

  • Portfolio of delta state CRDTs including optimized sets, and recursive map.

Abstract

Conflict-free Replicated Data Types (CRDTs) are distributed data types that make eventual consistency of a distributed object possible and non ad-hoc. Specifically, state-based CRDTs ensure convergence through disseminating the entire state, that may be large, and merging it to other replicas. We introduce Delta State Conflict-Free Replicated Data Types (δ-CRDT) that can achieve the best of both operation-based and state-based CRDTs: small messages with an incremental nature, as in operation-based CRDTs, disseminated over unreliable communication channels, as in traditional state-based CRDTs. This is achieved by defining δ-mutators to return a delta-state, typically with a much smaller size than the full state, that to be joined with both local and remote states. We introduce the δ-CRDT framework, and we explain it through establishing a correspondence to current state-based CRDTs. In addition, we present an anti-entropy algorithm for eventual convergence, and another one that ensures causal consistency. Finally, we introduce several δ-CRDT specifications of both well-known replicated datatypes and novel datatypes, including a generic map composition.

Introduction

Eventual consistency (EC) is a relaxed consistency model that is often adopted by large-scale distributed systems [15], [18], [34] where availability must be maintained, despite outages and partitioning, whereas delayed consistency is acceptable. A typical approach in EC systems is to allow replicas of a distributed object to temporarily diverge, provided that they can eventually be reconciled into a common state. To avoid application-specific reconciliation methods, costly and error-prone, Conflict-Free Replicated Data Types (CRDTs) [32], [33] were introduced, allowing the design of self-contained distributed data types that are always available and eventually converge when all operations are reflected at all replicas. Though CRDTs are deployed in practice and support millions of users worldwide [9], [21], [30], more work is still required to improve their design and performance.

CRDTs support two complementary designs: operation-based (or op-based) and state-based. In op-based designs [26], [33], the execution of an operation is done in two phases: prepare and effect. The former is performed only on the local replica and looks at the operation and current state to produce a message that aims to represent the operation, which is then shipped to all replicas. Once received, the representation of the operation is applied remotely using effect. On the other hand, in a state-based design [5], [33] an operation is only executed on the local replica state. A replica periodically propagates its local changes to other replicas through shipping its entire state. A received state is incorporated with the local state via a merge function that deterministically reconciles both states. To maintain convergence, merge is defined as a join : a least upper bound over a join-semilattice [5], [33].

Op-based CRDTs have some advantages as they can allow for simpler implementations, concise replica state, and smaller messages; however, they are subject to some limitations: First, they assume a message dissemination layer that guarantees reliable exactly-once causal broadcast; these guarantees are hard to maintain since large logs must be retained to prevent duplication even if TCP is used [20]. Second, membership management is a hard task in op-based systems especially once the number of nodes gets larger or due to churn problems, since all nodes must be coordinated by the middleware. Third, the op-based approach requires operations to be executed individually (even when batched) on all nodes.

The alternative is to use state-based systems, which are free from these limitations. However, a major drawback in current state-based CRDTs is the communication overhead of shipping the entire state, which can get very large in size. For instance, the state size of a counter CRDT (a vector of integer counters, one per replica) increases with the number of replicas; whereas in a grow-only Set, the state size depends on the set size, that grows as more operations are invoked. This communication overhead limits the use of state-based CRDTs to data-types with small state size (e.g., counters are reasonable while large sets are not). Recently, there has been a demand for CRDTs with large state sizes (e.g., in RIAK DT Maps [10] that can compose multiple CRDTs and that we formalize in Section 7.4.9).

In this paper, we rethink the way state-based CRDTs should be designed, having in mind the problematic shipping of the entire state. Our aim is to ship a representation of the effect of recent update operations on the state, rather than the whole state, while preserving the idempotent nature of join. This ensures convergence over unreliable communication (on the contrary to op-based CRDTs that demand exactly-once delivery and are prone to message duplication). To achieve this, we develop in detail the concept of Delta State-based CRDTs (δ-CRDT) that we initially introduced in [2]. In this new (delta) framework, the state is still a join-semilattice that now results from the join of multiple fine-grained states, i.e., deltas, generated by what we call δ-mutators. δ-mutators are new versions of the datatype mutators that return the effect of these mutators on the state. In this way, deltas can be temporarily retained in a buffer to be shipped individually (or joined in groups) instead of shipping the entire object. The changes to the local state are then incorporated at other replicas by joining the shipped deltas with their own states.

The use of “deltas” (i.e., incremental states) may look intuitive in state dissemination; however, this is not the case for state-based CRDTs. The reason is that once a node receives an entire state, merging it locally is simple since there is no need to care about causality, as both states are self-contained (including meta-data). The challenge in δ-CRDT is that individual deltas are now “state fragments” and usually must be causally merged to maintain the desired semantics. This raises the following questions: is merging deltas semantically equivalent to merging entire states in CRDTs? If not, what are the sufficient conditions to make this true in general? And under what constraints causal consistency is maintained? This paper answers these questions and presents corresponding proofs and examples.

We address the challenge of designing a new δ-CRDT that conserves the correctness properties and semantics of an existing CRDT by establishing a relation between the novel δ-mutators with the original CRDT mutators. We prove that eventual consistency is guaranteed in δ-CRDT as long as all deltas produced by δ-mutators are delivered and joined at other replicas, and we present a corresponding simple anti-entropy algorithm. We then show how to ensure causal consistency using deltas through introducing the concept of delta-interval and the causal delta-merging condition. Based on these, we then present an anti-entropy algorithm for δ-CRDT, where sending and then joining delta-intervals into another replica state produces the same effect as if the entire state had been shipped and joined.

We illustrate our approach through a simple counter CRDT and a corresponding δ-CRDT specification. Later, we present a portfolio of several δ-CRDTs that adapt known CRDT designs and also introduce a generic kernel for the definition of CRDTs that keep a causal history of known events and a CRDT map that can compose them. All these δ-CRDT datatypes, and a few more, are available online in a reference C++ library [3]. Our experience shows that a δ-CRDT version can be devised for all CRDTs, but this requires some design effort that varies with the complexity of different CRDTs. This refactoring effort can be avoided for new datatypes by writing all mutations as delta-mutations, and only deriving the standard mutators if needed; these can be trivially obtained from the delta-mutators.

This paper is an extended version of [2], adding the following material: Proofs of conditions to attain equivalence to state based CRDTs; Anti-entropy algorithm for basic convergence; Portfolio of delta state CRDTs including simple compositions and anonymous replicated types (grow only sets, two phase sets, lexicographic pairs (Soundcloud [9]) last-writer-wins sets), named types (positive–negative counters, (Cassandra [16]) lexicographic counters); Kernel for causal CRDTs, with a universal join function; Optimized causal CRDTs (remove-wins sets, (Riak) flags [6]); Recursive map data type for causal CRDTs.

Section snippets

System model

Consider a distributed system with nodes containing local memory, with no shared memory between them. Any node can send messages to any other node. The network is asynchronous; there is no global clock, no bound on the time a message takes to arrive, and no bounds on relative processing speeds. The network is unreliable: messages can be lost, duplicated or reordered (but are not corrupted). Some messages will, however, eventually get through: if a node sends infinitely many messages to another

A background of state-based CRDTs

Conflict-Free Replicated Data Types [32], [33] (CRDTs) are distributed datatypes that allow different replicas of a distributed CRDT instance to diverge and ensures that, eventually, all replicas converge to the same state. State-based CRDTs achieve this through propagating updates of the local state by disseminating the entire state across replicas. The received states are then merged to remote states, leading to convergence (i.e., consistent states on all replicas).

A state-based CRDT consists

Delta-state CRDTs

We introduce Delta-State Conflict-Free Replicated Data Types, or δ-CRDT for short, as a new kind of state-based CRDTs, in which delta-mutators are defined to return a delta-state : a value in the same join-semilattice which represents the updates induced by the mutator on the current state.

Definition 4.1 Delta-mutator

A delta-mutator mδ is a function, corresponding to an update operation, which takes a state X in a join-semilattice S as parameter and returns a delta-mutation mδ(X), also in S.

State convergence

In the δ-CRDT execution model, and regardless of the anti-entropy algorithm used, a replica state always evolves by joining the current state with some delta : either the result of a delta-mutation, or some arbitrary delta-group (which itself can be expressed as a join of delta-mutations). Without loss of generality, we assume S has a bottom which is also the initial state. (Otherwise, a bottom can always be added, together with a special init delta-mutator, which returns the initial state.)

Causal consistency

Traditional state-based CRDTs converge using joins of the full state, which implicitly ensures per-object causal consistency [12] : each state of some replica of an object reflects the causal past of operations on the object (either applied locally, or applied at other replicas and transitively joined).

Therefore, it is desirable to have δ-CRDT s offer the same causal-consistency guarantees that standard state-based CRDTs offer. This raises the question about how can delta propagation and

Portfolio of δ-CRDTs

Having established the equivalence to classic state based CRDTs we now derive a series of specifications based on delta-mutators. Although we cover a significant number of CRDTs, the goal is not to provide an exhaustive survey, but instead to illustrate more extensively the design of specifications with deltas. In our experience the intellectual effort of designing a delta-based CRDT is not much higher than designing it with standard mutators. Since standard mutators can be obtained from

Eventually convergent data types

The design of replicated systems that are always available and eventually converge can be traced back to historical designs in [22], [35], among others. More recently, replicated data types that always eventually converge, both by reliably broadcasting operations (called operation-based) or gossiping and merging states (called state-based), have been formalized as CRDTs [5], [26], [32], [33]. These are also closely related to BloomL [14] and Cloud Types [11]. State join-semilattices were used

Conclusion

We introduced the new concept of δ-CRDTs and devised delta-mutators over state-based datatypes which can detach the changes that an operation induces on the state. This brings a significant performance gain as it allows only shipping small states, i.e., deltas, instead of the entire state. The significant property in δ-CRDT is that it preserves the crucial properties (idempotence, associativity and commutativity) of standard state-based CRDT. In addition, we have shown how δ-CRDT can achieve

Paulo Sérgio Almeida is an assistant professor at the Department of Informatics at University of Minho, and a researcher at HASLab/INESC TEC. He obtained a M.Sc. degree from University of Porto in 1994 and a Ph.D. degree in Computer Science from Imperial College London in 1998. His research activities have been in the area of distributed systems, namely in causality tracking mechanisms, eventually consistent non-relational databases, fault-tolerant distributed aggregation algorithms, bloom

References (36)

  • P.S. Almeida, C. Baquero, R. Gonçalves, N.M. Preguiça, V. Fonte, Scalable and accurate causality tracking for...
  • P.S. Almeida, A. Shoker, C. Baquero, Efficient state-based CRDTs by delta-mutation, in: Networked Systems - Third...
  • C. Baquero, Delta-enabled-CRDTs (Retrieved 22.12.15). URL...
  • BaqueroC. et al.

    Making operation-based CRDTs operation-based

  • BaqueroC. et al.

    Using structural characteristics for autonomous operation

    Oper. Syst. Rev.

    (1999)
  • Basho, Riak datatypes (Retrieved 22.12.15). URL...
  • Basho, Riak 1.4 (Retrieved 04.01.16). URL...
  • A. Bieniusa, M. Zawirski, N. Preguiça, M. Shapiro, C. Baquero, V. Balegas, S. Duarte, An optimized conflict-free...
  • Peter Bourgon, Consistency without Consensus: CRDTs in Production at SoundCloud (Retrieved 22.12.15). URL...
  • BrownR. et al.

    Riak dt map: A composable, convergent replicated dictionary

  • BurckhardtS. et al.

    Cloud types for eventual consistency

  • BurckhardtS. et al.

    Replicated data types: specification, verification, optimality

  • S. Burckhardt, D. Leijen, M. Fahndrich, Cloud types: Robust abstractions for replicated shared state, Tech. Rep....
  • ConwayN. et al.

    Logic and lattices for distributed programming

  • S. Cribbs, R. Brown, Data structures in Riak Riak Conference (RICON) (Oct...
  • Datastax, What’s New in Cassandra 2.1: Better Implementation of Counters (Retrieved 04.01.16). URL...
  • DaveyB.A. et al.

    Introduction to Lattices and Order

    (2002)
  • DeCandiaG. et al.

    Dynamo: Amazon’s highly available key-value store

  • Cited by (0)

    Paulo Sérgio Almeida is an assistant professor at the Department of Informatics at University of Minho, and a researcher at HASLab/INESC TEC. He obtained a M.Sc. degree from University of Porto in 1994 and a Ph.D. degree in Computer Science from Imperial College London in 1998. His research activities have been in the area of distributed systems, namely in causality tracking mechanisms, eventually consistent non-relational databases, fault-tolerant distributed aggregation algorithms, bloom filters, and distributed algorithms in graphs. In recent years the main focus of research has been on Conflict-free Replicated Data Types.

    Ali Shoker is a researcher at HASLab/INESC TEC and an invited assistant professor at the Department of Informatics at the University of Minho, Portugal. He obtained his Ph.D. degree in Informatics and Telecommunication from the University of Toulouse in Nov. 2012 and spent some time at EPFL LPD lab, Switzerland, and INSA de Lyon, France. His research is focused on Distributed Systems; in particular, large scale distributed data management and fault tolerance. More recently, his research is focused on Conflict-free Replicated Data Types, Edge/Fog Computing, and Internet of Things.

    Carlos Baquero is currently an assistant professor at the Computer Science Department in Universidade do Minho (Portugal). He obtained his M.Sc. and Ph.D. degrees from Universidade do Minho in 1994 and 2000. His research interests are focused on distributed systems, in particular in causality tracking, peer-to-peer systems, distributed data aggregation, highly dynamic distributed systems, both in internet P2P settings and in mobile and sensor networks. Recent research is focused on Conflict-free Replicated Data Types, Edge/Fog Computing, and Internet of Things.

    The work presented was partially supported by EU FP7 SyncFree project (609551), EU H2020 LightKone project (732505), and SMILES line in project TEC4Growth (NORTE-01-0145-FEDER-000020).

    1

    HASLab/INESC TEC and Universidade do Minho, Portugal.

    View full text