skip to main content
10.1145/3127479.3132693acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
abstract
Public Access

RStore: efficient multiversion document management in the cloud

Published:24 September 2017Publication History

ABSTRACT

Motivation.The iterative and exploratory nature of the data science process, combined with an increasing need to support debugging, historical queries, auditing, provenance, and reproducibility, warrants the need to store and query a large number of versions of a dataset. This realization has led to many efforts at building data management systems that support versioning as a first-class construct, both in academia [1, 3, 5, 6] and in industry (e.g., git, Datomic, noms). These systems typically support rich versioning/branching functionality and complex queries over versioned information but lack the capability to host versions of a collection of keyed records or documents in a distributed environment or a cloud. Alternatively, key-value stores1 (e.g., Apache Cassandra, HBase, MongoDB) are appealing in many collaborative scenarios spanning geographically distributed teams, since they offer centralized hosting of the data, are resilient to failures, can easily scale out, and can handle a large number of queries efficiently. However, those do not offer rich versioning and branching functionality akin to hosted version control systems (VCS) like GitHub. This work addresses the problem of compactly storing a large number of versions (snapshots) of a collection of keyed documents or records in a distributed environment, while efficiently answering a variety of retrieval queries over those.

RStore Overview. Our primary focus here is to provide versioning and branching support for collections of records with unique identifiers. Like popular NoSQL systems, RStore supports a flexible data model; records with varying sizes, ranging from a few bytes to a few MBs; and a variety of retrieval queries to cover a wide range of use cases. Specifically, similar to NoSQL systems, our system supports efficient retrieval of a specific record in a specific version (given a key and a version identifier), or the entire evolution history for a given key. Similar to VCS, it supports retrieving all records belonging to a specific version to support use cases that require updating a large number of records (e.g., by applying a data cleaning step). Finally, since retrieving an entire version might be unnecessary and expensive, our system supports partial version retrieval given a range of keys and a version identifier.

Challenges. Addressing the above desiderata poses many design and computational challenges, and natural baseline approaches (see full paper [2] for more details) that attempt to build this functionality on top of existing key-value stores suffer from critical limitations. First, most of those baseline approaches cannot directly support point queries targetting a specific record in a specific version (and by extension, full or partial version retrieval queries), without constructing and maintaining explicit indexes. Second, all the viable baselines fundamentally require too many back-and-forths between the retrieval module and the backend key-value store; this is because the desired set of records cannot be succinctly described as a query. Third, ingest of new versions is difficult for most of the baseline approaches. Finally, exploiting "record-level compression" is difficult or impossible in those approaches; this is crucial to be able to handle common use cases where large records (e.g., documents) are updated frequently with relatively small changes.

Key Ideas. To address these problems, RStore features a new architecture that partitions the distinct records into approximately equal-sized "chunks", with the goal to minimize the number of chunks that need to be retrieved for a given query workload [2]. We establish that the system can adapt to different data and workload requirements through a few simple tuning knobs. The key computational challenge boils down to deciding how to optimally partition the records into chunks; we draw connections to well-studied problems like compressing bipartitite graphs and hypergraph partitioning to show that the problem is NP-Hard in general. Our system features a novel algorithm, that exploits the structure of the version graph, to find an effective partitioning of the records and is built on top of Apache Cassandra. An extensive experimental evaluation is performed over a large number of synthetically constructed datasets to show the effectiveness of RStore and to validate our design decisions.

References

  1. S. Bhattacherjee, A. Chavan, S. Huang, A. Deshpande, and A. G. Parameswaran. Principles of dataset versioning: Exploring the recreation/storage tradeoff. PVLDB, 8(12):1346--1357, 2015.Google ScholarGoogle Scholar
  2. S. Bhattacherjee and A. Deshpande. Storing and querying versioned documents in the cloud. University of Maryland, College Park. Accessible at: https://www.cs.umd.edu/~bsouvik/paper/tech-report.pdf, 2017.Google ScholarGoogle Scholar
  3. P. Buneman, S. Khanna, K. Tajima, and W. C. Tan. Archiving scientific data. ACM Trans. Database Syst., 29:2--42, 2004.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. R. Cattell. Scalable SQL and NoSQL data stores. SIGMOD Rec., 39(4):12--27, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. M. Hellerstein et al. Ground: A data context service. In CIDR, 2017.Google ScholarGoogle Scholar
  6. M. Maddox, D. Goehring, A. J. Elmore, S. Madden, A. G. Parameswaran, and A. Deshpande. Decibel: The relational dataset branching system. PVLDB, 9(9):624--635, 2016.Google ScholarGoogle Scholar

Index Terms

  1. RStore: efficient multiversion document management in the cloud

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SoCC '17: Proceedings of the 2017 Symposium on Cloud Computing
        September 2017
        672 pages
        ISBN:9781450350280
        DOI:10.1145/3127479

        Copyright © 2017 Owner/Author

        Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 24 September 2017

        Check for updates

        Qualifiers

        • abstract

        Acceptance Rates

        Overall Acceptance Rate169of722submissions,23%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader