abstract

Public Access

RStore: efficient multiversion document management in the cloud

Authors:
Souvik Bhattacherjee

University of Maryland

University of Maryland
View Profile

,
Amol Deshpande

University of Maryland

University of Maryland
View Profile

SoCC '17: Proceedings of the 2017 Symposium on Cloud ComputingSeptember 2017Pages 658https://doi.org/10.1145/3127479.3132693

Published:24 September 2017Publication History

SoCC '17: Proceedings of the 2017 Symposium on Cloud Computing

Pages 658

ABSTRACT

Motivation.The iterative and exploratory nature of the data science process, combined with an increasing need to support debugging, historical queries, auditing, provenance, and reproducibility, warrants the need to store and query a large number of versions of a dataset. This realization has led to many efforts at building data management systems that support versioning as a first-class construct, both in academia [1, 3, 5, 6] and in industry (e.g., git, Datomic, noms). These systems typically support rich versioning/branching functionality and complex queries over versioned information but lack the capability to host versions of a collection of keyed records or documents in a distributed environment or a cloud. Alternatively, key-value stores¹ (e.g., Apache Cassandra, HBase, MongoDB) are appealing in many collaborative scenarios spanning geographically distributed teams, since they offer centralized hosting of the data, are resilient to failures, can easily scale out, and can handle a large number of queries efficiently. However, those do not offer rich versioning and branching functionality akin to hosted version control systems (VCS) like GitHub. This work addresses the problem of compactly storing a large number of versions (snapshots) of a collection of keyed documents or records in a distributed environment, while efficiently answering a variety of retrieval queries over those.

RStore Overview. Our primary focus here is to provide versioning and branching support for collections of records with unique identifiers. Like popular NoSQL systems, RStore supports a flexible data model; records with varying sizes, ranging from a few bytes to a few MBs; and a variety of retrieval queries to cover a wide range of use cases. Specifically, similar to NoSQL systems, our system supports efficient retrieval of a specific record in a specific version (given a key and a version identifier), or the entire evolution history for a given key. Similar to VCS, it supports retrieving all records belonging to a specific version to support use cases that require updating a large number of records (e.g., by applying a data cleaning step). Finally, since retrieving an entire version might be unnecessary and expensive, our system supports partial version retrieval given a range of keys and a version identifier.

Challenges. Addressing the above desiderata poses many design and computational challenges, and natural baseline approaches (see full paper [2] for more details) that attempt to build this functionality on top of existing key-value stores suffer from critical limitations. First, most of those baseline approaches cannot directly support point queries targetting a specific record in a specific version (and by extension, full or partial version retrieval queries), without constructing and maintaining explicit indexes. Second, all the viable baselines fundamentally require too many back-and-forths between the retrieval module and the backend key-value store; this is because the desired set of records cannot be succinctly described as a query. Third, ingest of new versions is difficult for most of the baseline approaches. Finally, exploiting "record-level compression" is difficult or impossible in those approaches; this is crucial to be able to handle common use cases where large records (e.g., documents) are updated frequently with relatively small changes.

Key Ideas. To address these problems, RStore features a new architecture that partitions the distinct records into approximately equal-sized "chunks", with the goal to minimize the number of chunks that need to be retrieved for a given query workload [2]. We establish that the system can adapt to different data and workload requirements through a few simple tuning knobs. The key computational challenge boils down to deciding how to optimally partition the records into chunks; we draw connections to well-studied problems like compressing bipartitite graphs and hypergraph partitioning to show that the problem is NP-Hard in general. Our system features a novel algorithm, that exploits the structure of the version graph, to find an effective partitioning of the records and is built on top of Apache Cassandra. An extensive experimental evaluation is performed over a large number of synthetically constructed datasets to show the effectiveness of RStore and to validate our design decisions.

References

S. Bhattacherjee, A. Chavan, S. Huang, A. Deshpande, and A. G. Parameswaran. Principles of dataset versioning: Exploring the recreation/storage tradeoff. PVLDB, 8(12):1346--1357, 2015.Google Scholar
S. Bhattacherjee and A. Deshpande. Storing and querying versioned documents in the cloud. University of Maryland, College Park. Accessible at: https://www.cs.umd.edu/~bsouvik/paper/tech-report.pdf, 2017.Google Scholar
P. Buneman, S. Khanna, K. Tajima, and W. C. Tan. Archiving scientific data. ACM Trans. Database Syst., 29:2--42, 2004.Google ScholarDigital Library
R. Cattell. Scalable SQL and NoSQL data stores. SIGMOD Rec., 39(4):12--27, 2011.Google ScholarDigital Library
J. M. Hellerstein et al. Ground: A data context service. In CIDR, 2017.Google Scholar
M. Maddox, D. Goehring, A. J. Elmore, S. Madden, A. G. Parameswaran, and A. Deshpande. Decibel: The relational dataset branching system. PVLDB, 9(9):624--635, 2016.Google Scholar

Index Terms

RStore: efficient multiversion document management in the cloud
1. Information systems
  1. Data management systems
    1. Database design and models
      1. Data model extensions
        Semi-structured data
    2. Database management system engines
      1. Parallel and distributed DBMSs

Recommendations

ArThUR: A Tool for Markov Logic Network
Proceedings of the Confederated International Workshops on On the Move to Meaningful Internet Systems: OTM 2014 Workshops - Volume 8842

Logical approaches-and ontologies in particular-offer a well-adapted framework for representing knowledge present on the Semantic Web [InlineEquation not available: see fulltext.]. These ontologies are formulated in [InlineEquation not available: see ...
Read More
The Coolest Way to Generate Binary Strings

Pick a binary string of length n and remove its first bit b . Now insert b after the first remaining 10, or insert $\overline{b}$ at the end if there is no remaining 10. Do it again. And again. Keep going! Eventually, you will cycle through all 2ⁿ of the ...
Read More
Approximability and exact resolution of the multidimensional binary vector assignment problem

In this paper we consider the multidimensional binary vector assignment problem. An input of this problem is defined by m disjoint multisets $$V^1, V^2, \ldots , V^m$$V1,V2,?,Vm, each composed of n binary vectors of size p. An output is a set of n ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

SoCC '17: Proceedings of the 2017 Symposium on Cloud Computing
September 2017
672 pages
ISBN:9781450350280
DOI:10.1145/3127479

Copyright © 2017 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 September 2017
Check for updates
Qualifiers
- abstract
Conference

Acceptance Rates
Overall Acceptance Rate169of722submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 175
  Total Downloads
- Downloads (Last 12 months)20
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

RStore: efficient multiversion document management in the cloud

SoCC '17: Proceedings of the 2017 Symposium on Cloud Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

ArThUR: A Tool for Markov Logic Network

The Coolest Way to Generate Binary Strings

Approximability and exact resolution of the multidimensional binary vector assignment problem

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

RStore: efficient multiversion document management in the cloud

SoCC '17: Proceedings of the 2017 Symposium on Cloud Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

ArThUR: A Tool for Markov Logic Network

The Coolest Way to Generate Binary Strings

Approximability and exact resolution of the multidimensional binary vector assignment problem

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media