skip to main content
research-article
Open Access

Disaggregating RocksDB: A Production Experience

Authors Info & Claims
Published:20 June 2023Publication History
Skip Abstract Section

Abstract

As in the general industry, there is a trend in Meta's data centers to migrate data from locally attached SSDs to cloud storage. We extended RocksDB [26], a widely used open-source storage engine designed and built for local SSDs, to leverage disaggregated storage. RocksDB's design, such as its data and log files' access patterns, makes an append-only distributed file system a desirable underlying storage. At Meta, we built disaggregated RocksDB using Tectonic File System [35], which so far had mainly been used for our data warehouse and blob storage stacks. We identified that metadata overhead and tail latencies were Tectonic's major performance gaps and addressed them accordingly. We improved the reliability, performance and other requirements with both general and customized optimizations to the core engine in RocksDB. We also took the time to deeply understand the common challenges presented by applications running on RocksDB and implemented enhancements to address them. This architecture enabled RocksDB to adapt to a more distributed architecture for performance enhancements.

Skip Supplemental Material Section

Supplemental Material

PACMMOD-V1mod192.mp4

mp4

43.4 MB

References

  1. [n. d.]. Amazon EBS. https://aws.amazon.com/ebs/.Google ScholarGoogle Scholar
  2. [n. d.]. Cachelib Repo. https://github.com/facebook/CacheLib.Google ScholarGoogle Scholar
  3. [n. d.]. Ceph File system. https://docs.ceph.com/en/pacific/cephfs/index.html.Google ScholarGoogle Scholar
  4. [n. d.]. Distributed locks with Redis. https://redis.io/topics/distlock.Google ScholarGoogle Scholar
  5. [n. d.]. GlusterFS. https://www.gluster.org/.Google ScholarGoogle Scholar
  6. [n. d.]. Hbase. https://hbase.apache.org/.Google ScholarGoogle Scholar
  7. [n. d.]. HDFS. https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html.Google ScholarGoogle Scholar
  8. [n. d.]. MySQL. https://www.mysql.com/.Google ScholarGoogle Scholar
  9. [n. d.]. RDMA. http://www.rdmaconsortium.org/.Google ScholarGoogle Scholar
  10. [n. d.]. RocksDB Benchmark Wiki Page. https://github.com/facebook/rocksdb/wiki/Performance-Benchmarks.Google ScholarGoogle Scholar
  11. 2009. Rados. https://ceph.io/en/news/blog/2009/the-rados-distributed-object-store/.Google ScholarGoogle Scholar
  12. 2015. Introduction to HDFS Erasure Coding in Apache Hadoop. https://blog.cloudera.com/introduction-to-hdfs-erasure-coding-in-apache-hadoop/.Google ScholarGoogle Scholar
  13. 2020. RocksDB-Cloud remote compaction. https://rockset.com/blog/remote-compactions-in-rocksdb-cloud/.Google ScholarGoogle Scholar
  14. 2021. Cachelib. https://engineering.fb.com/2021/09/02/open-source/cachelib/.Google ScholarGoogle Scholar
  15. 2021. How we built a general purpose key value store for Facebook with ZippyDB. https://engineering.fb.com/2021/08/06/core-data/zippydb/.Google ScholarGoogle Scholar
  16. Muhammad Yousuf Ahmad and Bettina Kemme. 2015. Compaction management in distributed key-value datastores. Proceedings of the VLDB Endowment 8, 8 (2015), 850--861.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Benjamin Berg, Daniel S. Berger, Sara McAllister, Isaac Grosof, Sathya Gunasekar, Jimmy Lu, Michael Uhlar, Jim Carrig, Nathan Beckmann, Mor Harchol-Balter, and Gregory R. Ganger. 2020. The CacheLib Caching Engine: Design and Experiences at Scale. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). USENIX Association, 753--768. https://www.usenix.org/conference/osdi20/presentation/bergGoogle ScholarGoogle Scholar
  18. Laurent Bindschaedler, Ashvin Goel, and Willy Zwaenepoel. 2020. Hailstorm: Disaggregated compute and storage for distributed lsm-based databases. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 301--316.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Laurent Bindschaedler, Jasmina Malicevic, Nicolas Schiper, Ashvin Goel, and Willy Zwaenepoel. 2018. Rock You like a Hurricane: Taming Skew in Large Scale Analytics. In Proceedings of the Thirteenth EuroSys Conference (Porto, Portugal) (EuroSys '18). Association for Computing Machinery, New York, NY, USA, Article 20, 15 pages. https://doi.org/10.1145/3190508.3190532Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Wei Cao, Yang Liu, Zhushi Cheng, Ning Zheng, Wei Li, Wenjie Wu, Linqiang Ouyang, Peng Wang, Yijing Wang, Ray Kuan, et al. 2020. {POLARDB} Meets Computational Storage: Efficiently Support Analytical Workloads in {Cloud-Native} Relational Database. In 18th USENIX Conference on File and Storage Technologies (FAST 20). 29--41.Google ScholarGoogle Scholar
  21. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C Hsieh, Deborah A Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E Gruber. 2008. Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS) 26, 2 (2008), 1--26.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. James C Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, Jeffrey John Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, et al . 2013. Spanner: Google's globally distributed database. ACM Transactions on Computer Systems (TOCS) 31, 3 (2013), 1--22.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Jeffrey Dean. 2010. Evolution and future directions of large-scale storage and computation systems at Google. (2010).Google ScholarGoogle Scholar
  24. David DeWitt and Jim Gray. 1992. Parallel database systems: The future of high performance database systems. Commun. ACM 35, 6 (1992), 85--98.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Siying Dong, Mark Callaghan, Leonidas Galanis, Dhruba Borthakur, Tony Savor, and Michael Strum. 2017. Optimizing Space Amplification in RocksDB.. In CIDR, Vol. 3. 3.Google ScholarGoogle Scholar
  26. Siying Dong, Andrew Kryczka, Yanqin Jin, and Michael Stumm. 2021. RocksDB: Evolution of Development Priorities in a Key-Value Store Serving Large-Scale Applications. ACM Trans. Storage 17, 4, Article 26 (oct 2021), 32 pages. https://doi.org/10.1145/3483840Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Peter X Gao, Akshay Narayan, Sagar Karandikar, Joao Carreira, Sangjin Han, Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker. 2016. Network requirements for resource disaggregation. In 12th USENIX symposium on operating systems design and implementation (OSDI 16). 249--264.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google file system. In Proceedings of the nineteenth ACM symposium on Operating systems principles. 29--43.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Zvika Guz, Harry Li, Anahita Shayesteh, and Vijay Balakrishnan. 2018. Performance characterization of nvme-over-fabrics storage disaggregation. ACM Transactions on Storage (TOS) 14, 4 (2018), 1--18.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Dave Hitz, James Lau, and Michael A Malcolm. 1994. File System Design for an NFS File Server Appliance.. In USENIX winter, Vol. 94. 10--5555.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Ana Klimovic, Christos Kozyrakis, Eno Thereska, Binu John, and Sanjeev Kumar. 2016. Flash storage disaggregation. In Proceedings of the Eleventh European Conference on Computer Systems. 1--15.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Mihir Nanavati, Jake Wires, and Andrew Warfield. 2017. Decibel: Isolation and Sharing in Disaggregated {Rack-Scale} Storage. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17). 17--33.Google ScholarGoogle Scholar
  33. Edmund B Nightingale, Jeremy Elson, Jinliang Fan, Owen Hofmann, Jon Howell, and Yutaka Suzue. 2012. Flat datacenter storage. In 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12). 1--15.Google ScholarGoogle Scholar
  34. Patrick O'Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O'Neil. 1996. The log-structured merge-tree (LSM-tree). Acta Informatica 33, 4 (1996), 351--385.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Satadru Pan, Theano Stavrinos, Yunqiao Zhang, Atul Sikaria, Pavel Zakharov, Abhinav Sharma, Shiva Shankar P, Mike Shuey, Richard Wareing, Monika Gangapuram, Guanglei Cao, Christian Preseau, Pratap Singh, Kestutis Patiejunas, JR Tipton, Ethan Katz-Bassett, and Wyatt Lloyd. 2021. Facebook's Tectonic Filesystem: Efficiency from Exascale. In 19th USENIX Conference on File and Storage Technologies (FAST 21). USENIX Association, 217--231. https://www.usenix.org/conference/fast21/presentation/panGoogle ScholarGoogle Scholar
  36. Rockset. 2018. RocksDBCloud. https://rockset.com/blog/rocksdb-cloud-enabling-the-next-generation-of-cloud-native-databases/.Google ScholarGoogle Scholar
  37. Amitabha Roy, Laurent Bindschaedler, Jasmina Malicevic, and Willy Zwaenepoel. 2015. Chaos: Scale-out graph processing from secondary storage. In Proceedings of the 25th Symposium on Operating Systems Principles. 410--424.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Yizhou Shan, Yutong Huang, Yilun Chen, and Yiying Zhang. 2018. LegoOS: A Disseminated, Distributed OS for Hardware Resource Disaggregation. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, Carlsbad, CA, 69--87. https://www.usenix.org/conference/osdi18/presentation/shanGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  39. Michael Stonebraker. 1986. The case for shared nothing. IEEE Database Eng. Bull. 9, 1 (1986), 4--9.Google ScholarGoogle Scholar
  40. Amy Tai, Andrew Kryczka, Shobhit Kanaujia, Chris Petersen, Mikhail Antonov, Muhammad Waliji, Kyle Jamieson, Michael J Freedman, and Asaf Cidon. 2018. Live recovery of bit corruptions in datacenter storage systems. arXiv preprint arXiv:1805.02790 (2018).Google ScholarGoogle Scholar
  41. Alexandre Verbitski, Anurag Gupta, Debanjan Saha, Murali Brahmadesam, Kamal Gupta, Raman Mittal, Sailesh Krishnamurthy, Sandor Maurice, Tengiz Kharatishvili, and Xiaofeng Bao. 2017. Amazon aurora: Design considerations for high throughput cloud-native relational databases. In Proceedings of the 2017 ACM International Conference on Management of Data. 1041--1052.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Midhul Vuppalapati, Justin Miron, Rachit Agarwal, Dan Truong, Ashish Motivala, and Thierry Cruanes. 2020. Building an elastic query engine on disaggregated storage. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20). 449--462.Google ScholarGoogle Scholar

Index Terms

  1. Disaggregating RocksDB: A Production Experience

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Article Metrics

      • Downloads (Last 12 months)2,504
      • Downloads (Last 6 weeks)335

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader