skip to main content
10.1145/1383422.1383428acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

Evaluating the usefulness of content addressable storage for high-performance data intensive applications

Published:23 June 2008Publication History

ABSTRACT

Content Addressable Storage (CAS) is a data representation technique that operates by partitioning a given data-set into non-intersecting units called chunks and then employing techniques to efficiently recognize chunks occurring multiple times. This allows CAS to eliminate duplicate instances of such chunks, resulting in reduced storage space compared to conventional representations of data. CAS is an attractive technique for reducing the storage and network bandwidth needs of performance-sensitive, data-intensive applications in a variety of domains. These include enterprise applications, Web-based e-commerce or entertainment services and highly parallel scientific/engineering applications and simulations, to name a few.

In this paper, we conduct an empirical evaluation of the benefits offered by CAS to a variety of real-world data-intensive applications. The savings offered by CAS depend crucially on (i) the nature of the data-set itself and (ii) the chunk-size that CAS employs. We investigate the impact of both these factors on disk space savings, savings in network bandwidth, and error resilience of data. We find that a chunk-size of 1 KB can provide up to 84% savings in disk space and even higher savings in network bandwidth whilst trading off error resilience and incurring 14% CAS related overheads. Drawing upon lessons learned from our study, we provide insights on (i) the choice of the chunk-size for effective space savings and (ii) the use of selective data replication to counter the loss of error resilience caused by CAS.

References

  1. BSSN Pugh Benchmark. http://www.cactuscode.org/Benchmarks/bench_bssn_pugh.]]Google ScholarGoogle Scholar
  2. NAS PARALLEL BENCHMARKS. http://www.nas.nasa.gov/Resources/Software/npb.html.]]Google ScholarGoogle Scholar
  3. Oracle berkeley db. http://www.oracle.com/database/berkeley-db.html.]]Google ScholarGoogle Scholar
  4. M. Ajtai, R. Burns, R. Fagin, D. D. E. Long, and L. Stockmeyer. Compactly encoding unstructured inputs with differential compression. J. ACM, 49(3):318--367, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. Ajtai, R. Burns, R. Fagin, D. D. E. Long, and L. Stockmeyer. Compactly encoding unstructured inputs with differential compression. J. ACM, 49(3):318--367, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Belle. http://belle.kek.jp/.]]Google ScholarGoogle Scholar
  7. D. Bhagwat, K. Pollack, D. D. E. Long, T. Schwarz, E. L. Miller, and J.-F. Paris. Providing high reliability in a minimum redundancy archival storage system. In MASCOTS'06: Proceedings of the 14th IEEE International Symposium on Modeling, Analysis, and Simulation, 2006.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. R. Black. Compare-by-hash: A reasoned analysis. In Proceedings of the 2006 USENIX Annual Technical Conference (USENIX'06), Boston, MA, June 2006.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. W. J. Bolosky, S. Corbin, D. Goebel, , and J. R. Douceur. Single Instance Storage in Windows 2000. In Proceedings of the 4th USENIX Windows Systems Symposium, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. W. J. Bolosky, J. R. Douceur, D. Ely, and M. Theimer. Feasibility of a serverless distributed file system deployed on an existing set of desktop pcs. SIGMETRICS Perform. Eval. Rev., 28(1):34--43, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Broder. Some applications of rabin's fingerprinting method. In Sequences II: Methods in Communications, Security, and Computer Science, pages 143--152. Springer-Verlag, 1993.]]Google ScholarGoogle ScholarCross RefCross Ref
  12. A. Z. Broder. Identifying and filtering near-duplicate documents. In COM '00: Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. C. Chan and H. Lu. Fingerprinting using polynomial (rabin's method). Faculty of Science, University of Alberta, CMPUT690 Term Project, December 2001.]]Google ScholarGoogle Scholar
  14. L. P. Cox, C. D. Murray, and B. D. Noble. Pastiche: Making Backup Cheap and Easy. In OSDI: Symposium on Operating Systems Design and Implementation, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. F. Dabek, M. F. Kaashoek, D. Karger, R. Morris, and I. Stoica. Wide-area cooperative storage with cfs. In SOSP'01: Proceedings of the eighteenth ACM symposium on Operating systems principles, pages 202--215, New York, NY, USA, 2001. ACM Press.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Data Domain. http://www.datadomain.com.]]Google ScholarGoogle Scholar
  17. OSDL Database Test 2. http://www.osdl.org/.]]Google ScholarGoogle Scholar
  18. J. R. Douceur, A. Adya, W. J. Bolosky, D. Simon, and M. Theimer. Reclaiming space from duplicate files in a serverless distributed file system. In ICDCS '02: Proceedings of the 22 nd International Conference on Distributed Computing Systems (ICDCS'02), page 617, Washington, DC, USA, 2002. IEEE Computer Society.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. EMC Corp. EMC Centera Content Addressed Storage System, 2003. http://www.emc.com/.]]Google ScholarGoogle Scholar
  20. S. Ghemawat, H. Gobioff, and S. T. Leung. The Google File System. In Proc. of the 2003 19th ACM Symposium on Operating System Principles, October 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. V. Henson. An analysis of compare-by-hash. In Proceedings of the 9th Workshop on Hot Topics in Operating Systems (HotOS IX), pages 13--18, May 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. N. Jain, M. Dahlin, and R. Tewari. Taper: Tiered approach for eliminating redundancy in replicas. In USENIX Conference onf File and Storage Technologies (FAST05), 2005.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. Kozuch and M. Satyanarayanan. Internet Suspend/Resume. In Proceedings of the Workshop on Mobile Computing Systems and Applications, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. P. Kulkarni, F. Douglis, J. D. LaVoie, and J. M. Tracey. Redundancy Elimination Within Large Collections of Files. In USENIX Annual Technical Conference, General Track, 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. Li, M. Krohn, D. Mazierères, and D. Shasha. Secure untrusted data repository (SUNDR). In Proceedings of the 6th Symposium on Operating Systems Design and Implementation, pages 91--106, December 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. McKnight, T. Asaro, and B. Babineau. Digital archiving: End-user survey and market forecast 2006-2010. The Enterprise Strategy Group, Jan 2006.]]Google ScholarGoogle Scholar
  27. J. C. Mogul, Y. M. Chan, and T. Kelly. Design, Implementation, and Evaluation of Duplicate Transfer Detection in HTTP. In Proceedings of the First Symposium on Networked Systems Design and Implementation, San Francisco, CA, March 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. http://www.venge.net/monotone/docs/Hash-Integrity.html.]]Google ScholarGoogle Scholar
  29. T. D. Moreton, I. A. Pratt, and T. L. Harris. Storage, Mutability and Naming in Pasta. In Revised Papers from the NETWORKING 2002 Workshops on Web Engineering and Peer-to-Peer Computing, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. A. Muthitacharoen, B. Chen, and D. Mazieres. A Low-Bandwidth Network File System. In Proceedings of the 18th ACM Symposium on Operating Systems Principles, Chateau Lake Louise, Banff, Canada, Oct. 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. A. Muthitacharoen, R. Morris, T. M. Gil, and B. Chen. Ivy: a read/write peer-to-peer file system. SIGOPS Oper. Syst. Rev., 36(SI):31--44, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. P. Nath, M. Kozuch, D. O'Hallaron, J. Harkes, M. Satyanarayanan, N. Tolia, and M. Toups. Design tradeoffs in applying content addressable storage to enterprise-scale systems based on virtual machines. In Proceedings of the 2006 USENIX Annual Technical Conference (USENIX'06), Boston, MA, June 2006.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. NCBI GenBank. http://www.ncbi.nlm.nih.gov/Genbank/.]]Google ScholarGoogle Scholar
  34. K. Olsen, J. B. Minster, Y. Cui, A. Chourasia, R. Moore, Y. Hu, J. Zhu, P. Maechling, and T. Jordan. SCEC TeraShake Simulations: High Resolution Simulations of Large Southern San Andreas Earthquakes Using the TeraGrid. In Proceedings of the TeraGrid 2006 Conference.]]Google ScholarGoogle Scholar
  35. TeraByte Scale Enterprise databases. http://members.microsoft.com/customerevidence/Common/FileOpen.aspx?FileName=7405_FirstPremier_TDM_SQL_Server_Case_Study_Final.doc.]]Google ScholarGoogle Scholar
  36. Terabyte scale enterprise databases. http://www.wintercorp.com/VLDB/2005_TopTen_Survey/2005TopTenWinners.pdf.]]Google ScholarGoogle Scholar
  37. Terabyte scale enterprise databases. http://www.webtechniques.com/archives/1999/02/data/.]]Google ScholarGoogle Scholar
  38. S. Quinlan and S. Dorward. Venti: A New Approach to Archival Storage. In Proceedings of the FAST 2002 Conference on File and Storage Technologies, January 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. S. Quinlan, J. McKie, and R. Cox. Fossil, an archival file-server. http://www.cs.bell-labs.com/sys/doc/fossil.pdf.]]Google ScholarGoogle Scholar
  40. M. Rabin. Fingerprinting by Random Polynomials. In Harvard University Center for Research in Computing Technology Technical Report TR-15-81, 1981.]]Google ScholarGoogle Scholar
  41. S. Rhea, K. Liang, and E. Brewer. Value-Based Web Caching. In Proceedings of the Twelfth International World Wide Web Conference, May 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Stoica, I., Morris, R., Karger, D., Kaashoek, M.F., Balakrishnan, H. Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications. In Proceedings of the ACM SIGCOMM 2001, San Diego, CA, August 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. T. Suel, P. Noel, and D. Trendafilov. Improved file synchronization techniques for maintaining large replicated collections over slow networks. icde, 00, 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. N. Tolia, J. Harkes, M. Kozuch, and M. Satyanarayanan. Integrating Portable and Distributed Storage. In Proceedings of the 3rd USENIX Conference on File and Storage Technologies, 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Tolia, N., Kozuch, M., Satyanarayanan, M., Karp, B., Bressoud, T., Perrig, A. Opportunistic Use of Content-Addressable Storage for Distributed File Systems. In Proceedings of the 2003 USENIX Annual Technical Conference, San Antonio, TX, June 2003.]]Google ScholarGoogle Scholar
  46. A. Tridgell. Efficient Algorithms for Sorting and Synchronization. PhD thesis, The Australian National University, 1999.]]Google ScholarGoogle Scholar
  47. M. Vilayannur, P. Nath, and A. Sivasubramaniam. Providing Tunable Consistency for a Parallel File Store. In Proceedings of the Fourth USENIX Conference on File and Storage Technologies (FAST'05), 2005.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. L. L. You, K. T. Pollack, and D. D. E. Long. Deep store: An archival storage system architecture. In ICDE '05: Proceedings of the 21st International Conference on Data Engineering (ICDE'05), 2005.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. J. Ziv and A. Lempel. Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory, 24(5), 1978.]]Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Evaluating the usefulness of content addressable storage for high-performance data intensive applications

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          HPDC '08: Proceedings of the 17th international symposium on High performance distributed computing
          June 2008
          252 pages
          ISBN:9781595939975
          DOI:10.1145/1383422

          Copyright © 2008 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 23 June 2008

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate166of966submissions,17%

          Upcoming Conference

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader