ABSTRACT
Content Addressable Storage (CAS) is a data representation technique that operates by partitioning a given data-set into non-intersecting units called chunks and then employing techniques to efficiently recognize chunks occurring multiple times. This allows CAS to eliminate duplicate instances of such chunks, resulting in reduced storage space compared to conventional representations of data. CAS is an attractive technique for reducing the storage and network bandwidth needs of performance-sensitive, data-intensive applications in a variety of domains. These include enterprise applications, Web-based e-commerce or entertainment services and highly parallel scientific/engineering applications and simulations, to name a few.
In this paper, we conduct an empirical evaluation of the benefits offered by CAS to a variety of real-world data-intensive applications. The savings offered by CAS depend crucially on (i) the nature of the data-set itself and (ii) the chunk-size that CAS employs. We investigate the impact of both these factors on disk space savings, savings in network bandwidth, and error resilience of data. We find that a chunk-size of 1 KB can provide up to 84% savings in disk space and even higher savings in network bandwidth whilst trading off error resilience and incurring 14% CAS related overheads. Drawing upon lessons learned from our study, we provide insights on (i) the choice of the chunk-size for effective space savings and (ii) the use of selective data replication to counter the loss of error resilience caused by CAS.
- BSSN Pugh Benchmark. http://www.cactuscode.org/Benchmarks/bench_bssn_pugh.]]Google Scholar
- NAS PARALLEL BENCHMARKS. http://www.nas.nasa.gov/Resources/Software/npb.html.]]Google Scholar
- Oracle berkeley db. http://www.oracle.com/database/berkeley-db.html.]]Google Scholar
- M. Ajtai, R. Burns, R. Fagin, D. D. E. Long, and L. Stockmeyer. Compactly encoding unstructured inputs with differential compression. J. ACM, 49(3):318--367, 2002.]] Google ScholarDigital Library
- M. Ajtai, R. Burns, R. Fagin, D. D. E. Long, and L. Stockmeyer. Compactly encoding unstructured inputs with differential compression. J. ACM, 49(3):318--367, 2002.]] Google ScholarDigital Library
- Belle. http://belle.kek.jp/.]]Google Scholar
- D. Bhagwat, K. Pollack, D. D. E. Long, T. Schwarz, E. L. Miller, and J.-F. Paris. Providing high reliability in a minimum redundancy archival storage system. In MASCOTS'06: Proceedings of the 14th IEEE International Symposium on Modeling, Analysis, and Simulation, 2006.]] Google ScholarDigital Library
- J. R. Black. Compare-by-hash: A reasoned analysis. In Proceedings of the 2006 USENIX Annual Technical Conference (USENIX'06), Boston, MA, June 2006.]] Google ScholarDigital Library
- W. J. Bolosky, S. Corbin, D. Goebel, , and J. R. Douceur. Single Instance Storage in Windows 2000. In Proceedings of the 4th USENIX Windows Systems Symposium, 2000.]] Google ScholarDigital Library
- W. J. Bolosky, J. R. Douceur, D. Ely, and M. Theimer. Feasibility of a serverless distributed file system deployed on an existing set of desktop pcs. SIGMETRICS Perform. Eval. Rev., 28(1):34--43, 2000.]] Google ScholarDigital Library
- A. Broder. Some applications of rabin's fingerprinting method. In Sequences II: Methods in Communications, Security, and Computer Science, pages 143--152. Springer-Verlag, 1993.]]Google ScholarCross Ref
- A. Z. Broder. Identifying and filtering near-duplicate documents. In COM '00: Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching, 2000.]] Google ScholarDigital Library
- C. Chan and H. Lu. Fingerprinting using polynomial (rabin's method). Faculty of Science, University of Alberta, CMPUT690 Term Project, December 2001.]]Google Scholar
- L. P. Cox, C. D. Murray, and B. D. Noble. Pastiche: Making Backup Cheap and Easy. In OSDI: Symposium on Operating Systems Design and Implementation, 2002.]] Google ScholarDigital Library
- F. Dabek, M. F. Kaashoek, D. Karger, R. Morris, and I. Stoica. Wide-area cooperative storage with cfs. In SOSP'01: Proceedings of the eighteenth ACM symposium on Operating systems principles, pages 202--215, New York, NY, USA, 2001. ACM Press.]] Google ScholarDigital Library
- Data Domain. http://www.datadomain.com.]]Google Scholar
- OSDL Database Test 2. http://www.osdl.org/.]]Google Scholar
- J. R. Douceur, A. Adya, W. J. Bolosky, D. Simon, and M. Theimer. Reclaiming space from duplicate files in a serverless distributed file system. In ICDCS '02: Proceedings of the 22 nd International Conference on Distributed Computing Systems (ICDCS'02), page 617, Washington, DC, USA, 2002. IEEE Computer Society.]] Google ScholarDigital Library
- EMC Corp. EMC Centera Content Addressed Storage System, 2003. http://www.emc.com/.]]Google Scholar
- S. Ghemawat, H. Gobioff, and S. T. Leung. The Google File System. In Proc. of the 2003 19th ACM Symposium on Operating System Principles, October 2003.]] Google ScholarDigital Library
- V. Henson. An analysis of compare-by-hash. In Proceedings of the 9th Workshop on Hot Topics in Operating Systems (HotOS IX), pages 13--18, May 2003.]] Google ScholarDigital Library
- N. Jain, M. Dahlin, and R. Tewari. Taper: Tiered approach for eliminating redundancy in replicas. In USENIX Conference onf File and Storage Technologies (FAST05), 2005.]] Google ScholarDigital Library
- M. Kozuch and M. Satyanarayanan. Internet Suspend/Resume. In Proceedings of the Workshop on Mobile Computing Systems and Applications, 2002.]] Google ScholarDigital Library
- P. Kulkarni, F. Douglis, J. D. LaVoie, and J. M. Tracey. Redundancy Elimination Within Large Collections of Files. In USENIX Annual Technical Conference, General Track, 2004.]] Google ScholarDigital Library
- J. Li, M. Krohn, D. Mazierères, and D. Shasha. Secure untrusted data repository (SUNDR). In Proceedings of the 6th Symposium on Operating Systems Design and Implementation, pages 91--106, December 2004.]] Google ScholarDigital Library
- J. McKnight, T. Asaro, and B. Babineau. Digital archiving: End-user survey and market forecast 2006-2010. The Enterprise Strategy Group, Jan 2006.]]Google Scholar
- J. C. Mogul, Y. M. Chan, and T. Kelly. Design, Implementation, and Evaluation of Duplicate Transfer Detection in HTTP. In Proceedings of the First Symposium on Networked Systems Design and Implementation, San Francisco, CA, March 2004.]] Google ScholarDigital Library
- http://www.venge.net/monotone/docs/Hash-Integrity.html.]]Google Scholar
- T. D. Moreton, I. A. Pratt, and T. L. Harris. Storage, Mutability and Naming in Pasta. In Revised Papers from the NETWORKING 2002 Workshops on Web Engineering and Peer-to-Peer Computing, 2002.]] Google ScholarDigital Library
- A. Muthitacharoen, B. Chen, and D. Mazieres. A Low-Bandwidth Network File System. In Proceedings of the 18th ACM Symposium on Operating Systems Principles, Chateau Lake Louise, Banff, Canada, Oct. 2001.]] Google ScholarDigital Library
- A. Muthitacharoen, R. Morris, T. M. Gil, and B. Chen. Ivy: a read/write peer-to-peer file system. SIGOPS Oper. Syst. Rev., 36(SI):31--44, 2002.]] Google ScholarDigital Library
- P. Nath, M. Kozuch, D. O'Hallaron, J. Harkes, M. Satyanarayanan, N. Tolia, and M. Toups. Design tradeoffs in applying content addressable storage to enterprise-scale systems based on virtual machines. In Proceedings of the 2006 USENIX Annual Technical Conference (USENIX'06), Boston, MA, June 2006.]] Google ScholarDigital Library
- NCBI GenBank. http://www.ncbi.nlm.nih.gov/Genbank/.]]Google Scholar
- K. Olsen, J. B. Minster, Y. Cui, A. Chourasia, R. Moore, Y. Hu, J. Zhu, P. Maechling, and T. Jordan. SCEC TeraShake Simulations: High Resolution Simulations of Large Southern San Andreas Earthquakes Using the TeraGrid. In Proceedings of the TeraGrid 2006 Conference.]]Google Scholar
- TeraByte Scale Enterprise databases. http://members.microsoft.com/customerevidence/Common/FileOpen.aspx?FileName=7405_FirstPremier_TDM_SQL_Server_Case_Study_Final.doc.]]Google Scholar
- Terabyte scale enterprise databases. http://www.wintercorp.com/VLDB/2005_TopTen_Survey/2005TopTenWinners.pdf.]]Google Scholar
- Terabyte scale enterprise databases. http://www.webtechniques.com/archives/1999/02/data/.]]Google Scholar
- S. Quinlan and S. Dorward. Venti: A New Approach to Archival Storage. In Proceedings of the FAST 2002 Conference on File and Storage Technologies, January 2002.]] Google ScholarDigital Library
- S. Quinlan, J. McKie, and R. Cox. Fossil, an archival file-server. http://www.cs.bell-labs.com/sys/doc/fossil.pdf.]]Google Scholar
- M. Rabin. Fingerprinting by Random Polynomials. In Harvard University Center for Research in Computing Technology Technical Report TR-15-81, 1981.]]Google Scholar
- S. Rhea, K. Liang, and E. Brewer. Value-Based Web Caching. In Proceedings of the Twelfth International World Wide Web Conference, May 2003.]] Google ScholarDigital Library
- Stoica, I., Morris, R., Karger, D., Kaashoek, M.F., Balakrishnan, H. Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications. In Proceedings of the ACM SIGCOMM 2001, San Diego, CA, August 2001.]] Google ScholarDigital Library
- T. Suel, P. Noel, and D. Trendafilov. Improved file synchronization techniques for maintaining large replicated collections over slow networks. icde, 00, 2004.]] Google ScholarDigital Library
- N. Tolia, J. Harkes, M. Kozuch, and M. Satyanarayanan. Integrating Portable and Distributed Storage. In Proceedings of the 3rd USENIX Conference on File and Storage Technologies, 2004.]] Google ScholarDigital Library
- Tolia, N., Kozuch, M., Satyanarayanan, M., Karp, B., Bressoud, T., Perrig, A. Opportunistic Use of Content-Addressable Storage for Distributed File Systems. In Proceedings of the 2003 USENIX Annual Technical Conference, San Antonio, TX, June 2003.]]Google Scholar
- A. Tridgell. Efficient Algorithms for Sorting and Synchronization. PhD thesis, The Australian National University, 1999.]]Google Scholar
- M. Vilayannur, P. Nath, and A. Sivasubramaniam. Providing Tunable Consistency for a Parallel File Store. In Proceedings of the Fourth USENIX Conference on File and Storage Technologies (FAST'05), 2005.]] Google ScholarDigital Library
- L. L. You, K. T. Pollack, and D. D. E. Long. Deep store: An archival storage system architecture. In ICDE '05: Proceedings of the 21st International Conference on Data Engineering (ICDE'05), 2005.]] Google ScholarDigital Library
- J. Ziv and A. Lempel. Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory, 24(5), 1978.]]Google ScholarDigital Library
Index Terms
- Evaluating the usefulness of content addressable storage for high-performance data intensive applications
Recommendations
Storage Appliance System Based on Content Addressable Storage
CSO '09: Proceedings of the 2009 International Joint Conference on Computational Sciences and Optimization - Volume 02It is difficult for enterprise to build CAS (Content Addressable Storage) appliance system because the most of CAS systems only provide API interface, to design and implement a new kind of CAS storage system is proposed in this paper. CAS interface ...
Differentiated storage services
This article presents a Differentiated Storage Services architecture for file and storage systems. By classifying data at the block-level, a filesystem can request that different classes of data (e.g., file, directory, executable, text) be handled with ...
Adding aggressive error correction to a high-performance compressing flash file system
EMSOFT '09: Proceedings of the seventh ACM international conference on Embedded softwareWhile NAND flash memories have rapidly increased in both capacity and performance and are increasingly used as a storage device in many embedded systems, their reliability has decreased both because of increased density and the use of multi-level cells (...
Comments