research-article

Evaluating the usefulness of content addressable storage for high-performance data intensive applications

Authors:
Partho Nath

Cisco Systems, Inc., San Jose, CA, USA

Cisco Systems, Inc., San Jose, CA, USA
View Profile

,
Bhuvan Urgaonkar

Pennsylvania State University, University Park, PA, USA

Pennsylvania State University, University Park, PA, USA
View Profile

,
Anand Sivasubramaniam

Pennsylvania State University, University Park, PA, USA

Pennsylvania State University, University Park, PA, USA
View Profile

HPDC '08: Proceedings of the 17th international symposium on High performance distributed computingJune 2008Pages 35–44https://doi.org/10.1145/1383422.1383428

Published:23 June 2008Publication History

HPDC '08: Proceedings of the 17th international symposium on High performance distributed computing

Pages 35–44

ABSTRACT

Content Addressable Storage (CAS) is a data representation technique that operates by partitioning a given data-set into non-intersecting units called chunks and then employing techniques to efficiently recognize chunks occurring multiple times. This allows CAS to eliminate duplicate instances of such chunks, resulting in reduced storage space compared to conventional representations of data. CAS is an attractive technique for reducing the storage and network bandwidth needs of performance-sensitive, data-intensive applications in a variety of domains. These include enterprise applications, Web-based e-commerce or entertainment services and highly parallel scientific/engineering applications and simulations, to name a few.

In this paper, we conduct an empirical evaluation of the benefits offered by CAS to a variety of real-world data-intensive applications. The savings offered by CAS depend crucially on (i) the nature of the data-set itself and (ii) the chunk-size that CAS employs. We investigate the impact of both these factors on disk space savings, savings in network bandwidth, and error resilience of data. We find that a chunk-size of 1 KB can provide up to 84% savings in disk space and even higher savings in network bandwidth whilst trading off error resilience and incurring 14% CAS related overheads. Drawing upon lessons learned from our study, we provide insights on (i) the choice of the chunk-size for effective space savings and (ii) the use of selective data replication to counter the loss of error resilience caused by CAS.

References

BSSN Pugh Benchmark. http://www.cactuscode.org/Benchmarks/bench_bssn_pugh.]]Google Scholar
NAS PARALLEL BENCHMARKS. http://www.nas.nasa.gov/Resources/Software/npb.html.]]Google Scholar
Oracle berkeley db. http://www.oracle.com/database/berkeley-db.html.]]Google Scholar
M. Ajtai, R. Burns, R. Fagin, D. D. E. Long, and L. Stockmeyer. Compactly encoding unstructured inputs with differential compression. J. ACM, 49(3):318--367, 2002.]] Google ScholarDigital Library
M. Ajtai, R. Burns, R. Fagin, D. D. E. Long, and L. Stockmeyer. Compactly encoding unstructured inputs with differential compression. J. ACM, 49(3):318--367, 2002.]] Google ScholarDigital Library
Belle. http://belle.kek.jp/.]]Google Scholar
D. Bhagwat, K. Pollack, D. D. E. Long, T. Schwarz, E. L. Miller, and J.-F. Paris. Providing high reliability in a minimum redundancy archival storage system. In MASCOTS'06: Proceedings of the 14th IEEE International Symposium on Modeling, Analysis, and Simulation, 2006.]] Google ScholarDigital Library
J. R. Black. Compare-by-hash: A reasoned analysis. In Proceedings of the 2006 USENIX Annual Technical Conference (USENIX'06), Boston, MA, June 2006.]] Google ScholarDigital Library
W. J. Bolosky, S. Corbin, D. Goebel, , and J. R. Douceur. Single Instance Storage in Windows 2000. In Proceedings of the 4th USENIX Windows Systems Symposium, 2000.]] Google ScholarDigital Library
W. J. Bolosky, J. R. Douceur, D. Ely, and M. Theimer. Feasibility of a serverless distributed file system deployed on an existing set of desktop pcs. SIGMETRICS Perform. Eval. Rev., 28(1):34--43, 2000.]] Google ScholarDigital Library
A. Broder. Some applications of rabin's fingerprinting method. In Sequences II: Methods in Communications, Security, and Computer Science, pages 143--152. Springer-Verlag, 1993.]]Google ScholarCross Ref
A. Z. Broder. Identifying and filtering near-duplicate documents. In COM '00: Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching, 2000.]] Google ScholarDigital Library
C. Chan and H. Lu. Fingerprinting using polynomial (rabin's method). Faculty of Science, University of Alberta, CMPUT690 Term Project, December 2001.]]Google Scholar
L. P. Cox, C. D. Murray, and B. D. Noble. Pastiche: Making Backup Cheap and Easy. In OSDI: Symposium on Operating Systems Design and Implementation, 2002.]] Google ScholarDigital Library
F. Dabek, M. F. Kaashoek, D. Karger, R. Morris, and I. Stoica. Wide-area cooperative storage with cfs. In SOSP'01: Proceedings of the eighteenth ACM symposium on Operating systems principles, pages 202--215, New York, NY, USA, 2001. ACM Press.]] Google ScholarDigital Library
Data Domain. http://www.datadomain.com.]]Google Scholar
OSDL Database Test 2. http://www.osdl.org/.]]Google Scholar
J. R. Douceur, A. Adya, W. J. Bolosky, D. Simon, and M. Theimer. Reclaiming space from duplicate files in a serverless distributed file system. In ICDCS '02: Proceedings of the 22 nd International Conference on Distributed Computing Systems (ICDCS'02), page 617, Washington, DC, USA, 2002. IEEE Computer Society.]] Google ScholarDigital Library
EMC Corp. EMC Centera Content Addressed Storage System, 2003. http://www.emc.com/.]]Google Scholar
S. Ghemawat, H. Gobioff, and S. T. Leung. The Google File System. In Proc. of the 2003 19th ACM Symposium on Operating System Principles, October 2003.]] Google ScholarDigital Library
V. Henson. An analysis of compare-by-hash. In Proceedings of the 9th Workshop on Hot Topics in Operating Systems (HotOS IX), pages 13--18, May 2003.]] Google ScholarDigital Library
N. Jain, M. Dahlin, and R. Tewari. Taper: Tiered approach for eliminating redundancy in replicas. In USENIX Conference onf File and Storage Technologies (FAST05), 2005.]] Google ScholarDigital Library
M. Kozuch and M. Satyanarayanan. Internet Suspend/Resume. In Proceedings of the Workshop on Mobile Computing Systems and Applications, 2002.]] Google ScholarDigital Library
P. Kulkarni, F. Douglis, J. D. LaVoie, and J. M. Tracey. Redundancy Elimination Within Large Collections of Files. In USENIX Annual Technical Conference, General Track, 2004.]] Google ScholarDigital Library
J. Li, M. Krohn, D. Mazierères, and D. Shasha. Secure untrusted data repository (SUNDR). In Proceedings of the 6th Symposium on Operating Systems Design and Implementation, pages 91--106, December 2004.]] Google ScholarDigital Library
J. McKnight, T. Asaro, and B. Babineau. Digital archiving: End-user survey and market forecast 2006-2010. The Enterprise Strategy Group, Jan 2006.]]Google Scholar
J. C. Mogul, Y. M. Chan, and T. Kelly. Design, Implementation, and Evaluation of Duplicate Transfer Detection in HTTP. In Proceedings of the First Symposium on Networked Systems Design and Implementation, San Francisco, CA, March 2004.]] Google ScholarDigital Library
http://www.venge.net/monotone/docs/Hash-Integrity.html.]]Google Scholar
T. D. Moreton, I. A. Pratt, and T. L. Harris. Storage, Mutability and Naming in Pasta. In Revised Papers from the NETWORKING 2002 Workshops on Web Engineering and Peer-to-Peer Computing, 2002.]] Google ScholarDigital Library
A. Muthitacharoen, B. Chen, and D. Mazieres. A Low-Bandwidth Network File System. In Proceedings of the 18th ACM Symposium on Operating Systems Principles, Chateau Lake Louise, Banff, Canada, Oct. 2001.]] Google ScholarDigital Library
A. Muthitacharoen, R. Morris, T. M. Gil, and B. Chen. Ivy: a read/write peer-to-peer file system. SIGOPS Oper. Syst. Rev., 36(SI):31--44, 2002.]] Google ScholarDigital Library
P. Nath, M. Kozuch, D. O'Hallaron, J. Harkes, M. Satyanarayanan, N. Tolia, and M. Toups. Design tradeoffs in applying content addressable storage to enterprise-scale systems based on virtual machines. In Proceedings of the 2006 USENIX Annual Technical Conference (USENIX'06), Boston, MA, June 2006.]] Google ScholarDigital Library
NCBI GenBank. http://www.ncbi.nlm.nih.gov/Genbank/.]]Google Scholar
K. Olsen, J. B. Minster, Y. Cui, A. Chourasia, R. Moore, Y. Hu, J. Zhu, P. Maechling, and T. Jordan. SCEC TeraShake Simulations: High Resolution Simulations of Large Southern San Andreas Earthquakes Using the TeraGrid. In Proceedings of the TeraGrid 2006 Conference.]]Google Scholar
TeraByte Scale Enterprise databases. http://members.microsoft.com/customerevidence/Common/FileOpen.aspx?FileName=7405_FirstPremier_TDM_SQL_Server_Case_Study_Final.doc.]]Google Scholar
Terabyte scale enterprise databases. http://www.wintercorp.com/VLDB/2005_TopTen_Survey/2005TopTenWinners.pdf.]]Google Scholar
Terabyte scale enterprise databases. http://www.webtechniques.com/archives/1999/02/data/.]]Google Scholar
S. Quinlan and S. Dorward. Venti: A New Approach to Archival Storage. In Proceedings of the FAST 2002 Conference on File and Storage Technologies, January 2002.]] Google ScholarDigital Library
S. Quinlan, J. McKie, and R. Cox. Fossil, an archival file-server. http://www.cs.bell-labs.com/sys/doc/fossil.pdf.]]Google Scholar
M. Rabin. Fingerprinting by Random Polynomials. In Harvard University Center for Research in Computing Technology Technical Report TR-15-81, 1981.]]Google Scholar
S. Rhea, K. Liang, and E. Brewer. Value-Based Web Caching. In Proceedings of the Twelfth International World Wide Web Conference, May 2003.]] Google ScholarDigital Library
Stoica, I., Morris, R., Karger, D., Kaashoek, M.F., Balakrishnan, H. Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications. In Proceedings of the ACM SIGCOMM 2001, San Diego, CA, August 2001.]] Google ScholarDigital Library
T. Suel, P. Noel, and D. Trendafilov. Improved file synchronization techniques for maintaining large replicated collections over slow networks. icde, 00, 2004.]] Google ScholarDigital Library
N. Tolia, J. Harkes, M. Kozuch, and M. Satyanarayanan. Integrating Portable and Distributed Storage. In Proceedings of the 3rd USENIX Conference on File and Storage Technologies, 2004.]] Google ScholarDigital Library
Tolia, N., Kozuch, M., Satyanarayanan, M., Karp, B., Bressoud, T., Perrig, A. Opportunistic Use of Content-Addressable Storage for Distributed File Systems. In Proceedings of the 2003 USENIX Annual Technical Conference, San Antonio, TX, June 2003.]]Google Scholar
A. Tridgell. Efficient Algorithms for Sorting and Synchronization. PhD thesis, The Australian National University, 1999.]]Google Scholar
M. Vilayannur, P. Nath, and A. Sivasubramaniam. Providing Tunable Consistency for a Parallel File Store. In Proceedings of the Fourth USENIX Conference on File and Storage Technologies (FAST'05), 2005.]] Google ScholarDigital Library
L. L. You, K. T. Pollack, and D. D. E. Long. Deep store: An archival storage system architecture. In ICDE '05: Proceedings of the 21st International Conference on Data Engineering (ICDE'05), 2005.]] Google ScholarDigital Library
J. Ziv and A. Lempel. Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory, 24(5), 1978.]]Google ScholarDigital Library

Index Terms

Evaluating the usefulness of content addressable storage for high-performance data intensive applications

Recommendations

Storage Appliance System Based on Content Addressable Storage
CSO '09: Proceedings of the 2009 International Joint Conference on Computational Sciences and Optimization - Volume 02

It is difficult for enterprise to build CAS (Content Addressable Storage) appliance system because the most of CAS systems only provide API interface, to design and implement a new kind of CAS storage system is proposed in this paper. CAS interface ...
Read More
Differentiated storage services

This article presents a Differentiated Storage Services architecture for file and storage systems. By classifying data at the block-level, a filesystem can request that different classes of data (e.g., file, directory, executable, text) be handled with ...
Read More
Adding aggressive error correction to a high-performance compressing flash file system
EMSOFT '09: Proceedings of the seventh ACM international conference on Embedded software

While NAND flash memories have rapidly increased in both capacity and performance and are increasingly used as a storage device in many embedded systems, their reliability has decreased both because of increased density and the use of multi-level cells (...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
HPDC '08: Proceedings of the 17th international symposium on High performance distributed computing
June 2008
252 pages
ISBN:9781595939975
DOI:10.1145/1383422
General Chairs:
Manish Parashar
Rutgers University, USA
,
Karsten Schwan
Georgia Institute of Technology, USA
,
Program Chairs:
Jon Weissman
National e-Science Center, Edinburgh, University of Minnesota, USA
,
Domenico Laforenza
Information Science and Technology Institute, CNR, Italy
Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 June 2008
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
caching
compression
content addressable storage
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate166of966submissions,17%
Upcoming Conference
HPDC '24

Sponsor:

sigarch

The 33rd International Symposium on High-Performance Parallel and Distributed Computing

June 3 - 7, 2024

Pisa , Italy
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 22
  Total Citations
  View Citations
- 476
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Evaluating the usefulness of content addressable storage for high-performance data intensive applications

HPDC '08: Proceedings of the 17th international symposium on High performance distributed computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Storage Appliance System Based on Content Addressable Storage

Differentiated storage services

Adding aggressive error correction to a high-performance compressing flash file system

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Evaluating the usefulness of content addressable storage for high-performance data intensive applications

HPDC '08: Proceedings of the 17th international symposium on High performance distributed computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Storage Appliance System Based on Content Addressable Storage

Differentiated storage services

Adding aggressive error correction to a high-performance compressing flash file system

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media