Experimental Study on Chunking Algorithms of Data Deduplication System on Large Scale Data

Nisha, T. R.; Abirami, S.; Manohar, E.

doi:10.1007/978-81-322-2674-1_9

T. R. Nisha¹⁶,
S. Abirami¹⁶ &
E. Manohar¹⁶

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 398))

1270 Accesses
1 Citations

Abstract

Data deduplication also known as data redundancy elimination is a technique for saving storage space. The data deduplication system is highly successful in backup storage environments. Large number of redundancies may exist in a backup storage environment. These redundancies can be eliminated by finding and comparing the fingerprints. This comparison of fingerprints may be done at the file level or splits the files to create chunks and done at the chunk level. The file level deduplication system leads poor results than the chunk level since it considers the entire file for finding hash value and eliminates only duplicate files. This paper focuses on the experimental study on various chunking algorithms since chunking plays a very important role in data redundancy elimination system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Kulkarni P, Douglis F, LaVoie J, Tracey J (2004) Redundancy elimination within large collections of files. In: Proceedings of the USENIX annual technical conference, pp 59–72
Google Scholar
Meyer D, Bolosky W (2011) A study of practical de-duplication. In: Proceedings of the 9th USENIX conference on file and storage technologies
Google Scholar
Quinlan S, Dorward S (2002) Venti: a new approach to archival storage. In: Proceedings of the first Usenix conference on file and storage technologies, Monterey, California, pp 89–102
Google Scholar
Wei J, Jiang H, Zhou K, Feng D (2010) MAD2: a scalable high throughput exact de-duplication approach for network backup services. In: 26th IEEE mass storage systems and technologies (MSST), Incline Village, NV, USA, pp 1–14, May 2010
Google Scholar
Jin K, Miller E (2009) The effectiveness of de-duplication on virtual machine disk images. In: Proceedings of SYSTOR 2009. The Israeli experimental systems conference. ACM, pp 1–12
Google Scholar
Bhagwat D, Eshghi K, Long D, Lillibridge M (2009) Extreme binning: scalable, parallel de-duplication for Chunk-based file backup. In: Proceedings of IEEE international symposium on modeling, analysis & simulation of computer and telecommunication systems, pp 1–9
Google Scholar
Geer D (2008) Reducing the storage burden via data deduplication. Computer 41(12):15–17
Article Google Scholar
Muthitacharoen A, Chen B, Mazieres D (2001) A low-bandwidth network file system. In: 18th ACM symposium on operating systems principles (SOSP ‘01), Chateau Lake Louise, Banff, Canada, pp 174–187
Google Scholar
Xia W, Jiang H, Feng D, Hua Y (2014) Similarity and locality based indexing for high performance data de-duplication. IEEE Trans Comput 1–14
Google Scholar
Lillibridge M, Eshghi K, Bhagwat D, Deolalikar V, Trezise G, Camble P (2009) Sparse indexing: large scale, inline de-duplication using sampling and locality. In: Proceedings of the 7th conference on file and storage technologies, pp 111–123
Google Scholar
Debnath B, Sengupta S, Li J (2010) Chunkstash: speeding up inline storage de-duplication using flash memory. In: Proceedings of the 2010 USENIX conference on USENIX annual technical conference. USENIX Association
Google Scholar
Dong W, Douglis F, Li K, Patterson H, Reddy S, Shilane P (2011) Tradeoffs in scalable data routing for de-duplication clusters. In: Proceedings of the 9th USENIX conference on file and storage technologies. USENIX Association
Google Scholar
Dubnicki C, Gryz L, Heldt L, Kaczmarczyk M, Kilian W, Strzelczak P, Szczepkowski J, Ungureanu C, Welnicki M (2009) Hydrastor: a scalable secondary storage. In: Proceedings of the 7th conference on File and storage technologies. USENIX Association, pp 197–210
Google Scholar
Zhu B, Li K, Patterson H (2008) Avoiding the disk bottleneck in the data domain de-duplication file system. In: Proceedings of the 6th USENIX conference on file and storage technologies, vol 18(1–18). USENIX Association Berkeley, p 14
Google Scholar
Eshghi K, Tang HK (2005) A framework for analyzing and improving content-based chunking algorithms. Hewlett-Packard labs technical report (TR). HPL 2005-30(R.1)
Google Scholar
Moh TS, Chang B (2009) A running time improvement for two thresholds two divisors algorithm. In: Cunningham HC, Ruth P, Kraft NA (eds) ACM Southeast regional conference. ACM, p 69
Google Scholar
Linux download. ftp://kernel.org/

Download references

Author information

Authors and Affiliations

Department of Information Science and Technology, College of Engineering, Guindy, Anna University, Chennai, India
T. R. Nisha, S. Abirami & E. Manohar

Authors

T. R. Nisha
View author publications
You can also search for this author in PubMed Google Scholar
S. Abirami
View author publications
You can also search for this author in PubMed Google Scholar
E. Manohar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to T. R. Nisha .

Editor information

Editors and Affiliations

Electrical & Electronics Engineering, Noorul Islam College of Engineering, Kumaracoil, Tamil Nadu, India
L. Padma Suresh
Electrical Engineering, IIT Delhi, New Delhi, India
Bijaya Ketan Panigrahi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nisha, T.R., Abirami, S., Manohar, E. (2016). Experimental Study on Chunking Algorithms of Data Deduplication System on Large Scale Data. In: Suresh, L., Panigrahi, B. (eds) Proceedings of the International Conference on Soft Computing Systems. Advances in Intelligent Systems and Computing, vol 398. Springer, New Delhi. https://doi.org/10.1007/978-81-322-2674-1_9

Download citation

DOI: https://doi.org/10.1007/978-81-322-2674-1_9
Published: 08 December 2015
Publisher Name: Springer, New Delhi
Print ISBN: 978-81-322-2672-7
Online ISBN: 978-81-322-2674-1
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics