Large-Scale Data Management System Using Data De-duplication System

Abirami, S.; Vikraman, Rashmi; Murugappan, S.

doi:10.1007/978-81-322-2517-1_23

S. Abirami⁶,
Rashmi Vikraman⁶ &
S. Murugappan⁷

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 379))

1220 Accesses

Abstract

Data de-duplication is the process of finding duplicates and eliminating it from the storage environment. There are various levels where the data de-duplication can be performed, such as file level, where the entire file as a whole is considered for the purpose of duplicate detection. Second is chunk level, where the file is split into small units called chunks and those chunks are used for the duplicate detection. Third is byte level, where the comparisons take byte-level comparison. The fingerprint of the chunks is the main parameter for the duplicate detection. These fingerprints are placed inside the chunk index. As the chunk index size increases, the chunk index needs to be placed in the disk. Searching for the fingerprint in the chunk index placed in the disk will consume a lot of time which will lead to a problem known as chunk lookup disk bottleneck problem. This paper eliminates that problem to some extent by placing a bloom filter in the cache as a probabilistic summary of all the fingerprints in the chunk index placed in the disk. This paper uses the backup data sets obtained from the university labs. The performance is measured with respect to the data de-duplication ratio.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Mark, R.C., Whitner, S.: Data De-duplication for Dummies. Wiley Publishing, Inc (2008)
Google Scholar
Vikraman, Rashmi, Abirami, S.: A study on various data de-duplication systems. Int. J. Comput. Appl. 94(4), 35–40 (2014)
Google Scholar
Bhagwat, D., Eshghi, K., Lillibridge, M., Long, D.D.E.: Extreme binning: scalable, parallel de-duplication for chunk-based file backup. In: Proceedings of the IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems, pp. 1–9 (2009)
Google Scholar
Thein, N.L., Thwel T.T.: An efficient indexing mechanism for data de-duplication. In: Proceedings of the International Conference on the Current Trends in Information Technology (CTIT), pp. 1–5 (2012)
Google Scholar
He, Q., Zhang, X., Li, Z.: Data de-duplication techniques. In: Proceedings of the International Conference on Future Information Technology and Management Engineering, pp. 430–433 (2010)
Google Scholar
Rothenberg, C.E., Lagerspetz, E., Tarkoma, S.: Theory and practice of bloom filters for distributed systems. Published in IEEE Communications Surveys and Tutorials, pp. 131–155 (2012)
Google Scholar
Zhu, B., Patterson, H., Li, K.: Avoiding the disk bottleneck in the data domain de-duplication file system. In: Proceedings of the Sixth USENIX Conference on File and Storage Technologies, pp. 269–282 (2008)
Google Scholar
Rabin, M.O.: Fingerprinting by random polynomials. Technical Report TR-15-81, Center for Research in Computing Technology, Harvard University (1981)
Google Scholar
Chang, B., Moh, T.: A running time improvement for two thresholds two divisors algorithm. In: Proceedings of the ACM Southeast Regional Conference, pp. 69–107 (2010)
Google Scholar
Mishra, M., Sengar, S.S.: E-DAID: an efficient distributed architecture for in-line data de-duplication. In: Proceedings of the International Conference on Communication Systems and Network Technologies, pp. 438–442 (2012)
Google Scholar
Wang, C., Wan, J., Yang, L., Qin, Z.-G.: A fast duplicate chunk identifying method based on hierarchical indexing structure. In: Proceedings of the International Conference on Industrial Control and Electronics Engineering (ICICEE), pp. 624–627 (2012)
Google Scholar
Gadan, A., Miller, E., Rodeh, O.: HANDS: a heuristically arranged non-backup in-line de-duplication system. In: Proceedings of the 29th IEEE International Conference on Data Engineering (ICDE), pp. 446–457 (2013)
Google Scholar
Feng, D., Sha, E.H., Ge, X., Tan, Y., Yan, Z.: Reducing the de-linearization of data placement to improve de-duplication performance. In: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, pp. 796–800 (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Science and Technology, College of Engineering, Anna University, Chennai, India
S. Abirami & Rashmi Vikraman
School of Computer Science, Tamilnadu Open University, Chennai, India
S. Murugappan

Authors

S. Abirami
View author publications
You can also search for this author in PubMed Google Scholar
Rashmi Vikraman
View author publications
You can also search for this author in PubMed Google Scholar
S. Murugappan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to S. Abirami .

Editor information

Editors and Affiliations

Dept. of Computer Science and Engineering, Anil Neerukonda Institute of Technology and Sciences, Visakhapatnam, India
Suresh Chandra Satapathy
Department of CSE, CMR Technical Campus, Hyderabad, India
K. Srujan Raju
Computer Science & Engineering, Kalyani University, Nadia, West Bengal, India
Jyotsna Kumar Mandal
Electronics and Communication Engineering, Shri Ramswaroop Memorial Group of Professional Colleges, Lucknow, Uttar Pradesh, India
Vikrant Bhateja

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Abirami, S., Vikraman, R., Murugappan, S. (2016). Large-Scale Data Management System Using Data De-duplication System. In: Satapathy, S., Raju, K., Mandal, J., Bhateja, V. (eds) Proceedings of the Second International Conference on Computer and Communication Technologies. Advances in Intelligent Systems and Computing, vol 379. Springer, New Delhi. https://doi.org/10.1007/978-81-322-2517-1_23

Download citation

DOI: https://doi.org/10.1007/978-81-322-2517-1_23
Published: 05 September 2015
Publisher Name: Springer, New Delhi
Print ISBN: 978-81-322-2516-4
Online ISBN: 978-81-322-2517-1
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics