Skip to main content

Large-Scale Data Management System Using Data De-duplication System

  • Conference paper
  • First Online:
Book cover Proceedings of the Second International Conference on Computer and Communication Technologies

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 379))

  • 1220 Accesses

Abstract

Data de-duplication is the process of finding duplicates and eliminating it from the storage environment. There are various levels where the data de-duplication can be performed, such as file level, where the entire file as a whole is considered for the purpose of duplicate detection. Second is chunk level, where the file is split into small units called chunks and those chunks are used for the duplicate detection. Third is byte level, where the comparisons take byte-level comparison. The fingerprint of the chunks is the main parameter for the duplicate detection. These fingerprints are placed inside the chunk index. As the chunk index size increases, the chunk index needs to be placed in the disk. Searching for the fingerprint in the chunk index placed in the disk will consume a lot of time which will lead to a problem known as chunk lookup disk bottleneck problem. This paper eliminates that problem to some extent by placing a bloom filter in the cache as a probabilistic summary of all the fingerprints in the chunk index placed in the disk. This paper uses the backup data sets obtained from the university labs. The performance is measured with respect to the data de-duplication ratio.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Mark, R.C., Whitner, S.: Data De-duplication for Dummies. Wiley Publishing, Inc (2008)

    Google Scholar 

  2. Vikraman, Rashmi, Abirami, S.: A study on various data de-duplication systems. Int. J. Comput. Appl. 94(4), 35–40 (2014)

    Google Scholar 

  3. Bhagwat, D., Eshghi, K., Lillibridge, M., Long, D.D.E.: Extreme binning: scalable, parallel de-duplication for chunk-based file backup. In: Proceedings of the IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems, pp. 1–9 (2009)

    Google Scholar 

  4. Thein, N.L., Thwel T.T.: An efficient indexing mechanism for data de-duplication. In: Proceedings of the International Conference on the Current Trends in Information Technology (CTIT), pp. 1–5 (2012)

    Google Scholar 

  5. He, Q., Zhang, X., Li, Z.: Data de-duplication techniques. In: Proceedings of the International Conference on Future Information Technology and Management Engineering, pp. 430–433 (2010)

    Google Scholar 

  6. Rothenberg, C.E., Lagerspetz, E., Tarkoma, S.: Theory and practice of bloom filters for distributed systems. Published in IEEE Communications Surveys and Tutorials, pp. 131–155 (2012)

    Google Scholar 

  7. Zhu, B., Patterson, H., Li, K.: Avoiding the disk bottleneck in the data domain de-duplication file system. In: Proceedings of the Sixth USENIX Conference on File and Storage Technologies, pp. 269–282 (2008)

    Google Scholar 

  8. Rabin, M.O.: Fingerprinting by random polynomials. Technical Report TR-15-81, Center for Research in Computing Technology, Harvard University (1981)

    Google Scholar 

  9. Chang, B., Moh, T.: A running time improvement for two thresholds two divisors algorithm. In: Proceedings of the ACM Southeast Regional Conference, pp. 69–107 (2010)

    Google Scholar 

  10. Mishra, M., Sengar, S.S.: E-DAID: an efficient distributed architecture for in-line data de-duplication. In: Proceedings of the International Conference on Communication Systems and Network Technologies, pp. 438–442 (2012)

    Google Scholar 

  11. Wang, C., Wan, J., Yang, L., Qin, Z.-G.: A fast duplicate chunk identifying method based on hierarchical indexing structure. In: Proceedings of the International Conference on Industrial Control and Electronics Engineering (ICICEE), pp. 624–627 (2012)

    Google Scholar 

  12. Gadan, A., Miller, E., Rodeh, O.: HANDS: a heuristically arranged non-backup in-line de-duplication system. In: Proceedings of the 29th IEEE International Conference on Data Engineering (ICDE), pp. 446–457 (2013)

    Google Scholar 

  13. Feng, D., Sha, E.H., Ge, X., Tan, Y., Yan, Z.: Reducing the de-linearization of data placement to improve de-duplication performance. In: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, pp. 796–800 (2012)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to S. Abirami .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer India

About this paper

Cite this paper

Abirami, S., Vikraman, R., Murugappan, S. (2016). Large-Scale Data Management System Using Data De-duplication System. In: Satapathy, S., Raju, K., Mandal, J., Bhateja, V. (eds) Proceedings of the Second International Conference on Computer and Communication Technologies. Advances in Intelligent Systems and Computing, vol 379. Springer, New Delhi. https://doi.org/10.1007/978-81-322-2517-1_23

Download citation

  • DOI: https://doi.org/10.1007/978-81-322-2517-1_23

  • Published:

  • Publisher Name: Springer, New Delhi

  • Print ISBN: 978-81-322-2516-4

  • Online ISBN: 978-81-322-2517-1

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics