Abstract
Data de-duplication is the process of finding duplicates and eliminating it from the storage environment. There are various levels where the data de-duplication can be performed, such as file level, where the entire file as a whole is considered for the purpose of duplicate detection. Second is chunk level, where the file is split into small units called chunks and those chunks are used for the duplicate detection. Third is byte level, where the comparisons take byte-level comparison. The fingerprint of the chunks is the main parameter for the duplicate detection. These fingerprints are placed inside the chunk index. As the chunk index size increases, the chunk index needs to be placed in the disk. Searching for the fingerprint in the chunk index placed in the disk will consume a lot of time which will lead to a problem known as chunk lookup disk bottleneck problem. This paper eliminates that problem to some extent by placing a bloom filter in the cache as a probabilistic summary of all the fingerprints in the chunk index placed in the disk. This paper uses the backup data sets obtained from the university labs. The performance is measured with respect to the data de-duplication ratio.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Mark, R.C., Whitner, S.: Data De-duplication for Dummies. Wiley Publishing, Inc (2008)
Vikraman, Rashmi, Abirami, S.: A study on various data de-duplication systems. Int. J. Comput. Appl. 94(4), 35–40 (2014)
Bhagwat, D., Eshghi, K., Lillibridge, M., Long, D.D.E.: Extreme binning: scalable, parallel de-duplication for chunk-based file backup. In: Proceedings of the IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems, pp. 1–9 (2009)
Thein, N.L., Thwel T.T.: An efficient indexing mechanism for data de-duplication. In: Proceedings of the International Conference on the Current Trends in Information Technology (CTIT), pp. 1–5 (2012)
He, Q., Zhang, X., Li, Z.: Data de-duplication techniques. In: Proceedings of the International Conference on Future Information Technology and Management Engineering, pp. 430–433 (2010)
Rothenberg, C.E., Lagerspetz, E., Tarkoma, S.: Theory and practice of bloom filters for distributed systems. Published in IEEE Communications Surveys and Tutorials, pp. 131–155 (2012)
Zhu, B., Patterson, H., Li, K.: Avoiding the disk bottleneck in the data domain de-duplication file system. In: Proceedings of the Sixth USENIX Conference on File and Storage Technologies, pp. 269–282 (2008)
Rabin, M.O.: Fingerprinting by random polynomials. Technical Report TR-15-81, Center for Research in Computing Technology, Harvard University (1981)
Chang, B., Moh, T.: A running time improvement for two thresholds two divisors algorithm. In: Proceedings of the ACM Southeast Regional Conference, pp. 69–107 (2010)
Mishra, M., Sengar, S.S.: E-DAID: an efficient distributed architecture for in-line data de-duplication. In: Proceedings of the International Conference on Communication Systems and Network Technologies, pp. 438–442 (2012)
Wang, C., Wan, J., Yang, L., Qin, Z.-G.: A fast duplicate chunk identifying method based on hierarchical indexing structure. In: Proceedings of the International Conference on Industrial Control and Electronics Engineering (ICICEE), pp. 624–627 (2012)
Gadan, A., Miller, E., Rodeh, O.: HANDS: a heuristically arranged non-backup in-line de-duplication system. In: Proceedings of the 29th IEEE International Conference on Data Engineering (ICDE), pp. 446–457 (2013)
Feng, D., Sha, E.H., Ge, X., Tan, Y., Yan, Z.: Reducing the de-linearization of data placement to improve de-duplication performance. In: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, pp. 796–800 (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer India
About this paper
Cite this paper
Abirami, S., Vikraman, R., Murugappan, S. (2016). Large-Scale Data Management System Using Data De-duplication System. In: Satapathy, S., Raju, K., Mandal, J., Bhateja, V. (eds) Proceedings of the Second International Conference on Computer and Communication Technologies. Advances in Intelligent Systems and Computing, vol 379. Springer, New Delhi. https://doi.org/10.1007/978-81-322-2517-1_23
Download citation
DOI: https://doi.org/10.1007/978-81-322-2517-1_23
Published:
Publisher Name: Springer, New Delhi
Print ISBN: 978-81-322-2516-4
Online ISBN: 978-81-322-2517-1
eBook Packages: EngineeringEngineering (R0)