HDFS中海量小文件合并与预取优化方法的研究

doi:10.11896/j.issn.1002-137X.2017.11A.109

计算机科学 ›› 2017, Vol. 44 ›› Issue (Z11): 516-519.doi: 10.11896/j.issn.1002-137X.2017.11A.109

HDFS中海量小文件合并与预取优化方法的研究

郑通,郭卫斌,范贵生

华东理工大学信息科学与工程学院上海200237,华东理工大学信息科学与工程学院上海200237,华东理工大学信息科学与工程学院上海200237

出版日期:2018-12-01 发布日期:2018-12-01
基金资助:
本文受国家自然科学基金资助

Research on Optimization Method of Merging and Prefetching for Massive Small Files in HDFS

ZHENG Tong, GUO Wei-bin and FAN Gui-sheng

Online:2018-12-01 Published:2018-12-01

摘要/Abstract

摘要： HDFS在存储海量文件时具有明显的优势, 但在存储小文件占绝大多数的海量文件时,HDFS单个NameNode的存储架构会导致其性能严重降低。为此,提出一种基于合并思想的方案,即将小文件合并为大文件,同时建立小文件到合并文件的映射关系,并将其存于HBase中。为了提高读取速度,建立了基于LRU的预取机制。实验表明,该方法能明显提高HDFS在处理海量文件时的整体性能。

关键词: HDFS,海量文件,合并,映射,LRU,预取机制

Abstract: HDFS has a significant advantage on storing the massive files,however,its storage architecture which has onlyone NameNode will result in the decrease of performance when HDFS is used to store massive files which is mainly composed by small files .A solution based on the idea of that small files were merged into large files was proposed.Meanwhile,the mapping relationship from small files to merging files was established and stored into HBase.Finally,we provided a LRU based prefetching mechanism to improve the reading speed.The experiments show that the proposed method can improve the overall performance of HDFS with the large amounts of small files.

Key words: HDFS,Massive files,Merging,Mapping,LRU,Prefetch mechanism

郑通,郭卫斌,范贵生. HDFS中海量小文件合并与预取优化方法的研究[J]. 计算机科学, 2017, 44(Z11): 516-519. https://doi.org/10.11896/j.issn.1002-137X.2017.11A.109

ZHENG Tong, GUO Wei-bin and FAN Gui-sheng. Research on Optimization Method of Merging and Prefetching for Massive Small Files in HDFS[J]. Computer Science, 2017, 44(Z11): 516-519. https://doi.org/10.11896/j.issn.1002-137X.2017.11A.109

参考文献

[1] CHEN C T,HSU C C,WU J J,et al.GFS:A Distributed File System with Multi-source Data Access and Replication for Grid Computing[C]∥International Conference on Advances in Grid & Pervasive Computing.Geneva:ACM,2009:119-130.
[2] 王意洁,孙伟东,周松,等.云计算环境下的分布存储关键技术[J].软件学报,2012,23(4):962-986.
[3] BRAAM P J.The Lustre storage architecture[EB/OJ].http://www.lustre.org/documentation.
[4] 李建江,崔健,王聃,等.MapReduce并行编程模型研究综述[J].电子学报,2011,39(11):2635-2642.
[5] 李娜娜.云计算平台下社交网络数据获取技术研究[D].北京:北京邮电大学,2013.
[6] 黄晓云.基于HDFS的云存储服务系统研究[D].大连:大连海事大学,2010.
[7] Hadoop.Hadoop Archives Guide[EB/OL].(2013-04-01)[2016-12-18].http://hadoop.apache.org/docs/r1.2.1/Hadoop_archives.html.
[8] Hadoop.Sequence File Wiki[EB/OL].(2009-09-05)[2016-12-18].https://wiki.apache.org /hadoop/SequenceFile.
[9] DIEM C.Combine File Input Format[EB/OL].(2013-09-22)[2016-12-18].http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefile-inputformat-1.
[10] 游小容,曹晟.海量教育资源中小文件的存储研究[J].计算机科学,2015,42(10):76-80.
[11] ZHANG S,MIAO L,ZHANG D F,et al.A Strategy to Deal with Mass Small Files in HDFS[C]∥International Conference on Intelligent Human-Machine Systems and Cybernetics.New York:IEEE,2014:331-334.
[12] PATEL A,MEHTA M A.A Novel Approach for Efficient Handling of Small Files in HDFS[C]∥IEEE International Advance Computing Conference.New York:IEEE,2015:1258-1262.
[13] 赵晓永,杨扬,孙莉莉,等.基于Hadoop的海量MP3文件存储架构[J].计算机应用,2012,32(6):1724-1726.
[14] 赵洋.淘宝TFS深度剖析[J].数字化用户,2013(3):58-59.
[15] BEAVER D,KUMAR S,LI H C,et al.Finding a needle in Haystack Facebook’s photo storage[C]∥Usenix Symposium on Operating Systems Design and Implementation.New York:ACM,2010:47-60.
[16] GEOGE L.HBase权威指南[M].代志远,刘佳,蒋杰,译.北京:人民邮电出版社,2013:302-304.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

HDFS中海量小文件合并与预取优化方法的研究

Research on Optimization Method of Merging and Prefetching for Massive Small Files in HDFS

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

Metrics

本文评价

推荐阅读 0