Optimizing the Hadoop MapReduce Framework with high-performance storage devices

Moon, Sangwhan; Lee, Jaehwan; Sun, Xiling; Kee, Yang-suk

doi:10.1007/s11227-015-1447-3

Optimizing the Hadoop MapReduce Framework with high-performance storage devices

Published: 29 May 2015

Volume 71, pages 3525–3548, (2015)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Sangwhan Moon¹,
Jaehwan Lee²,
Xiling Sun³ &
…
Yang-suk Kee³

1027 Accesses
25 Citations
Explore all metrics

Abstract

Solid-state drives (SSDs) are an attractive alternative to hard disk drives (HDDs) to accelerate the Hadoop MapReduce Framework. However, the SSD characteristics and today’s Hadoop framework exhibit mismatches that impede indiscriminate SSD integration. This paper explores how to optimize a Hadoop MapReduce Framework with SSDs in terms of performance, cost, and energy consumption. It identifies extensible best practices that can exploit SSD benefits within Hadoop when combined with high network bandwidth and increased parallel storage access. Our Terasort benchmark results demonstrate that Hadoop currently does not sufficiently exploit SSD throughput. Hence, using faster SSDs in Hadoop does not enhance its performance. We show that SSDs presently deliver significant efficiency when storing intermediate Hadoop data, leaving HDDs for Hadoop Distributed File System (HDFS). The proposed configuration is optimized with the JVM reuse option and frequent heartbeat interval option. Moreover, we examined the performance of a state-of-the-art non-volatile memory express interface SSD within the Hadoop MapReduce Framework. While HDFS read and write throughput increases with high-performance SSDs, achieving complete system performance improvement requires carefully balancing CPU, network, and storage resource capabilities at a system level.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

We repeated five times since variance of results is small enough.
I/O utilization is defined as the percentage of CPU time passed during I/O requests were issued [10].

References

Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: OSDI’04: 6th symposium on operating system design and implementation, San Francisco
Ghemawat S, Gobioff H, Leung S (2003) The Google file system. In: SOSP’03: 19th ACM symposium on operating systems principles
Apache Hadoop Project. http://hadoop.apache.org. Accessed 21 May 2015
Shvachko K, Kuang H, Radia S, Chansler R (2010) The Hadoop distributed file system. In: MSST’10: 26th IEEE symposium on massive storage systems and technologies
Dell. Solid state drive vs. hard disk drive price and performance study. [White paper]
Shafer J, Rixner S, Cox A (2010) The Hadoop distributed filesystem: balancing portability and performance. In: ISPASS’10: IEEE international symposium on performance analysis of systems and software
Moon S, Lee J, Kee Y (2014) Introducing SSDs to Hadoop MapReduce Framework. In: IEEE Cloud’14: 7th IEEE international conference on cloud computing
Flexible IO Tester. Available in http://git.kernel.dk/?p=fio.git;a=summary. Accessed 21 May 2015
DFSIO program. Available in Hadoop source distribution: src/test/org/apache/hadoop/fs/TestDFSIO. Accessed 21 May 2015
Linux iostat manual page. http://sebastien.godard.pagesperso-orange.fr/man_iostat.html. Accessed 21 May 2015
Terasort program. http://hadoop.apache.org/docs/r0.23.6/api/org/apache/hadoop/examples/terasort/package-summary.html. Accessed 21 May 2015
Twitter’s Hadoop-LZO. http://github.com/twitter/hadoop-lzo. Accessed 21 May 2015
JProfiler, ej-technologies GmbH. https://www.ej-technologies.com/products/jprofiler/overview.html. Accessed 21 May 2015
Cloud Computing, Intel Inc., Optimizing Hadoop deployments. [White paper]
Intel Xeon Processor-Based Servers, Big data analytics, Intel Inc., Optimizing Hadoop Deployments. [White paper]
Hortonworks Inc., Best practices: Linux file systems for HDFS. http://hortonworks.com/kb/linux-file-systems-for-hdfs. Accessed 21 May 2015
White Tom (2012) Hadoop: the definitive guide. O’Reilly Media Inc, USA
Google Scholar
NVM Express Interface. http://www.nvmexpress.org. Accessed 21 May 2015
Samsung Enterprise Class SSD Datasheet. http://www.samsung.com/global/business/semiconductor/file/product/XS1715_ProdOverview_2014_1.pdf
Sur S, Wang H, Huang J, Ouyang X, Panda D (2010) Can high-performance interconnects benefit Hadoop distributed file system. In: MASVDC’10: workshop on micro architectural support for virtualization, data center computing, and clouds in conjunction with MICRO’10
Islam N, Rahman M, Jose J, Rajachandrasekar R, Wang H, Subramoni H, Murthy C, Panda D (2012) High performance RDMA-based design of HDFS over InfiniBand. In: SC ’12: the international conference on high performance computing, networking, storage and analysis
Appuswamy R, Gkantsidis C, Narayanan D, Hodson O, Rowstron A (2013) Scale-up vs scale-out for Hadoop: time to rethink? In: ACM symposium on cloud computing, 2 October 2013
Harter T, Borthakur D, Dong S, Aiyer A, Tang L, Arpaci-Dusseau A, Arpaci-Dusseau R (2014) Analysis of HDFS under HBase: a Facebook messages case study. In: FAST’14: 12th USENIX conference on file and storage technologies
SanDisk, Increasing Hadoop performance with SanDisk solid state drives (SSDs). [White paper]
Dai J, Huang J, Huang S, Huang B, Liu Y (2011) HiTune: dataflow-based performance analysis for big data cloud. In: Usenix ATC’11: USENIX annual technical conference
Joshi S, Liaskovitis V (2012) Java garbage collection characteristics and tuning guidelines for Apache Hadoop TeraSort workload. [White paper]
Chen Y, Ganapathi AS, Katz RH (2010) To compress or not to compress—compute vs. IO tradeoffs for MapReduce energy efficiency. Technical Report No. UCB/EECS-2010-36, University of California at Berkeley

Download references

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX, 77840, USA
Sangwhan Moon
School of Electronics and Information Engineering, Korea Aerospace University, Goyang-si, Republic of Korea
Jaehwan Lee
Advanced Datacenter Solution Group, Samsung Semiconductor Incorporation, Milpitas, CA, 95036, USA
Xiling Sun & Yang-suk Kee

Authors

Sangwhan Moon
View author publications
You can also search for this author in PubMed Google Scholar
Jaehwan Lee
View author publications
You can also search for this author in PubMed Google Scholar
Xiling Sun
View author publications
You can also search for this author in PubMed Google Scholar
Yang-suk Kee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jaehwan Lee.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Moon, S., Lee, J., Sun, X. et al. Optimizing the Hadoop MapReduce Framework with high-performance storage devices. J Supercomput 71, 3525–3548 (2015). https://doi.org/10.1007/s11227-015-1447-3

Download citation

Published: 29 May 2015
Issue Date: September 2015
DOI: https://doi.org/10.1007/s11227-015-1447-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimizing the Hadoop MapReduce Framework with high-performance storage devices

Abstract

Access this article

Similar content being viewed by others

Accelerating erasure coding by exploiting multiple repair paths in distributed storage systems

Breaking the von Neumann bottleneck: architecture-level processing-in-memory technology

The New Hardware Development Trend and the Challenges in Data Management and Analysis

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Optimizing the Hadoop MapReduce Framework with high-performance storage devices

Abstract

Access this article

Similar content being viewed by others

Accelerating erasure coding by exploiting multiple repair paths in distributed storage systems

Breaking the von Neumann bottleneck: architecture-level processing-in-memory technology

The New Hardware Development Trend and the Challenges in Data Management and Analysis

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation