Abstract
Solid-state drives (SSDs) are an attractive alternative to hard disk drives (HDDs) to accelerate the Hadoop MapReduce Framework. However, the SSD characteristics and today’s Hadoop framework exhibit mismatches that impede indiscriminate SSD integration. This paper explores how to optimize a Hadoop MapReduce Framework with SSDs in terms of performance, cost, and energy consumption. It identifies extensible best practices that can exploit SSD benefits within Hadoop when combined with high network bandwidth and increased parallel storage access. Our Terasort benchmark results demonstrate that Hadoop currently does not sufficiently exploit SSD throughput. Hence, using faster SSDs in Hadoop does not enhance its performance. We show that SSDs presently deliver significant efficiency when storing intermediate Hadoop data, leaving HDDs for Hadoop Distributed File System (HDFS). The proposed configuration is optimized with the JVM reuse option and frequent heartbeat interval option. Moreover, we examined the performance of a state-of-the-art non-volatile memory express interface SSD within the Hadoop MapReduce Framework. While HDFS read and write throughput increases with high-performance SSDs, achieving complete system performance improvement requires carefully balancing CPU, network, and storage resource capabilities at a system level.
Similar content being viewed by others
Notes
We repeated five times since variance of results is small enough.
I/O utilization is defined as the percentage of CPU time passed during I/O requests were issued [10].
References
Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: OSDI’04: 6th symposium on operating system design and implementation, San Francisco
Ghemawat S, Gobioff H, Leung S (2003) The Google file system. In: SOSP’03: 19th ACM symposium on operating systems principles
Apache Hadoop Project. http://hadoop.apache.org. Accessed 21 May 2015
Shvachko K, Kuang H, Radia S, Chansler R (2010) The Hadoop distributed file system. In: MSST’10: 26th IEEE symposium on massive storage systems and technologies
Dell. Solid state drive vs. hard disk drive price and performance study. [White paper]
Shafer J, Rixner S, Cox A (2010) The Hadoop distributed filesystem: balancing portability and performance. In: ISPASS’10: IEEE international symposium on performance analysis of systems and software
Moon S, Lee J, Kee Y (2014) Introducing SSDs to Hadoop MapReduce Framework. In: IEEE Cloud’14: 7th IEEE international conference on cloud computing
Flexible IO Tester. Available in http://git.kernel.dk/?p=fio.git;a=summary. Accessed 21 May 2015
DFSIO program. Available in Hadoop source distribution: src/test/org/apache/hadoop/fs/TestDFSIO. Accessed 21 May 2015
Linux iostat manual page. http://sebastien.godard.pagesperso-orange.fr/man_iostat.html. Accessed 21 May 2015
Terasort program. http://hadoop.apache.org/docs/r0.23.6/api/org/apache/hadoop/examples/terasort/package-summary.html. Accessed 21 May 2015
Twitter’s Hadoop-LZO. http://github.com/twitter/hadoop-lzo. Accessed 21 May 2015
JProfiler, ej-technologies GmbH. https://www.ej-technologies.com/products/jprofiler/overview.html. Accessed 21 May 2015
Cloud Computing, Intel Inc., Optimizing Hadoop deployments. [White paper]
Intel Xeon Processor-Based Servers, Big data analytics, Intel Inc., Optimizing Hadoop Deployments. [White paper]
Hortonworks Inc., Best practices: Linux file systems for HDFS. http://hortonworks.com/kb/linux-file-systems-for-hdfs. Accessed 21 May 2015
White Tom (2012) Hadoop: the definitive guide. O’Reilly Media Inc, USA
NVM Express Interface. http://www.nvmexpress.org. Accessed 21 May 2015
Samsung Enterprise Class SSD Datasheet. http://www.samsung.com/global/business/semiconductor/file/product/XS1715_ProdOverview_2014_1.pdf
Sur S, Wang H, Huang J, Ouyang X, Panda D (2010) Can high-performance interconnects benefit Hadoop distributed file system. In: MASVDC’10: workshop on micro architectural support for virtualization, data center computing, and clouds in conjunction with MICRO’10
Islam N, Rahman M, Jose J, Rajachandrasekar R, Wang H, Subramoni H, Murthy C, Panda D (2012) High performance RDMA-based design of HDFS over InfiniBand. In: SC ’12: the international conference on high performance computing, networking, storage and analysis
Appuswamy R, Gkantsidis C, Narayanan D, Hodson O, Rowstron A (2013) Scale-up vs scale-out for Hadoop: time to rethink? In: ACM symposium on cloud computing, 2 October 2013
Harter T, Borthakur D, Dong S, Aiyer A, Tang L, Arpaci-Dusseau A, Arpaci-Dusseau R (2014) Analysis of HDFS under HBase: a Facebook messages case study. In: FAST’14: 12th USENIX conference on file and storage technologies
SanDisk, Increasing Hadoop performance with SanDisk solid state drives (SSDs). [White paper]
Dai J, Huang J, Huang S, Huang B, Liu Y (2011) HiTune: dataflow-based performance analysis for big data cloud. In: Usenix ATC’11: USENIX annual technical conference
Joshi S, Liaskovitis V (2012) Java garbage collection characteristics and tuning guidelines for Apache Hadoop TeraSort workload. [White paper]
Chen Y, Ganapathi AS, Katz RH (2010) To compress or not to compress—compute vs. IO tradeoffs for MapReduce energy efficiency. Technical Report No. UCB/EECS-2010-36, University of California at Berkeley
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Moon, S., Lee, J., Sun, X. et al. Optimizing the Hadoop MapReduce Framework with high-performance storage devices. J Supercomput 71, 3525–3548 (2015). https://doi.org/10.1007/s11227-015-1447-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-015-1447-3