Skip to main content

Advertisement

Log in

Optimizing the Hadoop MapReduce Framework with high-performance storage devices

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Solid-state drives (SSDs) are an attractive alternative to hard disk drives (HDDs) to accelerate the Hadoop MapReduce Framework. However, the SSD characteristics and today’s Hadoop framework exhibit mismatches that impede indiscriminate SSD integration. This paper explores how to optimize a Hadoop MapReduce Framework with SSDs in terms of performance, cost, and energy consumption. It identifies extensible best practices that can exploit SSD benefits within Hadoop when combined with high network bandwidth and increased parallel storage access. Our Terasort benchmark results demonstrate that Hadoop currently does not sufficiently exploit SSD throughput. Hence, using faster SSDs in Hadoop does not enhance its performance. We show that SSDs presently deliver significant efficiency when storing intermediate Hadoop data, leaving HDDs for Hadoop Distributed File System (HDFS). The proposed configuration is optimized with the JVM reuse option and frequent heartbeat interval option. Moreover, we examined the performance of a state-of-the-art non-volatile memory express interface SSD within the Hadoop MapReduce Framework. While HDFS read and write throughput increases with high-performance SSDs, achieving complete system performance improvement requires carefully balancing CPU, network, and storage resource capabilities at a system level.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

Notes

  1. We repeated five times since variance of results is small enough.

  2. I/O utilization is defined as the percentage of CPU time passed during I/O requests were issued [10].

References

  1. Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: OSDI’04: 6th symposium on operating system design and implementation, San Francisco

  2. Ghemawat S, Gobioff H, Leung S (2003) The Google file system. In: SOSP’03: 19th ACM symposium on operating systems principles

  3. Apache Hadoop Project. http://hadoop.apache.org. Accessed 21 May 2015

  4. Shvachko K, Kuang H, Radia S, Chansler R (2010) The Hadoop distributed file system. In: MSST’10: 26th IEEE symposium on massive storage systems and technologies

  5. Dell. Solid state drive vs. hard disk drive price and performance study. [White paper]

  6. Shafer J, Rixner S, Cox A (2010) The Hadoop distributed filesystem: balancing portability and performance. In: ISPASS’10: IEEE international symposium on performance analysis of systems and software

  7. Moon S, Lee J, Kee Y (2014) Introducing SSDs to Hadoop MapReduce Framework. In: IEEE Cloud’14: 7th IEEE international conference on cloud computing

  8. Flexible IO Tester. Available in http://git.kernel.dk/?p=fio.git;a=summary. Accessed 21 May 2015

  9. DFSIO program. Available in Hadoop source distribution: src/test/org/apache/hadoop/fs/TestDFSIO. Accessed 21 May 2015

  10. Linux iostat manual page. http://sebastien.godard.pagesperso-orange.fr/man_iostat.html. Accessed 21 May 2015

  11. Terasort program. http://hadoop.apache.org/docs/r0.23.6/api/org/apache/hadoop/examples/terasort/package-summary.html. Accessed 21 May 2015

  12. Twitter’s Hadoop-LZO. http://github.com/twitter/hadoop-lzo. Accessed 21 May 2015

  13. JProfiler, ej-technologies GmbH. https://www.ej-technologies.com/products/jprofiler/overview.html. Accessed 21 May 2015

  14. Cloud Computing, Intel Inc., Optimizing Hadoop deployments. [White paper]

  15. Intel Xeon Processor-Based Servers, Big data analytics, Intel Inc., Optimizing Hadoop Deployments. [White paper]

  16. Hortonworks Inc., Best practices: Linux file systems for HDFS. http://hortonworks.com/kb/linux-file-systems-for-hdfs. Accessed 21 May 2015

  17. White Tom (2012) Hadoop: the definitive guide. O’Reilly Media Inc, USA

    Google Scholar 

  18. NVM Express Interface. http://www.nvmexpress.org. Accessed 21 May 2015

  19. Samsung Enterprise Class SSD Datasheet. http://www.samsung.com/global/business/semiconductor/file/product/XS1715_ProdOverview_2014_1.pdf

  20. Sur S, Wang H, Huang J, Ouyang X, Panda D (2010) Can high-performance interconnects benefit Hadoop distributed file system. In: MASVDC’10: workshop on micro architectural support for virtualization, data center computing, and clouds in conjunction with MICRO’10

  21. Islam N, Rahman M, Jose J, Rajachandrasekar R, Wang H, Subramoni H, Murthy C, Panda D (2012) High performance RDMA-based design of HDFS over InfiniBand. In: SC ’12: the international conference on high performance computing, networking, storage and analysis

  22. Appuswamy R, Gkantsidis C, Narayanan D, Hodson O, Rowstron A (2013) Scale-up vs scale-out for Hadoop: time to rethink? In: ACM symposium on cloud computing, 2 October 2013

  23. Harter T, Borthakur D, Dong S, Aiyer A, Tang L, Arpaci-Dusseau A, Arpaci-Dusseau R (2014) Analysis of HDFS under HBase: a Facebook messages case study. In: FAST’14: 12th USENIX conference on file and storage technologies

  24. SanDisk, Increasing Hadoop performance with SanDisk solid state drives (SSDs). [White paper]

  25. Dai J, Huang J, Huang S, Huang B, Liu Y (2011) HiTune: dataflow-based performance analysis for big data cloud. In: Usenix ATC’11: USENIX annual technical conference

  26. Joshi S, Liaskovitis V (2012) Java garbage collection characteristics and tuning guidelines for Apache Hadoop TeraSort workload. [White paper]

  27. Chen Y, Ganapathi AS, Katz RH (2010) To compress or not to compress—compute vs. IO tradeoffs for MapReduce energy efficiency. Technical Report No. UCB/EECS-2010-36, University of California at Berkeley

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jaehwan Lee.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Moon, S., Lee, J., Sun, X. et al. Optimizing the Hadoop MapReduce Framework with high-performance storage devices. J Supercomput 71, 3525–3548 (2015). https://doi.org/10.1007/s11227-015-1447-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-015-1447-3

Keywords

Navigation