research-article

Using realistic simulation for performance analysis of mapreduce setups

Authors:
Guanying Wang

Virginia Tech, Blacksburg, VA, USA

Virginia Tech, Blacksburg, VA, USA
View Profile

,
Ali R. Butt

Virginia Tech, Blacksburg, VA, USA

Virginia Tech, Blacksburg, VA, USA
View Profile

,
Prashant Pandey

IBM Almaden Research Center, San Jose, CA, USA

IBM Almaden Research Center, San Jose, CA, USA
View Profile

,
Karan Gupta

IBM Almaden Research Center, San Jose, CA, USA

IBM Almaden Research Center, San Jose, CA, USA
View Profile

LSAP '09: Proceedings of the 1st ACM workshop on Large-Scale system and application performanceJune 2009Pages 19–26https://doi.org/10.1145/1552272.1552278

Published:10 June 2009Publication History

LSAP '09: Proceedings of the 1st ACM workshop on Large-Scale system and application performance

Pages 19–26

ABSTRACT

Recently, there has been a huge growth in the amount of data processed by enterprises and the scientific computing community. Two promising trends ensure that applications will be able to deal with ever increasing data volumes: First, the emergence of cloud computing, which provides transparent access to a large number of compute, storage and networking resources; and second, the development of the MapReduce programming model, which provides a high-level abstraction for data-intensive computing. However, the design space of these systems has not been explored in detail. Specifically, the impact of various design choices and run-time parameters of a MapReduce system on application performance remains an open question.

To this end, we embarked on systematically understanding the performance of MapReduce systems, but soon realized that understanding effects of parameter tweaking in a large-scale setup with many variables was impractical. Consequently, in this paper, we present the design of an accurate MapReduce simulator, MRPerf, for facilitating exploration of MapReduce design space. MRPerf captures various aspects of a MapReduce setup, and uses this information to predict expected application performance. In essence, MRPerf can serve as a design tool for MapReduce infrastructure, and as a planning tool for making MapReduce deployment far easier via reduction in the number of parameters that currently have to be hand-tuned using rules of thumb.

Our validation of MRPerf using data from medium-scale production clusters shows that it is able to predict application performance accurately, and thus can be a useful tool in enabling cloud computing. Moreover, an initial application of MRPerf to our test clusters running Hadoop, revealed a performance bottleneck, fixing which resulted in up to 28.05% performance improvement.

References

DiskSim, Aug 2008. http://www.pdl.cmu.edu/DiskSim/.Google Scholar
ns-2, Aug 2008. http://nsnam.isi.edu/nsnam/index.php/Main_Page.Google Scholar
Disco Project, Jan. 2009. http://discoproject.org/.Google Scholar
Hadoop User Mailing List Archive, Mar. 2009. http://mail-archives.apache.org/mod_mbox/hadoop-core-user/.Google Scholar
JIRA: HADOOP-3473, Feb 2009. http://issues.apache.org/jira/browse/HADOOP-3473.Google Scholar
Terasort, Mar 2009. http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/examples/terasort/package-summary.html.Google Scholar
Adam Pisoni. Skynet, Apr. 2008. http://skynet.rubyforge.org.Google Scholar
K. Aida, A. Takefusa, H. Nakada, S. Matsuoka, S. Sekiguchi, and U. Nagashima. Performance Evaluation Model for Scheduling in Global Computing Systems. Int. J. High Perform. Comput. Appl., 14(3):268--279, 2000. Google ScholarDigital Library
Apache Software Foundation. Hadoop, May 2007. http://hadoop.apache.org/core/.Google Scholar
J. Boulon, A. Konwinski, R. Qi, A. Rabkin, E. Yang, and M. Yang. Chukwa, a large-scale monitoring system. In Proc. CCA, 2008.Google Scholar
R. Buyya and M. M. Murshed. GridSim: A Toolkit for the Modeling and Simulation of Distributed Resource Management and Scheduling for Grid Computing. CoRR, cs.DC/0203019, 2002.Google Scholar
H. Casanova. Simgrid: A Toolkit for the Simulation of Application Scheduling. In Proc. IEEE CCGRID, 2001. Google ScholarDigital Library
J. Dean. Experiences with mapreduce, an abstraction for large-scale computation. In Proc. IEEE PACT, 2006. Google ScholarDigital Library
J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Comm. of the ACM, 51(1):107--113, 2008. Google ScholarDigital Library
R. Fonseca, G. Porter, R. H. Katz, S. Shenker, and I. Stoica. X-Trace: A Pervasive Network Tracing Framework. In Proc. USENIX NSDI, 2007. Google ScholarDigital Library
I. Foster (Ed.) and C. Kesselman (Ed.). The GRID: Blueprint for a New Computing Infrastructure. Morgan Kaufmann Publishers, 1999. Google ScholarDigital Library
A. Phanishayee, E. Krevat, V. Vasudevan, D. G. Andersen, G. R. Ganger, G. A. Gibson, and S. Seshan. Measurement and analysis of TCP throughput collapse in cluster-based storage systems. In Proc. USENIX FAST, 2008. Google ScholarDigital Library
C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. Evaluating mapreduce for multi-core and multiprocessor systems. In Proc. IEEE HPCA, pages 13--24, 2007. Google ScholarDigital Library
H. J. Song, X. Liu, D. Jakobsen, R. Bhagwan, X. Zhang, K. Taura, and A. Chien. The MicroGrid: A scientific tool for modeling Computational Grids. Sci. Program., 8(3):127--141, 2000. Google ScholarDigital Library

Index Terms

Using realistic simulation for performance analysis of mapreduce setups
1. Computing methodologies
  1. Modeling and simulation
    1. Model development and analysis
      1. Modeling methodologies

Recommendations

Model driven performance simulation of cloud provisioned Hadoop mapreduce applications
MiSE '16: Proceedings of the 8th International Workshop on Modeling in Software Engineering

Hadoop is a widely adopted open source implementation of MapReduce. A Hadoop cluster can be fully provisioned by a Cloud service provider to provide elasticity in computational resource allocation. Understanding the performance characteristics of a ...
Read More
MapReduce in the Clouds for Science
CLOUDCOM '10: Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science

The utility computing model introduced by cloud computing combined with the rich set of cloud infrastructure services offers a very viable alternative to traditional servers and computing clusters. MapReduce distributed data processing architecture has ...
Read More
MapReduce: Review and open challenges

The continuous increase in computational capacity over the past years has produced an overwhelming flow of data or big data, which exceeds the capabilities of conventional processing tools. Big data signify a new era in data exploration and utilization. ...
Read More

Reviews

Reviewer: Tommaso Mazza

MRPerf, "a MapReduce simulator ... for facilitating exploration of the MapReduce design space," is introduced in this paper. The authors are motivated by two things: the arising emergence of cloud computing and the development of the MapReduce programming model. In particular, they focus on "the impact of various design choices and runtime parameters of a MapReduce system on application performance." Against the practical impossibility of understanding the effects of parameter tuning in systems with many variables, Wang et al. designed MRPerf, to try to facilitate the exploration of the MapReduce design space. Thus, the authors' main goal was to show how it is possible to design a MapReduce infrastructure and to reduce the number of parameters that usually have to be tuned by hand. The claimed precision of the tool was validated using medium-scale production clusters on Hadoop. A performance bottleneck was detected, and a 28.05 percent performance improvement was achieved. After a quick introduction of the problem and an overview of the typical high-performance computing (HPC) components and configurations-which I would have avoided or at least condensed-Wang et al. successfully describe the simulator architecture. Subsequently, in section 3.2, they show the input file format. I usually dislike seeing chunks of code in this type of paper and find that citing supplementary materials or external documentation is preferable. The reason is that the limited length of a paper never succeeds in explaining the code, and the code never completely elucidates itself. The performance evaluation sections are complete and exhaustive. Generally, apart from some minor issues-such as the use of the word "in-exact" in section 1.1-this is a good paper. It is well written and organized, and includes good discussions. I enjoyed reading it. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
LSAP '09: Proceedings of the 1st ACM workshop on Large-Scale system and application performance
June 2009
42 pages
ISBN:9781605585925
DOI:10.1145/1552272
Program Chairs:
Dick Epema
Delft University of Technology, the Netherlands
,
Jose Moreira
IBM T.J. Watson Research Lab, USA
,
Carey Williamson
University of Calgary, Canada
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 June 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cloud computing
hadoop
mapreduce
simulation
Qualifiers
- research-article
Conference

Acceptance Rates
LSAP '09 Paper Acceptance Rate4of7submissions,57%Overall Acceptance Rate4of7submissions,57%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 59
  Total Citations
  View Citations
- 1,205
  Total Downloads
- Downloads (Last 12 months)11
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Using realistic simulation for performance analysis of mapreduce setups

LSAP '09: Proceedings of the 1st ACM workshop on Large-Scale system and application performance

ABSTRACT

References

Cited By

Index Terms

Recommendations

Model driven performance simulation of cloud provisioned Hadoop mapreduce applications

MapReduce in the Clouds for Science

MapReduce: Review and open challenges

Reviews

Access critical reviews of Computing literature here