ABSTRACT
Recently, there has been a huge growth in the amount of data processed by enterprises and the scientific computing community. Two promising trends ensure that applications will be able to deal with ever increasing data volumes: First, the emergence of cloud computing, which provides transparent access to a large number of compute, storage and networking resources; and second, the development of the MapReduce programming model, which provides a high-level abstraction for data-intensive computing. However, the design space of these systems has not been explored in detail. Specifically, the impact of various design choices and run-time parameters of a MapReduce system on application performance remains an open question.
To this end, we embarked on systematically understanding the performance of MapReduce systems, but soon realized that understanding effects of parameter tweaking in a large-scale setup with many variables was impractical. Consequently, in this paper, we present the design of an accurate MapReduce simulator, MRPerf, for facilitating exploration of MapReduce design space. MRPerf captures various aspects of a MapReduce setup, and uses this information to predict expected application performance. In essence, MRPerf can serve as a design tool for MapReduce infrastructure, and as a planning tool for making MapReduce deployment far easier via reduction in the number of parameters that currently have to be hand-tuned using rules of thumb.
Our validation of MRPerf using data from medium-scale production clusters shows that it is able to predict application performance accurately, and thus can be a useful tool in enabling cloud computing. Moreover, an initial application of MRPerf to our test clusters running Hadoop, revealed a performance bottleneck, fixing which resulted in up to 28.05% performance improvement.
- DiskSim, Aug 2008. http://www.pdl.cmu.edu/DiskSim/.Google Scholar
- ns-2, Aug 2008. http://nsnam.isi.edu/nsnam/index.php/Main_Page.Google Scholar
- Disco Project, Jan. 2009. http://discoproject.org/.Google Scholar
- Hadoop User Mailing List Archive, Mar. 2009. http://mail-archives.apache.org/mod_mbox/hadoop-core-user/.Google Scholar
- JIRA: HADOOP-3473, Feb 2009. http://issues.apache.org/jira/browse/HADOOP-3473.Google Scholar
- Terasort, Mar 2009. http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/examples/terasort/package-summary.html.Google Scholar
- Adam Pisoni. Skynet, Apr. 2008. http://skynet.rubyforge.org.Google Scholar
- K. Aida, A. Takefusa, H. Nakada, S. Matsuoka, S. Sekiguchi, and U. Nagashima. Performance Evaluation Model for Scheduling in Global Computing Systems. Int. J. High Perform. Comput. Appl., 14(3):268--279, 2000. Google ScholarDigital Library
- Apache Software Foundation. Hadoop, May 2007. http://hadoop.apache.org/core/.Google Scholar
- J. Boulon, A. Konwinski, R. Qi, A. Rabkin, E. Yang, and M. Yang. Chukwa, a large-scale monitoring system. In Proc. CCA, 2008.Google Scholar
- R. Buyya and M. M. Murshed. GridSim: A Toolkit for the Modeling and Simulation of Distributed Resource Management and Scheduling for Grid Computing. CoRR, cs.DC/0203019, 2002.Google Scholar
- H. Casanova. Simgrid: A Toolkit for the Simulation of Application Scheduling. In Proc. IEEE CCGRID, 2001. Google ScholarDigital Library
- J. Dean. Experiences with mapreduce, an abstraction for large-scale computation. In Proc. IEEE PACT, 2006. Google ScholarDigital Library
- J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Comm. of the ACM, 51(1):107--113, 2008. Google ScholarDigital Library
- R. Fonseca, G. Porter, R. H. Katz, S. Shenker, and I. Stoica. X-Trace: A Pervasive Network Tracing Framework. In Proc. USENIX NSDI, 2007. Google ScholarDigital Library
- I. Foster (Ed.) and C. Kesselman (Ed.). The GRID: Blueprint for a New Computing Infrastructure. Morgan Kaufmann Publishers, 1999. Google ScholarDigital Library
- A. Phanishayee, E. Krevat, V. Vasudevan, D. G. Andersen, G. R. Ganger, G. A. Gibson, and S. Seshan. Measurement and analysis of TCP throughput collapse in cluster-based storage systems. In Proc. USENIX FAST, 2008. Google ScholarDigital Library
- C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. Evaluating mapreduce for multi-core and multiprocessor systems. In Proc. IEEE HPCA, pages 13--24, 2007. Google ScholarDigital Library
- H. J. Song, X. Liu, D. Jakobsen, R. Bhagwan, X. Zhang, K. Taura, and A. Chien. The MicroGrid: A scientific tool for modeling Computational Grids. Sci. Program., 8(3):127--141, 2000. Google ScholarDigital Library
Index Terms
- Using realistic simulation for performance analysis of mapreduce setups
Recommendations
Model driven performance simulation of cloud provisioned Hadoop mapreduce applications
MiSE '16: Proceedings of the 8th International Workshop on Modeling in Software EngineeringHadoop is a widely adopted open source implementation of MapReduce. A Hadoop cluster can be fully provisioned by a Cloud service provider to provide elasticity in computational resource allocation. Understanding the performance characteristics of a ...
MapReduce in the Clouds for Science
CLOUDCOM '10: Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and ScienceThe utility computing model introduced by cloud computing combined with the rich set of cloud infrastructure services offers a very viable alternative to traditional servers and computing clusters. MapReduce distributed data processing architecture has ...
MapReduce: Review and open challenges
The continuous increase in computational capacity over the past years has produced an overwhelming flow of data or big data, which exceeds the capabilities of conventional processing tools. Big data signify a new era in data exploration and utilization. ...
Comments