Abstract
MapReduce-based systems have emerged as a prominent framework for large-scale data analysis, having fault tolerance as one of its key features. MapReduce has introduced simple yet efficient mechanisms to handle different kinds of failures including crashes, omissions, and arbitrary failures. This contribution discusses in detail the types of failures in MapReduce systems and surveys the different mechanisms used in the framework for detecting, handling, and recovering from these failures. It also surveys the state-of-the-art optimization mechanisms to improve the fault tolerance in MapReduce, and in particular its open-source implementation Hadoop. Finally, it identifies the remaining challenges and open issues for building efficient fault tolerance mechanisms for MapReduce.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
Jeff Dean, one of the leading engineers in Google, said: (we) “lost 1600 of 1800 machines once, but finished fine”.
- 3.
- 4.
Spot instances are virtual machines resources in Amazon Web Services (WS), for which a user defines a maximum biding price that he/she is willing to pay. If there is no concurrence, the prices are lower and the possibility of using them is higher. But when the demand is higher, then Amazon WS has the right to stop your spot instances. If the spot instances are stopped by Amazon, the user does not pay, otherwise if the user decides to stop them before completing the normal hour, the user is obliged to pay for that consumption.
References
Ananthanarayanan, G., Agarwal, S., Kandula, S., Greenberg, A., Stoica, I., Harlan, D., Harris, E.: Scarlett: coping with skewed content popularity in mapreduce clusters. In: Proceedings of the Sixth Conference on Computer Systems, ACM, New York, NY, USA, EuroSys ’11, pp. 287–300, (2011). http://doi.acm.org/10.1145/1966445.1966472
Ananthanarayanan, G., Ghodsi, A., Shenker, S., Stoica, I.: Effective straggler mitigation: Attack of the clones. In: Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation, USENIX Association, Berkeley, CA, USA, NSDI’13, pp. 185–198, (2013). http://dl.acm.org/citation.cfm?id=2482626.2482645
Ananthanarayanan, G., Hung, M.C.C., Ren, X., Stoica, I., Wierman, A., Yu, M.: GRASS: trimming stragglers in approximation analytics. In: Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation, USENIX Association, Berkeley, CA, USA, NSDI’14, pp. 289–302, (2014). http://dl.acm.org/citation.cfm?id=2616448.2616475
Ananthanarayanan, G., Kandula, S., Greenberg, A., Stoica, I., Lu, Y., Saha, B., Harris, E.: Reining in the outliers in map-reduce clusters using Mantri. In: Proceedings of the 9th USENIX conference on Operating Systems Design and Implementation, USENIX Association, Berkeley, CA, USA, OSDI’10, pp. 1–16, (2010). http://dl.acm.org/citation.cfm?id=1924943.1924962
Apache Zookeeper: (2015). http://zookeeper.apache.org/
Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., Zaharia, M.: Spark sql: Relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, ACM, New York, NY, USA, SIGMOD ’15, pp. 1383–1394 (2015). http://doi.acm.org/10.1145/2723372.2742797
Avizienis, A., Laprie, J.C., Randell, B., Landwehr, C.E.: Basic concepts and taxonomy of dependable and secure computing. IEEE Trans. Dependable Secure Comput. 1(1), 11–33 (2004)
Barborak, M., Dahbura, A., Malek, M.: The consensus problem in fault-tolerant computing. ACM Comput. Surv. 25(2), 171–220 (1993). http://doi.acm.org/10.1145/152610.152612
Borthakur, D., Gray, J., Sarma, J.S., Muthukkaruppan, K., Spiegelberg, N., Kuang, H., Ranganathan, K., Molkov, D., Menon, A., Rash, S., Schmidt, R., Aiyer, A.: Apache Hadoop goes realtime at Facebook. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, ACM, New York, NY, USA, SIGMOD ’11, pp. 1071–1080 (2011). http://doi.acm.org/10.1145/1989323.1989438
Bressoud, T.C., Kozuch, M.A.: Cluster fault-tolerance: An experimental evaluation of checkpointing and MapReduce through simulation. In: Proceedings of the 2009 IEEE International Conference on Cluster Computing and Workshops, IEEE, pp. 1–10 (2009). http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5289185
Cachin, C., Guerraoui, R., Rodrigues, L.: Introduction to Reliable and Secure Distributed Programming (2. ed.). Springer (2011)
Chaiken, R., Jenkins, B., Larson, P., Ramsey, B., Shakib, D., Weaver, S., Zhou, J.: SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets. Proc. VLDB Endow 1(2), 1265–1276 (2008). http://dl.acm.org/citation.cfm?id=1454159.1454166
Chen, Q., Liu, C., Xiao, Z.: Improving mapreduce performance using smart speculative execution strategy. IEEE Trans. Comput. 63(4), 954–967 (2014). doi:10.1109/TC.2013.15
Chohan, N., Castillo, C., Spreitzer, M., Steinder, M., Tantawi, A., Krintz, C.: See spot run: using spot instances for MapReduce workflows. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, USENIX Association, Berkeley, CA, USA, HotCloud’10, pp. 7–7 (2010). http://dl.acm.org/citation.cfm?id=1863103.1863110
Clement, A., Kapritsos, M., Lee, S., Wang, Y., Alvisi, L., Dahlin, M., Riche, T.: Upright cluster services. In: Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles, ACM, New York, NY, USA, SOSP ’09, pp. 277–290 (2009). http://doi.acm.org/10.1145/1629575.1629602
Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears, R.: MapReduce online. In: Proceedings of the 7th USENIX Conference on Networked Systems Design and Implementation, USENIX Association, Berkeley, CA, USA, NSDI’10, pp. 21–21 (2010). http://dl.acm.org/citation.cfm?id=1855711.1855732
Correia, M., Costa, P., Pasin, M., Bessani, A., Ramos, F., Verissimo, P.: On the feasibility of byzantine fault-tolerant mapreduce in clouds-of-clouds. In: 2012 IEEE 31st Symposium on Reliable Distributed Systems (SRDS), pp. 448–453 (2012). doi:10.1109/SRDS.2012.46
Costa, P., Pasin, M., Bessani, A., Correia, M.: Byzantine Fault-Tolerant MapReduce: Faults are Not Just Crashes. In: Proceedings of the 3rd IEEE Second International Conference on Cloud Computing Technology and Science, IEEE Computer Society, Washington, DC, USA, CLOUDCOM ’11, pp. 17–24 (2010). http://dx.doi.org/10.1109/CloudCom.2010.25
Dean, J., Ghemawat, S., Inc, G.: MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th Conference on Symposium on Operating Systems Design & Implementation, USENIX Association, OSDI’04 (2004)
Dean, J.: Building software systems at google and lessons learned. Stanford EE Computer Systems Colloquium (2010). http://www.stanford.edu/class/ee380/Abstracts/101110-slides.pdf
Dinu, F., Ng, T.S.E.: Hadoop’s Overload Tolerant Design Exacerbates Failure Detection and Recovery. In: Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, ACM, New York, NY, USA, NetDB’11, pp. 1–7 (2011)
Dinu, F., Ng, T.E.: Understanding the effects and implications of compute node related failures in Hadoop. In: HPDC ’12: Proceedings of the 21st International Symposium on High-Performance Parallel and Distributed Computing, ACM, New York, NY, USA, pp. 187–198 (2012). http://doi.acm.org/10.1145/2287076.2287108
Facebook, Inc.: (2015). https://www.facebook.com/
Facebook, I.: Under the Hood: Scheduling MapReduce jobs more efficiently with Corona (2012). http://www.facebook.com/notes/facebook-engineering/under-the-hood-scheduling-mapreduce-jobs-more-efficiently-with-corona/10151142560538920
Fedak, G., He, H., Cappello, F.: BitDew: A data management and distribution service with multi-protocol file transfer and metadata abstraction. J Netw. Compu. Appl. 32(5), 961–975 (2009)
Gonzalez, J.E., Xin, R.S., Dave, A., Crankshaw, D., Franklin, M.J., Stoica, I.: Graphx: Graph processing in a distributed dataflow framework. In: Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, USENIX Association, Berkeley, CA, USA, OSDI’14, pp. 599–613 (2014). http://dl.acm.org/citation.cfm?id=2685048.2685096
Hadoop Releases: (2015). http://hadoop.apache.org/releases.html
Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A., Joseph, A.D., Katz, R., Shenker, S., Stoica, I.: Mesos: A Platform for Fine-grained Resource Sharing in the Data Center. In: Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation, USENIX Association, Berkeley, CA, USA, NSDI’11, pp. 22–22 (2011). http://dl.acm.org/citation.cfm?id=1972457.1972488
How-to: Set Up a Hadoop Cluster with Network Encryption: (2013). http://blog.cloudera.com/blog/2013/03/how-to-set-up-a-hadoop-cluster-with-network-encryption/
Ibrahim, S., Phuong, T.A., Antoniu, G.: An Eye on the Elephant in the Wild: A Performance Evaluation of Hadoop’s Schedulers Under Failures. In: Workshop on Adaptive Resource Management and Scheduling for Cloud Computing (ARMS-CC-2015), held in conjunction with PODC’15 (2015)
Introduction to Hadoop Security: (2013). http://www.cloudera.com/content/cloudera/en/home.html
Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: Proceedings of the 2nd ACM SIGOPS/EuroSys 2007, ACM, New York, NY, USA, EuroSys ’07, pp. 59–72 (2007). http://doi.acm.org/10.1145/1272996.1273005
Jin, H., Ibrahim, S., Qi, L., Cao, H., Wu, S., Shi, X.: The MapReduce programming model and implementations. Cloud Computing: Principles and Paradigms pp. 373–390. doi:10.1002/9780470940105.ch14
Jin, H., Qiao, K., Sun, X.H., Li, Y.l.: Performance under Failures of MapReduce Applications. In: Proceedings of the 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, IEEE Computer Society, Washington, DC, USA, CCGRID ’11, pp. 608–609 (2011). http://dx.doi.org/10.1109/CCGrid.2011.84
Jin, H., Sun, X.H.: Performance comparison under failures of MPI and MapReduce: An Analytical Approach. Future Gener. Comput. Syst. 29(7), 1808–1815 (2013). http://dx.doi.org/10.1016/j.future.2013.01.013
Kerberos: The Network Authentication Protocol: (2015). http://web.mit.edu/kerberos/
Ko, S.Y., Hoque, I., Cho, B., Gupta, I.: Making cloud intermediate data fault-tolerant. In: Proceedings of the 1st ACM Symposium on Cloud Computing, ACM, New York, NY, USA, SoCC ’10, pp. 181–192 (2010). http://doi.acm.org/10.1145/1807128.1807160
Ko, S.Y., Hoque, I., Cho, B., Gupta, I.: On availability of intermediate data in cloud computations. In: Proceedings of the 12th conference on Hot topics in operating systems, USENIX Association, Berkeley, CA, USA, HotOS’09, pp. 6–6 (2009). http://dl.acm.org/citation.cfm?id=1855568.1855574
Lin, H., Ma, X., Archuleta, J., Feng, W.c., Gardner, M., Zhang, Z.: MOON: MapReduce On Opportunistic eNvironments. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, ACM, New York, NY, USA, HPDC ’10, pp. 95–106 (2010). http://doi.acm.org/10.1145/1851476.1851489
Lin, J., Dyer, C.: Data-Intensive Text Processing with MapReduce. Tech. rep., University of Maryland, College Park (2010)
Liu, H., Orban, D.: Cloud MapReduce: A MapReduce implementation on top of a cloud operating system. In: 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp. 464–474 (2011). doi:10.1109/CCGrid.2011.25
Liu, H.: Cutting MapReduce Cost with Spot Market. In: Proceedings of the 3rd USENIX Conference on Hot topics in Cloud Computing, USENIX Association, Berkeley, CA, USA, HotCloud’11, pp. 5–5 (2011). https://www.usenix.org/conference/hotcloud11/cutting-mapreduce-cost-spot-market
Memishi, B., Ibrahim, S., Pérez, M.S., Antoniu, G.: On the Dynamic Shifting of the MapReduce Timeout. In: Kannan, R., Rasool, R.U., Jin, H., Balasundaram, S. (eds) Managing and Processing Big Data in Cloud Computing, IGI Global, Hershey, Pennsylvania (USA), pp. 1–22 (2016). doi:10.4018/978-1-4666-9767-6
Memishi, B., Pérez, M.S., Antoniu, G.: Diarchy: An Optimized Management Approach for MapReduce Masters. Procedia Comput. Sci. 51, 9–18 (2015). http://www.sciencedirect.com/science/article/pii/S1877050915009874. International Conference On Computational Science, ICCS Computational Science at the Gates of Nature
Microsoft, Inc.: (2015). http://www.microsoft.com/
Mone, G.: Beyond Hadoop. Commun. ACM 56(1), 22–24 (2013). http://doi.acm.org/10.1145/2398356.2398364
Okorafor, E., Patrick, M.K.: Availability of Jobtracker machine in Hadoop/MapReduce Zookeeper coordinated clusters. Adv. Comput.: An Int. J. 3(3), 19–30 (2012). http://www.chinacloud.cn/upload/2012-07/12072600543782.pdf
Pan, X., Tan, J., Kavulya, S., Gandhi, R., Narasimhan, P.: Ganesha: blackBox diagnosis of MapReduce systems. SIGMETRICS Perform. Eval. Rev. 37(3), 8–13 (2010). http://doi.acm.org/10.1145/1710115.1710118
Phan, T.D., Ibrahim, S., Antoniu, G., Bougé, L.: On Understanding the energy impact of speculative execution in Hadoop. In: IEEE International Conference on Green Computing and Communications (GreenCom 2015), Sydney, Australia (2015). https://hal.inria.fr/hal-01238055
RedHat: A guide for developers using the JBoss Enterprise SOA Platform (2008). http://www.redhat.com/docs/en-US/JBoss_SOA_Platform/4.3.GA/html/Programmers_Guide/index.html, programmersGuide
Roy, I., Setty, S.T.V., Kilzer, A., Shmatikov, V., Witchel, E.: Airavat: security and privacy for MapReduce. In: Proceedings of the 7th USENIX Conference on Networked Systems Design and Implementation, USENIX Association, Berkeley, CA, USA, NSDI’10, pp. 20–20 (2010). http://dl.acm.org/citation.cfm?id=1855711.1855731
Shih, J.: Hadoop security overview—from security infrastructure deployment to high-level services. Hadoop & BigData Technology Conference (2012). www.hbtc2012.hadooper.cn/subject/keynotep8shihongliang.pdf
Sorting 1PB with MapReduce: (2013). http://googleblog.blogspot.com/2008/11/sorting-1pb-with-mapreduce.html
Stonebraker, M., Abadi, D., DeWitt, D.J., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: MapReduce and parallel DBMSs: friends or foes? Commun. ACM 53:64–71 (2010). http://doi.acm.org/10.1145/1629175.1629197
Tang, B., Moca, M., Chevalier, S., He, H., Fedak, G.: Towards MapReduce for Desktop Grid Computing. In: Proceedings of the 2010 International Conference on P2P, Parallel, Grid, Cloud and Internet Computing, IEEE Computer Society, Washington, DC, USA, 3PGCIC ’10, pp. 193–200 (2010). http://dx.doi.org/10.1109/3PGCIC.2010.33
The Apache Hadoop Project: (2015). http://hadoop.apache.org/
Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., Saha, B., Curino, C., O’Malley, O., Radia, S., Reed, B., Baldeschwieler, E.: Apache Hadoop YARN: Yet Another Resource Negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing, ACM, New York, NY, USA, SoCC ’13, p. 5:1–5:16 (2013). http://doi.acm.org/10.1145/2523616.2523633
Wang, G., Butt, A.R., Pandey, P., Gupta, K.: A simulation approach to evaluating design decisions in MapReduce setups. In: 17th Annual Meeting of the IEEE/ACM International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems, IEEE, MASCOTS 2009, pp. 1–11
Wang, F., Qiu, J., Yang, J., Dong, B., Li, X., Li, Y.: Hadoop high availability through metadata replication. In: Proceedings of the First International Workshop on Cloud Data Management, ACM, New York, NY, USA, CloudDB ’09, pp. 37–44 (2009). http://doi.acm.org/10.1145/1651263.1651271
Warneke, D., Kao, O.: Nephele: Efficient parallel data processing in the cloud. In: Proceedings of the 2Nd Workshop on Many-Task Computing on Grids and Supercomputers, ACM, New York, NY, USA, MTAGS ’09, pp. 8:1–8:10 (2009). http://doi.acm.org/10.1145/1646468.1646476
White, T.: Hadoop—The Definitive Guide: Storage and Analysis at Internet Scale (3. ed., revised and updated). O’Reilly (2012)
Xiao, Z., Xiao, Y.: Achieving accountable MapReduce in cloud computing. Future Gener. Comput. Syst. 30, 1–13 (2014). http://dx.doi.org/10.1016/j.future.2013.07.001
Xu, H., Lau, W.C.: Optimization for speculative execution in a MapReduce-like cluster. In: 2015 IEEE Conference on Computer Communications, INFOCOM 2015, Kowloon, Hong Kong, April 26–1May 1, 2015, pp. 1071–1079. http://dx.doi.org/10.1109/INFOCOM.2015.7218480
Xu, H., Lau, W.C.: Speculative execution for a single job in a mapreduce-like system. In: 2014 IEEE 7th International Conference on Cloud Computing (CLOUD), pp. 586–593 (2014). doi:10.1109/CLOUD.2014.84
Yahoo! Inc: (2015). http://www.yahoo.com/
Yildiz, O., Ibrahim, S., Phuong, T.A., Antoniu, G.: Chronos: Failure-aware scheduling in shared Hadoop clusters. In: IEEE International Conference on Big Data (BigData 2015), pp 313–318 (2015). doi:10.1109/BigData.2015.7363770
Yu, Y., Isard, M., Fetterly, D., Budiu, M., Erlingsson, U., Gunda, P.K., Currey, J.: DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language. In: Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, USENIX Association, Berkeley, CA, USA, OSDI’08, pp. 1–14 (2008). http://dl.acm.org/citation.cfm?id=1855741.1855742
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, USENIX Association, Berkeley, CA, USA, NSDI’12, pp. 2–2 (2012). http://dl.acm.org/citation.cfm?id=2228298.2228301
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: Cluster computing with working sets. In: Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing, USENIX Association, Berkeley, CA, USA, HotCloud’10, pp. 10–10 (2010). http://dl.acm.org/citation.cfm?id=1863103.1863113
Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., Stoica, I.: Discretized streams: Fault-tolerant streaming computation at scale. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, ACM, New York, NY, USA, SOSP ’13, pp. 423–438 (2013). http://doi.acm.org/10.1145/2517349.2522737
Zaharia, M., Das, T., Li, H., Shenker, S., Stoica, I.: Discretized streams: An efficient and fault-tolerant model for stream processing on large clusters. In: Proceedings of the 4th USENIX Conference on Hot Topics in Cloud Ccomputing, USENIX Association, Berkeley, CA, USA, HotCloud’12, pp. 10–10 (2012). http://dl.acm.org/citation.cfm?id=2342763.2342773
Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R., Stoica, I.: Improving MapReduce performance in heterogeneous environments. In: Proceedings of the 8th USENIX conference on Operating Systems Design and Implementation, USENIX Association, Berkeley, CA, USA, OSDI’08, pp. 29–42 (2008). http://dl.acm.org/citation.cfm?id=1855741.1855744
Zhu, H., Haopeng, C.: Adaptive failure detection via heartbeat under Hadoop. In: Proceedings of the 2011 IEEE Asia-Pacific Services Computing Conference, IEEE, New York, NY, USA, ApSCC’11, pp. 231–238 (2011)
Acknowledgments
The research leading to these results has received funding from the H2020 project reference number 642963 in the call H2020-MSCA-ITN-2014.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this chapter
Cite this chapter
Memishi, B., Ibrahim, S., Pérez, M.S., Antoniu, G. (2016). Fault Tolerance in MapReduce: A Survey. In: Pop, F., Kołodziej, J., Di Martino, B. (eds) Resource Management for Big Data Platforms. Computer Communications and Networks. Springer, Cham. https://doi.org/10.1007/978-3-319-44881-7_11
Download citation
DOI: https://doi.org/10.1007/978-3-319-44881-7_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-44880-0
Online ISBN: 978-3-319-44881-7
eBook Packages: Computer ScienceComputer Science (R0)