Abstract
The growing demand for large-scale data mining and data analysis applications has led both industry and academia to design new types of highly scalable data-intensive computing platforms. MapReduce has enjoyed particular success. However, MapReduce lacks built-in support for iterative programs, which arise naturally in many applications including data mining, web ranking, graph analysis, and model fitting. This paper (This is an extended version of the VLDB 2010 paper “HaLoop: Efficient Iterative Data Processing on Large Clusters” PVLDB 3(1):285–296, 2010.) presents HaLoop, a modified version of the Hadoop MapReduce framework, that is designed to serve these applications. HaLoop allows iterative applications to be assembled from existing Hadoop programs without modification, and significantly improves their efficiency by providing inter-iteration caching mechanisms and a loop-aware scheduler to exploit these caches. HaLoop retains the fault-tolerance properties of MapReduce through automatic cache recovery and task re-execution. We evaluated HaLoop on a variety of real applications and real datasets. Compared with Hadoop, on average, HaLoop improved runtimes by a factor of 1.85 and shuffled only 4 % as much data between mappers and reducers in the applications that we tested.
Similar content being viewed by others
References
Abouzeid A., Bajda-Pawlikowski K., Abadi D.J., Rasin A., Silberschatz A.: HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. VLDB 2(1), 922–933 (2009)
Alvaro, P., Condie, T., Conway, N., Elmeleegy, K., Hellerstein, J.M., Sears, R.: Boom analytics: exploring data-centric, declarative programming for the cloud. In: EuroSys, pp. 223–236 (2010)
Ananthanarayanan, G., Agarwal, S., Kandula, S., Greenberg, A.G., Stoica, I., Harlan, D., Harris, E.: Scarlett: coping with skewed content popularity in mapreduce clusters. In: EuroSys, pp. 287–300 (2011)
Ananthanarayanan, G., Ghodsi, A., Shenker, S., Stoica, I.: Disk-locality in datacenter computing considered irrelevant. In: HotOS (2011)
Bancilhon, F., Ramakrishnan, R.: An amateur’s introduction to recursive query processing strategies. In: SIGMOD Conference, pp. 16–52 (1986)
Battré, D., Ewen, S., Hueske, F., Kao, O., Markl, V., Warneke, D.: Nephele/pacts: a programming model and execution framework for web-scale analytical processing. In: SoCC, pp. 119–130 (2010)
Borkar, V., Carey, M.J., Grover, R., Onose, N., Vernica, R.: Hyracks: a flexible and extensible foundation for data-intensive computing. In: ICDE Conference (2011)
Bu Y., Howe B., Balazinska M., Ernst M.: Haloop: efficient iterative data processing on large clusters. PVLDB 3(1), 285–296 (2010)
Chaiken R., Jenkins B., Larson P., Ramsey B., Shakib D., Weaver S., Zhou J.: Scope: easy and efficient parallel processing of massive data sets. PVLDB 1(2), 1265–1276 (2008)
Cluster Exploratory (CluE) program. http://www.nsf.gov/pubs/2008/nsf08560/nsf08560.htm (2010). Accessed7 July 2010
Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears, R.: MapReduce online. In: NSDI 2010 (2010)
Dean, J., Ghemawat. S.: MapReduce: simplified data processing on large clusters. In: OSDI, pp. 137–150 (2004)
DeWitt D.J., Gray J.: Parallel database systems: the future of high performance database systems. Commun. ACM 35(6), 85–98 (1992)
DeWitt D.J., Paulson E., Robinson E., Naughton J.F., Royalty J., Shankar S., Krioukov A.: Clustera: an integrated computation and data management system. PVLDB 1(1), 28–41 (2008)
Dittrich J., Quiané-Ruiz J.-A., Jindal A., Kargin Y., Setty V., Schad J.: Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). PVLDB 3(1), 518–529 (2010)
Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.-H., Qiu, J., Fox, G.: Twister: a runtime for iterative mapreduce. In: HPDC, pp. 810–818 (2010)
Hadoop. http://hadoop.apache.org/ (2010). Accessed 7 July 2010
Hagan M.T., Demuth H.B., Beale M.H.: Neural Network Design. PWS Publishing, Boston (1996)
Hdfs. http://hadoop.apache.org/common/docs/current/hdfs_design.html (2010). Accessed 7 July 2010
Hive. http://hadoop.apache.org/hive/ (2010). Accessed 7 July 2010
Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: EuroSys, pp. 59–72 (2007)
Jain A.K., Murty M.N., Flynn P.J.: Data clustering: a review. ACM Comput. Surv. 31(3), 264–323 (1999)
Kleinberg J.M.: Authoritative sources in a hyperlinked environment. J. ACM 46(5), 604–632 (1999)
Loo, B.T., Condie, T., Hellerstein, J.M., Maniatis, P., Roscoe, T., Stoica, I.: Implementing declarative overlays. In: SOSP, pp. 75–90 (2005)
Mahout. http://lucene.apache.org/mahout/ (2010). Accessed 7 July 2010
Malewicz, G., Austern, M.H., Bik, A.J.C., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: SIGMOD Conference, pp. 135–146 (2010)
Moore, A.W., Zuev, D.: Internet traffic classification using bayesian analysis techniques. In: SIGMETRICS, pp. 50–60 (2005)
Morton, K., Balazinska, M., Grossman, D.: ParaTimer: a progress indicator for MapReduce DAGs. In: SIGMOD Conference, pp. 507–518 (2010)
Olston, C., Chiou, G., Chitnis, L., Liu, F., Han, Y., Larsson, M., Neumann, A., Rao, V.B.N., Sankarasubramanian, V., Seth, S., Tian, C., ZiCornell, T., Wang, X.: Nova: continuous pig/hadoop workflows. In: SIGMOD Conference, pp. 1081–1090 (2011)
Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: SIGMOD Conference, pp. 1099–1110 (2008)
Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: bringing order to the web. Technical Report 1999-66, Stanford InfoLab (1999)
Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: SIGMOD Conference, pp. 165–178 (2009)
Power, R., Li, J.: Piccolo: building fast, distributed programs with partitioned tables. In: OSDI (2010)
Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: SIGMOD Conference, pp. 495–506 (2010)
Wasserman S., Faust K.: Social Network Analysis: Methods and Applications. Cambridge University Press, Cambridge (1994)
White, T.: Hadoop: The Definitive Guide. O’Reilly Media, Sebastopol (2009)
Wiley, K., Connolly, A., Krughoff, S., Gardner, J., Balazinska, M., Howe, B., Kwon, Y., Bu, Y.: Astronomical image processing with hadoop. In: Gabriel, C. (ed.) Astronomical Data Analysis Software and Systems (2010)
Zaharia, M., Borthakur, D., Sarma, J.Sen, Elmeleegy, K., Shenker, S., Stoica, I.: Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In: EuroSys, pp. 265–278 (2010)
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. Technical Report UCB/EECS-2011-82, EECS Department, University of California, Berkeley, July (2011)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: HotCloud (2010)
Zhang W., Wang K., Chau S.-C.: Data partition and parallel evaluation of datalog programs. IEEE Trans. Knowl. Data Eng. 7(1), 163–176 (1995)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Bu, Y., Howe, B., Balazinska, M. et al. The HaLoop approach to large-scale iterative data analysis. The VLDB Journal 21, 169–190 (2012). https://doi.org/10.1007/s00778-012-0269-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-012-0269-7