The HaLoop approach to large-scale iterative data analysis

Bu, Yingyi; Howe, Bill; Balazinska, Magdalena; Ernst, Michael D.

doi:10.1007/s00778-012-0269-7

The HaLoop approach to large-scale iterative data analysis

Special Issue Paper
Published: 14 March 2012

Volume 21, pages 169–190, (2012)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Yingyi Bu¹,
Bill Howe²,
Magdalena Balazinska² &
…
Michael D. Ernst²

695 Accesses
97 Citations
6 Altmetric
Explore all metrics

Abstract

The growing demand for large-scale data mining and data analysis applications has led both industry and academia to design new types of highly scalable data-intensive computing platforms. MapReduce has enjoyed particular success. However, MapReduce lacks built-in support for iterative programs, which arise naturally in many applications including data mining, web ranking, graph analysis, and model fitting. This paper (This is an extended version of the VLDB 2010 paper “HaLoop: Efficient Iterative Data Processing on Large Clusters” PVLDB 3(1):285–296, 2010.) presents HaLoop, a modified version of the Hadoop MapReduce framework, that is designed to serve these applications. HaLoop allows iterative applications to be assembled from existing Hadoop programs without modification, and significantly improves their efficiency by providing inter-iteration caching mechanisms and a loop-aware scheduler to exploit these caches. HaLoop retains the fault-tolerance properties of MapReduce through automatic cache recovery and task re-execution. We evaluated HaLoop on a variety of real applications and real datasets. Compared with Hadoop, on average, HaLoop improved runtimes by a factor of 1.85 and shuffled only 4 % as much data between mappers and reducers in the applications that we tested.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Abouzeid A., Bajda-Pawlikowski K., Abadi D.J., Rasin A., Silberschatz A.: HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. VLDB 2(1), 922–933 (2009)
Google Scholar
Alvaro, P., Condie, T., Conway, N., Elmeleegy, K., Hellerstein, J.M., Sears, R.: Boom analytics: exploring data-centric, declarative programming for the cloud. In: EuroSys, pp. 223–236 (2010)
Ananthanarayanan, G., Agarwal, S., Kandula, S., Greenberg, A.G., Stoica, I., Harlan, D., Harris, E.: Scarlett: coping with skewed content popularity in mapreduce clusters. In: EuroSys, pp. 287–300 (2011)
Ananthanarayanan, G., Ghodsi, A., Shenker, S., Stoica, I.: Disk-locality in datacenter computing considered irrelevant. In: HotOS (2011)
Bancilhon, F., Ramakrishnan, R.: An amateur’s introduction to recursive query processing strategies. In: SIGMOD Conference, pp. 16–52 (1986)
Battré, D., Ewen, S., Hueske, F., Kao, O., Markl, V., Warneke, D.: Nephele/pacts: a programming model and execution framework for web-scale analytical processing. In: SoCC, pp. 119–130 (2010)
Borkar, V., Carey, M.J., Grover, R., Onose, N., Vernica, R.: Hyracks: a flexible and extensible foundation for data-intensive computing. In: ICDE Conference (2011)
Bu Y., Howe B., Balazinska M., Ernst M.: Haloop: efficient iterative data processing on large clusters. PVLDB 3(1), 285–296 (2010)
Google Scholar
Chaiken R., Jenkins B., Larson P., Ramsey B., Shakib D., Weaver S., Zhou J.: Scope: easy and efficient parallel processing of massive data sets. PVLDB 1(2), 1265–1276 (2008)
Google Scholar
Cluster Exploratory (CluE) program. http://www.nsf.gov/pubs/2008/nsf08560/nsf08560.htm (2010). Accessed7 July 2010
Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears, R.: MapReduce online. In: NSDI 2010 (2010)
Dean, J., Ghemawat. S.: MapReduce: simplified data processing on large clusters. In: OSDI, pp. 137–150 (2004)
DeWitt D.J., Gray J.: Parallel database systems: the future of high performance database systems. Commun. ACM 35(6), 85–98 (1992)
Article Google Scholar
DeWitt D.J., Paulson E., Robinson E., Naughton J.F., Royalty J., Shankar S., Krioukov A.: Clustera: an integrated computation and data management system. PVLDB 1(1), 28–41 (2008)
Google Scholar
Dittrich J., Quiané-Ruiz J.-A., Jindal A., Kargin Y., Setty V., Schad J.: Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). PVLDB 3(1), 518–529 (2010)
Google Scholar
Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.-H., Qiu, J., Fox, G.: Twister: a runtime for iterative mapreduce. In: HPDC, pp. 810–818 (2010)
Hadoop. http://hadoop.apache.org/ (2010). Accessed 7 July 2010
Hagan M.T., Demuth H.B., Beale M.H.: Neural Network Design. PWS Publishing, Boston (1996)
Google Scholar
Hdfs. http://hadoop.apache.org/common/docs/current/hdfs_design.html (2010). Accessed 7 July 2010
Hive. http://hadoop.apache.org/hive/ (2010). Accessed 7 July 2010
Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: EuroSys, pp. 59–72 (2007)
Jain A.K., Murty M.N., Flynn P.J.: Data clustering: a review. ACM Comput. Surv. 31(3), 264–323 (1999)
Article Google Scholar
Kleinberg J.M.: Authoritative sources in a hyperlinked environment. J. ACM 46(5), 604–632 (1999)
Article MathSciNet MATH Google Scholar
Loo, B.T., Condie, T., Hellerstein, J.M., Maniatis, P., Roscoe, T., Stoica, I.: Implementing declarative overlays. In: SOSP, pp. 75–90 (2005)
Mahout. http://lucene.apache.org/mahout/ (2010). Accessed 7 July 2010
Malewicz, G., Austern, M.H., Bik, A.J.C., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: SIGMOD Conference, pp. 135–146 (2010)
Moore, A.W., Zuev, D.: Internet traffic classification using bayesian analysis techniques. In: SIGMETRICS, pp. 50–60 (2005)
Morton, K., Balazinska, M., Grossman, D.: ParaTimer: a progress indicator for MapReduce DAGs. In: SIGMOD Conference, pp. 507–518 (2010)
Olston, C., Chiou, G., Chitnis, L., Liu, F., Han, Y., Larsson, M., Neumann, A., Rao, V.B.N., Sankarasubramanian, V., Seth, S., Tian, C., ZiCornell, T., Wang, X.: Nova: continuous pig/hadoop workflows. In: SIGMOD Conference, pp. 1081–1090 (2011)
Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: SIGMOD Conference, pp. 1099–1110 (2008)
Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: bringing order to the web. Technical Report 1999-66, Stanford InfoLab (1999)
Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: SIGMOD Conference, pp. 165–178 (2009)
Power, R., Li, J.: Piccolo: building fast, distributed programs with partitioned tables. In: OSDI (2010)
Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: SIGMOD Conference, pp. 495–506 (2010)
Wasserman S., Faust K.: Social Network Analysis: Methods and Applications. Cambridge University Press, Cambridge (1994)
Google Scholar
White, T.: Hadoop: The Definitive Guide. O’Reilly Media, Sebastopol (2009)
Wiley, K., Connolly, A., Krughoff, S., Gardner, J., Balazinska, M., Howe, B., Kwon, Y., Bu, Y.: Astronomical image processing with hadoop. In: Gabriel, C. (ed.) Astronomical Data Analysis Software and Systems (2010)
Zaharia, M., Borthakur, D., Sarma, J.Sen, Elmeleegy, K., Shenker, S., Stoica, I.: Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In: EuroSys, pp. 265–278 (2010)
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. Technical Report UCB/EECS-2011-82, EECS Department, University of California, Berkeley, July (2011)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: HotCloud (2010)
Zhang W., Wang K., Chau S.-C.: Data partition and parallel evaluation of datalog programs. IEEE Trans. Knowl. Data Eng. 7(1), 163–176 (1995)
Article Google Scholar

Download references

Author information

Authors and Affiliations

University of California-Irvine, Irvine, CA, 92697, USA
Yingyi Bu
University of Washington, Seattle, WA, 98195, USA
Bill Howe, Magdalena Balazinska & Michael D. Ernst

Authors

Yingyi Bu
View author publications
You can also search for this author in PubMed Google Scholar
Bill Howe
View author publications
You can also search for this author in PubMed Google Scholar
Magdalena Balazinska
View author publications
You can also search for this author in PubMed Google Scholar
Michael D. Ernst
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yingyi Bu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bu, Y., Howe, B., Balazinska, M. et al. The HaLoop approach to large-scale iterative data analysis. The VLDB Journal 21, 169–190 (2012). https://doi.org/10.1007/s00778-012-0269-7

Download citation

Received: 24 February 2011
Revised: 24 January 2012
Accepted: 28 February 2012
Published: 14 March 2012
Issue Date: April 2012
DOI: https://doi.org/10.1007/s00778-012-0269-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The HaLoop approach to large-scale iterative data analysis

Abstract

Access this article

Similar content being viewed by others

MapReduce Algorithms for Big Data Analysis

Hierarchical Clustering for Large Data Sets

The Family of Map-Reduce

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

The HaLoop approach to large-scale iterative data analysis

Abstract

Access this article

Similar content being viewed by others

MapReduce Algorithms for Big Data Analysis

Hierarchical Clustering for Large Data Sets

The Family of Map-Reduce

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation