Skip to main content
Log in

The HaLoop approach to large-scale iterative data analysis

  • Special Issue Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

The growing demand for large-scale data mining and data analysis applications has led both industry and academia to design new types of highly scalable data-intensive computing platforms. MapReduce has enjoyed particular success. However, MapReduce lacks built-in support for iterative programs, which arise naturally in many applications including data mining, web ranking, graph analysis, and model fitting. This paper (This is an extended version of the VLDB 2010 paper “HaLoop: Efficient Iterative Data Processing on Large Clusters” PVLDB 3(1):285–296, 2010.) presents HaLoop, a modified version of the Hadoop MapReduce framework, that is designed to serve these applications. HaLoop allows iterative applications to be assembled from existing Hadoop programs without modification, and significantly improves their efficiency by providing inter-iteration caching mechanisms and a loop-aware scheduler to exploit these caches. HaLoop retains the fault-tolerance properties of MapReduce through automatic cache recovery and task re-execution. We evaluated HaLoop on a variety of real applications and real datasets. Compared with Hadoop, on average, HaLoop improved runtimes by a factor of 1.85 and shuffled only 4 % as much data between mappers and reducers in the applications that we tested.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Abouzeid A., Bajda-Pawlikowski K., Abadi D.J., Rasin A., Silberschatz A.: HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. VLDB 2(1), 922–933 (2009)

    Google Scholar 

  2. Alvaro, P., Condie, T., Conway, N., Elmeleegy, K., Hellerstein, J.M., Sears, R.: Boom analytics: exploring data-centric, declarative programming for the cloud. In: EuroSys, pp. 223–236 (2010)

  3. Ananthanarayanan, G., Agarwal, S., Kandula, S., Greenberg, A.G., Stoica, I., Harlan, D., Harris, E.: Scarlett: coping with skewed content popularity in mapreduce clusters. In: EuroSys, pp. 287–300 (2011)

  4. Ananthanarayanan, G., Ghodsi, A., Shenker, S., Stoica, I.: Disk-locality in datacenter computing considered irrelevant. In: HotOS (2011)

  5. Bancilhon, F., Ramakrishnan, R.: An amateur’s introduction to recursive query processing strategies. In: SIGMOD Conference, pp. 16–52 (1986)

  6. Battré, D., Ewen, S., Hueske, F., Kao, O., Markl, V., Warneke, D.: Nephele/pacts: a programming model and execution framework for web-scale analytical processing. In: SoCC, pp. 119–130 (2010)

  7. Borkar, V., Carey, M.J., Grover, R., Onose, N., Vernica, R.: Hyracks: a flexible and extensible foundation for data-intensive computing. In: ICDE Conference (2011)

  8. Bu Y., Howe B., Balazinska M., Ernst M.: Haloop: efficient iterative data processing on large clusters. PVLDB 3(1), 285–296 (2010)

    Google Scholar 

  9. Chaiken R., Jenkins B., Larson P., Ramsey B., Shakib D., Weaver S., Zhou J.: Scope: easy and efficient parallel processing of massive data sets. PVLDB 1(2), 1265–1276 (2008)

    Google Scholar 

  10. Cluster Exploratory (CluE) program. http://www.nsf.gov/pubs/2008/nsf08560/nsf08560.htm (2010). Accessed7 July 2010

  11. Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears, R.: MapReduce online. In: NSDI 2010 (2010)

  12. Dean, J., Ghemawat. S.: MapReduce: simplified data processing on large clusters. In: OSDI, pp. 137–150 (2004)

  13. DeWitt D.J., Gray J.: Parallel database systems: the future of high performance database systems. Commun. ACM 35(6), 85–98 (1992)

    Article  Google Scholar 

  14. DeWitt D.J., Paulson E., Robinson E., Naughton J.F., Royalty J., Shankar S., Krioukov A.: Clustera: an integrated computation and data management system. PVLDB 1(1), 28–41 (2008)

    Google Scholar 

  15. Dittrich J., Quiané-Ruiz J.-A., Jindal A., Kargin Y., Setty V., Schad J.: Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). PVLDB 3(1), 518–529 (2010)

    Google Scholar 

  16. Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.-H., Qiu, J., Fox, G.: Twister: a runtime for iterative mapreduce. In: HPDC, pp. 810–818 (2010)

  17. Hadoop. http://hadoop.apache.org/ (2010). Accessed 7 July 2010

  18. Hagan M.T., Demuth H.B., Beale M.H.: Neural Network Design. PWS Publishing, Boston (1996)

    Google Scholar 

  19. Hdfs. http://hadoop.apache.org/common/docs/current/hdfs_design.html (2010). Accessed 7 July 2010

  20. Hive. http://hadoop.apache.org/hive/ (2010). Accessed 7 July 2010

  21. Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: EuroSys, pp. 59–72 (2007)

  22. Jain A.K., Murty M.N., Flynn P.J.: Data clustering: a review. ACM Comput. Surv. 31(3), 264–323 (1999)

    Article  Google Scholar 

  23. Kleinberg J.M.: Authoritative sources in a hyperlinked environment. J. ACM 46(5), 604–632 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  24. Loo, B.T., Condie, T., Hellerstein, J.M., Maniatis, P., Roscoe, T., Stoica, I.: Implementing declarative overlays. In: SOSP, pp. 75–90 (2005)

  25. Mahout. http://lucene.apache.org/mahout/ (2010). Accessed 7 July 2010

  26. Malewicz, G., Austern, M.H., Bik, A.J.C., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: SIGMOD Conference, pp. 135–146 (2010)

  27. Moore, A.W., Zuev, D.: Internet traffic classification using bayesian analysis techniques. In: SIGMETRICS, pp. 50–60 (2005)

  28. Morton, K., Balazinska, M., Grossman, D.: ParaTimer: a progress indicator for MapReduce DAGs. In: SIGMOD Conference, pp. 507–518 (2010)

  29. Olston, C., Chiou, G., Chitnis, L., Liu, F., Han, Y., Larsson, M., Neumann, A., Rao, V.B.N., Sankarasubramanian, V., Seth, S., Tian, C., ZiCornell, T., Wang, X.: Nova: continuous pig/hadoop workflows. In: SIGMOD Conference, pp. 1081–1090 (2011)

  30. Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: SIGMOD Conference, pp. 1099–1110 (2008)

  31. Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: bringing order to the web. Technical Report 1999-66, Stanford InfoLab (1999)

  32. Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: SIGMOD Conference, pp. 165–178 (2009)

  33. Power, R., Li, J.: Piccolo: building fast, distributed programs with partitioned tables. In: OSDI (2010)

  34. Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: SIGMOD Conference, pp. 495–506 (2010)

  35. Wasserman S., Faust K.: Social Network Analysis: Methods and Applications. Cambridge University Press, Cambridge (1994)

    Google Scholar 

  36. White, T.: Hadoop: The Definitive Guide. O’Reilly Media, Sebastopol (2009)

  37. Wiley, K., Connolly, A., Krughoff, S., Gardner, J., Balazinska, M., Howe, B., Kwon, Y., Bu, Y.: Astronomical image processing with hadoop. In: Gabriel, C. (ed.) Astronomical Data Analysis Software and Systems (2010)

  38. Zaharia, M., Borthakur, D., Sarma, J.Sen, Elmeleegy, K., Shenker, S., Stoica, I.: Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In: EuroSys, pp. 265–278 (2010)

  39. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. Technical Report UCB/EECS-2011-82, EECS Department, University of California, Berkeley, July (2011)

  40. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: HotCloud (2010)

  41. Zhang W., Wang K., Chau S.-C.: Data partition and parallel evaluation of datalog programs. IEEE Trans. Knowl. Data Eng. 7(1), 163–176 (1995)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yingyi Bu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bu, Y., Howe, B., Balazinska, M. et al. The HaLoop approach to large-scale iterative data analysis. The VLDB Journal 21, 169–190 (2012). https://doi.org/10.1007/s00778-012-0269-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-012-0269-7

Keywords

Navigation