skip to main content
research-article

SCALLA: A Platform for Scalable One-Pass Analytics Using MapReduce

Published:01 December 2012Publication History
Skip Abstract Section

Abstract

Today’s one-pass analytics applications tend to be data-intensive in nature and require the ability to process high volumes of data efficiently. MapReduce is a popular programming model for processing large datasets using a cluster of machines. However, the traditional MapReduce model is not well-suited for one-pass analytics, since it is geared towards batch processing and requires the dataset to be fully loaded into the cluster before running analytical queries. This article examines, from a systems standpoint, what architectural design changes are necessary to bring the benefits of the MapReduce model to incremental one-pass analytics. Our empirical and theoretical analyses of Hadoop-based MapReduce systems show that the widely used sort-merge implementation for partitioning and parallel processing poses a fundamental barrier to incremental one-pass analytics, despite various optimizations. To address these limitations, we propose a new data analysis platform that employs hash techniques to enable fast in-memory processing, and a new frequent key based technique to extend such processing to workloads that require a large key-state space. Evaluation of our Hadoop-based prototype using real-world workloads shows that our new platform significantly improves the progress of map tasks, allows the reduce progress to keep up with the map progress, with up to 3 orders of magnitude reduction of internal data spills, and enables results to be returned continuously during the job.

References

  1. Abadi, D. J., Ahmad, Y., et al. 2005. The design of the borealis stream processing engine. In Proceedings of the 2nd Biennial Conference on Innovative Database Research. 277--289.Google ScholarGoogle Scholar
  2. Babu, S. 2010. Towards automatic optimization of MapReduce programs. In Proceedings of the 1st ACM Symposium on Cloud Computing. ACM, New York, NY, 137--142. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Berinde, R., Cormode, G., Indyk, P., and Strauss, M. J. 2009. Space-optimal heavy hitters with strong error bounds. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. 157--166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Chaiken, R., Jenkins, B., Larson, P.-A., Ramsey, B., Shakib, D., Weaver, S., and Zhou, J. 2008. Scope: Easy and efficient parallel processing of massive data sets. Proc. VLDB Endow. 1, 2, 1265--1276. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Charikar, M., Chen, K., and Farach-Colton, M. 2004. Finding frequent items in data streams. Theor. Comput. Sci. 312, 1, 3--15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Condie, T., Conway, N., Alvaro, P., Hellerstein, J. M., Elmeleegy, K., and Sears, R. 2010. MapReduce online. In Proceedings of the 7th USENIX Conference on Networked Systems Design and Implementation (NSDI’10). USENIX Association, Berkeley, CA, 21--21. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Cormode, G. and Muthukrishnan, S. 2005. An improved data stream summary: The count-min sketch and its applications. J. Algorithms 55, 1, 58--75. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Dean, J. and Ghemawat, S. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Conference on Symposium on Operating Systems Design & Implementation (OSDI’’04). USENIX Association, Berkeley, CA, 10--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. DeWitt, D. and Gray, J. 1992. Parallel database systems: The future of high performance database systems. Commun. ACM 35, 6, 85--98. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. DeWitt, D. J., Gerber, R. H., Graefe, G., Heytens, M. L., Kumar, K. B., and Muralikrishna, M. 1986. Gamma---A high performance dataflow database machine. In Proceedings of the International Conference on Very Large Data Bases. 228--237. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. DeWitt, D. J., Ghandeharizadeh, S., Schneider, D. A., Bricker, A., Hsiao, H.-I., and Rasmussen, R. 1990. The gamma database machine project. IEEE Trans. Knowl. Data Engin. 2, 1, 44--62. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Fiat, A., Karp, R. M., Luby, M., McGeoch, L. A., Sleator, D. D., and Young, N. E. 1991. Competitive paging algorithms. J. Algorithms 12, 4, 685--699. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Ganguly, S. and Majumder, A. 2007. Cr-precis: A deterministic summary structure for update data streams. In Proceedings of the 1st International Symposium on Combinatorics, Algorithms Probabilistic and Experimental Methodologies. 48--59. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Hellerstein, J. M. and Naughton, J. F. 1996. Query execution techniques for caching expensive methods. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 423--434. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Jiang, D., Ooi, B. C., Shi, L., and Wu, S. 2010. The performance of MapReduce: An in-depth study. In Proceedings of the International Conference on Very Large Data Bases.Google ScholarGoogle Scholar
  16. Kane, D. M., Nelson, J., and Woodruff, D. P. 2010. An optimal algorithm for the distinct elements problem. In Proceedings of the 29th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS’10). ACM, New York, NY, 41--52. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Karloff, H., Suri, S., and Vassilvitskii, S. 2010. A model of computation for MapReduce. In Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA, 938--948. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Lee, L. K. and Ting, H. F. 2006. A simpler and more efficient deterministic scheme for finding frequent items over sliding windows. In Proceedings of the 25th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. ACM, New York, NY, 290--297. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Li, B., Mazur, E., Diao, Y., McGregor, A., and Shenoy, P. J. 2011. A platform for scalable one-pass analytics using MapReduce. In Proceedings of the ACM SIGMOD International Conference on Management of Data, T. K. Sellis, R. J. Miller, A. Kementsietsidis, and Y. Velegrakis Eds., ACM, 985--996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Mazur, E., Li, B., Diao, Y., and Shenoy, P. J. 2011. Towards scalable one-pass analytics using MapReduce. In Proceedings of the International Parallel and Distributed Processing Symposium Workshops. IEEE, 1102--1111. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. McGeoch, L. A. and Sleator, D. D. 1991. A strongly competitive randomized paging algorithm. Algorithmica 6, 6, 816--825.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Metwally, A., Agrawal, D., and El Abbadi, A. 2005. Efficient computation of frequent and top-k elements in data streams. In Proceedings of the International Conference on Database Theory, T. Eiter and L. Libkin Eds., Lecture Notes in Computer Sciences, vol. 3363. Springer, 398--412. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Misra, J. and Gries, D. 1982. Finding repeated elements. Sci. Comput. Program. 2, 2, 143--152.Google ScholarGoogle ScholarCross RefCross Ref
  24. Morton, K., Balazinska, M., and Grossman, D. 2010. Paratimer: A progress indicator for MapReduce dags. In Proceedings of the International Conference on Management of Data (SIGMOD’10). ACM, New York, NY, 507--518. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Muthukrishnan, S. 2006. Data Streams: Algorithms and Applications. Now Publishers.Google ScholarGoogle Scholar
  26. Neumeyer, L., Robbins, B., Nair, A., and Kesari, A. 2010. S4: Distributed stream computing platform. In Proceedings of the IEEE International Conference on Data Mining Workshops. 170--177. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Olston, C., Reed, B., Srivastava, U., Kumar, R., and Tomkins, A. 2008. Pig Latin: A not-so-foreign language for data processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 1099--1110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Pavlo, A., Paulson, E., Rasin, A., Abadi, D. J., DeWitt, D. J., Madden, S., and Stonebraker, M. 2009. A comparison of approaches to large-scale data analysis. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 165--178. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. PigMix. 2008. Pig Mix benchmark. https://cwiki.apache.org/confluence/display/PIG/PigMix.Google ScholarGoogle Scholar
  30. Ramakrishnan, R. and Gehrke, J. 2003. Database Management Systems 3rd Ed. McGraw-Hill. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Roy, A., Diao, Y., Mauceli, E., Shen, Y., and Wu, B.-L. 2012. Massive genomic data processing and deep analysis. Proc. VLDB Endow. 5, 12, 1906--1909. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Shapiro, L. D. 1986. Join processing in database systems with large main memories. ACM Trans. Datab. Syst. 11, 3, 239--264. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Sleator, D. D. and Tarjan, R. E. 1985. Amortized efficiency of list update and paging rules. Commun. ACM 28, 2, 202--208. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., and Murthy, R. 2009. Hive - a warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2, 2, 1626--1629. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Tian, F. and DeWitt, D. J. 2003. Tuple routing strategies for distributed eddies. In Proceedings of the 29th International Conference on Very Large Data Bases (VLDB’03). VLDB Endowment, 333--344. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. White, T. 2009. Hadoop: The Definitive Guide. O’Reilly Media, Inc. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Yang, H.-C., Dasdan, A., Hsiao, R.-L., and Parker, D. S. 2007. Map-reduce-merge: simplified relational data processing on large clusters. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’07). ACM, New York, NY, 1029--1040. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Yu, Y., Gunda, P. K., and Isard, M. 2009. Distributed aggregation for data-parallel computing: interfaces and implementations. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles (SOSP’09). ACM, New York, NY, 247--260. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Zou, Q., Wang, H., Soulé, R., Hirzel, M., Andrade, H., Gedik, B., and Wu, K.-L. 2010. From a stream of relational queries to distributed stream processing. Proc. VLDB Endow. 3, 2, 1394--1405. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. SCALLA: A Platform for Scalable One-Pass Analytics Using MapReduce

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Database Systems
      ACM Transactions on Database Systems  Volume 37, Issue 4
      December 2012
      345 pages
      ISSN:0362-5915
      EISSN:1557-4644
      DOI:10.1145/2389241
      Issue’s Table of Contents

      Copyright © 2012 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 1 December 2012
      • Accepted: 1 September 2012
      • Revised: 1 August 2012
      • Received: 1 October 2011
      Published in tods Volume 37, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader