research-article

SCALLA: A Platform for Scalable One-Pass Analytics Using MapReduce

Authors:
Boduo Li

University of Massachusetts Amherst

University of Massachusetts Amherst
View Profile

,
Edward Mazur

University of Massachusetts Amherst

University of Massachusetts Amherst
View Profile

,
Yanlei Diao

University of Massachusetts Amherst

University of Massachusetts Amherst
View Profile

,
Andrew McGregor

University of Massachusetts Amherst

University of Massachusetts Amherst
View Profile

,
Prashant Shenoy

University of Massachusetts Amherst

University of Massachusetts Amherst
View Profile

Authors Info & Claims

ACM Transactions on Database Systems Volume 37 Issue 4Article No.: 27pp 1–43https://doi.org/10.1145/2389241.2389246

Published:01 December 2012Publication History

ACM Transactions on Database Systems

Abstract

Today’s one-pass analytics applications tend to be data-intensive in nature and require the ability to process high volumes of data efficiently. MapReduce is a popular programming model for processing large datasets using a cluster of machines. However, the traditional MapReduce model is not well-suited for one-pass analytics, since it is geared towards batch processing and requires the dataset to be fully loaded into the cluster before running analytical queries. This article examines, from a systems standpoint, what architectural design changes are necessary to bring the benefits of the MapReduce model to incremental one-pass analytics. Our empirical and theoretical analyses of Hadoop-based MapReduce systems show that the widely used sort-merge implementation for partitioning and parallel processing poses a fundamental barrier to incremental one-pass analytics, despite various optimizations. To address these limitations, we propose a new data analysis platform that employs hash techniques to enable fast in-memory processing, and a new frequent key based technique to extend such processing to workloads that require a large key-state space. Evaluation of our Hadoop-based prototype using real-world workloads shows that our new platform significantly improves the progress of map tasks, allows the reduce progress to keep up with the map progress, with up to 3 orders of magnitude reduction of internal data spills, and enables results to be returned continuously during the job.

References

Abadi, D. J., Ahmad, Y., et al. 2005. The design of the borealis stream processing engine. In Proceedings of the 2nd Biennial Conference on Innovative Database Research. 277--289.Google Scholar
Babu, S. 2010. Towards automatic optimization of MapReduce programs. In Proceedings of the 1st ACM Symposium on Cloud Computing. ACM, New York, NY, 137--142. Google ScholarDigital Library
Berinde, R., Cormode, G., Indyk, P., and Strauss, M. J. 2009. Space-optimal heavy hitters with strong error bounds. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. 157--166. Google ScholarDigital Library
Chaiken, R., Jenkins, B., Larson, P.-A., Ramsey, B., Shakib, D., Weaver, S., and Zhou, J. 2008. Scope: Easy and efficient parallel processing of massive data sets. Proc. VLDB Endow. 1, 2, 1265--1276. Google ScholarDigital Library
Charikar, M., Chen, K., and Farach-Colton, M. 2004. Finding frequent items in data streams. Theor. Comput. Sci. 312, 1, 3--15. Google ScholarDigital Library
Condie, T., Conway, N., Alvaro, P., Hellerstein, J. M., Elmeleegy, K., and Sears, R. 2010. MapReduce online. In Proceedings of the 7th USENIX Conference on Networked Systems Design and Implementation (NSDI’10). USENIX Association, Berkeley, CA, 21--21. Google ScholarDigital Library
Cormode, G. and Muthukrishnan, S. 2005. An improved data stream summary: The count-min sketch and its applications. J. Algorithms 55, 1, 58--75. Google ScholarDigital Library
Dean, J. and Ghemawat, S. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Conference on Symposium on Operating Systems Design & Implementation (OSDI’’04). USENIX Association, Berkeley, CA, 10--10. Google ScholarDigital Library
DeWitt, D. and Gray, J. 1992. Parallel database systems: The future of high performance database systems. Commun. ACM 35, 6, 85--98. Google ScholarDigital Library
DeWitt, D. J., Gerber, R. H., Graefe, G., Heytens, M. L., Kumar, K. B., and Muralikrishna, M. 1986. Gamma---A high performance dataflow database machine. In Proceedings of the International Conference on Very Large Data Bases. 228--237. Google ScholarDigital Library
DeWitt, D. J., Ghandeharizadeh, S., Schneider, D. A., Bricker, A., Hsiao, H.-I., and Rasmussen, R. 1990. The gamma database machine project. IEEE Trans. Knowl. Data Engin. 2, 1, 44--62. Google ScholarDigital Library
Fiat, A., Karp, R. M., Luby, M., McGeoch, L. A., Sleator, D. D., and Young, N. E. 1991. Competitive paging algorithms. J. Algorithms 12, 4, 685--699. Google ScholarDigital Library
Ganguly, S. and Majumder, A. 2007. Cr-precis: A deterministic summary structure for update data streams. In Proceedings of the 1st International Symposium on Combinatorics, Algorithms Probabilistic and Experimental Methodologies. 48--59. Google ScholarDigital Library
Hellerstein, J. M. and Naughton, J. F. 1996. Query execution techniques for caching expensive methods. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 423--434. Google ScholarDigital Library
Jiang, D., Ooi, B. C., Shi, L., and Wu, S. 2010. The performance of MapReduce: An in-depth study. In Proceedings of the International Conference on Very Large Data Bases.Google Scholar
Kane, D. M., Nelson, J., and Woodruff, D. P. 2010. An optimal algorithm for the distinct elements problem. In Proceedings of the 29th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS’10). ACM, New York, NY, 41--52. Google ScholarDigital Library
Karloff, H., Suri, S., and Vassilvitskii, S. 2010. A model of computation for MapReduce. In Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA, 938--948. Google ScholarDigital Library
Lee, L. K. and Ting, H. F. 2006. A simpler and more efficient deterministic scheme for finding frequent items over sliding windows. In Proceedings of the 25th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. ACM, New York, NY, 290--297. Google ScholarDigital Library
Li, B., Mazur, E., Diao, Y., McGregor, A., and Shenoy, P. J. 2011. A platform for scalable one-pass analytics using MapReduce. In Proceedings of the ACM SIGMOD International Conference on Management of Data, T. K. Sellis, R. J. Miller, A. Kementsietsidis, and Y. Velegrakis Eds., ACM, 985--996. Google ScholarDigital Library
Mazur, E., Li, B., Diao, Y., and Shenoy, P. J. 2011. Towards scalable one-pass analytics using MapReduce. In Proceedings of the International Parallel and Distributed Processing Symposium Workshops. IEEE, 1102--1111. Google ScholarDigital Library
McGeoch, L. A. and Sleator, D. D. 1991. A strongly competitive randomized paging algorithm. Algorithmica 6, 6, 816--825.Google ScholarDigital Library
Metwally, A., Agrawal, D., and El Abbadi, A. 2005. Efficient computation of frequent and top-k elements in data streams. In Proceedings of the International Conference on Database Theory, T. Eiter and L. Libkin Eds., Lecture Notes in Computer Sciences, vol. 3363. Springer, 398--412. Google ScholarDigital Library
Misra, J. and Gries, D. 1982. Finding repeated elements. Sci. Comput. Program. 2, 2, 143--152.Google ScholarCross Ref
Morton, K., Balazinska, M., and Grossman, D. 2010. Paratimer: A progress indicator for MapReduce dags. In Proceedings of the International Conference on Management of Data (SIGMOD’10). ACM, New York, NY, 507--518. Google ScholarDigital Library
Muthukrishnan, S. 2006. Data Streams: Algorithms and Applications. Now Publishers.Google Scholar
Neumeyer, L., Robbins, B., Nair, A., and Kesari, A. 2010. S4: Distributed stream computing platform. In Proceedings of the IEEE International Conference on Data Mining Workshops. 170--177. Google ScholarDigital Library
Olston, C., Reed, B., Srivastava, U., Kumar, R., and Tomkins, A. 2008. Pig Latin: A not-so-foreign language for data processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 1099--1110. Google ScholarDigital Library
Pavlo, A., Paulson, E., Rasin, A., Abadi, D. J., DeWitt, D. J., Madden, S., and Stonebraker, M. 2009. A comparison of approaches to large-scale data analysis. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 165--178. Google ScholarDigital Library
PigMix. 2008. Pig Mix benchmark. https://cwiki.apache.org/confluence/display/PIG/PigMix.Google Scholar
Ramakrishnan, R. and Gehrke, J. 2003. Database Management Systems 3rd Ed. McGraw-Hill. Google ScholarDigital Library
Roy, A., Diao, Y., Mauceli, E., Shen, Y., and Wu, B.-L. 2012. Massive genomic data processing and deep analysis. Proc. VLDB Endow. 5, 12, 1906--1909. Google ScholarDigital Library
Shapiro, L. D. 1986. Join processing in database systems with large main memories. ACM Trans. Datab. Syst. 11, 3, 239--264. Google ScholarDigital Library
Sleator, D. D. and Tarjan, R. E. 1985. Amortized efficiency of list update and paging rules. Commun. ACM 28, 2, 202--208. Google ScholarDigital Library
Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., and Murthy, R. 2009. Hive - a warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2, 2, 1626--1629. Google ScholarDigital Library
Tian, F. and DeWitt, D. J. 2003. Tuple routing strategies for distributed eddies. In Proceedings of the 29th International Conference on Very Large Data Bases (VLDB’03). VLDB Endowment, 333--344. Google ScholarDigital Library
White, T. 2009. Hadoop: The Definitive Guide. O’Reilly Media, Inc. Google ScholarDigital Library
Yang, H.-C., Dasdan, A., Hsiao, R.-L., and Parker, D. S. 2007. Map-reduce-merge: simplified relational data processing on large clusters. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’07). ACM, New York, NY, 1029--1040. Google ScholarDigital Library
Yu, Y., Gunda, P. K., and Isard, M. 2009. Distributed aggregation for data-parallel computing: interfaces and implementations. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles (SOSP’09). ACM, New York, NY, 247--260. Google ScholarDigital Library
Zou, Q., Wang, H., Soulé, R., Hirzel, M., Andrade, H., Gedik, B., and Wu, K.-L. 2010. From a stream of relational queries to distributed stream processing. Proc. VLDB Endow. 3, 2, 1394--1405. Google ScholarDigital Library

Index Terms

SCALLA: A Platform for Scalable One-Pass Analytics Using MapReduce
1. Information systems
  1. Data management systems
    1. Database management system engines

Recommendations

A platform for scalable one-pass analytics using MapReduce
SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data

Today's one-pass analytics applications tend to be data-intensive in nature and require the ability to process high volumes of data efficiently. MapReduce is a popular programming model for processing large datasets using a cluster of machines. However, ...
Read More
epiC: an extensible and scalable system for processing Big Data

The Big Data problem is characterized by the so-called 3V features: volume--a huge amount of data, velocity--a high data ingestion rate, and variety--a mix of structured data, semi-structured data, and unstructured data. The state-of-the-art solutions ...
Read More
From Google File System to Omega: A Decade of Advancement in Big Data Management at Google
BIGDATASERVICE '15: Proceedings of the 2015 IEEE First International Conference on Big Data Computing Service and Applications

Since the dawn of the big data era the search giant Google has been in the lead for meeting the challenge of the new era. Results from Google's big data projects in the past decade have inspired the development of many other big data technologies such ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Database Systems Volume 37, Issue 4
December 2012
345 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/2389241
Issue’s Table of Contents

Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 December 2012
- Accepted: 1 September 2012
- Revised: 1 August 2012
- Received: 1 October 2011
Published in tods Volume 37, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Parallel processing
incremental computation
one-pass analytics
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 18
  Total Citations
  View Citations
- 736
  Total Downloads
- Downloads (Last 12 months)4
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

SCALLA: A Platform for Scalable One-Pass Analytics Using MapReduce

ACM Transactions on Database Systems

Abstract

References

Cited By

Index Terms

Recommendations

A platform for scalable one-pass analytics using MapReduce

epiC: an extensible and scalable system for processing Big Data

From Google File System to Omega: A Decade of Advancement in Big Data Management at Google

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

SCALLA: A Platform for Scalable One-Pass Analytics Using MapReduce

ACM Transactions on Database Systems

Abstract

References

Cited By

Index Terms

Recommendations

A platform for scalable one-pass analytics using MapReduce

epiC: an extensible and scalable system for processing Big Data

From Google File System to Omega: A Decade of Advancement in Big Data Management at Google

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media