Abstract
Today’s one-pass analytics applications tend to be data-intensive in nature and require the ability to process high volumes of data efficiently. MapReduce is a popular programming model for processing large datasets using a cluster of machines. However, the traditional MapReduce model is not well-suited for one-pass analytics, since it is geared towards batch processing and requires the dataset to be fully loaded into the cluster before running analytical queries. This article examines, from a systems standpoint, what architectural design changes are necessary to bring the benefits of the MapReduce model to incremental one-pass analytics. Our empirical and theoretical analyses of Hadoop-based MapReduce systems show that the widely used sort-merge implementation for partitioning and parallel processing poses a fundamental barrier to incremental one-pass analytics, despite various optimizations. To address these limitations, we propose a new data analysis platform that employs hash techniques to enable fast in-memory processing, and a new frequent key based technique to extend such processing to workloads that require a large key-state space. Evaluation of our Hadoop-based prototype using real-world workloads shows that our new platform significantly improves the progress of map tasks, allows the reduce progress to keep up with the map progress, with up to 3 orders of magnitude reduction of internal data spills, and enables results to be returned continuously during the job.
- Abadi, D. J., Ahmad, Y., et al. 2005. The design of the borealis stream processing engine. In Proceedings of the 2nd Biennial Conference on Innovative Database Research. 277--289.Google Scholar
- Babu, S. 2010. Towards automatic optimization of MapReduce programs. In Proceedings of the 1st ACM Symposium on Cloud Computing. ACM, New York, NY, 137--142. Google ScholarDigital Library
- Berinde, R., Cormode, G., Indyk, P., and Strauss, M. J. 2009. Space-optimal heavy hitters with strong error bounds. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. 157--166. Google ScholarDigital Library
- Chaiken, R., Jenkins, B., Larson, P.-A., Ramsey, B., Shakib, D., Weaver, S., and Zhou, J. 2008. Scope: Easy and efficient parallel processing of massive data sets. Proc. VLDB Endow. 1, 2, 1265--1276. Google ScholarDigital Library
- Charikar, M., Chen, K., and Farach-Colton, M. 2004. Finding frequent items in data streams. Theor. Comput. Sci. 312, 1, 3--15. Google ScholarDigital Library
- Condie, T., Conway, N., Alvaro, P., Hellerstein, J. M., Elmeleegy, K., and Sears, R. 2010. MapReduce online. In Proceedings of the 7th USENIX Conference on Networked Systems Design and Implementation (NSDI’10). USENIX Association, Berkeley, CA, 21--21. Google ScholarDigital Library
- Cormode, G. and Muthukrishnan, S. 2005. An improved data stream summary: The count-min sketch and its applications. J. Algorithms 55, 1, 58--75. Google ScholarDigital Library
- Dean, J. and Ghemawat, S. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Conference on Symposium on Operating Systems Design & Implementation (OSDI’’04). USENIX Association, Berkeley, CA, 10--10. Google ScholarDigital Library
- DeWitt, D. and Gray, J. 1992. Parallel database systems: The future of high performance database systems. Commun. ACM 35, 6, 85--98. Google ScholarDigital Library
- DeWitt, D. J., Gerber, R. H., Graefe, G., Heytens, M. L., Kumar, K. B., and Muralikrishna, M. 1986. Gamma---A high performance dataflow database machine. In Proceedings of the International Conference on Very Large Data Bases. 228--237. Google ScholarDigital Library
- DeWitt, D. J., Ghandeharizadeh, S., Schneider, D. A., Bricker, A., Hsiao, H.-I., and Rasmussen, R. 1990. The gamma database machine project. IEEE Trans. Knowl. Data Engin. 2, 1, 44--62. Google ScholarDigital Library
- Fiat, A., Karp, R. M., Luby, M., McGeoch, L. A., Sleator, D. D., and Young, N. E. 1991. Competitive paging algorithms. J. Algorithms 12, 4, 685--699. Google ScholarDigital Library
- Ganguly, S. and Majumder, A. 2007. Cr-precis: A deterministic summary structure for update data streams. In Proceedings of the 1st International Symposium on Combinatorics, Algorithms Probabilistic and Experimental Methodologies. 48--59. Google ScholarDigital Library
- Hellerstein, J. M. and Naughton, J. F. 1996. Query execution techniques for caching expensive methods. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 423--434. Google ScholarDigital Library
- Jiang, D., Ooi, B. C., Shi, L., and Wu, S. 2010. The performance of MapReduce: An in-depth study. In Proceedings of the International Conference on Very Large Data Bases.Google Scholar
- Kane, D. M., Nelson, J., and Woodruff, D. P. 2010. An optimal algorithm for the distinct elements problem. In Proceedings of the 29th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS’10). ACM, New York, NY, 41--52. Google ScholarDigital Library
- Karloff, H., Suri, S., and Vassilvitskii, S. 2010. A model of computation for MapReduce. In Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA, 938--948. Google ScholarDigital Library
- Lee, L. K. and Ting, H. F. 2006. A simpler and more efficient deterministic scheme for finding frequent items over sliding windows. In Proceedings of the 25th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. ACM, New York, NY, 290--297. Google ScholarDigital Library
- Li, B., Mazur, E., Diao, Y., McGregor, A., and Shenoy, P. J. 2011. A platform for scalable one-pass analytics using MapReduce. In Proceedings of the ACM SIGMOD International Conference on Management of Data, T. K. Sellis, R. J. Miller, A. Kementsietsidis, and Y. Velegrakis Eds., ACM, 985--996. Google ScholarDigital Library
- Mazur, E., Li, B., Diao, Y., and Shenoy, P. J. 2011. Towards scalable one-pass analytics using MapReduce. In Proceedings of the International Parallel and Distributed Processing Symposium Workshops. IEEE, 1102--1111. Google ScholarDigital Library
- McGeoch, L. A. and Sleator, D. D. 1991. A strongly competitive randomized paging algorithm. Algorithmica 6, 6, 816--825.Google ScholarDigital Library
- Metwally, A., Agrawal, D., and El Abbadi, A. 2005. Efficient computation of frequent and top-k elements in data streams. In Proceedings of the International Conference on Database Theory, T. Eiter and L. Libkin Eds., Lecture Notes in Computer Sciences, vol. 3363. Springer, 398--412. Google ScholarDigital Library
- Misra, J. and Gries, D. 1982. Finding repeated elements. Sci. Comput. Program. 2, 2, 143--152.Google ScholarCross Ref
- Morton, K., Balazinska, M., and Grossman, D. 2010. Paratimer: A progress indicator for MapReduce dags. In Proceedings of the International Conference on Management of Data (SIGMOD’10). ACM, New York, NY, 507--518. Google ScholarDigital Library
- Muthukrishnan, S. 2006. Data Streams: Algorithms and Applications. Now Publishers.Google Scholar
- Neumeyer, L., Robbins, B., Nair, A., and Kesari, A. 2010. S4: Distributed stream computing platform. In Proceedings of the IEEE International Conference on Data Mining Workshops. 170--177. Google ScholarDigital Library
- Olston, C., Reed, B., Srivastava, U., Kumar, R., and Tomkins, A. 2008. Pig Latin: A not-so-foreign language for data processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 1099--1110. Google ScholarDigital Library
- Pavlo, A., Paulson, E., Rasin, A., Abadi, D. J., DeWitt, D. J., Madden, S., and Stonebraker, M. 2009. A comparison of approaches to large-scale data analysis. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 165--178. Google ScholarDigital Library
- PigMix. 2008. Pig Mix benchmark. https://cwiki.apache.org/confluence/display/PIG/PigMix.Google Scholar
- Ramakrishnan, R. and Gehrke, J. 2003. Database Management Systems 3rd Ed. McGraw-Hill. Google ScholarDigital Library
- Roy, A., Diao, Y., Mauceli, E., Shen, Y., and Wu, B.-L. 2012. Massive genomic data processing and deep analysis. Proc. VLDB Endow. 5, 12, 1906--1909. Google ScholarDigital Library
- Shapiro, L. D. 1986. Join processing in database systems with large main memories. ACM Trans. Datab. Syst. 11, 3, 239--264. Google ScholarDigital Library
- Sleator, D. D. and Tarjan, R. E. 1985. Amortized efficiency of list update and paging rules. Commun. ACM 28, 2, 202--208. Google ScholarDigital Library
- Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., and Murthy, R. 2009. Hive - a warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2, 2, 1626--1629. Google ScholarDigital Library
- Tian, F. and DeWitt, D. J. 2003. Tuple routing strategies for distributed eddies. In Proceedings of the 29th International Conference on Very Large Data Bases (VLDB’03). VLDB Endowment, 333--344. Google ScholarDigital Library
- White, T. 2009. Hadoop: The Definitive Guide. O’Reilly Media, Inc. Google ScholarDigital Library
- Yang, H.-C., Dasdan, A., Hsiao, R.-L., and Parker, D. S. 2007. Map-reduce-merge: simplified relational data processing on large clusters. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’07). ACM, New York, NY, 1029--1040. Google ScholarDigital Library
- Yu, Y., Gunda, P. K., and Isard, M. 2009. Distributed aggregation for data-parallel computing: interfaces and implementations. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles (SOSP’09). ACM, New York, NY, 247--260. Google ScholarDigital Library
- Zou, Q., Wang, H., Soulé, R., Hirzel, M., Andrade, H., Gedik, B., and Wu, K.-L. 2010. From a stream of relational queries to distributed stream processing. Proc. VLDB Endow. 3, 2, 1394--1405. Google ScholarDigital Library
Index Terms
- SCALLA: A Platform for Scalable One-Pass Analytics Using MapReduce
Recommendations
A platform for scalable one-pass analytics using MapReduce
SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of dataToday's one-pass analytics applications tend to be data-intensive in nature and require the ability to process high volumes of data efficiently. MapReduce is a popular programming model for processing large datasets using a cluster of machines. However, ...
epiC: an extensible and scalable system for processing Big Data
The Big Data problem is characterized by the so-called 3V features: volume--a huge amount of data, velocity--a high data ingestion rate, and variety--a mix of structured data, semi-structured data, and unstructured data. The state-of-the-art solutions ...
From Google File System to Omega: A Decade of Advancement in Big Data Management at Google
BIGDATASERVICE '15: Proceedings of the 2015 IEEE First International Conference on Big Data Computing Service and ApplicationsSince the dawn of the big data era the search giant Google has been in the lead for meeting the challenge of the new era. Results from Google's big data projects in the past decade have inspired the development of many other big data technologies such ...
Comments