Abstract
Popular big data computing platforms, such as Spark, provide new computing paradigm for traditional database operations, such as queries. Except for the management ability of large-scale data, big data platforms earn the reputation for their simple programming interface and good performance of scaling out. But traditional databases have intrinsic optimization mechanisms for fundamental operators, which supports efficient and flexible data processing. It is very valuable to give a comprehensive view of these two kinds of platforms on data processing performance. In this paper, we focus on join operation, a primary and frequently used operator for both databases and big data analysis, design and conduct extensive experiments to test the performance of the two classic platforms under unified datasets and hardware, which will disclose the performance influence on computing schema, storage media, etc. Based on the experimental analysis, we also put forwards our advice on computing platform onsideration for different application scenarios.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Jiang, D., Tung, A.K.H., Gang, C.: Map-Join-Reduce: toward scalable and efficient data analysis on large clusters. IEEE Trans. Knowl. Data Eng. 23(9), 1299–1311 (2011)
Zhou, M., Zhang, R., Zeng, D., et al.: Join optimization in the MapReduce environment for column-wise data store. In: Proceedings of 6th International Conference on Semantics Knowledge, Girds (SKG 2010), Los Alamitos, CA, 2011 Observation of Strains, pp. 97–104. IEEE Computer Society (2010)
Afrati, F.N., Ullman, J.D.: Optimizing multiway joins in a MapReduce environment. IEEE Trans. Knowl. Data Eng. 23(9), 1282–1298 (2011)
Zhao, Y.-R., Wang, W.-P.: Efficient join query processing algorithm CHMJ based on hadoop. J. Softw. 23(8), 2032–2041 (2012)
DB-Engines Ranking. http://db-engines.com/en/ranking/relational+dbms
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: HotCloud 2010, June 2010
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: NSDI 2012, April 2012
MySQL. http://www.mysql.com
TPC-H. http://www.tpc.org/tpch/
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. CACM 51(1), 107–113 (2008)
Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., Zaharia, M.: Spark SQL: relational data processing in spark. In: SIGMOD Conference, pp. 1383–1394 (2015)
Meng, X., Bradley, J.K., Yavuz, B., Sparks, E.R., Venkataraman, S., Liu, D., Jeremy Freeman, D.B., Tsai, M.A., Owen, S., Xin, D., Xin, R., Franklin, M.J., Zadeh, R., Zaharia, M., Talwalkar, A.: MLlib: machine learning in apache spark (2015). CoRR arXiv:1505.06807
Blanas, S., Patel, J.M., Ercegovac, V., et al.: A comparison of join algorithms for log processing in MapReduce. In: Proceedings of 2010 ACM SIGMOD International Conference on Management of data, pp. 975–986. ACM (2010)
Yang, H.C., Dasdan, A., Hsiao, R.L., et al.: Map-Reduce-Merge: simplified relational data processing on large clusters. In: ACM SIGMOD International Conference on Management of Data, pp. 1029–1040. ACM (2007)
Yang, H., Parker, D.S.: Traverse: simplified indexing on large Map-Reduce-Merge clusters. In: Zhou, X., Yokota, H., Deng, K., Liu, Q. (eds.) DASFAA 2009. LNCS, vol. 5463, pp. 308–322. Springer, Heidelberg (2009). doi:10.1007/978-3-642-00887-0_27
Dittrich, J., Quian Ruiz, J.A., et al.: Hadoop++ making a yellow elephant run like a cheetah (without it even noticing). Proc. VLDB Endow. 3(12), 518–529 (2010)
Agrawal, D., et al.: SparkBench – a spark performance testing suite. In: Nambiar, R., Poess, M. (eds.) TPCTC 2015. LNCS, vol. 9508, pp. 26–44. Springer, Heidelberg (2016). doi:10.1007/978-3-319-31409-9_3
Acknowledgment
This study is supported by the National Natural Science Foundation of China (Nos. 61363005, 61462017, U1501252, 61662013), Guangxi Natural Science Foundation of China (Nos. 2014GXNSFAA118353, 2014GXNSFAA118390, 2014GXNSFDA118036), Guangxi Key Laboratory of Automatic Detection Technology and Instrument Foundation (YQ15110), Guangxi Cooperative Innovation Center of Cloud Computing and Big Data, and the High Level Innovation Team of Colleges and Universities in Guangxi and Outstanding Scholars Program Funding.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Yang, C., Wang, Q., Yang, Q., Zhang, H., Zhang, J., Zhou, Y. (2017). Optimization Factor Analysis of Large-Scale Join Queries on Different Platforms. In: Bao, Z., Trajcevski, G., Chang, L., Hua, W. (eds) Database Systems for Advanced Applications. DASFAA 2017. Lecture Notes in Computer Science(), vol 10179. Springer, Cham. https://doi.org/10.1007/978-3-319-55705-2_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-55705-2_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-55704-5
Online ISBN: 978-3-319-55705-2
eBook Packages: Computer ScienceComputer Science (R0)