Optimization Factor Analysis of Large-Scale Join Queries on Different Platforms

Yang, Chao; Wang, Qian; Yang, Qing; Zhang, Huibing; Zhang, Jingwei; Zhou, Ya

doi:10.1007/978-3-319-55705-2_3

Chao Yang¹⁷,
Qian Wang¹⁷,
Qing Yang¹⁹,
Huibing Zhang¹⁷,
Jingwei Zhang^17,18 &
…
Ya Zhou¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10179))

Included in the following conference series:

International Conference on Database Systems for Advanced Applications

1553 Accesses
1 Citations

Abstract

Popular big data computing platforms, such as Spark, provide new computing paradigm for traditional database operations, such as queries. Except for the management ability of large-scale data, big data platforms earn the reputation for their simple programming interface and good performance of scaling out. But traditional databases have intrinsic optimization mechanisms for fundamental operators, which supports efficient and flexible data processing. It is very valuable to give a comprehensive view of these two kinds of platforms on data processing performance. In this paper, we focus on join operation, a primary and frequently used operator for both databases and big data analysis, design and conduct extensive experiments to test the performance of the two classic platforms under unified datasets and hardware, which will disclose the performance influence on computing schema, storage media, etc. Based on the experimental analysis, we also put forwards our advice on computing platform onsideration for different application scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Jiang, D., Tung, A.K.H., Gang, C.: Map-Join-Reduce: toward scalable and efficient data analysis on large clusters. IEEE Trans. Knowl. Data Eng. 23(9), 1299–1311 (2011)
Article Google Scholar
Zhou, M., Zhang, R., Zeng, D., et al.: Join optimization in the MapReduce environment for column-wise data store. In: Proceedings of 6th International Conference on Semantics Knowledge, Girds (SKG 2010), Los Alamitos, CA, 2011 Observation of Strains, pp. 97–104. IEEE Computer Society (2010)
Google Scholar
Afrati, F.N., Ullman, J.D.: Optimizing multiway joins in a MapReduce environment. IEEE Trans. Knowl. Data Eng. 23(9), 1282–1298 (2011)
Article Google Scholar
Zhao, Y.-R., Wang, W.-P.: Efficient join query processing algorithm CHMJ based on hadoop. J. Softw. 23(8), 2032–2041 (2012)
Article Google Scholar
DB-Engines Ranking. http://db-engines.com/en/ranking/relational+dbms
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: HotCloud 2010, June 2010
Google Scholar
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: NSDI 2012, April 2012
Google Scholar
MySQL. http://www.mysql.com
TPC-H. http://www.tpc.org/tpch/
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. CACM 51(1), 107–113 (2008)
Article Google Scholar
Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., Zaharia, M.: Spark SQL: relational data processing in spark. In: SIGMOD Conference, pp. 1383–1394 (2015)
Google Scholar
Meng, X., Bradley, J.K., Yavuz, B., Sparks, E.R., Venkataraman, S., Liu, D., Jeremy Freeman, D.B., Tsai, M.A., Owen, S., Xin, D., Xin, R., Franklin, M.J., Zadeh, R., Zaharia, M., Talwalkar, A.: MLlib: machine learning in apache spark (2015). CoRR arXiv:1505.06807
Blanas, S., Patel, J.M., Ercegovac, V., et al.: A comparison of join algorithms for log processing in MapReduce. In: Proceedings of 2010 ACM SIGMOD International Conference on Management of data, pp. 975–986. ACM (2010)
Google Scholar
Yang, H.C., Dasdan, A., Hsiao, R.L., et al.: Map-Reduce-Merge: simplified relational data processing on large clusters. In: ACM SIGMOD International Conference on Management of Data, pp. 1029–1040. ACM (2007)
Google Scholar
Yang, H., Parker, D.S.: Traverse: simplified indexing on large Map-Reduce-Merge clusters. In: Zhou, X., Yokota, H., Deng, K., Liu, Q. (eds.) DASFAA 2009. LNCS, vol. 5463, pp. 308–322. Springer, Heidelberg (2009). doi:10.1007/978-3-642-00887-0_27
Chapter Google Scholar
Dittrich, J., Quian Ruiz, J.A., et al.: Hadoop++ making a yellow elephant run like a cheetah (without it even noticing). Proc. VLDB Endow. 3(12), 518–529 (2010)
Google Scholar
Agrawal, D., et al.: SparkBench – a spark performance testing suite. In: Nambiar, R., Poess, M. (eds.) TPCTC 2015. LNCS, vol. 9508, pp. 26–44. Springer, Heidelberg (2016). doi:10.1007/978-3-319-31409-9_3
Chapter Google Scholar

Download references

Acknowledgment

This study is supported by the National Natural Science Foundation of China (Nos. 61363005, 61462017, U1501252, 61662013), Guangxi Natural Science Foundation of China (Nos. 2014GXNSFAA118353, 2014GXNSFAA118390, 2014GXNSFDA118036), Guangxi Key Laboratory of Automatic Detection Technology and Instrument Foundation (YQ15110), Guangxi Cooperative Innovation Center of Cloud Computing and Big Data, and the High Level Innovation Team of Colleges and Universities in Guangxi and Outstanding Scholars Program Funding.

Author information

Authors and Affiliations

Guangxi Key Laboratory of Trusted Software, Guilin University of Electronic Technology, Guilin, 541004, China
Chao Yang, Qian Wang, Huibing Zhang, Jingwei Zhang & Ya Zhou
Guangxi Cooperative Innovation Center of Cloud Computing and Big Data, Guilin University of Electronic Technology, Guilin, 541004, China
Jingwei Zhang
Guangxi Key Laboratory of Automatic Measurement Technology and Instrument, Guilin University of Electronic Technology, Guilin, 541004, China
Qing Yang

Authors

Chao Yang
View author publications
You can also search for this author in PubMed Google Scholar
Qian Wang
View author publications
You can also search for this author in PubMed Google Scholar
Qing Yang
View author publications
You can also search for this author in PubMed Google Scholar
Huibing Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jingwei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Ya Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jingwei Zhang .

Editor information

Editors and Affiliations

Royal Melbourne Institute of Technology , Melbourne, Australia
Zhifeng Bao
Northwestern University , Evanston, Illinois, USA
Goce Trajcevski
University of New South Wales , Sydney, New South Wales, Australia
Lijun Chang
The University of Queensland , Brisbane, Queensland, Australia
Wen Hua

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yang, C., Wang, Q., Yang, Q., Zhang, H., Zhang, J., Zhou, Y. (2017). Optimization Factor Analysis of Large-Scale Join Queries on Different Platforms. In: Bao, Z., Trajcevski, G., Chang, L., Hua, W. (eds) Database Systems for Advanced Applications. DASFAA 2017. Lecture Notes in Computer Science(), vol 10179. Springer, Cham. https://doi.org/10.1007/978-3-319-55705-2_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-55705-2_3
Published: 22 March 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-55704-5
Online ISBN: 978-3-319-55705-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics