Synonyms
SQL-on-Hadoop
Definition
Over the last decade, the database field has witnessed significant major innovations and changes in enterprise data platforms. First came the wave of NoSQL systems, which provide high scalability, although sometimes at the expense of ACID transactions and declarative SQL processing. On the analytics side, Hadoop emerged as the platform for all analytics needs of the enterprise. Although Hadoop started with just the MapReduce processing framework and the Hadoop File System (HDFS), it evolved into a multi-framework environment, supporting MapReduce, Spark, Tez, and others. Such processing environments, where data can be accessed and manipulated by multiple processing frameworks, are frequently referred to as data lakes. Given the popularity of SQL and its widespread use in enterprise analytics tools, it was soon evident that SQL processing on data lakes is critical in this new emerging enterprise data platform.
In this entry, we discuss SQL analytics...
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsRecommended Reading
Abadi D, Babu S, Özcan F, Pandis I. SQL-on-Hadoop systems: tutorial. Proc VLDB Endow. 2015;8(12):2050–2051.
Abouzeid A, Bajda-Pawlikowski K, Abadi DJ, Rasin A, Silberschatz A. HadoopDB: an architectural hybrid of mapReduce and DBMS technologies for analytical workloads. Proc VLDB Endow. 2009;2(1):922–933.
Amburst M, Xin RS, Lian C, Huai Y, Liu D, Bradley JK, Meng X, Kaftan T, Franklin MJ, Ghodsi A, Zaharia M. Spark SQL: relational data processing in spark. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2015.
Apache Drill. http://drill.apache.org/.
Apache Phoenix. http://phoenix.apache.org/.
Apache spark. https://spark.incubator.apache.org/.
Apache Calcite. https://calcite.apache.org/.
Apache HBase. https://hbase.apache.org/.
Apache ORC. https://orc.apache.org/.
Apache Parquet. https://parquet.apache.org/.
Bajda-Pawlikowski K, Abadi DJ, Silberschatz A, Paulson E. Efficient processing of data warehousing queries in a split execution environment. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2011.
Chang L, Wang Z, Ma T, Jian L, Ma L, Goldshuv A, Lonergan L, Cohen J, Welton C, Sherry G, Bhandarkar M. HAWQ: a massively parallel processing SQL engine in Hadoop. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2014.
Costea A, Ionescu A, Răducanu B, Switakowski M, Bârca C, Sompolski J, Luszczak A, Szafrański M, de Nijs G, Boncz P. VectorH: taking SQL-on-Hadoop to the next level. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2016.
DeWitt DJ, Nehme RV, Shankar S, Aguilar-Saborit J, Avanes A, Flasza M, Gramling J. Split query processing in polybase. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2013. p. 1255–66.
Floratou A, Minhas UF, Özcan F. SQL-on-Hadoop: full circle back to shared-nothing database architectures. Proc VLDB Endow. 2014;7(12):1295–306.
Gassner P, Lohman GM, Schiefer KB, Wang Y. Query optimization in the IBM DB2 family. IEEE Data Eng Bull. 1993;16(4):4–18.
Graefe G. Encapsulation of parallelism in the Volcano query processing system. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 1990.
Gray S, Özcan F, Pereyra H, van der Linden B, Zubiri A. IBM Big SQL 3.0: SQL-on-Hadoop without compromise (2014), http://public.dhe.ibm.com/common/ssi/ecm/en/sww14019usen/SWW14019USEN.PDF
He Y, Lee R, Huai Y, Shao Z, Jain N, Zhang X, Xu Z. Rcfile: a fast and space-efficient data placement structure in mapreduce-based warehouse systems. In: Proceedings of the 27th International Conference on Data Engineering; 2011. p. 1199–208.
Hive on spark. https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark.
Kornacker M, Behm A, Bittorf V, Bobrovytsky T, Ching C, Choi A, Erickson J, Grund M, Hecht D, Jacobs M, Joshi I, Kuff L, Kumar D, Leblang A, Li N, Pandis I, Robinson H, Rorke D, Rus S, Russell J, Tsirogiannis D, Wanderman-Milne S, Yoder M. Impala: a modern, open-source SQL engine for Hadoop. In: Proceedings of the 7th Biennial Conference on Innovative Data Systems Research; 2015.
Lipcon T, Alves D, Burkert D, Cryans J-D, Dembo A, Percy M, Rus S, Wang D, Bertozzi M, McCabe CP, Wang A. Kudu: storage for fast analytics on fast data. https://kudu.apache.org/.
Melnik S, Gubarev A, Long JJ, Romer G, Shivakumar S, Tolton M, Vassilakis T. Dremel: interactive analysis of web-scale datasets. Proc VLDB Endow. 2010;3(1–2):330–39.
Ongaro D, Ousterhout J. In search of an understandable consensus algorithm. In: Proceedings of the USENIX Annual Technical Conference; 2014.
Padmanabhan S, Malkemus T, Agarwal RC, Jhingran A. Block oriented processing of relational database operations in modern computer architectures. In: Proceedings of the 17th International Conference on Data Engineering; 2001.
Presto. http://prestodb.io/.
Saha B, Shah H, Seth S, Vijayaraghavan G, Murthy A, Curino C. Apache Tez: a unifying framework for modeling and building data processing applications. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2015.
Seshadri P, Pirahesh H, Leung TYC. Complex query decorrelation. In: Proceedings of the 12th International Conference on Data Engineering; 1996.
Splice machine. http://www.splicemachine.com/.
Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Zhang N, Anthony S, Liu H, Murthy R. Hive – a petabyte scale data warehouse using Hadoop. In: Proceedings of the 26th International Conference on Data Engineering; 2010.
Traverso M. Presto: interacting with petabytes of data at Facebook. https://www.facebook.com/notes/facebook-engineering/presto-interacting-with-petabytes-of-data-at-facebook/10151786197628920.
Wanderman-Milne S, Li N. Runtime code generation in Cloudera Impala. IEEE Data Eng Bull. 2014;37(1):31–7.
Xin RS, Rosen J, Zaharia M, Franklin MJ, Shenker S, Stoica I. Shark: SQL and rich analytics at scale. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2013.
Zuzarte C, Pirahesh H, Ma W, Cheng Q, Liu L, Wong K. WinMagic: subquery elimination using window aggregation. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2003.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Science+Business Media, LLC, part of Springer Nature
About this entry
Cite this entry
Özcan, F., Pandis, I. (2018). SQL Analytics on Big Data. In: Liu, L., Özsu, M.T. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8265-9_80648
Download citation
DOI: https://doi.org/10.1007/978-1-4614-8265-9_80648
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-8266-6
Online ISBN: 978-1-4614-8265-9
eBook Packages: Computer ScienceReference Module Computer Science and Engineering