research-article

SQL-on-hadoop systems: tutorial

Authors:
Daniel Abadi

Yale University

Yale University
View Profile

,
Shivnath Babu

Duke University

Duke University
View Profile

,
Fatma Özcan

IBM Research, Almaden

IBM Research, Almaden
View Profile

,
Ippokratis Pandis

Cloudera

Cloudera
View Profile

Proceedings of the VLDB Endowment Volume 8 Issue 12pp 2050–2051https://doi.org/10.14778/2824032.2824137

Published:01 August 2015Publication History

Proceedings of the VLDB Endowment

Abstract

Enterprises are increasingly using Apache Hadoop, more specifically HDFS, as a central repository for all their data; data coming from various sources, including operational systems, social media and the web, sensors and smart devices, as well as their applications. At the same time many enterprise data management tools (e.g. from SAP ERP and SAS to Tableau) rely on SQL and many enterprise users are familiar and comfortable with SQL. As a result, SQL processing over Hadoop data has gained significant traction over the recent years, and the number of systems that provide such capability has increased significantly. In this tutorial we use the term SQL-on-Hadoop to refer to systems that provide some level of declarative SQL(-like) processing over HDFS and noSQL data sources, using architectures that include computational or storage engines compatible with Apache Hadoop.

References

A. Abouzeid, K. Bajda-Pawlikowski, D. J. Abadi, A. Rasin, and A. Silberschatz. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. PVLDB, 2009. Google Scholar
M. Amburst, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, and M. Zaharia. Spark SQL: Relational data processing in Spark. In ACM SIGMOD, 2015. Google Scholar
Apache Drill. http://drill.apache.org/.Google Scholar
K. Bajda-Pawlikowski, D. J. Abadi, A. Silberschatz, and E. Paulson. Efficient processing of data warehousing queries in a split execution environment. In SIGMOD, 2011. Google Scholar
P. Boncz. Vortex: Vectorwise goes Hadoop. http://databasearchitects.blogspot.com/2014/05/vectorwise-goes-hadoop.html.Google Scholar
L. Chang, Z. Wang, T. Ma, L. Jian, L. Ma, A. Goldshuv, L. Lonergan, J. Cohen, C. Welton, G. Sherry, and M. Bhandarkar. HAWQ: A massively parallel processing SQL engine in hadoop. In SIGMOD, 2014. Google Scholar
S. Gray, F. Özcan, H. Pereyra, B. van der Linden, and A. Zubiri. IBM Big SQL 3.0: SQL-on-Hadoop without compromise. http://public.dhe.ibm.com/common/ssi/ecm/en/sww14019usen/SWW14019USEN.PDF, 2014.Google Scholar
Hive on spark. https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark.Google Scholar
M. Kornacker and et.al. Impala: A modern, open-source SQL engine for Hadoop. In CIDR, 2015.Google Scholar
B. Saha, H. Shah, S. Seth, G. Vijayaraghavan, A. Murthy, and C. Curino. Apache Tez: A unifying framework for modeling and building data processing applications. In SIGMOD, 2015. Google Scholar
P. Seshadri, H. Pirahesh, and T. Y. C. Leung. Complex query decorrelation. In ICDE, 1996. Google Scholar
Splice machine. http://www.splicemachine.com/.Google Scholar
D. Tahara, T. Diamond, and D. J. Abadi. Sinew: A SQL System for Multi-structured Data. In ACM SIGMOD, 2014. Google Scholar
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Anthony, H. Liu, and R. Murthy. Hive - A Petabyte Scale Data Warehouse Using Hadoop. In ICDE, 2010.Google Scholar
S. Wanderman-Milne and N. Li. Runtime code generation in Cloudera Impala. IEEE Data Eng. Bull., 2014.Google Scholar
R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: SQL and rich analytics at scale. In ACM SIGMOD, 2013. Google Scholar
C. Zuzarte, H. Pirahesh, W. Ma, Q. Cheng, L. Liu, and K. Wong. WinMagic: Subquery elimination using window aggregation. In ACM SIGMOD, 2003. Google Scholar

Index Terms

SQL-on-hadoop systems: tutorial
1. Information systems
  1. Data management systems
    1. Database management system engines
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Index terms have been assigned to the content through auto-classification.

Recommendations

Evaluating SQL-on-Hadoop for Big Data Warehousing on Not-So-Good Hardware
IDEAS '17: Proceedings of the 21st International Database Engineering & Applications Symposium

Big Data is currently conceptualized as data whose volume, variety or velocity impose significant difficulties in traditional techniques and technologies. Big Data Warehousing is emerging as a new concept for Big Data analytics. In this context, SQL-on-...
Read More
SQL-on-Hadoop: full circle back to shared-nothing database architectures

SQL query processing for analytics over Hadoop data has recently gained significant traction. Among many systems providing some SQL support over Hadoop, Hive is the first native Hadoop system that uses an underlying framework such as MapReduce or Tez to ...
Read More
Practical Hadoop Ecosystem: A Definitive Guide to Hadoop-Related Frameworks and Tools
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the VLDB Endowment Volume 8, Issue 12
Proceedings of the 41st International Conference on Very Large Data Bases, Kohala Coast, Hawaii
August 2015
728 pages
ISSN:2150-8097
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 August 2015
Published in pvldb Volume 8, Issue 12
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 390
  Total Downloads
- Downloads (Last 12 months)24
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

SQL-on-hadoop systems: tutorial

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Evaluating SQL-on-Hadoop for Big Data Warehousing on Not-So-Good Hardware

SQL-on-Hadoop: full circle back to shared-nothing database architectures

Practical Hadoop Ecosystem: A Definitive Guide to Hadoop-Related Frameworks and Tools

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

SQL-on-hadoop systems: tutorial

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Evaluating SQL-on-Hadoop for Big Data Warehousing on Not-So-Good Hardware

SQL-on-Hadoop: full circle back to shared-nothing database architectures

Practical Hadoop Ecosystem: A Definitive Guide to Hadoop-Related Frameworks and Tools

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media