Abstract
Enterprises are increasingly using Apache Hadoop, more specifically HDFS, as a central repository for all their data; data coming from various sources, including operational systems, social media and the web, sensors and smart devices, as well as their applications. At the same time many enterprise data management tools (e.g. from SAP ERP and SAS to Tableau) rely on SQL and many enterprise users are familiar and comfortable with SQL. As a result, SQL processing over Hadoop data has gained significant traction over the recent years, and the number of systems that provide such capability has increased significantly. In this tutorial we use the term SQL-on-Hadoop to refer to systems that provide some level of declarative SQL(-like) processing over HDFS and noSQL data sources, using architectures that include computational or storage engines compatible with Apache Hadoop.
- A. Abouzeid, K. Bajda-Pawlikowski, D. J. Abadi, A. Rasin, and A. Silberschatz. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. PVLDB, 2009. Google Scholar
- M. Amburst, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, and M. Zaharia. Spark SQL: Relational data processing in Spark. In ACM SIGMOD, 2015. Google Scholar
- Apache Drill. http://drill.apache.org/.Google Scholar
- K. Bajda-Pawlikowski, D. J. Abadi, A. Silberschatz, and E. Paulson. Efficient processing of data warehousing queries in a split execution environment. In SIGMOD, 2011. Google Scholar
- P. Boncz. Vortex: Vectorwise goes Hadoop. http://databasearchitects.blogspot.com/2014/05/vectorwise-goes-hadoop.html.Google Scholar
- L. Chang, Z. Wang, T. Ma, L. Jian, L. Ma, A. Goldshuv, L. Lonergan, J. Cohen, C. Welton, G. Sherry, and M. Bhandarkar. HAWQ: A massively parallel processing SQL engine in hadoop. In SIGMOD, 2014. Google Scholar
- S. Gray, F. Özcan, H. Pereyra, B. van der Linden, and A. Zubiri. IBM Big SQL 3.0: SQL-on-Hadoop without compromise. http://public.dhe.ibm.com/common/ssi/ecm/en/sww14019usen/SWW14019USEN.PDF, 2014.Google Scholar
- Hive on spark. https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark.Google Scholar
- M. Kornacker and et.al. Impala: A modern, open-source SQL engine for Hadoop. In CIDR, 2015.Google Scholar
- B. Saha, H. Shah, S. Seth, G. Vijayaraghavan, A. Murthy, and C. Curino. Apache Tez: A unifying framework for modeling and building data processing applications. In SIGMOD, 2015. Google Scholar
- P. Seshadri, H. Pirahesh, and T. Y. C. Leung. Complex query decorrelation. In ICDE, 1996. Google Scholar
- Splice machine. http://www.splicemachine.com/.Google Scholar
- D. Tahara, T. Diamond, and D. J. Abadi. Sinew: A SQL System for Multi-structured Data. In ACM SIGMOD, 2014. Google Scholar
- A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Anthony, H. Liu, and R. Murthy. Hive - A Petabyte Scale Data Warehouse Using Hadoop. In ICDE, 2010.Google Scholar
- S. Wanderman-Milne and N. Li. Runtime code generation in Cloudera Impala. IEEE Data Eng. Bull., 2014.Google Scholar
- R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: SQL and rich analytics at scale. In ACM SIGMOD, 2013. Google Scholar
- C. Zuzarte, H. Pirahesh, W. Ma, Q. Cheng, L. Liu, and K. Wong. WinMagic: Subquery elimination using window aggregation. In ACM SIGMOD, 2003. Google Scholar
Index Terms
- SQL-on-hadoop systems: tutorial
Recommendations
Evaluating SQL-on-Hadoop for Big Data Warehousing on Not-So-Good Hardware
IDEAS '17: Proceedings of the 21st International Database Engineering & Applications SymposiumBig Data is currently conceptualized as data whose volume, variety or velocity impose significant difficulties in traditional techniques and technologies. Big Data Warehousing is emerging as a new concept for Big Data analytics. In this context, SQL-on-...
SQL-on-Hadoop: full circle back to shared-nothing database architectures
SQL query processing for analytics over Hadoop data has recently gained significant traction. Among many systems providing some SQL support over Hadoop, Hive is the first native Hadoop system that uses an underlying framework such as MapReduce or Tez to ...
Comments