A Systematic Review of SQL-on-Hadoop by Using Compact Data Formats

— There are massive amounts of data generated from IoT, online transactions, click streams, emails, logs, posts, social networking interactions, sensors, mobile phones, their applications etc. The question is where and how to store these data in order to provide faster data access. Understanding and handling Big Data is a big challenge. The research direction in Big Data projects using Hadoop Technology, MapReduce kind of framework and compact data formats such as RCFile, SequenceFile, ORC, Avro, Parquet shows that only two data formats (Avro and Parquet) support schema evolution and compression in order to utilize less storage space. In this paper, a systematic review of SQL-on-Hadoop by using compact data formats (Avro and Parquet) has been performed over the past six years (2010–2015). With the help of search strategy followed, 94 research papers have been identified out of which 17 have been analyzed as relevant papers. This work outlines the usage of Avro or Parquet data format using publications of conference proceedings, journals and magazines of IEEEXplore, ACM Digital Library and ScienceDirect. At the end of the review, the conclusion has been made that direct comparison by compactness and fastness between Avro and Parquet do not exist in data science


I. INTRODUCTION
The amount of data captured by social media, the Internet of Things, enterprises and different types of applications is growing exponentially.There are huge volumes of raw data every day, but these data do not yield much information until processed.As a result of processing, raw data sometimes ends up in a database, which enables the data to become accessible for further processing and analysis in a number of different ways.
Towards distributed and real-time processing of large data setsthe so-called Big Datathe traditional computing techniques are becoming insufficient [6], [11], [23], [32].Hadoop is one of the most common open source Big Data frameworks in the industry today, capable of carrying out common Big Data related tasks.There is growing business demand for Hadoop technology usage in Big Data analysis (storage, biological data, road, traffic, travel and tourism, telecommunication, enterprise data, citizens' info) [21].In addition, Hadoop technology is becoming popular in such areas as cloud computing, internet data management (storage, load balancing), implementing MapReduce algorithms for providing solutions to various problems of handling large amount of data, in proposing new models by using HDFS [23].
Often raw data are stored in specific text formats, for instance: JSON, CSV, XML, etc.These formats allow data to be structured and available for humans to read and edit them in most convenient manner.However, storing raw data in a plain text has a significant drawbackthere is a disk space needed to store such files.But for Big Data cluster powered by Hadoop it is even a bigger problem because of the high replication factor of each data block within Hadoop File System -HDFS.For instance, recommended HDFS replication factor is 3.That means each raw data block will be replicated 3 times across data nodes.Thus it is crucial to select appropriate data format that enables HDFS storage space utilization in a more efficient manner according to the task defined.Secondly, data storage format may impact the speed of data processing with Hadoop tools, like Hive.Several binary data storage formats exist.Some of them are RCFile, ORC, Avro, Parquet.These formats are designed for systems that use MapReduce kinds of framework.It is a structure that is a systematic combination of multiple components including data storage format, data compression, and optimization techniques for data reading.This article investigates the research direction in Big Data projects using Hadoop Technology, MapReduce kind of framework and compact data formats such as Avro and Parquet and answers the research questions what are known about the differences in performance (query execution time) between compact data formats Avro and Parquet and which data format (Avro or Parquet) is more compact?
It is performed as a small-scale literature review.However, it can be considered as a complete systematic literature review within the scope of this article, for instance, the chosen search strategy and the selected time period.
The systematic review is carried out by identification of research, selection of studies by various authors, deciding upon the inclusion and exclusion criteria and analyzing the amount of publications done in this domain during the time period of year 2010 to 2015.This paper limits its scope to publications done in IEEE Digital Library (IEEE Xplore), ACM Digital Library and ScienceDirect.
The Hadoop Technology is commonly being used to manage Big Data projects.Hadoop is now the de facto standard for storing and processing big data, not only for unstructured data but also for some structured data [7].The Hadoop Distributed File System (HDFS) is designed to reliably store very large data sets, and to stream those data sets at high bandwidth to user applications [24].As a result, providing SQL analysis functionality to the big data resided in HDFS becomes more and more important.Hive is a pioneer system that supports SQL-like analysis to the data in HDFS [7].This review focuses not only on Hive.Other SQL-on-Hadoop systems such as HortonWorks Stinger or Cloudera Impala are acceptable too, if tests and comparisons of the performance are based on queries selected or derived from world-renowned benchmarks like TPC-H or TPC-DS.
There is another sphere of binary data storage format utilization on direct data sources.For instance, service data gathering from mobile phones to get specific insights of people's behavior or in order to create other kind of location intelligence reports.Assuming that a GPS data packet (timestamp, longitude and latitude) is 100 bytes on average and that the smartphone generates it every 8 seconds, quick math calculations result in 0.043 MB/h, 1.03 MB/day and 376 MB/year.In 2014 over 1.2 billion smartphones were sold 1 .If 1 billion devices produce a GPS data packet every 8 seconds, it results in 1 PB/day.This means that we need ~1000 disk drives with size 1TB in order to store these data.The volume of data is enormous.The question is where and how to store these data in order to provide database for faster execution of data queries.This is the main rationale for this review.
The data storage formats mentioned in Introduction section have some advantages and disadvantages.As shown in Table 1, only Avro and Parquet data formats support both important advantages: schema evolution and compression.
Avro relies on schemas.A schema defines the structure of the data and is used in data reading and writing process.The data schema is defined with JSON and stored into Avro file during data writing process.When Avro data are read, the schema used when writing are always present.This allows data to be written with no per-value over-heads.
Avro is used to save many small files in a single Avro file in HDFS to reduce the namenode memory usage because of user-defined patterns and specific data encoded into binary sequence and stored into a large containing file [33].
Parquet [19] is a column-based storage format, optimized for work with multi column datasets.Parquet use cases typically involve working with a subset of those columns rather than entire records.One of the most-often cited advantages of columnar data organizations is data compression [27] and reduced disk I/O [1] that improves performance of analytical queries [10].Data compression algorithms perform better on data with low information entropy (high data value locality).Thus the system achieves the I/O performance benefits of compression without paying the CPU cost of decompression [1].The layout of Parquet data files is optimized for queries that process large volumes of data.
Information stated ahead is known part of compact data formats, such as Avro and Parquet.Thus, Avro and Parquet choice for the deeper investigation is based on the necessity to investigate these formats by using various queries (scan, aggregation and join) from a world-renowned benchmark like TPC-H and prove the assumption that Avro supported roworiented data access should provide better performance on scan queries, e.g., when all columns are of interest for the processing, but Parquet format as a counterpart should provide better performance on column-oriented queries, e.g. when only a specific set of those is selected.
Considering that short background information (the rationale for the survey) is given in this and introduction section, the next section of the paper includes literature review.Section III discusses the research methodology used to extract the relevant data for systematic review.Section IV comprises the result set, followed by the conclusion in Section V.

III. RESEARCH METHODOLOGY
This study has been undertaken as a systematic literature review (SLR) based upon guidelines established for the Software Engineering domain [15].In this section, the protocol used in the SLR has been provided, the research question and its components have been specified, and the requirements regarding the source and primary study selection, the evidence collection and the method of synthesis of such evidence have been established.The results regarding each step are provided alongside the protocol, except summary, which is addressed in Section 4.
In accordance with guidelines [15], the following steps have been performed in order to conduct this research:  Defining the objective and the research question that the review is intended to answer.
 Defining the search strategy to be used to do primary studies including looking for terms and resources to be searched.
 Selection of primary studies: Individual studies contributing to a systematic review are called primary studies.The goal of this step is to find out numberless primary studies related to the selected domain.
 Piloting the selection of criteria on a subset of primary studies in order to determine which studies are relevant or which should be excluded from a systematic review.There are several inclusion/exclusion criteria to be considered: o relevance of the topic; o relevance of the subjects; o context; o publication venue [25].
In addition, it is important to develop a quality checklist in order to assess the individual studies.
 Assessment of quality.In order to evaluate the quality of the collected data, it is necessary to determine the strength of each individual research paper and give more detailed inclusion/exclusion criteria.Previously developed and filled in quality checklist can help to assess the quality of each individual research paper.
 Extraction of relevant data.It is important to define how the information required from each primary study could be obtained.
 Data synthesis.It is necessary to extract data from the primary studies in order to answer the research question, tabulate data in a consistent manner and determine whether the formulated results from the extracted data are consistent with each other or not.

A. Objective and research questions
In context of the information given in Introduction and Background section of this article, it is crucial to select an appropriate data format that reduces HDFS storage space and improves the speed of data processing with Hadoop tools, like Hive.The objective of this work is to perform systematic literature review in order to answer the research question:

B. Search strategy
To answer the research question, the search strategy has been defined and an extensive search for research papers has been conducted.During data retrieval, the boundaries of the systematic review have been set.The search strategy includes definition of the search scope by research keywords, search strings and sources.
Research keywords have been chosen based on the research question.The synonyms to the keywords have not been considered because the term like "Hadoop" is unique general term that can only be supplemented with related terms such as "Big data", "HDFS", "MapReduce", "Hive" or specific data formats, like "RCFile", "SequenceFile", "ORC", "Avro", "Parquet" etc.The term "Hadoop" has been predefined based on the names defined by the Hadoop developers in the Apache Hadoop website 7 .The final search string has been based on the experience from the pilot searches starting from a broadest search by the term "Hadoop" when IEEEXplore Digital Library's Full Text & Metadata search results in 6,348 articles, and ending by a narrow search by the term "RCFile" when only 4 results have been returned.Sometimes search strings have to be adapted according to the specific needs of digital libraries, but it is not necessary in this case.The search string used to obtain the initial results of this review consists of a Boolean expression: ((Hadoop OR HDFS) AND (Avro OR Parquet)) The operator OR has been used in Boolean expression in order to extend the list of results and retrieve more articles where Avro or Parquet data format is mentioned independently from each other.The same approach has been applied for terms Hadoop and HDFS, because in the context of data compactness and storage some authors, for instance [33], are using the term HDFS instead of Hadoop.
The criteria used to select sources of studies have been defined as follows:  Must have web search mechanism;  Search mechanisms must allow customized searches by title and abstract (preferablefull text);  Must support the search using Boolean expressions;  The abstract of paper should be available for free.The abstract unavailability is the main reason why Springer Link has not been chosen as acceptable source during the first step (selection of primary studies).
 Full articles must be available for download using available contracts between University of Latvia and the digital library.Google Scholar can also be acceptable;  The digital libraries should index papers on the Big Data topic written in English.Thus, the search strategy limits the search to the papers written in English.
With the search string defined, the following digital libraries have been chosen as sources: There are two additional search criteria: items and the publication period.The searched items are limited to Journal articles and conference papers, but the publication period is set from year 2010 till 2015 including, in order to cover six years when the most active time of the Hadoop development was [21].

C. Selection of primary studies
The search query presented in Section B has been used to retrieve the candidate articles from the digital library systems in the time period of 2010-2015.As shown in Fig. 1, the first initial step is based on it.
Search in digital libraries results in total 94 candidate papers: 78 from IEEE, only 1 from ACM, and 15 from ScienceDirect.As shown in Table 2 and Fig. 3, Hadoop Technology has drawn interest of researchers in the past six years.We can clearly see from Fig. 3 that the number of publications of research papers has increased exponentially from the year 2010 to 2015.

N=94
Step 1 Execute the search query in all the sources, gathering the results

N=84
Step 2 Exclude duplicating and irrelevant papers by reviewing the title, keywords and abstract of each paper

N=27
Step 3 Exclude irrelevant papers by analyzing the introduction and conclusions

N=17
Step 4 Exclude irrelevant papers by reading and analyzing the full text In the first step, the search results from all digital libraries have been gathered by using citations download or export function.Thus, search results have been obtained in CSV or other delimiter separated format and imported in Excel.Subsequently, the results from all 3 digital libraries have to be summarized in one format sheet where common data fields have been defined.All the relevant studies used for this review are presented in Excel, available at Dropbox 8 .Several parameters have been defined for future analysis and documented for each retrieved article in the summary sheet:  --------------------------------------------------------------------------------------------------------------------------------- In the second step, Excel macro functionality has been developed in order to perform faster screening and abstract text zooming from delimiter separated metadata.Then, the title, keywords and abstract of all papers have been reviewed.As a result, only 84 papers have been left as relevant.Two from 10 skipped papers have been recognized as irrelevant because the term "Parquet" has been used in the context of wood, but the search by term "HDFS" has not been performed precisely, e.g., it has been applied to HDF surface and letter "S" has been ignored in ScienceDirect search.Although presented in different conferences, other two papers ( [29] and [30]) have been recognized as very similar and devoted to NoSQL topic.
In the third step, all remaining 84 papers have to be analyzed individually to confirm the relevance in the context of the review.To select or discard papers, inclusion and exclusion criteria have been defined as follows.The abstract, introduction and conclusions of the paper should have something about such topics as storage space utilization, HIVE, SQL, HDFS, data formats (Avro or Parquet), compactness measurements, performance measurement, queries.The checklist regarding these selection criteria has been developed by supplementing previously created Excel.Information about the publication venue (country) has been analyzed in this step as well.
In the fourth step, a full text reading has been performed for the remaining 27 articles.The quality checklist has been created and filled in this step.The full text of articles has been found by using Google Scholar.

D. The assessment of relevance and data synthesis
Table 2 and Fig. 3 depict data synthesis from the year 2010 to 2015 respectively.It can be clearly seen that the number of selected studies has increased in past six years.There are 2 studies in 2010, 2 in 2011, 7 in 2012, 12 in 2013, 27 in 2014, and 44 in 2015.After applying the selection criteria, only 17 papers have been selected for data extraction and analysis: 1 study in 2013, 3 in 2014, 13 in 2015.The selection criteria are based on the assessment of the quality performed with the help of quality checklist.As shown in Table 3, the relevance criteria are based on the research questions, e.g., is the article about Avro, Parquet or both formats, provide comparison by compactness or the performance based on queries selected or derived from the world-renowned benchmarks like TPC-H or TPC-DS.As shown in Table 3, only 7 studies are based on world-renowned benchmark like TPC-H or TPC-DS.As shown in Fig. 2, most of all selected primary studies are originated in the USA.

E. Analytics and conclusions
Why have the selected articles been considered as relevant or excluded from future analysis?In order to answer this question, the short insight of each article is useful.Biookaghazadeh et al. [4] introduces a self-describing data format NetCDF that is not supported by existing big-data systems.In this article, four type queries are defined and executed on the raw storage format CSV and NetCDF.The experiment results obtained from typical queries on a geoscience dataset show that the introduced approach substantially outperforms the traditional CSV-based approach.The authors mention only Parquet format in context of the need to improve scientific data formats such as NetCDF and HDF for big-data systems.
Cejka et al. [5] from Siemens AG company compares file size of four different formats: Java, Protocol Buffers, Thrift and Avro.Avro's results show that it is much slower in write speed, however much faster in read speed than Protocol Buffers and Thrift.The file compression in Apache Avro is best.In order to evaluate the time of the retrieval of entries, the author's defined benchmark is used to retrieve data from such databases as Storacle, H2, MongoDB.Parquet format is not analyzed in this paper.Luckow et al. [16] compares different queries derived from TPC-DS and TPC-HS benchmarks and executed on Hive/Text, Hive/ORC, Hive/Parquet, Spark/ORC, Spark/Parquet.Hive/Parquet shows better execution time than Spark/Parquet.Select, aggregate and join queries are executed on a comparable infrastructure Hive/Spark versus RDBMS.Generally, the RDBMS can outperform Hive and Sparkhowever, both deliver a solid performance at a lower cost.But Avro format is not analyzed here.

Mammo and Srividya
Grover et al. [11] focuses on benchmarking multiple SQLlike big data technologies over Hadoop based distributed file system (HDFS) for Study Data Tabulation Model (SDTM) used in clinical trial databases for improving the efficiency of research in clinical trials.The benchmark proposed in this paper provides an overview of the capabilities of SQL-on-Hadoop platforms such as Hive, Presto, Drill and Spark.The authors mention format Avro and Parquet, but they do not analyze these formats in any kind of comparison.Only Parquet format is mentioned in the future work section as lightweight and fast with a columnar layout, hence it can significantly boost IO performance.Dong et al. [9] introduces the Record-aware Compression (RaC) scheme that makes the compressed contents splittable, uses a lightweight Hadoop Record Reader and preserves the parallelism and data locality properties as much as possible.In general, RaC can be used with other analytic platforms such as Spark and higher level abstractions of MapReduce such as Hive.In the evaluation, the authors show that using RaC can greatly reduce data loading time and the required system memory.More importantly, the authors observe that the time spent on decompressing data in memory is trivial compared to the time required for loading data from persistent storage to memory.The experimental results lead the authors to believe that content-aware and data-specific compression is very promising in big data processing and analysis.However, there is no direct comparison of Avro and Parquet data format in the SQL point of view.
Zhang Zhen'an et al. [34] introduce Alovera, a fast stream processing system for large-scale data.Alovera can easily serialize the records to HDFS by using Avro.The authors prove that the record-oriented data need nearly half of the time to be uncompressed while Avro is used to serialize the data stored in columnar format, and it is efficient to de-serialize the data.Parquet format is not analyzed in this paper.
Zhou et al. [35] explore a Workload Aware Column Order solution, WACO, to boost the scan operator in a wide table.Although this article does not investigate Avro, the authors implement WACO solution on Parquet data format on top of Hadoop 2.0.The authors conduct extensive experiments of the real-world TPC-H benchmark and SDSS dataset for simulating a wide table to demonstrate the superiority of our solution.The experiment results show that this approach is 2x faster than the state-of-the-art.
Haynes et al. [12] introduces Terra Populus that acts as the bridge between big data sources and researchers.Researchers are provided with convenient web applications that allow them to access, analyze, and tabulate different datasets under a common platform.Terra Populus' Tabulator application employs Parquet on Spark to build dynamic queries for analyzing large population survey data.The authors recognize that Parquet allows greater compression per data type.Parquet usage gives a high compression ratio while still allowing for fast data fetching.However Avro format is not analyzed here.
Pirzadeh et al. [20] reports on an evaluation of four representative Big Data systems (such as MongoDB, Hive, AsterixDB, and a commercial parallel shared-nothing relational database system) using a micro-benchmark called BigFUN.Parquet is used in benchmarking while Avro is not mentioned at all.Floratou et al. [10] compares three analytical job execution environments available in Hadoop ecosystem.Hive on MapReduce, Hive on Tez and Impala have been analyzed here by using a world-renowned benchmark like TPC-H.As a result, the authors confirm that Impala has better performance versus Hive (both versions).Although, the authors mention Parquet and Avro, they do not analyze those formats in any kind of comparison.Son et al. [26] proposes a novel column-store method called SSFile for Hadoop-based distributed systems.SSFile increases the actual amount of data processed per task and supports representative columnar execution techniques for efficient query processing.Through experiments authors show that SSFile significantly improves the performance of distributed processing.Avro schema is used in SSFile creating and benchmarking while Parquet format is not mentioned at all.Furthermore, the authors use only a few queries from the TPC-H benchmarking and do not argument this choice.

IV. RESULTS
This section presents the results of analysis of the relevant studies and answer to the research question.Although the field is in its earliest stages, there is clear evidence of the increasing interest focused on big data studies, Hadoop Technology, HDFS and compact, fast, binary data formats.The aim to give a synthesized overview on the trend of the research publications of Hadoop technology and answer the research question is reached by detailed analysis of the relevant studies.
The analysis of extracted data and initially retrieved studies show that Hadoop Technology and compact data formats have drawn interest of various researchers in the past six years.As shown in Fig. 3, the number of publications of research papers in conference proceedings, journals and magazines has significantly increased from the year 2010 to 2015.The growth has an exponential trend that means double amount of articles by the end of 2016.The prognosis 2016 is calculated by applying exponential trend to the line shown in Fig. 3.As shown in Table 3, only 1 paper ([PS14]) focuses on both data formats (Avro and Parquet) but 7 papers use TPC benchmark (as indirect quality criteria) for one of the formats, mostly Parquet.This might be because of row-based (Avro) and column-based (Parquet) data format specifics limiting comparison.However, it does not satisfy business demand for knowledge about both data format comparison.This is the gap for future research.
The only one paper ([PS14]) that focuses on both data formats (Avro and Parquet) compares row and column based data formats only from the compression point of view.Other papers do not provide clear and reliable answer to the research question about differences in performance (query execution time) between compact data formats Avro and Parquet.Most of 17 papers are addressed only to one format (Avro or Parquet), but not to both.

V. SUMMARY AND FINAL CONCLUSIONS
In this section, far from an exhaustive overview, some of the final conclusions are given to prevent recurrence.
There are significant gap and need for additional experiments and studies in order to answer the research question about Parquet and Avro format.All 17 studies are not containing direct focus on comparing two binary data storage formats -Parquet and Avro because of both design specifics.Parquet as stated in the official documentation [19] is a column-oriented data storage format.Thus, it should provide better performance on column-oriented queries, e.g., when only a specific set of those is selected.As a counterpart, Avro format is resigned for row-oriented data access, e.g., when all columns are the interest of processing.
In this review, 17 papers have been studied in order to evidence Hadoop Technology popularity and fast, compact, binary data format development necessity.A high diversity of Hadoop Technologies and used data formats has been noticed.It was a very time consuming process to classify all studies, extract relevant information, assess validity and reliability, develop checklists and make conclusions at the end.Due to abstract unavailability SpringerLink has been dropped from the list of digital libraries.Therefore this review can be extended in the direction to cover studies from SpringerLink.

RQ. 1 :RQ. 2 :
What are the differences in performance (query execution time) between compact data formats Avro and Parquet?Which data format (Avro or Parquet) is more compact?

-
Patent Citation Count  Reference Count  Source (IEEE, ACM or ScienceDirect)  Number of pages  Journal Name if Journal article  Country Yan and Yuan[31] build another TPC-DS benchmark by removing columnar optimization, they name it TPC-DS2, optimize the resource utilization, and maintain fairness among different types of queries.The authors present a price-based algorithm which achieves optimization objective by implementing algorithm in the open source Impala system and conducting a set of experiments in a clustering environment using the TPCDS workload.Experimental results show that coordinated resource management solution can increase the aggregate utility by at least 15.4% compared with simple fair resource share mechanism, and 63.5% compared with the FIFO resource management mechanism.This work demonstrates significant advantage of Parquet format.Avro format is not mentioned in this paper.Choi et al. [8] compares the CSV file format and Parquet file format via MicroBricks and x86 clusters.The authors carry out the TPC-H benchmark by means of an open source distributed SQL engine in Hadoop in both architectures.The experimental results are promising for the MicroBricks computing, and the results show that the query response times of the MicroBricks computing architecture outperforms those of commodity cluster without hurting the innate advantages of the MicroBricks cluster architecture.Avro format is not analyzed in this paper.

Fig. 3 .RQ. 1 :RQ. 2 :
Fig. 3. Importance of topic by years How about an answer to the research questions?RQ.1:What are the differences in performance (query execution time) between compact data formats Avro and Parquet?RQ.2: Which data format (Avro or Parquet) is more compact?

TABLE 1 .
COMPARISON OF DATA FILE FORMATS

TABLE 2 .
SEARCH RESULTS

TABLE 3 .
RELEVANCE CHECKLIST FOR SELECTED PRIMARY STUDIES Fig. 2. Countries of origin represented in the selected primary studies