ABSTRACT
In recent years, real-time processing and analytics systems for big data--in the context of Business Intelligence (BI)--have received a growing attention. The traditional BI platforms that perform regular updates on daily, weekly or monthly basis are no longer adequate to satisfy the fast-changing business environments. However, due to the nature of big data, it has become a challenge to achieve the real-time capability using the traditional technologies. The recent distributed computing technology, MapReduce, provides off-the-shelf high scalability that can significantly shorten the processing time for big data; Its open-source implementation such as Hadoop has become the de-facto standard for processing big data, however, Hadoop has the limitation of supporting real-time updates. The improvements in Hadoop for the real-time capability, and the other alternative real-time frameworks have been emerging in recent years. This paper presents a survey of the open source technologies that support big data processing in a real-time/near real-time fashion, including their system architectures and platforms.
- A Quick Message Queue Benchmark: ActiveMQ, RabbitMQ, HornetQ, QPID, Apollo, URL: atx-aeon.com/wp/2013/04/10/a-quick-message-queue-benchmark-activemq-rabbitmq-hornetq-qpid-apollo as of 2014/5/1.Google Scholar
- D. J. Abadi et al. Column-stores vs. row-stores: how different are they really? In Proc. of the SIGMOD, pp. 967--980, 2008. Google ScholarDigital Library
- Apache Flume. URL: flume.apache.org as of 2014/5/1.Google Scholar
- Apache Hadoop NextGen MapReduce (YARN). URL: hadoop.apache.org as of 2014/5/1.Google Scholar
- Apache Kafka -- A high-throughput distributed messaging system. URL: kafka.apache.org as of 2014/5/1.Google Scholar
- Big Data Survey Europe. URL: www.pmone.com/fileadmin/user_upload/doc/study/BARC_BIG_DATA_SURVEY_EN_final.pdf as of 2014/5/1.Google Scholar
- T. Condie et al. MapReduce Online. NSDI 10(4):20, 2010. Google ScholarDigital Library
- J. Dean, and S. Ghemawat. Mapreduce: simplified data processing on large clusters. CACM, 1(51):107--113, 2008. Google ScholarDigital Library
- J. Dittrich et al. Hadoop++: making a yellow elephant run like a cheetah. PVLDB, 3(1):518--529, 2010. Google ScholarDigital Library
- R. Gemulla et al. CoHadoop: flexible data placement and its exploitation in Hadoop. PVLDB, 4(9):575--585, 2011. Google ScholarDigital Library
- K. Goodhope et al. Building LinkedIn's real-time activity data pipeline. IEEE Data Eng. Bull, 35(2):33--45, 2012.Google Scholar
- Hadoop. URL: hadoop.apache.org as of 2014/5/1.Google Scholar
- C. Hewitt et al. A universal modular actor formalism for artificial intelligence. IJCAI, 1973. Google ScholarDigital Library
- Y. He et al. RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems. In Proc. of ICDE, pp. 1199--1208, 2011. Google ScholarDigital Library
- HStreaming. URL: www.adello.com as of 2014/5/1.Google Scholar
- Impala. URL: www.cloudera.com as of 2014/5/1.Google Scholar
- J. Kreps et al. Kafka: A distributed messaging system for log processing. In Proc. of NetDB, 2011.Google Scholar
- K. Lee et al. Parallel data processing with MapReduce: a survey. SIGMOD Record 40(4):11--20, 2012. Google ScholarDigital Library
- X. Liu. Data warehousing technologies for large-scale and right-time data. Dissertation, Aalborg University, DK, 2012.Google Scholar
- X. Liu, C. Thomsen, and T. B. Pedersen. ETLMR: a highly scalable dimensional ETL framework based on mapreduce. In Proc. of DaWak, pp. 96--11, 2011. Google ScholarDigital Library
- X. Liu, C. Thomsen, and T. B. Pedersen. Mapreduce-based dimensional etl made easy. PVLDB, 5(12):1882--1885, 2012. Google ScholarDigital Library
- X. Liu, C. Thomsen, and T. B. Pedersen. CloudETL: Scalable dimensional for Hive. In Proc. of IDEAS, 2014. Google ScholarDigital Library
- N. Marz and J. Warren. Big data: principles and best practices of scalable realtime data systems. Manning, 2013.Google Scholar
- S. Melnik et al. Dremel: interactive analysis of web-scale datasets. PVLDB 3(1-2):330--339, 2010. Google ScholarDigital Library
- L. Neumeyer et al. S4: distributed stream computing platform. In: Proc. of ICDMW, pp. 170--177, 2010. Google ScholarDigital Library
- S4 distributed stream computing platform. URL: incubator.apache.org/s4 as of 2014/5/1.Google Scholar
- Scribe. URL: github.com/facebook/scribe as of 2014/5/1.Google Scholar
- Spark. URL: spark.incubator.apache.org as of 2014/5/1.Google Scholar
- M. Stonebraker et al. MapReduce and parallel DBMSs: friends or foes? Commun. ACM, 53(1):64--71, 2010. Google ScholarDigital Library
- Storm. URL: storm-project.net as of 2014/5/1.Google Scholar
- C. Thomsen, T. B. Pedersen, and W. Lehner. RiTE: Providing on-demand data for right-time data warehousing. In Proc. of ICDE, pp. 456--465, 2008. Google ScholarDigital Library
- M. Zaharia et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. NSDI, pp. 2--2, 2012. Google ScholarDigital Library
- M. Zaharia et al. Discretized streams: an efficient programming model for large-scale stream processing. USENIX HotCloud, pp. 10--10, 2012. Google ScholarDigital Library
- M. Zaharia et al. Discretized streams: fault-tolerant streaming computation at scale. In Proc. of SOSP, pp. 423--438, 2013. Google ScholarDigital Library
Index Terms
- Survey of real-time processing systems for big data
Recommendations
A Brief Survey on Big Data in Healthcare
This article presents a brief introduction to big data and big data analytics and also their roles in the healthcare system. A definite range of scientific researches about big data analytics in the healthcare system have been reviewed. The definition ...
Query Processing Techniques for Big Spatial-Keyword Data
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of DataThe widespread use of GPS-enabled cellular devices, i.e., smart phones, led to the popularity of numerous mobile applications, e.g., social networks, micro-blogs, mobile web search, and crowd-powered reviews. These applications generate large amounts of ...
A survey of state management in big data processing systems
The concept of state and its applications vary widely across big data processing systems. This is evident in both the research literature and existing systems, such as Apache Flink, Apache Heron, Apache Samza, Apache Spark, and Apache Storm. Given the ...
Comments