Abstract
Structured log data is a kind of append-only time-series data which grows rapidly as new entries are continuously generated and captured. It has become very popular in application domains such as Internet, sensor networks and telecommunications. In recent years, many systems have been developed to support batch analysis of such structured log data. But they often fail to meet the high throughput requirements of real-time log data ingestion and analytics. An efficient index is very important to accelerate log data analytics, and at the meanwhile to support high throughput data loading. This paper focuses on designing a specialized indexing scheme for real-time log data analytics. The solution adopts a dynamic global hash index to partition the tuples into hash buckets. Then the tuples in the hash buckets are sorted and buffered in the sort buffer queue. When the amount of data in the queue reaches a threshold, the data is packed into segments before spilling to the disks. Moreover, an intra-segment index is maintained by meta database. With such an indexing scheme, the database system achieves high throughput and real-time data loading and query performance. As shown in the experiments, the data loading throughput reaches 5 million tuples per second per node. The delay of data loading does not exceed 10 seconds, and a sub-second query performance is achieved for the given queries.
This work is supported by the Outstanding Innovative Talents Cultivation Funded Programs 2014 of Renmin University of China.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
http://docs.oracle.com/cd/b28359_01/server.111/b28313/indexes.htm#i1006549
https://dev.mysql.com/doc/refman/5.5/en/index-btree-hash.html
http://www.gbase.cn/comcontent_detail1/&i=30&comcontentid=30.html
http://www.oracle.com/technetwork/articles/sharma-indexes-093638.html
Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. CACM 13(7), 422–426 (1970)
Boncz, P.A., Zukowski, M., Nes, N.: Monetdb/x100: Hyper-pipelining query execution. In: CIDR, vol. 5, pp. 225–237 (2005)
Chan, C.-Y., Ioannidis, Y.E.: Bitmap index design and evaluation. In: SIGMOD, vol. 27, pp. 355–366 (1998)
Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: A distributed storage system for structured data. In: OSDI (2006)
Guttman, A.: R-trees: A dynamic index structure for spatial searching. In: SIGMOD, pp. 47–57 (1984)
He, Y., Lee, R., Huai, Y., Shao, Z., Jain, N., Zhang, X., Xu, Z.: Rcfile: A fast and space-efficient data placement structure in mapreduce-based warehouse systems. In: ICDE (2011)
Karger, D., Lehman, E., Leighton, T., Panigrahy, R., Levine, M., Lewin, D.: Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the world wide web. In: STOC, pp. 654–663 (1997)
Lakshman, A., Malik, P.: Cassandra: a decentralized structured storage system. SIGOPS 44(2), 35–40 (2010)
Lehman, P.L., et al.: Efficient locking for concurrent operations on b-trees. TODS 6(4), 650–670 (1981)
Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: SIGMOD (2008)
Neil, P.O., Cheng, E., Gawlick, D., ONeil, E.: The log-structured merge-tree (lsm-tree). Acta Informatica 33(4), 351–385 (1996)
Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: SIGMOD (2009)
Ślźak, D., Wróblewski, J., Eastwood, V., Synak, P.: Brighthouse: an analytic data warehouse for ad-hoc queries. PVLDB 1(2), 1337–1345 (2008)
Stonebraker, M., Abadi, D., DeWitt, D.J., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: Mapreduce and parallel dbmss: Friends or foes? CACM, 53(1), January 2010
Stonebraker, M., Abadi, D.J., Batkin, A., Chen, X., Cherniack, M., Ferreira, M., Lau, E., Lin, A., Madden, S., O’Neil, E.J., O’Neil, P.E., Rasin, A., Tran, N., Zdonik, S.B.: C-store: A column-oriented DBMS. In: VLDB, pp. 553–564 (2005)
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Anthony, S., Liu, H., Murthy, R.: Hive - a petabyte scale data warehouse using hadoop. In: ICDE (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Bian, H., Chen, Y., Qin, X., Du, X. (2015). A Fast Data Ingestion and Indexing Scheme for Real-Time Log Analytics. In: Cheng, R., Cui, B., Zhang, Z., Cai, R., Xu, J. (eds) Web Technologies and Applications. APWeb 2015. Lecture Notes in Computer Science(), vol 9313. Springer, Cham. https://doi.org/10.1007/978-3-319-25255-1_69
Download citation
DOI: https://doi.org/10.1007/978-3-319-25255-1_69
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25254-4
Online ISBN: 978-3-319-25255-1
eBook Packages: Computer ScienceComputer Science (R0)