research-article

Time series data encoding for efficient storage: a comparative analysis in Apache IoTDB

Authors:
Jinzhao Xiao

Tsinghua University

Tsinghua University
View Profile

,
Yuxiang Huang

Tsinghua University

Tsinghua University
View Profile

,
Changyu Hu

Tsinghua University

Tsinghua University
View Profile

,
Shaoxu Song

Tsinghua University

Tsinghua University
View Profile

,
Xiangdong Huang

Tsinghua University

Tsinghua University
View Profile

,
Jianmin Wang

Tsinghua University

Tsinghua University
View Profile

Proceedings of the VLDB Endowment Volume 15 Issue 10pp 2148–2160https://doi.org/10.14778/3547305.3547319

Published:01 June 2022Publication History

Proceedings of the VLDB Endowment

Abstract

Not only the vast applications but also the distinct features of time series data stimulate the booming growth of time series database management systems, such as Apache IoTDB, InfluxDB, OpenTSDB and so on. Almost all these systems employ columnar storage, with effective encoding of time series data. Given the distinct features of various time series data, it is not surprising that different encoding strategies may perform variously. In this study, we first summarize the features of time series data that may affect encoding performance, including scale, delta, repeat and increase. Then, we introduce the storage scheme of a typical time series database, Apache IoTDB, prescribing the limits to implementing encoding algorithms in the system. A qualitative analysis of encoding effectiveness regarding to various data features is then presented for the studied algorithms. To this end, we develop a benchmark for evaluating encoding algorithms, including a data generator regarding the aforesaid data features and several real-world datasets from our industrial partners. Finally, we present an extensive experimental evaluation using the benchmark. Remarkably, a quantitative analysis of encoding effectiveness regarding to various data features is conducted in Apache IoTDB.

References

https://iotdb.apache.org/.Google Scholar
https://www.influxdata.com/.Google Scholar
http://opentsdb.net/.Google Scholar
https://prometheus.io/.Google Scholar
https://github.com/apache/iotdb/tree/research/encoding-exp.Google Scholar
https://github.com/xjz17/iotdb/tree/TSEncoding.Google Scholar
https://thulab.github.io/iotdb-quality/.Google Scholar
https://iotdb.apache.org/UserGuide/Master/Data-Concept/Encoding.html.Google Scholar
https://github.com/thulab/iotdb-benchmark.Google Scholar
https://www.microsoft.com/en-us/download/details.aspx.Google Scholar
https://archive.ics.uci.edu.Google Scholar
https://www.kaggle.com/datasets/eliasdabbas/web-server-access-logs.Google Scholar
https://www.kaggle.com/datasets/winmedals/incident-event-log-dataset.Google Scholar
https://www.kaggle.com/datasets/shawon10/web-log-dataset.Google Scholar
https://www.kaggle.com/datasets/.Google Scholar
https://www.gnu.org/software/gzip/.Google Scholar
https://sxsong.github.io/doc/encoding.pdf.Google Scholar
Anders Aamand, Piotr Indyk, and Ali Vakilian. (learned) frequency estimation algorithms under zipfian distribution. CoRR, abs/1908.05198, 2019.Google Scholar
Matej Bartik, Sven Ubik, and Pavel Kubalík. LZ4 compression algorithm on FPGA. In 2015 IEEE International Conference on Electronics, Circuits, and Systems, ICECS 2015, Cairo, Egypt, December 6-9, 2015, pages 179--182. IEEE, 2015.Google Scholar
Davis W. Blalock, Samuel Madden, and John V. Guttag. Sprintz: Time series compression for the internet of things. CoRR, abs/1808.02515, 2018.Google ScholarDigital Library
Giuseppe Campobello, Antonino Segreto, Sarah Zanafi, and Salvatore Serrano. RAKE: A simple and efficient lossless compression algorithm for the internet of things. In 25th European Signal Processing Conference, EUSIPCO 2017, Kos, Greece, August 28 - September 2, 2017, pages 2581--2585. IEEE, 2017.Google ScholarCross Ref
Giacomo Chiarot and Claudio Silvestri. Time series compression: a survey. CoRR, abs/2101.08784, 2021.Google Scholar
E. F. Codd. Relational database: A practical foundation for productivity. Commun. ACM, 25(2):109--117, 1982.Google ScholarDigital Library
Marco Dalai and Riccardo Leonardi. Approximations of one-dimensional digital signals under the l_infty norm. IEEE Trans. Signal Process., 54(8):3111--3124, 2006.Google ScholarDigital Library
Frank Eichinger, Pavel Efros, Stamatis Karnouskos, and Klemens Böhm. A time-series compression technique and its application to the smart grid. VLDB J., 24(2):193--218, 2015.Google ScholarDigital Library
Eugene Fink and Harith Suman Gandhi. Compression of time series by extracting major extrema. J. Exp. Theor. Artif. Intell., 23(2):255--270, 2011.Google ScholarDigital Library
Solomon W. Golomb. Run-length encodings (corresp.). IEEE Trans. Inf. Theory, 12(3):399--401, 1966.Google ScholarDigital Library
Muon Ha and Yulia A. Shichkina. Translating a distributed relational database to a document database. Data Sci. Eng., 7(2):136--155, 2022.Google ScholarCross Ref
Paul G. Howard and Jeffrey Scott Vitter. Parallel lossless image compression using huffman and arithmetic coding. Inf. Process. Lett., 59(2):65--73, 1996.Google ScholarDigital Library
Sheng Huang, Yaoliang Chen, Xiaoyan Chen, Kai Liu, Xiaomin Xu, Chen Wang, Kevin Brown, and Inge Halilovic. The next generation operational data historian for iot based on informix. In Curtis E. Dyreson, Feifei Li, and M. Tamer Özsu, editors, International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22-27, 2014, pages 169--176. ACM, 2014.Google ScholarDigital Library
Yannis Katsis, Yoav Freund, and Yannis Papakonstantinou. Combining databases and signal processing in plato. In Seventh Biennial Conference on Innovative Data Systems Research, CIDR 2015, Asilomar, CA, USA, January 4-7, 2015, Online Proceedings. www.cidrdb.org, 2015.Google Scholar
Abdelouahab Khelifati, Mourad Khayati, and Philippe Cudré-Mauroux. CORAD: correlation-aware compression of massive time series using sparse dictionary coding. In Chaitanya Baru, Jun Huan, Latifur Khan, Xiaohua Hu, Ronay Ak, Yuanyuan Tian, Roger S. Barga, Carlo Zaniolo, Kisung Lee, and Yanfang Fanny Ye, editors, 2019 IEEE International Conference on Big Data (IEEE BigData), Los Angeles, CA, USA, December 9-12, 2019, pages 2289--2298. IEEE, 2019.Google ScholarCross Ref
Iosif Lazaridis and Sharad Mehrotra. Capturing sensor-generated time series with quality guarantees. In Umeshwar Dayal, Krithi Ramamritham, and T. M. Vijayaraman, editors, Proceedings of the 19th International Conference on Data Engineering, March 5-8, 2003, Bangalore, India, pages 429--440. IEEE Computer Society, 2003.Google Scholar
Alice Marascu, Pascal Pompey, Eric Bouillet, Michael Wurst, Olivier Verscheure, Martin Grund, and Philippe Cudré-Mauroux. TRISTAN: real-time analytics on massive time series using sparse dictionary compression. In Jimmy Lin, Jian Pei, Xiaohua Hu, Wo Chang, Raghunath Nambiar, Charu C. Aggarwal, Nick Cercone, Vasant G. Honavar, Jun Huan, Bamshad Mobasher, and Saumyadipta Pyne, editors, 2014 IEEE International Conference on Big Data (IEEE BigData 2014), Washington, DC, USA, October 27-30, 2014, pages 291--300. IEEE Computer Society, 2014.Google ScholarCross Ref
V. Krishna Nandivada and Rajkishore Barik. Improved bitwidth-aware variable packing. ACM Trans. Archit. Code Optim., 10(3):16:1--16:22, 2013.Google ScholarDigital Library
Ghim Hwee Ong and Shell-Ying Huang. A data compression scheme for chinese text files using huffman coding and a two-level dictionary. Inf. Sci., 84(1&2):85--99, 1995.Google ScholarDigital Library
Tuomas Pelkonen, Scott Franklin, Paul Cavallaro, Qi Huang, Justin Meza, Justin Teller, and Kaushik Veeraraghavan. Gorilla: A fast, scalable, in-memory time series database. Proc. VLDB Endow., 8(12):1816--1827, 2015.Google ScholarDigital Library
Horst Samulowitz, Chandra Reddy, Ashish Sabharwal, and Meinolf Sellmann. Snappy: A simple algorithm portfolio. In Matti Järvisalo and Allen Van Gelder, editors, Theory and Applications of Satisfiability Testing - SAT 2013 - 16th International Conference, Helsinki, Finland, July 8-12, 2013. Proceedings, volume 7962 of Lecture Notes in Computer Science, pages 422--428. Springer, 2013.Google ScholarDigital Library
Raimund Seidel. Small-dimensional linear programming and convex hulls made easy. Discret. Comput. Geom., 6:423--434, 1991.Google ScholarDigital Library
Bin Song, Limin Xiao, Guangjun Qin, Li Ruan, and Shida Qiu. A deduplication algorithm based on data similarity and delta encoding. In Hanning Yuan, Jing Geng, and Fuling Bian, editors, Geo-Spatial Knowledge and Intelligence - 4th International Conference on Geo-Informatics in Resource Management and Sustainable Ecosystem, GRMSE 2016, Hong Kong, China, November 18-20, 2016, Revised Selected Papers, Part II, volume 699 of Communications in Computer and Information Science, pages 245--253. Springer, 2016.Google Scholar
Julien Spiegel, Patrice Wira, and Gilles Hermann. A comparative experimental study of lossless compression algorithms for enhancing energy efficiency in smart meters. In 16th IEEE International Conference on Industrial Informatics, INDIN 2018, Porto, Portugal, July 18-20, 2018, pages 447--452. IEEE, 2018.Google ScholarCross Ref
Jirí Walder, Michal Krátký, and Jan Platos. Fast fibonacci encoding algorithm. In Jaroslav Pokorný, Václav Snásel, and Karel Richta, editors, Proceedings of the Dateso 2010 Annual International Workshop on DAtabases, TExts, Specifications and Objects, Stedronin-Plazy, Czech Republic, April 21-23, 2010, volume 567 of CEUR Workshop Proceedings, pages 72--83. CEUR-WS.org, 2010.Google Scholar
Chen Wang, Xiangdong Huang, Jialin Qiao, Tian Jiang, Lei Rui, Jinrui Zhang, Rong Kang, Julian Feinauer, Kevin Mcgrail, Peng Wang, Diaohan Luo, Jun Yuan, Jianmin Wang, and Jiaguang Sun. Apache iotdb: Time-series database for internet of things. Proc. VLDB Endow., 13(12):2901--2904, 2020.Google ScholarDigital Library
Terry A. Welch. A technique for high-performance data compression. Computer, 17(6):8--19, 1984.Google ScholarDigital Library
Raymond Chi-Wing Wong and Ada Wai-Chee Fu. Mining top-k itemsets over a sliding window based on zipfian distribution. In Hillol Kargupta, Jaideep Srivastava, Chandrika Kamath, and Arnold Goodman, editors, Proceedings of the 2005 SIAM International Conference on Data Mining, SDM 2005, Newport Beach, CA, USA, April 21-23, 2005, pages 516--520. SIAM, 2005.Google Scholar
Retaj Yousri, Madyan Alsenwi, M. Saeed Darweesh, and Tawfik Ismail. A design for an efficient hybrid compression system for eeg data. In 2021 International Conference on Electronic Engineering (ICEEM), pages 1--6, 2021.Google ScholarCross Ref
Xinyang Yu, Yanqing Peng, Feifei Li, Sheng Wang, Xiaowei Shen, Huijun Mai, and Yue Xie. Two-level data compression using machine learning in time series database. In 36th IEEE International Conference on Data Engineering, ICDE 2020, Dallas, TX, USA, April 20-24, 2020, pages 1333--1344. IEEE, 2020.Google ScholarCross Ref
Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory, 23(3):337--343, 1977.Google ScholarDigital Library

Index Terms

Time series data encoding for efficient storage: a comparative analysis in Apache IoTDB
1. Information systems
  1. Data management systems
    1. Data structures
      1. Data layout
        Data compression

Index terms have been assigned to the content through auto-classification.

Recommendations

An Efficient NoSQL-Based Storage Schema for Large-Scale Time Series Data

In IoT (internet of things), most data from the connected devices change with time and have sampling intervals, which are called time-series data. It is challenging to design a time series storage model that can write massive time-series data in a short ...
Read More
Hybrid storage architecture and efficient MapReduce processing for unstructured data

We present a hybrid storage architecture which integrates various kinds of data stores.We propose three partitioning strategies to execute MapReduce-based batch-processing jobs.Our hybrid solution shows 10% to 8.6 times faster performance than the ...
Read More
Efficient archival data storage
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 15, Issue 10
June 2022
319 pages
ISSN:2150-8097
Editors:
Fatma Özcan
Google
,
Juliana Freire
New York University
,
Xuemin Lin
University of New South Wales
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 June 2022
Published in pvldb Volume 15, Issue 10
Badges
- Artifacts Available / v1.1
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 268
  Total Downloads
- Downloads (Last 12 months)152
- Downloads (Last 6 weeks)17
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Time series data encoding for efficient storage: a comparative analysis in Apache IoTDB

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

An Efficient NoSQL-Based Storage Schema for Large-Scale Time Series Data

Hybrid storage architecture and efficient MapReduce processing for unstructured data

Efficient archival data storage