skip to main content
research-article
Artifacts Available / v1.1

Time series data encoding for efficient storage: a comparative analysis in Apache IoTDB

Published:01 June 2022Publication History
Skip Abstract Section

Abstract

Not only the vast applications but also the distinct features of time series data stimulate the booming growth of time series database management systems, such as Apache IoTDB, InfluxDB, OpenTSDB and so on. Almost all these systems employ columnar storage, with effective encoding of time series data. Given the distinct features of various time series data, it is not surprising that different encoding strategies may perform variously. In this study, we first summarize the features of time series data that may affect encoding performance, including scale, delta, repeat and increase. Then, we introduce the storage scheme of a typical time series database, Apache IoTDB, prescribing the limits to implementing encoding algorithms in the system. A qualitative analysis of encoding effectiveness regarding to various data features is then presented for the studied algorithms. To this end, we develop a benchmark for evaluating encoding algorithms, including a data generator regarding the aforesaid data features and several real-world datasets from our industrial partners. Finally, we present an extensive experimental evaluation using the benchmark. Remarkably, a quantitative analysis of encoding effectiveness regarding to various data features is conducted in Apache IoTDB.

References

  1. https://iotdb.apache.org/.Google ScholarGoogle Scholar
  2. https://www.influxdata.com/.Google ScholarGoogle Scholar
  3. http://opentsdb.net/.Google ScholarGoogle Scholar
  4. https://prometheus.io/.Google ScholarGoogle Scholar
  5. https://github.com/apache/iotdb/tree/research/encoding-exp.Google ScholarGoogle Scholar
  6. https://github.com/xjz17/iotdb/tree/TSEncoding.Google ScholarGoogle Scholar
  7. https://thulab.github.io/iotdb-quality/.Google ScholarGoogle Scholar
  8. https://iotdb.apache.org/UserGuide/Master/Data-Concept/Encoding.html.Google ScholarGoogle Scholar
  9. https://github.com/thulab/iotdb-benchmark.Google ScholarGoogle Scholar
  10. https://www.microsoft.com/en-us/download/details.aspx.Google ScholarGoogle Scholar
  11. https://archive.ics.uci.edu.Google ScholarGoogle Scholar
  12. https://www.kaggle.com/datasets/eliasdabbas/web-server-access-logs.Google ScholarGoogle Scholar
  13. https://www.kaggle.com/datasets/winmedals/incident-event-log-dataset.Google ScholarGoogle Scholar
  14. https://www.kaggle.com/datasets/shawon10/web-log-dataset.Google ScholarGoogle Scholar
  15. https://www.kaggle.com/datasets/.Google ScholarGoogle Scholar
  16. https://www.gnu.org/software/gzip/.Google ScholarGoogle Scholar
  17. https://sxsong.github.io/doc/encoding.pdf.Google ScholarGoogle Scholar
  18. Anders Aamand, Piotr Indyk, and Ali Vakilian. (learned) frequency estimation algorithms under zipfian distribution. CoRR, abs/1908.05198, 2019.Google ScholarGoogle Scholar
  19. Matej Bartik, Sven Ubik, and Pavel Kubalík. LZ4 compression algorithm on FPGA. In 2015 IEEE International Conference on Electronics, Circuits, and Systems, ICECS 2015, Cairo, Egypt, December 6-9, 2015, pages 179--182. IEEE, 2015.Google ScholarGoogle Scholar
  20. Davis W. Blalock, Samuel Madden, and John V. Guttag. Sprintz: Time series compression for the internet of things. CoRR, abs/1808.02515, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Giuseppe Campobello, Antonino Segreto, Sarah Zanafi, and Salvatore Serrano. RAKE: A simple and efficient lossless compression algorithm for the internet of things. In 25th European Signal Processing Conference, EUSIPCO 2017, Kos, Greece, August 28 - September 2, 2017, pages 2581--2585. IEEE, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  22. Giacomo Chiarot and Claudio Silvestri. Time series compression: a survey. CoRR, abs/2101.08784, 2021.Google ScholarGoogle Scholar
  23. E. F. Codd. Relational database: A practical foundation for productivity. Commun. ACM, 25(2):109--117, 1982.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Marco Dalai and Riccardo Leonardi. Approximations of one-dimensional digital signals under the linfty norm. IEEE Trans. Signal Process., 54(8):3111--3124, 2006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Frank Eichinger, Pavel Efros, Stamatis Karnouskos, and Klemens Böhm. A time-series compression technique and its application to the smart grid. VLDB J., 24(2):193--218, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Eugene Fink and Harith Suman Gandhi. Compression of time series by extracting major extrema. J. Exp. Theor. Artif. Intell., 23(2):255--270, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Solomon W. Golomb. Run-length encodings (corresp.). IEEE Trans. Inf. Theory, 12(3):399--401, 1966.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Muon Ha and Yulia A. Shichkina. Translating a distributed relational database to a document database. Data Sci. Eng., 7(2):136--155, 2022.Google ScholarGoogle ScholarCross RefCross Ref
  29. Paul G. Howard and Jeffrey Scott Vitter. Parallel lossless image compression using huffman and arithmetic coding. Inf. Process. Lett., 59(2):65--73, 1996.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Sheng Huang, Yaoliang Chen, Xiaoyan Chen, Kai Liu, Xiaomin Xu, Chen Wang, Kevin Brown, and Inge Halilovic. The next generation operational data historian for iot based on informix. In Curtis E. Dyreson, Feifei Li, and M. Tamer Özsu, editors, International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22-27, 2014, pages 169--176. ACM, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Yannis Katsis, Yoav Freund, and Yannis Papakonstantinou. Combining databases and signal processing in plato. In Seventh Biennial Conference on Innovative Data Systems Research, CIDR 2015, Asilomar, CA, USA, January 4-7, 2015, Online Proceedings. www.cidrdb.org, 2015.Google ScholarGoogle Scholar
  32. Abdelouahab Khelifati, Mourad Khayati, and Philippe Cudré-Mauroux. CORAD: correlation-aware compression of massive time series using sparse dictionary coding. In Chaitanya Baru, Jun Huan, Latifur Khan, Xiaohua Hu, Ronay Ak, Yuanyuan Tian, Roger S. Barga, Carlo Zaniolo, Kisung Lee, and Yanfang Fanny Ye, editors, 2019 IEEE International Conference on Big Data (IEEE BigData), Los Angeles, CA, USA, December 9-12, 2019, pages 2289--2298. IEEE, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  33. Iosif Lazaridis and Sharad Mehrotra. Capturing sensor-generated time series with quality guarantees. In Umeshwar Dayal, Krithi Ramamritham, and T. M. Vijayaraman, editors, Proceedings of the 19th International Conference on Data Engineering, March 5-8, 2003, Bangalore, India, pages 429--440. IEEE Computer Society, 2003.Google ScholarGoogle Scholar
  34. Alice Marascu, Pascal Pompey, Eric Bouillet, Michael Wurst, Olivier Verscheure, Martin Grund, and Philippe Cudré-Mauroux. TRISTAN: real-time analytics on massive time series using sparse dictionary compression. In Jimmy Lin, Jian Pei, Xiaohua Hu, Wo Chang, Raghunath Nambiar, Charu C. Aggarwal, Nick Cercone, Vasant G. Honavar, Jun Huan, Bamshad Mobasher, and Saumyadipta Pyne, editors, 2014 IEEE International Conference on Big Data (IEEE BigData 2014), Washington, DC, USA, October 27-30, 2014, pages 291--300. IEEE Computer Society, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  35. V. Krishna Nandivada and Rajkishore Barik. Improved bitwidth-aware variable packing. ACM Trans. Archit. Code Optim., 10(3):16:1--16:22, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Ghim Hwee Ong and Shell-Ying Huang. A data compression scheme for chinese text files using huffman coding and a two-level dictionary. Inf. Sci., 84(1&2):85--99, 1995.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Tuomas Pelkonen, Scott Franklin, Paul Cavallaro, Qi Huang, Justin Meza, Justin Teller, and Kaushik Veeraraghavan. Gorilla: A fast, scalable, in-memory time series database. Proc. VLDB Endow., 8(12):1816--1827, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Horst Samulowitz, Chandra Reddy, Ashish Sabharwal, and Meinolf Sellmann. Snappy: A simple algorithm portfolio. In Matti Järvisalo and Allen Van Gelder, editors, Theory and Applications of Satisfiability Testing - SAT 2013 - 16th International Conference, Helsinki, Finland, July 8-12, 2013. Proceedings, volume 7962 of Lecture Notes in Computer Science, pages 422--428. Springer, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Raimund Seidel. Small-dimensional linear programming and convex hulls made easy. Discret. Comput. Geom., 6:423--434, 1991.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Bin Song, Limin Xiao, Guangjun Qin, Li Ruan, and Shida Qiu. A deduplication algorithm based on data similarity and delta encoding. In Hanning Yuan, Jing Geng, and Fuling Bian, editors, Geo-Spatial Knowledge and Intelligence - 4th International Conference on Geo-Informatics in Resource Management and Sustainable Ecosystem, GRMSE 2016, Hong Kong, China, November 18-20, 2016, Revised Selected Papers, Part II, volume 699 of Communications in Computer and Information Science, pages 245--253. Springer, 2016.Google ScholarGoogle Scholar
  41. Julien Spiegel, Patrice Wira, and Gilles Hermann. A comparative experimental study of lossless compression algorithms for enhancing energy efficiency in smart meters. In 16th IEEE International Conference on Industrial Informatics, INDIN 2018, Porto, Portugal, July 18-20, 2018, pages 447--452. IEEE, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  42. Jirí Walder, Michal Krátký, and Jan Platos. Fast fibonacci encoding algorithm. In Jaroslav Pokorný, Václav Snásel, and Karel Richta, editors, Proceedings of the Dateso 2010 Annual International Workshop on DAtabases, TExts, Specifications and Objects, Stedronin-Plazy, Czech Republic, April 21-23, 2010, volume 567 of CEUR Workshop Proceedings, pages 72--83. CEUR-WS.org, 2010.Google ScholarGoogle Scholar
  43. Chen Wang, Xiangdong Huang, Jialin Qiao, Tian Jiang, Lei Rui, Jinrui Zhang, Rong Kang, Julian Feinauer, Kevin Mcgrail, Peng Wang, Diaohan Luo, Jun Yuan, Jianmin Wang, and Jiaguang Sun. Apache iotdb: Time-series database for internet of things. Proc. VLDB Endow., 13(12):2901--2904, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Terry A. Welch. A technique for high-performance data compression. Computer, 17(6):8--19, 1984.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Raymond Chi-Wing Wong and Ada Wai-Chee Fu. Mining top-k itemsets over a sliding window based on zipfian distribution. In Hillol Kargupta, Jaideep Srivastava, Chandrika Kamath, and Arnold Goodman, editors, Proceedings of the 2005 SIAM International Conference on Data Mining, SDM 2005, Newport Beach, CA, USA, April 21-23, 2005, pages 516--520. SIAM, 2005.Google ScholarGoogle Scholar
  46. Retaj Yousri, Madyan Alsenwi, M. Saeed Darweesh, and Tawfik Ismail. A design for an efficient hybrid compression system for eeg data. In 2021 International Conference on Electronic Engineering (ICEEM), pages 1--6, 2021.Google ScholarGoogle ScholarCross RefCross Ref
  47. Xinyang Yu, Yanqing Peng, Feifei Li, Sheng Wang, Xiaowei Shen, Huijun Mai, and Yue Xie. Two-level data compression using machine learning in time series database. In 36th IEEE International Conference on Data Engineering, ICDE 2020, Dallas, TX, USA, April 20-24, 2020, pages 1333--1344. IEEE, 2020.Google ScholarGoogle ScholarCross RefCross Ref
  48. Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory, 23(3):337--343, 1977.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Time series data encoding for efficient storage: a comparative analysis in Apache IoTDB
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader