Abstract
Not only the vast applications but also the distinct features of time series data stimulate the booming growth of time series database management systems, such as Apache IoTDB, InfluxDB, OpenTSDB and so on. Almost all these systems employ columnar storage, with effective encoding of time series data. Given the distinct features of various time series data, it is not surprising that different encoding strategies may perform variously. In this study, we first summarize the features of time series data that may affect encoding performance, including scale, delta, repeat and increase. Then, we introduce the storage scheme of a typical time series database, Apache IoTDB, prescribing the limits to implementing encoding algorithms in the system. A qualitative analysis of encoding effectiveness regarding to various data features is then presented for the studied algorithms. To this end, we develop a benchmark for evaluating encoding algorithms, including a data generator regarding the aforesaid data features and several real-world datasets from our industrial partners. Finally, we present an extensive experimental evaluation using the benchmark. Remarkably, a quantitative analysis of encoding effectiveness regarding to various data features is conducted in Apache IoTDB.
- https://iotdb.apache.org/.Google Scholar
- https://www.influxdata.com/.Google Scholar
- http://opentsdb.net/.Google Scholar
- https://prometheus.io/.Google Scholar
- https://github.com/apache/iotdb/tree/research/encoding-exp.Google Scholar
- https://github.com/xjz17/iotdb/tree/TSEncoding.Google Scholar
- https://thulab.github.io/iotdb-quality/.Google Scholar
- https://iotdb.apache.org/UserGuide/Master/Data-Concept/Encoding.html.Google Scholar
- https://github.com/thulab/iotdb-benchmark.Google Scholar
- https://www.microsoft.com/en-us/download/details.aspx.Google Scholar
- https://archive.ics.uci.edu.Google Scholar
- https://www.kaggle.com/datasets/eliasdabbas/web-server-access-logs.Google Scholar
- https://www.kaggle.com/datasets/winmedals/incident-event-log-dataset.Google Scholar
- https://www.kaggle.com/datasets/shawon10/web-log-dataset.Google Scholar
- https://www.kaggle.com/datasets/.Google Scholar
- https://www.gnu.org/software/gzip/.Google Scholar
- https://sxsong.github.io/doc/encoding.pdf.Google Scholar
- Anders Aamand, Piotr Indyk, and Ali Vakilian. (learned) frequency estimation algorithms under zipfian distribution. CoRR, abs/1908.05198, 2019.Google Scholar
- Matej Bartik, Sven Ubik, and Pavel Kubalík. LZ4 compression algorithm on FPGA. In 2015 IEEE International Conference on Electronics, Circuits, and Systems, ICECS 2015, Cairo, Egypt, December 6-9, 2015, pages 179--182. IEEE, 2015.Google Scholar
- Davis W. Blalock, Samuel Madden, and John V. Guttag. Sprintz: Time series compression for the internet of things. CoRR, abs/1808.02515, 2018.Google ScholarDigital Library
- Giuseppe Campobello, Antonino Segreto, Sarah Zanafi, and Salvatore Serrano. RAKE: A simple and efficient lossless compression algorithm for the internet of things. In 25th European Signal Processing Conference, EUSIPCO 2017, Kos, Greece, August 28 - September 2, 2017, pages 2581--2585. IEEE, 2017.Google ScholarCross Ref
- Giacomo Chiarot and Claudio Silvestri. Time series compression: a survey. CoRR, abs/2101.08784, 2021.Google Scholar
- E. F. Codd. Relational database: A practical foundation for productivity. Commun. ACM, 25(2):109--117, 1982.Google ScholarDigital Library
- Marco Dalai and Riccardo Leonardi. Approximations of one-dimensional digital signals under the linfty norm. IEEE Trans. Signal Process., 54(8):3111--3124, 2006.Google ScholarDigital Library
- Frank Eichinger, Pavel Efros, Stamatis Karnouskos, and Klemens Böhm. A time-series compression technique and its application to the smart grid. VLDB J., 24(2):193--218, 2015.Google ScholarDigital Library
- Eugene Fink and Harith Suman Gandhi. Compression of time series by extracting major extrema. J. Exp. Theor. Artif. Intell., 23(2):255--270, 2011.Google ScholarDigital Library
- Solomon W. Golomb. Run-length encodings (corresp.). IEEE Trans. Inf. Theory, 12(3):399--401, 1966.Google ScholarDigital Library
- Muon Ha and Yulia A. Shichkina. Translating a distributed relational database to a document database. Data Sci. Eng., 7(2):136--155, 2022.Google ScholarCross Ref
- Paul G. Howard and Jeffrey Scott Vitter. Parallel lossless image compression using huffman and arithmetic coding. Inf. Process. Lett., 59(2):65--73, 1996.Google ScholarDigital Library
- Sheng Huang, Yaoliang Chen, Xiaoyan Chen, Kai Liu, Xiaomin Xu, Chen Wang, Kevin Brown, and Inge Halilovic. The next generation operational data historian for iot based on informix. In Curtis E. Dyreson, Feifei Li, and M. Tamer Özsu, editors, International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22-27, 2014, pages 169--176. ACM, 2014.Google ScholarDigital Library
- Yannis Katsis, Yoav Freund, and Yannis Papakonstantinou. Combining databases and signal processing in plato. In Seventh Biennial Conference on Innovative Data Systems Research, CIDR 2015, Asilomar, CA, USA, January 4-7, 2015, Online Proceedings. www.cidrdb.org, 2015.Google Scholar
- Abdelouahab Khelifati, Mourad Khayati, and Philippe Cudré-Mauroux. CORAD: correlation-aware compression of massive time series using sparse dictionary coding. In Chaitanya Baru, Jun Huan, Latifur Khan, Xiaohua Hu, Ronay Ak, Yuanyuan Tian, Roger S. Barga, Carlo Zaniolo, Kisung Lee, and Yanfang Fanny Ye, editors, 2019 IEEE International Conference on Big Data (IEEE BigData), Los Angeles, CA, USA, December 9-12, 2019, pages 2289--2298. IEEE, 2019.Google ScholarCross Ref
- Iosif Lazaridis and Sharad Mehrotra. Capturing sensor-generated time series with quality guarantees. In Umeshwar Dayal, Krithi Ramamritham, and T. M. Vijayaraman, editors, Proceedings of the 19th International Conference on Data Engineering, March 5-8, 2003, Bangalore, India, pages 429--440. IEEE Computer Society, 2003.Google Scholar
- Alice Marascu, Pascal Pompey, Eric Bouillet, Michael Wurst, Olivier Verscheure, Martin Grund, and Philippe Cudré-Mauroux. TRISTAN: real-time analytics on massive time series using sparse dictionary compression. In Jimmy Lin, Jian Pei, Xiaohua Hu, Wo Chang, Raghunath Nambiar, Charu C. Aggarwal, Nick Cercone, Vasant G. Honavar, Jun Huan, Bamshad Mobasher, and Saumyadipta Pyne, editors, 2014 IEEE International Conference on Big Data (IEEE BigData 2014), Washington, DC, USA, October 27-30, 2014, pages 291--300. IEEE Computer Society, 2014.Google ScholarCross Ref
- V. Krishna Nandivada and Rajkishore Barik. Improved bitwidth-aware variable packing. ACM Trans. Archit. Code Optim., 10(3):16:1--16:22, 2013.Google ScholarDigital Library
- Ghim Hwee Ong and Shell-Ying Huang. A data compression scheme for chinese text files using huffman coding and a two-level dictionary. Inf. Sci., 84(1&2):85--99, 1995.Google ScholarDigital Library
- Tuomas Pelkonen, Scott Franklin, Paul Cavallaro, Qi Huang, Justin Meza, Justin Teller, and Kaushik Veeraraghavan. Gorilla: A fast, scalable, in-memory time series database. Proc. VLDB Endow., 8(12):1816--1827, 2015.Google ScholarDigital Library
- Horst Samulowitz, Chandra Reddy, Ashish Sabharwal, and Meinolf Sellmann. Snappy: A simple algorithm portfolio. In Matti Järvisalo and Allen Van Gelder, editors, Theory and Applications of Satisfiability Testing - SAT 2013 - 16th International Conference, Helsinki, Finland, July 8-12, 2013. Proceedings, volume 7962 of Lecture Notes in Computer Science, pages 422--428. Springer, 2013.Google ScholarDigital Library
- Raimund Seidel. Small-dimensional linear programming and convex hulls made easy. Discret. Comput. Geom., 6:423--434, 1991.Google ScholarDigital Library
- Bin Song, Limin Xiao, Guangjun Qin, Li Ruan, and Shida Qiu. A deduplication algorithm based on data similarity and delta encoding. In Hanning Yuan, Jing Geng, and Fuling Bian, editors, Geo-Spatial Knowledge and Intelligence - 4th International Conference on Geo-Informatics in Resource Management and Sustainable Ecosystem, GRMSE 2016, Hong Kong, China, November 18-20, 2016, Revised Selected Papers, Part II, volume 699 of Communications in Computer and Information Science, pages 245--253. Springer, 2016.Google Scholar
- Julien Spiegel, Patrice Wira, and Gilles Hermann. A comparative experimental study of lossless compression algorithms for enhancing energy efficiency in smart meters. In 16th IEEE International Conference on Industrial Informatics, INDIN 2018, Porto, Portugal, July 18-20, 2018, pages 447--452. IEEE, 2018.Google ScholarCross Ref
- Jirí Walder, Michal Krátký, and Jan Platos. Fast fibonacci encoding algorithm. In Jaroslav Pokorný, Václav Snásel, and Karel Richta, editors, Proceedings of the Dateso 2010 Annual International Workshop on DAtabases, TExts, Specifications and Objects, Stedronin-Plazy, Czech Republic, April 21-23, 2010, volume 567 of CEUR Workshop Proceedings, pages 72--83. CEUR-WS.org, 2010.Google Scholar
- Chen Wang, Xiangdong Huang, Jialin Qiao, Tian Jiang, Lei Rui, Jinrui Zhang, Rong Kang, Julian Feinauer, Kevin Mcgrail, Peng Wang, Diaohan Luo, Jun Yuan, Jianmin Wang, and Jiaguang Sun. Apache iotdb: Time-series database for internet of things. Proc. VLDB Endow., 13(12):2901--2904, 2020.Google ScholarDigital Library
- Terry A. Welch. A technique for high-performance data compression. Computer, 17(6):8--19, 1984.Google ScholarDigital Library
- Raymond Chi-Wing Wong and Ada Wai-Chee Fu. Mining top-k itemsets over a sliding window based on zipfian distribution. In Hillol Kargupta, Jaideep Srivastava, Chandrika Kamath, and Arnold Goodman, editors, Proceedings of the 2005 SIAM International Conference on Data Mining, SDM 2005, Newport Beach, CA, USA, April 21-23, 2005, pages 516--520. SIAM, 2005.Google Scholar
- Retaj Yousri, Madyan Alsenwi, M. Saeed Darweesh, and Tawfik Ismail. A design for an efficient hybrid compression system for eeg data. In 2021 International Conference on Electronic Engineering (ICEEM), pages 1--6, 2021.Google ScholarCross Ref
- Xinyang Yu, Yanqing Peng, Feifei Li, Sheng Wang, Xiaowei Shen, Huijun Mai, and Yue Xie. Two-level data compression using machine learning in time series database. In 36th IEEE International Conference on Data Engineering, ICDE 2020, Dallas, TX, USA, April 20-24, 2020, pages 1333--1344. IEEE, 2020.Google ScholarCross Ref
- Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory, 23(3):337--343, 1977.Google ScholarDigital Library
Index Terms
- Time series data encoding for efficient storage: a comparative analysis in Apache IoTDB
Recommendations
An Efficient NoSQL-Based Storage Schema for Large-Scale Time Series Data
In IoT (internet of things), most data from the connected devices change with time and have sampling intervals, which are called time-series data. It is challenging to design a time series storage model that can write massive time-series data in a short ...
Hybrid storage architecture and efficient MapReduce processing for unstructured data
We present a hybrid storage architecture which integrates various kinds of data stores.We propose three partitioning strategies to execute MapReduce-based batch-processing jobs.Our hybrid solution shows 10% to 8.6 times faster performance than the ...
Comments