Skip to main content

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 852))

  • 746 Accesses

Abstract

Scalable web systems are directly related to distributed storage systems used to process large amounts of data (big data). An example of such a system is Hadoop with its many extensions supporting data storage such as SQL-on-Hadoop systems and the “Parquet” file format. Another kind of systems for storing and processing big data are NoSQL databases, such as HBase, which are used in applications requiring fast random access. The Kudu system was created to combine the advantages of Hadoop and HBase and enable both effective data set analysis and fast random access. As subject of the research, performance analysis of the mentioned systems was performed. The experiment was conducted in the Amazon Web Services public cloud environment, where the cluster of nine virtual machines was configured. For research purpose, containing about billion rows fragment of “Wikipedia Page Traffic Statistics” public dataset was used. The results of the measurements confirm that the Kudu system is a promising alternative to the commonly used technologies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Apache Kudu. Documentation (2018). https://kudu.apache.org/docs/

  2. Baranowski, Z., Canali, L., Toebbicke, R., Hrivnac, J., Barberis, D.: A study of data representation in Hadoop to optimize data storage and search performance for the ATLAS EventIndex. J. Phys: Conf. Ser. 898, 062020 (2017). https://doi.org/10.1088/1742-6596/898/6/062020

    Article  Google Scholar 

  3. Borzemski, L., Kamińska-Chuchmała, A.: Distributed web systems performance forecasting using turning bands method. IEEE Trans. Ind. Inf. 9(1), 254–261 (2013). https://doi.org/10.1109/TII.2012.2198664

    Article  Google Scholar 

  4. Lakhe, B.: Practical Hadoop Migration - How to Integrate Your RDBMS with the Hadoop Ecosystem and Re-Architect Relational Applications to NoSQL. Apress, New York (2016)

    Google Scholar 

  5. Lipcon, T., Alves, D., Burkert, D., Cryans, J.D., Dembo, A., Percy, M., Rus, S. Wang, D., Bertozzi, M., McCabe, C.P., Wang, A.: Kudu - Storage for Fast Analytics on Fast Data. Cloudera, Inc. (2015). https://kudu.apache.org/kudu.pdf

  6. Marz, N., Warren, J.: Big Data - Principles and Best Practices of Scalable Realtime Data Systems. Manning Publications, New York (2015)

    Google Scholar 

  7. Press, G.: A Very Short History of Big Data. Forbes, 9 May 2013. https://www.forbes.com/sites/gilpress/2013/05/09/a-very-short-history-of-big-data/

  8. Skomoroch, P.N.: Wikipedia Page Traffic Statistics - 7 months of hourly pageview statistics for all articles in Wikipedia. Amazon Web Services (2015). https://aws.amazon.com/datasets/wikipedia-page-traffic-statistics/

  9. Tyukin, B.: Benchmarking Impala on Kudu vs Parquet. Blog about Big Data, Business Intelligence, Data Warehousing and ETL, 5 January 2018. https://boristyukin.com/benchmarking-apache-kudu-vs-apache-impala/

  10. Vohra, D.: Practical Hadoop Ecosystem A Definitive Guide to Hadoop-Related Frameworks and Tools. Apress, New York (2016)

    Book  Google Scholar 

  11. Yegulalp, S.: Cloudera’s Kudu: Like HDFS and HBase in one. InfoWorld Tech Watch, 28 September 2015. https://www.infoworld.com/article/2986675/hadoop/cloudera-kudu-hdfs-hbase-in-one.html

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ziemowit Nowak .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Oleś, D., Nowak, Z. (2019). The Performance Analysis of Distributed Storage Systems Used in Scalable Web Systems. In: Borzemski, L., Świątek, J., Wilimowska, Z. (eds) Information Systems Architecture and Technology: Proceedings of 39th International Conference on Information Systems Architecture and Technology – ISAT 2018. ISAT 2018. Advances in Intelligent Systems and Computing, vol 852. Springer, Cham. https://doi.org/10.1007/978-3-319-99981-4_27

Download citation

Publish with us

Policies and ethics