Skip to main content
Log in

Distributed real-time ETL architecture for unstructured big data

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Real-time extract transform load (ETL) is the integral part of increasing demand of faster business decisions targeting large number of modern applications. Multi-source unstructured data stream extraction and transformation using disk data in distributed environment are the building blocks of real-time ETL due to volume and velocity of data. Therefore designing an architecture for basic building blocks for real-time ETL remains a major challenge. In this paper, we focus primarily to expedite stream-disk joins during transformation phase of ETL that is considered most expensive operator in stream processing due to frequent disk access. We propose an architecture for real-time ETL to ingest unstructured stream of data from multi-sources, without having to worry about the structure of data sources, and transform them after joining with distributed disk data. We also present a novel data pipeline stream-disk join that uses partition-based input and best-effort in-memory database technique reducing frequent disk access. The proposed architecture addresses the challenges of stream data loss, ignored un-matching streams, disk overhead and real-time processing for distributed environment. The experimental results obtained using stream generator and real-world datasets on local and distributed machines show that proposed architecture yields significantly improved throughput especially for large number of stream tuples with large datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. http://deepyeti.ucsd.edu/jianmo/amazon/index.html.

  2. https://www.mongodb.com/cloud/atlas.

References

  1. Semlali B-EB, El Amrani C, Ortiz G, Boubeta-Puig J, Garcia-de-Prado A (2021) SAT-CEP-monitor: an air quality monitoring software architecture combining complex event processing with satellite remote sensing. Comput Electr Eng 93:107257

    Article  Google Scholar 

  2. Semlali B-EB, El Amrani C, Ortiz G (2020) SAT-ETL-integrator: an extract-transform-load software for satellite big data ingestion. J Appl Remote Sens 14(1):018501

    Article  Google Scholar 

  3. Semlali B-EB, Amrani CE, Ortiz G (2019) Adopting the hadoop architecture to process satellite pollution big data. Int J Technol Eng Stud 5(2):30–39

    Article  Google Scholar 

  4. Kolajo T, Daramola O, Adebiyi A (2019) Big data stream analysis: a systematic literature review. J Big Data 6(1):47

    Article  Google Scholar 

  5. Arfat Y, Usman S, Mehmood R, Katib I (2020) Big data tools, technologies, and applications: a survey. In: Smart infrastructure and applications. Springer, Cham, pp 453–490

  6. Emara TZ, Huang JZ (2020) Distributed data strategies to support large-scale data analysis across geo-distributed data centers. IEEE Access 8:178526–178538. https://doi.org/10.1109/ACCESS.2020.3027675

    Article  Google Scholar 

  7. Huang D, Han D, Wang J, Yin J, Chen X, Zhang X, Zhou J, Ye M (2018) Achieving load balance for parallel data access on distributed file systems. IEEE Trans Comput 67(3):388–402. https://doi.org/10.1109/TC.2017.2749229

    Article  MathSciNet  MATH  Google Scholar 

  8. Semlali B-EB, El Amrani C, Ortiz G (2020) Hadoop paradigm for satellite environmental big data processing. Int J Agric Environ Inf Syst (IJAEIS) 11(1):23–47

    Article  Google Scholar 

  9. Mehmood E, Anees T (2020) Challenges and solutions for processing real-time big data stream: a systematic literature review. IEEE Access 8:119123–119143. https://doi.org/10.1109/ACCESS.2020.3005268

    Article  Google Scholar 

  10. Adnan K, Akbar R, Wang KS (2021) Development of usability enhancement model for unstructured big data using SLR. IEEE Access

  11. Wang G, Chen L, Dikshit A, Gustafson J, Chen B, Sax MJ, Roesler J, Blee-Goldman S, Cadonna B, Mehta A, et al (2021) Consistency and completeness: rethinking distributed stream processing in apache kafka. In: Proceedings of the 2021 international conference on management of data, pp 2602–2613

  12. Adnan K, Akbar R (2019) An analytical study of information extraction from unstructured and multidimensional big data. J Big Data 6(1):1–38

    Article  Google Scholar 

  13. Rajagopalan A, Vitale F. Vainstein D, Citovsky G, Procopiuc CM, Gentile C (2021) Hierarchical clustering of data streams: scalable algorithms and approximation guarantees. In: International conference on machine learning, pp 8799–8809. PMLR

  14. Yan X, Homaifar A, Sarkar M, Girma A, Tunstel E (2021) A clustering-based framework for classifying data streams. arXiv preprint arXiv:2106.11823

  15. Akanbi A (2020) ESTemd: A distributed processing framework for environmental monitoring based on apache Kafka streaming engine. In: 2020 the 4th international conference on big data research (ICBDR’20), pp 18–25

  16. Semlali B-EB, Freitag F (2021) Sat-hadoop-processor: a distributed remote sensing big data processing software for earth observation applications. Appl Sci 11(22):10610

    Article  Google Scholar 

  17. Naeem MA, Mehmood E, Malik MA, Jamil N (2020) Optimizing semi-stream Cachejoin for near-real-time data warehousing. J Database Manag (JDM) 31(1):20–37

    Article  Google Scholar 

  18. Machado GV, Cunha Í, Pereira AC, Oliveira LB (2019) DOD-ETL: distributed on-demand ETL for near real-time business intelligence. J Internet Serv Appl 10(1):21

    Article  Google Scholar 

  19. Cuzzocrea A, Ferreira N, Furtado P (2020) A rewrite/merge approach for supporting real-time data warehousing via lightweight data integration. J Supercomput 76(5):3898–3922

    Article  Google Scholar 

  20. Hamdi I, Bouazizi E, Alshomrani S, Feki J (2018) Improving QoS in real-time data warehouses by using feedback control scheduling. Int J Inf Decis Sci 10(3):181–211

    Google Scholar 

  21. Pareek A, Khaladkar B, Sen R, Onat B, Nadimpalli V, Lakshminarayanan M (2018) Real-time ETL in Striim. In: Proceedings of the international workshop on real-time business intelligence and analytics, pp 1–10

  22. Zhuang Z, Feng T, Pan Y, Ramachandra H, Sridharan B (2016) Effective multi-stream joining in apache samza framework. In: 2016 IEEE international congress on big data (BigData Congress), pp 267–274. https://doi.org/10.1109/BigDataCongress.2016.41

  23. Naeem MA, Mirza F, Khan HU, Sundaram D, Jamil N, Weber G (2020) Big data velocity management-from stream to warehouse via high performance memory optimized index join. IEEE Access 8:195370–195384. https://doi.org/10.1109/ACCESS.2020.3033464

    Article  Google Scholar 

  24. Rafiei D, Deng F (2020) Similarity join and similarity self-join size estimation in a streaming environment. IEEE Trans Knowl Data Eng 32(4):768–781. https://doi.org/10.1109/TKDE.2019.2893175

    Article  Google Scholar 

  25. Ji Y, Liu S, Lu L, Lang X, Yao H, Wang R (2018) VC-TWJoin: A stream join algorithm based on variable update cycle time window. In: 2018 IEEE 22nd international conference on computer supported cooperative work in design (CSCWD), pp 178–183. https://doi.org/10.1109/CSCWD.2018.8465208

  26. Najafi M, Sadoghi M, Jacobsen H-A (2020) Scalable multiway stream joins in hardware. IEEE Trans Knowl Data Eng 32(12):2438–2452. https://doi.org/10.1109/TKDE.2019.2916860

    Article  Google Scholar 

  27. Watson A, Das SK, Ray S (2021) An unified system for data analytics and in situ query processing. arXiv preprint arXiv:2102.09295

  28. Nardelli A, Vlassov V, Payberah AH (2020) Framework-agnostic optimization of repeated skewed joins at massive scale. In: 2020 IEEE intl conf on parallel & distributed processing with applications, big data & cloud computing, sustainable computing & communications, social computing & networking (ISPA/BDCloud/SocialCom/SustainCom), IEEE, pp 26–33

  29. Poepsel-Lemaitre R, Kiefer M, von Hein J, Quiané-Ruiz J-A, Markl V (2021) In the land of data streams where synopses are missing, one framework to bring them all. Proc VLDB Endow 14(10):1818–1831

    Article  Google Scholar 

  30. Shaikh SA, Watanabe Y, Wang Y, Kitagawa H (2019) Smart scheme: an efficient query execution scheme for event-driven stream processing. Knowl Inf Syst 58(2):341–370

    Article  Google Scholar 

  31. Hu L, Sun R, Wang F, Fei X, Zhao K (2016) A stream processing system for multisource heterogeneous sensor data. J Sens 2016:1–8. https://doi.org/10.1155/2016/4287834

  32. Ren X, Curé O (2017) Strider: A hybrid adaptive distributed RDF stream processing engine. In: International Semantic Web Conference, pp. 559–576. Springer

  33. Choi J-H, Park J, Park HD, Min O-G (2017) DART: fast and efficient distributed stream processing framework for internet of things. ETRI J 39(2):202–212

    Article  Google Scholar 

  34. Semlali, B-EB, Amrani CE (2020) A stream processing software for air quality satellite datasets. In: International conference on advanced intelligent systems for sustainable development. Springer, pp 839–853

  35. Boudriki Semlali BE, El Amrani C (2021) Big data and remote sensing: a new software of ingestion. Int J Electr Computer Eng 11:1521–1530

    Google Scholar 

  36. Babar M, Arif F (2019) Real-time data processing scheme using big data analytics in internet of things based smart transportation environment. J Ambient Intell Humaniz Comput 10(10):4167–4177

    Article  Google Scholar 

  37. Junior MR, Olivieri B, Endler M (2019) DG2CEP: a near real-time on-line algorithm for detecting spatial clusters large data streams through complex event processing. J Internet Serv Appl 10(1):8

    Article  Google Scholar 

  38. Mehmood E, Anees T (2019) Performance analysis of not only SQL semi-stream join using Mongodb for real-time data warehousing. IEEE Access 7:134215–134225. https://doi.org/10.1109/ACCESS.2019.2941925

    Article  Google Scholar 

  39. Jeon Y, Lee K, Kim H (2019) Distributed join processing between streaming and stored big data under the micro-batch model. IEEE Access 7:34583–34598. https://doi.org/10.1109/ACCESS.2019.2904730

    Article  Google Scholar 

  40. Kim H, Lee K (2020) Semi-stream similarity join processing in a distributed environment. IEEE Access 8:130194–130204. https://doi.org/10.1109/ACCESS.2020.3009414

    Article  Google Scholar 

  41. Zhao J, Wei S, Wen X, Qiu X (2020) Analysis and prediction of big stream data in real-time water quality monitoring system. J Ambient Intell Smart Environ 1–14 (Preprint)

  42. Bartolini I, Patella M (2018) A general framework for real-time analysis of massive multimedia streams. Multimedia Syst 24(4):391–406

    Article  Google Scholar 

  43. Grover P, Kar AK (2017) Big data analytics: a review on theoretical contributions and tools used in literature. Glob J Flex Syst Manag 18(3):203–229

    Article  Google Scholar 

  44. Hesse G, Matthies C, Uflacker M (2020) How fast can we insert? An empirical performance evaluation of apache Kafka. In: 2020 IEEE 26th international conference on parallel and distributed systems (ICPADS), pp. 641–648. IEEE

  45. Akanbi A, Masinde M (2020) A distributed stream processing middleware framework for real-time analysis of heterogeneous data on big data platform: Case of environmental monitoring. Sensors 20(11):3166

    Article  Google Scholar 

  46. Zhang H, Chen G, Ooi BC, Tan K-L, Zhang M (2015) In-memory big data management and processing: a survey. IEEE Trans Knowl Data Eng 27(7):1920–1948

    Article  Google Scholar 

  47. Ouyang H, Wei H, Huang Y, Li H, Pan A (2021) Verifying transactional consistency of mongodb. arXiv preprint arXiv:2111.14946

  48. Akın Ö, Deniz HF, Nefis D, Kızıltan A, Çakır A (2020) Enabling big data analytics at manufacturing fields of farplas automotive. In: International conference on intelligent and fuzzy systems. Springer, Berlin, pp 817–824

  49. Rao B, Wang L (2017) A survey of semantics-aware performance optimization for data-intensive computing. In: 2017 IEEE 15th Intl Conf on Dependable, Autonomic and Secure Computing, 15th Intl Conf on Pervasive Intelligence and Computing, 3rd Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress (DASC/PiCom/DataCom/CyberSciTech). IEEE, pp 81–88

  50. Corral-Plaza D, Medina-Bulo I, Ortiz G, Boubeta-Puig J, Group USER et al (2020) A stream processing architecture for heterogeneous data sources in the internet of things. Comput Stand Interfaces 70:103426

    Article  Google Scholar 

  51. Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauly M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: 9th \(\{\)USENIX\(\}\) symposium on networked systems design and implementation (\(\{\)NSDI\(\}\) 12), pp 15–28

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Erum Mehmood.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Erum Mehmood was involved in conceptualization of the idea and designing the solution architecture. Erum Mehmood and Tayyaba Anees contributed in literature review, implementation, experimentation of this research, and paper revisions. Tayyaba Anees was involved in overall supervision, discussion of results.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mehmood, E., Anees, T. Distributed real-time ETL architecture for unstructured big data. Knowl Inf Syst 64, 3419–3445 (2022). https://doi.org/10.1007/s10115-022-01757-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-022-01757-7

Keywords

Navigation