Skip to main content

Data Harmonization for Heterogeneous Datasets in Big Data - A Conceptual Model

  • Conference paper
  • First Online:
Software Engineering Perspectives in Intelligent Systems (CoMeSySo 2020)

Abstract

Data comes from machines, transactions, and social media, which is gigantic and disparate in nature. About 80% of today’s data is unstructured, while the remaining percentage is semistructured and structured. It is a big challenge for management to make efficient decisions on run time and also to store heterogeneous nature of data by existing tools. Data Harmonization can be used to solve the heterogeneity problem; the idea of data harmonization is to provide a uniform representation and remove all forms of heterogeneity from the heterogeneous datasets. In recent studies, various models have been developed for integrating, mapping, and fusion of structured and semistructured datasets, but no such model has been developed for structured, semistructured, and unstructured datasets. Information extraction is used as a vital component to extract data from different textual datasets that information formats may comprise in different file formats, i.e., Excel, JSON, and text. For developing textual data harmonization model for heterogeneous datasets, comprises of structured, semistructured, and unstructured data based on phrases similarity techniques, it needs to be first preprocessed using Natural Language Processing and its techniques like Bag of Phrases, Parts of Speech and so on. Therefore this paper focuses on the conceptual data harmonization model based on text similarity technique, which will help to blend structured, semistructured, and unstructured data. The selected phrases from heterogeneous datasets will go through training and testing using Recurrent Neural Network.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Daniel, B.K.: Big data and data science: a critical review of issues for Educational research. Br. J. Educ. Technol., 50, 101 (2017)

    Google Scholar 

  2. Bhadani, A., Jothimani, D.: Big Data: Challenges, Opportunities and Realities. IGI Global, USA (2016)

    Google Scholar 

  3. Dhayne, H.: In search of big medical data integration solutions-a comprehensive survey. IEEE Access 7, 91265 (2019)

    Article  Google Scholar 

  4. Hong, N., Wen, A., Shen, F., Sohn, S., Liu, S., Liu, H., Jiang, G.: Integrating structured and unstructured EHR data using an FHIR-based type system: a case study with medication data. In: AMIA Summits on Translational Science Proceedings (2018)

    Google Scholar 

  5. Hong, N.: Developing a scalable FHIR-based clinical data normalization pipeline for standardizing and integrating unstructured and structured electronic health record data. J. Biomed. Inf. 2, 570 (2019)

    Google Scholar 

  6. Liang, H.: Text feature extraction based on deep learning: a review. J. Wirel. Commun. Networking 2017, 1–12 (2017)

    Article  Google Scholar 

  7. Näppilä, T.: A query language for selecting, harmonizing, and aggregating heterogeneous XML data. Int. J. Web Inf. Syst. (2011)

    Google Scholar 

  8. Sindhu, C.S.: “Handling Complex Heterogeneous Healthcare Big Data,” International Journal of Computational Intelligence Research, pp. 1201–1227, 2017

    Google Scholar 

  9. Scheurwegs, E.: Data integration of structured and unstructured sources for assigning clinical codes to patient stays. Am. Med. Inf. Assoc. 23, e11 (2016)

    Article  Google Scholar 

  10. Banu, G.: Implementation of big data in health information systems: sample approaches in Saudi hospital. Int. J. Comput. Appl. (2017)

    Google Scholar 

  11. Alblawi, A.S.: Big data and learning analytics in higher education. In: IEEE Conference on Big Data and Analytics (ICBDA). IEEE (2017)

    Google Scholar 

  12. Mujtaba, G.: Clinical text classification research trends: systematic literature review and open issues. Expert Syst. Appl. 116, 494–520 (2019)

    Article  Google Scholar 

  13. Padden, S.: From database to big data. In: Presented at the IEEE Computer Society (2012)

    Google Scholar 

  14. Sagiroglu, S., Sinanc, D.: Big data: a review. In: Presented at the International Conference on Collaboration Technologies and Systems (CTS) (2013)

    Google Scholar 

  15. Saggi, M.K.: A survey towards an integration of big data analytics to big insights for Value creation. Inf. Process. Manage. 54, 758–790 (2018)

    Article  Google Scholar 

  16. Arora, Y.: Big Data: A Review of Analytics Methods and Techniques. IEEE (2016)

    Google Scholar 

  17. Lee, I.: Big data: dimensions, evolution, impacts, and challenges. Bus. Horiz. 2017(60), 293–303 (2017)

    Article  Google Scholar 

  18. Nataraj, G.: Integration of heterogeneous data in the data vault model (2019)

    Google Scholar 

  19. Danyaro, K.U.: A proposed methodology for integrating oil and gas data using semantic big data technology. In: Recent Trends in Information and Communication Technology (2018)

    Google Scholar 

  20. Karim Dahdouh, A.D.: Big data for online learning systems. Educ. Inf. Technol. 23, 2783 (2018)

    Article  Google Scholar 

  21. Sambrekar, K.: A Proposed Technique for Conversion of Unstructured Agro-data to Semi-structured or Structured data (2018)

    Google Scholar 

  22. Lu, J.: Multi-model databases: a new journey to handle the variety of data. ACM Comput. Surv. (CSUR) 52, 1–38 (2019)

    Article  Google Scholar 

  23. Yuan, J.: Autism spectrum disorder detection from semi-structured and unstructured medical data. J. Bioinform. Syst. Biol. 2017, 3 (2016)

    Article  Google Scholar 

  24. Viraj Adduru1, S.A.H.: Towards dataset creation and establishing baselines for sentence-level neural clinical paraphrase generation and simplification. In: Philips Research (2018)

    Google Scholar 

  25. Chen, Q.: CA-RNN: using context-aligned recurrent neural networks for modeling sentence similarity. In: Presented at the Thirty-Second AAAI Conference on Artificial Intelligence (2018)

    Google Scholar 

  26. Wang, L.: Big data analytics for disparate data. Am. J. Intell. Syst. 7, 39 (2017)

    Google Scholar 

  27. Maheshwari, H.: Overview of BD and its issues (2019)

    Google Scholar 

  28. Torfi, A.: Natural Language Processing Advancements By (2020)

    Google Scholar 

  29. Wu, Y.: Phrase2Vec: phrase embedding based on parsing. Inf. Sci. 517, 100 (2020)

    Article  Google Scholar 

  30. Yarmohammadi, M.: Robust document representations for cross-lingual information retrieval in low-resource settings. In: Proceedings of MT Summit XVII, vol. 1 (2019)

    Google Scholar 

  31. Elloumi, S.: A new approach for textual feature selection based on N-composite isolated labels. Nat. Lang. Eng. 1–23, 2019 (2019)

    Google Scholar 

  32. Shin, H.: Bringing Bag-of-phrases to ODP-based Text Classification. In: Presented at the IEEE (2016)

    Google Scholar 

  33. Aziz, A.A.: Siamese Similarity Between Two Sentences Using Manhattan’s Recurrent Neural Networks (2018)

    Google Scholar 

  34. Song, Y., Using fractional latent topic to enhance recurrent neural network in text similarity modeling. In: International Conference on Database Systems for Advanced Applications (2019)

    Google Scholar 

  35. Qu, Y.: Question answering over freebase via attentive RNN with similarity matrix based CNN (2018)

    Google Scholar 

  36. Agarwal, B.: A deep network model for paraphrase detection in short text messages. Inf. Process. Manage. 54, 922 (2018)

    Article  Google Scholar 

  37. Jiang, J.: Semantic text matching for long-form documents. In: Presented at the The World Wide Web Conference (2019)

    Google Scholar 

Download references

Acknowledgments

This research/paper was fully supported by Minister of Education, Malaysia under the Fundamental Research Grant Scheme (FRGS) with Ref. No. FRGS/1/2018/ICT04/UTP/02/4

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ganesh Kumar .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kumar, G., Basri, S., Imam, A.A., Balogun, A.O. (2020). Data Harmonization for Heterogeneous Datasets in Big Data - A Conceptual Model. In: Silhavy, R., Silhavy, P., Prokopova, Z. (eds) Software Engineering Perspectives in Intelligent Systems. CoMeSySo 2020. Advances in Intelligent Systems and Computing, vol 1294. Springer, Cham. https://doi.org/10.1007/978-3-030-63322-6_61

Download citation

Publish with us

Policies and ethics