Data Harmonization for Heterogeneous Datasets in Big Data - A Conceptual Model

Kumar, Ganesh; Basri, Shuib; Imam, Abdullahi Abubakar; Balogun, Abdullateef Oluwagbemiga

doi:10.1007/978-3-030-63322-6_61

Ganesh Kumar¹⁷,
Shuib Basri¹⁷,
Abdullahi Abubakar Imam^17,18 &
…
Abdullateef Oluwagbemiga Balogun^17,19

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1294))

Included in the following conference series:

Proceedings of the Computational Methods in Systems and Software

1300 Accesses
3 Citations

Abstract

Data comes from machines, transactions, and social media, which is gigantic and disparate in nature. About 80% of today’s data is unstructured, while the remaining percentage is semistructured and structured. It is a big challenge for management to make efficient decisions on run time and also to store heterogeneous nature of data by existing tools. Data Harmonization can be used to solve the heterogeneity problem; the idea of data harmonization is to provide a uniform representation and remove all forms of heterogeneity from the heterogeneous datasets. In recent studies, various models have been developed for integrating, mapping, and fusion of structured and semistructured datasets, but no such model has been developed for structured, semistructured, and unstructured datasets. Information extraction is used as a vital component to extract data from different textual datasets that information formats may comprise in different file formats, i.e., Excel, JSON, and text. For developing textual data harmonization model for heterogeneous datasets, comprises of structured, semistructured, and unstructured data based on phrases similarity techniques, it needs to be first preprocessed using Natural Language Processing and its techniques like Bag of Phrases, Parts of Speech and so on. Therefore this paper focuses on the conceptual data harmonization model based on text similarity technique, which will help to blend structured, semistructured, and unstructured data. The selected phrases from heterogeneous datasets will go through training and testing using Recurrent Neural Network.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Daniel, B.K.: Big data and data science: a critical review of issues for Educational research. Br. J. Educ. Technol., 50, 101 (2017)
Google Scholar
Bhadani, A., Jothimani, D.: Big Data: Challenges, Opportunities and Realities. IGI Global, USA (2016)
Google Scholar
Dhayne, H.: In search of big medical data integration solutions-a comprehensive survey. IEEE Access 7, 91265 (2019)
Article Google Scholar
Hong, N., Wen, A., Shen, F., Sohn, S., Liu, S., Liu, H., Jiang, G.: Integrating structured and unstructured EHR data using an FHIR-based type system: a case study with medication data. In: AMIA Summits on Translational Science Proceedings (2018)
Google Scholar
Hong, N.: Developing a scalable FHIR-based clinical data normalization pipeline for standardizing and integrating unstructured and structured electronic health record data. J. Biomed. Inf. 2, 570 (2019)
Google Scholar
Liang, H.: Text feature extraction based on deep learning: a review. J. Wirel. Commun. Networking 2017, 1–12 (2017)
Article Google Scholar
Näppilä, T.: A query language for selecting, harmonizing, and aggregating heterogeneous XML data. Int. J. Web Inf. Syst. (2011)
Google Scholar
Sindhu, C.S.: “Handling Complex Heterogeneous Healthcare Big Data,” International Journal of Computational Intelligence Research, pp. 1201–1227, 2017
Google Scholar
Scheurwegs, E.: Data integration of structured and unstructured sources for assigning clinical codes to patient stays. Am. Med. Inf. Assoc. 23, e11 (2016)
Article Google Scholar
Banu, G.: Implementation of big data in health information systems: sample approaches in Saudi hospital. Int. J. Comput. Appl. (2017)
Google Scholar
Alblawi, A.S.: Big data and learning analytics in higher education. In: IEEE Conference on Big Data and Analytics (ICBDA). IEEE (2017)
Google Scholar
Mujtaba, G.: Clinical text classification research trends: systematic literature review and open issues. Expert Syst. Appl. 116, 494–520 (2019)
Article Google Scholar
Padden, S.: From database to big data. In: Presented at the IEEE Computer Society (2012)
Google Scholar
Sagiroglu, S., Sinanc, D.: Big data: a review. In: Presented at the International Conference on Collaboration Technologies and Systems (CTS) (2013)
Google Scholar
Saggi, M.K.: A survey towards an integration of big data analytics to big insights for Value creation. Inf. Process. Manage. 54, 758–790 (2018)
Article Google Scholar
Arora, Y.: Big Data: A Review of Analytics Methods and Techniques. IEEE (2016)
Google Scholar
Lee, I.: Big data: dimensions, evolution, impacts, and challenges. Bus. Horiz. 2017(60), 293–303 (2017)
Article Google Scholar
Nataraj, G.: Integration of heterogeneous data in the data vault model (2019)
Google Scholar
Danyaro, K.U.: A proposed methodology for integrating oil and gas data using semantic big data technology. In: Recent Trends in Information and Communication Technology (2018)
Google Scholar
Karim Dahdouh, A.D.: Big data for online learning systems. Educ. Inf. Technol. 23, 2783 (2018)
Article Google Scholar
Sambrekar, K.: A Proposed Technique for Conversion of Unstructured Agro-data to Semi-structured or Structured data (2018)
Google Scholar
Lu, J.: Multi-model databases: a new journey to handle the variety of data. ACM Comput. Surv. (CSUR) 52, 1–38 (2019)
Article Google Scholar
Yuan, J.: Autism spectrum disorder detection from semi-structured and unstructured medical data. J. Bioinform. Syst. Biol. 2017, 3 (2016)
Article Google Scholar
Viraj Adduru1, S.A.H.: Towards dataset creation and establishing baselines for sentence-level neural clinical paraphrase generation and simplification. In: Philips Research (2018)
Google Scholar
Chen, Q.: CA-RNN: using context-aligned recurrent neural networks for modeling sentence similarity. In: Presented at the Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Google Scholar
Wang, L.: Big data analytics for disparate data. Am. J. Intell. Syst. 7, 39 (2017)
Google Scholar
Maheshwari, H.: Overview of BD and its issues (2019)
Google Scholar
Torfi, A.: Natural Language Processing Advancements By (2020)
Google Scholar
Wu, Y.: Phrase2Vec: phrase embedding based on parsing. Inf. Sci. 517, 100 (2020)
Article Google Scholar
Yarmohammadi, M.: Robust document representations for cross-lingual information retrieval in low-resource settings. In: Proceedings of MT Summit XVII, vol. 1 (2019)
Google Scholar
Elloumi, S.: A new approach for textual feature selection based on N-composite isolated labels. Nat. Lang. Eng. 1–23, 2019 (2019)
Google Scholar
Shin, H.: Bringing Bag-of-phrases to ODP-based Text Classification. In: Presented at the IEEE (2016)
Google Scholar
Aziz, A.A.: Siamese Similarity Between Two Sentences Using Manhattan’s Recurrent Neural Networks (2018)
Google Scholar
Song, Y., Using fractional latent topic to enhance recurrent neural network in text similarity modeling. In: International Conference on Database Systems for Advanced Applications (2019)
Google Scholar
Qu, Y.: Question answering over freebase via attentive RNN with similarity matrix based CNN (2018)
Google Scholar
Agarwal, B.: A deep network model for paraphrase detection in short text messages. Inf. Process. Manage. 54, 922 (2018)
Article Google Scholar
Jiang, J.: Semantic text matching for long-form documents. In: Presented at the The World Wide Web Conference (2019)
Google Scholar

Download references

Acknowledgments

This research/paper was fully supported by Minister of Education, Malaysia under the Fundamental Research Grant Scheme (FRGS) with Ref. No. FRGS/1/2018/ICT04/UTP/02/4

Author information

Authors and Affiliations

Computer and Information Science Department, Universiti Teknologi PETRONAS, Bandar Seri Iskandar, 32610, Perak, Malaysia
Ganesh Kumar, Shuib Basri, Abdullahi Abubakar Imam & Abdullateef Oluwagbemiga Balogun
Ahmadu Bello University, Zaria, Nigeria
Abdullahi Abubakar Imam
Department of Computer Science, University of Ilorin, Ilorin, 1515, Nigeria
Abdullateef Oluwagbemiga Balogun

Authors

Ganesh Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Shuib Basri
View author publications
You can also search for this author in PubMed Google Scholar
Abdullahi Abubakar Imam
View author publications
You can also search for this author in PubMed Google Scholar
Abdullateef Oluwagbemiga Balogun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ganesh Kumar .

Editor information

Editors and Affiliations

Faculty of Applied Informatic, Tomas Bata University in Zlín, Zlín, Czech Republic
Radek Silhavy
Faculty of Applied Informatics, Tomas Bata University in Zlin, Zlín, Czech Republic
Petr Silhavy
Faculty of Applied Informatics, Tomas Bata University in Zlin, Zlín, Czech Republic
Zdenka Prokopova

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kumar, G., Basri, S., Imam, A.A., Balogun, A.O. (2020). Data Harmonization for Heterogeneous Datasets in Big Data - A Conceptual Model. In: Silhavy, R., Silhavy, P., Prokopova, Z. (eds) Software Engineering Perspectives in Intelligent Systems. CoMeSySo 2020. Advances in Intelligent Systems and Computing, vol 1294. Springer, Cham. https://doi.org/10.1007/978-3-030-63322-6_61

Download citation

DOI: https://doi.org/10.1007/978-3-030-63322-6_61
Published: 16 December 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-63321-9
Online ISBN: 978-3-030-63322-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics