Abstract
Data comes from machines, transactions, and social media, which is gigantic and disparate in nature. About 80% of today’s data is unstructured, while the remaining percentage is semistructured and structured. It is a big challenge for management to make efficient decisions on run time and also to store heterogeneous nature of data by existing tools. Data Harmonization can be used to solve the heterogeneity problem; the idea of data harmonization is to provide a uniform representation and remove all forms of heterogeneity from the heterogeneous datasets. In recent studies, various models have been developed for integrating, mapping, and fusion of structured and semistructured datasets, but no such model has been developed for structured, semistructured, and unstructured datasets. Information extraction is used as a vital component to extract data from different textual datasets that information formats may comprise in different file formats, i.e., Excel, JSON, and text. For developing textual data harmonization model for heterogeneous datasets, comprises of structured, semistructured, and unstructured data based on phrases similarity techniques, it needs to be first preprocessed using Natural Language Processing and its techniques like Bag of Phrases, Parts of Speech and so on. Therefore this paper focuses on the conceptual data harmonization model based on text similarity technique, which will help to blend structured, semistructured, and unstructured data. The selected phrases from heterogeneous datasets will go through training and testing using Recurrent Neural Network.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Daniel, B.K.: Big data and data science: a critical review of issues for Educational research. Br. J. Educ. Technol., 50, 101 (2017)
Bhadani, A., Jothimani, D.: Big Data: Challenges, Opportunities and Realities. IGI Global, USA (2016)
Dhayne, H.: In search of big medical data integration solutions-a comprehensive survey. IEEE Access 7, 91265 (2019)
Hong, N., Wen, A., Shen, F., Sohn, S., Liu, S., Liu, H., Jiang, G.: Integrating structured and unstructured EHR data using an FHIR-based type system: a case study with medication data. In: AMIA Summits on Translational Science Proceedings (2018)
Hong, N.: Developing a scalable FHIR-based clinical data normalization pipeline for standardizing and integrating unstructured and structured electronic health record data. J. Biomed. Inf. 2, 570 (2019)
Liang, H.: Text feature extraction based on deep learning: a review. J. Wirel. Commun. Networking 2017, 1–12 (2017)
Näppilä, T.: A query language for selecting, harmonizing, and aggregating heterogeneous XML data. Int. J. Web Inf. Syst. (2011)
Sindhu, C.S.: “Handling Complex Heterogeneous Healthcare Big Data,” International Journal of Computational Intelligence Research, pp. 1201–1227, 2017
Scheurwegs, E.: Data integration of structured and unstructured sources for assigning clinical codes to patient stays. Am. Med. Inf. Assoc. 23, e11 (2016)
Banu, G.: Implementation of big data in health information systems: sample approaches in Saudi hospital. Int. J. Comput. Appl. (2017)
Alblawi, A.S.: Big data and learning analytics in higher education. In: IEEE Conference on Big Data and Analytics (ICBDA). IEEE (2017)
Mujtaba, G.: Clinical text classification research trends: systematic literature review and open issues. Expert Syst. Appl. 116, 494–520 (2019)
Padden, S.: From database to big data. In: Presented at the IEEE Computer Society (2012)
Sagiroglu, S., Sinanc, D.: Big data: a review. In: Presented at the International Conference on Collaboration Technologies and Systems (CTS) (2013)
Saggi, M.K.: A survey towards an integration of big data analytics to big insights for Value creation. Inf. Process. Manage. 54, 758–790 (2018)
Arora, Y.: Big Data: A Review of Analytics Methods and Techniques. IEEE (2016)
Lee, I.: Big data: dimensions, evolution, impacts, and challenges. Bus. Horiz. 2017(60), 293–303 (2017)
Nataraj, G.: Integration of heterogeneous data in the data vault model (2019)
Danyaro, K.U.: A proposed methodology for integrating oil and gas data using semantic big data technology. In: Recent Trends in Information and Communication Technology (2018)
Karim Dahdouh, A.D.: Big data for online learning systems. Educ. Inf. Technol. 23, 2783 (2018)
Sambrekar, K.: A Proposed Technique for Conversion of Unstructured Agro-data to Semi-structured or Structured data (2018)
Lu, J.: Multi-model databases: a new journey to handle the variety of data. ACM Comput. Surv. (CSUR) 52, 1–38 (2019)
Yuan, J.: Autism spectrum disorder detection from semi-structured and unstructured medical data. J. Bioinform. Syst. Biol. 2017, 3 (2016)
Viraj Adduru1, S.A.H.: Towards dataset creation and establishing baselines for sentence-level neural clinical paraphrase generation and simplification. In: Philips Research (2018)
Chen, Q.: CA-RNN: using context-aligned recurrent neural networks for modeling sentence similarity. In: Presented at the Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Wang, L.: Big data analytics for disparate data. Am. J. Intell. Syst. 7, 39 (2017)
Maheshwari, H.: Overview of BD and its issues (2019)
Torfi, A.: Natural Language Processing Advancements By (2020)
Wu, Y.: Phrase2Vec: phrase embedding based on parsing. Inf. Sci. 517, 100 (2020)
Yarmohammadi, M.: Robust document representations for cross-lingual information retrieval in low-resource settings. In: Proceedings of MT Summit XVII, vol. 1 (2019)
Elloumi, S.: A new approach for textual feature selection based on N-composite isolated labels. Nat. Lang. Eng. 1–23, 2019 (2019)
Shin, H.: Bringing Bag-of-phrases to ODP-based Text Classification. In: Presented at the IEEE (2016)
Aziz, A.A.: Siamese Similarity Between Two Sentences Using Manhattan’s Recurrent Neural Networks (2018)
Song, Y., Using fractional latent topic to enhance recurrent neural network in text similarity modeling. In: International Conference on Database Systems for Advanced Applications (2019)
Qu, Y.: Question answering over freebase via attentive RNN with similarity matrix based CNN (2018)
Agarwal, B.: A deep network model for paraphrase detection in short text messages. Inf. Process. Manage. 54, 922 (2018)
Jiang, J.: Semantic text matching for long-form documents. In: Presented at the The World Wide Web Conference (2019)
Acknowledgments
This research/paper was fully supported by Minister of Education, Malaysia under the Fundamental Research Grant Scheme (FRGS) with Ref. No. FRGS/1/2018/ICT04/UTP/02/4
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Kumar, G., Basri, S., Imam, A.A., Balogun, A.O. (2020). Data Harmonization for Heterogeneous Datasets in Big Data - A Conceptual Model. In: Silhavy, R., Silhavy, P., Prokopova, Z. (eds) Software Engineering Perspectives in Intelligent Systems. CoMeSySo 2020. Advances in Intelligent Systems and Computing, vol 1294. Springer, Cham. https://doi.org/10.1007/978-3-030-63322-6_61
Download citation
DOI: https://doi.org/10.1007/978-3-030-63322-6_61
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-63321-9
Online ISBN: 978-3-030-63322-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)