Metadata Discovery of Heterogeneous Biomedical Datasets Using Token-Based Features

Wen, Jingran; Gouripeddi, Ramkiran; Facelli, Julio C.

doi:10.1007/978-981-10-6451-7_8

Jingran Wen³²,
Ramkiran Gouripeddi^32,33 &
Julio C. Facelli^32,33

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 449))

1307 Accesses

Abstract

Metadata discovery is the process of recognizing semantics and descriptors of data elements and datasets. This study uses a machine-learning approach to classify biomedical dataset characteristics for metadata discovery. Four common types of biomedical data sources were included in this study - genetic variant, protein structure, scientific publications, and general English corpus. Decision tree classification models were built using token-based features derived from these data files. These decision tree classification models are able to identify the four data sources with average F1 scores ranging from 0.935 to 1.000. This study demonstrates that biomedical data of different types have different distributions of token-based document structural features and that such structural features can be leveraged for metadata discovery.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A Tool for Biomedical – Documents Classification Using Support Vector Machines

Large scale biomedical texts classification: a kNN and an ESA-based approaches

Article Open access 16 June 2016

T-HMM: A Novel Biomedical Text Classifier Based on Hidden Markov Models

References

Federer, L.M., Lu, Y.L., Joubert, D.J., Welsh, J., Brandys, B.: Biomedical data sharing and reuse: attitudes and practices of clinical and scientific research staff. PLoS One 10(6), e0129506 (2015)
Article Google Scholar
Ross, J.S., Lehman, R., Gross, C.P.: The importance of clinical trial data sharing: toward more open science. Circ. Cardiovasc. Qual. Outcomes 5(2), 238–240 (2012)
Article Google Scholar
Gotzsche, P.C.: Why we need easy access to all data from all clinical trials and how to accomplish it. Trials 12, 249 (2011)
Article Google Scholar
Wilkinson, M.D., Dumontier, M., Aalbersberg, I.J., Appleton, G., Axton, M., Baak, A., et al.: The FAIR guiding principles for scientific data management and stewardship. Sci. Data 3, 160018 (2016)
Article Google Scholar
Gouripeddi, R., Schultz, N.D., Bradshaw, R.L., Madsen, R.P., Mo, Warner P.B., et al.: FURTHeR: an infrastructure for clinical, translational and comparative effectiveness research. In: American Medical Informatics Association 2013 Annual Symposium. Washington, DC (2013)
Google Scholar
Brank, J., Mladenić, D., Grobelnik, M.: Feature construction in text mining. In: Sammut, C., Webb, G.I. (eds.) Encyclopedia of Machine Learning. Springer, Boston (2010)
Google Scholar
Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., et al.: The protein data bank. Nucleic Acids Res. 28(1), 235–242 (2000)
Article Google Scholar
Roy, A., Kucukural, A., Zhang, Y.: I-TASSER: a unified platform for automated protein structure and function prediction. Nat. Protoc. 5(4), 725–738 (2010)
Article Google Scholar
Leaver-Fay, A., Tyka, M., Lewis, S.M., Lange, O.F., Thompson, J., Jacak, R., et al.: ROSETTA3: an object-oriented software suite for the simulation and design of macromolecules. Meth. Enzymol. 487, 545–574 (2011)
Article Google Scholar
Landrum, M.J., Lee, J.M., Benson, M., Brown, G., Chao, C., Chitipiralla, S., et al.: ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids Res. 44(D1), D862–D868 (2016)
Article Google Scholar
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Rajaraman, A., Ullman, J.D.: Mining of massive datasets. Data mining, pp. 1–17 (2011)
Google Scholar
Mladenić, D.: Feature selection in text mining. In: Sammut, C., Webb, G.I. (eds.) Encyclopedia of Machine Learning. Springer, Boston (2010)
Google Scholar

Download references

Acknowledgments

bioCADDIE is supported by the National Institutes of Health (NIH) through the NIH Big Data to Knowledge, Grant 1U24AI117966-01. OpenFurther has received support NCATS UL1TR001067, 3UL1RR025764-02S2, AHRQ R01 HS019862, DHHS 1D1BRH20425, U54EB021973, UU Research Foundation, NIBIB, NIH U54EB021973. Computer resources were provided by the University of Utah Center for High Performance Computing.

Author information

Authors and Affiliations

Department of Biomedical Informatics, The University of Utah, Salt Lake City, UT, 84108, USA
Jingran Wen, Ramkiran Gouripeddi & Julio C. Facelli
Center for Clinical and Translational Science, The University of Utah, Salt Lake City, UT, 84108, USA
Ramkiran Gouripeddi & Julio C. Facelli

Authors

Jingran Wen
View author publications
You can also search for this author in PubMed Google Scholar
Ramkiran Gouripeddi
View author publications
You can also search for this author in PubMed Google Scholar
Julio C. Facelli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Julio C. Facelli .

Editor information

Editors and Affiliations

iCatse, B-3001, Intellige 2, Kyonggi University, Seongnam-si, Kyonggi-do, Korea (Republic of)
Kuinam J. Kim
Computer Science, Namseoul University Computer Science, Cheonan , Ch´ungch´ong-namdo, Korea (Republic of)
Hyuncheol Kim
School of Computer Science and Engineering, Kyungpook National University, Daegu, Korea (Republic of)
Nakhoon Baek

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wen, J., Gouripeddi, R., Facelli, J.C. (2018). Metadata Discovery of Heterogeneous Biomedical Datasets Using Token-Based Features. In: Kim, K., Kim, H., Baek, N. (eds) IT Convergence and Security 2017. Lecture Notes in Electrical Engineering, vol 449. Springer, Singapore. https://doi.org/10.1007/978-981-10-6451-7_8

Download citation

DOI: https://doi.org/10.1007/978-981-10-6451-7_8
Published: 31 August 2017
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-6450-0
Online ISBN: 978-981-10-6451-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Metadata Discovery of Heterogeneous Biomedical Datasets Using Token-Based Features

Abstract

Access this chapter

Similar content being viewed by others

A Tool for Biomedical – Documents Classification Using Support Vector Machines

Large scale biomedical texts classification: a kNN and an ESA-based approaches

T-HMM: A Novel Biomedical Text Classifier Based on Hidden Markov Models

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Metadata Discovery of Heterogeneous Biomedical Datasets Using Token-Based Features

Abstract

Access this chapter

Similar content being viewed by others

A Tool for Biomedical – Documents Classification Using Support Vector Machines

Large scale biomedical texts classification: a kNN and an ESA-based approaches

T-HMM: A Novel Biomedical Text Classifier Based on Hidden Markov Models

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation