Abstract
Metadata discovery is the process of recognizing semantics and descriptors of data elements and datasets. This study uses a machine-learning approach to classify biomedical dataset characteristics for metadata discovery. Four common types of biomedical data sources were included in this study - genetic variant, protein structure, scientific publications, and general English corpus. Decision tree classification models were built using token-based features derived from these data files. These decision tree classification models are able to identify the four data sources with average F1 scores ranging from 0.935 to 1.000. This study demonstrates that biomedical data of different types have different distributions of token-based document structural features and that such structural features can be leveraged for metadata discovery.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Federer, L.M., Lu, Y.L., Joubert, D.J., Welsh, J., Brandys, B.: Biomedical data sharing and reuse: attitudes and practices of clinical and scientific research staff. PLoS One 10(6), e0129506 (2015)
Ross, J.S., Lehman, R., Gross, C.P.: The importance of clinical trial data sharing: toward more open science. Circ. Cardiovasc. Qual. Outcomes 5(2), 238–240 (2012)
Gotzsche, P.C.: Why we need easy access to all data from all clinical trials and how to accomplish it. Trials 12, 249 (2011)
Wilkinson, M.D., Dumontier, M., Aalbersberg, I.J., Appleton, G., Axton, M., Baak, A., et al.: The FAIR guiding principles for scientific data management and stewardship. Sci. Data 3, 160018 (2016)
Gouripeddi, R., Schultz, N.D., Bradshaw, R.L., Madsen, R.P., Mo, Warner P.B., et al.: FURTHeR: an infrastructure for clinical, translational and comparative effectiveness research. In: American Medical Informatics Association 2013 Annual Symposium. Washington, DC (2013)
Brank, J., Mladenić, D., Grobelnik, M.: Feature construction in text mining. In: Sammut, C., Webb, G.I. (eds.) Encyclopedia of Machine Learning. Springer, Boston (2010)
Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., et al.: The protein data bank. Nucleic Acids Res. 28(1), 235–242 (2000)
Roy, A., Kucukural, A., Zhang, Y.: I-TASSER: a unified platform for automated protein structure and function prediction. Nat. Protoc. 5(4), 725–738 (2010)
Leaver-Fay, A., Tyka, M., Lewis, S.M., Lange, O.F., Thompson, J., Jacak, R., et al.: ROSETTA3: an object-oriented software suite for the simulation and design of macromolecules. Meth. Enzymol. 487, 545–574 (2011)
Landrum, M.J., Lee, J.M., Benson, M., Brown, G., Chao, C., Chitipiralla, S., et al.: ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids Res. 44(D1), D862–D868 (2016)
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Rajaraman, A., Ullman, J.D.: Mining of massive datasets. Data mining, pp. 1–17 (2011)
Mladenić, D.: Feature selection in text mining. In: Sammut, C., Webb, G.I. (eds.) Encyclopedia of Machine Learning. Springer, Boston (2010)
Acknowledgments
bioCADDIE is supported by the National Institutes of Health (NIH) through the NIH Big Data to Knowledge, Grant 1U24AI117966-01. OpenFurther has received support NCATS UL1TR001067, 3UL1RR025764-02S2, AHRQ R01 HS019862, DHHS 1D1BRH20425, U54EB021973, UU Research Foundation, NIBIB, NIH U54EB021973. Computer resources were provided by the University of Utah Center for High Performance Computing.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Wen, J., Gouripeddi, R., Facelli, J.C. (2018). Metadata Discovery of Heterogeneous Biomedical Datasets Using Token-Based Features. In: Kim, K., Kim, H., Baek, N. (eds) IT Convergence and Security 2017. Lecture Notes in Electrical Engineering, vol 449. Springer, Singapore. https://doi.org/10.1007/978-981-10-6451-7_8
Download citation
DOI: https://doi.org/10.1007/978-981-10-6451-7_8
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-6450-0
Online ISBN: 978-981-10-6451-7
eBook Packages: EngineeringEngineering (R0)