Skip to main content

Metadata Discovery of Heterogeneous Biomedical Datasets Using Token-Based Features

  • Conference paper
  • First Online:
IT Convergence and Security 2017

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 449))

  • 1307 Accesses

Abstract

Metadata discovery is the process of recognizing semantics and descriptors of data elements and datasets. This study uses a machine-learning approach to classify biomedical dataset characteristics for metadata discovery. Four common types of biomedical data sources were included in this study - genetic variant, protein structure, scientific publications, and general English corpus. Decision tree classification models were built using token-based features derived from these data files. These decision tree classification models are able to identify the four data sources with average F1 scores ranging from 0.935 to 1.000. This study demonstrates that biomedical data of different types have different distributions of token-based document structural features and that such structural features can be leveraged for metadata discovery.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Federer, L.M., Lu, Y.L., Joubert, D.J., Welsh, J., Brandys, B.: Biomedical data sharing and reuse: attitudes and practices of clinical and scientific research staff. PLoS One 10(6), e0129506 (2015)

    Article  Google Scholar 

  2. Ross, J.S., Lehman, R., Gross, C.P.: The importance of clinical trial data sharing: toward more open science. Circ. Cardiovasc. Qual. Outcomes 5(2), 238–240 (2012)

    Article  Google Scholar 

  3. Gotzsche, P.C.: Why we need easy access to all data from all clinical trials and how to accomplish it. Trials 12, 249 (2011)

    Article  Google Scholar 

  4. Wilkinson, M.D., Dumontier, M., Aalbersberg, I.J., Appleton, G., Axton, M., Baak, A., et al.: The FAIR guiding principles for scientific data management and stewardship. Sci. Data 3, 160018 (2016)

    Article  Google Scholar 

  5. Gouripeddi, R., Schultz, N.D., Bradshaw, R.L., Madsen, R.P., Mo, Warner P.B., et al.: FURTHeR: an infrastructure for clinical, translational and comparative effectiveness research. In: American Medical Informatics Association 2013 Annual Symposium. Washington, DC (2013)

    Google Scholar 

  6. Brank, J., Mladenić, D., Grobelnik, M.: Feature construction in text mining. In: Sammut, C., Webb, G.I. (eds.) Encyclopedia of Machine Learning. Springer, Boston (2010)

    Google Scholar 

  7. Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., et al.: The protein data bank. Nucleic Acids Res. 28(1), 235–242 (2000)

    Article  Google Scholar 

  8. Roy, A., Kucukural, A., Zhang, Y.: I-TASSER: a unified platform for automated protein structure and function prediction. Nat. Protoc. 5(4), 725–738 (2010)

    Article  Google Scholar 

  9. Leaver-Fay, A., Tyka, M., Lewis, S.M., Lange, O.F., Thompson, J., Jacak, R., et al.: ROSETTA3: an object-oriented software suite for the simulation and design of macromolecules. Meth. Enzymol. 487, 545–574 (2011)

    Article  Google Scholar 

  10. Landrum, M.J., Lee, J.M., Benson, M., Brown, G., Chao, C., Chitipiralla, S., et al.: ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids Res. 44(D1), D862–D868 (2016)

    Article  Google Scholar 

  11. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  12. Rajaraman, A., Ullman, J.D.: Mining of massive datasets. Data mining, pp. 1–17 (2011)

    Google Scholar 

  13. Mladenić, D.: Feature selection in text mining. In: Sammut, C., Webb, G.I. (eds.) Encyclopedia of Machine Learning. Springer, Boston (2010)

    Google Scholar 

Download references

Acknowledgments

bioCADDIE is supported by the National Institutes of Health (NIH) through the NIH Big Data to Knowledge, Grant 1U24AI117966-01. OpenFurther has received support NCATS UL1TR001067, 3UL1RR025764-02S2, AHRQ R01 HS019862, DHHS 1D1BRH20425, U54EB021973, UU Research Foundation, NIBIB, NIH U54EB021973. Computer resources were provided by the University of Utah Center for High Performance Computing.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Julio C. Facelli .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Singapore Pte Ltd.

About this paper

Cite this paper

Wen, J., Gouripeddi, R., Facelli, J.C. (2018). Metadata Discovery of Heterogeneous Biomedical Datasets Using Token-Based Features. In: Kim, K., Kim, H., Baek, N. (eds) IT Convergence and Security 2017. Lecture Notes in Electrical Engineering, vol 449. Springer, Singapore. https://doi.org/10.1007/978-981-10-6451-7_8

Download citation

  • DOI: https://doi.org/10.1007/978-981-10-6451-7_8

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-10-6450-0

  • Online ISBN: 978-981-10-6451-7

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics