NS-kNN: a modified k-nearest neighbors approach for imputing metabolomics data

Lee, Justin Y.; Styczynski, Mark P.

doi:10.1007/s11306-018-1451-8

NS-kNN: a modified k-nearest neighbors approach for imputing metabolomics data

Original Article
Published: 23 November 2018

Volume 14, article number 153, (2018)
Cite this article

Metabolomics Aims and scope Submit manuscript

1768 Accesses
42 Citations
8 Altmetric
Explore all metrics

Abstract

Introduction

A common problem in metabolomics data analysis is the existence of a substantial number of missing values, which can complicate, bias, or even prevent certain downstream analyses. One of the most widely-used solutions to this problem is imputation of missing values using a k-nearest neighbors (kNN) algorithm to estimate missing metabolite abundances. kNN implicitly assumes that missing values are uniformly distributed at random in the dataset, but this is typically not true in metabolomics, where many values are missing because they are below the limit of detection of the analytical instrumentation.

Objectives

Here, we explore the impact of nonuniformly distributed missing values (missing not at random, or MNAR) on imputation performance. We present a new model for generating synthetic missing data and a new algorithm, No-Skip kNN (NS-kNN), that accounts for MNAR values to provide more accurate imputations.

Methods

We compare the imputation errors of the original kNN algorithm using two distance metrics, NS-kNN, and a recently developed algorithm KNN-TN, when applied to multiple experimental datasets with different types and levels of missing data.

Results

Our results show that NS-kNN typically outperforms kNN when at least 20–30% of missing values in a dataset are MNAR. NS-kNN also has lower imputation errors than KNN-TN on realistic datasets when at least 50% of missing values are MNAR.

Conclusion

Accounting for the nonuniform distribution of missing values in metabolomics data can significantly improve the results of imputation algorithms. The NS-kNN method imputes missing metabolomics data more accurately than existing kNN-based approaches when used on realistic datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

From Data to Cure: A Comprehensive Exploration of Multi-omics Data Analysis for Targeted Therapies

Article Open access 02 April 2024

Arnab Mukherjee, Suzanna Abraham, … K. S. Mukunthan

A practical guide to amplicon and metagenomic analysis of microbiome data

Article Open access 11 May 2020

Yong-Xin Liu, Yuan Qin, … Yang Bai

RNA-Seq Data Analysis in Galaxy

References

Armitage, E. G., Godzien, J., Alonso-Herranz, V., Lopez-Gonzalvez, A., & Barbas, C. (2015). Missing value imputation strategies for metabolomics data. Electrophoresis, 36, 3050–3060.
Article CAS Google Scholar
Barnard, J., & Meng, X. L. (1999). Applications of multiple imputation in medical studies: from AIDS to NHANES. Statistical Methods in Medical Research, 8, 17–36.
Article CAS Google Scholar
Boeckel, J. N., Palapies, L., Zeller, T., Reis, S. M., von Jeinsen, B., Tzikas, S., Bickel, C., Baldus, S., Blankenberg, S., Munzel, T., Zeiher, A. M., Lackner, K. J., & Keller, T. (2015). Estimation of values below the limit of detection of a contemporary sensitive troponin I assay improves diagnosis of acute myocardial infarction. Clinical Chemistry, 61, 1197–1206.
Article CAS Google Scholar
Chen, H., Quandt, S. A., Grzywacz, J. G., & Arcury, T. A. (2011). A distribution-based multiple imputation method for handling bivariate pesticide data with values below the limit of detection. Environ Health Perspect, 119, 351–356.
Article Google Scholar
Di Guida, R., Engel, J., Allwood, J. W., Weber, R. J., Jones, M. R., Sommer, U., Viant, M. R., & Dunn, W. B. (2016). Non-targeted UHPLC-MS metabolomic data processing methods: a comparative investigation of normalisation, missing value imputation, transformation and scaling. Metabolomics, 12, 93.
Article Google Scholar
Dromms, R. A., & Styczynski, M. P. (2012). Systematic applications of metabolomics in metabolic engineering. Metabolites, 2, 1090–1122.
Article CAS Google Scholar
Fiehn, O., Garvey, W. T., Newman, J. W., Lok, K. H., Hoppel, C. L., & Adams, S. H. (2010). Plasma metabolomic profiles reflective of glucose homeostasis in non-diabetic and type 2 diabetic obese African-American women. PLoS ONE, 5, e15234.
Article Google Scholar
Gromski, P. S., Xu, Y., Kotze, H. L., Correa, E., Ellis, D. I., Armitage, E. G., Turner, M. L., & Goodacre, R. (2014). Influence of missing values substitutes on multivariate analysis of metabolomics data. Metabolites, 4, 433–452.
Article Google Scholar
Hrydziuszko, O., & Viant, M. R. (2011). Missing values in mass spectrometry based metabolomics: an undervalued step in the data processing pipeline. Metabolomics, 8, 161–174.
Article Google Scholar
Hu, L. Y., Huang, M. W., Ke, S. W., & Tsai, C. F. (2016). The distance function effect on k-nearest neighbor classification for medical datasets. Springerplus, 5, 1304.
Article Google Scholar
Kim, H., Golub, G. H., & Park, H. (2005). Missing value estimation for DNA microarray gene expression data: local least squares imputation. Bioinformatics, 21, 187–198.
Article CAS Google Scholar
Lazar, C., Gatto, L., Ferro, M., Bruley, C., & Burger, T. (2016). Accounting for the multiple natures of missing values in label-free quantitative proteomics data sets to compare imputation strategies. Journal of Proteome Research, 15, 1116–1125.
Article CAS Google Scholar
Lee, M., Rahbar, M. H., Brown, M., Gensler, L., Weisman, M., Diekman, L., & Reveille, J. D. (2018). A multiple imputation method based on weighted quantile regression models for longitudinal censored biomarker data with missing values at early visits. BMC Medical Research Methodology, 18, 8.
Article Google Scholar
Liu, Y., & Brown, S. D. (2014). Imputation of left-censored data for cluster analysis. Journal of Chemometrics, 28, 148–160.
Article CAS Google Scholar
Niehaus, T. D., Gerdes, S., Hodge-Hanson, K., Zhukov, A., Cooper, A. J., ElBadawi-Sidhu, M., Fiehn, O., Downs, D. M., & Hanson, A. D. (2015). Genomic and experimental evidence for multiple metabolic functions in the RidA/YjgF/YER057c/UK114 (Rid) protein family. BMC Genomics, 16, 382.
Article Google Scholar
Shah, J. S., Rai, S. N., DeFilippis, A. P., Hill, B. G., Bhatnagar, A., & Brock, G. N. (2017). Distribution based nearest neighbor imputation for truncated high dimensional data with applications to pre-clinical and clinical metabolomics studies. BMC Bioinformatics, 18, 114.
Article Google Scholar
Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., & Altman, R. B. (2001). Missing value estimation methods for DNA microarrays. Bioinformatics, 17, 520–525.
Article CAS Google Scholar
Wei, R., Wang, J., Jia, E., Chen, T., Ni, Y., & Jia, W. (2018a). GSimp: A Gibbs sampler based left-censored missing value imputation approach for metabolomics studies. PLoS Computational Biology, 14, e1005973.
Article Google Scholar
Wei, R., Wang, J., Su, M., Jia, E., Chen, S., Chen, T., & Ni, Y. (2018b). Missing Value imputation approach for mass spectrometry-based metabolomics data. Scientific Reports, 8, 663.
Article Google Scholar

Download references

Acknowledgements

The authors acknowledge the National Science Foundation (MCB-1254382) and the National Institutes of Health (R35-GM119701) for financial support.

Author information

Authors and Affiliations

School of Chemical & Biomolecular Engineering, Georgia Institute of Technology, 311 Ferst Drive, Atlanta, GA, 30332-0100, USA
Justin Y. Lee & Mark P. Styczynski

Authors

Justin Y. Lee
View author publications
You can also search for this author in PubMed Google Scholar
Mark P. Styczynski
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

JYL participated in the design of the study, carried out the computational experiments, and helped draft the manuscript. MPS conceived of the study, participated in the design of the study, and helped draft the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Mark P. Styczynski.

Ethics declarations

Ethical approval

The article does not contain any studies with human and/or animal participants.

Conflict of interest

The authors declare no conflicts of interest.

Software availability

The MATLAB code developed in this study is accessible via https://github.com/gtStyLab/NSkNN.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (DOCX 7412 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lee, J.Y., Styczynski, M.P. NS-kNN: a modified k-nearest neighbors approach for imputing metabolomics data. Metabolomics 14, 153 (2018). https://doi.org/10.1007/s11306-018-1451-8

Download citation

Received: 06 August 2018
Accepted: 15 November 2018
Published: 23 November 2018
DOI: https://doi.org/10.1007/s11306-018-1451-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

NS-kNN: a modified k-nearest neighbors approach for imputing metabolomics data