Cancer informatics by prototype networks in mass spectrometry

https://doi.org/10.1016/j.artmed.2008.07.018Get rights and content

Summary

Objective

Mass spectrometry has become a standard technique to analyze clinical samples in cancer research. The obtained spectrometric measurements reveal a lot of information of the clinical sample at the peptide and protein level. The spectra are high dimensional and, due to the small number of samples a sparse coverage of the population is very common. In clinical research the calculation and evaluation of classification models is important. For classical statistics this is achieved by hypothesis testing with respect to a chosen level of confidence. In clinical proteomics the application of statistical tests is limited due to the small number of samples and the high dimensionality of the data. Typically soft methods from the field of machine learning are used to generate such models. However for these methods no or only few additional information about the safety of the model decision is available. In this contribution the spectral data are processed as functional data and conformal classifier models are generated. The obtained models allow the detection of potential biomarker candidates and provide confidence measures for the classification decision.

Methods

First, wavelet-based techniques for the efficient processing and encoding of mass spectrometric measurements from clinical samples are presented. A prototype-based classifier is extended by a functional metric and combined with the concept of conformal prediction to classify the clinical proteomic spectra and to evaluate the results.

Results

Clinical proteomic data of a colorectal cancer and a lung cancer study are used to test the performance of the proposed algorithm. The prototype classifiers are evaluated with respect to prediction accuracy and the confidence of the classification decisions. The adapted metric parameters are analyzed and interpreted to find potential biomarker candidates.

Conclusions

The proposed algorithm can be used to analyze functional data as obtained from clinical mass spectrometry, to find discriminating mass positions and to judge the confidence of the obtained classifications, providing robust and interpretable classification models.

Introduction

Analysis of clinical proteomic spectra obtained from mass spectrometric measurements is a complicated issue [1]. One major objective is the search for potential biomarkers in complex body fluids like serum, plasma, urine, saliva, or cerebral spinal fluid [2], [3], [4], [5]. Typically the spectra are given as high-dimensional vectors. Thus, from a mathematical point of view, an efficient analysis and visualization of high-dimensional data sets is required. Moreover, the amount of available data is restricted: usually patient cohorts are small in comparison to the dimensionality of the data.

In contrast to the widely applied multilayer perceptron [6], prototype-based classification allows an easy interpretation, which is of particular interest for many (clinical) applications. One prominent prototype-based classifier is the supervised relevance neural gas algorithm (SRNG) [7]. SRNG leads to a robust classifier where efficient learning of labeled high dimensional data is possible and has been already used in different types experiments [8], [9], [10], [11].

In general the available approaches to model classifiers in clinical proteomics initially transform the spectra into a vector space followed by training a classifier. In this way the functional nature of the data is lost, which may lead to suboptimal classifier models. A functional representation of the data with respect to the used metric and a weighting or pruning of (priorly not known) irrelevant parts of the inputs, would be desirable. A discriminative data representation is necessary. The extraction of such discriminant features is difficult for spectral data and typically done by a parametric peak picking procedure. This peak picking is often the focus of criticism because some present peaks may not be detected and the functional nature of the data is partially lost. To avoid this difficulties we focus on the approach as given in [12], [13] and apply a wavelet encoding to the spectral data to get discriminative features. The obtained wavelet coefficients are sufficient to reconstruct the signal, still containing all relevant information of the spectra in a functional encoding. However this better discriminating set of features is typically more complex and hence a robust approach to determine the desired classification model is needed. Taking this into account a feature selection is applied based on a statistical pre-analysis of the data and the SRNG algorithm is used to obtained predictive models.

In this contribution, we focus on the conformal prediction concept incorporated in prototype-based learning vector quantizers (LVQ). The paper is organized as follows. First we briefly review the functional encoding of mass spectrometric data by means of a wavelet-based encoding. Subsequently the theory of the SRNG and its equipment with a functional metric is reviewed. After these settings, the method of conformal prediction [14], [15] is reviewed and we show how it can be used together with LVQ approaches. Subsequently the methodology is applied on experimental data from two clinical proteomic studies. We evaluate the results not only using cross-validation but also in the light of conformal prediction which allows the assessment of the classification safety by means of p-values as known from classical statistics.

Section snippets

Preprocessing

The classification of mass spectra involves multiple preprocessing steps. In general peak picking is used to locate and quantify positions of peaks within the spectrum and feature extraction is applied on the peak list to obtained an adequate feature matrix.

In the first step a number of procedures as baseline correction, optional denoising, noise estimation and normalization are needed [16], [17]. Upon these prepared spectra the peaks have to be identified by scanning all local maxima and the

Bioinformatic methods

The supervised relevance neural gas algorithm is a prototype-based classification model, which will be introduced very briefly. Subsequently we extend the concept of conformal prediction as introduced in [14], [15] in the context of prototype-based networks which is used in the evaluation part to determine confidence values for obtained classification results.

Evaluation of prototype-based classifier models

Advanced prototype-based classification models show typically high regularization capabilities [27]. Nevertheless also the results of prototype networks need a thoroughly analysis by cross-validation to get practical measures to rate the prediction capabilities of the current model. Beside these generic measures of confidence in the results obtained by a classification model a more fine grained confidence analysis would be desirable. Classical statistics typically allows a judgment on the

Clinical data

Serum protein profiling is a promising approach for classification of cancer versus non-cancer samples. The data used in this paper are taken from a colorectal cancer (CRC) study and patients from healthy individuals5. Here it should be mentioned only that for each profile a mass spectrum is obtained within an analyzed mass-to-charge-ratio of 1500–3500 Da. Two sample

Experiments and results

We focus on a supervised data analysis and reduce the dimensionality of the data by use of a problem specific wavelet analysis combined with a statistical selection criterion. We avoid statistical assumptions with respect to the underlying data sets, but take only measurement specific knowledge into account.

Hence we have a 101 and a 40-dimensional space of wavelet coefficients and we use multiple algorithms and metrics to determine classification models. We focus on the presented SRNG algorithm.

Conclusions

We presented a specific pre-processing for mass spectrometric data analysis combined with an extension of the SRNG by a functional metric and integration of conformal prediction. The presented processing of the spectra aims on a natural compact encoding of the signals by means of a functional representation, while the classification model is especially suited to deal with high dimensional sparse data and allows strong regularizations to reduce overfitting effects.

In an initial setup the

Acknowledgments

The authors are grateful to T. Elssner and M. Gerhard for useful discussions and support in interpretation of the results (both Bruker Daltonik GmbH Leipzig/Bremen, Germany). Further we would like to thank Luo Zhiyuan for helpful discussions on Hedging predictions (Computer Learning Research Center (CLRC), Royal Holloway, University of London, UK). Frank-Michael Schleif would also like to thank Beate Müller (Ritsumeikan University, Japan) for an effective working atmosphere during preparation

References (38)

  • B. Hammer et al.

    Generalized relevance learning vector quantization

    Neural Networks

    (2002)
  • A. Gammerman et al.

    Prediction algorithms and confidence measures based on algorithmic randomness theory

    Theoretical Computer Science

    (2002)
  • W. Pusch et al.

    Mass spectrometry-based clinical proteomics

    Pharmacogenomics

    (2003)
  • G. Fiedler et al.

    Standardized peptidome profiling of human urine by magnetic bead separation and matrix-assisted laser desorption/ionization time-of-flight mass spectrometry

    Clinical Chemistry

    (2007)
  • E. Schäffeler et al.

    Magnetic bead based human plasma profiling discriminate acute lymphatic leukaemia from non-diseased samples

  • R. Schipper et al.

    Salivary protein/peptide profiling with seldi-tof-ms

    Annals of the New York Academy of Science

    (2007)
  • N. Guerreiro et al.

    Optimization and evaluation of seldi-tof mass spectrometry for protein profiling of cerebrospinal fluid

    Proteome science

    (2006)
  • C. Bishop

    Pattern recognition and machine learning

    (2006)
  • B. Hammer et al.

    Supervised neural gas with general similarity measure

    Neural Processing Letters

    (2005)
  • T. Villmann et al.

    Supervised neural gas and relevance learning in learning vector quantisation

  • F-M. Schleif et al.

    Supervised relevance neural gas and unified maximum separability analysis for classification of mass spectrometric data

  • T. Villmann et al.

    Comparison of relevance learning vector quantization with other metric adaptive classification methods

    Neural Networks

    (2005)
  • T. Villmann et al.

    Exploration of mass-spectrometric data in clinical proteomics using learning vector quantization methods

    Briefings in Bioinformatics

    (2008)
  • F-M. Schleif et al.

    Analysis of proteomic spectral data by multi resolution analysis and self-organizing-maps

  • F-M. Schleif et al.

    Supervised neural gas for functional data and its application to the analysis of clinical proteom spectra

  • V. Vovk et al.

    Alorithmic learning in a random world

    (2005)
  • A. Gammerman et al.

    Hedging predictions in machine learning

    The Computer Journal

    (2007)
  • R. Ketterlinus et al.

    Fishing for biomarkers: analyzing mass spectrometry data with the new clinprotools software

    Biotechniques

    (2005)
  • Schleif F-M. Prototype based machine learning for clinical proteomics, Ph.D. thesis. Technical University Clausthal,...
  • Cited by (18)

    • Generic probabilistic prototype based classification of vectorial and proximity data

      2015, Neurocomputing
      Citation Excerpt :

      Prototype-based methods are of special interest, because they represent their decisions in terms of typical representatives, contained in the input space, or by approximations thereof. Prototypes can directly be inspected by human experts similar as data points: for example, physicians can inspect prototypical medical cases [5,6], prototypical images can directly be displayed on the computer screen, and prototypical action sequences of robots can be performed in a robotic simulation. Since the decision in prototype-based techniques usually depends on the similarity of a given input to the prototypes stored in the model, a direct inspection of the taken decision in terms of the responsible prototype becomes possible.

    • Improving Bayesian credibility intervals for classifier error rates using maximum entropy empirical priors

      2010, Artificial Intelligence in Medicine
      Citation Excerpt :

      There have also been several reports wherein different design methods have been compared with respect to their performances [4–6]. A third type of report is mainly focused on promising applications in various different areas like diabetes [7], cancer [8–12], cardiology [4], arthritis [13] and intensive care prognosis [14]. Very recently there was also a special issue on computational intelligence and machine learning in bioinformatics [15].

    • Swarm intelligence based wavelet coefficient feature selection for mass spectral classification: An application to proteomics data

      2009, Analytica Chimica Acta
      Citation Excerpt :

      However, this makes the classification model construction increasingly dependent on a robust and correct feature extraction method to select pertinent information from high dimensional mass spectral data. Wavelet analysis is widely applied for spectral data analysis, and has the ability to remove noise, reduce data dimensions and concentrate information from within signals [8–10]. Although wavelet analysis significantly reduces data dimension, not all the wavelet coefficients contain pertinent information, so it is necessary and attractive to detect representative features from wavelet coefficients.

    • Conformal Prediction in Clinical Medical Sciences

      2022, Journal of Healthcare Informatics Research
    • Self-adjusting reject options in prototype based classification

      2016, Advances in Intelligent Systems and Computing
    View all citing articles on Scopus
    View full text