Cancer informatics by prototype networks in mass spectrometry
Introduction
Analysis of clinical proteomic spectra obtained from mass spectrometric measurements is a complicated issue [1]. One major objective is the search for potential biomarkers in complex body fluids like serum, plasma, urine, saliva, or cerebral spinal fluid [2], [3], [4], [5]. Typically the spectra are given as high-dimensional vectors. Thus, from a mathematical point of view, an efficient analysis and visualization of high-dimensional data sets is required. Moreover, the amount of available data is restricted: usually patient cohorts are small in comparison to the dimensionality of the data.
In contrast to the widely applied multilayer perceptron [6], prototype-based classification allows an easy interpretation, which is of particular interest for many (clinical) applications. One prominent prototype-based classifier is the supervised relevance neural gas algorithm (SRNG) [7]. SRNG leads to a robust classifier where efficient learning of labeled high dimensional data is possible and has been already used in different types experiments [8], [9], [10], [11].
In general the available approaches to model classifiers in clinical proteomics initially transform the spectra into a vector space followed by training a classifier. In this way the functional nature of the data is lost, which may lead to suboptimal classifier models. A functional representation of the data with respect to the used metric and a weighting or pruning of (priorly not known) irrelevant parts of the inputs, would be desirable. A discriminative data representation is necessary. The extraction of such discriminant features is difficult for spectral data and typically done by a parametric peak picking procedure. This peak picking is often the focus of criticism because some present peaks may not be detected and the functional nature of the data is partially lost. To avoid this difficulties we focus on the approach as given in [12], [13] and apply a wavelet encoding to the spectral data to get discriminative features. The obtained wavelet coefficients are sufficient to reconstruct the signal, still containing all relevant information of the spectra in a functional encoding. However this better discriminating set of features is typically more complex and hence a robust approach to determine the desired classification model is needed. Taking this into account a feature selection is applied based on a statistical pre-analysis of the data and the SRNG algorithm is used to obtained predictive models.
In this contribution, we focus on the conformal prediction concept incorporated in prototype-based learning vector quantizers (LVQ). The paper is organized as follows. First we briefly review the functional encoding of mass spectrometric data by means of a wavelet-based encoding. Subsequently the theory of the SRNG and its equipment with a functional metric is reviewed. After these settings, the method of conformal prediction [14], [15] is reviewed and we show how it can be used together with LVQ approaches. Subsequently the methodology is applied on experimental data from two clinical proteomic studies. We evaluate the results not only using cross-validation but also in the light of conformal prediction which allows the assessment of the classification safety by means of p-values as known from classical statistics.
Section snippets
Preprocessing
The classification of mass spectra involves multiple preprocessing steps. In general peak picking is used to locate and quantify positions of peaks within the spectrum and feature extraction is applied on the peak list to obtained an adequate feature matrix.
In the first step a number of procedures as baseline correction, optional denoising, noise estimation and normalization are needed [16], [17]. Upon these prepared spectra the peaks have to be identified by scanning all local maxima and the
Bioinformatic methods
The supervised relevance neural gas algorithm is a prototype-based classification model, which will be introduced very briefly. Subsequently we extend the concept of conformal prediction as introduced in [14], [15] in the context of prototype-based networks which is used in the evaluation part to determine confidence values for obtained classification results.
Evaluation of prototype-based classifier models
Advanced prototype-based classification models show typically high regularization capabilities [27]. Nevertheless also the results of prototype networks need a thoroughly analysis by cross-validation to get practical measures to rate the prediction capabilities of the current model. Beside these generic measures of confidence in the results obtained by a classification model a more fine grained confidence analysis would be desirable. Classical statistics typically allows a judgment on the
Clinical data
Serum protein profiling is a promising approach for classification of cancer versus non-cancer samples. The data used in this paper are taken from a colorectal cancer (CRC) study and patients from healthy individuals5. Here it should be mentioned only that for each profile a mass spectrum is obtained within an analyzed mass-to-charge-ratio of 1500–3500 Da. Two sample
Experiments and results
We focus on a supervised data analysis and reduce the dimensionality of the data by use of a problem specific wavelet analysis combined with a statistical selection criterion. We avoid statistical assumptions with respect to the underlying data sets, but take only measurement specific knowledge into account.
Hence we have a 101 and a 40-dimensional space of wavelet coefficients and we use multiple algorithms and metrics to determine classification models. We focus on the presented SRNG algorithm.
Conclusions
We presented a specific pre-processing for mass spectrometric data analysis combined with an extension of the SRNG by a functional metric and integration of conformal prediction. The presented processing of the spectra aims on a natural compact encoding of the signals by means of a functional representation, while the classification model is especially suited to deal with high dimensional sparse data and allows strong regularizations to reduce overfitting effects.
In an initial setup the
Acknowledgments
The authors are grateful to T. Elssner and M. Gerhard for useful discussions and support in interpretation of the results (both Bruker Daltonik GmbH Leipzig/Bremen, Germany). Further we would like to thank Luo Zhiyuan for helpful discussions on Hedging predictions (Computer Learning Research Center (CLRC), Royal Holloway, University of London, UK). Frank-Michael Schleif would also like to thank Beate Müller (Ritsumeikan University, Japan) for an effective working atmosphere during preparation
References (38)
- et al.
Generalized relevance learning vector quantization
Neural Networks
(2002) - et al.
Prediction algorithms and confidence measures based on algorithmic randomness theory
Theoretical Computer Science
(2002) - et al.
Mass spectrometry-based clinical proteomics
Pharmacogenomics
(2003) - et al.
Standardized peptidome profiling of human urine by magnetic bead separation and matrix-assisted laser desorption/ionization time-of-flight mass spectrometry
Clinical Chemistry
(2007) - et al.
Magnetic bead based human plasma profiling discriminate acute lymphatic leukaemia from non-diseased samples
- et al.
Salivary protein/peptide profiling with seldi-tof-ms
Annals of the New York Academy of Science
(2007) - et al.
Optimization and evaluation of seldi-tof mass spectrometry for protein profiling of cerebrospinal fluid
Proteome science
(2006) Pattern recognition and machine learning
(2006)- et al.
Supervised neural gas with general similarity measure
Neural Processing Letters
(2005) - et al.
Supervised neural gas and relevance learning in learning vector quantisation
Supervised relevance neural gas and unified maximum separability analysis for classification of mass spectrometric data
Comparison of relevance learning vector quantization with other metric adaptive classification methods
Neural Networks
Exploration of mass-spectrometric data in clinical proteomics using learning vector quantization methods
Briefings in Bioinformatics
Analysis of proteomic spectral data by multi resolution analysis and self-organizing-maps
Supervised neural gas for functional data and its application to the analysis of clinical proteom spectra
Alorithmic learning in a random world
Hedging predictions in machine learning
The Computer Journal
Fishing for biomarkers: analyzing mass spectrometry data with the new clinprotools software
Biotechniques
Cited by (18)
Generic probabilistic prototype based classification of vectorial and proximity data
2015, NeurocomputingCitation Excerpt :Prototype-based methods are of special interest, because they represent their decisions in terms of typical representatives, contained in the input space, or by approximations thereof. Prototypes can directly be inspected by human experts similar as data points: for example, physicians can inspect prototypical medical cases [5,6], prototypical images can directly be displayed on the computer screen, and prototypical action sequences of robots can be performed in a robotic simulation. Since the decision in prototype-based techniques usually depends on the similarity of a given input to the prototypes stored in the model, a direct inspection of the taken decision in terms of the responsible prototype becomes possible.
Improving Bayesian credibility intervals for classifier error rates using maximum entropy empirical priors
2010, Artificial Intelligence in MedicineCitation Excerpt :There have also been several reports wherein different design methods have been compared with respect to their performances [4–6]. A third type of report is mainly focused on promising applications in various different areas like diabetes [7], cancer [8–12], cardiology [4], arthritis [13] and intensive care prognosis [14]. Very recently there was also a special issue on computational intelligence and machine learning in bioinformatics [15].
Swarm intelligence based wavelet coefficient feature selection for mass spectral classification: An application to proteomics data
2009, Analytica Chimica ActaCitation Excerpt :However, this makes the classification model construction increasingly dependent on a robust and correct feature extraction method to select pertinent information from high dimensional mass spectral data. Wavelet analysis is widely applied for spectral data analysis, and has the ability to remove noise, reduce data dimensions and concentrate information from within signals [8–10]. Although wavelet analysis significantly reduces data dimension, not all the wavelet coefficients contain pertinent information, so it is necessary and attractive to detect representative features from wavelet coefficients.
Computational intelligence and machine learning in bioinformatics
2009, Artificial Intelligence in MedicineConformal Prediction in Clinical Medical Sciences
2022, Journal of Healthcare Informatics ResearchSelf-adjusting reject options in prototype based classification
2016, Advances in Intelligent Systems and Computing