Introduction

Magnetic resonance spectroscopy is slowly becoming an accurate non-invasive complement to magnetic resonance imaging for initial diagnosis exam of brain masses [1], since it provides useful chemical information about metabolites for characterizing brain tumors [2]. To achieve this status, clinical and pattern recognition (PR)-based classification of brain tumors using magnetic resonance spectroscopy (MRS) data has been thoroughly investigated for more than fifteen years [1,313].

The clinical decision-support systems (CDSSs) based on PR should be developed in such a way so as to obtain high accuracy in classification, interpretability by means of clinical knowledge and the generalization of the performance to new samples obtained subsequently in different clinical centers [1417]. Standardization of acquisition conditions and protocols should make data from different hospitals compatible and allow the development and evaluation of joint CDSSs. This standardization prevents possible bias from single-center or single-machine studies and, additionally, increases the number of available cases for classifier development and test purposes.

During the INTERPRET project (INTERPRET) [8,18], a protocol was defined to guarantee the compatibility of the signals acquired at different hospitals [19,20]. As a result, studies on automated brain tumor classification were carried out using these data. Hence, in previous studies [7,8,10, 21], the ability of automatic classifiers based on short echo time (TE) MRS to discriminate among different brain tumor diagnoses was demonstrated. In addition, in [11,13,21], the automated classification by means of long TE MRS was also studied and demonstrated. Other studies evaluated the extension of the classifiers towards 1H magnetic resonance spectroscopic imaging (MRSI) [12,2124]. Every study reported above was developed and evaluated using data acquired during the same period of time. Besides, other automated classification studies, such as [2,13,2528], have been reported on single-center MRS datasets of brain masses.

In order to provide the clinical community with robust results of automatic classification, the extension of the evaluation in time is advisable. Hence, the validation of classifiers through subsequent cases can consolidate the confidence of clinicians in the potential applicability of these classifiers. The multicenter The eTUMOUR project (eTUMOUR) [29] (2004–2009) has benefited from the data and expertise gathered by INTERPRET. The INTERPRET acquisition protocols for clinical, radiological, and histopathological data were extended to ex-vivo transcriptomic (DNA microarrays) and metabolomic (HR-MAS) data acquisition in The eTUMOUR project (eTUMOUR). Furthermore, the raw MRS data acquired during INTERPRET were incorporated into the eTUMOUR dataset for classifier development. This provides a unique opportunity to evaluate INTERPRET-based models by means of cases from a later date from partly different hospitals with different instrumentation, but obtained using the same or compatible acquisition protocols. The multiproject-multicenter evaluation proposed in this study gives a close-up perspective of the conditions that predictive models may face under different real clinical environments.

In this study, six pairwise classifiers for glioblastoma GBM, low-grade meningioma (MEN), metastasis (MET), and low-grade glial (LGG) diagnoses were developed and tested on single-voxel (SV) short TEMRS signals. Short TE MRS is fast (typically 5 min) and robust, so it is considered to be appropriate for routine clinical studies [1]. Most major hospitals currently use this acquisition protocol for the MRS evaluation of brain tumors. Short TE spectral pattern has been reported to contain a larger amount of information than long TE spectra, e.g. metabolites and other compounds that are considered useful for classification purposes [1,8,11]. Hence, creatine (Cr) (3.02, 3.92 ppm), choline (Cho) (3.21 ppm), N-acetyl aspartate (NAA) (2.01 ppm), myo-inositol (mI) and glycine (Gly) (3.55 ppm), mI/Taurine (Tau) (3.26 ppm), glutamate/glutamine (Glx) (2.04, 2.46, 3.78 ppm), lactate (Lac) (1.31 ppm), and alanine (Ala) (1.47 ppm) are observed at short TE. Furthermore, macromolecules (MM) (5.4, 2.9, 2.25, 2.05, 1.4 and 0.87 ppm) and mobile lipids (ML) are also well detected at short TE [1,8]. Comparative studies on the use of short TE versus long TE have shown the benefit of using short TE or the combination of both echo times for automatic classification purposes [30].

Based on previous results from [10,11,18,21], good performance of the PR models could be expected for most of the classification problems, except for the discrimination of glioblastoma and metastasis [10]. Our performance estimations of models trained with INTERPRET data and tested over eTUMOUR cases confirmed this behaviour. We observed that pairwise discrimination between glioblastoma, meningioma, metastasis, and low-grade glial achieved an accuracy of around 90%. The exception was for the discrimination between glioblastoma and metastasis that did not perform better than 78%. This study consolidates the results obtained by previous studies in automatic brain tumor classification using MRS. These results may also increase the confidence of the clinical community in the use of CDSSs that incorporate this kind of classifiers for the interpretation of MRS biomedical signals and the diagnosis of brain tumors.

Materials and methods

Data acquisition

The training data used for classifier development were SV MRS signals at 1.5 T at short TE (point-resolved spectroscopic sequence (PRESS) or stimulated echo acquisition mode sequence (STEAM), 20–32 ms) that were acquired by international centers in the framework of INTERPRET [18]. The classes considered for inclusion in this study were based on the histological classification of the central nervous system (CNS) tumors set up by the World Health Organization (WHO) [31]: glioblastoma (GBM), MEN, MET, and LGG (Astrocytoma gII, Oligoastrocytoma gII, or Oligodendroglioma gII). The number of cases by class is summarized in Table 1.

Table 1 Number of training (INTERPRET) and test (eTUMOUR) cases per class used in the study

211 SV 1H (nuclear) magnetic resonance (MR) spectra from the INTERPRET database [19] were included. These signals were acquired with Siemens, general electric (GE), and Philips instruments by six international centers. The acquisition protocols included PRESS or STEAM sequences, with spectral parameters: repetition time (TR) between 1,600 and 2,020 ms, TE of 20 or 30–32 ms, spectral width of 1,000–2,500 Hz, and 512, 1,024, or 2,048 data-points, as described in previous studies [19]. Every training spectrum and diagnosis was validated by the INTERPRET Clinical Data Validation Committee (CDVC) and expert spectroscopists [8].

The test data were provided by eight international institutions in the framework of eTUMOUR [29]. The cases with the SV short TE (STEAM 20 ms, PRESS 30–32 ms) MRS at 1.5 T signal validated by the expert spectroscopist of eTUMOUR and with the original histopathology available before 28 February 2007) were included. Therefore, 97 cases from eTUMOUR were considered for testing in this study. The test cases used to evaluate the performance of the classifiers were acquired from partly different hospitals in later dates than the training cases and using instruments of the three main manufacturers. Table 2 shows that the percentages of cases by manufacturer included in the test data are similar to the percentages in the training data. Table 3 shows the percentage of cases by center included in the training and test datasets. Forty percent of training cases belong to one center that afterwards did not provide test data. Besides, 35% of test cases belong to three new centers that were not providers of training data.

Table 2 Breakdown of cases per manufacturer included in the training (INTERPRET) and test (eTUMOUR) datasets
Table 3 Percentage of cases per acquisition center included in the training (INTERPRET) and test (eTUMOUR) datasets

Pre-processing

Each signal was pre-processed according to the INTERPRET protocol. A fully automatic pre-processing pipeline was available for the training data. Besides, a semi-automatic pipeline was defined for some new file formats of the test cases from GE and Siemens manufacturers. The semi-automatic pipeline was designed to ensure compatibility of its output with the automatic one.

Automatic pipeline

The steps of the automatic pre-processing pipeline were: (1) Eddy current correction was applied to the water-suppressed free induction decay (FID) of each case using the Klose algorithm [32]. (2) The residual water resonance was removed using the Hankel-Lanczos singular value decomposition (HLSVD) time-domain selective filtering using ten singular values and a water region of [4.33, 5.07] ppm. (3) An apodization with a Lorentzian function of 1 Hz of damping was applied. (4) Before transforming the signal to the frequency domain using the fast Fourier transform (FFT), an interpolation was needed in order to increase the frequency resolution of the low resolution spectra to the maximum frequency resolution used in the acquisition protocols (see [8] for details in the acquisition conditions and resolutions). This was carried out with the zero-filling procedure. (5) Afterwards, the baseline offset, which was estimated as the mean value of the region [11, 9] ∪ [−2,−1] ppm, was subtracted from the spectrum. (6) The normalization of the spectral data vector to the L2-norm was performed based on the data-points in the region [−2.7, 4.33] ∪ [5.07, 7.1] ppm. (7) Depending on the signal-to-noise ratio (SNR) and the tumor pattern, an additional frequency alignment check of the spectrum was performed by referencing the ppm-axis to (in order of priority) the total Cr at 3.03 ppm or to the Cho containing compounds at 3.21 ppm or the ML at 1.29 ppm. (8) Finally, the region of interest was restricted to [0.5, 4.1] ppm, obtaining a vector of 190 points for each spectrum where, after the pre-processing filters, the resonances of the main metabolites arise and where the contribution of the residual water is expected to be minimal. In summary, 211 INTERPRET cases and 47 cases of the eTUMOUR test dataset (32 from Philips and 15 from GE) were pre-processed with the automatic pipeline.

Semi-automatic pipeline

Due to limitations of the automatic pre-processing software, 50 test samples were pre-processed by a semi-automatic pipeline that was partially based on the java magnetic resonance user interface (jMRUI) [33]. Some modifications of the semiautomatic pipeline with respect to the automatic pipeline were in the following steps: (1) The phase of the water suppressed FID was mainly corrected with the reference water. Additional manual zero-order and first-order phase correction was performed when needed. (2) Residual water was removed by means of the jMRUI-implementation of the Hankel singular value decomposition (HSVD)algorithm [34]. The filter was parametrized as in the automatic pipeline. Steps 3–8 remained equivalent to the automatic pre-processing. As a result, a pre-processing pipeline based on different software implementations but compatible with the automatic one was set up, and comparable signals for testing the PR models were obtained.

Feature extraction

Several feature extraction methods based on PR were applied to the real part of the spectra prior to any classification approach. These methods included direct spectral peak integration (PI) on selected metabolite resonance regions [35], peak height of typical resonances (PPM) [36], principal component analysis (PCA) [37,38], independent component analysis (ICA) [39,40], and wavelet transform (WAV) [41,42]. Finally, some classification approaches were applied to the full region of interest represented by a data vector of 190 points (190). The selected features for the classifiers were derived from previous studies [10,30] or from model validation based on the training dataset. In some approaches, standard normal variate (SNV) scaling was applied to the obtained features. The wavelet basis used in the experiments was coiflet 3 with nine levels [41]. Further information and experimental details about the methods used can be found in “Appendix A” of the on-line Supplementary Material.1

Classification methods

Ten methods were applied to address the pairwise classifications. These methods included parametric discriminant analysis [43]: linear discriminant analysis (LDA), Fisher’s rank-reduced version of LDA (FLDA) [44]), quadratic discriminant analysis (QDA), linear discriminant analysis with diagonal covariance matrix (dLDA) and quadratic discriminant analysis with diagonal covariance matrix (dQDA). Kernel-based models (support vector machines (SVM) [45] and least-squares support vector machine (LS-SVM) [46]) were also applied. Additionally, artificial neural networks (multilayer perceptron (MLP) [47] and bi-directional Kohonen networks (BDK) [27,48]) and single and ensemble [49] classifiers using K-nearest neighbours and local feature reduced by PCA (PCA-KNN) [50,51]) were used.

Bayesian strategies for regularization were also applied in some of the classifiers based on LS-SVM [52] and MLP [53]. Further information about these methods can be found in “Appendix B” of the on-line Supplementary Material.

A measure to evaluate unbalanced classifiers: the balanced error rate (BER)

The performance was measured by means of the error rate (ERR) and the balanced error rate (BER). In a binary classifier A versus B, BER is the average of the error rate on the A and B classes [54]. Let n A be the number of cases of the class A, and e A the number of misclassified cases. Let n B be the number of cases of the class B, and e B the number of misclassified cases. While the ERR is defined as e A+e B/n A+n B, the BER is defined as 1/2(e A/n A + e B/n B). BER is useful when one class is underrepresented compared to the other class, e.g. GBM versus LGG and GBM versus MET in the INTERPRET dataset and MEN versus GBM and MEN versus MET in the eTUMOUR dataset.

Results and discussion

For each task, different combinations of feature extraction and classification methods were applied in the study. An estimation of the ERR and BER for the INTERPRET dataset using a tenfold cross validation (CV)was carried out for each model. Afterwards, the estimations of the ERR and BER were obtained on the independent test (IT) dataset of eTUMOUR. Table 4 illustrates the results with the best pairwise classifiers based on the IT estimations. A detailed list of the results is available in Sect. 1 of the on-line Supplementary Material.

Table 4 Best results obtained for the six pairwise classification problems

The classification problems

Most of the discrimination problems among the four classes were solved with high accuracy in the eTUMOUR dataset. Table 4 shows that most of the best classifiers among GBM, MEN, MET, and LGG achieved an accuracy (1 — ERR) of around 90%. Such decision support methodologies with these ratios of accuracy may be useful to be incorporated in integrated CDSSs for clinical purposes. Besides, for GBM versus MET, the best result was an accuracy of 78% of the independent test, which is far from the accuracy obtained for the other discrimination problems. The glioblastoma versus metastasis discrimination by means of the MRS is difficult with the use of SV spectroscopy alone [7,8,5558]. Other approaches, such as MRSI coupled with magnetic resonance imaging (MRI) or the acquisition of an additional adjacent voxel to the brain mass should provide relevant additional information for distinguishing between these two types of tumors [5759].

Figure 1 shows the box-whisker plot of the performance (BER based on IT) for each problem based on the detailed list of the results (Sect. 1 of the on-line Supplementary Material). Note the high deviation of the distribution for the GBM versus MET with respect to the others. In a multiple comparison at a 0.05 α-level based on the Tukey’s honestly significance difference criterion for Kruskal-Wallis nonparametric one-way analysis of variance [60], each problem had a mean rank that was significantly different from the GBM versus MET problem. The distributions of the other five discrimination problems overlapped among them. Nevertheless, the smallest non-outlier observation of the GBM versus LGG problem was higher than the smallest non-outlier observation of the remaining problems. This may indicate that the GBM versus LGG discrimination is more difficult to solve by SV short TE MRS than the other four discrimination problems.

Fig. 1
figure 1

Box-whisker plots of the performance for each problem in the eTUMOUR dataset (based on the detailed list of results included in Sect. 1 of the on-line Supplementary Material). Performance is measured in BER. The box indicates the region between the lower (X 0.25) and the upper (X 0.75) quartiles. The horizontal line inside the box indicates the median of the distribution, and the vertical lines (the “whiskers”) extend to at most 1.5 times the box width. Any outlier of the distribution is displayed with a cross

The different approaches obtained good results for the discrimination of the GBM and MEN classes. A multilayer perceptron with the full spectra achieved a BER of 0.09. The mode of the distribution of BER was below0.20 for the GBM versus MEN problem.

The difficulty of the GBM versus MET discrimination was clearly observed in both CV-and IT-estimations (see Fig. 2). In the distribution of the IT results for this problem, the BER mode was 0.5, and the main distribution of the results ranged from 0.4 to 0.55. Some methods achieved a BER of 0.2; nevertheless, the main mass of the distribution was far from this value, which makes it difficult to ensure reproducibility of these performances. These results agree with those already published in previous studies [8,10]. This is most probably due to the similar necrotic profile (high lipid peaks mask the rest of the metabolic information) of the Metastasis cases and of most of the glioblastoma cases.

Fig. 2
figure 2

Scatter plot of the performance measured in BER estimated by the IT set consisting of new eTUMOUR cases and the BER estimated by the CV using the INTERPRET cases. BER(IT) = BER(CV) is represented by the solid-blue line and the trend of the (BER(CV) < 0.2, BER(IT) < 0.3) region is indicated by the black-dashed line

The mode of the BER for the GBM versus LGG problem was 0.2. Nevertheless, there was a set of regularized classifiers that obtained a BER of around 0.09. To be more precise, the best BER corresponded to the Bayesian framework for LS-SVM using peak integration (PI) values. Devos et al. [10] obtained comparable performances for this problem using LDA and standard LS-SVMs. In studies [25,61], significant statistical differences between GBM and LGG and between GBM and astrocytoma grade-III were also found for different metabolite ratios with respect to Cr and/or water. In long TE, Menze et al. [13] observed a better performance with regularized methods than with the standard ones when classifying normal, non-progressive tumors (with radiation injury and stable disease) and brain tumors.

As expected, our results confirm that MEN can be easily discriminated from MET no matter what method is used. Most of the BER probability mass of the results was in the interval from 0.1 to 0.2. The best result achieved a BER of 0.07, which was based on PCA and a neural network with Bayesian regularization. These results are consistent with [10].

LS-SVM and LDA with different feature extraction methods achieved BER of 0.08 and 0.11 for the meningioma versus low-grade glial problem. Most of the results for this problem were in the interval from 0.15 to 0.25, and the mode of the distribution was under 0.2. The low error in MEN versus LGG was also predicted by the CV results on the INTERPRET data. This result is consistent with the performances reported in Tate et al. in [7] on a three-class discrimination problem: MEN versus astrocytomas grade II (A2) versus aggressive tumors (AGG) (which is composed of GBM and MET). In that study, the confusion submatrix of MEN versus A2 indicates no misclassifications between them. Identical results were obtained by Tate et al. in [8] when extending the three-class classifier to MEN versus LGG versus AGG.

The distribution of BER forMET versus LGG had a clear trend towards the lower values (BER of 0.1), showing good performance for all the methods studied in this problem. PI combined with LDA, FLDA, MLP, or LS-SVM classification methods obtained the best performance for the IT set. The CV estimations of the errors also indicated good performance by the classifiers. These results are also consistent with [10].

The pre-processing techniques

Eight out of 50 semi-automatically preprocessed test cases were misclassified at least once by the pairwise BDK classifiers (GBM versus MET excluded). Also, 10 out of 47 of the automatically preprocessed test cases were misclassified at least once by the same classifiers. Based on these rates, no differences were observed in the classification of automatic and semi-automatic pre-processed signals. The semi-automatic pre-processing pipeline applied to the larger part of the test dataset was consistent with the automatic pipeline applied on the training set. This is an important practical conclusion because it suggests the compatibility of different pre-processing software tools, either in an automatic or a semi-automatic fashion for automatic classification in CDSSs.

The feature extraction methods

All the feature extraction methods applied in this study were based on PR. Therefore, we could not make any comparison between PR and metabolite quantification approaches. Approaches that take advantage of the combination of different TE [25,26,30,6264] were not considered in order to ensure that results could be compared with previous analyses of this type of data [7,8,10,12,27,28,6567]. Furthermore, although a feature extraction evaluation is not the aim of the present study and the setup of this study is not designed specifically for it, some effects of the different feature extraction methods are reported.

Figure 3 shows the box-whisker plot of the performance (BER) for each feature extraction (FE) method. GBM versus MET classifiers are not included because of their large difference in performance with respect to the other classification problems. The distributions of the results for all FE methods overlap, and no statistical differences were observed. Nevertheless, a noteworthy fact is the trend toward low values of the peak integration method compared to other methods. The study of Devos et al. [10] about the same four classes obtained similar performances when comparing full region of interest, peak regions and PI. In [12], Simonetti et al. compared, PCA, independent component analysis (ICA), LCModel [67] and PI for feature extraction on short TE MRSI data and they also obtained the best results with PI. In a single-center study, Opstad et al. [28] reported that the LCModel quantification obtained better results than PCA for two-step LDA classification. In long TE spectra, Lukas et al. [11] observed a better performance using the full region of interest rather than using PI or peak region extraction. Finally, Menze et al. [13] and Luts et al. [68] obtained an improvement when PR approaches (e.g. ICA, PCA, binned peak region and WAV) were used in short or long TE instead of quantification approaches.

Fig. 3
figure 3

Box-whisker plots of the performance for each feature extraction method in the eTUMOUR dataset. Performance is measured in BER and the box-whisker characteristics are the same as in Fig. 1

The classification methods

The diversity of methods used for classification is broad enough to have a good overview of the effect that this selection has on the performance of the classifiers. Figure 4 shows the box-whisker plot of the performance (BER) for each classification method. Analogously to the analysis of FE methods, GBM versus MET classifiers are not included in the distributions because of their large differences in performance with respect to the other classification methods. As observed in Fig. 4, the distributions overlap, but in general, lower results of BER were obtained using a BDK. In [27], BDK was used in PI values to discriminate over tumor grades and other tissues in the INTERPRET multi-voxel dataset. The study of Devos et al. [10] observed similar performances of their LDA and LS-SVM classifiers based on PI and evaluated by the area under the ROC curves. Tate et al. [7,8] based their three-class classifiers on the LDA due to the ability of this method for projecting the results in a two-dimensional space for visualization. Note that FLDA shows similar results when compared with the other methods in average; however, other methods like LS-SVM and BDK might be preferable for some discrimination problems (e.g. GBM vs. LGG).

Fig. 4
figure 4

Box-whisker plots of the performance for each classification method in the eTUMOUR dataset. Performance is measured in BER and the box-whisker characteristics are the same as in Fig. 1

Finally, in Fig. 2, we summarize and compare the BER estimation obtained by the CV for the INTERPRET training dataset and the IT consisting of the new eTUMOUR cases. Most of the results are in the (BER(CV)<0.2, BER(IT)<0.3) region, except for the GBM versus MET problem, which had a sparse distribution. The general trend in this region is indicated by the black-dashed line. This indicates an underestimation of the BER by the CV evaluation. The underestimation is typically observed in the PR challenges [54], and it is usually produced by the overfitting of the models on the training dataset and the estimation of the error with non-fully independent samples [69]. A noteworthy feature of our study is the evaluation of the predictive models using the new subsequently acquired multicenter test, that ensures the independence of the training and test sets. With respect to the GBM versus MET results, they are scattered in regions of larger error. For this problem, some overestimations of the CV error are also observed. This may show the difficulty of the problem and the randomness in the results. The results obtained for the rest of the discrimination problems confirm the expected behaviour of the predictive models.

Use of the study for automatic validation of MRS entries in brain tumour datasets

An intuitive method to compare datasets of signals is the visual inspection of their prototypical patterns. Figure 5 shows plots of the unimodal prototypes of the short TE spectra for the four tumour groups of the training and test datasets. Each prototype is represented by the unsmoothed mean function and the mean function±the standard deviation function. The view is zoomed in the [0.5, 4.1] ppm region used in our experiments. The observed resonances correspond to the main compounds reported in the “Introduction”. In general, the training and test prototype patterns for GBM, MET and LGG are close to each other, whereas the MEN prototype differs visually more. This may be because of a higher standard deviation on the test dataset around the 3.21 ppm peak with respect to the training dataset. Besides, the variation around the 2.2 ppm is higher in the test-set mean than in the training one.

Fig. 5
figure 5

Unimodal prototypes of the short TE spectra for the four tumour groups of the training and test datasets. Each prototype is represented by the unsmoothed mean function and the mean ± SD function. The view is zoomed in the [0.5, 4.1] ppm region used in our experiments

A practical result of this study is that cases that are repeatedly misclassified by the different techniques can be flagged as being susceptible of revision for possible problems in voxel positioning, acquisition artifact, normal-tissue contamination, or limitation in the classification methodology (e.g. patterns replicated in non-tumoral diseases, atypical MRS patterns and underrepresented tumor subtypes). In this way, even in the absence of biopsy, PR techniques can contribute to the automatic validation of cases, assisting the specialists on the detection of potential source of errors in the biomedical data acquired from patients.

Figures 6 and 7 show some eTUMOUR misclassified cases which may be interesting to review. The eTUMOUR case et2274 was diagnosed by the original pathologist as oligodendroglioma 9450/3 (grade II, WHO), although a comment was added to the free text section of the eTUMOUR database (eTDB) making reference to the presence of areas of anaplastic oligodendroglioma (grade III, WHO). Still, the final diagnosis proposed was grade II oligodendroglioma. The voxel allocation was carried out following the eTUMOUR acquisition protocol. The ML pattern is uncommon, as the high 0.9 and 1.3 ppm resonances show. The disappearance of these resonances at long TE (136 ms) discards a significant necrotic contribution (results not shown, but see [30]). This pattern has been observed before [30], for example in the INTERPRET cases I0450 (oligoastrocytoma) and I0179 (oligodendroglioma), which are also misplaced in the short TE latent space of the INTERPRET decision-support system (DSS) 2.0 (http://azizu.uab.es/INTERPRET). In summary, et2274 seems to behave as a class outlier and its consistent misclassification in our analysis may be sampling precisely that. The eTUMOUR case et2206 was originally diagnosed as oligoastrocytoma 9382/3 (grade II, WHO), but there were some discrepancies regarding the glial subtype on the validation done by the pathological committee. It was misclassified by every MET versus LGG classifier, and also by some GBM versus LGG and MEN versus LGG classifiers. Its ML pattern at short TE is also uncommon, having relatively large 0.9, 1.3 and 2.8 ppm peaks that are reduced at long TE (results not shown), which suggests, as well, a non-necrotic origin. The eTUMOUR case et2349 is a GBM without clear visible ML, which was misclassified in every classification problem. The review of the experts did not indicate problems in the location of the voxel, being this mainly positioned in the highly cellular part of the tumour. The eTUMOUR case et2197 is a MET with possible MRS pattern contribution from normal brain parenchyma, as it could be deduced by the relative difference of size between the voxel used for acquisition and the small brain lesion. Its pattern shows similar Cho and Cr peak heights and relatively high NAA at 2 ppm). However, the appearance of high Lac/ML at 1.3 ppm at the same time suggests abnormality. Nonetheless, it is clearly an uncommon spectral pattern for a MET.

Fig. 6
figure 6

Potential outliers (1/2) detected as a consequence of this study. Case numbering corresponds to eTUMOUR database (http://www.etumour. net) entries. For each case, the reference image and voxel location is shown on the left, and the region of interest of the real part of the short TE spectrum is shown on the right. For an easier visualization of the spectrum, vertical dashed lines indicate the position of the main resonances: Cho (3.21 ppm), Cr (3.02), NAA (2.01 ppm), L1 (1.29 ppm), L2 (0.92 ppm)

Fig. 7
figure 7

Potential outliers (2/2) detected as a consequence of this study. Figure characteristics are the same as in Fig. 6

Conclusions

This study describes a multiproject-multicenter evaluation of automated brain tumor classifiers using single-voxel short TE MR spectra. To our knowledge, there is no previous work that evaluates predictive models trained with data acquired from a multicenter project using a new independent test set subsequently acquired from partly different centers. Classifiers were trained with cases acquired by six centers during the 2000–2002 period. They were tested with posterior cases acquired by eight institutions during the 2004–2007 period. This strategy provides a view that is close to a real environment where similar classifiers, integrated in a clinical decision-support system (CDSS), may be used in multiple hospitals to assist in the diagnosis of new cases.

Our major conclusion is that accurate classification of those new cases is feasible using data acquired in different hospitals, different instrumentation, but similar acquisition protocols. Specifically, in our experiments, classifiers developed from the INTERPRET dataset seem to be robust enough for predictive classification of prospective cases from eTUMOUR.

The pairwise discrimination between Glioblastoma, Meningioma, Metastasis, and Low-grade Glial achieved accuracies of around 90%. However, the discrimination of Glioblastoma and Metastasis did not achieve a result better than 78% accuracy. Our results consolidate the conclusions of previous studies on automatic brain tumor classification using MRS but with multiproject-multicenter data for training and subsequent test.

A well-defined protocol for the acquisition of MRS (e.g. spectral parameters and voxel localization), and the application of quality controls to MRS spectra should allow the reproducibility of such classification rules and the successful use of decision-support systems (DSSs) in clinical environments.

The methodology provided in the present study may also be of use as “automatic flaggers” to help in the quality control of cases during the eTUMOUR multicenter project and beyond. The approach used in this work could be of use for pediatric brain tumour related studies [70] aimed at providing predictive information to pediatric neurosurgeons.

Hence, the conclusions obtained in this study are directly applicable to several of the tasks associated to a CDSS development for brain tumor diagnosis and prognosis and its deployment in clinical environments.

Footnote 1