Multiparametric MRI-based differentiation of WHO grade II/III glioma and WHO grade IV glioblastoma

Non-invasive, imaging-based examination of glioma biology has received increasing attention in the past couple of years. To this end, the development and refinement of novel MRI techniques, reflecting underlying oncogenic processes such as hypoxia or angiogenesis, has greatly benefitted this research area. We have recently established a novel BOLD (blood oxygenation level dependent) based MRI method for the measurement of relative oxygen extraction fraction (rOEF) in glioma patients. In a set of 37 patients with newly diagnosed glioma, we assessed the performance of a machine learning model based on multiple MRI modalities including rOEF and perfusion imaging to predict WHO grade. An oblique random forest machine learning classifier using the entire feature vector as input yielded a five-fold cross-validated area under the curve of 0.944, with 34/37 patients correctly classified (accuracy 91.8%). The most important features in this classifier as per bootstrapped feature importance scores consisted of standard deviation of T1-weighted contrast enhanced signal, maximum rOEF value and cerebral blood volume (CBV) standard deviation. This study suggests that multimodal MRI information reflects underlying tumor biology, which is non-invasively detectable through integrative data analysis, and thus highlights the potential of such integrative approaches in the field of radiogenomics.

We have recently established a novel BOLD (blood oxygenation level dependent) based MRI method for the measurement of vascular deoxygenation, i.e. relative oxygen extraction fraction (rOEF), in glioma patients. In a subgroup of patients, this non-invasive measure of hypoxia also correlated reasonably well with hypoxia-inducible factor 1α (HIF1α ) immunohistochemistry stainings 11 . Hypoxia has long been known to play a central role in gliomagenesis and is related to a more malignant tumor phenotype and therapy resistance 12 . Accordingly, several studies have shown that HIF1α protein expression increases with WHO grade 13 .
In order to evaluate the large feature vectors resulting from integrative analysis of image features and their association with underlying biology, machine-learning algorithms such as random forests and support vector machines clearly outperform traditional approaches such as correlation analysis and are increasingly being used in fields such as genomics 14 and imaging analysis 15 .
We hypothesized that integrative analysis of multimodal MRI data, considering anatomic sequences (T1 and T2 weighted) as well as physiological imaging (DSC perfusion and hypoxia imaging), reflects the underlying glioma biology and hence allows for its non-invasive detection.

Results
Patient characteristics. In total, this study included 37 patients (24 male, 13 female; mean age 58 years), 27 of which had a glioblastoma (73%; WHO grade IV), five an anaplastic glioma (13.5%; WHO grade III) and five a diffuse glioma (13.5%; WHO grade II). Of the grade II and III glioma, seven were diagnosed with an astrocytoma, two with a mixed oligoastrocytoma and one patient with a pure oligodendroglioma. Isocitrate dehydrogenase (IDH) mutation status was available for 35/37 patients: 31 patients were IDH wild type, while four patients (two WHO grade II glioma, two WHO grade III glioma) carried a mutant IDH allele.
Multimodal image feature analysis. We performed multimodal MRI on all 37 patients, including rOEF as a marker for hypoxia and perfusion imaging (Fig. 1). For image analysis, volumes of interest (VOI) were manually defined by CP and summary and volume statistics were extracted for these VOIs in all sequences, resulting in a total of 116 features (see methods for details).
Based on recent work on the (epi)genetic basis of gliomas [16][17][18] which indicate that WHO grade II and III gliomas can be rather subdivided by molecular status than WHO grade, we grouped these tumors as opposed to WHO grade IV glioblastoma.
We trained an oblique random forest with 300 trees, using logistic regression as the node model. To account for the bias in prediction accuracy, we performed a five-fold cross-validation to estimate classifier performance. The resulting model had an area under the curve of 0.944, with 34 of 37 patients correctly classified ( Fig. 2A; sensitivity 0.8889, specificity 1, positive predictive value 1, negative predictive value 0.7692). In this analysis, three patients with a glioblastoma got misclassified as grade II/III glioma. Of these, all three were IDH wild type and also otherwise showed no peculiarities. To investigate the features most important for this classification, we calculated the mean importance score from 1000 bootstrap iterations, using the same model parameters as above. Figure 2B displays the z-transformed mean importance scores for all 116 features. Of these, three features had a z-transformed importance score > 1.96: Standard deviation of the contrast-enhanced (ce) T1w signal and rCBV in FLAIR-hyperintense tumor as well as maximum rOEF signal in the high rOEF VOI, which generally broadly overlapped with edema (Fig. 3). For all these features, values were higher in glioblastoma compared to WHO grade II/III glioma. Table 1 summarizes these features.
In our cohort, only four of 37 patients carried a mutant IDH allele. This low frequency prohibited training a prediction model for IDH mutational status in our cohort.

Discussion
Novel MR imaging sequences reflecting underlying oncogenic processes have greatly advanced the field of radiogenomics. Here, we explored the ability of a machine-learning classifier based on an extensive multimodal MR imaging data set to distinguish between WHO grade II/III glioma and WHO grade IV glioblastoma.
For this study, we grouped WHO grade II and III gliomas as opposed to WHO grade IV glioblastomas. In the current fourth edition of the WHO classification 19 , signs of anaplasia and mitotic activity distinguish WHO grade II and III glioma. However, both criteria are rather subjective and prone to relevant inter-observer variability 20,21 , while the diagnosis of WHO grade IV glioblastoma, requiring the presence of necrosis and neoangiogenesis, is more reliable. Furthermore, recent advances in our understanding of the complex (epi)genetic basis of gliomas have led to the identification of several molecular subtypes associated with biology and prognosis [16][17][18] , and have shown that outcome differences between WHO grade II and III glioma more rely on the distribution of these molecular subgroups than on biological differences between WHO grade II and III glioma per se 22 , suggesting that grade II and III glioma may in fact be a single entity subdivided by molecular parameters. Glioblastoma, on the other hand, still have a worse prognosis than WHO grade II and III glioma, even when accounting for molecular parameters 17 . Based on these considerations and in accordance with large cohorts such as the cancer genome archive (TCGA), which also group WHO grade II and III glioma as "lower grade glioma" as opposed to grade IV glioblastoma, we also grouped our samples accordingly.
We identified three imaging features most important for the differentiation between WHO grades: One feature from a conventional anatomic sequence (ce T1w) and two features from physiological imaging (rCBV, rOEF). Using only image features derived from anatomic sequences to distinguish gliomas of different WHO grades is known to be limited by significant overlap between WHO grades 3,6,7 . Therefore, the use of several physiological imaging sequences, better capturing underlying tumor biology, has been explored. Of these, perfusion imaging has maybe been studied most extensively for its use in MRI-based glioma grading [23][24][25] . Most authors found maximum CBV values to be a significant discriminator between tumor grades (with higher values in higher WHO grades), while in our cohort, the standard deviation of the CBV was more important. This may be related to a finding of an earlier study, where we found maximum CBV to be reflective of IDH status 26 , and in our cohort, only four patients carried an IDH mutation. We hypothesize that standard deviation of the CBV in the FLAIR-hyperintense tumor reflects intratumoral heterogeneity, which is known to be a prominent feature of glioblastoma, both histopathologically and molecularly 27 . Maximum relative oxygen extraction fraction (rOEF) was higher in glioblastoma compared to WHO grade II/III glioma (Fig. 3, Table 1). Considering the preliminary correlation between HIF1α expression (as per immunohistochemistry) and areas of high rOEF 11 , this feature might well reflect hypoxic areas, again highly characteristic of glioblastoma and thus aid in the differentiation between WHO grade II/III glioma and WHO grade IV glioblastoma. Interestingly, high rOEF values were most prevalent in peritumoral edema, which also fits with recent observations of Jensen et al. who demonstrated expression of hypoxia markers (HIF1α and VEGF) in peritumoral edema 28 .
Only four of the ten WHO grade II and III patients in our cohort carried a mutant IDH allele, while large-scale population-bases studies suggest that between 60-80% of these tumors should be IDH mutant 29 . Unfortunately, this low number of IDH mutations also precluded training a classifier for the detection of IDH mutation status. A possible explanation for the lower number of IDH mutant patients in our cohort might lie in the selection criteria for this study. In this cohort, we aimed to include mostly patients with a suspected high grade glioma, because we wanted to correlate hypoxia as detected by rOEF with a PET-based measure of hypoxia, [ 18 F]-FMISO.
Unraveling such meaningful correlations in complex multivariate features sets requires elaborate computational algorithms that go beyond standard univariate statistical tests. The random forest algorithm excels in feature extraction and classification tasks for data sets with few observations but high-dimensional feature vectors -a property of most data sets used in radiogenomics studies, as well as ours. Here, we used a random forest implementation with "oblique" node models 30 that overcomes deficiencies of the regular random forest when dealing with correlated features, which also is a typical property of an imaging data set in general and ours in particular. Importantly, such algorithms allow to input a high-dimensional data set without prior selection of "candidate sequences", by automatically weighing each feature, and hence leverage the full information contained in the data set.
Five-fold cross-validation yielded an 91.8% accuracy (34 of 37 patients correctly classified), suggesting that advanced MR imaging indeed is able to predict underlying biology. However, the lack of an independent validation cohort as well as the relatively small sample size are limitations of our work and caution against strong conclusions, though we tried to account for this by bootstrapping and cross-validating all steps in training and evaluating our classifier. Furthermore, this study is retrospective in nature and hence suffers from limitations associated with retrospective data sets.
In summary, multimodal MR imaging analysis using machine-learning techniques holds great promise for advancing the field of radiogenomics, enabling non-invasive insight into tumor biology, and possibly in the future informing clinical decision making. Our results warrant future studies.

Methods
Subjects. 47 patients with suspected glioma were examined using an extended MRI protocol. Data of ten patients had to be excluded because of other or lacking diagnosis (n = 4), severe motion artifacts in R2′ measurement (n = 2) and technical problems or motion in the DSC-based CBV measurement (n = 4). All patients provided informed written consent. The study was approved by the ethics committee of the TU Munich and carried out in accordance with the approved guidelines.

Statistical analysis.
For supervised analysis of this high-dimensional data set, we trained an oblique random forest classifier 30 with 300 trees (https://cran.r-project.org/package= obliqueRF). To validate this model, we performed a five-fold cross-validation; only cross-validated performance measures are reported, using the pROC and caret packages (https://cran.r-project.org/package= pROC & https://cran.r-project.org/package= caret) 35 . Using a bootstrapping approach (with 1,000 iterations), we calculated feature importance scores to identify the features most important for classification from the oblique random forest. Briefly, importance counts how often a variable was deemed relevant (at 0.05 level) when chosen for a split at a node (increasing the importance value by 1) and how often it was irrelevant for the split (decreasing by 1