Improved prediction of outcome in Parkinson's disease using radiomics analysis of longitudinal DAT SPECT images

No disease modifying therapies for Parkinson's disease (PD) have been found effective to date. To properly power clinical trials for discovery of such therapies, the ability to predict outcome in PD is critical, and there is a significant need for discovery of prognostic biomarkers of PD. Dopamine transporter (DAT) SPECT imaging is widely used for diagnostic purposes in PD. In the present work, we aimed to evaluate whether longitudinal DAT SPECT imaging can significantly improve prediction of outcome in PD patients. In particular, we investigated whether radiomics analysis of DAT SPECT images, in addition to use of conventional non-imaging and imaging measures, could be used to predict motor severity at year 4 in PD subjects. We selected 64 PD subjects (38 male, 26 female; age at baseline (year 0): 61.9 ± 7.3, range [46,78]) from the Parkinson's Progressive Marker Initiative (PPMI) database. Inclusion criteria included (i) having had at least 2 SPECT scans at years 0 and 1 acquired on a similar scanner, (ii) having undergone a high-resolution 3 T MRI scan, and (iii) having motor assessment (MDS-UPDRS-III) available in year 4 used as outcome measure. Image analysis included automatic region-of-interest (ROI) extraction on MRI images, registration of SPECT images onto the corresponding MRI images, and extraction of radiomic features. Non-imaging predictors included demographics, disease duration as well as motor and non-motor clinical measures in years 0 and 1. The image predictors included 92 radiomic features extracted from the caudate, putamen, and ventral striatum of DAT SPECT images at years 0 and 1 to quantify heterogeneity and texture in uptake. Random forest (RF) analysis with 5000 trees was used to combine both non-imaging and imaging variables to predict motor outcome (UPDRS-III: 27.3 ± 14.7, range [3,77]). The RF prediction was evaluated using leave-one-out cross-validation. Our results demonstrated that addition of radiomic features to conventional measures significantly improved (p < 0.001) prediction of outcome, reducing the absolute error of predicting MDS-UPDRS-III from 9.00 ± 0.88 to 4.12 ± 0.43. This shows that radiomics analysis of DAT SPECT images has a significant potential towards development of effective prognostic biomarkers in PD.


Introduction
Parkinson's disease (PD) is a progressive, degenerative movement disorder, characterized by neuronal loss in the substantia nigra with the loss of dopaminergic terminals in the basal ganglia (Brooks et al., 1990;Garnett et al., 1987;Stoessl et al., 2011). Given the absence of proven disease modifying therapies for PD, there is a critical need to establish biomarkers of disease progression (Marek et al., 2008); e.g. an aim of the Parkinson's Progressive Marker Initiative (PPMI) (Parkinson Progression Marker Initiative, 2011). Furthermore, there is significant interest in prognostication of disease outcome, to properly adapt and power clinical trial studies, as applied to appropriate patients. Stratification of PD based on expected prognosis would allow better designs of disease modifying trials, with greater power to ascertain efficacy.
Imaging of the dopaminergic system with 123 I-ioflupane-dopamine transporter (DAT) SPECT is now widely used (Catafau and Tolosa, 2004;Grachev et al., 2012;Kupsch et al., 2012). DAT SPECT images are commonly assessed visually. However, there are increasing efforts to quantify uptake in regions-of-interest (ROIs) in order to provide more objective measures of disease severity (Badiavas et al., 2011;Koch et al., 2005). Quantitative imaging biomarkers also have the potential for earlier detection of disease. A significant way in which DAT SPECT imaging has been helpful is to identify a subgroup of PD patients who are symptomatic without evidence of dopamine deficit (SWEDDs). These patients have a significantly better prognosis. What remains of critical importance, is to discover further subsets in the PD population of different outcomes, to enable significantly improved targeted clinical trials for the assessment of novel therapies for PD.
Advanced radiographic metrics that quantify heterogeneity in shape and uptake have been explored, primarily in the field of oncology, in order to improve diagnosis as well as prediction of treatment response and survival outcome in different cancers (Eary et al., 2008;El Naqa et al., 2009;Hatt et al., 2015;Rahmim et al., 2016b;Soufi et al., 2017;Tixier et al., 2014;Tixier et al., 2011;van Velden et al., 2011;Vriens et al., 2012). Overall, the field of radiomics aims to extract a large number of quantitative features from radiological images, aiming to uncover correlates of disease characteristics that are ordinarily not visually observed or quantitatively measured (Aerts et al., 2014;Asselin et al., 2012;Chicklore et al., 2013;Kumar et al., 2012;Lambin et al., 2012). In a previous work, we assessed whether radiomics analysis improved correlations with clinical assessments (Rahmim et al., 2016a). The present work investigates whether radiomics analysis significantly adds to the ability of non-imaging and conventional measures to predict outcome. This is plausible since pathophysiologic studies have demonstrated very heterogeneous patterns of dopamine loss in the basal ganglia (Kish et al., 1988). As such, we hypothesize that radiomics analysis has the potential to significantly improve prediction of outcome in Parkinson's disease patients.

Longitudinal patient data
Longitudinal data were extracted from the PPMI database (www. ppmi-info.org/data) (Parkinson Progression Marker Initiative, 2011). The movement disorder societyunified Parkinson's disease rating scale (MDS-UPDRS)part III (motor) in year 4 was used as outcome (referred to as UPDRS-III from here on). Predictors included demographics (age, sex), as well as baseline (year 0) and year 1 DAT SPECT images and clinical measures. Clinical measures included: disease duration (DD) taken with respect to (a) time of diagnosis (DD-diag.) as well as (b) time of appearance of symptoms (DD-sympt.). This also included (c) the motor measure, UPDRS-III, and (d) the non-motor cognitive measure, Montreal Cognitive Assessment (MoCA), for both baseline (year 0) and year 1. For consistency, we only included patients with SPECT data acquired on similar kinds of scanner (Siemens, 2-headed ECAM or Symbia systems), and subjects who underwent high-resolution 3 T MRI. These selection criteria resulted in 64 PD subjects (38 male, 26 female; age at year 0: 61.9 ± 7.3, range [46,78]), with widely distributed year 4 outcome UPDRS-III: 27.3 ± 14.7; range [3,77].
Imaging was performed 4 ± 0.5 h following injection of DAT SPECT ( 123 I-Ioflupane; 111-185 MBq). Thyroid update was blocked via pre-treatment of subjects with saturated iodine solution (10 drops in water) or perchlorate (1000 mg) prior to injection. Data acquisition consisted of 128 × 128 raw SPECT projection data acquired every 3 degrees, 120 projections, 20% symmetric photopeak windows centered on 159 keV and 122 keV, and a total scan duration of~30-45 min. A HERMES system (Hermes Medical Solutions, Stockholm, Sweden) was used to perform iterative OSEM reconstruction on the input raw SPECT projection data, for all studies to ensure consistency.
Subsequently, PMOD (PMOD Technologies, Zurich, Switzerland) was used for attenuation correction. Ellipses where drawn on the images and Chang 0 attenuation correction was applied invoking a sitespecific mu as empirically derived from phantom data (as acquired in site initiation for the trial). Following this, standard 3D Gaussian postsmoothing (6.0 mm FWHM) was applied.

Image processing and quantification
Image processing consisted of segmentation of MRI images and registration of SPECT images onto MRI images, followed by quantitative feature extraction from SPECT images, as follows: 1) Segmentation: We focused on six ROIs, namely the caudate, putamen and ventral striatum (VS) (both right and left) for analysis. The occipital cortex was also segmented, which is commonly used as a reference region for normalization of counts (Badiavas et al., 2011;Djang et al., 2012;Koch et al., 2005). Segmentation consisted of: 1) affine registration of the T1-weighted MRI of the subject to the MNI305 atlas (Evans et al., 1993); 2) initial volumetric labeling of the registered MRI of the subject using the labels of the atlas; 3) correction of image inhomogeneity, i.e., variation of the image intensity due to the B1 bias field, using the results of the previous stage; 4) nonlinear volumetric alignment of the subject's affine-registered and inhomogeneity-corrected MRI to the MNI305 atlas; and 5) propagation of the atlas labels to the MRI generated in the previous step and then, mapping of the results back to the original MRI of the subject. The classification of each point in the space to a given label was achieved by finding the segmentation that maximized the probability of input (image) given the prior probabilities from the training set (MNI305 atlas). Details of this framework are presented by Fischl et al., and an implementation of the methods is available in FreeSurfer (Fischl et al., 2002;Fischl et al., 2004a;Fischl et al., 2004b). 2) Registration: To register SPECT images of each subject to the MRI of the subject, we employed a rigid, information-theory-based co-registration approach (Penny et al., 2011), implemented in two steps. The algorithm uses the normalized mutual information as the objective function along with a Gaussian smoothing kernel with a width of 7 mm. It also uses two separation levels, the average distance between sampled points of 4 and 2 mm, for the coarse to fine registration of the images, as elaborated by Collignon et al. (1995). Since DAT SPECT images have lower spatial resolution (typically 10 mm) compared to MRI (typically 1 mm), and depict centralized hyper intense striatal uptake (with minimal uptake elsewhere), standard coregistration commonly performs poorly. To tackle this, in the next stage, we performed linear intensity normalization of the T1-w images so that the average intensity of white matter equals 100. Then, we set the intensity value of the caudate and VS to 4000 and putamen to 1000. Finally, we employed the rigid co-registration algorithm on the co-registered DAT SPECT images from the previous stage and the manipulated T1-w images. In this way, the hyper intensity regions in the SPECT image are forced to better align to the regions around the favorable structures. In the end, we overlaid the structures segmented in MRI on the co-registered SPECT images. Furthermore, the more (m) vs. less (l) affected sides of the structures were determined with reference to the SPECT uptake at the putamen. 3) Feature extraction: A total of 92 imaging features were extracted, including 13 first-order intensity features, 22 morphological features, and 57 second-and higher-order textural features, describing the intensity and spatial distribution of radiotracer uptake. The definitions of these features are elaborated in the supplement. Of the 92 features, 75 of them incorporated DAT SPECT uptake information, while 17 of the morphological features were purely based on the MRI-based ROIs (see supplement). This latter is justified given findings of structural brain atrophy in PD patients (Brenneis et al., 2003;Zeighami et al., 2015). Of the 92 features, 4 features were conventional measures of SUVmax, SUVmean, SUVpeak and ROI-volume, though these were extracted for each of the 6 ROIs and on both years of imaging (year 0 and 1).

Data analysis
Multivariate analysis was performed in three groups, involving: (1) use of only non-imaging features (i.e. demographics and clinical measures), (2) additional use of conventional imaging features, and (3) additional use of radiomic features. For each group of predictors, a single regression tree was fitted to explore interactions among variables, followed by random forest analysis (R package randomForest) to identify the top predictors. Random forest analysis is an ensemble non-parametric machine learning method. It can handle complicated interactions among large number of variables and their non-linear effects efficiently. In fact, it has shown favorable performance in comparison to other machine learning methods in the context of radiomics analysis (Parmar et al., 2015). An observation is that the prediction model cannot be visualized using a single straightforward formula because the prediction is done through a collection of trees, and each tree has its own formula, which is also the case with artificial neural networks (ANNs).
Year 4 UPDRS-III was used as the response variable in all these analyses. Features with zero variance were removed before the analysis. For each group of predictors, 5000 bootstrap samples were randomly drawn from all patients. Each of these bootstrap samples included about two-third of the total number of patients, and the remaining one-third patients were called out-of-bag (OOB) for that bootstrap sample. A decision tree was grown from each bootstrap sample and internally validated using the corresponding OOB sample. This procedure was repeated for all bootstrap samples. The minimum node size was set to 5, and at each node, 1/3 of candidate features were randomly selected to determine how to split the node. To rank the importance of a predictor, all values of this predictor were permuted among all individuals in the OOB, and we put both permuted and non-permuted data down all trees to obtain their predictions. The average absolute difference in prediction between permuted and non-permuted OOB data was then used to measure the variable importance score and to select top predictors. Variables with larger difference scores were considered more important.
Prediction accuracy of random forest analysis was assessed using leave-one-out-cross-validation (LOOCV). Each time, one patient was excluded from the random forest analysis. The developed random forest algorithm was then applied to this excluded patient to obtain a prediction of UPDRS-III score. We then calculated the absolute difference between the predicted and observed UPDRS-III scores for the patient. The process was repeated for all patients. The average difference Δ from all patients was used to compare predictions using different groups of predictors. In particular, we investigated added value of radiomics features by comparing Δ from group 2 and group 3. For longitudinal analysis, combined features from year 0, year 1, as well as their differences were utilized. To assess performance when only baseline information was included, the abovementioned analysis was repeated when data from only year 0 were used to predict outcome in year 4.

Results
Fig. 1 depicts 3D segmentation as performed on a typical study, as well as transaxial, coronal and sagittal slices of the DAT SPECT image overlaid with the segmentations. Fig. 2 shows a single decision tree (left) and its performance (right). Patients with baseline UPDRS-III greater than or equal to 36 had the highest median year 4 UPDRS-III score. For patients with baseline UPDRS-III < 36, there were interactions among different radiomic features. However, it is seen that using a single decision tree results in a limited number of outcomes, which is why a random forest of decision trees is utilized to provide improved performance (next).   The performance of the tree on the data is shown, which is sub-optimal, given that only one of six outcomes can be arrived at, at the leaves. Use of random forest of decision trees aims to improve this performance. Radiomic features, such as difference Entropy, SZHGE and LZLGE as seen above, are elaborated in the supplement. (m) and (l) refer to the more and less affected sides, respectively (e.g. caudate(m) is the more affected caudate).
A. Rahmim et al. NeuroImage: Clinical 16 (2017) 539-544 prediction of outcome is 8.93 ± 0.91. It is seen that addition of only conventional imaging features does not improve prediction performance (9.00 ± 0.88). However, addition of radiomic features results in an absolute error in outcome prediction of 4.12 ± 0.43. Given the year 4 UPDRS-III distribution (27.3 ± 14.7), this represents a significant (p-value < 0.001) average improvement of 18% in prediction of outcome, and can be readily observed visually. Fig. 4 depicts the findings when data from only year 0 (baseline) are used to predict outcome. It is generally seen that predictions are poorer. What is more, in comparison to use of non-imaging features (10.77 ± 1.10), outcome prediction it not improved when adding conventional features (10.63 ± 1.07) or even by further addition of radiomic feature (9.93 ± 1.10). Overall, it is seen that longitudinal images (acquired at year 0 and year 1) are required to enable significantly improved prediction of motor outcome in year 4. Fig. 5 plots relative contribution of different predictors via a metric %IncMSE: incremental percent change in mean square error (MSE) by exclusion of a single feature. It is seen that UPDRS-III (motor) from year 1 and then from year 0 are the most important predictors. At the same time, it is seen that radiomics features were among the top 10 most important predictors. The critical observation, from Fig. 3, is that it is through the combination of conventional measures with radiomic features that one is able to significantly improve prediction of outcome.

Discussion
There exist two intertwined challenges in PD biomarker discovery. One is to identify biomarkers that track PD progression, and the other is to predict outcome (Marek et al., 2008). Both are very important clinically and in the discovery of disease modifying therapies. PD progression biomarkers are needed to assess efficacy of such therapies, while better prediction of outcome is needed in order to discover PD sub-types so as to more effectively design clinical trials in the first place.
In the present work, we utilized motor assessment (UPDRS-III) in year 4 as the outcome measure being predicted. We plan to extend our work to prediction of other outcomes, including cognitive assessment (MoCA) and time to initiation of symptomatic therapy (TIST). In fact, it is plausible that use of a single clinical metric may not be sufficient to truly track PD progression. As a result, it is meaningful to use a combination of metrics, or global statistical tests for assessment of disease progression in clinical trials (Huang et al., 2009), which we plan to utilize in future efforts involving prediction of outcome in PD.
There exist considerable difficulties and uncertainties with PD disease metrics. Disease duration is a particularly problematic metric, given the subtle and often nonspecific nature of early symptoms. The patients' abilities to detect the first symptoms is highly variable, and is impacted by a number of factors such as personality, education level and professional background, and the type of initial symptom (e.g. tremor vs. bradykinesia). This somewhat subjective nature of UPDRS evaluation makes the disease duration metric prone to inter-rater variability. UPDRS-III motor assessment itself is a highly variable   A. Rahmim et al. NeuroImage: Clinical 16 (2017) 539-544 measure (Horne et al., 2016;Mentzel et al., 2016), involving both patient-and rater-dependent variabilities. To reduce such variability, we averaged motor measures, if additionally performed up to 6 months before or after the visit. For instance, for year 4 outcome prediction, if a patient had a visit at 3.5 years and/or 4.5 years post-enrollment, the measures would be utilized and averaged with year 4 measures. In our past efforts, we investigated use of advanced texture analysis in quantitative brain PET imaging, in studies of PD (Gonzalez et al., 2013;Klyuzhin et al., 2016;Sossi et al., 2012) and Neuroinflammation (Rahmim et al., 2012). Furthermore, we found strong evidence that such measures retain their information even as one transitions from the higher resolution domain of PET images to the lower resolution domain of SPECT images by significant post-reconstruction blurring of PET images (e.g. up to 1 cm) (Blinder et al., 2014). In a subsequent DAT SPECT study (Rahmim et al., 2016a), we also demonstrated significantly improved correlation of radiomic features with clinical assessment. This prompted the present study, for the challenging task of predicting clinical outcome, making use of radiomic features involving DAT SPECT images. It is also worth noting that a potential of radiomic features is to move beyond conventional measures that require access to a good reference region for normalization, enabling more subtle detection and assessment of disease (Campbell and Shokouhi, 2017).
The large number of features used in radiomics analysis is a concern in the context of the curse of dimensionality, and there exist increasing scrutiny (Chalkidou et al., 2015) and efforts (Kumar et al., 2012) to avoid problems of overfitting and false discovery. One approach is to pre-eliminate a number of features as shown in independent studies to depict poor repeatability (test-retest) or poor reproducibility to variations in image reconstruction and processing parameters (Ashrafinia et al., 2017;Doumou et al., 2015;Galavis et al., 2010;Grkovski et al., 2015;Hatt et al., 2013;Leijenaar et al., 2013;Lu et al., 2016;Lv et al., 2017;Shiri et al., 2017;Tixier et al., 2012;van Velden et al., 2016). An alternative approach, which we have pursued in this work, is to perform feature selection within the training dataset itself, using unsupervised methods (Bartenhagen et al., 2010;Dy, 2008;Mitra et al., 2002), to reduce the dimensionality of the data. This involved elimination of features that have very low dynamic range or those that are very highly correlated with other features (redundancy).
Among predictors used in this work, we also included genomics profiling. In particular, we used a reduced list of 8 alpha-synuclein (SNCA) single nucleotide polymorphisms (SNPs), given previous findings on their association with increased risk of PD onset (Guella et al., 2016). However, no significant genomic interactions were found for progression. Our ongoing efforts include correlation of the SNPs with imaging phenotypes to boost statistical power, and analysis of a wider array of SNPs. We plan on performing extensive analysis on a larger patient dataset, including a larger training set as well as separate validation set, in order to assess reproducibility of our findings of significantly improved prognosis with inclusion of radiomic features.
We finally note that there is increasing recognition for the critical role of standardization and reproducible research for effective progress in the hunt for and established utility of biomarkers (Collins and Tabak, 2014;Economist, 2013;Poste, 2011). This is equally a concern in the field of radiomics which involves complicated imaging feature definitions and analyses (Hatt et al., 2017). To this end, there are now significant efforts underway for standardization of imaging biomarkers (Zwanenburg et al., 2016), which are critically needed for reproducible application of radiomics features and their successful translation to clinical practice.

Conclusion
This work has shown that use of longitudinal data (year 0 and year 1), combined with radiomics analysis, can result in significant improvements in prediction of outcome. Our results demonstrated that addition of radiomic features to conventional measures significantly improved (p < 0.001) prediction of outcome (namely that of year 4 motor performance as assessed using MDS-UPDRS-III), reducing the absolute error of prediction from 9.00 ± 0.88 to 4.12 ± 0.43 (MDS-UPDRS-III distribution: 27.3 ± 14.7). Radiomics analysis of DAT SPECT images holds significant potential towards development of effective biomarkers for prognostication of PD, with implications in design of clinical trials.

Conflicts of interest
None of the authors report any conflict of interest.
Grant support