Short-term MRI measurements as predictors of EDSS progression in relapsing-remitting multiple sclerosis: grey matter atrophy but not lesions are predictive in a real-life setting

Background Magnetic resonance imaging (MRI) is the best biomarker of inflammatory disease activity in relapsing remitting Multiple Sclerosis (RRMS) so far but the association with disability is weak. Appearance of new MRI-lesions is used to evaluate response to immunotherapies in individual patients as well as being the most common primary outcome in phase-2 trials. Measurements of brain atrophy show promising outcomes in natural cohort studies and some phase-2 trials. From a theoretical perspective they might represent irreversible neurodegeneration and be more closely associated with disability. However, these atrophy measurements are not yet established as prognostic factors in real-life clinical routine. High field MRI has improved image quality and resolution and new methods to measure atrophy dynamics have become available. Objective To investigate the predictive value of MRI classification criteria in to high/low atrophy and inflammation groups, and to explore predictive capacity of two consecutive routine MRI scans for disability progression in RRMS in a real-life prospective cohort. Methods 82 RRMS-patients (40 untreated, 42 treated with immunotherapies, mean age 40 years, median Expanded Disability Status Scale (EDSS) of 2, underwent two clinically indicated MRI scans (3 Tesla) within 5–14 months, and EDSS assessment after a mean of 3.0 (1.5–4.2) years. We investigated the predictive value of predefined classifications in low/high inflammatory and atrophy groups for EDSS progression (≥1.5 if baseline EDSS = 0, ≥1.0 if baseline EDSS <5, ≥0.5 for other) by chi-square tests and by analysis of variance (ANOVA). The classifications were based on current scientific or clinical recommendation (e.g., treatment response criteria). Brain atrophy was assessed with three different methods (SIENA, SIENAX, and FreeSurfer). Post-hoc analyses aimed to explore clinical data and dynamics of MRI outcomes as predictors in multivariate linear and logit models. Results Progression was observed in 24% of patients and was independent from treatment status. None of the predefined classifications were predictive for progression. Explorative post-hoc analyses found lower baseline EDSS and higher grey matter atrophy (FreeSurfer) as best predictors (R2 = 0.29) for EDSS progression and the accuracy was overall good (Area under the curve = 0.81). Conclusion Beside EDSS at baseline, short-term grey matter atrophy is predictive for EDSS progression in treated and untreated RRMS. The development of atrophy measurements for individual risk counselling and evaluation of treatment response seems possible, but needs further validation in larger cohorts. MRI-atrophy estimates from the FreeSurfer toolbox seem to be more reliable than older methods.


INTRODUCTION
Disability progression in Multiple Sclerosis (MS) is mediated by acute inflammation as well as chronic inflammation and neurodegeneration (Hauser, Chan & Oksenberg, 2013;Friese, Schattling & Fugger, 2014). Magnetic Resonance Imaging (MRI) is currently the best available biomarker in relapsing-remitting Multiple Sclerosis (RRMS) (Hauser, Chan & Oksenberg, 2013) and new T2-hyperintense or contrast enhancing lesions are outcomes of inflammation in clinical trials (Sormani et al., 2009;Stellmann et al., 2015). New lesions are associated as well with treatment failure in individual patients (Rio et al., 2008). Lesion load at the time of diagnosis and its increase within the first five years are prognostic factors for long-term disability at a group level (Fisniku et al., 2008;Popescu et al., 2013;Tintore et al., 2015). However, the association between clinical and MRI measurements of inflammation and disability progression is moderate at best. In contrast, there is growing evidence that atrophy might be closer associated with disability than lesions (Rocca et al., 2013;Jacobsen et al., 2014). Over 10 years, confirmed disability progression was associated with whole brain atrophy, cortical atrophy and ventricular volume (Zivadinov et al., 2016). Cross-sectional studies indicate a better correlation of atrophy with disability and cognitive decline than lesions alone (Benedict, Carone & Bakshi, 2004;Steenwijk et al., 2016). Moreover, atrophy is discussed as an additional criterion to define treatment response within the concept of NEDA (''No evidence of disease activity'') as first studies report on the predictive value of for example percentage brain volume change for treatment response to interferon-beta (Perez-Miralles et al., 2015). However MRI atrophy measurements are not yet established as individual prognostic factors and reliability has not yet been proven in a real-life setting. Furthermore data about short-term atrophy dynamics (e.g., within one year) as predictor of disability progression are rare (Popescu et al., 2013). Bielekova et al. (2005) were the first to combine simple MRI measurements of inflammation and atrophy as prognostic factors. They aimed to assign patients into four risk groups based on their baseline inflammatory activity (high or low) and respective atrophy (high or low). After eight years the algorithm failed to predict progression. Since then though, high field MRI has improved image quality and resolution and new methods to measure brain atrophy dynamics have become available (Smith et al., 2002;Fischl, 2012). It is therefore reasonable to investigate the predictive value of short-term atrophy and inflammation measurements of two MRI scans in a real life setting, as most patients likely receive them due to clinical monitoring anyway (Uher et al., 2015).
The current study was designed to validate the concept of Bielekova with different classification algorithms representing widely accepted criteria such as the Rio criteria for treatment failure. In addition we aimed to explore as to how far varying atrophy measurements (SIENA/SIENAX from the Functional MRI software library, fmrib.ox.ac.uk and FreeSurfer freesurfer.net) differ in their ability to predict EDSS progression.

Study design
The study was designed to assess the predictive value of two standard MRI scans for EDSS progression in treated and untreated RRMS in a real-life setting. Participants were consecutively recruited and underwent two baseline visits five to 14 months apart including a neurological assessment as well as MRI scans. We scheduled annual follow-up visits but due to an increasing dropout rate (25% in 2014) and a poor compliance to scheduled visits, the study had to be terminated early with final visits in 2014/2015. As a result, patients had heterogeneous follow-up times (median 2.9 years, range 1.5-4.2).
Our analysis plan included two steps: in an hypothesis driven approach, we used short-term changes of lesions and atrophy to define four risk groups and validate their predictive capacities: (I) Low inflammation and low atrophy, (II) high inflammation and low atrophy, (III) low inflammation and high atrophy and (IV) high inflammation and high atrophy. Since the original publication of Bielekova (Bielekova et al., 2005) new methods to measure brain atrophy dynamics became available (Smith et al., 2002;Fischl, 2012). We aimed for a comparison of three frequently used techniques (SIENA, SIENAX, FreeSurfer). Post-hoc, we explored clinical data and different volumetric methods in their ability to predict EDSS progression.

Patients
Patients aged between 18 and 60 years with a confirmed diagnosis of RRMS according to the revised McDonald Criteria (Polman et al., 2011) had to give written informed consent.
Patients were asked to participate at baseline if two MRI scans were clinically indicated within one year. The local ethics committee (Board of Physicians, Hamburg, No. PV4405) approved the study. Between the two baseline visits, patients had to be stable without (untreated) or stable on any disease-modifying drug (DMD, treated). MRI scans were not performed within 30 days after a steroid treatment. 109 patients were enrolled. 56 had a DMD and 53 opted against any DMD in a shared decision process. The expanded disability status scale (EDSS) (Kurtzke, 1983) of all patients was assessed by trained neurologists. Treatment at follow-up was labelled as ''no change'' or ''change''. The kind of treatment change was defined as ''escalation'', ''no change'' and ''de-escalation.''

MRI and image analysis
MRI data were acquired on a 3T scanner including a magnetization prepared rapid acquisition gradient-echo (MPRAGE) T1-weighted sequence (T1, pre-post Gadolinium(Gd)) and a PD-T2-weighted sequence (T2). The software JIM was used to semi-automatically mask lesions in T2 (T2-hyperintense lesions), T1 (T1-hypointense lesions) and T1Gd sequence. Two raters counted lesions and evaluated new lesions. Regions of Interest (ROI) were semi-automatically placed around single lesions in the PD/T2, T1 and T1Gd sequence. The number of lesions was determined manually while volumes were calculated automatically with the ROI-analysis function. Two raters evaluated the number of new T2/BH lesions. Afterwards, all images were processed with the FSLtoolbox (Smith et al., 2004). Brain tissue volume, normalized for subject head size (NBV = normalized brain volume, NGM = normalized grey matter, NWM = normalized white matter), was estimated with SIENAX (Smith et al., 2002) and lesion volume was normalized based on the SIENAX results. To reduce the risk of false tissue assignment in lesions, lesion masks were dilated and filled with normal appearing white matter contrast. Brain masks were manually corrected to minimize false tissue assignment by the FSL-segmentation. Longitudinal atrophy was assessed with SIENA (Smith et al., 2002) and results were corrected for the individual duration between the two baseline scans to calculate an annualized Percentage Brain Volume Change (aPBVC). In addition, we used FreeSurfer (Version 5.2.0, http://surfer.nmr.mgh.harvard.edu/). To extract reliable and comparable volume estimates from both baseline MRI scans, images were processed with the FreeSurfer longitudinal stream (Reuter et al., 2012). We extracted volumes of the grey and white matter. Brain masks and white/grey matter segmentation were also manually corrected if needed.

Statistics
We performed descriptive statistics according to the nature of the data by means with standard deviation (sd) or as frequencies and/or percentages. Based on a single EDSS at follow-up and the lowest baseline EDSS we calculated absolute change and EDSS progression of each of the patients. Progression was stated if the EDSS increased by 1.5 points or more (baseline EDSS = 0), if the EDSS increased by one or more points (baseline EDSS between one and four) or if the EDSS increased 0.5 points or more (baseline EDSS above five) (Sormani & De Stefano, 2013). All changes were annualized based on the interval between the two baseline visits (i.e., scan one and two). To identify potential cofounders for EDSS progression, we checked if the variable baseline or follow-up time differed between patients with or without EDSS progression (T -test). We investigated if baseline variables or follow-up times differ between treated and untreated patients. In case of significant differences we adjusted further analyses for treatment status if possible.

Predefined criteria
Classification into low and high inflammation was defined by four different criteria: (A) No lesion vs. at least one lesion per year (representing no inflammatory activity versus any activity, used with the No evidence for disease activity (NEDA) outcome in clinical trials).
(B) Two lesions per year vs. less (MRI-criterion of treatment non-response (Rio et al., 2008)). (C) Four lesions per year vs. less (extending criteria A/B towards a higher inflammatory cut-off). (D) One lesion per month (representing the original Bielekova criterion (Bielekova et al., 2005)).
The four corresponding definitions for low and high atrophy groups represented three commonly used methods to assess atrophy: (1) Absolute change of NBV (any atrophy vs. none atrophy, SIENAX).
Median NBV split values were 1,539,505 mm 3 in untreated and 1,875,934 mm 3 in treated patients. The predictive value of each combination of criteria (e.g., 1A, 2C, 3B etc.) for disease progression was evaluated by chi-square tests and by analysis of variance (ANOVA).

Post-hoc exploratory analyses
First, we investigated the ability of the following variables to predict the EDSS change and progression in linear and logit models adjusted for treatment status: gender, age, number of T1-, T2-and Gd-lesions, the absolute change of lesion numbers and SIENAX volumes from Visit 1 to Visit 2, aPBVC; as well as global atrophy measurements from the longitudinal FreeSurfer processing (volumes: brain, white matter, grey matter, subcortical grey matter, cortical grey matter, supratentorial brain). Potential interactions with treatment status were investigated the same way. P-values were corrected for multiple testing with the false discovery rate (FDR) method. Remaining significant predictors were afterwards combined in multivariate models by forward stepwise selection of variables based on the Akaike Information Criterion (AIC). To quantify the predictive value of the models we calculated the coefficient of determination (R 2 ) for linear models. In addition we computed Receiver Operating Characteristic (ROC) curves and their Area under the curve (AUC) from logit models with progression (''yes'', ''no'') as a binary outcome. Sensitivity, specificity, and the negative (NPV) and positive predictive value (PPV) were estimated to be at the best threshold from predicted values. Finally, we calculated odds ratios and their 95% Confident Intervals (95% CI) for each variable. All analyses were performed with Statistics in R 3.1.2.

Cohort
From 109 patients recruited, a clinical follow-up of at least 1.5 years was available with 82 (75%) patients. Mean follow-up time was three years (range 1.5-4.2). Patients that did not attend the follow-up were younger (p = 0.018) and had shorter disease duration at baseline (p < 0.001) than the follow-up cohort but did not differ in any other baseline parameters. In the follow-up cohort, 40 patients received DMDs and 42 were without medication. Six The variable time span between the baseline visits (median 7 months, range 5-14) was not associated with progression nor were treatment status (stable or changed, Chi square p = 0.2, ANOVA p = 0.1) or escalation/de-escalation (Chi square p = 0.2, ANOVA p = 0.3). Further analyses were not corrected for these potential confounders. Treated and untreated patients differed in follow-up time, EDSS and NBV at baseline (all p < 0.001) and we adjusted further analyses for treatment status.

Validation of predefined classification algorithm
None of the predefined classification algorithms in high/low inflammation and atrophy groups were able to predict EDSS progression (Table 2)-except for one: change of FreeSurfer brain volume (Criterion 3) and at least four T2-lesions per year (Criterion C) in the whole cohort (p = 0.037). However, the algorithm failed to predict absolute EDSS change if adjusted for treatment status (p = 0.261) and comparison of the three different atrophy measurements was not possible.

Explorative classification algorithms
The results of screening predictors are summarised in Table 3. SIENA and SIENAX measurements were not significantly associated. Corrected for multiple testing only baseline EDSS, change of total grey matter volume and change of cortical grey matter remained significant. After stepwise selection of variables, the final multivariate linear model included treatment status, baseline EDSS and change of FreeSurfer grey matter volume (Table 4 and Fig. 1) as predictors (R 2 : 0.29). The corresponding logit model included cortical grey matter instead of total grey matter (Table 4). Separation between patients with and without progression was good (AUC = 0.81, Table 4 and Fig. 1). While higher atrophy indicated a higher risk of progression in all models, the association of baseline EDSS and progression was inverse, i.e., patients with lower EDSS had a higher risk to progress.

DISCUSSION
So far only lesion load and new lesions (within restrictions) can be used as individual predictors of disease progression in routine imaging of MS patients. We identified shortterm grey matter atrophy as a potential better predictor. Except from a low predictive value of Gd-enhancing lesions in treated patients, no lesion measurement was related to progression. From a pathophysiological perspective, it is feasible to combine measurements of inflammation and neurodegeneration to predict disability accumulation after several years (Bielekova et al., 2005). Here, all but one simple classification algorithms of high and low inflammatory or atrophy groups failed to foresee EDSS progression (Bielekova et al., 2005;Fisniku et al., 2008;Popescu et al., 2013;Jacobsen et al., 2014;Tintore et al., 2015).
The negligible sensitivity of lesions in our cohort might be explained by the fact that previous studies mainly investigated patients with a clinically isolated syndrome (CIS) while we investigated established RRMS (Fernández, 2013;Odenthal & Coulthard, 2015;Tintore et al., 2015). It is well known from natural history data that relapses do not influence the risk of disability or the onset of a progressive disease course if they occur later than two years after disease onset (Degenhardt et al., 2009;Scalfari et al., 2010). We assume that our patients were in a later phase of the disease where T2-lesions may have only a minor impact which is in accordance with other cohorts (Jacobsen et al., 2014;Uher et al., 2015)   but not all (Popescu et al., 2013). The association between lesions and relapses could not be evaluated as information about relapses was not reliable, but based on the considerations stated this is not a major limitation. In our cohort, grey matter atrophy was more predictive than total brain or white matter atrophy. This observation is in line with previous studies, where progression was associated with cortical atrophy and subcortical grey matter changes (Rocca et al., 2013;Jacobsen et al., 2014) It is known that clinical disability is closer associated with cortical pathology than with T2-lesions or normal appearing white matter . DMDs or their change were not associated with disability progression. The missing effect of immunotherapies might be due to inconsistent treatment effects of DMD. Again, these findings are in line with previous studies (Daumer et al., 2009;Hauser, Chan & Oksenberg, 2013). So far, most of the reported associations between grey matter and disability are based on absolute volumes e.g., grey matter volume from a single MRI (Rocca et al., 2013;Jacobsen et al., 2014;Tintore et al., 2015). The use of absolute cut-offs as predictors, for example, Bielekovas' 83% brain parenchymal fraction (Bielekova et al., 2005) are restricted and specific to each cohort as different scanners, and sequences and processing pipelines have a major influence on these values and calibration is not possible (Obuchowski et al., 2014). For example, even in our cohort the absolute brain volumes differed between treated and untreated patients and inversely as assumed; treated and more disabled patients with longer disease duration had higher baseline brain volume. As the scanner sequences and analysis pipeline were the same the observation must be due to an unknown bias. Using relative values such as changes from baseline, is a feasible approach to overcome such short-comings even though they are less informative than calibrated quantitative measurements (Obuchowski et al., 2014). In our study we used three different kinds of relative values (Fischl, 2012). Only the FreeSurfer algorithm was associated with progression and seems to be more reliable and sensitive than SIENA/SIENAX. However, computing these measurements still requires several hours and is not yet feasible for clinical routine. The higher risk for progression in patients with lower EDSS seems counterintuitive at first sight. EDSS scores below four represent mainly the neurological examination (Kurtzke, 1983) and even non-disabling new symptoms may lead to an increase of the EDSS. Most of our patients had no or only mild disability at baseline. Therefore EDSS-progression represents non-disabling symptoms in most cases. Whether or not such EDSS changes are predictive in the long-run is questionable and cannot (could not?) be improved by confirming EDSS changes after three or six months which was not possible in our cohort (Ebers et al., 2008). Over all, the risk of progression in our cohort was in line with other cohorts (Jacobsen et al., 2014). Furthermore the median time from disease onset to EDSS three is about 12 years which is still above the median disease duration at follow-up in our cohort (Scalfari et al., 2010). Our findings are somehow limited as 25% were lost to follow-up, which is similar to other cohorts (Jacobsen et al., 2014). Accounting for heterogeneous follow-up by implementing survival analyses was not possible as the lack  Table 4 as well.
of independency of censoring violates a fundamental assumption of survival analyses (Leung, Elashoff & Afifi, 1997). As dropouts did not differ relevantly from follow-up patients, we assume no major impact on our results. Our relatively small sample size restricts the generalization of our findings but overall FreeSurfer measurements are a promising method to enhance individual risk stratification.

CONCLUSION
Besides EDSS at baseline, grey matter atrophy within one year is a valuable predictor for EDSS progression in treated and untreated RRMS. The development of atrophy measurements for individual risk counselling and evaluation of treatment response seems possible but defining a simple to compute generalizable measurement is still challenging.