Fully automated segmentation of the cervical cord from T1-weighted MRI using PropSeg: Application to multiple sclerosis☆

Spinal cord (SC) atrophy, i.e. a reduction in the SC cross-sectional area (CSA) over time, can be measured by means of image segmentation using magnetic resonance imaging (MRI). However, segmentation methods have been limited by factors relating to reproducibility or sensitivity to change. The purpose of this study was to evaluate a fully automated SC segmentation method (PropSeg), and compare this to a semi-automated active surface (AS) method, in healthy controls (HC) and people with multiple sclerosis (MS). MRI data from 120 people were retrospectively analysed; 26 HC, 21 with clinically isolated syndrome, 26 relapsing remitting MS, 26 primary and 21 secondary progressive MS. MRI data from 40 people returning after one year were also analysed. CSA measurements were obtained within the cervical SC. Reproducibility of the measurements was assessed using the intraclass correlation coefficient (ICC). A comparison between mean CSA changes obtained with the two methods over time was performed using multivariate structural equation regression models. Associations between CSA measures and clinical scores were investigated using linear regression models. Compared to the AS method, the reproducibility of CSA measurements obtained with PropSeg was high, both in patients and in HC, with ICC > 0.98 in all cases. There was no significant difference between PropSeg and AS in terms of detecting change over time. Furthermore, PropSeg provided measures that correlated with physical disability, similar to the AS method. PropSeg is a time-efficient and reliable segmentation method, which requires no manual intervention, and may facilitate large multi-centre neuroprotective trials in progressive MS.


Introduction
Neuropathological and magnetic resonance imaging (MRI) studies have demonstrated the involvement of the spinal cord (SC) in multiple sclerosis (MS); neurodegeneration in the SC is thought to represent the main pathological substrate of irreversible locomotor disability (Abdel-Aziz et al., 2015;DeLuca et al., 2006;Ganter et al., 1999). In particular, SC MRI has provided indirect evidence of axonal degeneration by quantifying atrophy, i.e. a reduction in SC cross-sectional area (CSA) over time, with correlations identified between measures of CSA and physical disability (Kearney et al., 2015b;Lin et al., 2004;Losseff et al., 1996). Such associations support the notion that reliable CSA estimation over time could be a plausible endpoint for clinical trials for neuroprotection in MS (Kearney et al., 2014a), and a number of exploratory studies have been reported in the literature (Kalkers et al., 2002;Leary et al., 2003).
Previous methods used for measuring CSA have been variable in terms of their reproducibility and sensitivity to small change, and all of them require some degree of operator input (Coulon et al., 2002;Horsfield et al., 2010;Kawahara et al., 2013;Kidd et al., 1993;McIntosh et al., 2011). Typically, intra-and inter-observer reproducibility is assessed from repeated measurements by estimating the coefficient of variation (COV); the currently established semi-automated active surface (AS) method offers intra-and inter-observer COV values of 0.44% and 1.07%, respectively (Horsfield et al., 2010). More recently, investigators have aimed to develop fully automated segmentation methods, which may minimize user-bias and significantly reduce the image processing time (Asman et al., 2014;Chen et al., 2013;Koh et al., 2011).
However, the variety of image acquisitions, the types of image contrast and variability of the field of view (FOV) required for each specific application, make it particularly challenging for each individual method to simultaneously account for so many variables. A fully automated method, called PropSeg, which accounts for such variability, has been recently developed (De Leener et al., 2014). PropSeg is based on an iterative propagation of a deformable model with adaptive contrast mechanisms and offers fast and reliable measurements of the cord CSA in a matter of seconds, as demonstrated in a pilot study of healthy volunteers and people with spinal cord injury (De Leener et al., 2014); importantly, the method has been reported to work when using T1-, T2-and T2*-weighted acquisitions and at any level of the spinal cord.
In this study we evaluate PropSeg, as compared to the widely used semi-automated AS method (Horsfield et al., 2010), in a large cohort of healthy controls and people with MS, in order to test the following hypotheses: (i) PropSeg provides reproducible CSA measurements in the cervical SC. (ii) A reduction in CSA in the cervical SC, seen longitudinally in MS, can be reliably measured with PropSeg. (iii) There are associations between cervical SC CSA measures derived by PropSeg and clinical scores in MS.

Study participants
MRI data from 120 people were retrospectively analysed; 26 healthy controls (HC), 21 people with clinically isolated syndrome (CIS), 26 relapsing remitting (RR) MS, 21 secondary progressive (SP) MS and 26 primary progressive (PP) MS. The inclusion criteria for the CIS cohort, and the criteria used for MS diagnosis and MS subgroup classification, have been reported previously (Kearney et al., 2014b(Kearney et al., , 2015a. All people with CIS and MS had Expanded Disability Status Scale (EDSS) (Kurtzke, 1983) and Multiple Sclerosis Functional Composite (MSFC) score (Fischer et al., 1999) determined by the same neurostatus certified assessor. Z-scores for the 25-foot timed walk test (TWT), 9-hole peg test (HPT) and 3 s paced auditory serial addition test B (PASAT) were calculated using published normative values. For those participants who could not perform the TWT and HPT, an arbitrary value of 180 s or 300 s was assigned to that test, respectively. In addition, the American Spinal Injury Association (ASIA) motor (m) and sensory (s) scores (Maynard et al., 1997) were recorded for all participants. All clinical assessments were performed immediately before the MRI study. Demographic and clinical characteristics are summarised in Table 1. A total of 40 people returned for follow-up assessment, with MRI and clinical assessments repeated at the second visit; 10 HC (4 female (F), mean age (SD): 43.4 (8.9) years), 10 RRMS (6 F, 40.5 (9) years), 10 SPMS (4 F, 56.3 (5.9) years) and 10 PPMS (2 F, 56.2 (8.5) years). The mean (SD) follow-up visit for the HC was (14 (5.2) months), RRMS (24 (3.74) months), SPMS (16.3 (3.6) months) and PPMS (14.8 (4.9) months).
Informed written consent was obtained from each study participant prior to inclusion in the study. The study received approval from the local Institutional Ethics Committee.

MRI acquisition protocol
Imaging was performed using a 3 T Philips Achieva MRI system with RF dual-transmit technology (Philips Medical Systems, Best, Netherlands) and the manufacturer's product 16-channel neurovascular coil.
The whole cervical cord was imaged using a magnetizationprepared 3D T1-weighted acquisition (with isotropic voxel size of 1 mm 3 ) in the sagittal plane with FOV = 256 × 256 mm 2 , matrix size = 256 × 256, TR = 8 ms, TE = 3.7 ms, TI = 860 ms (using linear k-space profile order), SENSE = 2 in the anterior-posterior direction and TFE factor of 205; the scan time for the acquisition was 6:30 min.

Image analysis
The 3D T1-weighted volume obtained from each study participant was processed using both the active surface (AS) (Horsfield et al., 2010) (Jim 6.0_019; http://www.xinapse.com/) and PropSeg (De Leener et al., 2014) (Spinal Cord Toolbox version 1.0; https:// sourceforge.net/projects/spinalcordtoolbox/) segmentation methods in two different ways, which provide the CSA at C2/C3 and between C2 and C5, respectively: i) by reformatting the original sagittal volume in the axial plane and extracting 15 contiguous 1 mm thick slices orthogonal to the longitudinal axis of the cervical cord centred at the C2/C3 levelthis was done using the multi-planar reconstruction option available within Jim 6.0 that allows to manually position the handle of the reformatted volume orthogonal to the longitudinal axis of the cervical cord centred at the C2/C3; the volume was subsequently resampled using sinc interpolation along the slice directionand ii) by using the axial reformatted volume obtained from i), only this time processing a larger number of axial slices to cover the section of the cervical cord from the top of C2 to the base of C5 vertebral body as previously reported (Horsfield et al., 2010). The rationale for selecting and processing these two segments of the cervical cord in this study is based on previously published methods in MS, which were shown to offer reproducible atrophy measurements and/or were used to investigate the possibility that specific levels of the cervical cord were particularly sensitive to MS-related atrophy (Horsfield et al., 2010;Kearney et al., 2014a;Losseff et al., 1996;Rocca et al., 2011).

AS analysis method
Using the AS method, each scan was processed by a single rater (MY) as follows: a seed point was first placed in the centre of the cord on the most superior axial slice in which the odontoid process of the axis (C2) was still visible. The next seed point was placed in the centre of the cord on the slice that passed through the inferior border of C5. Starting at C5 and moving superiorly, a seed point was placed in the centre of the cord on every tenth slice until the seed point at the top of C2 was reached (Horsfield et al., 2010) (see Fig. 1 A-C). In this way, the boundary of the cord on all slices from C2 to C5 was identified and 15 slices corresponding to the C2/C3 level were subsequently processed for method i) and all the slices processed for method ii).

PropSeg analysis method
Using the PropSeg method, all 3D T1-weighted volumes were processed in their original form (sagittal plane) taking only a few minutes in total, simply by specifying the directory storing all the data. The processed volumes containing the binary mask of the whole cervical SC were then reformatted in the axial plane to match the processing of the AS method i.e. by extracting the equivalent slices as per i) and ii) described earlier. Fig. 1 D-F shows an example of the result obtained using PropSeg with the original sagittal volume and an example of a single axial reformatted slice through the C2/C3 intervertebral disc showing the cord contour identified using both PropSeg and AS segmentation methods for comparison.
Whilst for i) an equal number of slices was processed in all cases, the number of slices processed in ii) was not always the same due to anatomical variability; for this reason, CSA measurements were normalized by the number of slices as previously suggested (Healy et al., 2012).

Reproducibility assessment
Since the PropSeg method inherently outputs the same result each time the same scan is analysed, i.e. intra-and inter-observer reproducibility COV = 0%, the most appropriate test of reproducibility in this case was related to the ability of each segmentation method to obtain near-identical measurements when a number of the study participants underwent the same MRI examination twice (i.e. 'scan-rescan' assessment). For this purpose, 8 healthy controls (6 males, mean age 33.5, SD 6.7) and 8 people with MS (5 females, mean age 43.3, SD 11.3, 4 RRMS, 4 SPMS) had the scan twice after being removed from the scanner and repositioned between the scans during the same visit.
For the assessment of scan-rescan reproducibility, the intraclass correlation coefficient (ICC) was calculated and subsequently 1-ICC was reported; 1-ICC provides an estimate of the fraction of variability due to measurement error (within-subject) over the total variation, i.e. biological variation (between-subject) and within-subject variation (Bartlett and Frost, 2008). 95% confidence intervals (CI) and p-values were obtained using bias-corrected and accelerated non-parametric bootstrap with 1000 replicates. Mean changes in cord CSA over one year were investigated for each participant group (apart from CIS), each cervical SC segment and each segmentation method. For each group of participants and for each cervical SC segment, a formal comparison between mean CSA changes obtained with the two segmentation methods was performed using multivariate (bivariate) structural equation regression models; in this context these essentially fit two regression models simultaneously, allowing the comparison, across models, of relevant coefficients.
To assess the potential usefulness of the CSA measure to detect change, the change ratio (CR), the ratio of the mean of within-subject changes/standard deviation (SD) of within-subject changes, was calculated for each segmentation method, each cord segment, and each group. This is because the sensitivity to change, or the power of a method, is related not to the absolute magnitude of the change but to the change relative to the SD of changes. To assess the potential sensitivity to patient pathology, for each patient group, effect size (ES) was calculated as the difference between the mean CSA change in that patient group and the mean CSA change in the control group, divided by the SD of the change in the patient group; again the magnitude of the difference relative to SD is crucial. For both CR and ES measures, higher values indicated a greater sensitivity and power of the MRI measure to detect change or difference. MRI measures with large CR denote that the individuals of a group show a homogeneously large amount of change over time relative to the 'noise'. Similarly, MRI measures with large ES would reflect a large and homogeneous difference between the change in a given group of patients and that in controls.

Associations between CSA measures and clinical scores
In order to investigate and compare associations between the clinical and the two segmentation MRI measures, each baseline clinical variable (for each cervical cord segment) was used as the response (dependent) variable in linear regression models, with the following explanatory variables: i) baseline AS-derived CSA; ii) baseline PropSeg CSA; iii) both AS and PropSeg baseline CSA. For each clinical variable, a comparison was made between the R-square of the model in i) and that of the model in ii). Models obtained in iii) were used to assess the comparative potential of the two segmentation methods to explain the variability of the clinical variable. Similar models were performed using one-year MRI and one-year clinical measures, and using baseline MRI and one year clinical measures. In this exploratory work, a number of statistical tests were performed. However, these were in order to examine several null hypotheses as opposed to a single one; for this reason adjustment for multiple comparisons was not made (Perneger, 1998). Significance level was set at 5%.

Results
Representative mean (SD) CSA measurements obtained at each cervical SC level with each segmentation method and for each participant group at baseline are shown in Table 1. Out of 160 scans processed in total, PropSeg failed to correctly segment the cord only in 3 cases (1 healthy control and 2 RRMS), and these cases were manually processed by inserting seed points in the centre of the cord prior to the segmentation; the presence of MS lesions had no obvious effect on the performance of the segmentation method (see Fig. 2). Segmentation of the cord using the AS method was successful in all cases.

Reproducibility assessment
In the HC group, the estimated ICC values for the C2/C3 and C2/C5 levels were very similar using both segmentation methods. In the patient group, the estimated ICC values were slightly higher using the AS method than PropSeg, for both cervical cord levels. Nevertheless, the estimated ICC values were always above 0.98 (Table 2).

Change in CSA over time and effect size calculations
For the C2/C3 level, mean changes measured using AS and PropSeg methods were not significantly different for any of the groups, except for a borderline evidence of a higher (negative) change in RRMS using PropSeg than AS (p = 0.0425).
For the C2/C5 level, in controls there was no evidence of any of the changes over time, with either PropSeg or AS, being different from zero. No differences were observed between methods for the other groups.
As regards the CR of one-year change in CSA, apart from PPMS, CR was slightly higher using PropSeg than AS and for both segments of the cervical SC (Table 3); CSA reduction was greater in patients than controls, although not statistically significant.
The ES, for the C2/C3 level was better using PropSeg than AS, apart from the PPMS group. Instead, for the C2/C5 level, PropSeg was worse than AS in all patient groups (Table 4).

Associations between CSA measures and clinical scores
Univariable models showed that baseline CSA measures for both segments of the SC were significantly associated with baseline clinical variables, for both segmentation methods (Table 5). At one-year follow-up, CSA measures were only significantly associated with ASIAm and ASIA-s, for both cervical SC segments (Table 6). Baseline CSA measures for both segments predicted ASIA-m scores at one-year follow-up, for PropSeg and AS methods (for C2/C3: p = 0.001 and p = 0.003, respectively; for C2/C5: p b 0.001 and p = 0.001, respectively) ( Table 7).
Multiple regression models showed that baseline CSA measures obtained with PropSeg method were better at explaining the variability of the EDSS and ASIA-m (for the C2/C3 segment: p = 0.085 and p = 0.022, respectively; for the C2/C5 segment: p = 0.049 and p = 0.048). Additionally, baseline PropSeg measures at C2/C3 explained better the variability of ASIA-s (p = 0.020) than AS method. At one-year followup, there was no evidence that any of the methods was better than the other at explaining the variability of any of the clinical measures. As regards prediction analyses, there was borderline evidence of the PropSeg method (C2/C5 measures at baseline) explaining better than the AS method (also at baseline, C2/C5 measures) the variability of ASIA-m at one-year follow-up.

Discussion
This is the first study to apply a fully automated method (PropSeg) of spinal cord area measurement to people with MS. The results of this study demonstrate that: firstly, PropSeg provides a reproducible measurement of cord area both in healthy controls and people with MS, similarly to the widely used AS (Horsfield et al., 2010); secondly, PropSeg seems to be able to detect changes over time reliably, at least with the same sensitivity as the AS method; thirdly, PropSeg provides cord area measures that reflect physical disability, as shown by the presence of significant associations between obtained cord area values and physical disability, as well as being predictive of a specific measure of spinal cord dysfunction (ASIA-m) at one-year follow-up.
This current study demonstrates that a fully automated software package may be used to measure cord area in MS, acknowledging the fact that only T1-weighted MRI was used in this particular study; the use of any other type of contrast, or even the application of PropSeg to other neurological conditions merit investigation in their own right. As the software is automated, we chose to demonstrate its reproducibility by measuring the scan-rescan ICC, which was N 0.98 for two different segments of the cervical cord. This agrees strongly with the AS measurements obtained in this current study and a previous study, that also measured its reproducibility (Kearney et al., 2014a). Here, manual intervention was required to identify the vertebral levels (i.e. slices corresponding to C2/C3 and C2/C5), to ensure a direct comparison between the two segmentation methods. However, a new feature has recently been added to PropSeg that automatically identifies vertebral levels using template-based approaches, allowing the user to prespecify the cord segment(s) of interest (De Leener et al., in press). In conjunction with the probabilistic mapping of spinal levels based on vertebral levels (Cadotte et al., 2015), such information might provide more specific association between clinical deficits and the level of spinal cord atrophy, as shown in ALS (Cohen-Adad et al., 2013).
In order to use cord atrophy as an endpoint for a clinical trial the methodology must be sufficiently sensitive to a small reduction in cord area. In the current study we have shown that the reductions in cord area observed over one year, although not significant, were in line with the AS method, used as an anchor measure in this study. The lack of significant reduction may relate to the smaller number of patients followed up, which likely reduced the statistical power. The high CR and ES values obtained with PropSeg further emphasise the robustness of this technique when applied to a longitudinal study of cord atrophy in MS. However, the worse ES result observed at C2/C5 in all groups using PropSeg could be additionally informative. Bearing in mind the higher CR values of PropSeg as compared to AS at that level, coupled with the slightly lower ICC values observed for both methods at C2/C5 as compared to the C2/C3 level, this may be indicative of a reduced reliability of atrophy measurements when obtained at the C2/C5 level. The potential pitfalls of studying the C2/C5 level as opposed to C2/C3 have been mentioned elsewhere (Kearney et al., 2014a;Losseff et al., 1996;Reid, 1960) and the significance and relative clinical impact of the small variations in ICC, CR, and ES identified in this study have not been examined specifically. Nevertheless, it has been shown that, at least for the C2/C3 level, PropSeg provides reproducible measurements Table 2 Intraclass correlation coefficient (ICC) for scan-rescan reliability using PropSeg and AS segmentation methods for measuring the cervical cord cross-sectional area (CSA) at the C2/ C3 and C2/C5 levels in healthy control (n = 8) and MS cases (n = 8).   and can detect change with at least the same sensitivity as the AS method.

ICC (95% CI)
Owing to the longitudinal nature of this present study we were also able to examine the predictive ability of cord atrophy, in relation to physical disability in MS. The spinal cord specific measure of motor disability used in this study (ASIA-m) was predicted by cord atrophy using the fully automated PropSeg method. Importantly, as regards the univariable models, the obtained R-squared were generally at least as Table 5 Associations between CSA measures and clinical scores at baseline (unadjusted).

PropSeg AS
Regression coefficient (95% CI), p-value R 2 Regression coefficient (95% CI), p-value R 2   high in the PropSeg method models as in the AS method models, and several times higher in the clinical models using PropSeg measures. However, more commonly used scales of physical disability in MS (such as the EDSS and MSFC) were not predicted by either the PropSeg method or the AS method.

Limitations and future directions
A number of limitations should be considered when interpreting the results of this study. Firstly, PropSeg has been evaluated in MS using only T1-weighted images and therefore the performance of the method using other forms of contrast in MS will need to be investigated specifically. However, due to the time-efficient and fully automated nature of the method, such assessments may be easily carried out on retrospective data.
As previously mentioned, a subset only of the cohort included at baseline was followed up at one year. This could be addressed in a future longitudinal study, in which a greater number of the baseline participants are followed up. One factor that may facilitate such a study would be to include people with progressive MS that have lower levels of physical disability, so that with time severe disability does not become a prohibitive factor for scanning.
Furthermore, the followed-up cohort consisted of people with different subgroups of MS. This may have conceivably influenced the overall rate of atrophy observed, thereby influencing the predictive power of this MRI parameter. A future longitudinal study containing, either a single subgroup of MS, or a sufficiently large cohort, so that between group factors can be analysed would be of importance.
Lastly, the current study was performed on data acquired in a single centre. Although the results obtained were in line with the hypotheses being investigated, many clinical trials in MS are performed in multiple centres and these hypotheses have not been tested in such a scenario. It would therefore be of importance to determine the sensitivity to pathology when introducing different scanner manufacturers as confounder when using the PropSeg method. This could be addressed by analysing data from existing (or future) multi-centre trials in MS that include imaging of the spinal cord.

Conclusion
This study demonstrates that spinal cord atrophy may be measured reliably in multiple sclerosis using a fully automated image segmentation method. These results have direct implications for future clinical trials for neuroprotection in progressive MS, where previous attempts at spinal cord atrophy measurement have been limited by factors relating to reproducibility or sensitivity to change, both of which are addressed in this study.