Translating state-of-the-art spinal cord MRI techniques to clinical use: A systematic review of clinical studies utilizing DTI, MT, MWF, MRS, and fMRI

Background A recent meeting of international imaging experts sponsored by the International Spinal Research Trust (ISRT) and the Wings for Life Foundation identified 5 state-of-the-art MRI techniques with potential to transform the field of spinal cord imaging by elucidating elements of the microstructure and function: diffusion tensor imaging (DTI), magnetization transfer (MT), myelin water fraction (MWF), MR spectroscopy (MRS), and functional MRI (fMRI). However, the progress toward clinical translation of these techniques has not been established. Methods A systematic review of the English literature was conducted using MEDLINE, MEDLINE-in-Progress, Embase, and Cochrane databases to identify all human studies that investigated utility, in terms of diagnosis, correlation with disability, and prediction of outcomes, of these promising techniques in pathologies affecting the spinal cord. Data regarding study design, subject characteristics, MRI methods, clinical measures of impairment, and analysis techniques were extracted and tabulated to identify trends and commonalities. The studies were assessed for risk of bias, and the overall quality of evidence was assessed for each specific finding using the Grading of Recommendations Assessment, Development and Evaluation (GRADE) framework. Results A total of 6597 unique citations were identified in the database search, and after full-text review of 274 articles, a total of 104 relevant studies were identified for final inclusion (97% from the initial database search). Among these, 69 studies utilized DTI and 25 used MT, with both techniques showing an increased number of publications in recent years. The review also identified 1 MWF study, 11 MRS studies, and 8 fMRI studies. Most of the studies were exploratory in nature, lacking a priori hypotheses and showing a high (72%) or moderately high (20%) risk of bias, due to issues with study design, acquisition techniques, and analysis methods. The acquisitions for each technique varied widely across studies, rendering direct comparisons of metrics invalid. The DTI metric fractional anisotropy (FA) had the strongest evidence of utility, with moderate quality evidence for its use as a biomarker showing correlation with disability in several clinical pathologies, and a low level of evidence that it identifies tissue injury (in terms of group differences) compared with healthy controls. However, insufficient evidence exists to determine its utility as a sensitive and specific diagnostic test or as a tool to predict clinical outcomes. Very low quality evidence suggests that other metrics also show group differences compared with controls, including DTI metrics mean diffusivity (MD) and radial diffusivity (RD), the diffusional kurtosis imaging (DKI) metric mean kurtosis (MK), MT metrics MT ratio (MTR) and MT cerebrospinal fluid ratio (MTCSF), and the MRS metric of N-acetylaspartate (NAA) concentration, although these results were somewhat inconsistent. Conclusions State-of-the-art spinal cord MRI techniques are emerging with great potential to improve the diagnosis and management of various spinal pathologies, but the current body of evidence has only showed limited clinical utility to date. Among these imaging tools DTI is the most mature, but further work is necessary to standardize and validate its use before it will be adopted in the clinical realm. Large, well-designed studies with a priori hypotheses, standardized acquisition methods, detailed clinical data collection, and robust automated analysis techniques are needed to fully demonstrate the potential of these rapidly evolving techniques.


Background
The advent of magnetic resonance imaging (MRI) in the mid-1980s transformed the field of spinal cord imaging and provided clinicians with high-resolution anatomical images, directly leading to improved clinical decision-making. Conventional MRI techniques (spin echo, gradient echo, and inversion recovery sequences, with T1-, T2-, or proton density-weighting) have continued to mature over 3 decades of use, establishing MRI as the imaging modality of choice for most spinal disorders. However, conventional MRI provides little information regarding the health and integrity of the spinal cord tissue itself, due to the fact that signal intensity changes are non-specific and do not correspond directly with aberrant physiological processes (Wada et al., 1995). This is reflected in the poor correlation of conventional MRI data with neurological and functional impairment in various spinal cord pathologies (Tetreault et al., 2013;Wilson et al., 2012), and failure to provide reliable prognostic information. In the degenerative condition cervical spondylotic myelopathy (CSM), weak correlates with clinical status have been identified using T2-weighted hyper-intensity (T2w-HI), T1-weighted (T1w) hypo-intensity, and measures of cord compression (Matsuda et al., 1999;Tetreault et al., 2013;Wada et al., 1995). In multiple sclerosis (MS), numerous studies have found that spinal cord lesion load is less important than atrophy, measured as the cross-sectional area (CSA) of the cord (Stevenson et al., 1998). As a result, conventional MRI techniques are of limited value in developing imaging biomarkers or predicting clinical outcomes because they are not sensitive and specific measures of the degenerative and regenerative changes that occur within the spinal cord at the microstructural and functional levels.
A 2013 international meeting of spinal cord imaging experts, sponsored by the International Spinal Research Trust (ISRT) and the Wings for Life (WfL) Spinal Cord Research Foundation, outlined 5 emerging MRI techniques that have the potential to revolutionize the field, by elucidating details of the microstructure and functional organization within the spinal cord Wheeler-Kingshott et al., 2014). This group highlighted the following techniques due to their ability to characterize microstructural features of the spinal cord: diffusion tensor imaging (DTI), magnetization transfer (MT), myelin water-fraction (MWF), and magnetic resonance spectroscopy (MRS). DTI measures the directional diffusivity of water, and several of the metrics that it produces correlate with axonal integrity, and to a lesser degree, myelination (Wheeler-Kingshot et al., 2002). MT involves an off-resonance saturating pre-pulse that takes advantage of the chemical and magnetization exchange between protons bound to lipid macromolecules and nearby water protons, and provides a surrogate measure of myelin quantity (Graham and Henkelman, 1997). This is most often expressed in a ratio between scans with and without the pre-pulse (MTR) or between the spinal cord and cerebrospinal fluid (MTCSF). MWF estimates the fraction of tissue water bound to the myelin sheath, by fitting the T2 relaxation curve to a multi-exponential model and identifying the fraction of the signal with a T2 parameter between 15 and 40 ms (Wu et al., 2006). MRS quantifies either the absolute or relative concentrations of specific molecules of interest within a single large voxel, including N-acetylaspartate (NAA), myo-inositol (Ins), choline (Cho), creatine (Cre), and lactate (Lac) (Gomez-Anson et al., 2000). The expert panel also highlighted functional MRI (fMRI) of the spinal cord, due to its potential to characterize changes in neurological function, using either blood oxygen-level dependent (BOLD), which relies upon the concept of neuro-vascular coupling in which changes in neurological function produce corresponding changes in local blood flow, or signal enhancement by extravascular protons (SEEP), which is thought to detect neural activity indirectly through changes in the intracellular/extracellular volume ratio (Stroman et al., 2001). fMRI studies can involve a variety of designs, including motor tasks or sensory stimuli in block or event-related designs, and can visualize and provide indirect measures reflecting neuronal activity and connectivity occurring within the spinal cord .
All 5 of these emerging MRI techniques are highly amenable to quantitative analysis, offering the opportunity to develop quantitative MRI biomarkers that correlate with disability and/or predict outcomes. The development of these techniques may also provide more sensitive and specific diagnostic tests. For example, in the earliest stages of CSM, symptoms may include vague complaints of numbness and neck pain, but the cause may be unclear between early myelopathy vs. musculoskeletal pain and peripheral nerve compression. Objective evidence of damage to the cord tissue could provide important information to prompt earlier surgery. Furthermore, quantitative biomarkers could act as surrogate outcome measures in clinical trials, such as therapeutic remyelination agents in MS or spinal cord injury (SCI), providing shortterm end-points and reducing the time and costs associated with novel drug development (Cadotte and Fehlings, 2013). In acute SCI, these techniques could potentially discriminate reversible and irreversible components of damage (demyelination, axonal loss, gray matter loss) early after injury, and thus provide a more accurate prognosis to help guide therapeutic strategies and focus rehabilitation resources.
Unfortunately, the application of these advanced MRI techniques to image the spinal cord is far from trivial. These techniques were initially developed and validated in brain imaging, but the spinal cord is a far more challenging structure to obtain accurate data. In fact, the spine is among the most hostile environments in the body for MRI, due to magnetic field inhomogeneity at the interfaces between bone, intervertebral disk, and cerebrospinal fluid (CSF), and also because of the small size of the cord and its white matter tracts, and the relatively large motion of the cord during cardiac and respiratory cycles . High-quality spinal cord imaging using these methods has only recently been achieved, requiring specialized acquisition sequences, complex shimming, custom receive coils, long acquisition times, and substantial post-processing to correct for motion, aliasing, and other artifacts.
This systematic review aims to summarize the progress of clinical translation of these imaging techniques to date, and identify the most common technical methods employed. The review will also highlight the major barriers that are currently preventing the adoption of these techniques into clinical use. The search was designed to identify all studies that applied one or more of these MRI techniques to assess for clinical utility in one or more of the following 3 key questions: 1. Diagnostic utility: Does the MRI technique provide metrics that demonstrate group differences or improved diagnostic accuracy (sensitivity/specificity) in the diagnosis of spinal pathologies? 2. Biomarker utility: Does the advanced MRI technique generate metrics that quantify the amount of injury and thus correlate with neurological/functional impairment and/or show longitudinal changes over time that correlate with changes in disability in spinal pathologies? 3. Predictive utility: Does the advanced MRI technique generate metrics that predict neurological, functional, or quality of life outcomes in spinal pathologies?

Electronic literature search
A systematic search of MEDLINE, MEDLINE-in-Progress, Embase, and Cochrane databases was conducted, with the results formatted in accordance with the PRISMA statement for systematic reviews and metaanalyses (Liberati et al., 2009). The search included literature published from January 1, 1985 to June 1, 2015 and sought all studies that describe the use of one or more of the state-of-the-art spinal cord MRI techniques (DTI, MT, MWF, MRS, and fMRI) on subjects with any clinical pathology (complete search terms listed in Appendix A, inclusion/exclusion criteria in Table 1). Studies that employed diffusion kurtosis imaging (DKI), an extension of DTI using multiple b-values, were included as these studies typically also report DTI metrics in addition to measures of kurtosis. Studies that employed advanced MRI techniques to image only the brain were excluded (e.g. brain MRS in CSM). We also excluded studies utilizing diffusion-weighted imaging (DWI) that only calculated an apparent diffusion coefficient, but did not calculate tensors (which require the use of diffusion-sensitizing gradients in at least 6 directions) or tensor-derived metrics such as fractional anisotropy (FA), axial diffusivity (AD), and radial diffusivity (RD). The search was limited to human studies, but limits on study design were not placed. Abstracts identified in the initial search were reviewed by 3 of the authors (A.R.M., I.A., N.S.) to determine relevant manuscripts for full-text review. The inclusion criteria required that studies were original research that appeared to • Subjective or unvalidated outcome measures Study Design • Restrospective or prospective cohort studies designed to assess the ability of an imaging factor to: ○ Make a diagnosis ○ Correlate with neurological/functional impairment ○ Predict neurological/functional outcome after at least 3 months • Minimum 24 total subjects, with at least 12 having spinal pathological condition of interest • Review articles • Opinions • Technical reports • Studies in healthy controls • Animal or biomechanical studies answer one or more of the key questions above and included a minimum of 24 total subjects, with at least 12 of these subjects with a specific spinal pathology. Thus, we included studies with at least 24 pathological subjects (with no control subjects), and studies with at least 12 pathological subjects and a total of at least 24 subjects (including controls). Studies that included 3 or more different groups for comparison (e.g. NMO vs. MS vs. healthy) were required to have at least 12 subjects with the primary pathology of interest. Case reports or smaller series, meeting abstracts, white papers, editorials, review papers, technical reports, or studies of only healthy subjects were excluded. The full text of each article was then analyzed by 2 of the authors (A.R.M., I.A.) in the context of each key question to determine suitability for final inclusion, with discrepancies resolved by discussion. If multiple articles were identified with redundant results based on the same group of subjects, only the most relevant article (larger sample size or more recent publication) was kept in the review. References of each full-text article and each review paper that were identified were also systematically checked to identify additional eligible articles (Fig. 1).
For key question 1 (diagnostic utility) we sought all articles that compared the presence or absence of a specific MRI feature or the value of a quantitative metric between patients and controls, relating to diagnosis. For question 2 (biomarker utility), we identified articles that identified relationships between MRI metrics and measures of clinical disability, including the calculation of correlation coefficients (Pearson, Spearman, or multivariate) or identification of differences between severity groups. To be relevant to key question 3 (predictive utility), studies needed to assess the relationship between baseline MRI metrics and follow-up clinical data at a specified time at least 3 months after the initial imaging.

Data extraction
For each of the articles that met all inclusion/exclusion criteria after full-text review, the following data were extracted redundantly by 2 of the authors (A.R.M., Z.T.): study design, subject characteristics (age, gender, diagnosis, treatment(s) administered), follow-up duration, MRI sequences, MRI acquisition parameters, MRI data analysis methods, clinical data recorded, and results pertaining to diagnosis, correlation with disability, and correlation with outcomes. Differences in extracted data were resolved by discussion.

Data analysis and synthesis
Regarding diagnosis, we analyzed group differences and their statistical significance (P-value), and also the number of subjects with each specific MRI feature, present or absent (or a quantity above/below a threshold), that was reported for pathological and healthy subjects, to assess sensitivity (SE), specificity (SP), positive predictive value (PPV), and negative predictive value (NPV). For correlations with disability and prediction of clinical outcomes, we collected results that were reported as odds ratios, univariate or multivariate correlation coefficients, and P-values.
Although many of the studies identified in this systematic review reported results using the same quantitative metrics, a formal metaanalysis was not performed due to the wide variation in acquisition and data analysis techniques. Such a meta-analysis would only be relevant for a group of studies that showed substantial homogeneity in subject populations, MRI techniques, regions of interest (ROIs), and clinical measures. However, trends in the data were tabulated and summarized independently by 2 authors (A.R.M., I.A.) and discrepancies were resolved by discussion.

Risk of bias for individual studies
Risk of bias was assessed for each article independently by 2 reviewers (A.R.M., I.A.). The risk of bias criteria were defined by the authors by consensus, combining criteria from the Center for Evidence-Based Medicine (CEBM) Diagnostic Study Appraisal Worksheet (CEBM Website) and The Journal of Bone & Joint Surgery for prognostic studies   (Wright et al., 2003), in addition to the modifications described in Skelly et al. (2013). The criteria were further modified to also consider potential sources of bias related to technical factors. The criteria are summarized in Table 2. Factors that were considered to be potential sources of bias include retrospective, case series, or case-control study designs; failure to match or analyze differences in demographics (age, gender) or control for other confounders; heterogeneity in the diagnosis of the study population; non-random enrollment methods (e.g. convenience sampling or posters may have increased selection bias compared with consecutive enrollment); unreliable acquisition and analysis methods; and a narrow range of severity of illness. More specifically, acquisition techniques were considered to have a higher risk of bias if they produced wide confidence intervals for metrics (N20%), showed distortions/artifacts that frequently required the exclusion of slices/subjects (N5%), or were subject to potential systematic bias, such as acquisitions that have substantial partial volume effects due to in-plane resolution N1.5 × 1.5 mm 2 , or thickness N5 mm. Analytical techniques were considered to confer a higher risk of bias if they involved manual processes (e.g. ROI selection) without blinding, or liberal statistical assumptions (e.g. uncorrected p b 0.05 for activations in fMRI). For diagnostic studies, failure to calculate and report diagnostic accuracy was considered a potential source of reporting bias, as it conceals how many pathological subjects have an "abnormal" result on a given metric. Similarly, correlation studies that did not publish univariate or multivariate correlation coefficients do not disclose the strength of the correlation. Prognostic studies were also judged to have potential bias if the patients were not at a similar point in the course of disease (lacking internal validity), if the study did not achieve N 80% clinical follow-up, if follow-up was not long enough for a majority of patients to show a clinical change, or if other known prognostic factors were not reported and analyzed. If an article failed to report important information for any of the aforementioned potential sources of bias, or technical details that are necessary to reproduce the image acquisition, it was considered to have an increased risk of bias. Following rating of each article for risk of bias by the 2 reviewers, discrepancies were resolved by discussion.

Overall quality of the body of literature
After individual article evaluation, the overall body of evidence with respect to each key question and specific finding was determined based upon precepts outlined by the Grading of Recommendation Assessment, Development and Evaluation (GRADE) Working Group (Schünemann et al., 2008). The possible ratings for overall quality of evidence are high, moderate, low, very low, and insufficient. The initial quality of the overall body of evidence was considered high if the majority of the studies had low or moderately low risk of bias, and low if the majority of the studies had high or moderately high risk of bias. The body of evidence was then upgraded 1 or 2 levels (only if no downgrading occurred) on the basis of the following criteria: (1) large magnitude of effect or (2) dose-response gradient, or downgraded 1 or 2 levels on the basis of the following criteria: (1) inconsistency of results, (2) indirectness of evidence, (3) imprecision of the effect estimates (e.g., wide confidence intervals [CIs] N50% of the estimate), or (4) non-a priori statement of subgroup analyses. The final overall quality of evidence expresses our confidence in the estimate of effect and the impact that further research may have on the results (Schünemann et al., 2008). The overall quality reflects the authors' confidence that the evidence reflects the true effect and the likelihood that further research will not change this estimate of effect. For example, a high level of evidence suggests that the evidence reflects the true effect, and further research is very unlikely to change our confidence in the estimate. A grade of "insufficient" means that evidence either is unavailable or does not permit a conclusion.

Study selection
The literature search was designed to be highly inclusive and generated a total of 6597 unique citations (Fig. 1). Following review of the title and abstract, 256 articles were retained for full-text review and 47 review papers were identified. The full-text review of the 256 articles excluded another 156, leaving 101 articles that met all inclusion/ exclusion criteria and were relevant to one or more of the 3 key questions. The reference lists of these 101 articles and the 47 review papers identified another 18 articles for full-text review, and 1 additional study that was electronically published following the literature search was identified by the authors. Among these 19 articles, 3 were retained for a final total of 104 studies. Many of the articles excluded at the full-text stage employed advanced MRI techniques in the brain but not the spinal cord, or the number of subjects fell below the threshold. Several articles were also excluded that used MT as a method to enhance contrast between the spinal cord and surrounding tissues, but did not perform quantitative analyses such as computing MTR or MTCSF. Of the final 104 articles, 101 (97%) were identified by the electronic database search.
The systematic review identified 69 DTI studies, including 62 that performed ROI-based quantitative analysis and 16 that performed fiber tractography (FT), 25 MT studies, 1 MWF study, 11 MRS studies, and 8 fMRI studies. Ten of the studies employed multi-modal acquisition techniques, including DTI and MT (6 studies), DTI and fMRI (3 studies), or DTI and MRS (1 study). Eight studies that used DTI FT also performed ROI-based quantitative analysis. The chronological trends of each of these imaging techniques are displayed in Fig. 2. The number of DTI studies that used ROI-based analysis sharply increased in recent years, whereas FT analysis decreased slightly. MT studies decreased after 2003, but saw a resurgence in recent years. MRS, MWF, and fMRI have been used in only a small number of studies, and recent use of these techniques has been limited. Tables 3-8 summarize the details of each study included in the review, separated by the imaging modality that was employed (with DTI divided by analysis technique).

Methodology and risk of bias of individual studies
Among the 104 studies, the risk of bias assessment found moderately low risk (with regards to at least 1 of the key questions) in only 6 studies, with the remainder of studies showing moderately high (24) or high (74) risk. Among the 69 DTI studies, the risk of bias was felt to be high in 52, moderately high in 14, and moderately low in only 3 studies. For MT studies this risk was high in 12, moderately high in 8, and moderately low in 5 studies. MRS studies showed high risk of bias in 7 studies and moderately high risk in 4. All of the fMRI studies and the single MWF study were all assessed to have high risk of bias. Most of the studies reviewed were exploratory in nature (i.e. early translational studies) and not clearly based on a priori hypotheses, frequently making many statistical comparisons without appropriate correction. Most were prospective cohort studies (101), and the remaining 3 were retrospective cohort studies. Furthermore, 43 of the 104 studies failed to account for confounding factors such as age and/or gender, either by ensuring age/gender-matched groups or by performing appropriate multivariate analyses. The vast majority of studies focused on a population with a homogenous diagnosis (98/104), avoiding possible issues with internal validity. However, only 15 of the 104 studies clearly reported the use of consecutive or random enrolment procedures to avoid possible selection bias, whereas the remaining 89 studies either used convenience sampling or failed to report enrolment methods in detail. Most of the studies (82/104) included patients with a range of severity of impairment, including mild/early cases that are more difficult to diagnose.

Acquisition techniques
Among the reviewed studies, a large fraction utilized technical methods that could introduce significant bias in terms of quantitative results. The group of DTI studies used a wide range of pulse sequences, with the majority (41/69) employing a relatively straightforward single-shot EPI (ssEPI) sequence, whereas 3 studies used multi-shot EPI (msEPI), 9 studies used more complex reduced field of view (rFOV) techniques, 1 study used line scan DTI, 1 study utilized a fast spin echo (FSE) sequence, one study used a spectral adiabatic inversion recovery (SPAIR) sequence, and the remaining 13 studies did not provide sequence details. Acquisition parameters were also highly variable, including b-values, FOV, matrix, number of excitations (NEX), saturation bands, shimming, and the use of cardiac gating, which was employed in 16/69 (23%) studies. Two of the studies utilized multiple b-values and calculated measures of diffusion kurtosis, such as mean kurtosis (MK) and root mean square displacement (RMSD) (Hori et al., 2012;Raz et al., 2013). 27 of 69 studies acquired images with very large voxels (greater than 1.5 × 1.5 × 5 mm in at least 1 dimension) or failed to report resolution, potentially biasing the results due to increased partial volume effects. Several studies also performed analyses that could introduce a systematic bias against the pathological group, such as obtaining FA from an ROI in thinned spinal cord tissue at the level of syringomyelia or a hemorrhagic SCI lesion, which is more likely to include voxels with partial volume effects that artificially lower FA (Cheran et al., 2011;Hatem et al., 2009Hatem et al., , 2010Koskinen et al., 2013;Yan et al., 2015). The group of MT studies tended to use more consistent acquisition methods with less variation, with 24/25 studies employing some form of gradient echo (GE) sequence, all studies using a sinc or Gaussian shaped saturating pre-pulse, and none of the studies utilizing cardiac gating. Only 2 studies computed MTCSF following a single MT acquisition. The remaining 23 studies acquired images with and without a saturation pre-pulse, coregistered the images, and calculated MTR. The study investigating MWF used a 32-echo sequence with inversion recovery (without cardiac gating) to measure the short T2 component using a multi-exponential model, but this technique only acquired a single axial slice with an acquisition time of 30 min. All of the MRS studies uniformly employed similar acquisition sequences, making use of point-resolved spectroscopy (PRESS) with chemical shift selective (CHESS) water suppression, while cardiac gating was employed in 5/11 (45%). Unfortunately, these studies all produced metrics with wide confidence intervals within subject groups. All of the spinal fMRI studies were based on a fast spin echo (FSE) acquisition, and none used cardiac gating. The fMRI studies appeared to suffer from challenges with reliable acquisitions, although reporting was not detailed enough to determine confidence intervals or measures of reliability, as the results typically involved processed data in terms of group activations and connectivity analyses.

Analysis methods
Whole-cord ROIs were used in the vast majority of DTI, MT, and MWF studies. Among the 62 ROI-based DTI studies, 18 reported tract-specific metrics, 3 extracted metrics from WM, and 2 reported data from GM, with the remaining 39 reporting whole-cord metrics or non-specific ROIs (e.g. mixed GM and WM from a mid-sagittal slice). Among DTI FT studies, only 2 reported tract-specific metrics, with the remainder averaging results across all WM identified. 5/25 MT studies reported tract-specific metrics, 1 averaged results across all WM, and 2 offered GM-specific metrics. All MRS results were whole-cord, and fMRI results were typically broken into cord quadrants (combining GM and WM). Only 5 of the ROI-based DTI studies     performed automated (or semi-automated) selection of the ROI (Nair et al., 2010;Oh et al., 2013aOh et al., ,b, 2015Toosy et al., 2014), whereas the other 57 studies introduced potential bias by performing manual ROI selection without blinding procedures. The most common automated method was a simple segmentation procedure, followed by extraction from the whole cord. Nair et al. (2010) used FA values of each subject to create a WM skeleton, and then used this map to draw ROIs from C1 to C6, in a method that is somewhat similar to tractography-based ROI selection. Toosy et al. (2014) performed automated segmentation and registration to a spinal cord template, and subsequently extracted whole-cord ROIs and also hyperintense lesions using an automated threshold-free cluster enhancement (TFCE) algorithm. In addition, 7 studies utilized a semi-automatic algorithm to perform spinal cord segmentation, but then performed manual exclusion of edge voxels that were subject to partial volume effects with contamination from CSF (Agosta et al., 2007a(Agosta et al., , 2008b(Agosta et al., , 2009aBenedetti et al., 2010;Manconi et al., 2008;Valsasina et al., 2007), which could introduce bias in the same manner as manual ROI selection. Another study performed random ROI placement to avoid issues of potential bias, but did not report the exact method of randomization (Kamble et al., 2011). Among the 16 DTI FT studies, 6 utilized automatic ROI selection based on the FT output, although 4 of these used manual seed points to initiate the FT algorithm and 1 did not report details on the use of seed points (Hatem et al., 2010). Budzik et al. (2011) performed semi-automated FT without manual seed points and extracted whole-cord ROIs automatically. Among the MT studies, 14 of the 25 studies utilized automatic or semi-automatic analysis methods to extract MTR or MTCSF, with only a minority of studies using manual ROI selection. Rather than exclude edge voxels manually, many of these studies excluded voxels based on a preset threshold of MTR b 10%. The single MWF study used manual ROI selection. The 11 MRS studies were all single-voxel ROIs, with relatively straightforward analysis methods.
All of the fMRI studies used a complex series of steps in data analysis, and 7/8 of the reviewed studies made statistical assumptions without correcting for multiple comparisons, leading to potentially biased results. All of the fMRI studies manually divided the cord into quadrants or hemi-cords.

Evidence regarding diagnostic utility
Ninety-five of the 104 studies included in the review made comparisons between pathological subjects and healthy controls. Among these 95 studies, 88 had a high risk of bias, and 7 had a moderately high risk. The vast majority of these studies (89/95) only reported group differences and did not calculate diagnostic accuracy in terms of SE, SP, PPV, or NPV. Group comparisons between pathological subjects and healthy controls frequently showed similarities across different diseases including decreased FA, increased MD, increased RD, decreased MK, decreased MTR, increased MTCSF, and decreased NAA concentration, suggesting various clinical pathologies share common underlying injury mechanisms of demyelination, axonal loss, and GM loss. All 6 of the studies that reported diagnostic accuracy (SE, SP) results utilized DTI, with 4 showing moderate utility of DTI metrics in diagnosing CSM, 1 in CM, and 1 in MS. In CSM, the reported values of SE and SP of DTI metrics ranged from 50 to 100%, but tended to exceed those reported for T2w-HI. However, none of the reported values for diagnostic accuracy were sufficiently high to compete with the gold standard for CSM diagnosis, which is based upon clinical signs of myelopathy along with imaging evidence of any amount of cord compression (typically using conventional MRI). The evidence for diagnostic utility in the CM and MS studies was also not sufficient to consider DTI superior to existing diagnostics. Two studies (both using DTI) computed z-statistics for metrics at each vertebral level to determine if an individual measurement was normal or abnormal. Results pertaining to diagnostic utility are summarized for each clinical pathology in Table 9.

Evidence regarding predictive utility
Longitudinal studies that assessed predictive utility of advanced MRI metrics were only conducted in a total of 10 studies involving MS (5), ALS (2), CSM (2), and CM (1). Among these, 6 utilized DTI, 3 used MRS, 1 used MT, and 1 used MWF. The risk of bias among these studies was assessed as high in 8 and moderately high in 2. Four additional studies collected longitudinal clinical data but did not report prediction of outcomes using baseline MRI metrics. Among the 10 studies investigating predictive utility, 5 employed a detailed battery of clinical assessments (Bellenberg et al., 2013;El Mendili et al., 2014;Freund et al., 2010;Ikeda et al., 2013;Jones et al., 2013). Baseline FA showed weak to moderate correlations with clinical outcomes such as ALSFRS in ALS (1 study), mJOA recovery ratio in CSM (1/2 studies), and EDSS in MS (2/2 studies), but not mJOA in CSM (1 study). Ratios involving NAA were predictive of outcome in ALS (1 study) and MS (1/2 studies). Results for predictive utility are summarized in Table 9.

Evidence summary
The vast majority of studies included in this review had high or moderately high risk of bias, leading to a low baseline quality of evidence for each of the specific findings listed in Table 10. For the specific finding that FA is decreased in terms of group differences between patients and healthy controls in ALS, CSM, myelitis, MS, neuromyelitis optica (NMO), and SCI, the overall quality of evidence was neither upgraded nor downgraded, and remained low. Other metrics MD, RD, MK, MTR, MTCSF, and NAA also showed group differences between patients and healthy subjects in various clinical conditions, but the quality of evidence for these metrics was downgraded to very low due to a low level of evidence (MK, MTCSF) or inconsistent results between studies (MD, RD, MTR, NAA). There was insufficient evidence available to make any recommendations regarding the diagnostic utility (in terms of detecting group differences) of AD, standard deviation of primary eigenvector orientation (SD(θ)), orientation entropy (OE), tractography pattern, MWF, and fMRI-based metrics due to a lack of evidence, inconsistent results, and wide confidence intervals in many of the studies. The overall quality of evidence for diagnostic accuracy (sensitivity and specificity) was also insufficient, which was downgraded 2 levels due to highly inconsistent results. In terms of biomarker utility, only FA demonstrated consistent results, and the quality of evidence was upgraded 1 level to moderate for showing a dose-response gradient. The evidence for other MRI metrics as biomarkers was inconsistent and imprecise, leading to a finding of insufficient evidence. Finally, the evidence regarding the predictive utility for all MRI metrics was inconsistent and imprecise, leading to a rating of insufficient.

Discussion
It is an exciting time in spinal cord imaging, as the emergence of powerful new MRI techniques has inspired a large number of early clinical studies of pathological spine conditions. The excellent research conducted to date has demonstrated tremendous potential for all of these techniques to elucidate aspects of the microstructure or function within the human spinal cord, adding numerous insights into the pathophysiology of several neurological diseases. Among the 5 new techniques addressed in this review, DTI has thus far generated the most research, comprising 66% of the included studies and showing a sharp increase within the past 6 years, particularly using ROI-based analysis (Fig. 2). This increase in interest is most likely related to the promising results that DTI studies have demonstrated, particularly with moderate evidence that FA is a biomarker for disability in numerous pathologies (Table 10). The correlation of FA with impairment appears to be strongest in diseases that are confined to the spinal cord (e.g. CSM), which is consistent with the concept that disability in more distributed diseases (e.g. MS) is caused by injury to both the brain and the spinal cord. Low evidence was also found suggesting that FA shows group differences compared with healthy controls in several conditions, but insufficient evidence was available to suggest that DTI provides improved diagnostic accuracy or prediction of outcomes over established methods. A very low level of evidence was found for group differences using other DTI metrics MD and RD, MT metrics MTR and MTCSF, and the MRS metric of NAA concentration. It is unclear based on the current body of evidence if these metrics have substantial diagnostic value, due to a lack of strong evidence and substantial inconsistencies in results to date. The lack of well-designed studies to determine the diagnostic utility of the advanced MRI techniques, with 93% having a high risk of bias and only 6% reporting sensitivity and specificity, suggests a profound knowledge gap for future research. Furthermore, several studies in the review suggested that the simple quantitative measure of spinal cord CSA (quantifying atrophy) outperforms all of the advanced MRI metrics in terms of diagnostic and biomarker utility (Kearney et al., 2014(Kearney et al., , 2015Oh et al., 2013b,        • FA decreased (7/7 studies), specifically in LCSTs (4/4 studies) • MTR (in LCSTs) was decreased in ALS (1 study) • NAA decreased in ALS (3/3 studies) • FA correlated with ALSFRS (r = −0.55-0.74, R = 0.38, 4/6 studies) • NAA/Cre correlates with ALSFRS (r = 0.79, 1/2 studies) and FVC (r = 0.66, 1 study) • FA, MD changes over 1 year not correlated with change in ALSFRS (2/2 studies) • MTR does not correlate with ALSFRS (1 study) • FA predicted ALSFRS at 1 year (1 study) • NAA/Cre and NAA/Myo predict ALSFRS at 1 year (r = −0.70-0.78, 1 study) aSCI 3 • MD decreased (2/3 studies) • FA decreased (2/3 studies) • FA correlates with one or more components of ASIA motor score (2/2 studies) CM 3 3 • FA decreased and MD increased at MCL (2/3 studies) • FA had higher SE (73%) and SP (100%) than T2w-HI (1 study) •   • FA correlates with thermal sensation in 1/2 ROIs (r = −0.63, 1/2 studies) • FA (r = −0.64, P = 0.02) and number of FT fibers (r = −0.75, P = 0.02) correlate with average daily pain scores (1 study) 2015), suggesting that stronger results are still needed to contemplate the clinical uptake of these techniques.

Interpreting the evidence in the context of risk of bias
Unfortunately, the vast majority of studies (98/104, 94%) completed to date have a high or moderately high risk of bias, indicating the relative immaturity of the research in the field thus far. Although we were unable to determine precisely how many of the studies were based on a priori hypotheses (often due to ambiguous reporting of methods), it was obvious that most studies were highly exploratory, as they frequently analyzed numerous metrics and ROIs/levels without statistical correction to avoid type I errors. The early nature of the body of evidence is also apparent in the fact that 86% of studies failed to explicitly use randon/consecutive enrolment methods, and 41% did not perform age/gender matching in group comparisons or analysis for these potential confounders when assessing correlations or prediction of outcomes. Comparing the risk of bias between the 5 advanced MRI techniques, it was found to be lowest in MT studies, rated as moderately low in 20%, moderately high in 32%, and high in 48%, primarily as a result of more reliable, consistent acquisition methods and a tendency to more frequently utilize automated analysis techniques. However, in spite of these advantages, the results of the MT studies (most commonly using the metric MTR) showed considerably less consistent results compared with the DTI metric FA in terms of detecting group differences and correlating with impairment. As a result, the overall quality of evidence for MTR (and MTCSF) to demonstrate group differences in various clinical conditions was considered very low, and the evidence for their utility as biomarkers was insufficient (Table 10). This is suggestive that MTR is, overall, a weaker marker of pathological changes in the diseases studied than FA, although these metrics appear to measure separate components of microstructural change (Cohen-Adad et al., 2011;Wheeler-Kingshot et al., 2002), and the differences in consistency of results could alternatively be explained by technical factors. The risk of bias among DTI studies was assessed as high in 75% and moderately high in another 20%, largely as a result of problems with acquisition methods such as very large voxels (39%) and a lack of automated/ objective analyses (86%). The lack of a substantial number of high quality DTI studies led to a low baseline level of evidence for FA, MD, RD, and MK to demonstrate group differences and utility as a biomarker (Table 10). The quality of evidence for FA as a biomarker was upgraded to moderate due to a "dose-response gradient" (a term used in GRADE) as it shows consistent and relatively strong correlations with impairment, whereas the evidence for MD, RD, and MK were downgraded to very low in terms of diagnostic utility (showing group differences) and insufficient in terms of value as biomarkers. The risk of bias in MRS studies was high in 64% and moderately high in the remaining 36%, related to technical problems with acquisitions that resulted in the exclusion of subjects and wide confidence intervals in reported metrics. NAA showed very promising results in some studies, but the overall evidence was again downgraded to very low in terms of group differences and insufficient for correlation with impairment due to inconsistent results and imprecise estimates of effect. The single MWF study and all of the spinal fMRI studies were deemed to have a high risk of bias, primarily relating to difficulties in acquiring reliable images and the use of liberal statistical assumptions. As a result, none of the metrics investigated in these studies were deemed to have thus far demonstrated utility in terms of the three key questions.

The design of imaging studies for clinical translation
The incorporation of detailed clinical assessments into translational study protocols provides a richer and more objective characterization of patients' functional impairments compared with coarse clinical tools such as EDSS, JOA, mJOA, ALSFRS, and AIS. The majority of studies that investigated biomarker utility (57%) and half of the prognostic studies employed only a single coarse measure of impairment. The use of these summary measures of disability risks misrepresenting the degree to which the spinal cord and specific WM tracts are truly injured, as these measures are imprecise, and results can be strongly influenced by counfounding factors, such as reporting bias (in self-reported measures) or brain involvement in distributed CNS diseases (e.g. MS). If considerable noise and inaccuracies are present in the clinical assessments, the process of trying to identify meaningful correlations with MRI metrics can become futile. The additional use of electrophysiology (EP) tests can be used to augment the clinical information, although it is important that these test do not replace detailed neurological/ functional assessments, as in some cases they may not be sufficiently  (Kerkovsky et al., 2012). However, it should be noted that a trend appears to be emerging, with many recent studies employing a broader array of clinical tests. Future studies that generate fine-grained clinical data using a battery of assessments are more likely to identify important correlations with disability, and such high fidelity data may even have the power to show strong relationships between MRI changes in individual WM tracts and focal neurological deficits that uniquely occur in each specific disease.
4.3. State-of-the-art spinal cord MRI acquisition techniques: a work in progress "The only thing that is constant is change." -Heraclitus, 500 BC. Although many technological advances have been made, the state-ofthe-art spinal cord MRI techniques addressed in this review remain a work in progress, with many technical hurdles remaining. All of these imaging techniques are much more difficult to implement in the spinal cord than other regions, such as the brain, which has attracted many talented MRI physicists and engineers to take on this challenge. The issues of magnetic field inhomogeneity and physiological motion, leading to various artifacts and image distortions, remain significant barriers to high quality data collection for all of the techniques. DTI, most commonly based on spin echo EPI sequences, is an inherently noisy technique that typically requires large voxels and/or the use of multiple excitations to achieve acceptable SNR, both of which can increase partial volume effects at the cord periphery. The substantial variability in acquisition methods used by spinal cord DTI research groups indicates that this community is far from reaching consensus on the optimal approach to this difficult problem. The most common DTI sequence employed was ssEPI (59%), which tends to allow short acquisition times (b5 min in the majority of reviewed studies; Tables 3, 4). 11/69 studies took advantage of these short scan times and used the approach of performing multiple ssEPI acquisitions and averaging the results offline to improve SNR, using coregistration and motion correction tools. However, it should be noted that EPI involves important tradeoffs, as it is strongly affected by susceptibility artifact due to inhomogeneity in the magnetic field. This effect can cause image distortions, particularly at the level of intervertebral disk spaces, which is exaggerated when herniated disks obliterate the anterior CSF, potentially introducing bias or invalidating metrics calculated in the compressed portion of the spinal cord in conditions such as CSM. For example, Kerkovsky et al. (2012) report decreased FA in patients with spinal cord encroachment (effacement on the CSF) that have neck pain or radiculopathy but no objective signs of myelopathy. This result could represent sub-clinical changes in the spinal cord microstructure, but could alternatively be explained by increased susceptibility artifact. In recent years, there has been increased use of rFOV techniques, although this approach was only utilized in 13% of the reviewed studies. These sequences are based on 2D radiofrequency (RF) excitation (Finsterbusch, 2009;Saritas et al., 2008) or oblique refocusing pulses (Dowell et al., 2009;Wilm et al., 2009), and allow the use of a smaller FOV with higher resolution while avoiding aliasing problems and decreasing distortions, albeit at a cost of increased acquisition time. Only a fraction of DTI studies (23%) employed cardiac gating, likely because most groups felt that the reduction in motion artifacts is not worth the increased acquisition time and added complexity of setting up cardiac monitoring equipment. Two diffusion studies collected data with multiple b-values and computed measures of diffusion kurtosis, which is a dimensionless measure of the deviation from a Gaussian probability curve, with a positive value reflecting a sharper peak and heavier tails (Hori et al., 2012;Raz et al., 2013). Both studies identified positive MK in all subjects, with pathological subjects in CSM (Hori et al., 2012) and MS (Raz et al., 2013) showing group decreases in MK. However, it is unclear if DKI measures are sufficiently more powerful than simple DTI metrics to justify the added acquisition time required for multiple b-values. However, the optimal number of diffusion-sensitizing directions has not been established for DKI, but it may be possible that DKI can be performed with a smaller number of directions, possibly offsetting the need for multiple b-values. As mentioned above, all of the MT studies utilized similar acquisition methods such as GE sequences (except for the earliest study (Silver et al., 1997)), MT pre-pulse parameters, and resolution. The single WMF study was exploratory in nature, and further refinements in spinal cord MWF image acquisition, including decreased scan time, are needed prior to the initiation of more advanced clinical studies using this method. MRS, particularly of the spinal cord, is prone to motion artifact and low SNR, typically requiring relatively long acquisition times due to the use of complex shimming methods, a high number of signal averages, and cardiac gating to obtain useful data. The magnetic field inhomogeneity within the spinal canal makes it difficult to shim the B0 field, usually requiring high-order shimming procedures to attempt to compensate. As a result, there is line broadening in the metabolite peaks and decreases amplitude, making detection difficult. MRS studies had the highest use of cardiac gating at 45% compared to other techniques in this review. The MRS results demonstrate significant variations in metabolite concentrations and ratios, even among healthy individuals (Holly et al., 2009;Ikeda et al., 2013;Salamon et al., 2013), suggesting that noise may still be a major limitation. However, it may also be the case that there naturally exists a wide range of normal in the concentrations and ratios of the molecules that MRS captures, in which case it will be difficult for MRS to make strong assertions about individual patients, even with further technical improvements. However, MRS provides unique information compared with the other advanced MRI techniques, and further development may allow quantification of important CNS molecules such as glutamate (not reliably detected with current methods), which may suggest an important role for MRS to compliment the other more anatomically specific techniques. All 8 of the spinal fMRI studies used a fast or turbo SE pulse sequence with SEEP contrast, compared with T2*-weighted EPI that is typically used in brain fMRI based on BOLD contrast. FSE is commonly employed in spinal fMRI to compensate for severe inhomogeneity of the magnetic field within the spinal canal, but the readouts from this technique are considerably slower than EPI, increasing the effects of physiological motion artifacts. The time to acquire each volume of images in the reviewed studies ranged from 8 to 13 s, collecting between 5 and 9 slices (axial orientation in 7 studies, sagittal in 1) per volume, indicating the relatively low temporal resolution compared with brain fMRI, in which an entire brain volume can be acquired in 2 to 4 s. Furthermore, the signal change relating to altered neural activity is frequently only 2-3% (Stroman et al., 2004), requiring high SNR to reliably differentiate active voxels from background noise. The overall results of the spinal fMRI studies did not show convincing changes in activation patterns in specific pathologies (only minor loss of ipsilateral focal activation), possibly due to technical problems achieving sufficient SNR. If, however, reliable activations can be detected with better temporal resolution and shorter acquisition time, fMRI will likely make a significant impact, with obvious applications in conditions such as SCI to detect new activity and connectivity as regeneration therapies (e.g. stem cells) are studied. In summary, all 5 of the state-of-the-art spinal cord MRI techniques continue to face technical issues that require further innovations, and clinical studies face the limitation of needing to freeze on a specific acquisition methodology over the period of time required to complete data collection, even if it may not include the latest and greatest technical advances.

State-of-the-art imaging deserves state-of-the-art analysis
The majority of DTI, MT, MWF, and fMRI studies included in this review used manual methods of ROI selection to extract quantitative metrics, with only 25/93 (27%) using automated or semi-automated ROI selection. In addition to being slow and imprecise, unblinded manual ROI selection is an obvious source of potential bias in studies, as the technician selecting the ROI can arbitrarily include or omit pixels of high or low signal (often present at the edge of the cord due to partial volume effects), and it is impossible to blind the technician in many scenarios (e.g. compressive myelopathy). The very low rate of objective analysis techniques for DTI studies (14%), compared with 56% of MT studies, is possibly due to greater problems with partial volume effects at the edge of the cord in DTI, where contamination with CSF causes an increase in isotropic diffusion and a corresponding decrease in FA, prompting 7 DTI studies to employ manual exclusion of edge voxels after performing semi-automated segmentation to identify the spinal cord. Furthermore, most studies (73/104, 70%) included in this review reported whole-cord metrics, which average the effects of a specific disease process across all GM and WM. Analyzing whole-cord metrics lacks the specificity of measuring changes in individual anatomical areas, such as WM tracts (which might be differentially affected in a certain disease), and it also potentially dilutes the sensitivity to detect small changes: a 10% change present in the WM might only show a 5% change in the whole-cord metric, which may no longer be statistically significant. To optimize the sensitivity and specificity of these techniques, the ideal solution is to analyze only the tissue that is most affected by a certain disease, such as the anterior horn GM and/or the lateral corticospinal tracts in ALS. Several groups are actively developing tools for this purpose, which can perform a series of complex data processing steps and automatically extract quantitative metrics from GM, WM, and specific WM-tracts (Cohen-Adad et al., 2014), even correcting for partial volume effects at the cord periphery (Levy et al., 2015). Tract-specific metrics, which were available in only 22/104 studies (21%), also have the advantage of potentially characterizing gradations of injury to each anatomical area within the cord, potentially correlating with or predicting focal neurological deficits. Fiber tractography (FT) is an interesting alternative to ROIbased quantitative analyses of DTI data. The DTI studies that employed FT were listed separately from ROI studies in Table 4, primarily to identify trends and commonalities among the methods used within FT studies. Among the FT studies reviewed, only 38% extracted quantitative metrics from the region defined by the FT results. The utility of FT in quantitative assessment of the spinal cord is controversial, as some have suggested that using FT to automatically define ROIs is inherently biased (Cohen-Adad et al., 2011), and most FT algorithms require manual seed points, as was identified in our review (only 1/16 studies did not require seed points). However, one study in this review reported improved measures of inter-observer reliability using FT-based ROIs vs. manual ROIs, again supporting the importance of automated, objective analysis methods (Van Hecke et al., 2009). Other studies derived quantitative measures from the FT output, such as number of fibers, fiber density, or fiber length (as surrogates for number of intact axons). However, the FT analysis is typically based on liberal assumptions of what constitutes a fiber, using low thresholds for minimum FA of 0.10-0.30 and angle of b20-70°when calculating connections between voxels. The result is a very loose representation of the actual white matter that should be interpreted with caution. An alternative to using tractography to measure the organization of the white matter is to perform quantitative analysis of the directionality of the eigenvectors, which was performed in 2 studies using OE and SD(θ). These alternative methods are highly quantitative, and may turn out to be more reliable than tractography in characterizing white matter changes, but greater data is needed to fully define their value. Half of the FT studies, all of which involved various forms of compressive myelopathy, only reported descriptions of the pattern of tracked fibers such as the degree of deformation or disruption. However, assignment of these descriptors is highly subjective and WM compression may be more accurately represented by geometric measurements (e.g. maximum spinal cord compression ratio). In comparing MT techniques, the use of MTR may have a theoretical advantage over MTCSF, as the CSF is prone to flow artifact that causes signal dropout, which could potentially bias results, but this was not an obvious drawback in the 2 studies that employed MTCSF. The calculation of MTR requires an added post-processing step, as images with and without an MT prepulse need to be co-registered accurately, but this is relatively straightforward with modern tools. No major technical challenges were identified in the analysis techniques employed by MWF and MRS studies, except for the use of manual ROIs in the WMF study (Laule et al., 2010). In all of the reviewed fMRI studies, time-series data were analyzed by convolving with a canonical hemodynamic response function, and activation maps (based on a p-value threshold or a clustering algorithm) were created. Due to challenges in obtaining robust activations, most of the spinal fMRI studies used an uncorrected threshold of P b 0.05 for each voxel so that a greater number of activations could be identified, with the exception of one study (Cadotte et al., 2012). This uncorrected analysis runs a high risk of identifying false activations, particularly when hundreds of voxels are included, and therefore the results of these studies must be interpreted with caution. All of the fMRI studies also used manual ROI selection, typically dividing the cord into quadrants manually, contributing another potential source of bias to the analysis.

Statistical analysis: a big data problem
Appropriate statistical analysis for complex clinical studies using quantitative MRI techniques is far from straightforward. This data can involve a large number of metrics, including multiple DTI indices or the output from multi-modal acquisitions, and the values might be extracted from numerous ROIs located in individual WM tracts at many rostro-caudal levels of the spinal cord. Furthermore, the above-mentioned trend toward using multiple clinical measures to fully characterize disability suggests that future studies will need to employ multivariate analyses with an increasing number of independent and dependent variables. The analysis of these studies quickly becomes a big data problem, and help from an experienced statistician is advisable to correctly design robust multivariate analyses that incorporate a priori variables of interest and potential confounding factors such as age and gender. It is of paramount importance that a priori hypotheses are clearly stated beforehand, to avoid an excess number of comparisons and misrepresentation of the complex data to make unfounded conclusions. Among the studies reviewed, there were many cases where no correction was made for multiple comparisons, leading to findings that would not have been identified as significant with proper correction. In some cases, studies went as far as reporting conclusions that were clearly overstated or unfounded, which must be avoided in future translational research that will form the basis for clinical adoption of these techniques.

Limitations of this study
This systematic review attempted to perform an exhaustive review of all clinical studies utilizing the 5 advanced spinal cord MRI techniques. A large number of citations were analyzed in an attempt to identify all relevant articles, but it is still possible that relevant studies were missed, including those not available in English. On the other hand, the large scope of this review made it more difficult to discuss all of the subtleties involved in these MRI techniques. Also, the inclusion criteria arbitrarily excluded cohorts with fewer than 24 subjects or fewer than 12 pathological subjects. This threshold was originally set at 20 total subjects and 10 pathological subjects, but it was increased because the number of studies identified using the lower threshold was far greater than 100, which would have made the tables excessively long and the discussion even more difficult. However, we did not increase the threshold higher than 24 as we felt that several key studies would have been excluded. Studies that only analyzed the quantitative metrics apparent diffusion coefficient (ADC), generated from DWI, or CSA, derived from anatomical images, were also excluded for the purpose of focusing this review on new techniques. Spinal cord DWI has been in clinical use for many years for the detection of infarction and abscess, but the simple metric of ADC (equivalent to MD in DTI) may have value in specific applications as a measure of microstructural tissue changes. CSA is clearly a powerful quantitative metric that relates to cord atrophy, which should be considered for use in addition to the advanced MRI metrics in multivariate models. The search strategy excluded research that only studied healthy subjects, as these studies and those with smaller cohorts of pathological subjects tended to show less robust methodology and clinical relevance. This review also focused solely on advanced spinal cord imaging techniques, but several groups studying spinal cord pathologies have investigated imaging changes in brain microstructure and function, in part due to the relative simplicity of implementing these imaging protocols in the brain (Freund et al., 2013;Kowalczyk et al., 2012;Mikulis et al., 2002). Furthermore, this review was focused on the 5 most promising spinal cord imaging techniques identified by the recent expert panel, but several others are emerging that may make a substantial impact to this field, including perfusion imaging, susceptibility weighted imaging, T1 relaxometry, neurite orientation dispersion and density imaging (NODDI), and myelin g-ratio (Stikov et al., 2015).

Future directions
The path to clinical translation of technological innovations, such as new MRI techniques, invariably includes numerous challenges and there remains significant work to successfully bring these techniques into clinical use. Translational research typically involves a process that begins with small exploratory studies and transitions to large, carefully designed clinical trials, and several of the state-of-the-art spinal cord MRI techniques reviewed in this paper have demonstrated sufficiently strong results and are ready for this next step. Looking forward, the spinal cord imaging community will continue to drive these powerful techniques forward, with several key steps happening concurrently: 1) larger clinical studies with specific hypothesis-driven research questions will be designed and conducted to assess for clinical utility; 2) acquisition techniques will continue to evolve and be refined to maximize signal-to-noise ratio (SNR) and resolution while minimizing distortions, artifacts, and acquisition times; and 3) powerful data analysis tools will be developed that can automatically extract quantitative data from the GM, WM, and specific WM tracts. The long path to clinical translation is not easy, but in the coming years, we can expect many further innovations in this burgeoning field, which will hopefully lead to major improvements in the diagnosis and management of patients with spinal cord pathologies.
New techniques and innovations are also emerging that could dramatically alter the course of research in this field, but were not utilized by any of the studies in this review. For example, the development of high strength gradients for DTI, highlighted by the human connectome project that uses 300 mT/m gradients (200 mT/m/ms slew rate) -8 times stronger than most clinical hardware, have provided new insights, such as mapping the axon diameter distribution in the human spinal cord . Recently, the introduction of inhomogenously broadened MT (ihMT) imaging has demonstrated much higher specificity for myelin imaging than previous MT techniques (although the signal dropout is less pronounced requiring subtraction between images, which decreases SNR substantially), which will likely spur new clinical studies to investigate its utility (Girard et al., 2015). Chemical Exchange Saturation Transfer (CEST) effect is a particular case of MT imaging, which can quantify the biochemical composition of tissues based on labile protons (hydroxyl, amide, amine, and sulfhydryl moieties). Feasibility in the human spinal cord and application in MS patients have recently been demonstrated (Kim and Cercignani, 2014). In addition, none of the 104 studies that were reviewed used 7 T field strength, but with the proliferation of 7 T research systems and the recent announcement of 7 T clinical scanners, it is inevitable that new clinical studies at ultra-high field strength are coming soon and these could potentially show substantial improvements that strengthen the case for clinical utility. Analysis techniques may also undergo a revolution with the introduction of machine learning, as complex multivariate data from healthy and pathological subjects could be used to train classifiers, potentially increasing diagnostic sensitivity and specificity.
However, optimism for novel MRI methods must be tempered with practicality. Even if the clinical utility of one or more of these quantitative MRI techniques is clearly demonstrated, a considerable hurdle will still remain before widespread clinical adoption will occur. The concept of quantitative MRI has been used in the research domain for several years (e.g. CSA for MS), but is largely foreign to clinicians, and the exact method and workflow for its use needs to be carefully considered, or these new technique will be quickly abandoned. Radiologists, neurologists, and spine surgeons that have busy clinical practices are unlikely to sit at an imaging workstation and perform manual tasks to generate quantitative metrics, so data analysis will need to be fully automated, robust, and seamlessly integrated. The perception that new analysis methods are time consuming, unreliable, or inaccurate will render these new methods unacceptable. Thus it is essential that sophisticated, automatic analysis tools be developed in parallel with advances in the imaging techniques themselves.

Conclusions
The current body of evidence of clinical studies using spinal cord DTI, MT, MWF, MRS, and fMRI is relatively limited, indicating the early stage of this translational research effort. However, moderate evidence indicates that the quantitative DTI metric FA successfully correlates with impairment in a number of neurological disorders. Low evidences suggests that FA shows tissue injury (in terms of group differences) in a number of disorders, but the evidence is insufficient to support its use as a diagnostic test or as a predictor of clinical outcomes. Very low evidence exists for other metrics to show pathological changes in terms of group differences in the spinal cord, including MD, RD, MK, MTR, MTCSF, and NAA, and the evidence is insufficient to determine if they can be used as a diagnostic test, biomarker, or prognostic marker in a clinical context. DTI has produced the most substantial results to Fig. 3. Key points.
date, but acquisition methods, data processing, and interpretation require further refinement, followed by standardization and crossvendor validation, before this technology is ready for widespread clinical adoption. The path to clinical translation of these complex MRI techniques is not straightforward, and future translational studies are required that have clear a priori hypotheses, large enrolment numbers, short scan times, high quality acquisition techniques, detailed clinical assessments, automated analysis techniques, and robust multivariate statistical analyses (Fig. 3). It is also important to keep in mind that the definition of clinical utility is to be able to make assertions about individual patients, not just achieve significant group differences, setting a very high standard for success. However, much progress has already been made, and the spinal cord imaging community will undoubtedly make many great achievements in the years to come.