Multisite reliability of cognitive BOLD data

doi:10.1016/j.neuroimage.2010.09.076

NeuroImage

Volume 54, Issue 3, 1 February 2011, Pages 2163-2175

https://doi.org/10.1016/j.neuroimage.2010.09.076 Get rights and content

Abstract

Investigators perform multi-site functional magnetic resonance imaging studies to increase statistical power, to enhance generalizability, and to improve the likelihood of sampling relevant subgroups. Yet undesired site variation in imaging methods could off-set these potential advantages. We used variance components analysis to investigate sources of variation in the blood oxygen level-dependent (BOLD) signal across four 3-T magnets in voxelwise and region-of-interest (ROI) analyses. Eighteen participants traveled to four magnet sites to complete eight runs of a working memory task involving emotional or neutral distraction. Person variance was more than 10 times larger than site variance for five of six ROIs studied. Person-by-site interactions, however, contributed sizable unwanted variance to the total. Averaging over runs increased between-site reliability, with many voxels showing good to excellent between-site reliability when eight runs were averaged and regions of interest showing fair to good reliability. Between-site reliability depended on the specific functional contrast analyzed in addition to the number of runs averaged. Although median effect size was correlated with between-site reliability, dissociations were observed for many voxels. Brain regions where the pooled effect size was large but between-site reliability was poor were associated with reduced individual differences. Brain regions where the pooled effect size was small but between-site reliability was excellent were associated with a balance of participants who displayed consistently positive or consistently negative BOLD responses. Although between-site reliability of BOLD data can be good to excellent, acquiring highly reliable data requires robust activation paradigms, ongoing quality assurance, and careful experimental control.

Research Highlights

►Person variance was more than 10 times larger than site variance for most ROIs. ►Person-by-site interactions contributed sizable unwanted variance to the total. ►Many voxels showed good to excellent between-site reliability. ►Regions of interest displayed fair to good between-site reliability. ►Acquiring reliable fMRI data requires ongoing quality assurance.

Introduction

Several multi-site functional magnetic resonance imaging (fMRI) studies are currently in process or are being planned (Van Horn and Toga, 2009). The larger samples made possible by multi-site studies can potentially increase statistical power, enhance the generalizability of study results, facilitate the identification of disease risk, increase the odds of finding uncommon genetic variations, make rare disease and subgroup identification possible, help justify multivariate analyses, and support cross-validation designs (Cohen, 1988, Friedman and Glover, 2006a, Jack et al., 2008, Mulkern et al., 2008, Van Horn and Toga, 2009). Yet the potential advantages of multi-site functional imaging studies could be off-set by unwanted variation in imaging methods across sites. Even when the same activation task is used at different sites and the same image processing path is employed, potential site differences might arise from differences in stimulus delivery and response recording, head stabilization method, field strength, the geometry of field inhomogeneity, gradient performance, transmit and receive coil configuration, system stability, shimming method, type and details of the image sequence including K-space trajectory, type of K-space filtering, system maintenance, and environmental noise (Friedman and Glover, 2006a, Friedman and Glover, 2006b, Ojemann et al., 1998, Van Horn and Toga, 2009, Voyvodic, 2006). The large number of experimental factors that might differ between sites could introduce unwanted variation related to site and its interactions into in a multi-site fMRI study. This unwanted variation may, in turn, undermine the advantages of increased statistical power and enhanced generalizability that would otherwise be associated with large-sample studies. Given that unwanted between-site variation is itself likely to vary from multi-site study to multi-site study, determining the magnitude of site variation and evaluating its impact on the consistency of results across sites have become a critical component of multi-site fMRI studies (Friedman et al., 2008, Pearlson, 2009).

The consistency of blood oxygen level-dependent (BOLD) fMRI values across sites has been studied for a variety of behavioral activation tasks using several different statistical approaches. One common approach is to measure between-site consistency by assessing the extent of overlap of either observed or latent activation regions (Casey et al., 1998, Gountouna et al., 2010, Vlieger et al., 2003, Zou et al., 2005). These studies find only a modest degree of overlap in the extent of activation, with the number of regions found to be significantly activated varying by fivefold across sites in one study (Casey et al., 1998). Differences in field strength and k-space trajectory have accounted for significant between-site variation in some studies (Cohen et al., 2004, Voyvodic, 2006, Zou et al., 2005). Even when Cartesian K-space trajectories are used at all sites, differences in the type of image acquisition protocol can produce differences in the spatial extent and magnitude of the BOLD signal, as studies comparing gradient-recalled echo protocols with spin echo and asymmetric spin echo protocols show (Cohen et al., 2004, Ojemann et al., 1998).

Methods that measure the overlap of activation extent and volume across MR systems have been criticized for assuming invariant null hypothesis distributions across sites and for the use of a specific threshold to determine statistical significance (Suckling et al., 2008, Voyvodic, 2006). The distributions of the test statistics, however, can be checked, and if differences in distributions are found, methods are available to adjust the activation maps or modify the statistical analysis (Miller, 1986, Voyvodic, 2006). With regard to the second criticism, overlap consistency can be investigated across a range of thresholds, as in Zou et al.'s (2005) study. A more fundamental limitation of overlap methods is that they do not provide a standard against which to judge the importance of a particular degree of overlap. The question of how much overlap is necessary to produce statistically robust and generalizable findings typically remains after percent overlap statistics are presented. In addition, overlap methods do not address the question of how consistently subjects can be rank-ordered by their brain response across sites. For example, in a study involving assessment of the same subjects scanned at multiple sites, a high degree of overlap in regional activation might be observed in group-level mean statistical maps across sites even though the rank-ordering of the brain response of subjects changes randomly from site to site. This possibility underscores the point that cross-site reliability of fMRI measurements cannot be determined solely by examining the consistency of group activation maps across sites using repeated measures ANOVA model.

Variance components analysis (VCA), which assesses components of systematic and error variance by random effects models, is another commonly used method to assess cross-site consistency of fMRI data (Bosnell et al., 2008, Costafreda et al., 2007, Dunn, 2004, Friedman et al., 2008, Gountouna et al., 2010, Suckling et al., 2008). VCA provides several useful standards against which to judge the importance of consistency findings. For investigators interested in studying the brain basis of individual differences, one natural standard is to compare variance components of nuisance factors against the person variance. This strategy of assessing the importance of between-site variation in fMRI studies has been used in several studies. Costafreda and colleagues, for example, found that between-subject variance was nearly seven-fold larger than site variance in the region of interest (ROI) they studied (Costafreda et al., 2007). In the Gountouna et al. (2010)study, site variance was virtually zero for several ROIs with the ratio of subject variance to site variance as large as 44 to 1 in another especially reliable ROI. Suckling and colleagues reported a value of between-subject variance that was slightly more than 10 times greater than the between-site variance (Suckling et al., 2008). Using a fixed-effect ANOVA model, Sutton and colleagues found effect sizes for between-subject differences to be seven to sixteen times larger than the between-site effect for the ROIs studied (Sutton et al., 2008).

Variance components are also commonly used to calculate intraclass correlations (ICC) that can be compared with the intraclass correlations of other, perhaps more familiar, variables (Brennan, 2001, Dunn, 2004). Between-site intraclass correlations of ROIs have been reported in the 0.20 to 0.70 range for fMRI data, depending on the activation measure, behavioral task, and degree of homogeneity of the magnets studied (Bosnell et al., 2008, Friedman et al., 2008). For clinical ratings of psychiatric symptoms, ICCs in the 0.20 to 0.70 range would be rated as ranging from poor to good (Cicchetti and Sparrow, 1981). The intraclass correlation can also be used to assess the impact of measurement error on the statistical power of between-site studies, providing a third method to assess the importance of VCA results (Bosnell et al., 2008, Suckling et al., 2008).

In the present study, healthy volunteers were scanned once at three sites and twice at a fourth site while performing a working memory task with emotional or neutral distraction. We investigated the between-site reliability of BOLD/fMRI data by calculating variance components for voxelwise data. Voxelwise VCA and ICC maps are presented in order to identify voxel clusters where particular components of variation were most prominent and where between-site reliability was largest. Presenting data for all voxels avoided limitations of generalization associated with the use of specific statistical thresholds. We also present findings obtained by averaging beta-weights over voxels within selected regions of interest (ROI) to simplify the comparison of our results with those of other studies that presented ROI results. The main study hypotheses follow:

(1)
Clusters of voxels will be identified where the between-subject variation will be more than 10-fold the value of the between-site variation. This hypothesis is derived from the VCA studies discussed above.
(2)
Variation attributed to the person-by-site interaction will be greater than variation associated with the site factor. Because person-by-site interactions occur when the rank ordering and/or distance of the BOLD response of subjects differs across sites, this source of variation reflects a broader array of potential sources of error than site variation alone.
(3)
The magnitude of the ICC will depend on whether the functional contrast involved a low-level or high-level control condition, where a low-level control involves only orientation, attention, and basic perceptual processing, and a high-level control condition involves additional cognitive processes (Donaldson and Buckner, 2001). High-level control conditions are likely to share a larger number of neural processes with the experimental condition than low-level control conditions leading to a larger subtraction of neural activity and a smaller BOLD signal, especially for brain regions that participate in a multiplicity of neurocognitive functions engaged by the experimental task. The smaller magnitude of the BOLD response is likely to restrict the range of person variance, reducing the between-site ICC (Cronbach, 1970, Magnusson, 1966).
(4)
The between-site intraclass correlation will increase as the number of runs averaged increases. Although this hypothesis has been supported by a previous study involving a sensorimotor task, it has not been tested in BOLD data generated by a cognitive task (Friedman et al., 2008).

We also investigated the relationships among between-site reliability, effect size, and sample size. A previous within-site reliability study found correlations greater than 0.95 between the median within-site ICC and the activation-threshold t-test value for both an auditory target detection task and an N-back working memory task (Caceres et al., 2009). We will examine the relationship between reliability and effect size for the between-site case. Although we anticipate that median effect size calculated across voxels will be strongly related to the magnitude of between-site reliability for those voxels, dissociations might be observed. If voxels with large activation effect sizes and poor reliability are observed, we will investigate the possibility that these voxels have poor reliability due to reduced variation among people (Brennan, 2001, Magnusson, 1966). If voxels with small activation effect sizes and good between-site reliability are observed, we will investigate the possibility that activation magnitude within subjects is consistent across sites, yet balanced between negative and positive activation values.

In the present study, the specific form of the ICC we calculated assessed between-site consistency at an absolute level (Brennan, 2001, Friedman et al., 2008, Shrout and Fleiss, 1979). High between-site ICC values, therefore, would support the interchangeability of data and justify the pooling of fMRI values across sites (Friedman et al., 2008, Shavelson et al., 1989). There are, of course, alternative definitions of the ICC (Shrout and Fleiss, 1979), and it is useful here to provide some discussion of the factors that would affect the choice to assess reliability based on absolute or relative agreement of measurements at different sites. The appropriate reliability measure will depend on the type of study being designed and the intended analysis. Suppose that “site” is explicitly considered as a design factor and that as a result “site” is explicitly accounted for in the data analysis. Then it might seem that the site factor will address consistent differences across site and that an ICC measuring relative agreement would be appropriate. This argument is plausible as long as “site” is orthogonal or independent of other design/analysis factors. For such studies, the Pearson correlation, the generalizability coefficient of Generalizability Theory or the ICC(3,1) statistic of Shrout and Fleiss (all of which look for relative rather than absolute agreement) would be appropriate statistics to assess reliability (Brennan, 2001, Shavelson et al., 1989, Shrout and Fleiss, 1979). If on the other hand there are associations between-site and other factors, for example, there is variation in the patient/control mix cross sites or there is variation in a genotype of interest, then adjusting for site in the analysis is not enough to eliminate all site effects and it is valuable to consider an ICC measuring absolute consistency. In these circumstances, having established in a reliability study that site variation contributes only a small amount of variation to the pooled variance would permit the pooling, which in turn should increase the likelihood that important subgroups will be detected and would enhance both statistical power and the generalizability of results. The reliability results of the present study were used to design a large study where the range of genetic variation and relevant symptom subtypes could not be determined a priori. We therefore calculated ICCs to assess the consistency of the absolute magnitude of the fMRI/BOLD response in order to determine whether data pooling would be justified.

Section snippets

Participants

Nine male and nine female, healthy, right-handed volunteers were studied once at each of three magnet sites and twice at a fourth site (mean [range], age: 34.44 [23–53] years; education: 17.06 [12–23] years). The sample size was chosen so that the lower 0.01 confidence interval for an ICC at the lower limits of excellent reliability (0.75) would exceed ICC values at poor levels of reliability (<0.40) (Walter, 1998). All participants were employed, with the largest number of individuals (eight)

Voxelwise maps

To determine whether the BOLD response changed merely by being repeated across the four sites, the effect of session order on the recognition versus scrambled faces contrast was tested with a voxelwise repeated measures analysis. Because the test revealed no significant clusters, session order was ignored in the following analyses.

Voxelwise plots of the site variance component revealed little variation in most brain regions (Fig. 2A). Voxels in the superior sagittal sinus, in the most dorsal

Discussion

Between-site reliability of the BOLD response elicited by working memory conditions can be good to excellent in many brain regions, although the extent of reliability depends on the specific cognitive contrast studied, the number of runs averaged, and the brain area investigated. In five of six regions of interest studied, variance associated with people exceeded site variance by least 10-fold. There is now evidence from several multisite variance components analyses of BOLD data showing that

References (54)

R. Bosnell et al.
Reproducibility of fMRI in the clinical setting: implications for trial designs
Neuroimage
(2008)
A. Caceres et al.
Measuring fMRI reliability with the intra-class correlation coefficient
Neuroimage
(2009)
B.J. Casey et al.
Reproducibility of fMRI results across four institutions using a spatial working memory task
Neuroimage
(1998)
E.R. Cohen et al.
Hypercapnic normalization of BOLD fMRI: comparison across field strengths and pulse sequences
Neuroimage
(2004)
R.W. Cox
AFNI: Software for analysis and visualization of functional magnetic resonance neuroimages
Comput. Biomed. Res.
(1996)
L. Friedman et al.
Reducing interscanner variability of activation in a multicenter fMRI study: controlling for signal-to-fluctuation-noise-ratio (SFNR) differences
Neuroimage
(2006)
V.E. Gountouna et al.
Functional magnetic resonance imaging (fMRI) reproducibility and variance components across visits and scanning sites with a finger tapping task
Neuroimage
(2010)
J.A. Maldjian et al.
An automated method for neuroanatomic and cytoarchitectonic atlas-based interrogation of fMRI data sets
Neuroimage
(2003)
J.A. Maldjian et al.
Precentral gyrus discrepancy in electronic versions of the Talairach atlas
Neuroimage
(2004)
R.V. Mulkern et al.
Establishment and results of a magnetic resonance quality assurance program for the pediatric brain tumor consortium
Acad. Radiol.
(2008)

J.A. Mumford et al.

Power calculation for group fMRI studies accounting for arbitrary design and temporal autocorrelation

Neuroimage

(2008)

J.T. Voyvodic

Activation mapping as a percentage of local excitation: fMRI stability within scans, between scans and across field strengths

Magn. Reson. Imaging

(2006)

J. Blair et al.

Predicting premorbid IQ: a revison of the natinal adult reading test

Clin. Neuropsychol.

(1989)

R.L. Brennan

Generalizability Theory

(2001)

A.V. Carron

Motor performance and response consistency as a function of age

J. Mot Behav.

(1971)

D.V. Cicchetti et al.

Developing criteria for establishing interrater reliability of specific items: applications to assessment of adaptive behavior

Am. J. Ment. Defic.

(1981)

J. Cohen

Statistical Power Analysis for the Behavioral Sciences

(1988)

D.L. Collins et al.

Automatic 3D intersubject registration of MR volumetric data in standardized Talairach space

J. Comput. Assist. Tomogr.

(1994)

S.G. Costafreda et al.

Multisite fMRI reproducibility of a motor task using identical MR systems

J. Magn. Reson. Imaging

(2007)

S.G. Costafreda et al.

Neural correlates of sad faces predict clinical remission to cognitive behavioural therapy in depression

NeuroReport

(2009)

L.J. Cronbach

Essential of psychological testing

(1970)

F. De Guio et al.

Signal decay due to susceptibility-induced intravoxel dephasing on multiple air-filled cylinders: MRI simulations and experiments

Magma

(2008)

M. D'Esposito

Functional neuroimaging of working memory

W.J. Dixon et al.

Introduction to Statistical Analysis

(1983)

D.I. Donaldson et al.

Effective paradigm design

G. Dunn

Statistical Evaluation of Measurement Errors: Design and Analysis of Reliability Studies

(2004)

M.B. First et al.

Structured Clinical Inteview for DSM-IV Axis I Disorders

(1997)

Cited by (0)

View full text

Published by Elsevier Inc.

Multisite reliability of cognitive BOLD data

Abstract

Research Highlights

Introduction

Section snippets

Participants

Voxelwise maps

Discussion

Neuroimage

Neuroimage

Neuroimage

Neuroimage

Comput. Biomed. Res.

Neuroimage

Neuroimage

Neuroimage

Neuroimage

Acad. Radiol.

Neuroimage

Magn. Reson. Imaging

Predicting premorbid IQ: a revison of the natinal adult reading test

Clin. Neuropsychol.

Generalizability Theory

Motor performance and response consistency as a function of age

J. Mot Behav.

Developing criteria for establishing interrater reliability of specific items: applications to assessment of adaptive behavior

Am. J. Ment. Defic.

Statistical Power Analysis for the Behavioral Sciences

Automatic 3D intersubject registration of MR volumetric data in standardized Talairach space

J. Comput. Assist. Tomogr.

Multisite fMRI reproducibility of a motor task using identical MR systems

J. Magn. Reson. Imaging

Neural correlates of sad faces predict clinical remission to cognitive behavioural therapy in depression

NeuroReport

Essential of psychological testing

Signal decay due to susceptibility-induced intravoxel dephasing on multiple air-filled cylinders: MRI simulations and experiments

Magma

Functional neuroimaging of working memory

Introduction to Statistical Analysis

Effective paradigm design

Statistical Evaluation of Measurement Errors: Design and Analysis of Reliability Studies

Structured Clinical Inteview for DSM-IV Axis I Disorders