Multisite reliability of cognitive BOLD data
Research Highlights
►Person variance was more than 10 times larger than site variance for most ROIs. ►Person-by-site interactions contributed sizable unwanted variance to the total. ►Many voxels showed good to excellent between-site reliability. ►Regions of interest displayed fair to good between-site reliability. ►Acquiring reliable fMRI data requires ongoing quality assurance.
Introduction
Several multi-site functional magnetic resonance imaging (fMRI) studies are currently in process or are being planned (Van Horn and Toga, 2009). The larger samples made possible by multi-site studies can potentially increase statistical power, enhance the generalizability of study results, facilitate the identification of disease risk, increase the odds of finding uncommon genetic variations, make rare disease and subgroup identification possible, help justify multivariate analyses, and support cross-validation designs (Cohen, 1988, Friedman and Glover, 2006a, Jack et al., 2008, Mulkern et al., 2008, Van Horn and Toga, 2009). Yet the potential advantages of multi-site functional imaging studies could be off-set by unwanted variation in imaging methods across sites. Even when the same activation task is used at different sites and the same image processing path is employed, potential site differences might arise from differences in stimulus delivery and response recording, head stabilization method, field strength, the geometry of field inhomogeneity, gradient performance, transmit and receive coil configuration, system stability, shimming method, type and details of the image sequence including K-space trajectory, type of K-space filtering, system maintenance, and environmental noise (Friedman and Glover, 2006a, Friedman and Glover, 2006b, Ojemann et al., 1998, Van Horn and Toga, 2009, Voyvodic, 2006). The large number of experimental factors that might differ between sites could introduce unwanted variation related to site and its interactions into in a multi-site fMRI study. This unwanted variation may, in turn, undermine the advantages of increased statistical power and enhanced generalizability that would otherwise be associated with large-sample studies. Given that unwanted between-site variation is itself likely to vary from multi-site study to multi-site study, determining the magnitude of site variation and evaluating its impact on the consistency of results across sites have become a critical component of multi-site fMRI studies (Friedman et al., 2008, Pearlson, 2009).
The consistency of blood oxygen level-dependent (BOLD) fMRI values across sites has been studied for a variety of behavioral activation tasks using several different statistical approaches. One common approach is to measure between-site consistency by assessing the extent of overlap of either observed or latent activation regions (Casey et al., 1998, Gountouna et al., 2010, Vlieger et al., 2003, Zou et al., 2005). These studies find only a modest degree of overlap in the extent of activation, with the number of regions found to be significantly activated varying by fivefold across sites in one study (Casey et al., 1998). Differences in field strength and k-space trajectory have accounted for significant between-site variation in some studies (Cohen et al., 2004, Voyvodic, 2006, Zou et al., 2005). Even when Cartesian K-space trajectories are used at all sites, differences in the type of image acquisition protocol can produce differences in the spatial extent and magnitude of the BOLD signal, as studies comparing gradient-recalled echo protocols with spin echo and asymmetric spin echo protocols show (Cohen et al., 2004, Ojemann et al., 1998).
Methods that measure the overlap of activation extent and volume across MR systems have been criticized for assuming invariant null hypothesis distributions across sites and for the use of a specific threshold to determine statistical significance (Suckling et al., 2008, Voyvodic, 2006). The distributions of the test statistics, however, can be checked, and if differences in distributions are found, methods are available to adjust the activation maps or modify the statistical analysis (Miller, 1986, Voyvodic, 2006). With regard to the second criticism, overlap consistency can be investigated across a range of thresholds, as in Zou et al.'s (2005) study. A more fundamental limitation of overlap methods is that they do not provide a standard against which to judge the importance of a particular degree of overlap. The question of how much overlap is necessary to produce statistically robust and generalizable findings typically remains after percent overlap statistics are presented. In addition, overlap methods do not address the question of how consistently subjects can be rank-ordered by their brain response across sites. For example, in a study involving assessment of the same subjects scanned at multiple sites, a high degree of overlap in regional activation might be observed in group-level mean statistical maps across sites even though the rank-ordering of the brain response of subjects changes randomly from site to site. This possibility underscores the point that cross-site reliability of fMRI measurements cannot be determined solely by examining the consistency of group activation maps across sites using repeated measures ANOVA model.
Variance components analysis (VCA), which assesses components of systematic and error variance by random effects models, is another commonly used method to assess cross-site consistency of fMRI data (Bosnell et al., 2008, Costafreda et al., 2007, Dunn, 2004, Friedman et al., 2008, Gountouna et al., 2010, Suckling et al., 2008). VCA provides several useful standards against which to judge the importance of consistency findings. For investigators interested in studying the brain basis of individual differences, one natural standard is to compare variance components of nuisance factors against the person variance. This strategy of assessing the importance of between-site variation in fMRI studies has been used in several studies. Costafreda and colleagues, for example, found that between-subject variance was nearly seven-fold larger than site variance in the region of interest (ROI) they studied (Costafreda et al., 2007). In the Gountouna et al. (2010)study, site variance was virtually zero for several ROIs with the ratio of subject variance to site variance as large as 44 to 1 in another especially reliable ROI. Suckling and colleagues reported a value of between-subject variance that was slightly more than 10 times greater than the between-site variance (Suckling et al., 2008). Using a fixed-effect ANOVA model, Sutton and colleagues found effect sizes for between-subject differences to be seven to sixteen times larger than the between-site effect for the ROIs studied (Sutton et al., 2008).
Variance components are also commonly used to calculate intraclass correlations (ICC) that can be compared with the intraclass correlations of other, perhaps more familiar, variables (Brennan, 2001, Dunn, 2004). Between-site intraclass correlations of ROIs have been reported in the 0.20 to 0.70 range for fMRI data, depending on the activation measure, behavioral task, and degree of homogeneity of the magnets studied (Bosnell et al., 2008, Friedman et al., 2008). For clinical ratings of psychiatric symptoms, ICCs in the 0.20 to 0.70 range would be rated as ranging from poor to good (Cicchetti and Sparrow, 1981). The intraclass correlation can also be used to assess the impact of measurement error on the statistical power of between-site studies, providing a third method to assess the importance of VCA results (Bosnell et al., 2008, Suckling et al., 2008).
In the present study, healthy volunteers were scanned once at three sites and twice at a fourth site while performing a working memory task with emotional or neutral distraction. We investigated the between-site reliability of BOLD/fMRI data by calculating variance components for voxelwise data. Voxelwise VCA and ICC maps are presented in order to identify voxel clusters where particular components of variation were most prominent and where between-site reliability was largest. Presenting data for all voxels avoided limitations of generalization associated with the use of specific statistical thresholds. We also present findings obtained by averaging beta-weights over voxels within selected regions of interest (ROI) to simplify the comparison of our results with those of other studies that presented ROI results. The main study hypotheses follow:
- (1)
Clusters of voxels will be identified where the between-subject variation will be more than 10-fold the value of the between-site variation. This hypothesis is derived from the VCA studies discussed above.
- (2)
Variation attributed to the person-by-site interaction will be greater than variation associated with the site factor. Because person-by-site interactions occur when the rank ordering and/or distance of the BOLD response of subjects differs across sites, this source of variation reflects a broader array of potential sources of error than site variation alone.
- (3)
The magnitude of the ICC will depend on whether the functional contrast involved a low-level or high-level control condition, where a low-level control involves only orientation, attention, and basic perceptual processing, and a high-level control condition involves additional cognitive processes (Donaldson and Buckner, 2001). High-level control conditions are likely to share a larger number of neural processes with the experimental condition than low-level control conditions leading to a larger subtraction of neural activity and a smaller BOLD signal, especially for brain regions that participate in a multiplicity of neurocognitive functions engaged by the experimental task. The smaller magnitude of the BOLD response is likely to restrict the range of person variance, reducing the between-site ICC (Cronbach, 1970, Magnusson, 1966).
- (4)
The between-site intraclass correlation will increase as the number of runs averaged increases. Although this hypothesis has been supported by a previous study involving a sensorimotor task, it has not been tested in BOLD data generated by a cognitive task (Friedman et al., 2008).
We also investigated the relationships among between-site reliability, effect size, and sample size. A previous within-site reliability study found correlations greater than 0.95 between the median within-site ICC and the activation-threshold t-test value for both an auditory target detection task and an N-back working memory task (Caceres et al., 2009). We will examine the relationship between reliability and effect size for the between-site case. Although we anticipate that median effect size calculated across voxels will be strongly related to the magnitude of between-site reliability for those voxels, dissociations might be observed. If voxels with large activation effect sizes and poor reliability are observed, we will investigate the possibility that these voxels have poor reliability due to reduced variation among people (Brennan, 2001, Magnusson, 1966). If voxels with small activation effect sizes and good between-site reliability are observed, we will investigate the possibility that activation magnitude within subjects is consistent across sites, yet balanced between negative and positive activation values.
In the present study, the specific form of the ICC we calculated assessed between-site consistency at an absolute level (Brennan, 2001, Friedman et al., 2008, Shrout and Fleiss, 1979). High between-site ICC values, therefore, would support the interchangeability of data and justify the pooling of fMRI values across sites (Friedman et al., 2008, Shavelson et al., 1989). There are, of course, alternative definitions of the ICC (Shrout and Fleiss, 1979), and it is useful here to provide some discussion of the factors that would affect the choice to assess reliability based on absolute or relative agreement of measurements at different sites. The appropriate reliability measure will depend on the type of study being designed and the intended analysis. Suppose that “site” is explicitly considered as a design factor and that as a result “site” is explicitly accounted for in the data analysis. Then it might seem that the site factor will address consistent differences across site and that an ICC measuring relative agreement would be appropriate. This argument is plausible as long as “site” is orthogonal or independent of other design/analysis factors. For such studies, the Pearson correlation, the generalizability coefficient of Generalizability Theory or the ICC(3,1) statistic of Shrout and Fleiss (all of which look for relative rather than absolute agreement) would be appropriate statistics to assess reliability (Brennan, 2001, Shavelson et al., 1989, Shrout and Fleiss, 1979). If on the other hand there are associations between-site and other factors, for example, there is variation in the patient/control mix cross sites or there is variation in a genotype of interest, then adjusting for site in the analysis is not enough to eliminate all site effects and it is valuable to consider an ICC measuring absolute consistency. In these circumstances, having established in a reliability study that site variation contributes only a small amount of variation to the pooled variance would permit the pooling, which in turn should increase the likelihood that important subgroups will be detected and would enhance both statistical power and the generalizability of results. The reliability results of the present study were used to design a large study where the range of genetic variation and relevant symptom subtypes could not be determined a priori. We therefore calculated ICCs to assess the consistency of the absolute magnitude of the fMRI/BOLD response in order to determine whether data pooling would be justified.
Section snippets
Participants
Nine male and nine female, healthy, right-handed volunteers were studied once at each of three magnet sites and twice at a fourth site (mean [range], age: 34.44 [23–53] years; education: 17.06 [12–23] years). The sample size was chosen so that the lower 0.01 confidence interval for an ICC at the lower limits of excellent reliability (0.75) would exceed ICC values at poor levels of reliability (<0.40) (Walter, 1998). All participants were employed, with the largest number of individuals (eight)
Voxelwise maps
To determine whether the BOLD response changed merely by being repeated across the four sites, the effect of session order on the recognition versus scrambled faces contrast was tested with a voxelwise repeated measures analysis. Because the test revealed no significant clusters, session order was ignored in the following analyses.
Voxelwise plots of the site variance component revealed little variation in most brain regions (Fig. 2A). Voxels in the superior sagittal sinus, in the most dorsal
Discussion
Between-site reliability of the BOLD response elicited by working memory conditions can be good to excellent in many brain regions, although the extent of reliability depends on the specific cognitive contrast studied, the number of runs averaged, and the brain area investigated. In five of six regions of interest studied, variance associated with people exceeded site variance by least 10-fold. There is now evidence from several multisite variance components analyses of BOLD data showing that
References (54)
- et al.
Reproducibility of fMRI in the clinical setting: implications for trial designs
Neuroimage
(2008) - et al.
Measuring fMRI reliability with the intra-class correlation coefficient
Neuroimage
(2009) - et al.
Reproducibility of fMRI results across four institutions using a spatial working memory task
Neuroimage
(1998) - et al.
Hypercapnic normalization of BOLD fMRI: comparison across field strengths and pulse sequences
Neuroimage
(2004) AFNI: Software for analysis and visualization of functional magnetic resonance neuroimages
Comput. Biomed. Res.
(1996)- et al.
Reducing interscanner variability of activation in a multicenter fMRI study: controlling for signal-to-fluctuation-noise-ratio (SFNR) differences
Neuroimage
(2006) - et al.
Functional magnetic resonance imaging (fMRI) reproducibility and variance components across visits and scanning sites with a finger tapping task
Neuroimage
(2010) - et al.
An automated method for neuroanatomic and cytoarchitectonic atlas-based interrogation of fMRI data sets
Neuroimage
(2003) - et al.
Precentral gyrus discrepancy in electronic versions of the Talairach atlas
Neuroimage
(2004) - et al.
Establishment and results of a magnetic resonance quality assurance program for the pediatric brain tumor consortium
Acad. Radiol.
(2008)