Test–retest variability underlying fMRI measurements

doi:10.1016/j.neuroimage.2011.11.061

NeuroImage

Volume 60, Issue 1, March 2012, Pages 717-727

https://doi.org/10.1016/j.neuroimage.2011.11.061 Get rights and content

Abstract

Introduction

A high test–retest reliability is of pivotal importance for many disciplines in fMRI research. To assess the current limits of fMRI reliability, we estimated the variability in true underlying Blood Oxygen Level Dependent (BOLD) activation, with which we mean the variability that would be found in the theoretical case when we could obtain an unlimited number of scans in each measurement.

Methods

In this test–retest study, subjects were scanned twice with one week apart, while performing a visual and a motor inhibition task. We addressed the nature of the variability in the underlying BOLD signal, by separating for each brain area and each subject the between-session differences in the spatial pattern of BOLD activation, and the global (whole brain) changes in the amplitude of the spatial pattern of BOLD activation.

Results

We found evidence for changes in the true underlying spatial pattern of BOLD activation for both tasks across the two sessions. The sizes of these changes in pattern activation were approximately 16% of the total activation within the pattern, irrespective of brain area and task. After spatial smoothing, this variability was greatly reduced, which suggests it takes place at a small spatial scale. The mean between-session differences in the amplitude of activation across the whole brain were 13.8% for the visual task and 23.4% for the motor inhibition task.

Conclusions

Between-session changes in the true underlying spatial pattern of BOLD activation are always present, but occur at a scale that is consistent with partial voluming effects or spatial distortions. We found no evidence that the reliability of the spatial pattern of activation differs systematically between brain areas. Consequently, between-session changes in the amplitude of activation are probably due to global effects. The observed variability in amplitude across sessions warrants caution when interpreting fMRI estimates of height of brain activation. A Matlab implementation of the used algorithm is available for download at www.ni-utrecht.nl/downloads/ura.

Introduction

Functional Magnetic Resonance Imaging (fMRI) is a widely applied method for measuring brain activation in humans. For some purposes of fMRI, such as the planning for neurosurgery (Rutten et al., 2002), the definition of phenotypes in genetic studies (Turetsky et al., 2007), or clinical trials predicting the outcome of pharmacological treatment (Chen et al., 2007), a high degree of reliability is demanded, meaning that differences with retesting should be minimal. However, it is well known that activation maps in the same subjects can contain substantial variation across sessions (McGonigle et al., 2000).

This is not surprising, as the fMRI signal not only contains activation related signal (i.e. Blood Oxygen Level Dependent (BOLD) signal) but also noise. This noise is produced both by the scanner and by human physiological processes such as heartbeat and respiration (Kruger and Glover, 2001, van Buuren et al., 2009). Because of this noise, the estimate of the true BOLD signal in a certain voxel will fluctuate around the true underlying mean BOLD signal. We postulate that this true underlying BOLD signal would be revealed should one obtain a number of scans that approaches infinity during each experimental session. We believe however, in regular fMRI experiments the number of obtainable samples is limited, so noise in the fMRI signal is an important factor for determining reliability (Bennett and Miller, 2010).

Besides noise, the estimates of the underlying BOLD signal can also differ because there are in fact true underlying BOLD signal changes between sessions. This true variation, as opposed to variation due to noise, refers to between-session signal changes that are larger than would be expected based on noise alone. More specifically, we define the true variation as the variation in signal that would be measured when we have a number of scans that approaches infinity. In this study we want to estimate the amount of true variation. An estimate of true variation in the underlying BOLD signal can yield a theoretical limit of fMRI reliability of individual measurements. A theoretical limit of fMRI reliability is an important piece of knowledge not only for assessing feasibility of future fMRI studies, but also for providing a more elaborate background for general interpretation of fMRI results.

We will also make an attempt to address the nature of the variability in the underlying BOLD signal, by partitioning the between-session variation in underlying BOLD signal in two different terms: global effects and spatial pattern. We believe that these different variability-terms possibly have different sources. Firstly fMRI signals (i.e. noise and BOLD signal) can vary due to global whole brain variations, affecting the amplitudes of BOLD responses to a similar extent across the entire brain. This type of variation would thus scale the amplitudes of BOLD responses (and their estimates) with roughly the same factor throughout the brain, but leave the spatial pattern of activation relatively unchanged (see Fig. 1A for a schematic representation). Secondly the underlying signal could also differ because of changes in the spatial pattern of activation. The pattern of activation begins to differ when the amount of activation in one voxel changes relative to the amount of activation in another after taking whole brain variation in the amplitude of activation into account (see Fig. 1B for a schematic representation). Variation in the spatial pattern of activation will be assessed per brain area.

Subjects performed a visual (blocked design) and a motor inhibition task (event related design) on two occasions separated by one week. We assessed the presence of global changes in the amplitude of the pattern of BOLD activation and changes in the spatial pattern of underlying BOLD activation in individual subjects. Results show that the underlying patterns of activation are relatively stable over sessions for all brain areas, while the whole brain amplitude of activation is more variable.

Section snippets

Background

The purpose of the analysis was to estimate the variability in the true underlying BOLD signal between sessions. We wanted to express this true underlying variability as a percentage of the mean underlying signal. We assumed that the true underlying BOLD signal would be found when we acquired an infinite number of scans during each session. However, as we extrapolate this estimate of variation in underlying BOLD from sessions that are in reality limited in duration, we also assume the ideal

SD_parallel and SD_orthogonal

The SD_parallel and the SD_orthogonal were determined for the two tasks and for different brain areas. The results can be seen in Fig. 4. The scatterplots of t-values and the fitted lines that underlie these estimates can be seen for all subjects in Supplement 1 for the visual task, and Supplement 2 for the motor inhibition task. Results were tested univariately with one sample t-tests and Bonferroni corrected (p = 0.05) for the number of comparisons (= 70 = equal to the number of cortical segments),

Discussion

We estimated differences in underlying BOLD activation between sessions. The amount of variability in the underlying BOLD signal (i.e. true variation) can yield a theoretical limit of fMRI reliability of individual measurements. In this test–retest study, subjects were scanned one week apart while performing a visual and motor inhibition task. We specifically investigated variations in the spatial pattern of activation, and global changes in the amplitude of the spatial pattern of activation.

Conclusions

In summary, results from this study show that underlying patterns of BOLD activation are relatively stable across sessions, while the amplitude of the activation is more variable. The small pattern variability that we observed was caused by a general phenomenon of the most active voxels also showing the most variation, irrespective of brain area. Furthermore, this pattern variability was present mostly on a very local scale (neighboring voxels). The variability in the amplitudes (global

References (35)

A.R. Aron et al.
Long-term test–retest reliability of functional MRI in a classification learning task
NeuroImage
(2006)
A. Caceres et al.
Measuring fMRI reliability with the intra-class correlation coefficient
NeuroImage
(2009)
C.H. Chen et al.
Brain imaging correlates of depressive symptom severity and predictors of symptom improvement after antidepressant treatment
Biol. Psychiatry
(2007)
A.M. Dale et al.
Cortical surface-based analysis. I. Segmentation and surface reconstruction
NeuroImage
(1999)
R.S. Desikan et al.
An automated labeling system for subdividing the human cerebral cortex on MRI scans into gyral based regions of interest
NeuroImage
(2006)
B. Fischl et al.
Cortical surface-based analysis. II: Inflation, flattening, and a surface-based coordinate system
NeuroImage
(1999)
T. Freyer et al.
Test–retest reliability of event-related functional MRI in a probabilistic reversal learning task
Psychiatry Res.
(2009)
R.I. Goldman et al.
Single-trial discrimination for integrating simultaneous EEG and fMRI: identifying cortical areas contributing to trial-to-trial variability in the auditory oddball task
NeuroImage
(2009)
C. Habeck et al.
An event-related fMRI study of the neurobehavioral impact of sleep deprivation on performance of a delayed-match-to-sample task
Brain Res. Cogn Brain Res.
(2004)
D.J. McGonigle et al.
Variability in fMRI: an examination of intersession differences
NeuroImage
(2000)

K.A. Norman et al.

Beyond mind-reading: multi-voxel pattern analysis of fMRI data

Trends Cogn. Sci.

(2006)

M. Raemaekers et al.

Test–retest reliability of fMRI activation during prosaccades and antisaccades

NeuroImage

(2007)

A.M. Smith et al.

Investigation of low frequency drift in fMRI signal

NeuroImage

(1999)

D.A. Soltysik et al.

Head-repositioning does not reduce the reproducibility of fMRI activation in a block-design motor task

NeuroImage

(2011)

M. Vink et al.

Striatal dysfunction in schizophrenia and unaffected relatives

Biol. Psychiatry

(2006)

B.B. Zandbelt et al.

Within-subject variation in BOLD-fMRI signal changes across repeated measurements: quantification and implications for sample size

NeuroImage

(2008)

C.M. Bennett et al.

How reliable are the results from functional magnetic resonance imaging?

Ann. N. Y. Acad. Sci.

(2010)

Cited by (41)

Modulatory effects of fMRI acquisition time of day, week and year on adolescent functional connectomes across spatial scales: Implications for inference
2023, NeuroImage
Metabolic, hormonal, autonomic and physiological rhythms may have a significant impact on cerebral hemodynamics and intrinsic brain synchronization measured with fMRI (the resting-state connectome). The impact of their characteristic time scales (hourly, circadian, seasonal), and consequently scan timing effects, on brain topology in inherently heterogeneous developing connectomes remains elusive. In a cohort of 4102 early adolescents with resting-state fMRI (median age = 120.0 months; 53.1 % females) from the Adolescent Brain Cognitive Development Study, this study investigated associations between scan time-of-day, time-of-week (school day vs weekend) and time-of-year (school year vs summer vacation) and topological properties of resting-state connectomes at multiple spatial scales. On average, participants were scanned around 2 pm, primarily during school days (60.9 %), and during the school year (74.6 %). Scan time-of-day was negatively correlated with multiple whole-brain, network-specific and regional topological properties (with the exception of a positive correlation with modularity), primarily of visual, dorsal attention, salience, frontoparietal control networks, and the basal ganglia. Being scanned during the weekend (vs a school day) was correlated with topological differences in the hippocampus and temporoparietal networks. Being scanned during the summer vacation (vs the school year) was consistently positively associated with multiple topological properties of bilateral visual, and to a lesser extent somatomotor, dorsal attention and temporoparietal networks. Time parameter interactions suggested that being scanned during the weekend and summer vacation enhanced the positive effects of being scanned in the morning. Time-of-day effects were overall small but spatially extensive, and time-of-week and time-of-year effects varied from small to large (Cohen's f ≤ 0.1, Cohen's d<0.82, p < 0.05). Together, these parameters were also positively correlated with temporal fMRI signal variability but only in the left hemisphere. Finally, confounding effects of scan time parameters on relationships between connectome properties and cognitive task performance were assessed using the ABCD neurocognitive battery. Although most relationships were unaffected by scan time parameters, their combined inclusion eliminated associations between properties of visual and somatomotor networks and performance in the Matrix Reasoning and Pattern Comparison Processing Speed tasks. Thus, scan time of day, week and year may impact measurements of adolescent brain's functional circuits, and should be accounted for in studies on their associations with cognitive performance, in order to reduce the probability of incorrect inference.
An open-access accelerated adult equivalent of the ABCD Study neuroimaging dataset (a-ABCD)
2022, NeuroImage
Citation Excerpt :
Measuring the reliability and meaningful variation of behavior and brain function across repeated study sessions helps disentangle these interactions and interpret longitudinal changes. A growing body of work assesses the reliability of task-based patterns of fMRI activity within-subjects and across sites in youth (Casey et al., 1998; Haller et al., 2018; Kennedy et al., 2021) and adults (Bennett & Miller, 2013; Berman et al., 2010; Buimer et al., 2020; Elliott et al., 2020; Friedman et al., 2008; Gee et al., 2015; Kragel et al., 2021; Li et al., 2020; McGonigle et al., 2000; Raemaekers et al., 2012; for reviews see Herting et al., 2018; Noble et al., 2021). An open question remains, however, about how to “benchmark” session-to-session differences observed in longitudinal developmental neuroimaging datasets—that is, how to disentangle developmental effects from practice effects from repeated testing and other state-related effects and noise.
As public access to longitudinal developmental datasets like the Adolescent Brain Cognitive Development Study^SM (ABCD Study®) increases, so too does the need for resources to benchmark time-dependent effects. Scan-to-scan changes observed with repeated imaging may reflect development but may also reflect practice effects, day-to-day variability in psychological states, and/or measurement noise. Resources that allow disentangling these time-dependent effects will be useful in quantifying actual developmental change. We present an accelerated adult equivalent of the ABCD Study dataset (a-ABCD) using an identical imaging protocol to acquire magnetic resonance imaging (MRI) structural, diffusion-weighted, resting-state and task-based data from eight adults scanned five times over five weeks. We report on the task-based imaging data (n = 7). In-scanner stop-signal (SST), monetary incentive delay (MID), and emotional n-back (EN-back) task behavioral performance did not change across sessions. Post-scan recognition memory for emotional n-back stimuli, however, did improve as participants became more familiar with the stimuli. Functional MRI analyses revealed that patterns of task-based activation reflecting inhibitory control in the SST, reward success in the MID task, and working memory in the EN-back task were more similar within individuals across repeated scan sessions than between individuals. Within-subject, activity was more consistent across sessions during the EN-back task than in the SST and MID task, demonstrating differences in fMRI data reliability as a function of task. The a-ABCD dataset provides a unique testbed for characterizing the reliability of brain function, structure, and behavior across imaging modalities in adulthood and benchmarking neurodevelopmental change observed in the open-access ABCD Study.
Internal reliability of blame-related functional MRI measures in major depressive disorder
2021, NeuroImage: Clinical
In major depressive disorder (MDD), self-blame-related fMRI measures have shown the potential to be used as prognostic markers for recurrence risk. Like most potential fMRI markers, however, their reliability is unclear. Here, we probed the internal reliability of self-blame-related fMRI measures, as well as the impact of different modelling approaches on reliability metrics and validity.
Internal consistency (i.e. split-half reliability) was calculated for blood oxygen level-dependent (BOLD) responses and psychophysiological interactions (PPI) related to self-blame-related biases in medication-free remitted MDD participants (n = 81) and healthy controls (n = 41). Trial-length was modelled using three durations (0, 2 and 5 s), which was convolved with the haemodynamic response function (HRF) with and without time and dispersion derivatives. Intraclass correlation coefficients (ICCs) were calculated for simple contrasts examining activation to self-blaming emotions and other-blaming emotions and the more complex contrast of the subtraction-based difference between self- and other-blaming emotions within the following a priori ROIs: right superior anterior temporal lobe seed region, anterior subgenual cingulate cortex, posterior subgenual cortex and right striatum / pallidum.
Across ROIs, we obtained fair reliability (ICC ≥ 0.40) for simple, but poor reliability (ICC < 0.40) for more complex fMRI measures related to self-blame. Despite this low internal consistency of complex measures at the individual level, we observed robust activation at the group-level, reproducing previously published results.
While simple BOLD contrasts had fair reliability, previously employed PPI models had poor reliability and simple connectivity measures lacked predictive validity. This calls for the development of functional connectivity measures that strike a better balance between reliability and validity for future clinical applications, which require robust measures at the individual rather than group-level.
Effects of using different software packages for BOLD analysis in planning a neurosurgical treatment in patients with brain tumours
2020, Clinical Imaging
Citation Excerpt :
A much easier situation is, for example, with MR morphometry and MR spectroscopy, where any standard anatomical scans or phantom can be used [25,26]. Based on the available literature [17,27–37] discussing the algorithms used for fMRI data analysis and employed in the relevant software packages, it can be concluded that differences result from the fact that slightly various patterns were used in the consecutive steps of the analysis. It seems that no previous studies have focused on comparing the volume of active areas as well as distances between the edge of active area and the tumour boundary as identified by FSL, SPM and FuncTool.
The authors of the present thesis carried out a comparative analysis of three different computer software packages – FSL, SPM and FuncTool – for the processing of fMRI scans.
The aim of the thesis was the analysis of the volume of regions functionally active during the stimulation of the centres evaluated as well as the location of those regions in relation to the tumour boundaries, and then the comparison of the results.
Thirty eight patients with a diagnosed tumour of the left hemisphere, qualified for a neurosurgical operation, underwent fMRI. The functions of speech, motion and sensation were evaluated. Imaging data were processed for all the acquired scans with the use of each of the three software packages assessed.
For the FuncTool software package the calculated differences in the distances were several times greater than those calculated using FSL and SPM. The differences in the volume of the functionally active regions derived from the calculations with the use of the FSL and SPM software packages were statistically different for four out of the six functions evaluated.
The conclusions of the analysis in question showed that the FSL and SPM packages could be used interchangeably in the functional mapping of the brain with the purpose of the planning of neurosurgical operations. The FuncTool software package is less precise than FSL and SPM.
Test-retest reliability of brain responses to risk-taking during the balloon analogue risk task
2020, NeuroImage
Citation Excerpt :
Furthermore, reliability must be established before fMRI can be used for medical or legal applications (Bennett and Miller, 2010). As such, current literature on the reliability of task-induced brain activation studies (e.g., Atri et al., 2011; Brandt et al., 2013; Cao et al., 2014; Chase et al., 2015; Gorgolewski et al., 2015; Gorgolewski et al., 2013; Morrison et al., 2016; Nettekoven et al., 2018; Plichta et al., 2012; Raemaekers et al., 2012; Sauder et al., 2013; Yang et al., 2019) as well as the reliability of resting-state brain imaging studies (e.g., Blautzik et al., 2013; Braun et al., 2012; Chen et al., 2015; Guo et al., 2012; Li et al., 2012; Liao et al., 2013; Mannfolk et al., 2011; Somandepalli et al., 2015; Song et al., 2012; Yang et al., 2019; Zuo and Xing, 2014) is growing rapidly. Despite the widespread use of the BART paradigm for assessing risk-taking behavior and brain function, the test-retest reliability of brain responses to the BART has not been evaluated.
The Balloon Analogue Risk Task (BART) provides a reliable and ecologically valid model for the assessment of individual risk-taking propensity and is frequently used in neuroimaging and developmental research. Although the test-retest reliability of risk-taking behavior during the BART is well established, the reliability of brain activation patterns in response to risk-taking during the BART remains elusive. In this study, we used functional magnetic resonance imaging (fMRI) and evaluated the test-retest reliability of brain responses in 34 healthy adults during a modified BART by calculating the intraclass correlation coefficients (ICC) and Dice’s similarity coefficients (DSC). Analyses revealed that risk-induced brain activation patterns showed good test-retest reliability (median ICC = 0.62) and moderate to high spatial consistency, while brain activation patterns associated with win or loss outcomes only had poor to fair reliability (median ICC = 0.33 for win and 0.42 for loss). These findings have important implications for future utility of the BART in fMRI to examine brain responses to risk-taking and decision-making.
Is the encoding of Reward Prediction Error reliable during development?
2018, NeuroImage
Reward Prediction Errors (RPEs), defined as the difference between the expected and received outcomes, are integral to reinforcement learning models and play an important role in development and psychopathology. In humans, RPE encoding can be estimated using fMRI recordings, however, a basic measurement property of RPE signals, their test-retest reliability across different time scales, remains an open question. In this paper, we examine the 3-month and 3-year reliability of RPE encoding in youth (mean age at baseline = 10.6 ± 0.3 years), a period of developmental transitions in reward processing. We show that RPE encoding is differentially distributed between the positive values being encoded predominantly in the striatum and negative RPEs primarily encoded in the insula. The encoding of negative RPE values is highly reliable in the right insula, across both the long and the short time intervals. Insula reliability for RPE encoding is the most robust finding, while other regions, such as the striatum, are less consistent. Striatal reliability appeared significant as well once covarying for factors, which were possibly confounding the signal to noise ratio. By contrast, task activation during feedback in the striatum is highly reliable across both time intervals. These results demonstrate the valence-dependent differential encoding of RPE signals between the insula and striatum, and the consistency of RPE signals or lack thereof, during childhood and into adolescence. Characterizing the regions where the RPE signal in BOLD fMRI is a reliable marker is key for estimating reward-processing alterations in longitudinal designs, such as developmental or treatment studies.

View all citing articles on Scopus

View full text

Test–retest variability underlying fMRI measurements

Abstract

Introduction

Methods

Results

Conclusions

Introduction

Section snippets

Background

SDparallel and SDorthogonal

Discussion

Conclusions

NeuroImage

NeuroImage

Biol. Psychiatry

NeuroImage

NeuroImage

NeuroImage

Psychiatry Res.

NeuroImage

Brain Res. Cogn Brain Res.

NeuroImage

Trends Cogn. Sci.

NeuroImage

NeuroImage

NeuroImage

Biol. Psychiatry

NeuroImage

How reliable are the results from functional magnetic resonance imaging?

Ann. N. Y. Acad. Sci.

SD_parallel and SD_orthogonal