Comparing the test–retest reliability of resting‐state functional magnetic resonance imaging metrics across single band and multiband acquisitions in the context of healthy aging

Abstract The identification of meaningful functional magnetic resonance imaging (fMRI) biomarkers requires measures that reliably capture brain performance across different subjects and over multiple scanning sessions. Recent developments in fMRI acquisition, such as the introduction of multiband (MB) protocols and in‐plane acceleration, allow for increased scanning speed and improved temporal resolution. However, they may also lead to reduced temporal signal to noise ratio and increased signal leakage between simultaneously excited slices. These methods have been adopted in several scanning modalities including diffusion weighted imaging and fMRI. To our knowledge, no study has formally compared the reliability of the same resting‐state fMRI (rs‐fMRI) metrics (amplitude of low‐frequency fluctuations; seed‐to‐voxel and region of interest [ROI]‐to‐ROI connectivity) across conventional single‐band fMRI and different MB acquisitions, with and without in‐plane acceleration, across three sessions. In this study, 24 healthy older adults were scanned over three visits, on weeks 0, 1, and 4, and, on each occasion, underwent a conventional single band rs‐fMRI scan and three different rs‐fMRI scans with MB factors 4 and 6, with and without in‐plane acceleration. Across all three rs‐fMRI metrics, the reliability scores were highest with MB factor 4 with no in‐plane acceleration for cortical areas and with conventional single band for subcortical areas. Recommendations for future research studies are discussed.


| INTRODUCTION
Over the past 20 years, resting-state functional magnetic resonance imaging (rs-fMRI) research has proven invaluable for shedding light on intrinsic functional networks in the brain at rest (Fox & Raichle, 2007).
More specifically, studies have suggested that rs-fMRI may be used to explore the neural architecture of developmental, aging, and pathological processes and help establish predictive biomarkers for mental illness (Fox & Raichle, 2007;Greicius, 2008). Identifying robust and trustworthy biomarkers necessitates measures that have satisfactory test-retest reliability over multiple visits. Indeed, reliable measures are one of the cornerstones of scientific progress as they ensure that similar results are observed when the same scan sequence is repeated on the same group of subjects over several visits.
In the last two decades, concerns regarding the reliability of biomedical research have been increasingly expressed (Ionnadis, 2005;Prinz et al., 2011) and recent research has focused on identifying best practices with a view to improving the test-retest reliability of standard rs-fMRI measures (Noble et al., 2019;Zuo & Xing, 2014). Overall, the observed test-retest reliability scores of rs-fMRI metrics varies greatly across studies, with reliability scores ranging between poor (Noble et al., 2019;Shou et al., 2013) and moderate to excellent (Birn et al., 2013;Guo et al., 2012;O'Connor et al., 2017). These discrepancies across studies can be attributed to distinct factors that have previously been highlighted as having a direct impact on test-retest reliability. Indeed, recent literature has emphasized that reliability tends to be higher for brain regions located in the cortex rather than the subcortex (Shah et al., 2016) and for measures focusing on the regional intensity of spontaneous brain activity, also known as amplitude of low-frequency fluctuations (ALFF; Wang et al., 2013) compared with measures of functional connectivity (FC) per se such as seed-to-voxel activation maps (Shou et al., 2013). Nonetheless, while the literature points towards these factors as having a major impact on test-retest reliability, other features remain to be explored such as those associated to recent developments in magnetic resonance physics.
Echo planar imaging (EPI) was first introduced by Mansfield (1977) and, since the early 1990s, has been mostly used for blood oxygen level-dependent (BOLD) fMRI and diffusion weighted imaging investigations. Whilst conventional EPI can acquire a whole brain image in 2-3 s, multiband (MB) acquisitions, first introduced by Larkman et al. (2001), simultaneously excite multiple slice locations, thus decreasing the time needed to scan a whole brain volume to <1 s (Liao et al., 2013). However, imperfections in the excitation profile of the MB RF pulses can lead to (1) signal leakage between simultaneously excited slices (Todd et al., 2016); (2) signal dropout resulting in decreased temporal signal to noise ratio (tSNR; Chen et al., 2015); and (3) enhanced spatially heterogeneous noise amplification which increases at higher MB factors (Risk et al., 2021).
To overcome issues related to signal dropouts and susceptibilityrelated distortions, in-plane acceleration has commonly been added in contemporary MB fMRI studies. A recent study has shown that a total acceleration of 4 (i.e., MB factor 2 with in-plane acceleration 2) is optimal with regards to sensitively detecting common rs networks while offering a negligible decrease in signal to noise ratio compared with a total acceleration of 2, 6, and 8 (Preibisch et al., 2015). Other studies have recommended the use of a MB factor of 4 for wholebrain fMRI scanning, while single band acquisitions are reported to be better suited for studies focused on activity in subcortical regions due to differences in signal detection sensitivity (Risk et al., 2021).
In terms of test-retest reliability, however, only a few studies have compared the impact of different TRs and acceleration factors on the reliability of commonly used rs-fMRI metrics (Golestani et al., 2017;Wang et al., 2013). Both studies showed the reliability of the ALFF measure to be higher with shorter TRs compared with a conventional rs-fMRI sequence with a lower sampling rate (Golestani et al., 2017;Wang et al., 2013). Limitations of these studies include a small sample size (n = 8; Golestani et al., 2017), a comparison based solely on sampling rate rather than a clear comparison between MB and single band sequences (Golestani et al., 2017) and the comparison of data acquired from different scanners (Wang et al., 2013). To our knowledge, no studies have yet explored which combination of MB factor and in-plane acceleration yields the best test-retest reliability in comparison with single band sequences all acquired from the same scanner and with a reasonable sample size in the context of three different rs-fMRI measures such as ALFF, seed-to-voxel analysis and region of interest (ROI)-to-ROI analysis. In this study, we aim to address this need.
Based on the studies cited above, particularly those that show altered sensitivity subcortically, we hypothesized that (1) reliability scores would be significantly higher for MB protocols compared with single band for cortical regions, while single band would be best suited for subcortical regions. More specifically, we also hypothesized that, (2) for cortical regions, MB4 with no in-plane acceleration (i.e., a total acceleration of 4) would yield the best reliability scores, while MB4 with in-plane acceleration 2 (i.e., a total acceleration of 8) would be the MB modality associated with the lowest reliability scores. We also hypothesized that (3) reliability scores would be higher with the ALFF measure compared with the seed-to-voxel metrics, and (4) would be higher for cortical regions compared with subcortical regions across all three metrics.

| Participants
In total, 30 healthy right-handed adults aged 52-73 (19 males and 11 females) participated in the study after providing written informed consent. All participants met the following inclusion criteria: being right-handed, aged between 50 and 75, being physically healthy, not having any MRI counter-indications (i.e., pacemaker, heart valve, metal in the body, claustrophobia), not suffering from any psychiatric or neurological disorder and not being on any psychoactive medication.
Of the 30 participants who took part, two dropped out before completing all three scans and the data from four further participants were discarded due to technical issues that occurred during the scans. Therefore, the final number of participants included in the analyses was 24 (15 males and 9 females, age = 61.3 ± 7.9 years). The study was approved by the King's College London human Research Ethics Committee (number HR-17/18-5720). After each visit, the researchers visually inspected the scans for artifacts and all scans were reviewed by a qualified radiologist in order to rule out any neurological disorder, in line with the Department of Neuroimaging's standard policies. Unprocessed EPI images of slice 21 in axial view for run 1 for each participant and each rs-fMRI modality can be found in Figure S1.
None of the participants had any history of psychiatric disorder or neurological disease or received any psychoactive treatment during the study.

| Procedure
Each participant was invited to attend three scanning sessions at the Centre for Neuroimaging Sciences (Institute of Psychiatry, Psychology and Neuroscience; King's College London), on Weeks 0, 1, and 4 (±1 day), at the same time of day (±1 h).

| MRI data acquisition
On each of the three visits, all participants were scanned in the same 3 T MRI scanner (Discovery MR750, General Electric, Milwaukee, Wisconsin). All of the images were acquired by experienced qualified radiographers who all rigorously followed the exact same MRI protocol. All participants underwent an anatomical T1-weighted MRI and four rs sequences: (1) standard single band Echo-Planar Imaging, with an in-plane acceleration of 2 (SB-ASSET = 2); (2) MB4, with no inplane acceleration (MB = 4, ARC = 1); (3) MB4, with an in-plane acceleration of 2 (MB = 4, ARC = 2); (4) and MB6, with no in-plane acceleration (MB = 6, ARC = 1). ASSET, also known as "Array Coil Spatial Sensitivity Encoding," corresponds to the methodology which we employed for the single band sequence and which consists of parallel imaging with in-plane acceleration, combining data in image space as it is commonly done when using "Sensitivity Encoding" or "SENSE" acceleration. For MB sequences, we used ARC, also known as "Autocalibrating Reconstruction for Cartesian Imaging," which corresponds to parallel imaging with data combination in k-space, as also performed by "Generalized Auto-calibrating Partial Parallel Acquisition." For this study, we used the Nova 32-channel head coil.
The acquisition parameters for each rs-fMRI sequence are described in Table 1. The order of the four rs-fMRI runs was counterbalanced across imaging sessions and subjects. The anatomical sequence had the following parameters: repetition time = 8.23 ms; echo time = 3.25 ms; flip angle = 12 ; field of view = 230 mm 2 ; matrix size = 256 Â 256; 1 mm isotropic resolution.
During the acquisition of all four rs-fMRI sequences, the participants were asked to keep their eyes open and fixate on a crosspresented on the screen. Additionally, they were provided with headphones and earplugs to reduce any discomfort associated with the noise of the scanner. Each of the four rs-fMRI sequences was 8-min long, with a higher number of images being collected as the MB factor increased as displayed in Table 1.
During the same data acquisition, a T2 FLAIR, a T2 CUBE, and three Diffusion Tensor Imaging modalities were also collected. However, we did not use them for this study.

| MRI data preprocessing
The data were preprocessed using the Statistical Parametric Mapping

| tSNR analyses
In order to explore how tSNR differs across all four rs-fMRI sequences in cortical regions and in subcortical regions, voxelwise whole-brain tSNR maps were extracted for each participant, each run and each rs-fMRI modality The tSNR maps were calculated using fMRI images prior to denoizing and before the data was demeaned.
We then used a mask to spatially constrain each tSNR map to each of the 11 ROIs. The resulting maps were then averaged across all three runs and divided into two groups: a cortical group made up of all seven cortical ROIs and a subcortical group comprising all four subcortical ROIs. Within each group, global tSNR was assessed by averaging all the tSNR values across voxels and across all seven ROIs for the cortical group and all four ROIs for the subcortical group. The resulting averaged tSNR values were then submitted to paired ttests for each group. False discovery rate (FDR) correction (Benjamini & Hochberg, 1995) was used for adjusting for all three contrasts (i.e., SB-ASSET2 compared with MB4-ARC1; SB-ASSET2 compared with MB4-ARC2 and SB-ASSET2 compared with MB6-ARC1).

| fMRI analyses
For all three analyses described below, we specifically focused on three rs-fMRI metrics which have previously been used to gain a better understanding of psychiatric disorders: ALFF, seed-to-voxel and ROI-to-ROI measures (Ebisch et al., 2011, Lei et al., 2017Yang et al., 2021. All fMRI analyses were carried out using MR images after denoizing. Eleven

| Intraclass correlation coefficient analysis
The intraclass correlation coefficient (ICC; Landis & Koch, 1977) was used to index the reliability of all three rs-fMRI metrics described below. The ICC score is typically defined as the proportion of withinsubject variability in relation to between-subject variability as follows, where MSE b and MSE w are the between-subject and within-subject mean squared errors, respectively (Charman et al., 2017): As such, the higher the ICC score, the more similar within-subject measurements are over time. ICC scores typically range between À1 and 1; however, in this article, these values are scaled to a range of À100 and 100 and, in this case, ICC scores are categorized as poor (ICC < 21), fair (20 < ICC < 41), moderate (40 < ICC < 61), substantial (60 < ICC < 81) and almost perfect (ICC > 80; Landis & Koch, 1977).
For all analyses, the reliability scores were calculated across all three visits using the ICC toolbox (Caceres et al., 2009), a MATLAB toolbox designed specifically for voxel-wise ICC analyses of neuroimaging data. ICC analyses were implemented voxel-wise and ICC scores were derived from the median of the full distribution of the ICC values across all voxels in each brain region, a method which has previously been shown to be stable under different conditions of smoothing and cluster size (Caceres et al., 2009). In order to formally compare ICC scores across all four rs-fMRI modalities, F-tests were run for each ROI, testing the null hypothesis that the ICC value of a given rs-fMRI modality was equal to that of another rs-fMRI modality, in line with previous work by McGraw and Wong (1996). FDR (Benjamini & Hochberg, 1995) was applied to adjust for all 11 ROIs and all 6 contrasts (i.e., SB-ASSET2 compared with MB4-ARC1, MB4-ARC2, and MB6-ARC1; MB4-404 ARC1 compared with MB4-ARC2 and MB6-ARC1; and MB4-ARC2 compared with MB6-ARC1).

| ALFF analysis
The intensity of the brain's spontaneous activity can be examined through the ALFF measure which has previously been used as a marker for brain diseases (Cheng et al., 2013;Zang et al., 2007). The first step of the ALFF analysis consisted of transforming the timeseries to the frequency domain for each voxel using a fast Fourier transform, and then obtaining the power spectrum. Because ALFF is defined as the averaged square root of the amplitude within a specific frequency range (Zang et al., 2007), we calculated the ALFF by computing the average square root across 0.008-0.09 Hz for each voxel.
Each ALFF map was then normalized with respect to the global mean ALFF value for standardization purposes as described in Zang et al. (2007). An ALFF map was then generated for each subject and each visit and the median ICC was calculated over all three visits.
T A B L E 3 Descriptive statistics and paired t-tests for the tSNR measure before denoizing (n, number of participants; SD, standard deviation; SE, standard error) Abbreviation: tSNR, temporal signal to noise ratio.

| Seed-to-voxel analyses
For each subject and each visit, we extracted the mean BOLD timeseries from each of the 11 seeds and calculated the Pearson's correlation coefficient between the timeseries of each seed and the timeseries of all other voxels in the brain. Correlation coefficients were then converted to normalized z-scores using Fisher's transform.
Eleven normalized correlations maps were thus obtained for each subject and for each visit.
For these analyses, the whole-brain median ICC scores were first calculated across all voxels for each of the 11 correlation maps. We will refer to these analyses as seed-to-voxel in the rest of the article.
Further ICC analyses were subsequently completed and consisted in using a mask to spatially constrain each of the 11 correlation maps to each of the 10 remaining ROIs. The median voxel-wise ICC value was calculated for each ROI. From here on, we will refer to these spatially constrained ICC analyses as ROI-to-ROI.  Table 3.

| ICC analyses
In the results that follow, the observed ICC scores varied across rs-fMRI modalities, rs-fMRI metrics and brain regions (Figures 2-5).  3.4 | Test-retest reliability of the seed-to-voxel measure  Table S2.

| Test-retest reliability of the ALFF measure
3.5 | Test-retest reliability of the ROI-to-ROI measure

| DISCUSSION
To our knowledge, this study is the first to compare the test-retest reliability of rs-fMRI metrics across single-band and MB fMRI acquisitions, and to explore which combination of acceleration factors yields the best results and which role cortical and subcortical regions play in this context.
In the context of the ALFF measure, the regional intensity of spontaneous brain activity exhibited moderate to almost perfect ICC scores overall, with values ranging between 45.99 and 81.45 across all four rs modalities and all brain regions. This is partially in line with Golestani et al.'s (2017)  have previously been associated with higher ALFF reliability scores compared with longer TRs (2 s; Golestani et al., 2017), however this is the first study to show a discrepancy in the reliability of spontaneous brain activity between subcortical regions and cortical regions when With regards to the seed-to-voxel metrics, ICC scores ranged between 9.05 and 53.56, which is in line with Golestani et al.'s (2017) findings showing ICC values ranging from 10 to 50 for seed-based  Our present findings revealed that future studies using MB factor 4 with no in-plane acceleration would help build upon existing psychiatric research aiming to identify biomarkers targeting specific cortical structures. Atypical FC of the AI has previously been associated with autism spectrum disorders (Ebisch et al., 2011), while impaired FC of the medial prefrontal cortex has been proposed as a potential biomarker for alcohol dependence . In contrast, commonly used conventional single band sequences may be preferable for studies focusing on the FC of the nucleus accumbens, which has formerly been examined in the context of anorexia nervosa (Haynos et al., 2019) and major depressive disorder .
For the ROI-to-ROI metric, the reliability scores ranged between showing higher noise amplification in subcortical regions in the context of shorter TRs (Risk et al., 2021). In particular, it has been suggested that higher sampling rates are associated with noise amplification due to individual slices being recovered from multiple simultaneously excited slices during image reconstruction (Risk et al., 2021).
Additionally, in our study, the MB protocol with the lowest total acceleration (i.e., 4 with no in-plane acceleration) was the MB modality that yielded the best reliability results for cortical areas. These results accord with the negligible decrease in tSNR previously observed by Preibisch et al. (2015) with a total acceleration of 4 (i.e., MB factor 2 with an in-plane acceleration of 2), which is much lower than the loss in tSNR of about 64% they obtained with a total acceleration of 8 (i.e., MB factor 4 with an in-plane acceleration of 2).
In fact, in line with Preibisch et al. (2015), our findings also revealed a negligible difference in tSNR between SB-ASSET2 and MB4-ARC1 (i.e., a total acceleration of 4) and a significant decrease in tSNR between SB-ASSET2 and MB6-ARC1 for both cortical and subcortical ROIs. However, with regards to MB4-ARC2 (i.e., a total acceleration of 8), we found the difference in tSNR between SB-ASSET2 and MB4-ARC2 to be significant for subcortical areas but nonsignificant for cortical areas, which only partially aligns with Preibisch et al.'s (2015) findings. This could be because Preibisch et al. (2015) presented the results of whole-brain analyses while we presented separate findings for cortical and subcortical areas. Because of the differences in tSNR between SB-ASSET2 and some of the MB modalities, future studies might benefit from controlling for the effect of tSNR when exploring test-retest reliability, as it may partially explain some of the differences in ICC scores across rs-fMRI modalities.
Furthermore, physiological and motion artifacts have been shown to affect the quality and the test-retest reliability of FC patterns as well as enhance subject specificity (Birn et al., 2014;Xifra-Porxas et al., 2021). Indeed, head motion and cardiac and breathing variations have been shown to be linked to the variance in the fMRI signal (Power et al., 2017). As such, these could also explain some of the differences in ICC scores across the four rs-fMRI modalities.
Finally, it must be pointed out that multiarray coils exhibit a radially dependent sensitivity which reduces towards their center and which may have influenced some of the differences in ICC reliability between cortical and subcortical regions described in this article.
However, it is worth highlighting that the same coil was used for single band and MB sequences and therefore issues related to reduced coil sensitivity would only play a partial role in the results observed across our measurements.

| Limitations
One limitation of our study is that we focused our analyses on 11 ROIs chosen a priori. Even though the choice of a small number of ROIs allowed for more in-depth analysis of the effect of ROI location on reliability scores in comparison with whole-brain analyses, it also means that we are not able to comment on the reliability scores for other ROIs that might be of interest for other brain diseases.
Furthermore, future studies should consider exploring the effects of in-plane acceleration more systematically and investigating other combinations of acceleration factors that were not examined here due to time constraint, such as single band with no in-plane acceleration and MB factor 6 with in-plane acceleration. Indeed, this would further inform the specific impact of in-plane acceleration on reliability scores.
It is also important to note that our current results apply to data acquired on a 3 T MR750 GE scanner, which does not allow us to generalize our findings to different field strengths or any other manufacturer.
Indeed, the multivariate consistency of rs-fMRI connectivity maps has previously been shown to decrease because of variations in scanning sites and scanner manufacturers (Badhwar et al., 2020). Future research exploring the impact of various acceleration factors across different manufacturers or field strengths would be particularly useful.
Additionally, our analyses involved using anatomically defined ROIs as opposed to specific subregions derived from areas of interest.
Future studies could benefit from exploring the test-retest reliability of segmented subregions using parcellation techniques as these could further inform ROI heterogeneity as well as specific factors at play in the context of test-retest reliability.
Finally, our participants were aged between 52 and 73 and we chose to focus on this age group because this study was carried out as a methods investigation to evaluate the test-retest reliability of a neuroimaging protocol for potential future inclusion in clinical trials of aging-associated diseases. However, this means that our findings cannot be unequivocally generalized to younger age groups. Further research is needed to explore how MB and in-plane acceleration would affect test-retest reliability across the lifespan.

| CONCLUSION
In conclusion, this study provides strong evidence that MB factor 4 with no in-plane acceleration enhances reliability scores for cortical areas, while single band yields the best reliability values for subcortical regions. Based on these findings, we recommend MB4 with no inplane acceleration for whole brain analyses or analyses focusing on specific cortical regions, and single band for studies aiming to specifically explore subcortical areas.

ACKNOWLEDGMENTS
We would like to thank all the participants who took part in this study, as well as Janssen Research and Development, a division of Janssen

DATA AVAILABILITY STATEMENT
Data will be made available on reasonable request.