Ground-truth resting-state signal provides data-driven estimation and correction for scanner distortion of fMRI time-series dynamics

The fMRI community has made great strides in decoupling neuronal activity from other physiologically induced T2* changes, using sensors that provide a ground-truth with respect to cardiac, respiratory, and head movement dynamics. However, blood oxygenation level-dependent (BOLD) time-series dynamics are confounded by scanner artifacts, in complex ways that can vary not only between scanners but even, for the same scanner, between sessions. The lack of equivalent ground truth has thus far stymied the development of reliable methods for identification and removal of scanner-induced noise. To address this problem, we first designed and built a phantom capable of providing dynamic signals equivalent to that of the resting-state brain. Using the dynamic phantom, we quantified voxel-wise noise by comparing the ground-truth time-series with its measured fMRI data. We derived the following data-quality metrics: Standardized Signal-to-Noise Ratio (ST-SNR) and dynamic fidelity that can be directly compared across scanners. Dynamic phantom data acquired from four scanners showed scanner-instability multiplicative noise contributions of about 8-20\% of the total noise. We further measured strong non-linearity in the fMRI response for all scanners, ranging between 12-22\% of total voxels. To correct scanner distortion of fMRI time-series dynamics at a single-subject level, we trained a convolutional neural network (CNN) on paired sets of measured vs. ground-truth data. Tests on dynamic phantom time-series showed a 3- to 4-fold increase in ST-SNR and about 40-70\% increase in dynamic fidelity after denoising. Critically, we observed that the CNN temporal denoising pushes ST-SNR ST-SNR>1. Denoising human-data with ground-truth-trained CNN, in turn, showed markedly increased detection sensitivity of resting-state networks.


Introduction
Large-scale investments in the identification of fMRI-derived biomarkers for brain-based disorders are a testament to the anticipated promise of fMRI as a neurodiagnostic tool. Yet even once clinical neuroscience establishes reliable biomarkers, a critical rate-limiting factor in the use of fMRI in clinical practice will be fMRI's poor signal/noise profile for single-subject level analyses. The taskfree, "resting-state" paradigms most likely to be utilized in a clinical setting (because of their limited reliance on patient training, engagement, and compliance) only exacerbate this problem. Task-based designs, in principle, clearly delineate between activation in response to a task (signal) and activation during baseline (noise). However, task-free paradigms, by definition, lack the experimental manipulation that would typically be used to distinguish between fluctuations of interest (signal) from fluctuations of nuisance (noise) (DeDora et al. 2016). Without a principled way to distinguish between signal and noise, we lack the feedback necessary to optimize for one while removing the other, thereby limiting our ability to achieve the kind of advances in detection sensitivity required to enhance fMRI's utility in evaluating the single patient.
FMRI's signal is conventionally derived from the blood oxygenation level-dependent (BOLD) contrast. This activity represents regional time-varying changes in the concentration of deoxygenated hemoglobin, following neural-activity induced by exogenous stimuli or spontaneous fluctuations of the resting state. These time-varying changes reflect changes in apparent transverse relaxation time T2 * , an MR parameter sensitive to levels of deoxyhemoglobin, and hence responsible for the observed BOLD contrast. Ideally, the value measured at each voxel at a given time point should only change in response to T2 * changes driven by neural activity (fluctuations of interest, signal). However, in practice, the measurement is dependent on a complex interaction between acquisition parameters (flip-angle: , echo-time: TE, repetition-time: TR), MR parameters (longitudinal relaxation time T1, apparent transverse relaxation time T2 * , proton density within a voxel) and background noise (Lauterbur 2000). Change in any of these parameters introduces variance (fluctuations of the nuisance, noise) in the observed voxel time-series. The difficulty of maintaining fidelity to actual (neuronal) time-series dynamics is made even more acute by the fact that BOLD contrast constitutes only a small fraction (typically, less than 5%) of the total measured signal.
Fluctuations of nuisance in the fMRI time-series originate from two sources: the individual being scanned (physiological noise, due primarily to cardiac, respiratory, and motion effects) as well as the scanner itself. Physiological processes like respiration or cardiac pulsations can cause changes in blood flow (affecting T1 and T2 * ), and thus temporal variations in magnetization that might artifactually appear to be a BOLD effect. Subject head motion causes relative displacement of voxels leading to temporally correlated non-stationary noise and can induce spurious correlations in the resting-state analysis (Power et al. 2012). As significant as these artifacts are, the fact that cardiac, respiratory, and motion variables permit external measurements (e.g., ECG for heart rate) have permitted the field to develop an impressive array of well-validated methods with which to both identify and mitigate their influence. Examples of strategies for targeting physiological and motion confounds include: selecting acquisition parameters designed to permit thermal noise to dominate physiological noise (Wald and Polimeni 2017); techniques to address breathing-related field fluctuations both prospectively (Duerst et al. 2015) and at image reconstruction stage (Bollmann et al. 2017); use of simultaneously recorded measurement of heart-rate, respiration, and motion to retrospectively remove physiological confounds (Caballero-Gaudes and Reynolds 2017); and motioncorrection implemented prospectively (Zaitsev et al. 2017) or retrospectively through registration.
In contrast, the lack of a ground truth for fMRI time-series has not permitted the same strategies for identification and removal of scanner-induced noise, which can vary not only between scanners of the same make and model, but even within the same scanner during different testing sessions. These fluctuations of nuisance originate from imperfections of the instrumentation and the electromagnetic fields used for the measurement and are normally referred to as "scanner instability." This nomenclature is, itself, potentially misleading, since detection-sensitivity of resting-state networks requires simultaneously amplifying fluctuations of interest while suppressing fluctuations of nuisance. Indeed, we have previously shown that typical methods that focus entirely on suppressing fluctuations (optimizing solely for scanner "stability"), such as temporal signal/noise (tSNR), actually deoptimize detection-sensitivity of resting-state networks, because the damped fluctuations include not only suppressed noise but also suppressed signal (DeDora et al. 2016).
Different approaches tackle the problem of minimizing scanner artifacts based upon models of MR-physics. Such methods include reducing the effects of eddy currents by the use of actively shielded gradients and pre-emphasis filters, the use of navigators and calibration echoes, or NMR probes (Kasper et al. 2015) that provide concurrent field monitoring with correction during image reconstruction. Yet modeling-based approaches, while valuable in their own right in terms of contributing to our understanding, can fall short as a practical tool for optimizing resting-state Ground-truth "resting-state" signal provides data-driven estimation and correction for scanner distortion of fMRI timeseries dynamics 4 signal/noise (SNR). The reason for this is that they tend to oversimplify processes that, in an actual testing environment, are fundamentally complex-involving multiple factors, both known and unknown, which interact with one another in nonlinear and nonstationary ways. For example, scanner instabilities may be caused by variation in flip angle over time, imperfections in gradient system, heating, time-varying eddy current effects, or gain changes in transmit and receive chains (Greve et al. 2013;Liu 2016). Time-varying gradients in fast imaging methods, such as interleaved echo-planar imaging (EPI), require high-gradient amplitudes and slew-rates, pushing the scanner to its limits and causes image artifacts due to k-space trajectory deviations. Inhomogeneity in B0 field and perturbations in gradient field cause eddy currents, ghosting, geometric distortions, errors in phase encoding leading to voxel displacement, gain-drifts, and other distortions (Jezzard and Clare 1999).
While the scanner instability is multiplicative, the impact of thermal/background noise on fMRI timeseries is additive and can arise due to a random process like Brownian motion of ions in MR electronics or the human subject, external RF noise sources in the scanner room, or RF spikes dues to intermittent contact between metallic components (Greve et al. 2013;Liu 2016).
In a clinical setting involving decision-making for a single patient, the impact of errors that fluctuate over time and are signal dependent cannot be remedied by increasing sample size, under the assumption that signal amplifies while noise cancels. Longitudinal comparison of scans acquired pre and post treatment cannot be interpreted if both the subject and scanner are changing over time (for example, in using resting-state fMRI in pre-surgical localization, surgical planning in epilepsy, and identifying subjects with Alzheimer's Disease (Lee, Smyser, and Shimony 2013) (Glover et al. 2012) and can lead to systematic confounds in time-series data. In one recent example (Friedman et al. 2008), between-site reliability showed median intra-class correlation of just r=0.22.
In summary, efforts to make the application of resting-state fMRI clinically useful must necessarily address SNR from the perspective of not only physiological, but scanner, artifact-and in ways that make sense given the ubiquity of task-free designs. While efforts to mitigate physiological artifact can and have benefited from external measurements (Caballero-Gaudes and Reynolds 2017), until recently such a strategy has not been available for scanner artifact. Static phantoms optimize purely for general stability (Friedman and Glover 2006), thereby suppressing the fluctuations responsible for resting-state signal. Moreover, the brain (non-static but, by definition, the unknown variable) likewise cannot serve as a calibration device. Finally, physics-based models cannot, in principle, approximate the impact of complex nonstationary distortion on time-series without empirical measurement of that distortion. To address these issues, we approached the problem from the perspective of creating a "brain-like" calibration device, capable of producing a dynamic groundtruth input signal similar to a typical resting-state time-series. Because such a device would provide a ground truth for both fluctuations of interest (signal) as well as fluctuations of nuisance (noise), it could permit optimization for signal-to-noise, rather than simply stability. Because of the consequent ability to obtain, and therefore compare, time-series distortion between true and measured time-series, we could develop a purely data-driven-rather than modeled-distortion correction. Doing so would potentially permit cleaning data of scanner-induced artifact while remaining agnostic with respect to the diversity of known and unknown sources of distortion and their behavior over time.

We designed and engineered a commercial-grade dynamic phantom capable of producing brain-like dynamic signals.
Our previous work (Rǎdulescu and Mujica-Parodi 2014; Mujica-Parodi, Cha, and Gao 2017) and those of others (CIUCIU et al. 2012) shows that healthy resting-state fMRI signals follow 1/f (pink noise) frequency spectra; therefore, our pseudo-brain "input" signal was engineered to achieve equivalent dynamics (custom dynamics can also be easily programmed). To create a dynamic signal, our phantom ( Fig. 1A) uses difference in agarose gel concentration across voxels; the phantom, when rotated in-plane across a voxel during the data acquisition, produces a changing T2 * signal. Rotations occur at the start of each TR of a scan and are limited to around 250 milliseconds. The phantom consists of three distinct parts: a) an agarose gel cylinder assembly, having which is coupled to a pneumatic motor and an optical encoder on the other end. All components remain intact via fastening to an outer frame and go inside the MR scanner with the cylindrical head placed inside the head-coil. The black box shown is the control unit, which interfaces with the optical encoder, pneumatic input from an air compressor and the pneumatic motor. B. Distribution of T2* values across voxels in four quadrants at 3T (Site 1). The agarose gel is prepared using the recipe provided by Friedman et al. (Friedman and Glover 2006). Even though the agarose gel is prepared only at 2.2% and 2.3% concentration, the heterogeneity in T2 * values can be attributed to imperfect agarose network formation, chemical heterogeneity, and polydispersity of gel networks (Djabourov et al. 1989). C. Feedback control system for rotating the inner cylinder. At each trigger from the MR scanner, the PSoC controller compares the current position F(t) with the programmed target position R(t) and opens the solenoid valve proportionally to the magnitude of the error signal E(t) to actuate the pneumatic 7 motor. Here, U(t) is the actuating signal, and M(t) is the manipulated variable. The system uses no braking mechanism, and accurate positioning is achieved through a predetermined linear relationship established between open-state time for solenoid valve and the corresponding rotation achieved at a given pneumatic pressure.
two concentric cylinders; b) a control unit providing control logic for rotation of the inner cylinder; and, c) an air motor assembly with a gearbox and an optical encoder for position tracking. Within the agarose gel cylinder assembly, the inner cylinder rotates during the scan and is coupled to the air motor and the optical encoder, while the outer cylinder contains a reference gel and remains static.
The outer cylinder's reference agarose gel is made at 2.2% concentration by weight, whereas the inner cylinder contains two different gel concentrations at 2.2% and 2.3% by weight, split into four quadrants in a configuration as shown in Fig. 1B. Within each quadrant, a variation in T2 * values exist across voxels because of imperfect agarose network formation, chemical heterogeneity, and polydispersity of gel networks (Djabourov et al. 1989). The control unit for driving the phantom uses a feedback control strategy with control logic implemented in PSoC microcontroller, feedback sensing via an optical encoder, and actuation through solenoid valves. The control unit contains some other custom circuitry for fast valve response time (spike-up voltage circuit), touchscreen userinterface running on raspberry-pi, and UART communication between the raspberry-pi and the PSoC microcontroller. The phantom is MR-compatible (agarose gel cylinder assembly and air motor assembly) and uses polycarbonate (body), delrin (air motor), glass-nylon (ball bearings), and G11 garolite (motor shaft) in construction. The control unit containing electronics and pneumatic compressor for driving the air motor stays outside in the MR control room.

Using ground truth brain-like dynamic signals, we quantified a standardized signal-to-noise ratio (ST-SNR) and dynamic fidelity; these demonstrated wide variance across scanners, even for the "best case scenario" of "high-performance" scanners utilizing acquisition parameters individually optimized by a highly experienced MR physicist.
While the definition of signal-to-noise ratio (SNR) is well defined across the engineering domain, use of the term within the fMRI field has colloquially co-opted its definition in ways that can dilute its meaning and utility. Currently in fMRI, multiple definitions and variants for computing SNR exist (Welvaert and Rosseel 2013), leading to difficulty in interpreting and comparing SNR values. Normally used to optimize for scanner stability with the use of a static phantom, temporal SNR (tSNR) is defined as the ratio of mean signal to standard deviation of a time-series. However, for reasons described above, optimizing for tSNR (i.e., 8 solely for stability) will suppress not only the fluctuations responsible for noise but also the fluctuations responsible for resting-state signal, effectively de-optimizing for detection of restingstate networks (DeDora et al. 2016). Furthermore, mean-signal in tSNR calculation is highly dependent on acquisition parameters, making the interpretation for comparison difficult. For example: tSNR has been reported across two orders of magnitude (between 4.42 and 280 for studies reviewed by Welvaert et al. (Welvaert and Rosseel 2013)). To address both issues, we quantified the accuracy with which fMRI time-series follow the true signal using two data-quality metrics: Standardized Signal-to-noise ratio (ST-SNR) and dynamic fidelity. "ST-SNR" is defined as the ratio of signal power and the background noise power and is calculated accordingly, where power is the sum of the absolute squares of time-domain samples divided by the time-series length. We define "dynamic fidelity" as the accuracy with which an MR scanner tracks changes in the input signal, and calculate it as the Pearson correlation coefficient between the ground-truth signal and fMRI output. were acquired and were averaged voxel-wise to obtain a close approximation to true intensity values. The mean volume was then rotated 600 times synthetically at angles obtained from the optical encoder during the actual run.
This yielded ground-truth volumes, which then were compared to the volumes acquired during the scan.
The programmed rotation of the dynamic phantom, along with the optical encoder feedback, provides a mechanism for rotation control and sensing. The phantom tracks the programmed rotation at an accuracy of 0.2°. With the rotation generating voxel-wise time-series, the feedback sensing provides data on the actual rotation that occurs. This feedback data enables calculation of the groundtruth time-series and the noise estimate for each voxel, as shown in Fig. 2, for quantifying ST-SNR and dynamic fidelity. In Table 1, we show both ST-SNR and dynamic fidelity for four scanners, showing the potential for wide variance across scanners, even for a "best case scenario" of "highperformance" scanners utilizing acquisition parameters individually optimized by a highly experienced MR physicist. Importantly for multi-site or longitudinal applications, these two metrics (ST-SNR and dynamic fidelity) provide a direct assessment and comparison of data-quality over different scanners, as well as the same scanner over time. As ST-SNR and dynamic fidelity have standardized and interpretable range of values, the direct comparison of these metrics longitudinally or across scanners becomes possible. For example, for the same make and model of a scanner (the two Siemens PRISMA scanners, described in Table 1) having equivalent voxel-size, the ST-SNR observed is markedly different. Inspecting further, while one may attribute this difference to the different head-coil arrays used between the two scanners (Table 3), the comparison of ST-SNR with 3T SKYRA (Table 1) at the same site with equivalent voxel-size and head-coil suggests otherwise: that the Site 2 PRISMA scanner is an outlier.

Using the dynamic phantom generated ground-truth, we quantified the ratio of scanner instability to background noise in fMRI time-series, thereby identifying multiplicative versus thermal noise components.
We analyzed time-series of the noise (residual time-series as calculated above, refer to Fig. 2), using power spectral density plots, to identify spectral-features arising from scanner artifacts. Fig. 3 illustrates the mean power spectral density across all voxels for the groundtruth and the estimated noise time-series. Each voxel time-series' power spectral density was normalized by its maximum power before calculating mean at each frequency bin across all voxels.
The power spectral density of the noise closely matches that of the ground-truth signal indicating the presence of multiplicative noise (scanner instability) component alongside thermal/background noise. Multiplicative noise modulates the MR signal, is known to exhibit some temporal and spatial correlation (Greve et al. 2013), and cannot be removed using smoothing or frequency-based temporal filtering. The presence of multiplicative noise diminishes the advantages offered by hardware improvements (increase in signal to thermal noise ratio with higher field strength and more sensitive head-coil arrays) and can exacerbate the false-positives problem (Eklund, Nichols, and Knutsson 2016), alongside spurious correlations and poor reproducibility of functional connectivity. Bandlimited programmed rotation of our phantom produces a band-limited ground-truth signal, and thus the associated multiplicative noise can be directly observed in this narrow band-see Fig. 3 (in the 0-0.1 Hz range). At frequencies higher than 0.1 Hz where the ground-truth signal is absent, scanner noise shows a flat spectrum or white-noise behavior (thermal noise). This multiplicative noise behavior is further corroborated by a linear scaling of noise power (logarithmic scale) with an increase in signal (ground-truth) standard deviation. We observed a moderate correlation between noise power and signal standard deviation for all scanners ( Table 1).
Using the ground-truth dynamic signal and the measured fMRI output, we quantified the ratio of scanner-instability to thermal/background noise using a probabilistic description of the two noisesources. Scanner instability is signal-dependent and thus proportional to the signal intensity, while thermal noise is independent of the MR signal. Background noise and scanner-instability are temporally independent, and therefore, their variances add. With as the standard deviation of the thermal noise and  as the proportionality constant for the multiplicative noise, we can write: 2 = 2 + 2 + 2 2 = 2 + 2 , Eq. 1 where 2 and 2 , are the variances of the observed fMRI output and the ground-truth, respectively. The model for the probability of observing the measured signal, given ground-truth YGT and noise parameters can then be written as: ( | , , ) = ( = , 2 = 2 + 2 2 ), Eq. 2 We estimate the parameters , and by Monte-Carlo simulation using YfMRI and YGT. Specifically, we model Eq.1 to sample from the posterior distribution that is proportional to Eq.2 while assuming constant priors for the parameter distributions. The relative contribution of multiplicative noise to that of the total noise is listed in Table 1 for each scanner. The results indicate that even in modern highperformance scanners with acquisition parameters optimized by a trained MR physicist, the scannerinduced variance due to instability is around 8-20% of the contribution of the total scanner noise.

Using the dynamic phantom generated ground-truth, we quantified scanner-induced non-
linearity in fMRI response. Finally, we observe scanner-induced temporal non-linear distortion of fMRI response using a tree-partition non-linearity estimator (Ljung 2019) (a piece-wise linear 12 function defined by the binary tree over partitions of the regressor space) with ground-truth as the regressor. Non-linearity is detected in the observed fMRI data if a nonlinear function explains significant variance in the observed data beyond the variance explained by the linear function of the ground-truth. Non-linearity estimation was performed using 'isnlarx' function provided in System Identification Toolbox, Matlab (Ljung 2019), which categorizes non-linearity as strong, weak or not significant based on reliability of the nonlinearity detection test. We observed that the 7T scanner showed the highest non-linearity in response, with 22% of voxels exhibiting strong non-linearity (Table 1).

We designed a data-driven temporal filter and observed robust increases in ST-SNR and dynamic fidelity of fMRI time-series after denoising.
We provide a deep-learning framework using a Convolutional Neural Network (CNN) for learning an equivalent of a temporal filter. Given that we now have known dynamic inputs, we developed an end-to-end trainable CNN architecture that uses discriminative denoising to remove noise in the hidden layers. We provided pairs of measured fMRI time-series and known signal to learn a mapping from noisy to clean time-series implicitly. We used batch regularization with small batches of batch-size=8 within CNN to avoid internal covariate shift, accelerate the training process, and reduce dependence on network parameter initialization (Sergey  For evaluating the performance and generalizability of the CNN, we compare the results of CNN denoised fMRI time-series, as shown in Fig. 5, with the original data-quality and temporal de-noising using a standard third-order Butterworth bandpass filter (0.008-0.

Removing scanner-induced variance from human fMRI data increased the detection sensitivity of brain networks, visible even at the single-subject level.
For assessing the effects of CNN denoising on human fMRI data, the detection sensitivity of brain networks engaged in movie watching was calculated as a measure of the ability to preserve fluctuations of interest (signal) while removing scanner confounds (noise) from the time-series, and was quantified using the ratio of mean absolute Z-score inside and outside well-defined resting-state network masks in subject-specific ICA maps. A ratio > 1 indicates that Z-score inside the mask is higher compared to voxels outside. Higher this ratio, the easier it is to detect the brain/resting-state networks. We observed an increase in detection sensitivity at the single-subject level for all three scanners after accounting for scanner-related noise.
As naturalistic conditions are complex and dynamic (Vanderwal et al. 2015), evoking multiple streams of neural responses, we report results for ten well-defined resting-state networks separately, in addition to the detection sensitivity of visual networks (

Discussion
3.1. Why should one use the dynamic phantom over a static phantom? Static phantoms are commonly used for quality assurance (Friedman and Glover 2006) to assess and minimize scanner fluctuations due to background noise and instability. However, the resting-state fMRI or naturalistic paradigms depend not only upon suppressing fluctuations due to noise but equally upon sensitivity towards signal change, which can only be assessed by a phantom that produces a known and changing (dynamic) signal. The importance of a dynamic phantom is that it is the only method, in our knowledge, that can quantifiably assess the most basic assumption underlying all task-free fMRI: fidelity between input (brain) dynamics and output (fMRI time-series) dynamics. We introduced a novel method for generating ground-truth using the dynamic phantom and estimating voxel-wise noise time-series. The dynamic phantom additionally provides an estimate of standardized signal-tonoise ratio (ST-SNR) and non-linearity, quantifying actual measurement error in fMRI response as compared to static-phantom derived temporal stability of the mean signal (tSNR). While static phantoms estimate only flat-spectrum noise (Expert et al. 2011), the dynamic phantom can detect both signal-dependent and background noise. Using Bayesian parameter estimation, we quantified the ratio of instability/multiplicative noise to the background noise. Although fMRI time-series have several sources of confounds and variance contributed by scanner-instability is relatively small, the reliability of the longitudinal data may be seriously affected without proper characterization. Using data metrics introduced, quality assurance protocols can be established for scanner health monitoring.
Any deviations in ST-SNR or fidelity, compared against longitudinally tracked measurements, would indicate scanner problems.

Why did we use a deep-learning approach for temporal denoising? Scanner-instability and
background noise in resting-state data lead to decreased detection-sensitivity of resting-state networks, which have been typically addressed by increasing the amount of data collected or increasing the scan-time per subject. These methods are not only expensive but lead to other problems such as subject-fatigue and increased head-motion, which are especially acute in clinical populations.
In the current report, we propose a fundamentally different approach for removing scanner confounds from fMRI time-series, which may circumvent the need for collecting more data, and which is ideally suited for single-subject level analyses required for clinical and computational modeling applications, as well as large-scale multi-site and longitudinal studies. Our method exploits the availability of paired noisy fMRI and ground-truth data to perform discriminative denoising using CNN. Developing a denoising algorithm for correcting time-series distortions can be framed as a system-identification problem, wherein the goal is to infer a functional relationship between the system input (noisy fMRI data) and the system output (clean fMRI data). Convolution of the noisy signal with the identified filter produces the denoised signal. While dealing with linear systems, this system-identification problem reduces to the characterization of impulse response using delta function or observing the system's frequency response using sinusoids. However, for non-linear systems, there exists no canonical representation of the system that will capture "all possibilities" of mapping inputs to transformed outputs. The convolution integral for linear systems can be extended to convolution-like Volterra series for non-linear systems, which can further be extended to Weiner series where each component of the series is orthogonal to all lower-order components. Lee and Schetzen (Lee and Schetzen 1965) provided a simple method based on cross-correlation for estimating Weiner kernels.
However, the cross-correlation method is fundamentally limited by the fact that inputs must be Gaussian. Further, the kernel estimation suffers in cases of strongly nonlinear systems. To overcome these problems, we used deep learning for performing temporal filtering. Intuitively, the trained CNN can be thought of as a temporal filter (like a bandpass-filter), but with filter parameters estimated in an automated data-driven manner optimized for a specific scanner performing a session. (Greve et al. 2011), causing heteroscedasticity when using ordinary least-squares estimator and skewing the p-values to be smaller than they should be. Scanner-differences can be reduced by dataprocessing techniques before analysis (resting-state data), or scanner-effects can be adjusted statistically (task-based data). Feis et al. (Feis et al. 2015) recently showed the successful application of FMRIB's ICA-based X-noiseifier (FIX) (Salimi-Khorshidi et al. 2014) to remove scanner-specific structured noise components that diminished differences in detected resting-state networks across sites. However, the complexity in re-training the FIX classifier for a dataset from every new scanner is non-trivial and requires manual component labeling using data from multiple subjects by an expert.

Why is the dynamic phantom more useful than ICA-based techniques in mitigating scannereffects for multi-site studies? Different sites generally have very different-levels of scanner-noise
While our measurement of ST-SNR provides a way for statistical adjustment of scanner effects in task-based paradigm using ST-SNR as a covariate in ANCOVA designs, the CNN denoising can remove scanner-induced effects before analysis for resting-state fMRI or naturalistic paradigms in an automated way.

Future directions:
Our work has direct implications in moving towards single-subject imaging, which is necessary for clinical purposes as well as for fMRI driven computational neuroscience.
Ensuring the stability of time-series adds statistical power to draw useful conclusions from a limited amount of data. Although first-level analysis is dominated by physiological noise (Greve et al. 2011;Triantafyllou et al. 2005;Wald and Polimeni 2017), we observed a ~6-15% increase in detection sensitivity of resting-state networks after removal of scanner-related noise. Future studies with a larger sample-size will focus on the effects of removing scanner confounds on reliability estimates of functional connectivity analysis and computational neuroscience circuits. Low reliability causes low reproducibility of functional connectomics (Zuo, Biswal, and Poldrack 2019). Reproducibility across sessions while scanning the same patient affects the clinical decision and is an active concern for the use of resting-state fMRI as a clinical tool (O'Connor and Zeffiro 2019). Additionally, future direction include investigating effects of dynamic phantom estimated ST-SNR on activation effect size in taskbased studies, combining multi-site task-based studies using ST-SNR as a covariate, and using CNN denoising to harmonize data across sites as required for multi-site studies.

Study Design:
We performed imaging at two sites: the SCAN Center at Stony Brook University in Stony Brook, New York (Site 1) and the Athinoula A. Martinos Center for Biomedical Imaging at the Massachusetts General Hospital in Charlestown, Massachusetts (Site 2). We designed and engineered a dynamic phantom for producing ground-truth time series, based on differences in T2 * values of agarose gel across voxels of interest. Controlled rotation of the dynamic phantom produces variation in the T2 * values within a voxel, tuned to generate amplitude changes/signal as observed with BOLD contrast in humans (see Results for a detailed description of the design). At Site 1 (3T Siemens PRISMA scanner), we scanned the phantom during a single session with five acquisition runs, with each successive run separated by a 20-minute interval. Each run had a unique programmed rotation profile as input to the phantom. No human data acquisition occurred at Site 1. At Site 2, we acquired data from three human subjects (two males and one female aged 55, 56, and 47 years, respectively) and the phantom, using three scanners: 3T Siemens SKYRA, 3T Siemens PRISMA, and 7T Siemens MAGNETOM. We acquired data in three imaging sessions: one session per scanner.
During each imaging session, we acquired three phantom scans, each with a unique rotation profile, and six human scans, with two scans per subject. The first phantom scan took place at the beginning of each session. Next, each of the three human subjects were scanned while they viewed a naturalistic movie (no audio, see Supplementary Materials for video) inside the scanner. Afterward, we acquired the second phantom scan, followed by a repeated acquisition for all three human subjects under identical conditions. Finally, we acquired the third phantom scan. The Institutional Review Board at Massachusetts General Hospital (Partner's Healthcare) provided approval for the human study, and all participants provided written informed consent prior to participating in the study.

Data acquisition parameters:
To ensure that results conservatively reflect actual data-quality metrics within the neuroimaging field, we asked each scanner's MR physicist to independently provide the optimal acquisition parameters for modern fMRI studies conducted on that specific scanner. The details of the protocol parameters are as follows.
(1) Site 1 (phantom imaging only): The phantom was scanned on a 3T Siemens PRISMA scanner with a 64-channel head coil. For relaxation rate measurements, multi-echo gradient-echo images were acquired at twelve echo times equally spaced between 5 ms and 60 ms with TR = 70 ms, FOV = 192 mm × 192 mm, flip angle = 20°, slice thickness= 1.5 mm, and readout bandwidth = 320 Hz/px. For the time-series data, standard singleshot gradient-echo EPI data were acquired with the parameters as listed in

Preprocessing of Phantom Data and Training the Convolutional Neural Network (CNN):
Acquisition of phantom EPI data involved acquiring the first 200 volumes without any programmed rotation, followed by 600 rotating volumes with the rotation synchronized to the scanner's TR (repetition time) trigger signal. The phantom rotation was limited to around 250 ms from the start of each TR and was quantified using the optical encoder's feedback (Figure 1a,c). Based on the optical encoder's feedback and scanner's slice timing information, all the slices acquired during in-plane rotation within a TR were discarded from the respective EPI dataset for any further analysis. The remaining slices were manually inspected, and bad slices due to susceptibility artifacts (towards the top and bottom face of the cylinder) were thrown out. The final set of slices then underwent an automated procedure based on contour finding and the Hough transform for generating masks used to select the voxels of interest located in the inner cylinder of each slice. The first 200 volumes of all the remaining slices were averaged voxel-wise to create a mean functional dataset to obtain close approximations to the true voxel intensity. Synthetic rotation of the mean functional dataset, to create ground-truth time-series, involved up-sampling the mean images by a factor of 5 (3rd order spline interpolation), followed by rotation at angles provided by optical encoder's feedback and downsampling by local averaging to original dimensions of the mean functional slice. Subtracting the noisy fMRI output from the corresponding ground-truth time-series yields voxel-wise noise time-series.
Power spectrum density (Fig. 3) was calculated using the periodogram method implemented in SciPy library (Virtanen et al. 2020). Monte-Carlo simulations for parameter estimation to quantify multiplicative-to-thermal noise ratio were carried out in PyMC3 (Salvatier, Wiecki, and Fonnesbeck 2016).
For all voxels, the measured and the ground-truth time-series pairs were used for end-to-end training of the CNN (see Fig. 4 for architecture). Given that multiple phantom scans were acquired for each scanner, CNN training involved combining data acquired with different programmed motion sequences (Supplementary Materials: Suppl. Fig. 2) on a scanner for data-augmentation. Within each training dataset, 33% of data was used as validation split and model weights with lowest validation loss was saved as the trained CNN. For Site 1, three CNNs were trained using data from scans 1 and 3, scans 2 and 4, and scans 3 and 5. For Site 2, three phantom scans were acquired at each scanner, and CNNs were trained using data from scans 1 and 2, scans 2 and 3, and scans 1 and 3. For testing denoising performance, the test data were denoised using a trained CNN which did not use the test data during training (out-of-sample denoising), for example: at Site 2, for denoising scan 2, we used a CNN trained on scans 1 and 3.  corrected and normalized to MNI) underwent denoising (voxels in gray-matter only) using trained scanner-specific CNN, followed by physiological confound removal as in the standard method (CompCor, motion, and bandpass filtering). Finally, datasets obtained from both temporal denoising methods were smoothed with a 4-mm full width at half-maximum Gaussian kernel.

Calculating detection sensitivity of resting-state networks:
To identify functionally connected networks in a data-driven manner, we performed group spatial ICA on the preprocessed data using the GIFT v3.0b fMRI Toolbox (Rachakonda et al. 2010), separately for each scanner and temporal processing scheme (standard method and CNN denoising). For each dataset, 20 independent components were obtained, after ten runs of ICASSO (Himberg, Hyvärinen, and Esposito 2004) procedure for ensuring component stability. Subject-specific spatial maps and associated time courses were estimated using back-reconstruction (GICA) (Erhardt et al. 2011). We used the Infomax algorithm for performing ICA. ICA spatial maps were converted to Z values. Afterward, the subjectspecific ICA maps were spatially matched to ten well-defined resting-state network templates obtained from Smith et al. (Smith et al. 2009) for obtaining each subject's corresponding network ICA maps. The resting-state network templates contained visual-medial, visual-occipital, visuallateral, default mode, cerebellum, sensorimotor, auditory, executive control, right frontal-parietal, and left frontal-parietal networks. Detection sensitivity was then calculated as the ratio of mean absolute Z-score inside and outside of each of the ten resting-state network masks in subject-specific ICA spatial maps. The masks for the resting-state networks were obtained through thresholding the Smith et al.'s original ICA maps at |Z| > 2.3. The mean of detection sensitivity values, across all ten networks or across three visual networks, for each subject, yielded a total of six values (three subjects with two runs) for every scanner. These six values were compared between the standard method and the method with CNN temporal denoising for each scanner using permutation testing (100,000 repetitions).  Variations in programmed rotation sequences were used for data augmentation while training the CNN, as well as testing generalizability of the trained CNN for a given scanner. Sequence 1-3 were used at both sites, while sequence 4,5 were used only at site 1.