neuRosim : An R Package for Generating fMRI Data

Studies that validate statistical methods for functional magnetic resonance imaging (fMRI) data often use simulated data to ensure that the ground truth is known. However, simulated fMRI data are almost always generated using in-house procedures because a well-accepted simulation method is lacking. In this article we describe the R package neuRosim, which is a collection of data generation functions for neuroimaging data. We will demonstrate the possibilities to generate data from simple time series to complete 4D images and the possibilities for the user to create her own data generation method.


Introduction
Despite optimization of experimental designs and significant improvements in scanner technology, functional magnetic resonance imaging (fMRI) data still contain a considerable amount of noise.Statistics are needed to infer information from the data.However, a major problem is that the ground truth of fMRI data (i.e., where and when the activation is located) is unknown and can only be measured with very invasive techniques (i.e., intracranial electroencephalography) that are almost always unethical to perform with humans (David et al. 2008).Therefore, when researchers try to establish the validity of a new statistical method, or when they want to assess the sensitivity and the specificity of an existing method, they need to know the ground truth.As a solution, simulation studies have gained great interest as a validation tool because in these studies, the data themselves are generated under a known model.
Although the necessity of knowing the ground truth is acknowledged, a standard simulation procedure for fMRI data is lacking.In the literature, two major categories of computational simulations can be distinguished, namely (1) generating time series based on an experimental design and (2) simulating the magnetic signal by solving the Bloch equations (Bloch 1946).Unfortunately, the first category in itself has no common method.Most researchers model the activation in the time series as the convolution of a haemodynamic response function and a stimulus vector.Additionally, some noise is added ranging from pure random Gaussian noise (Lei et al. 2010;Liao et al. 2008;Lin et al. 2010), over temporally correlated noise (Grinband et al. 2008;Locascio et al. 1997;Bullmore et al. 1996;Purdon and Weisskoff 1998) to real noise derived from empirically acquired resting state scans (Bianciardi et al. 2004;Lange 1999;Weibull et al. 2008;Lee et al. 2008;Lange et al. 1999;Hansen et al. 2001;Skudlarski et al. 1999).Furthermore, all simulations are done using in-house software routines.As a consequence, convergence of the simulation methods is impossible as long as fMRI simulators are not available.In contrast, the second method (Drobnjak et al. 2006), using the Bloch equations, is embedded in a simulator as part of the software package FSL (Smith et al. 2004).However, the simulator is rarely used for validation studies.Probably, this is due to the fact that solving the Bloch equations is computationally very intensive and it takes, for example, about a month to generate a 4D dataset of 100 scans including all artefacts using a PC with a 3.4 GHz processor.By developing our package neuRosim, we want to respond to the current lack of fMRI simulators.Our package is by no means intended to provide the fMRI data generation method.The aim of the package is to provide a tool for simulating fMRI data that can initiate the search for more established and validated simulation methods for fMRI data such that the results of simulation studies can be generalized.
The package neuRosim for R (R Development Core Team 2011) is created with two types of users in mind.The first type is the practical researcher who uses the fMRI scanner as a tool to acquire data that hopefully support her theory.This researcher normally would not think of generating fMRI data.However, by generating some data before the actual scanning process is started, this researcher can check the effectiveness of her design without almost any cost, both in time and money.In this way, the most effective design for a particular research question can be tested and adjusted.1Secondly, the more theoretical researcher (e.g., a statistician) can validate both existing and new methods based on the generated data.Because the data generation in neuRosim is fairly fast, the generation process can easily be embedded in large simulation studies.fMRI data are in fact the result of a Fourier transformation of the k-space and are, as a result, complex-valued data (Rowe and Logan 2004).However, in most fMRI studies the data analysis is done for the magnitude data and not for the phase data.In the current version of neuRosim, only the generation of fMRI magnitude data is considered.Therefore, all assumptions that are made to model the data apply only to the characteristics of magnitude data.The generation of magnitude fMRI data is seen as an additive source problem (Bellec et al. 2009) in which two main sources are distinguished, namely (1) the activation caused by an experimental design or resting state activation, and (2) the noise.neuRosim contains several functions to model both sources.These functions are regarded as low-level functions, meaning that they generate only a specific part of the data and are mostly used as building blocks to construct higher-level functions.For beginning users, it will be more convenient to start with the high-level functions that are described in Section 3.However, advanced users can use the high-level functions as a basis for their completely customized simulations.In Section 2, we will give an overview of the different models in the low-level functions.
Further, it should be noted that the data generated by neuRosim are considered to be preprocessed data.This implies that several artefacts (e.g., head motion, magnetic field inhomogeneity) that are normally removed during the pre-processing stage of the data are not explicitly modeled.However, it is possible to incorporate some residual effects of these artefacts under the assumption that the artefacts are not completely removed by the pre-processing analysis.For example, neuRosim data can contain task-related noise that can account for residual head movements.

Features and examples of low-level functions 2.1. Experimental activation and design
To generate BOLD (blood oxygen level dependent) activation, neuRosim uses a stimulus function that is part of the experimental design.A BOLD response is only generated if the function indicates the presence of a stimulus.Block designs, as well as event-related designs (or a combination of both) can be defined based on the onsets and the durations of the task as defined by the user.The function stimfunction uses these arguments to generate a 0-1 valued time vector where 1 indicates that the stimulus is present.Note that for a single event, the duration of the stimulus should be defined as 0. For example, to generate a stimulus function for a 20-second ON/OFF block design of 200s with a microtime resolution of 0.1s: R> totaltime <-200 R> onsets <-seq (1,200,40) R> dur <-20 R> s <-stimfunction(totaltime = totaltime, onsets = onsets, + durations = dur, accuracy = 0.1) The resulting stimulus function is shown as a dashed line in Figure 1.To simulate the BOLD signal caused by the task, the stimulus function is convoluted with a haemodynamic response function (HRF).The role of the microtime resolution is to ensure a high-precision convolution with the specified HRF.In the current version of neuRosim, three different response functions are implemented.
1.The stimulus function is convoluted with a gamma-variate HRF as implemented in the function gammaHRF with a user-defined full width at half maximum (FWHM) value (Buxton et al. 2004).The function is defined as with k = 3.To provide the desired FWHM, the time constant τ h is given by τ h = 0.242 • FWHM (Buxton et al. 2004, p. S227).
R> gamma <-specifydesign(totaltime = 200, onsets = list(onsets), + durations = list(dur), effectsize = 1, TR = 2, conv = "gamma") To modulate the strength of the activation in each condition, the argument effectsize in the function specifydesign should be specified.The values, provided in this argument, are used to increase (values larger than 1) or decrease (values smaller than 1) the amplitude of the generated BOLD response.
2. The stimulus function is convoluted with a double-gamma HRF via canonicalHRF, which models an initial dip and an undershoot of the BOLD signal (Friston et al. 1998), where The spatial location of the activation is specified as regions using the function specifyregion.
A region can be modeled in three ways, namely (1) as a cube, (2) as a sphere or (3) manually.
The first two forms can be modeled by defining two arguments, namely the coordinates of the center of the region and the distance from the center to the edge of the region in voxels.For example, to define an activated sphere (the result is displayed in Figure 2) R> a <-specifyregion(dim = c(64, 64), coord = c(20, 20), radius = 10, + form = "sphere", fading = 0.5) To define the form manually, the coordinates of all voxels that are part of the region should by specified as a matrix with columns corresponding to their (x, y)-coordinates.
Additionally, it is possible to differentiate the strength of the measured activation between voxels in the activated region.This can be the case if, for example, the BOLD response to a certain stimulus is of different size in some parts of the activated region.A first method to include this variability is to divide the activated region into separate subregions and specify separate parameters of the HRF for each subregion in specifydesign.The subregions can than be merged together using the high-level function simprepSpatial (see Section 3).Secondly, if the region is defined as a cube or a sphere, the fading option can be used to require that the region has the largest effect in the center and smaller activation towards the edges (Logan and Rowe 2004).This fading of the BOLD response is modeled as an exponential decay depending on the distance of the activated voxel to the center of the region.The decay rate λ can vary between 0 and 1 with 0 meaning no decay and 1 indicating the strongest decay.In 3D this corresponds to where (i , j , k ) are the (x, y, z)-coordinates of the voxel in the center of the region, λ is the decay rate and the activation is scaled to be 1 in the center of the region.An example of an activated sphere with fading (λ = 0.5) is presented in Figure 2.

Noise
The noise present in fMRI data is caused by different sources, such as for example the scanner and the subject.neuRosim offers a bundle of functions to model noise from one of these sources.The noise functions can be divided into four categories, namely (1) white noise, (2) colored noise, (3) temporal noise and (4) spatial noise.The white noise (modeled by the function systemnoise) represents the system noise that is part of the fMRI data.Two types of system noise are considered: (1) system noise that is Rician distributed and ( 2) system noise that is Gaussian distributed.The former is applicable for fMRI magnitude data with low signal-to-noise ratio (SNR), while the latter can be used for higher SNR (about more than 10) (Haacke et al. 1999;Gudbjartsson and Patz 1995) .The standard deviation of the noise is user-defined or can be based on the desired SNR defined by the user.In all noise functions, average SNR is defined as where S represents the average magnitude of the signal, and σ N stands for the standard deviation of the noise (Krüger and Glover 2001).For example (the resulting time series is plotted in Figure 3), Colored noise depends on either the signal, the timing or the location.neuRosim contains three types of signal-dependent noise, (1) low-frequency drift, (2) physiological noise and (3) task-related noise.
Low-frequency drift, generated by lowfreqdrift, is a consequence of system noise (Smith et al. 1999) that can be attributed to slow fluctuations in the scanner hardware (Lazar 2008).The drift is modeled as a basis of discrete cosine functions.The number of functions is determined by the frequency of the drift with a default value of 128 seconds.For example (the resulting time series is plotted in Figure 3), R> n.low <-lowfreqdrift(dim = 1, nscan = 100, TR = 2, freq = 120) Physiological noise (physnoise) is defined as possible cardiac and respiratory artefacts and as such accounts for the variability in the signal that is caused by the heart beat and respiratory rate.These artefacts are often categorized as low-frequency drift.However, we choose to model the physiological noise separately because it is shown that the frequency of these artefacts is often higher than the scanner fluctuations (Smith et al. 1999).The physiological noise is modeled as sine and cosine functions with user-defined frequencies.Default values are 1.17 Hz and 0.2 Hz for heart beat and respiratory rate respectively (Biswal et al. 1996).For example (the resulting time series is plotted in Figure 3), R> n.phys <-physnoise(dim = 1, nscan = 100, sigma = 15, TR = 2) Task-related noise accounts for spontaneous neural activity due to the experimental task (Hyde et al. 2001) and is operationalized by adding random noise only where and when activation is present.The distribution of this noise can be either Gaussian or Rician.Additionally, the task-related noise can be interpreted as residual noise from head motion that is not removed in the pre-processing stage.For example (the resulting time series is plotted in Figure 3), R> n.task <-tasknoise(act.image= s, sigma = 15) Temporal noise accounts for the fact that fMRI data are repeated measurements (Purdon and Weisskoff 1998).The function temporalnoise generates noise based on an auto-regressive model of order p (AR(p)) defined as with χ t ∼ N (0, σ 2 ).For example, the generate temporally correlated noise of order 2 (the resulting time series is plotted in Figure 3), R> n.temp <-temporalnoise(dim = 1, sigma = 15, nscan = 100, + rho = c(0.4,-0.2)) Finally, spatial noise models the spatial dependencies in fMRI data (Logan and Rowe 2004).
Of course, voxels are arbitrary units and neighboring voxels are more likely to be correlated than voxels that are further apart.The function spatialnoise incorporates three types of spatial noise models, namely (1) an autoregressive correlation structure, (2) a Gaussian random field and (3) a Gamma random field.The first structure correlates the voxels with each other based on random Gaussian or Rician noise.The strength of the correlation depends on the value of the auto-correlation coefficient (default value is rho = 0.75) and the distance between the voxels.If spatial correlation based on random fields is chosen, the full-widthhalf-maximum (FWHM) of the kernel, which is used to generate the random field, should be provided (default is FHWM = 4).Additionally, if the method is gammaRF, the shape (default is gamma.shape= 6) and rate (default is gamma.rate= 1) parameter of the Gamma distribution should be defined as additional arguments.For example, to generate spatially correlated noise for a 20×20 slice R> d <-c(20, 20) R> n.corr <-spatialnoise(dim = d, sigma = 15, nscan = 100, + method = "corr", rho = 0.7) R> n.gaus <-spatialnoise(dim = d, sigma = 15, nscan = 100, + method = "gaussRF", FWHM = 4) R> n.gamma <-spatialnoise(dim = d, sigma = 15, nscan = 100, + method = "gammaRF", FWHM = 4, gamma.shape= 3, gamma.rate= 2) Figure 4 displays the correlation matrices for the generated slices.To generate these images, all voxels were ordered and the correlation matrix of the generated time series was calculated.Therefore, the diagonal represents the perfect correlation of each voxel with itself.We see that voxels that are close to this diagonal, representing neighboring voxels, are also highly correlated.The block diagonal structure, which can be observed clearly with the Gaussian random field structure (Figure 4b), is the result of reducing the two-dimensional structure of the slice.
Additionally, all noise functions include the functionality that a template or mask can be provided.As such, the noise is only generated for those voxels that are included in the mask.This would allow the user to make for example a distinction between the noise source in the gray matter, the white matter or in the cerebrospinal fluid.

Examples of high-level functions
The aim of the high-level functions is to allow the user to generate fMRI data efficiently and transparently.The functions are developed such that they can easily be implemented in a simulation environment.Of course, these functions have limits in their use.Therefore, we refer users who desire more functionality to the low-level functions.

Generating fMRI time series
The simTSfmri function generates fMRI time series for a specified design matrix and with an additive noise structure.The field of the design matrix should be prepared with the simprepTemporal function, to ensure that all arguments are in the correct format.As an example, we will generate a time series for a block design with two conditions.The experiment lasts 100 scans with TR = 2 and the first condition has activation blocks of 20s, while the second condition had activation blocks of 7 seconds  R> design <-simprepTemporal(totaltime = total, onsets = os, durations = dur, + effectsize = effect, TR = TR, hrf = "double-gamma") Figure 5 presents the resulting activation from this design in dashed lines.The following arguments should be specified to ensure a complete definition of the design matrix: the total duration of the experiment in seconds (total), the onsets of each condition represented as a list (onsets), the duration of the stimulus in each condition represented as a list (durations), the repetition time in seconds (TR) and the form of the HRF (either gamma, double-gamma or balloon).The noise can be either of the structures described in Section 2, but it is also possible to add a mixture of noise.The different noise components are then weighted with a vector of weights specified by the user.The weights can vary between 0 and 1, however, the weights should sum to one.For example, we will add a mixture of noise to our above specified design.The mixture contains Rician system noise, temporal noise of order 1, low-frequency drift, physiological noise and task-related noise and has a baseline value of 10 R> w <-c(0.3,0.3, 0.01, 0.09, 0.3) R> ts <-simTSfmri(design = design, base = 10, SNR = 2, noise = "mixture", + type = "rician", weights = w, verbose = FALSE) The resulting time series are plotted in Figure 5.

Generating fMRI volumes
The function simVOLfmri is built to generate complete fMRI datasets (i.e., 3D for a slice and 4D for a volume).In this function, some spatial properties of the data are introduced.For this function, not only a design matrix -defining when the activation occurs -has to be specified, but also a region -defining where the activation takes place -should be provided.Similarly as for the design matrix, a preparation function (simprepSpatial) is needed to ensure that all arguments that define the region of activation are in the correct format.Suppose that we wish to simulate 2 activated regions that are part of a small network.We need to call the simprepSpatial function as follows R> regions <-simprepSpatial(regions = 2, coord = list(c(10, 5, 24), + c(53, 29, 24)), radius = c(10, 5), form = "sphere") The arguments that should be provided in the function are: the number of activated regions (regions), a list of coordinates specifying the regions (coord), the radius of the region (radius, not needed if the region is defined manually) and the shape of the region (form) The implemented shapes are cube and sphere.For any other shape, the coordinates of all voxels in the region should be entered manually (see Section 2 for an example).Further, we will generate the activation in both regions following the same design matrix as for the generation of the time series.

Simulating and analyzing a 4D fMRI dataset
To further demonstrate the functionality of the package, we present a more real-life example.Consider the data from a repetition priming experiment performed using event-related fMRI neuRosim: An R Package for Generating fMRI Data (Henson et al. 2002) 2 .The data are the result of a 2 × 2 factorial study with factors fame and repetition where famous and non-famous faces were presented twice against a checkerboard (Henson et al. 2002, for more details, see).An orthographic overview of the measured data is given on the left side of Figure 6.To generate data using neuRosim that are representative for this study, we start by defining the design.First we define some parameters like the dimension of the image space, the number of scans and TR.Then, since we simulate an event-related design, we also assign the onsets for each condition.
Next, we analyzed the simulated data in SPM following the exact description given in the manual (Ashburner et al. 2009, Chapter 29).We considered three contrasts, namely: (1) the overall effect of faces versus baseline checkerboard, (2) the effect of famous faces and (3) the effect of repetition.The results were thresholded with p < 0.05 (uncorrected), just to demonstrate the detection of the activation.Figure 7 shows a comparison between some of the activated regions that are found in the real data (left-hand) and in the simulated data (right-hand).

Conclusions and future work
neuRosim provides a flexible framework for generating fMRI data including a large variety of activation models and noise structures.High-level functions are available to simulate time series or full 4D data in an efficient and transparent way.For more advanced users, the lowlevel functions create the opportunity to build customized simulation functions.Currently, we are working on an extension of a resting state module such that in future updates it will be possible to have the same functionality for the generation of resting state data as for fMRI data.Other future plans are to include more neurobiological models, for example, the metabolic-hemodynamic model (Sotero and Trujillo-Barreto 2008;Sotero et al. 2009) and spatiotemporal BOLD dynamics (Drysdale et al. 2010).To extent the generalizability of the data simulated by neuRosim, we plan to include the generation of complex-valued fMRI data consisting of both magnitude and phase data.To conclude, it is our hope that neuRosim will evolve to a general platform for simulating fMRI data.Simulation studies should be a requisite to publish a statistical validation paper in the field of neuroscience.This will only be possible when standardized and trustworthy simulation methods using validated data generation techniques are available.
1 and a 2 model the delay of the response and the undershoot relative to the onset, b 1 and b 2 model the dispersion of the response and the undershoot, c models the scale of the undershoot, and d 1 and d 2 model the time to peak of the response and the undershoot.The default values of the parameters are d i = a i b i , a 1 = 6, a 2 = 12, b i = 0.9 and c = 0.35 (Glover 1999).R> canonical <-specifydesign(totaltime = 200, onsets = list(onsets), + durations = list(dur), effectsize = 1, TR = 2, + conv = "double-gamma") 3. The stimulus function is used as the input for the balloon model implemented in the balloon function (Buxton et al. 2004).The solving of the differential equations in the model is based on the Runge-Kutta solver in the R package deSolve (Soetaert et al. 2010).The parameters of the model can be modulated via the param argument, which should be a list containing values for all the parameters in the model.If not specified, the default values as described by Buxton et al. (2004) are used.R> balloon <-specifydesign(totaltime = 200, onsets = list(onsets), + durations = list(dur), effectsize = 1, TR = 2, conv = "Balloon")

Figure 1 :
Figure 1: The BOLD signals based on the three convolution functions for a 20-second ON/OFF block design.

Figure 2 :
Figure 2: Example of an activated slice: on the left, the activation is modeled as a sphere, on the right, the activated voxels are defined manually.

Figure 3 :
Figure 3: Time series of the noise structures in neuRosim.

Figure 4 :
Figure 4: Correlation images for (a) an autoregressive correlation structure, (b) a Gaussian random field and (c) a Gamma random field.

Figure 5 :
Figure 5: Generated time series (in blue) based on an experiment with two conditions (dashed lines).

Figure 6 :
Figure 6: Orthographic view of fMRI data for an event-related repetition priming study.On the left, the data measured by Henson et al. (2002) and on the right, the data simulated by neuRosim.

Figure 7 :
Figure 7: Axial slice view of the activated voxels for the real (left) and simulated data (right): faces versus baseline contrast (top), famous versus non-famous contrast (middle), first versus second presentation contrast (bottom).
The first three are general regions that activate when faces are presented, the fourth region is only activated if famous faces are shown, while in the last region adaptation to the repetition of faces is modeled.