Confidence Sets for Cohen’s d effect size images

Highlights • Confidence Sets (CSs) extend the idea of confidence intervals to fMRI maps.• For a Cohen’s d threshold c, upper CS asserts where d>c, lower CS where d<c.• We demonstrate the CSs method on HCP subject-level Cohen’s d data.• We compare the CSs with results from standard statistical voxelwise inference.• Unlike traditional cluster tests, CSs precisely quantify spatial uncertainty.


Introduction
Online dating has transformed the love-seeking game forever. Whereas romantic partners would historically first encounter each other face-to-face, brought together by a mutual friend or family member, in recent times these rituals of connection have been largely replaced by social networks and matchmaking websites ( Rosenfeld et al., 2019 ). While it was perhaps inevitable that technologies of the Digital Age would take a hold on our pursuit to find a partner, what may be more surprising is the influence the internet has had on the final outcomes of a marriage itself. In an investigation analyzing survey data from over 19,000 married American respondents, it was reported that virtual dating avenues may have helped to improve the prospects of finding a long and happy relationship ( Cacioppo et al., 2013 ). With overwhelming statistical evidence, the results of this study found that spouses who had met their partner online were more likely to be satisfied with their marriage ( < 0 . 001 ) and less likely to divorce ( < 0 . 002 ). even if they have little practical value or are unlikely to be replicable across repeated analyses ( Button et al., 2013 ). This issue has become topical within the functional magnetic resonance imaging (fMRI) community due to the arrival of population-scale neuroimaging datasets. While fMRI has traditionally been a 'small ' enterprise, with typical sample sizes of 20 to 30 subjects ( Poldrack et al., 2017 ), datasets such as the Human Connectome Project (HCP, = 1200 ) and UK Biobank ( = 40 , 000+ ) are now giving researchers the opportunity to analyze data acquired from tens of thousands of participants. These projects promise to transform our understanding of brain function, and are already yielding rich results ( Miller et al., 2016;Van Essen and Glasser, 2016 ). However, in this setting the standard stastical thresholding approach to functional brain data analysis has become obsolete: with ample power to detect all effects, statistical analysis of high quality fMRI data has been shown to lead to almost universal brain activation, even when stringent thresholding methods are applied ( Gonzalez-Castillo et al., 2012 ).
In a more general analysis context, there are still a number of limitations with traditional fMRI inference techniques. Currently, the most popular method for overcoming the multiple comparisons problem is to threshold the statistical results with cluster-extent based thresholding ( Carp, 2012 ), involving a two-step procedure: first a primary voxelwise threshold is applied to the statistic map, usually in correspondence with an uncorrected significance level (e.g. = 0 . 001 ), creating clusters of voxels whose statistic values have all surpassed the threshold. Then, in order to control the family-wise error (FWE) rate, a cluster-extent threshold is determined based on the distribution of cluster sizes obtained under the null-hypothesis of no activation, and the final results are computed as all suprathreshold clusters with a spatial extent larger than .
As the FWE-corrected -value is determined by cluster size, one of the main drawbacks of this procedure is that the significance of specific voxels can not be determined, and the most we can assert is that activation has occured somewhere inside a given cluster ( Woo et al., 2014 ). It is therefore impossible to pinpoint the precise source of the activation when a cluster covers multiple anatomical regions, and the spatial specificity of the inference diminishes the larger a cluster becomes. Another problem with this approach is that no information is provided in regards to the spatial variation of significant clusters. For instance, if a single fMRI study was repeated many times using different groups of participants, there would be variation in the sizes and shapes of the final activation clusters, yet current statistical results have no way to convey this variability.
In our previous work ( Bowring et al., 2019 , ( BTSN )), we helped to address these issues by developing Confidence Sets (CSs) for inference on %BOLD change effect size maps. Unlike traditional hypothesis testing methods, where inference is only provided in terms of the presence of an effect, the CSs made simultaneous confidence statements about the precise brain regions where raw effect sizes had exceeded, and fallen short of, a non-zero %BOLD threshold. Here, we set out to adapt the CSs for application to standardized Cohen's maps (i.e. %BOLD change divided by population standard deviation) that are more commonly used to provide effect size estimates complementing the statistical results obtained from an fMRI one-sample -test. For a clusterforming threshold c and a predetermined confidence level 1 − , the Cohen's CSs comprise of two sets: the upper CS (denoted  + ), containing all voxels declared to have a true Cohen's effect size greater than ; and the lower CS (  − ), for which all voxels outside this set are declared to have a true Cohen's effect size less than . The upper CS is smaller and nested inside the lower CS, and the assertion is made with (1 − )100 % confidence holding simultaneously for both regions. Fig. 1 provides a visual schematic of the CSs, with the upper and lower CSs presented in red and blue respectively. Note that -tests, or any test statistics, are not suitable for building CSs, as they do not estimate a population quantity and become arbitrarily large for increasing sample sizes.
The statistical characteristics of Cohen's effect size maps are fundamentally different to the raw %BOLD images that motivated the CSs in our previous work. Our main contributions with this effort are modifications to the methods used in BTSN to create procedures for obtaining Cohen's CSs on fMRI data with desirable finite-sample performance. In particular, we apply recent results from the bootstrapping literature and a variance-stabilizing transformation method to ultimately propose three separate algorithms for computing Cohen's CSs. The first algorithm is motivated by asymptotic properties of the Cohen's sampling distribution, and provides a framework for the two remaining methods which employ further adjustments to optimize the finite-sample performance of the CSs. We assess the performance of all three methods on a range of simulated synthetic 2D and 3D signals representative of fMRI clusters, and find that the two latter procedures are effective even when the sample size is modest ( = 60 ). Finally, we apply the three procedures to Human Connectome Project working memory task data, operating on Cohen's effect maps, where we obtain CSs for a variety of cluster forming thresholds. By comparing the CSs with results obtained from a traditional statistical voxelwise inference, we highlight the improvement in activation localization that can be provided with the Confidence Sets.
The remainder of this manuscript is organized as follows: first, we describe the problem of obtaining Confidence Sets for Cohen's images, exemplifying the key differences which distinguish Cohen's from the %BOLD effect size. We then derive properties of the Cohen's estimator, before adapting the methods developed in BTSN to propose three separate algorithms to compute Cohen's CSs. We assess the empirical coverage performance of each of these methods on 2D and 3D Monte Carlo simulations, and finally, present the CSs obtained from applying each algorithm to Human Connectome Project working memory task-fMRI dataset.

From % BOLD to Cohen's d
For a compact domain ⊂ ℝ , e.g. = 3 , for = 1 , … , consider the one-sample model at location ∈ , where 1 ( ) , … , ( ) are the observations at , ( ) is the true underlying mean intensity across the observations, and 1 ( ) , … , ( ) are i.i.d. mean-zero errors with common variance 2 ( ) and some unspecified spatial correlation. We are motivated by the setting of a group-level task-fMRI analysis, where ( ) represents the true mean %BOLD change across the group, and each observation ( ) is the %BOLD response estimate map obtained by applying a first-level model to the th participant's functional data. (Note, while we focus on the one-sample model here, the method may also generalize for application to the general linear model ( ) = ( ) + ( ) . See the end of Section 5.1 for more details.) We wish to make inference on the Cohen's effect size, defined as the true mean %BOLD change divided by the population standard deviation, Specifically, we are interested in the brain regions where ( ) has exceeded, and fallen short of, a fixed threshold , indicated by the noisefree, population cluster defined as: Since  is unknown, we pursue a method for constructing pairs of spatial CSs: an upper set  + and a lower set  − , that we are confident surround the true excursion set  (i.e.  + ⊂  ⊂ − ) for a desired confidence level of, for example, 1 − = 95% . Such a method lets us assert with 95% confidence that all voxels contained in the upper CS  + Fig. 1. Schematic of the color-coded regions we will use to visually represent the Confidence Sets (CSs) and point estimate set. The upper and lower CSs are presented in red and blue (overlapped by yellow and red) respectively. The yellow set (overlapped by red),  , is the point estimate set, the best guess from the data of voxels that have a Cohen's effect size greater than the threshold = 0 . 5 . have a Cohen's effect size greater than, for example, = 0 . 8 , and simultaneously, we are 95% confident all voxels outside the lower CS  − have a Cohen's effect size less than 0.8. Here, we emphasize the classic frequentist connotation of the term 'confidence'; letting  denote the boundary of  , then precisely, there is a probability of 1 − that the region  − ∩ (  + ) computed from a future experiment fully encompasses the true set boundary  . In this sense, the set difference of the upper and lower CS,  − ∩ (  + ) , is similar to a standard confidence interval.
In BTSN , we adapted the mathematical theory first proposed in Sommerfeld et al. (2018) ( SSS ) to obtain CSs for inference on the mean %BOLD change effect ( ) . Let ̄ ( ) = 1 ∑ =1 ( ) , the sample mean %BOLD change. Then subject to continuity of the relevant fields and some basic conditions on the error terms ( ) , for the excursion set  , of voxels with a true %BOLD effect size greater than , we showed that for a critical constant , the upper and lower CSs constructed as give asymptotic nominal coverage for enveloping the true  , in terms of the mean %BOLD change effect size. Further to this, we proposed a Wild -Bootstrap method for determining the critical value , and demonstrated that on applying this method the CSs were also valid for data with smaller sample sizes. We now seek to develop a similar methodology for the Cohen's effect size. However, the statistical properties of the Cohen's estimator ̂ ( ) = ̄ ( ) ̂( ) are considerably different to the sample mean ̄ ( ) . To provide a visual intuition of this in the case of Gaussian data, in Fig. 2 we display images of both of these fields from a 2D simulation over a square region = 100 × 100 . For = 60 subjects, we simulated a toy run of the signal-plus-noise model in (1) where the true underlying signal ( ) was a linear ramp effect increasing from a magnitude of 0 to 10 in the -direction while remaining constant in the -direction ( Fig. 2 (a)). To the signal we added subject-specific Gaussian noise ( ) , smoothed with a 3 voxel FWHM Gaussian kernel and then re-normalized to have a spatially constant standard deviation of ( ) = 1. Notably, in this setup the true Cohen's field ( ) was identical to ( ) .
In Fig. 2 (b) and (c) we show the sample mean and sample Cohen's fields from this simulation. While the sample mean image is uniformly smooth across the space, the Cohen's field becomes more speckled, i.e. more variable, for increasing true ( ) . In the following sections we will show that the sample Cohen's is a biased estimator of the true underlying effect size, and that the sample variance of Cohen's changes systematically with ( ) , before proposing our theoretical adjustments to the methods presented in BTSN to obtain CSs for Cohen's effect size images.

Limiting properties of the Cohen's d estimator
Motivated by the example in Fig. 2 , we now consider the onesample model given in (1) with the additional assumption that the error fields are Gaussian. In this case, the data are i.i.d. 1 ( ) , … , ( ) ∼  ( ( ) , 2 ( )) and the error terms 1 ( ) , … , ( ) are i.i.d. from a mean zero Gaussian random field ( ) such that for all , ∈ , where ( , ) denotes the population correlation coefficient between points and in the error field. The sample mean and sample variance for this model are defined as respectively. We wish to understand the limiting structure of the Cohen's estimator ̂ ( ) = ̄ ( ) ̂( ) . Applying the multivariate central limit theorem to the sample moments at and yields the asymptotic joint distribution: ) .

(8)
where the covariance matrix Σ( , ) is given by For the function ( 1 , 1 , 2 , 2 ) = , application of the delta method yields where Therefore, the limiting field of the Cohen's estimator ̂ ( ) is asymptotically normal with asymptotic variance 1 + 2 ( ) 2 . As alluded to in the previous section, it is notable that unlike the sample mean, the asymptotic variance and spatial correlation of the Cohen's estimator are dependent on the underlying true effect size. In the upcoming section, we will use these properties to motivate a construction for Cohen's Confidence Sets. Fig. 2. Visualizing the differences between the sample mean and sample Cohen's field. For = 60 subjects, we simulated a signal-plus-noise model where the true underlying mean signal ( ) was a linear ramp increasing from 0 to 10 across the region (a). To each subject we added Gaussian noise with a homogeneous variance, so that the true Cohen's effect ( ) was equal to the group mean signal ( ) . While the sample mean image ̄ ( ) is uniformly smooth across the region (b), the sample Cohen's field ̂ ( ) becomes rougher from left to right (c).

Confidence Sets for Cohen's d effect size images
Once again, consider the model outlined at the start of Section 2.1 . For clarity, we reiterate that the spatial CSs for the raw %BOLD change field ( ) of focus in our previous work took the form of (5) , where was determined via a Wild -Bootstrap procedure. This construction of the CSs was motivated by the limiting properties of the field In particular, letting  , denote the boundary of  , defined in (4) , then for a neighbourhood of  , , it is assumed in SSS that ( ) converges weakly to a smooth Gaussian field ( ) on with mean zero, unit variance, and with the same (unknown) spatial correlation as each of the .
In the previous section, for the Gaussian one-sample model we derived the convergence in distribution of the function to a Gaussian field  ( ) with mean zero, unit variance, and covariance structure This suggests a natural analog to the construction of CSs in (5) for the Cohen's effect size given by Ideally, we wish to apply the same Wild -Bootstrap procedure described in Section 2.2 of BTSN to approximate the limiting field  in order to determine . However, we will now show that such an approach is not viable for Cohen's , before proposing a modified procedure to solve the problem. Going forward our focus will primarily be on the Cohen's effect size, and thus for brevity, we will drop the subscript from our notation and refer to the Cohen's CSs above as  + and  − respectively.

Modified residuals for the Cohen's d wild t -bootstrap
In SSS , it was shown that the limiting coverage of the CSs for the %BOLD effect size ( ) is governed by the maximum distribution of the limiting Gaussian field ( ) on the boundary  , , such that Since the limiting Gaussian field ( ) is unknown, in BTSN we implemented a Wild -Bootstrap procedure to approximate ( ) on the boundary  , . Defining the standardized residuals, the Wild -Bootstrap approximating field is given by where the * are i.i.d. Rademacher variables (i.e. each * takes the value of − 1 or 1 with probability 1/2), and ̂ * ( ) is the standard deviation of the current realization of bootstrapped residuals * ̃ ( ) . The asterisk ( * ) indicates that ̃ * ( ) is one of many bootstrap samples; in practice, we would obtain a large number of bootstrap samples ̃ * ( ) , and approximate as the (1 − )100 percentile of the suprema sup ∈  |̃ * ( ) |. While this method is valid in regards to %BOLD change, for Cohen's we demonstrate that asymptotically the correlation structure of the approximating field ̃ * ( ) is incorrect. Consider again the Gaussian model in Section 2.2 . In this instance, the covariance of the approximating field is: where we note that since the standardized residuals ̃ ( ) are asymptotically Gaussian with unit variance, the bootstrap estimate of the standard deviation of the standardized residuals ̂ * ( ) converges almost surely to 1. Since (19) and (14) do not agree except for the complete null case of ( ) = 0 everywhere, we conclude that the covariance of the approximating field does not converge to the true covariance of the limiting field  ( ) .
To solve this problem, we implement a Taylor expansion transformation recently proposed in Telschow et al. (2020) to construct modified residuals with the desired limiting properties. Motivated by the delta method procedures used in Section 2.2 , an estimation of the residual field for a single subject is given by: where ( , ) = √ . A first-order Taylor expansion of ( , ) about the point (̂( ) , ̂2 ( ) ) yields the approximating Cohen's residuals: Normalizing by the estimated standard deviation of the limiting field  ( ) , we obtain the modified standardized residuals: In Telschow et al. (2020) , it is shown that the limiting covariance of ̃ ( ) is equal to the covariance function of  ( ) . Therefore, a modification of (18) leads us to the Cohen's version of the Wild -Bootstrap approximating field, where now ̂ * ( ) is the standard deviation of the bootstrapped Cohen's residuals * ̃ ( ) . While normalization of the ( ) by an estimator of the standard deviation of  ( ) provides us with residuals that have the correct limiting properties, for application to fMRI data we wish to optimize the bootstrap in smaller sample sizes. In this regard, it may be preferable to standardize the ( ) using an estimator tailored to the sample. Noting that the sample mean of the approximating residuals, ̄ ( ) = 1 ∑ =1 ( ) , is equal to zero for all , letting then an alternative to (22) is to normalize the Cohen's residuals by their sample standard deviation so that These standardized residuals can then be used for the Wild -Bootstrap approximating field given in (23) . In this case, the sample standard deviation should also be accounted for in the formation of the CSs, suggesting an alternate construction to (15) given by In Section 3.1 , we assess the performance of the CSs on synthetic data when the residuals are standardized using either the estimated limiting variance √ 1 + ̂ 2 ( ) 2 or the sample standard deviation ̂ ( ) .

Finite properties of the Cohen's d estimator and a variance-stabilizing transformation
Up to now, we have motivated two possible constructions for Cohen's CSs ( (15) and (26) ) using the limiting properties of the Cohen's estimator. Here, we will draw our attention to the distributional properties of ̂ ( ) for finite samples to make further improvements on these methods, and introduce another novel procedure for obtaining CSs based on Gaussianizing the distribution of ̂ ( ) .
Again, assuming the Gaussian model described in Section 2.2 , observing that the Cohen's estimator can be expressed in the form, from the RHS of the equality we deduce that √ ̂ is characterized by a noncentral -distribution with noncentrality parameter √ and − 1 degrees of freedom. Letting where Γ denotes the gamma function, then the expectation of √ is given by Therefore, unlike the sample mean, the Cohen's estimator is biased.
To improve the performance of the CSs for small sample sizes, we will account for this bias in the formulation of the CSs. A well-known approximation of is Therefore, a bias-corrected version of the CS construction in (15) given by and similarly, a bias-corrected version of the alternate construction in (26) given by In addition to the formation of the CSs, for any application to real data, the Wild -Bootstrap described in Section 2.4 must be applied over an approximation of the boundary  = { ∈ ∶ ( ) = } . Taking into consideration the bias of the Cohen's estimator, we will use the plugin boundary: The noncentral -distribution is asymmetric unless = 0 ; in general, the size of the asymmetry scales with the magnitude of the noncentrality parameter and is inversely proportional to the degrees of freedom. Therefore, we expect the distribution of the Cohen's estimator to be highly skewed when the true effect size is large and the sample size is small. This conflicts with the symmetric construction of the upper and lower CSs  + and  − given in (31) and (32) , suggesting that the coverage performance of these two methods may decline in such situations.
To account for skewness, we adapt a method originally proposed in Laubscher (1960) to stabilize the variance of the noncentral , transforming to a distribution which is approximately Gaussian, and hence, symmetric. Letting * = in Appendix A we show that the variance-stabilizing transformation of ̂ is given by: Numerical work presented in Laubscher (1960) shows that the 90th percentile value of (̂ ) in (35) closely estimates −1 ( 0 . 9 ) for a range of true effect sizes when the sample size is larger than 40, suggesting thatfor moderate sample sizes -the distribution of (̂ ) is approximately Gaussian.
By the monotonicity of the mapping ↦ * ℎ , the variance-stabilizing transformation provides a further possibility for constructing the CSs in the transformed space . Reconstructing (35) , the transformed CSs are given by: In this case, the Cohen's residuals given in (21) for the Wild -Bootstrap must also be modified. An estimation of the transformed resid-ual field for a single subject is given by where the function ( , ) = * ℎ . Similarly to the methods applied in Section 2.4 , a first-order Taylor expansion of ( , ) about the point ( ( ) , ̂2 ( )) obtains the transformed Cohen's residuals In practice, the critical value in (36) is computed by applying the Wild -Bootstrap over  using the transformed Cohen's residuals given above for the bootstrap approximating field in (23) . For coherence, we will now formalize the complete procedures to obtain CSs for each of our three proposed CS constructions in (31), (32) , and (36) .

Three algorithms for computing Cohen's d CSs
Based on our derivations up to this point, we give three algorithms to compute Cohen's CSs for data modelled within the one-sample model in Section 2.1 . While the first two algorithms are similar, the key difference separating these methods is the estimator of the variance used in the formation of the CSs and for standardizing the Cohen's residuals.
We first describe Algorithm 1 , where we use 1 + .
2. Normalize the Cohen's residuals by the estimated limiting standard deviation of the Cohen's image to obtain the standardized residuals, 3. Draw i.i.d. Rademacher variables * 1 , … , * , and compute the Wild -Bootstrap approximating field, where ̂ * ( ) is the bootstrap standard deviation of the bootstrapped residuals * ̃ ( ) . 4. Obtain the value * = sup ∈  | * ( ) |, using the bias-corrected esti- 5. For a large number of bootstrap replications B repeat steps 3. and 4., obtaining the empirical distribution of the absolute maximum  = { * 1 , … , * } . Compute as the (1 − ) percentile of  . 6. Obtain the Cohen's CSs, For Algorithm 2 , we use the sample variance of the Cohen's residuals ̂2 ( ) as the variance estimator, motivated by our workings in Section 2.4 . .
2. Normalize the Cohen's residuals by their sample standard deviation to obtain the standardized residuals, 3. Draw i.i.d. Rademacher variables * 1 , … , * , and compute the Wild -Bootstrap approximating field, where ̂ * ( ) is the bootstrap standard deviation of the bootstrapped residuals * ̃ ( ) . 4. Obtain the value * = sup ∈  | * ( ) |, using the bias-corrected esti- For a large number of bootstrap replications B repeat steps 3. and 4., obtaining the empirical distribution of the absolute maximum  = { * 1 , … , * } . Compute as the (1 − ) percentile of  . 6. Obtain the Cohen's CSs, Finally, Algorithm 3 is based on the derivations in Section 2.5 , transforming the estimated Cohen's image to a field which is approximately Gaussian. This is done to stabilize the variance and remove the skew of the Cohen's estimator, which may adversely effect the performance of the CSs. By the monotonicity of the transformation, the CSs obtained using this method are valid for inference on the true (un-transformed) Cohen's effect size.

Simulation setup
In this section we describe the settings used to evaluate the performance of each of the three algorithms for obtaining Cohen's CSs on synthetic data. For each method, we simulate 3000 independent samples of the Gaussian one-sample model using a range of signals ( ) , Gaussian noise structures ( ) with stationary and non-stationary variance 2 ( ) , in two-and three-dimensional regions . To compute the critical value , we apply the given method's Wild -Bootstrap procedure with = 5000 bootstrap samples on the estimated boundary  that must be used for application to real data. We obtain the boundary using the linear interpolation method described in Section 2.3 of BTSN . We then compute the empirical coverage using the interpolation assessment method described in Section 2.4 of BTSN , given as the percentage of trials for which the true thresholded signal  contains the upper CS  + and is contained within the lowers CS  − (i.e. the proportion of trials for which  + ⊂  ⊂ − ). In each simulation, we apply the given method to sample sizes of = 30 , 60 , 120 , 240 and 480, and for each of the three nominal coverage probability levels 1 − = 0 . 80 , 0 . 90 and 0.95.

2D simulations
We analyzed the performance of the three algorithms to obtain Cohen's CSs on a square region of size 100 × 100 . For the true underlying signal ( ) we considered two different raw effects: first, a linear ramp that increased from a magnitude of 0 to 1 in the -direction while remaining constant in the -direction. Second, a circular effect, created by placing a circular phantom of magnitude 1 and radius 30 in the centre of the search region, which was then smoothed using a 3-voxel FWHM Gaussian kernel. If we were to assume that each voxel had a size of 2 mm 3 , we note that this would amount to applying smoothing with a 6 mm FWHM kernel, a fairly typical setting used in fMRI analyses.
To each of these signals we added subject-specific Gaussian noise ( ) , obtained from smoothing white noise with a 3 voxel FWHM Gaussian kernel, with homogeneous and non-homogeneous variance structures: the first noise field had a spatially constant standard deviation of 1, and therefore in this case the true Cohen's effect was identical to the underlying signal ( ) . The second field had a linearly increasing standard deviation structure in the -direction from √ 0 . 5 to √ 1 . 5 while remaining constant in the -direction. Thus, the variance of this noise field spatially increased in the y -direction from 0.5 to 1.5 in a non-linear fashion.
The true Cohen's fields ( ) for the linear ramp signal with homogeneous and heterogeneous noise are shown in Fig. 3 . The corresponding Cohen's fields for the circular signal are shown in Fig. 4 . Altogether, for the three algorithms, the two underlying signals and two noise sources gave us 12 different simulation setups; for all of the simulations, we obtained Cohen's CSs for the noise-free cluster  at a cluster-forming Fig. 3. The two Cohen's effects corresponding to the linear ramp signal ( ) . On the left, the subject-specific Gaussian noise field ( ) has a spatially constant standard deviation of 1, and therefore ( ) = ( ) . On the right, ( ) had a spatially increasing standard deviation structure in the y -direction (from top-to-bottom), while remaining constant in the x -direction. Fig. 4. The two Cohen's effects corresponding to the circular signal ( ) . On the left, the subjectspecific Gaussian noise field ( ) has a spatially constant standard deviation of 1, and therefore ( ) = ( ) . On the right, ( ) had a spatially increasing standard deviation structure in the ydirection (from top-to-bottom), while remaining constant in the x -direction. threshold of = 0 . 8 . In Chapter 2.2 of Cohen (2013) , = 0 . 8 was classified as a 'large effect'; for group-level analyses of large-sample fMRI data with ample statistical power (such as the HCP or UK Biobank), effect sizes of this magnitude may be used to assess brain areas where practically significant activation has occurred.

3D simulations
Four signal types ( ) were considered to analyze the performance of the three algorithms in three dimensions. The first three of these signals were generated synthetically on a cubic region of size 100 × 100 × 100 : firstly, a small spherical effect, created by placing a spherical phantom of magnitude 1 and radius 5 in the centre of the search region, which was then smoothed using a 3-voxel FWHM Gaussian kernel. Secondly, a larger spherical effect, generated identically to the first effect with the exception that the spherical phantom had a radius of 30. Lastly, we created an effect by placing four spherical phantoms of magnitude 1 in the region of varying radii and then smoothing the entire image using a 3-voxel FWHM Gaussian.
Each of the images were rescaled after smoothing to have a maximum intensity of 1. For the small and large spherical effect an imagewise rescaling was applied, where all locations in the smoothed map were divided through by the maximum intensity across the region. For the final effect, because parts of the four spherical phantoms overlapped after smoothing, the signal intensities in these regions summed to greater than 1. In this case, we reduced the intensities in these areas to have a magnitude of 1 while leaving the rest of the image untouched. This ensured that the signal at the center of each spherical phantom had a magnitude of 1, coinciding with the previous small and large spherical effect signal types (see Fig. 5 , the signal at the center of each spherical phantom in plot (c) is 1, in correspondence with the small and large spherical effects in plots (a) and (b)).
Similar to the two-dimensional simulations, for the three signals described above we added white noise smoothed using a 3-voxel FWHM Gaussian kernel with homogeneous and heterogeneous variance structures. The first noise field had a spatially constant standard deviation of 1, while the second field had a linearly increasing standard deviation in the -direction from For the final signal type, we took advantage of big data that has been made available through the UK Biobank in an attempt to generate an effect that replicated the true %BOLD change induced during an fMRI task. We randomly selected 4000 subject-level contrast of parameter estimate result maps from the Hariri Faces/Shapes task-fMRI data show the Cohen's field for the three different spherical effects ( ) when Gaussian noise with spatially homogeneous standard deviation was added to the signal. Plot (d) shows the Cohen's field corresponding to the UK Biobank full mean and standard deviation images. Note that the colormap limits for the first three Cohen's effect-size images are from 0 to 1, while the colormap limits for the UK Biobank image is from − 0.9 to 0.9. collected as part of the UK Biobank brain imaging study. Full details on how the data were acquired and processed is given in Miller et al. (2016) , Alfaro-Almagro et al. (2018) and the UK Biobank Showcase; information on the task paradigm is given in Hariri et al. (2002) . From these contrast maps, we computed a group-level full mean and full standard deviation image, considering all voxels where at least one hundred subjects had data (instead of discarding voxels with any missing data). In the final simulation, we used the group-level Biobank mean image as the true underlying signal ( ) for each subject, and the full standard deviation image was used for the standard deviation of each simulated subjectspecific Gaussian noise field ( ) added to the true signal. Because of the considerably large sample size of high-quality data from which these maps have been obtained, we anticipate that both of these images are highly representative of the true underlying fields that they approximate. Both images were masked using an intersection of all 4000 of the subject-level brain masks.
Once again, we smoothed the noise field using a 3-voxel FWHM Gaussian kernel; as the Biobank maps had voxel sizes of 2 mm 3 , this equated to applying 6 mm FWHM smoothing to the noise field of the original data.
In Fig. 5 , we show the true underlying Cohen's fields for the three synthetic 3D effects with homogeneous noise structure, and the Cohen's field corresponding to the UK Biobank full mean and standard deviation (a histogram of the UK Biobank Cohen's field is provided in Fig. B.14 ). For all four signal types we obtained Cohen's Confidence Sets for the threshold = 0 . 8 , and in order to assess if a change of threshold could affect the performance of the CSs, we also obtained Cohen's CSs using a threshold of = 0 . 5 for the final UK Biobank signal type.

Application to Human Connectome Project data
To provide a real-data demonstration of the three methods proposed in this work, we computed Cohen's CSs on 80 participants data from the Unrelated 80 package released as part of the Human Connectome Project (HCP, S1200 Release) using all three algorithms described in Section 2.6 . Cohen's CSs were obtained for the subject-level 2-back vs 0-back contrast maps from the working memory task results included with the HCP dataset. For a comparison with standard fMRI inference procedures, we also performed a traditional statistical group-level analysis on the data. A one-sample -test was carried out in SPM, using a voxelwise FWE-corrected threshold of < 0 . 05 obtained via permutation test with SPM's SnPM toolbox.
For the working memory task participants were presented with pictures of places, tools, faces and body parts in a block design. The task consisted of two runs, where on each run a separate block was designated for each of the image categories, making four blocks in total. Within each run, for half of the blocks participants undertook a 2-back memory task, while for the other half a 0-back memory task was used. Eight EVs were included in the GLM for each combination of picture category and memory task (e.g. 2-back Place); we compute CSs on the subject-level contrast images for the 2-back vs 0-back contrast results that contrasted the four 2-back related EVs to the four 0-back EVs.
Imaging was conducted on a 3T Siemans Skyra scanner using a gradient-echo EPI sequence; TR = 720 ms, TE = 33.1 ms, 208 × 180 mm FOV, 2.0 mm slice thickness, 72 slices, 2.0 mm isotropic voxels, and a multi-band acceleration factor of 8. Preprocessing of the subject-level data was carried out using tools from FSL and Freesurfer following the 'fMRIVolume' HCP Pipeline fully described in Glasser et al. (2013) . To summarize, the fundamental steps carried out to each individual's functional 4D time-series data were gradient unwarping, motion correction, EPI distortion correction, registration of the functional data to the anatomy, non-linear registration to MNI space (using FSL's Non-linear Image Registration Tool, FNIRT), and global intensity normalization. Spatial smoothing was applied using a Gaussian kernel with a 4 mm FWHM.
Modelling of the subject-level data was conducted with FSL's FM-RIB's Improved Linear Model (FILM). The eight working task EVs were included in the GLM, with temporal derivatives terms added as confounds of no interest, and regressors were convolved using FSL's default double-gamma hemodynamic response function. The functional data and GLM were temporally filtered with a high-pass frequency cutoff point of 200 s, and time series were prewhitened to remove autocorrelations from the data.
Mirroring the methods used in BTSN , we applied additional smoothing to the final contrast maps to mimic images smoothed using a 6 mm FWHM Gaussian kernel. This is a more typical degree of smoothing applied to functional data than the 4 mm kernel originally used in the HCP analysis pipeline.

2D simulations
Empirical coverage results for each of the three algorithms are presented for the linear ramp signal in Fig. 6 and for the circular signal in Fig. 7 , where in all simulations a Cohen's threshold of = 0 . 8 was applied. In both figures, on the top row we display the coverage results obtained when the standard deviation field of the noise was homogeneous across the region (corresponding to Fig. 3 (a) for the linear ramp, Fig. 4   Coverage results for the circular signal, with homogeneous (top row) and heterogeneous (bottom row) Gaussian noise structures. All algorithms performed well, and unlike the linear ramp, empirical coverage for all three methods converged towards the nominal level. For smaller sample sizes there was a larger degree of over-coverage, most noticeably for simulations using the 80% nominal target. Overall, Algorithm 2 performed marginally better than the other two methods, and Algorithm 1 performed the worst.
results when the standard deviation field was spatially heterogeneous ( Fig. 3 (a) and Fig. 4 (b) for the linear ramp and circle respectively). For the linear ramp, across all confidence levels 1 − = 0 . 80 , 0 . 90 , and 0.95 we observed valid, over-coverage for all three algorithms when sufficiently large sample sizes ( ≥ 60 ) were used. In all plots, it appears that the coverage rates for the three algorithms are converging to the same value, slightly above the nominal target. Specifically, for the nominal target level of 80%, in both the homogeneous and heterogeneous cases all empirical results seem to be converging to around 88% ( Fig. 6 , left-side plots). For the 95% target, the scale of disagreement between the empirical results and the nominal target is smaller; here, all coverage results hover close to 96% for = 240 and 480 ( Fig. 6 , right-side plots).
While for larger sample sizes the performance of all three algorithms was similar, there was greater disparity between the methods for simulations using the smaller sample sizes of = 30 and 60. Here, the empirical coverage results for Algorithm 1 were consistently higher than the other two methods, with the degree of over-coverage increasing as the sample size was lowered. On the other hand, the coverage results for Algorithms 2 and 3 both decreased as the sample size was made smaller; while there was only a slight dip in coverage for Algorithm 3 here, the drop-off for Algorithm 2 was more considerable, with coverage results falling substantially below the nominal target level when = 30 .
For the circular signal, on the whole all three methods performed well. In this instance, almost all empirical results for Algorithms 2 and 3 lay within the 95% confidence interval of the nominal coverage rate (blue and magenta curves sandwiched between black dashed lines for all plots in Fig. 7 ), with Algorithm 2 performing marginally better. While we observed greater over-coverage for the smaller sample sizes, most substantially in simulations using the 80% nominal target ( Fig. 7 , leftside plots), empirical coverage converged towards the nominal level for all three algorithms.
Finally, the use of homogeneous or heterogeneous noise in the model had minimal impact on any of the algorithm's empirical coverage performance for either of the signals. This is exemplified in Figs. 6 and 7 , where in both cases the homogeneous coverage plots presented in the top row are almost identical to the corresponding heterogeneous plots shown below.

3D simulations
Empirical coverage results for each of the three algorithms are presented in Figs. 8 , 9, 10 and 11 respectively for each of the four 3D signal types displayed in Fig. 5 (small sphere, large sphere, multiple spheres, UK Biobank), where in all simulations a Cohen's threshold of = 0 . 8 was applied. In Fig. C.15   Coverage results for the small sphere signal type, with homogeneous (top row) and heterogeneous (bottom row) Gaussian noise structures. In general, empirical coverage remained above the nominal level across all simulations, and for the 95% confidence level (right plots), the results of all three methods fell close to the nominal target (with some over-coverage for = 30 ). All methods were robust as to whether the subject-level noise had homogeneous or heterogeneous variance structure. Because of this, there are minimal differences comparing the plots between both rows. nal type using a smaller Cohen's threshold of = 0 . 5 . For the spherical effects ( Figs. 8 -10 ), on the top row we display the coverage results obtained when the standard deviation field of the noise was homogeneous across the region, and on the bottom row we display the equivalent results when the standard deviation field was spatially heterogeneous. For the UK Biobank signal ( Fig. 11 ), the full standard deviation image computed from the UK Biobank data was used for the standard deviation field of the noise, and hence in this case there is only one row of results.
Across all 3D simulations we observed consistencies between the results obtained with each of the three algorithms: in general, empirical coverage for all methods came above the nominal target, with increasing severity for smaller sample sizes. Similar to the 2D simulations, the extent of over-coverage was smaller when a larger confidence level was used. Comparing the three methods, coverage results for Algorithm 1 were considerably higher than the other two methods, particularly when a small sample size and confidence level were used. For the 'large sphere' and 'multiple spheres' signal types, Algorithm 1 suffered with over-coverage of above 15% in simulations with a sample of size of = 30 and 60 and a nominal target level of 80% ( Figs. 9 and 10 , left-side plots). For both of these signals, there was still a considerable amount of over-coverage when larger sample sizes of = 120 , 240 and (to a lesser degree) 480 were used. On the other hand, Algorithms 2 and 3 performed similarly in large sample sizes across all simulations, with empirical coverage results coming slightly above the nominal target. Notably, both of these algorithms performed very well for simulations with a 95% nominal target level (all figures, right-side plots). Differences between these two methods were more distinguished for smaller sample sizes of = 30 and 60, where we observed a greater degree of overcoverage for Algorithm 3 . Consequentially, Algorithm 2 's results came closer to the nominal target here, although for the 'multiple sphere' and 'UK Biobank' signal types, in some cases Algorithm 2 's results fell slightly below the nominal level ( Figs. 10 and 11 ). Overall, empirical coverage for Algorithm 3 was the most uniform of the three methods across moderate and large sample sizes.
Comparing Figs. 8 and 9 , we observed a slight deterioration in the performance of all three algorithms when moving from the small sphere signal type to the large sphere. In particular, results obtained from applying the three methods to the large sphere fell further above the nominal target relative to the small sphere. This was most severe for Algorithm 1 , where differences between the two sets of results were larger than 10% for the 80% confidence level ( Figs. 8 and 9 , left-side plots). These dif-  Fig. 9 , empirical coverage results were higher for all three methods here. Algorithm 1 suffered from a particularly large degree of over-coverage for simulations with a small sample size. Coverage performance for Algorithms 2 and 3 was closer in resemblance to the corresponding small sphere results, with Algorithm 2 performing slightly better. This suggests that both of these methods are fairly robust to changes in the boundary length.
ferences were comparatively smaller for Algorithms 2 and 3 , where we observed only a slight increase in empirical coverage, particularly for simulations using larger sample sizes. This would suggest that both of these methods are fairly robust to changes in the boundary length. We observed a similar sort of degradation in performance for the two simulation results obtained using the UK Biobank signal type, where simulations using the smaller threshold of = 0 . 5 ( Fig. C.15 ) fell further above the nominal target relative to the simulations using a threshold of = 0 . 8 . Once again, this may be attributable to changes in the boundary length between the two simulations (the boundary length was longer for the smaller threshold of = 0 . 5 ). Rather than a degradation in the three algorithms' performances per se, we suspect that inaccuracies in the methods used to assess the simulations may have induced a positive bias into the coverage results for signals with a longer boundary (see the end of Section 5.2 for more on this). Finally, the use of homogeneous or heterogeneous noise in the model once again had very little impact on the performance of all three algorithms. Nevertheless, for simulations with small sample sizes, a heterogeneous noise structure led to a slight decrease in the empirical coverage results for Algorithms 2 and 3 ( Figs. 8-10 , left-side plots).

Human Connectome Project
Cohen's Confidence Sets obtained by applying Algorithm 3 to 80 subjects' contrast data from the Human Connectome Project are shown in Fig. 12 . CSs computed on the same data using Algorithm 1 and Algorithm 2 are displayed in Figs. D.1 and D.2 respectively. For each figure, we display the CSs obtained from applying the specified algorithm with three separate thresholds, = 0 . 5 , 0 . 8 , and 1.2. These three Cohen's effect sizes were classified as medium, large, and very large in Cohen (2013) .
In the top plot of Fig. 12 , the red upper CSs localized brain regions within the frontal cortex that are commonly associated with working memory. This included areas of the superior frontal gyrus (left and right, all slices), middle frontal gyrus (left, coronal slice), paracingulate gyrus (left and right, axial slice) and insular cortex. Other brain areas encapsulated inside the upper CS were the angular gyrus (left and right, axial slice), cerebellum (left and right, sagittal slice) and precuneus (left and right, axial slice). For all these regions, the method identified clusters of voxels where we can assert with 95% confidence there was a Cohen's effect size greater than 0.5.  Fig. 11. Coverage results for the UK Biobank signal type, where the full standard deviation image was used as the standard deviation of the subject-level noise fields.
Coverage results here were similar to the results for the multiple spheres signal type shown in Fig. 10 . Once again, both Algorithms 2 and 3 performed well for large samples, with empirical coverage rates hovering above the nominal target, while results for Algorithm 1 came further above the nominal level. While for smaller samples the degree of over-coverage became greater for Algorithms 1 and 3 , results for Algorithm 2 appear to slightly drop here.

Fig. 12.
Slices views of the Cohen's Confidence Sets obtained from applying Algorithm 3 to the HCP working memory task data, using three Cohen's effect size thresholds, = 0 . 5 , 0 . 8 and 1.2. The upper CS  + is displayed in red, and the lower CS  − in blue. Yellow voxels represent the point estimate set  , the best guess from the data of voxels that have surpassed the Cohen's threshold. The red upper CS has localized regions in the frontal gyrus, paracingulate gyrus, angular gyrus, cerebellum and precuneus which we can assert with 95% confidence have attained (at least) a 0.5 Cohen's effect size.
By increasing the threshold to = 0 . 8 ( Fig. 12 , middle plot), there was a shrinking of both the blue lower CSs and red upper CSs. Therefore, while we can confidently declare a medium effect size in all of the brain areas identified above, the quantity of voxels within each region that we can proclaim to have a large effect size is considerably smaller. In the case of the right cerebellar hemisphere (left, sagittal slice) and insular cortex, the upper CS vanished completely, indicating that the method did not locate any voxels in these regions where we can assert a Cohen's effect size greater than 0.8. The results for the largest threshold assessed, = 1 . 2 ( Fig. 12 , bottom plot) are particularly notable. Here, the yellow point estimate set contains a small but appreciable number of voxels, marking where the observed Cohen's effect size was greater than 1.2. However, in this case there was no red upper CS (i.e. the upper CS was completely empty). Therefore, contrary to the inference one might come to based on the point estimate set alone, we can not state with confidence that any voxels have attained a very large effect size. Conversely, the large quantity of (grey background) voxels lying outside the blue lower CS in Fig. 12 imply an effect size less than 1.2 across the vast majority of the brain.
In Fig. 13 , the red upper CSs computed with Algorithm 3 are compared with the thresholded -statistic map (green-yellow voxels) obtained from applying a one-sample -test group-analysis to the 80 subjects' contrast data, using a voxelwise FWE-corrected threshold of < 0 . 05 . This figure demonstrates the improved spatial specificity that can be provided with the CSs in comparison with the traditional approach. Specifically, while the thresholded statistic map contains one large cluster covering a sizeable portion of the parietal lobe across both brain hemispheres, the red upper CSs pinpoint precise areas in the precuneus and angular gyrus where a practically significant medium (or large) Cohen's effect size can be inferred ( Fig. 13 , axial slices).

Spatial inference on Cohen's d effect size
To fully appreciate the outcomes of a neuroimaging study, information about the magnitude (as well as presence) of effects must be reported at the end of an investigation. It is only with this knowledge that one can truly determine the practical relevance (and potential clinical importance) of any discoveries made during the analysis. In this work, Fig. 13. Comparing the upper CSs (red voxels) computed with Algorithm 3 on the HCP working memory task data (same slice views as Fig. 12 ) with the thresholded -statistic results obtained by applying a traditional group-level one-sample -test, voxelwise < 0 . 05 FWE correction (green-yellow voxels). While the thresholded statistic map contains a single cluster covering a sizable portion of the parietal lobe across both hemispheres (axial slices), the upper CSs have localized the precise areas of the precuneus and anglur gyrus where we can confidently declare a Cohen's effect size of at least 0.5. This demonstrates how the CSs can provide improved spatial specificity in determining regions with practically significant activation.
we have presented three methods to create Confidence Sets for Cohen's effect size maps, providing formal confidence statements on regions of the brain where the Cohen's effect size has exceeded a specified activation threshold, alongside regions where the effect size has not surpassed this threshold. Both of these statements are made simultaneously across the entire brain, enabling researchers to pinpoint the precise regions where meaningful differences have occurred as well as identifying areas that have not responded to the task. This is in contrast to the traditional statistical approach, where a test statistic (e.g. a -statistic) or a -value can only quantify the compatibility between the observed data and what would be expected under the null-hypothesis of no activation, and no information is provided about the effect size when a finding is deemed to be statistically significant. Since the test statistic is confounded by the sample size of the study, it is not possible to implement a similar framework to obtain Confidence Sets for statistic images; as sample sizes increase test statistics also become arbitrarily large, until ultimately there is enough statistical power to declare even the smallest of effects as statistically significant. On the other hand, for larger samples we expect the red upper CSs to converge towards the population parameter  representing the set of voxels with a true population effect magnitude above a purposeful activation threshold .
While in our analysis of the HCP working memory task data we primarily focused on activated regions localized by the red upper CSs, it is also important to note the added utility provided by the blue lower CSs that can not be obtained with standard statistical inferences. In particular, while the logic of statistical testing means that one can never prove the null-hypothesis to be true, the blue lower CSs provide inference on brain areas where effect sizes have not attained a sufficient activation threshold. Therefore, if the Cohen's threshold is chosen appropriately, users could essentially accept the null of no (practically significant) activation for all voxels lying outside the blue CSs. Another method that facilitates for evidence in support of either the null-or an alternative hypothesis is Bayesian testing ( Han and Park, 2018;Rouder et al., 2009 ). However, this procedure can not guarantee the same control of falsepositives that has been demonstrated in terms of coverage for the CSs here, and requires that users set a log odds threshold (to define the strength of evidence needed to accept the null or alternative hypothesis) which is arguably less intuitive than the pre-determined confidence level needed for the CSs.
The use of CSs for inference on effect size may also help to alleviate issues associated with hypothesis testing for studies with lower statistical power. Specifically, for studies with small sample sizes it has been reported that applying traditional inference procedures can lead to spurious or irreproducible results with considerably inflated observed effect sizes ( Cremers et al., 2017;Poldrack et al., 2017 ). In regards to the latter, this is often caused by a form of selection bias known as the 'winners curse' ( Button et al., 2013;Reddan et al., 2017 ), whereby voxels whose observed effect size has exceeded their expected performance are intrinsically more likely to be determined as statistically significant. This becomes a problem as magnitudes are commonly reported only for significant voxels, a practice that leads to positively biased effect estimates. Our analysis results from the Human Connectome Project dataset exemplify how the CSs can help to resolve this issue. In Fig. 12 , the yellow point estimate 'best guess from the data' clusters identified a number of voxels with a Cohen's effect size greater than 1.2 that were also included in the thresholded statistic map obtained from applying a onesample -test, voxelwise < 0 . 05 FWE correction ( Fig. 13 ). However, by synthesizing information about the effect magnitude as well as the reliability of the estimate, the CSs presented in Fig. 12 affirm that there are in fact no voxels that can be confidently declared to have an effect size larger than 1.2. On the contrary, only a handful of brain regions were contained in the red upper CS asserting a Cohen's effect size exceeding 0.5 for the HCP working memory task data.
In our previous effort we described a method to obtain CSs for unstandardized percentage BOLD change maps, rather than the Cohen's images that have served as our main focus here. The use of Cohen's instead of %BOLD is likely to be advantageous due to complications associated with the BOLD effect. As discussed in our previous work, %BOLD effect sizes have been shown to modulate according to acquisition parameters such as the scanner field strength or MRI pulse sequence, and inhomogenieties in vascularity between different brain regions can cause further variation in the BOLD response. At a more rudimentary level, there are also difficulties involved in obtaining percentage BOLD change images. While all of the main fMRI software packages provide contrast of parameter estimate maps, each of the three most widely-used analysis packages (AFNI, FSL and SPM) scale the raw data differently; the parameter estimates are often given in arbitrary units which deviate between packages. Conversion to percentage BOLD change therefore requires a software-dependent normalization, where one must take into consideration how to appropriately scale the data, design matrix and analysis contrasts. While this can be cumbersome and prone to human error, conversion to Cohen's is relatively simple. Due to the straightforward relationship between the Cohen's effect size and the one-sample -statistic ( ̂ = ∕ √ ), users can easily generate Cohen's images from the unthresholded -statistic maps created by all the main neuroimaging packages. For all of these reasons, the Cohen's CS maps may be more suitable for comparison between studies.
In this work, we have used classifications of the Cohen's effect size as initially suggested in Cohen (2013) , describing 0.5 as a 'medium' effect, 0.8 as 'large', and 1.2 as 'very large'. While these benchmarks provide basic descriptors of effect size, in general we recommend that users take appropriate steps to contextualize what sort of magnitude constitutes a meaningful finding in their own study. In context, this means that for some task-fMRI studies brain regions with a Cohen's effect size of 0.2 (classified as a 'small' effect by Cohen) may represent an important finding (see Fig. 2 in Noble et al., 2020 for the distributions of groundtruth effect sizes found across the brain for a range of common task-fMRI paradigms). Overall, users should factor in the aims of their investigation, the quality of the study and, if possible, the effect sizes reported in similar previous efforts before choosing a threshold. Obtaining the CSs for the Human Connectome Project contrast data in this work was computationally quick, with each analysis taking less than one minute for all three proposed algorithms. Therefore, one possible strategy is to evaluate a variety of different 's on pilot or historical data before fixing a value to use on a study of interest.
While we have developed the Cohen's CSs for a one-sample model, the methods presented here may also be applied to the general linear model ( ) = ( ) + ( ) , where is the design matrix and ( ) is the vector of unknown coefficients. In this setting, for a contrast vector , the quantity of interest would be the standardized contrast ( )∕ ( ) (instead of the Cohen's effect size ( )∕ ( ) ). The method would be carried out similarly, except, for example, the normalized standard deviation of the contrast estimate ( ) − 1 2 would need to be considered in the construction of the CSs (replacing the 1∕ √ term in the one-sample CS constructions for the three algorithms). It is important to note that the mathematical results underpinning this work in Telschow et al. (2020) have as yet only been provided for the one-sample model, which is why this model has been our primary focus here. Nevertheless, a further intuition on applying the method to the general linear model can be obtained from BTSN , where we developed the CSs in the general linear model setting for raw effect size images.

Three algorithms for Cohen's d Confidence Sets
In this work, we have theoretically motivated three algorithms for obtaining Cohen's CSs. Our simulation results in Sections 4.1 and 4.2 have demonstrated differences in the coverage performance for each of these algorithms. Across all sets of simulation results, empirical coverage for Algorithm 1 came above the nominal level, with particularly severe over-coverage for 3D simulations carried out on large synthetic signals when small sample sizes were used ( Figs. 8 -10 ). The cause for such poor performance here is likely to be due to the variance term used to construct the CSs in Algorithm 1 . Recalling the derivations in Section 2.2 , the variance term used for Algorithm 1 . was chosen as an estimator of the variance of the limiting Gaussian field  ( ) . Therefore, while we expect this term to correctly approximate the variance of the bootstrap approximating field asymptotically, our theory provides no indication about the accuracy of this term in small samples. The over-coverage seen in our simulation results suggests that this term overestimates the true variance of the approximating field when the sample size is low. While there was some improvement in Algorithm 1 's results as increased, even for the largest sample size we analyzed, = 480 , empirical coverage for the other two methods was consistently closer to the nominal target level.
Overall, Algorithm 2 and Algorithm 3 both performed well across our 2D and 3D simulations. For simulations using a 95% confidence level, the empirical coverage performance of these two methods was remarkably similar for ≥ 120 (in most cases, slightly above the nominal target). It is therefore difficult to conclude which method should be implemented in practice. For our 3D simulations ( Figs. 8 -11 ), Algorithm 2 's empirical coverage results fell slightly closer to the nominal level in most cases. However, the results for Algorithm 3 were slightly more robust across moderate and large sample sizes, as observed in the UK Biobank and multiple spheres simulations ( Figs. 10 and 11 ). When smaller sample sizes of ≤ 60 were used, Algorithm 3 consistently suffered from a higher degree of over-coverage. However, the performance of Algorithm 2 also appeared to be drop off in some cases, marginally for the UK Biobank simulation ( Fig. 11 ), and considerably for the Ramp simulation ( Fig. 6 ), where for = 30 Algorithm 2 's coverage results fell well below the nominal target level. Therefore, Algorithm 3 may still be preferable in small sample sizes to ensure the inference remains valid (in respect to obtaining at least a 95% coverage rate).
From a theoretical standpoint, the variance-stabilizing transformation approach used in Algorithm 3 assumes that the observations are Gaussian, while this is somewhat relaxed for Algorithm 2 , where the bootstrap is applied to estimate the standard deviation directly from the data. While this supports that Algorithm 2 may be preferable for non-Gaussian data, in our Human Connectome Project analyses the CSs maps obtained using both methods ( Fig. 12 for Algorithm 3, Fig. D.2 for Algorithm 2 ) were virtually identical, indicating that both methods could be equally effective for fMRI data with sample sizes on the order of the HCP.
While our simulations have primarily focused on how well the three algorithms are able to maintain a targeted nominal coverage level, we also carried out further analyses to evaluate each method's capacity for identifying voxels with a true effect size greater than the threshold (akin to the power of a statistical test). In Appendix E , we present the '  + sensitivity' of each of the three algorithms across our 2D and 3D simulations. For each simulation type, we computed the sensitivity of the upper CS  + as the proportion of voxels in the noise-free population cluster  that were identified in  + . In other words,  + sensitivity is the percentage of voxels with a true Cohen's effect size greater than correctly identified within the upper CS  + . Overall, the results for all three methods were similar here ( Figs. E.1 -E.6 ), although Algorithm 3 performed marginally better than the two remaining methods. In general, we found that more than 100 subjects were required to achieve a sensitivity above 10% , and that sensitivity could be exceedingly low when < 60 . These small-sample results are similar to corresponding sensitivity figures reported in Noble et al. (2020) for standard fMRI statistical methods, where the mean true positive rate was found to be less than 5% after a cluster-based FWER correction for a range of task-fMRI studies using a sample size of = 20 (although note that this is not a like-for-like comparison, as the CSs control the inclusion error of individual voxels rather than clusters). While the CSs may not be very sensitive by this measure in small samples, we stress that the inference is still valid, in the sense that the CSs can still provide (at least) a 95% empirical coverage rate. Therefore, while a traditional statistical test with voxelwise FDR correction may be more suitable for detecting nonzero effects in this setting, the CSs could still find utility for delineating effect sizes that have surpassed a practically significant threshold and providing information about the spatial uncertainty of the thresholded clusters. We are pursuing further work on 'FDR-controlled' CSs, which we hope will lead to a more sensitive approach with the CSs in the future.
Although we have assessed the three algorithms on synthetic data, where the variance of the subject-level errors was either homo-or heterogeneous across space, a limitation of this work is that only Gaussian noise fields were considered for our simulations. While various non-Gaussian error fields were included in the simulations conducted in Sommerfeld et al. (2018) , where the CSs were shown to be effective for inference on the sample mean effect size, further assessments on the methods proposed here using noise structures more comparable to actual fMRI data would be a valuable addition to any future work. As already discussed, the variance-stabilizing transformation utilized by Algorithm 3 makes an assumption of Gaussianity, so a future study may seek to verify whether this method's performance is adversely affected in a non-Gaussian setting. One approach that could be taken here is through empirical evaluations based on real data, where a large dataset is split in half to define a ground-truth and draw many random samples. We investigated such an evaluation with the UK Biobank, but found that even with 4000 subjects set aside we were unable to obtain a highly stable ground-truth set  required to accurately assess the coverage rate. As the UK Biobank expands towards making 100,000 subjects' fMRI data available, we look forward to revisiting this approach with yet larger sample sizes.
It is noticeable that the asymptotic coverage for all three procedures appeared to converge to above the nominal level in nearly all of our simulations. As well as this, the size of the over-coverage varied depending on the signal type and the confidence level (in all cases, the degree of over-coverage was higher for the smaller confidence level of 1 − = 0 . 80 compared to the larger confidence level of 1 − = 0 . 95 ). We do not believe this is due to changes in the signal type per se, but instead due to inaccuracies in the interpolation method used to assess if coverage was obtained (i.e. if  + ⊂  ⊂ − ) caused by the resolution of the lattice, positively biasing results for signals with a longer boundary  . The assessment method evaluates whether coverage holds at a discrete set of sub-sampled points on  , but as the boundary length becomes longer, this set of discrete points becomes relatively less dense within the true, continuous boundary. Violations of the subset condition are therefore more likely to be missed for signals with a longer boundary, which could explain why, for example, the empirical coverage results for the large spherical effect were systematically higher than for the small spherical effect ( Figs. 8 and 9 ), or why the coverage results for the UK Biobank signal type and a threshold of = 0 . 5 (longer boundary) were systematically higher than the corresponding results obtained using a threshold of = 0 . 8 (shorter boundary; Figs. C.15 and 11 ). This line of reasoning is also consistent with the greater levels of over-coverage seen for the lower confidence levels, as more violations of coverage should have occurred here, but this also meant there was a higher chance that violations could be missed. We discussed this issue in further detail in Section 5.2 of BTSN .

Data availability
We have used data from The Human Connectome Project and UK Biobank. All code used for the simulations and analysis of HCP data are available at: https://github.com/AlexBowring/Confidence _ Sets _ Manuscript . All Cohen's d Confidence Sets and statistical results maps obtained from our HCP analyses have been made available at the NeuroVault repository: https://neurovault.org/collections/9019/ . Then for * = √ , * = √ , and * = √ , define the transformation ∶ ℝ → ℝ as: Proof. We closely follow the workings given in the ' 2. Noncentral t. ' section of Laubscher (1960) . We have shown that √ ̂ is distributed by a noncentral -distribution with noncentrality parameter √ ̂ and − 1 degrees of freedom. Defining:

Credit authorship contribution statement
where Γ is the gamma function, then the expectation and variance of √ ̂ are: .

(A.4)
It is known that is well-approximated by the polynomial Now, noting that: ( −3)(4 −5) 2 . Using the variance expression in (A.9) and applying Corollary 1 in Laubscher (1960) , the approximate variance-stabilizing transformation of √ ̂ is given by

Fig. C1.
Coverage results for the UK Biobank signal type simulation with a Cohen's threshold of = 0 . 5 (instead of the = 0 . 8 threshold used for the simulation results presented in Fig. 11 ). For this smaller threshold, we observed valid, over-coverage for all three methods across all sample sizes, and on-the-whole the CSs performed well. In comparison to the results obtained for the larger threshold of = 0 . 8 ( Fig. 11 ), it is notable that there is a slightly higher degree of over-coverage across all of the results here. We believe this may be in part due to inaccuracies in the interpolation method used to assess the simulations results, rather than inaccuracies in the method itself; as the boundary length  is longer for the smaller threshold used here, it is more likely that violations of coverage were missed (due to the fact coverage is assessed at only a discrete set of lattice points along  ), inducing a positive bias in the results. We discuss this issue in more depth at the end of Section 5.2 .

Appendix D. Supplementary Human Connectome Project results
Fig. D1. S.1:Slices views of the Cohen's Confidence Sets obtained from applying Algorithm 1 to the HCP working memory task data, using three Cohen's effect size thresholds, = 0 . 5 , 0 . 8 and 1.2. Comparing with Fig. 12 and Fig. D.2 , the CSs presented here are slightly more conservative than the corresponding CSs obtained with Algorithms 2 and 3 (in the sense that the red upper CSs here are smaller, and blue lower CSs are larger). This is consistent with the simulation results obtained in Sections 4.1 and 4.2 , where the empirical coverage for Algorithm 1 was consistently larger than the other two methods.

Fig. D2
. Slices views of the Cohen's Confidence Sets obtained from applying Algorithm 2 to the HCP working memory task data, using three Cohen's effect size thresholds, = 0 . 5 , 0 . 8 and 1.2. Comparing with Fig. 12 , the upper and lower CSs presented here are almost identical to the corresponding CSs obtained with Algorithm 3 .