The impact of sample size on the reproducibility of voxel-based lesion-deficit mappings

This study investigated how sample size affects the reproducibility of findings from univariate voxel-based lesion-deficit analyses (e.g., voxel-based lesion-symptom mapping and voxel-based morphometry). Our effect of interest was the strength of the mapping between brain damage and speech articulation difficulties, as measured in terms of the proportion of variance explained. First, we identified a region of interest by searching on a voxel-by-voxel basis for brain areas where greater lesion load was associated with poorer speech articulation using a large sample of 360 right-handed English-speaking stroke survivors. We then randomly drew thousands of bootstrap samples from this data set that included either 30, 60, 90, 120, 180, or 360 patients. For each resample, we recorded effect size estimates and p values after conducting exactly the same lesion-deficit analysis within the previously identified region of interest and holding all procedures constant. The results show (1) how often small effect sizes in a heterogeneous population fail to be detected; (2) how effect size and its statistical significance varies with sample size; (3) how low-powered studies (due to small sample sizes) can greatly over-estimate as well as under-estimate effect sizes; and (4) how large sample sizes ( N ≥ 90) can yield highly significant p values even when effect sizes are so small that they become trivial in practical terms. The implications of these findings for interpreting the results from univariate voxel-based lesion-deficit analyses are discussed.


Introduction
1 There is a great deal of evidence showing how both false positive and false 2 negative results increase as sample size decreases (Bakker et al., 2012;Button et 3 al., 2013a;Chen et al., 2018;Cremers et al., 2017;Ingre, 2013;Ioannidis, 2008) and 4 how inadequate statistical power can lead to replication failures (Anderson et al., 5 2017;Bakker et al., 2012;Perugini et al., 2014;Simonsohn et al., 2014a;Szucs and 6 Ioannidis, 2017). However, the impact of sample size on false negative and false 7 positive rates has never been quantified in mass-univariate voxel-based lesion-deficit 8 mapping (e.g., voxel-based lesion-symptom mapping and voxel-based 9 morphometry). Using data from a large sample of stroke patients, we firstly 10 estimated the magnitude of a lesion-deficit mapping of interest and then formally 11 investigated how effect size and its statistical significance varies with sample size. In 12 addition to demonstrating how small samples can result in over-and under-13 estimations of effect size, we also highlight an issue with large sample sizes whereby 14 high statistical power dramatically increases the likelihood of detecting effects that 15 are so small that they become uninteresting from a scientific viewpoint (i.e. the 16 fallacy of classical inference; Friston et al., 2012). In other words, statistically 17 significant findings when sample sizes are large can hide the fact that the effect 18 under investigation might be of little importance in practical terms, or, even worse, 19 the result of random chance alone and thereby a false positive (Smith and Nichols, 20 2018). 21 To investigate the effect of sample size on the results of univariate voxel-22 based lesion-deficit mapping, we randomly drew thousands of resamples (with a 23 range of sample sizes) from a set of data from 360 stroke survivors who had 24 collectively acquired a wide range of left hemisphere lesions and cognitive 25 impairments. By using a single patient population and holding all procedures and 26 analyses constant, we ensured that variability in the results across thousands of 27 random resamples cannot be explained by methodological confounds -such as the 28 use of dissimilar recruitment strategies and/or behavioural assessments -that are 29 likely to influence the findings of studies that aggregate data from multiple 30 independent sources (e.g., meta-analyses; Müller et al., 2018). Furthermore, by 31 performing our statistical analyses on actual data, rather than running simulations on 32 in size). See Table 1 for demographic and clinical details of the full sample of 360 96 stroke patients. 97

Behavioural assessment 98
All patients recruited to the PLORAS database are assessed on the 99 Comprehensive Aphasia Test (CAT) (Swinburn et al., 2004). The CAT is a fully 100 standardised test battery, which consists of a total of 27 different tasks. For ease of 101 comparison across tasks, the authors of the CAT encourage the conversion (through 102 a non-linear transformation) of raw scores into T-scores, which represent how well 103 the patient performed relative to a reference population of 113 patients with aphasia, 104 56 of whom were tested more than once. For example, a T-score of 50 indicates the 105 mean of the patient sample used to standardise the CAT, whereas a T-score of 60 106 represents one standard deviation above the mean. Most people without post-stroke 107 aphasia would therefore be expected to score above the average of the patient 108 standardisation sample on any given task from the CAT. The threshold for 109 impairment is defined relative to a second reference population of 27 neurologically-110 normal controls. Specifically, it is the point below which the score would place the 111 patient in the bottom 5% of the control population (Swinburn et al., 2004). Lower 112 scores indicate poorer performance. Importantly, the two standardisation samples 113 referred to before (i.e. 113 patients with aphasia and 27 neurologically-normal 114 controls) are completely independent of the data we report in the current paper (for 115 more details on the standardisation samples, see Swinburn et al., 2004). 116 As stated in the CAT manual (p. 71), the main advantages of converting raw 117 scores into T-scores is that this allows: (i) scores from different tasks to be compared 118 because they have been put on a common scale; and (ii) the use of parametric 119 statistics given that T-scores are normally distributed scores with a mean of 50 and a 120 standard deviation of 10. 121 The current study focused exclusively on a total of 5 tasks from the CAT. Task 122 1 used nonword repetition to assess the patient's ability to articulate speech. Task 2 123 used written picture naming to test the patient's ability to find the names of objects 124 (lexical/phonological retrieval). Tasks 3-5 tested the patient's ability to recognise, 125 process and remember the semantic content of pictures and auditory words. Task 126 details were as follows: 127 Task 1: The CAT nonword repetition (Rep-N) task aurally presents five nonsense 128 words (e.g., gart), one at a time, with instructions to repeat them aloud. Immediate 129 correct responses were given a score of 2; incorrect responses were given a score 130 of 0; correct responses after a self-correction or a delay (> 5 seconds) were given a 131 score of 1. Articulatory errors (e.g., dysarthric distortions) not affecting the perceptual 132 identity of the target were scored as correct. Verbal, phonemic, neologistic and 133 apraxic errors were scored as incorrect. T-scores equal to or below 51 constitute the 134 impaired range. 135 Task 2: The CAT written picture naming (Writt-PN) task visually presents five 136 pictures of objects (e.g., tank), one at a time, with instructions to write their names 137 down. Letters in the correct position were given a score of 1 each. Substitutions, 138 omissions and transpositions were given a score of 0. One point was deducted from 139 the total score if one or more letters were added to the target word. T-scores equal to 140 or below 54 constitute the impaired range. 141 Task 3: The CAT semantic associations (Sem-A) task visually presents five pictures 142 of objects simultaneously. The instructions were to match the picture at the centre 143 (e.g., mitten) with one of four possible alternatives according to the strongest 144 semantic association (e.g., hand, sock, jersey, and lighthouse). The inclusion of a 145 semantically related distractor (e.g., sock) encouraged deeper levels of semantic 146 processing/control. There are a total of ten test trials plus a practice one at the 147 beginning. Correct responses were given a score of 1; incorrect responses were 148 given a score of 0. T-scores equal to or below 47 constitute the impaired range. 149 Task 4: The CAT recognition memory (Recog-M) task visually presents each of the 150 ten central items from the CAT semantic associations task (one at a time) along with 151 three unrelated distractors. The instructions were to indicate which of the four 152 pictures on display had been seen before. There are a total of ten test trials plus a 153 practice one at the beginning. The scoring system for this task was identical to that 154 used in the semantic associations task. T-scores equal to or below 43 constitute the 155 impaired range. 156 Task 5: The CAT auditory word-to-picture matching (A W -P) task involves hearing a 157 word produced by the examiner and selecting the picture among four possible 158 alternatives that best matches the meaning of the heard word. There are a total of 159 fifteen test trials plus a practice one at the beginning. Immediate correct responses 160 were given a score of 2; incorrect responses were given a score of 0; correct 161 responses after a self-correction or a delay (> 5 seconds) were given a score of 1.  (Deichmann et al., 2004). 178 The T1-weighted anatomical whole-brain volume of each patient was 179 subsequently analysed with our automated lesion identification toolbox using default 180 parameters (for more details, see Seghier et al., 2008). This converts a scanner-181 sensitive raw image into a quantitative assessment of structural abnormality that 182 should be independent of the scanner used. The procedure combines a modified 183 segmentation-normalisation routine with an outlier detection algorithm according to 184 the fuzzy logic clustering principle (for more details, see Seghier et al., 2007). The 185 outlier detection algorithm assumes that a lesioned brain is an outlier in relation to 186 normal (control) brains. The output includes two 3D lesion images in standard MNI 187 space, generated at a spatial resolution of 2 x 2 x 2 mm 3 . The first is a fuzzy lesion 188 image that encodes the degree of structural abnormality on a continuous scale from 189 0 (completely normal) to 1 (completely abnormal) at each given voxel relative to 190 normative data drawn from a sample of 64 neurologically-normal controls. A voxel 191 with a high degree of abnormality (i.e. a value near to 1 in the fuzzy lesion image) 192 therefore means that its intensity in the segmented grey and white matter deviated 193 markedly from the normal range. The second is a binary lesion image, which is 194 simply a thresholded (i.e. lesion/no lesion) version of the fuzzy lesion image. All our 195 statistical analyses were based on the fuzzy images. The binary images were used 196 to delineate the lesions, to estimate lesion size and to create lesion overlap maps. 197

Lesion-deficit analyses 198
We used voxel-based morphometry (Ashburner and Friston, 2000;Mechelli et 199 al., 2005) to assess lesion-deficit relationships (Mummery et al., 2000;Tyler et al., 200 2005), performed in SPM12 using the general linear model. The imaging data 201 entered into the voxel-based analysis were the fuzzy (continuous) lesion images that 202 are produced by our automated lesion identification toolbox. 203 The most important advantage of utilising the fuzzy lesion images (as in Price 204 et al., 2010) over alternative methods is that they provide a quantitative measure of 205 the degree of structural abnormality, at each and every voxel of the brain, relative to 206 neurologically-normal controls. In contrast to fuzzy lesion images, (i) binary lesion 207 images do not provide a continuous measure of structural abnormality and will be 208 less sensitive to subtle changes that are below an arbitrary threshold for damage 209 (e.g., Fridriksson et al., 2013;Gajardo-Vidal et al., 2018); (ii) normalised T1 images 210 do not distinguish between typical and atypical (abnormal) variability in brain 211 structure (e.g., Stamatakis and Tyler, 2005); and (iii) segmented grey or white matter 212 probability images when used in isolation (as in standard VBM routines) do not 213 provide a complete account of the whole of the lesion (e.g., Mehta et al. 2003). 214 In Analysis 1, the fuzzy lesion images were entered into a voxel-based 215 multiple regression model with 6 different regressors (5 behavioural scores and 216 lesion size); see Fig. 1. The regressor of interest was nonword repetition scores that 217 are sensitive to difficulties articulating speech. In addition, the following regressors 218 were included to factor out other sources of variance: written picture naming scores 219 (which are sensitive to name retrieval abilities), semantic associations scores (which 220 are sensitive to visual recognition and semantic processing), auditory word-to-picture 221 matching scores (which are sensitive to auditory recognition and lexical-semantic 222 processing), recognition memory scores (which are sensitive to picture recognition 223 and memory) and lesion size (to partial out linear effects of lesion size). For the 224 voxel-based lesion-deficit analysis (with 360 patients), the search volume was 225 restricted to voxels that were damaged in at least five patients (as in Fridriksson et 226 al., 2016; for rationale, see Sperber and Karnath, 2017). For this purpose, a lesion 227 overlap map based on the binary lesion images from all 360 patients was created, 228 thresholded at five, and used as an inclusive mask before estimating the model (see 229 Fig. 2A). Our statistical voxel-level threshold was set at p < 0.05 after family-wise 230 error (FWE) correction for multiple comparisons (using random field theory as 231 implemented in SPM; Flandin and Friston, 2015) across the whole search volume 232 (for alternative approaches, see Mirman et al., 2018). 233 Having identified a significant lesion-deficit mapping, we quantified the 234 strength of the association between lesion and deficit by: (i) extracting the raw signal 235 (which indexes the degree of structural abnormality) from each statistically significant 236 voxel; (ii) averaging the signal across voxels (i.e. a single value per patient); and, 237 finally, (iii) computing the partial correlation between lesion load in the region of 238 interest and nonword repetition scores, after adjusting for the effect of the covariates 239 of no interest (i.e. 4 behavioural scores and lesion size). Our measure of effect size 240 was the proportion of variance (= R 2 ) in nonword repetition scores explained 241 uniquely by lesion load in the region of interest (i.e. the best estimate of the true 242 population effect that we have). 243 In Analysis 2, we investigated how sample size affected the reproducibility of 244 the lesion-deficit mapping within the region of interest identified in Analysis 1. 245 Specifically, we generated 6000 bootstrap samples of the following sizes: 360, 180, 246 120, 90, 60 and 30 (i.e. 36000 resamples in total). These sample sizes were 247 selected to follow as closely as possible those observed in the vast majority of 248 published voxel-based lesion-deficit mapping studies (e.g., Dressing et al., 2018;249 Fridriksson et al., 2013249 Fridriksson et al., , 2016Halai at el., 2017;Schwartz et al., 2011Schwartz et al., , 2012. For 250 each iteration of the resampling procedure, individuals were drawn randomly from 251 the full set of 360 patients with replacement, meaning that the probability of being 252 chosen remained constant throughout the selection process (i.e. the procedure 253 satisfied the Markovian, memory-less, property). For each bootstrap sample, the 254 partial correlation between nonword repetition scores and lesion load (averaged 255 across voxels in the region of interest from Analysis 1) was computed. The resulting 256 R 2 and p values were recorded, after regressing out the variance accounted for by 257 the covariates of no interest. Of note, when we re-ran the resampling procedure 258 outlined above with the replacement feature disabled (i.e. sampling without 259 replacement), virtually the same results were obtained (for more details, see 260

Supplementary Material). 261
In addition, to rule out the possibility that variability in the results could simply 262 be explained by differences in the distribution of damage across the brain, we 263 quantified statistical power in the region of interest from Analysis 1 for a 264 representative subset of bootstrap samples. Specifically, only those resamples that 265 produced an R 2 value which fell exactly at a particular decile (i.e. 0th, 10th, 266 20th…100th) of the distribution of effect sizes were considered. This resulted in the 267 selection of a total of 66 bootstrap samples (i.e. 11 for each sample size); see Table  268 2. Critically, our power calculations show where in the brain there was sufficient 269 statistical power to detect a significant lesion-deficit association at a threshold of p < 270 0.05 after correction for multiple comparisons. The statistical power maps were 271 generated using the "nii_powermap" function of NiiStat 272 (https://www.nitrc.org/projects/niistat/), which is a set of Matlab scripts for analysing 273 neuroimaging data from clinical populations. 274 Importantly, we have chosen to assess in-sample effect sizes, i.e. without 275 validating in a separate data set (Friston, 2012). In this context, the effect size is 276 providing an estimate of the strength of the particular effect identified by our analysis 277 in our data. It may be that an out-of-sample prediction -on new data -would indicate 278 a smaller effect size. However, this would not invalidate the logic of our reasoning, 279 particularly since the essential point we are making here is that our effect size 280 estimate (i.e. approximately 11% in R 2 terms) is very small. If there is inflation in this 281 estimate, it could only mean that the out-of-sample effect size would be even less. 282 Therefore, we have been able to show that even for an over-estimated effect size (if 283 it would turn out to be), there are serious problems that arise from small sample 284 sizes, the fallacy of classical inference, and publication bias. The impact of these 285 issues on the reliability of the findings would only be worse if the effect size were to 286 come down. 287 Furthermore, we have first statistically selected an ROI in a large sample of 288 patients, with a "left-hemisphere" analysis, and then used smaller and smaller 289 bootstrap samples that focused on the identified ROI. In this sense, we are 290 performing (non-orthogonal) statistical tests in a previously selected ROI, which 291 could potentially inflate false positive rates (Brooks et al., 2017). Consequently, the 292 results derived from the analysis of smaller samples should not be taken as robust 293 findings: they are being presented to make important methodological points. Our 294 best statistical estimates of the effect considered are those obtained from the full 295 data set. 296

Analysis 1: identifying a region of interest 298
Poorer speech articulation was significantly associated with greater lesion 299 load (after controlling for written picture naming, recognition memory, semantic 300 associations and auditory word-to-picture matching scores in addition to lesion size) 301 in 549 voxels (= 4.4 cm 3 in size; see Table 3). These voxels became our region of 302 interest (ROI) for all subsequent analyses. They were located in parts of the left 303 ventral primary motor and somatosensory cortices (i.e. tongue, larynx, head and face 304 regions), anterior supramarginal gyrus, posterior insula and surrounding white matter 305 (see Fig. 2B). 306 This highly significant lesion-deficit relationship accounted for 11% of the 307 variance (95% credible interval calculated using a flat prior: 0.06-0.18; Morey et al., 308 2016); see Fig. 3. In the following analyses, we ask how sample size affects the 309 reproducibility of the identified effect. 310

Analysis 2: effect size variability and replicability 311
Although the mean/median effect sizes were similar across sample sizes, the 312 mean/median p values changed considerably with sample size (see Fig. 4), because 313 there was wide sample-to-sample variability in the extent to which the original effect 314 was replicated. For instance, less than 40% of the random resamples where N = 30 315 generated significant p values, while this raised to virtually 100% for the resampled 316 data sets where N ≥ 180. Overall, R 2 values ranged between 0.00 and 0.79, whereas 317 p values ranged between 6*10 -27 and 1 (see Fig. 5A and B). Additionally, our 318 analyses showed that, as sample size increased, R 2 values tended to fall closer to 319 the mean of the effect size distribution, although a not inconsiderable degree of 320 uncertainty regarding R 2 estimation remained (even for N = 180 and 360). In other 321 words, the dispersion of the R 2 values tended to be larger with smaller sample sizes 322 (see Fig. 5A), resulting in less precision in the estimation of the magnitude of the true 323 population effect. 324

Low-powered resamples can inflate effect sizes 325
Since studies that obtain statistically non-significant results (i.e. typically p ≥ 326 0.05) are hardly ever published (also known as the file drawer problem or study 327 publication bias), we focused directly upon the resampled data sets that produced 328 significant p values. For N = 30, the mean and median effect sizes of these 329 significant resamples (i.e. roughly 37%) were 0.26 and 0.24 (range = 0.16-0.79). 330 Conversely, the mean and median effect sizes for the N = 30 resamples where the 331 lesion-deficit mapping did not reach statistical significance (roughly 63%) were 0.07 332 and 0.06 (range = 0.00-0.16); see Table 4 for similar findings when N = 60. Critically, 333 using a more stringent statistical threshold would only aggravate the problem (for 334 more details, see Table 4). With larger sample sizes (N ≥ 90), however, effect size 335 inflation is counteracted since both over-and under-estimations of the true effect 336 size surpassed the threshold for statistical significance, resulting in relatively 337 accurate mean estimates (0.13, 0.12, 0.12, and 0.11 respectively). 338

High-powered resamples are sensitive to trivial/small effects 339
The frequency with which a significant association was observed between 340 lesion load in the ROI and nonword repetition scores increased dramatically with 341 sample size. For example, whereas roughly 37% of the effects for N = 30 would be 342 typically regarded as statistically significant (i.e. p < 0.05), more than 85% of the 343 lesion-deficit mappings for N ≥ 90 generated equally low or even lower p values (see 344   Table 4). More importantly, effects as small as 0.05 in R 2 terms (i.e. that only 345 accounted for 5% of the variance) reached statistical significance for N = 90; and this 346 phenomenon was even more pronounced in the presence of larger sample sizes: 347 0.02 for N = 180 (see Table 4 and Fig. 5A). Reporting point and interval estimates of 348 effect sizes is therefore essential for assessing the importance or triviality of the 349 identified lesion-deficit mapping, which is particularly relevant when the study uses 350 large sample sizes. 351

Discussion
The goal of this study was to examine how sample size influences the 353 reproducibility of voxel-based lesion-deficit mappings. First, we identified a significant 354 lesion-deficit association and estimated its magnitude using data from a very large 355 sample of 360 patients who were all right-handed, English speaking stroke survivors 356 with unilateral left hemisphere damage. By repeating the same analysis on 357 thousands of bootstrap samples of different sizes we illustrate how the estimated 358 effect size, and its statistical significance, varied across replications. This allowed us 359 to index the degree of uncertainty in the estimation of the true population effect as a 360 function of sample size. As expected, effect sizes were more likely to be over-361 estimated or under-estimated with small sample sizes (i.e. variability in the results 362 increased as sample size decreased). Conversely, we demonstrate how highly 363 significant lesion-deficit mappings can be driven by a negligible proportion of the 364 variance when the sample size is very large. 365

Estimating the true effect size 366
The first part of our investigation identified a region of interest (ROI) where 367 damage was reliably associated with impairments in speech articulation. We then 368 calculated what proportion of the variance in nonword repetition scores could be 369 accounted for by the degree of damage to the identified region after factoring out 370 confounds from auditory and visual perception, speech recognition, lexical/semantic 371 processing and word retrieval abilities. The ROI included anatomical brain structures 372 that have been associated with speech production in many previous lesion studies. 373 These include the insula (Ogar et al., 2006), the precentral gyrus, the postcentral 374 gyrus, the supramarginal gyrus and surrounding white matter (Baldo et al., 2011;375 Basilakos et al., 2015). It did not involve the inferior frontal gyrus/frontal operculum 376 as reported in Hillis et al. (2004) and Baldo et al. (2011), even though our full sample 377 incorporated plenty of patients with damage to these regions (see Fig. 2A). We do 378 not attempt here to adjudicate whether this discrepancy was a consequence of a 379 false negative in our study or a false positive in prior studies. Our focus was on how 380 well the identified lesion-deficit mapping could be replicated across thousands of 381 bootstrap samples drawn randomly from the original data set of 360 patients. For 382 each resample, we estimated how much of the variance in nonword repetition scores 383 could be accounted for by lesion load in the ROI (after adjusting for the effect of the 384 covariates of no interest). These effect sizes and their statistical significance were 385 then compared to our best estimate of the "true" population effect size, which was 386 found (from our full sample of 360 patients) to be 11%. 387

Variability in the estimated effect size and its statistical significance 388
The second part of our investigation showed that the probability of finding a 389 significant lesion-deficit association in the ROI from the first analysis (with 360 390 participants), depended on the size of the sample. For larger samples (N ≥ 180), the 391 effect of interest was detected in virtually 100% of resamples. Whereas for smaller 392 samples (N = 30), it was detected in less than 40% of resamples (see Table 4). We 393 can also show that p values decrease as N increases, even when effect sizes are 394 equated (see Fig. 4 and 50 th percentile in Table 2). This observation is in line with 395 prior reports that p values exhibit wide sample-to-sample variability (Cumming, 2008;396 Halsey et al., 2015;Vsevolozhskaya et al., 2017), particularly in the presence of 397 small sample sizes (Hentschke and Stüttgen, 2011). 398 When considering the central tendency of effect size estimates, the difference 399 between larger and smaller resamples is dramatically reduced compared to that 400 seen for p values (see mean/median effect sizes in Fig. 4). Nevertheless, even if p 401 values were completely abandoned (e.g., Trafimow and Marks, 2015), there is still a 402 great deal of uncertainty in the accuracy with which effect sizes can be estimated 403 when small samples are used. This highlights the importance of reaching a better 404 balance between null-hypothesis significance testing and effect size estimation 405 (Chen et al., 2017;Cumming, 2014;Morey et al., 2014). Indeed, p values only 406 indicate the likelihood of observing an effect of a given magnitude (when the null 407 hypothesis is true). As such, they cannot convey the same information provided by 408 point and interval estimates of effect sizes (Steward, 2016;Wasserstein and Lazar, 409 2016), particularly since the relationship between p values and effect sizes is non-410 linear (Hentschke and Stüttgen, 2011;Simonsohn et al., 2014aSimonsohn et al., , 2014b. 411 There are several potential reasons why the magnitude and statistical 412 significance of the same effect varies so markedly across resamples. For example, 413 high sample-to-sample variability could reflect (i) sampling error due to heterogeneity 414 in the lesion-deficit association across participants (Button, 2016;Stanley and 415 Spence, 2014), (ii) outliers that are confounding the effects (Rousselet and Pernet,416 2012) or (iii) measurement error (Button, 2016;Loken and Gelman, 2017;Stanley 417 and Spence, 2014). In this context, the field needs to adopt informed sampling 418 strategies that ensure representative samples and maximise the probability of 419 identifying generalizable lesion-deficit mappings (Falk et al., 2013;LeWinn et al., 420 2017;Paus, 2010). 421

Unreliable effect sizes in smaller samples 422
High variance in the results of our lesion-deficit mappings with smaller 423 samples (N = 30 and 60) demonstrates how effects can be over-as well as under-424 estimated (e.g., Cremers et al., 2017;Ioannidis, 2008). Indeed, we show that 85% of 425 all significant random data sets for N = 30 yielded effect size estimates that were 426 larger than the upper bound of the credible interval (see Table 5). This is consistent 427 with prior observations that low-powered studies (with small sample sizes) can only 428 consistently detect large deviations from the true population effect (Szucs and 429 Ioannidis, 2017). Put another way, even when effect sizes are accurately estimated 430 from small samples, they are unlikely to attain statistical significance; particularly 431 when the magnitude of the effect under investigation is small or medium. In our data, 432 for example, we found that more than half the analyses with N = 30 that did not 433 reach statistical significance produced effect sizes that fell within the credible interval 434 (i.e. accurate estimations of effect sizes resulted in false negatives). Even worse, 435 analyses of small sample sizes can invert the direction of the effect (Gelman and 436 Carlin, 2014) as seen in our data where we found that 5% of all results for N = 30 437 were in the wrong direction. Furthermore, reporting such findings as if they were 438 accurate representations of reality would lead to misleading conclusions (Nissen et 439 al., 2016). 440 Critically, the problem was not solved but became worse when we adopted a 441 more stringent statistical threshold, which is contrary to that proposed by Johnson 442 (2013) and Benjamin et al. (2018). For example, if we were to raise the statistical 443 threshold from p < 0.05 to p < 0.001 for the N = 30 resamples, the statistically 444 significant effect sizes would range from 38% to 79% of the variance (compared to 445 11% in the full sample of 360 patients). Increasing sample size, however, does 446 improve accuracy, with less than 10% of significant p values associated with inflated 447 effect sizes when N ≥ 180 (see Table 5). 448 Given that results are more likely to be published if they reach statistical 449 significance than if they do not (i.e. the file drawer problem or study publication bias), 450 our findings highlight three important implications for future lesion-deficit mapping 451 studies. First, low-powered studies (due to small sample sizes) could lead a whole 452 research field to over-estimate the magnitude of the true population effect. Second, 453 power calculations based on inflated effect sizes from studies with small samples will 454 inevitably over-estimate the statistical power associated with small sample sizes 455 (Anderson et al., 2017). Third, although the mean effect size measured over many 456 studies with small sample sizes will eventually converge on the true effect size, in 457 reality, the same study is seldom replicated exactly and null results are only rarely 458 reported. It has therefore been advocated that, contrary to current practices, it is 459 better to carry out a few well-designed high-powered studies than it is to assimilate 460 the results from multiple low-powered studies (Bakker et al., 2012;Higginson and 461 Munafò, 2016). In brief, large scale studies increase the probability that an identified 462 lesion-deficit mapping is correct (Button et al., 2013a;Szucs and Ioannidis, 2017). 463

Trivial effect sizes in larger samples 464
Another important observation from the current study is that, when samples 465 are sufficiently large, relatively weak lesion-deficit associations can be deemed 466 statistically significant (i.e. p < 0.05). For instance, effects that only accounted for as 467 little as 3% of the variance reached statistical significance when N ≥ 120 -an 468 inferential problem known as the fallacy of classical inference (Friston, 2012;Smith 469 and Nichols, 2018). However, our findings are consistent with the view that this issue 470 can be addressed by reporting point and interval estimates of effect sizes (Button et 471 al., 2013b;Lindquist et al., 2013), which allow one to assess the practical 472 significance (as opposed to statistical significance only) of the results. In other 473 words, it can be argued that the fallacy of classical inference is specific to statistical 474 tests (e.g., t, F and/or p values), leaving effect sizes largely unaffected (Reddan et 475 al., 2017). Furthermore, there are two important advantages of conducting high-476 powered studies: (i) they greatly attenuate the impact of study publication bias as 477 both over-and under-estimations of the true effect size will surpass the threshold for 478 statistical significance; and (ii) the precision with which the magnitude of the true 479 population effect can be estimated is substantially improved (Lakens and Evers,480 2014; see Table 5 and Figs. 4 and 5A). Our study also indicates that, even with 481 sample sizes as large as N = 360, a not inconsiderable degree of uncertainty in R 2 482 estimation remained, which suggests that increasing sample size beyond this N will 483 continue to bring benefit. 484

Study limitations 485
The focus of the current paper has been on establishing the degree to which 486 the replicability of lesion-deficit mappings is influenced by sample size. To illustrate 487 our points, we have (i) searched for brain regions where damage is significantly 488 related to impairments in articulating speech; (ii) estimated the strength of the 489 identified lesion-deficit association; and, (iii) run the exact same analysis on 490 thousands of samples of varying size. However, we have not attempted to account 491 for all possible sources of inconsistencies in univariate voxel-based lesion-deficit 492 mapping. Nor have we investigated how our results would change if we selected 493 another function of interest (e.g., word retrieval or phonological processing). Indeed, 494 it has already been pointed out that higher-order functions might be associated with 495 smaller effects than lower-level ones (Poldrack et al., 2017;Yarkoni, 2009). 496 We also acknowledge that there are many different ways of conducting voxel-497 based lesion-deficit analyses (for more information see de Haan and 498 Karnath et al., 2018;Rorden et al., 2007;Sperber and Karnath, 2018). We have 499 selected one approach, using mass-univariate multiple regression on continuous 500 measures of structural abnormality, behaviour and lesion size. However, we could 501 have used other types of images or other behavioural regressors. For example, 502 several recent studies have adopted dimensionality reduction techniques, such as 503 principal component analysis (PCA), to transform a group of correlated behavioural 504 measures into a smaller number of orthogonal (uncorrelated) factors (e.g., Butler et 505 al., 2014;Corbetta et al., 2015;Mirman et al., 2015a). This PCA approach has made 506 an important contribution to finding coarse-grained explanatory variables (e.g., Halai 507 et al., 2017;Lacey et al., 2017;Mirman et al., 2015b;Ramsey et al., 2017), but some 508 of its limitations are that it: (i) involves an arbitrary criterion for factor extraction; (ii) 509 ignores unexplained variance when selecting a limited number of components; and, 510 (iii) necessitates subjective, a posteriori, interpretation as to what the components 511 might mean based on the factor loadings, which is not typically clear cut. Instead, we 512 propose that a better solution for tackling orthogonality issues is to adopt both a 513 rigorous sampling strategy as well as behavioural measures that offer an optimal 514 sensitivity-specificity balance. 515 Finally, we have highlighted that the reliance on small-sized samples of 516 patients in the presence of publication bias can undermine the inferential power of 517 univariate voxel-based lesion-deficit analyses. However, we have not attempted to 518 provide guidance on how prospective power calculations -that correct for the various 519 forms of bias present in scientific publications -can be conducted. Nor have we 520 illustrated how the presence of publication and other reporting biases in the lesion-521 deficit mapping literature, specifically, can be ascertained. The reason simply being 522 that others have already devoted considerable effort to developing tools that identify 523 and deal with problems such as: (i) the excess of statistically significant findings 524 (e.g., Ioannidis and Trikalinos, 2007); (ii) the proportion of false positives (e.g., 525 Gronau et al., 2017); (iii) the presence of publication bias and questionable research 526 practices (e.g., Du et al., 2017;Simonsohn et al., 2014aSimonsohn et al., , 2014b; (iv) errors in the 527 estimation of the direction and/or magnitude of a given effect (e.g., Gelman and 528 Carlin, 2014); and, (v) sample size calculations that take into account the impact of 529 publication bias and uncertainty on the estimation of reported effect sizes (e.g., 530 Anderson et al., 2017). With respect to statistical power, the situation is further 531 complicated by the fact that -in the context of univariate voxel-based lesion-deficit 532 mapping -it not only depends on the size of the sample, the magnitude of the effect 533 under study and the statistical threshold used (Cremers et al., 2017), but also on the 534 distribution of damage across the brain (which is non-uniform; Inoue et al., 2014;535 Kimberg et al., 2007;Mah et al., 2014;Sperber and Karnath, 2017). More research 536 on the topic will be required before prospective power calculations can be fully 537 trusted. Until that moment, the recruitment of representative patient samples in 538 combination with high-powered designs seems to be the best available solution to 539 the issues discussed here. 540

Interpreting voxel-based lesion-deficit mappings 541
The strength of the lesion-deficit association that we identified in a large 542 sample of 360 patients illustrates that the majority of the variability in speech 543 articulation abilities was driven by factors other than the degree of damage to the 544 ROI. A clear implication of this is that the field of lesion-deficit mapping still has a 545 long way to go before it can inform current clinical practice, which is arguably one of 546 its most important goals. Future studies will need to control and understand other 547 known sources of variance (apart from lesion site and size) such as time post-stroke, 548 age and education in order to improve our ability to predict language outcome and 549 recovery after stroke at the individual patient level (Price et al., 2017). Furthermore, 550 to map all the possible ways in which brain damage can affect behaviour, it will in all 551 likelihood be necessary to use increasingly larger samples of patients (e.g., Price et 552 al., 2010;Seghier et al., 2016) and multivariate methods (e.g., Hope et al., 2015;553 Pustina et al., 2018;Yourganov et al., 2016;Zhang et al., 2014). 554

555
This study investigated the impact of sample size on the reproducibility of 556 voxel-based lesion-deficit mappings. We showed that: (i) highly significant lesion-557 deficit associations can be driven by a relatively small proportion of the variance; (ii) 558 the exact same lesion-deficit mapping can vary widely from sample to sample, even 559 when analyses and behavioural assessments are held constant; (iii) the combination 560 of publication bias and low statistical power can severely affect the reliability of 561 voxel-based lesion-deficit mappings; and, finally, (iv) reporting effect size estimates 562 is essential for assessing the importance or triviality of statistically significant 563 findings. Solutions to the issues highlighted here will, in our view, likely involve the 564 use of: (a) improved reporting standards; (b) increasingly larger samples of patients; 565 (c) multivariate methods; (d) informed sampling strategies; and, (e) independent 566 replications. Careful reflection on some deeply-rooted research practices, such as 567 biases in favour of statistically significant findings and against null results, might also 568 be necessary. 569  The table shows that in all but one case, more than 80% of the voxels comprising the region of interest from Analysis 1 had sufficient statistical power to detect a significant lesion-deficit association at a threshold of p < 0.05 after correction for multiple comparisons. %tile = percentile of the effect size (R 2 ) distribution; Power = percentage of voxels within the region of interest from Analysis 1 that had sufficient statistical power to detect a significant lesion-deficit association at a statistical threshold of p < 0.05 after correction for multiple comparisons; R 2 = R 2 value (at a particular decile); P = p value (at a particular decile).   3786 4272 1728 5289 711 5747 253 5974 26  For each summary statistic, the upper row indicates the corresponding value when the alpha threshold was set at 0.05, whereas the lower row indicates the corresponding value when the alpha threshold was set at 0.001. Count = the number of resampled data sets that generated significant or non-significant R 2 values; s = significant (i.e. p < α); ns = not significant (i.e. p ≥ α); M = mean R 2 value; Mdn = median R 2 value; Min = minimum R 2 value; Max = maximum R 2 value. The table shows, for each sample size, the frequency with which effect size estimates reached statistical significance (i.e. p < 0.05) and fell within (=) or outside the 95% credible interval (i.e. 0.06-0.18) of the best estimate of the "true" population effect (i.e. R 2 = 0.11). 95% CI = 95% credible interval; > = larger than the upper bound of 95% CI; < = smaller than the lower bound of 95% CI.