Empirical examination of the replicability of associations between brain structure and psychological variables

Linking interindividual differences in psychological phenotype to variations in brain structure is an old dream for psychology and a crucial question for cognitive neurosciences. Yet, replicability of the previously-reported ‘structural brain behavior’ (SBB)-associations has been questioned, recently. Here, we conducted an empirical investigation, assessing replicability of SBB among heathy adults. For a wide range of psychological measures, the replicability of associations with gray matter volume was assessed. Our results revealed that among healthy individuals 1) finding an association between performance at standard psychological tests and brain morphology is relatively unlikely 2) significant associations, found using an exploratory approach, have overestimated effect sizes and 3) can hardly be replicated in an independent sample. After considering factors such as sample size and comparing our findings with more replicable SBB-associations in a clinical cohort and replicable associations between brain structure and non-psychological phenotype, we discuss the potential causes and consequences of these findings.

The early observations of inter-individual variability in human psychological skills and traits 56 have triggered the search for defining their correlating brain characteristics. Studies using in-57 vivo neuroimaging have provided compelling evidence of a relationship between human skills 58 and traits and brain morphometry that were further influenced by individuals' years of 59 experience, as well as level of expertise. More subtle changes were also shown following new 60 learning/training (Draganski et al., 2004;Taubert et al., 2011), hence further demonstrating 61 dynamic relationships between behavioral performance and brain structural features. Such 62 observations quickly generated a conceptual basis for growing number of studies aiming to 63 map subtle inter-individual differences in observed behavior such as personality traits (Nostro 64 et al., 2017), impulsivity traits (Matsuo et al., 2009) or political orientation ; 65 to normal variations in brain morphology (for review see (Genon et  previously reported SBB-associations were also questioned recently. In particular, Boekel et al. 76 (2015) in a purely confirmatory replication study, picked on few specific previously reported 77 SBB-associations. Strikingly, for almost all the findings under scrutiny, they could not find 78 support for the original results in their replication attempt. 79 In another study we demonstrated a lack of robustness of correlations between cognitive 80 performance and measures of gray matter volume (GMV) in a-priori defined sub-regions of the 81 dorsal premotor cortex in two samples of healthy adults (Genon et al., 2017). Although our 82 study did not primarily aim to address the scientific qualities of SBB, it revealed, in line with 83 Boekel et al. (2015) , that a replication issue in SBB-associations could seriously be considered. 84 However, ringing the warning bell of a replication crisis would be premature since these 85 previous studies have approached replicability questions within very specific contexts and 86 methods and using small sample sizes (Muhlert and Ridgway, 2016). approach. Thus, the null findings of the two questioning studies could be related to the focus 91 and averaging of GMV within specific region-of-interests as suggested by (Kanai, 2016) and 92 discussed in (Genon et al., 2017). 93 In stark contrast with this argument, in whole-brain exploratory SBB studies, the multitude of 94 statistical tests that is performed (as the associations are tested for each voxel, separately) likely 95 yield many false positives. Directly addressing this limitation, several strategies for multiple 96 comparison correction have been proposed to control the rate of false positives (Eklund et al., 97 2016). We could hence assume that the high number of multiple tests and general low power 90% of the whole-brain exploratory analyses (i.e. high level of spatial consistency of significant 162 findings). As the size of the subsamples decreased, the shape of the distribution also changed, 163 and the median of the density plots fell around 50% and even 10% for samples consisting of 164 232 and 138 individuals, respectively. 165 Similar results, though with much lower percent of consistently overlapping voxels, were seen 166 for negative associations of BMI with GMV. The density plots and the spatial maps of Figure  167 1B show that for the larger samples (consisting of 326 and 232 subjects) few voxels were 168 consistently found in "all" (100%) subsamples as having significant negative association with 169 BMI. For the smaller samples (with 138 participants) the maximum replicable association was 170 found in 93% of the splits and 4 out of 100 exploratory analyses did not result in any significant 171 clusters (Table 1). Additionally, as Figure 2B shows, the majority of significant voxels had a 172 replicability bellow 50%. 173 These results highlight the influence of sample size on the replicability (frequency of overlap) 174 of whole-brain significant associations, even for age and BMI, for which we expected more 175 stable associations with morphological properties of the brain. 176 Structural associations of the psychological scores: In contrast, for most of the psychological 177 scores, only few of the 100 discovery subsamples yielded significant clusters. Table 1 and 178 supplementary Table 2 show the number of splits for which the exploratory whole-brain SBB-179 analysis resulted in at least one significant positively or negatively associated cluster for each 180 score. These results reveal that finding significant SBB-associations using the exploratory 181 approach in healthy individuals is highly unlikely for most of the psychological variables. 182 Furthermore, the significant findings were spatially very diverse, that is, spatially overlapping 183 findings were very rare. 184 We here retained for further analyses the three psychological scores for which the discovery 185 samples most frequently resulted in at least one significantly associated cluster. These three 186 scores were the Perceptual reasoning score of WASI (Wechsler, 1999), the number of correct 187 responses in word-context test and the interference time in the color-word interference task. For 188 example, for the discovery samples of 326 adults, in 83 out of 100 randomly generated 189 discovery samples, at least one cluster (not necessarily overlapping) showed a significant 190 positive association between perceptual reasoning and GMV (Table 1)). Of note, these more 191 frequently found associations were in the direction linking better task performance with higher 192

GMV. 193
Yet again, in line with our observations for BMI associations, the probability of finding at least 194 one significant cluster tend to decrease in smaller discovery samples (see Table 1). Likewise, 195 as the discovery sample size decreased, the maximum rate of spatial overlap, as denoted by the 196 height of the density plots, decreased (see Figure 1C-F). The width of these plots show that the 197 majority (> 50%) of the significant voxels spatially overlapped only in less than 10% of the 198 discovery samples. In the same line, the variability depicted by the spatial maps highlight that 199 many voxels are found as significant only in one out of 100 analyses. 200 These results highlight that finding a significant association between normal variations on 201 behavioral scores and voxel-wise measures of GMV among healthy individuals is highly 202 unlikely, for most of the tested domains. Furthermore, they underscore the extent of spatial 203 inconsistency and the poor replicability of the significant SBB-associations from exploratory 204 analyses. 205

--------figure1---------207
Confirmatory ROI-based SBB-replicability: 208 Age and BMI effects: Irrespective of the size of the test subsamples and definition used to 209 identify "successful" replication (see Methods), for all ROIs negative age-GMV associations 210 were "successfully" replicated in the matched test samples. Unlike the perfect replication of 211 age-associations, replication rate of BMI effects depended highly on the test sample size and 212 the criteria used to characterize "successful" replication. Over all three tested sample sizes, in 213 more than 90% of the a-priori defined ROIs, BMI associations were found to be in the same 214 "direction" in the discovery and test samples (i.e. replicated based on "sign" criteria). The 215 examination of replicated findings based on "statistical significance" revealed replicated effects 216 in more than 57% of ROIs. This rate of ROI-based replicability increased from ~57% to 75%, 217 as the test sample size increased from 140 to 328 individuals (see figure 2). Furthermore, as the 218 dark blue segments in the outer layers of figure 2 indicates, Bayesian hypothesis testing 219 revealed moderate-to-strong evidence for H1 in more than 30% of the ROIs. in the color-word interference task). 225 Despite the structural associations of perceptual reasoning score being in the same direction 226 (positive SBB-association), for the majority of the ROIs (>85%), less than 31% of all ROIs 227 showed replicated effects based on "statistical significance" criterion. Finally, less than 4% of 228 the ROIs were identified as "successfully replicated" based on the Bayes factors. (Figure 2). 229 For the three tested samples sizes, associations of the word-context task were in the same 230 direction (positive SBB-association) in the discovery and test pairs in ~75% of ROIs. 231 Nevertheless, again, the rate of statistically "significantly"-replicated ROIs ranged between 17 232 to 26%. Furthermore, even less than 8% of all ROIs showed replicated effects based on the 233 Bayes factors (moderate-to-strong evidence for H1) (Figure 2). 234 Finally, negative correlations between interference time of the color-word interference task and 235 average GMV were depicted in ~70 % of the ROIs, but significant-replication was found in 236 only 11% to 17% of all ROIs, for the three test sample sizes. Along the same line, replication 237 based on the Bayes factors was below 5% ( Figure 2E). 238 In general, these results show the span of replicability of structural associations from highly 239 replicable age-effects to very poorly replicable psychological associations. They also highlight 240 the influence of the sample size, as well as the criteria that is used to define successful 241 replication on the rate of replicability of SBB-effects in independent samples. 242 Effect size in the discovery sample and its link with effect size of the test sample and actual 243 replication: 244 to the x-axis (discovery samples). Furthermore, for these by-"sign" replicated ROIs, there was 250 no positive relationship between the effect sizes of the behavioral associations in the discovery 251 and test samples (blue lines in each subplot). 252 For BMI and age, however, the effect sizes of the discovery and test pairs were generally 253 positively correlated, suggesting that the ROIs with greater negative structural association with 254 BMI (or age) in the discovery sample, also tended to show stronger negative associations within 255 the matched test sample. 256 To investigate if the replication power, estimated using the effect size of the discovery samples, 257 was linked to a higher probability of actual replication in the test samples, the ROIs were 258 grouped into replicated and not-replicated, based on the "statistical significance" criterion. 259 While the estimations of statistical power were generally higher among the replicated compared 260 to not-replicated ROIs for BMI associations (p-value of the Mann-Whitney U tests < 10 -5 ), for 261 structural associations of the psychological scores, this was not the case. Strikingly, for the 262 structural associations of perceptual reasoning, over all sample sizes, the significantly 263 replicated ROIs tended to have lower estimated power compared to the ROIs that actually were 264 not-replicated (p-value of the Mann-Whitney U tests < 10 -5 ). These unexpected findings 265 highlight the unreliable aspect of effect size estimations of SBB-associations within the 266 discovery samples among healthy individuals. They also demonstrate that these inflated effect 267 sizes result in flawed and thus uninformative estimated statistical power. Replicability of "whole brain exploratory associations": 272 Within the sample of patients from ADNI-cohort, 84 out of the 100 whole-brain exploratory 273 analyses resulted in at least one significant cluster showing a positive association between the 274 immediate-recall score and GMV. In the healthy population, however, the same score resulted 275 in a significant cluster in only less than 10% of exploratory analyses, for any of the three 276 discovery sample sizes (supplementary Table 2 and supplementary Figure 1). 277 As could be seen in the spatial maps of Figure 4, significant associations in the ADNI cohort 278 were found across several brain regions including the bilateral lateral and medial temporal lobe, 279 the lateral occipital cortex, the precuneus, the superior parietal lobule, the orbitofrontal cortex 280 and the thalamus. Although most of the significant voxels were found by less than 10% of the 281 splits, some voxels in the bilateral hippocampus were found to be significantly associated with 282 the recall score in more than 70% of the subsamples (peak of spatial overlap; see Figure 4A, 283

B). 284
Confirmatory ROI-based SBB-replicability: 285 Figure 4D shows the rates of "successful replication" of associations between the immediate-286 recall score and GMV within each ROI in the independent, matched-samples. As the most inner 287 layer shows, in more than 94% of ROIs, GMV correlated positively with the recall score in the 288 test subsamples, corroborating the "sign" of the association in the paired-discovery samples. 289 These correlations were significant in 72% of all ROIs. Furthermore, in more than 50% of all 290 ROIs the correlations in the test sample supported, at least moderately, the link between higher 291 GMV and higher recall score (using the Bayes factors). 292

Association between discovery and replication effect size: 293
The marginal histograms in Figure 4C suggest that overall the size of effects in the discovery 294 samples are slightly larger than the effects sizes in the paired replication samples. When looking 295 at the ROIs that were successfully replicated (by-sign), there was a positive association between 296 the discovery and replication effect size (spearman's rho = 0.38, p-value < 10 -11 ) . 297 Finally, the median replication power was higher among "significantly replicated" ROIs, 298 compared to not replicated (defined using "statistical significance criterion") ROIs (p-value of 299 the mann-whiteney U test < 10 -3 ). These results showed the superior, yet not perfect, 300 replicability of SBB-associations within the clinical population (see supplementary Figure 2 for 301 structural associations of immediate recall within healthy cohort). The observed somewhat 302 robustness of the findings in ADNI suggest that, when the population under study shows clear 303 variations in both brain structural markers and psychological measurements, such as the patient 304 group in ADNI cohort, the associations between brain structure and psychological performance 305 could be relatively reliably characterized. Nevertheless, again, the occurrence of not-replicated 306 results highlight the importance of confirmatory analyses for a robust characterization of brain-307 behavior associations. Our empirical investigation of the replicability of SBB in healthy adults showed that significant 313 associations between psychological phenotype and GMV are not frequent when probing a range 314 of psychometric variables with an exploratory approach. Where significant associations were 315 found, these associations showed a poor replicability. 316 In the following, we first discussed implications of the very low rate of significant findings 317 revealed by the exploratory approach. We then discussed the possible causes of the observed According to the scientific literature, associations between psychological phenotype (cognitive 326 performance and psychological trait) and local brain structure are not uncommon (Kanai and 327 Rees, 2011). However, in our exploratory analyses, when looking at a range of psychological 328 variables, significant associations with GMV were very rare. It is worth noting that here by 329 having a-priori fixed analysis design and inference routines, we aimed to avoid "fishing" for 330 When considering potential impacts of biased SBB-reports on our confidence of psychological 337 measures, as well as our conception and apprehension of brain-behavior relationships and 338 psychological interindividual differences, we would strongly argue for null findings reports. 339 Such reports would contribute to a more accurate and balanced apprehension of associations 340 between differences in psychological phenotype and brain morphometric features, but it would 341 also help to progressively disentangle factors that mediate or modulate the relationship between 342 brain structure and behavioral outcomes. 343

Poor spatial overlap of SBB across resampling: possible causes and recommendations 344
In addition to the low likelihood of finding "any" significant SBB-association using the 345 exploratory approach, clusters that do survive the significance thresholding did not often 346 overlap in different subsamples. Furthermore, the probability of spatial overlap further dropped 347 as the number of participants in the subsamples decreased ( Figure 1). Putting this finding in 348 light of the literature brings two main hypotheses. 349 First, from the conceptual level, we could hypothesize that the pattern of correlation between a 350 psychological measure is by nature spatially diffuse at the brain level. Psychological measures 351 aim to conceptually articulate behavioral functions and processes, thus, in most cases, they 352 have not been developed to identify specific localized brain functions. Following this 353 philosophical segregation between psychological sciences and neurosciences, it is now widely 354 acknowledged that there is no one-to-one mapping between behavioral functions and brain concept to brain features usually result in a diffuse brain spatial pattern with small effect sizes 357 (Bressler, 1995;Poldrack, 2010;Tononi et al., 1998). From this axiom, we can expect that 358 several studies conducted in small samples (specifically after rigorous corrections for multiple 359 comparisons) are likely to each capture a partial and minor aspect of the whole true association 360 pattern, resulting in a poor replication rate for each individual study (i.e. high type II error). 361 Alternatively, a more parsimonious hypothesis is a methodological one questioning the truth or 362 validity of the found significant associations hence considering them as spurious (i.e. type I 363 error). Psychological and MRI measurements are both relatively indirect estimations of 364 respectively, behavioral features and brain structural features and thus are susceptible to noise. 365 Correlations in small samples in the presence of noise for both type of variables is likely to 366 produce spurious significant results (Loken and Gelman, 2017) by fitting a correlation or 367 regression between random noise in both variables. 368 Thus, the pattern of poor spatial consistency of SBB findings could result either from factors at 369 the object of study level, i.e. the relationship between brain and behavior, or, from factors at 370 the measurement and analysis level. While the latter hypothesis is more parsimonious, one 371 argument for the former hypothesis comes from the relatively substantial replications by-sign 372 observed in our confirmatory analyses, of three top behavioral scores (see figure 2). If the 373 significant SBB findings would be purely driven by noise in the data, we would expect them to 374 show purely random signs across resampling, which was not the case (but also see 375 Supplementary figure S2 for example of scores with lower replicability and higher inconsistent 376 associations across resampling). Therefore, it is actually likely that both hypotheses hold true 377 and that the spatial variability of significant SBB findings result from both, factors at the 378 analyses levels and factors at the object level, potentially interacting together. 379 It is worth noting that similar complexity and uncertainty have been described for task-based Forstmeier and Schielzeth, 2011)) than in the larger samples. 390 These factors, added to the complexity of human behavior, renders the objective of capturing 391 covariations with psychometric variables in brain structure locally particularly challenging. For 392 that reason, in exploratory studies whose aim is to identify brain structural features correlating 393 with a given (set of) psychological variable(s), a multivariate approach could be advised 394 pattern. While some authors argue either for one or the other approach, the use of these 397 approaches are far from being mutually exclusive (Moeller and Habeck, 2006). Combining both 398 approaches in small datasets indeed revealed that the results of the univariate approach reflect 399 the "tip of the iceberg" of the behavior's brain correlates, whose spatial extent are more 400 comprehensively captured with the multivariate analysis, but interpretability is facilitated by 401 the use of univariate analyses; e.g. (Genon et al., 2016(Genon et al., , 2014. 402 Thus, to partially address the previously described concerns of small and spatially diffuse 403 effects at the brain level in exploratory whole-brain-behavior study, we here recommend to 404 combine a univariate and a multivariate approach. This solution may help to reduce the false 405 negatives, yet it does not provide any protection against the influence of noise that may affect 406 both approaches. 407

Confirmatory replication of exploratory SBB findings: importance of out of sample replication 408
ROI-based analysis further highlighted that significant associations, which have been 409 discovered when starting with a psychological measure and searching within the whole brain 410 for a significant association (i.e. "evidenced in an exploratory study"), show poor replicability 411 (using significance and Bayes factor criteria, but also using similar sign criterion for most Thus, maybe the conceptual objective itself should be questioned: should we expect the 452 association between normal psychological phenotype, in particular cognitive performance, in 453 healthy population to be substantially driven by local brain macrostructure morphology? Brain 454 structure can certainly not be questioned as the primary substrates of behavior and more than 455 a century of lesion studies recalls this primary principle to our attention (Broca, 1865;Scoville 456 and Milner, 1957), but this does not imply that "normal" variations at standard psychological 457 tests can be related to variations in markers of local brain macrostructure. Our results suggest 458 that reliable answer to this important question requires substantially big samples (bigger than 459 those used here) and independent replications. In addition, to underscore the importance of the sample size, our analyses and results further 475 show that the size of the replication sample also matters when examining the replicability of a 476 previous SBB findings. This is an obvious factor that has been frequently neglected in the 477 discussions about replication crisis. Yet, while many replication studies straightforwardly 478 blame the sample size of the original studies, it is important to keep in mind that a replication 479 failure might also come from a too small sample size of the replication study (Muhlert and 480 Ridgway, 2016). 481

Summary and conclusions 483
Overall, our work and review of the recent and concomitant replication literature in related 484 fields demonstrate that several improvements could be recommended to get more accurate 485 insight on the relationship between psychological phenotype and brain structure and to 486 progressively answer open questions. Importantly, our recommendations and suggestions 487 concern different levels of SBB researches: the dataset level, the analyses level, as well as at 488 the post-publication and replication level. 489 At the dataset level, our study pointed out the need for big data samples to identify robust 490 associations between psychological variables and brain structure, with sample size of at least 491 several hundreds of participants. It should be acknowledged that this conclusion is easier to 492 achieve than to implement in research practice. Nevertheless, large scale cohort datasets from 493 healthy adult populations, such as eNKI used in the current study, human connectome project Sharing raw data would undoubtedly improve the problem of low statistical power, but if not 507 possible, sharing the unthresholded statistical maps (e.g. through platforms such as Neurovault 508 (Gorgolewski et al., 2015)) could also be a significant scientific contribution. In addition to 509 directly contribute to our understanding of brain-behavior relationship, such efforts would open 510 up new possibilities for estimating the correct size and extent of effects by integrating 511 unthresholded statistical maps in the estimation of the effects sizes throughout the brain. Thus, 512 we could hope that sharing initiatives will also contribute indirectly to more valid and insightful 513 SBB studies in the remote future and hence to a better allocation of resources. 514 515 516

Participants: 518
Healthy adults' data were selected from the enhanced NKI (eNKI) Rockland cohort (Nooner et  We focused only on participants for which good quality T1-weighted scans was available along 522 with timewise-corresponding psychological data. Exclusion criteria consisted of alcohol or 523 substance dependence or abuse (current or past), psychiatric illnesses (eg. Schizophrenia) and 524 current depression (major or bipolar). Furthermore, we excluded participants with missing 525 information on important confounders (age, gender, education) or bad quality of structural 526 scans after pre-processing, resulting in a total sample of 466 healthy participants (age: 48 ± 19, 527 153 male). fluency, 20 questions, proverbs and word-context task) , the Rey auditory verbal learning task 555 (RAVLT) (Schmidt, 1996) assessing verbal memory performance, as well as the WASI-II 556 intelligence test (Wechsler, 1999). Psychological phenotyping also included anxiety (state and 557 trait) (Spielberger et al., 1970) and personality questionnaires (McCrae and Costa, 2004) in the 558 eNKI cohort. For each test, we used several commonly derived sub-scores to assess the 559 replicability of their structural associations. For each psychological measure, participants 560 whose performance deviated more than 3 standard deviation (SD) from mean of the whole 561 sample were considered as outliers and thus were excluded from further analysis (See 562 supplementary Table 1). 563 The list-learning task is a common measure of verbal learning performance and has been 564 implemented using the same standard tool (RAVLT) in both the eNKI and the ADNI cohort. 565 Previous studies have shown that the immediate-recall score (sum of recalled items over the 566 first 5 trials) could be reliably predicted from whole brain MRIs in AD patients (Moradi et al., 567 2017). Since this score is a standard measure commonly used in healthy and clinical dataset 568 and its relations to brain structure in clinical data has been previously suggested, in the current 569 work we performed SBB with this score in the ADNI cohort as a "conceptual benchmark". 570

MRI acquisition and preprocessing: 571
The imaging data of the eNKI cohort were all acquired using a single scanner (Siemens 572 Magnetom TrioTim, 3.0 T). T1-weighted images were obtained using a MPRAGE sequence 573 (TR = 1900 ms; TE = 2.52 ms; voxel size = 1 mm isotropic). 574 ADNI, on the other hand, is a multisite dataset. Here we selected a subset of this data, which 575 has been acquired in a 3.0 T scanner (baseline measurements from ADNI2 and ADNI GO 576 cohort) from 39 different sites; see http://adni.loni.usc.edu/methods/documents/ for more 577

information. 578
Both datasets were preprocessed using the CAT12 toolbox (Gaser and Dahnke, 2016). Briefly, 579 each participant's T1-weighted scan was corrected for bias-field inhomogeneities, then 580 approach, in which a linear model is used to fit interindividual variability in the psychological 592 score to GMV at each voxel. Inference is then usually made at cluster level, in which groups of 593 adjacent voxels that support the link between GMV and the tested score are clustered together. 594 Replicability of thus-defined associations could be assessed by conducting a similar whole-595 brain voxel-wise exploratory analysis in another sample of individuals and comparing the 596 spatial location of the significant findings that survive multiple comparison correction, between 597 the two samples. Alternatively, replicability could be assessed, using a confirmatory approach, 598 in which only regions showing significant SBB-association in the initial exploratory analysis, 599 i.e. regions of interest (ROIs), are considered for testing the existence of the association between 600 brain structure and the same psychological score in an independent sample. The latter procedure 601 commonly focuses on a summary measure of GMV within each ROI and tests for existence of 602 the SBB-association in the direction suggested by the initial exploratory analysis. Thus this 603 approach circumvents the need for multiple comparison correction and therefore increases the 604 power of replication. 605 Here we assessed replicability of associations between each behavioral measure and gray mater 606 structure, using both approaches: the whole brain replication approach and the ROI replication 607 approach, which are explained in details in the following sections. 608

609
Replicability of whole brain exploratory SBB-associations: 610 611 Whole-brain GLM analyses: 100 random subsamples (of same size) were drawn from the main 612 cohort (eNKI or ADNI). Hereafter, each of these subsamples is called a "discovery sample". In 613 each of these samples, SBB-associations were identified using the voxel-wise exploratory 614 approach after controlling for confounders. This was done by using the general linear model 615 as implemented in the "randomise" tool 616 (https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/Randomise), with 1000 permutations. Age, sex and 617 education were modeled as confounders in the eNKI data. As the ADNI dataset is a multi-site 618 study, we further added site and disease category as dummy-coded confounders to GLMs for 619 the analyses in that dataset. Inference was then made using threshold-free cluster enhancement 620 (TFCE) (Smith and Nichols, 2009), which unlike other cluster-based thresholding approaches 621 does not require an arbitrary a-priori cluster forming threshold. Significance was set at P < 0.05 622 (extent threshold of 100 voxels). 623 Spatial consistency maps and density plots: To quantify the spatial overlap of significant SBB 624 associations over 100 subsamples, spatial consistency maps were generated. To do so, the 625 binarized maps of all clusters that showed significant association in the same direction between 626 each psychological score and GMV were generated (i.e. voxels belonging to a significant 627 cluster get the value "1" and all other voxels were labeled "0") and added over all 100 628 subsamples. These aggregate maps denote the frequency of finding a significant association 629 between the behavioral score and GMV, at each voxel. Accordingly, a voxel with value of 10 630 in the aggregate map has been found to be significantly associated with the phenotypical score 631 in 10 out of 100 subsamples. Density plots were also generated to represent the distribution of 632 values within each such map, i.e. the distribution of "frequency of significant finding". Hence, 633 the spatial voxel-wise "significance overlap maps" as well as density plots of the distribution 634 of values within each map give indications of the replicability of "whole brain exploratory SBB-635 associations" for each psychological score. 636 637 Replicability of SBB-associations using confirmatory ROI-based approach: 638 ROI-based confirmatory analyses: The replicability of the SBB associations was also evaluated 639 with the ROI-based confirmatory approach. For each of the 100 discovery subsamples, an age-640 and sex-matched "test sample" was generated from the remaining participants of the main 641 cohort. In the clinical cohort the discovery and test pairs were additionally matched for "site". 642 In this analysis, for each psychological variable, the significant clusters from the above-643 mentioned exploratory approach from every "discovery sample" were used as a-priori ROIs. 644 Average GMV over all voxels within the ROI was then calculated for each participant in the 645 respective "discovery" and "test" pair subsamples. Within each subsample, association between 646 the average GMV and the psychological variable was assessed using ranked-partial correlation, 647 controlling for confounding factors. The correlation coefficient was then compared between 648 each discovery and test pair, providing means to assess "ROI-based SBB replicability" rates 649 for each psychological score. Accordingly, each ROI was examined only once, to identify if 650 associations between average GMV in this ROI and the psychological score from the discovery 651 subsample could be confirmed in the paired test sample. Replicability rates were quantified 652 according to different indexes (see below) over all ROIs from 100 discovery samples, yielding 653 a percentage of "successfully replicated" ROIs based on each index. 654

Indexes of replicability: 655
Sign: First, we used a lenient definition of replication, in which we compared only the sign of 656 correlation coefficients of associations within each ROI between the discovery and the 657 matched-test sample. Accordingly, any effect that was in the same direction in both samples 658 (even if very close to zero) was defined as a "successful" replication. 659 Statistical Significance: Another straightforward method for evaluating replication simply 660 defines statistically significant effects (e.g. p-value < 0.05) that are in the same direction as the 661 original effects (from the discovery sample) as "successful" replication. This criteria is 662 consistent with what is commonly used in the psychological sciences to decide whether a 663 replication attempt "worked" (Open Science Collaboration, 2015). Yet, a key weakness of this 664 approach is that it treats the threshold (p < 0.05) as a bright-line criterion between replication 665 success and failure. Furthermore, it does not quantify the decisiveness of the evidence that the 666 data provides for and against the presence of the correlation (Boekel et al., 2015;Wagenmakers 667 et al., 2015). However, such an estimation can be provided by using the "Bayes factors". 668 Bayes Factor: To compare the evidence that the "test subsample" provided for or against the 669 presence of an association (H1 and H0, respectively), we additionally quantified SBB-670 replication within each ROI, using Bayes factors (Jeffreys, 1961). Similar to Boekel et al. 671 (2015), here we used the adjusted (one-sided) Jeffry's test (Jeffreys, 1961) based on a uniform 672 prior distribution for the correlation coefficient. As we intended to confirm the SBB-673 associations defined in the discovery subsamples, the alternative hypothesis (H1) in this study 674 was considered one-sided (in line with Boekel et al. (2015) To facilitate the interpretation, Bayes factors (BF) were summarized into four categories as 678 illustrated in the bar legend of Figure 2. A BF01 lower than 1/3 shows that the data is three times 679 or more likely to have happened under H1 than H0. Accordingly, this value defines the 680 "successful" replication. 681

Investigation on factors influencing replicability of SBB-associations among healthy 682 individuals: 683
Sample size: In order to study the influence of sample size on the replicability of SBB-684 associations, for each psychological measure, the healthy sample (eNKI) was divided into 685 discovery and test pairs at three different ratios: 70% discovery and 30% test, 50% discovery 686 and 50% test and finally 30% discovery and 70% test. As mentioned earlier, in each case, the 687 discovery and test counterparts were randomly generated 100 times in order to quantify the 688 replication rates. For example, to assess the replicability of brain structural associations of age, 689 in the case of "70% discovery and 30% test", the entire NKI sample (n = 466) was divided into 690 a discovery group of n = 326 participants and an age-and sex-matched test pair sample of n = 691 138 and this split procedure was repeated 100 times. Similarly, for generating equal-sized 692 discovery and test subsamples, 100 randomly generated age and sex matched split-half samples 693 were generated from the main NKI cohort. 694 Due to the multi-site structure of the ADNI cohort, when generating unequal sized discovery 695 and test samples, we did not achieve a good simultaneous matching of age, sex and site, while 696 trying to maintain samples sizes in each subgroup reasonably large. Thus, in this cohort, we did 697 not directly study the influence of the sample size and the replicability rates were only 698 quantified for equal sized discovery and test samples (187 participants matched for age, sex and 699 site between discovery and test pairs). 700 Effect size: Furthermore, to study the influence of the effect size on the replication rates, we 701 focused on the effect sizes within each a-priori ROI in the discovery samples. Here we tested 702 the following two assumptions: 703 1) ROIs with larger effect sizes in the discovery sample result in larger effect sizes in the test 704 sample pairs (i.e. positive association between effect size in the discovery and test samples). 705 2) ROIs with larger effect sizes in the discovery sample are more likely to result in a 706 "significant" replication in the independent sample. 707 To test the first assumption, in the "ROI-based SBB-replicability" the association between 708 effect size in the discovery and test pairs were calculated for each psychological measure. These 709 associations were calculated separately for the replicated (defined using "sign" criterion) and 710 not-replicated ROIs. We expected to find a positive association between discovery and 711 confirmatory effect sizes, for the "successfully replicated effects". 712 To test the second assumption, for each ROI, we calculated its replication statistical power and 713 compared it between replicated and not-replicated ROIs (here replication was defined using 714 "Statistical Significance" criterion). The statistical power of a test is the probability that it will 715 correctly reject the null hypothesis when the null is false. In a bias-free case, the power of the 716 replication is a function of the replication sample size, real size of the effect and the nominal 717 type I error rate (). In this study, the replication power was estimated based on the size of the 718 effects as they were defined in the discovery sample and a significant threshold of 0.05 (one-719 sided) and was calculated using "pwr" library in R (https://www.r-project.org). 720 These analyses were performed for each discovery-test split size, separately (i.e. 70%-30%, 721 50%-50% and 30%-70% discovery-test sample sizes, respectively  Figure 1. Replicability of exploratory results within healthy cohort. Frequency of spatial overlap (density plots and aggregate maps) of significant findings from exploratory analysis over 100 random subsamples, calculated for three different sample sizes (x-axis). Here in addition to age and BMI (A,B), which are used as benchmarks, the top three behavioral scores with the highest frequency of overlapping findings are depicted (C-E). Warmer colors on spatial maps denote higher number of samples with a significant association at the respective voxel. BMI : body mass index; CWI : colorword interference. Donut plots summerising ROI-based replication rates (% of ROI) using three different critera for three different sample sizes among heathy participants. The most inner layers depict replication using "sign" only (blue: replicated, orange: not replciated). The middle layers define replication based on similar "sign" as well as "statistical significance" (i.e. p < 0.05) (blue: replicated, orange: not replciate). The most outer layers define replication using "bayes factor" (blue: "moderate-to-string evidece for H1, light blue: anecdotal evidence for H1; light orange: anecdotal evidence for H0, orange: "moderate-to-string evidece for H0 );  Donut plots summerising ROI-based replicability rates using three different critera. The most inner layer depicts replicability using "sign" only (blue: replicated, orange: not replciated). The middle layer, defines replication based on similar "sign" as well as "statistical significance" (i.e. p < 0.05) (blue: replicated, orange: not replciate). The most outer layer defines replicability using bayes factor " (blue: "moderate-to-string evidece for H1, light blue: anecdotal evidence for H1; light orange: anecdotal evidence for H0, orange: "moderate-to-string evidece for H0 ); Discovery and replication samples have equal size (n = 184) and are matched for age, sex and site.  Table S1. Distribution of the raw phenotypical and psychological scores in the whole sample. Table S2. Summary of the exploratory findings. For each discovery sample size, the number of clusters in which grey matter volume is positively or negatively associated with the tested psychological score is reported. Number of splits (out of 100) in which the clusters were detected are noted in parentheses.

Supplementary Figure legends:
Figure S1. Summary of replication of positive associations between immediate-recall and GMV within healthy cohort. A: Frequency of spatial overlap (density plots and aggregate maps) of significant findings from exploratory analysis over 100 random subsamples, calculated for three different sample sizes (x-axis). B: ROI-based confirmatory replication results: Top row : Donut plots summerising ROI-based replicability rates (% of ROI) using three different critera for three different sample sizes. The most inner layers depict replicability using "sign" only (blue: replicated, orange: not replciated). The middle layers define replication based on similar "sign" as well as "statistical significance" (i.e. p < 0.05) (blue: replicated, orange: not replciate). The most outer layers define replicability using bayes factor " (blue: "moderate-to-string evidece for H1, light blue: anecdotal evidence for H1; light orange: anecdotal evidence for H0, orange: "moderate-to-string evidece for H0 ); Bottom row: Scatter plots of effect sizes in the discovery versus replication sample for all ROIs from 100 splits within healthy cohort; Points are colorcoded based on their replciation status (by-"sign") and size of each point is proportional to the estimated statistical power of replication. Regresion lines are drawn for the replciated and unreplicated ROIs, separately. Figure S2. ROI-based confirmatory replication results for five personality subscores within healthy cohort. Donut plots summerising ROI-based replication rates (% of ROI) using three different critera for three different sample sizes among heathy participants. The most inner