Does hippocampal volume explain performance differences on hippocampal-dependent tasks?

Marked disparities exist across healthy individuals in their ability to imagine scenes, recall autobiographical memories, think about the future and navigate in the world. The importance of the hippocampus in supporting these critical cognitive functions has prompted the question of whether differences in hippocampal grey matter volume could be one source of performance variability. Evidence to date has been somewhat mixed. In this study we sought to mitigate issues that commonly affect these types of studies. Data were collected from a large sample of 217 young, healthy adult participants, including whole brain structural MRI data (0.8mm isotropic voxels) and widely-varying performance on scene imagination, autobiographical memory, future thinking and navigation tasks. We found little evidence that hippocampal grey matter volume was related to task performance in this healthy sample. This was the case using different analysis methods (voxel-based morphometry, partial correlations), when whole brain or hippocampal regions of interest were examined, when comparing different sub-groups (divided by gender, task performance, self-reported ability), and when using latent variables derived from across the cognitive tasks. Hippocampal grey matter volume may not, therefore, significantly influence performance on tasks known to require the hippocampus in healthy people. Perhaps only in extreme situations, as in the case of licensed London taxi drivers, are measurable ability-related hippocampus volume changes consistently exhibited. Highlights Evidence is mixed about whether hippocampal volume affects cognitive task performance This is particularly the case concerning individual differences in healthy people We collected structural MRI data from 217 healthy people They also had widely-varying performance on cognitive tasks linked to the hippocampus In-depth analyses showed little evidence hippocampal volume affected task performance

Evidence is mixed about whether hippocampal volume affects cognitive task performance 35 36 This is particularly the case concerning individual differences in healthy people 37 38 We collected structural MRI data from 217 healthy people 39 40 They also had widely-varying performance on cognitive tasks linked to the hippocampus 41 42 In-depth analyses showed little evidence hippocampal volume affected task performance 43 44 45

INTRODUCTION 70
People vary substantially in their ability to perform tasks related to critical aspects of cognition 71 that enable the smooth functioning of our everyday lives. These include imagining scenes (the 72 process of forming and visualising scene imagery in the absence of visual input), 73 autobiographical memory (the recall of past events from one's life), future thinking (imagining 74 future experiences) and spatial navigation (the process of ascertaining one's position in the 75 environment, and planning and following a route). For example, some individuals can recollect 76 decades-old autobiographical memories with great clarity compared to others who struggle to 77 recall what they did last weekend (e.g. Palombo Considering first scene imagination, as far as we are aware, no studies involving healthy 95 people have yet investigated the association between hippocampal volume and the ability to 96 construct scene imagery. Recent work documented a relationship between hippocampal grey 97 matter volume and scene imagination performance in patients with the behavioural variant of 98 frontotemporal dementia, a link that was not apparent in patients with Alzheimer's disease 99 (Wilson et al., 2020). However, in that study no direct comparison was made between the two 100 patient groups, or between the patients and healthy controls, precluding interpretations about 101 the association between hippocampal volume and scene imagination ability. Whether a link exists between hippocampal volume and more naturalistic autobiographical 116 memory ability is, therefore, unclear. 117 There has been limited work investigating the relationship between future thinking 118 ability and hippocampal volume. A recent study claimed to have observed a positive 119 relationship between participant ratings of sensory perceptual qualities in future thinking (e.g. 120 vividness, the amount of visual details) and hippocampal grey matter volume (Yang,Chen,121 Zhang, Xu, & Feng, 2020). However, the peak voxel coordinates (MNI space; 39, -7.5, -10.5) 122 and cluster were located outside of the hippocampus (Figure 1 of Yang et al., 2020). A more 123 precise investigation into the relationship between hippocampal volume and future thinking 124 ability is, therefore, required. 125 A definitive link has been identified between hippocampal grey matter volume and 126 extreme spatial navigation ability. Licensed London (UK) taxi drivers must memorise the 127 extensive and complex layout of ~25,000 London streets and thousands of landmarks, known 128 colloquially as acquiring "The Knowledge". In several studies they were found to have greater 129 posterior, and decreased anterior, hippocampal grey matter volume compared to healthy 130 controls (Maguire et al., 2000), London bus drivers, who spend an equivalent amount of time 131 driving but on regular routes instead of requiring a complete knowledge of London's layout 132 (Maguire, Woollett, & Spiers, 2006), and medical doctors, who have high levels of expertise 133 but not primarily in the spatial domain (Woollett, Glensman, & Maguire, 2008). Moreover, 134 longitudinal data collected before and after attempts to acquire The Knowledge identified 135 posterior hippocampal grey matter volume enlargement within subjects, but only in those 136 individuals who went on to qualify as London taxi drivers (Woollett & Maguire, 2011). Greater 137 posterior, but less anterior, hippocampal grey matter volume is, therefore, reliably associated 138 with extreme spatial navigation expertise. 139 In the general population, however, the relationship between hippocampal grey matter 140 volume and navigation ability is less clear. One study involving navigation in a virtual 141 environment found no association between hippocampal volume and navigation performance 142 in healthy people (Maguire, Spiers, et al., 2003), a finding that has since been replicated in a 143 larger sample of 90 individuals ( and with first person perspective taking ability during navigation (Sherrill,Chrastil,Aselcioglu,158 Hasselmo, & Stern, 2018). In addition, greater posterior compared to anterior hippocampal 159 grey matter volume has been linked to the use of "map-based" strategies, and consequently 160 with better navigation performance (Brunec et al., 2019). By contrast, greater anterior 161 hippocampal grey matter volume has been related to better topographical memory ability 162 ability to mentally construct an atemporal visual scene, meaning that the scene is not grounded 258 in the past or the future. Participants construct different scenes of commonplace settings. For 259 each of seven scenes, a short cue is provided (e.g. imagine lying on a beach in a beautiful 260 tropical bay) and the participant is asked to imagine the scene that is evoked and then describe 261 it out loud in as much detail as possible. Recordings are transcribed for later scoring. 262 Participants are explicitly told not to describe a memory, but to create a new atemporal scene 263 that they have never experienced before. 264 The main outcome measure is the "experiential index" which is calculated for each 265 scene and then averaged. In brief, it is composed of four elements: the content, participant 266 ratings of their sense of presence (how much they felt like they were really there) and perceived 267 vividness, participant ratings of the spatial coherence of the scene, and an experimenter rating 268 of the overall quality of the scene. 269 For the scene construction sub-measures, we separately investigated the four categories 270 that make up the content score, and also the spatial coherence rating. 271 To score the content, four categories of statement are identified; spatial references, 272 entity presence, sensory description and thoughts/emotions/actions. The spatial reference 273 category encompasses statements regarding the relative position of entities within the 274 environment or directions relative to the participant's vantage point. The entity category is a 275 count of how many distinct entities (e.g., objects, people, animals) were mentioned. The 276 sensory descriptions category consists of any statements describing (in any modality) 277 properties of an entity or the environment in general. Finally, the thoughts/emotions/actions 278 category covers any introspective thoughts or emotional feelings as well as the thoughts, 279 intentions, and actions of other entities in the scene. The final score of each category is the 280 average across the seven scenes. 281 The imagined scenes are also examined using the spatial coherence index. After each 282 scene is mentally constructed, participants are presented with a set of 12 statements, each 283 providing a possible qualitative description of the imagined scene. Participants are instructed 284 to indicate which statements they felt accurately described their construction, identifying as 285 many or as few as they thought appropriate. Eight of the statements indicate that aspects of the 286 scene were integrated (e.g. "I could see the whole scene in my mind's eye"), whereas four 287 indicate that aspects of the scene were fragmented (e.g. "It was a collection of separate 288 images"). One point is awarded for each integrated statement selected and one point taken away 289 for each fragmented statement. This yields a score between -4 and +8 that is then normalised 290 around zero. Any construction with a negative score is considered to be incoherent and 291 fragmented and scored at 0 so as not to over-penalise fragmented descriptions. The final spatial 292 coherence index was the average score of the seven scenes, ranging between 0 (totally 293 fragmented) and +6 (completely integrated). 294 Double scoring was performed on 20% of the data. We took the most stringent approach 295 to identifying across-experimenter agreement. Inter-class correlation coefficients, with a two-296 way random effects model looking for absolute agreement indicated excellent agreement 297 among the experimenters (minimum score of 0.9; see Supplementary Methods Table S1). For 298 reference, a score of 0.8 or above is considered excellent agreement beyond chance. 299 participants are asked to provide autobiographical memories from a specific time and place 303 over four time periodsearly childhood (up to age 11), teenage years (aged from 11-17), 304 adulthood (from age 18 years to 12 months prior to the interview; two memories are requested) 305 and the last year (a memory from the last 12 months). Recordings are transcribed for later 306 scoring. 307 The AI has two main outcome measures; the number of "internal" and "external" details 308 included in the description of an event. Importantly, these two scores represent different aspects 309 of autobiographical memory recall. Internal details are those describing the event in question 310 (i.e. episodic details), and were of primary interest here. External details describe semantic 311 information concerning the event, or non-event information. Internal events are, therefore, 312 thought to be hippocampal-dependent, while external events are not. The two AI scores are 313 obtained by separately averaging performance for the internal and external details across the 314 five autobiographical memories. 315 For the AI sub-measures, we examined the five separate categories that comprise the 316 internal details outcome measure, as well as considering AI vividness ratings. While external 317 details can also be split into component categories, too few details were provided by the 318 participants to assess each of these individually. 319 Internal details are composed of event, place, time, perceptual, and thoughts/emotions 320 categories. The event category contains details regarding happenings or the unfolding of the 321 story, including individuals present, actions and reactions. The time category refers to any 322 details of the year, season, day or time of day wherein the event occurred. The place category 323 contains details that localise the event, both at the general level (e.g. a city), and more 324 specifically (e.g. to parts of a room). The perceptual category involves descriptions (in any 325 modality) of any aspect of the event. Finally, the thoughts/emotions category includes 326 descriptions of emotional states, thoughts or implications. The final score of each category is 327 the average across the five memories. 328 AI vividness ratings were examined given a recent finding that responses on various 329 memory questionnaires were associated with AI vividness, suggesting that vividness may be a 330 key process in autobiographical memory recall (Clark & Maguire, 2020). Vividness ratings are 331 collected for each memory in response to the question "How clearly can you visualize this 332 event?" on a 6-point scale from 1 (vague memory, no recollection) to 6 (extremely clear as if 333 it's happening now). An overall vividness rating was the average of the vividness ratings 334 provided for each autobiographical memory. 335 Double scoring was performed on 20% of the data, and there was excellent agreement 336 across the experimenters (minimum score of 0.81; see Supplementary Methods Table S2). 337 338

Future thinking 339
The future thinking task (Hassabis, Kumaran, Vann, et al., 2007) follows the same procedure 340 as the scene construction task, but requires participants to imagine three plausible future scenes 341 involving themselves (an event at the weekend; next Christmas; the next time they will meet a 342 friend). There are two main differences between the future thinking task and the scene 343 construction task. First, unlike scenes in the scene construction task, scenes in the future 344 thinking task involve 'mental time travel' to the future, so they have a clear temporal 345 dimension. Second, the cues for the scene construction task are somewhat more specific than 346 those employed in the future thinking task (see Hurley

Navigation 353
Navigation ability was assessed using the paradigm described by (Woollett & Maguire, 2010). 354 A participant watches movie clips of two overlapping routes through an unfamiliar real town 355 (Blackrock, Dublin, Ireland) four times. 356 The main outcome measure for navigation performance is calculated by combining the 357 scores from the five tasks used to assess navigational ability. The sub-measures are the 358 performance scores of each individual task. 359 The tasks were as follows. First, was a movie clip recognition task; following each 360 viewing of the route movies, participants are shown four short movie clipstwo from the 361 actual routes, and two distractors. Participants indicate whether they have seen a movie clip 362 before or not. Second, after all four route viewings are completed, recognition memory for 363 scenes from the routes is tested. The third task, proximity judgements, involves assessing 364 knowledge of the spatial relationships between landmarks from the routes. Fourth, the route 365 knowledge task, requires participants to place scene photographs from the routes in the correct 366 order as if travelling through the town. Finally, the sketch map task involves participants 367 drawing a sketch map of the two routes including as many landmarks as they can remember.

Statistical analyses of the behavioural data 396
Data were summarised using means and standard deviations, calculated in SPSS v22. There 397 were no missing data, and no data needed to be removed from any analysis. Analyses were first carried out voxel-wise across whole brain grey matter using an explicitly 462 defined mask which was generated by averaging the smoothed grey matter probability maps in 463 MNI space across all subjects. Voxels for which the grey matter probability was below 80% 464 were excluded from the analysis. Two-tailed t-tests were used to investigate the relationships 465 between cognitive task performance and grey matter volume, with statistical thresholds applied 466 at p < 0.05 family-wise error (FWE) corrected for the whole brain, and a minimum cluster size 467 of 5 voxels. 468 As our main focus was on the relationship between cognitive task performance and 469 hippocampal grey matter volume, in the main text we report only findings pertaining to the 470 hippocampus. However, performing analyses at the whole brain level meant that we could 471 apply the recommended statistical threshold to the analyses (Nichols et al., 2017;Poldrack et 472 al., 2008). Consequently any significant relationships identified would be supported by the 473 strongest evidence. In addition, performing analyses at the whole brain level also allowed us 474 to investigate whether any non-hippocampal brain regions were associated with cognitive task 475 performancethese results are reported in the Supplementary Results, although there were 476 very few. 477 478

Hippocampal ROI VBM 479
Following the whole brain analysis, we focused on the hippocampus using anatomical 480 hippocampal masks. This allowed us to investigate whether any weaker relationships existed 481 between hippocampal grey matter volume and task performance that did not reach the statistical 482 threshold required for the whole brain analyses. The masks were manually delineated on the 483 group-averaged MT saturation map in MNI space (1mm × 1mm × 1 mm) using ITK-SNAP 484 Two-tailed t-tests were used to investigate the relationships between cognitive task 493 performance and grey matter volume within the hippocampal masks. Voxels were regarded as 494 significant when falling below an initial whole brain uncorrected voxel threshold of p < 0.001, 495 and then a small volume correction threshold of p < 0.05 FWE corrected for each mask, with 496 a minimum cluster size of 5 voxels. Given that all significant results observed using the anterior 497 and posterior unilateral masks were also evident using the bilateral masks, we report the results 498 of the bilateral masks. Where significant results within the masks were identified, the whole 499 brain analysis was re-examined with the threshold set to p < 0.001 uncorrected to assess 500 whether the effects identified within the hippocampal masks were due to leakage from adjacent 501 brain regions. 502 503

Auxiliary analyses using extracted hippocampal volumes 504
Following our main analyses, we performed a series of auxiliary investigations to further 505 scrutinise potential relationships between hippocampal grey matter volume and scene 506 imagination, autobiographical memory, future thinking and navigation task performance. 507 The basis of these auxiliary analyses was the hippocampal grey matter volume for each 508 participant that was extracted using 'spm_summarise'. The whole, anterior and posterior 509 anatomical hippocampal masks (described above) were applied to each participant's smoothed 510 and normalised grey matter volume maps, and the total volume within each mask was 511 extracted. It has been suggested that the posterior:anterior hippocampal volume ratio shows 512 stronger relationships to memory and navigation performance than raw anterior and posterior in line with these studies, we also calculated each participant's posterior:anterior hippocampal 515 volumetric ratio. We did this by dividing the posterior hippocampal volume by the anterior 516 hippocampal volume, providing a ratio whereby values greater than 1 were indicative of a 517 larger posterior over anterior hippocampus and those less than 1 indicated a larger anterior over 518 posterior hippocampus. 519 Overall, across all our auxiliary investigations (described in detail below) we examined 520 the relationships between four different measures of hippocampal grey matter volume (whole, 521 anterior, posterior, ratio) for each of our 26 task performance measures. Given the extensive 522 nature of these analyses, we needed to ensure that appropriate correction was made to the 523 statistical thresholds. Typically, FWE corrections (such as Bonferroni) are employed in this 524 regard. However, FWE correction acts to ensure that the probability of a single false positive 525 result remains at the critical alpha (typically p < 0.05); a highly effective method if the need to 526 avoid false positives is high. However, this level of control comes at a price, especially when 527 controlling for a large number of statistical tests, as it means that the effects of any single 528 variable have to be particularly large to reach the adjusted significance level. Thus, while false 529 positives are avoided, the risk of incorrectly rejecting a true result is introduced. 530 An alternative method of statistical correction is to control the false discovery rate 531 (Benjamini & Hochberg, 1995) which influences the proportion of significant results identified 532 that are false positives. Therefore, a FDR of p < 0.05 allows for 5% false positive results across 533 the tests performed. Controlling for the FDR is, therefore, more lenient than performing FWE 534 corrections, as it accommodates the potential for a greater number of false positive results but, 535 by doing so, also reduces the chances of incorrectly rejecting a true result. Applying our 536 statistical correction in terms of the FDR provided, therefore, a compromise between the need 537 to adjust our statistical thresholds for multiple comparisons, while not removing all possibility 538 of identifying true relationships among the data features. 539 Across the auxiliary analyses, the FDR was set to p < 0.05 for each family of tests. A 540 family was defined as all the tests performed for each set of auxiliary analyses for one cognitive 541 task. For example, when testing for relationships between hippocampal volume and scene 542 construction task performance, we had 4 measures of hippocampal volume (whole, anterior, 543 posterior, ratio) and 6 measures of performance (experiential index, spatial references, entities 544 present, sensory descriptions, thoughts/emotions/actions, spatial coherence), totalling 24 545 separate statistical tests. The FDR was thus set to p < 0.05 for these 24 tests. 546 The following auxiliary analyses were performed in SPSS v25 with FDR thresholding 547 and Benjamini-Hochberg adjusted p values calculated using the resources provided by 548 McDonald (2014). 549 550 2.9.1 Partial correlations using extracted hippocampal volumes 551 We performed partial correlation analyses between each of the hippocampal volumetric 552 measures (whole, anterior, posterior, ratio) and each of our measures of task performance. As 553 with the primary VBM analyses, age, gender, total intracranial volume and MRI scanner were 554 included as covariates. 555 556 2.9.2 Sub-group analyses using extracted hippocampal volumes 557

Effects of gender 558
To directly investigate the effects of gender, we divided the sample into male (n = 108) and 559 female (n = 109) participants. We first performed separate partial correlations for the two 560 participant groups for each of the hippocampal volumetric measures (whole, anterior, posterior, 561 ratio) and each of our measures of task performance. Age, total intracranial volume and MRI 562 scanner were included as covariates. If a significant volume-task relationship was identified in 563 either the male or female group, our intention was to then compare the male and female 564 relationships directly. However, this was never required (see section 3.3.2). 565 566 567

Median split: direct comparison 568
Participants were first allocated to groups depending on their performance on a task compared 569 to the median score for that task. If their score was less than or equal to the median value they 570 were assigned to the low performing group, scores greater the median resulted in allocation to 571 the high performing group. Due to the distribution of performance scores, the navigation movie 572 clip recognition test could not be divided into groups (as most participants performed at ceiling) 573 and so was not included in these analyses. Our four measures of hippocampal volume were 574 then compared between the low and high performing groups for each task using univariate 575 ANCOVAs with the hippocampal volume measure as the dependent variable, performance 576 group as the main predictor and age, gender, total intracranial volume and MRI scanner as 577 covariates. 578 579

Median split: partial correlations 580
We next investigated if different associations between hippocampal volume and cognitive task 581 scores existed in the high and low performing participants. As before, because the navigation 582 movie clip recognition test could not be divided into groups it was not included in these 583 analyses. For each task, partial correlations were performed separately for the two groups 584 between the hippocampal volume measures and the task performance measures, with age, 585 gender, total intracranial volume and MRI scanner included as covariates. If a significant 586 relationship was identified in either the low or high performance group, our intention was to 587 then compare the group correlations directly. However, this was never required (see section The best and worst performers for each task were identified. This was approximately the top 594 and bottom 10% (n ~ 20 in each group). The exact number of participants allocated to each 595 group varied for each task depending on the distribution of performance scores and to ensure 596 the number of participants in each group was as similar as possible. As the navigation movie 597 clip recognition test could not be split into approximately equal groups of best and worst 598 performers it was not included in these analyses. Our four measures of hippocampal volume 599 were then compared between the best and worst performing participants for each task using 600 univariate ANCOVAs with the hippocampal volume measure as the dependent variable, 601 performance group as the main predictor and age, gender, total intracranial volume and MRI 602 scanner as covariates. navigation. Our aim here was to investigate whether the interaction between self-reported 608 ability and hippocampal volume was associated with task performance, following the method 609 of He & Brown (2020). As such, eight interaction terms were created: responses on the PSIQ 610 and SBSODS by the four measures of hippocampal grey matter volume. Partial correlation 611 analyses were then performed between the interaction terms and task performance, with age, 612 gender, total intracranial volume and MRI scanner as covariates as in the previous analyses, 613 with the additional covariates of the relevant questionnaire and hippocampal volume 614 measurement to ensure that any identified relationships were due to the interaction of self-615 report and volume and not either individually. 616 617

Multivariate analyses using latent variables 618
The analyses detailed above are all univariate and task-based. A multivariate approach using 619 latent variables derived from across the cognitive data may provide additional insights into 620 associations between task performance and hippocampal grey matter volume. For example, 621 this method offered increased statistical power and enabled data driven investigations, 622 eschewing a reliance on conceptually driven cognitive constructs, and so had the potential to 623 increase the generalisability of findings beyond the specific tasks in question. 624 To identify latent variables across the cognitive data, two Principal Component 625 Analyses (PCA) were performed using SPSS v25 with varimax rotation and an eigenvalue cut-626 off of 1. The first PCA investigated possible latent variables across the four main outcome 627 measures of the scene imagination, autobiographical memory, future thinking and navigation 628 tasks, and the second PCA examined all 21 task sub-measures. Factor scores for the identified 629 latent variables were extracted using the Anderson-Rubin method. The relationships between 630 the latent variable factor scores and hippocampal grey matter volume were examined using the 631 VBM and partial correlation approaches described above. 632 633 3. RESULTS 634

Cognitive task performance 635
A summary of the outcome measures for the cognitive tasks is shown in

Whole brain VBM 645
As our main focus was on the relationship between cognitive task performance and 646 hippocampal grey matter volume, here we report only findings pertaining to the hippocampus 647 any regions identified outside the hippocampus are reported in the Supplementary Results. 648 No significant relationships between cognitive task performance and hippocampal grey 649 matter volume were identified for any of the main outcome measures of the tasks assessing 650 scene imagination, autobiographical memory, future thinking or navigation. This was also the 651 case for the sub-measures of these tasks. 652 653

Hippocampal ROI VBM 654
No relationships between cognitive task performance and hippocampal grey matter volume 655 were identified using any of the hippocampal masks for the main outcome measures from the 656 tasks examining scene imagination, autobiographical memory, future thinking or navigation. 657 No associations were observed between performance on the scene construction task 658 sub-measures and hippocampal grey matter volume using any of the hippocampal masks. 659 For the AI sub-measures, one potential relationship was identified. When using the 660 posterior hippocampal mask, a cluster of 10 voxels, located towards the anterior portion of the 661 posterior mask in the left hippocampus, was found to be negatively associated with the number 662 of internal place references (Figure 1a; peak coordinates = -34, -23, -12, peak t = 3.28, p FWE 663 posterior hippocampus ROI corrected = 0.046). This relationship, however, was not significant when 664 correcting for the whole hippocampus mask (p FWE whole hippocampus ROI corrected = 0.069). Reducing 665 the statistical threshold to p < 0.001 uncorrected for the whole brain confirmed the cluster was 666 confined to the hippocampus and not due to leakage from adjacent brain regions (Figure 1b). 667 However, visualisation at this lower threshold also demonstrated that the identified relationship 668 was not unique to the hippocampal voxels, with a number of other brain regions also being 669 negatively associated with the AI internal place references at this liberal statistical threshold. 670 For the future thinking sub-measures, no associations were observed between cognitive 671 task performance and hippocampal grey matter volume. 672 Considering the navigation sub-measures, one potential relationship was identified. 673 When using the anterior hippocampal mask, a cluster of 17 voxels, located at the edge of the 674 mask in the right hippocampus, was found to be positively associated with performance on the 675 navigation route knowledge sub-measure (Figure 1c; peak coordinates = 27 -8 -29, peak t = 676 3.31, p FWE anterior hippocampus ROI corrected = 0.028). This relationship, however, was not significant 677 when correcting for the whole hippocampus mask (p FWE whole hippocampus ROI corrected = 0.062). 678 Reducing the statistical threshold to p < 0.001 uncorrected for the whole brain clearly indicated 679 that the identified hippocampal voxels were located on the edge of a larger cluster that was in 680 fact centred on the right anterior fusiform and parahippocampal cortices (Figure 1d, cluster size 681 = 1557, peak coordinates = 29 -5 -37, peak t = 3.87). 682 683

Auxiliary analyses using extracted hippocampal volumes 684
We next performed a series of auxiliary analyses to further investigate possible relationships 685 between cognitive task performance and hippocampal grey matter volume. The latter was 686 extracted for each participant using the predefined anatomical hippocampal masks.

Partial correlations using extracted hippocampal volumes 694
No relationships were identified between hippocampal grey matter volume and performance 695 for any of the main or sub-measures of the tasks assessing scene imagination, autobiographical 696 memory, future thinking or navigation (see Supplementary Results Table S5 for details). These 697 partial correlation findings, therefore, support those of the primary VBM analyses, as well as 698 extending this to include the posterior:anterior hippocampal volume ratio. 699 700 3.3.2 Sub-group analyses using extracted hippocampal volumes 701

Effects of gender 702
We next divided the sample into male and female participants. No relationships between 703 cognitive task performance and hippocampal grey matter volume were observed in either group 704 for any of the main or sub-measures (see Supplementary Results Tables S6 and S7 for  No significant relationships between hippocampal grey matter volume and cognitive task main 717 and sub-measures were identified in either the low or high performing groups (see 718 Supplementary Results Tables S10 and S11 for details). As no relationships were detected in 719 either group, direct comparison of the relationships was not relevant. 720 721

Best versus worst 722
Details of the groups that were created when the sample was divided in terms of best and worst 723 performers are provided in the Supplementary Results (Tables S12). Taking the best and worst 724 performing participants for each cognitive task main and sub-measure and comparing the 725 hippocampal grey matter volume measurements between the two groups also showed no 726 significant differences between the groups (see Supplementary Results Table S13 details). 727 728

Effects of self-reported ability 729
The results of these analysis are provided in the Supplementary Results (Table S14). 730 No significant associations were observed between any of the scene construction task 731 main and sub-measures and the interaction between self-reported ability (determined by the 732 PSIQ) and hippocampal grey matter volume. 733 For autobiographical memory, a potential relationship was identified for AI vividness, 734 but no other main or sub-measure. For AI vividness, an association was observed between AI 735 vividness ratings and the PSIQ by whole hippocampus volume interaction (r = 0.24, p FDR 736 corrected = 0.016), and with the PSIQ by posterior hippocampal volume interaction (r = 0.22, p 737 FDR corrected = 0.013). This suggests that self-reported imagery ability may be influencing the 738 relationship between hippocampal volume and AI vividness ratings. 739 To better understand these interactions, simple slopes and region of significance 740 analyses were performed using the interactions toolbox (Long, 2019) in R 3.6.1 (R Core Team, 741 2017), with regions of significance calculated using the Johnson-Neyman technique (Johnson 742 & Fay, 1950). 743 First, we focused on the relationship between AI vividness and the PSIQ by whole 744 hippocampus interaction. Repeating the partial correlation analysis as a regression (with AI 745 vividness as the dependent variable, PSIQ, whole hippocampal volume and the PSIQ by whole 746 hippocampal volume interaction as predictors, and age, gender, scanner and ICV as covariates), 747 identified that for the PSIQ by whole hippocampus volume interaction, the unstandardized beta 748 value was 0.0000710. Performing simple slopes to decompose this relationship identified that 749 for individuals who were 1 standard deviation below the mean PSIQ score, the unstandardized 750 beta value was -0.000859 (t = -4.15, p FDR corrected < 0.001). For individuals with a mean PSIQ 751 score, the unstandardized beta value was -0.000459 (t = -2.78, p FDR corrected = 0.009), while for 752 individuals who were 1 standard deviation above the mean PSIQ score, the unstandardized beta 753 value was -0.0000583 (t = -0.28, p FDR corrected = 0.92). As such, in individuals with low PSIQ 754 scores, there was a negative relationship between whole hippocampal volume and AI vividness 755 that was not evident in individuals with high PSIQ scores. A region of significance analysis 756 found that the negative relationship existed in individuals with PSIQ scores less than 23.65. 757 We then investigated the relationship between AI vividness and the PSIQ by posterior 758 hippocampus interaction. Repeating the partial correlation analysis as a regression (with AI 759 vividness as the dependent variable, PSIQ, posterior hippocampal volume and the PSIQ by 760 posterior hippocampal volume interaction as predictors, and age, gender, scanner and ICV as 761 covariates), identified that for the PSIQ by posterior hippocampus volume interaction, the 762 unstandardized beta value was 0.000157. Performing simple slopes identified that for 763 individuals who were 1 standard deviation below the mean PSIQ score, the unstandardized beta 764 value was -0.00180 (t = -4.53, p FDR corrected < 0.001). For individuals with a mean PSIQ score, 765 the unstandardized beta value was -0.000919 (t = -3.24, p FDR corrected = 0.003), while for 766 individuals who were 1 standard deviation above the mean PSIQ score, the unstandardized beta 767 value was -0.0000343 (t = -0.098, p FDR corrected = 0.92). As with the whole hippocampal volume 768 interaction, in individuals with low PSIQ scores, there was a negative relationship between 769 posterior hippocampal volume and AI vividness that was not apparent in individuals with high 770 PSIQ scores. A region of significance analysis found that the negative association existed in 771 individuals with PSIQ scores less than 24.13. 772 For all of the future thinking and navigation main and sub-measures, no significant 773 findings were observed when investigating the association between cognitive task performance 774 and the interaction between self-reported ability and hippocampal grey matter volume. 775 776

Multivariate analyses using latent variables 777
The detailed outcomes of the multivariate analyses can be found in the Supplementary Results. 778 In summary, performing a PCA with the four main outcome measures of the scene imagination, 779 autobiographical memory, future thinking and navigation tasks resulted in a single component 780 that explained 56.05% of the variance (Supplementary Results Table S15). Conducting a VBM 781 analysis with this latent variable identified no significant relationship between hippocampal 782 grey matter volume and the latent variable factor scores, both at the whole brain level and when 783 using the hippocampal ROIs. Performing partial correlations between the extracted 784 hippocampal grey matter volumes and the latent variable factor scores also identified no 785 significant associations (Supplementary Results). 786 The second PCA using all 21 task sub-measures identified 6 components that explained 787 65.01% of the variance (Supplementary Results Tables S16 and S17, and Figure S1 were any relationships between these latent constructs and hippocampal grey matter volume. 799 Using VBM (at both the whole brain level and in the anatomical ROIs) or when performing 800 partial correlations using the extracted hippocampal grey matter volumes (Supplementary  801   Results Table S18), no significant relationships between task performance and hippocampal 802 grey matter volume were evident. 803 however, in a large sample of young, healthy adult participants, we found little evidence that 809 hippocampal grey matter volume was related to task performance. This was the case across 810 different methodologies (VBM and partial correlations), whole brain or hippocampal ROIs, 811 different sub-groups of participants (divided by gender, task performance and self-reported 812 ability) and when examining latent variables derived from across the cognitive tasks. 813 Hippocampal grey matter volume may not, therefore, significantly influence performance on 814 tasks known to require the hippocampus in healthy people. 815 Of the 52 main VBM analyses we performed, only two significant associations between 816 hippocampal grey matter volume and task performance were noted. First, a negative 817 association between a cluster of ten voxels and the measure of AI internal place was observed. 818 However, this result was only significant when correcting for the posterior hippocampal mask, 819 and was not replicated when a partial correlation analysis was performed on the extracted 820 hippocampal volume data. Indeed, the correlation coefficient between the posterior 821 hippocampal volume and AI internal place was only -0.053. In the second result, a cluster of 822 seventeen voxels was positively associated with navigation route knowledge when correcting 823 for the anterior hippocampal mask only. However, as shown in Figure 1d on all of the tasks. In addition, hippocampal grey matter volume was heterogeneous across the 856 sample. We were, therefore, well-placed to identify any potential relationships between 857 hippocampal volume and task performance, yet few were found. 858 Fourth, we investigated possible relationships across the whole brain, but also using 859 anatomically-defined hippocampal ROIs. Despite the reduced level of statistical correction 860 required when constraining analyses to hippocampal anatomical ROIs, we still found no 861 significant relationships. It could be argued that the already liberal statistical thresholds should 862 be reduced even further, but this would simply increase the chances of false positive results, 863 given our large sample size and the considerable number of analyses that were performed. 864 Fifth, similar outcomes were observed using two different analysis methods. While our 865 primary analyses involved VBM, by extracting the whole, anterior and posterior hippocampal 866 volumes for each participant and additionally calculating the posterior:anterior volume ratio, 867 we could also perform partial correlations between each of the volume measurements and our 868 measures of task performance. The results, therefore, do not seem to be consequent upon a 869 specific analysis technique, but instead there was a consistent pattern of null results. 870 Sixth, we also divided the sample into multiple sub-groups, investigating male and 871 female participants, assessing whether there was any effect of cognitive performance (using 872 both a median split, and by examining only the best and worst performers for each task 873 measure), and self-reported ability. No significant associations between hippocampal volume 874 and performance were identified. 875 Finally, we also conducted multivariate analyses, using PCA to identify latent variables 876 from across the cognitive tasks. This approach offered increased statistical power and enabled 877 data driven investigations, eschewing a reliance on conceptually driven cognitive constructs, 878 and so had the potential to increase the generalisability of findings beyond the specific tasks in 879 question. To the best of our knowledge, formal PCA analyses using the sub-measures of the 880 tasks employed here have not been reported previously. It was reassuring to find broad 881 correspondence between the components identified by the PCA and the cognitive tasks, and to 882 note that the spatial coherence and the emotional aspects of mental representations cut across 883 cognitive tasks. Of particular relevance for the questions under consideration here, and as with 884 the other analyses, no significant associations between hippocampal grey matter volume and 885 the identified latent variables were observed. While we found that hippocampal volume seemed to be unrelated to task performance 944 in the general population, it is likely that some neural variations exist that can, at least in part, 945 explain the marked differences in performance observed in tasks such as those we employed. 946 One possibility is that while volumetric differences for the whole hippocampus, or even its 947 anterior or posterior portions, are not related to performance, effects might be observed for the 948 volume of specific hippocampal subfields. For example, larger CA3 volume has been 949 associated with reduced subjective confusion when recalling highly similar memories 950 ratings. More in-depth analyses using specific ROIs may highlight other relationships (e.g. the 966 positive association between the parahippocampal/fusiform cortices and navigation route 967 knowledge sub-measure we observed when using an uncorrected threshold), but these ROIs 968 would need to be motivated in advance to justify their use, going beyond the remit of the current 969 study. 970 Advances in neuroimaging mean it is now possible to investigate brain structure in more 971 detail than just grey matter volume. Quantitative neuroimaging techniques (Weiskopf,  structural and functional connectivity measures may, therefore, be associated with task 992 performance, perhaps to a greater extent than hippocampal grey matter volume, and represent 993 important avenues to explore in future studies. 994 As with functional connectivity, patterns of activity during rest or task-based fMRI may 995 be related to individual differences in performance.  Note. Inter-class correlation coefficients from a two way random effect model looking for absolute agreement for each content score and for the quality ratings. Four experimenters scored the whole data set (n = 217 participants, 1519 individual scenes) with double scoring performed on 20% of the data (n = 44 participants, 308 scenes) proportionally for each original experimenter. Note. Inter-class correlation coefficients from a two way random effects model looking for absolute agreement for each score on the autobiographical interview. Three experimenters scored the whole data set (n = 217 participants, 1085 individual memories) and double scoring was performed 20% of the data (n = 43 participants, 215 individual memories) proportionally for each original experimenter. Note. Inter-class correlation coefficients from a two way random effects model looking for absolute agreement for each content score and for the quality ratings. Four experimenters scored the whole data set (n = 217 participants, 651 individual future scenes) with double scoring performed on 20% of the data (n = 44 participants, 132 future scenes) proportionally for each original experimenter. Note. Inter-class correlation coefficients from a two way random effects model looking for absolute agreement for each score on the navigation sketch maps. Three experimenters scored the whole data set (n = 217) and double scoring was performed on 20% of the data (n = 42 participants) proportionally for each original experimenter.

Primary VBM analyses: Results outside of the hippocampus
Only one association was identified between cognitive task performance and grey matter volume outside of the hippocampus (when using a statistical threshold of p < 0.05 FWE whole brain corrected). A negative association was observed between 22 voxels in the right parahippocampal cortex and AI vividness ratings (peak coordinates = 24 -23 -28, peak t = 4.78, p FWE whole brain corrected = 0.025).   Table S14. Partial correlations between task performance and the interaction between hippocampal grey matter volume and self-reported ability (measured by the Plymouth Sensory Imagery Questionnaire for scene construction, autobiographical memory and future thinking and the Santa Barbara Sense of Direction Scale for navigation), with age, gender, total intracranial volume and MRI scanner included as covariates in all analyses, as well as the additional covariates of the relevant questionnaire and hippocampal volume measurement. Note. The factor scores for each participant were extracted using the Anderson-Rubin method. These were included in a VBM analysis with age, gender, scanner and total intracranial volume as covariates, finding no significant relationships with hippocampal volume both at the whole brain level (p < 0.05 FWE corrected) and when using the anatomical hippocampal masks (p < 0.05 FWE corrected for each mask). Performing partial correlations between the extracted hippocampal grey matter volumes and the factor scores also identified no significant relationships (whole hippocampal volume: r = 0.032, p FDR corrected = 0.85; anterior hippocampal volume: r = 0.073, p FDR corrected = 0.58; posterior hippocampal volume: r = -0.012, p FDR corrected = 0.86; posterior/anterior hippocampal volume ratio: r = -0.091, p FDR corrected = 0.58).