Does hippocampal volume explain performance differences on hippocampal-dependant tasks?

Highlights • Evidence is mixed about whether hippocampal volume affects cognitive task performance.• This is particularly the case concerning individual differences in healthy people.• We collected structural MRI data from 217 healthy people.• They also had widely-varying performance on cognitive tasks linked to the hippocampus.• In-depth analyses showed little evidence hippocampal volume affected task performance.


Introduction
People vary substantially in their ability to perform tasks related to critical aspects of cognition that enable the smooth functioning of our everyday lives. These include imagining scenes (the process of forming and visualising scene imagery in the absence of visual input), autobiographical memory (the recall of past events from one's life), future thinking (imagining future experiences) and spatial navigation (the process of ascertaining one's position in the environment, and planning and following a route). For example, some individuals can recollect decades-old autobiographical memories with great clarity compared to others who struggle to recall what they did last weekend (e.g. Palombo et al., 2018 ). Spatial navigation can be undertaken with ease by some people while others consistently get lost (e.g. Newcombe, 2018 ).
Neuroimaging studies of healthy individuals performing tasks related to these cognitive functions reliably show engagement of the hippocampus ( Addis et al., 2007 ;Barry et al., 2018 ;Brown et al., 2010 ;Buckner and Carroll, 2007 ;Dalton et al., 2018 ;Spiers and Maguire, 2006 ;Svoboda et al., 2006 ). Moreover, damage to the ing interpretations about the association between hippocampal volume and scene imagination ability.
A larger number of studies have investigated the relationship between hippocampal volume and memory ability in healthy individuals, but with mixed results. On the one hand, "recollective " memory ability (the detailed re-experiencing of individual episodes characterized by retrieval of items in their context) has been positively associated with posterior hippocampal grey matter volume ( Poppenk and Moscovitch, 2011 ). Moreover, a single individual identified with "Highly Superior Autobiographical Memory " was reported to have greater posterior hippocampal volume compared to matched controls ( Mazzoni et al., 2019 ). On the other hand, group studies of individuals with superior memory capabilities, including those with Highly Superior Autobiographical Memory or those who take part in the World Memory Championships, observed no differences in hippocampal grey matter volume compared to matched controls ( LePort et al., 2012 ;Maguire et al., 2003 ). Similarly, a meta-analysis of 33 studies found no relationship between hippocampal grey matter volume and performance on laboratory-based memory tasks ( Van Petten, 2004 ). Whether a link exists between hippocampal volume and more naturalistic autobiographical memory ability is, therefore, unclear.
There has been limited work investigating the relationship between future thinking ability and hippocampal volume. A recent study claimed to have observed a positive relationship between participant ratings of sensory perceptual qualities in future thinking (e.g. vividness, the amount of visual details) and hippocampal grey matter volume ( Yang et al., 2020 ). However, the peak voxel coordinates (MNI space; 39, − 7.5, − 10.5) and cluster were located outside of the hippocampus ( Fig. 1 of Yang et al., 2020 ). A more precise investigation into the relationship between hippocampal volume and future thinking ability is, therefore, required.
A definitive link has been identified between hippocampal grey matter volume and extreme spatial navigation ability. Licensed London (UK) taxi drivers must memorise the extensive and complex layout of ∼25,000 London streets and thousands of landmarks, known colloquially as acquiring "The Knowledge ". In several studies they were found to have greater posterior, and decreased anterior, hippocampal grey matter volume compared to healthy controls ( Maguire et al., 2000 ), London bus drivers, who spend an equivalent amount of time driving but on regular routes instead of requiring a complete knowledge of London's layout ( Maguire et al., 2006 ), and medical doctors, who have high levels of expertise but not primarily in the spatial domain ( Woollett et al., 2008 ). Moreover, longitudinal data collected before and after attempts to acquire The Knowledge identified posterior hippocampal grey matter volume enlargement within subjects, but only in those individuals who went on to qualify as London taxi drivers ( Woollett and Maguire, 2011 ). Greater posterior, but less anterior, hippocampal grey matter volume is, therefore, reliably associated with extreme spatial navigation expertise.
In the general population, however, the relationship between hippocampal grey matter volume and navigation ability is less clear. One study involving navigation in a virtual environment found no association between hippocampal volume and navigation performance in healthy people ( Maguire et al., 2003 ), a finding that has since been replicated in a larger sample of 90 individuals ( Weisberg et al., 2019 ). However, a re-analysis of the Weisberg et al. (2019) data purports to have identified a relationship between right posterior hippocampal volume and navigation performance, but one that is dependant on self-reported navigation ability or being grouped according to navigation performance ability ( He and Brown, 2020 ). These interactions suggested that in low performing participants a negative relationship existed between right posterior hippocampal volume and task performance. However, interactions were only observed for one of five performance measures and no multiple comparison corrections were made to the statistical threshold used to determine significance. Given the reasonably large sample size and the number of tests performed on the data (by both Weisberg et al., 2019 andHe andBrown, 2020 ) the re-sults of this re-analysis require replication before firm conclusions can be drawn.
Posterior hippocampal grey matter volume has been positively associated with an individual's ability to learn new environments in the real world ( Schinazi et al., 2013 ), in virtual reality ( Brown et al., 2014 ), and with first person perspective taking ability during navigation ( Sherrill et al., 2018 ). In addition, greater posterior compared to anterior hippocampal grey matter volume has been linked to the use of "mapbased " strategies, and consequently with better navigation performance ( Brunec et al., 2019 ). By contrast, greater anterior hippocampal grey matter volume has been related to better topographical memory ability ( Hartley and Harlow, 2012 ), path integration ( Chrastil et al., 2017 ), and performance on a virtual reality eight arm radial maze ( Bohbot et al., 2007 ). Therefore, the literature on associations between navigational ability and hippocampal grey matter volume is disparate, and lacks consistency.
Overall, across scene imagination, autobiographical memory, future thinking and navigation, the relationship between hippocampal grey matter volume and task performance in healthy individuals is ambiguous. Why might this be the case?
First, the sample sizes used have typically been small, with most studies including 20-30 participants. Recent empirical work has highlighted the non-replicability of brain structure-function associations when using small samples ( Kharabian Masouleh et al., 2019 ). Notably, in the navigation study where a larger sample of 90 participants was utilised, no relationship between hippocampal volume and task performance was identified ( Weisberg et al., 2019 ). The mixed results may, therefore, simply be due to using insufficiently powered samples.
Second, there is a large variability across studies in how analyses were performed. For example, while several analyses included covariates to account for age, gender and total intracranial volume, others included covariates for only some of these variables, and numerous analyses involved no covariates at all. Given that these variables are known to be potential confounds (e.g. Ridgway et al., 2008 ), their inclusion (or not) as covariates could significantly influence the observed results.
Third, studies differ in terms of the level of statistical correction that was applied. For instance, some studies use more lenient statistical thresholds compared to others, and while correction for multiple comparisons is evident when testing for multiple relationships in several studies, in others it is not. Differences in statistical thresholds could not only contribute to contradictory results across the literature, but may also have increased the number of false positive findings.
Fourth, there are differences in how hippocampal anatomy was examined. Many studies investigate the hippocampus as a whole, while others divide the hippocampus into anterior and posterior portions, or focus on the volumetric ratio between the anterior and posterior segments. Different functions have been ascribed to anterior and posterior portions ( Poppenk et al., 2013 ;Strange et al., 2014 ;Zeidman and Maguire, 2016 ), and this could also have influenced extant results.
Finally, in some studies, differing hippocampal volume-task relationships have been identified for sub-groups of participants that were not apparent across the whole sample. However, these participant subgroups are not always directly compared to each other, making it difficult to have confidence in any reported differences. In addition, subgroup analyses are often deemed "exploratory " and "requiring replication ", but it is rare that such follow-up occurs.
In the current study we sought to overcome these issues in a comprehensive investigation of the relationship between hippocampal grey matter volume and scene imagination, autobiographical memory, future thinking and spatial navigation task performance in a healthy sample.
To ensure an adequate sample size, data were collected from 217 individuals with a wide range of ability. All participants performed tasks relating to each of the aforementioned cognitive functions, and hippocampal grey matter volume was derived from structural MRI scans with an isotropic voxel resolution of 0.8 mm. Analyses were first performed on each task's main outcome measure. As concerns could be raised that these four measures are over-general, for completeness, we also examined the sub-measures of each task (an additional 21 performance variables). This allowed us to investigate potential nuances that may exist in hippocampal volume-task performance relationships. As reviewed above, the mixed results in the literature made the formulation of clear hypotheses difficult. Consequently, our aim in the data analyses was to reflect the range of approaches previously pursued in the literature, and to perform a broad and inclusive evaluation of the relationship between hippocampal grey matter volume and task performance. In that vein, we also took the opportunity to implement a multivariate analysis by constructing latent variables from across the cognitive tasks. In previous work we found that, in the presence of other cognitive tasks, performance on the scene imagination, autobiographical memory and future thinking tasks all loaded onto the same latent variable . As such, investigation of the associations between latent variables and hippocampal volume may reveal correlations that are not apparent from assessing individual tasks alone.
We have reported the main outcome measures of the four cognitive tasks in two previous papers that asked different research questions to those under consideration here. As noted above, in Clark et al. (2019) we assessed the relationships between scene imagination, autobiographical memory, future thinking and navigation in terms of task performance. Using mediation analyses, we found that scores on the scene construction task explained the relationships between autobiographical memory, imagining the future and spatial navigation task performance. In Clark and Maguire (2020) , we investigated how subjective questionnaire data were related to task performance. This revealed that imagination and navigation questionnaires reflected performance on their related tasks. Memory questionnaires, on the other hand, were associated with autobiographical memory vividness. In the current study, we examined whether performance on the cognitive tasks, including the main outcome measures but also their sub-measures and latent variables, were related to hippocampal grey matter volume. The structural MRI data have not been published previously.

Participants
Two hundred and seventeen healthy individuals were recruited, including 109 females and 108 males. The age range was restricted to 20-41 years old to limit the possible effects of ageing (mean age = 29.0 years, SD = 5.60). Participants had English as their first language and reported no history of psychological, psychiatric or neurological conditions. Our aim was to assess people from the general population who would not be classed as having extreme expertise in the domains of interest. Consequently, people with vocations such as taxi driving (or those training to be taxi drivers), ship navigators, aeroplane pilots, or those with regular hobbies including orienteering, or taking part in memory sports and competitions, were excluded. Participants were reimbursed £10 per hour for taking part which was paid at study completion. All participants gave written informed consent and the study was approved by the University College London Research Ethics Committee (project ID: 6743/001) and conducted in accordance with the guidelines of the Declaration of Helsinki.

Procedure
Participants completed the study over multiple visits. Structural MRI scans were acquired during the first visit. Cognitive testing was conducted during visits two and three. The scene imagination, autobiographical memory and future thinking tasks were completed during one visit, while the navigation tasks were performed on a separate visit. All participants completed all parts of the study.

Cognitive tasks
All tasks are published and were performed and scored as per their published use. Here, for the reader's convenience, we describe each task briefly.

Scene imagination: the scene construction task
The scene construction task ( Hassabis et al., 2007 ) measures a participant's ability to mentally construct an atemporal visual scene, meaning that the scene is not grounded in the past or the future. Participants construct different scenes of commonplace settings. For each of seven scenes, a short cue is provided (e.g. imagine lying on a beach in a beautiful tropical bay) and the participant is asked to imagine the scene that is evoked and then describe it out loud in as much detail as possible. Recordings are transcribed for later scoring. Participants are explicitly told not to describe a memory, but to create a new atemporal scene that they have never experienced before.
The main outcome measure is the "experiential index " which is calculated for each scene and then averaged. In brief, it is composed of four elements: the content, participant ratings of their sense of presence (how much they felt like they were really there) and perceived vividness, participant ratings of the spatial coherence of the scene, and an experimenter rating of the overall quality of the scene.
For the scene construction sub-measures, we separately investigated the four categories that make up the content score, and also the spatial coherence rating.
To score the content, four categories of statement are identified; spatial references, entity presence, sensory description and thoughts/emotions/actions. The spatial reference category encompasses statements regarding the relative position of entities within the environment or directions relative to the participant's vantage point. The entity category is a count of how many distinct entities (e.g. objects, people, animals) were mentioned. The sensory descriptions category consists of any statements describing (in any modality) properties of an entity or the environment in general. Finally, the thoughts/emotions/actions category covers any introspective thoughts or emotional feelings as well as the thoughts, intentions, and actions of other entities in the scene. The final score of each category is the average across the seven scenes.
The imagined scenes are also examined using the spatial coherence index. After each scene is mentally constructed, participants are presented with a set of 12 statements, each providing a possible qualitative description of the imagined scene. Participants are instructed to indicate which statements they felt accurately described their construction, identifying as many or as few as they thought appropriate. Eight of the statements indicate that aspects of the scene were integrated (e.g. "I could see the whole scene in my mind's eye "), whereas four indicate that aspects of the scene were fragmented (e.g. "It was a collection of separate images "). One point is awarded for each integrated statement selected and one point taken away for each fragmented statement. This yields a score between -4 and + 8 that is then normalised around zero. Any construction with a negative score is considered to be incoherent and fragmented and scored at 0 so as not to over-penalise fragmented descriptions. The final spatial coherence index was the average score of the seven scenes, ranging between 0 (totally fragmented) and + 6 (completely integrated).
Double scoring was performed on 20% of the data. We took the most stringent approach to identifying across-experimenter agreement. Interclass correlation coefficients, with a two-way random effects model looking for absolute agreement indicated excellent agreement amongst the experimenters (minimum score of 0.9; see Supplementary Methods Table S1). For reference, a score of 0.8 or above is considered excellent agreement beyond chance.

Autobiographical memory: the autobiographical interview
In the autobiographical interview (AI; Levine et al., 2002 ) participants are asked to provide autobiographical memories from a specific time and place over four time periods -early childhood (up to age 11), teenage years (aged from 11 to 17), adulthood (from age 18 years to 12 months prior to the interview; two memories are requested) and the last year (a memory from the last 12 months). Recordings are transcribed for later scoring.
The AI has two main outcome measures; the number of "internal " and "external " details included in the description of an event. Importantly, these two scores represent different aspects of autobiographical memory recall. Internal details are those describing the event in question (i.e. episodic details), and were of primary interest here. External details describe semantic information concerning the event, or non-event information. Internal details are, therefore, thought to be hippocampaldependant, while external details are not. The two AI scores are obtained by separately averaging performance for the internal and external details across the five autobiographical memories.
For the AI sub-measures, we examined the five separate categories that comprise the internal details outcome measure, as well as considering AI vividness ratings. While external details can also be split into component categories, too few details were provided by the participants to assess each of these individually.
Internal details are composed of event, place, time, perceptual, and thoughts/emotions categories. The event category contains details regarding happenings or the unfolding of the story, including individuals present, actions and reactions. The time category refers to any details of the year, season, day or time of day wherein the event occurred. The place category contains details that localise the event, both at the general level (e.g. a city), and more specifically (e.g. to parts of a room). The perceptual category involves descriptions (in any modality) of any aspect of the event. Finally, the thoughts/emotions category includes descriptions of emotional states, thoughts or implications. The final score of each category is the average across the five memories.
AI vividness ratings were examined given a recent finding that responses on various memory questionnaires were associated with AI vividness, suggesting that vividness may be a key process in autobiographical memory recall ( Clark and Maguire, 2020 ). Vividness ratings are collected for each memory in response to the question "How clearly can you visualize this event? " on a 6-point scale from 1 (vague memory, no recollection) to 6 (extremely clear as if it's happening now). An overall vividness rating was the average of the vividness ratings provided for each autobiographical memory.
Double scoring was performed on 20% of the data, and there was excellent agreement across the experimenters (minimum score of 0.81; see Supplementary Methods Table S2).

Future thinking
The future thinking task ( Hassabis et al., 2007 ) follows the same procedure as the scene construction task, but requires participants to imagine three plausible future scenes involving themselves (an event at the weekend; next Christmas; the next time they will meet a friend). There are two main differences between the future thinking task and the scene construction task. First, unlike scenes in the scene construction task, scenes in the future thinking task involve 'mental time travel' to the future, so they have a clear temporal dimension. Second, the cues for the scene construction task are somewhat more specific than those employed in the future thinking task (see Hurley et al., 2011 for more on this issue).
The scoring procedures for both the main outcome measure and the sub-measures are the same as for the scene construction task. Double scoring identified excellent agreement across the experimenters (minimum score of 0.88; see Supplementary Methods Table S3).

Navigation
Navigation ability was assessed using the paradigm described by ( Woollett and Maguire, 2010 ). A participant watches movie clips of two overlapping routes through an unfamiliar real town (Blackrock, Dublin, Ireland) four times.
The main outcome measure for navigation performance is calculated by combining the scores from the five tasks used to assess navigational ability. The sub-measures are the performance scores of each individual task.
The tasks were as follows. First, was a movie clip recognition task; following each viewing of the route movies, participants are shown four short movie clips -two from the actual routes, and two distractors. Participants indicate whether they have seen a movie clip before or not. Second, after all four route viewings are completed, recognition memory for scenes from the routes is tested. The third task, proximity judgements, involves assessing knowledge of the spatial relationships between landmarks from the routes. Fourth, the route knowledge task, requires participants to place scene photographs from the routes in the correct order as if travelling through the town. Finally, the sketch map task involves participants drawing a sketch map of the two routes including as many landmarks as they can remember. Sketch maps are scored in terms of the number of road segments, road junctions, correct landmarks, landmark positions, the orientation of the routes and an overall map quality score from the experimenters. Double scoring was performed on 20% of the sketch maps, and there was excellent agreement amongst the experimenters (minimum of 0.89; see Supplementary Methods Table S4).

Self-report measures
2.4.1. Plymouth sensory imagery questionnaire, appearance subscale ( Andrade et al., 2014 ). The PSIQ Appearance subscale assesses the vividness of visual imagery. The subscale requires participants to imagine three scenarios: a bonfire, a sunset, and a cat climbing a tree. They then rate the visual image they generated on an 11-point scale from 0 (no image at all) to 10 (vivid as real life). Scores on the three scenarios are summed to create a total score out of 30. Responses on the PSIQ have previously been associated with performance on the scene construction, autobiographical memory, and future thinking tasks ( Clark and Maguire, 2020 ). ( Hegarty et al., 2002 ). The SBSODS assesses spatial and navigational abilities, preferences and experiences. Fifteen statements are presented, with participants indicating their level of agreement with each statement. Ratings are made on a 7-point scale from 1 (strongly agree) to 7 (strongly disagree). Seven statements are positively coded ( "I am very good at giving directions ") and eight are reverse scored ( "I don't have a very good "mental map" of my environment "). Scores are summed across the 15 statements to create a total score out of 105 (where a low score reflects good navigation ability). Responses on the SBSODS have been associated with multiple measures of navigation, including the spatial navigation task used in the current study ( Clark and Maguire, 2020 ;Hegarty et al., 2002 ).

Statistical analyses of the behavioural data
Data were summarised using means and standard deviations, calculated in SPSS v22. There were no missing data, and no data needed to be removed from any analysis.

MRI data acquisition
Three MRI scanners were used to collect the structural neuroimaging data. All scanners were Siemens Magnetom TIM Trio systems with 32 channel head coils and were located at the same imaging centre, running the same software. The sequences were loaded identically onto the individual scanners. Participant set-up and positioning followed the same protocol for each scanner. Whole brain volumetric images had an isotropic resolution of 800 m × 800 m × 800 m and were obtained using magnetisation transfer (MT) saturation maps derived from a multi-parameter mapping (MPM) quantitative imaging protocol Weiskopf et al., 2013 ). This protocol consisted of the acquisition of three multi-echo gradient acquisitions with either proton density (PD), T1 or MT weighting. Each acquisition had a repetition time, TR, of 25 ms. PD weighting was achieved with an excitation flip angle of 6°, which was increased to 21°to achieve T1 weighting. MT weighting was achieved through the application of a Gaussian RF pulse 2 kHz off resonance with 4 ms duration and a nominal flip angle of 220°. This acquisition had an excitation flip angle of 6°. The field of view was 256 mm head-foot, 224 mm anterior-posterior (AP), and 179 mm right-left (RL). The multiple gradient echoes per contrast were acquired with alternating readout gradient polarity at eight equidistant echo times ranging from 2.34 to 18.44 ms in steps of 2.30 ms using a readout bandwidth of 488 Hz/pixel. Only six echoes were acquired for the MT weighted volume to facilitate the off-resonance pre-saturation pulse within the TR. To accelerate the data acquisition, partially parallel imaging using the GRAPPA algorithm was employed in each phase-encoded direction (AP and RL) with forty reference lines and a speed up factor of two. Calibration data were also acquired at the outset of each session to correct for inhomogeneities in the RF transmit field ( Lutti et al., 2010 ;Lutti et al., 2012 ).

MRI data pre-processing
The data from the MPM protocol were processed for each participant using the hMRI toolbox ( Tabelow et al., 2019 ) within SPM12 ( www.fil.ion.ucl.ac.uk/spm ). The default toolbox configuration settings were used, with the exception that correction for imperfect spoiling was additionally enabled. The output MT saturation map used for the volumetric analyses quantified the degree of saturation of the steady state signal having accounted for spatially varying T 1 times and RF field inhomogeneity ( Helms et al., 2008 ;Weiskopf et al., 2013 ).
Each participant's MT saturation map was then segmented into grey matter probability maps using the unified segmentation approach , but with no bias field correction (since the MT saturation map does not suffer from any bias field modulation). Inter-subject registration was performed using DARTEL, a nonlinear diffeomorphic algorithm ( Ashburner, 2007 ), as implemented in SPM12. This algorithm estimates the deformations that best align the tissue probability maps by iteratively registering them with their average. The resulting DARTEL template and deformations were used to normalize the tissue probability maps to the stereotactic space defined by the Montreal Neurological Institute (MNI) template (at 1 × 1 × 1 mm resolution), with an isotropic Gaussian smoothing kernel of 8 mm full width at half maximum (FWHM).

Primary VBM analyses
Our primary analyses investigating the associations between cognitive task performance and hippocampal grey matter volume were performed using voxel-based morphometry (VBM) ( Ashburner and Friston, 2000 ;Mechelli et al., 2005 ).
First, we investigated the relationship between hippocampal volume and the main outcome measures for each of the cognitive tasks assessing scene imagination, autobiographical memory, future thinking and navigation. We then examined the associations between hippocampal volume and each of the sub-measures from these tasks.
For each performance measure, statistical analyses were carried out using multiple linear regression models. Six regressors were included in each model, the cognitive task performance measure of interest and five covariates: one each for age, gender and total intracranial volume, and two regressors to model the effects of the different scanners (only two regressors are required because one scanner is taken as the baseline, with the two regressors then modelling any effects that are different between the other scanners and the baseline scanner). The dependant variable was the smoothed and normalised grey matter volume.

Whole brain VBM
Analyses were first carried out voxel-wise across whole brain grey matter using an explicitly defined mask which was generated by averaging the smoothed grey matter probability maps in MNI space across all subjects. Voxels for which the grey matter probability was below 80% were excluded from the analysis. Two-tailed t-tests were used to investigate the relationships between cognitive task performance and grey matter volume, with statistical thresholds applied at p < 0.05 familywise error (FWE) corrected for the whole brain, and a minimum cluster size of 5 voxels.
As our main focus was on the relationship between cognitive task performance and hippocampal grey matter volume, in the main text we report only findings pertaining to the hippocampus. However, performing analyses at the whole brain level meant that we could apply the recommended statistical threshold to the analyses ( Nichols et al., 2017 ;Poldrack et al., 2008 ). Consequently any significant relationships identified would be supported by the strongest evidence. In addition, performing analyses at the whole brain level also allowed us to investigate whether any non-hippocampal brain regions were associated with cognitive task performance -these results are reported in the Supplementary Results, although there were very few.

Hippocampal ROI VBM
Following the whole brain analysis, we focused on the hippocampus using anatomical hippocampal masks. This allowed us to investigate whether any weaker relationships existed between hippocampal grey matter volume and task performance that did not reach the statistical threshold required for the whole brain analyses. The masks were manually delineated on the group-averaged MT saturation map in MNI space (1 mm × 1 mm × 1 mm) using ITK-SNAP ( www.itksnap.org ). As in Poppenk & Moscovitch (2011) and Brunec et al. (2019) , the anterior hippocampus was delineated using an anatomical mask that was defined in the coronal plane and proceeded from the first slice where the hippocampus can be observed in its most anterior extent until the final slice of the uncus. The posterior hippocampus was defined from the first slice following the uncus until the final slice of observation in its most posterior extent (see Dalton et al., 2017 for more details). The whole hippocampal mask comprised the combination of the anterior and posterior masks. The masks were examined separately for the left and right hippocampus, and as bilateral masks.
Two-tailed t-tests were used to investigate the relationships between cognitive task performance and grey matter volume within the hippocampal masks. Voxels were regarded as significant when falling below an initial whole brain uncorrected voxel threshold of p < 0.001, and then a small volume correction threshold of p < 0.05 FWE corrected for each mask, with a minimum cluster size of 5 voxels. Given that all significant results observed using the anterior and posterior unilateral masks were also evident using the bilateral masks, we report the results of the bilateral masks. Where significant results within the masks were identified, the whole brain analysis was re-examined with the threshold set to p < 0.001 uncorrected to assess whether the effects identified within the hippocampal masks were due to leakage from adjacent brain regions.

Auxiliary analyses using extracted hippocampal volumes
Following our main analyses, we performed a series of auxiliary investigations to further scrutinise potential relationships between hippocampal grey matter volume and scene imagination, autobiographical memory, future thinking and navigation task performance.
The basis of these auxiliary analyses was the hippocampal grey matter volume for each participant that was extracted using 'spm_summarise'. The whole, anterior and posterior anatomical hippocampal masks (described above) were applied to each participant's smoothed and normalised grey matter volume maps, and the total volume within each mask was extracted. It has been suggested that the posterior:anterior hippocampal volume ratio shows stronger relationships to memory and navigation performance than raw anterior and posterior hippocampal volumes alone ( Brunec et al., 2019 ;Poppenk and Moscovitch, 2011 ). Therefore, in line with these studies, we also calculated each participant's posterior:anterior hippocampal volumetric ratio. We did this by dividing the posterior hippocampal volume by the anterior hippocampal volume, providing a ratio whereby values greater than 1 were indicative of a larger posterior over anterior hippocampus and those less than 1 indicated a larger anterior over posterior hippocampus.
Overall, across all our auxiliary investigations (described in detail below) we examined the relationships between four different measures of hippocampal grey matter volume (whole, anterior, posterior, ratio) for each of our 26 task performance measures. Given the extensive nature of these analyses, we needed to ensure that appropriate correction was made to the statistical thresholds. Typically, FWE corrections (such as Bonferroni) are employed in this regard. However, FWE correction acts to ensure that the probability of a single false positive result remains at the critical alpha (typically p < 0.05); a highly effective method if the need to avoid false positives is high. However, this level of control comes at a price, especially when controlling for a large number of statistical tests, as it means that the effects of any single variable have to be particularly large to reach the adjusted significance level. Thus, while false positives are avoided, the risk of incorrectly rejecting a true result is introduced.
An alternative method of statistical correction is to control the false discovery rate ( Benjamini and Hochberg, 1995 ) which influences the proportion of significant results identified that are false positives. Therefore, a FDR of p < 0.05 allows for 5% false positive results across the tests performed. Controlling for the FDR is, therefore, more lenient than performing FWE corrections, as it accommodates the potential for a greater number of false positive results but, by doing so, also reduces the chances of incorrectly rejecting a true result. Applying our statistical correction in terms of the FDR provided, therefore, a compromise between the need to adjust our statistical thresholds for multiple comparisons, while not removing all possibility of identifying true relationships amongst the data features.
Across the auxiliary analyses, the FDR was set to p < 0.05 for each family of tests. A family was defined as all the tests performed for each set of auxiliary analyses for one cognitive task. For example, when testing for relationships between hippocampal volume and scene construction task performance, we had 4 measures of hippocampal volume (whole, anterior, posterior, ratio) and 6 measures of performance (experiential index, spatial references, entities present, sensory descriptions, thoughts/emotions/actions, spatial coherence), totalling 24 separate statistical tests. The FDR was thus set to p < 0.05 for these 24 tests.
The following auxiliary analyses were performed in SPSS v25 with FDR thresholding and Benjamini-Hochberg adjusted p values calculated using the resources provided by McDonald (2014) .

Partial correlations using extracted hippocampal volumes
We performed partial correlation analyses between each of the hippocampal volumetric measures (whole, anterior, posterior, ratio) and each of our measures of task performance. As with the primary VBM analyses, age, gender, total intracranial volume and MRI scanner were included as covariates.
2.9.2. Sub-group analyses using extracted hippocampal volumes 2.9.2.1. Effects of gender. To directly investigate the effects of gender, we divided the sample into male ( n = 108) and female ( n = 109) participants. We first performed separate partial correlations for the two participant groups for each of the hippocampal volumetric measures (whole, anterior, posterior, ratio) and each of our measures of task performance. Age, total intracranial volume and MRI scanner were included as covariates. If a significant volume-task relationship was identified in either the male or female group, our intention was to then compare the male and female relationships directly. However, this was never required (see Section 3.3.2 ).

Median split: direct comparison.
Participants were first allocated to groups depending on their performance on a task compared to the median score for that task. If their score was less than or equal to the median value they were assigned to the low performing group, scores greater the median resulted in allocation to the high performing group. Due to the distribution of performance scores, the navigation movie clip recognition test could not be divided into groups (as most participants performed at ceiling) and so was not included in these analyses. Our four measures of hippocampal volume were then compared between the low and high performing groups for each task using univariate ANCOVAs with the hippocampal volume measure as the dependant variable, performance group as the main predictor, and age, gender, total intracranial volume and MRI scanner as covariates.
2.9.2.3. Median split: partial correlations. We next investigated if different associations between hippocampal volume and cognitive task scores existed in the high and low performing participants. As before, because the navigation movie clip recognition test could not be divided into groups it was not included in these analyses. For each task, partial correlations were performed separately for the two groups between the hippocampal volume measures and the task performance measures, with age, gender, total intracranial volume and MRI scanner included as covariates. If a significant relationship was identified in either the low or high performance group, our intention was to then compare the group correlations directly. However, this was never required (see section 3.3.3).

Best versus worst.
The best and worst performers for each task were identified. This was approximately the top and bottom 10% (n ∼ 20 in each group). The exact number of participants allocated to each group varied for each task depending on the distribution of performance scores and to ensure the number of participants in each group was as similar as possible. As the navigation movie clip recognition test could not be split into approximately equal groups of best and worst performers it was not included in these analyses. Our four measures of hippocampal volume were then compared between the best and worst performing participants for each task using univariate ANCOVAs with the hippocampal volume measure as the dependant variable, performance group as the main predictor, and age, gender, total intracranial volume and MRI scanner as covariates.
2.9.2.5. Effects of self-reported ability. Self-reported ability was assessed using the PSIQ ( Andrade et al., 2014 ) for scene imagination, autobiographical memory and future thinking, and the SBSODS ( Hegarty et al., 2002 ) for navigation. Our aim here was to investigate whether the interaction between self-reported ability and hippocampal volume was associated with task performance, following the method of He and Brown (2020) . As such, eight interaction terms were created: responses on the PSIQ and SBSODS by the four measures of hippocampal grey matter volume. Partial correlation analyses were then performed between the interaction terms and task performance, with age, gender, total intracranial volume and MRI scanner as covariates as in the previous analyses, with the additional covariates of the relevant questionnaire and hippocampal volume measurement to ensure that any identified relationships were due to the interaction of self-report and volume and not either individually.

Multivariate analyses using latent variables
The analyses detailed above are all univariate and task-based. A multivariate approach using latent variables derived from across the cognitive data may provide additional insights into associations between task performance and hippocampal grey matter volume. For example, this method offered increased statistical power and enabled data driven investigations, eschewing a reliance on conceptually driven cognitive constructs, and so had the potential to increase the generalisability of findings beyond the specific tasks in question.
To identify latent variables across the cognitive data, two Principal Component Analyses (PCA) were performed using SPSS v25 with varimax rotation and an eigenvalue cut-off of 1. The first PCA investigated possible latent variables across the four main outcome measures of the scene imagination, autobiographical memory, future thinking and navigation tasks, and the second PCA examined all 21 task sub-measures. Factor scores for the identified latent variables were extracted using the Anderson-Rubin method. The relationships between the latent variable factor scores and hippocampal grey matter volume were examined using the VBM and partial correlation approaches described above.

Cognitive task performance
A summary of the outcome measures for the cognitive tasks is shown in Table 1 . A wide range of scores was obtained for all variables with the exception of navigation movie clip recognition, where performance was close to ceiling.

Whole brain VBM
As our main focus was on the relationship between cognitive task performance and hippocampal grey matter volume, here we report only findings pertaining to the hippocampus -any regions identified outside the hippocampus are reported in the Supplementary Results. No significant relationships between cognitive task performance and hippocampal grey matter volume were identified for any of the main outcome measures of the tasks assessing scene imagination, autobiographical memory, future thinking or navigation. This was also the case for the sub-measures of these tasks.

Hippocampal ROI VBM
No relationships between cognitive task performance and hippocampal grey matter volume were identified using any of the hippocampal masks for the main outcome measures from the tasks examining scene imagination, autobiographical memory, future thinking or navigation.
No associations were observed between performance on the scene construction task sub-measures and hippocampal grey matter volume using any of the hippocampal masks.
For the AI sub-measures, one potential relationship was identified. When using the posterior hippocampal mask, a cluster of 10 voxels, located towards the anterior portion of the posterior mask in the left hippocampus, was found to be negatively associated with the number of internal place references ( Fig. 1 a; peak coordinates = − 34, − 23, − 12, peak t = 3.28, p FWE posterior hippocampus ROI corrected = 0.046). This relationship, however, was not significant when correcting for the whole hippocampus mask (p FWE whole hippocampus ROI corrected = 0.069). Reducing the statistical threshold to p < 0.001 uncorrected for the whole brain confirmed the cluster was confined to the hippocampus and not due to leakage from adjacent brain regions ( Fig. 1 b). However, visualisation at this lower threshold also demonstrated that the identified relationship was not unique to the hippocampal voxels, with a number of other brain For all panels the results are shown on the averaged brain of the whole sample ( n = 217) and the coloured bar indicates the t-value associated with each voxel. (a) Negative association between the Autobiographical Interview (AI) internal place sub-measure and grey matter volume limited to the bilateral hippocampus mask -10 hippocampal voxels were identified, significant ( p < 0.05) only when correcting for the posterior hippocampal mask. (b) Negative association between the AI internal place sub-measure and grey matter volume at p < 0.001 uncorrected for the whole brain. (c) Positive association between navigation route knowledge and grey matter volume limited to the bilateral hippocampus mask; 17 hippocampal voxels were identified, significant ( p < 0.05) only when correcting for the anterior hippocampal mask. (d) Positive association between navigation route knowledge and grey matter volume at p < 0.001 uncorrected for the whole brain, showing that the hippocampal voxels are in fact located on the edge of a cluster centred on the right anterior fusiform and parahippocampal cortices. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) regions also being negatively associated with the AI internal place references at this liberal statistical threshold.
For the future thinking sub-measures, no associations were observed between cognitive task performance and hippocampal grey matter volume.
Considering the navigation sub-measures, one potential relationship was identified. When using the anterior hippocampal mask, a cluster of 17 voxels, located at the edge of the mask in the right hippocampus, was found to be positively associated with performance on the navigation route knowledge sub-measure ( Fig. 1 c; peak coordinates = 27 − 8 − 29, peak t = 3.31, p FWE anterior hippocampus ROI corrected = 0.028). This relationship, however, was not significant when correcting for the whole hippocampus mask (p FWE whole hippocampus ROI corrected = 0.062). Reducing the statistical threshold to p < 0.001 uncorrected for the whole brain clearly indicated that the identified hippocampal voxels were located on the edge of a larger cluster that was in fact centred on the right anterior fusiform and parahippocampal cortices ( Fig. 1 d, cluster size = 1557, peak coordinates = 29 − 5 − 37, peak t = 3.87).

Auxiliary analyses using extracted hippocampal volumes
We next performed a series of auxiliary analyses to further investigate possible relationships between cognitive task performance and hippocampal grey matter volume. The latter was extracted for each participant using the predefined anatomical hippocampal masks. The mean grey matter volume for the whole hippocampus was 4456.49 mm 3 ( SD = 373.74 mm 3 ), for the anterior mask was 2303.70 mm 3 ( SD = 215.07 mm 3 ) and for the posterior mask was 2152.79 mm 3 ( SD = 182.66 mm 3 ), with a mean posterior/anterior ratio of 0.94 ( SD = 0.057), where a ratio less than 1 indicates greater anterior relative to posterior hippocampal volume.

Partial correlations using extracted hippocampal volumes
No relationships were identified between hippocampal grey matter volume and performance for any of the main or sub-measures of the tasks assessing scene imagination, autobiographical memory, future thinking or navigation (see Supplementary Results Table S5 for details). These partial correlation findings, therefore, support those of the primary VBM analyses, as well as extending this to include the posterior:anterior hippocampal volume ratio.

Sub-group analyses using extracted hippocampal volumes 3.3.2.1. Effects of gender.
We next divided the sample into male and female participants. No relationships between cognitive task performance and hippocampal grey matter volume were observed in either group for any of the main or sub-measures (see Supplementary Results Tables S6 and S7 for details). As no relationships were detected in either group, direct comparison of the relationships was not relevant.

Median split: direct comparison.
Details of the groups that were created when the sample was divided by median splits of performance are provided in the Supplementary Results (Table S8). Comparing hippocampal grey matter volume measurements between the two groups determined by their median performance on each main and sub-measure found no differences between the groups for any task (see Supplementary Results Table S9 for details).

Median split: partial correlations.
No significant relationships between hippocampal grey matter volume and cognitive task main and sub-measures were identified in either the low or high performing groups (see Supplementary Results Tables S10 and S11 for details). As no relationships were detected in either group, direct comparison of the relationships was not relevant.

Best versus worst.
Details of the groups that were created when the sample was divided in terms of best and worst performers are provided in the Supplementary Results (Tables S12). Taking the best and worst performing participants for each cognitive task main and submeasure and comparing the hippocampal grey matter volume measurements between the two groups also showed no significant differences between the groups (see Supplementary Results Table S13 details).

Effects of self-reported ability.
The results of these analysis are provided in the Supplementary Results (Table S14).
No significant associations were observed between any of the scene construction task main and sub-measures and the interaction between self-reported ability (determined by the PSIQ) and hippocampal grey matter volume.
For autobiographical memory, a potential relationship was identified for AI vividness, but no other main or sub-measure. For AI vividness, an association was observed between AI vividness ratings and the PSIQ by whole hippocampus volume interaction ( r = 0.24, p FDR corrected = 0.016), and with the PSIQ by posterior hippocampal volume interaction ( r = 0.22, p FDR corrected = 0.013). This suggests that selfreported imagery ability may be influencing the relationship between hippocampal volume and AI vividness ratings.
To better understand these interactions, simple slopes and region of significance analyses were performed using the interactions toolbox ( Long, 2019 ) in R 3.6.1 ( R Core Team, 2017 ), with regions of significance calculated using the Johnson-Neyman technique ( Johnson and Fay, 1950 ).
First, we focused on the relationship between AI vividness and the PSIQ by whole hippocampus interaction. Repeating the partial correlation analysis as a regression (with AI vividness as the dependant variable, PSIQ, whole hippocampal volume and the PSIQ by whole hippocampal volume interaction as predictors, and age, gender, scanner and ICV as covariates), identified that for the PSIQ by whole hippocampus volume interaction, the unstandardized beta value was 0.0000710. Performing simple slopes to decompose this relationship identified that for individuals who were 1 standard deviation below the mean PSIQ score, the unstandardized beta value was − 0.000859 ( t = − 4.15, p FDR corrected < 0.001). For individuals with a mean PSIQ score, the unstandardized beta value was − 0.000459 ( t = − 2.78, p FDR corrected = 0.009), while for individuals who were 1 standard deviation above the mean PSIQ score, the unstandardized beta value was − 0.0000583 ( t = − 0.28, p FDR corrected = 0.92). As such, in individuals with low PSIQ scores, there was a negative relationship between whole hippocampal volume and AI vividness that was not evident in individuals with high PSIQ scores. A region of significance analysis found that the negative relationship existed in individuals with PSIQ scores less than 23.65.
We then investigated the relationship between AI vividness and the PSIQ by posterior hippocampus interaction. Repeating the partial correlation analysis as a regression (with AI vividness as the dependant variable, PSIQ, posterior hippocampal volume and the PSIQ by posterior hippocampal volume interaction as predictors, and age, gender, scanner and ICV as covariates), identified that for the PSIQ by posterior hippocampus volume interaction, the unstandardized beta value was 0.000157. Performing simple slopes identified that for individuals who were 1 standard deviation below the mean PSIQ score, the unstandardized beta value was − 0.00180 ( t = − 4.53, p FDR corrected < 0.001). For individuals with a mean PSIQ score, the unstandardized beta value was − 0.000919 ( t = − 3.24, p FDR corrected = 0.003), while for individuals who were 1 standard deviation above the mean PSIQ score, the unstandardized beta value was − 0.0000343 ( t = − 0.098, p FDR corrected = 0.92). As with the whole hippocampal volume interaction, in individuals with low PSIQ scores, there was a negative relationship between posterior hippocampal volume and AI vividness that was not apparent in individuals with high PSIQ scores. A region of significance analysis found that the negative association existed in individuals with PSIQ scores less than 24.13.
For all of the future thinking and navigation main and sub-measures, no significant findings were observed when investigating the association between cognitive task performance and the interaction between selfreported ability and hippocampal grey matter volume.

Multivariate analyses using latent variables
The detailed outcomes of the multivariate analyses can be found in the Supplementary Results. In summary, performing a PCA with the four main outcome measures of the scene imagination, autobiographical memory, future thinking and navigation tasks resulted in a single component that explained 56.05% of the variance (Supplementary Results Table S15). Conducting a VBM analysis with this latent variable identified no significant relationship between hippocampal grey matter volume and the latent variable factor scores, both at the whole brain level and when using the hippocampal ROIs. Performing partial correlations between the extracted hippocampal grey matter volumes and the latent variable factor scores also identified no significant associations (Supplementary Results).
The second PCA using all 21 task sub-measures identified 6 components that explained 65.01% of the variance (Supplementary Results Tables S16 and S17, and Figure S1). One reassuring finding of this PCA was the broad correspondence of the components to the cognitive tasks. For example, all of the navigation sub-measures loaded onto Component 2, and all of the autobiographical memory sub-measures, excluding internal time, were placed in Component 5. The scene construction and future thinking sub-measures showed greater interrelatedness, which is not surprising given the close similarity of the task requirements. Component 1 comprised the content sub-measures of both tasks, and Component 3 contained their spatial coherence indices, as well as autobiographical memory vividness ratings. Component 4 encompassed the thoughts and emotions elements of scene construction, future thinking and autobiographical memory. Finally, Component 6 corresponded predominantly to the autobiographical memory sub-measure of internal time. We then examined whether there were any relationships between these latent constructs and hippocampal grey matter volume. Using VBM (at both the whole brain level and in the anatomical ROIs) or when performing partial correlations using the extracted hippocampal grey matter volumes (Supplementary Results Table S18), no significant relationships between task performance and hippocampal grey matter volume were evident.

Discussion
Marked differences exist across healthy individuals in their ability to imagine visual scenes, recall autobiographical memories, think about the future and navigate in the world. One possible source of this variance may be an individual's hippocampal grey matter volume. Here, however, in a large sample of young, healthy adult participants, we found little evidence that hippocampal grey matter volume was related to task performance. This was the case across different methodologies (VBM and partial correlations), whole brain or hippocampal ROIs, different sub-groups of participants (divided by gender, task performance and self-reported ability) and when examining latent variables derived from across the cognitive tasks. Hippocampal grey matter volume may not, therefore, significantly influence performance on tasks known to require the hippocampus in healthy people.
Of the 52 main VBM analyses we performed, only two significant associations between hippocampal grey matter volume and task performance were noted. First, a negative association between a cluster of ten voxels and the measure of AI internal place was observed. However, this result was only significant when correcting for the posterior hippocampal mask, and was not replicated when a partial correlation analysis was performed on the extracted hippocampal volume data. Indeed, the correlation coefficient between the posterior hippocampal volume and AI internal place was only − 0.053. In the second result, a cluster of seventeen voxels was positively associated with navigation route knowledge when correcting for the anterior hippocampal mask only. However, as shown in Fig. 1 d, these voxels are clearly on the edge of a larger cluster that was in fact centred on the anterior fusiform and parahippocampal cortices. Overall, given the number of individual analyses performed and the nature of the findings, neither of these results is particularly convincing.
Of the 712 auxiliary analyses performed, only two were significant -AI vividness ratings and the PSIQ by whole hippocampal volume interaction, and AI vividness ratings and the PSIQ by posterior hippocampal volume interaction. Further examination of these interactions found that in individuals with low PSIQ scores (less than 23.65 and 24.13 respectively), there was a negative relationship between whole and posterior hippocampal volume and AI vividness. In other words, in people with poor visual imagery, the smaller their posterior hippocampal volume the more vivid they rated their autobiographical memories. However, this finding seems to contradict the notion of memory vividness being a key factor in autobiographical memory recall (e.g., Clark and Maguire, 2020 ;Gilboa et al., 2004 ;Richter et al., 2016 ). Additional investigations focusing on individuals with low visual imagery ability may provide further clues about this relationship. However, as it stands, interpretations of these results should be made with caution. In particular, it should be borne in mind that the effect size of the interactions was small (r values of 0.24 and 0.22) and observing only two significant results across the large number of analyses performed is well within the number of false positive results expected, even with statistical correction applied.
The overwhelming number of our analyses produced null results. We are, of course, mindful that null results can be difficult to interpret and that an absence of evidence is not evidence of absence. Nevertheless, several factors motivate confidence in interpreting the current results as showing little evidence of a relationship between hippocampal grey matter volume and performance on our tasks of interest.
First, we had a sample of 217 participants. Samples of this size are much better powered to find true and reliable relationships between brain structure and cognitive ability than the smaller sample sizes (typically between 20 and 30 participants) often tested in previous studies ( Kharabian Masouleh et al., 2019 ;Kong and Francks, 2019 ). Indeed, the only other study to employ a substantial sample was that of Weisberg et al. (2019 ;n = 90) who also found no relationship between hippocampal grey matter volume and navigation task performance.
Second, our investigation was wide-ranging and thorough with 26 different outcome measures, each relating to different aspects of scene imagination, autobiographical memory, future thinking and navigation. It seems unlikely that if a relationship existed between hippocampal grey matter volume and these cognitive functions that it would not have been identified by at least one of these performance measures.
Third, our sample was specifically recruited to provide a wide variance in our measures of interest. With the exception of the navigation movie clip recognition sub-measure (where performance was at, or very close to, ceiling) there were large individual differences in ability on all of the tasks. In addition, hippocampal grey matter volume was heterogeneous across the sample. We were, therefore, well-placed to identify any potential relationships between hippocampal volume and task performance, yet few were found.
Fourth, we investigated possible relationships across the whole brain, but also using anatomically-defined hippocampal ROIs. Despite the reduced level of statistical correction required when constraining analyses to hippocampal anatomical ROIs, we still found no significant relationships. It could be argued that the already liberal statistical thresholds should be reduced even further, but this would simply increase the chances of false positive results, given our large sample size and the considerable number of analyses that were performed.
Fifth, similar outcomes were observed using two different analysis methods. While our primary analyses involved VBM, by extracting the whole, anterior and posterior hippocampal volumes for each participant and additionally calculating the posterior:anterior volume ratio, we could also perform partial correlations between each of the volume measurements and our measures of task performance. The results, therefore, do not seem to be consequent upon a specific analysis technique, but instead there was a consistent pattern of null results.
Sixth, we also divided the sample into multiple sub-groups, investigating male and female participants, assessing whether there was any effect of cognitive performance (using both a median split, and by examining only the best and worst performers for each task measure), and self-reported ability. No significant associations between hippocampal volume and performance were identified.
Finally, we also conducted multivariate analyses, using PCA to identify latent variables from across the cognitive tasks. This approach offered increased statistical power and enabled data driven investigations, eschewing a reliance on conceptually driven cognitive constructs, and so had the potential to increase the generalisability of findings beyond the specific tasks in question. To the best of our knowledge, formal PCA analyses using the sub-measures of the tasks employed here have not been reported previously. It was reassuring to find broad correspondence between the components identified by the PCA and the cognitive tasks, and to note that the spatial coherence and the emotional aspects of mental representations cut across cognitive tasks. Of particular relevance for the questions under consideration here, and as with the other analyses, no significant associations between hippocampal grey matter volume and the identified latent variables were observed.
How do our null results correspond to the existing literature? We are not alone in reporting a lack of association between hippocampal grey matter volume and cognitive ability in healthy adults. No previous studies have investigated the hippocampal volume-scene imagination relationship. Considering memory, a meta-analysis of 33 studies identified no relationship between performance on laboratory-based memory tasks and hippocampal volume ( Van Petten, 2004 ). The one previous study that investigated future thinking found no evidence of an association between hippocampal volume and future thinking performance -with voxels instead being located outside of the hippocampus ( Yang et al., 2020 ). Regarding navigation, two previous studies have found no relationship between hippocampal volume and navigational ability in samples drawn from the general population ( Maguire et al., 2003 ;Weisberg et al., 2019 ). Given the well-known "file drawer " problem ( Franco et al., 2014 ;Rosenthal, 1979 ), the number of null results is likely to be higher than those published in the literature. Our findings, therefore, of no relationship between hippocampal volume and task performance, are not unique to the current sample.
We appreciate that other studies have reported hippocampal volumetask performance relationships. For example, a positive association was identified between posterior hippocampal grey matter volume and recollective memory ability ( Poppenk and Moscovitch, 2011 ), and both posterior and anterior hippocampal grey matter volume have been associated with various measures of spatial navigation, sometimes within specific subgroups (e.g. Bohbot et al., 2007 ;Brown et al., 2014 ;Brunec et al., 2019 ;Chrastil et al., 2017 ;Hartley and Harlow, 2012 ;He and Brown, 2020 ;Schinazi et al., 2013 ;Sherrill et al., 2018 ). These disparate results, however, may have been influenced by methodological factors. As alluded to, the sample sizes used were typically small, with the largest being only a quarter of the current study (see Kharabian Masouleh et al., 2019 for more on this issue). Moreover, previous studies have not always been optimally analysed. For example, in some cases relevant covariates were not included, or appropriate levels of statistical correction were not applied, or results were found only in specific smaller sub-groups. Any of these issues could have affected the reliability of the observed results.
The current results differ from those involving licensed London taxi drivers, where increased posterior, but decreased anterior, hippocampal grey matter volume has been consistently identified compared to a variety of control groups, including the same individuals tested before and after they acquired The Knowledge ( Maguire et al., 2000 ;Maguire et al., 2006 ;Woollett et al., 2008 ;Woollett and Maguire, 2011 ). The training to become a London taxi driver takes 3-4 years and approximately 50% of trainees fail to qualify. Those who successfully qualify as London taxi drivers are, therefore, a unique population with extreme spatial navigation expertise. Consequently, the divergence between our null results and those in London taxi drivers is not surprising. Indeed, interpreting the current null results (and those of Maguire et al., 2003 , andWeisberg et al., 2019 ) together with the London taxi driver findings may help to refine our understanding of the relationship between hippocampal volume and navigation ability. Together these results suggest that measurable hippocampal volume changes associated with ability may only be apparent following extreme spatial training rather than being a marker of ability in the general population.
Why might measureable hippocampal volume changes only be detected in the context of extreme expertise? As we did not address this question in the current study we can only speculate. Taking the example of navigation, the vast majority of humans never need to acquire the huge amount of spatial layout knowledge associated with being a licensed London taxi driver. Hence, the hippocampus has likely evolved to accommodate representations of environments that we use day-today, with sufficient capacity for a wide range of navigation experiences. Moreover, proficiency, especially at the extreme end of the scale, may come at a cost. For instance, in licensed London taxi drivers, grey matter volume is increased in the posterior hippocampus relative to control groups, but this is accompanied by decreased anterior hippocampal volume ( Maguire et al., 2000 ;Maguire et al., 2006 ), and also poorer performance on some anterograde visuospatial memory testes  ). Consequently, hippocampal volume may reflect a balance between sufficient functionality for the activities that a typical human undertakes whilst not compromising other hippocampal operations in the service of extreme expertise.
While we found that hippocampal volume seemed to be unrelated to task performance in the general population, it is likely that some neural variations exist that can, at least in part, explain the marked differences in performance observed in tasks such as those we employed. One possibility is that while volumetric differences for the whole hippocampus, or even its anterior or posterior portions, are not related to performance, effects might be observed for the volume of specific hippocampal subfields. For example, larger CA3 volume has been associated with reduced subjective confusion when recalling highly similar memories ( Chadwick et al., 2014 ). In addition, subiculum volume and combined left dentate gyrus/CA2,3 volume has been positively associated with the number of internal details on the autobiographical interview ( Palombo et al., 2018 ). However, while these studies highlight the potential for relationships between hippocampal subfield volumes and task performance, they also suffer from the limitation of small sample sizes, and so additional work is required to address this issue further.
Another possibility is that the volume of brain regions other than the hippocampus are related to performance on our tasks of interest. Regions including, but not limited to, the parahippocampal, retrosplenial, posterior cingulate, parietal, and medial prefrontal cortices have been associated with scene imagination, autobiographical memory, future thinking and navigation (e.g. Ciaramelli et al., 2017 ;Hassabis et al., 2007 ;Ramanan et al., 2018 ;Schacter et al., 2012 ;Simons et al., 2010 ;Stawarczyk and D'Argembeau, 2015 ). However, as reported in the Supplementary Results, when using a p < 0.05 FWE statistical threshold across the whole brain we observed only one relationship -a negative association between 22 voxels in the right parahippocampal cortex and AI vividness ratings. More in-depth analyses using specific ROIs may highlight other relationships (e.g. the positive association between the parahippocampal/fusiform cortices and navigation route knowledge sub-measure we observed when using an uncorrected threshold), but these ROIs would need to be motivated in advance to justify their use, going beyond the remit of the current study.
Advances in neuroimaging mean it is now possible to investigate brain structure in more detail than just grey matter volume. Quantitative neuroimaging techniques  can be used to model different properties of tissue microstructure, in particular that of myelination and iron. Studies using these techniques have, for example, found relationships between myelination and iron in ageing ( Callaghan et al., 2014 ) and, in older participants, increased myelination and decreased iron tissue content has been associated with verbal memory performance in the ventral striatum ( Steiger et al., 2016 ). Tissue microstructure, both at the whole hippocampus level or within specific subfields, or even in regions outside of the hippocampus, may offer explanations about the variance observed in scene imagination, autobiographical memory, future thinking and navigation ability and will be an interesting target for future work.
The connectivity between brain regions may also be an important factor in explaining individual differences in task performance. Structural measures of connectivity, for example diffusion-weighted imaging of hippocampal white matter pathways, have been associated with both autobiographical memory and navigation ability ( Hodgetts et al., 2017 ;Iaria et al., 2008 ;Metzler-Baddeley et al., 2011 ). Moreover, functional connectivity between the medial temporal lobe and parietal-occipital regions may be related to questionnaire measures of autobiographical memory ( Sheldon et al., 2016 ), while the functional connectivity of the posterior hippocampus and retrosplenial cortex has been linked with responses on navigation questionnaires ( Sulpizio et al., 2016 ). Both structural and functional connectivity measures may, therefore, be associated with task performance, perhaps to a greater extent than hippocampal grey matter volume, and represent important avenues to explore in future studies.
As with functional connectivity, patterns of activity during rest or task-based fMRI may be related to individual differences in perfor-mance. For example, higher vividness ratings when performing mental imagery tasks have been reported to correlate with activity in the fusiform, posterior cingulate, parahippocampal ( Fulford et al., 2018 ) and visual ( Dijkstra et al., 2017 ;Keogh et al., 2020 ) cortices. In relation to autobiographical memory, more vivid recall has been linked with higher levels of hippocampal ( Gilboa et al., 2004 ) and precuneus ( Richter et al., 2016 ) activity, while superior memory precision has been related to increased angular gyrus engagement during task performance ( Richter et al., 2016 ). Better performance during navigation tasks has also been associated with increased hippocampal activity ( Hartley et al., 2003 ;Ohnishi et al., 2006 ). Individual differences in ability may, therefore, be detectable in fMRI activity during task performance although, as with the structural MRI studies outlined in Section 1 , there is a need for large-scale functional neuroimaging studies to examine this further.

Conclusion
In this study we attempted to circumvent issues known to affect studies of healthy people that assess relationships between hippocampal grey matter volume and performance on cognitive tasks that require the hippocampus. Having tested a large sample of healthy participants with a wide range of ability, no credible associations were identified. Consequently, variability in hippocampal grey matter volume seems unlikely to explain individual differences in scene imagination, autobiographical memory, future thinking and spatial navigation performance in the general population. It may be that in healthy participants it is only in extreme situations, as in the case of licensed London taxi drivers, that measurable ability-related hippocampus volume changes are exhibited.

Data availability
Data and materials will be open-access once the construction of a dedicated data-sharing portal has been finalised. In the meantime, requests for data and materials can be sent to e.maguire@ucl.ac.uk.

Funding information
The authors were supported by a Wellcome Principal Research Fellowship to E.A. Maguire (101759/Z/13/Z) and the Centre by a Strategic Award from Wellcome (203147/Z/16/Z). The funding sources had no involvement in study design, the collection, analysis or interpretation of data, in the writing of the report or in the decision to submit the article for publication.