Mental health in the UK Biobank: A roadmap to self‐report measures and neuroimaging correlates

Abstract The UK Biobank (UKB) is a highly promising dataset for brain biomarker research into population mental health due to its unprecedented sample size and extensive phenotypic, imaging, and biological measurements. In this study, we aimed to provide a shared foundation for UKB neuroimaging research into mental health with a focus on anxiety and depression. We compared UKB self‐report measures and revealed important timing effects between scan acquisition and separate online acquisition of some mental health measures. To overcome these timing effects, we introduced and validated the Recent Depressive Symptoms (RDS‐4) score which we recommend for state‐dependent and longitudinal research in the UKB. We furthermore tested univariate and multivariate associations between brain imaging‐derived phenotypes (IDPs) and mental health. Our results showed a significant multivariate relationship between IDPs and mental health, which was replicable. Conversely, effect sizes for individual IDPs were small. Test–retest reliability of IDPs was stronger for measures of brain structure than for measures of brain function. Taken together, these results provide benchmarks and guidelines for future UKB research into brain biomarkers of mental health.


| INTRODUCTION
Over the years, there has been a multitude of neuroimaging studies that aimed to investigate alterations in the brain in relation to affectbased mental health (e.g., anxiety and depression). The Major Depressive Disorder (MDD) literature reports structural changes in the cortico-limbic network (Klauser et al., 2015), insula and hippocampus (Stratmann et al., 2014), as well as functional changes in the Default Mode Network (DMN; Tozzi et al., 2021;Yu et al., 2019), medial temporal gyrus, and caudate (Ma et al., 2012). In Generalized Anxiety Disorder (GAD), similar functional changes are seen in the DMN (Andreescu et al., 2011) and ventromedial prefrontal cortex (Cha et al., 2014), as well as structural changes in the DMN (Wolf et al., 2016) and amygdala (He, Xu, Zhang, & Zuo, 2016). However, the literature on neural correlates of MDD contains some inconsistent findings. For example, some studies report greater functional connectivity in the DMN (Greicius et al., 2007;Sheline, Price, Yan, & Mintun, 2010) while others report lesser functional connectivity in the same network (Bluhm et al., 2009;Tozzi et al., 2021;Yan et al., 2019). A potential reason for inconsistent findings is the small sample size of most of these Rosie Dutt and Kayla Hannon equally shared the joint first authorship. studies. The broader fields of psychology and neuroimaging are recognizing that small sample sizes lead to inflated effect sizes that often result from sampling variability and therefore do not replicate in new data (Button et al., 2013;Grady, Rieck, Nichol, Rodrigue, & Kennedy, 2021;Marek et al., 2020;Poldrack et al., 2017;Yarkoni, 2009). Larger sample sizes are therefore needed to obtain reliable insights into the neural correlates of mental health.
One option to achieve larger sample sizes is to conduct meta-analyses. Meta-analyses use results from prior studies as their input and employ quantitative methods to pool data across studies and test for consensus (Müller et al., 2018). A meta-analysis on resting-state functional connectivity in MDD showed hypo-connectivity in frontoparietal and salience networks and hyper-connectivity in the DMN (Kaiser, Andrews-Hanna, Wager, & Pizzagalli, 2015). Another meta-analysis showed that there are common gray-matter volume changes in MDD which are also seen in bipolar disorder . In GAD, metaanalyses have also been able to confirm consistent dysregulation of affective control related to numerous networks, which provides support for an integrated model of brain network changes (Xu et al., 2019). While these meta-analyses aid to establish consensus on brain correlates of mental health (Wager, Lindquist, & Kaplan, 2007), they can be limited in their scope. This is because the input studies surveyed in meta-analyses often adopt narrow inclusion and exclusion criteria for the patient sample, which limits cross-diagnostic mental health research. Additionally, due to the lack of availability of whole-brain statistical result images from prior studies, coordinate-based meta-analyses are often undertaken which are limited in their spatial precision (Müller et al., 2018). Furthermore, meta-analyses suffer from publication bias (only including effect sizes from published significant studies; Thornton & Lee, 2000), language bias (only including papers written in English; Egger et al., 1997), and selective outcome reporting (input-papers selectively publish only significant variables; Hutton & Williamson, 2000;Kirkham et al., 2010), which can lead to inflated meta-analytical results (Sterne, Egger, & Smith, 2001). These inherent limitations of meta-analyses may explain why disagreement persists within even meta-analytical work, with a recent study showing hypo-connectivity (rather than hyperconnectivity) in the core DMN in patients with depression (Tozzi et al., 2021).
Consequently, in recent years, there has been a move to accrue larger neuroimaging datasets such as the Young Adult and Lifespan Human Connectome Projects (HCP; Harms et al., 2018;Van Essen et al., 2013), Connectomes Related to Human Disease studies (CRHD; Tozzi et al., 2020), UK Biobank (UKB; Miller et al., 2016;Sudlow et al., 2015), Enhancing Neuro Imaging Genetics through Meta-Analysis (ENIGMA; Schmaal et al., 2017), and Adolescent Brain Cognitive Development study (ABCD; Casey et al., 2018). The increased statistical power afforded by these datasets enables studies to approximate the true effect (Marek et al., 2020). Currently, the UKB is the largest neuroimaging dataset, encompassing data from extensive questionnaires, physical and cognitive measures, and biological samples (including genotyping) in addition to multimodal neuroimaging scans (Sudlow et al., 2015). The UKB is a prospective epidemiological study that recruited a cohort of 500,000 participants, of which 100,000 subjects will take part in one round of imaging, and 10,000 of those subjects will undergo a further second round of scanning (Sudlow et al., 2015).
Health outcomes for all participants will be tracked over future years until participants' decease, including full primary health and hospital records. Therefore, the UKB offers a valuable resource to study mental health and other disorders. The goal of our study is to establish a foundation for future mental health biomarker research in the UKB. The UKB includes multiple rich self-report measures of mental health. However, the organization and abundance of this information can make it somewhat challenging for researchers to navigate. For data pertaining to mental health, there are three sources within the UKB.
The first is the assessment center questions (https://biobank.ndph.ox. ac.uk/showcase/label.cgi?id=100060) which participants complete via a touch screen on the day they were scanned. The second is a separately administered online mental health questionnaire (https:// biobank.ndph.ox.ac.uk/showcase/label.cgi?id=136), which is completed by a subset of UKB participants at a time-independent from the scanning date (median absolute number of days between scan 1 and online questionnaire completion: 742, range: À1,185 to +964 days in exploratory sample). The third is the health records available in the UKB which encompass the date of the first experience of specific ICD-10 diagnoses obtained from primary care (https://biobank.ndph.ox.ac.uk/ showcase/label.cgi?id=3000) and hospital inpatient data (https:// biobank.ndph.ox.ac.uk/showcase/label.cgi?id=2000). In this study, we tabulate and compare different mental health measures available in the UKB, with a focus on self-reported symptom scores from the assessment center information and online questionnaire. We test their relationship with brain measures, thereby providing a benchmark for using UKB mental health variables in future research.
This study aims to achieve four key goals. First, we aim to clearly tabulate the different self-report measures of mental health available in the UKB and discern the relationships between summary scores to enable future studies to make an informed decision on which measure is most appropriate to use. Second, we propose and validate a new summary measure (Recent Depressive Symptoms; RDS-4) that uses depression questions which were asked on the day of scanning in the UKB study. The RDS-4 score, therefore, enables research into current depressive symptoms and changes in symptomatology over time.
Third, we aim to establish realistic and robust univariate and multivariate effect sizes of commonly reported brain correlates of mental health based on population data. Finally, we aim to determine the test-retest reliability of imaging variables alongside their effect size as both reliability and sensitivity are critical requirements for biomarker research. Large-scale imaging datasets such as the UKB play a critical role in the long-term goal of finding brain biomarkers of mental health, and we hope to provide a foundation that future studies can build on.

| Dataset
Imaging data from 32,420 UKB participants were available at the time the study was performed. From this, we selected multiple independent test cohorts ( Figure 1 and Table 1). Subjects with a mean head motion greater than 0.2 mm were removed resulting in the exclusion of 5,265 subjects. Subjects with any missing online questionnaire or scan 1 assessment center mental health data were also removed, resulting in the exclusion of additional 10,848 subjects (largely because the online questionnaire was only performed in a subset of UKB participants). From the remaining 16,307 subjects, we selected individuals who had undergone brain scans at two timepoints. These subjects make up the test-retest sample.
Late-onset depression (first episode at age 60+) is associated with different brain correlates (e.g., white matter hyperintensities) and different risk factors (e.g., vascular risk) compared with recurrent early-onset depression (age of first episode before 60; Salo, Scharfen, Wilden, Schubotz, & Holling, 2019). Therefore, we assessed subjects for probable late-onset depression based on selfreported age at the first episode of depression (Data-field: 20433).
Subjects who reported their first episode at 60 or older (N = 418) were excluded.
The majority of individuals within the UKB cohort are expected to have no mental health conditions because it is a population sample. To ensure sufficient power to identify neural correlates of mental health, we wanted to reduce the expected over-representation of healthy individuals and ensure that our samples richly capture mental health variability.
This was achieved by including equal numbers of participants with and without a history of mental health. From the UKB showcase we used: Seen doctor (GP) for nerves, anxiety, tension, or depression (Data-field: 2090) to ensure our samples included an equal number of subjects who experienced mental health issues on at least one occasion, and those who have not. For each subject who had seen a GP for nerves, anxiety, tension, or depression (N = 4,531), we paired a matched subject from those who had never seen a GP for nerves, anxiety, tension, or depression (i.e., subject pairs were identically matched for sex and age, and Note: The "ever seen GP for mental health" and "never seen GP for mental health" subjects were matched, such that the same male-to-female ratio and mean age applies to these groups. minimal difference in head motion). Subsequently, approximately twothirds of the "never seen GP" subjects together with their matched "seen GP" subjects were randomly assigned to the exploratory sample, and the remaining subjects were assigned to the confirmatory sample.
During a subject assignment to groups, we preserved the matched characteristics within each resulting sample ( Figure 1). No subjects overlapped between the exploratory and confirmatory samples.

| Mental health measures
The set of self-report questions related to mental health included in the UKB were informed by standardized measures, but did not simply cover a list of previously validated scales. Seen psychiatrist for nerves, anxiety, tension, and depression 2100 Abbreviations: GAD-7, general anxiety disorder-7; N-12, neuroticism-12; PHQ-9, patient health questionnaire-9; RDS-4, recent depressive symptoms-4. There are a number of important differences between the RDS-4 and the other mental health measures. Compared to PHQ-9, the RDS-4 was obtained on the day of the imaging scan, whereas the PHQ-9 was undertaken at a time point that was independent of the scan date. Compared to probable depression status, the RDS-4 provides a continuous measure of recent symptom severity, whereas probable depression status is a categorical (case-control) measure of lifetime occurrence of depression. Compared to N-12, RDS-4 is a measure of recent (state) depressive symptoms, whereas N-12 is a more general measure of personality (trait). Compared to GAD-7, the RDS-4 focuses on depression and the GAD-7 focuses on anxiety.
Diffusion data was acquired with b-values of 1,000 and 2,000 s/mm 2 , at 2 mm spatial resolution, with a factor 3 multiband acceleration and 50 distinct diffusion-encoding directions. Both tfMRI and rs-fMRI used identical acquisition parameters (spatial resolution = 2.4 mm, TR = 0.735 s, factor = 8 multiband accelerator). Task fMRI used the Hariri faces/shapes "emotion" task as employed in the HCP Hariri, Tessitore, Mattay, Fera, & Weinberger, 2002), with a shorter total length and reduced repeats of the total stimulus block. For further information on UKB imaging, please refer to Miller et al. (2016).
F I G U R E 2 Schematic overview of the acquisition timing of UK Biobank mental health measures in relation to imaging acquisition. Mental health measures in light green were obtained on the day of scanning, whereas mental health measures in light blue were obtained at an independent time point that varied from 1,185 days before to 964 days after scan 1 across participants. The range of possible scores for each mental health measure is included. All five measures were included in neuroimaging and questionnaire comparison analyses in this article

| Imaging derived phenotypes
In addition to raw and processed imaging data, image-derived phenotypes (IDPs) are available for download. IDPs are derived from calculations that combine many images and/or voxels to produce a scalar quantity from the processed imaging data (Miller et al., 2016). Examples of IDPs include regional volumes from structural MRI and "edges" from resting-state functional MRI (i.e., connectivity between a pair of networks).
The IDPs included in this paper are summarized in Table 3, and further information can be found in (Miller et al., 2016) as well as the UKB showcase brain imaging documentation resource (https:// biobank.ndph.ox.ac.uk/showcase/showcase/docs/brain_mri.pdf).
Briefly, resting-state IDPs were obtained using Independent Components Analysis performed at two different dimensionalities (25 and 100), which resulted in 21 and 55 signal networks, respectively.
Subject-specific BOLD time series for each network were calculated using dual regression (Nickerson, Smith, Öngür, & Beckmann, 2017), and the amplitude for each network (temporal standard deviation) and functional connectivity between pairs of networks (full or partial correlation coefficients) were calculated. Resting-state IDPs from both ICA dimensionalities were included as they may offer complementary information at different levels of functional organization.
From T1-weighted images, gray matter volumes were obtained with FSL FIRST and FAST, and cortical area and thickness were calculated with Freesurfer. Total volume of white matter hyperintensities was estimated based on T1-weighted and T2-flair images using FSL's BIANCA algorithm (Griffanti et al., 2016). From the diffusion data, weighted mean fractional anisotropy (FA) and mean diffusivity (MD) were obtained using FSL's DTIFIT tool. Task fMRI IDPs reflect summary measures of activation (the median and 90th percentile for both the percent signal change and the z-statistic) in regions selected from the group-level activation map. Susceptibility weighted IDPs were generated from the signal decay times predicted from the magnitude images at the two TEs such that the IDPs equate to the median signal decay times.

| Confound variables
All analyses were corrected for the "simple" set of confounds Data used to compute RDS-4 and N-12 were collected at the scan date (assessment center information), whereas GAD-7 and PHQ-9 were computed from data obtained from the online questionnaire. The absolute number of days elapsed between the two data collections ranged from 0 to 1,185 days. To investigate the effects of measurement latency on mental health measure correlation, Spearman rank correlations between the RDS-4 and PHQ-9 (both measures of depression) were computed as a function of elapsed time between measurement (see Supplementary Materials Section S1 and Figure S1).
To test whether self-report measures differed significantly based on probable depression status, a two-sample Kolmogorov-Smirnov test was performed to ascertain whether subjects with a positive depression status had different distributions of depression scores than subjects with no depression status.

| Mapping between mental health variables in the UKB
To gain insights into how the different measures of mental health included in the UKB relate to each other, we used equipercentile linking in the exploratory sample. Here, the stepwise percentiles for each measure were calculated, and for each score in one measure, the equivalent percentile rank in a different measure was mapped (Kolen & Brennan, 2014). We further calculated the Cronbach alpha for the newly proposed RDS-4 score to measure internal consistency in the exploratory sample.

| Mechanical Turk study to validate RDS-4
To further validate the proposed RDS-4 score, we performed an independent study using the Amazon Mechanical Turk platform via CloudResearch.com (Litman, Robinson, & Abberbock, 2017 Watson & Clark, 1991). The latter two measures were included because they are commonly used measures of depression that can be considered "gold standard" for self-report. Although these measures are not available in the UKB, our goal was to validate the RDS-4 against these standardized measures.
We undertook multiple steps to avoid low-quality responses, which can be a concern in Mechanical Turk questionnaire research.
First, we adopted premium options in CloudResearch, such as only including "CloudResearch approved participants" who undergo more extensive vetting. Second, we included two questions to assess the attention levels of the participants while performing the study ("If you are still paying attention, please select 'yes'" & "Please answer this question with the 'Most or all of the time' option"). Participants who failed to answer these questions appropriately were excluded. Third, we imposed a minimum duration for questionnaire completion at 172.5 s (which equals 2.5 s per question). Participants who completed the questionnaire in less than 172.5 s were excluded.
Spearman rank correlation was used to compare scores between the RDS-4, PHQ-9, CES-D, and MASQ-30 using data from time point 1. Intraclass correlation coefficient (ICC A,1; also known as criterionreferenced reliability; Koo & Li, 2016;McGraw & Wong, 1996) was used to calculate the test-retest reliability between time point 1 and time point 2 separately for each measure.

| Exploratory brain-mental health analysis
We used Canonical Correlation Analysis (CCA) as a data-driven approach to identify joint multivariate relationships between mental health measures and brain imaging variables (Hotelling, 1936). Following nuisance regression to remove variance explained by nuisance regressors, dimensionality reduction was performed separately for resting state, structural, and task IDPs ( The multivariate CCA results were also replicated in the independent confirmatory sample by projecting the resting state, structural, and task IDPs onto the same PCA subspace (i.e., not repeating the PCA, but using the weights from the exploratory sample), and multiplying brain eigenvectors as well as mental health scores by their respective canonical coefficients (i.e., A & B as estimated from the exploratory sample). The CCA replication was tested based on the correlation between the resulting U and V (i.e., the canonical correlation). We also performed the same post-hoc univariate correlations between averaged UV and individual IDPs as described above to assess the replicability of IDP contributions to the CCA.

| Confirmatory analysis of effect size
The independent confirmatory sample (N = 2,426) was used to test univariate effect sizes of selected brain variables from CCA analysis (i.e., significant IDPs after Bonferroni correction). Specifically, we performed a Cohen's d test based on probable depression status, and calculated the Pearson's r from the correlations between the selected brain variables and each of the four mental health variables (i.e., RDS-4, PHQ-9, N-12, and GAD-7), respectively. These analyses were repeated for each imaging modality including surface area, gray matter volume, cortical thickness, white matter hyperintensity, fractional anisotropy, median T2*, task activity, resting-state network amplitude, and edge connectivity at both dimensionalities (i.e., 25 and 100). We de-confounded both the brain variables and the mental health variables before running the aforementioned analyses. The only exception from de-confounding is the binary grouping based on probable depression status, as de-confounding would result in subjectspecific values that are noncategorical, which is unsuitable for the Cohen's d test.

| Test-retest reliability of imaging measures
To assess the stability of IDPs across time, we performed test-retest reliability analyses using data from N = 624 subjects that were scanned twice at separate time-points, with an inter-scan interval of approximately 2 years (Table 1). Data were de-confounded for this sample using the same approach employed for the exploratory CCA analysis. After data were de-confounded, intra-class correlations were computed between the IDPs collected at each scan time-point using the ICC(A,1) formulation to quantify the agreement between measurements collected at each timepoint (McGraw & Wong, 1996).
Test-retest reliability measures were grouped according to IDP measurement modality (e.g., cortical area, cortical volume, etc.) to allow for assessment of the ICC distributions for different modalities.
We also assessed the effect of inter-scan interval length on the test-retest correlation strengths by computing ICCs for each IDP after including regressing out the inter-scan interval (in days) from each IDP, thus removing any additional variance attributable to intersubject differences in inter-scan interval lengths. Finally, we assessed between the two timepoints available in the UKB, we also performed a separate Mechanical Turk study to test the short-term (7-day) testretest reliability of RDS-4 (see Section 3.3).
We performed a two-sided, two-sample Kolmogorov-Smirnov test on RDS-4, PHQ-9, N-12, and GAD-7 scores over subjects with and without probable depression status. Subjects with probable depression scored significantly higher than subjects with no probable depression status on all measures (KS-statistic χ ≥ 0:19,p ≤ 10 À48 ; Figure 2b).

| Mapping between mental health variables in the UKB
Given that this is a largely healthy sample, as expected, the distributions for PHQ-9, RDS-4, and GAD-7 all reveal a large number of participants with scores on the lower end of the mental health measure, with a sharp decline seen in the number of participants scoring on the upper end of the mental health measures (Figure 4a-d). Notably, the distribution of N-12 is relatively less skewed than PHQ-9, RDS-4, and GAD-7. We calculated Cronbach's internal consistency alpha for RDS-4, which measures the internal consistency. The Cronbach alpha for RDS-4 was .78, which indicates a moderate to strong internal reliability. This was similar to N-12 (Cronbach alpha = .83).

| Mechanical Turk study to validate RDS-4
Out of 134 subjects who completed our separate validation study, three subjects were removed because they failed the attention questions and a further 44 subjects were removed because they completed the surveys too fast, resulting in N = 87 subjects (53 female and 34 male; mean age 66.0 ± 4.8). The results showed that RDS-4 was highly correlated with other depression scales and achieved testretest reliability comparable to other depression scales (Table 4).

| Exploratory brain-mental health analysis
Prior to performing the CCA, the data reduction of resting-state IDPs resulted in 100 components which explained 50.1% of variance. The data reduction of the structural IDPs resulted in 24 components which explained 50.6% of variance. The data reduction of the task IDPs resulted in two components which explained 51.2% of variance.
Therefore, the total number of brain variables input into the CCA was 126 and this was tested against the five mental health variables. The   IDPs that contributed significantly to the CCA were also tested for univariate direct correlations with individual mental health variables in the independent confirmatory sample (see next section for the results). For these follow-up univariate tests, we furthermore supplemented the target IDPs with a literature-curated list (Table S1) that partly overlaps with the data-driven IDP identification.

| Confirmatory analysis of effect size
Our findings showed that univariate effect sizes of the relationship between IDPs and mental health determined in our robust population sample were very low. Overall, effect sizes of the differences in the brain variables (i.e., IDPs), indicated by Cohens' d, based on probable depression status, were larger than the Pearson's r values from correlations between IDPs and continuous mental health measures ( Figure 6). On average, resting-state node amplitude and edge connectivity derived from partial correlation matrices appeared to have the higher effect sizes in most mental health measures, and task activity and fractional anisotropy ranked high in some mental health measures. At the level of individual IDPs, edges derived from both partial and full correlation matrices emerged as the best "predictors" in explaining data variance in all mental health variables except for PHQ-9 where amplitude of a few resting-state nodes ranked at top ( Figures S3-S7). These findings together suggest an overall higher effect size of resting-state in contrast to nonresting state measures on the investigated mental health variables.

| Test-retest reliability of imaging measures
We next assessed the stability of IDPs over time in 624 subjects who had data from two separate scan sessions conducted approximately 2-2.5 years apart. Figure 7a shows the distribution of inter-scan intervals for all 624 subjects. To assess test-retest reliability, ICCs were computed between the scan 1 measurements for each IDP and the corresponding scan 2 measurements for the same IDP. Then, the ICCs were assigned to categories based on the measurement modality of the corresponding IDPs: brain surface area (  F I G U R E 7 Test-retest analyses. (a) The histogram shows the inter-scan interval distribution for the 624 subjects included in these analyses. The x-axis shows days between scans, and the y-axis shows the number of subjects. (b) The boxplots show the ICCs obtained using brain IDPs after standard confound regression (blue) versus ICCs obtained using brain IDPs after standard confound regression plus regressing out effects of inter-scan interval length (orange). IDP measurement modality categories are organized along the x-axis, and the y-axis shows ICC values (see also Figure S8) Figure 7b depicts the distributions of ICCs for each IDP measurement modality obtained using the confound-regressed data from both scan time-points, along with those obtained after additionally regressing out the effects of inter-scan interval length (i.e., days between scans). Notably, ICC distributions were highly similar for both analyses. In general, IDPs corresponding to measures of brain structure had higher ICCs than IDPs corresponding to measures of brain function. The highest ICCs were observed for IDPs corresponding to brain volume/brain area measures and the lowest ICCs were observed for IDPs corresponding to task measures. This pattern of results is not particularly surprising since macro-scale structural properties like regional volume are expected to be relatively stable over time, especially when considering relative between-subject correlations. Macro-scale functional properties like task activation magnitudes or network connectivity patterns exhibit higher variability over time due to influences of factors such as the level of task engagement (during task), cognitive state (during rest), and physiological state (e.g., hungry vs. sated, sleepy vs. alert), and therefore are expected to have somewhat reduced test-retest stability.
Analyses performed for sub-groups of patients that did (n = 288) versus did not (n = 336) exhibit changes in mental health between time-points as determined by the difference between RDS-4 measures obtained at each time point yielded highly similar results, as did those obtained after regressing out the change in RDS-4 score (See Supplementary Material Section S5 and Figure S8). Overall, these results suggest that the test-retest reliability of the IDPs is largely independent of mental health change as indicated by the RDS-4.

| DISCUSSION
In the present study, we aimed to tabulate mental health questionnaires available in the UKB and investigate their neural correlates. We summarize five different UKB measures of mental health: PHQ-9, GAD-7, RDS-4, N-12, and probable depression status. Our results show that all measures were moderately correlated with one another ( Figure 3). CCA analyses to identify multivariate associations between these mental health measures and IDPs indicated a significant CCA mode of covariation which linked brain IDPs to mental health scores ( Figure 5). The multivariate CCA analysis indicated a significant correlation between mental health and imaging that was largely reproducible in the independent confirmatory sample. All mental health measures contributed to the CCA result indicating a "trait-like" multivariate brain-mental health association. In a separate test of univariate effect sizes, modalities with the strongest modality-mean effect sizes included amplitude and edge connectivity of resting-state networks, but univariate effect sizes were generally very low ( Figure 6).
All IDPs showed moderate to high test-retest reliability, with IDPs of brain structure showing higher reliability than IDPs of brain function ( Figure 7). Together, these findings provide the foundation for future biomarkers research into mental health using the UKB.
We highlighted a difference in acquisition timing of mental health questionnaires in the UKB study relative to neuroimaging data acquisition. Two well-validated measures of mental health (GAD-7 and PHQ-9) were obtained as part of the online questionnaire, which is acquired independently of scan days such that they were obtained 742 days apart (median across exploratory subjects) from scan 1 (range À1,185 to +964 days). Because of this time discrepancy (which is highly inconsistent across subjects), the PHQ-9 (which tests recent depressive symptoms over a 2-week period) is not well-suited as a state depression measure for UKB neuroimaging research despite its validity for lifetime depression (Cannon et al., 2007), and its sensitivity to depression in older populations ( Table 4), whereas a lower "trait-level" correlation between RDS-4 and PHQ-9 is observed in the UKB data (0.6; Figure 3a) due to the gap in acquisition times ( Figure S1). Furthermore, RDS-4 has high internal consistency and its scores map closely onto established measures of depression ( Figure 4 and Table 4)further confirming its validity. The RDS-4 questions cover four different depression domains (mood, disinterest, restlessness, and tiredness) that are also considered in other measures such as the Hamilton and Montgomery-Åsberg scales (Hamilton, 1967;Montgomery & Asberg, 1979). Hence, by asking questions in different domains, the RDS-4 inventory reflects overall depression severity relatively well, despite the comparatively small number of items. The Neuroticism-12 index-also obtained on each day of scanning-is a personality trait (Eysenck & Eysenck, 1975) that is strongly related to an increased risk in depression (Hirschfeld et al., 1983;Shaw & Hare, 1969). N-12 items assess generic traits as opposed to recently experienced clinical symptoms (RDS-4 and PHQ-9). Our results confirm that N-12 is more stable over time compared with RDS-4 and PHQ-9 as assessed by the 2-year correlation. We, therefore, suggest that N-12 can be used as a measure of trait-level susceptibility to depression in UKB neuroimaging research.
In terms of neuroimaging correlates of mental health, our findings show that multivariate associations explain more variance in mental health effects than univariate associations, which is supported by previous work (Marek et al., 2020). It should be noted that our estimated effect sizes are derived from a large sample (N > 2,000) and are therefore expected to capture true effect sizes that are uninfluenced by sampling variability (Marek et al., 2020). The literature to date is dominated by underpowered studies which, by design, only report high effect sizes because the significance threshold is itself high due to limited power. We have to adjust our expectations to value realistic effect sizes from well-powered samples, which may be lower but, importantly, reproducible. The observed increase in explained variance when using multivariate methods is consistent with the proposal of complex macroscopic patterns of psychopathology in mental health patients (Williams, 2016;. Future biomarker research will therefore need to focus on multivariate techniques such as CCA, connectome fingerprinting (Finn et al., 2015), topological network properties (Zhu et al., 2017), or machine learning (Dinga et al., 2018).
One reason why multivariate methods may have higher effect sizes than univariate methods could be due to the relatively low signal-to-noise ratio and high measurement noise of individual univariate IDPs and the effective averaging that occurs in multivariate combinations of IDPs and during the dimensionality reduction before CCA, which reduces noise. For example, previous work showed substantial increases in heritability when combining connectivity IDPs with independent component analysis compared with univariate IDPs (Elliott et al., 2018). Given the low SNR of individual IDPs and the risk of overfitting in multivariate methods, robust cross-validation (Poldrack, Huckins, & Varoquaux, 2020) and independent replication of findings (in a split-half group and/or in a fully independently acquired dataset) are essential requirements for future biomarker research Dinga, Schmaal, & Marquand, 2020).
A second potential reason for limited effect sizes (even with the use of multivariate methods like CCA) is between-subject heterogeneity. One type of heterogeneity is diversity in symptoms, such that two patients with depression may present with largely nonoverlapping symptom profiles (Drysdale et al., 2017;Feczko et al., 2019;Feczko & Fair, 2020;Kaczkurkin et al., 2020). Another type of heterogeneity is diversity in psychophysiological disease mechanisms. Here, it is possible that the same symptom may be caused by a number of different patterns of brain changes (Feczko & Fair, 2020), which we refer to as "many-to-one mechanistic mapping". Notably, both types of heterogeneity are potentially more prominent in large-scale population studies such as the UKB compared with smaller studies. This is because studies with smaller samples often implement stricter exclusion criteria in relation to comorbidities and medication to control for known sources of heterogeneity. Reducing the exclusion criteria in the UKB is likely advantageous for mental health research because the UKB and other large-scale studies provide a more accurate representation of "real-life" mental health as it occurs across the population. This makes the findings more likely to be generalizable.
However, gaining a better understanding of both symptom heterogeneity and many-to-one mechanistic heterogeneity is critically important for the effective clinical translation of mental health biomarkers.
Computational methods are available to account for heterogeneity, such as subtyping analyses to reveal any distinct sub-groups (Drysdale et al., 2017;Kaczkurkin et al., 2020) and normative modeling analysis to compare each individual against the normative range (Marquand, Rezek, Buitelaar, & Beckmann, 2016). These models of heterogeneity benefit from the large sample size available in the UKB which enables stringent cross-validation.
In summary, this article provides a guide for future neuroimaging biomarker research into effect-based mental health in the UKB. We recommend using RDS-4 for imaging-based research into state depression (i.e., currently experienced symptoms) and N-12 for imaging-based research into personality traits associated with depression (Lahey, 2009). Our results regarding the brain correlates of mental health show low effect sizes of individual IDPs, but higher effect sizes and replicability of multivariate associations and relatively high test-retest reliability. Therefore, we recommend the use of approaches that capture multivariate patterns and parse patient heterogeneity in combination with stringent out-of-sample replication to avoid overfitting.

ACKNOWLEDGMENTS
We are grateful to UK Biobank and the UK Biobank participants for making the resource data possible, and to the data processing team at Oxford University for producing the shared processed data. This

DATA AVAILABILITY STATEMENT
All analysis code for this article is available at: https://github.com/ PersonomicsLab/MH_in_UKB. UKB data (Miller et al., 2016;Sudlow et al., 2015) are available following an access application process, for more information please see: https://www.ukbiobank.
ac.uk/enable-your-research/apply-for-access. In accordance with the UKB regulations, newly derived variables in this article (e.g., RDS-4) will be made available to other researchers via UKB data access post-publication.