Spectral clustering based on structural magnetic resonance imaging and its relationship with major depressive disorder and cognitive ability

There is increasing interest in using data‐driven unsupervised methods to identify structural underpinnings of common mental illnesses, including major depressive disorder (MDD) and associated traits such as cognition. However, studies are often limited to severe clinical cases with small sample sizes and most do not include replication. Here, we examine two relatively large samples with structural magnetic resonance imaging (MRI), measures of lifetime MDD and cognitive variables: Generation Scotland (GS subsample, N = 980) and UK Biobank (UKB, N = 8,900), for discovery and replication, using an exploratory approach. Regional measures of FreeSurfer derived cortical thickness (CT), cortical surface area (CSA), cortical volume (CV) and subcortical volume (subCV) were input into a clustering process, controlling for common covariates. The main analysis steps involved constructing participant K‐nearest neighbour graphs and graph partitioning with Markov stability to determine optimal clustering of participants. Resultant clusters were (1) checked whether they were replicated in an independent cohort and (2) tested for associations with depression status and cognitive measures. Participants separated into two clusters based on structural brain measurements in GS subsample, with large Cohen's d effect sizes between clusters in higher order cortical regions, commonly associated with executive function and decision making. Clustering was replicated in the UKB sample, with high correlations of cluster effect sizes for CT, CSA, CV and subCV between cohorts across regions. The identified clusters were not significantly different with respect to MDD case–control status in either cohort (GS subsample: pFDR = .2239–.6585; UKB: pFDR = .2003–.7690). Significant differences in general cognitive ability were, however, found between the clusters for both datasets, for CSA, CV and subCV (GS subsample: d = 0.2529–.3490, pFDR < .005; UKB: d = 0.0868–0.1070, pFDR < .005). Our results suggest that there are replicable natural groupings of participants based on cortical and subcortical brain measures, which may be related to differences in cognitive performance, but not to the MDD case–control status.


| INTRODUCTION
Major depressive disorder (MDD) is a heritable and disabling psychiatric condition associated with depressed mood and changes in cognitive function (Malhi & Mann, 2018), resulting in a significant reduction in the quality of life and a substantial burden on the individual, family and society. Many previous studies have reported structural and functional brain alterations associated with depression (Drevets et al., 2008;Jiang et al., 2019). Moreover, psychiatric conditions (including MDD) have been shown to be associated with cognitive alterations . Both psychiatric conditions and cognitive functions are found to have underlying neurobiological mechanisms. With recent advances in brain imaging, computational as well as mathematical techniques, there is increasing interest in developing objective measures that could help classify MDD status and associated traits such as cognition using neuroimaging data. Studies from many different research groups have indicated structural brain differences in MDD using large robust samples. MDD-related cortical thinning was found in orbitalfrontal cortex (Schmaal et al., 2017), medial prefrontal cortex (Treadway et al., 2015), temporal (Zhao et al., 2017), subgenual anterior cingulate cortex (Anderson et al., 2020), lingual gyrus (Suh et al., 2019), precentral (Bos et al., 2018) and par orbitalis (Merz et al., 2018) regions. Some studies also reported lower surface areas in lingual, fusiform, parahippocampal gyrii (Couvy-Duchesne et al., 2018) and subcallosal regions (Wei et al., 2020), as well as cortical volume reduction in prefronal cortex, orbitalfrontal cortex (Grieve et al., 2013), subcallosal regions (Wei et al., 2020), temporal pole, insula lobe (Amidfar et al., 2020) and subgenual anterior cingulate cortex (Niida et al., 2019). Although some research indicated MDD-related reduction in thalamus (Schmaal et al., 2016;Webb et al., 2014;Ye et al., 2020), amygdala (Qi et al., 2018) and hippocampus (Nugent et al., 2013), the MDD case-control volumetric differences in subcortical regions have been found to be insignificant in some other studies (Bos et al., 2018;Shen et al., 2017). Furthermore, white matter microstructure Shen et al., 2017;van Velzen et al., 2020), functional connectivity (Qiao et al., 2020;Ran et al., 2020) and abnormalities were also found in MDD patients. Although MDD-related brain differences were found in several literatures, these studies usually reported small to very small effect sizes. Previous machine learning (ML) studies with structural brain features also show potential for unbiased diagnostic classification (Lebedeva et al., 2017;Patel et al., 2015;Qiu et al., 2014). Features derived from structural magnetic resonance imaging (MRI) have shown promise for MDD case-control classification, with linear or non-linear supported vector machine classifiers achieving accuracies of >70% (Gao et al., 2018). However, the ability of ML to determine case-control status using such features remains uncertain, especially when most existing studies have been conducted on relatively small datasets (N < 100), with limited independent replication. In addition, the majority of existing studies focus on clinically ascertained cases, and therefore the results may not be generalisable to population or community-based samples (Stolicyn et al., 2020).
While supervised learning methods focus on the core question of whether differences in brain measures characteristic of MDD are sufficient to accurately classify MDD cases from healthy controls, unsupervised learning methods focus on determining whether natural groupings based on brain differences are relevant for MDD. We considered this as a potentially useful approach, because results from unsupervised learning methods could in turn help us further refine and better understand the disorder. Moreover, clustering has been shown to be an important tool in other areas of medicine, such as in understanding Alzheimer's disease (Alashwal et al., 2019) and different psychiatric disorders (Marquand et al., 2016). Recently, other studies have also attempted similar unsupervised clustering analysis approaches on structural (Zhou et al., 2019) and functional imaging data (Drysdale et al., 2017;Tokuda et al., 2018) as a way to identify potential imaging-based data-driven depression subtypes.
In the current study, we applied unsupervised spectral clustering, as an exploratory approach, to data from a relatively large sample of well-characterised individuals (MDD cases and controls drawn from a communitybased sample, Generation Scotland (GS) (Smith et al., 2012), with structural imaging measures, depression phenotyping and cognitive data). Our rationale was to explore if the effects are observable using unsupervised spectral clustering. Our aim was to identify natural groupings of individuals, characterised by maximally distinct patterns of structural brain properties. We then attempted replication of the clustering in an independent sample with imaging data (UK Biobank [UKB], Miller et al., 2016), using regional between-cluster effects as a basis for evaluating replication. Finally, we investigated whether these natural imaging-based groupings are related to distinct clinical and cognitive features of the participants, focussing on those phenotypes that are consistent across cohorts.
Participant graphs were constructed for each FreeSurfer-derived structural metric of cortical thickness (CT), cortical surface area (CSA), cortical volume (CV) and subcortical volume (subCV) subsets separately. Firstly, imaging variables were controlled for age, sex, intracranial volume (ICV) and MRI site and then normalised. K-nearest neighbour (k-NN) graphs were then constructed based on pairwise distances between each pair of participants, and finally clustering was conducted using a dynamic graph-based Louvain modularity algorithm (Blondel et al., 2008). This was chosen to optimise the Markov stability (Schaub et al., 2012) as a measure of the clustering quality instead of the standard modularity measure (Newman, 2006), which has been shown to result in over-partitioning for graphs with strong local structure, such as the k-NN graph (Schaub et al., 2012). By optimising Markov stability, large communities can be revealed at longer Markov times, thus solving the problem of overpartitioning. As such, this method can reveal stable natural groupings within a cohort.
Our main aims were (1) to determine whether there was a natural clustering of participants based on structural imaging features and whether these were replicated in an independent cohort and, as an exploratory step, (2) to assess whether the clustering results were associated with depression status or cognitive features in both cohorts.
2 | METHODS AND MATERIALS 2.1 | Data acquisition and preprocessing 2.1.1 | GS dataset GS (subsample) is a community-based dataset with imaging data, reported previously (Habota et al., 2019;Navrady et al., 2018;Romaniuk et al., 2019;Rupprechter et al., 2020;Smith et al., 2012;Stolicyn et al., 2020). Demographic details of these participants and for the replication cohort (UKB) are presented in Table 1. Ethical approval for the GS subsample was obtained from the NHS Tayside committee on research (reference 14/SS/0039). T1 imaging of N = 1070 participants from GS subsample, scanned between June 2015 and May 2019, were performed at two sites (N = 544 from Aberdeen and N = 526 from Dundee). Structural measures were derived from T1 images with FreeSurfer version 5.3 Fischl et al., 1999Fischl et al., , 2004. Mean CT, CSA and CV were derived for 68 cortical regions defined by the Desikan-Killiany atlas (Desikan et al., 2006). Volumes of 21 subcortical structures-including left and right accumbens, amygdala, caudate nucleus, hippocampus, pallidum, putamen, thalamus and four cerebellar regions-were also extracted with FreeSurfer. In total, N = 980 participants remained after quality control-removing participants with any missing values, as well as participants whose ICV measure and global cortical measures, that is, overall cortical volume (sum of regional cortical volumes) and overall surface area (sum of regional surface areas), were more than three standard deviations away from the sample mean (Stolicyn et al., 2020). Details of MRI acquisition and quality control process are described in Sections S1.1.2 and S1.1.3. Participants whose demographic information was missing were also removed. There were 225 FreeSurfer-derived features available for each participant (204 cortical and 21 subcortical features). Standard Z-score normalisation was performed prior to graph construction.
For the GS subsample, there were N = 980 participants in total, of whom N = 302 were cases with lifetime (current or past) MDD. Diagnosis was established using the Structured Clinical Interview for DSM-IV Disorders (SCID) (First, 1997) and was based on criteria from the Diagnostic and Statistical Manual of Mental Disorders (DSM) (American Psychiatric Association, 2000) (Section S1.2.1). Participants were classed as currently depressed if they had an ongoing depressive episode, and as past MDD if they were not depressed at the time of MRI scan but had at least one depressive episode previously (Stolicyn et al., 2020). Participants were classed as recurrent if they had had more than one depressive episode. Data for each participant therefore included MDD status according to the SCID diagnosis described above, single versus recurrent episodes (single: N = 116, recurrent N = 186).
The cognitive measures were derived from the following tasks: Wechsler Adult Intelligence Scale UK -Third Edition (WAIS-III UK ) logical memory (LM) Parts I and II (sum of immediate/delayed recall) , WAIS-III UK digit-symbol coding (DSy) (Wechsler, 1998), phonemic verbal fluency C-F-L (VF) (Lezak, 1995), Mill Hill vocabulary (MHV) (Raven & Raven, 2003) and matrix reasoning (Matrix) tests (Ritchie et al., 1993). Additionally, age, sex, MRI scan site (Aberdeen or Dundee) and ICV were available and controlled for as described below. Table 1a shows the GS subsample participants characteristics.

| UKB dataset
The UKB obtained ethical approval from the NHS Research Ethics Committee (reference11/NW/0382), and our current study was approved by the UKB Access Committee (Project #4844). All participants in both the GS subsample and UKB gave written informed consent.
Data used were the raw T1-weighted volumes were from the second release of UKB MRI data (January 2017). All scans were acquired at the same 3T scanner (Siemens Skyra) at one single site (Cheadle). Information on the acquisition parameters can be found in the UKB online Brain Imaging Documentation (https://biobank.ctsu.ox. ac.uk/crystal/docs/brain_mri.pdf). As with GS subsample, the T1 volumes were processed at the University of Edinburgh with FreeSurfer version 5.3 using default settings, Note: *The ICV here for GS subsample was not standardised for each site. N is the number of participants for whom data is available. Age is in years.
Measures for DSy/VF/ MHV/Matrix/LM and VNR represent raw task scores. The x in RT and pairs match represents raw task scores. For the prospective memory test, 1 means recall at the first attempt and 0 otherwise. Abbreviations: LM, logical memory; Matrix: matrix reasoning; MHV, Mill Hill vocabulary; Pairs Match, pairs matching; RT, reaction time; VF, verbal fluency total score. and brain measures were extracted according to the Desikan-Killiany atlas (Desikan et al., 2006). CT, CSA and CV were computed for the 68 cortical regions, alongside volumes of 21 subcortical structures. FreeSurfer parcellations were visually assessed for a variety of errors (Cox, Lyall, et al., 2019;Stolicyn et al., 2020). Major errors included zero or partial output, substantial skull strip issues or tissue identification errors. Where no major errors were present, parcellations were examined for minor errors including erroneous boundary placement, minor skull stripping issues and minor tissue omission. Participants with missing values, missing demographic information, as well as those who were outliers in ICV and global cortical measures (as above for GS subsample) were removed, resulting in a dataset with N = 8,900 participants in total; see Section S1.1.4. Diagnosis of lifetime depression was based on participant responses in the online version of the Composite International Diagnostic Interview -Short Form (CIDI-SF) (Kessler et al., 1998) and made according to the DSM diagnostic criteria (Stolicyn et al., 2020) Ritchie & Deary, 2020). Although it would have been optimal to match the tasks more closely to those in our GS subsample, Matrix pattern completion and symbol digit substitution tasks were introduced later and therefore were not conducted concurrent with the imaging assessment for the N = 8,900 participants in this study. In a recent investigation, however, we note that he current four cognitive variables that were concurrent with imaging correlated well with other more detailed cognitive tasks within the UKB and with standard validated psychometric indicators of g (Fawns- Ritchie & Deary, 2020). Additionally, age, sex and ICV were available and controlled for as described below. Table 1b shows the UKB participants characteristics.

| Cognitive function g-factor extraction
In addition to the measures from individual tasks, we also derived a measure of general cognitive function-gfactor for participants within each cohort (Deary et al., 2010;Johnson & Bouchard, 2005) and assessed the association between the clusters and the derived g-factor.
The measure of general cognitive ability (g-factor) was a well replicated phenomena in psychological sciences (Deary et al., 2010;Warne & Burningham, 2019). Previous research have shown that the g-factor derived from entirely different sets of cognitive tests correlated well with each other, given that the set of cognitive tasks covers a sufficiently broad cognitive domain (Johnson et al., 2004;Johnson, te Nijenhuis, & Bouchard, 2008).
The g-factor here is the first factor score from factor analysis employed using the factoran function in MATLAB 2020a. For GS subsample, the g-factor was based on measures from the Matrix test (Ritchie et al., 1993), verbal fluency test (Lezak, 1995), MHV test (Raven & Raven, 2003), LM  and digit-symbol coding tests (Wechsler, 1998). Proportion of variance explained by g-factor in GS subsample was 26.0%. For UKB, g-factor was computed using measures from all the available UKB cognitive tasks stated above using the same process as in GS subsample. Proportion of variance explained by the g-factor in UKB was 14.7%. Details of the loadings can be found in Section S1.8, Tables S4a and S4b.

| Correction for covariates
Correction for covariates was performed by residualizing each brain measure with respect to sex, age, MRI site and ICV using linear regression models (Alfaro-Almagro et al., 2021;Becher, 1992;Dukart et al., 2011;Kostro et al., 2014;More et al., 2021;Snoek et al., 2019). We additionally conducted a Kruskal-Wallis (KW) test to confirm that no group differences remained on the basis of these covariates between identified clusters for both GS subsample and UKB (see Section S1.6, Tables S1 and S2).

| Graph construction
We applied a dynamical graph community detection approach to assess clustering of participants, which involved graph construction as the first step. Without known graph geometry, the graph was determined by the type of construction and the distance function chosen for the pairwise distance matrix based on the structural variables and the type of construction.

| Defining distance between participants
The pairwise distance matrix D is defined as D ij = d (x i ,x j ), where x i and x j are vectors of regional measures (CV, CSA, CT or subCV) per participants in the data and d(Á,Á) is the distance function to be specified. We used the standard Euclidean distance: to determine similarity between participants. Euclidean distance was chosen as other distance functions typically have more assumptions and constraints on the dataset. For example, cosine dissimilarity is typically used for non-centred and time-varying data, which were not the case here. As a preprocessing step, we applied standard Z-score normalisation to all measures before calculating the pair-wise Euclidean distances to avoid bias in features with broad value ranges.

| k-NN Graph Construction
We constructed k-NN graphs from the pairwise distance matrices computed above. In the k-NN graph, each data point (in this case, participant) is connected to the k closest other data points, as found in the distance matrix, D. It can be formulated as follows: where d ij is the direct distance from node i to node j, and d i (k) is the distance of the kth nearest neighbour from node i. The resulting graph is binary and undirected. Different values of k were tested in order to determine the optimal model for the respective graph construction by verifying the Markov stability measure on the networks.

| Optimal graph partitioning
We used Markov stability, instead of modularity, as the objective function for the Louvain algorithm for community detection in our graphs (i.e., optimal clustering) (Schaub et al., 2012). Contrary to other common community detection methods (e.g., k-means clustering, hierarchical clustering or graph modularity optimisation), Markov stability adopts a dynamics-based framework to uncover community structure. Graph partitions can be ranked and compared at each optimization time step that helps identify stable, optimal partitions (Delvenne et al., 2013). We briefly describe the Louvain algorithm and the Markov stability measure below. Liu et al.
validated the Markov stability method on several real datasets by comparing with other popular clustering methods in Liu et al., and it had achieved the best normalised mutual information (NMI) values on average (Liu & Barahona, 2020). Moreover, although the number of clusters are required for initialisation for other clustering methods, this clustering technique can perform the clustering in a completely unsupervised manner.

| The Louvain algorithm
The Louvain method is a greedy algorithm for graph community detection, which typically optimises modularity of the graph partitions. The modularity is used widely to measure the strength of division of a network into clusters. Details and formulation of modularity is included in Section S1.3. In the first phase, each node (data point) is assigned its own group, and hence the clusters are defined by individual nodes. Then, for each node i, we evaluate the modularity increment of removing i from its community and putting it into the community of j. At each step, the movement that leads to the largest increase in modularity is chosen. The algorithm repeats the same process until no further movement of nodes can lead to an increase in modularity (Blondel et al., 2008). At this stage, the local maximum is achieved.
The second phase consists of forming a new network from the communities found during the first phase, that is, treating the communities in the original graph as nodes in the new network. The sum of weights of edges, w k , within the same community is represented as a selfloop for that community, and edges between new nodes are defined by the sums of respective weights of intercommunity links. This can be interpreted as a coarse version of the original graph. The process in the first phase is then applied to the new network. The two phases are then repeated until modularity is optimised and a hierarchy of communities is produced, and this marks the end of a Louvain run (Blondel et al., 2008).
In our study, we applied the Louvain algorithm with optimisation of Markov stability measure instead of modularity measure for more optimal community detection. We describe the concept of Markov stability below.

| Markov stability
Markov stability is a measure of quality of graph community structure (Delvenne et al., 2010). Although modularity is the default measure of partitioning quality in the Louvain algorithm, optimising modularity can lead to overpartitioning or underpartitioning of the graph, and detection of less natural groupings (Fortunato & Barthelemy, 2007;Schaub et al., 2012). Compared with modularity, optimising Markov stability takes into account the different time scales within the partitioning algorithm (in our case the Louvain algorithm), with finer communities detected as optimal at earlier partitioning time steps and larger communities at later time stepswhich leads to more natural groupings. Markov stability measure is based on running random walks on the graph and recording which groupings appear most natural for each time scale according to the walk process, with length of each walk determined by the time scale (Delvenne et al., 2013(Delvenne et al., , 2010. Details of the Markov stability calculation are described in (Delvenne et al., 2010;Lambiotte et al., 2014). Further details on relation of Markov stability to modularity, as well as how modularity can be replaced with Markov stability in the Louvain algorithm, can be found in Section S1.4.

| Assessing clustering robustness
Because several runs of the Louvain algorithm are needed to define optimal partitions, we completed 100 runs of the Louvain algorithm for each time step.
Consistency of graph partitioning at each time step between different Louvain algorithm runs was measured by the average variation information (VI) between all pairs of partitions from different runs, evaluated as follows: V I P, P' ð Þ¼ 2HðP, P'Þ À HðPÞ À HðP'Þ HðP, P'Þ ¼ HðP, P'Þ À IðP, P'Þ HðP, P'Þ ð2:3Þ with IðP, P'Þ ¼ HðPÞ þ HðP'Þ À HðP, P'Þ ð2:4Þ where I(P,P') is the mutual information, and H(P), H(P') and H(P,P') are Shannon entropies used to measure the amount of information contained in partitioning P. Division by H(P,P') is for normalisation. In the following sections, we denoted variational information (VI) across different Louvain runs by VI and denoted VI across different time steps by VI(t,t 0 ).For each Louvain run, a different initial condition (i.e., the order of nodes being scanned during each merging step in the first phase) was chosen, so the effect of perturbation on partitioning results could be assessed. We assessed the consistency of partitions at each chosen time point and persistence of the number of communities over the time scale to choose optimal partitions (Delmotte et al., 2012). When more than one partition was considered as stable over a time scale, the clustering partition that remained stable for the longest time period was selected as the most stable.

| Stability postprocessing
Stability postprocessing applied in the current study is conceptually similar to the Convex Hull of Admissible Modularity Partitions (CHAMP) method described in Weir et al. (Weir et al., 2017). The Louvain algorithm with stability optimisation was run 100 times with 500 time-steps on each run (the 500 time steps were logarithmically spaced from 1 to 100), on the k-NN graphs constructed with k = 5, 7, 9 and 11. For each of the 500 time steps, an optimal graph partitioning was defined across the entire 100 Louvain runs. As a final postprocessing step, the defined optimal graph partitions for each time step were updated by considering partitions in all other time steps. Details of the postprocessing function can be found in Section S1.5.

| Assessing clustering consistency
We applied NMI to assess consistency between optimal partitions identified when different k values were used for constructing the graphs. NMI measures the information shared by two partitions, C i and C j . In other words, it measures to what extent knowing about C i reduces the uncertainty about C j . The NMI is defined as (Kvålseth, 2017): where H(P) is again, the Shannon entropy. If the two partitions are independent, the NMI is 0. If the two partitions are exactly the same, then NMI is equal to 1. Another alternative we proposed is the accuracy measure, which is formulated as follows: Accuracy ¼ number of correctly classif ied samples number of samples in data 2.6 | Assessing reproducibility and relation of clusters to cognition and MDD After stability optimisation and testing for robustness and consistency, we assessed the reproducibility of the partitioning results and tested associations of clusters from the stable partitions with variables of interest. To evaluate whether clustering was similar in GS subsample and UKB, we computed Pearson correlations between the cluster effects Cohen's d values in GS subsample and in UKB for regions in each of the four modalities. For computing the Cohen's d values, we took the values of each FreeSurfer region across participants and then calculated the standardised mean differences between the two participant clusters identified for each modality. Cohen's d values indicated the level of contribution of each regional measure to the separation between clusters; hence, a strong correlation of Cohen's d values between GS subsample and UKB would indicate that each measure had a similar contribution to the between-cluster separation in both cohorts; that is, a measure with large Cohen's d in GS subsample would have large Cohen's d in UKB and vice versa.
To assess associations with MDD and cognitive tasks, the KW test was used since the variables were not normally distributed. For cognition, we initially tested association with the general cognitive ability ( g-factor) and then the individual cognitive tasks separately for both cohorts. For individual tasks, in GS subsample, this involved testing associations with LM, DSy, VF, MHV and Matrix tasks and in the UKB association with VNR, RT, Pairs Match and ProsMemory tasks. The Benjamini-Hochberg procedure was used to correct the p values across the tasks for each cohort (Benjamini & Hochberg, 1995).

| RESULTS
3.1 | Participant clusters based on brain measures 3.1.1 | Clustering results in GS subsample As an illustration of how the optimal partitions were chosen, we took the partitioning of the 5-NN graph with regional surface area features in the GS subsample dataset as an example (Figure 1). Figure 1 illustrates participant partitioning throughout the time scale, after controlling for covariates. It shows that partitions with five, three and two clusters, had large plateaus with few VI (t,t 0 ) spikes within the plateaus, whereas partitions with three and two clusters also had low VI (across algorithm runs). This indicates that the key partitions were those with three and two clusters. Similar procedures were used to inspect the other k-NN graphs for different modalities.
F I G U R E 1 Community information, VI(t,t 0 ), together with 2D projections of the graphs, with nodes coloured according to cluster labels and nodes arranged with reference to five-cluster partition, over Markov time for the clustering analysis on 5-NN graph, constructed with residualised FreeSurfer surface area metrics for Generation Scotland (GS) subsample. Clustering results are based on time-dependent Markov Stability optimisation and had undergone post-processing for smoothing stability and obtaining more stable clustering result. The fact that the VI(t,t 0 ) stays zero for most of the time steps indicates robustness of the partition results. Note that the 500 time steps are logarithmically spaced from 1 to 100 Notably, the data were consistently partitioned into two clusters across different k-NN graphs (Figure 2). Strong similarity between partitions from different k-NN graphs for CT and CV is illustrated by high NMI (>0.7) between two-cluster partitions of the 11-NN and two-cluster partitions of either 5-NN, 7-NN or 9-NN graphs ( Table 2). Although we see slightly decreased NMI for CSA and subCV, high accuracies (CSA: ≥89.6%; subCV: ≥94.7%) still indicate strong similarity between partitioning results within each of the four modalities. To show that F I G U R E 2 Number of clusters found (left y-axis) and VI(t,t 0 ) (right y-axis) over Markov time for K-nearest neighbour (k-NN) graphs based on the FreeSurfer surface area metrics in Generation Scotland (GS) subsample. Low variational information (VI) indicates that variations among different Louvain runs are small, which implies that the important clusters are from partitions into two or three modules. The VI(t,t 0 ) is an evaluation of the variational information of the partition at time step tn with that of tn À 1 and the spikes indicate changes in partitioning result. No spikes are present in any of the plots after clustering into 2, implying the two clusters are equally stable with any k. Because the plateau at two-cluster partition was the largest (i.e., the algorithm stays at two-cluster partition for the largest time period), we therefore identified that the two-cluster partition as the most stable partition T A B L E 2 Normalised mutual information (NMI) between two-cluster partitions of the different K-nearest neighbour (k-NN) graphs and the two-cluster partitions of the 11-NN graph in GS subsample for each of cortical thickness (CT), cortical surface area (CSA), cortical volume (CV) and subcortical volume (subCV) measures the resulting clusters based on different metrics were not highly dependent on each other, we computed the NMI between the clusters. The NMI between the clusters based on different modalities was presented in Section S1.9, Table S5a. The low NMI among clusters is also consistent with the increasing numbers of studies reporting low correspondence between these modalities, particularly for area and thickness in terms of genetic influences and associated phenotypes (Cox et al., 2018;Grasby et al., 2020;Panizzon et al., 2009;Winkler et al., 2010). Figure 2 shows the summary of modules merging along the time scale and reaching an equilibrium of two clusters for all four k-NN graphs. The VI across Markov time for different k-NN graphs was low for clustering into 2 and 3, which implied that these partitions were stable (Figure 2). Because the plateau at the two-cluster partition was the largest (i.e., the algorithm stays at the twocluster partition for the largest time period and that is consistent across different k-NN settings), we therefore concluded that the two-cluster partition was the most stable partition to assess for associations with the clinical and cognitive phenotypes.
We computed differences in brain measures between the two clusters identified in the 11-NN graph in Cohen's d effects for GS subsample. The regions with largest Cohen's d for each modality were as follows: CT, right hemisphere (RH) supramarginal (d = 1.662); CSA, left hemisphere (LH) rostral middle frontal (d = 1.387); CV, LH superior frontal (d = 1.461); subCV, RH ventral diencephalic volume (d = 1.762). Overall, regions with large effect sizes included superior, medial and orbitofrontal regions, temporal and parietal cortices, and subcortically in ventral diencephalic volume, as well as thalamus and hippocampus. Full results are reported in Section S2.1.1, Tables S6-S9.

| Replication of clustering results in UKB
Similar to GS subsample, two clusters were identified within each of the feature modalities (CT, CSA, CV and subCV) for UKB data.
The data were again optimally partitioned into two clusters across different k-NN graphs. Strong similarity between the partitions from different k-NN graphs for CT, CSA and CV was found with high NMI (>0.7) between the two-cluster partitions of the 5-NN and the two-cluster partitions of either 7-NN, 9-NN or 11-NN graphs (Table 3). For subCV, we also saw high accuracies (subCV: ≥92.9%). The NMI between the clusters based on different modalities was presented in Section S1.9, Table S5b.
Similar to the GS subsample, we computed differences in brain measures between the two clusters identified in the 5-NN graph in Cohen's d effects for UKB. The regions with largest Cohen's d for each modality were as follows: CT, RH inferior parietal (d = 1.536); CSA, LH superior frontal (d = 1.090); CV, LH precuneus (d = 1.058); subCV, RH ventral diencephalic volume (d = 1.416). Clusterrelated differences were highly correlated between GS subsample and UKB datasets: correlation coefficients were 0.9392, 0.9226, 0.9241 and 0.7931 respectively for CT, CSA, CV and subCV modalities; see Figure 3. The top 20 cortical regions and top 10 subcortical regions driving the clusters' separations as well as the corresponding Cohen's d for each of the four clustering analyses are listed in Table 4. These results indicate that the natural groupings of participants, as well as the regional measures that best separate the identified clusters, were similar across the GS subsample and UKB datasets. Among those top regions, there was at least 70% overlap between the two cohorts. The overlapping regions included ventral diencephalic volume, thalamus and hippocampus for subcortical regions, and superior, medial and orbitofrontal regions, as well as parietal regions for cortical metrics.
Details of effect sizes for the clustering results based on all four modalities for both cohorts are reported in Section S2.1.1, Tables S6-S9. All effect sizes were positive, which implies that across all four feature modalities, regional measures in one cluster were larger compared with the other cluster, independent of sex, age and ICV differences. Figure 4 shows that the effect sizes were positive for all regions when clustering was based on CSAs. Figures for other modalities can be found in Section S2.2, Figures S1-S3.
T A B L E 3 Normalised mutual information (NMI) between two-cluster partitions of the different K-nearest neighbour (k-NN) graphs and the 2-cluster partitions of the 5-NN graph in UKB for each of cortical thickness (CT), cortical surface area (CSA), cortical volume (CV) and subcortical volume (   For both the GS subsample and the UKB, those regions contributing the most to cluster separation included lateral orbitofrontal, post central, precentral, precuneus, rostral middle frontal, superior frontal, superior parietal and supramarginal areas in both hemispheres. Large effect sizes were noted for these regions (GS subsample: d = 0.8682-1.662; UKB: d = 0.7761-1.536), including some regions where d > 1.2, giving confidence in the separation of the clusters (Lakens, 2013;Sullivan & Feinn, 2012). Those regions contributing least to between cluster separation (most consistent across individuals) were the caudal anterior cingulate cortex, entorhinal cortex, frontal pole and temporal pole in both hemispheres, and parahippocampal gyrus in the LH (GS subsample: d = 0.2095-0.6893, UKB: d = 0.2498-0.6389).

| Association between clusters and MDD and cognitive variables in GS subsample and UKB
As stated in the method sections, the KW test was used as the test statistics to determine statistical significance of between-cluster differences for MDD and cognitive

| Association of clusters with MDD in GS subsample
The clusters were found to have no significant association with the presence of an MDD diagnosis in any of the two-cluster results derived from the four different modalities (CT: p FDR = .2239; CSA: p FDR = .3777; CV: p FDR = .2295; subCV: p FDR = .6585). We also tested whether the clusters were associated with the severity of depression in GS subsample by only including recurrent cases (N = 186) in the MDD group and found that this was also not significant (CT: p FDR = .9353; CSA: p FDR = .4020; CV: p FDR = .9184; subCV: p FDR = .6906).
F I G U R E 4 Standardised mean differences in regional cortical surface areas (CSAs) between the two clusters identified in the Generation Scotland (GS) subsample and in UK Biobank (UKB). Higher differences between clusters were found for lateral orbitofrontal, post central, precentral, precuneus, rostral middle frontal, superior frontal, superior parietal and supramarginal areas in both hemispheres. Most of these areas are closely related to executive functions and decision making. These regions are also associated with various diseases that may contribute to larger differences between individuals Information about the effect sizes between clusters as well as the effect sizes between cases and controls is included in Section S2.1.2, Tables S10-S13. Table 5 shows the corrected p values and effect sizes of associations between the cognitive tasks and the clustering result for CT, CSA, CV and subCV modalities. The general cognitive ability ( g-factor) was found to be significantly associated with the clustering based on CSA (d = 0.2771, p FDR = 8.69e-5), CV (d = 0.3490, p FDR = 5.27e-6) and subCV (d = 0.2529, p FDR = .0022) but not CT. As for individual tasks, the digit symbol coding (DSy) and Matrix tests were found to be significantly associated with the clustering based on CSA (DSy: FreeSurfer measures (for all of CT, CSA, CV and subCV) were related to positive effect sizes in cognitive measures, and these results were independent of sex, age and ICV differences. These results suggest that participant clusters defined by larger imaging measures may be characterised by better cognitive performance.

| Associations of clusters with MDD status and cognitive measures in UKB
As in the GS subsample, clusters in UKB were found to have no significant associations with lifetime MDD diagnosis (CT: p FDR = .7690; CSA: p FDR = .3059; CV: p FDR = .2003; subCV: p FDR = .6703). The absence of a direct one-to-one correspondence between the cognitive tasks in GS subsample and the UKB precluded a direct replication of the test-specific cognitive findings in GS subsample using UKB data. However, due to the advantages of cross-battery stability conferred by computing a g-factor (Fawns- Ritchie & Deary, 2020;Johnson et al., 2004;Johnson, te Nijenhuis, & Bouchard, 2008), we replicated the group differences in g-factor. The gfactor were found to be significantly associated with clus- p FDR = 4.48e-5), and also for CT (d = 0.0573, p FDR = .0318). As for individual tasks, the score for VNR (UKBID:20016.2.0, Fluid Intelligence) was also found to be significantly associated with clustering based on CSA (d = 0.0999, p FDR = 5.19e-5), CV (d = 0.1091, p FDR = 1.17e-5), subCV (d = 0.0923, p FDR = 4.48e-5) and CT (d = 0.0502, p FDR = .0318). There are also significant associations of clusters based on subCV with reaction time (d = À0.0716, p FDR = .0011), and clusters based on CV and CT with prospective memory (CV: d = 0.0483, p FDR = .0468, CT: d = 0.0485, p FDR = .0489) (see Table 6).

| Overall summary
In the current study, we employed an exploratory approach and performed unsupervised spectral clustering with k-NN graphs, which were based on pairwise distances in structural brain measures derived with FreeSurfer. We aimed to determine the presence of natural groupings of participants and their relation to lifetime MDD and cognitive ability. The results identified a natural split of the data into two main clusters for each of the four modalities studied, where clustering results for separate modalities were independent of each other. We replicated the natural groupings of participants into two main clusters in each modality in an independent dataset (UKB) based on the highly correlated cluster-related differences between the two cohorts, with correlation coefficients 0.9392, 0.9226, 0.9241 and 0.7931 respectively for CT, CSA, CV and subCV modalities. Moreover, the results were not driven by common covariates, namely, sex, age, MRI site and ICV. It was found that the strongest contributors to the cluster separation were the ventral diencephalic volume, thalamus and hippocampus for subcortical regions (d = 1.0891-1.7621) and superior, medial and orbitofrontal regions, along with temporal and parietal regions for cortical metrics (d = 0.8192-1.6617). The clusters identified were not related to lifetime MDD status in either dataset. We also did not find associations with the more severe MDD cases by taking only those with recurrent MDD in the GS subsample (see Sections 3.2.1 and 3.2.3). Although we found no relationship with MDD, there was however significant relationships with cognition base on the general cognitive ability (gfactor) in both GS subsample and UKB (Johnson, Carothers, & Deary, 2008). The clusters also showed significant relationships with some other specific tests mainly in the domains of reasoning (Matrix in GS subsample and VNR in UKB) and processing speed (WAIS-III UK DSy score in GS subsample and Reaction Time Task in UKB). Results suggest that the participant clusters defined by larger FreeSurfer measures are in general characterised by better cognitive performance. Apart from MDD status and cognitive abilities, assessing associations of clusters with other variables (for example brain age, stress and social economic status) could also be an interesting future research direction.
A key feature of our work is the use of covariates to ensure that the clusters are not driven by important factors such as age, head size and sex. Prior to our study, Zhou et al. employed supervised feature selection on N = 3,297 brain morphometric measures that approximately represented the 3D neuroanatomical integrity of the participants' brains in the UKB as well as N = 4,316 demographic, clinical, biological specimen, imaging, genomic, and questionnaire variables for N = 9,914 subjects (Zhou et al., 2019). Although using different clustering methods (k-means clustering and hierarchical clustering), Zhou et al. also carried out clustering analysis on all derived neuroimaging measures and also obtained two clusters, of which one cluster showed larger values in all of their top 20 neuroimaging variables. Contrary to the results of our study, their resulting clusters did show differences regarding in mental health variables, including depressive symptoms. However, they did not adjust for basic covariates and found a significant association of clusters with sex, so that the significant between-cluster differences found in mental health variables, including depressive symptoms, might be driven by the significant sex disparity between the clusters. That the current study and Zhou et al.'s study show mixed results regarding associations with mental health variable may therefore be related to different methodological approaches.

| Interpretation of between-cluster effect sizes
The calculated effect sizes, that is, the Cohen's d coefficients, represent the degree of separation between the individuals in the two clusters for each brain region. Most regions had medium to high effect sizes, which indicates that the two clusters were clearly separated (Lakens, 2013;Sullivan & Feinn, 2012).
The greatest effect sizes were seen for CT in RH supramarginal area in GS subsample (d = 1.662) and for CT in RH inferior parietal area in UKB (d = 1.536) in cortical measures as well as right ventral diencephalic volume for both GS subsample (d = 1.762) and UKB (d = 1.416) in subcortical measures. The top regions between GS subsample and UKB had a high percentage of overlap, as shown in Table 4. Details of between-cluster effects can be found in Section S2.1.1, Tables S6-S9. In general, for both cohorts, as well as across different cortical metrics, we note that those regions with larger effect sizes tended to be those that are commonly associated with higher cognitive functions such as executive function and decision making (e.g., precuneus, rostral middle frontal gyrus, superior frontal gyrus, lateral orbitofrontal gyrus and superior temporal gyrus) (Barbey et al., 2012;Camilleri et al., 2018). These results are in line with prior work on brain regional correlates of intelligence (Cox et al., 2018; and add further reference-via an unsupervised clustering method-that these higher order cortical regions are related to cognitive ability beyond influences of gross head size, age and sex. We note that Cox et al. found that the frontal pole contributed the most to intelligence , whereas, in the current study, the frontal pole was found to have one of the smallest betweencluster effect sizes. We consider that this difference likely originates from substantial differences in software, anatomical labelling and analysis methods. Cox et al. employed the UKB-processed FSL FIRST and FSL FAST parcellations, whereas this study used FreeSurfer derived metrics. This is important because there is currently no consensus regarding the definitions of the posterior extent of the frontal pole from structural neuroimaging data and both of these methods uses different atlas definitions (see Bohland et al., 2009;Cox et al., 2014). Further, Cox et al. implemented structural equation models (SEMs) targeting the associations between individual ROIs and the g-factor directly, although clustering analysis as used here in general does not directly test associations between ROI measures and other variables. Moreover, the clustering analysis in this study was methodologically driven by structural brain measures and not cognitive ability measures.

| Limitations
A non-hypothesis-driven graph clustering analysis can generally help to discover population subtypes based on non-linear relationships between independent variables. For clinical studies, the drawback is that the identified subtypes (clusters) are not guaranteed to be clinically relevant. In our case, we found that the partitioning results were not associated with MDD status as measured in the current samples. However, we note that our samples were relatively healthy, with few individuals having current depression (most cases met criteria for lifetime MDD rather than current MDD). It is possible that using these diagnostic criteria may have contributed to lack of association of clustering results with MDD status. Previous studies of MDD case-control classification, where high accuracies (i.e., ≥85 %) were achieved, typically had sample size smaller than 100 and involved clinically ascertained current MDD cases, in some cases with severe or treatment-resistant depression (Johnston et al., 2015;Mwangi et al., 2012;Patel et al., 2015).
In terms of the lack of MDD-related differences between the clusters in the context of previous supervised studies, we cannot exclude the possibility that this may be due to sample differences (our samples are relatively well community-based samples) or that our unsupervised method may not be sensitive enough in its current form to detect these brain features of typically small to very small effect sizes.
Moreover, we also performed clustering analysis without controlling for any covariates as an initial testing of our method. We found that the resulting clusters were associated with MDD status based on CT, CSA and CV measures for GS subsample and CT, CSA, CV and subCV measures for UKB (see Section S1.7, Tables S3a and S3b) but they were also strongly driven by sex, age, ICV and scan site. Because we were not specifically interested in sex, age or site effects, these regressed out of the brain measures prior to the main analysis. We note, however, that without residualisation, we do indeed see the expected clustering related to these characteristics. We cannot exclude the possibility however that residualisation may have removed some effects related to MDD.
In spite of the relative invariance of g to cognitive test battery content, we note that sufficient breadth of cognitive domains is an important consideration in deriving a comparable g-factor. For example, it might be considered that GS tests were more nonverbal and fluid when compared with the verbal and crystallised abilities in UKB. Thus, although there were verbal and crystallised elements in the UKB VNR test, that the UKB tests used here were subsequently shown to be relatively good g measures, it is possible that the g measures extracted across the two cohorts showed imperfect correspondence. Nevertheless, our finding that the natural clustering showed g-differences in both cohorts further militates against this as a substantial confounder of our results.
The current study involved using information from MRI scans based on FreeSurfer parcellations according to the Desikan-Killiany atlas. We note that greater information in the form of raw voxel-wise data may be a better representation of the brain structure and may improve the clustering quality. This would, however, significantly increase the computational cost. Future research could also apply clustering analysis on functional MRI data.

| CONCLUSION
We employed a novel unsupervised clustering algorithm to find natural participant groupings within two large independent datasets of brain structural measures. A natural grouping of two clusters was identified in the first dataset (GS subsample) for each of the four studied modalities and was replicated in the second dataset (UKB). The main regions driving cluster separation were ventral diencephalic volume, thalamus and hippocampus, superior, medial and orbitofrontal regions, along with temporal and parietal regions in both GS subsample and UKB datasets. Although the clusters were not related to lifetime MDD, they were found to be associated with general cognitive ability (g-factor, computed based on multiple cognitive tasks) in both cohorts, and also with specific reasoning tasks, namely, the Matrix and DSy tasks in GS subsample and Fluid Intelligence Score in UKB. Regions with relatively high cluster-related effect sizes were the higher order cortical regions, commonly associated with executive function and decision making. Future work could focus both on development and application of ML methods to voxel-wise and multimodal brain imaging data as well as looking at associations with other clinically relevant metrics.
(England) and the UK devolved administrations, and leading medical research charities. Simon R. Cox was supported by Age UK (Disconnected Mind project), the UK Medical Research Council (MRC: MR/R024065/1) and the US National Institutes of Health (NIH) (R01AG054628).

CONFLICT OF INTEREST
AMM has received research support from Eli Lilly, Janssen, and The Sackler Trust. AMM has also received speaker fees from Illumina and Janssen.

DATA AVAILABILITY STATEMENT
The data collected in the STRADL study have been incorporated in the larger Generation Scotland dataset. Nonidentifiable information from the Generation Scotland cohort is available to researchers in the United Kingdom and to international collaborators through application to the Generation Scotland Access Committee (access@ generationscotland.org) and through the Edinburgh Data Vault (https://doi.org/10.7488/8f68f1ae-0329-4b73-b189-c7288ea844d7). Generation Scotland operates a managed data access process including an online application form, and proposals are reviewed by the Generation Scotland Access Committee. Data from the UK Biobank resource are available for health-related research upon registration and application through the UK Biobank Access Management System (https://www.ukbiobank.ac.uk/registerapply/). The code for Markov Stability algorithm for the analysis of undirected weighted graphs can be found on this public repository: https://github.com/ michaelschaub/PartitionStability.