Detecting Clinically Meaningful Shape Clusters in Medical Image Data: Metrics Analysis for Hierarchical Clustering Applied to Healthy and Pathological Aortic Arches

Objective: Today's growing medical image databases call for novel processing tools to structure the bulk of data and extract clinically relevant information. Unsupervised hierarchical clustering may reveal clusters within anatomical shape data of patient populations as required for modern precision medicine strategies. Few studies have applied hierarchical clustering techniques to three-dimensional patient shape data and results depend heavily on the chosen clustering distance metrics and linkage functions. In this study, we sought to assess clustering classification performance of various distance/linkage combinations and of different types of input data to obtain clinically meaningful shape clusters. Methods: We present a processing pipeline combining automatic segmentation, statistical shape modeling, and agglomerative hierarchical clustering to automatically subdivide a set of 60 aortic arch anatomical models into healthy controls, two groups affected by congenital heart disease, and their respective subgroups as defined by clinical diagnosis. Results were compared with traditional morphometrics and principal component analysis of shape features. Results: Our pipeline achieved automatic division of input shape data according to primary clinical diagnosis with high F-score (0.902 ± 0.042) and Matthews correlation coefficient (0.851 ± 0.064) using the correlation/weighted distance/linkage combination. Meaningful subgroups within the three patient groups were obtained and benchmark scores for automatic segmentation and classification performance are reported. Conclusion: Clustering results vary depending on the distance/linkage combination used to divide the data. Yet, clinically relevant shape clusters and subgroups could be found with high specificity and low misclassification rates. Significance:  Detecting disease-specific clusters within medical image data could improve image-based risk assessment, treatment planning, and medical device development in complex disease.


I. INTRODUCTION
Modern medical imaging techniques such as computed tomography (CT) and magnetic resonance (MR) imaging provide detailed and accurate anatomical and functional information of inner body structures and organs, making them widely used tools for diagnosis and treatment planning.Consequently, medical image databases are growing and valuable patient data are accumulating, calling for novel approaches to process and extract clinically relevant information not only on a case-by-case basis, but also considering entire patient populations [1], [2], [3].
Many computational image processing pathways focus on segmentation of body structures [4], [5] or apply classification algorithms to automatically distinguish between healthy and disease [6], [7], [8].Yet, to date few studies have looked at tools that can be applied after those two crucial steps, computational tools that can help understand a disease once anatomical shape information is given and once a diagnosis has been made.Automated clustering techniques from the field of data mining have been widely used in genomics, taxonomy and chemoinformatics to structure large amounts of data into subgroups, thereby revealing previously unknown, yet relevant patterns within a given population [9], [10].We believe that such an approach may prove beneficial as well for the analysis of complex three-dimensional (3D) anatomical models from medical image data in order to close the gap between mere data and useful knowledge, as desired in current Precision Medicine or "Precision Imaging" approaches [3].Clinical image assessment of inner body structures usually reveals a patient's dominant pathology, but it often remains unclear how individual image data relate to other patients with the same disease or primary diagnosis.Grouping patients according to anatomical similarity and taking into account clinical history and other functional or outcome parameters may ultimately allow refined, cluster-adapted treatment and follow-up strategies and could assist in risk-stratification when scanning a new patient with similar diagnosis.Hierarchical clustering techniques seem to be an attractive way to discover anatomical subgroups from medical image data as they are inherently unsupervised, thus do not require any prior information about the study population and, unlike K-means clustering, do not require specifying an expected number of subgroups [11], [12], [13].Furthermore, clustering results can be graphically summarised in a dendrogram that depicts in a tree-like diagram how similar subjects are grouped together, while dissimilar subjects are placed on different branches of the tree.However, evaluation of subject similarity or dissimilarity and clustering results heavily depends on the choice of both similarity or distance metric (with low intersubject distance relating to higher similarity) and linkage function determining how subjects are linked together to form a subgroup [12], [13].Depending on the chosen distance/linkage combination, clustering results may vary substantiallypotentially rendering meaningless results [14], [15].While previous studies have analysed clustering techniques based on generic shapes or two-dimensional (2D) shape data [16], few have assessed hierarchical clustering performance using actual patient data in a realistic setting, i.e. using three-dimensional (3D) anatomical models of healthy and pathological shapes derived from medical images [17], [18].In general, medical image hierarchical clustering performance data including validation against known and clinically relevant clusters are sparse.In this study, we aimed to investigate whether and how hierarchical clustering can be used to automatically divide a bulk of unlabelled clinically acquired cardiovascular magnetic resonance (CMR) image data into clusters and subgroups that could be of clinical relevance.
Specifically, we sought to analyse clustering classification performance of various distance/linkage combinations applied to a population of 60 aortic arch anatomical models, automatically segmented from CMR data, composed of three equally-sized subgroups of healthy aortic arches, arches post aortic coarctation repair (COA) [19] and arches post arterial switch operation (ASO) [20].COA and ASO patients suffer from congenital heart disease (CHD), which manifests itself in abnormalities of cardiovascular structures (here, the aorta, known to present shape patterns abnormal from healthy individuals [19], [21] [22]).Anatomy plays a crucial role in both diagnosis and therapy of CHD, as shape abnormalities often lead to functional impairment, requiring intervention.COA and ASO image data provide an excellent platform to test unsupervised clustering algorithms, as newly found shape clusters or subgroups within those diseases may ultimately impact on novel diagnosis and treatment strategies.
To assure "meaningfulness" (here, clinical relevance) of unsupervised clustering results, we externally validated [15], [14] our results against clinical expert opinion, traditional morphometric parameters and 3D shape analysis via principal component analysis (PCA).We aimed to find the distance metric/linkage function combination that achieved highest classification performance, i.e. that was able to automatically divide the bulk CMR input data into the three clinically meaningful clusters of CTRL, COA and ASO arch shapes with low misclassification rates.
Furthermore, we hypothesised that such clinically meaningful clustering on a macrolevel yields meaningful shape subgroups (i.e."clusters within clusters") on lower-level hierarchies of the clustering tree as well, which may allow the detection of novel disease patterns in future studies.

II. METHODS
The study outline is as follows: all aortic arch shape models were automatically segmented from CMR data and were parameterised within one common mathematical framework using a non-parametric statistical shape modelling (SSM) approach based on non-rigid registration of a computed template shape [23], [24].Based on this shape data, we applied principal component analysis (PCA) for more detailed assessment of 3D shape features prior to cluster analysis.Hierarchical clustering was then performed on both the full, unprocessed shape data and the reduced PCA dataset to determine the input and the distance/linkage combination yielding clustering closest to the clinical expert diagnosis with high classification performance.Lastly, the distance/linkage setting yielding the most meaningful division of the data (with highest F-score and Matthews Correlation Coefficient) was analysed in more detail.

A. Patient Population
A total of 60 patients, who underwent routine CMR examination (whole heart 3D balanced, steady-state free precession acquisition; 1.5T Avanto MR scanner, Siemens Medical Solutions, Erlangen, Germany) at Great Ormond Street Hospital for Children (GOSH, London, UK) were retrospectively included in the study.The cohort was divided into three subgroups according to their clinical primary diagnosis: 20 healthy subjects whose aortic arch shapes were reported as normal at cardiac assessment (control group CTRL, age 15.2±2.03years, 3 female), 20 patients who had undergone surgical aortic arch reconstruction for treatment of coarctation of the aorta (COA, 23.1±7.35years, 4 female) and 20 patients who had their aorta pushed back posteriorly in the Lecompte [20] manoeuvre for arterial switch operation (ASO, 14.4±2.48years, 4 female).Ethical approval was obtained by the Great Ormond Street Institute of Child Health/GOSH Research Ethics Committee and all patients or legal parent or guardian gave informed consent for research use of the image data.

B. Segmentation and Registration
The aorta including the left ventricle (LV) was segmented automatically using a multi-atlas propagation segmentation approach that applies a locally normalised cross correlation (LNCC) based ranking combined with a consensus based region-of-interest selection, which has been successfully applied to whole heart [25] and right ventricle segmentation from CMR data [4].For each group, a leave-one-out strategy was followed, where 19 manually labelled atlases of the respective group were used to segment one unseen subject, and dice similarity coefficients (DSC) were computed to quantify automatic segmentation accuracy following DSC=2AB/(A+B), where A is the obtained segmentation and B the corresponding ground truth.Automatic segmentation results were visually inspected and, if necessary, manually edited (i.e.cleaned up and improved) using ITKSnap [26].
Segmentation labels were exported as 3D computational surface meshes in the Visualization Toolkit (VTK) format [27] and visualised in ParaView [28].All models were cut consistently below the aortic root and at the level of the diaphragm using The Vascular Modeling Toolkit (VMTK, [29]) cutting tools, whilst coronary arteries and head and neck vessels were cut off as close as possible to the arch.All surface meshes were then rigidly registered to one healthy CTRL subject using an Iterative Closest Point (ICP) algorithm [30] prior to template computation (i.e.3D population anatomical mean shape, see section C).In order to remove bias due to misalignment of input shapes, a Generalised Procrustes Analysis (GPA) was adopted, by computing an initial template, realigning the input shapes to the new template via ICP registration and recomputing the template until convergence, as described in [31].

C. Template and Deformation Matrix Computation
The 60 aligned arch surface meshes constituted the input for the template computation using the openly available Deformetrica code framework (www.deformetrica.org)[32].The framework computes the 3D template shape of an input shape population, without assuming any point-to-point correspondence between input meshes.This is achieved by modelling shapes as mathematical currents (surrogate representations of shapes), which characterise a shape as a distribution of shape features rather than its actual point coordinates in space [23], [24].Template and resulting template-dependent shape parameterisations were computed following protocols detailed in [31].Surface meshes were transferred into a vector space of currents W, generated by a Gaussian kernel.The standard deviation of the kernel λW allows control of the currents resolution and was set to 5mm.The template and its transformations φi registering template to each subject shape were computed simultaneously using the large deformation diffeomorphic metric mapping (LDDMM) framework [33].The transformation functions φi were defined within another Gaussian kernel vector space V with standard deviation λV, set to 20mm, controlling the transformation stiffness.All 3D shape features present in the population were thus encoded by subject-specific transformations of the template φi, which are parameterised by a unique set of deformation vectors βi for each patient shape.Setting λW to 5mm and λV to 20mm, resulted in a set of 300 βi per patient.With each βi having an x, y and z entry, a final deformation matrix DFull of size N x n with N=60 included subjects and n=900 deformation momenta comprised all 3D shape information of the input population and was used for further analysis via PCA and hierarchical clustering.

D. Morphometric Analysis and Principal Component Analysis
To investigate whether arch shape characteristics related to size and shape were sufficiently different between the three groups (i.e.whether the three patient groups translate into three shape groups), traditional morphometric analysis was carried out in 2D and in 3D, without controlling for size difference as size itself is a descriptor of pathological paediatric patient arch shape as well.In terms of size, aortic arch model volume V, surface to volume ratio SVol and arch centreline length CLlength were derived automatically using VMTK and Matlab (The MathWorks, Natick, MA).As shape parameters, we considered arch centreline tortuosity CLtort [34], ascending to descending aortic arch diameter ratio Dasc,desc and arch width T, manually measured on the image slices as described in [19], [31].
Further, we performed PCA on the covariance matrix of the combined deformation vectors βi [35] to extract PCA shape modes, each describing a certain amount of 3D population shape variability as a deformation of the template shape.Each subject deformation φi was projected onto each PCA shape mode to obtain the low-dimensional shape vector { , },   [1, 𝑚] [35] for each shape mode k and subject i, whose entries parameterise the subject-specific PCA loadings.The { , } were compared between the three groups CTRL, COA and ASO, and the { , } of the first two PCA shape modes were plotted against each other to visualise potential grouping within the input shape data.The first m=19 shape modes, explaining 90% of the total shape variability (determined by the proportion of sorted eigenvalues) were selected [36] and the respective { , } combined constituted the reduced PCA shape loading matrix DPCA of size N x m, which described 3D population shape features in terms of the lower-dimensional PCA loadings.

E. Hierarchical Clustering
The shape matrices DFull and DPCA constituted the input for the agglomerative hierarchical clustering algorithm (Matlab).Based on a pre-defined distance (i.e.similarity) metric, clusters are formed by grouping subjects with similar features together, while subjects with distinctly different features are placed in other clusters.This unsupervised approach unveils "naturally occurring" subgroups within the data, without depending on prior user input [12], [16].Here, features of interest were 3D aortic arch shape features, parameterised by the entries of DFull and DPCA.The algorithm can be described as follows [37], [38]:

1) Compute distances between every pair of subjects within the input dataset to obtain a metric of pairwise subject similarity (treating each subject as its own cluster). 2) Form binary cluster from two closest (most similar) subjects (using distance metric) or clusters (using linkage function). 3) Recompute distances between newly formed cluster and remaining subjects or clusters. 4) Return to Step 2 until all subjects are included in one large cluster, formed by a tree-like multi-level network of subclusters (dendrogram). At the lowest level, each subject forms its own cluster. 5) Cut off dendrogram branches at a specified level of the hierarchy to assign subjects below each cut to a specific cluster, generating partitions of the data.
To compute pairwise distances between the 60 patient shapes parameterised by deformation row vectors of DFull or PCA shape vectors of DPCA, the following commonly used distance (similarity) metrics dist between the vector pair xs and xt were computed (with D being of size N x n, with N (1-by-n) row vectors   ,   [1, 𝑁]; for DFull with   [1, … , 900] and for DPCA with   [1, … , 19]) [38]: with sj being the standard deviation of the xs and xt over the sample set.After defining a distance metric between pairs of subject shapes, a linkage function then uses the generated distance data to join groups of subjects together into binary clusters and link those to higher level larger clusters, until all subjects are linked together.The linkage function thus defines the similarity or distance between two groups of subjects and is used to generate the dendrogram.The order in which subjects are clustered together is determined by the type of linkage method.For each distance metric, the following commonly used linkage methods were applied to generate a dendrogram.For subjects or clusters s and t joined into cluster  ∪ , the new distance between this cluster and another subject or cluster k is generally defined by the Lance-Williams dissimilarity update formula link( ∪ , k) (8), which defines different types of linkage methods, depending on the choice of the parameters αs, αt, β and γ as follows [10]: Note that dist can be any of the distance metrics defined in (1-7); ns, nk, nt is the number of subjects in cluster s, k, t, respectively.Centroid, Median and Ward linkage methods are appropriate for Euclidean distances only [38].Cutting the dendrogram horizontally at a particular height or level partitions the data into shape subgroups [12].As we first aimed to assess whether the clustering algorithm was able to distinguish between CTRL, COA and ASO groups, dendrograms were cut automatically at a level that yielded three large shape clusters.

F. Clustering Classification Performance Measures
Based on the majority of group members associated with one cluster, each cluster was automatically labelled either CTRL (Class1), COA (Class2) or ASO (Class3) and numbers of assigned subjects from each of the three classes were recorded in a confusion matrix to assess clustering classification performance.All correctly assigned subjects for each class are shown on the diagonal of the matrix.For each of the three classes Classj   [1,3], the total number of true positives (TPj, e.g. in case of the CTRL class, the actual CTRLs that were correctly classified as CTRL), false positives (FPj, e.g.COA and/or ASO that were incorrectly classified as CTRL), false negatives (FNj, e.g.CTRLs that were incorrectly classified as COA and/or ASO) and true negatives (TNj, e.g.all remaining subjects, correctly classified as non-CTRL) were derived from the confusion matrices.
With these values, overall classification performance was computed using macroaveraging (denoted with subscript M) [39] over L=3 classes of the following performance measures: To minimise chance findings and bias associated with those traditional measures, we also computed (macroaveraged) Informedness, which relates to the probability that there has been an informed classification as opposed to mere guessing, and Markedness, defined as [40]: ) F-scoreM and MCCM scores were used to evaluate overall classification performance of the various distance metric and linkage combinations.Note that F-score values range from 0 for worst to 1 for best classification performance, whereas MCC ranges from -1 for total disagreement over 0 for random guessing to +1 for perfect prediction of classes [41].In the following, the qualitative term "best" refers to highest possible classification performance in terms of both F-scoreM and MCCM score being close to the value 1.

G. Validation of Clustering Results
Clustering results were evaluated using 10-fold cross validation (CV), leaving out N/10 randomly selected subjects, and recomputing template, DFull and DPCA, until each subject had been left out once.Classification performance measures were calculated for each of the 10 CV runs, looping through all 49 distance metric/linkage combinations for the two different input matrices DFull and DPCA, respectively.All clustering runs were carried out on a 32GB workstation using one 2.3GHz core.The distance/linkage combination with the best classification performance based on mean F-scoreM and MCCM was chosen for further analyses of the full data matrix, comprising all N=60 subjects.Results of this final clustering were visualised as a dendrogram and compared to PCA results.

H. Statistical Analysis
For all analysed size, shape and PCA shape vector entries, mean and 95% confidence intervals (95CIs) based on the patient cohorts are reported.For classification performance measures, mean and 95CIs are reported based on the CV runs.
To compare distributional differences between the three patient groups CTRL, COA and ASO, independent analysis of variance (ANOVA) was performed.Prior to ANOVA, homogeneity of variance was assessed using Levene's test.In case homogeneity of variance was violated, Welch's test was performed.When ANOVA showed significance, post hoc tests were carried out for pairwise group comparisons and Bonferroni adjusted to control for Type I error rates.Statistical significance was assumed at level p < .05.All statistical tests were carried out using R v3.3.1 (R Foundation for Statistical Computing, Vienna, Austria).

A. Segmentation
Average segmentation runtime was approximately 2 hours per patient (parallel processing on a 24 core, 2.3GHz, 32GB RAM workstation).Average Dice scores (±95CI) for the automatically computed segmentation labels compared to their respective ground truths were 0.917±0.026for the CTRL, 0.944±0.012for the COA and 0.913±0.033for the ASO group.Final automatic segmentation labels required a maximum of 10 minutes manual clean-up.

B. Comparison of Traditional Shape Parameters
In terms of size, significant distributional differences in V (Fig. 1a) were found between the COA and CTRL group (p=2e-07), and the COA and ASO group (p=7e-06).SVol distributions (Fig. 1b) differed significantly between the COA and CTRL group (p=1e-06), and the COA and ASO group (p=3e-03).Distributional differences in CLlength (Fig. 1c) were found between the COA and CTRL group (p=5e-05) and the COA and ASO group (p=1e-06).Overall, COA aortic arches were significantly larger and more compact, whereas arch models from the CTRL and ASO group were of similar size.Following this analysis, we would expect the clustering algorithm to confuse CTRL and ASO shapes, while separating out well the COA group, if it mainly took into account size differences between input shapes.
With regard to measured shape parameters, significant differences between all three groups were found for CLtort following post hoc analyses (p=2e-07 for COA vs CTRL, p=1e-14 for COA vs ASO and p=1e-02 for CTRL vs ASO, Fig. 1d).Similarly, Dasc,desc distributions (p=1e-05 for COA vs CTRL, p=7e-10 for COA vs ASO and p=4e-02 for CTRL vs ASO, Fig. 1e) and T distributions differed significantly (p=6e-08 for COA vs CTRL, p=2e-16 for COA vs ASO and p=3e-09 for CTRL vs ASO, Fig. 1f) between all three groups, with COA arches showing generally more tortuous and wider arch shapes with higher ascending to descending aortic diameter ratios than the other two groups and ASO arches being the least wide, least tortuous with the lowest ascending to descending arch diameter ratios.

C. Principal Component Analysis of 3D Shape Features
The first three shape modes are visualised in Fig. 2. PCA shape mode 1 accounted for 35.4% of shape variability.It described shape change from an overall small and short, ASOlike arch shape with narrow arch width towards a large, COAlike arch shape with high arch width, dilated root and ascending aorta, and more tortuous descending aorta continuation (Fig. 2a).In terms of { ,1 } shape vector entry distributions, COA arches differed significantly from the CTRL group (p=4e-08) and from the ASO group (p=3e-12).CTRL and ASO shape vector entry distributions did not differ significantly (p=.050).
PCA shape mode 2 described shape variability associated with more rounded and wide arches compared to more "gothic" [19] arch shapes with similar arch height but smaller arch width.It accounted for 12.2% of the total shape variability (Fig. 2b).The { ,2 } entry distribution for the ASO group differed significantly from the CTRL group (p=2e-08) and from the COA group (p=1e-06), while there was no significant difference between the CTRL and COA groups (p=.930)PCA shape mode 3 accounted for 7.4% of shape variability.It varied from arch shapes with lower arch height and slightly dilated root to arches with higher arch height but similar arch width and few diameter changes along the arch (Fig. 2c).For this mode, the { ,3 } distribution of the CTRL group was significantly different from the COA group (p=2e-02) and from the ASO group (p=3e-03) but no significant difference was found between the COA and ASO group (p=.999) Following the analysis of traditional shape parameters and the first three PCA shape modes, we concluded that all three selected patient groups were sufficiently different from each other, thus forming three distinct shape groups to be found by the clustering algorithm.Furthermore, plotting the { ,1 } and { ,2 } for PCA shape modes 1 and 2 against each other revealed a good split between the three groups in PCA 3D shape space (Fig. 2d), justifying the assumption of three large shape clusters within our cohort of 60 patients.

D. Determining best performing Input and Distance/Linkage Combination
Macroaveraged classification performance measures F-soreM and MCCM for various distance/linkage combinations and the input datasets DFull,CV and DPCA,CV are shown in Fig. 3.
Fig. 2: Results from principal component analysis (PCA) of the deformation shape data DFull.Graphs show boxplots of subject-specific shape vector entries associated with the first three PCA shape modes accounting for 35.4% (mode 1), 12.2% (mode 2) and 7.4% (mode 3) of total shape variability (ac).For PCA shape mode 1, high shape vector entries were associated with the COA group and with COA-like 3D aortic arch shape features, visualised as a deformation of the computed template shape from -2 to +2 standard deviations (SD) below the graph (a).PCA shape mode 2 related to shape features associated with the ASO group, showing a slightly squeezed, gothic-type arch with dilated aortic root compared to a more rounded arch shape for low shape vector entries (b).PCA shape mode 3 visualised shape changes towards an overall slim aortic arch with relatively constant arch diameter, associated with high shape vector entries and thus the CTRL group (c).The PCA revealed significant differences in 3D arch shape features between the three patient groups.The scatterplot of subject-specific PCA shape vector 1 entries vs PCA shape vector 2 entries (d) revealed grouping among aortic arch input shapes in PCA shape space according to the deformation shape data (* denotes statistical significance at level p<.05; ** at level p<.01.).
This work is licensed under a Creative Commons Attribution 3.0 License.For more information, see http://creativecommons.org/licenses/by/3.0/.This article has been accepted for publication in a future issue of this journal, but has not been fully edited.Content may change prior to final publication.Citation information: DOI 10.1109/TBME.2017.2655364,IEEE Transactions on Biomedical Engineering Note that only the linkage option which achieved highest F-scoreM and MCCM score is shown for each distance metric.Best performing linkages were the same for DFull,CV and DPCA,CV, except in the cases of Cosine and Chebychev distance metrics, where DPCA,CV achieved higher scores using the Average linkage instead of Weighted linkage function.
In a one-to-one comparison, clustering using DFull,CV yielded better classification performance both in terms of F-scoreM and MCCM than achieved with DPCA,CV.Only the Chebychev and Cityblock distance metrics performed better for DPCA,CV, yet scoring on average below 0.7 for F-scoreM and below 0.6 for MCCM.The worst performance was found for the Standardised Euclidean distance, even yielding negative (i.e.highly confused) results in terms of MCCM.
On average, the best performing distance metrics (average F-scoreM above 0.7 and average MCCM above 0.5) were the Spearman, Correlation and Cosine metrics in combination with the Weighted linkage and the Euclidean distance in combination with the Ward linkage.However, particularly MCCM scores revealed weaknesses such as large 95CIs for the Cosine metric, making it the most unreliable distance metric.Instead, Spearman/Weighted, Correlation/Weighted and Euclidean/Ward combinations performed consistently well, with the Correlation/Weighted combination achieving on average the best classification performance with F-scoreM=0.902±0.042and MCCM=0.851±0.064for DFull,CV.Therefore, the Correlation/Weighted distance/linkage combination applied to the full dataset DFull,CV was found to yield the best overall shape clustering results with respect to the three patient groups and was chosen for further analysis.

E. Analysis of Best Performing Distance/Linkage Combination
Looking at individual classification performance metrics, the Correlation/Weighted distance/linkage combination performed consistently well, with average InformednessM, MarkednessM and MCCM scores above 0.8 and SpecificityM, RecallM, PrecisionM, F-scoreM and AccuracyM measures around 0.9 (Fig. 4).Highest scores were achieved for SpecificityM (i.e.proportion of patients correctly identified as not being a member of one of the three groups) with 0.948±0.023.
Detailed analysis of the derived confusion matrices for each CV run using the Correlation/Weighted combination and DFull,CV revealed that on average 83% of CTRL arch shapes were correctly assigned to the CTRL group, while 13% were confused with COA and 4% were confused with ASO arch shapes (Table I).For the COA group, on average 85% were correctly assigned and the remaining 15% were confused with CTRL arch shapes.ASO arch shapes were not confused with any other shape, thus 100% were placed correctly into one ASO cluster.Notably, neither were ASO and CTRL shapes confused with high misclassification rates, nor were COA shapes always assigned correctly as we would have expected in case the clustering algorithm only took into account aortic arch size rather than shape (see section B).

F. Subgroup Analysis -Clusters Within Clusters
Finally, clustering classification performance was assessed using the Correlation/Weighted distance/linkage combination and DFull, including all N=60 patients.In this case, only two COA shapes (10%) were confused with CTRL arch shapes, while 100% of both CTRL and ASO arches were assigned to one respective cluster (Fig. 5a).
In order to reveal more refined shape subgroups within the three larger clusters, which would add novel information about previously unknown patterns within the pathological shape clusters, branches were cut at a lower hierarchy level.Tree branches were cut at a height of 0.72, thus forming a total of 10 subgroups with a varying number of members in each larger cluster (Fig. 5b).The CTRL group was divided into 5 smaller subgroups, the COA group into three and the ASO into two.Interestingly, the two confused COA shapes formed one distinct cluster within the CTRL group by themselves, marking them as being different from the other CTRL shapes.
To evaluate whether the 10 subgroups related to meaningful 3D shape groups within the CTRL, COA and ASO clusters, we produced a scatter plot of the PCA shape space generated by the { ,1 } and { ,2 } associated with PCA shape modes 1 and 2 and symbol-coded the respective members of the 10 subgroups according to their subgroup affiliation (Fig. 6).This plot revealed that novel and meaningful shape subgroups within the three larger (known) shape clusters could be found,since arch shapes that were clustered together by the hierarchical clustering algorithm were also clustered closer together in terms of their 3D aortic arch shapes as described by the PCA loadings.Those findings confirmed that our pipeline can be used to detect to date unknown anatomical subgroups and patterns within pathological arch shape populations, which may prove to differ in terms of clinical outcome in future studies of larger, homogeneous patient cohorts.

IV. DISCUSSION
Comparing different types of input data for the unsupervised clustering pipeline, our results showed that a preceding dimensionality reduction via PCA yielded overall lower macro F-scoreM and MCCM scores than the raw deformation vector data.PCA thus did not yield improved clustering classification performance, which is in accordance with previous studies [42].Using the full deformation vector data as input, the distance/linkage combinations Spearman/Weighted, Correlation/Weighted and Euclidean/Ward showed overall good ability to automatically structure the bulk input data into the three clinically defined groups as measured by average F-scoresM above 0.7 and MCCM scores above 0.5, following 10-fold cross-validation.This is in accordance with early observations from Lance and Williams [43] stating that the Correlation distance is suitable for comparing shapes, while the Euclidean distance is generally compatible with many clustering scenarios, probably due to being invariant under translations of the origin and under rotations of the pattern space [44].The Correlation metric may here have resulted in best classification performance as it predominantly measures interrelationships between features (rather than absolute values or magnitudes)here parameterised by shape deformation vectors of a template shape defined in a common mathematical framework.In accordance with this study, Correlation and Euclidean distance metrics have previously been found appropriate for various hierarchical clustering tasks [15], [43], [45] and so has the Ward linkage, specifically when clustering anatomical structures [5], [46], [47].The Correlation/Weighted combination performed best with average SpecificityM, RecallM, PrecisionM and AccuracyM scores around 0.9 and small confidence intervalsconsiderably higher than previously reported accuracies [48].Averaged over all crossvalidation runs, 17% of CTRL shapes, 15% of COA shapes and 0% of ASO shapes were misclassified.Those were lower misclassification rates than reported earlier for hierarchical clustering by Dalton (21% to 28%) [15], and Brun (13% to 48%) [14].Applied to the full dataset of 60 patients, only two COA arch shapes that showed highly localised deformation of the transverse aortic arch were confused with CTRL shapes.This suggests that some subtle 3D arch shape features may not be taken into account sufficiently when computing intersubject distances.This could be addressed in future studies by a weighting of local 3D shape features, depending on which section of the arch (i.e. which anatomical region) is subject of interest.As expected though, ASO arches seemed to constitute a distinctly different shape cluster, allowing for 0% of misclassification, which is a notable performance for an unsupervised and automated approach.
Furthermore, hierarchical clustering results were compared to results from PCA statistical shape analysis and found that both methods compared well in determining shape clusters and subgroups based on the deformation data.More importantly, apart from distinguishing the three clinically known groups (CTRL, COA, ASO) mostly correctly, the clustering algorithm was able to cluster together subjects with similar 3D arch shape on lower levels of the clustering tree as well.This allowed for detection of previously unknown "clusters within the cluster", i.e. novel anatomical patterns within the pathological COA and ASO clusters.While we refrained from analyzing those subgroups further due to a limited subgroup sample size, such subgroups may be discovered in future studies of larger patient cohorts via the proposed pipeline and may generate novel hypotheses of clinical relevance.
Following these results, we foresee potential application of hierarchical clustering algorithms for medical image analysis in four main areas: research, clinical, technical and commercial applications.In research, such approaches could help better understand diseases by providing a means to derive novel (anatomical, shape) biomarkers and detect yet-to-bediscovered disease patterns.This, in turn, could ultimately assist clinicians in decision-making and risk stratification, particularly in complex or rare diseases.Large, cloud-based image databasesin combination with immediate online clustering following image acquisitioncould allow comparison of a newly scanned patient to individuals with the same clinical history or disease in order to detect "outliers" or similarities [49].On the technical side, hierarchical clustering could be used for shape-retrieval systems and found clusters could be used to compute subtemplates or subatlases (i.e.representatives of a subgroup), which may improve atlasbased image segmentation of highly varying anatomy [7].Following the overall good classification results for our unsupervised pipeline, we further assume that trained, supervised approaches would perform even better in case classification of shapes is desired.Finally, regarding commercial applications, subgroup-based subtemplate anatomical models could allow for more cost-effective "fewsizes-fit-all" rather than patient-specific approaches for device design and development, which may be particularly appealing in complex structural disease.
For such broad application of hierarchical clustering algorithms to become reality, large medical image databases are required, which leads to one of the limitations of our study the relatively small sample size.Nevertheless, we believe this study constitutes a first step showcasing that clinically meaningful clustering of medical image data can be achieved once clustering parameters are set correctly.Future studies should focus on including more patients, different types of anatomy and on automating the pipeline further.Here, we aimed to automate data processing as much as possible.Yet, steps such as isolating the structure of interest after segmentation (here the aortic arch) were performed using manual cutting tools.This is another limitation, which may be addressed by providing segmentation atlases specifically adapted to the structure of interest.Further, some automatic segmentation results had to be edited manually due to insufficient input image quality or artefacts.With sophisticated automatic segmentation algorithms currently being on the rise [50], we foresee drastic improvement in this area in the near future.In this regard, this study reports one of the largest datasets of automatically segmented pathological structures affected by congenital heart disease and the reported Dice Similarity Coefficients could be used as reference values for further algorithm development.

V. CONCLUSION
In this study, we present and evaluate a medical image processing pipeline combining automatic segmentation, statistical shape modelling and unsupervised hierarchical clustering of 3D anatomical models in a cohort of healthy and pathological aortic arches post-surgical repair.By applying a specific set of distance metric and linkage function, clustering classification results yielded clinically meaningful shape clusters and subgroupsautomatically derived without any prior information.To the best of our knowledge, this is the first study evaluating 3D hierarchical shape clustering performance on realistic, clinically acquired cardiovascular image data.The reported clustering classification performance and automatic segmentation scores could be used as benchmark values for future algorithm implementation and improvement.
Apart from yielding a clinically meaningful division of the data according to known clinical diagnosis, our analysis revealed novel subgroups within the known clusters, which offers the potential of providing additional information and insight into yet-to-be unveiled similarities and patterns within a disease, once an initial diagnosis has been made.Therefore, our analytical platform can be an adjunct in moving away from a case-by-case image-based diagnosis towards assessing a patient in the context of a patient population as an integral component of current Precision Medicine or "Precision Imaging" [3] strategies.Such a clinical decision support system may pave the way for moving from mere data towards information and knowledge, which could ultimately impact on improved diagnosis, risk stratification and treatment strategies.
To provide a summary of the above measures, the macroaveraged F-scoreM (weighted harmonic mean of Recall and Precision) and Matthew's Correlation Coefficient (MCCM) (geometric mean of Informedness and Markedness[41]) were computed as follows:

Fig. 1 :
Fig. 1: Boxplots of size (a-b) and shape (d-f) morphometric parameters describing differences in aortic arch shape between the three patient groups CTRL, COA and ASO.The thick line within the box represents the median value, box height represents the interquartile range and whiskers extend to the maximum and minimum value, respectively.* denotes statistical significance at level p<.05; ** at level p<.01.

Fig. 3 :
Fig. 3: Clustering classification performance measures for full input dataset (DFull,CV, right, green) and reduced PCA shape loading dataset (DPCA,CV, left, blue), showing mean and 95CIs of macroaveraged F-scoreM (a) and MCCM (b) for the respective best distance/linkage combinations over 10 CV runs.Overall, the DFull,CV input dataset performed better than the reduced PCA dataset and Spearman/Weighted, Correlation/Weighted and Euclidean/Ward distance/linkage combinations were found to yield good and reliable clustering classification performance.

Fig. 4 :
Fig. 4: Means and 95CIs over 10 CV runs of all computed clustering classification performance measures for the distance/linkage combinationCorrelation/Weighted, applied to the full 3D shape dataset DFull,CV.In particular, high Specificity was achieved.

Fig. 6 :
Fig.6: Scatterplot of 3D shape space described by subject-specific PCA shape vector entries.Individual patients are symbol-coded according to subgroups obtained from cutting the dendrogram at lower levels of the hierarchy (Fig.5b), revealing that patients with similar 3D aortic arch shape are grouped together by both the PCA analysis and the clustering algorithm.The two COA subjects misclassified as CTRL (subjects 27 and 31) are marked in orange.
Detecting Clinically Meaningful Shape Clusters in Medical Image Data: Metrics Analysis for Hierarchical Clustering applied to Healthy and Pathological Aortic Arches Jan L Bruse*, Maria A Zuluaga, Abbas Khushnood, Kristin McLeod, Hopewell N Ntsinjana, Tain-Yen Hsia, Maxime Sermesant, Xavier Pennec, Andrew M Taylor, and Silvia Schievano; for the Modelling of Congenital Hearts Alliance (MOCHA) Collaborative Group