Beyond “Sex Prediction”: Estimating and Interpreting Multivariate Sex Differences and Similarities in the Brain

Previous studies have shown that machine-learning (ML) algorithms can “predict” sex based on brain anatomical/ functional features. The high classication accuracy achieved by ML algorithms is often interpreted as revealing large differences between the brains of males and females and as conrming the existence of “male/female brains”. However, classication and estimation are quite different concepts, and using classication metrics as surrogate estimates of between-group differences results in major statistical and interpretative distortions. The present study illustrates these distortions and provides a novel and detailed assessment of multivariate sex differences in gray matter volume (GMVOL) that does not rely on classication metrics. Moreover, modeling and clustering techniques and analyses of similarities (ANOSIM) were used to identify the brain areas that contribute the most to these multivariate differences, and to empirically assess whether they assemble into two sex-typical proles. Results revealed that multivariate sex differences in GMVOL: 1) are “large” if not adjusted for total intracranial volume (TIV) variation, but “small” when controlling for this variable; 2) differ in size between individuals and also depends on the ML algorithm used for their calculation 3) do not stem from two sex-typical proles, and so describing them in terms of “male/female brains” is misleading.


Introduction
Machine-learning (ML) offers new and informatively rich methods for multivariate exploration of the increasingly large and complex datasets from human neuroimaging studies 1 . In the sex differences eld, most ML applications have focused on classi cation tasks. These studies have effectively shown that several predictive algorithms can exploit anatomical and/ or functional features to ascertain whether a brain belongs to a male or to a female with an ≈80-90% accuracy [2][3][4][5][6][7][8][9][10][11][12][13][14] . Because this kind of "prediction" only informs us about what is already known (to which sex category each particular sampled brain belongs), the interest and relevance of these ndings resides more in the brains' ability to be classi ed than in the classi cations obtained. Thus, classi cation is rarely the true goal of these studies and the metrics obtained (i.e., percentage of properly classi ed cases, %CC) are employed to indirectly estimate the degree of statistical distinctiveness and/ or separateness of the brains of females and males at the multivariate level.
However, replacing a direct and quantitative evaluation of multivariate sex differences with a nominal classi cation has many statistical and conceptual shortcomings. Most of these weaknesses arise from the fact that classi cation requires an arti cial dichotomization of the continuous posterior classi cation probabilities provided by ML algorithms. Thus, as described below, dichotomization results in a "loss of information about individual differences as well as havoc with regard to estimation and interpretation of relationships among variables" ( 15 , pages 19-20), and it is rarely justi ed from either a conceptual or statistical perspective [15][16][17] . Consequently, "classi cation through forced up-front dichotomization in an attempt to simplify the problem results in arbitrariness and major information loss", and it is always "inferior to probability modeling for driving the development of a predictive instrument or for estimation or for hypothesis testing" ( 17 ,page 4). Moreover, dichotomous classi cation prede nes and statistically treats males and females as two mutually exclusive categories with zero within-group variance, hence precluding their empirical description, impeding any direct estimation of the size of their differences, and imposing an unwarranted binary interpretative framework.
With these limitations in mind, the present study was designed to assess multivariate sex differences in gray matter volume (GMVOL) without resorting to classi cation metrics. Because the sizes of univariate volumetric sex differences 18, 19 and sexclassi cation accuracy 5,20 are strongly in uenced by total intracranial volume (TIV), this assessment was conducted with raw estimates of GMVOL as well as after adjusting these estimates with the well-validated 5,18 power-corrected proportions method 21 (PCP). More speci cally, the raw and PCP-adjusted GMVOL estimates of the 116 brain areas de ned by the AAL atlas 22 were introduced as features of ve different classi cation algorithms, which were trained and tested in two independent, sex-balanced, samples (n = 288 and n = 150 per group, respectively) in order to obtain valid estimates of the probability of being classi ed as male (PCAM). The PCAM was used in subsequent analyses as a continuous dependent variable in which females and males were treated as empirically-mapped distributions (instead of as nominal categories), and their distributional divergences were thoroughly explored with robust statistical and graphical methods [23][24][25][26] to properly quantify the size of the multivariate sex differences in GMVOL. In a second step, the brain areas that contributed the most to the PCAM scores yielded by each algorithm in each dataset were identi ed by means of boosted-beta regressions 27 . These and other complementary analyses made it possible to assess whether or not different algorithms provide similar outcomes and identify the same brain architectures as typical of females and males. In this regard, a common interpretation of classi cation studies is that, because distinct ML algorithms are able to very accurately "predict" sex, all these algorithms must be identifying a reproducible constellation of brain differences that assemble into two clearly distinguishable brain types ("male/ female brains"; e.g., 2,3,6,8,10,12,13 ). However, it could also well be that different algorithms rely on different brain features to provide their classi cation labels and outcomes, and, if so, these proposed brain types would become method-speci c and, therefore, largely devoid of scienti c utility 28 .

Results And Discussion
Classi cation accuracy: Results and limitations. Figure 1A displays the classi cation accuracy (%CC) achieved by each algorithm in the testing subsamples (n = 150 per group) of the raw and PCP datasets. As previously observed 5,20 , the proportion of subjects correctly classi ed was much higher in the raw dataset (%CC average = 88.06) than in the PCP dataset (%CC average = 61.86; Supplementary Tables 2A and 2B). Figure 1 also shows that the %CC varied slightly between algorithms. In the raw dataset, none of these differences achieved statistical signi cance (Supplementary Table 2C). In the PCP dataset, LDA and LR exhibited lower accuracy than RF and SVM, but the statistical signi cance of these differences was lost after correcting for multiple comparisons (p adj = 0.06 in all cases; Supplementary Table 2D). Therefore, it can be concluded that prediction accuracy clearly differs between datasets; however, within each dataset, all the algorithms seem to yield very similar %CC values.
As mentioned above, the %CC is regularly used as an estimate of the overall degree of multivariate distinctiveness or separateness of the brains of females and males. However, the calculation of this index requires dichotomizing the continuous output of ML algorithms, and it is well known that dichotomization leads to major statistical and interpretative distortions [15][16][17] .
More speci cally, dichotomization results in an overestimation of effect sizes 15 , a loss of one-fth to two-thirds of the outcome's variance 29 , a reduction in statistical power equivalent to discarding between one-third and two-thirds of the sample 29 , and a higher risk of nding false positive results 16 . However, the main problem of dichotomization is that, in the absence of any knowledge about the outcome's distribution, there is no a priori reason to suppose that there is an underlying dichotomy, and if there is, there is no reason this dichotomy should correspond to an outcome's half-split 15,16,29 . Therefore, although the use of this cutoff might be justi ed when the main goal is to obtain a binary classi cation, its use when estimating between-group differences is essentially arbitrary and it may abet equally arbitrary conclusions [15][16][17]29 .
These dichotomization-induced distortions can be illustrated by comparing the results in Fig. 1A to those in Fig. 1B, which -as in other previous studies 14,30 -divides the originally continuous outcome into three, instead of two, equal segments. This simple modi cation reduces the %CC (Supplementary Tables 3A and 3B), thus illustrating how dichotomization leads to larger between-group differences through a suppression of the within-groups variation. These %CC changes were non-uniform, given that dichotomization and trichotomization yielded similar results in some cases (e.g. LDA in the raw dataset, %CC difference = ± 4.63%; Supplementary Table 3C) but very disparate results in others (e.g. RF algorithm in the PCP dataset, %CC difference = ± 52.67%; Supplementary Table 3D). These discrepancies illustrate that, because dichotomization singularly affects to individuals close to but on opposite sides of the cutoff (treating them as totally different when they are quite similar), dichotomization effects on the %CC are dependent on the underlying scores' distribution, and, consequently, they are variable and fundamentally unpredictable. This, together with the dichotomization-induced reduction in statistical power, can mask the effects of moderator variables (e.g., the algorithm used). Indeed, whereas the %CC obtained after dichotomization did not signi cantly differ between algorithms in any dataset, those obtained after trichotomization differed in both datasets (Supplementary Tables 3E and 3F).
Although trichotomization has some advantages over dichotomization 16,17,31 , the results described above should not be interpreted as advocating for it. What these results show is that any discretization of a continuous outcome introduces a certain degree of arbitrariness in the results obtained, and also that by pre-imposing a number of categories that may or may not exist in the data, discretization leads to preconditioned and discretionary inferences 15,16,29 . Thus, for example, by looking at the PCP-RF outcomes in Fig. 1A it would be possible to conclude that 66% of individuals have "male/female brains", whereas looking at the PCP-RF results in Fig. 1B -which were obtained in the same subjects, algorithm, and dataset-, the conclusion would be that 84.33% of individuals have "sex-unde ned brains". These inferences are radically different, and yet -in the absence of additional information-both of them are equally arbitrary.
Assessing multivariate sex differences in GMVOL Another major limitation of using the %CC as an estimate of multivariate differences is that it is not descriptive. In fact, the %CC is a "very insensitive and statistically ine cient measure" ( 17 , page 258), where all the cases above/ below the prede ned threshold are rated the same, regardless of their actual values in the outcome variable (e.g. P = 0.51 is equalized to P = 0.99 and P = 0.01 to P = 0.49). Consequently, %CC scores preclude describing the outcome's distribution, summarizing the individuals' scores or estimating the actual size of between-group differences. These limitations are overcome when the continuous output of ML algorithms is not discretized, and when females and males are treated as empirically-established distributions that spread at different probability levels within particular ranges of the outcome's continuous space 14,[32][33][34] . Therefore, we rst described how males and females mapped onto the continuous output provided by ML algorithms (the PCAM continuum) and then we assessed their distributional divergences to obtain proper estimates of the size of the male-female multivariate differences in GMVOL Figure 2 depicts the kernel density estimates (KDE) of the PCAM distributions yielded by each algorithm in each dataset. Based on these graphical representations, it is apparent that PCAM distributions differed between males and females, but also between datasets and between algorithms within each dataset. More speci cally, in the raw dataset, both males and females exhibited non-normal and very skewed distributions, with most of the females accumulating near of the lower bound of the PCAM continuum, and most of the males accumulating near of the upper bound. These distributions were also very long-tailed, with a few scattered individuals spreading beyond their respective sex clusters and virtually occupying the entire PCAM range. These distributional characteristics varied depending on the algorithm (e.g., LDA vs. RF; Supplementary Tables 4A-4B, Supplementary Fig. 1), which explains why their %CC estimates differentially varied in response to the dichotomizing or trichotomizing cutoffs (Fig. 1). On the other hand, in the PCP dataset, all the PCAM distributions were non-skewed but they also showed major differences between them (Supplementary Tables 4A-4C, Supplementary Fig. 1). Thus, the PCAM distributions obtained with the LDA, LR and MARS algorithms were rather at, with females and males spreading at similar rates across almost the entire PCAM range. In contrast, the RF and SVM associated distributions were more clearly peaked and extended within a shorter span around the center of the PCAM range. Again, these distributional singularities explain why %CC estimates disparately varied in response to different pre-imposed cutoffs (Fig. 1).
Based on the PCAM distributions, overall estimates of the multivariate sex differences in GMVOL were obtained. These measurements indicated that, despite the previously mentioned between-algorithm variations, sex differences were invariably "large" in the raw dataset. Thus, the overlap between the males/females' distributions was "small" ( 12-18%; Supplementary  Table 4D), and the chance that a randomly chosen male would have a PCAM score higher than that of a randomly chosen female (PS M ) exceeded 90% in all cases (Supplementary Table 4E). Conversely, in the PCP dataset differences were much smaller, with high levels of overlap (range: 55.7-71.3%) and lower PS M scores (ranges: 0.64-0.7; Supplementary Tables 4D and   4E). Additional estimates of the multivariate sex differences in GMVOL were obtained by comparing location and dispersion measures. No differences in variances or inter-quantile ranges were observed (Supplementary Tables 4F and 4G), suggesting that -when considered on the PCAM continuum-the variability of males and females does not signi cantly differ. Conversely, median comparisons con rmed "large"/ "small" sex differences in the raw and PCP datasets, respectively (see below).
However, all these measures provide a single effect size estimate that may not be very informative or could even be misleading with regard to the possible complex differences between two distributions 23,35,36 . To fully represent and compare distributions, robust statistical and informatively-rich graphical methods such as the cumulative distribution function (CDF) 25 and the shift-≈ function 23,24 are required 23,25,[36][37][38][39] . Consequently, we used these methods to provide two complementary perspectives 23 of the multivariate sex differences in GMVOL.
How do males and females compare to each other? Figure 3 displays the CDFs for the PCAM scores yielded by each algorithm in each dataset, making it possible to compare males and females in three different ways: 1) by directly contrasting the proportion of cases in each group with PCAM scores equal to or lower than any possible cutoff; 2) by estimating how many subjects in one group have PCAM values equal to or lower than a given proportion of cases in the other group 25,35 ; and 3) by comparing the PCAM values at the deciles of the females/ males' distributions 23,24 .
All these comparisons con rmed that, when PCAM scores are obtained from multivariate composites of raw GMVOL, males and females are quite different. For instance, in the raw dataset, 10% of males with the lowest scores (D1) had PCAM values that were higher than or equal to those observed in 80% of the females. However, Fig. 3 also shows that the size of these sex differences varied across individuals. Thus, taking the LDA outcomes as an example, sex differences in PCAM scores were already "large" at D1 (30.6% of the maximum possible; POMP), but they were twice that size at the medians (D5, POMP=98.1) and tended to decrease thereafter (D9, POMP=40.72). These inter-decile variations were statistically signi cant (Supplementary Tables 5A and 5C), and resulted in clearly non-monotonic shift-functions ( Supplementary Fig. 2), suggesting that differences at center locations might lead to inappropriate inferences about the differences observed at more distal locations of the same distribution. On the other hand, the estimated size of sex differences also varied between algorithms (LDA>LR=MARS=SVM>RF). Thus, for example, the RF estimated difference at D5 was 66.5 POMP, which is 31.6% lower than the estimate provided by the LDA algorithm. This and other similar between-algorithm variations were statistically signi cant (Supplementary table 5E).
Conversely, when the effects of TIV-variation are ruled out, males and females are much more similar to each other. Thus, in the PCP dataset, 10% of males with the lowest scores (D1) had PCAM values that were higher than or equal to those of just 20-30% of the females. Moreover, and also in contrast to what had been observed in the raw dataset, sex differences were approximately constant across deciles ( 10-20 POMP), and resulted in almost at shift-functions (Supplementary Tables 5B   and 5D; Supplementary Fig. 2). In addition, distinct algorithms provided different estimates of the size of these sex differences. The relative magnitudes of these variations were similar to those observed in the raw dataset, but in this case, they did not reach statistical signi cance (Supplementary Table 5F).
What is the typical difference between any given female and any given male?
When the interest is not as much to describe and compare males and females, but rather to estimate the size of the typical difference between any given male and any given female, the distribution of all their pair-wise differences can be calculated and directly analyzed 22,26 . Thus, Fig. 4A depicts the KDE of all the pair-wise differences between males and females in each algorithm and dataset, whereas panel B depicts their corresponding CDFs. Thus, in this case, CDFs make it possible to: 1) estimate the empirical probability of nding a pairwise male-female difference whose size is equal to or lower than a given reference value; 2) estimate the size of the pairwise differences for any given proportion of cases; and 3) compare the estimates of these pairwise differences provided by different algorithms in each dataset or by the same algorithm in the raw and PCP datasets.
As Fig. 4 shows, in the raw dataset, pairwise male-female differences extended over a very wide range, but they were also very asymmetrically distributed and favored males in more than 90% of the cases (i.e., D1 values > 0 in all algorithms; Supplementary Table 6A). The size of these differences depended on the algorithm used to calculate the PCAM scores (medians' range = 0.58-0.93; Supplementary Table 6B), although they were generally "large" as compared to the possible maximum (average POMP difference = 67.2). Consequently, the multivariate estimates of raw GMVOL in a randomly picked ≈ ≈ male-female pair are expected to clearly differ, leading to PCAM scores substantially (POMP difference > 30%) larger in males than in females in around 80-90% of the cases.
By contrast, when the in uence of TIV-variation is statistically controlled, pair-wise male-female differences show an algorithmdependent range but they are always quasi-symmetrically distributed around their median values (Supplementary Table 6C).
These median values differed between algorithms (range = 0.09-0.16; Supplementary Table 6D), although all of them indicated that pairwise male-female differences were "small" as compared to the possible maximum (average POMP difference = 12.93) and signi cantly smaller than the differences observed in the raw dataset (Supplementary Table 6E). Accordingly, the multivariate estimates of TIV-adjusted GMVOL of randomly picked male-female pairs are expected not to differ much, and the females' PCAM scores should be higher than or equal to males' scores in 30-40% of the cases.
Interpreting multivariate sex differences in GMVOL If simpli ed to the maximum, the results described in the previous section indicate that the multivariate sex differences in raw measures of GMVOL are "large", but also that these differences become "small" when the effects of TIV-variation are statistically controlled. This conclusion is similar to the one that could be obtained after examining %CC scores. However, this parallelism should not lead to the interpretation that %CC and PCAM-based measures are equivalent. In fact, within each dataset, %CC scores were uncorrelated or even inversely correlated with the estimated size of the multivariate sex differences in GMVOL (see Supplementary Figs. 3 and 5, respectively). Moreover: 1) Whereas %CC primarily relates to algorithm's performance, PCAM-based measures describe and compare individuals and groups, making it possible to quantify betweenand within-sex variation; 2) Whereas %CC is a "very insensitive and statistically ine cient measure" ( 17 , page 258) that did only vary between-datasets, PCAM-based measures are sensitive enough to reveal that multivariate sex differences in GMVOL also differ between algorithms and between subjects; 3) Whereas %CC scores are obtained from pre-imposed and, to some extent, arbitrary criteria, PCAM-based measures are fully empirical and conceptually unrestricted.
On this last point, previous sex classi cation studies 4,6,8,10,12,14,40 have reinvigorated sex binary views according to which human brains can be categorized into two "types", one typical of males and the other typical of females. However, nding twoand only two-brain types is not as much an empirical result as it is a pre-imposed requirement 17 of these sex classi cation studies. Moreover, the fact that different algorithms are able to correctly identify sex with a 80-90% accuracy does not ensure that these algorithms are providing the same outcomes or identifying the same underlying reality 28, 41,42 . Thus, in the present study, the %CC ranged between 86.3-90% and 58.7-65% in the raw and in the PCP datasets, but the actual percentages of subjects correctly classi ed by all ve algorithms were 76.3 and 38.0%, respectively. On the other hand, interpreting any %CC score in terms of "brain types" requires a great conceptual leap because %CC scores do not provide any information about the brains of females and males or about which brain features could be considered the hallmarks of these alleged "male/ female" brain types 41,4241,42 . This limitation is also shared by PCAM-based measures, which quantify how different the brains of males and females are, although they do not directly provide information about where these differences take place or about how they group together 41,43 . Therefore, in the present study, we used boosted beta regression procedures to identify and quantify the relative importance of the brain features that contribute the most to the PCAM scores yielded by each algorithm in each dataset, and we used hierarchical clustering procedures and ANOSIM analyses to assess how these features assemble.
As it could be expected 20 and Figs. 5 and 6 illustrate, the number, identity, and relative importance of the PCAM predictors identi ed in the raw and PCP datasets were clearly divergent. Speci cally, up to 64 predictors were included between the raw and the PCP datasets, but only 22 of them were present in both. Accordingly, poor levels of mutual nominal agreement were observed (D = 0.330 [0.157, 0.502]). In addition, the relative importance of these 64 predictors varied greatly depending on the dataset, resulting in agreement levels that were virtually zero (Supplementary Table 7F). When this comparison was performed only with the 22 predictors included in both datasets, higher but still "poor" levels of agreement were observed (e.g., ICC < 0.5; see other agreement metrics in Supplementary Table 7G). Of note, in the raw dataset, the predictors that consistently showed higher importance were brain areas in which TIV explains a large amount of variance (see r 2 values in Fig. 5 and Supplementary   Table 7H). This relationship was corroborated by statistically signi cant correlations (rho (30) = 0.592 and rho (30) = 0.571, p < 0.001) between these r 2 values and two estimates of the predictors' relative importance across algorithms (ranks' averages and a multiplicative "rank of ranks"; Fig. 5 and Supplementary Table 7H). Conversely, in the PCP dataset, TIV-variation did not account for any variance in GMVOL ( Fig. 6 and Supplementary Table 7H), and the predictors' relative importance was unrelated to these non-statistically signi cant r 2 values (rho (52) =-0.119 and rho (52) =-0.123, p > 0.05; Fig. 6 and Supplementary Table 7I).
Taken together, these results reveal that the most relevant brain areas in predicting the PCAM scores in the raw dataset were not the same, and, when they were, they did not have the same relative importance as in the PCP dataset, thus con rming that raw and TIV-adjusted measures of GMVOL provide information about two distinct constructs.
Similarly, major differences in the predictors' number, identity, and relative importance were also observed when comparing different algorithms within each dataset. Thus, in the raw dataset (Fig. 5), a total of 32 brain areas were identi ed as relevant predictors of PCAM scores. Different models included a different number of predictors (range: [9][10][11][12][13][14][15][16][17][18][19], and only three of them were included in all of them, thus resulting in low levels of absolute nominal agreement (K HR =0.505  Table 7J). In a similar vein, the predictors' relative importance varied considerably between algorithms. In fact, the values of the predictors' coe cients in different models showed agreement/ reliability levels that were virtually zero (Supplementary Table 7L). Agreement increased but remained low when the predictors' relative importance was considered at the ordinal level. More speci cally, the reliability of these ordinal estimates was larger than 0, but "poor" when based on the results of a single algorithm (ICC single-rating = 0.289 [0.141, 0.478]), and only the average of these ordinal estimates yielded levels of agreement that can be considered as "moderate" 44  In the PCP dataset (Fig. 6), a total of 54 predictors were identi ed as relevant. As in the raw dataset, different algorithms included a different number of predictors (range: 9-41), and only four of them were included in all the regression models.  Table 7K). As also observed in the raw dataset, the agreement between the values of the predictors' coe cients in different models did not statistically differ from zero (Supplementary Table 7M). Agreement between predictors increased, but remained low, when their relative importance was considered at the ordinal level.
Thus, again paralleling the results observed in the raw dataset, the ordinal estimates obtained with any single algorithm showed "poor" reliability (ICC single-rating = 0.312 [0.187, 0.457]; Supplementary Table 7M), and only the average of these ordinal estimates yielded agreement levels that can be considered "moderate" 44  These between-algorithms' discrepancies explain why different algorithms yield differently-shaped PCAM distributions (Fig. 2) that result in multivariate sex differences that vary in size (Figs. 3 and 4). Thus, because they differ in their statistical assumptions and operations, distinct algorithms rely on distinct brain features (Figs. 5 and 6) and assign different PCAM scores to the same subjects (Fig. 7A, Supplementary Tables 8A-8B; for dyadic between-algorithm's comparisons, see  Supplementary Tables 8C-8F; for a case-by-case examination, see Interactive Fig. 1). Because these PCAM-variations are highly idiosyncratic (Fig. 7B), each individual occupies a different relative position in each PCAM distribution ( Fig. 7C and 7D), and these distributions end up spreading dissimilarly within the PCAM range (Fig. 2). In the raw dataset, between-algorithms' discrepancies are partially concealed by male-female differences in TIV that push their respective scores towards opposite sides of the PCAM range and boost sex differences and, hence, between-algorithms correlations ( Supplementary Fig. 6). However, when TIV effects are statistically controlled (as in the PCP dataset or by partialling out the TIV effects), between-algorithms discrepancies in PCAM scores become evident ( Fig. 7D; Supplementary Fig. 6).
Therefore, it is apparent that -despite working with identical data from the same individuals-the different algorithms tested in the present study do not provide directly exchangeable outcomes or identify a single, coherent, and reproducible subset of brain features as the source of the males-females multivariate differences in GMVOL (neither in the raw dataset or in the PCP dataset). These observations are clearly consistent with the lack of agreement observed between the few studies that tried to identify the neuroanatomical features that could best distinguish the brains of females and males 4,10,12,14 (for a comparative review, see 41 ), and with evidence suggesting that ML algorithms rely on different brain features when classifying different subpopulations of females and males 44 . Together, these sources of empirical evidence directly challenge the binary sex views of human brains based on a global interpretation of %CC scores. As mentioned above, these views assume that, because distinct ML algorithms are able to correctly "predict" sex from neuroanatomical features in 80-90% of the cases, all these algorithms' must be identifying two distinct brain types in the human species, one typical of males and the other typical of females ("male/ female brains") 2,3,6,8,10,12,13 . However, these universal "brain types" do not seem to really exist, given that different algorithms identify distinct brain features as the landmarks of "male/ female brains" in different samples of females and males and when applied to the same subjects.
In fact, even when just taking into account the outcomes of a single ML algorithm in a single sample, these proposed "male/ female" brain types do not seem to exist, either, because the same class label or even virtually identical PCAM scores can be achieved by individuals exhibiting very different brain pro les (see examples in Figs. 8A and 9A). More speci cally, when accumulated differences in raw GMVOL (Euclidean distances) at the brain areas identi ed as relevant PCAM predictors are considered, the differences between members of different sex categories are larger than those observed between members of the same sex category (ANOSIM R = 0.455 [0.363, 0.540], Supplementary Table 9A), and males and females tend to group into two separate clusters (Fig. 8B). These two clusters are homogeneous and robust, and they are only minimally perturbed when additional partitions are imposed (Fig. 8D). However, these clusters do not correspond to two "brain types" or sex-speci c brain pro les. Thus, when the same individuals are partitioned based on the dissimilarity of their brain pro les (Spearman's distances) rather than on the orderless sum of their differences, subjects cluster in a sex-unrelated manner ( Fig. 8C and 8E) because the brain pro les similarities observed between members of different sex categories are equivalent to those observed between the members of each single sex category (ANOSIM R = 0.049 [0.006, 0.080], Supplementary Table 9B). This pattern of results was corroborated for each algorithm and also when using the 116 brain areas of the AAL atlas or just only the ve areas more directly related to the PCAM scores provided by each algorithm as predictors ( Supplementary Figs. 7-11). Similar results were also observed when TIV-variation was statistically controlled (PCP dataset; Fig. 9 Tables 9A-B). Therefore, it can be concluded that: 1) as shown throughout the present study, the size of multivariate sex differences in raw and TIVadjusted GMVOL are "large" and "small", respectively; 2) Regardless of their size, these differences do not arise from divergences between a "typical male" and a "typical female" brain pro le, but from divergences between multiple and idiosyncratic brain pro les that seem to be loosely related to sex categories. Accordingly, there are no "male/ female brains" (for similar conclusions, see 9,14,40,41,45 ), and although multivariate male/female differences in the brain can be summarized with a single and continuous score, the brains of females and males are not aligned along a "female-male" continuum, either, at least not on one that can be univocally translated to their neuroanatomical features (for a similar conclusion and a more ample discussion, see 42 ) Conclusions When the output of ML algorithms is not discretized, multivariate information about the brains of females and males can be condensed in a single continuum 14,[32][33][34] . The present study shows that, by assessing how females and males differentially occupy this empirically-de ned unidimensional space, the size of their multivariate differences can be estimated on a standardized [0,1] scale or in ordinal/ distributional terms. Used in this way, the PCAM continuum -and other similar ones-offers a more informative, nuanced, and accurate way to investigate multivariate relationships between sex and brain features than what is offered by the classi cation approach.
Our results also reveal that, at least at the neuroanatomical level, there are not two brain "types", and overall brain sex differences in GMVOL do not stem from a speci c pattern of differences in a few "key" brain areas. Accordingly, the PCAMbased estimates of these multivariate group differences are probably better interpreted as a summary of heterogenous patterns of differences between several subsets of males and females that diverge in distinct brain features. Therefore, like other multivariate effect size indexes 37 , PCAM-based measures might be more useful when summarizing a reduced, coherent, and theoretically-justi ed set of variables (whose within-pro le differences can be interpreted) than when calculated from a large number of loosely related/ arbitrarily chosen brain features. By implication, we conclude that the PCAM continuum -and other similar measures-provides a reduced metric space that is useful for comparing females and males and estimating their brain differences, but a single continuum is clearly insu cient to properly describe and adequately conceptualize the complex and highly-idiosyncratic sex-associated effects in the brains of females and males (for a similar conclusion, see 41,43 ).
Finally, our results also show that -because they differ in their a priori assumptions and internal operations, different ML algorithms may produce different outcomes. Therefore, comparing or combining the results of several algorithms should lead to more reliable and valid conclusions than those extracted from just one. Moreover, the algorithm/s chosen becomes a critical methodological decision that should be reported in detail and carefully considered when summarizing the results of different studies.

Participants
The present study was conducted using data from the 1,200 Subject Release of the Human Connectome Project (HCP), which includes structural Magnetic Resonance Imaging (MRI) data from 1113 healthy young adult participants 47 . The HCP dataset contains an unequal number of females (n = 606) and males (n = 507) who differ in age (Mean females =29.56, Mean males =27.90, t 1111 = 7.63, p < 4.94 -14 ). Therefore, we used a self-built algorithm to randomly select a sex-balanced sample of participants (438 females, 438 males) not differing in age. This sample was subsequently split into the so-called training and testing subsamples (see below).

Imaging and data preprocessing MRI acquisition and images preprocessing
The MRI acquisition details for the HCP-sample can be found in the reference manual of the S1200 release of the HCP (https://www.humanconnectome.org/storage/app/media/documentation/s1200/HCP_S1200_Release_Reference_Manual.pdf).
Images were preprocessed with the CAT12 toolbox (http://www.neuro.uni-jena.de/cat/, version r1184) of the SPM12 (http://www. l.ion.ucl.ac.uk/spm/software/spm12/, version 6906) software. CAT12 preprocessing was conducted following the standard default procedure suggested in the manual. Brie y, it includes the following steps: (1) segmentation of the images into gray matter, white matter, and cerebrospinal uid; (2) registration to a standard template provided by the International Consortium of Brain Mapping (ICBM); (3) DARTEL normalization of the gray matter segments to the MNI template; (4) modulation of the normalized data via the "a ne + non-linear" algorithm; and (5) data quality check (in which no outliers or incorrectly aligned cases were detected). Images were not smoothed because we were only interested in the modulated images.
After applying this procedure, which does not include any correction for overall head size, voxels were mapped into 116 regions according to the Automated Anatomical Labeling atlas (AAL, 22 ) by calculating the total gray matter volume for each region of interest (VOI) and participant via a MATLAB script (https://www0.cs.ucl.ac.uk/staff/g.ridgway/vbm/get_totals.m). TIV was estimated using native-space tissue maps obtained in the segmentation step. More speci cally, TIV was calculated as the sum of GM, WM, and CSF total values multiplied by voxel size and divided by 1,000 to obtain a milliliter (ml) measurement. The estimates of GMVOL in these 116 regions were employed as the predictors for the machine-learning algorithms described below.
Training/ testing subsets and data standardization Following current recommendations 48,49 , classi cation algorithms were tted and tested in two separated groups of participants randomly extracted from the previously described main sample. The training subsample included 288 females and to between-class imbalance 50,51 . In both subsets, females and males were very similar in age (means' range = 28.4-28.6) and showed similar differences in TIV (Cohen's d = 1.8 in both cases; see further details in Supplementary Tables 1A and 1B).
Before being used as predictors, all volumetric variables were transformed into z-scores to avoid distortions due to their different ranges 48,52 . Standardization was initially performed in the training subset, and the exact same scaling parameters were subsequently used to standardize the testing subset.
TIV adjustment: The raw and the PCP datasets.
Previous studies have shown that the estimates of univariate and multivariate sex differences are largely dependent on TIV variation, but also that not all the currently used methods are equally effective and valid for removing TIV-variation 5,18 . Therefore, in the present study, all analyses were conducted twice in the same subjects, without introducing any TIV adjustment ("raw" dataset) and after removing TIV variation with the well-validated power-corrected proportions (PCP) method 21 . The PCP method improves the traditional proportions approach by introducing an exponential correcting parameter in the denominator.

Machine-learning algorithms
We report, and compare the outcomes of ve classi cation algorithms that differ in their assumptions (Supplementary Table 1C) and that provide an adequate representation of the principal "families" of machine-learning classi ers.
Testing several ML algorithms is important because algorithms' internal operations are very much dependent on these assumptions 48,53 and may potentially lead to different outcomes. Moreover, it is necessary to compare the outcomes of different ML algorithms (see PCAM variation subheading) in order to ensure that the results obtained describe methodindependent ndings and that the conclusions drawn are truly generalizable. The outcomes considered included common classi cation metrics (such as the percentage of correctly classi ed cases, %CC) but also novel and alternative ones that were obtained by using the posterior classi cation probabilities yielded by ML algorithms (in this case, operationalized as the probability of being classi ed as male, PCAM) as a continuous variable (see details in the Assessing multivariate sex differences in GMVOL subheading).
All the ML classi ers were implemented and cross-validated (5 folds; 10 repeats) using the interface provided by the caret package for R. In alphabetical order, the predictive algorithms used in the present study were: -Random Forest (RF): Implemented by the rf function of the randomForest package 57 , built up by aggregating 500 classi cation trees, each of them using 10 randomly selected predictors.
-Support Vector Machine with a radial kernel (SVM): Implemented using the svmRadial function of the kernlab package for R 58 . The tune function (tenfold cross-validation) was used to automatically select the optimal values for the regularization and kernel-width hyper-parameters.

Statistical analyses
All statistical analyses were conducted in the testing subsamples of the raw and the PCP datasets using different packages for R 55 . Statistical analyses focused on description and effect sizes estimation rather than merely testing statistical signi cance 59 . All effect size estimates were accompanied by 95% con dence intervals (CI), and, when appropriate, these effects were also reported in terms of their percentage of the maximum possible (POMP) score (see 35,60 ). Moreover, when statistical signi cance was tested, p values were corrected for multiple comparisons with the FDR 61 or -when comparing deciles (see below)-with the Hochberg 62 method.
Algorithms' performance was initially measured as the percentage of correctly classi ed cases (%CC) and its 95% CI. These %CC scores were initially compared with the chance-expected value of 0.5 with one-sided binomial tests and with each other by means of the McNemar's test. Classi cation bias (whether females or males had higher chances of being misclassi ed) was also assessed using the McNemar's test. The details of these analyses are presented in the Supplementary Material.
As discussed in the main text, the calculation of the %CC scores requires a dichotomization of the continuous outcome provided by ML algorithms. To illustrate the statistical and interpretative distortions [15][16][17] introduced by this dichotomization, we calculated new %CC scores after dividing the PCAM this outcome into three thirds (as in some previous studies 14,30 ) instead of into two halves. These scores were compared to the chance-expected value of 0.33 by means of chi-squared tests. The McNemar's test was used to compare the dichotomization-and trichotomization-associated %CC scores, as well as the trichotomization %CC scores yielded by each algorithm in each dataset.
Assessing multivariate sex differences in GMVOL As in some previous studies 14,32-34 , we used the a posteriori probabilities yielded by ML algorithms (in this case, operationalized as the probability of being classi ed as male, PCAM) as a continuous and informatively-rich dependent variable. The males and females' PCAM distributions yielded by each algorithm in each dataset were rst described through bootstrap estimates of appropriate statistics (skewness, kurtosis, deciles, inter-quantile range and variance; repetitions = 10,000). In a second step, PCAM scores were used to quantify the multivariate sex differences in GMVOL at different levels. More speci cally: Possible sex differences in PCAM dispersion measures (variances and inter-quantile ranges, IQR) were assessed through the original version and a customized version of the comvar2 function included in the freely accessible Rallfun-v38 le (https://dornsife.usc.edu/labs/rwilcox/software/).
The overall degree of similarity between the PCAM density distributions for males and females was quanti ed using the h overlap index. The h index measures the area intersected by two probability density functions, and it is conceptually related to other measures of overlap, such as the Kullback-Leibler divergence and the Bhattacharyya's distance. However, unlike these overlap metrics, can be estimated in the absence of symmetry, unimodality, or any other distributional assumption 63 . In the present study, kernel density estimation (KDE) and were obtained through the boot.overlap (10,000 repetitions) function of the overlapping package for R 64 . A second and complementary estimate of these sex differences at the distribution level was obtained by calculating the probability of superiority (PS). The PS is de ned as the probability that a randomly sampled member of group A will have a higher score than the score attained by a randomly sampled member of group B. More speci cally, the probability that males' PCAM scores would be higher (PS M ), equal to, or lower than those of females (PS F ), along with the Cliff's d statistic 65 and its 95%CI, was obtained through the cidv2 function of the rogme package 23 .
Because no single score can properly summarize the differences between two distributions 23,25,[35][36][37][38][39] , male-female differences in the PCAM continuum were characterized by comparing their cumulative distribution functions (CDF; 25,35 ). CDFs make it possible to directly estimate the proportion of cases in each group with PCAM values equal to or lower than any possible cutoff, but also the proportion of subjects in one group have PCAM values equal or lower than a given proportion of cases in another group 25,35 . Within each CDF, sex-based comparisons were conducted at each decile with the shifthd_pbci function (bootstrap: η ˆ η 10,000 repetitions) of the rogme package 23 . The shifthd_pbci -and other functions of this package described below-use the Harrell-Davis quantile estimator in conjunction with a percentile bootstrap approach to calculate the deciles and the betweengroups differences at those deciles. Unlike traditional parametric methods, this approach ensures that the estimates fall within the bounds of the PCAM distribution [0,1], thus preventing inappropriate inferences. Moreover, during the calculation of these deciles' differences, the corresponding 95% CIs are calculated, and the signi cance level is adjusted for multiple comparisons using the Hochberg method. Thus, when one of these CIs does not include the zero value, the difference might be declared statistically signi cant at a < 0.05 without being concerned about Type I error 23 .
With the decile estimates obtained, the so-called shift functions 24 were also calculated. The shift-function plots the betweengroups decile differences against the deciles of one group, thus providing a complete picture of how, and by how much, the score distribution of one group should be re-arranged to match that the scores' distribution of another group (for a detailed description, see 23 ). Finally, we also compared whether the estimated size of the female-male differences at D5 (median) differed between algorithms and within the deciles of the PCAM distributions obtained with each algorithm. These comparisons were conducted with the original bwquantile function (see acknowledgements section) and with a customized version of this function, respectively.
Following current recommendations 23,24 , we also estimated the size of the typical difference between any given male and any given female at each PCAM distribution of the raw and PCP datasets. These bootstrapped estimations were conducted using the allpdiff_hdpbci function (bootstrap: 10,000 repetitions) of the rogme package 23 , which computes through the Harrell-Davis estimator the deciles (and their 95%CI) of the empirical distribution of all (in this case, 22,500) pair-wise differences between the members of two independent groups. We also calculated the CDFs for these pair-wise differences in the raw and PCP datasets, and then between-datasets decile-based comparisons were conducted with the shiftdhd_pbci function (bootstrap: 10,000 repetitions) of the rogme package 23 . Finally, we employed the Dqcomhd function (bootstrap: 50,000 repetitions) of the WRS2 package for R to ascertain whether the deciles of these male-female pair-wise differences signi cantly varied between algorithms in each dataset.
Interpreting multivariate sex differences in GMVOL For obvious reasons, interpretability has become a major issue in ML applications 66 . In the particular case of the study of multivariate sex differences, knowledge about the brains of females and males is only gained when the complex and numerical output of ML algorithms is decomposed and the brain features that contribute the most to the groups' distinguishability are identi ed. To provide this information, we extracted global, post-hoc, model-agnostic explanations of the ve ML algorithms tested in this study by modeling their outputs through the use of interpretable surrogate models (for a discussion about the different types of interpretability and their associated methods, see 66-68 ). More speci cally, we employed boosted beta regression procedures to identify the brain features that best predicted the PCAM scores observed with each algorithm in each dataset.
Boosted beta regression analyses and between-algorithms' agreement.
In the statistical literature, beta regression has been established as a powerful and readily interpretable procedures to model bounded (0,1) distributions 69 . However, because the outcome of classical beta regression procedures might be challenging when using a large number of predictors, boosted beta regression models have been developed 27 . Boosted beta regression is based on the gamboostLSS algorithm, which performs a reliable variable selection during the iterative tting process (for a comprehensive description of boosted beta regression, see 27 ). In the present study, boosted beta regressions were implemented through the betaboost package for R 70 , using the PCAM scores observed with each algorithm in each dataset as the response variable and the volumetric scores of the testing subsample in the raw/PCP datasets as predictors. The number of iterations that most reduced the risk was established through cross-validation and the contribution of each predictor was estimated by using the obtained mu-coe cient values and by constructing a relative importance measure: relative importance = 100* (accumulated risk reduction attributable to a predictor/ total risk reduction in the model).
In a second step, the degree of agreement between the boosted beta regression models obtained was assessed. These comparisons were conducted between datasets and between algorithms within each dataset. For each of these two sets of comparisons, we rst assessed whether boosted beta regression models included the same brain features as relevant predictors. More speci cally, R-wise agreement (coincidence between all models) was estimated by means of the Hubert's Kappa index 71 and the multi-rater delta index (which, unlike other more commonly used agreement metrics, is not affected by the ratings' marginal distributions 72,73 ). These two agreement indexes were calculated using software speci cally developed for this purpose and freely available at https://www.ugr.es/~bioest/software/cmd.php?seccion=agreement.
We also assessed the degree of agreement on the relative importance attributed to each predictor by using Lin's concordance correlation coe cient 74 , Kendall's W agreement coe cient 75 , the mean of bivariate Spearman's rho rank correlations 76 , and the intraclass-correlation coe cient (two-way ANOVA, random effects,w single and average ratings 77,78 ). In the case of comparisons of algorithms within each dataset, agreement was assessed at the interval and at the ordinal level by inputting the obtained coe cients' values and their ordinal rank positions, respectively. In the case of datasets comparisons, agreement was assessed by using each predictor's ranking mean (RM) or its position in a multiplicative "rank of ranks" (RRP, 79 ) across algorithms. These two measures were also used in a correlational analysis assessing whether the relative importance of these predictors was associated with TIV variation. Thus, as in 5,18 , linear regression analyses were conducted to obtain an estimate (r 2 ) of the TIV-explained variance in each brain region, and the r 2 scores corresponding to the brain regions identi ed as relevant predictors in each dataset were correlated with their corresponding RM and RRP values.

Between-algorithm PCAM variation
To assess the degree of similarity of the PCAM scores obtained with different algorithms, three complementary approaches were used. First, for each individual, its minimum maximum PCAM score was subtracted from its maximum PCAM score within each dataset, and the CDFs depicting this maximal PCAM variation in each dataset were built up and described by several summary statistics (minimum, average, deciles, and maximum). Second, the same statistics were used to describe the degree of PCAM variation for each algorithms' pair within the raw and the PCP datasets. Finally, to assess the degree of ordinal similarity between the PCAM scores yielded by different algorithms, zero-order Spearman's rho between-algorithms' correlation matrices in the raw and PCP datasets were calculated. Because we corroborated a signi cant contribution of TIV to PCAM scores in the raw dataset, this correlational analysis was also conducted using partial (-TIV) Spearman correlations.

Hierarchical clustering
As described above, the high accuracy observed in previous sex classi cation studies has often been interpreted as showing the ability of ML algorithms to identify two clearly distinguishable brain types, one typical of males and the other typical of females. However, proving that these brain types actually exist requires con rming that the brains of females and males substantially differ in a speci c and reproducible pattern of brain features. To assess whether or not these distinctive brain pro les could be found in our data, agglomerative hierarchical clustering methods (average linkage) were used.
Speci cally, hierarchical clustering analyses were performed with the hclust function of the stats package 55 . Initially, the included features were the volumetric z-scores of those brain areas identi ed as relevant predictors of the PCAM scores yielded by each ML algorithm in each dataset (see beta boosted regression section). Dissimilarity was measured in terms of Euclidean and Spearman distances; Euclidean distances served to quantify the individuals' disparity in terms of the magnitude of their accumulated differences in GMVOL, whereas Spearman distances measured the discordance in the shape of their brain pro les. Each resulting dendrogram was cut at appropriate heights to obtain 2 to 10 clusters and in each of these alternative partitions the size (number of subjects) and composition (proportion of females) of the resulting clusters were assessed. The robustness of the obtained results was corroborated by repeating the same analyses with the volumetric z-scores of the 116 brain areas from the AAL atlas and also with the top ve predictors of the PCAM scores yielded by each ML algorithm in each dataset.
Complementarily, a series of analysis of similarity (ANOSIM) were conducted. ANOSIM is an ANOVA-like, non-parametric test that operates on distance matrices and assesses the null hypothesis that distances between members of two or more prede ned groups (in this case, males/females) is the same as between the members of these groups 80 . Because in the present study this assessment involved a large number of instances (44,850), statistical signi cance was almost guaranteed and, consequently, it was not truly informative 81 . Therefore, we focused on estimating the value of R statistic (and its 95%CI), which compares the mean of ranked dissimilarities between groups to the mean of ranked dissimilarities within groups, and whose meaningful range lies between 0 (when the similarity within groups is the same as between-groups) and 1 (when all samples within groups are less dissimilar to each other than to any pair of samples from different groups) 80 . More speci cally, Euclidean and Spearman distance matrices were calculated in 10,000 bootstrap samples using the get_dist function of the factoextra 82 package. In each of these distance matrices, an ANOSIM test was conducted with the anosim function of the vegan package 83 and the corresponding R value was obtained. In a second step, the 95% con dence intervals of these estimates (normal approximation and percentile method) were obtained through the boot.ci function of the boot package 84 . Again, these calculations were performed using as features the volumetric z-scores of the 116 brain areas from the AAL atlas, and of all or just the top ve predictors of the PCAM scores yielded by each ML algorithm in each dataset.  The PCAM continuum. Plots depict the strip charts (bottom) and the non-scaled density functions (top) of the PCAM scores of females (red) and males (blue) yielded by each algorithm in each dataset. The thickness of the lines is directly proportional to the scaled density of each distribution. Plots also include the medians and inter-quantile ranges of each group (vertical bars) and estimates of their similarities/ differences at the distribution level (overlap/ Cliff's delta, respectively). For a complete statistical output, see Supplementary Tables 4A-4E   Comparing females and males on the PCAM continuum. Plots depict the cumulative density functions of the PCAM scores for females (red) and males (blue). Horizontal color bars depict the tenths of the females' (top) and males' (bottom) PCAM distributions. As the provided examples illustrate, these plots allow to compare the PCAM values at the deciles of the females/ males' distributions but also to compare the proportion of cases in each group with PCAM scores equal to or lower than any possible cutoff, and how many subjects in one group have PCAM values equal to or lower than a given proportion of cases in the other group. See further details and analyses in Supplementary Tables 5A-5F and in the accompanying Supplementary  Figures 2 and 3. What is the typical difference between any female and any male? A) Estimated density functions of all pairwise differences between males and females in each algorithm and dataset. B) CDFs of these pair-wise differences allow to 1) estimate the empirical probability of nding a pairwise male-female difference whose size is equal to or lower than any predesignated value; 2) estimate the size of the pairwise differences for any given proportion of cases; 3) compare the estimates of these pairwise differences provided by different algorithms in each dataset or by a single algorithm in the raw and PCP datasets. See further details in Supplementary Tables 6A-6E and   Relative importance of PCAM predictors in the raw dataset. Plots depict the brain areas identi ed as relevant predictors of the PCAM scores yielded by each algorithm and their relative importance. The gure also depicts the top 10 predictors across all ve algorithms according to their average rank values (the only predictor also found in the top 10 of the PCP dataset is highlighted in green). The right side of the plot depicts the proportion of variance (r2 value) explained by TIV in each of these brain regions. The correlation between the two sets of data and additional parameters of the boosted beta regressions employed to identify these predictors are also included. Additional details are provided in Supplementary Tables 7A,7C, 7E-7G, 7H,7J, and 7L.

Figure 6
Relative importance of PCAM predictors in the PCP dataset. Plots illustrate the relative importance of the brain areas identi ed as relevant predictors of the PCAM scores yielded by each algorithm. The gure also depicts the top 10 predictors across all ve algorithms according to their average rank values (the only predictor also found in the top 10 of the raw dataset is highlighted in purple). The proportion of variance (r2 value) explained by TIV in each of these brain regions is depicted by the bars of the right side of the plot. The correlation between the two sets of data as well as other parameters of the boosted beta regressions employed to identify these predictors are also included. Further details are provided in Supplementary Tables 7B,7D, 7E-7G, 7I,7K, and 7M.  Brain pro les in the raw dataset. A) In the background, a "spaghetti" plot displays the individual values (females in red, males in blue) in all the brain areas identi ed as relevant predictors of PCAM scores (see Figure 6). Two cases (green/ yellow dots) are highlighted to illustrate how the same classi cation category (and even virtually identical PCAM scores) can be achieved by different and brain pro les. B-C) Hierarchical agglomerative clustering based on Euclidean and Spearman distances, respectively. Branches are colored according to the sex composition of the emerging aggregations (red=only females, blue=only males, black=males and females). D-E) Clusters' size and composition. Dendrograms displayed in panels B and C were cut at appropriate heights to obtain 2-10 clusters. The composition of these clusters (K2-K10, horizontal axis) was analyzed in terms of the proportion of females (rectangles' color; large numbers) and the cluster's size (rectangle area; small black numbers). Similar results were obtained when using the other algorithms and with a larger (116)/ smaller (5) number of predictors ( Supplementary Figures 7-11).

Figure 9
Brain pro les in the PCP dataset. A) In the background, a "spaghetti" plot displays the individual values (females in red, males in blue) in all the brain areas identi ed as relevant predictors of PCAM scores (see Figure 7; grey labels highlight predictors with negative regression weights, see Supplementary Table 7B). To illustrate how the same classi cation category and similar PCAM scores can be achieved by different brain pro les, two cases (green/ yellow dots) are highlighted. B-C) Hierarchical agglomerative clustering based on Euclidean and Spearman distances, respectively. Branches are colored according to the sex composition of the emerging aggregations (red=only females, blue=only males, black=males and females). D-E) Clusters' size and composition. Dendrograms displayed in panels B and C were cut to appropriate heights to obtain 2-10 clusters. The composition of these clusters (K2-K10, horizontal axis) was analyzed in terms of the proportion of females (rectangles' color; large numbers) and the cluster's size (rectangle area; small black numbers). Similar results were obtained when using the other algorithms and with a larger (116)/ smaller (5) number of predictors ( Supplementary Figures 8-12).