A Dirichlet Model of Alignment Cost in Mixed-Membership Unsupervised Clustering

Abstract Mixed-membership unsupervised clustering is widely used to extract informative patterns from data in many application areas. For a shared dataset, the stochasticity and unsupervised nature of clustering algorithms can cause difficulties in comparing clustering results produced by different algorithms, or even multiple runs of the same algorithm, as outcomes can differ owing to permutation of the cluster labels or genuine differences in clustering results. Here, with a focus on inference of individual genetic ancestry in population-genetic studies, we study the cost of misalignment of mixed-membership unsupervised clustering replicates under a theoretical model of cluster memberships. Using Dirichlet distributions to model membership coefficient vectors, we provide theoretical results quantifying the alignment cost as a function of the Dirichlet parameters and the Hamming permutation difference between replicates. For fixed Dirichlet parameters, the alignment cost is seen to increase with the Hamming distance between permutations. Datasets with low variance across individuals of membership coefficients for specific clusters generally produce high misalignment costs—so that a single optimal permutation has far lower cost than suboptimal permutations. Higher variability in data, as represented by greater variance of membership coefficients, generally results in alignment costs that are similar between the optimal permutation and suboptimal permutations. We demonstrate the application of the theoretical results to data simulated under the Dirichlet model, as well as to membership estimates from inference of human-genetic ancestry. The results can contribute to improving cluster alignment algorithms that seek to find optimal permutations of replicates. Supplementary materials for this article are available online.


Introduction
In mixed-membership unsupervised clustering, statistical models of a set of clusters and a set of entities are considered, so that the total "membership" of an entity is distributed across the clusters (Airoldi et al. 2015a).Using the models, patterns inferred among the entities are interpreted by examining their co-clustering, and a cluster itself is interpreted by examining the entities that possess large membership fractions for the cluster.
Mixed-membership unsupervised clustering has found diverse applications in such areas as document and text classification, statistics of networks, and medical diagnostics (Airoldi et al. 2015b).In one of the most prominent areas of applicationthe field of population genetics-it has long been a central technique for recovering information about genetic relationships of individuals and populations.In typical populationgenetic studies, researchers collect genotypes from individuals within a species, measure features of genetic variation among the individuals, and infer evolutionary processes that have generated those features.Mixed-membership unsupervised clustering techniques designed specifically for populationgenetic data-Structure (Pritchard, Stephens, and Donnelly 2000), Admixture (Alexander, Novembre, and Lange 2009), and Baps (Corander, Waldmann, and Sillanpää 2003), for CONTACT Noah A. Rosenberg noahr@stanford.eduDepartment of Biology, Stanford University, Stanford, CA.Supplementary materials for this article are available online.Please go to www.tandfonline.com/r/JCGS.example-use stochastic iterative clustering algorithms to infer membership fractions for individuals in clusters.The membership fraction for an individual in a cluster is interpreted, depending on the setting, in one of two ways.In some settings, it represents the proportion of the individual's genome originating in the cluster, or the probability that within the individual, an observation of a specific site in the genome originates from that cluster, with different sites having independent and identical probabilities.In others, it gives the probability that the individual's entire genome originates from the cluster, so that different sites are identically distributed but fully dependent.
In unsupervised clustering, two challenges to data analysis have long been recognized: label-switching and genuine multimodality (Stephens 2000;Jasra, Holmes, and Stephens 2005;Jakobsson and Rosenberg 2007;Airoldi et al. 2015a).Labelswitching describes the fact that because the methods include stochastic steps, if K clusters are labeled 1, 2, . . ., K, then K! distinct permutations of the cluster labels have equivalent meaning.For example, Figure 1(A) shows two permutations of the clusters for a single set of cluster memberships; the panels differ only in that the cluster labels, each of which is represented by a color, differ between the permutations.When label-switching is present, clustering replicates can be aligned by identifying the unique permutation that makes the replicates equivalent (Figure 1(D)).Genuine multimodality arises if no permutation exists that makes replicates equivalent, such as in Figure 1(C).Replicates using the same data can fail to produce identical memberships, even after they are permuted to align in many features (Figure 1(F)).A common scenario is that in which replicates are not strictly equivalent (Figure 1(B)), but a permutation makes them extremely similar (Figure 1(E)); this situation is informally described as possessing label-switching rather than genuine multimodality.
To make use of cluster analyses from multiple datasets, algorithms, or settings, a method is needed for identifying the permutations that eliminate label-switching and reveal genuine multimodality.In the population-genetic context, early attempts at permutation proceeded informally, as the particular features of datasets often rendered the optimal permutations relatively easy to identify (e.g., Rosenberg et al. 2001;Rosenberg 2004).To advance on this situation, several algorithms, including Clumpp (Jakobsson and Rosenberg 2007), Clumpak (Kopelman et al. 2015), and Pong (Behr et al. 2016), have been introduced for identifying optimal alignments, where an optimal alignment is one that minimizes a cost function or maximizes a similarity function.These algorithms are now widely used with unsupervised clustering methods to clarify the results that the methods produce.
The alignment algorithms are generally seen to perform well in identifying permutations that visually align replicates (Jakobsson and Rosenberg 2007; Kopelman et al. 2015;Behr et al. 2016).However, despite the widespread use of alignment algorithms in population genetics, formal evaluations of their success at identifying optimal permutations have not been performed.Further, relatively little understanding has been available regarding the alignment cost difference of suboptimal permutations in relation to minimal-cost permutations; thus, when permutations are suboptimal, the potential for reducing the alignment cost from that achieved by existing algorithms remains unclear.
In this study, we introduce a model for evaluating the cost difference of optimal and suboptimal permutations.We treat individual memberships as drawn from a Dirichlet distribution with specified parameters.Under the Dirichlet model, we explore the cost of suboptimal permutations as a function of the number of misaligned clusters-the Hamming distance between permutations.We find that cost generally increases with the number of misaligned clusters.For examples in which Dirichlet parameters assign each individual primarily to a single cluster, the alignment cost for suboptimal permutations is generally substantially higher than for the optimum.For "noisy" data, as represented by Dirichlet parameters with similar mean values of membership components for different clusters, suboptimal permutations can possess cost similar to the optimum.The model can help in understanding challenges for algorithms that seek to produce minimal-cost permutations.

Terminology
Model-based unsupervised clustering algorithms in population genetics produce a vector of membership coefficients for each individual.For N individuals and K clusters, the output of a clustering algorithm is an estimated N × K membership coefficient matrix Q, where qik is the estimated coefficient for individual i in cluster k.In models in which each individual is treated as belonging to a single cluster, qik represents the estimated probability that individual i is a member of cluster k.In mixed-membership models, in which an individual possesses membership in multiple clusters, qik is the estimated fraction of the data from individual i that originates from cluster k.
For convenience, we use the language of mixed-membership unsupervised clustering in population genetics, but our analysis can also apply to cases of population-genetic clustering in which membership coefficients are interpreted as probabilities rather than ancestry fractions, as well as to related applications outside population genetics.Note that in the population-genetic context, we distinguish "populations, " representing predetermined groups of individuals, from "clusters, " the K groups for which membership is estimated.
Hence, each individual has an estimated membership vector qi = (q i1 , qi2 , . . ., qiK ), for which the sum across clusters is K k=1 qik = 1.The estimated membership matrix Q is a rightstochastic matrix, and each column vector characterizes a cluster by the list of associated memberships of the N individuals.
In population genetics, unsupervised cluster analyses study the patterns of genetic variation of individuals from multiple populations.They infer a matrix that contains the estimated membership proportions of the individuals in the K clusters, where clusters correspond either to supervised ancestry groups or emergent groups appearing in specific data analyses.For instance, in a supervised analysis with three clusters 1, 2, and 3, representing ancestry in three distant populations, respectively, an individual with estimated membership vector (0.6, 0.3, 0.1) has 60% of its genome estimated to originate from population 1, 30% from population 2, and 10% from population 3.

Dirichlet Model
We consider membership coefficients drawn from a theoretical model.Each of a set of predetermined populations is assumed to have its own characteristic distribution of membership coefficients for a series of K clusters.For an analysis with K clusters, a natural choice to model the individual membership coefficients of a population is a Dirichlet distribution of order K (Kotz, Balakrishnan, and Johnson 2004, chap. 49).
In this model, for a given predefined population, the expected membership proportion of an individual in cluster k is the mean of the kth random variable in a Dirichlet-distributed random vector.Suppose a random vector q is drawn from the Dirichlet distribution of order K with parameters a = (a 1 , a 2 . . ., a K ), where a k > 0 for all k.Writing a 0 = K k=1 a k , this q ∼ Dir(a) has probability density function where (•) is the gamma function (Kotz, Balakrishnan, and Johnson 2004).
It is convenient to convert Dirichlet parameters a = (a 1 , a 2 , . . ., a K ) to expected memberships, (E[q 1 ], E[q 2 ], . . ., E[q K ]), as the expected memberships are often more easily understood than the Dirichlet parameters.We model a set of populations using Dirichlet distributions, with the Dirichlet parameter vector a chosen so that each mean value E[q k ] = a k /a 0 corresponds to the assumed mean proportion of cluster k in a population; the sum a 0 controls the variance.Memberships of different individuals from the same population are modeled as independent random vectors sampled from the Dirichlet model with a shared set of parameter values.Memberships of individuals from different populations follow distributions with different Dirichlet parameters.
We use superscript • ( ) on the Dirichlet parameters a to distinguish the parameter vector for population .Thus, for example, for a set I 1 of individuals from population 1 and a set I 2 of individuals from population 2, q i ∼ Dir(a (1) ) for all i ∈ I 1 and q i ∼ Dir(a (2) ) for all i ∈ I 2 , where a (1) /a (1) 0 denotes the population-wise parametric mean membership proportions for population 1 and a (2) /a (2) 0 denotes those proportions for population 2.

Distance Function
To quantify the alignment cost for a pair of replicate analyses, we will need a dissimilarity measure for pairs of membership coefficients on the same samples.Consider two replicate N × K membership coefficient matrices, P and Q.We follow Jakobsson and Rosenberg (2007) in relying on the Frobenius norm of their difference (see also Rosenberg et al. 2002).
In particular, using • F to denote the Frobenius norm, the distance between the membership matrices P and Q can be calculated as ( 2 ) The choice of the Frobenius norm to measure the difference between membership matrices not only accords with past studies, it is also mathematically convenient in our framework, as the sum of squares that it entails facilitates the computation of integrals with respect to the Dirichlet distribution.

Overview
With our Dirichlet model and distance function established, we now describe the computation of the alignment cost associated with a pair of replicate clusterings and a single individual.Under the model, the membership coefficient vector of an individual is a random vector drawn from the Dirichlet distribution with specified parameters.
Consider two replicate draws from the Dirichlet model, representing outcomes of two cluster analyses.Both replicates have K clusters.However, owing to label-switching, multimodality, or both, the clusters are not necessarily aligned.In the most general case, in one replicate, an individual has membership vector p drawn from a Dirichlet distribution Dir(a), and the same individual has membership vector q drawn from Dir(b) in the second replicate.Both p and q are random vectors; when two replicates are aligned, p and q are drawn from the same distribution, and when the replicates are not aligned, the random vectors are drawn from different distributions.
The contribution of the individual to the distance between replicates 1 and 2 is the random variable K i=1 (p i − q i ) 2 = p − q 2 2 .Denote the mean value of this random variable p − q 2 2 by A a,b , where p ∼ Dir(a) and q ∼ Dir(b).Using the probability density function of the Dirichlet distribution in Equation (1), this value can be computed as (3) Here, we have made use of the fact that p K = 1− K−1 i=1 p i and q K = 1 − K−1 i=1 q i .It is often convenient to consider the special case for label-switching, in which the entries of b represent a permutation of the entries of a.In other words, denote the permutation between two replicates by φ, so that cluster φ(i) gives the number of the cluster in replicate 2 that corresponds to cluster i in replicate 1.
For the label-switching case, in replicate 1, the parameters associated with the K cluster memberships are (a 1 , a 2 , . . ., a K ).In replicate 2, corresponding parameters are (b 1 , b 2 , . . ., b K ) = (a φ(1) , a φ(2) , . . ., a φ(K) ).For simplicity, we let b i = a φ(i) .Only for the identity permutation φ 0 , for which φ(i) = i for all i = 1, 2, . . ., K, are the two replicates aligned.Because the a i and b i are the same set of items, permuted, We are interested in the mean contribution of an individual to the alignment cost, C a,b , which can be calculated as the difference between the contribution of a random individual to the distance between a pair of replicates aligned by permutation b (misaligned for b = a) and the contribution to the distance between correctly aligned replicates: Here, we include a factor of 1 2 based on the fact that K k=1 (p ik − q ik ) 2 lies in [0, 2], so that A a,b − A a,a is bounded above by 2; the cost C a,b ranges from 0 to 1.
We now evaluate Equation (3) to obtain the individual mean contribution to distance between replicates.First, we consider K = 2.We next examine K = 3, and we generalize to arbitrary K.We explore the effect of the Dirichlet parameters on these mean contributions.
Theorem 3.1.Consider a population of individuals with membership coefficients in K = 2 clusters.Suppose that in one replicate, the membership coefficients of the individuals follow a Dirichlet model with parameters a = (a 1 , a 2 ), and in a second replicate, they follow a Dirichlet model with parameters b = (b 1 , b 2 ).The mean contribution of a randomly chosen individual to the distance between replicates is Proof of Theorem 3.1.When K = 2, Equation (3), representing the mean contribution of a randomly chosen individual to the distance between two replicates, becomes We compute this integral in Appendix B to obtain the result.
We can then apply Theorem 3.1 to obtain the contribution of an individual in the special case that the two replicates are aligned.In other words, we calculate A a,a : For misaligned replicates that differ by label-switching, consider a permutation φ with (b 1 , b 2 ) = (a 2 , a 1 ).We have Applying Equations ( 7), (6), and (4), the mean contribution of an individual to alignment cost is Figure 2 plots Equation ( 8) as a function of a 1 and a 2 .For a 2 = a 1 , the two replicates have the same parameters, and the mean contribution of an individual to the cost is 0. In each replicate, the mean membership coefficient for cluster 1 is 1 2 , and the mean for cluster 2 is also 1 2 .Starting from the a 2 = a 1 line in the a 1 a 2 -plane, as a 2 increases while holding a 1 constant, or as a 1 increases while holding a 2 constant, the cost increases.These parameter changes make the permuted replicate with parameters (a 2 , a 1 ) quite different from the unpermuted replicate with parameters (a 1 , a 2 ), so that a replicate is increasingly distinguishable from its label-switching permutation.The cost approaches 1 for a replicate with high a 1 and low a 2 , or vice versa; in these cases, nearly all of the membership lies in one of the two clusters, so that the alignment cost of switching the two clusters is nearly 1. Table 1.Alignment cost for each of the six label-switching cases with K = 3, as functions of the Dirichlet parameters a 1 , a 2 , a 3 .
NOTE: Each cost is obtained by evaluating Equation ( 9), and then applying Equation (4).

K = 3
Considering two replicates, the number of permutations possible for the clusters is K! = 6.As in the K = 2 case, for each permutation φ, we can obtain the mean individual contribution to the distance between two replicates.We compute A a,b from Equation (3): For each φ, we compute the alignment cost for φ from Equation (4).
The costs for the six permutations appear in Table 1.We omit their derivations, as each can be obtained from the general result that we present for arbitrary K (Section 3.4).In the table, the cost of 0 for the identity permutation appears in the first row.The next three rows show the costs associated with each of the permutations with Hamming distance 2 from the initial permutation (1, 2, 3), where the Hamming distance tabulates the number of clusters that are misaligned between a pair of replicates.The last two rows show the costs for the two permutations with Hamming distance 3 from (1, 2, 3).
Figure 3 plots the alignment cost from Table 1 for labelswitching with a specific permutation (2, 3, 1) as a function of a 1 and a 2 , for three fixed values of a 3 .In each panel, varying the three parameters in [1, 25], for a 1 = a 2 = a 3 , the alignment cost has the minimum value of 0. In Figure 3(A), the largest alignment cost is reached when one of the three parameters has value 25 and the other two are equal to 1; in Figure 3(C), the maximum occurs when two parameters equal 1 and the third is 25.In the intermediate Figure 3(B), large values occur both in the case that two values equal 1 and the third is 13, and in the case that one value is 1, one is 25, and the third is 13.These maxima reflect the intuition that when memberships differ substantially across clusters, the identity permutation has substantially lower cost than do permutations that represent misalignments.

Arbitrary K
We now generalize the calculation of the individual mean contribtion to distance between replicates (Equation ( 3)) and alignment cost (Equation ( 4)) from the K = 2 case to arbitrary K. Theorem 3.2.Consider a population of individuals with membership coefficients in K ≥ 2 clusters.Suppose that in one replicate, the membership coefficients of the individuals follow a Dirichlet model with parameters a = (a 1 , a 2 , . . ., a K ), and in a second replicate, they follow a Dirichlet model with parameters The mean contribution of a randomly chosen individual to the distance between replicates is Recall that a 0 = K i=1 a i and b 0 = K i=1 b i ; the proof appears in Appendix C. Note that in the case of K = 2, Equation (10) reduces to Equation (5).If two replicates are aligned, then we can derive the mean contribution of an individual to the distance between replicates by substituting b i = a i into Equation ( 10) for all i.
Corollary 3.3.The mean contribution to the distance between two aligned replicates of an individual whose membership coefficients follow a Dir(a) distribution, where a = (a 1 , a 2 , . . ., a K ), is We obtain this result by applying Theorem 3.2 to the identity permutation φ 0 .For K = 2, Equation ( 11) reduces to Equation (6).For a general permutation φ, supposing that the two replicates are not necessarily correctly aligned, we calculate the contribution of a randomly chosen individual to the cost using Equations ( 10) and ( 11), following Equation (4).(a 1 , a 2 , . . ., a K ).The mean contribution of an individual to the alignment cost for a second replicate whose parameters follow a Dirichlet model with permutation φ(a) is Once again, for K = 2, Equation ( 12) reduces to Equation (8).For K = 3, Equation ( 12) gives Table 1.With these general results on the alignment cost under label-switching now established, we proceed to analyze the effects of the parameters on the alignment cost.

Effect of the Dirichlet Parameters with Fixed Permutation
For label-switching with a fixed permutation, the value of C a,φ(a) in Equation ( 12) is only affected by the values of the Dirichlet parameters.In general, as the difference between the parameter of a cluster and its corresponding parameter under permutation-the difference between a i and a φ(i) -increases, C a,φ(a) also increases.
We have already examined the effect of the Dirichlet parameters in cases with K = 2 and K = 3.In Figure 2, we examined the relationship between C a,φ(a) and the a i for the only nonidentical permutation with K = 2, φ(1, 2) = (2, 1).The alignment cost was equal to 0 for a 1 = a 2 , that is, when both clusters have mean membership 0.5.The cost increases as a 1 and a 2 become increasingly different.
For K = 3, similar relationships appear in Figure 3 for the permutation (2, 3, 1).For a 1 = a 2 = a 3 , the alignment cost reaches the minimum of zero.In panels A and C, with domain [1, 25] for each of the three parameters, the alignment cost is maximal when one of the three is 25 and the other two are 1; in panel B, with a 3 fixed, the maximum occurs at (a 1 , a 2 , a 3 ) = (1, 1, 13).The more diverged the values of three parameters, the higher the alignment cost.This result corresponds to the intuition that when clusters have distinct membership patterns, they are less easily mistaken for each other.

Effect of the Permutation with Fixed Dirichlet Parameters
The cost for a permutation increases as its parameters increasingly diverge from the starting permutation.Suppose now that we consider the effect only of the permutation.The minimum possible value of C a,φ(a) in Equation ( 12) is 0. Clearly, the cost is 0 for two replicates in which the second replicate is unpermuted in relation to the first.Consider a permutation cycle PC: a subset of elements in the permutation φ that are permuted among themselves, so that φ(i) ∈ PC for all i ∈ PC and φ(i) ∈ PC for all i ∈ PC.We interpret a permutation cycle as minimal in the sense that none of its proper subsets is a permutation cycle.Suppose φ is decomposed into N PC permutation cycles.The set of all elements I = {1, 2, . . ., K} is the disjoint union of all permutation cycles: I = ˙ N PC h=1 PC h .We have the following result.
Proposition 4.1.Consider a permutation φ.C a,φ(a) = 0 if and only if for each permutation cycle PC h in φ, there exists a constant c h > 0 such that a i = c h for all i ∈ PC h .
The proposition states that the cost associated with φ is zero if and only if for all permutation cycles, all clusters in the permutation cycle have the same Dirichlet parameter.
Proof of Proposition 4.1.We prove the "if " direction first.Index the permutation cycles by h.Suppose for each h that a i = c h for all i ∈ PC h .For each i, suppose that when φ is decomposed into permutation cycles, φ(i) is in permutation cycle PC h .Then a i = c h .Because cluster i and cluster φ(i) are in the same permutation cycle, and all the Dirichlet parameters are positive.By the Cauchy-Schwarz inequality, with equality if and only if a = αb for some constant α.Because K i=1 a i = K i=1 b i , the only value α can take is 1, so that a i = b i = a φ(i) for all i = 1, 2, . . ., K. Note that i and φ(i) are in the same permutation cycle, say, PC h , by definition.We can denote by c h the value a i = a φ(i) ; the value c h applies for all i in PC h .Thus, assuming C a,φ(a) = 0 produces the conclusion that a i = c h for all i ∈ PC h .
Note that the proposition applies to the identity permutation φ = φ 0 .The identity permutation places each cluster in its own permutation cycle, so that φ(i) = i, b i = a φ(i) = a i , and Consider an example of Proposition 4.1 with permutation φ 1 (1, 2, 3, 4, 5) = (2, 1, 4, 5, 3).This permutation has two permutation cycles: {1, 2} and {3, 4, 5}.If a 1 = a 2 and a 3 = a 4 = a 5 , then C a,φ 1 (a) = 0.This can be easily seen from the fact that in Equation ( 12), C a,φ(a) = 0 if a i = b i for all i = 1, 2, . . ., K. When two clusters have the same parameter, they assign the same mean membership value, so that they are indistinguishable.Although φ is not the identity, it produces cost 0 because it only permutes indistinguishable clusters.
We also report the maximum possible cost as a function of the Dirichlet parameters, together with the permutation that gives this cost.This upper bound on the cost provides information on the worst-case misalignment possible given two replicates.Examining the form of Equation ( 12), for fixed (a 1 , a 2 , . . ., a K ), this maximum can be calculated by minimizing The proposition states that the maximal value of C a,φ(a) is attained when φ matches the largest parameter in a to the smallest value in b, the second-largest value in a to the secondsmallest value in b, and so on.
Proof of Proposition 4.2.We make use of the rearrangement inequality (Steele 2004, p. 78) , and write r = σ −1 (i), denoting the rank order of a i in the list (a 1 , a 2 , . . ., a K ), with 1 smallest and K largest.
By the rearrangement inequality, the minimum of With the permutations that produce minimal and maximal cost established, we can examine the effect of the permutation on cost more generally.Figure 4 shows the cost contributed by individuals in each of four populations, for all 24 permutations of four clusters with fixed Dirichlet parameters.With the parameters fixed, the contribution to alignment cost in general increases as more clusters are misaligned.This result can be seen in the fact that permutations with 3 or 4 misaligned clusters tend to lie toward the right side of the figure, which is ordered left to right by increasing cost; permutations with only two misaligned clusters tend to lie near the left side.The maximal cost follows Proposition 4.2: in panels A-C, the highest-cost permutation reverses the order of the mean memberships, as do the four highest-cost permutations with equal cost in panel D.

Multiple Individuals
The theoretical contributions to distance between replicates (Theorem 3.2) and to alignment cost (Corollary 3.4) are both derived as expectations of random variables for a single individual in one population.We provide a simple extension to multiple individuals from multiple populations.
For multiple individuals from multiple populations, we can obtain an expected total contribution to distance between replicates and an expected total contribution to the alignment cost.We treat all individuals in a population as independent and identically distributed draws from the population.The expected mean total contribution of multiple individuals can be calculated by the linearity of expectation.
Proposition 5.1.Consider L populations, in which population has N individuals and membership coefficients a ( ) = (a ( )  1 , a ( ) 2 , . . ., a ( ) K ) that follow a Dir(a ( ) ) distribution, with a k .The total distance between two replicates under label-switching with permutation φ, which maps a ( )  i to a and the total alignment cost is Proof of Proposition 5.1.The proof is trivial.For a population with N individuals and Dirichlet parameters a ( ) , suppose two replicates follow a permutation φ for the second replicate in relation to the first.The membership coefficients of these individuals are independently drawn from the same Dirichlet distribution.Hence, by the linearity of expectation, the expected total contributions to distance and cost are sums across individuals, N A a ( ) ,φ(a ( ) ) and N C a ( ) ,φ(a ( ) ) , respectively.For multiple populations, we simply sum expectations across populations.

Example
We use data from the human-genetic ancestry inference study of Fortier, Kim, and Rosenberg (2020) to illustrate the alignment cost in a practical setting.For the data, we assume that membership coefficients follow the Dirichlet model, whose parameters we then estimate.We then measure empirical alignment costs between pairs of replicates, comparing them to theoretical costs that result from using the estimated parameters of the Dirichlet distribution.

Data
Fortier, Kim, and Rosenberg (2020) conducted clustering using STRUCTURE applied to 978 sampled individuals from L = 53 human populations, with K = 4.They performed analyses using a larger dataset of 791 loci genotyped in the individuals and a less informative smaller subset containing, among the 791 loci, only 13 that are used in forensic genetics.The 53 populations vary in sample size, from 1 to 51 individuals.
For each analysis, 10 clustering replicates were performed, so that we have two sets of 10 replicates from Fortier, Kim, and Rosenberg (2020), each with a 978 × 4 membership coefficient matrix.The individual membership coefficients of all replicates appear in Figure 5(A) (all 791 loci) and Figure 6(A) (13-locus subset).Fortier, Kim, and Rosenberg (2020) summarized these replicates; we show all 10.With all 791 loci, most individuals are placed predominantly in one cluster; with the 13-locus subset, membership is more evenly distributed across clusters.We use the 791-locus analysis as an example of replicates with lower variability across individuals in membership coefficients within populations, and the 13-locus analysis as an example with greater variability, interpreted in this case as more "noise" in membership estimates.

Maximum Likelihood Estimation of Dirichlet Parameters
Consider a membership matrix from population of sample size N .The matrix has size N × K, and it can be written Q ( ) = (q ( ) 1 , q ( ) 2 , . . ., q ( ) N ) T , where q ( ) 1 , q ( ) 2 , . . ., q ( ) N denote membership vectors for the N individuals.If we assume that each of the N vectors represents an independent multivariate draw from an underlying Dirichlet distribution with parameter vector a, then we can obtain a maximum likelihood estimate of a by maximizing log-likelihood L(a).Taking the likelihood as a product of Equation ( 1) across individuals, we have  19) is used for the computation.(D) Relative difference between empirical and theoretical alignment cost for pairs of replicates, evaluated using Equation ( 20).(E) Theoretical alignment costs for all possible permutations of replicate 1 and empirical alignment costs for replicates 2 to 10 in relation to replicate 1 (blue lines).The theoretical computation uses Equation ( 18) and the empirical computation uses Equation ( 17).19) is used for the computation.(D) Relative difference between empirical and theoretical alignment cost for pairs of replicates, evaluated using Equation ( 20).(E) Theoretical alignment costs for all possible permutations of replicate 1 and empirical alignment costs for replicates 2 to 10 in relation to replicate 1 (blue lines).The theoretical computation uses Equation ( 18) and the empirical computation uses Equation ( 17).

L(a) = log
where log q( ) k = 1 N N i=1 log q ik .This objective of maximizing L(a) is equivalent to minimizing −L(a), a convex function in a.The minimization problem has no closed-form solution, but can be solved numerically.We use fixed-point iteration (Minka 2000).The update step in the iteration is where (x) = d log (x)/dx is the digamma function.The algorithm is guaranteed to converge to the maximizing a for the Dirichlet distribution (Minka 2000, sec. 1).
To use the fixed-point iteration method to numerically find the maximum likelihood estimate of a = (a 1 , a 2 , . . ., a K ), we follow the method of Minka (2000, eqs. 19-21) to start the iteration from an initial guess for a; this method relies on empirical computations of the means and variances of the q ( ) ik across individuals i.To obtain the update in Equation ( 16), we apply Newton's method for solving (x) = y, following Minka (2000, Appendix C).

Empirical and Theoretical Alignment Cost Calculations
For both the 791-locus and 13-locus cases, we estimated the Dirichlet parameters for each of the 53 populations and each of the 10 replicates.For the single-individual group, because no variance among individuals is available, we cannot estimate the Dirichlet parameters, and we simply used the membership coefficients of the individual as the parameter estimates.
To examine the performance of the Dirichlet model in measuring alignment costs, we computed empirical and theoretical alignment costs between pairs of replicates.For each pair of replicates, we computed the total empirical alignment cost using Equation ( 2) to obtain the sum of squared differences between their membership matrices.For individual i and cluster k in a population with sample size N individuals, denote by q (R 1 , ) ik and q For the theoretical computation, we first used the inferred permutation between replicate 1 and subsequent replicates, as provided by CLUMPP and reported by Fortier, Kim, and Rosenberg (2020), as the "correct" alignments.We next computed the theoretical contribution to alignment cost for each of the 53 populations based on the inferred pairwise permutation (Equa-tion ( 12)), aggregating the contributions from all populations following Equation ( 14).
More precisely, suppose that for a pair of replicates (R 1 , R 2 ), the inferred Dirichlet parameters for the L populations are {a (R 1 , ) } =1,2,...,L and {a (R 2 , ) } =1,2,...,L .First, we choose R 1 as the "base" replicate, and denote the permutation in replicate R 2 with respect to R 1 as φ R 1 →R 2 .The total theoretical cost in this situation is Next, we use R 2 as the base, and φ R 2 →R 1 is the permutation in replicate R 1 with respect to R 2 .The cost is ) , because the inferred Dirichlet parameters differ for R 1 and R 2 .To account for this asymmetry, we take as the theoretical alignment cost the mean of the two values:

Data Analysis
The greater variability of membership coefficients in the 13locus case compared to the 791-locus case is depicted in  For another assessment of the agreement of theoretical and empirical alignment costs, using replicate 1 as the base, we computed the theoretical alignment cost for all 24 permutations of the K = 4 clusters (Figure 5(E)).If a permutation was observed among replicates 2-10, then its empirical cost with respect to replicate 1 is also shown; if multiple replicates possess the same permutation, then we take their mean cost.This analysis finds that empirical and theoretical costs agree across permutations with a wide range of cost values.
Comparing Figures 6 and 5, Figure 6 reports corresponding quantities for the 13-locus case.The empirical (Figure 6  in the 791-locus case.In Figure 6(A), in which individuals possess high variability within populations, comparing to the lowvariability replicates of Figure 5, it is less easily discerned that an alignment is suboptimal; the lower alignment costs for the high-variability 13-locus case compared to the low-variability 791-locus case reflect this observation.
The agreement between theoretical and empirical alignment costs is reduced for Figure 6 compared to Figure 5, with a substantial difference between the theoretical costs in Figure 6(C) and the empirical costs in Figure 6(B).The relative difference is high in Figure 6(D), and the empirical costs differ from the theoretical costs for many permutations in Figure 6(E).The greater disagreement between theoretical and empirical costs suggests that for the high-variability 13-locus case, the Dirichlet model provides a poorer fit to the replicates than in the lowvariability 791-locus case.

Discussion
We have used a Dirichlet model to study the membership coefficients produced by mixed-membership unsupervised clustering algorithms (Section 2.2).Under the Dirichlet model, using a theoretical measure for the alignment cost between clustering replicates (Equation ( 4)), we have evaluated the alignment cost for a pair of clustering replicates as a function of the model parameters.The model provides tools for use in evaluating clustering replicates, both in analyses of specific datasets and in assessing the performance of clustering algorithms.
Under the model, Corollary 3.4 describes the cost of one replicate in relation to another, making use of the general Theorem 3.2.A replicate with N individuals and K clusters-and hence, NK data entries-is summarized with K parameters, one for each cluster.Theorem 3.2 and Corollary 3.4 provide relatively simple expressions in terms of the K parameter values for each of two replicates.We have evaluated these expressions for the special cases of K = 2 (Section 3.2) and K = 3 (Section 3.3), for which they reduce further.
In analyzing the properties of the theoretical cost as a function of the Dirichlet parameters, we have seen that for a fixed permutation between a pair of replicates, the cost increases as the Dirichlet parameters of the two replicates diverge (Section 4.1).We have also seen that when the Dirichlet parameters are fixed, the cost increases with the number of misaligned clusters (Section 4.2).However, this result depends in part on the specific permutation, as certain permutations might produce lower cost than others with fewer misaligned clusters.When all clusters within the same permutation cycle share common Dirichlet parameters, none of these clusters are "misaligned, " and the cost is zero (Proposition 4.1).We have also described the maximal cost across permutations (Proposition 4.2), potentially enabling cost functions to be normalized by the maximum across permutations.
In an example data analysis, we have found that the Dirichlet model closely fits data with low variability in estimated cluster memberships across individuals within populations (Section 6.4).The fit is not as close for data with high variability in cluster memberships, and hence with more "noise." However, alignment costs are smaller for such cases; in noisy data, the model is poorer but the distinction between properly aligned and misaligned replicates is less consequential.
We envision several applications for the Dirichlet model and its associated results.First, the model can be used to provide summary statistics for replicate mixed-membership cluster analyses.The model would first be used to estimate Dirichlet parameters for replicates.Theoretical alignment costs of permutations of those replicates could then be calculated from the parameters, measuring the cost difference between the optimal permutation and suboptimal permutations.As functions of the estimated parameters, the cost distribution of permutations, the cost difference between optimal and suboptimal permutations, and the maximal cost across permutations can all provide informative summaries.
Such summary statistics could potentially be applied in diagnostics for alignments.The cost difference between the optimal and least suboptimal permutation can measure the extent to which the optimal permutation of a replicate is evident-the "noise" in the replicate-guiding computational decisions for identifying optimal alignments.In particular, if noise is low and the cost for suboptimal permutations is high, then the optimal replicate is likely to be relatively easy to identify, and choices that prioritize speed rather than comprehensive searches in existing alignment algorithms might be suitable.
Next, using the model, clustering alignment methods could adopt heuristic threshold values to decide when to stop the search for a better alignment, or to decide if two replicates represent substantially different modes or merely represent labelswitching (Jakobsson and Rosenberg 2007;Kopelman et al. 2015).Such threshold values could potentially be tuned prior to application of the alignment methods, employing our maximal cost computation.The theoretical alignment cost can thus provide an automated method of choosing threshold values suited to particular datasets, as the theoretical calculation of costs associated with label-switching would be performed in place of more computationally intensive empirical calculations.
Finally, and potentially most significantly, methods based on the model have the potential to contribute to new alignment algorithms.Existing algorithms rely on empirical cost calculations between pairs of replicates.Using the model, however, once the Dirichlet parameters have been estimated, theoretical alignment costs calculated from the estimated parameters potentially reduce computation time.In particular, when it makes sense to treat individual members of predefined populations as identically distributed, the membership coefficients for the N members of a population can be summarized with K parameters in place of N K data points.Alignment can then proceed based on computations involving K theoretical values rather than N K empirical values, reducing computation time compared to use of empirical costs.Notably, as the problem of identifying optimal permutations of replicates can be understood as an example of a general class of assignment problems (Burkard, Dell' Amico, and Martello 2012) seen in combinatorial optimization and operations research, this use of the model can potentially contribute to alignment problems beyond the genetics context.
We note a number of limitations.First, the utility of the Dirichlet model is more limited in cases in which the model provides a poor fit to the data.However, we have seen that the fit can be assessed by comparing empirical and theoretical alignment costs, so that applicability of the model can be assessed for a particular dataset.
A second limitation is that the theoretical cost measure does not consider the scenario in which replicates possess different numbers of clusters.The approach, however, can be extended.Consider two replicates, replicate 1 with K 1 clusters and replicate 2 with K 2 > K 1 clusters.In principle, it is possible to sum membership coefficients in each of K 1 disjoint subsets of the K 2 clusters and to then evaluate alignment cost between the K 1 clusters of replicate 1 and the K 1 subsets for replicate 2. This computation can be performed in principle for each way of distributing the K 2 clusters over K 1 subsets.The Stirling number of the second kind, S 2 (K 2 , K 1 ), counts the partitions of K 2 labeled objects into K 1 unlabeled classes, in such a way that each of the K 1 classes contains at least one of the K 2 objects; the number of scenarios that must be considered is the number of ways of distributing the K 2 labeled clusters over K 1 labeled subsets, or S 2 (K 2 , K 1 ) K 1 !.
A third limitation comes from our choice of distance measure.The derivation of the theoretical alignment cost relied on the squared 2-norm of the difference between membership vectors as the distance between replicates.Various distances have previously been used to compare pairs of replicates (Rosenberg et al. 2002;Jakobsson and Rosenberg 2007;Kopelman et al. 2015;Behr et al. 2016).Other measures, including other pnorms, could be used in Equation (3) to enlarge small differences between vectors (lower p) or to reduce them (higher p); a new cost computation under the Dirichlet model would then be required.
Although our study has been motivated by the setting of unsupervised clustering in population genetics, the Dirichlet model applies to mixed-membership clustering more generally.Hence, our analysis of the model and its performance can contribute to other fields where cluster analysis-and particularly unsupervised cluster analysis-is used.

Appendix C. Proof of Theorem 3.2
The proof entails a calculation of the multiple integral in Equation (3).This integral can be rearranged in a nested way: We make use of the mean and variance of the Dirichlet distribution, so that for (p 1 , p 2 , . . ., p K ) Dirichlet-distributed with parameters (a 1 , a 2 , . . ., a K ) and (q 1 , q 2 , . . ., q K ) Dirichlet-distributed with parameters (b 1 , b 2 , . . ., b K ), with a 0 = K i=1 a i and b 0 = K i=1 b i , we have (Kotz, Balakrishnan, and Johnson 2004, eq. 49.9)Using these results, we have a i (a i + 1) a 0 (a 0 + 1) (C.8) a 0 (a 0 + 1) , i = j (C.10) E[q i q j ] = cov(q i , q j ) + E[q i ]E[q j ] = b i b j b 0 (b 0 + 1) , i = j.(C.11) Because p and q are independent, we also have (C.12) In the integral in Equation (C.1), each term in the sum K−1 i=1 p 2 i + K−1 i=1 q 2 i − K−1 i=1 K−1 j=1 p i q j − K−1 i=1 p i q i + K−2 i=1 K−1 j=i+1 p i p j + K−2 i=1 K−1 j=i+1 q i q j is integrated with respect to the Dirichlet density over two simplices, one for p and one for q.Hence, each term can be integrated by one of Equations (C.2)-(C.12):Equation (C.8) for terms p 2 i , Equation (C.9) for terms q 2 i , Equation (C.12) for terms p i q j (j = i and j = i), Equation (C.10) for terms p i p j (j = i), and Equation (C.11) for terms q i q j (j = i).
The integral becomes:

Figure 1 .
Figure 1.Label-switching and genuine multimodality.(A)-(C) Pairs of replicates.(D)-(F) Optimal permutations of replicate 2 to align with replicate 1. Replicates are simulated using the Dirichlet parameter values in Appendix A. (A) Label-switching only.(B) Label-switching with two independent replicates simulated from the same parameter values.(C) Genuine multimodality.(D) Optimal permutation for (A).(E) Optimal permutation for (B).(F) Optimal permutation for (C).The number of individuals per population is 50.
Corollary 3.4.Suppose the membership coefficients of the individuals in a Dirichlet model follow Dir(a), where a =

Figure 4 .
Figure 4. Alignment cost as a function of permutations.The figure considers an example cluster analysis in which four populations are placed into four clusters; each population has a different cluster in which membership predominates.In each panel, the 24 permutations of the four clusters are ordered by the alignment cost associated with an individual from a specific population.(A) Population 1. (B) Population 2. (C) Population 3. (D) Population 4. Simulation parameters appear in Appendix A. Permutations are labeled with respect to the original clusters (1, 2, 3, 4), representing φ(1, 2, 3, 4) for each of the 24 possible choices of φ.In panel D, clusters 2 and 3 have the same parameter values; we do not count as misaligned clusters that map within a permutation cycle to a cluster with the same parameter value.The number of individuals per population is 100.

Figure 5 .
Figure 5. Application of the Dirichlet model to data from 791 loci.(A) 10 clustering replicates.(B) Empirical alignment cost between pairs of replicates, following Equation (17), divided by the total number of individuals N = L =1 N .(C) Theoretical alignment cost between pairs of replicates, divided by the total number of individuals.The symmetric Equation (19) is used for the computation.(D) Relative difference between empirical and theoretical alignment cost for pairs of replicates, evaluated using Equation (20).(E) Theoretical alignment costs for all possible permutations of replicate 1 and empirical alignment costs for replicates 2 to 10 in relation to replicate 1 (blue lines).The theoretical computation uses Equation (18) and the empirical computation uses Equation (17).

Figure 6 .
Figure 6.Application of the Dirichlet model to data from 13 loci.(A) 10 clustering replicates.(B) Empirical alignment cost between pairs of replicates, following Equation (17), divided by the total number of individuals N = L =1 N .(C) Theoretical alignment cost between pairs of replicates, divided by the total number of individuals.The symmetric Equation (19) is used for the computation.(D) Relative difference between empirical and theoretical alignment cost for pairs of replicates, evaluated using Equation (20).(E) Theoretical alignment costs for all possible permutations of replicate 1 and empirical alignment costs for replicates 2 to 10 in relation to replicate 1 (blue lines).The theoretical computation uses Equation (18) and the empirical computation uses Equation (17).
Figure 7, both on the basis of the empirical variance in membership coefficients (Figure 7(A) and (C)) and using the theoretical variance computed from the estimated Dirichlet parameters (Figure 7(B) and (D)).We interpret the alignment costs in relation to this observation concerning variability in the two cases.For the 791-locus case, averaging across individuals, Figure 5(B) reports pairwise empirical costs between replicates and Figure 5(C) reports theoretical costs.The relative difference between empirical and theoretical costs, computed with respect to the theoretical cost as appears in Figure 5(D).The theoretical cost in Figure 5(C) generally accords with the empirical cost in Figure 5(B).The relative difference in Figure 5(D) is small for most pairs of replicates.
(B)) and theoretical (Figure 6(C)) alignment costs have lower values than

Figure 7 .
Figure 7. Mean and standard deviation of membership coefficients in 53 populations.(A) Empirical, 791 loci.(B) Theoretical, 791 loci.(C) Empirical, 13 loci.(D) Theoretical, 13 loci.Panels A and B consider replicate 1 from Figure 5; panels C and D consider replicate 1 from Figure 6.Standard deviations appear as error bars.Population 34 has only one individual and no empirical standard deviation; a theoretical standard deviation is induced by the choice to equate Dirichlet parameters a with the membership coefficient vector for the individual.
With the four means specified and the variance specified for the membership coefficient of the first cluster, the Dirichlet parameters are uniquely specified.The parameters and the corresponding mean and variance of the Dirichlet distributions appear in TableA.1.

Table A . 1 .
Model parameters for simulating clustering examples in Figures1 and 4.