FSTruct: An F ST‐based tool for measuring ancestry variation in inference of population structure

Abstract In model‐based inference of population structure from individual‐level genetic data, individuals are assigned membership coefficients in a series of statistical clusters generated by clustering algorithms. Distinct patterns of variability in membership coefficients can be produced for different groups of individuals, for example, representing different predefined populations, sampling sites or time periods. Such variability can be difficult to capture in a single numerical value; membership coefficient vectors are multivariate and potentially incommensurable across predefined groups, as the number of clusters over which individuals are distributed can vary among groups of interest. Further, two groups might share few clusters in common, so that membership coefficient vectors are concentrated on different clusters. We introduce a method for measuring the variability of membership coefficients of individuals in a predefined group, making use of an analogy between variability across individuals in membership coefficient vectors and variation across populations in allele frequency vectors. We show that in a model in which membership coefficient vectors in a population follow a Dirichlet distribution, the measure increases linearly with a parameter describing the variance of a specified component of the membership vector and does not depend on its mean. We apply the approach, which makes use of a normalized F ST statistic, to data on inferred population structure in three example scenarios. We also introduce a bootstrap test for equivalence of two or more predefined groups in their level of membership coefficient variability. Our methods are implemented in the r package FSTruct.


| INTRODUC TI ON
In the past two decades, computational methods for inference of population structure from individual-level genetic data have contributed a rich and informative set of approaches for the analysis of genetic variation. Model-based clustering methods such as admixture (Alexander et al., 2009;Alexander & Lange, 2011), baps (Corander et al., 2004(Corander et al., , 2008 and structure (Falush et al., 2003(Falush et al., , 2007Hubisz et al., 2009;Pritchard et al., 2000) are now routinely used to generate insights into population structure and evolutionary history in be difficult to capture in a single numerical value; membership coefficient vectors are multivariate and potentially incommensurable across predefined groups, as the number of clusters over which individuals are distributed can vary among groups of interest. Further, two groups might share few clusters in common, so that membership coefficient vectors are concentrated on different clusters. We introduce a method for measuring the variability of membership coefficients of individuals in a predefined group, making use of an analogy between variability across individuals in membership coefficient vectors and variation across populations in allele frequency vectors. We show that in a model in which membership coefficient vectors in a population follow a Dirichlet distribution, the measure increases linearly with a parameter describing the variance of a specified component of the membership vector and does not depend on its mean. We apply the approach, which makes use of a normalized F ST statistic, to data on inferred population structure in three example scenarios. We also introduce a bootstrap test for equivalence of two or more predefined groups in their level of membership coefficient variability. Our methods are implemented in the r package FSTruct.

K E Y W O R D S
F ST , admixture, population structure diverse species of interest in ecology, evolution, conservation biology and agriculture (Guillot & Orlando, 2017).
In model-based inference of population structure, individuals are clustered based on their multilocus genotypes into a series of statistical clusters, such that each individual possesses a membership coefficient for each cluster. Each membership coefficient represents the proportion of an individual's ancestry that is derived from the associated cluster. Interpreting the membership coefficients of individuals from various predefined populations, sampling sites or other groups of biological interest can illuminate patterns of genetic variation and population structure. Researchers often investigate variability of membership patterns within predefined groups, as well as similarities and differences in the membership patterns of distinct groups.
One type of comparison that is frequently of interest is an assessment of relative levels of variation in membership coefficients among the individuals belonging to two or more predefined groups.
This type of comparison arises in many contexts, such as when exploring differences in membership variability between admixed and nonadmixed populations, between populations from different time periods or between different types of data from the same sampled individuals.
For example, in a study of ancient human DNA samples dating over a period of thousands of years, Antonio et al. (2019) sought to examine whether the population of Rome possessed greater diversity in ancestry during certain periods of the Roman Empire. They estimated membership coefficients using admixture and interpreted the inferred coefficients to claim that during the Imperial Rome period, when the Roman Empire was at its peak, ancestry was more variable than during earlier periods, when Rome was more isolated ( Figure 1 of Antonio et al., 2019).
Interpretations of inferred membership coefficients to make relative claims about membership variability have generally relied on visual assessment of population structure diagrams rather than on statistical hypothesis testing. In particular, as in Antonio et al. (2019), researchers seeking to quantify variability in membership coefficients across individuals or to compare this variability between two or more groups often do so visually or informally.
Here, we introduce a statistical method to measure variability in membership coefficients inferred by model-based clustering and to compare this variability across populations. We apply the method to examples from real and simulated data. The method is implemented in the r package FSTruct.

| Overview
The output of population structure inference software programs such as structure and admixture is a representation of individual membership coefficients in matrix form. The matrix, often denoted Q and termed a 'Q matrix', has I rows, corresponding to I individuals, and K columns, corresponding to the total number of clusters ( Figure 1b). The entry in row i and column k, q (i) k , represents the membership coefficient of individual i in cluster k: the proportion of the ancestry of individual i that is assigned to cluster k. Each row sums to 1, or ∑ K k=1 q (i) k = 1 for each i. We seek to compute a measure of variability among ancestry vectors for individuals: among rows of Q. We wish for the measure to be comparable across different data sets, possibly representing different samples. This problem is complicated by the fact that different Q matrices might include different numbers of clusters; furthermore, column entries for some clusters might vary greatly across individuals, while other columns are more uniform.
We approach the problem by modifying the population differentiation statistic F ST to fit this ancestry scenario. F ST measures allele frequency variability among subpopulations, and it is computed using a set of allele frequency vectors that each sum to 1. This setting is mathematically analogous to Q matrices, in which vectors of membership coefficients for each individual sum to 1. In the analogy, each individual represents a 'population', and its cluster membership is analogous to an 'allele frequency' (Figure 1).
calculation among population vectors of allele frequencies calculation among individual vectors of membership coefficients Compute among rows et al., 2013). The constrained maximum is relatively low when I, the number of individuals in a Q matrix, is small (analogous to a small number of populations), or when M, the mean membership of the highest-membership ancestry cluster, is close to its minimum, 1 K , or its maximum, 1 (analogous to an extreme value for the frequency of the most frequent allele). Denoting this maximum F max ST , we normalize F ST by its maximum, using the ratio F ST ∕ F max ST as a measure of variability that is comparable across Q matrices of different size. This measure ranges between 0 and 1, equalling 0 when members of a population have identical membership and equalling 1 when vectors of membership coefficients are maximally variable.

| The
Consider a scenario with I subpopulations and K distinct alleles.
Allele k has frequency q (i) The total heterozygosity H T is the expected frequency of heterozygotes under Hardy-Weinberg equilibrium in a population whose allele frequencies equal the mean allele frequencies across subpopulations: The quantity 1 gives the mean frequency of allele k across subpopulations.
With the total population assumed to be polymorphic so that H T > 0, for the setting of I subpopulations and K alleles, with K possibly arbitrarily large, Alcala and Rosenberg (2022)  In the language of our analogy, I is the number of individuals-the number of rows in the Q matrix; M is the sample mean membership coefficient for the most frequent ancestral cluster across all I individuals; and 1 = IM is the largest entry in the vector that sums column entries of the Q matrix across rows. The latter case of Equation 1, with 1 < 1 < I, is generally more relevant in the setting of population clustering, as I is typically larger than K, so that 1 > I K > 1.
The ratio F ST ∕ F max ST , which represents a normalized measure of variability that can be compared among different groups of individuals with different values of I or K, or both, ranges between 0 and 1, taking a value of 0 when all individuals in a group have identical membership coefficients. It has a value of 1 when they are as variable as possible given M. Alcala and Rosenberg (2022) showed that for 0 < 1 ≤ 1, the maximum is realized when each ancestry cluster is found in only a single individual and each individual has exactly J ancestry clusters with coefficients greater than zero: J − 1 clusters with coefficients of σ 1 , one cluster with a coefficient of 1 − (J − 1) 1 ≤ 1 and all others with coefficients of 0. Note that in the scenario 0 < 1 ≤ 1, the number of clusters K is larger than the number of individuals I; at the maximum, multiple clusters are tied with the same mean membership coefficient M.
For 1 < 1 < I, the maximum is realized when only the ancestry cluster of greatest membership is shared among individuals, and at most a single individual contains ancestry from multiple sources.
More formally, this scenario occurs when ⌊ 1 ⌋ individuals possess all of their membership in the cluster of greatest membership (i.e. q (i) 1 = 1 for these individuals), a single individual has membership coefficient {σ 1 } for the cluster of greatest membership and coefficient

| Statistical test to compare values of F ST ∕F max ST
In applications, we may wish not only to compute F ST ∕ F max ST for a single population but also to compare this ratio between two or more populations using a statistical test. We accomplish this task by bootstrap resampling of rows to generate replicate Q matrices for each population. We then compute the F ST ∕ F max ST statistic for each of these replicate matrices. This process generates a bootstrap distribution of the statistic for each population. We then use a Wilcoxon rank-sum test to determine whether pairs of bootstrap distributions of the statistic for different sets of individuals are significantly different; we use a Kruskal-Wallis test to compare three or more sets of individuals.

| Software availability
We have implemented our method in the r package FSTruct (pronounced 'F-struct'), which is available for download from github. com/Maike Morri son/FSTruct. This package includes functions that compute F ST ∕ F max ST from a Q matrix such as those produced by admixture or structure, generate bootstrap samples and distributions for arbitrarily many Q matrices and visualize Q matrices.

| Dirichlet model
To illustrate our method, we used individual membership coefficient vectors drawn from a Dirichlet distribution (Kotz et al., 2000).
This distribution is suited for use as the underlying model for finite vectors of nonnegative numbers q 1 , q 2 , … , q K that sum to one, ∑ K k=1 q k = 1, and it has appeared in previous studies of membership coefficient vectors (Huelsenbeck & Andolfatto, 2007;Pritchard et al., 2000).
We treat individual membership coefficient vectors in a population as following a Dirichlet distribution with parameter vec- We denote this distribution by Dir 1 , 2 , … , K . Here, λ is a vector of length K whose elements determine the parametric mean membership coefficient for each ancestral cluster. The value of α controls the variance of q k , the individual membership coefficient in cluster k: Var q k = k 1 − k ∕ ( + 1) . Thus, an increase in α lowers the variances of membership coefficients.
To generate a random Q matrix with I individuals and K ancestry clusters, we draw I independent and identically distributed Dir 1 , 2 , … , K vectors, q 1 , … , q K , which each comprise a set of membership coefficients for a single individual. Each vector is a row of the simulated Q matrix and is a draw from a Dirichlet distribution with mean membership coefficients 1 , 2 , … , K .
Variability of membership coefficients across individuals is controlled by α. Hence, we proceed by (1) using the Dirichlet distribution to simulate Q matrices with specified parametric membership coefficient means and variances, (2) computing F ST ∕ F max ST for each Q matrix and (3) examining the relationship between the value of F ST ∕ F max ST for each Q matrix and the parametric variance of the Dirichlet distribution used to simulate it.

| Dirichlet simulations
To investigate the behaviour of F ST ∕ F max ST in relation to a measure of variability in membership coefficients, we used the Dirichlet distribution to simulate Q matrices with known variability. We simulated Q matrices with I = 50 individuals and K = 2 clusters. Each simulation replicate thus drew I = 50 ancestry vectors from a Dir 1 , 2 distribution.
We fixed 1 , 2 = 2 3 , 1 3 , so that membership in cluster 1 has parametric mean 2 3 across individuals in a population and membership in cluster 2 has parametric mean 1 3 . The parametric variance of the membership coefficient for a specific cluster, across sampled individuals, then equals 2 = Var q 1 = Var q 2 = 2 3 × 1 3 ∕ ( + 1); both coefficients have the same variance. As α ranges in (0, ∞ ), the variance ranges in 0, 2 9 . We performed 500 replicate simulations of samples of 50 indi-  Figure 4 and found that nearly all pairs of distributions were significantly different (Wilcoxon rank-sum tests, p < 10 −6 ). Interestingly, the only two pairs that were not significantly different were pairs with the same α but different means: the left-hand α 1 distribution with mean 2 3 , 1 3 in Figure 4b and the right-hand α 1 distribution with mean 9 10 , 1 10 in Figure 4c (Wilcoxon rank-sum test, p = .270), and the left-hand α 3 distribution with mean We confirm in Figure S1 that the variability in the mean membership coefficients q 1 , q 2 of simulated Q matrices increases as the α

| Data examples
To illustrate the application of FSTruct, we apply the method to data examples that represent each of three distinct scenarios in which ancestry variability is of interest: (1) ancestry comparisons of admixed F I G U R E 4 Dependence of bootstrap distributions of F ST ∕ F max ST for simulated Q matrices on the Dirichlet variance parameter α, rather than the Dirichlet mean λ. (a, d) Q matrices simulated using specified Dir(αλ) distributions. (b, c) Bootstrap distributions of F ST ∕ F max ST for Q matrices from (a) and (d), plotted directly below or above the corresponding matrix. In both (a) and (d), eight matrices were simulated, two for each of four values of α selected to span the range of parametric variances: 1 = 21901 ∕ 99, 2 = 341 ∕ 99, 3 = 101 ∕ 99 and 4 = 1 ∕ 99. Matrices are annotated by associated parametric variances 2 = 1 2 ∕ ( + 1). In (a), matrices are simulated with parametric mean = 2 3 , 1 3 and are taken from matrices plotted in Figure 3. In (d), matrices are simulated with a more extreme parametric mean, = 9 10 , 1 10 . Each vertical bar represents an individual membership coefficient vector (q 1 , q 2 ); the proportion of each bar coloured a darker shade represents q 1 and the proportion in a lighter shade corresponds to q 2 . The parametric variance of a Q matrix, 2 = 1 2 ∕ ( + 1), ranges in 0, 2 9 for = 2 3 , 1 3 and in (0, 0.09) for = 9 10 , 1 10 . The empirical variance s 2 is computed for each matrix using the sample mean q = Rome and Late Antiquity periods. They lend increased granularity to these claims, suggesting that ancestry variability was steadily increasing during these three periods, with a maximum achieved during Late Antiquity. To explore patterns of ancestry variability in data sets of different size, we evaluated F ST ∕ F max ST using results from a structure analysis conducted by Algee-Hewitt et al. (2016). This study focused on 13 tetranucleotide loci commonly used for individual identification in forensic applications, the 'codis loci'. In a worldwide human sample, the study compared analyses with the codis loci to analyses with a larger set of 779 noncodis loci and to analyses with sets of 13 noncodis tetranucleotide loci. The study claimed that the codis loci have similar ancestry information to sets of 13 non-codis tetranucleotide loci.

| DISCUSS ION
We have introduced a measure for quantifying variability across vectors of individual membership coefficients, as produced by population structure inference programs such as structure and admixture.
Our measure is based on a mathematical analogy with the population  The new measure, which we have implemented in the r package FSTruct, contributes to a body of methods for quantitative analysis of inferred membership coefficients. This collection of methods includes computations useful for analysing the level of support observed for different numbers of clusters K (Alexander & Lange, 2011;Evanno et al., 2005) and methods of aligning the clustering solutions observed in replicate analyses (Behr et al., 2016;Jakobsson & Rosenberg, 2007;Kopelman et al., 2015), as well as software for graphical display (Ramasamy et al., 2014;Rosenberg, 2004) and for managing files and workflows associated with the analysis (Earl & VonHoldt, 2012;Francis, 2017). We note that in addition to analysing the Q-matrices produced by population structure inference programs such as structure and admixture, FSTruct can quantify variability in any matrix whose rows sum to 1. Applications are potentially numerous. where the proportion of mutations belonging to a mutational type is analogous to a cluster membership (Alexandrov et al., 2013;Rahbari et al., 2016).

AUTH O R CO NTR I B UTI O N S
MLM, NA and NAR designed the study and performed the theoretical analysis. MLM conducted the simulations, analysed the data and wrote the software. NAR supervised the study. All authors wrote the manuscript.

ACK N OWLED G EM ENTS
We acknowledge support from NIH grant R01 HG005855 and NSF This assumption is equivalent to setting F p. 354, exercise 2), so the sum of two terms that converge almost surely to 0 also converges almost surely to 0. Hence, it suffices to separately prove almost sure convergence to 0 of the two terms summed in Equation A7.

U N D E R TH E D I R I CH LE T M O D EL
Recalling that q (i) 2 = 1 − q (i) 1 , we observe that the integrand q (i) 1 1 −1 q (i) 2 2 −1 ∕ B 1 , 2 is simply the Beta probability density function, which integrates to one. Hence, the product evaluates to 1 and L * simply equals a constant: We next evaluate the L i terms. For each i in 1, 2, … , I, As was the case for L * , the integrand of the integral inside the product is the Beta probability density function, so the product evaluates to one. Thus, The remaining integral can be evaluated by noting that ∫ 1 0 x a−1 (1 − x) b−1 dx = B(a, b). We employ this identity to simplify L i , obtaining By Equation A2 and the property of gamma functions Γ(z + 1) = zΓ(z),

this expression simplifies to
We now combine Equations A9 and A10 to complete the calculation in Equation A8, noting that L i does not depend on i, so that each L i follows Equation A10. (A6) (A9) L * = − I 2 1 I 1 1 − 1 = − 1 1 − 1 .