Phylogenetic Clustering by Linear Integer Programming (PhyCLIP)

Abstract Subspecies nomenclature systems of pathogens are increasingly based on sequence data. The use of phylogenetics to identify and differentiate between clusters of genetically similar pathogens is particularly prevalent in virology from the nomenclature of human papillomaviruses to highly pathogenic avian influenza (HPAI) H5Nx viruses. These nomenclature systems rely on absolute genetic distance thresholds to define the maximum genetic divergence tolerated between viruses designated as closely related. However, the phylogenetic clustering methods used in these nomenclature systems are limited by the arbitrariness of setting intra and intercluster diversity thresholds. The lack of a consensus ground truth to define well-delineated, meaningful phylogenetic subpopulations amplifies the difficulties in identifying an informative distance threshold. Consequently, phylogenetic clustering often becomes an exploratory, ad hoc exercise. Phylogenetic Clustering by Linear Integer Programming (PhyCLIP) was developed to provide a statistically principled phylogenetic clustering framework that negates the need for an arbitrarily defined distance threshold. Using the pairwise patristic distance distributions of an input phylogeny, PhyCLIP parameterizes the intra and intercluster divergence limits as statistical bounds in an integer linear programming model which is subsequently optimized to cluster as many sequences as possible. When applied to the hemagglutinin phylogeny of HPAI H5Nx viruses, PhyCLIP was not only able to recapitulate the current WHO/OIE/FAO H5 nomenclature system but also further delineated informative higher resolution clusters that capture geographically distinct subpopulations of viruses. PhyCLIP is pathogen-agnostic and can be generalized to a wide variety of research questions concerning the identification of biologically informative clusters in pathogen phylogenies. PhyCLIP is freely available at http://github.com/alvinxhan/PhyCLIP, last accessed March 15, 2019.


Introduction
Advances in high-throughput sequencing technology and computational approaches in molecular epidemiology have seen sequence data increasingly integrated into clinical care, surveillance systems, and epidemiological studies (Gardy and Loman 2017). Based on the growing number of available pathogen sequences genomic epidemiology has yielded a wealth of information on epidemiological and evolutionary questions ranging from transmission dynamics to genotypephenotype correlations. Central to all of these questions is the need for robust and consistent nomenclature systems to describe and partition the genetic diversity of pathogens to meaningfully relate to epidemiological, evolutionary, or ecological processes. Increasingly, nomenclature systems for pathogens below the species level are based on sequence information, supplementing, or even displacing conventional biological properties such as serology or host range (Simmonds et al. 2010;McIntyre et al. 2013). However, existing sequence-based nomenclature frameworks for defining lineages, clades or clusters in pathogen phylogenies are mostly based on arbitrary and inconsistent criteria.
Standardizing the definition of a phylogenetic cluster or lineage across pathogens is complicated by differences in characteristics such as genome organization and maintenance ecology. Cluster definitions vary widely even between studies of the same pathogen, limiting generalization, and interpretation between studies as designated clusters, clades, and/or lineages carry inconsistent information in the larger evolutionary context (Grabowski et al. 1904;Dennis et al. 2014;Hassan et al. 2017).
In virology, nomenclature systems are largely reliant on absolute distance thresholds that define the maximum genetic divergence tolerated between viruses designated as closely related Van Doorslaer et al. 2011;Lauber and Gorbalenya 2012;Donald et al. 2013;Kroneman et al. 2013;Poon et al. 2015Poon et al. , 2016Smith et al. 2015;Valastro et al. 2016). Groups of closely related viruses are inferred to be phylogenetic clusters when the genetic distance between them is lower than the limit set on within-cluster divergence. Nonparametric distance-based clustering approaches have defined the distance between sequences using pairwise genetic distances calculated directly from sequence data (WHO/OIE/FAO H5N1 Evolution Working Group 2008; Aldous et al. 2012;Ragonnet-Cronin et al. 2013) or pairwise patristic distances calculated from inferred phylogenetic trees (Hu e et al. 2004;Prosperi et al. 2011;Poon et al. 2015;Pu et al. 2015;Ortiz and Neuzil 2017). Within-cluster limits on tolerated divergence have been set using mean (WHO/OIE/FAO H5N1 Evolution Working Group 2008), median (Prosperi et al. 2011), or maximum within-cluster pairwise genetic or patristic distance (Ragonnet-Cronin et al. 2013). Some methods incorporate additional criteria, such as the statistical support for subtrees under consideration or minimum/ maximum cluster size (Hu e et al. 2004;Prosperi et al. 2010Prosperi et al. , 2011Ragonnet-Cronin et al. 2013). These genetic distancebased clustering approaches are convenient, as they are rulebased and scalable, allowing for relatively easy nomenclature updates. Arguably, flexibility in the distance thresholds allows researchers to curate clusters based on consistency of the geographic or temporal metadata.
The central limitation of approaches based on pairwise genetic or patristic distance is that thresholds to define meaningful within-and between-cluster diversity are arbitrary. For most pathogens, there is no clear definition of a welldelineated phylogenetic unit to underlie nomenclature designation or suggest what additional information would be informative to delineate subpopulations, for example, information on antigenicity or geography or host range. Resultantly, there is no ground truth to optimize distance thresholds when developing a nomenclature system for most pathogens. Partitioning phylogenetic trees into meaningful subsets is therefore complex and is mostly performed ad hoc through exploratory analyses with uninformative sensitivity analyses across thresholds. As expected, cluster membership is highly sensitive to the threshold applied and therefore results can be unstable across different cluster definitions (Rose et al. 2017).
There is a need for a consistent, automated and robust statistical framework for determining cluster-defining criteria in nomenclature frameworks. Here, we describe a statistically principled phylogenetic clustering approach called Phylogenetic Clustering by Linear Integer Programming (PhyCLIP). PhyCLIP is based on integer linear programming (ILP) optimization, with the objective to assign statistically principled cluster membership to as many sequences as possible. We apply PhyCLIP to the hemagglutinin (HA) phylogeny of the highly pathogenic avian influenza (HPAI) A/goose/ Guangdong/1/1996 (Gs/GD)-like lineage of the H5Nx subtype viruses, which underlies the most prominent nomenclature system for avian influenza viruses and which itself is based on a genetic distance approach (WHO/OIE/FAO H5N1 Evolution Working Group 2008).

New Approach
PhyCLIP requires an input phylogeny and three user-provided parameters: i. Minimum number of sequences (S) that should be considered a cluster. ii. Multiple of deviations ðcÞ from the grand median of the mean pairwise sequence patristic distance that defines the within-cluster divergence limit (WCL) iii. False discovery rate (FDR) to infer that the diversity observed for every combinatorial pair of output clusters is significantly distinct from one another. Figure 1A shows the workflow of PhyCLIP which is further elaborated here. First, PhyCLIP considers the input phylogenetic tree as an ensemble of N monophyletic subtrees (including the root) that could potentially be clustered as a single phylogenetic cluster, each defined by an internal node i subtending a set of sequences L i (fig. 1B, see "Materials and Methods" section). Consequently, as the topological structure of the phylogenetic tree is incorporated in the cluster structure, it is possible to infer the evolutionary trajectory of the output clusters of PhyCLIP if the tree is appropriately rooted. For clarity, we use the term subtree to refer to the set of sequences subtended under the same node that could potentially be clustered and the term cluster to refer to sequences that are clustered by PhyCLIP within the same subtree.
The within-cluster internal diversity of subtree i is measured by its mean pairwise sequence patristic distance (l i ). PhyCLIP calculates the WCL, an upper bound to the internal diversity of a cluster, as: where l is the grand median of the mean pairwise patristic distance distribution l 1 ; l 2 ; . . . ; l i ; . . . ; l N f and r is any robust estimator of scale (e.g., median absolute deviation MAD ð Þ or Qn, see "Materials and Methods" section) that quantifies the statistical dispersion of the mean pairwise patristic distance distribution for the ensemble of N subtrees. In other words, only subtrees with l i WCL will be considered for clustering by PhyCLIP ( fig. 1B).

Distal Dissociation
The assumption that a cluster must be monophyletic can lead to incorrect assignment of cluster membership to undersampled, distantly related outlying sequences that have diverged considerably from the rest of the cluster (e.g., sequence j9 in fig. 1C). These exceedingly distant outlying sequences can also inflate l i of the subtree they subtend, leading to inaccurate overestimation of the internal divergence of the putative Phylogenetic Clustering by Linear Integer Programming . doi:10.1093/molbev/msz053 MBE subtree. Similarly, distantly related descendant subtrees can artificially inflate l i of their ancestral trunk nodes (e.g., nodes o and q in fig. 1C). In turn, historical sequences immediately descending from a trunk node i will not be clustered if its l i exceeds WCL ( fig. 1C).
PhyCLIP dissociates any distal subtrees and/or outlying sequences from their ancestral lineage prior to implementing the ILP model. For any subtree i with l i > WCL, starting from the most distant sequence to i, PhyCLIP applies a leave-oneout strategy dissociating sequences, and the whole descendant subtree if every sequence subtended by it was dissociated, until the recalculated l i without the distantly related sequences falls below WCL. For each subtree, PhyCLIP also tests and dissociates any outlying sequences present. An outlying sequence is defined as any sequence whose patristic distance to the node in question is > 3Â the estimator of scale away from the median sequence patristic distance to the node. l i is recalculated for any node with changes to its sequence membership L i after dissociating these distantly related sequences. These distal dissociation steps effectively offer PhyCLIP greater flexibility in its clustering construct allowing the identification of paraphyletic clusters on top of monophyletic ones that may better reflect the phylogenetic relationships of these sequences.

Integer Linear Programming Optimization
The full formulation of the ILP model is detailed in "Materials and Methods" section. Here, we broadly describe how the optimization algorithm proceeds to delineate the input phylogeny. The primary objective of PhyCLIP is to cluster as many sequences in the phylogeny as possible subject to the following constraints: i. All output clusters must contain !S number of sequences. ii. All output clusters must satisfy l i WCL. iii. The pairwise sequence patristic distance distribution of every combinatorial pair of output clusters must be significantly distinct from the resultant cluster if sequences from the pair of clusters were to combine. This is the intercluster divergence constraint and herein, statistical significance is inferred if the multiple-testing corrected P value for the cluster pair is <FDR (see "Materials and Methods" section). Apart from an appropriately rooted phylogenetic tree, users only need to provide S, c; and FDR as the inputs for PhyCLIP. After determining the within-cluster WCL, PhyCLIP dissociates distantly related subtrees and outlying sequences that inflate the mean patristic distance (l i ) of ancestral subtrees. The ILP model is then implemented and optimized to assign cluster membership to as many sequences as possible. If a prior of cluster membership is given, this is followed by a secondary optimization to retain as much of the prior membership as is statistically supportable within the limits of PhyCLIP. Post-ILP optimization clean-up steps are taken before yielding finalized clustering output. (B) PhyCLIP considers the phylogeny as an ensemble of monophyletic subtrees, each defined by an internal node (circled numbers) subtended by a set of sequences (letters encapsulated within shaded region of the same color as the circled number). In this example, only subtrees with ! 3 sequences (S ¼ 3) are considered for clustering by the ILP model but WCL is determined from l i of all subtrees, including the unshaded subtrees 6-8. Only subtrees where l i WCL are eligible for clustering. (C) Subtrees o and q, as well as sequence j9 are dissociated from subtree i as they are exceedingly distant from i. If sequences j1, j2, j4 and j5 are clustered under subtree h whereas j3 is clustered under subtree i by ILP optimization, a post-ILP clean up step will remove j3 from cluster i. iv. If a descendant subtree satisfies (i)-(iii) for clustering (e.g., subtree 5 in fig. 1B) and so does its ancestor, which also subtends the sequences descending from the descendant, (e.g., subtree 3 in fig. 1B), the leaves subtended by the descendant will be clustered under the descendant node (e.g., sequences E-G will be clustered under cluster 5 in fig. 1B) whereas the direct progeny of the ancestor subtree will cluster amongst themselves (e.g., sequences H and I will be clustered under cluster 3 in fig. 1B).
The ILP model is implemented in a third-party linear programming solver fully integrated within PhyCLIP, to obtain the global optimal solution. At the time of this publication, PhyCLIP supports two-third-party solvers: ( Based on a recent independent benchmark (http://plato. asu.edu/talks/informs2018.pdf; last accessed March 15, 2019), Gurobi outperformed GLPK in both performance and speed (Gurobi solved all 40 Simplex LP test problems whereas GLPK could only solve 31 of them with a geometric mean runtime that was 52 times longer than Gurobi). As such, it is highly recommended that any users with Internet access via an academic domain run PhyCLIP with the Gurobi solver. All clustering results presented in this manuscript were obtained using Gurobi.

Post-ILP Clean-Up
Although distal dissociation prior to ILP optimization works well for dissociating distantly related subtrees and sequences, it is ineffective in identifying spurious singletons such as sequence j3 in figure 1C. Here, in terms of sequence patristic distance, sequence j3 is an outlying sequence to the descendant node h but not so to the ancestral node i. If taxa subtended by subtree h (i.e., j1, j2, j4, and j5) were to be clustered without j3 which itself is clustered under cluster i, PhyCLIP performs a post-ILP optimization clean-up step that removes j3 from output cluster i. This is because j3 is clearly a topologically outlying taxon to i and if unremoved, would imply that sequences clustered under cluster h (i.e., j1, j2, j4, and j5) can belong to cluster i as well.
PhyCLIP also offers the user an optional clean-up step that subsumes subclusters into their parent clusters if sequences in the descendant subcluster are still associated with the parent cluster (i.e., not removed by distal dissociation) and that coalescing with the parent clusters does not lead to violation of the statistical bounds that define the clustering result. This may be useful if the user prefers a relatively more coarsegrained clustering (e.g., nomenclature building). As mentioned earlier, so long as a statistically significant distinction could be made between a descendant subtree and its ancestral lineage, the ILP model enforces the progeny sequences of the descendant subtree to cluster in the descendant cluster. In turn, PhyCLIP is sensitive to the detection of clusters of highly related or identical sequences that minimally satisfies the minimum cluster size (S), as their distributions are statistically distinct from the rest of the population. This sensitivity may lead to over-delineation of the tree and/or multiple nested clusters. Notably, these sensitivity-induced subclusters are not false-positive clusters and meet the same statistical criteria as all other clusters. However, some users may want to subsume these subclusters into parent clusters to facilitate higher level interpretation.

Optimization Criteria
PhyCLIP's user-defined parameters can be calibrated across a range of input values, optimizing the global statistical properties of the clustering results to select an optimal parameter set. The optimization criteria are prioritized by the research question, as the clustering resolution and cluster definition are dependent on the question, and therefore the degree of information required to capture ecological, epidemiological, and/or evolutionary processes of interest. Users may want a high-resolution clustering result, with the phylogenetic tree delineated into a large number of small, high confidence clusters with very low internal divergence, tolerating a higher number of unclustered sequences. Other users may want a more intermediate resolution, with more broadly defined clusters that are still well-separated but encompass the majority of data in the tree (supplementary fig. S1A, Supplementary Material online).
PhyCLIP's optimization criteria are agnostic to the metadata of the data set and include: 1) The grand mean of the pairwise patristic distance distribution and its standard deviation (SD). The grand mean is a measure of the within-cluster divergence and can be optimized to select a clustering configuration with the lowest global internal divergence. 2) The mean of the intercluster distance to all other clusters and its SD. This can be optimized to select a clustering configuration with well-separated clusters.
3) The percentage of sequences clustered, which can be optimized to minimize the number of unclustered sequences. 4) The total number of clusters and 5) mean or median cluster size, which can be optimized to select a tolerable level of stratification of the tree.
The ranges of input parameters considered are also dependent on the characteristics of the data set. The minimum cluster size range considered should be a factor of the size of the phylogenetic tree, whereas the multiple of deviation (c) considered should be a factor of the intra and intercluster distance related to the research question.
Metadata can be incorporated to validate PhyCLIP's optimization. The spatiotemporal structure of phylogenies can inform clustering results if within-cluster variation in metadata such as collection times or geographic origin is used as a post hoc optimization criterion. Within-cluster pairwise geographic distance between the origins of sequences can act as an incomplete ground truth to determine whether a Phylogenetic Clustering by Linear Integer Programming . doi:10.1093/molbev/msz053 MBE clustering result delineates meaningful clusters if there is a reasonable expectation that clusters are defined by spatial factors. The within-cluster deviation in collection dates can also be included as an optimization criterion if clusters are expected to be temporally structured.

Results
To evaluate the utility of PhyCLIP we compared its clustering of the global HPAI H5Nx virus data against the WHO/OIE/ FAO nomenclature (WHO/OIE/FAO HN Evolution Working Gr 2009; Smith et al. 2015). The WHO/OIE/FAO H5 nomenclature has been updated progressively since its development in 2007 as new sequences are added to the global phylogeny including updates in 2009 and 2015. The primary analysis of PhyCLIP's performance was assessed with the full data set of H5N1 HA sequences included in the WHO/OIE/FAO H5 nomenclature update of 2015 (n ¼ 4,357), with comparison with the WHO/OIE/FAO clade designation. PhyCLIP was run with different combinations of the parameters varied over the following ranges: a minimum cluster size of 2-10, a multiple of deviation (c) of 1-3, and an FDR of 0.05, 0.1, 0.15, or 0.2. The optimization criteria were prioritized as follows: 1) percentage of sequences clustered, 2) grand mean of withincluster patristic distance distribution, 3) mean within-cluster geographic distance, and 4) mean of the intercluster distances.
The percentage of sequences clustered was prioritized as the primary optimization criterion to ensure that the maximum number of sequences was assigned a nomenclature identifier. Mean within-cluster geographic distance was included as a post hoc optimization criterion as many avian influenza viruses cluster with high spatial consistency owing to their transmission dynamics in localized avian populations. For influenza viruses endemic to poultry such as H5Nx, this is likely owing to increased local transmission during outbreaks in large poultry populations, as well as the associated sampling biases (Smith et al. 2015). Within-cluster genetic divergence was optimized with higher priority than within-cluster mean geographic distance, as the use of phylogenetic geographic structure as a ground truth for avian influenza viruses is restricted by the long-distance dissemination of related viruses through mechanisms such as the poultry trade or migration of wild birds (WHO/OIE/FAO H5N1 Evolution Working Group 2014; Smith et al. 2015). The within-cluster geographic distance was calculated for each cluster in each clustering result as the mean within-cluster pairwise Vicenty distance in miles.
The temporal consistency of clusters can also be used as optimization criteria for measurably evolving viruses such as Influenza A virus (Drummond et al. 2003). Results ranking the grand mean within-cluster SD in sampling dates of each clustering result as the fourth optimization criterium, with mean of the intercluster distance in fifth, were identical to those only including the aforementioned four optimization criteria.
As PhyCLIP incorporates topological information of the phylogeny into the clustering construct, nonterminal internal nodes with zero branch lengths can lead to erroneous clustering and over-delineation (supplementary fig. S1B, Supplementary Material online). Such internal nodes are usually found in bifurcating trees as representations of polytomies, arising from a lack of phylogenetic signal among the sequences subtended by the node to resolve them into dichotomies. As such, prior to implementing PhyCLIP, all nonterminal, zero branch length nodes in the input phylogenetic trees were collapsed into polytomies, which more accurately depicts the relationship between identical/indiscernible sequences and/or ancestral states. In the H5Nx analysis, all subclusters were subsumed if the statistical requisites of the parent clade were maintained, to aid in easing the interpretation of the nomenclature designation (as discussed in the "New Approach" section).

Influence of the Parameters
The influence of the parameters on PhyCLIP's clustering properties was assessed with the 2015-update H5 phylogeny. Lower multiples of deviation (c) define a more conservative expected range for tolerated within-cluster divergence, informed by the global pairwise patristic distance distribution (supplementary fig. S2, Supplementary Material online). As a result, clusters designated at a c of 1 have the lowest internal divergence, measured by the grand mean of the pairwise patristic distance distribution ( fig. 2C). These clusters are expected to be highly related, with low variation in clustered sequence spatiotemporal metadata ( fig. 2E and F). More conservative ranges of tolerated within-cluster divergence result in a higher clustering resolution with a greater number of clusters, lower mean cluster sizes and a higher percentage of sequences unclustered ( fig. 2A and B). A higher c increases the limit of tolerated within-cluster divergence, resulting in a lower clustering resolution that coalesces smaller clusters into larger, more internally divergent clusters. The collapsing of the smaller clusters decreases the total number of clusters while concurrently increasing the percentage of sequences clustered and mean cluster size. The influence of c is less pronounced for the mean intercluster distance, with no apparent distinction between c ¼ 1 and 2. The total number of clusters decreases approximately linearly as the minimum cluster size ðSÞ increases from two to ten ( fig. 2A). Lower FDRs are more conservative in designating the pairwise patristic distance distributions of two clusters as statistically distinct. A higher or less conservative FDR therefore designates more similar distributions as distinct from one another, increasing the number of clusters ( fig. 2A). The effect of FDR is muted at a higher minimum cluster size or higher c, as these parameters designate larger clusters, which limits the number of clustering configurations available.

Optimal PhyCLIP Clustering Result for HPAI Avian H5 Viruses
For the full phylogeny of Gs/GD-like H5 viruses from the 2015 nomenclature update, the optimal parameter set combined a minimum cluster size of 7, an FDR of 0.15, and a c of 3. The optimal clustering configuration clustered 98% of the sequences into a total of 89 clusters with a median cluster size of 21 Principally, pathogen nomenclature systems should delineate population structure, highlighting the underlying population dynamics that may be informative about the Phylogenetic Clustering by Linear Integer Programming . doi:10.1093/molbev/msz053 MBE evolutionary trajectory of pathogen variants. The distal dissociation approach of PhyCLIP produces a clustering topology where divergent subclusters nest within a larger cluster structure termed a supercluster, as exemplified with WHO/OIE/ FAO clade 2.1x viruses in figure 3. Sufficiently diverse subclusters are dissociated from the ancestral trunk node of a putative cluster. This enables the remaining sequences that meet the statistical criteria to cluster with the ancestral node based on their pairwise patristic distance, as the divergent subcluster is no longer inflating the ancestral node's mean pairwise patristic distance above the within-cluster limit. Cluster A in figure 3 depicts the supercluster topology: the source population viruses (tips in yellow) are annotated as A, and the divergent descendant subclusters are annotated as A.1, A.2, and A.3, respectively. This approach captures sourcesink ecological dynamics: the supercluster acts as a putative source population to its subclusters, reflecting the clear evolutionary divergence and trajectory of descendants of the source population (sub-lineages). The nomenclature system algorithmically imposed on PhyCLIP's clustering for avian influenza is designed to enhance the evolutionary information in the clustering (see "Materials and Methods" section).
PhyCLIP's optimal cluster designation delineated the spatiotemporal structure of the phylogeny at high resolution In addition to source-sink dynamics, distal dissociation also identifies probable outlying sequences, defined as sequences more than three times the estimator of scale away from the median patristic distance to the node. For example, PhyCLIP identifies seven outliers in its delineation of WHO/OIE/FAO clade 2.3.2.1c in the 2015 phylogeny (indicated by red tip-points in fig. 4). These sequences may represent under-sampled populations with unobserved diversity, introductions from otherwise unsampled populations or lower quality sequences introducing error into phylogenetic reconstruction.

Comparison with the WHO/OIE/FAO H5 Nomenclature
The current WHO/OIE/FAO nomenclature system designates 43 different clades and 7 clade-like groupings for the full H5 phylogeny as of the 2015 update (Smith et al. 2015) (supplementary table S2, Supplementary Material online). PhyCLIP recovers the current WHO/OIE/FAO H5 nomenclature with varying degrees of agreement across parameter sets, as measured by the variation of information (VI) between the clustering partitions (supplementary fig. S4, Supplementary Material online). VI is an information theoretic criterion for comparing partitions of the same data set, based on the information lost and gained when moving between partitions (Meil a 2007). A lower VI indicates more similar partitions. Parameter sets with a c of 3 consistently had the lowest VI compared with the WHO/OIE/ FAO system, indicating that the WHO/OIE/FAO nomenclature system has the highest agreement with PhyCLIP clustering results that tolerate higher within-cluster divergence.
In the optimal clustering result, PhyCLIP delineates the spatiotemporal structure of the phylogeny with a higher resolution than the WHO/OIE/  1.3.1.1.2.2.2) compared to the optimal 2015 clustering, with substantial overlap between the source populations identified (100% and 83% for source populations for southeast Asia wave and global wave, respectively).
Changes in the clustering topology between the 2009 and 2015 update phylogenies are expected as the underlying data sets are substantially different. More than 3,000 viruses were added to the tree in the 6 years between nomenclature updates. The Gs/GD-like H5 viruses evolved significantly in the intervening period owing to genetic drift and reassortment. The addition of a large number of divergent viruses to MBE the underlying data set fundamentally alters the ensemble statistical properties of the tree, driving changes in the clustering configuration by changes in the global patristic distance distribution, topology, and statistical power between data sets. As a result, the ecological inferences drawn from the 2015 clustering topology are different from that of the 2009 phylogeny (supplementary table S1, Supplementary Material online).
Primarily, the addition of a set of highly divergent sequences increases the spread of the global pairwise patristic distance distribution (supplementary fig. S2, Supplementary Material online). The within-cluster limit it informs increases concurrently, increasing the tolerance of allowable withincluster divergence. In the distal dissociation approach, increased tolerance of divergence would allow for the incorporation of more distant trunk viruses into supercluster source populations if the enclosed viruses are sufficiently distinct to be dissociated as independent clusters (supplementary fig. S6, Supplementary Material online). If the within-cluster limit is lowered, inclusion of the considered trunk viruses will violate the within-cluster limit. Resultantly, these trunk viruses and their descendants will be assessed for clustering as independent subtrees. Clustering

Comparison of Optimal to Suboptimal Clustering Results
So far, we have focused our interpretation on the optimal PhyCLIP clustering. To ensure that our results were robust across similarly optimal PhyCLIP parameter sets we compared the optimal set against the next four similarly optimal sets. Comparing the top five clustering results ranked by the optimality criterion (in order of greatest number of sequences clustered, lowest internal genetic and geographic divergence, and greatest average between-cluster distance), the clustering result from the optimal parameters set of the 2015 phylogeny was generally consistent with those generated from the four highest-ranked suboptimal parameter sets (see supplementary fig. S8, Supplementary Material online). Each of the top four suboptimal clustering was found to have low VI (0.817-0.984) relative to the optimal clustering, with a large proportion (74.4-82.7%) of viruses clustered in the same corresponding clusters. The supercluster source populations leading to the early 2000 expansion into east and southeast Asia as well as the global expansion in 2005 were similarly found in all suboptimal results.
However, changes to parameter sets fundamentally changed the statistical constraints defining the clustering solution space and in turn, altered the partitions between resultant clusters. Specifically, in this case, where c ¼ 3 in all five optimal/suboptimal parameter sets, varying minimum cluster size not only changed the distribution of putative subtrees for clustering but the distribution of intercluster divergence P values for multiple-testing correction as well. As such, while the global superclusters were largely recapitulated in the suboptimal results, local partitions of co-circulating viruses descending from these supercluster sources, and consequently the inferences of source-sink dynamics, varied amongst the different parameter sets.

PhyCLIP Clustering of the 1996-2018 H5Nx Phylogeny
In recent years the Gs/GD-lineage of H5 viruses has undergone substantial evolution, with viruses from WHO/OIE/FAO clade 2.3.4.4 reassorting with co-circulating viruses to give rise to multiple H5Nx subtypes including H5N2, H5N5, H5N6, and H5N8. We applied PhyCLIP to a phylogeny representing the Gs/GD-lineage up to and including early 2018 to investigate how the global expansion of the H5Nx viruses changes clustering inference (n ¼ 7,898) (supplementary figs. S9 and S10, Supplementary Material online). Applying the same optimization approach described above, the optimal parameter set for the 2018 phylogeny combines a minimum cluster size of 4, a FDR of 0.2, and a c of 3. This parameter set clustered 97% of the viruses into 135 clusters, with a median cluster size of 23 (supplementary fig. S11, Supplementary Material online).
The addition of the H5Nx viruses collected from 2014 to 2018 to the 2015 phylogeny changed the distribution in two ways: 1) it added diversity to the right tail of the distribution, owing to the increased divergence of the H5Nx viruses compared with the H5N1 viruses; 2) it increased the number of putative clusters with low internal divergence, as a large amount of the H5Nx viruses possess highly similar HA genes Phylogenetic Clustering by Linear Integer Programming . doi:10.1093/molbev/msz053 MBE owing to both sampling biases during outbreaks and the relative short circulation time following their emergence. This shift in the distribution reduced the within-cluster limit compared to that of the 2015 data set (supplementary fig. S2, Supplementary Material online).
Filtering the 2015-update and 2018 data sets (see "Materials and Methods" section) resulted in changes in tree topology and overall sequence diversity, and consequently altered the ecological inference of source-sink clusters circulating from 1997 to 2005 (supplementary table S1, Supplementary Material online). However, the ecological inferences of the second major wave of expansion, the post-2005 global expansion characterized by cluster 1. 2.1.1.1.3.2 and its descendants 1.2.1.1.1.3.2.1-8

Benchmarking Against Other Phylogenetic Clustering Tools
PhyCLIP was benchmarked for performance against two open-source nonparametric clustering tools, PhyloPart (Prosperi et al. 2011) andClusterPicker (Ragonnet-Cronin et al. 2013). Both tools require a phylogenetic tree as input, as well as a user-specified distance threshold and minimum statistical node-support level. In addition, both algorithms carry out a depth-first traversal of the tree, considering subtrees as putative clusters if the node support is above the user-defined level. In PhyloPart, the user specifies a percentile of the global pairwise patristic distance distribution as a threshold. If the median of the pairwise patristic distances of the putative cluster is below the percentile threshold, a cluster is designated. ClusterPicker requires a user-defined maximum pairwise genetic distance (calculated as p-distance directly from the sequences) threshold for cluster designation. In both tools, a subtree is designated as a cluster if it meets the respective clustering criteria. If the subtree violates the clustering criteria, the algorithm tests the children of the subtree as potential clusters until a leaf is reached, when no cluster is designated in the path.
In contrast, traversal order has no bearing on the clustering outcomes of PhyCLIP. Although PhyCLIP parses the input phylogeny by level-order, prior to ILP optimization, PhyCLIP dissociates outlying taxa if l i < WCL and proceeds with full distal dissociation heuristics described in the "New Approach" section if otherwise for every internal node i in the input tree. In both cases, tip dissociation is performed by ranking taxa based on their patristic distance to node i (i.e., the common ancestor) without consideration of their topological placement. Finally, all putative subtrees (i.e., tree nodes) after distal dissociation are given equal consideration by ILP optimization to maximally assign cluster membership to all tips (see "New Approach" section). In doing so, not only does PhyCLIP allow for paraphyletic clustering, tree traversal order does not affect clustering results.
Accepted practice for these tools is to incorporate previous knowledge of sequence divergence into a distance threshold or to calibrate the threshold over a tolerable range with metadata or expert consensus. The two methods were applied to the 2009-update phylogeny (n ¼ 1,224 sequences) with thresholds ranging from 0.005 to 0.05 substitutions/site. For PhyloPart, the respective percentile of the global pairwise patristic distance distribution was chosen to match the distance threshold. Required bootstrap support level was set to 0 in both methods to make it comparable with PhyCLIP, which lacks node-support criteria. The optimal threshold was selected by maximization of the mean silhouette index across the clustering partitions (see "Materials and Methods" section). All programs were run on the Ubuntu 16.04 LTS operating system with an Intel Core i7-4790 3.60 GHz CPU.
The optimal thresholds and clustering statistics for each of the methods are reported in supplementary table S4, Supplementary Material online. A direct comparison of cluster inference between PhyCLIP and the other methods is difficult owing to notable differences in cluster definitions, as these methods were largely designed to detect highly related clusters of sequences linked by direct transmission events. The optimal clustering result for ClusterPicker by silhouette maximization had a very low maximum genetic distance threshold at 0.5% (supplementary fig. S12, Supplementary Material online). This resulted in a highly stratified tree with 246 small, highly related clusters and 33.8% outliers, compared with PhyCLIP's 39 clusters and 2% outliers (VI to PhyCLIP of 2.7) (supplementary fig. S13  The nomenclature scheme developed for PhyCLIP was applied to PhyloPart's optimal clustering result for a more meaningful comparison. PhyCLIP's distal dissociation approach allows for the identification of paraphyletic clusters, forming supercluster topologies throughout the tree (as discussed above). Notably, PhyloPart's depth-first approach and monophyletic cluster criteria prevent it from designating paraphyletic clusters, obscuring the suggestive source-sink dynamics of H5N1's expansion wave identified by PhyCLIP's distal dissociation approach (supplementary table S5, Supplementary Material online). Concurrently, PhyloPart is unable to identify hierarchical clusters, which PhyCLIP identifies as divergent trajectories nested in larger clusters (supplementary fig. S13, Supplementary Material online).
PhyCLIP is appreciably more computationally intensive than PhyloPart and ClusterPicker as it not only parses the global pairwise patristic distance distribution of the phylogeny but recursively recalculates the distribution for subtrees in the distal dissociation approach, performs hypothesis testing across every combinatorial pair of subtrees to test their intercluster divergence, as well as optimize the ILP model. To relieve some of the computational cost, PhyCLIP is written in Python 2.7 employing multiprocessing modules to parallelize the computational tasks involved resulting in $3.2Â times speedup with 8 CPU cores relative to a single core run (table 1).
Despite the differences in computation time, the principal advantage of PhyCLIP is its use of the background genetic diversity to inform its within-cluster limit without the need to arbitrarily define it or calibrate it over a range of thresholds. This is especially helpful as there is typically a lack of prior knowledge on meaningful delineation of phylogenetic units for most pathogens to recommend a range of distance thresholds. In addition, PhyCLIP's distal dissociation and outlier detection approaches are capable of identifying informative paraphyletic and hierarchical clusters, unlike the other tools.

Discussion
PhyCLIP provides a statistically principled, phylogenyinformed framework to assign cluster membership to taxa in phylogenetic trees without the introduction of arbitrary distance thresholds for cluster designation. PhyCLIP uses the pairwise patristic distance distribution of the entire tree to inform its limit on within-cluster internal divergence against the background genetic diversity of the population included in the phylogeny. Testing against the global background genetic diversity indicates whether the putative clustered sequences are sufficiently more related to one another than to the rest of the data set to be designated a distinct cluster.
PhyCLIP's cluster assignment is agnostic to metadata but is capable of capturing the geographic and temporal structure of the H5 phylogeny informatively. PhyCLIP recovers the overall structure of the current WHO/OIE/FAOH5 nomenclature developed on a sequence divergence threshold but delineates more informative, higher resolution clusters that capture geographically distinct subpopulations. PhyCLIP therefore plausibly provides the foundation for an alternative nomenclature that minimizes the limitations of currently employed approaches.
PhyCLIP's clustering is expected to improve with the addition of new sequences to the tree as new information about the genetic diversity and evolutionary trajectory of the pathogen becomes known and can be incorporated into the background diversity of the tree that informs the algorithm. In addition, topological information that captures how sequences are related by common ancestors is inherently incorporated in PhyCLIP owing to its distal dissociation approach. The distal dissociation approach also does not assume all clusters are monophyletic as the most recent common ancestor of all tips in a cluster is not assumed to have any descendants. As such, PhyCLIP can identify nested clusters both as clusters with sufficiently high information content to meet the statistical requirements of cluster designation or sufficiently diverse clusters that are dissociated from their ancestral nodes. The designation of divergent descendant clusters nested within a supercluster suggestively captures source-sink population dynamics that may be informative about the evolutionary trajectory of the clustered sequences. At the same time, users could also opt for PhyCLIP to subsume subclusters that do not violate the statistical criteria of the parent clusters into the latter, aiding higher level interpretation. Importantly, the distal dissociation approach also identifies highly divergent outlying sequences that may be indicative of under-sampled diversity.
For pathogens that evolve more rapidly than they spread geographically, it is expected that clusters of related sequences would be temporally structured. However, it is important to consider the distribution of sampling times, which can drive clustering artificially. This is especially pertinent for transmission dynamic studies, where clustering is often driven by heterogeneity in sampling rates across subpopulations rather than heterogeneity in transmission rates (Poon 2016;McCloskey and Poon 2017). PhyCLIP can be applied to timeresolved phylogenies in heterochronous data sets. However, molecular clock analyses make strong biological assumptions and require sufficient temporal signal to inform the model reconstructing the statistical relationship between genetic divergence and time (Rambaut et al. 2016). These models rely on high-quality sampling dates and alignments free of MBE sequence error and laboratory-altered strains or recombinant viruses to reconstruct valid and unbiased time-scaled phylogenies (Rambaut et al. 2016). As PhyCLIP centrally operates on the branch lengths of the phylogeny, we recommend it is only applied to robust time-resolved phylogenies after a thorough investigation of the temporal signal as well as a rigorous assessment of model and prior assumptions (Boskova et al. 2018). PhyCLIP's methodology has limitations. Notably, PhyCLIP is tree-based and is therefore subject to error in phylogenetic reconstruction. PhyCLIP does not include criteria for the statistical support of nodes under consideration, which omits uncertainty in phylogenetic reconstruction. However, high statistical support for a node does not necessarily indicate that all sequences subtended by it are highly related but merely reflects the statistical support of the bipartition to the exclusion of other sequences. In addition, the relationship between the statistical significance of internal nodes and population dynamics is unresolved as is an appropriate definition of a robustly supported node (Zharkikh and Li 1992;Susko 2009;Anisimova et al. 2011;Kumar et al. 2012;Volz et al. 2012). There is often less phylogenetic signal to resolve internal nodes subtending small subtrees in measurably evolving populations, increasing uncertainty in the arrangement of the internal structure of smaller subtrees. If a statistical support threshold is set for nodes, these viruses will consistently be left unclustered or will be forced to coalesce with more ancestral nodes subtending larger clusters, which would violate PhyCLIP's statistical framework.
As with any phylogenetic clustering methods, PhyCLIP is also sensitive to variation in sampling rates (Volz et al. 2012). There is a significant surveillance bias toward certain pathogens (e.g., HPAI H5) owing to their consequences for animal and human health. The evolution and divergence of these pathogens are currently captured in surveillance data as a more accurate approximation to a continuum of evolution. PhyCLIP's clustering is strongly influenced by the diversity in the input population it tests against and will perform best when the background diversity of the phylogeny is complete or representative.
Clusters identified by PhyCLIP should not be interpreted as sequences linked by rapid direct transmission events. Transmission dynamic studies aim to integrate epidemiological clustering with phylogenetic clusters to study transmission chains or local outbreak networks by assuming putative transmission links between highly related sequences (Hassan et al. 2017). Data sets from transmission dynamic studies are likely to be sampled from localized outbreaks over a very specific period of time. The global distribution generated from the resulting phylogenetic trees will not contain sufficient information or power to meaningfully compare subpopulations to identify high confidence transmission clusters.
In conclusion, PhyCLIP provides an automated, statistically principled framework for phylogenetic clustering that can be generalized to research questions concerning the identification of biologically informative clusters in pathogen phylogenies.

Robust Estimator of Scale (Deviation)
PhyCLIP computes the robust estimator of scale (r) either as the MAD or Qn. Note that MAD may not suitably account for any potential skewness of the pairwise sequence patristic distance distribution as it inherently assumes symmetry about the median ( l). On the contrary, Qn, an alternative estimator of scale proposed by Rousseeuw and Croux (1993), is as robust as MAD (i.e., 50% breakdown point), calculated solely using the differences between the values in the distribution without needing a location estimate, and has been proven to be statistically more efficient in both Gaussian and non-Gaussian distributions relative to MAD.

Integer Linear Programming Model
Here, we fully elaborate the ILP model underlying PhyCLIP. Let n 1 ; n 2 ; . . . ; n i ; . . . ; n N be the set of binary variables indicating if subtree i satisfies the conditions for clustering as a clade (n i ¼ 1 if it does and n i ¼ 0 vice versa, fig. 2C). Each sequence j subtended by subtree i is also assigned a binary variable l j;i indicating if the sequence is clustered under subtree i (l j;i ¼ 1 if j is clustered under node i and l j;i ¼ 0 vice versa, fig. 2C). PhyCLIP then formulates the phylogenetic clustering problem as an ILP model with the objective to maximize the number of sequences assigned with cluster membership: subject to the following constraints: Constraint (3) stipulates that sequence j can be clustered under subtree i if and only if subtree i is a potential clade (n i ¼ 1). l j;i 2 À n i À n k 8 j 2 fL i ; L k g; k; i < k : (4) If sequence j is subtended by subtrees i and k, wherein i is ancestral to k and both nodes are potential clusters (n i ¼ n k ¼ 1), constraints (3) and (4) stipulate sequence j will not be clustered under the ancestor node i. Implementing these constraints across all pairwise combinations of subtrees subtending sequence j in turn constrains j to be clustered under the most descendant node k possible. X i l j;i 1 8 j : Constraint (5) stipulates that each sequence can only be clustered under a single subtree, hence abrogating any fuzzy clustering.
Cðn i À 1Þ X j l j;i À S 8 i ; where C is any arbitrarily large positive constant. Constraint (6) requires all clusters to contain at least S number of taxa as defined by the user ( fig. 1B and C). WCL À l i 8 i : Constraint (7) ensures that l i of all clades fall below the stipulated WCL limit.
Cð2 À n i À n k Þ ! q i;k À FDR 8 i; k 6 ¼ i ; where q i;k is the Benjamini-Hochberg corrected p value testing if subtrees i and k are significantly divergent from one and another under the user-defined significance level, FDR. Constraint (8) is the intercluster divergence constraint.
Intercluster divergence between subtrees i and k is tested under the null hypothesis that the pairwise sequence distance distributions of i and k are empirically equivalent to that if the two subtrees were clustered together. This can be done either by the putative Kolmogorov-Smirnov (KS) test or Kuiper's test.
Although both tests are nonparametric, the Kuiper's test statistic incorporates both the greatest positive and negative deviations between the two distributions whereas the KS test statistic is defined only by their maximum difference. As a result, the Kuiper's test becomes equally sensitive to differences to the tails as well as the median of the distributions but the KS test works best when the distributions differ mostly at the median. In other words, the KS test is good at detecting shifts between the distributions but lacks the sensitivity to uncover spreads between the distributions characterized by changes in their tails. Kuiper's test is, however, sensitive to detect both types of changes in distributions.
There are two scenarios under which q i;k may be calculated: (i) Subtree i is ancestral to k. The hypothesis test assumes the null hypothesis that the pairwise sequence patristic distance distribution of subtree k is statistically identical to the pairwise sequence patristic distance distribution of its ancestor i. (ii) Neither subtree i nor k is an ancestor of the other. In this case, two hypothesis tests are carried out comparing the distribution of each subtree to the distribution of pairwise sequence patristic distance should both subtrees be combined as a single cluster and we take the more conservative q i;k ¼ maxfq i; combined ; q k; combined g.

Nomenclature
Traversing the output clusters of PhyCLIP by preorder of the input phylogeny, a unique number is assigned to any cluster with no immediate ancestral supercluster precursor to it (i.e., parent node of the cluster node is not part of any PhyCLIP clusters). Otherwise, the descendant cluster in question is designated as a child cluster should its membership size be >25th percentile of PhyCLIP's output cluster size distribution (i.e., for having proliferated in numbers substantial enough to be deemed a progeny cluster). Every child cluster of a supercluster is assigned a progeny number separated by a decimal point (e.g., 1.2 refers to the second child cluster of supercluster 1). However, descendant clusters that fall below the cluster size cut-off are distinguished from child clusters as nested clusters, each assigned an address in the form of a parenthesized letter, alphabetized by tree traversal order, prefixed by its parent supercluster nomenclature (e.g., 1.1 [c] refers to the third nested cluster of supercluster 1.1). Nested clusters in superclusters fundamentally have different properties from the sensitivity-induced nested clusters discussed in "New Approach" section and cannot be subsumed as it will violate the within-cluster limit of the parent supercluster. The structure of the resultant clustering topology is highlighted in figure 3.

Phylogenetic Analyses
PhyCLIP's performance was evaluated on an empirical data set. The sequence data sets used to construct the HA gene phylogenetic trees underlying the WHO/OIE/FAO nomenclature for the A/goose/Guangdong/1/1996 (Gs/GD/96)-like H5 avian influenza viruses were downloaded from GISAID (WHO/OIE/FAO H5N1 Evolution Working Group 2008; WHO/OIE/FAO H5N1 Evolution Working Group 2012; WHO/OIE/FAO H5N1 Evolution Working Group 2014; Smith et al. 2015). The primary analysis is based on the full data set included in the 2009 (n ¼ 1,224) and 2015 (n ¼ 4,357) nomenclature updates. Viruses that were inconsistently included across WHO/OIE/FAO updates were followed up and included (WHO/OIE/FAO HN Evolution Working Group 2009; Smith et al. 2015). Sequences were curated based on criteria defined by the H5 nomenclature: sequences with more than 5 ambiguous nucleotides, with a sequence length shorter than 60% of the alignment, or with frameshifts or duplicated by name were removed. For the 2018 phylogeny, all avian and human viruses from the Gs/GD-like H5 lineage were downloaded from GISAID up to April 2018, including H5Nx subtypes H5N2, H5N3, H5N5, H5N6, and H5N8. An alternative filtering approach compared to the published WHO nomenclature approach was applied to ensure a data set of high-quality sequences that would be robust to error in phylogenetic reconstruction as PhyCLIP is inherently sensitive to topological information. In this approach, duplicate sequences and sequences with a length below 95% of the full HA sequence or more than 1% ambiguous nucleotides were discarded. Sequences were aligned with MAFFT v7.397 and trimmed to the start of the mature protein (Katoh et al. 2002). Each sequence set was annotated with the WHO/OIE/ FAOH5 nomenclature using LABEL(v0.5.2), and the version of the module corresponding to the nomenclature update of the data set (e.g., H5v2015 module for the full tree from the nomenclature update in 2015) (Shepard et al. 2014). Maximum likelihood phylogenetic trees were constructed for each data set with RAxML 8.2.12 under the GTRþGAMMA substitution model, and rooted to Gs/GD/ 96 (Stamatakis 2014). Phylogenetic trees were visualized using Figtree (http://tree.bio.ed.ac.uk/software/figtree/; last accessed March 15, 2019) and ggtree (Yu et al. 2017).

Silhouette Index
The silhouette index is based on the distance, here patristic distance, of each cluster member to other cluster members compared with the distance to its nearest neighbors (Rousseeuw 1987). Silhouette values approaching one Phylogenetic Clustering by Linear Integer Programming . doi:10.1093/molbev/msz053 MBE indicate that the cluster member is correctly assigned, whereas values close to zero indicate that the sequence is equally matched to its neighboring cluster. A negative Silhouette index indicates that the sequence is more closely related to the neighboring cluster than to its fellow cluster members. Calculation of the silhouette index was performed in R (R Core Team 2016).

Supplementary Material
Supplementary data are available at Molecular Biology and Evolution online.