K-Pax2: Bayesian identification of cluster-defining amino acid positions in large sequence datasets

The recent growth in publicly available sequence data has introduced new opportunities for studying microbial evolution and spread. Because the pace of sequence accumulation tends to exceed the pace of experimental studies of protein function and the roles of individual amino acids, statistical tools to identify meaningful patterns in protein diversity are essential. Large sequence alignments from fast-evolving micro-organisms are particularly challenging to dissect using standard tools from phylogenetics and multivariate statistics because biologically relevant functional signals are easily masked by neutral variation and noise. To meet this need, a novel computational method is introduced that is easily executed in parallel using a cluster environment and can handle thousands of sequences with minimal subjective input from the user. The usefulness of this kind of machine learning is demonstrated by applying it to nearly 5000 haemagglutinin sequences of influenza A/H3N2.Antigenic and 3D structural mapping of the results show that the method can recover the major jumps in antigenic phenotype that occurred between 1968 and 2013 and identify specific amino acids associated with these changes. The method is expected to provide a useful tool to uncover patterns of protein evolution.


Introduction
The growth in microbial genome sequence data, driven by decreasing sequencing costs and the integration of sequencing into routine clinical microbiology (Köser et al., 2012;Reuter et al., 2013), has begun to revolutionize our understanding of microbial evolution and spread. However, the pace of sequence accumulation generally exceeds the pace of experimental studies of protein function.
This relationship holds not only for recently emerged pathogens (Cotten et al., 2013;Gire et al., 2014), but also for intensively studied pathogens, such as influenza (Gong & Bloom, 2014;Worobey et al., 2014). Tools to analyse such large datasets and provide targeted guidance in inferring phenotypically meaningful groups can therefore be useful to identify amino acid sites and proteins that play critical roles in pathogen biology and evolution. These sites are potential targets for diagnostics, therapeutics and vaccines.
Large sequence alignments are challenging to dissect using standard tools from phylogenetics and multivariate statistics. When the datasets comprise hundreds to thousands of sequences, trees become increasingly crowded and identifying meaningful information is difficult. In contrast, basic statistical procedures such as principal components analysis, hierarchical clustering or k-means (Hastie et al., 2009) can provide a compressed view into the data with relative ease. However, the use of such unfocused methods for extracting information is problematic when the biologically relevant signals are masked by noise introduced due to sequencing errors or functionally neutral variation. This is the situation for fast-evolving organisms where many changes rapidly accumulate across proteins but only a subset of them actually show signs of selection.
Model-based statistical methods have a clear advantage over the generic approaches when the model is structured to infer biologically relevant information. For microbial proteins one important question is which isolates or strains constitute phenotypically distinct groups, distinguished by specific amino acids fixed by selection. It is also useful to know which positions and amino acids are probably directly under selection. Statistically, these questions correspond to the task of simultaneously clustering a protein sequence alignment in two ways, by the rows to identify the relevant groups of strains and by the columns to identify which amino acid positions define the clusters. As both the number of groups and the relevant sequence positions are often unknown, statistical inference is required. Bayesian modelling is particularly well suited for such model selection problems, as by specifying probabilistic prior information for the unknowns in the model, one can efficiently focus the search and avoid overfitting.
Previous studies (Aguas & Ferguson, 2013;Meroz et al., 2011) have partially solved the above-discussed problem by supervised machine learning techniques. Within this related setting, genetic determinants are identified conditional on a known classification of the sequences. To our knowledge, no statistical machine learning method has yet addressed the problem of identifying most relevant sites and amino acids without knowing a priori how the sequences are grouped.
We introduce here a Bayesian method (K-Pax2) that can handle thousands of sequences with minimal subjective input from the user. Our approach is based on a twoway clustering model inspired by an earlier method (K-Pax) for clustering single protein sequence alignments from distant homologues to identify substructure within a protein superfamily (Marttinen et al., 2006). Our current method possesses two significant improvements over the original K-Pax, one related to accuracy and the other to the technical specifications of the priors and model. These changes permit the method to be used to study a large number of closely related sequences as well as several proteins simultaneously. A useful feature of our model definition is that it enables an analytically obtainable Bayesian score of model fitness. This feature permits the use of parallel computation in model optimization, as the scores are directly comparable from independent optimization runs without approximation errors caused by, for example, Monte Carlo methods.
The haemagglutinin (HA) of influenza A/H3N2 possesses features that make it an ideal test case to demonstrate the function and applicability of K-Pax2 to large alignments. Thousands of A/H3N2 HA sequences are available in public databases (Bao et al., 2008;Benson et al., 2005;Bogner et al., 2006;Squires et al., 2012). In addition, the detailed structure and evolution of HA have been investigated by phylogenetic inference and direct experiments (Bedford et al., 2014;Bizebard et al., 1995;Fleury et al., 1999;Knossow et al., 2002;Koel et al., 2013;Smith et al., 2004;Suzuki, 2006;Wolf et al., 2006).
HA is a homotrimeric integral membrane protein on the surface of the influenza virion and the primary target of the neutralizing immune response against influenza. HA binds sialic acid receptors on the surface of cells and, once bound, promotes viral entry by fusion of the viral envelope with the endosome membrane. The tertiary structure of HA indicates that there are two main domains: a variable globular head (HA1) that contains the sialic acid binding sites and a conserved stalk region (HA2) involved in membrane fusion (Skehel & Wiley, 2000).
Since its introduction in 1968, the A/H3N2 HA has undergone rapid evolution that is associated with short coalescent times, a ladder-like phylogeny and regular antigenic

Impact Statement
Large sequence databases have introduced new opportunities to explore patterns of microbial evolution. This paper introduces the first fast modelbased machine learning method targeted to identify genomic positions that are likely to display nonsynonymous variation due to selection pressure. The method is widely applicable to aid in generation of hypotheses for experimental work and to pinpoint plausible candidates for further study and data acquisition. Results on influenza A/H3N2 highlight the potential to significantly advance the process towards understanding the mechanisms linked to the success of major pathogens.
A. Pessia and others change (Bedford et al., 2014;Fitch et al., 1991;Smith et al., 2004). The HA1 domain is the predominant site of influenza's antigenic evolution. Mutations in exposed epitopes demonstrate strong selective pressure to escape antibodies (Fitch et al., 1991;Suzuki, 2006), and tend to predominate along the trunk of the phylogenetic tree. However, there is also evidence of positive selection at CD4+ and CD8+Tcell epitopes (Suzuki, 2006) and for the addition of Nlinked glycosylation sites (Suzuki, 2011). Here, we use the K-Pax2 method to analyse thousands of influenza A/ H3N2 HA sequences to evaluate the success of the algorithm in identifying amino acid positions known to play key roles in the function of HA.

Theory and Implementation
A two-way clustering model for identifying groups of viral strains under diversifying or directional selection Let S5(s 1 , . . . , s n ) denote a multiple sequence alignment of concatenated amino acid sequences for the coding regions extracted from n virus samples. Each alignment element thus belongs to the alphabet A representing the set of amino acids, including the gap symbol. The length of the aligned sequences is denoted by L. For the purpose of obtaining a model family and an inference algorithm that can efficiently capture signals of diversifying and directional selection from S, we transform the multiple sequence alignment into an n £ LjAj binary matrix, where each column corresponds to an indicator variable of a particular element in A being observed at position l. Prior to any inference, all columns with exclusively zero elements are removed from the analysis because they are uninformative for the statistical model introduced here. The resulting binary matrix X is assumed to be of dimension n|m.
In a set notation, let N ¼ {1; . . . ; n} denote the collection of integer labels for the n virus strains. Let W5{w 1 , . . . , w K } denote an assignment of the n strains into K mutually disjoint non-empty clusters, where w K represents the set of labels of the units associated with cluster k. Formally, the K non-empty subsets w 1 , . . . ,w K define a partition of the sequences such that k <w k ¼ N and w k > w k9 ¼ {B}, ;k -k9. In our model formulation each of the K clusters is assumed to correspond to a group of strains that has evolved under diversifying or directional selection pressure and consequently proliferated given the fitness improvements induced by nonsynonymous changes that are of functional importance at the protein level. The sequence locations of such changes, the number of groups K and the explicit assignment of strains into the groups are all unknown parameters of our model to be inferred from the matrix X.
Non-synonymous changes in viral strains that are free from diversifying selection pressure will fluctuate in frequency in the population due to drift, but they are not in general expected to be rapidly driven to fixation unless they are tightly linked to other sites that are under selection. We assume that the non-synonymous mutations that do not induce fitness changes will occur at a constant rate throughout the population. This can be translated into the statistical approximation that for the n sampled strains, functional neutrality corresponds to a fixed probability of observing a particular residue in a given sequence position across all the K clusters: and for all k51, . . . , K. Thus, from the clustering perspective, any column j ( j51, . . . , m) in X is considered as 'noise' if the above probability is constant across groups. Conversely, we define a column j to represent a putative 'selection signal' if there are at least two groups for which the corresponding probability is different: for all i [ w k , and i9 [ w k9 and for some k ? k9. Such signals are only putative, as random drift could still explain a difference in the residue composition between two clusters. In addition, more rigid probabilistic restrictions must be imposed on the model structure to ensure that the grouping W and the identities of the selected sites become jointly identifiable and convey a biologically meaningful extraction of information from the alignment S. Note that residues that remain unchanged in the whole virus population over long periods of time ostensibly due to strict functional constraints on the protein structure are also uninformative for the purpose of identification of sequence clusters, as they correspond to fully conserved sites in the alignment S.
Under relatively strong selection pressure, non-synonymous changes that are associated with an increase in fitness should rapidly rise in frequency, leading to the formation of a novel group of strains. Similar to the neutral changes considered previously, this assumption can be translated into a statistical approximation that implies that we expect for each cluster k at least a single column j to be present in X such that These residues are defined as 'characteristic' for cluster k and represent significant signals of selection. In summary, a site-amino acid pair (column of X) can then be considered either noise (h51), a weak signal (h52) or a strong signal (h53). It can be further classified as of no particular status (r51) or characteristic (r52) for a cluster.
Column classification can accordingly be encoded by a collection of binary variables z jhrk attaining value 1 if and only if column j ( j51, . . . ,m) has property h (h51, 2, 3) with status r (r51, 2) in cluster k (k51, . . ., K), and attaining value 0 otherwise. Let the array Z represent the collection of binary variables z jhrk over all the index values. The pair (W,Z) then contains all the main parameters of inter-est in our model. However, its full probabilistic characterization requires a set of additional nuisance parameters that are defined below.

Likelihood function
Assuming conditional independence of the elements of X given both the main and the nuisance parameters of the model, we obtain the following expressions: where X k is the binary data matrix associated with cluster k of size n k , and subsequently where x kj is the binary vector for cluster k at column j, while Z jk is a 3 £ 2 binary matrix such that h P r P z jhrk ¼ 1. Defining the columns as statistically independent may be interpreted as a very strong assumption. However, note that their stochastic nature is already addressed through the prior distribution. Concern could arise for phenomena such as hitchhiking, where sites could be genetically linked and thus present at similar frequencies. Such cases can be easily addressed when postprocessing the results from model optimization.
We define the (prior) predictive probability where gðxjhÞ is the Bernoulli distribution and pðhjZ jk Þ is the conjugate Beta distribution for its parameter h, which is explicitly conditioned on the property and status of column j in cluster k. All these Bernoulli parameters are nuisance parameters in the model, as their explicit values are not a target of inference. Hence, in accordance with standard conventions in Bayesian statistics, they are integrated out from the likelihood to obtain the marginal posterior distribution for the parameters of interest. Note that according to this formulation sequences belonging to the same cluster are not statistically independent, whereas sequences belonging to different groups are. For z jhrk ¼ 1, standard Bayesian calculation (Bernardo & Smith, 2000) shows that equation 2 is equal to the ratio of Beta functions where a jhrk and b jhrk are the hyperparameters of the Beta distribution and y kj ¼ x ij is the number of values equal to unity observed in cluster k at column j.
To simplify the notation, we denote the probability in equation 2 as p jhrk . The likelihood function can now be compactly rewritten as Prior distributions Let pðW ; ZÞ ¼ p W ð ÞpðZjW Þ be the joint prior distribution for the partition W and the column classification Z. For computational simplicity, similar to Marttinen et al. (2006Marttinen et al. ( , 2009, we define the prior distribution for W as the uniform distribution for which pðW Þ / 1 There are alternative prior distributions for data partitions that directly penalize an increase in the number of clusters, such as a uniform distribution for the number of clusters K used in the hierBAPS software (Cheng et al., 2013) or the Dirichlet process prior (Jain & Neal, 2007;Neal, 2000). However, because we use a strongly informative prior distribution for the parameters in Z, which penalizes spurious clusters, the uniform prior on W does not lead to problems with overestimation of K, as illustrated for a related clustering model by Marttinen et al. (2009).
To define the conditional prior distribution for Z, we follow a hierarchical approach. Let c5(c 1 , c 2 , c 3 ) T denote our prior probabilities for a column to represent noise, weak signal or strong signal, respectively. Then, c h ¢ 0 and h P c h ¼ 1ðh ¼ 1; 2; 3Þ. Note that these properties are column-specific and they are not affected by any particular partition under consideration. Also, note that the array Z satisfies where Z jh is a 2 £ K binary matrix satisfying equation 5. The matrix Z jh is then modelled by K independent multi-A. Pessia and others nomial distributions where v hr is the prior probability of observing status r when a column has property h. Inserting equation 7 into equation 6, we finally obtain where we used the equality z jh:: z jhrk ¼ Kz jhrk .

Posterior inference
By multiplying the right-hand side of equation 4 and equation 8, we obtain the joint posterior distribution of the main parameters up to a normalizing constant We estimate the pair (W, Z) using the mode of the posterior distribution which is equivalently obtained by maximizing the log posterior while ignoring the constant term: Let W represent the set of all the possible partitions of N and let Z W denote the set of all the possible classifications of the columns (conditional on the underlying partition). The cardinality of the parameter space is easily determined, as W j j is equal to the Bell number B n , whereas Z W j j ¼ ð2 þ 2 K Þ m . For a discrete posterior distribution over a space of such high cardinality and complex topology, it is unlikely that any standard Markov chain Monte Carlo approach would be able to efficiently explore the distribution and estimate the mode using a reasonable amount of computational time. Therefore, we have developed a greedy optimization algorithm for fitting the model to a multiple sequence alignment. An advantage of the analytical tractability of the model is that any two model structures can be compared using the difference in log posterior, and hence estimates from multiple independent parallel or sequential algorithm runs can be ranked in a straightforward manner. Similarly, posterior uncertainty around the mode estimate can be easily numerically summarized, for example using Bayes factors against neighbouring model configurations.
An explanation of how to obtain default values for the prior hyperparameters and a description of the greedy algorithm can be found in Supplementary Text S1.

Data collection
Data collection followed a multi-stage approach. First, 12 295 A/H3N2 HA protein sequences were downloaded from three different search engines: NCBI's Influenza Virus Resource (Bao et al., 2008), GISAID EpiFlu Database (Bogner et al., 2006) and Influenza Research Database (Squires et al., 2012). Our search query consisted of fulllength A/H3N2 HA proteins, collected from human hosts in any country, excluding laboratory strains and mixed subtypes. In the second stage, we scanned the data for duplicates and low-quality reads and, after removing them from the collection, we aligned the data using MUSCLE (Edgar, 2004). After again removing duplicates, the dataset consisted of 4898 unique strains of 567 amino acids. The complete list of accession numbers is given in Table S1.

3-D mapping of characteristic amino acid changes
The amino acid positions that correspond to characteristic amino acid changes were mapped to the crystal structure of the influenza virus HA (PDB ID 1HA0). Structurally relevant mutations occurring between two consecutive clusters are shown as yellow spheres. The resulting sequence of mapping images were rendered in PyMol and the image sequence was then encoded into a video file using MEncoder v.4.8.3 and the H.264 compression format.

Broad overview of K-Pax2 output
To obtain a reliable estimate of the model parameters, we ran the optimization algorithm 100 times from different starting points and chose the solution with the highest posterior probability. The starting points were created by randomly modifying, through merging and splitting operations, a common k-medoids partition (Hastie et al., 2009). The value for k was chosen according to the highest posterior probability score. This procedure generated initial partitions lying in a neighbourhood of the optimal solution and allowed the algorithm to converge in less than 6 h (2.6 GHz processor with 2 GB RAM). The optimal model allocated the 4898 sequences into 57 different groups while simultaneously detecting 117 (out of 567 possible) cluster-defining sites. As a comparison, the adjusted Rand index between our solution and the kmedoids partition with the same number of clusters is 0.824. The two partitions are very similar and their discrepancy is completely explained by a small rearrangement of the units. This result can be interpreted by noting that Kpax2 gives different weights to matrix columns, whereas standard clustering techniques do not make any distinction between noise and signal sites.
To understand the groups' chronologies, we first selected, within each group, only those strains possessing the whole set of characteristic amino acids. We will call these strains the 'consensus sequences' of the cluster, as they represent the molecular variation most relevant for selection. Based on the earliest year in which the consensus sequence was identified, we ordered the groups according to their appearance. Fig. 1 summarizes the temporal distribution within each cluster, showing a clear relationship between cluster associations and sampling time. A similar temporal pattern can be observed by overlaying the clusters on a maximum-likelihood phylogenetic tree (Fig. 2). Because more samples are available from the recent past, we achieve higher resolution clustering of samples from the past several years compared with, for example, samples from 1968 to 1972. Table S2, each virus group is associated with a particular subset of the 117 sites. The cluster-defining amino acids can be interpreted as a fingerprint of the fitness change that did lead to proliferation of the lineage represented by the cluster.

Core evolution of the HA protein
To facilitate comparison with HA evolution, we performed a phylogenetic analysis of the fingerprint amino acid 1968 1973 1980 1990 1996 2001 2005  A. Pessia and others change patterns discovered by the method. There can exist, in particular with densely sampled data from cocirculating groups of strains, multiple clusters of which only one successfully seeds the next cluster. Therefore, we identified a parsimonious 'core' set of groups defined to have the following characteristics. First, their age or time of emergence is determined by the first sampling date of their consensus sequence (as previously defined). Second, a core cluster can have only a single ancestral core cluster but potentially multiple descendant clusters, some of   Identification of cluster-defining positions in protein sequences which may not be core clusters themselves. Third, a core cluster can descend only from an ancestral core cluster that precedes it by at least 1 year. In addition, we assumed that no recombination has occurred.
The above criteria led to the discovery of 23 core clusters among the 57 clusters present in the K-Pax2 output.
We computed the genetic distance between clusters as the average distance between their consensus sequences using the corrected distance proposed by Tamura & Kumar (2002) and the usual p-distance (Nei & Kumar, 2000). Both measures agreed. The tree in Fig. 3 was reconstructed by choosing, for each group, the ancestor associated with the minimum distance. The core clusters can be interpreted as the backbone clades of the A/H3N2 HA phylogeny, connecting the viruses observed in 1968 to the most recent ones. The classical ladder shape of the phylogenetic tree is conserved when only one consensus sequence per core cluster is used (Fig. 4). These cluster transitions closely resemble those reported by Smith et al. (2004) based on a carefully curated set of sequences, which represents less than 10 % of the data analysed here. The evolutionary relationships among all the 57 clusters are shown in Fig. S1. Figure 5 shows how the characteristic sites of the core clusters have evolved over time. This reflects the dominant role of the B-cell epitopes in contrast to T-cell epitopes (Suzuki, 2006). To quantify the distribution of these changes over time, we calculated unadjusted estimates of mutation A. Pessia and others rates in each epitope and elsewhere in HA1 (Table 1). Antigenic drift is thought to occur when an average of four amino acid changes accumulates over time (Koel et al., 2013). Many of the cluster transitions in Fig. 5 agree with this definition, but some carry fewer substitutions, which illustrates the usefulness of more flexible, statistical Rates have been estimated as y=ðltÞ, where y is the total number of amino acid changes, l is the length of the region and t is the time difference in years between two clusters. Independence between sites and homogeneous rates per region are assumed.
Year   Identification of cluster-defining positions in protein sequences model-based rules to pinpoint potential targets for further attention and experimental work. The inferred changes are not uniformly distributed over the five epitopes (chisquared test, x 2 519.665, df54, Pv0.001); instead changes in epitopes B and A are over-represented (in decreasing order), which matches well with current understanding of their relative functional importance (Koel et al., 2013). Figure 6 shows how the core clusters relate to each other in antigenic space, based on haemagglutination inhibition assays (Bedford et al., 2014). Many clusters are clearly distinct from each other, supporting the conclusion that K-Pax2 successfully identifies meaningful phenotypes. The pairs of clusters where an overlap occurs represent core clusters that arise in succession in Fig. 4. This suggests that our method has high sensitivity to detect changes that relate to early antigenic separation of strains, making it potentially also useful for continuous semi-automated screening of novel antigenic types from sequenced strains.
While K-Pax2 can generate hypotheses about which amino acids are under selection simply on the basis of sequence data, the integration of K-Pax2 output and other data can yield additional hypotheses. Video S1 displays the characteristic amino acid changes in core clusters mapped to the 3D structure of HA. The most comprehensive transition occurs in the 1996 group where changes occurred in all five epitopes, shown as a pronounced jump in antigenic space (Fig. 6). Interestingly, sequential changes in core characteristic sites rarely occur in close proximity, even when within the same epitope. This raises the possibility that selection tends to favour alternation across the protein surface, even within a single domain. Such patterns are consistent with the idea that HA evades immunity through sequential mutations that enable escape from different subpopulations (Linderman et al., 2014;Sato et al., 2004).

Conclusions
There is a widening gap between the number of experimentally validated evolutionary mechanisms and the abundance of sequence data. Hence, there is demand for computational tools that can aid in harvesting biologically meaningful signals from data to guide further research.
Using thousands of publicly available HA sequences from A/H3N2 since 1968, we demonstrated that a Bayesian modelling approach can identify patterns of sequence variation that reflect known existing drivers of A/H3N2 evolution. These results suggest the power of K-Pax2 to extract evolutionary signals from microbial sequence collections and to provide a critically needed tool to guide studies of protein function and evolution.
Despite the demonstrated ability of our method to successfully explore sequence variation without imposing an explicit dynamic evolutionary model, there are caveats to be aware of. Like most statistical methods, the modelbased clustering can be affected by sampling biases of various kinds. Highly uneven sampling over space and time will both reduce the power to detect novel variants and inflate the false positive rate of functionally critical residue changes. Furthermore, certain evolutionary processes such as episodic selection can create a pattern that resembles those implied by positive selection, and hence the inferred clusters may lack meaningful interpretation in phenotype space. Furthermore, hitchhiking phenomena due to genetic linkage may confound the identification of the causal variants as characteristic sites.