Detecting negative selection on recurrent mutations using gene genealogy

Background Whether or not a mutant allele in a population is under selection is an important issue in population genetics, and various neutrality tests have been invented so far to detect selection. However, detection of negative selection has been notoriously difficult, partly because negatively selected alleles are usually rare in the population and have little impact on either population dynamics or the shape of the gene genealogy. Recently, through studies of genetic disorders and genome-wide analyses, many structural variations were shown to occur recurrently in the population. Such “recurrent mutations” might be revealed as deleterious by exploiting the signal of negative selection in the gene genealogy enhanced by their recurrence. Results Motivated by the above idea, we devised two new test statistics. One is the total number of mutants at a recurrently mutating locus among sampled sequences, which is tested conditionally on the number of forward mutations mapped on the sequence genealogy. The other is the size of the most common class of identical-by-descent mutants in the sample, again tested conditionally on the number of forward mutations mapped on the sequence genealogy. To examine the performance of these two tests, we simulated recurrently mutated loci each flanked by sites with neutral single nucleotide polymorphisms (SNPs), with no recombination. Using neutral recurrent mutations as null models, we attempted to detect deleterious recurrent mutations. Our analyses demonstrated high powers of our new tests under constant population size, as well as their moderate power to detect selection in expanding populations. We also devised a new maximum parsimony algorithm that, given the states of the sampled sequences at a recurrently mutating locus and an incompletely resolved genealogy, enumerates mutation histories with a minimum number of mutations while partially resolving genealogical relationships when necessary. Conclusions With their considerably high powers to detect negative selection, our new neutrality tests may open new venues for dealing with the population genetics of recurrent mutations as well as help identifying some types of genetic disorders that may have escaped identification by currently existing methods.

matrices (or schemes) will be implemented in the future.
In the following subsections, we will describe the algorithm, the overall structure first and details next.

Overall workflow of the new parsimony algorithm
Supplementary figure SS3 shows the schematic images of the broad processes involved in the new parsimony algorithm. The input data consists of the states of the sampled sequences at a focal site/locus and an incompletely resolved genealogy of the sequences (panel A of SS3). Then the algorithm consists of two broad procedures: the first one proceeds from bottom up, and calculates the minimum possible penalties for the sub-tree under each interior node, conditional on each possible state at the node (nearby mini table), while inserting additional nodes (red-rimmed circles) and branches (red segments) if necessary (panel B of SS3); and the second one proceeds from top down, choosing the states (and accompanying additional branches) that give the minimum penalty at each node, finally enumerating all possible mutation scenarios (red lightening bolts) providing the minimum total penalty (panel C of SS3).

(I) Bottom-up part of the new parsimony algorithm
The first large block of procedures starts from the exterior nodes (i.e. tips), and goes up step by step until it reaches the top node (i.e. the root). At each node [Although there could be numerous other topologies connecting mediating nodes that provide the same minimum penalty, dwelling on them is unproductive without other information. We employ the 'least resolved' topology providing the minimum penalty because it is the most flexible.] When the penalty matrix is not of JC-type, we will have to find the least penalized topology connecting new mediating nodes under € n. An efficient algorithm to achieve this job is now under development. Hereafter, we will thus focus on the parsimonious scenarios under the JC-type penalty.
In general, all child nodes do not necessarily have their own unique character states. In such cases, there may be or may not be a unique parsimonious scenario depending on the situation. Although parsimonious scenarios may be enumerated by exhaustively examining all the combinations of possible states at the child nodes € ʹ′ n (∈ Ch(n)) , it could be prohibitively time-consuming. For example, if there are ten child nodes and four alternative character states for each node, we have to examine € 4 10 ≈ 10 6 combinations. We therefore devised an algorithm that efficiently enumerates the parsimonious scenarios for the node € n and its children € (∈ Ch(n)), under the JC-type penalty. The algorithm will be explained in the next section. Because the processes at a bifurcated node are identical to those of Sankoff's parsimony algorithm, we will mainly describe the processes on a multi-furcated node.

(I-1) Components of the bottom-up part
Supplementary figure SS6 shows the overall workflow of component processes at each node, which make up the recurrent procedure for the bottom-up part of the algorithm.  (i) Classify the child nodes into those with their own "quasi-unique character states" and those without ones (step i). The "quasi-unique character state" Then, the quasi-unique character states are tentatively assigned to the former class of child nodes. The character states that are (quasi-)unique to some child node(s) , as well as the current subject state nodes without their own quasi-unique character states will be referred to as "remaining" child nodes.
(ii) Tentatively cluster together the child nodes sharing the same quasi-unique character state, under a newly added node (step ii). This is done for each of the primary characters (iii) Now the tough part begins, in which we will assign character states to the remaining child nodes, in such a way that the total penalty is the smallest. For this purpose, we first identify the 'best' primary character, the secondary character € s S can improve the total penalty. Such characters will be called "(penalty-)improving secondary characters", and will be used for the next procedure (step iv).

Secondary characters with
[ ] = 0 will be referred to as "(penalty-)keeping secondary characters" and will also be kept for later use, because they might keep the minimum total penalty and provide alternative parsimonious scenarios (step iv).
(v) Then, for each secondary character € s S categorized as "improving" or "keeping" at some node, the remaining child nodes € ʹ′ ʹ′ n { } will be sorted out into three classes: "key" nodes , and the rest. The "rest" will be dismissed because they will never improve the penalty (step v).
(vi) Using only the improving secondary characters and placing them solely into some of their key nodes, we will find "core" combinations of the characters that minimize the total penalty (step vi). In consequence, the minimum total conditional penalty, € Pen[n | s], will also be determined. Details on this process will be described in the next subsection (and illustrated in Supplementary figures SS7 and SS8).
(vii) Based on each core combination of the improving secondary characters, try to make "fleshed-out" alternative combinations by adding other improving secondary characters onto some of their key nodes (step vii and Supplementary figure SS9). The characters will be tried one by one recursively from the least penalized one, and all promising combinations will be exhausted. Basically, each such character € ʹ′ s S will be examined at its key nodes. At each key node € ʹ′ ʹ′ n , the character will tentatively replace the 'current best' character(s) , in a manner similar to that for (vii). All promising combinations will be tried.
(ix) Then, from each further alternative combination, try to make yet further alternative combinations by putting each of employed improving and keeping secondary characters into their marginal nodes that are not occupied by (an)other secondary character(s) yet (step ix). All such combinations will be exhausted. . Under a JC-type penalty matrix, this topology will genuinely minimize the total penalty, while leaving the genealogical relationship least resolved.
(xii) As a follow-up, examine clusters each consisting only of a single child node € ʹ′ n (∈ Ch(n)) bearing its own quasi-unique character For each of such clusters, first we remove the newly introduced "mediating" node, because it is not necessary (step xii and ) employed in the current character mapping onto child nodes, we also employ an alternative mapping whose only difference from the original one is the character at node incorporate the node € ʹ′ n into the cluster whose assigned character is € ʹ′ s (step xii and Supplementary figure SS11 C). For each character mapping (and the resulting minimally resolved genealogy), we perform this process iteratively on each single child node bearing its own quasi-unique character, to exhaust all possible such alternative mappings (and therefore minimally resolved genealogies).

(I-2) Components of the core process (step (vi) in part I)
Here we will detail the core process in the bottom-up part of the algorithm, which searches for the "core" combinations of "improving secondary characters" that minimize the total penalty, and which was briefly described in step (vi) of the last subsection (step vi of Supplementary figure SS6). The workflow of this core process is illustrated in Supplementary figure SS7 (henceforth only step IDs will be referred to).
(o) First, the improving secondary characters, and will be numbered as and Supplementary figure SS8 A). The process starts at the "best" character € s S1 , and proceeds downwards, looping over the order decide whether we should actually try the downhill or not, we monitor the potentially achievable penalty similarly to the process (a). Then, the downhill process will be performed similarly to the process (b) (step (c) and Supplementary figure SS8 C). After the downhill process ends, the best combination in the process replaces the set of previous best combinations if the former is better than the latter, or the former is added to the latter if they tie. The resulting improved combination will be further tried to improve similarly, and the whole process goes on until the basis, € s Si , is finally removed.
(d) The best combination in the € i th cycle will be compared to the current best combination in the whole process (step (d)). If the former improves the penalty, it replaces the latter. If they tie, the former will be added to the set of current bests.

(II) Top-down part of the new parsimony algorithm
The second large block of procedures starts at the top node (i.e. root), and goes down until it goes through all the nodes. The block is an extension of the top-down part of Sankoff's parsimony algorithm [2], which enumerates all the possible parsimonious mutation scenarios.
When the input tree is completely resolved (i.e. bifurcated), the whole process is exactly that of Sankoff's algorithm [2]: When there are more than one minimum-penalty characters at a node, as many parsimonious scenarios as the minimum-penalty characters are created from the parsimonious scenario constructed down to the immediately previous node.
Even when the tree is incompletely resolved (i.e. multi-furcated), the step (i) remains the same. The step (ii), however, appears quite different for a multi-furcated node from that for a bifurcated node (given above), though they are equivalent in philosophy. Thus:

Output format
This algorithm enumerates the most parsimonious scenarios, whose mutations incur the minimum penalty in total, and that is consistent with the input incompletely resolved (i.e. multi-furcated) genealogy, as well as the current states, of the sequences. Each scenario accompanies additional branches and nodes necessary to achieve the minimum penalty.
It is not uncommon that there are hundreds to thousands of parsimonious scenarios, which will be hard to be stored in memory in the normal format of states mapped onto the interior nodes, especially given that the tree topology could differ from scenario to scenario. To alleviate the possibly enormous memory consumption, we introduced the following output format: (1) Each scenario is represented in terms of changes (mutations) mapped onto the branches, instead of (ancestral) states mapped onto the ( (3) Instead of storing the set of whole scenarios, store the sets of components of scenarios, factorized as in feature (2)  Also, as mentioned earlier, although the current implementation correctly handles JC-type penalty matrices, this may not be the case with non-JC-type penalties. To be applicable to such penalties, we will have to consider connecting two or more nodes to which secondary characters are assigned. This will complicate the comparison of mutation penalties, and will be tackled in the future.
Another possibly useful modification would be to extend the algorithm so that it can accept a set of inconsistent partitions, instead of a (possibly multi-furcated) genealogy. This will give a firm foundation for handling sequences under recombination.

Program availability
The new parsimony algorithm was implemented in a home-made Perl module, which is called by a home-made Perl script that applies our new neutrality tests to an output file of the coalescent simulator 'msms' [4] (under a particular set of options).
A package containing the Perl module and the Perl script is available as Additional File 3, or on request to the first author at <kezawa dot ezawa3 at gmal dot com> (replace ' dot ' with '.', and ' at ' with '@'). To schematically illustrate what out new parsimony algorithm does, a multi-furcated node and its child nodes are shown as circles in each panel. When every child node has its own unique state ABBREVIATIONS: B.P., "best primary" (character); I.S., "improving secondary" (character); K.S., "keeping secondary" (character); R.S., "remaining secondary" (character); Curr_tot_min_penalty, current total minimum penalty; tot_pen, total penalty.

A.
Step ( (A) To the top node (thick-rimmed circle at the top), states with the smallest total penalty are assigned, just as in Sankoff's (1975) algorithm. (B) To each child (numbered circle) of a bifurcated node (circle labeled 'P'), assign a state giving the smallest penalty conditional on the state of the node 'P', just as in Sankoff's (1975) algorithm. (C) For a multi-furcated node (circle labeled 'P'), parsimonious scenarios down to its children (numbered circles), conditional on its states, are already enumerated in the bottom-up part of the new algorithm (in the dashed box at the top). Therefore, once the state of node 'P' is specified in the previous step of the top-down part, the states are automatically assigned to the child nodes, and their genealogical relationships are automatically resolved, as a part of each fixed parsimonious scenario. A. Recording state changes on branches.