A phylogenomics approach for selecting robust sets of phylogenetic markers

Reconstructing the evolutionary relationships of species is a major goal in biology. Despite the increasing number of completely sequenced genomes, a large number of phylogenetic projects rely on targeted sequencing and analysis of a relatively small sample of marker genes. The selection of these phylogenetic markers should ideally be based on accurate predictions of their combined, rather than individual, potential to accurately resolve the phylogeny of interest. Here we present and validate a new phylogenomics strategy to efficiently select a minimal set of stable markers able to reconstruct the underlying species phylogeny. In contrast to previous approaches, our methodology does not only rely on the ability of individual genes to reconstruct a known phylogeny, but it also explores the combined power of sets of concatenated genes to accurately infer phylogenetic relationships of species not previously analyzed. We applied our approach to two broad sets of cyanobacterial and ascomycetous fungal species, and provide two minimal sets of six and four genes, respectively, necessary to fully resolve the target phylogenies. This approach paves the way for the informed selection of phylogenetic markers in the effort of reconstructing the tree of life.

List of 63 completely sequenced Cyanobacterial genomes used in the analyses. Columns indicate, in this order, the taxonomic group, the species, the source from where proteomes were obtained, and date of acquisition. Gray boxes indicate which species were used as query in the orthology search. Yellow boxes indicate which species were used for the validation phase.

Group
Scientific Name Source As of    Table 4.
Results obtained when one-genome-at-the-time test was applied to the 19 cyanobacterial genomes belonging to the Testing set. Columns indicate, in this order, the scientific name of the studied species; the number of selected markers found in each species (out of 6); the number of all marker genes found (out of 203 initial set); the Robinson & Foulds distance (RF) (18), which measures topological differences, between the tree inferred using the selected markers and all available marker genes when each species is considered; percentage of wrong splits as a normalized measure of topological differences between the trees inferred using the selected markers or all available markers for a given species.

Scientific Name Selected
Markers ( Table 5. Results for the concatenation of traditionally used marker genes in Cyanobacteria (22, 23). Only coding-protein genes were considered for the comparison in order to use the same methodology regarding the BLAST search, multiple sequence alignment and phylogenetic tree reconstruction.
Since nifD was absent, present in multiple copies or with a low coverage for many species, we constructed a second set of marker genes without it. Comparisons in terms of topological differences, were performed against the different reference species trees for the training, testing and all cyanobacterial species.   Table 7.

Combination of marker
Results obtained when one-genome at the time test was applied to the 28 fungal genomes from Ascomycota belonging to the testing set. In this order, the scientific name of the studied species, the number of selected markers found in each species (out of 4); the number of all marker genes found (out of 169 initial set); the Robinson & Foulds distance (RF) (18), which measures topological differences, between the tree inferred using the selected markers and all available marker genes when each species is considered; percentage of wrong splits as a normalized measure of topological differences between the trees inferred using the selected markers or all available markers for a given species.

Scientific Name Selected
Markers ( Table 8. Results for the concatenation of marker genes for the fungal Species Tree. On this comparison only coding-protein genes were considered in order to use the same methodology for the BLAST search, Multiple Sequence Alignment and phylogenetic tree reconstruction as for our set of marker genes. Two datasets were used on this analysis, one containing 4 broadly used markers and the other one containing the two markers proposed by (3). Tree topologies comparisons were performed for the training, testing and all sets of species considered on this study. (1): R&F distance
Sets of selected marker genes for the two additional rounds performed on Ascomycota. Protein information was retrieved from UNIPROT using either Saccharomyces cerevisiae (Ascomycotaround 2) or Candida glabrata (Ascomycota -round 3).   genes are needed to recover the same topology as the reference species tree. The reference species tree was reconstructed after concatenating 203 sets of single-copy widespread genes across 43 cyanobacterial species from the training set. Figure S2.

Uniprot Id Length (AA) Annotation
Percentage of wrong splits of the progressive concatenation of 167 gene-sets compared to the reference species tree. Progressive concatenation was performed following the order of the distance of individual markers to the reference topology using different alternative measurements: Robinson and Foulds distance (blue line), Tree Certainty (red line), K-tree score (green line) and Likelihood ratio (cyan line). On the figure is marked how many genes are needed, for each ranking approximation, to concatenate for recovering the reference tree topology. The reference species tree was reconstructed after concatenating 169 sets of singlecopy widespread genes across 55 ascomycotal fungal species from the training set. Figure S3.
Percentage of wrong splits for 615 concatenated gene-sets. Gene-sets were constructed as random combinations of the initial marker genes sets (6 gene-sets). The combination yielding the same topology as the reference which was used for downstream analyses is marked on the figure with stars.

Figure 4.
Percentage of wrong splits, as compared with the reference tree, for 1,000 different combinations of 4 marker genes selected among the available ones for the Ascomycota -Round 1 experiment. The idea behind this experiment is to measure how often a combination of 4 marker genes, the size of the one selected in our experiment, are found. Considering the average (8.9440) and the standard deviation (+/-3.7538) of the distribution, having a value of 0% (same reference topology) is further than (avg + 2std) which covers the 95% of the expected values.

Figure S5.
Basidiomycota fungal species tree comprising 28 species used as an additional validation step for the selection of 4 gene markers set on Ascomycota. The same tree topology was recovered using either the concatenation of the 4 gene markers set or the concatenation of 313 single-copy widespread genes present in all species. Gray boxes indicate which species were used as query to perform the BLAST search of orthologous genes.

Pipeline description in pseudo-code.
1) Divide the initial set of species into non-overlapping training (T-set) and validation (V-set) sets.
Size of each set is optional, by default 2:1 (T:V) size ratio is used.
2) Identify marker genes as genes with single copy orthologs in 100% (by default) of the species in the T-set. 100% can be optionally lowered.
3) For each marker gene identified in step 2, reconstruct a multiple sequence alignment (MSA).
4) Generate the reference topology for the T-set by concatenating all alignments generated in step 3. 5) Rank individual markers by their distance to the reference topology using any desired metric (e.g. Robinson and Foulds distance, Tree Certainty, Likelihood ratio) 6) Generate, progressively, concatenated alignments from the most similar to the most dissimilar markers up to reaching a given threshold. By default, the threshold is set to recover exactly the same reference topology (distance = 0). Even when the threshold is reached, the process can be extended to explore all progressively concatenated markers.
7) The set of concatenated markers for which, for the first time, a given threshold is reached constitutes the initial markers set. Depending on the size of such set, it could be tested directly (go to step 9) or start random concatenation of markers to reduce the size of this set. 8) In order to reduce the initial markers set size, smaller sets can be randomly generated using either only markers from the initial set or from the whole set of markers. The number of randomly concatenated sets can be set to a given number, e.g. 100 or 1000, or explored completely when the number of combinations is affordable. For instance, there are 1,024 possible combinations of sizes 2 to 9 for an initial marker set of 10 genes, conversely, for an initial set of 20 markers there are 1,048,576 possible combinations. 9) Identify which previously identified markers are on the V-set of species. Markers are required to be single copy but not widespread across all species.
10) Reference topologies are reconstructed following two approximations: 10A) a reference tree is reconstructed only for species in the V-set by concatenating the markers found in step 9.
10B) a tree for each individual species in the validation set is reconstructed which includes species in the training set plus each individual species. 11) Following a similar strategy to step 10, trees using only markers identified in steps 7 or 8 are reconstructed. Then, tree topologies recovered for the selected markers are compared against the one rendered using all markers found in V-set. In this way, the phylogenetic signal carried by the set of markers is evaluated. It is also evaluated the presence or not of individual markers on the newly set of species.
12) Depending on the results in the previous step, a set of markers can be proposed as the outcome of the experiment, if satisfactory. If new sets of markers are needed, go to step 8 exploring more random combination or increasing the upper limit of the marker genes set size.
13) A final tree including all species used in the study is reconstructed using either all available marker or only the selected ones. Comparison of both topologies gives an idea about the potential of the selected set of marker genes for resolving large species trees.