Gene-Specific Signatures of Elevated Non-Synonymous Substitution Rates Correlate Poorly across the Plasmodium Genus

Background Comparative genome analyses of parasites allow large scale investigation of selective pressures shaping their evolution. An acute limitation to such analysis of Plasmodium falciparum is that there is only very partial low-coverage genome sequence of the most closely related species, the chimpanzee parasite P. reichenowi. However, if orthologous genes have been under similar selective pressures throughout the Plasmodium genus then positive selection on the P. falciparum lineage might be predicted to some extent by analysis of other lineages. Principal Findings Here, three independent pairs of closely related species in different sub-generic clades (P. falciparum and P. reichenowi; P. vivax and P. knowlesi; P. yoelii and P. berghei) were compared for a set of 43 candidate ligand genes considered likely to be under positive directional selection and a set of 102 control genes for which there was no selective hypothesis. The ratios of non-synonymous to synonymous substitutions (dN/dS) were significantly elevated in the candidate ligand genes compared to control genes in each of the three clades. However, the rank order correlation of dN/dS ratios for individual candidate genes was very low, less than the correlation for the control genes. Significance The inability to predict positive selection on a gene in one lineage by identifying elevated dN/dS ratios in the orthologue within another lineage needs to be noted, as it reflects that adaptive mutations are generally rare events that lead to fixation in individual lineages. Thus it is essential to complete the genome sequences of particular species of phylogenetic importance, such as P. reichenowi.


Introduction
Identifying genes under positive directional selection can help understand how parasites adapt to new survival or reproductive challenges. The dN/dS ratio (non-synonymous substitutions per non-synonymous site divided by synonymous substitutions per synonymous site) is commonly applied to scan for evidence of positive selection in comparative genomic analysis [1,2]. Analyses of polymorphism among genome sequences of the human malaria parasite P. falciparum [3][4][5], and divergence between P. falciparum and the partially available genome sequence of the chimpanzee parasite P. reichenowi [3] show elevated dN/dS ratios in genes encoding membrane and exported proteins (considered to be under positive selection), as well as genes that are expressed at low abundance or at only one stage of the life cycle (considered to be under relaxed negative selection). However, the incompleteness of the P. reichenowi genome sequence (available sequence reads aligned to only , 42% of the P. falciparum 3D7 genome sequence) means that most loci could not be effectively analysed for inter-specific divergence [3], so most signatures of positive directional selection have not yet been discriminated.
Pairwise analyses with other malaria parasite species may also identify loci under positive selection. However, given the great evolutionary distance between many of the species, such as between P. falciparum and the rodent parasite P. yoelii [6], studies of pairwise dN/dS suffer from too high a sequence divergence, causing synonymous substitutions to be saturated and making estimates of dN/dS rate ratios unreliable. Analyses of closely related species are preferable, and pairwise dN/dS analysis among the genomes of the rodent malaria parasites, P. yoelii, P. berghei and P. chabaudi [7], showed a similar overall trend to the falciparumreichenowi analysis, with putative membrane proteins displaying higher dN/dS values than other genes. Could the results of that analysis (or analysis of other closely related species pairs such as P. vivax and P. knowlesi) be extrapolated to P. falciparum genes for which P. reichenowi orthologous sequences are not available? This study tests whether signatures from one clade of the Plasmodium genus can be used to predict those in other clades. The distributions of dN/dS values are compared for sets of orthologous loci in three phylogenetically independent species pairs, investigating a set of 43 candidate genes that are considered likely to be under positive selection and a set of 102 control genes for which there is no selective hypothesis.

Results and Discussion
For each of the 43 candidate ligand genes analysed, interspecific dN/dS ratios are shown for each of the three closely related species pairs, P. falciparum / P. reichenowi, P. vivax / P. knowlesi, and P. yoelii / P. berghei (Table 1, further details in table  S1). To test whether this candidate ligand gene dataset is enriched in genes under positive selection, dN/dS values were compared with the control gene dataset (table S2) for each species pair (Fig. 1A) using Wilcoxon's rank sum test. For all three species pairs the median dN/dS ratio was significantly greater in the candidate ligand gene set than in the control set (falciparum-reichenowi, P = 0.0084; vivax-knowlesi, P = 0.0175; yoelii-berghei, P = 0.0003) (Fig. 1B). This was also seen for dN values (Fig. 1C), though not for dS (Fig. 1D), indicating a signature of positive selection on nonsynonymous mutations leading to elevated dN/dS values in a proportion of the candidate ligand genes. Relaxed selective constraint could also result in elevated dN/dS, although there is no reason to expect that the candidate ligand genes should be under any less selective constraint than the control set of genes to maintain protein structure and function.
It should be noted that analysis of any single one of these genes in isolation would not lead to a strong conclusion of positive selection, since none showed a dN/dS value .1. Inter-specific dN/dS values for whole genes are hardly ever .1 even when positive selection occurs, due to the effect of negative background selection on many sites within most genes [1,2], so comparison of relative dN/dS values across sets of genes is a more sensitive way of scanning for evidence of positive selection than searching for individual values above 1 or any other arbitrary cut off.
To assess the predictive power of dN/dS across the Plasmodium genus, rank correlations (Spearman's r = rSp) were applied to test whether similar relative selective forces operate on orthologous genes in different species. Table 2 shows the correlation of dN/dS, dN and dS indices for all genes among the three different species pairs. Pairwise scatterplots of dN/dS values are shown in Fig. 2. The predictive power is quantified by r 2 Sp which represents the amount of variability in one axis which can be explained by variability in the other. dN/dS was significantly, though poorly, positively correlated between independent species pairs for both candidate ligand genes and control genes. Correlations were greater for dN than for dS, supporting the idea that selection affects the correlations while synonymous substitutions are mostly stochastic. However, the predictive power of dN/dS for one species pair on another is lower for candidate ligand genes (25 %, 21 % and 31 % for Pf/Pr versus Pv/Pk, Pf/Pr versus Py/Pb, and Pv/Pk versus Py/Pb respectively) than for control genes (55 %, 35 % and 44 % for the respective three comparisons). This indicates that the correlation is not improved by positive selection but is actually made worse. Discrete processes of positive selection will have occurred in different species lineages, against a background of selective constraint that varies among genes in a manner that is apparently more homogeneous between different lineages.
Thus, although broadly similar signatures indicating positive selection on distinct classes of genes may be seen in different parts of the Plasmodium phylogeny, predictions about positive selection on individual genes for which sequence data are currently missing in particular species cannot be reliably extrapolated from orthologues in other parts of the phylogeny. To detect loci that have undergone positive directional selection in the lineage of a particular species, sequences must be directly compared with orthologues of a closely related species. As P. falciparum is currently the most important human parasite, completion of the closely related P. reichenowi genome sequence should now have particularly high priority [3].

Sets of candidate genes and controls
A set of 55 single-locus genes encoding surface proteins that are putatively ligands at various life cycle stages was first defined. These genes are candidates to display signatures of positive selection due to their likely role in host-parasite interaction, and of these, 43 could be included in comparative dN/dS analyses as noted in the following section. Loci in this candidate gene dataset were compared with loci from a control dataset chosen to represent an unbiased sample of genes not hypothesised to be under positive selection. The control set was of loci on P. falciparum chromosome 3 that contained one or more nucleotide difference among the sequences of five isolates as published [8] with data searchable on PlasmoDB (www.plasmodb.org) [9]. Of the 104 such loci identified, two (PFC0210c and PFC0420w) were already included in the candidate ligand gene dataset and were thus excluded from the control dataset, which therefore consisted of 102 genes.

Defining orthologous genes for analysis of sequence divergence between species
Pairwise nucleotide divergence was estimated for 3 pairs of closely related species: P. falciparum and P. reichenowi; P. vivax and P. knowlesi; and P. yoelii and P. berghei (Fig. 1A). Two other species for which genome sequence data are available, P. gallinaceum and P. chabaudi, were not included in the present analysis as the former is not very closely related to any other species [10][11][12][13], and the latter would add little extra information to the yoelii-berghei pair [7]. Protein-coding gene sequences in the P. falciparum 3D7 genome sequence (release date 11/02/2005), produced by a consortium of the Wellcome Trust Sanger Institute (WTSI), the Institute for Genomic Research (TIGR) and Stanford University [14], were downloaded from the PlasmoDB website (http://www.plasmodb. org/common/downloads/) [9]; sequences from P. vivax (release date 03/11/2005) and P. yoelii (23/07/2004) [15], produced by the Institute for Genomic Research (TIGR), were downloaded from the TIGR website (ftp://ftp.tigr.org/pub/data/Eukaryotic_Projects/); shotgun sequences from P. reichenowi (11/03/2004) [3], and gene sequences from P. knowlesi (06/01/2006) and P. berghei (08/ 06/2004) [7] were produced by the Wellcome Trust Sanger Institute and were downloaded from the WTSI website (ftp://ftp. sanger.ac.uk/pub/pathogens/). Orthologues to P. falciparum predicted protein sequences were defined by BLASTp (protein vs. protein) searches against databases of P. yoelii, P. berghei, P. vivax and P. knowlesi predicted proteins, and required a reciprocal best match against the P. falciparum predicted protein database. For added stringency, each pair of putative orthologues (yoelii-berghei, vivax-knowlesi, falciparumreichenowi) were BLASTed against the database of the other species of the pair to ensure that the best matches to the P. falciparum sequences in each species were also reciprocal best matches to each other. Where this was not the case the pair was not analysed (detailed results of BLAST searches are shown in tables S1 and S2). No database of predicted proteins existed for P. reichenowi, so P. falciparum predicted protein sequences were used to search the P. reichenowi genomic contig database using tBLASTn (protein versus DNA translated in all 6 possible reading frames). In a number of cases where P. reichenowi orthologues could not be identified in the contig data, published P. reichenowi sequences were obtained from GenBank or sequences built from shotgun sequencing reads were used (table S3). For each gene, the P. falciparum coding sequence (introns excluded) was aligned to the best matching P. reichenowi contig using the SeqMan II program (DNASTAR, Madison, WI) to define the start and end of the coding sequence and the intron-exon boundaries. P. reichenowi contig sequences contained some regions of single-read coverage, so nucleotide mismatches in regions of single-read coverage were edited to match the P. falciparum sequence, and only well supported nucleotide mismatches in regions of multiple-read coverage were used for analysis. If a P. reichenowi gene sequence contained apparent frameshifts supported by multiple-read coverage, it was considered to be a pseudogene and not analysed.
Forty three of the 55 candidate ligand gene loci examined could be analysed for pairwise divergence between orthologues. Twelve loci (msp1, msp2, msp3, msp6, msp7, eba175, eba165, ebl1, rh1, rh2a, rh2b, rh3) were not analysed, because (i) unambiguous orthologues could not be defined, or (ii) molecular evolution appeared complex such that dN and dS may not represent the accumulation of substitutions between species (some genes had dimorphic alleles that were more divergent than the paired species sequence, and others showed evidence of gene conversion with paralogues), or (iii) an orthologue appeared to be a pseudogene. Of the 102 control gene loci, all orthologous pairs identified were analysed unless they contained a pseudogene. The relatively low number of falciparum-reichenowi gene pairs analysed (37 of 102 control genes, compared to 92 and 70 for the other species pairs) reflects the low sequence coverage of the P. reichenowi genome to date.

Analysis of pairwise between-species dN and dS values for individual genes
Orthologous protein sequence pairs were aligned using clustalW [16] and the protein alignments imposed upon the nucleotide sequences using the program pal2nal [17]. For each sequence pair, pairwise dN, dS and dN/dS indices were estimated by maximum likelihood using the codeml program [18]. Maximum likelihood estimates of dN/dS were used since they are more accurate than approximate methods such as the Nei-Gojobori method when transition/transversion rate biases and nucleotide composition or codon frequency biases exist [19]. Three independent runs of codeml were made with different initial estimates for the transition/transversion rate ratio (k) and the dN/dS ratio (run 1: k = 1, dN/dS = 1; run 2: k = 0.1, dN/dS = 10; run 3: k = 10, dN/ dS = 0.1) so that each run began at a different point in the likelihood space. If different final results were obtained between runs, those with the highest log likelihood value (lnL) were used, lower lnL values being assumed to represent local likelihood optima. Non-parametric statistical tests, Wilcoxon's rank sum test and Spearman's rank correlation, were carried out using STATA 9 (StatCorp LP, Texas, USA) as the indices of divergence were not assumed to be normally distributed.

Supporting Information
Table S1 Results of BLAST sequence similarity searches to determine orthologuous pairs of loci in six Plasmodium genomes for 43 candidate ligand genes  Table 2