Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cfg.478 Research Article

The phylogenetics of the genus Alphavirus have historically been characterized using partial gene, single gene or partial proteomic data. We have mined cDNA and amino acid sequences from GenBank for all fully sequenced and some partially sequenced alphaviruses and generated phylogenomic analyses of the genus Alphavirus genus, employing capsid encoding structural regions, non-structural coding regions and complete viral genomes. Our studies support the presence of the previously reported recombination event that produced the Western Equine Encephalitis clade, and confirm many of the patterns of geographic radiation and divergence of the multiple species. Our data suggest that the Salmon Pancreatic Disease Virus and Sleeping Disease Virus are sufficiently divergent to form a separate clade from the other alphaviruses. Also, unlike previously reported studies employing limited sequence data for correlation of phylogeny, our results indicate that the Barmah Forest Virus and Middelburg Virus appear to be members of the Semliki Forest clade. Additionally, our analysis indicates that the Southern Elephant Seal Virus is part of the Semliki Forest clade, although still phylogenetically distant from all known members of the genus Alphavirus. Finally, we demonstrate that the whole Rubella viral genome provides an ideal outgroup for phylogenomic studies of the genus Alphavirus.


Introduction
Alphaviruses comprise a genus of arboviruses of the Family Togaviridae that infect many different vertebrate hosts and are transmitted by a number of invertebrate vectors. Nearly 30 different alphaviruses have been isolated worldwide and classified into one of seven serocomplexes Powers et al., 2001). These species cause disease in a wide range of animals and humans. Based on the geographic location in which the specific alphaviruses have been isolated, each species has been described as Old World (Asia, Australia, Europe, and Africa) or New World (North America and South America) (Strauss and Strauss, 1994). The Alphavirus genome is a positivestrand RNA molecule, approximately 11-12 kb in length. The 5 end of the genome encodes the non-structural proteins nsP1-nsP4. The 3 terminal region encodes the structural proteins: capsid, 6K and envelope genes 1-3 (Strauss and Strauss, 1994).
Large-scale genomic recombination events between an Eastern Equine Encephalitis Virus (EEEV) and a Sindbis-like ancestor are hypothesized to have resulted in the production of a 'Western Equine Encephalitis Virus' (WEEV) (Hahn et al., 1988;Levinson et al., 1990;Weaver et al., 1993;Strauss and Strauss, 1997) (Table 1). In turn, this virus speciated into a number of discrete alphaviruses, including the Highlands J, Buggy Creek, and Fort Morgan Viruses (Calisher et al., 1980Strauss and Strauss, 1997). These alphaviruses, along with Sindbis, have been classified into the 'WEE complex', based upon serologic analyses (Calisher et al., 1980 and confirmed by sequence alignments (Hahn et al., 1988) and limited phylogenetic analysis (Levinson et al., 1990;Weaver et al., 1997;Powers et al., 2001). The WEEV descendants contain the E1 and E2 structural proteins from the Sindbis-like progenitor; the remainder of the genome retains EEEV sequence similarities.
Current alphaviral study includes molecular characterization of the genes, development of alphaviral-based vectors and vaccines, and phylogenetic characterization of the lineage of the alphaviruses (Strauss and Strauss, 1994;Frolov et al., 1996;Schlesinger and Dubensky, 1999;Powers et al., 2001). Recent advances in automated sequencing have produced a plethora of sequence information that allows review and revision of the earlier immunological results that have categorized alphaviruses into the seven antigenic complexes . Calisher et al. (1980) identified Western Equine Encephalitis complex viruses, which led to an antigenic classification of the WEE complex viruses that has remained unaltered Strauss and Strauss, 1994;Powers et al., 2001). A subsequent re-evaluation of Phylogenomics of the genus Alphavirus 219 neutralization testing  further distinguished what would come to be known as 'WEE complex recombinants' from other complex members, including Sindbis, its subtypes, and Aura (Powers et al., 2001).
The first phylogenetic analyses comparing the regions of recombination of Sindbis and Eastern Equine Encephalitis Viruses were performed by Hahn et al. (1988). The study compared amino acid identity of the nsP4 carboxyterminus, capsid, E1, E2, E3 and 6K proteins of the EEEV, SIN, Western Equine Encephalitis and Venezuelan Equine Encephalitis viruses.
Analysing non-structural protein differences, Weaver et al. (1993) found the presence of two major alphaviral groups: the Old World viruses (Sindbis, O'Nyong Nyong, Middelburg, Ross River, and Semliki Forest viruses) and the New World viruses (EEEV, VEEV, WEEV). Later, Weaver et al. (1997) determined alphaviral phylogenetic relationships for the WEE complex based on 500 nt-length portions of the C-terminal regions of both the E1 and nsP4 genes.
Salmon Pancreatic Disease Virus (SPDV) and Sleeping Disease Virus (SDV) are two of the more recently discovered alphaviruses (Boucher et al., 1994;Boucher and Laurencin, 1996;Christie et al., 1998). Antigenic similarity of these species was determined through immune cross-protection (Boucher and Laurencin, 1996), neutralization and histopathological testing (Weston et al., 2002). Limited phylogenetic analysis was performed on these species (Villoing et al. 2000;Weston et al. 2002). Another new virus, the poorly understood Southern Elephant Seal Virus (SESV), was phylogenetically compared to the viruses of the Semliki Forest (SF) complex and shown to antigenically cross-react with other members from this geographic region. Powers et al. (2001) produced a comprehensive phylogram that included almost all known alphaviral strains and many subtypes. This phylogram was generated on the basis of a portion of the E1 glycoprotein sequence and grouped the alphaviruses into their antigenic complexes. The phylogram also indicated points at which the viruses may have been geographically translocated between the Old and New Worlds.
Previous phylogenetic studies have been limited by the methodologies and sequences employed. For example, the antigenic complexes formulated  are based upon E1, E2, and capsid glycoprotein immunological relationships between the viruses, but ignore the rest of the genome. The partial gene sequences used to produce identity and cladistic data do not reflect whole viral genome similarity (Hahn et al., 1988;Levinson et al., 1990;Weaver et al., 1993Weaver et al., , 1994Weaver et al., , 1997Powers et al., 2001). Therefore, the antigenic complexes used to categorize the relationships within the genus Alphavirus are restricted in their ability to fully identify phylogenomic relationships.
Building upon the work of Levinson et al. (1990), we deduced phylogenomic relationships between all sequenced alphaviruses, using complete non-structural, structural and whole genomic cDNA and amino acid data. These data were used to generate phylograms that could be compared to the published phylogenetic trees of Levinson et al. (1990), Weaver et al. (1993Weaver et al. ( , 1997 and Powers et al. (2001). Additionally, we have proposed the appropriate phylogenetic positions of the three newly recognized alphaviruses (SPDV, SDV and SESV).

Sources of sequence data
All Alphavirus and Rubella virus (Accession No. 9 790 308) sequences were obtained from the National Center for Biotechnology Information (NCBI) GenBank database (www.ncbi.nlm.nih. gov), using the 'Nucleotide' search option ( Table  1). The sequences were converted to FASTA format using the 'FASTA' display option and saved as individual text files. All sequences were combined into a single text file for alignment procedures. This process was repeated for the structural polyprotein cDNA sequences of the alphaviruses of interest (Table 1).
The cDNA sequences of the non-structural polyprotein coding region were identified by aligning the nsP4 cDNA region of the Alphavirus in question to the complete genome. Alignments were performed using the BLAST pairwise alignment (www.ncbi.nlm.nih.gov/blast/bl2seq/bl2.html). Alignment analysis revealed the appropriate nucleotide at which the non-structural region ended for each species. The promoter and structural sequences were removed from the overall sequence by hand and the derived non-structural regions combined into a single FASTA file.
Amino acid viral sequences were obtained from NCBI GenBank database (www.ncbi.nlm.nih.gov) ( Table 1). The structural and non-structural polyprotein amino acid sequences were available for both the alphaviruses and Rubella virus. These were copied and pasted into a FASTA file. The structural and non-structural protein sequences, after data mining, were divided into separate files. For each species, full-length genomic amino acid data was produced by hand, joining the amino terminal of the non-structural polyprotein sequence to the carboxyterminal amino acid of the structural polyprotein sequence. These complete amino acid sequences were combined into a FASTA file.

Sequence alignments and phylogenetic tree construction
Multiple alignments of all DNA and amino acid sequences were constructed using the Clustal X v1.81 software (Thompson et al., 1997). All alignments were performed using the default values of the Clustal X program. The DNA and protein sequences of Rubella virus were employed as outgroups in all studies.
Phylogenetic trees were generated from the distances provided by the Clustal X analysis using the neighbour-joining method (Saitou and Nei, 1987). Bootstrap analyses (Felsenstein, 1985) consisted of 1000 replicates. The neighbour-joining trees were visualized with the TREEVIEW program (Page, 1996). All bootstrap values of less than 500 are not shown on phylograms.

Pairwise alignment
The Matrix Global Alignment Tool (MatGAT) v. 2.01 was used to compare the viral cDNA and protein sequences in pairwise analyses (Campanella et al., 2003). All cDNA sequences were evaluated using a first gap penalty of '70' and an extending gap penalty of '1', while all protein sequences were compared using the program defaults.

GenDistance analysis
The complete genomic sequences for each Alphavirus and Rubella virus were converted to FASTA files and analysed using the default settings of GenDistance (Chen et al., 2000). The output file was bootstrapped using the neighbour-joining function of the PHYLIP package (Felsenstein, 1985) and visualized as a rectangular cladogram using TreeView.

Results
Comparison of complete genomic cDNA and amino acid sequences Rubella virus acts as a distant, but viable, outgroup because it stems from a separate node in the phylograms and its genes share similar functions to those of the genus Alphavirus (Frey, 1994) ( Figure 1). Additionally, the identity matrix analysis reveals that Rubella virus is equally genetically distant from all alphaviruses. The genomic identity values for Rubella virus averaged 28.7%, while the amino acid values averaged 17.9% (on-line supplementary materials, Figure A), supporting the use of the Rubella virus as a practical outgroup in phylogenomic studies.
The members of the AV clade also demonstrate an 'outgroup-like' distance from the other members of the genus Alphavirus ( Figure 1). Compared to the other alphaviruses, the Aquatic viruses (Salmon Pancreatic Disease Virus and Sleeping Disease Virus) demonstrate average cDNA and protein identity values of 43.8% and 38.5%, respectively (on-line supplementary materials, Figure A).
There are two subclades with values that indicate close genetic distance between clade members. All members of the Sindbis subclade of the SEE clade show an average cDNA identity of 72.2% and amino acid identity of 76.5% (on-line supplementary materials, Figure A). Individual protein identity values for the SEE clade are all greater than 50%, suggesting sufficient genetic distance for member inclusion in this clade. The members of Figure 1. Phylograms examining the complete genomes of the alphaviruses. These trees were generated using Clustal X and neighbour-joining analysis (Thompson et al., 1997;Saitou and Nei, 1987). Each of the alphaviral clades are identified in bold. Rubella, Salmon Pancreatitis, and Sleeping Disease Viruses are in bold to denote their placement as outgroups. A) Whole cDNA alphaviral genomic sequences. B) Whole amino acid alphaviral genomic sequences the SF clade have cDNA identity values ranging from 53% to 97% with an average value of 63% (on-line supplementary materials, Figure A, upper matrix). The average protein identity value for the SF clade members is 65.5% (on-line supplementary materials, Figure A, lower matrix).
As an alternative method of cDNA sequence analysis, the GenDistance program (Chen et al., 2000) was employed to compare the genomes of viral species. GenDistance utilizes data compression as a tool to retrieve information in genetic sequences. Genetic distance values are determined using shared information between compressed data sequences and calculated relatedness. The GenDistance generated data was used to produce a phylogram that exhibits the same three clades (SF, Sindbis-Eastern Encephalitis Virus and Aquatic Virus) as were found using the sequence alignment assay of Clustal X (on-line supplementary materials, Figure B). This result provides independent support that our standard alignment analysis has generated phylogenies that are reliable and credible. Additionally, Rubella virus again acts as an appropriate outgroup in the GenDistance cladogram. Barmah Forest Virus is located within the SF clade of the GenDistance cladogram, confirming its membership in the SF clade. Finally, Igbo-Ora Virus and O'Nyong Nyong Virus diverge from Chikungunya.

Comparison of alphaviral structural protein sequences
The structural phylograms comparing cDNA and amino acid sequences indicate the same major three clades seen previously (Figure 2). Both phylograms (Figure 2)

Comparison of alphaviral non-structural protein sequences
Comparisons of the non-structural polyproteins demonstrate somewhat different results from the genomic analysis (Figure 4) (Figure 4).
The cDNA identity values are comparable for all the alphaviruses to which Barmah Forest Virus is compared (on-line supplementary materials, Figure D). Western Equine Encephalitis Virus is closer to Venezuelan Equine Encephalitis Virus and Eastern Equine Encephalitis Virus than any other Alphavirus compared. The amino acid matrix results reveal highest identity values when BFV is compared to Igbo-Ora Virus (on-line supplementary materials, Figure D). WEEV demonstrates the closest cDNA identity values to VEEV and EEEV, while it also shares amino acid identity with most of the alphaviruses compared. This supports the Hahn et al. (1988) assertion that the non-structural polyprotein region of the WEEV genome retains homology with its EEEV progenitor.

Discussion
The antigenic complexes currently used within the alphaviral literature do not elucidate the extent to which these species are related genetically, nor do they provide a taxonomical designation for the most recently isolated alphaviruses. Though the studies of Powers et al. (2001) incorporated portions of genes, it is the most comprehensive alphaviral phylogenetic project to date. This work, however, did not appropriately assign the viruses into descriptive clades, nor did it use a sufficiently divergent virus as an outgroup to root phylogenetic analyses. In the present study, we have addressed these issues and recommended changes to alphaviral phylogenetics. We have revised the phylogenetic relationships of the Western Equine Encephalitis Complex, Barmah Forest, Middelburg, Salmon Pancreatic Disease, Sleeping Disease, and Southern Elephant Seal Viruses ( Table 2). The complete genomic cDNA and amino acid analyses described represent the first time that whole genomic sequences have been utilized for phylogenomic analysis of alphaviruses. Our analyses address how to describe overall genetic relatedness of a genus of viruses that utilize myriad hosts and vectors within their various replication cycles and have speciated through recombination, divergence, and radiation. Complete genomic analysis widens the methodological focus with which alphaviruses are presently examined.
Our 'complete genomic' phylograms separate the alphaviruses studied into three major phylogenomic clades: the Semliki Forest (SF), the Sindbis-Equine Encephalitis (SEE), and the Aquatic Virus (AV) Clades (Figure 1). These phylograms allow conclusive classification of Salmon Pancreatic Disease and Sleeping Disease Viruses into their own clade. These results are internally consistent throughout our studies (Figs. 1, 2, and 4; on-line supplementary materials Figs. A-D), supporting our conclusions. The classification of SPDV and SDV into their own clade is underscored by the unique pathophysiological manifestations of these diseases (Boucher and Laurencin, 1996). Therefore, the The pairwise alignment of the complete genomic sequences concurs with the neighbour-joining analyses. We replicated an E1 alignment by Weaver et al. (1997). The results produced by MatGAT are comparable to those generated by Weaver et al. (1997) using PAUP (data not shown). The identity values produced by MatGAT are within 1-2% of those produced by PAUP (Hahn et al., 1988;Levinson et al., 1990;Weaver et al., 1993;Weaver et al., 1994;Weaver et al., 1997;Powers et al., 2001), thereby rendering our comparisons to their phylogenetic work informative.
Rubella virus, a member of the Rubivirus genus and the only other member of Family Togaviridae, acts as a sufficient outgroup for each of the complete genomic phylograms, as well as for the structural and non-structural phylograms (Figs. 1-4).
The bootstrap values indicate a high degree of statistical significance in the relationships between the species.
Our assertions are strengthened by the results of the GenDistance analysis of the complete genomic cDNA. This program was designed not to align sequences, but to recognize patterns within alphabetical data strings. Our GenDistance cladogram shows the same three Alphavirus clades and identifies the presence of subclades within the SEE Clade. Once again, Rubella virus appears as an appropriate outgroup for rooting the phylogram.
Additionally, We have performed parsimony analysis (Felsenstein, 1989) with the complete genomic cDNA sequence. Our results (data not shown) mirror the relationships previously demonstrated by neighbour-joining analyses.
The anomalous Southern Elephant Seal Virus appeared to associate with the Equine Encephalitis Subclade in the cDNA phylogram (separating EEEV from VEEV), and the SF Clade in the amino acid phylogram. However, pairwise alignment data for every Alphavirus tested against SESV were in the average ranges of 23 ± 3% for the nucleotide sequence and 26 ± 5% for the amino acid sequence (on-line supplementary materials, Figure C). These results suggest significant genetic distance between SESV and the Equine Encephalitis Subclade. Because of these conflicting data, the procedure was repeated with the addition of EEEV 'South America' strain, in order to enhance the relative weight of the 'North America' strain ( Figure 2). Moreover, we developed a complete structural polyprotein phylogram to further elucidate the relative distance of Barmah Forest and Southern Elephant Seal Viruses from the other members of the Semliki Forest Clade (Figure 3). We also included Middelburg Virus in this analysis. Through limited E1 partial protein phylogenetic analysis, MIDV has been shown to be a phylogenetically close relative to Semliki Forest Virus (Powers et al., 2001). However, it has also been demonstrated to be antigenically different from this group . Our structural polyprotein phylogram shows high bootstrap values for all nodes and a branching pattern indicating that, while distant, BFV and MIDV are members of the SF Clade ( Figure 2).
The reticulate speciation that produced the recombinant members of the SEE Clade also requires separate structural, as well as non-structural, polyprotein analysis for informative phylogenomic review. Based upon previously published works, the Western Equine Encephalitis Complex includes Aura, Sindbis, Ockelbo, WEEV, Highlands J, Fort Morgan, and Buggy Creek Viruses (Hahn et al., 1988;Levinson et al., 1990;Weaver et al., 1997;Powers et al., 2001). Our complete structural and non-structural polyprotein cDNA and amino acid phylogenomic analyses support the conclusions of the previously published literature.
Our structural cDNA and amino acid phylograms place FMV, BCV, HJ, and WEEV into their own subclade, distinct from both the Venezuelan, Eastern Equine Encephalitis, and Sindbis-like Viruses (Figure 2). The EEEV origin of the non-structural portion of the WEEV genome is indicated by its classification of WEEV into a subclade with VEEV and EEEV in the multiple pairwise alignment. These assertions are validated by the consistent bootstrap values of 1000 for the divergence of the Sindbis, Recombinant, and Equine Encephalitis subclades in both the structural and non-structural phylograms (Figs. 2 and 4).
There are two alphaviral evolutionary scenarios presently considered most parsimonious (Powers et al., 2001). The 'New World Origin' hypothesis consists of three movements between the Old and New World: 1) an ancestor of the SF Clade was relocated from the New to the Old World; 2) an ancestor of SIN was relocated from the Old to the New World; 3) an ancestor of MAYV was relocated from the Old to the New World. The 'Old World Origin' hypothesis disagrees only with the first point of the New World Origin hypothesis; relocations two and three remain the same. The Old World Origin hypothesis posits that an ancestor of the Equine Encephalitis Viruses relocated from the Old to the New World.
Mayaro and Aura Viruses (both geographically New World viruses) appear within Old World clades or subclades in each of our phylograms; therefore, we concur with events two and three of the hypotheses. Though the work of Powers et al. (2001) addresses Salmon Pancreatic Disease, Sleeping Disease, and Southern Elephant Seal Viruses, it makes little attempt to describe these species within either of these hypotheses.
In light of our conclusions, we have revised the theoretical cladogram (Weaver et al., 1997) of alphaviral evolution to include SPDV, SDV, and SESV, showing divergence of these viruses prior to the events forming the Semliki Forest and Sindbis-Equine Encephalitis Clades ( Figure 5). Based upon our complete genomic sequence analyses, we have modified the phylogenetic relationships in the Semliki Forest Complex to add Mayaro and Barmah Forest Viruses and show divergence of Chikungunya Virus, to form O'nyong nyong and Igbo-Ora Viruses ( Figure 5). Lastly, we have removed subtypes of the major viruses from our cladogram to clarify our phylogenomic relationships.
Our three-clade taxonomical designation takes the methodology of alphaviral phylogenetic study in the direction of phylogenomics. Complete genomic, structural polyprotein, and non-structural polyprotein sequence analyses are able to elucidate alphaviral phylogenomics on a broad scale. These investigations provide a basis for more detailed substrain explorations, constrained only by availability of sequence data. As more Alphavirus sequencing projects are completed, and more alphaviruses are isolated, a more conclusive picture of alphaviral phylogeny may be drawn.