Molecular Analysis of the Complete Genome of a Simian Foamy Virus Infecting Hylobates pileatus (pileated gibbon) Reveals Ancient Co-Evolution with Lesser Apes

Foamy viruses (FVs) are complex retroviruses present in many mammals, including nonhuman primates, where they are called simian foamy viruses (SFVs). SFVs can zoonotically infect humans, but very few complete SFV genomes are available, hampering the design of diagnostic assays. Gibbons are lesser apes widespread across Southeast Asia that can be infected with SFV, but only two partial SFV sequences are currently available. We used a metagenomics approach with next-generation sequencing of nucleic acid extracted from the cell culture of a blood specimen from a lesser ape, the pileated gibbon (Hylobates pileatus), to obtain the complete SFVhpi_SAM106 genome. We used Bayesian analysis to co-infer phylogenetic relationships and divergence dates. SFVhpi_SAM106 is ancestral to other ape SFVs with a divergence date of ~20.6 million years ago, reflecting ancient co-evolution of the host and SFVhpi_SAM106. Analysis of the complete SFVhpi_SAM106 genome shows that it has the same genetic architecture as other SFVs but has the longest recorded genome (13,885-nt) due to a longer long terminal repeat region (2,071 bp). The complete sequence of the SFVhpi_SAM106 genome fills an important knowledge gap in SFV genetics and will facilitate future studies of FV infection, transmission, and evolutionary history.


Introduction
Foamy viruses (FVs) belong to the Retroviridae subfamily of Spumaretrovirinae, which have fundamentally different replication strategies compared to other complex retroviruses, such as human immunodeficiency virus (HIV). For example, FVs differ from other retroviruses in how they initiate infection. Like complex DNA viruses, FVs use two promoters for gene expression, one in the long terminal repeat (LTR) and one in the envelope (env) gene [1]. In addition, FVs complete reverse transcription within the virion before infection of a new host cell, such that the SFV genome can be double-stranded DNA or single-stranded RNA [2,3]. FVs infect a wide range of mammals, including cows, horses, cats, and nonhuman primates (NHPs) in which they are called simian foamy viruses (SFVs). SFVs have been identified in nearly every primate species examined across all continents where NHPs exist. Phylogenetic analyses show each of these NHPs to be infected with species-specific variants, reflecting a generally ancient co-evolution of SFVs and their hosts [3][4][5][6][7][8][9][10][11][12][13][14][15]. Like other simian retroviruses, SFVs can recombine and cross-species infections occur, both of which can complicate their deep-sequencing [43], a technique that uses random priming instead of relying on sequence-specific approaches, thus allowing molecular characterization of divergent viral sequences. In addition to in silico characterization of the full-length SFVhpi_SAM106 genome, we also analyzed evolutionary relationships to other complete monkey and ape SFVs by using non-simian FVs as outgroups.

Blood Sample Processing, Co-Culture, and PCR Identification of a Novel Divergent SFV in Gibbons
SFVhpi_SAM106 was previously isolated from peripheral blood mononuclear cells (PBMCs) prepared from whole blood from a H. pileatus (SAM106) co-cultured with canine thymocyte Cf2Th cells as described in detail elsewhere [42]. Briefly, frozen PBMCs were thawed, stimulated with interleukin-2 for three days at 37 • C, washed with media, and incubated with an equal number of Cf2Th cells. Cultures were monitored every 3 to 4 days for syncytial cytopathic effect (CPE) typical of FV. CPE was visible on day 38, whereupon infected Cf2Th cells and viral supernatants were collected and stored frozen in liquid nitrogen until use. This captive gibbon was wild-caught and was about 30 years old at the time of specimen collection. Previous PCR and sequence analysis of integrase sequences within the pol gene from this gibbon confirmed the presence of a highly divergent SFV distinct from other ape SFVs [42]. During that same study, we similarly isolated SFVhle from a H. leucogenys but were unable to recover virus from the stored tissue culture supernatants. As described in the initial publication, NHP blood specimens were obtained opportunistically in accordance with the animal care use committees at each participating institution [42].

Next Generation Sequencing and SFVhpi_SAM106 Genome Assembly
In total, 1 mL of tissue culture supernatant was centrifuged at 5000× g at 4 • C for 5 min with subsequent filtration through a 0.45 µm filter (Millipore, Billerica, MA, USA) to remove any residual host cells. Viral nucleic acids were then isolated using the Qiagen QIAamp MinElute virus spin kit (Qiagen, Hilden, Germany) according to the manufacturer's instructions, except that carrier RNA was omitted. The eluted nucleic acids were then treated with DNase I (DNA-free, Ambion, Austin, TX, USA) and cDNA was generated by priming with random hexamers using the Superscript double-stranded cDNA Synthesis kit (Invitrogen, Carlsbad, CA, USA). cDNA was purified using the Agencourt Ampure XP system (Beckman Coulter, Brea, CA, USA) and approximately 1 ng of cDNA was subjected to simultaneous fragmentation and adaptor ligation ("tagmentation") using the Nextera XT DNA Sample Prep Kit (Illumina Systems, San Diego, CA, USA). Briefly, fragmented DNA was PCR-amplified with Nextera index primers (15 cycles) and purified using the Agencourt Ampure XP system. The resulting DNA library was sequenced using an Illumina MiSeq (MiSeq Reagent Kit V3, 600 cycle, Illumina Systems, San Diego, CA, USA). To analyze sequence data, raw sequences were de-multiplexed and converted to FastQ format using the CASAVA v1.8.2 software (Illumina). The processed reads were then imported into the CLC Genomics Workbench v7 (CLC bio, Aarhus, Denmark), trimmed to remove Nextera-specific transposon sequences as well as short and low quality reads, and assembled using the CLC de novo assembler. Both singleton and assembled contiguous sequences (contigs) were queried against the GenBank database (http://www.ncbi.nlm.nih.gov/GenBank) using the basic local alignment search tools blastn and blastx [12], with a high e-value cut-off of 10 and word sizes of 11 and three for blastn and blastx queries, respectively. Contigs with significant homology to SFV were then mapped to SFVmcy (macaque SFV; GenBank accession number NC_010819) as the reference sequence to generate the consensus SFVhpi_SAM106 sequence.

Sanger Sequencing of LTR Region
We amplified the LTR region using primers specific to SFVhpi_SAM106 in order to confirm the LTR length. DNA was extracted from the tissue culture cells and 500 ng was used as template in two separate nested PCRs using SFVhpi_SAM106-specific primers. One assay spanned the 5 LTR We performed an initial denaturation at 94 • C for 2 min. followed by 40 cycles of 94 • C for 1 min, 45 • C for 1 min; 72 • C for 2.5 min, with a final extension at 72 • C for 5 min. For both nested PCRs, we used 3 µL of the primary PCR product as template. Nested PCR products were electrophoresed in 1.0% agarose gels and visualized by ethidium bromide staining. For sequence analysis, the PCR products were purified using the Qiaquick gel extraction kit (Qiagen Inc., Valencia, CA) and then sequenced in both directions using a Big Dye terminator cycle kit (ThermoFisher Scientific, Waltham, MA) and additional internal SFVhpi_SAM106-specific sequencing primers to ensure sufficient coverage.
Using Geneious v7.0.6, we extracted the five coding regions, gag, pol, env, tas, and bet, and aligned them with representative SFVs with complete genomes from four other apes, two OWMs, four NWMs, one prosimian, and one FV each from equine, bovine, and feline hosts (Table 1). We also created a concatamer of the major FV coding regions (group-specific antigen (gag), polymerase (pol), and envelope (env)) to enable maximally robust phylogenetic analysis. Concatamers of major coding regions of other slowly evolving cell-associated retroviruses are commonly used for evolutionary analyses [45][46][47]. Finally, since recombination was reported recently in the surface protein (SU) region of env, we also prepared an alignment of the complete SFV env sequences of species in Table 1 and those from chimpanzee, gorilla, humans, and macaques used in the analyses by Galvin et al. and Richard et al. consisting of 48 taxa [19,48] All alignments were checked for evidence of potential recombination events using first a 400-nt window and a 40-nt step, and then a 200-nt window and 20-nt step, using the Kimura-2 parameter nucleotide substitution model with gap stripping in Simplot v 3.5.1 [49]. We also checked for recombination by using the Recombination Detection Program (RDP) v4 with the default parameters [50]. We performed codon-based nucleotide alignments using MAFFT v 7.017 [51], followed by manual adjustments and gap stripping. We used the model selection algorithm in MEGA v6 [52] to determine the best fitting nucleotide substitution model, which was inferred to be the general time reversible (GTR) model with gamma (Γ) distribution and invariable sites (GTR+Γ+I). Likelihood mapping of quartet topologies implemented in IQ-TREE v1.6.8 was used to check for evidence of good phylogenetic signal in the alignment [53]. We also checked the phylogenetic signal and evidence of nucleotide substitution saturation in the alignments with the program DAMBE v7.0.35 (http://dambe.bio.uottawa.ca/DAMBE/dambe.aspx).
Phylogenies and divergence dating were simultaneously inferred using Bayesian inference with the program BEAST v.1.8.4 [54]. For BEAST analysis, we created six taxon groups, including Hominoidae (great apes), Pan/Gorilla, NWMs, OWMs, OWM/Hominoidae, and all simians. We used the MAFFT alignments and enforced monophyly for both simian and non-simian taxon groups in the analyses. To evaluate the potential effect of nucleotide heterogeneity sometimes observed at third codon positions of RNA viruses on the phylogeny, we also conducted phylogenetic analysis after stripping the third codon positions (cdp) from the alignment. We used an uncorrelated lognormal relaxed molecular clock, a birth-death speciation tree prior, and 100 million Markov Chain Monte Carlo (MCMC) iterations with a 10% burn-in. To more accurately estimate divergence dates, we set priors for the time to the most recent ancestor (TMRCA) dates across the FV phylogeny using normal distribution priors and nuclear DNA split estimates for NHPs and other matching non-simian placental mammals from www.timetree.org [55,56]  We used Tracer v1.6 to ensure all parameters converged with effective sampling size (ESS) values >250. Trees were logged every 10,000 generations. Two independent BEAST runs were performed to ensure convergence and reliability of the results. We used TreeAnnotator v1.8.3 to choose the maximum clade credibility tree from the posterior distribution of 10,001 sampled trees with a burn-in value of 1000 trees. The inferred trees were visualized using FigTree v1.4.2 http://tree.bio.ed.ac.uk/software/figtree/).

Comparison of FV tRNA Binding Motifs
tRNA primer binding site sequences for SFVhpi_SAM106 were identified using tRNAdb (http: //trnadb.bioinf.uni-leipzig.de), a database that can be queried for tRNA binding motifs and outputs consensus and features of conservation for any selected set of tRNAs. FV tRNA motifs were compared to investigate potentially divergent primer binding sites.

Nucleotide Sequence Accession Number
The complete genome of SFVhpi_SAM106 has been deposited in GenBank with the accession number M621235.

SFVhpi_SAM106 Genome Assembly
Assembly of 38,000 paired-end reads yielded the complete SFVhpi_SAM106 coding genome with 380× coverage. The longest contig obtained by de novo assembly was 11,815 nt. The read lengths ranged from 175 to 250 bp. The average read length was 203.59 bp. The sequence of the complete genome was determined by manual alignment of overlapping 5 and 3 LTR regions to give a final length of 13,885 nt.

Organization of the SFVhpi_SAM106 Genome and Comparison with Other Ape SFVs
The SFVhpi_SAM106 genome consists of all expected structural, enzymatic and auxillary gene coding regions, including gag, pol, env, tas, and bet, together flanked by two LTRs (Figure 1). with a burn-in value of 1000 trees. The inferred trees were visualized using FigTree v1.4.2 http://tree.bio.ed.ac.uk/software/figtree/).

Comparison of FV tRNA Binding Motifs
tRNA primer binding site sequences for SFVhpi_SAM106 were identified using tRNAdb (http://trnadb.bioinf.uni-leipzig.de), a database that can be queried for tRNA binding motifs and outputs consensus and features of conservation for any selected set of tRNAs. FV tRNA motifs were compared to investigate potentially divergent primer binding sites.

Nucleotide Sequence Accession Number
The complete genome of SFVhpi_SAM106 has been deposited in GenBank with the accession number M621235.

SFVhpi_SAM106 Genome Assembly
Assembly of 38,000 paired-end reads yielded the complete SFVhpi_SAM106 coding genome with 380× coverage. The longest contig obtained by de novo assembly was 11,815 nt. The read lengths ranged from 175 to 250 bp. The average read length was 203.59 bp. The sequence of the complete genome was determined by manual alignment of overlapping 5′ and 3′ LTR regions to give a final length of 13,885 nt.

Organization of the SFVhpi_SAM106 Genome and Comparison with Other Ape SFVs
The SFVhpi_SAM106 genome consists of all expected structural, enzymatic and auxillary gene coding regions, including gag, pol, env, tas, and bet, together flanked by two LTRs (Figure 1).

Figure 1.
Genomic structure of SFVhpi_SAM106. Ets-1, ETS proto-oncogene 1 transcription factor motif; CAP, catabolite activator protein transcription motif; TATA box, promoter region motif; LTR, long terminal repeat; IP, internal promoter; gag, group-specific antigen; pol, polymerase; env, envelope; tas, transactivator gene; bet, between env and tas genes; U3, unique 3′ region of the LTR; R, repeat region of the LTR; U5, unique 5′ region of the LTR. The Bet protein is translated from a spliced RNA and residues from the 5′ part of tas indicated by the speckled region and dotted line.
Gene and genome length comparisons with other ape SFVs are provided in Table 2. SFVhpi_SAM106 has the longest recorded genome among ape SFVs owing to its longer LTRs, whereas the lengths of the five coding regions are comparable in size to those from other ape SFVs.  Figure 1. Genomic structure of SFVhpi_SAM106. Ets-1, ETS proto-oncogene 1 transcription factor motif; CAP, catabolite activator protein transcription motif; TATA box, promoter region motif; LTR, long terminal repeat; IP, internal promoter; gag, group-specific antigen; pol, polymerase; env, envelope; tas, transactivator gene; bet, between env and tas genes; U3, unique 3 region of the LTR; R, repeat region of the LTR; U5, unique 5 region of the LTR. The Bet protein is translated from a spliced RNA and residues from the 5 part of tas indicated by the speckled region and dotted line.
Gene and genome length comparisons with other ape SFVs are provided in Table 2. SFVhpi_SAM106 has the longest recorded genome among ape SFVs owing to its longer LTRs, whereas the lengths of the five coding regions are comparable in size to those from other ape SFVs. Nucleotide and amino acid identity comparisons of the major genes and proteins, respectively, of SFVhpi_SAM106 to those of other ape FVs are provided in Table 3. Sequence analysis showed the SFVhpi_SAM106 genome was nearly equidistant from other ape SFVs, sharing approximately 65% nucleotide identity across the gag-pol-env region of the coding genome. The highest gene and protein identities were seen with pol/Pol followed by env/Env, gag/Gag, tas/Tas, and bet/Bet. The LTR, at 2071 nt (positions 1-2071), was found to be the longest among the ape SFVs by about 300 to 800 nucleotides. We confirmed the LTR length by PCR using SFVhpi_SAM106 PCR primers and Sanger sequencing. Both the SFVhpi_SAM106_LTR-RU5 and bet-LTR fragments were 100% identical to the LTR obtained by NGS and were 1968 bp and 2029 bp in length after removing the primer sequences. The U3 region extends from positions 1 to 1704, which is about 300 nt longer than that of other ape SFVs; followed by the R region (positions 1705 to 1911), which is about 20 nt longer than that of other ape SFVs; and the U5 region (positions 1912 to 2071), which is about the same length as that in other ape SFVs. Three TATA box motifs were found at nucleotide positions 76, 239, and 396 upstream of the primer binding site (PBS), which is in turn 104 nt upstream of the start codon for gag. The poly A motif (AATAAA) is located at position 1889, the conserved 3 and central polypurine tracts (AGGAGAGGG) are located at positions 7973 and 11,813, respectively. The first two dimerization signals (DS) were highly conserved and are located at positions 2081 and 2140, but a potential third DS was not strictly conserved and consisted of AAAAGTC instead of AAAATGG found in other SFVs [57]. There are two Ets-1 transcription factor binding domains in the 5 LTR at positions 402 and 766. There are also three CAP (catabolite activator protein transcription motif) sites at positions 1647, 1692, and 1808. We also identified a polypurine tract (PPT) at positions 11,795, just upstream of the start of the 3 LTR. The conserved tRNAlys PBS motif 5 -TGG CGC CCA ACG TGG GGC-3 at positions 2074 to 2091 in SFVhpi_SAM106 is present in all available complete ape SFV genomes. The relatively conserved SFV Env/Orf-2 splice donor (AGTTGˆGTAATTT) and acceptor (TTTTAAGˆATAAT) sites are located at positions 10,292 and 10,411, respectively, and were predicted by NetGene2. The nucleotide identity of the SFVhpi_SAM106 LTR to other ape SFV LTRs ranged from 30% to 45%, with the closest identity to LTRs from chimpanzee SFVs.
The internal promoter (IP) was identified by comparison with other FVs and is located at env nucleotide positions 10,142 to 10,198 (5 -CAA GAG AA CATAAA AGA TCA AAT CGA GAG AGC AAC CGC AGA GC-3 ) (Figure 1). However, unlike the TATAAA box consensus motif of exogenous FV IPs, the SFVhpi_SAM106 TATA promoter box is more similar to that of two endogenous FVs (sloth and coelacanth) in that it starts with a cytosine (highlighted in the IP sequence in bold and italics) instead of a thymine [58]. A potential CAP site was identified 48 nucleotides downstream of the IP. Fifteen N-linked glycosylation sites were identified, of which one is in the LP region, 11 are in SU, and three are in TM. Ten of these are NXT variants and five are NXS. In comparison, SFVppy has 13 N-linked glycosylation sites of which one is in LP, nine are in SU, and three are in TM. Nine of these are NXT variants and four are NXS. The number, position, and composition of potential tas-response elements (TREs) varies among SFVs and requires in vivo experiments for confirmation [58]. By comparison with other SFVs, we identified a potential TRE upstream of the IP at position 10,021 (CTTAAAGGCAGAAGAGAAA). TREs in the LTR region could not be readily identified.
We observed that gag (1974 nt; positions 2178-4151), pol (3420 nt; positions 4102-7521), env (2966 nt; positions 7481-10,446), and tas (897 nt; positions 10,420-11,316) ORF lengths are similar to other ape SFVs (Tables 2 and 3). To identify the potential splice/acceptor sites for the bet gene, we used an online neural network prediction tool. Although we found evidence of a splice donor site in SFVhpi_SAM106 at positions 10,693 to 10,702 (5 -GAGGAATGAˆTAAGTTAAT-3 ) and a splice acceptor site at positions 10,991 to 11,010 (5 -CTCCTATTAGˆGTACACTGGG-3 ) indicating possible bet splicing, support was not strong (0.54 and 0.30, respectively). We also found a splice acceptor site within the bet ORF (positions 11,375-11,395, 5 -AATTCTCAGˆATGATGAGGAT-3 ) with strong support (0.96), which could potentially give rise to an alternate tas-bet fusion transcript 1065 nucleotides and 355 amino acids in length.
Notable motifs identified in silico in Gag include the cytoplasmic retention and targeting signal (CTRS) (GEWGFGDRYNVVQIVLQD) located at aa positions 39 to 56 with the highly conserved arginine (R) at position 46 essential for intracytoplasmic particle formation. The YXXL motif involved in particle assembly was highly conserved at positions 77 to 80. The P3-cleavage site (RSFN/TVSQ) is located at aa positions 624 to 631 in the carboxyl terminus. Analysis of the Gag sequence did not identify conserved P(T/S)AP late domain motifs in the center of the protein, but we did find two PPAP motifs at aa positions 269 to 272 and 296 to 299. These motifs are similar in number and location to those in SFVpve and SFVggo, but not SFVppy, in which the single PSAP motif is located near the end of Gag. We also identified an assembly domain (YEMLGL) at aa positions 462 to 467. Examination of Gag for glycine-arginine (GR) rich boxes involved in viral replication identified four potential GR-boxes at aa positions 485 to 509 (GGRGRGRNNRNAASGNTQGGNQRQSR), 515 to 534 (GRQSQGGRGRGSNNNTNSRQ), 538 to 562 (QNSSGYNLRPRTYNQRYGGGQGRR), and 595 to 612 (RGDQPRRSGAGRGQGGNR) compared to the three such motifs found in other SFVs. The highly conserved chromatin binding motif was located in the third GR at AA positions 542 to 548 (underlined above), which for other FVs is in GRII [59]. However, a nuclear localization signal (NLS) present in GRII of SFVpsc was not identified in the SFVhpi_SAM106 GRIII by using PSORTII and NucPred [59]. An NES at the N-terminus of Gag has been shown to be critical for the late stages of virus replication of SFVpsc (formerly HFV) and is partially conserved in SFVmcy and SFVcae [60]. Comparison with other SFVs identified a potential NES at aa positions 91 to 108 (LAFNGIGPAEGALRFGPL); however, NetNES predicted in the SFVhpi_SAM106 Gag a leucine-rich NES around aa positions 8 to 20 (LDVQELVLLMQDL), which is relatively conserved in other ape SFV Gag proteins. NetNes correctly predicted the reported NES in SFVpsc, SFVpve, and SFVggo, but not in SFVppy, SFVmcy, SFVcae, SFVocr, SFVcja, SFVaxx, SFVssc, SFVsxa, BFVbta, and EFVeca. An NES similar to that in SFVhpi_SAM106 was predicted for FFVfca.
Within the Pro-Pol polypeptide, the highly conserved protease catalytic center (DTGA) and reverse transcriptase (RT) catalytic site (YVDD) were located at aa positions 24 to 26 and 312 to 315, respectively. The DSF motif required for RNAse H activity located at positions 670 to 672 was also conserved. conserved (YVNN/XNXX) and located at aa positions 749 to 756, potentially coding for a 388 aa IN peptide. The IN catalytic center (DD35E) was found at D 898 D 936 E 972 and the zinc-finger motif at H 813 H 817 C 847 C 850 . Interestingly, we also identified a highly conserved TATA box motif at positions 4,840 to 4,846 in pol present in the RT region of all FVs with the consensus sequence (T(A/T)(T/C)AA(A/G) (Figure 2). In SFVhpi_SAM106, this TATAAA motif is 80 nucleotides upstream of the RT active site. For the 989-aa Env protein, the highly conserved WXXW motif required for Gag interaction and budding is located at aa positions 10 to 13 (WLIW). The cellular furin protease cleavage sites, which cleave the N-terminal leader peptide (LP) from the SU and the SU from the transmembrane (TM) protein, are located at aa positions 125 to 131 (RLAR/RSLR) and 570 to 577 (RKRA/TSSN) to generate three potential Env proteins of lengths 127 (LP), 446 (SU), and 416 (TM), respectively. The highly conserved hydrophobic WXXW motif in the LP required for Env incorporation and particle release is located at aa positions 10 to 13. The membrane-spanning domain located at aa positions 945 to 980 (AKGIFGTAFSLVAYVKPILIGIGVIILLVVIFKIIS) is partially conserved. The consensus KKK endoplasmic reticulum retrieval signal located at aa positions 694 to 696 of the TM (or positions 985 to 987 of Env) is partially conserved and has the sequence KAK, whereas SFVppy, SFVaxx, SFVggo_BAK74, and SFVssc motifs have the sequence KRK. A putative receptor-binding domain (RBD) is located in SU at aa positions 227 to 552 and the relatively conserved fusion peptide (LRSMGYALTGGIQTVSQI) is located in the TM at aa positions 581 to 598 of Env [61].
The ape SFV Tas proteins are highly divergent (Table 3) and hence identification of the poorly defined acidic activation and DNA binding domains was not possible. Tas is a nuclear protein involved in transcriptional transactivation with a partially conserved NLS at the 3′ end of the protein.
In SFVhpi_SAM106, the NLS GTGRKRRTN is located at aa positions 216 to 224 with a strong NucPred score of 0.87, whereas PSORT II identified RKRR in this motif as an NLS, but with a low score (−0.16). Neither prediction program found a bipartite NLS in SFVhpi_SAM106 or other ape SFV Tas proteins. PSORT II predicted with high reliability (94.1) that the SFVhpi_SAM106 Tas was a nuclear protein, similar to the Tas protein of SFVppy (reliability score = 94.1), but more likely than those from SFVggo and SFVpve/psc (reliability score = 89). The PSORT II predictor uses a heuristic algorithm, including neural networks, to identify nuclear proteins are rich in basic residues. The For the 989-aa Env protein, the highly conserved WXXW motif required for Gag interaction and budding is located at aa positions 10 to 13 (WLIW). The cellular furin protease cleavage sites, which cleave the N-terminal leader peptide (LP) from the SU and the SU from the transmembrane (TM) protein, are located at aa positions 125 to 131 (RLAR/RSLR) and 570 to 577 (RKRA/TSSN) to generate three potential Env proteins of lengths 127 (LP), 446 (SU), and 416 (TM), respectively. The highly conserved hydrophobic WXXW motif in the LP required for Env incorporation and particle release is located at aa positions 10 to 13. The membrane-spanning domain located at aa positions 945 to 980 (AKGIFGTAFSLVAYVKPILIGIGVIILLVVIFKIIS) is partially conserved. The consensus KKK endoplasmic reticulum retrieval signal located at aa positions 694 to 696 of the TM (or positions 985 to 987 of Env) is partially conserved and has the sequence KAK, whereas SFVppy, SFVaxx, SFVggo_BAK74, and SFVssc motifs have the sequence KRK. A putative receptor-binding domain (RBD) is located in SU at aa positions 227 to 552 and the relatively conserved fusion peptide (LRSMGYALTGGIQTVSQI) is located in the TM at aa positions 581 to 598 of Env [61].
The ape SFV Tas proteins are highly divergent (Table 3) and hence identification of the poorly defined acidic activation and DNA binding domains was not possible. Tas is a nuclear protein involved in transcriptional transactivation with a partially conserved NLS at the 3 end of the protein.
In SFVhpi_SAM106, the NLS GTGRKRRTN is located at aa positions 216 to 224 with a strong NucPred score of 0.87, whereas PSORT II identified RKRR in this motif as an NLS, but with a low score (−0.16). Neither prediction program found a bipartite NLS in SFVhpi_SAM106 or other ape SFV Tas proteins. PSORT II predicted with high reliability (94.1) that the SFVhpi_SAM106 Tas was a nuclear protein, similar to the Tas protein of SFVppy (reliability score = 94.1), but more likely than those from SFVggo and SFVpve/psc (reliability score = 89). The PSORT II predictor uses a heuristic algorithm, including neural networks, to identify nuclear proteins are rich in basic residues. The SFVhpi_SAM106 Tas  Although the bet splice acceptor site described above was not strong, the SFVhpi_SAM106 Bet protein length of 483 aa was similar to that of other ape SFVs (SFVpve = 490 aa, SFVpsc = 482 aa, SFVggo = 481 aa, SFVppy = 464). Comparison with these other ape SFV Bet proteins identified a potential integrin-binding motif (K/RGD) at aa positions 306 to 308 that was partially conserved (KGT), with SFVpve, SFVggo, and SFVppy having a KGD motif and SFVpsc having an RGD motif. The tripeptide RGD domain has been shown to be required for cell membrane binding, so it will be important to examine the functionality of the D > T mutation at the third aa position in the SFVhpi_SAM106 Bet. One study has proposed the SFVpve Bet is secreted in both the cytoplasm and nucleus of infected cells and contains a bipartite NLS in the C-terminus [62]. PSORT II does detect the bipartite NLS RKIRTLTEMTQDEIRKR at aa positions 463 to 479 of SFVpsc Bet, but with a cytoplasmic protein reliability prediction of 70.6% rather than a nuclear one. NucPred did not predict a NLS in the SFVpsc Bet. Neither NucPred nor PSORT II identified any putative NLS in the SFVhpi_SAM106, SFVpve, SFVggo, and SFVppy Bet proteins.

Absence of Evidence of Genetic Recombination in the SFVhpi_SAM106 Genome
Simplot analyses on the gag-pol-env concatamer alignment did not show any evidence of genetic recombination at a threshold of 70% of permuted trees, a cutoff commonly used for the analysis of other retroviruses (data not shown). These results were consistent across two window and step sizes in the analysis. We also found no significant evidence of recombination using the recombination detection program RDP4 using eight methods (RDP, GENCONV, Chimaera, MaxChi, BootScan, SiScan, 3Seq, and LARD). We did not find any evidence of recombination of the SFVhpi_SAM106 env using the 48 taxa dataset by phylogenetic and RDP analysis, but we did confirm the two different SU RBD SFV clades as previously reported (Supplementary Figure S1) [19,48]. Phylogenetic analysis of the two alignments encompassing the complete env and the region without the RBD in SU showed the typical co-evolutionary history of FV with SFVhpi_SAM106 ancestral to SFVppy, SFV OWMs, and then chimpanzees and gorillas with strong support. In the analysis of only the BD region of SU, SFVhpi_SAM106 was ancestral to the chimpanzee and gorilla SFVs, but with no support. In addition, SFVppy clustered FFVfca with good support between the NWM SFV and the other OWMA SFVs. Combined, these results suggest an absence of recombination in SFVhpi_SAM106 and that genetic recombination did not affect our phylogenetic results when using the gag-pol-env concatamer.

Evolutionary Relationships and Divergence Dating of SFVhpi_SAM106 and Other FVs
Likelihood mapping of the 15 taxa 7,412 position gag-pol-env concatamer alignment showed an equal distribution of the majority of possible quartets across the tree of which 98.4% were fully resolved, i.e., tree-like, with only 1.68% of unresolved quartets. These results suggest an overall dataset with very good phylogenetic signal and very little "noise" and hence suitable for phylogenetic reconstruction. Excellent phylogenetic signal was also found in both gag-pol-env alignments with scores >99.3 using DAMBE. Little nucleotide substitution saturation was found in the alignments using the method of Xia in DAMBE [63]. Together, our results indicate the alignments were satisfactory for phylogenetic analysis.
Phylogenetic trees generated using Bayesian inference of the gag-pol-env concatamer showed that FV sequences from a broad range of genetically diverse NHPs and non-simians formed monophyletic lineages and distinct clusters that mirrored host taxonomic relationships (Figure 3). SFVhpi_SAM106 clustered with other ape SFVs with strong posterior probability (PP > 1) support (Figure 3). The FV phylogeny was similar to that seen in our previous study, where SFVhpi_SAM106 is a sister taxa of but ancestral to the great ape SFVs, mirroring the phylogeny of the host mitochondrial and nuclear sequences in which the lesser apes are an outgroup to the great apes [6,42]. An identical phylogeny was obtained with the gag-pol-env first and second cdp alignment, indicating the absence of substitution saturation at the third cdp in the analysis of the unstripped alignment ( Figure 3). As expected, the representative prosimian FV sequence from a galago (SFVocr) was ancestral to all other SFVs.
The three non-simian FVs from equine, feline, and bovine formed a clade separate from the SFVs, with BFV and EFV clustering together with strong support (PP = 1). All BEAST analyses had standard deviation values of the uncorrelated lognormal relaxed clock (ucld.stdev) greater than zero but less than one, indicating that variation in substitution rates across branches was not consistent with a strict molecular clock but also was not so great as to bias the analyses ( Table 4). The absence of site-to-site variation was also supported by alpha parameters of the gamma distribution above 1.0 (Table 4).
Viruses 2017, 9, x FOR PEER REVIEW 11 of 20 the absence of substitution saturation at the third cdp in the analysis of the unstripped alignment ( Figure 3). As expected, the representative prosimian FV sequence from a galago (SFVocr) was ancestral to all other SFVs. The three non-simian FVs from equine, feline, and bovine formed a clade separate from the SFVs, with BFV and EFV clustering together with strong support (PP = 1). All BEAST analyses had standard deviation values of the uncorrelated lognormal relaxed clock (ucld.stdev) greater than zero but less than one, indicating that variation in substitution rates across branches was not consistent with a strict molecular clock but also was not so great as to bias the analyses ( Table 4). The absence of site-to-site variation was also supported by alpha parameters of the gamma distribution above 1.0 (Table 4).
(A)   Bayesian dating analyses showed that for the gag-pol-env concatamer alignment, the TMRCA for SFVhpi_SAM106 and the Hominoidea was 20.69 mya with a 95% highest posterior density (HPD) interval of 19.13 to 22.19 mya (Table 4). This divergence date occurs during the Miocene epoch ( Figure 3). The TMRCA for the SFV OWM/ape split (Crown Catarrhini) was 27.28 mya with a 95% HPD interval of 27.28 to 30.09 mya. The SFVpsc, SFVpve/SFVggo split (Crown Homininae) had a TMRCA of 9.18 mya with a 95% HPD interval of 8.54 to 9.8 mya. For the OWM SFV (Crown Cercopithecinae), a TMRCA of 18.05 mya was inferred with a 95% HPD interval of 11.29 to 24.7 mya. In contrast, TRMCAs for the non-simian FVs were much older, estimated to be 92.4 mya (95% HPD interval of 85.03-98.69 mya). The TMRCA for the Boreoutherian (placental mammals) at the root of the FV phylogeny was estimated to be 95.46 mya (95%HPD interval of 90.08 to 101.19 mya) during the Upper epoch and Mesozoic era. Similar TMRCAs were inferred for the 12 cdp alignment (Table 4). Our inferred FV and host divergence dates were similar to those inferred by others using different methods, supporting the robustness of our dating methods ( Table 4). The inferred mean substitution rates ranged from 5.05 × 10 −9 to 9.82 × 10 −9 nucleotides/site/year across the gag-pol-env coding region (Table 4).

Discussion
Using a metagenomics approach, we obtained the first complete genome of an SFV from a lesser ape, the pileated gibbon (Hylobates pileatus). Next generation sequencing is a powerful molecular method and has been used to obtain complete genomes of other SFV isolates recently [13,67,68]. We characterized the SFVhpi_SAM106 genome by detailed sequence analysis and provided a deeper understanding of the evolutionary history of FVs, especially within the Hominoidea. Most of the important genomic structures and functional domains were conserved in SFVhpi_SAM106 with some exceptions. While the organization of the SFVhpi_SAM106 genome is similar to that of other FVs, it has several unique features, including an LTR at~2 kb that is 1.2 to 1.6 times longer than that in other FVs, attributable mostly to longer U3 and R regions. Given that the 5 LTR U3 and R regions of FVs contain transcriptional start control elements, including the tas response elements (TREs) and cellular transcription factors, it will be important to conduct transcription-mapping experiments to determine the functional significance of the longer LTRs of SFVhpi_SAM106 for replication and regulation. Since the number and location of FV TREs are not conserved, we were unable to locate the SFVhpi_SAM106 TREs within the LTRs or env sequences in silico. Others have shown that tissue culture of chimpanzee SFV in diploid human fibroblasts isolated from an infected human selects for isolates with substantial nonrandom nucleotide deletions in the U3 region can increase viral replication [69]. The U3 deletion variant was also present as the majority variant in the infected human but was absent from naturally-infected chimpanzees [69]. However, SFVhpi_SAM106 was isolated from infected PBMCs using canine thymocyte cells [42], which have not been shown to impact SFV LTR length or functionality in this dog cell line or in other non-human cells, including SFVggo and SFVcpz isolates from zoonotically-infected humans grown in baby hamster kidney cells (BHK-21) [57]. Comparisons of SFVhpi_SAM106 LTRs directly from PBMCs from the pileated gibbon and those obtained by culture may be needed to further evaluate this unusually long LTR.
Additional differences from other FV prototypes included four instead of three potential GR boxes in Gag and a TATA box promoter motif in the IP in env, which is more similar to those of two endogenous FVs from the sloth and coelacanth by starting with a cytosine instead of a thymine present in other FVs [58]. Since the GR and TATA box motifs are important for viral replication, studies are needed to determine the effect of these genetic differences on SFVhpi_SAM106 replication and growth. Some subtypes of human and simian immunodeficiency viruses (HIVs and SIVs) have the CATAAA TATA box sequence in their LTRs, which have not been shown to decrease LTR promoter activity by the HIV-1 transcription activator (Tat) protein in vitro [70]. Examination of the SFVhpi_SAM106 genome for additional TATA promoters identified a highly conserved TATA box upstream of the RT active site promoter in the FV genome and its possible effect on viral replication is needed.
FVs have been shown to have co-evolved with their hosts for millions of years as vertebrates diversified in the Paleozoic Era [5,6,64]. Our results expand knowledge of the evolutionary history of FVs and show for the first time that SFVhpi_SAM106 from the lesser ape, Hylobates pileatus, follows this same ancient co-speciation trajectory. The phylogenetic position of SFVhpi_SAM106 mirrors that of the gibbon host, with a divergence date of about 21 mya (95% HPD 19.13-22.19 mya) during the Miocene Epoch of the Cenozoic Era. The FV divergence dates inferred with our methods are strongly congruent with those of Aiewsakun and Katzourakis, who used Bayesian phylogeny and a time-dependent rate power-law decay function that is independent of archaeological calibrators [64]. Our Hylobatidae/Hominidae divergence dates are also consistent with those of Matsudaira [38]. In addition, our inferred SFV divergence dates are also highly consistent with those reported for the evolutionary histories of NHPs and other placental mammals examined in our study and had very close divergence dates with overlapping confidence intervals (Table 4). Together, these results show that our use of multiple host divergence calibration points provided a robust inference of FV evolutionary histories. As with all evolutionary studies, the inferred divergence dates represent minimum ages that are younger than true ages, because the fossil record is incomplete.
Given the strong evidence of FVs co-diverging with their hosts, both FVs and their mammalian hosts [73] should have very similar evolutionary rates. Indeed, FVs have extremely low rates of evolution, similar to rates observed for mitochondrial protein-coding genomes of about 1 × 10 −8 substitutions/site/year [4,6,11]. However, our estimates were a log lower and are more similar to those of the mammalian neutral substitution rate for nuclear genes and for endogenous retroviruses of 1 × 10 −9 substitutions/site/year [74,75]. Our lower inferred FV nucleotide substitution rates may reflect the use of older calibration dates for the Equs caballus/Bos Taurus split and for Boreoutheria in the phylogenetic analyses.
Genetic recombination did not appear to affect our phylogenetic results. Phylogenetic analysis of only the FV env region with additional SFV taxa from the gorilla, chimpanzee, and macaque with evidence of a variant RBD from a potential recombination event showed that SFVhpi_SAM106 was ancestral to SFVppy and then all other OWMA SFVs instead of ancestral to only the apes as we found for the gag-pol-env concatamer alignments. Similar results were inferred even after removal of the putative recombination region in the RBD of SU, whereas in the RBD-only alignment, placement of the SFVhpi_SAM106 and SFVppy taxa were not resolved. Similar genetic relationships were observed by Richard et al. in SFVppy when examining complete gorilla and chimpanzee env sequences [19]. One exception in their analysis of the variant RBD region is that SFVppy clustered ancestral to chimpanzee and gorilla clade 1 instead of with FFVfca as in our analysis since non-simian FVs were not included in their analysis [19,48]. Such nonconforming FV co-evolutionary phylogenetic relationships may reflect the phenomenon called long branch attraction (LBA), which can occur when highly divergent lineages are included in the phylogenetic analysis [73]. LBA can potentially be resolved by the addition of additional sequence information [73], which we have done by using the gag-pol-env concatamer, or by inclusion of more SFV sequences from gibbons and orangutans when they become available.
As with other NHPs, gibbons are threatened by habitat encroachment for forest clearance, road construction, and changes in agriculture resources. Gibbons are also hunted for food, medicine, or for exportation for the pet trade (see also the International Union for Conservation of Nature's Red List at www.iucnredlist.org) [40,76]. Gibbons are also common members of zoological exhibits. All of these activities can lead to human exposures to gibbons and their microbial communities, including SFVs, which have crossed into humans from many NHP species [12]. Until now, only two short SFV integrase sequences were available to inform the design of sensitive PCR assays for the detection of human infection [42]. More sensitive molecular assays are needed to detect the low copies of integrated genomes typically found in SFV infection. The small number of complete SFV genomes from all available NHPs has also limited the design of generic SFV PCR primers. These limitations may help explain the inability to detect SFV sequences in seropositive pileated gibbons from Cambodia using generic PCR assays [28]. The availability of the complete SFVhpi_SAM106 genome from our study will facilitate the design of more sensitive and specific PCR assays for the detection of gibbon SFVs and fills an important knowledge gap in the SFV database, facilitating future studies of FV infection, transmission, and the evolutionary history of FVs.

Conclusions
By using a metagenomics approach, we obtained the complete SFVhpi_SAM106 genome from a pileated gibbon (Hylobates pileatus). Bayesian analysis showed that SFVhpi_SAM106 is ancestral to other ape SFVs with a divergence date of~20.6 million years ago, reflecting ancient co-evolution of the host and virus. Our molecular analysis also showed that the SFVhpi_SAM106 genome has the longest genome (13,885 nt) of all SFVs with complete genomes available, due to a much longer LTR (2071 bp). The complete sequence of the SFVhpi_SAM106 genome will provide invaluable information for further understanding the epidemiology and evolutionary history of SFVs.