Insights into Circovirus Host Range from the Genomic Fossil Record

Advances in DNA sequencing have dramatically increased the rate at which new viruses are being identified. However, the host species associations of most virus sequences identified in metagenomic samples are difficult to determine. Our analysis indicates that viruses proposed to infect vertebrates (in some cases being linked to human disease) may in fact be restricted to arthropod hosts. The detection of these sequences in vertebrate samples may reflect their widespread presence in the environment as viruses of parasitic arthropods.

circovirus 2 (PCV-2), which causes postweaning multisystemic wasting syndrome in swine. The genus Cyclovirus, in contrast, is comprised entirely of viruses that have been identified only via sequencing and for which host species associations are less clear. Nevertheless, cycloviruses have frequently been associated with pathogenic conditions in humans and domestic mammals. For example, cyclovirus sequences have been detected in the cerebrospinal fluid of humans suffering from acute central nervous system disease in Vietnam and Malawi (3,4). Cyclovirus sequences have also been reported in association with numerous other outbreaks of disease in humans and domestic mammals (5)(6)(7).
Sequences derived from circoviruses have been shown to be present in the genomes of many eukaryotic species (8,9). These endogenous circoviral elements (CVe) are thought to be derived from the genomes of ancient circoviruses that were, by one means or another, ancestrally integrated into the nuclear genome of germ line cells (10). CVe can provide unique information about the long-term coevolutionary relationships between viruses and hosts; for example, the identification of ancient CVe in vertebrate genomes shows that viruses in the genus Circovirus have been coevolving with vertebrate hosts for millions of years (11).
We recently reported the results of a study in which we systematically screened vertebrate whole-genome sequence (WGS) data for CVe (11). Here, we expanded this screen to include a total of 675 animal genomes, including 307 invertebrate species. Via screening, we identified novel examples of sequences derived from circoviruses, cycloviruses, and the more divergent circular Rep-encoding single-stranded DNA (CRESS-DNA) group. We examine the phylogenetic relationships between these sequences, well-studied circovirus isolates, and circovirus-related sequences recovered via metagenomic sequencing of environmental samples or animal tissues. Our analysis raises important questions about the origins of cyclovirus sequences in samples derived from humans and other mammals and about their role in causing disease in these hosts.
(This article was submitted to an online preprint archive [12].)

Identification of CVe in animal genomes.
We screened WGS data of 675 animal species (see Table S1 in the supplemental material) in silico to identify sequences related to circoviruses. We identified 300 circovirus-related sequences in total, 76 of which have not been reported previously (Tables 1 and S3). To investigate the novel sequences identified in our screen, each sequence was virtually translated and incorporated into a multiple-sequence alignment that included a representative set of previously reported circoviruses and CVe (Table S2). Incorporation of CVe sequences into an alignment provided a basis for determining their genetic structures and investigating their phylogenetic relationships to circoviruses (Fig. 1).
All of the newly identified sequences were derived from rep; no novel sequences derived from circovirus cap genes were detected. We identified two novel CVe derived from viruses in the genus Circovirus in fish genomes ( Table 1). One of these, identified in the tomato clownfish (Amphiprion frenatus), appeared to be an ortholog of a CVe locus previously identified in other perciform fish (11). The other, identified in a mormyrid fish, was clearly related to CVe previously identified in ray-finned fish (11,13), but as it comprised a relatively short fragment of the rep gene, its more precise phylogenetic relationship to these CVe could not be determined with confidence.
We identified 93 circovirus-related sequences in invertebrate genome assemblies, 71 of which have not been reported previously (Tables 1 and S3). Of these, a relatively high proportion exhibited coding potential. Some occurred on short contigs and could potentially have been derived from contaminating virus. However, we found that, in many cases, at least one of the circovirus-related sequences identified in a WGS assembly was incorporated into a contig that was easily large enough to contain an entire circovirus genome and was flanked by Ͼ3 kb of genomic sequence and thus likely to represent CVe. On this basis, we estimate that 60 of the 93 sequences we identified in invertebrate genomes are likely to be derived from CVe. Sequences that occurred on short contigs, particularly those that lacked any in-frame stop codons or frameshifts (Table 1), might instead be derived from contaminating virus.
Maximum likelihood (ML) phylogenies were reconstructed using an alignment of Rep proteins and disclosed two robustly supported, monophyletic clades corresponding to the Circovirus and Cyclovirus genera (1). In line with our previous investigations (11), we found that all Rep-related sequences from vertebrate WGSs grouped with circoviruses, with the exception of a highly divergent sequence identified in the genome of a jawless vertebrate (the hagfish Eptatretus burgeri). All sequences derived from invertebrate WGSs grouped with cycloviruses or with divergent CRESS-DNA viruses (e.g., Avon-Heathcote Estuary-associated circular virus 24 [14]). CRESS-DNA virus-like sequences from distinct species tended to emerge on relatively long branches, and bootstrap support for branching patterns in this region of the phylogeny were generally quite low. The low resolution in this part of the phylogeny likely reflects the lack of adequate sampling of viruses from invertebrate species.
Some sequences from invertebrate WGSs were observed to cluster with cycloviruses in phylogenies, including CVe detected in two parasitic mite species (Varroa destructor and Tropilaelaps mercedesae) and a third detected in the genome of the elongate twig ant (Pseudomyrmex gracilis). The last element, here referred to as CVe-Pseudomyrmex, appeared to be no more distantly related to contemporary cycloviruses than many of them are to one another, including some that are associated with vertebrates (at least superficially) (Fig. 1). Because this seemed a little surprising, we sought to confirm the presence of CVe-Pseudomyrmex in the twig ant germ line. We obtained genomic DNA from four species of ant within the Pseudomyrmex genus (P. gracilis, Pseudomyrmex elongatus, Pseudomyrmex spinicola, and Pseudomyrmex oculatus), including three distinct populations of P. gracilis. We then used PCR to amplify a region encompassing part of the CVe and part of the genomic flanking sequence. We obtained an amplicon of the expected size in all three DNA samples of P. gracilis; all other samples were negative CVe Yes a Number of distinct sequences disclosing similarity to circovirus proteins that were identified in species WGS data. b Species genomes that were confirmed as containing CVe are indicated, based on the presence in WGS assemblies of at least one contig containing regions of circovirus homology flanked by Ͼ3 kb of genomic sequence. c Data indicate which species contained circovirus-derived sequences in which protein-coding potential was maintained across the entire length of the detected region of circovirus homology, and this region was at least 200 nucleotides in length.
( Fig. 2). Sequencing of the amplicon confirmed that it was derived the genomic locus containing the CVe and contained both a portion of the CVe and a region of genomic flanking sequence.
Mapping the host associations of circoviruses and cycloviruses. In phylogenies based on Rep, clades corresponding to the Circovirus and Cyclovirus genera contained a mixture of (i) CVe from WGS assemblies, (ii) sequences obtained from virus isolates, and (iii) sequences obtained from metagenomic samples ( Fig. 1 and Table S1). Among circoviruses (genus Circovirus), host associations at the level of class appear relatively stable. For example, beak and feather disease virus (BFDV) groups robustly with a CVe that entered the germ line of birds of the order Passeriformes ϳ38 million years ago (mya) (11), while barbel circovirus (BarbCV) groups robustly with CVe from the genome of the golden-line barbel in a well-supported clade containing numerous CVe from ray-finned fish. The only sequence that superficially seems to contradict this pattern is "chimpanzee" circovirus, which groups robustly with avian viruses. However, the name of this sequence is misleading: it was recovered from chimpanzee feces, but no host association is known. Indeed, the possibility that it might derive from an avian circovirus was noted at the time it was reported (15).
We observed three well-supported sublineages in the clade corresponding to the Cyclovirus genus, here termed cycloviruses 1 to 3 (Fig. 1). For cycloviruses, the only confirmed host associations come from the CVe-Pseudomyrmex sequence reported above. Many of the cyclovirus sequences that have been identified via metagenomic sequencing are associated with arthropod species, such as dragonflies (16). However, others are associated with vertebrates and have names such as bat cyclovirus. The cyclovirus 1 group is exclusively comprised of viruses from vertebrate samples. In cyclovirus groups 2 and 3, however, sequences from vertebrate and invertebrate samples are extensively intermingled (Fig. 1), and clade structure does not reflect these host associations in any obvious way. Sequences from each host group appear to be dispersed randomly, and the branch lengths separating vertebrate from invertebrate viruses (and CVe) are relatively short in many cases.

DISCUSSION
In this study, we screened in silico 675 animal genomes and identified numerous sequences related to circoviruses, including many that have not been reported previously. We examined the phylogenetic relationships between these sequences, wellstudied circovirus isolates, and circovirus sequences recovered via metagenomic sequencing of environmental samples or animal tissues.
Most of the novel circovirus sequences reported here were identified in invertebrate genome assemblies. Many were highly divergent and are likely derived from uncharacterized CRESS-DNA virus lineages that infect invertebrate species. All of the newly identified sequences identified in our study were derived from rep genes; we did not detect any novel CVe with homology to circovirus cap genes. The preponderance of CVe derived from rep versus those derived from cap might reflect the greater heterogeneity of capsid sequences in general, which might lead to these sequences being generally harder to detect. Certainly, in the case of the more divergent invertebrate viruses, it is possible that the cap genes found in some lineages might not share any sequence homology with those sequenced previously. However, we note that even among CVe derived from viruses in the genus Circovirus, within which capsid sequences are comparatively conserved, sequences derived from rep are approximately twice as common as those derived from cap.
Among the factors that may have influenced the structures and types of CVe that we observe in animal germ lines are selection pressures that have led to these sequences being co-opted or exapted by host species. Interestingly, several of the confirmed CVe in our study lacked frameshifting mutations or in-frame stop codons (Table 1), indicating that they have been evolving under purifying selection relatively recently. We confirmed the presence of one such CVe (CVe-Pseudomyrmex) in three populations of Pseudomyrmex gracilis (Fig. 2). The fact that we did not detect the CVe-Pseudomyrmex sequence in other members of the genus suggests that it was incorporated into the P. gracilis germ line after this species diverged from P. elongatus, P. spinicola, and P. oculatus in the Miocene epoch (17). However, we cannot completely rule out that Pseudomyrmex is actually older since the failure to obtain an amplicon in other Pseudomyrmex species could be accounted for in other ways (e.g., sequence divergence in the regions targeted by PCR primers). Nevertheless, the occurrence of an apparently fixed, intact, and expressed circovirus rep gene in an ant genome provides further evidence that these genes have been co-opted or exapted by host species for as yet unknown functions. Functional genomic studies in insects indicate that endogenous viral element (EVE) sequences have been co-opted into RNA-based systems of antiviral immunity (18). Thus, one possible explanation accounting for the conservation of this sequence in CVe-Pseudomyrmex is that it is involved in immune defense although this would not necessarily require maintenance of an intact coding sequence.
Our study allowed the host associations of circoviruses and CVe to be examined in the context of their evolutionary relationships. With respect to this, the grouping of sequences for which the host associations are well established (i.e., CVe and viruses that have been investigated using methods in addition to sequencing) relative to sequences recovered from metagenomic samples was revealing. Prior to this study, the only host associations that had been robustly demonstrated were within the genus Circovirus. Circoviruses have been isolated from vertebrates, and in phylogenies based on Rep proteins, these isolates group together with vertebrate CVe in a well-supported clade. Furthermore, the host associations of circoviruses appear quite stable, with ancient CVe from particular host groups (e.g., orders or classes) sometimes seen grouping together with contemporary viruses from the same host groups (Fig. 1). Within the Circovirus clade there is only one sequence that seems to contradict this pattern. This sequence was recovered from a stool sample and, thus, as was observed when it was first reported (15), is likely to reflect environmental contamination.
The limited evidence available regarding the zoonotic potential of circoviruses suggests that they lack the capacity to be transmitted between relatively distantly related hosts (i.e., hosts in distinct classes or orders). For example, during the 1990s and early 2000s, porcine circovirus 1 (PCV-1) was inadvertently introduced into batches of live attenuated rotavirus vaccine as an adventitious agent. These vaccines were administered to millions of people (19), yet PCV-1 is not thought to have infected any humans as a result, indicating that powerful barriers to cross-species transmission are probably in effect.
Nevertheless, recent studies have identified some surprising cases wherein phylogenetic trees indicate apparent transmission of viruses between vertebrate classes (20). We see evidence for potential interclass transmission within one Circovirus subclade that contains sequences obtained from waterfowl and mammals (including mink, bats, dogs, and pigs) as well as CVe from reptile genomes (Fig. 1). Within this subclade, viruses of mink are robustly separated from porcine, canine, and waterfowl viruses by a CVe that was incorporated into the serpentine germ line Ͼ72 million years ago (11). Thus, the phylogeny indicates that at least one interclass transmission event is likely to have occurred within this clade. However, it should be noted that while the clustering patterns observed in this clade do suggest potential transmission of circoviruses between vertebrate classes, they also indicate that such events have occurred relatively infrequently during evolution since the CVe in snake genomes provide a minimum for the entire clade (assuming the root of this clade is as depicted in Fig. 1). Moreover, since clustering patterns that superficially appear to indicate host switches can be accounted for by multiple alternative evolutionary scenarios (21,22), caution is advisable when phylogenetic approaches are used to infer the relationships between parasites and their hosts, especially when sampling is limited.
If cross-species transmission of circoviruses between distinct mammalian orders does not occur readily, then transmission between arthropod and vertebrate hosts appears extremely unlikely. However, if we take the reported host species associations of cycloviruses at face value, we might conclude that transmission between distantly related species groups occurs frequently, particularly within the cyclovirus 3= lineage (Fig. 1). Importantly, CVe-Pseudomyrmex groups robustly within this lineage, and since all other taxa within the Cyclovirus clade have been recovered via metagenomic screening, this provides the first unambiguous evidence of a host association for cycloviruses, establishing with a high degree of confidence that they do indeed infect arthropods or, at the very least, have done so in the past. Furthermore, since the only proven associations for cycloviruses so far are with arthropods, contamination of vertebrate samples with viruses derived from arthropods is perhaps the most parsimonious explanation for the host associations observed here. Contamination from arthropod sources such as dust mites can presumably occur fairly easily, given their ubiquity (23). Intriguingly, with respect to this we identified putative cyclovirus CVe in the genomes of two distinct mite species ( Fig. 1 and Table 1).
Since there is always a risk of being misled by contamination when viruses are identified via sequencing-based approaches, we propose that host associations of circoviruses identified via sequencing should be viewed with caution where they are found to strongly contradict established host associations within well-defined clades, particularly at higher taxonomic levels (e.g., phylum, class, or order). Whereas the weight of evidence may favor cyclovirus groups 2 and 3 being exclusively arthropod viruses that frequently contaminate vertebrate samples, the status of the cyclovirus 1 lineage is perhaps more equivocal. This basal lineage is comprised exclusively of sequences obtained from mammalian samples and includes cycloviruses proposed to cause disease in humans (cyclovirus VN and human cyclovirus VS5700009). Conceivably, these sequences could represent a mammal-or vertebrate-specific lineage of cycloviruses that is distinct from arthropod-infecting lineages. Notably, however, falsepositive detection of human cyclovirus VS5700009 has been reported (24).
Virus sequences recovered from metagenomic samples can be investigated by examining their phylogenetic relationships to other viruses for which host associations have been established. The work performed here demonstrates the utility of endogenous virus sequences in this process. This approach can be generalized to inform metagenomics-based virus discovery and diversity mapping efforts for any virus group that has generated endogenous sequences.

MATERIALS AND METHODS
Sequence data. Whole-genome sequence (WGS) assemblies of 675 species (see Table S1 in the supplemental material) were downloaded from the National Center for Biotechnology Information (NCBI) website. We obtained a representative set of sequences for the genus Circovirus and a nonredundant set of vertebrate CVe sequences from an openly accessible data set we compiled in our previous work (11). This data set was expanded to include a broader range of sequences in the family Circoviridae, including representative species in the Cyclovirus genus and the more distantly related CRESS-DNA viruses (Table  S2). We used GLUE, an open, data-centric software environment specialized in capturing and processing virus genome sequence data sets (25), to collate the sequences, alignments, and associated data used in this investigation. These data are available in the publicly accessible DIGS-for-EVEs (database-integrated genome screening for EVEs) online repository (https://github.com/giffordlabcvr/DIGS-for-EVEs).
Genome screening in silico. Genome screening in silico was performed using the DIGS tool (26). The DIGS procedure comprises two steps. In the first step, the basic local alignment search tool (BLAST) program (27) is used to search a genome assembly file for similar to a particular probe (i.e., a circovirus Rep or Cap polypeptide sequence). In the second, sequences that produce statistically significant matches to the probe are extracted and classified by BLAST-based comparison to a set of virus reference genomes (Table S2). Results are captured in a MySQL database.
Newly identified CVe identified in this study were assigned a unique identifier (ID) according to a convention we established previously (11). The first component of the ID is the classifier CVe. The second is a composite of two distinct subcomponents separated by a period: the name of the CVe group (usually derived from the host group in which the element occurs, e.g., Carnivora) and a numeric ID that uniquely identifies the insertion. Orthologous copies in different species are given the same number but are differentiated using the third component of the ID that uniquely identifies the species from which the sequence was obtained. Unique numeric IDs were assigned to novel CVe with reference to those used in the previously assembled data set (11).
Amplification and sequencing. Genomic DNA was extracted from ant tissue samples following the Moreau protocol (37) and a DNeasy blood and tissue kit (Qiagen). PCR amplification of CVe-Pseudomyrmex was performed using two sets of primer pairs designed with Primer3 (http://bioinfo.ut .ee/primer3-0.4.0/), each comprising one primer anchored in the CVe sequence and another anchored in the genomic flanking sequence. Primer pair 1 amplified a sequence that was 694 bp long, and primer pair 2 amplified a sequence that was 286 bp long. Primers were tested using illustra PuReTaq Ready-To-Go PCR beads (GE Healthcare). A temperature gradient PCR was performed to assess the optimum annealing temperature for the specific primer pairs. PCR was then performed using the genomic DNA ant extractions. The PCR conditions for this run were the following: an initial denaturation stage of 5 min at 95°C, 30 cycles of 30 s of denaturing at 95°C, 30 s of annealing at 49.7°C for primer pair 1 and at 62°C for primer pair 2, and extension at 72°C for 1 min, with a final extension after this program at 72°C for 5 min. Each run included a negative control. Amplification products (800 to 1,000 bp) for each PCR were excised and run on agarose gels. Bands of the expected size were excised, purified, and sequenced via Sanger sequencing (38).