The evolution, distribution and diversity and of endogenous circoviral elements

Circoviruses (family Circoviridae) are small, non-enveloped viruses that have short, single-stranded DNA genomes. Circovirus sequences are frequently recovered in metagenomic investigations, indicating that these viruses are widespread, yet they remain relatively poorly understood. Endogenous circoviral elements (CVe) are DNA sequences derived from circoviruses that occur in vertebrate genomes. CVe can provide unique, retrospective information about the biology and evolution of circoviruses. In this study, we screened 362 vertebrate genome assemblies in silico to generate a catalog of CVe loci. We identified a total of 179 CVe sequences, most of which have not been reported previously. We show that these CVe loci reflect at least 19 distinct germline integration events. We determine the structure of CVe loci, identifying some that show evidence of potential functionalization. We also identify orthologous copies of CVe in snakes, fish, birds, and mammals, allowing us to add new calibrations to the timeline of circovirus evolution. Finally, we observed that some ancient CVe group robustly with contemporary circoviruses in phylogenies, with all sequences within these groups being derived from the same host class or order, implying a hitherto underappreciated stability in circovirus-host relationships. The openly available dataset constructed in this investigation provides new insights into circovirus evolution, and can be used to facilitate further studies of circoviruses and CVe. Abbreviations CVe endogenous circoviral element ORF open reading frame Cap Capsid Rep Replicase ssDNA single-stranded DNA


Introduction 1
Circoviruses (family Circoviridae, genus Circovirus) are small, non-enveloped 2 viruses with single-stranded DNA (ssDNA) genomes. Circovirus genomes are 3 typically ~2 kilobases (kb) in length and contain only two open reading frames 4 (ORFs): one encoding a non-structural replicase (Rep) protein, and a second 5 encoding the viral capsid (Cap). The family contains two genera: Circovirus and 6 Cyclovirus [1]. Over recent years, sequencing-based virus discovery efforts have 7 identified many new members of these two genera [2]. However, little or nothing is 8 known about most of the novel viruses that have been identified using these 9 approaches. Only a handful of circoviruses have been investigated at a level beyond 10 sequencing: porcine circoviruses 1 and 2 (PCV-1 and PCV-2), which infect swine, 11 and beak and feather disease virus (BFDV), which infects various avian species [3]. 12 Circovirus sequences are frequently recovered from tissue and environmental 13 samples in metagenomic investigations, indicating that these viruses are widespread, 14 yet they remain relatively poorly understood [4]. Endogenous circoviruses (CVe) 15 provide an unconventional but useful source of information about circovirus 16 distribution, diversity and evolution. These sequences are derived from the genomes 17 of circoviruses that circulated millions of years ago, and became integrated into the 18 host germline [5,6]. Relatively robust minimum age estimates can be obtained for 19 CVe via the identification of orthologous copies in distinct host lineages. On this 20 basis, we now know that the association between circoviruses and vertebrates 21 extends back millions of years before the present day [7,8]. 22 In this study, we screened vertebrate genomes in silico to generate a 23 comprehensive catalog of CVe. We used these data to: (i) extract information about 24 the long-term evolution of circoviruses; (ii) generate an openly accessible data 25 resource that can facilitate the further investigation of CVe and circoviruses. 26 27

Identification and analysis of CVe sequences 29
We used similarity searches to systematically screen genome assemblies of 30 362 chordate species (Table S1) for sequences homologous to circovirus proteins. 31 Vertebrate genome assemblies and circovirus reference genomes were obtained 32 from the NCBI genomes resource. Screening in silico was performed using the 33 database-integrated genome-screening tool. The DIGS procedure used to identify 34 CVe comprises two steps. In the first, a circovirus probe sequence (e.g. a Cap or 35 Rep protein sequence) is used to search a particular genome assembly file using the 36 basic local alignment search tool (BLAST) program [9]. In the second, sequences 37 that produce statistically significant matches to the probe are extracted and classified 1 by BLAST-based comparison to a set of virus reference genomes (see Table S2). 2 Results are captured in a MySQL database. 3 We inferred the ancestral ORFs of CVe (and the number of stop codons and 4 frameshifts interrupting these ORFs) via a combination of automated alignment and 5 manual adjustment. Multiple sequence alignments were constructed using MUSCLE 6 [10] and PAL2NAL [11]. Manual inspection and adjustment of alignments was 7 performed in Se-Al [12]. Phylogenies were constructed using maximum likelihood as 8 implemented in RaxML [13], and the VT protein substitution model [14] as selected 9 using ProTest [15]. 10 11 2.2. Construction of CVe sequence data resource. 12 We used GLUE -an open, data-centric software environment specialized in 13 capturing and processing virus genome sequence datasets -to collate the 14 sequences, alignments and associated data used in this investigation. The aim was 15 to create a standardized data CVe resource that would be openly accessible, and 16 would facilitate the further use and development of the dataset assembled here. The 17 project includes all the CVe identified by our in silico screen, as well as a set of 18 representative reference sequences for the Circovirus genus ( Table S2). All of these 19 sequences are linked to the appropriate auxiliary data; for the virus sequences, this 20 includes information about the sample from which the sequence was obtained; for 21 CVe, it includes the name of genome assembly and contig in which the CVe 22 sequence was identified, and its coordinates and orientation within that contig. 23 The project also includes the key alignments constructed in this study, linked 24 together using the GLUE 'alignment tree' data structure. These include: (i) 'tip' 25 alignments in which all taxa are CVe that are known or putative orthologs of one 26 another; (ii) a 'root' alignment constructed to represent proposed homologies 27 between the genomes of representative viruses in the genus Circovirus and the CVe 28 recovered by our screen. Because each of these alignments is constrained to a 29 standard reference sequence, are alignments are linked to one another. 30 We applied a systematic approach to naming CVe. Each element was 31 assigned a unique identifier (ID) constructed from a defined set of components. The 32 first component is the classifier 'CVe'. The second is a composite of two distinct 33 subcomponents separated by a period: the name of CVe group (usually derived from 34 the host group in which the element occurs in (e.g. Carnivora), and the second is a 35 numeric ID that uniquely identifies the insertion. Orthologous copies in different 36 species are given the same number, but are differentiated using the third component 37 of the ID that uniquely identifies the species from which the sequence was obtained. 1 An additional unique numeric ID may be added to this component in cases were a 2 CVe element has expanded via duplication. 3 4 3. Results 5

Identification and phylogenetic analysis of vertebrate CVe 6
We systematically screened 362 vertebrate genome assemblies for CVe, and 7 identified a total of 179 CVe sequences (Table S3), in 52 distinct species ( Table 2). 8 For each CVe sequence, we determined the regions of the circovirus genome 9 represented, and attempted to identify genomic flanks. Where genomic flanks were 10 present, we compared these with one another to identify potentially orthologous CVe 11 loci. In several cases, it was not possible to determine whether multiple CVe loci 12 We only identified four cases where CVe encoding both rep and cap were 29 present in the same species or species group. In most, only rep-derived sequences 30 appear to have been incorporated/retained, and in one case only cap ( Table 1). We 31 constructed a multiple sequence alignment (MSA) that spanned the entire circovirus 32 genome and contained both reference sequences for CVe (these could be based on 33 individual loci, or a consensus), and representative circovirus reference taxa (Table  34 S2). We used this 'root' MSA (see section 2.2) to infer which regions of the 35 circovirus genome had been incorporated as CVe. Where CVe spanned coding 36 sequence, we inferred the putative ancestral reading frame by comparing CVe and 37 circovirus sequences, and attempting to identify likely frameshifting mutations. Most 1 CVe represent only fragments of the genome (Figure 1), and many are relatively 2 degraded, containing multiple frameshifting indels and stop codons. 3 Where we identified several CVe from the same species, we compared 4 genomic regions to search for evidence of homology and thereby identify orthologs. 5 Where we were able to identify orthologous CVe insertions, we used these data to 6 create a timeline of circovirus evolution (Figure 2). In addition, we identified sets of 7 'potentially orthologous' CVe, where sequence similarity and phylogenetic 8 relationships were consistent with orthology, but this could not be confirmed or ruled 9 out based on flanking sequences. 10 A range of distinct partitions were derived from the virtually translated root 11 MSA (with frameshifts removed), and used to construct bootstrapped ML phylogenies 12 ( Figure 3). In general, support for the deeper branching relationships between CVe 13 and circoviruses was weak, irrespective of which genomic region was used to 14 construct trees. This reflects the fact that most CVe are short and/or highly degraded, 15 and these sequences tend to group distantly from other taxa. However, in 16 phylogenies based on Rep (Figure 3), several robustly supported subgroupings were 17 observed, three of which -referred to here as mammal 1, cyprinid 1, and avian 1 - vertebrates and includes the hagfishes (myxinoids) and lampreys (petromyzontids). 28 We identified seven sequences exhibiting homology to rep in the genome assembly 29 of the inshore hagfish (Eptatretus burgeri). These sequences are relatively distinct 30 from other circoviruses, and also showed relatively high genetic diversity relative to 31 one another, forming three distinct groups in phylogenetic trees ( Figure S1). Notably, 32 the putative Rep polypeptides encoded by these sequences contained several in-33 frame indels relative to one another. Because such a pattern of variation is unlikely to 34 arise through neutral accumulation of mutations in the germline, this suggests the 35 occurrence of at least three distinct genome incorporation events, each involving 36 distinct, but relatively closely related viruses. However, since we were unable to 37 identify unambiguous genomic flanking sequences for any of these loci, their 1 classification as CVe should for now be considered tentative. 2 3

CVe in ray-finned fish (class Actinopterygii) 4
Circoviruses are thought to infect barbel fish (Barbus barbus) and European 5 catfish (Silurus glanis), based on (i) the observation of viral particles in tissues, and 6 the recovery of circovirus sequences from these tissues via nested PCR [17,18]. In 7 addition, CVe have been reported in one fish species -the Indian rohu (Labeo rohita) 8 [19]. We identified numerous additional CVe sequence in the genome assemblies of 9 ray-finned fishes (Class Actinopterygii) ( Table 2, Table S3). We established that at 10 least two of these CVe -occurring in the common carp (Cyprinus carpio) and golden-11 line barbell (Sinocyclocheilus grahami) genomes -were orthologs of one another, 12 indicating they were incorporated into the germline of cyprinid fish more than 39 13 million years ago [28,29]. These CVe were comprised of multiple complete circovirus 14 genomes arranged in tandem, and intriguingly, were observed CVe group as sister 15 taxa to barbel circovirus (BarbCV) in phylogenetic trees, sharing ~70% nucleotide 16 identity (across 1654 nucleotides) with the BarbCV genome. 17 We also identified matches to rep in eight other species of ray-finned fish 18 (Table 2). We could not determine with certainty how many integration events these 19 CVe represented. Interestingly, however, all of these sequences group together in 20 phylogenies (Figure 3), and the phylogeny constructed for these elements -when 21 rooted on the CVe from the most basal host -the European eel (Anguilla anguilla), 22 approximately follows that of the host species, consistent with a single ancestral 23 integration event >200 million years ago (Figure 2). Alternatively, the CVe observed 24 in distinct orders might represent distinct incorporation events. This is supported by 25 the placement of CVe.anura in phylogenies, in which it splits the fish CVe from one 26 another, albeit with weak support (Figure S1). In addition, the observation that CVe 27 elements in order cypriniforme fish (golden-line barbell and carp) occur as full-length 28 tandem genomes, whereas those in Perciformes are derived from more divergent 29 fragments of rep, is suggestive of at least two separate incorporation events. Notably 30 one CVe in the mangrove rivulus (Kryptolebias marmoratus) encoded a complete 31 intact rep gene (Figure 1) that is predicted to be expressed, suggesting it may have 32 been functionalized in some manner. 33 34

CVe in amphibians 35
Sequences homologous to circovirus rep genes have previously been 36 identified in the Western clawed frog (Xenopus tropicalis) [20]. We identified CVe in 37 the genome of the American bullfrog (Rana catesbeiana) that partially overlaps that 1 identified in Xenopus. Potentially, these sequences could be orthologs of one 2 another, which would imply a minimum age of ~204 MYA [21,22] (Figure 2). 3 However, we were unable to confirm this based on analysis of flanking genomic 4 sequences. 5 6 3.5. CVe in reptiles 7 A pair of othologous CVe, each covering about 75% of the circovirus genome, 8 have previously been recovered from rattlesnake genomes (Crotalus spp) [23]. We 9 identified CVe in four additional snake species ( Table 2). Examination of aligned 10 snake CVe sequences indicated that all are likely to be orthologs of those previously 11 reported in rattlesnakes (see Figure S2), implying that this CVe integrated into the 12 serpentine germline ~72-90 million years ago (Mya) (Figure 2). 13 14

CVe in birds (class Aves) 15
CVe have previously been reported in the genomes of several avian species: 16 the little egret (Egretta garzetta), white-throated tinamou (Tinamus guttatus), medium 17 ground-finch (Geospiza fortis), and kea (Nestor notabilis) [16,20]. We identified CVe 18 in eight additional species. Some of these appeared likely to be orthologs of CVe 19 reported previously. For example, we identified CVe in two species of psittacine bird 20 that appeared represented orthologs of one another, and possibly of those previously 21 identified in the kea (Nestor notabilis) [16] ( Table 2), which would imply integration 22 into the psittacine germline prior to the divergence of the major extant lineages within 23 the order Psittaciformes (estimated to have occurred 30-60 Mya [13, 24]) (Figure 2). 24 We also identified orthologs of the rep-derived insertion previously described 25 in the medium ground finch in several additional species in the avian order 26 Passeriformes (songbirds) ( Table 2). Identification of these orthologs demonstrates 27 that this particular CVe predates the radiation of avian sub-order Passeroida ~38 28 Mya [13,25] (Figure 2). 29 In addition to identifying the previously reported CVe in the genomes of the 30 white-throated tinamou (Tinamus guttatus) and little egret (Egretta garzetta) [16], we 31 identified previously unreported CVe in the Japanese rail (Gallirallus okinawae: order 32 Gruiformes) and downy woodpecker (Picoides pubescens: order Piciformes) (Table  33 2). Both these sequences were relatively short and divergent, and consequently we 34 could not determine their relationships to other CVe and circoviruses with confidence. 35 36

CVe in mammals (class Mammalia) 37
The majority of CVe identified in our screen were recovered from carnivore 1 genome assemblies. As far as we are able to discern from phylogenetic and 2 comparative analysis, all of these CVe derive from a 1-4 germline incorporation 3 events involving an ancient carnivore rep gene. However, the copy number of these 4 elements has expanded subsequent to their incorporation into the germline, in some 5 cases quite dramatically. The grouping of carnivore CVe in phylogenies (Figure 4) 6 indicates that at least four CVe insertions were present in the carnivore germline prior 7 to the divergence of extant families within this order. The copy number of one 8 particular element (referred to here as CVe-Carnivora-4) has expanded in some 9 carnivore lineages. As shown in Figure 4, the phylogenetic relationships between 10 duplicates in the group CVe-Carnivora-4 indicate that these expansions have 11 occurred independently in ursids (bears), pinnipeds (seals and walruses), and 12 mustelids. CVe in this lineage are flanked by sequences that disclose homology to 13 non-LTR retrotransposons. Thus, one plausible explanation for the elevated copy 14 number in certain carnivore lineages is that CVe have become embedded into 15 retroelements and copied along with these sequences when they undergo 16 transposition. 17 A novel, relatively well-preserved rep-derived CVe was identified in the 18 genome of the Ryukyu mouse (Mus caroli) that grouped closely with circoviruses 19 genome recovered from dogs [26,27]. This element presumably arose after this 20 species diverged from the house mouse (Mus musculus) ~6-7 Mya, since it is absent 21 from this species. 22 In the cape golden mole (Chrysochloris asiaticus) matches to both cap and 23 rep were identified. However, these occurred on distinct contigs and did not overlap. 24 Furthermore, both CVe were relatively short and degraded, and were highly 25 divergent relative to other CVe. CVe derived from cap were also identified in the 26 genome of Hoffmann's two-toed sloth (Choloepus hoffmanni) (Figure 1). 27 CVe have previously been identified in the genome of the short-tailed 28 opossum (Monodelphis domestica), an American marsupial [7]. In phylogenies based 29 on rep, this sequence groups together with the porcine circoviruses, canine 30 circovirus, and the CVe we identified in Mus caroli. We identified the first examples of 31 CVe from the genomes of Australian marsupial species: the Tasmanian devil 32 (Sarcophilus harrisii) and the koala (Phascolarctos cinereus). Both these sequences 33 derived from circovirus rep genes, and grouped together in phylogenetic trees 34 ( Figure S1). However, their placement relative to other taxa was not supported with 35 confidence, reflecting their short and degraded nature. Several other short and 36 degraded matches to Rep probes were identified in other mammalian species (Table  37 1, Table 2, Figure 1). These sequences were relatively distantly related to one 1 another and to contemporary circoviruses. 2 3 4. Discussion 4 4.1. CVe provide retrospective information about circovirus evolution. 5 In this study, we recovered CVe from published vertebrate genomes, 6 determined their genomic structures, and examined their phylogenetic relationships 7 to contemporary circoviruses. Our analysis is the first to examine such a large set of 8 CVe sequences, and to screen so widely within vertebrates. We show that CVe are 9 relatively widespread in vertebrate genomes, though it appears they are absent from 10 some lineages (e.g. primates, in which genome coverage is relatively high). 11 Several of the CVe loci identified here have been reported previously [7, 16, 12 19, 20], and the majority of novel CVe sequences recovered by our screen were 13 orthologs or duplicates of these loci. Nevertheless, we identified 17 CVe loci that 14 have not been reported before ( Table 1, Table S3). These sequences provide the 15 first evidence of (ancestral) circovirus infection for several species (Table 2). In 16 addition, the identification and characterisation of novel orthologs allowed us to 17 establish the first minimum age estimates for some CVe loci, and to markedly 18 extended those of others. Thus, we were able to derive a more accurately calibrated 19 timeline of evolution for the Circovirus genus, spanning multiple geological eras 20 (Figure 2). Furthermore, we observed that CVe in fish, birds and mammals cluster 21 phylogenetically with exogenous circoviruses identified from the same host class. 22 This implies that there is a stability to the relationship between circovirus and host 23 relationships, at least at higher taxonomic levels. 24 25 4.1. Impact of CVe on host genome evolution 26 The majority of CVe are derived from rep genes. To the extent that CVe have 27 been exapted or co-opted, the predominance of CVe derived from rep might reflect 28 that these sequences are more readily functionalised than those derived from cap. 29 Furthermore, we identified one sequence. Notably one CVe in the mangrove rivulus 30 (Kryptolebias marmoratus) encoded an intact rep gene that is predicted to express 31 mRNA, suggesting it may have been functionalized in some manner (Figure 1). 32 Notably, several examples have now been described of endogenous viral 33 elements (EVEs) that are derived from replicase genes, are expressed, and encode 34 intact ORFs [7,30,31]. These elements are derived from a range of different viruses, 35 and have clearly arisen in distinct events, suggesting there might be a common 36 mechanism causing EVEs derived from the replicases of distinct viruses to be 37 selected and maintained in different species. Alternatively, it is possible that the 1 discrepancy in numbers simply reflects that cap-derived sequences are less 2 conserved and therefore harder to detect. 3 Curiously, it is rare for more than one CVe to occur in the germline of any 4 jawed vertebrate lineage. Carnivores are an obvious exception, since CVe have been 5 amplified to relatively high copy number (10-20 copies) in several carnivore lineages 6 (Figure 4), apparently via retrotransposon-mediated duplication. Further investigation 7 of how these CVe have been amplified may reveal if their presence within an actively 8 replicating retrotransposon lineage has impacted on the fixation of transposable 9 elements derived from that lineage.