Genome-Wide Characterization of Zebrafish Endogenous Retroviruses Reveals Unexpected Diversity in Genetic Organizations and Functional Potentials

ABSTRACT Endogenous retroviruses (ERVs) occupy a substantial fraction of mammalian genomes. However, whether ERVs extensively exist in ancient vertebrates remains unexplored. Here, we performed a genome-wide characterization of ERVs in a zebrafish (Danio rerio) model. Approximately 3,315 ERV-like elements (DrERVs) were identified as Gypsy, Copia, Bel, and class I−III groups. DrERVs accounted for approximately 2.3% of zebrafish genome and were distributed in all 25 chromosomes, with a remarkable bias on chromosome 4. Gypsy and class I are the two most abundant groups with earlier insertion times. The vast majority of the DrERVs have varied structural defects. A total of 509 gag and 71 env genes with coding potentials were detected. The env-coding elements were well-characterized and classified into four subgroups. A ERV-E4.8.43-DanRer element shows high similarity with HERV9NC-int in humans and analogous sequences were detected in species spanning from fish to mammals. RNA-seq data showed that hundreds of DrERVs were expressed in embryos and tissues under physiological conditions, and most of them exhibited stage and tissue specificity. Additionally, 421 DrERVs showed strong responsiveness to virus infection. A unique group of DrERVs with immune-relevant genes, such as fga, ddx41, ftr35, igl1c3, and tbk1, instead of intrinsic viral genes were identified. These DrERVs are regulated by transcriptional factors binding at the long terminal repeats. This study provided a survey of the composition, phylogeny, and potential functions of ERVs in a fish model, which benefits the understanding of the evolutionary history of ERVs from fish to mammals. IMPORTANCE Endogenous retroviruses (ERVs) are relics of past infection that constitute up to 8% of the human genome. Understanding the genetic evolution of the ERV family and the interplay of ERVs and encoded RNAs and proteins with host function has become a new frontier in biology. Fish, as the most primitive vertebrate host for retroviruses, is an indispensable integral part for such investigations. In the present study, we report the genome-wide characterization of ERVs in zebrafish, an attractive model organism of ancient vertebrates from multiple perspectives, including composition, genomic organization, chromosome distribution, classification, phylogeny, insertion time, characterization of gag and env genes, and expression profiles in embryos and tissues. The result helps uncover the evolutionarily conserved and fish-specific ERVs, as well as the immune-relevant ERVs in response to virus infection. This study demonstrates the previously unrecognized abundance, diversification, and extensive activity of ERVs at the early stage of ERV evolution.

ERVs in embryo development and adult tissues under physiological and virus infection conditions in an ancient vertebrate organism. This study is anticipated to improve the current understanding of the molecular and functional evolutionary history of ERVs from fish to mammals throughout vertebrate evolution.

RESULTS
Genome distribution of DrERVs. A total of 3,315 DrERV-like elements were identified from zebrafish genome by RetroTector prediction with scores ranging from 250 to 1,247, among which 1,453 have an empirically high score of over 300 (36) (Fig. S1 to S4, Table S1). These DrERVs account for approximately 2.3% of the zebrafish genome and are distributed in all 25 chromosomes with different numbers. Approximately 3.1% of the DrERVs possess complete structure with gag, pol, and env genes and LTRs at both ends; whereas the remaining 96.9% of the DrERVs share partial sequences that are structurally incomplete in varying degrees (Fig. 1A). The most abundant DrERV structure is the LTR-LTR (36.5%), which means that all the gag, pol, and env genes are lost in this case. Elements with LTRs on both ends account for 83.3% in total and 46.8% when LTR-LTR type is excluded. These outcomes showed the prevalence of DrERVs with both LTRs in the genome. Among the DrERVs with recognizable remains of protein-coding genes (PCGs), the LTR-pol-LTR accounted for 15.1%. In fact, pol is the most detected PCG; hence, the pol portions of DrERVs are retained more frequently than gag and env portions. The lengths of 59-LTR and 39-LTR are comparable and most concentrated in approximately 500 bp (Fig. 1B). The lengths of gag, env, and pol genes are 551 to 4,902 bp, 218 to 4,634 bp, and 709 to 4,810 bp, respectively, and concentrated in 1,000 to 2,000 bp (for gag and env) and 2,000 to 3,000 bp (for pol). The distribution of LTRs and gag, env, and pol genes on positive and negative strands shows no remarkable bias. The number of DrERVs showed an overall high correlation with the length (R = 0.807, P , 0.001) and GC content (R = 0.797, P , 0.001) of zebrafish chromosomes. The correlation between DrERV abundance and GC content could be associated with high epigenetic modification in DrERV sequences as observed in other species (37). However, DrERVs are not always randomly distributed on zebrafish chromosomes. DrERVs are remarkably enriched in chromosomes 1 and 4 than expected, especially in chromosome 4 (Fig. 1C, Table 1). The densities of DrERVs located on chromosomes 1 and 4 are 2.87 and 4.38 elements/Mb, which are considerably higher than the average density (2.31 elements/Mb) of DrERVs distributed on the whole genomes (Table 1). Interestingly, more than 400 duplicated Nod-like receptor (NLR) genes or fragments were observed in zebrafish genome (38,39). These NLR elements are also highly accumulated on chromosome 4 as detected by the central nucleotide oligomerization domain (NACHT), Fish-specific NACHT associated domain (FISNA), leucine-rich repeat domain (LRR), SPla and the RYanodine Receptor domain (SPRY) independently. We found through comparative analysis that most PCGs are distributed in the short arm of chromosome 4, whereas NLR elements and DrERVs have a distribution bias on the long arm of chromosome 4 (Fig. 1D). This finding suggests the existence of a close evolutionary correlation and functional association between DrERVs and NLRs.
Classification of DrERVs. Given that the concatenated sequences of DrERVs cannot be properly aligned due to the high sequence diversity, the reverse transcriptase (RT) regions of pol genes were applied for DrERVs classification as described in many other ERV researches (27,(40)(41)(42)(43). A phylogenetic tree with 968 DrERVs containing predictable RT region was constructed ( Fig. 2A). These DrERVs were identified to be Gypsy (830), Bel (41), Copia (6), class I ERV (44), class II ERV (11), and class III ERV (9) (Table S2). Gypsy, Bel, and Copia are sometimes classified as retrotransposon elements, a proposed ancestor of retrovirus that had not acquired the env gene during evolution. However, these elements are also considered ERV-like elements in many investigations, because the env gene is occasionally found in Gypsy, and infectious Gypsy has been reported (44)(45)(46). Therefore, we used the generalized concept of ERV-like elements in this study, and all these elements are considered DrERVs. Gypsy occupies the largest proportion of DrERVs and includes three main subgroups (Gypsy1-3) with high divergences. In comparison, Copia and Bel are more convergent compared with Gypsy. Four DrERVs related to Snake-head retrovirus (SnRV) were classified to class III DrERV, this group is sometimes classified as a separate type (32,47). As the second most abundant group next to Gypsy, most of class I DrERVs are more like ancestor type of epsilonretrovirus and gammaretrovirus, and only four are directly related to epsilonretrovirus (Fig. 2B). Only four fish XRVs have been reported, namely, Walleye epidermal hyperplasia virus (WEHV), Walleye dermal sarcoma virus (WDSV), Salmon swim bladder sarcoma virus (SSSV) and SnRV. WEHV, WDSV, and SSSV belong to epsilonretrovirus, which indicating that epsilonretrovirus is prominent in both exogenous and endogenous retroviruses in fish. All 11 class II DrERVs were clustered with the lentivirus group, and none was clustered with alpha-, beta-, or deltaretroviruses.
Most of the DrERVs in the genome are structurally incomplete; thus, over 2,000 DrERVs that lack the RT region were not classified by this method. Although gag and env genes are usually not used as phylogenetic markers, we applied the relatively conserved retro-trans_gag domain of gag and the TLV_coat (HR1-HR2) of env to further explore the phylogeny of DrERVs. Considering that gag and env are evolutionarily not conserved, the DrERVs classified by RT region but not exogenous retroviral sequences were used as references in classification. A total of 454 retrotrans_gag domains were identified from 1,119 gag genes and form seven subclades. Beyond our expectation, all these 454 elements were related to Gypsy (Fig. 2C), except for ERV-E4.7b.16-DanRer, which has been identified as a class I DrERV. This outcome could be a result of recombination or transposition. By comparison with the tree constructed by RT region, 34 additional Gypsy-like DrERVs were newly identified (Table S2). The env tree shows that all the 71 elements with TLV_coat (HR1 to HR2) domains are related to class I (Fig. 2D); 51 additional elements were newly identified by this method (Table S2). Thus, 104 class I DrERVs were identified in total and form the second most abundant group next to Gypsy. All env elements were further classified to four groups (named DrEnv1 to 4). Among which, DrEnv1 and DrEnv4 show high degrees of homogeneity, and the elements in these two groups seem to be copies from the same ancestor. By contrast, DrEnv2 and DrEnv3 show heterogeneity and seem to be correlated with XRVs and mammalian ERVs.  Insertion time of DrERVs. Identical 59-LTR and 39-LTR are believed to start mutating separately at the very beginning of the insertion of retroviruses into host genomes. In this case, the divergence of LTRs at both ends of ERVs are usually used to make a rough estimation of the insertion time. Thus, this method was applied here to reveal the age of DrERVs, by which a neutral evolutionary rate of 1.46 Â 10 28 in fish was used (29). A total of 890 DrERVs with a score of .300 and LTR length of .100 bp were selected for analysis. The result showed that 27.87% (248/890) of the DrERVs examined possess identical LTRs at both ends. The high proportion of identical LTRs indicates that these elements could be inserted recently or under a strong positive selection pressure. In addition, some DrERVs might still keep their transposition activity. Overall, the integration times are highly concentrated in 0-1 Mya, with a proportion of 68.54% (610/890) of the DrERVs. The result implies a recent explosive integration wave in zebrafish. For other intervals, integration waves are observed at about 7 Mya, 9 Mya, 15 Mya, and 17 to 23 Mya (Fig. S5). In addition, we found that the DrERVs with 100 to 200 bp LTR are much older than others (Fig. 3A). The median insertion time of the former group is 6.10 Mya; whereas it ranges from 0.11 to 0.38 Mya in other groups. This result indicates that the relatively short identifiable LTRs are more likely to reflect the longer-term mutation rather than its inherent characteristics. When focusing on different structures of DrERVs, the LTR-LTR type shows overall earlier insertion time (median = 2.55 Mya) than other types (median = 0.19-0.75 Mya, Fig. 3B). Actually, the loss of all three PCGs in LTR-LTR type has suggested that it is ancient. Surprisingly, all the DrERVs integrated over 10 Mya belong to Gypsy (except for the unclassified ones). Gypsy1 possess the highest median insertion time (0.67 Mya) in Gypsy1 to 3; whereas Gypsy2 takes the longest time frame, which possesses the oldest DrERV (42.43 Mya) in this research (Fig. 3C). In class I-III DrERVs, class II possess the highest median insertion time (0.86 Mya); whereas the oldest DrERV (6.68 Mya) in class I-III belongs to class I. The earlier insertion times of Gypsy and class I DrERVs may have caused the abundance of these two groups.
Characterization of gag genes in DrERVs. Given that multiple endogenous gag-and env-derived genes from humans and other species are involved in various diseases, immune responses, and other physiological processes (48)(49)(50)(51)(52), we next evaluated the coding potential and gene structure of gag and env in DrERVs. According to the phylogenetic analysis of retrotrans_gag, we found that all retrotrans_gag-containing gag are Gypsy-like. Besides, several putative gag-derived encoding sequences for SRE-ZBP, CTfin51, AW-1, and Number 18 cDNA (SCAN), paraneoplastic Ma antigens (PNMA), and zinc finger protein (Znf) domains were observed at these loci. Among the 1119 gag genes, 97 SCAN, 71 PNMA, and 45 Znf domain-encoding sequences were predicted. Interestingly, the predicted domain proteins were all part of or related with zinc finger protein; SCAN and PNMA domains have been proposed to be derived from gag (53,54). We identified 55 ORFs at the typical gag positions of class I-III DrERVs classified by RT to further understand the characteristics of gag. Although no domain was predicted in these gag loci by the National Center for Biotechnology Information's Conserved Domain Database (CDD), we annotated five potential gag domains (assigned as DrGD) by sequence comparison (Fig. S6). In the phylogenetic tree, class I gag has two main branches, and gag from class II DrERVs formed an independent clade; however, four class III gag formed a mixed clade with some class I gag genes, which could be a result of recombination or transposition (Fig. 4). Most XRVs and ERVs from other species, which were selected by BLAST hits of gag, formed an independent branch (main virus group); only Ovine progressive pneumonia virus (OPPV) and SnRV retrovirus clustered with the mixed clade of class III/class I DrERVs. We found that DrGD1 is the most representative domain in DrERVs and kept by all the eight XRVs and ERVs. All the four viruses and two ERVs in the main virus group belong to class I, which comprise two epsilonretroviruses, namely, WEHV-1 and WDSV, and four gammaretroviruses. The epsilonretroviruses possess only DrGD1, but the gammaretroviruses also hold DrGD2, which is a domain found only in the branches of class I gag. Although endogenous gammaretrovirus was not identified in zebrafish, this finding suggests that a number of gag genes in class I DrERVs have a close relationship with gammaretroviruses. By contrast, the epsilonretroviruses may have experienced a loss of DrGD2 during evolution. Both OPPV and SnRV belong to class III retrovirus, only DrGD1 was found in these two retrovirus and the clustered class III and class I gag. The solo DrGD1 in class III could be a result of the loss of other domains. DrGD3 is a unique domain of class II, DrGD4 and DrGD5 are representative in the top and bottom branches of class III gag. These results suggest that the evolution of gag is accompanied by obvious domain changes, the zebrafish gag genes may be the ancient type that retains more ancestral characteristics. Five domains may be used as markers to identify fish-derived ERVs, especially DrGD1 and DrGD2. Notably, DrGD1 and DrGD2 are retained in XRVs and mammalian ERVs.
Characterization of env genes in DrERVs. A total of 71 env genes were classified into four groups (DrEnv1-4, Fig. 5A). Particularly, DrEnv1-3 belong to class I DrERVs.  Most of the DrEnv4-containing elements lost the RT region, except for three DrERVs, namely, ERV-E24.2.36-DanRer (classified to class I), ERV-S3.7.27-DanRer (classified to class III), and ERV-E3.1.25-DanRer (classified to Gypsy). Considering the close phylogenetic relationship between DrEnv3 and DrEnv4, as well as the classification of ERV-E24.2.36-DanRer, we propose that the exception of ERV-S3.7.27-DanRer and ERV-E3.1.25-DanRer env genes comes from the transposition of RT region or env genes; therefore, we tend to classify DrEnv4 to class I DrERV as well. The env genes in each branch show high sequence similarity. The insertions, deletions, or nonsynonymous mutations are found at various positions in most of the env genes. However, none was found to interrupt the putative short ORFs containing the 17 aa immunosuppressive domain (ISD) in DrEnv1-4, and some premature stop codons were found right after the ISDs of DrEnv4. In addition, the ISD sequences show high consistency in each branch but high heterogeneity across branches (Fig. 5A), among which positions 2 to 4 and 6 to 7 are extremely conserved. The 14th amino acid of ISD was considered to play a principal role in the immunosuppressive function in mammalian Env proteins. Glutamine or arginine/lysine at this site controls the "on" or "off" state of immunosuppressive activity. In DrERVs, glutamine was found in DrEnv2 and DrEnv4 at this site, whereas arginine/lysine was found in DrEnv1 and DrEnv3 at this site.
Four DrEnv1, two DrEnv2, eight DrEnv3, and four DrEnv4 possess putative ORFs with more than 1,000 bp in length. The structure of representative env genes of each group were analyzed to evaluate gene conservation and the potential function of Env proteins (Fig. 5B). Signal peptides were predicted at the N-terminus of DrEnv2 and DrEnv3 proteins, indicating the potential of membrane location. However, the signal peptides were absent in DrEnv1 and DrEnv4 proteins. All of the eight DrEnv3 proteins possess a conserved C-X-X-C motif, which was predicted to be involved in surface domain (SU)2 transmembrane domain (TM) interaction in the ERVs of other species. However, none was found in the other three DrEnv groups. The SU and TM domains of mammalian Env proteins are cleaved by a furin cleavage site with a consensus motif of R/K-X-R/K-R. This motif-encoding sequence was also found in zebrafish DrEnv1, DrEnv3, and DrEnv4 groups, except for DrEnv2. The ISD was found in all the four groups, followed by a conserved C-X7-CC motif in DrEnv1 and a C-X6-CC motif in DrEnv2 and DrEnv3. However, this C-X5/6/7-CC motif is absent in DrEnv4. The transmembrane region is located at the C-terminus of DrEnv1-3 but lost in DrEnv4. In conclusion, DrEnv4 lost most motifs, but all the eight env genes in DrEnv3 kept all the coding potential for conserved motifs. The relative position of the motif-encoding sequences in all groups are consistent with env genes identified in other species (Fig. 5C) (10,17,18). Considering the high integrity of DrEnv3, we hope to further evaluate the coding potential and transcription feature of this group. We performed rapid amplification of cDNA ends (RACE) analysis and identified some full-length group members, including an env-encoding DrERV on chromosome 5 (ERV-E5.1.38-DanRer), which is a member of DrEnv3. Similar to HERV and other retroviruses, the transcription of ERV-E5.1.38-DanRer starts from the 59 end of the R region in 59-LTR and ends at the 39 end of the R region in 39-LTR (Fig. 5D). The majority of sequences between 59-LTR and env gene was spliced, including the ORFs of gag and pol. As a result, the mature transcript retained R and U5 regions at the 59 end and U3 and R regions at the 39 end, in which only the ORF of env and two noncoding regions at both ends of env were kept. This result showed the transcription activity and potential function of the env gene in a DrEnv3 group member.
Comparative analysis between DrERV and HERV. Besides the rapid evolution of the virus itself, the host's defense mechanism accelerates the mutation of ERVs and makes ERVs highly heterogeneous even among those from the same ancestor. For this reason, understanding the phylogenetic relationship of ERVs across species has received much attention. Here, we performed a comparative analysis between DrERVs and HERVs to explore the evolutionary history of ERVs from teleost fish to mammals throughout vertebrate evolution. The RT-encoding region of pol gene is regarded as the most conserved part of ERVs; therefore, 665 nonredundant RT sequences were used as queries to BLAST the human genome. The highest ERV density was found in zebrafish chromosome 4. In addition, a large number of the BLAST hits (18.1%) in the human genome are generated by the queries of DrERVs from zebrafish chromosome 4 (Fig. 6A). These results implied that the ERV reservoir of zebrafish chromosome 4 could be primitive and has a profound impact on evolution. Among the 8,919 BLAST hits in the human genome, 2,802 unique targets were found when repeat items were excluded. Most of these targets are intergenic, only 321 targets overlap with annotated genes, among which 56% are PCGs and 38% are lncRNAs (Fig. 6B). We also found that DrERVs from different taxonomy generate different hit numbers and E-value (Fig. 6C). Although Gypsy comprises the majority of DrERVs, this group generates only a few hits in the human genome with relatively high E-value. In comparison, most of class I DrERVs generate a large number of BLAST hits and low E-value in the human genome. Therefore, this group could be quite ancient and may have the most possibility of being evolutionarily preserved ERVs. Furthermore, the BLAST target loci in human genome were examined in ascending order of E-value. Among the top 50 BLAST target sites with the lowest E-values, 25 targets were annotated as different copies of HERV9NC-int; and all these 25 HERV9NC-int elements were generated by class I DrERVs. We next searched a HERV9NC-int element and its corresponding ERV-E4.8.43-DanRer in other nine species (Fig. 6D). As expected, the BLAST targets with high similarity (Evalue , 1e-30) were detected in all the nine species ranging from aquatic fish species to terrestrial mammalian organisms. This outcome implied the potential evolutionary correlation between HERV9NC-int elements and DrERVs. Next, we compared the other domains of HERV9NC-int and ERV-E4.8.43-DanRer besides RT (Fig. 6E). We found that TLV_coat in the envelope shows a relatively high similarity (E-value = 1e-6), which provides another evidence of the homology of HERV9NC-int and ERV-E4.8.43-DanRer.
In addition, the env and gag genes of DrERVs were also applied to BLAST human genome. The target with highest score (E-value = 2e-87) is located in the intron-4 of Ras and Rab interactor 3 (RIN3) of human and chimpanzee, which shows a high similarity with ERV-E19.1.80-DanRer-env in a reverse direction (Fig. S7A). The target sequence has a coding potential of a 666 aa protein. The putative TM domain of this protein is substantially more conserved than the SU domain. The TM and SU domains have a genetic distance of 0.8 and 1.2, respectively. Through further analysis, we identified a previously unrecognized HERV element (HERV-14q32.12) at this position (Fig. S8). HERV-14q32.12 contains gag, pol, and env genes and LTRs at both ends. A conserved Gag_p30 domain, a Rve domain, and a TLV_coat encoding sequence were predicted in the gag, pol and env gene, respectively; however, the usually more conserved RT domain encoding sequence was absent, and gag and pol were truncated with premature stop codons. The 147 retrotrans_gag domain-containing gag genes generate only three nonredundant hits, namely, Retrotransposon Gag-like 1 and 5 and Paternally expressed 10. The gag genes that lack retrotrans_gag domain were also used to search the human genome. These elements generate more hits with more substantial E-value, among which ERV-AB4.6.45-DanRer-gag and HERVE-int-19p12-gag showed highest similarity (E-value = 3e-46) (Fig. S7B).
Expression of DrERVs in embryos and adult tissues. The transcriptional expression profile of DrERVs in zebrafish embryos and adult tissues were analyzed at whole genomewide level to evaluate the functional activity of DrERVs in physiological process. The expression pattern of DrERVs in embryos was analyzed using the RNA-seq data extracted from European Nucleotide Archive. Data of four important developmental stages of segmentation (bud), pharyngula (28-h postfertilization [hpf]), hatching (2-days postfertilization [dpf]) and larval (5 dpf) were chosen. A total of 319 DrERVs with relatively high expression level (TPM . 100) were identified at four development stages (Fig. 7A, Table S3). The expression patterns of DrERVs between bud and 5 dpf stages exhibit substantial difference because the early expressed DrERVs were hardly expressed in the later period (Fig. 7B). The expression levels of DrERVs at 28 hpf and 2 dpf stages showed a decrease in bud-specific DrERVs and an increase of 5 dpf-specific DrERVs (Fig. 7A). The results indicated the existence of a strict expression regulation between different stage-specific DrERVs. Compared with the large fraction of LTR-LTR type in the total DrERVs, a relatively low ratio (18.2%) of this type was highly expressed in embryos, which means that most of the active DrERVs in embryo have coding potential. Among the DrERVs expressed in embryos, 10 class I, two class II, and three class III DrERVs were identified. Among them, six class I and one class III were downregulated at 5 dpf, whereas the others were upregulated. Furthermore, the expression of DrEnv1-4 was analyzed. The result showed that some of the DrEnv1 and all of the DrEnv2 were expressed at the bud stage, whereas most of the rest DrEnv1 began to be highly expressed at 5 dpf (Fig. S9). The DrEnv3 has broad and diverse expression patterns in embryos, from 28 hpf to 5 dpf, and most of them reached the peak expression at 28 hpf or 2 dpf except ERV-E15.5.82-DanRer, ERV-E4.7b.6E-DanRer, and ERV-E19.1.80-DanRer, which reached the peak expression at 5 dpf. The findings suggest the functional diversity of DrEnv3 group members. By contrast, the expression pattern of DrEnv4 showed high homogeneity; all of the members reached relatively high expression levels at a late period of embryo development (5 dpf) and exhibited a gradual increase in expression pattern during early development. Given that some previous studies suggested that ERVs could act as regulatory elements for adjacent genes in development, we next analyzed the adjacent genes (within 10 Kb) of bud-specific DrERVs. Expectedly, 110 known genes were found, 49 of which were expressed at the early stage and downregulated before 5 dpf (Table S4). Interestingly, 22 of the 49 genes were correlated with nucleic acid binding, and nine genes were correlated with ion binding. Further study is needed to elucidate the potential mutual regulation between DrERVs and their adjacent genes.
Next, the expression pattern of DrERVs in seven tissues, namely, heart, spleen, head kidney, liver, gut, brain, and muscle, were further evaluated by RNA-seq. Ninety-six highly expressed (FPKM . 1) DrERVs were identified in the seven tissues (Fig. 7C, Table S5). We found by comparing the seven tissues that only a few DrERVs are expressed in the liver, whereas considerable DrERVs are present in other tissues. We found 53 DrERVs in the heart, 31 in the spleen, four in the head kidney, 31 in the gut, 29 in the brain, and 13 in the muscle. Among these tissue-expressed DrERVs, 15 DrERVs were ubiquitously expressed in more than three tissues (Fig. 7D), in which four DrERVs possess pol gene, two DrERVs possess gag and pol genes, and nine DrERVs only have LTR structures at both ends. This result indicates that these LTR-LTR elements are the origin of DrERV-derived functional noncoding RNAs. By contrast, 47 DrERVs are tissue specific, that is, 25 DrERVs are expressed in the heart, 10 in the spleen, four in the head kidney, eight in the gut, two in the brain, and three in the muscle (Fig. 7D). Additionally, 34.0% of the tissue-expressed DrERVs are class I DrERVs, 14.4% are Gypsy DrERVs, and 4.2% are class II and Bel DrERVs. Notably, the heart is the most active DrERV-expressing tissue, and most of the heart-expressed DrERVs (27/53) were class I DrERVs that contain the DrEnv4 genes and account for 87% (27/31) of the total DrEnv4-containing DrERVs identified in zebrafish. This finding suggests that DrEnv4 members may play particularly important roles in cardiac function and metabolism. In this respect, we performed a trans-analysis of the 27 heart-specific DrEnv4-containing DrERVs. We found by enrichment analysis that the genes co-expressed with these DrEnv4-containing DrERVs were remarkably enriched in ATP metabolism and cellular respiration (Fig. 7E). This outcome is consistent with the functional characteristics that energy production is particularly prominent in the heart, which is closely associated with mitochondrial function. In addition, three DrEnv3-containing DrERVs (ERV-E19.1.80-DanRer, ERV-E15.5.82-DanRer and ERV-E4.7b.36E-DanRer) were specifically expressed in the brain, and the encoded DrEnv3 proteins have the most intact structure of a typical envelope protein. Notably, among the 96 DrERVs expressed in seven tissues, 39 were also expressed in embryo (Fig. 7F), and 33 of   (Table S6). This result suggests that few DrERVs expressed in adult tissues were expressed in embryos at early stages (bud and 28 hpf, Fig. 7F). Furthermore, 26 of the 33 DrERVs have env-coding ability, 24 and two of which were expressed in adult heart and brain. These observations implied the potential functional activity of envelope proteins in adult tissues of zebrafish.
Expression of DrERVs in response to spring viremia of carp virus infection. We examined the expression profile of DrERVs in immune-relevant tissues under spring viremia of carp virus (SVCV) challenge to explore whether DrERVs are involved in host antiviral immunity. For this purpose, the head kidney, spleen, and gut, which are the main immune organs that are representatives of central, peripheral, and mucosal immune organs, were selectively examined. The induced-expression of type I interferon (IFNw 1 and IFNw 3) and interferon-stimulated genes (ISG15 and mxb) in the head kidney, spleen, and gut was determined as indicators for the successful initiation of antiviral immunity in response to SVCV infection (Fig. 8A). A total of 192, 130, and 64 DrERVs were substantially upregulated (Fig. 8B, Table S7). Three and four class I DrERVs were substantially upregulated in the head kidney and gut, whereas no DrERVs of the other groups were upregulated. We performed cis-and trans-analyses between DrERVs and PCGs to preliminarily uncover the potential function of these responsive DrERVs. Fifteen potential cis-acting genes were found by cis-analysis, and a ERVL-3.4.71-DanRer element was closely correlated with immunoglobulin heavy variable (ighv) 4-9 (R = 0.95) and ighv4-8 (R = 0.88). ERVL-3.4.71-DanRer and ighv4-9 have a more significant correlation but a longer distance than that between ERVL-3.4.71-DanRer and ighv4-8, which indicates a possibility that DrERVs are correlated with the selection of variable regions in immunoglobulin heavy chain. The trans-acting genes were filtered by correlation coefficient (20.99 . R . 0.99), and a set of 963 genes was generated. A group of G-protein coupled receptors (GPCRs) were found to be functionally correlated with DrERVs by gene ontology (GO) enrichment analysis (Fig. S10). Given that GPCRs play important roles in various physiological processes, including immune and nervous activities, the discovery of the correlation between DrERVs and GPCRs may provide new insights into the involvement of DrERVs in cellular activities through association with GPCRs.
We believe that ERVs participate in cellular functions through three ways: providing transcription factor (TF)-binding sites for downstream genes and encoding regulatory noncoding-RNAs (ncRNAs) and functional proteins. We examined the remarkably upregulated immune-relevant genes in response to SVCV infection to verify the first mechanism. We found five potential DrERV-aid genes, namely, fga, ddx41, ftr35, igl1c3, and tbk1, which are located inside five DrERVs, namely, ERVL-1.2.11-DanRer, ERVL-9.1.84-DanRer, ERVL-2.7.54-DanRer, ERVL-3.7.49-DanRer, and ERVL-4.1.2-DanRer. The expression of fga and ddx41 was upregulated in the gut, and the expression of ftr35, igl1c3, and tbk1 was upregulated in the head kidney and spleen. These genes are potentially regulated by various TFs via association with cis-acting elements (such as promoters and enhancers) in the 59-LTRs and/or 39-LTRs of the five DrERVs. The binding sites of RelA, STAT4, NF-k B, and IRF1 were found in the 59-LTR of ERVL-1.2.11-DanRer, which is distributed upstream of the fga gene. An IRF2 binding site was found in the 59-LTR of ERVL-9.1.84-DanRer, which is located upstream of the ddx41 gene. The binding sites of STAT4, STAT5a, STAT5b, IRF1, RelA, and NF-k B were found at the 59-LTR of ERVL-2.7.54-DanRer, which is distributed upstream of the ftr35 gene. RelA, IRF1, IRF2, and NF-k B binding sites were located at the 59-LTR of ERVL-3.7.49-DanRer, which is upstream of the igl1c3 gene. Interestingly, numerous binding sites for STAT4, STAT5a, STAT5b, STAT1b, NF-k B, RelA, IRF1, and IRF2 were found at the non-coding region between the 59-LTR and 39-LTR of ERVL-4.1.2-DanRer, which is distributed upstream of the tbk1 gene (Fig. 9A). The association of IRF1 and RelA with the LTR upstream fga and igl1c3 genes in ERVL-1.2.11-DanRer and ERVL-3.7.49-DanRer were selectively examined by chromatin immunoprecipitation (ChIP) assay to confirm the binding activity of TFs to the LTRs of DrERVs. We initially verified the upregulated expression of fga and igl1c3 in gut and head kidney tissues under SVCV infection (Fig. 9B). ChIP analysis showed that IRF1 and RelA are considerably enriched at the desired LTR regions as expected (Fig. 9C). Thus, the LTRs of DrERVs inserted with cellular genes could facilitate gene expression by providing various TF-binding sites. Consistent with this notion, the 59-and 39-LTRs of these DrERVs showed higher diversity than those of other DrERVs (P , 0.01), such as DrERVs that encode ncRNAs and PCGs (Fig. 9D).
Finally, we found that 41 DrERVs with protein-coding potential were regulated in the spleen, kidney, and gut tissues in response to SVCV infection (Table S9). Almost all the 41 protein-coding DrERVs potentially encode entire or partial Pol; 12 of these DrERVs encode Gag, and two of these DrERVs encode Env. Both of the two Env proteins belong to the DrEnv3 group. Besides, five of the 41 DrERVs encode a SCAN domain-containing protein.
The SCAN domain is usually found in zinc-finger proteins and thought to be related to transcription regulation. The sequences that encode the SCAN domain-containing proteins are located between 59-LTRs and pol genes in the five DrERVs; such a genetic locus is similar to that of gag gene, and this domain is proposed to be derived from gag (53).

DISCUSSION
We systematically identified ERVs in a zebrafish model to understand the origin, evolution, and potential function of ERVs in an ancient vertebrate. Approximately 3,315 DrERV elements were identified from all the 25 chromosomes of zebrafish and accounted for 2.3% of the zebrafish genome. The vast majority of the DrERVs are incomplete in structure in varying degrees and the LTR-LTR elements are the most abundant class of DrERVs. This finding suggests that most DrERVs are deficient probably because of the autonomous suppression by the host or homologous recombination (1). The DrERV elements were classified into Gypsy, Bel, Copia, and class I-III groups. Gypsy occupies the largest proportion of DrERVs with high divergences. The oldest DrERV, which belongs to Gypsy, was predicted to be integrated into zebrafish genome over 40 Mya. Besides Gypsy, class I DrERVs were the most abundant group in zebrafish. This group was predicted to be inserted into the genome much older than classes II and III. The early insertion time of class I DrERVs may cause the abundance of this group. Despite its prevalence, almost all of the class I DrERVs are an ancestor type of gamma-and epsilon-like retroviruses; only four are directly related to epsilonretroviruses. These ancient class I DrERVs may promote the monophyletic origination of gammaretroviruses and epsilonretroviruses and provide an epitome of the ancestral features of class I retroviruses. In addition, a total of 11 DrERVs were clustered with exogenous lentiviruses in polbased phylogenetic tree; however, the regions of gag and env genes between these DrERVs and exogenous lentiviruses cannot be properly aligned due to the high sequence diversity. Given that endogenous lentiviruses have not been detected before, the existence of endogenous lentiviruses in fish remains to be further clarified by developing new strategies. In this respect, an improved method for phylogenomics of gammaretroviruses can provide a valuable reference for our research (47,55). In the prediction of insertion time, we found that considerable DrERVs possess identical LTR elements at both ends. The cause of this phenomenon could be due to the newly insertion event. Another explanation is that these elements are under strong purifying selection and are therefore involved in physiological events. In addition, these LTRs are theoretically capable of transposition. Retrotector has been designed with algorithms that recognize some retrovirus signature (including LTRs) and their respective distance. It is worth noting that RetroTector does not properly assess solo LTRs, thus this highly defective type of DrERV is almost absent in this research. Additionally, the accurate prediction of LTRs is still challenging due to the high divergence. Although the SweepDNA and LTRID modules in RetroTector use various strategies to avoid false positive identification of LTRs, we believe that there may still be a certain misrecognition rate at the present stage. Clarification on these issues also depends on methodological innovation in the future.
Interestingly, DrERVs have a remarkable distribution bias on chromosome 4, especially on the long arm of this chromosome. ERVs are highly enriched on sex-determination chromosomes in humans and other mammalian species, and some previous studies have suggested that chromosome 4 might be related with sex determination in zebrafish (35,56); thus, our findings may provide new insights into the hypothesis that chromosome 4 links sex determination in zebrafish based on the ERV distribution bias. Additionally, quite a few NLR element-encoding genes densely accumulated on the long arm of chromosome 4. Considering that active ERVs have transposition ability, these NLR-coding sequences may be amplified and fragmented by the surrounding DrERVs and seem to be isolated from other PCGs in the short arm. This occurrence makes the long arm of chromosome 4 entirely heterochromatic and these NLR elements are under precise controls. The NLR family is an integral part of innate immunity, but we are not clear whether the stacked DrERVs and NLRs in the long arm region is a result of the game of the host and the intruder. However, the transposition ability is very likely to be the driving force of the duplication of NLRs. The evolutionary correlation and functional interplay among DrERVs, sex determination, and innate immunity are an interesting topic that remains to be further explained as a whole.
Gag and env genes are recognized as multifunctional regardless of physiological or pathological conditions; therefore, we next focused on these two genes. The retrotrans_gag is the only characteristic domain found in zebrafish gag elements. However, all detected ret-rotrans_gag domains belong to gag in Gypsy. This finding indicates that this domain could originate from Gypsy. In this case, no domain could be predicted by CDD in the gag of class I-III DrERVs. Many long ORFs were found in these gag position of class I-III DrERVs; thus, we aligned and recognized five potential gag domains (DrGD1-5) in these elements. DrGD1 and DrGD2 were detected in DrERVs, XRVs, and mammalian ERVs. Therefore, these two domains are relatively conserved in evolution and could become new functional cores. In addition, these domains could also be applied as new markers for ERV identification in lower vertebrates. Since the discovery of syncytin in mammals, endogenous Env protein has received much attention (57,58). All the 71 newly identified env in zebrafish belong to class I DrERVs, because no env was found in the other groups. Retroviruses were generally proposed to have originated from transposons by acquiring env-like genes and thus gained interhost transposition ability, although some LTR retrotransposons seem to be originated from retroviruses via passive loss of the env genes (59,60). Based on this theory, we could speculate that the env genes in class I retroviruses were captured in fish genome or even earlier. The abundance and early insertion time of class I DrERVs are also in accordance with this hypothesis. In humans, gamma-and epsilon-related HERVs have been detected. Different from those of zebrafish, gamma-related HERVs are the most abundant class I HERVs (7). However, only the ancestors of class I and epsilon related DrERVs were detected in zebrafish, and no gamma-related DrERVs were found. Coincidentally, almost all the known fish XRVs belong to epsilonretrovirus, although most fish XRVs may go undetected. This finding hints the profound effect of epsilonretrovirus on fish genome, or reverse.
More than 88% of the BLAST hits are located in the intergenic regions when the RT elements of DrERVs were used to BLAST the human genome. This finding indicates that ERV elements are an important resource of non-coding regions in vertebrate genome. A quite large number of BLAST hits in the human genome were generated by DrERVs on zebrafish chromosome 4. This result proved again that the DrERVs at the long arm of chromosome 4 could be functional and seem to be evolutionarily conserved and even vital in species evolution. We found through the comparison between DrERVs and HERVs that HERV9NC-int has the highest homology to DrERVs, particularly ERV-E4.8.43-DanRer. This HERV possesses coding potential for gag, pol, and env genes and is responsive to HIV infection (61); thus, it seems to be an undiscovered important HERV in the human genome. This finding is consistent with our usual understanding that elements with vital functions are evolutionarily conserved. Numerous ERVs with high identity were found by searching HERV9NC-int and ERV-E4.8.43-DanRer in other species. Phylogenetic analysis showed that the BLAST hits from the same species are clustered together, regardless if the hits were generated by HERV9NC-int or ERV-E4.8.43-DanRer. In addition, when the gag and env elements of DrERVs were applied to BLAST the human genome, hits with high E-value (3e-46 and 2e-87, respectively) were detected. The results suggest that ERV elements could also be preserved during vertebrate evolution like normal genes and these ERV elements are very likely to play fundamental roles in biological activities. Besides that, these analogous elements are also possibly derived from large-scale cross-species transmissions.
A total of 665 DrERVs were actively expressed in embryos and adult tissues under physiological and viral infection conditions. These DrERVs account for 20.06% of the total DrERVs in zebrafish. The results suggest the extensively involvement of DrERVs in the life activities of zebrafish, including embryonic development, cellular metabolisms, tissue homeostasis, and immune responses. Among the 665 actively expressed DrERVs, 319 DrERVs were detected in embryos at four developmental stages, 96 DrERVs were found in seven adult tissues under normal physiological conditions, and 421 DrERVs were strongly induced in three tissues upon viral infection. The majority of embryo-specific DrERVs were differentially expressed at the four developmental stages, namely, bud, 28 hpf, 2 dpf, and 5 dpf. These DrERVs show distinct stage-specific transcriptional patterns during zebrafish development. The stage-specific genes or elements are closely related to cell lineage specification for ongoing development; therefore, the identification of lineage-specific DrERVs from each developmental stage needs to be further explored. A clarification on this issue would provide new insights into the mechanisms underlying ERV-based regulation of embryonic development. The 96 tissue-expressed DrERVs were mostly enriched in the heart, followed by the spleen, gut, brain, muscle, head kidney, and liver. Among these 96 DrERVs, 15 DrERVs were ubiquitously expressed in different tissues, whereas 48 DrERVs were expressed in specific tissues. The tissue-specific DrERVs were most abundant in the heart, followed by spleen, gut, head kidney, muscle, and liver. The tissue-expressed DrERVs exhibited a wider encoding potential for Env, Pol, and Gag proteins and noncoding RNAs. This encoding potential meets the requirement for the functional diversification of different tissues. In addition, 38 DrERVs expressed in adult tissues were also transcribed in embryos, and most of them were detected at 5 dpf. This finding suggests that these DrERVs play important roles in fundamental cellular activities throughout the lifetime of zebrafish from the late-developmental embryos to adult tissues. Notably, considerable numbers of embryo-and tissue-expressed DrERVs belong to class I DrERVs. Hence, this type of DrERV has stronger functional activities in cellular development and metabolism than those of other types.
Importantly, 421 DrERVs were remarkably induced in head kidney, spleen, and gut tissues in response to SVCV infection. Most of these DrERVs were not transcribed in tissues at steady state under normal conditions without SVCV stimulation; hence, they are viral-responsive DrERVs, which are crucial for antiviral immunity in zebrafish. Given that the head kidney, spleen, and gut are three major immune-relevant tissues that represent the central, peripheral, and mucosal immune systems in fish, the vastly induced expression of DrERVs in these tissues suggests that DrERVs play extensive roles in a wide spectrum of immunities, which potentially rang from the activities in the early hemopoietic regulation, proliferation, and differentiation of immune cells in systemic immunity to local defense reactions in mucosal immunity. The correlation between ERVs and exogenous viruses has long been a challenging topic that remains to be explored. Some previous investigations have focused on the interplay between ERVs and XRVs, such as HIV, spleen necrosis virus, MuLV, and friend virus (62). However, the association between ERVs and a non-retroviral virus remains poorly understood. In the present study, we demonstrated that numerous DrERVs are highly responsive to SVCV, a negative-stranded RNA virus and a member of family Rhabdoviridae. Thus, zebrafish is expected to be an attractive model organism for studying the complex relationships among ERVs, exogenous invading viruses, and host immunity. Such investigations may include the identification of the regulators (such as stimulators, restriction factors, and interactomes) of ERV expression, the sensing and control of ERVs by innate-immune signaling pathways and underlying mechanisms, the regulation of innate immunity by ERVs, the evolutionary history of ERVs, and the co-evolution mechanism between ERVs and host immunity.
The ERV elements occur in four broad classes in humans and mouse models, including elements that are relatively intact with potentially infectious retrovirus members; elements that lack partial coding sequences, typically env, but potentially autonomous; nonautonomous elements that lack coding sequences but retain essential cis-acting sequences for transcription, packaging, and primer binding; and solo LTR products of recombination between LTRs and the associated loss of the internal domain and one LTR copy (2,3,63). All these four ERV classes were identified in zebrafish and the most abundant class is that of LTR-LTR elements. Importantly, a fifth class of ERVs was identified in zebrafish. In the fifth class, the internal viral elements within the LTR ends are completely replaced by some cellular functional genes, especially immune-relevant genes, such as fga, ddx41, ftr35, igl1c3, and tbk1. This finding reflects the structural and functional bias of DrERVs that were repurposed for host gene expression under strong selection pressure during evolution. Architecturally, the LTR structures are a rich source of transcriptional regulatory cis-elements, which provide numerous TF-binding sites in the promoter and enhancer sequences of LTRs. For example, the LTRs with fga, ddx41, ftr35, igl1c3, and tbk1 genes in zebrafish contain perfect or nearly perfect binding sites for the RelA, IRF1, IRF2, NF-k B, STAT1b, STAT4, STAT5a, and STAT5b TFs. Thus, this class of DrERVs may potentially mediate an immune transcription network in response to viral infection. In addition, we noticed that the heavy chain (ighv4-8/9) and light chain (igl1c3) of zebrafish immunoglobin-coding sequences are closely correlated with DrERVs; and a considerable number of DrERVs regulate the expression of adjacent genes. These observations suggest the long-distance modulatory effect of DrERVs on remote target genes probably through regulatory elements, such as enhancers in LTRs. Thus, the vast majority of LTR structures in zebrafish genome may provide a large reservoir of regulatory elements for host utilization in various cellular activities and even long-distance genomic recombination, which leads to increased mutation frequency and thus contributes to species evolution (64). The regulatory elements included in LTRs seem to be expandable with evolving host control over invading species, because increasing number of LTR elements were emerged in humans and other mammals (5,7,65). In addition, LTRs are known as the focus of epigenetic silencing. In this case, the flanking LTRs of functional genes or elements could undergo reprogramming by epigenetic modifications from repressive modules to active status. Hence, the LTR elements of DrERVs would become an attractive model platform for understanding the mechanisms underlying epigenetic regulation at genome-wide LTR structural levels.
In conclusion, we comprehensively analyzed the composition, phylogeny, and potential functions of ERVs in zebrafish, an attractive model organism of ancient vertebrates, from multiple perspectives. The results may provide a solid foundation for further investigation on the molecular origin, genetic shift, and functional evolution of vertebrate ERV family from fish to mammals. The ubiquitous existence and high fraction of ERVs in vertebrates determine their important position in the genome, but at present, ERV research still needs to be expounded.

MATERIALS AND METHODS
Experimental fish and virus. Wild-type AB zebrafish (Danio rerio) were bred and maintained in circulating water at 28°C under standard conditions as previously described (66,67). Only healthy fish, as determined by general appearance and activity level, were used. Zebrafish embryos were collected by natural spawning and kept at 28.5°C in incubator with a 14:10-h light/dark photoperiod. All animal care and experimental procedures were approved by the Committee on Animal Care and Use and the Committee on the Ethic of Animal Experiments of Zhejiang University. SVCV was a gift from Prof. Yibing Zhang (Institute of Hydrobiology, Chinese Academy of Sciences) and propagated in epithelioma papulosum cyprini (EPC) cells as previously described (68).
Detection of DrERVs in the genome. Sequences of 25 zebrafish chromosomes (GRCz11) were downloaded from NCBI Genome database (https://www.ncbi.nlm.nih.gov/genome/?term=zebrafish). The sequences of each chromosome were segmented into 9 Mb fragments with 2 Kb overlapping regions between the fragments and submitted to RetroTector (http://retrotector.neuro.uu.se/) for prediction of DrERVs with default settings. RetroTector contains three modules for the recognition of candidate LTRs, the recognition of retroviral conserved motifs that meet distance thresholds and the reconstruction of original protein sequences. In addition, RetroTector can give a variety of possible ORFs and score values based on how well the sequences match the infectious retroviruses (36).
Nomenclature of DrERVs. DrERVs are named following the rules described previously (69). Briefly, the names are composed of three parts, "ERV" and "DanRer" (Danio rerio) constitute the first and third parts, respectively. In this research, only the DrERVs related to class I-class III are initialed with "ERV", whereas the others are initialed with "ERVL" (ERV-like). The second part indicates the locus. For example, in "ERV-E5.1.38-DanRer," "5" represents chromosome 5, "1" represents the first 9 Mb fragments of chromosome 5, "38" represents the 38th DrERV in this fragment. In the DrERVs related to class I-class III, the second parts are initialed with "E" (epsilon-related), "AB" (alpha/beta-related), or "S" (spuma-related). Some DrERV genes that encode for Gag, Pol, and Env proteins are located beyond the expected DrERV regions (from 59-LTR to 39-LTR); these genes are annotated independently with additional "G," "P," or "E" at the end of the second part of the name, for example, "ERV-21.2.63E-DanRer." Distribution analysis of DrERVs. The correlation among element number, chromosome length, and GC content was analyzed using SPSS 17.0. x 2 test was performed using expected DrERV number per megabase (ratios of total DrERV number to total genome sequence length) and observed numbers per megabase (x 2 = P [(observed 2 expected) 2 /expected]) to investigate the preference of DrERV distribution on chromosomes. The locations of NLR elements on chromosome 4 were indicated by the BLAST hits of FISNA, NACHT, LRR, and SPYR in the Pfam database (http://pfam.xfam.org/).
Molecular dating. The insertion time of DrERVs were estimated using the formula T = (D/R)/2, where T is the invasion time (million years), D is the 59-and 39-LTR divergence given as the number of differences per nucleotide per site (overall nucleotide divergence), and R is the genomic substitution rate per site per year. D was calculated using MEGA 6.06, the neutral rate of fish genomic evolution of 1.46 Â 10 28 was applied (29).
Characterization of gag and env genes. The ORFs of gag and env genes were predicted by SeqBuilder in Lasergene 7.1 software (DNASTAR Inc., USA). Conserved domains in these ORFs of gag and env were further identified by CDD. Considering all retrotrans_gag-containing gag are Gypsy-like, we identified 55 ORFs at the gag loci in DrERVs without recognizable retrotrans_gag domains. The ML tree was constructed based on amino acid sequences of these ORFs as described in classification and phylogenetic analysis of DrERVs. The best-fit model of VT1I1G4 was applied. Additional sequences of XRVs and ERVs were selected by BLAST results, including EGV97139.1, AII72209.1, AAO62318.1, AGV92852.1, AAO46144.1, AAC78248.1, ADC92307.1, NC_043194.1, NC_001867.1, and NC_001724.1. For the gag genes without recognizable domains, we identified undiscovered conserved domains with MEME Suite (74), and these new domains were annotated by itol (https://itol.embl.de/). The ML tree of env genes constructed in classification and phylogenetic analysis of DrERVs was reused here for subsequent analysis. The conserved motifs in Env proteins were identified by comparing with the well-characterized Env proteins as reported previously (17,18).
Rapid amplification of cDNA ends. Ten zebrafish were injected with 5 mL SVCV (10 7 50% tissue culture-infective dose [TCID 50 ]/mL) per fish, and spleens were collected 12 hpi. Total RNAs were extracted using TRIzol Reagent (Invitrogen). The first strand was synthesized according to manufacturer's protocol of the Single Cell Full Length mRNA-Amplification Kit (Vazyme, Nanjing). The ERV-E5.1.38-DanRer-env sequence was amplified by using specific primers and nested primers as shown in Table S11. The PCR products were purified from 1.2% agarose gel by using gel extraction kit (Omega) and inserted into pGEM-T EASY vector (Promega), then sequenced. Then, sequences were assembled manually and compared with genomic sequence.
Comparative analysis between DrERVs and HERVs. After removing the identical RTs, a total of 665 non-redundant RT sequences were used as queries to BLAST the human genome (GRCh38) in Ensembl (http://www.ensembl.org) database. The similarity between the queries and the hits were evaluated by scores and E-values provided by Ensembl. The RT sequences of ERV-E4.8.38-DanRer and HERV9NC-int were used to BLAST the genomes of platypus (mOrnAna1.p.v1), shark (Callorhinchus_milii-6.1.3), lizard (AnoCar2.0), mouse (GRCm39), duck (CAU_duck1.0), chicken (GRCg6a), koala (phaCin_unsw_v4.1), dog (CanFam3.1), and cow (ARS-UCD1.2) in Ensembl. The BLAST hits with highest similarity (evaluated by score and E-value, only hits with E-value , 1e-30 were selected) with ERV-E4.8.38-DanRer or HERV9NC-int were selected for phylogenetic analysis. A ML tree was constructed based on the amino acid sequences as described in Classification and phylogenetic analysis of DrERVs, with the best-fit model of LG1G4.
RNA-seq analysis of DrERVs. RNA-seq data of zebrafish embryo developmental stages was extracted from European Nucleotide Archive (PRJNA154389) (75). The data of Bud, 28 hpf, 2 dpf, and 5 dpf were selected for subsequent analysis. Tissues of heart, head kidney, spleen, gut, liver, brain, and muscle of healthy adult zebrafish were collected for subsequent experiment. Total RNAs were extracted using TRIzol Reagent (Invitrogen), and genomic DNA was removed using DNase I (TaKara). The spleen, head kidney, and gut tissues were collected 12 hpi with 5 mL SVCV (10 7 TCID 50 /mL) or mock PBS per fish, and RNAs were extracted as mentioned above. RNA-seq transcriptome libraries were prepared following TruSeq RNA sample preparation Kit from Illumina (San Diego, CA), and mRNAs was isolated by oligo(dT) beads and then fragmented by fragmentation buffer. Double-stranded cDNA was synthesized using a SuperScript double-stranded cDNA synthesis kit (Invitrogen, CA, USA) with random hexamer primers (Illumina). Libraries were sequenced with the Illumina NovaSeq 6000 sequencer (2 Â 150 bp read length). Then, reads were mapped to zebrafish reference genome (GRCz11) using TopHat; and quality control was performed using RseQC. The pattern recognition and clustering analyses were performed using scripts of R; differential expression was analyzed using DESeq, scripts of R and perl. Transacting analysis for ERV-like elements was performed using R script. The Pearson correlation coefficient and P value of expression level of ERV-like elements and mRNAs were assessed, and absolute value of Pearson correlation coefficient greater than 0.85 and P value less than 0.01 is considered transacting. GO enrichment was performed using the R topGO package. The mapping between Ensembl IDs and GO terms was retrieved from the Ensembl database using a custom Perl script (get_ensembl_go_terms.pl) from the topgo-wrapper repository (https://github.com/iansealy/topgo-wrapper).
Molecular cloning. The primers for zebrafish IRF1 and RelA were designed according to sequences in Ensembl (http://www.ensembl.org). Total RNAs were extracted from zebrafish spleen by using TRIzol Reagent (Invitrogen) and reverse-transcribed into cDNAs, then IRF1 and RelA cDNAs were generated through RT-PCR. The PCR products were purified from 1.2% agarose gel by using gel extraction kit (Omega), and inserted into the pcDNA6 (Invitrogen) and pEGFP-N1 (BD Biosciences) vectors. The plasmids were transformed into competent Escherichia coli DH5a (Invitrogen), and the positive plasmids were purified by using endo-free plasmid minikit II (Omega Bio-tek).
Quantitative RT-PCR for expression analysis. The transcriptional expression of fga, igl1c3, ERVL-7.2.48-DanRer, and ERVL-3.1.72-DanRer upon SVCV infection were determined by quantitative RT-PCR (Q-RT-PCR) on a Mastercycler ep realplex machine (Eppendorf). SVCV infection and RNA preparation were performed as described in RACE, and RNAs were reverse-transcribed into cDNAs. PCR experiments were performed in a total volume of 10 mL by using a SYBR Premix Ex Taq kit (TaKaRa Bio). The reaction mixtures were incubated for 2 min at 95°C, then subjected to 40 cycles of 15 s at 95°C, 15 s at 60°C, and 20 s at 72°C. The relative expression levels were calculated using the 2 2DCt and 2 2DDCt method with b-actin for normalization. Each PCR trial was run in triplicate parallel reactions and repeated three times.
Chromatin immunoprecipitation-qPCR. Promoters were predicted for DrERVs surrounding immune-related genes with PROMO (76,77). ChIP assays were applied to investigate whether DrERVs would serve as transcriptional factor binding sites for downstream genes. Zebrafish embryos were acquired as described above. Plasmids of pcDNA6-IRF1 and pEGFP-RelA were microinjected into the embryos with amount of 200 ng/embryo at 0.5 hpf, followed by injection of 2 nL PBS or SVCV (10 7 TCID 50 /mL) per embryo at 24 hpf. At 36 hpf, 30 embryos were collected for each replicate in both PBS and SVCV administered groups (three replicates in each group). ChIP assays were carried out according to the manufacturer's protocol of the ChIP assay kit (Beyotime, Beijing). Samples with normal mice IgG and the input were used as negative and positive controls. Pull-down levels of target promoter sequences were determined by qPCR and normalized to the corresponding abundance in the input chromatin. The promoter-specific primers of indicated DrERVs are listed in Table S11.
Statistical analysis. Statistical analysis and graphical presentation were carried out using SPSS 21.0 and GraphPad Prism 6.0. Quantitative data from the three independent experiments were expressed as mean 6 SD. The groups were compared statistically using Student's t test for paired samples. The P values *, P , 0.05, **, P , 0.01, and ***, P , 0.001 were considered statistically significant.
Data availability. Raw data of RNA-Seq have been deposited into (Sequence Read Archive (SRA)), with the accession number PRJNA690124 (for expression profile of 7 tissues) and PRJNA690234 (for expression profile in response to SVCV infection).

SUPPLEMENTAL MATERIAL
Supplemental material is available online only. SUPPLEMENTAL FILE 1, XLSX file, 7.6 MB. SUPPLEMENTAL FILE 2, PDF file, 3.3 MB.