An immune-suppressing protein in human endogenous retroviruses

Abstract Motivation Retroviruses are important contributors to disease and evolution in vertebrates. Sometimes, retrovirus DNA is heritably inserted in a vertebrate genome: an endogenous retrovirus (ERV). Vertebrate genomes have many such virus-derived fragments, usually with mutations disabling their original functions. Results Some primate ERVs appear to encode an overlooked protein. This protein is homologous to protein MC132 from Molluscum contagiosum virus, which is a human poxvirus, not a retrovirus. MC132 suppresses the immune system by targeting NF-κB, and it had no known homologs until now. The ERV homologs of MC132 in the human genome are mostly disrupted by mutations, but there is an intact copy on chromosome 4. We found homologs of MC132 in ERVs of apes, monkeys and bushbaby, but not tarsiers, lemurs or non-primates. This suggests that some primate retroviruses had, or have, an extra immune-suppressing protein, which underwent horizontal genetic transfer between unrelated viruses. Contact mcfrith@edu.k.u-tokyo.ac.jp


Introduction
Retroviruses cause significant disease, such as acquired immune deficiency syndrome and adult T-cell leukemia. They have RNA genomes, which undergo reverse transcription into DNA, which is inserted into the host cell's genome. Occasionally, they infect germline cells, in which case the insertion may be inherited by future generations of the host organism: this is termed an endogenous retrovirus (ERV). Vertebrate genomes contain many retrovirusderived fragments, for example they comprise $8% of the human genome. Probably most ERVs decay by neutral evolution; however, many ERV fragments have been co-opted by the host to function as protein-coding genes or regulatory elements (Johnson, 2019;Thompson et al., 2016;Wang and Han, 2020). Thus, retroviruses are important contributors to disease and evolution.
Retroviruses encode three main genes, in the following order: 5 0 -gagpol-env-3 0 . Each gene produces several proteins, including viral structural proteins, a protease, reverse transcriptase and viral envelope proteins. Some retroviruses also encode small 'accessory' proteins near the 3 0 end. In short, retroviruses have a largely consistent genome organization.
We report that some primate ERVs encode an extra protein upstream of the gag gene (or possibly fused to the gag gene). This protein is homologous to protein MC132 of the human poxvirus Molluscum contagiosum. MC132 suppresses the immune system by targeting NF-jB, and it had no known homologs until now (Brady et al., 2015).

MC132 protein fossils in human ERVs
We discovered this ERV protein by chance, while searching for protein fossils in the human genome. Protein fossils are segments of formerly protein-coding DNA, and we recently developed a sensitive method to find them by comparing DNA to protein sequences to find homologous segments (Frith, 2022;Yao and Frith, 2022). We found a few dozen segments of the human genome with homology to protein MC132 from Molluscum contagiosum virus (specifically, proteins Q98298 and A0A1S7DLX6 from UniProt release 2022_03). These segments lie in ERVs. There are many types of human ERV, and these segments lie in a few specific types. We used ERV annotation from RepeatMasker, which finds ERVs by comparing the genome to ERV models in the Dfam database (Storer et al., 2021). These models and annotation are not perfect (see below), but according to them, the protein has homology to HERV30 (Fig. 1), HERV17, HERV9 and perhaps a few others. These ERV types are closely related: they are in Dfam's ERV1 class.

Reconstructing the ancestral sequence
These ERVs presumably inserted into the genome millions of years ago and underwent random mutations, degrading the original sequence. We attempted to reconstruct an ancestral sequence, from 15 DNA segments whose alignments to MC132 cover most of the protein (<10 amino acids missing from the start and <30 missing from the end). We fed these segments, plus 200 bp flanks, to Refiner (Hubley et al., 2022). Refiner inferred an ancestral DNA sequence, which remarkably has an 879-bp open reading frame (ORF) encompassing the MC132 homology. This ORF encodes the following protein: It is easy to verify (e.g. by NCBI BLAST) that this protein has significant similarity to MC132. The DNA from Refiner is 99% identical to Dfam's HERV30 consensus sequence, but the latter has two frame shifts disrupting the ORF.
We then sought human genome segments homologous to this new protein, using the same DNA-versus-protein homology search method as above (see Methods). We found hundreds of hits (Table 1), mostly in HERV17 annotations ( Table 2). The hits are consistently upstream of the gag gene (Fig. 2). Remarkably, there is one HERV30 in chromosome 4 where the ORF is intact, with no frame disruptions (Fig. 2).

The protein homology overlaps gaps in HERV17 annotations
We noticed that, in HERV17, the DNA region homologous to the new protein overlaps a consistent gap in RepeatMasker's HERV17 annotation (Fig. 3). This gap indicates an imperfection in the ERV models used by RepeatMasker. Either the HERV17 model is inaccurate in this region or we have a new HERV17-like subfamily that is not yet represented in RepeatMasker's models. We made a new model, by feeding HERV17 sequences from the human genome to Refiner. The new consensus sequence is 99% identical to Dfam's HERV17 over most of its length but has an extra 270 bp in the gap region. This suggests that we should update the model rather than add a new subfamily. On the other hand, a previous study suggested two HERV17 subgroups (Grandi et al., 2016): the extra 270 bp was not discussed but is actually present in one subgroup. In any case, the gap region is immediately upstream of a GA tandem repeat, which may evolve quickly and cause variation in these models.
Our new consensus sequence has frameshifts in the region homologous to the new protein. So do the previous subgroups. Thus, HERV17 may have proliferated in the genome after disruption of the reading frame. HERV17 (also known as HERV-W) is interesting because it has many copies that were retrotransposed by LINE enzymes (Pavl ı cek et al., 2002), and its env gene was co-opted as the human syncytin gene ERVW-1 involved in placental development (Mi et al., 2000).

Extent of protein homology in ERV families
To better understand this protein homology in each ERV family, we took the genome segments aligned to the Refiner protein and mapped them to Dfam's consensus DNA sequence for each ERV (Fig. 4). HERV17 consistently has a partial match, shorter than the HERV30 match, while HERV9 and HERV9N have even shorter matches.
In some of these cases, the RepeatMasker ERV annotations are fragmented and suggest ambiguity about which type of ERV1 is really present. It is possible that some of the Dfam consensus sequences incorrectly combine different ERV subfamilies or that some ERVs are actually chimeric. Careful reconstruction of these ERVs would help us to understand the evolution of this protein.

The new ERV protein in non-human primates
Finally, we searched several mammal genomes (Table 3) for homologies to MC132 and the new protein from Refiner. Similarly to human, there are hundreds of hits in apes, old-world monkeys and  Alignment between Dfam's HERV30 consensus DNA sequence and MC132 protein (Q98298). The DNA's translation is shown above it. kj indicates a match, ::: a positive substitution score, and . . . a zero substitution score (Fig. 7). The alignment has 36% identity. Lowercase regions were deemed to be simple repeats by the alignment tool (LAST)   (Table 1). Among other primates, there are a few hits in bushbaby, but none in tarsier or mouse lemur. This is surprising, because it is usually thought that bushbaby and lemurs are related as Strepsirrhini, whereas tarsiers and simians are related as Haplorhini. We found no hits in other mammals (e.g. rat, pika, dolphin). As expected, the homologous segments of these primate genomes lie in ERVs. In apes and old-world monkeys, these are the same types of ERV as in human, according to RepeatMasker annotations. In a more distantly related new-world monkey (marmoset), we find the highest number of homologous segments, which almost all lie in ERV annotations of type ERV1-1_CJa-I (also in the ERV1 category). In bushbaby, the hits overlap gaps in ERV annotations (Fig. 5). It is likely that the ERV models used by RepeatMasker become less accurate for primates more distantly related to human.
We tried to infer the evolutionary tree of DNA segments homologous to the new protein (Fig. 6). The tree largely groups the DNA segments according to their three main primate clades: Catarrhini (apes and old-world monkeys), new-world monkeys and bushbaby. On the other hand, the tree does not separate apes (e.g. human) from old-world monkeys (e.g. rhesus). This indicates that these ERVs proliferated in the three main clades more recently than their last common ancestors, and in Catarrhini before the last common ancestor of apes and old-world monkeys.

Discussion
Some primate retroviruses used to encode an additional protein, homologous to an immune-suppressing protein in a human poxvirus. It is plausible that this retrovirus protein also had an immune-suppressing function. Perhaps some extant retroviruses still encode such a protein.
We found one intact ORF for this protein in human chromosome 4, so it is possible that this protein is present in humans. The ORF might have been co-opted as a gene that (down-)regulates immune responses.
One puzzle is that the ORF upstream of gag in a retrovirus would be expected to hinder the translation of gag. So, it is possible that the ORF's stop codon inferred by Refiner is incorrect, and it was actually one long ORF fused to gag.
Since retroviruses and poxviruses are not closely related, DNA encoding this protein must have been horizontally transferred between these types of virus. The direction of transfer is unknown and could be indirect, e.g. from an unknown third source. In any case, a retrovirus encoding this protein infected ancient primates. The inferred evolutionary tree (Fig. 6) suggests that there were independent infections of Catarrhini, new-world monkeys and bushbaby ancestors. Homologs of this protein may lurk elsewhere, and finding them should clarify its evolutionary history.

DNA-versus-protein homology search
DNA-versus-protein homology searches were done with LAST version 1411, essentially as described previously (Frith, 2022): lastdb -q -c myDB proteins.fasta lastal -D1e9 -K1 -p my.train myDB genome.fasta > out.maf   H u m a n c h r3 2 5 0 H u m a n c h rY 2 3 0 H u m a n c h r1 2 2 3 1 R h e s u s c h r1 1 2 3 3 H u m a n c h r1 3 2 3 6 H u m a n c h r3 2 3 2 G ib b o n ch r5 2 2 5 H u m a n ch r1 0     Fig. 6. Evolutionary tree of protein fossils homologous to MC132. Blue indicates fossils from Platyrrhini (new-world monkey) genomes, red Catarrhini (apes and old-world monkeys), and green bushbaby. Pink circles mark branches with medium-to-high confidence (bootstrap value >70%) This requires a file 'my.train' specifying rates of substitution, deletion, and insertion. These rates can be inferred by finding homologies between DNA and protein sequences using last-train (Frith, 2022;Yao and Frith, 2022). However, it is not obvious which sequences to use for this inference: they must have extensive-enough homology to infer the parameters of a 64 Â 21 substitution matrix (Fig. 7). We used human pseudogene DNA and non-human proteins: the idea is that they have diverged by a combination of proteincoding and noncoding evolution. Specifically, we used retrogene DNA in human genome hg38 according to ucscRetroInfo9 (Baertsch et al., 2008) and chicken proteins from UniProt release 2022_03 (proteome UP000000539) (The UniProt Consortium, 2021). Instead of chicken, we also tried mouse and zebrafish, but it did not seem to make much difference. The rates were inferred like this: lastdb -q -c db UP000000539_9031.fasta last-train --codon --pid¼50 -m100 db retro.fasta > my.train

DNA consensus sequences
Refiner, from RepeatModeler version 2.0.3, was run with default parameters.