Evolutionary fate and implications of retrocopies in the African coelacanth genome

The coelacanth is known as a “living fossil” because of its morphological resemblance to its fossil ancestors. Thus, it serves as a useful model that provides insight into the fish that first walked on land. Retrocopies are a type of novel genetic element that are likely to contribute to genome or phenotype innovations. Thus, investigating retrocopies in the coelacanth genome can determine the role of retrocopies in coelacanth genome innovations and perhaps even water-to-land adaptations. We determined the dS values, dN/dS ratios, expression patterns, and enrichment of functional categories for 472 retrocopies in the African coelacanth genome. Of the retrocopies, 85–355 were shown to be potentially functional (i.e., retrogenes). The distribution of retrocopies based on their dS values revealed a burst pattern of young retrocopies in the genome. The retrocopy birth pattern was shown to be more similar to that in tetrapods than ray-finned fish, which indicates a genomic transformation that accompanied vertebrate evolution from water to land. Among these retrocopies, retrogenes were more prevalent in old than young retrocopies, which indicates that most retrocopies may have been eliminated during evolution, even though some retrocopies survived, attained biological function as retrogenes, and became old. Transcriptome data revealed that many retrocopies showed a biased expression pattern in the testis, although the expression was not specifically associated with a particular retrocopy age range. We identified 225 Ensembl genes that overlapped with the coelacanth genome retrocopies. GO enrichment analysis revealed different overrepresented GO (gene ontology) terms between these “retrocopy-overlapped genes” and the retrocopy parent genes, which indicates potential genomic functional organization produced by retrotranspositions. Among the 225 retrocopy-overlapped genes, we also identified 46 that were coelacanth-specific, which could represent a potential molecular basis for coelacanth evolution. Our study identified 472 retrocopies in the coelacanth genome. Sequence analysis of these retrocopies and their parent genes, transcriptome data, and GO annotation information revealed novel insight about the potential role of genomic retrocopies in coelacanth evolution and vertebrate adaptations during the evolutionary transition from water to land.


Background
Genes that are unnecessary for existence in a new environment are often eliminated during the course of evolution, but genomes also acquire novel genetic elements as a source of functional and phenotypic diversity [1]. Retrocopies, one type of novel genetic element, are genome segments that are reverse-transcribed from intronless mRNA and then inserted into new positions in the genome (i.e., RNA-based duplication) [1]. Retrocopies have long been considered evolutionary dead ends, because it was expected that these segments lack regulatory elements and originate from RNA-based duplication [2,3]. However, many recent studies revealed that retrocopies can be used to generate new genes, called retrogenes, by fusion with other genes [4][5][6] or via acquisition of new exons or introns from de novo genome sequences [4]. Furthermore, retrogenes, which are functional retrocopies, have often evolved functional roles in male germ lines [7,8]. Thus, "dead on arrival" copies of parental genes are currently considered to serve as "seeds of evolution" [9]. Despite the substantial contribution to molecular evolution revealed by studies on retrocopies in humans and fruit flies [10][11][12][13], the patterns of retrocopy formation and effects on genomic dynamics of lower vertebrates remain undetermined.
The African coelacanth is the second closest living fish relative of tetrapods. Thus, genomic data on this species, which have recently been made available, are useful for researching successful land adaptations during vertebrate evolution [14]. The African coelacanth is well known for its morphological resemblance to its fossil ancestors; as anticipated, protein-coding sequence evolution in the coelacanth genome was shown to be relatively slow [14]. Because retrogenes can serve as "seeds of evolution," investigation into the pattern and rate of retrogene formation and implications of the role of these retrogenes in the African coelacanth genome can provide novel insight into coelacanth evolution and water-to-land vertebrate adaptations.
In this study, we identified and investigated 472 parentretrocopy pairs in the African coelacanth genome. Based on the obtained results, we suggest that retrocopies attained biological relevance and were subject to natural selection following fusion with existing genes or acquisition of new gene elements, although many retrocopies have ultimately been eliminated from the genome during coelacanth evolution. Analysis of these potentially functional retrocopies (or retrogenes) provides new insight into the dynamics of functional organization of the genome during coelacanth evolution and vertebrate transition from water to land.

Retrocopies and potential retrogenes in the African coelacanth genome
Although retrocopies were long thought to be pseudogenes, it is now known that many are functional and considered retrogenes following fusion with other genes [5] or acquisition of regulatory elements and exons from de novo sequences [4,15]. To evaluate whether a retrocopy is functional, multiple criteria, such as a dN/dS ratio <0.5 or transcriptional evidence, were developed in previous studies [13,[16][17][18].
In this study, 472 one-to-one parent-retrocopy pairs were identified in the African coelacanth genome (Additional file 1: Table S1). To assess retrocopy functionality, we first compared the sequences of each retrocopy and its parent gene, which revealed a total of 235 intact retrocopies with no frameshift mutations or premature stop codons (Fig. 1, Additional file 1: Table S1). These retrocopies possess intact ORFs (open reading frames) and thus have the potential to become functional protein sequences [19]. Additionally, 155 retrocopies with a dN/ dS ratio significantly <0.5 were also identified ( Fig. 1, Additional file 1: Table S1). dN/dS <0.5 is indicative of purifying selection, which reflects potential for retrocopy functionality [13]. Second, we mapped retrocopies to Ensembl genes (http://www.ensembl.org) to detect retrocopies showed overlap with Ensembl genes. In this step, 224 retrocopies were detected, which included 152 exon-overlapping retrocopies and 72 non-exon-overlapping retrocopies (Fig. 1, Additional file 1: Table S2). These retrocopies may represent the main direct way by which retrocopies can influnce the genome, fusing with an existing gene, or acquiring new gene elements from a de novo sequence [4,15,20]. Finally, we mapped the RNA-seq data (gills, kidneys, pectoral fins, pelvic fins, pharynx, tail muscle, and testis; SRA (sequence read archive) accessions DRP000627 and SRX189186; see Methods) to the African coelacanth genome and identified 219 retrocopies with transcriptional evidence (i.e., retrocopies with FPKM (fragments per kilobaseof exon per million fragments mapped) values, Additional file 1: Table S3).
The common and unique retrocopies identified by the previously discussed different criteria are summarized in Fig. 1. Pearson Chi-square tests revealed whether an expressed retrocopy (i.e., based on FPKM value) was intact, showed a dN/dS value <0.5, or overlapped with an Ensembl gene (p-values <0.001 respectively). These results indicate that the analyzed retrocopies met the criteria for functional retrocopies. Moreover, 355 retrocopies were identified by at least one of the four criteria, and 85 retrocopies were identified by all criteria; this indicates that 85-355 retrocopies are potential retrogenes (Fig. 1).

Retrocopy age distribution
To investigate the origin and evolution of these retrocopies, we assessed the retrocopy age distribution based on the increase in dS values, which were estimated by comparing the parent genes and retrocopies. The distribution revealed a burst pattern of young retrocopies (dS <0.6; Fig. 2). Notably, similar age distribution patterns of retrocopies were revealed in humans, platypuses, and western clawed frogs in a previous study [21]. However, this pattern was not generated in ray-finned fish, including sticklebacks, zebrafish, medaka, tetraodon, and fugu [21]. Moreover, the associated dN/dS ratios with the retrocopies tended to decrease as dS increased (Fig. 3), which indicates a greater constraint of selection on older retrocopies.
To assess the proportions of retrogenes with different age ranks, we subdivided the retrocopies into bins based on dS intervals and then estimated the percentage of functional retrocopies for each bin. Regardless of the frequency burst of young retrocopies, the retrocopy bins with lower dS values (i.e., younger retrocopies) tended to contain a smaller percentage of retrocopies that exhibited intact ORFs, dN/dS ratios significantly <0.5, overlap with Ensembl genes, or expression compared with those that had elevated dS values (i.e., potentially functional retrocopies) (Fig. 4). Additionally, estimation of the mean log 10 (FPKM + 1) values for the retrocopies in Fig. 1 Venn diagram of the unique and common retrocopies among different categories. Red refers to retrocopies that overlap with Ensembl genes. Yellow refers to retrocopies with intact ORFs. Green refers to retrocopies with evidence of transcription. Blue refers to retrocopies with dN/dS ratios significantly <0.5  each dS bin showed that elevated mean values tended to be associated with relatively old retrocopies (i.e., dS >0.6; Fig. 5). Finally, the "expression frequency" was considered the number of tissues that expressed a specific retrocopy, and an elevated expression frequency tended to be associated with relatively old retrocopies (Fig. 6).

Expression patterns
A total of 219 retrocopies showed evidence of expression when we mapped the RNA-seq data to the African coelacanth genome. (Additional file 1: Table S3). In this study, we refer to the "expression frequency" as the number of tissues showing expression of a retrocopy; retrocopies with higher expression frequencies (frequency >3) tended to exhibit higher log(FPKM + 1) values for each tissue (p <0.01 Kruskal-Wallis test) and higher dS values (p <0.01 Kruskal-Wallis test), but lower dN/dS ratios (p <0.01 Kruskal-Wallis test) than the retrocopies with lower expression frequencies (Table 1). Young retrocopies were more likely than old retrocopies to be expressed with lower FPKM values and lower expression frequencies (Figs. 5 and 6).
In previous studies that examined retrocopies in humans and fruit flies [13,22,23], the preferential retrogene expression showed bias in the testis. In this study, 139 of the  Table S4). Additionally, the retrocopies expressed in the testis was also the most overrepresented of retrocopies that were uniquely expressed in one tissue (Additional file 1: Table S5), which indicates that this expression pattern bias also occurs in the coelacanth. However, despite the overrepresentation of retrocopy expression in the testis, no dS value or dN/dS ratio bias was detected in the retrocopy expression patterns in any tissue (Additional file 1: Tables S4 and S5).

Retrocopy-overlapped genes
The primary mechanisms by which a retrocopy can be transcribed and become functional are fusion (or insertion) with an existing gene or acquisition of new gene elements from a de novo sequence [4,15,20]. In this study, 224 retrocopies overlapped with 225 coelacanth Ensembl genes. We defined the 225 Ensembl genes as "retrocopy-overlapped genes" (Additional file 1: Table S2), and these Ensembl genes might harbor inserted retrocopies or have originated from the de novo exon/intron acquisition of retrocopies (Additional file 2: Figure S1). These retrocopyoverlapped genes (i.e., Ensembl genes that overlapped with retrocopies) might represent the most direct pathway for retrocopies to influence the genome dynamics of the coelacanth. To further examine this direct influence, we compared the GO enrichment results for the retrocopy parent genes with those of the 225 retrocopy-overlapped genes.
The results revealed that the gene functions that were overrepresented in the parent genes were most likely related to the synthesis and metabolism of biological molecules, whereas those in the retrocopy-overlapped genes were most likely related to transmembrane transport (Table 2). This result might indicate functional organization of the genome during coelacanth evolution, which was most likely produced by retrotransposition.
To further determine retrocopy influence on coelacanth and lower vertebrate evolution, we investigated potential orthologous sequences of the retrocopy-overlapped genes in ray-finned fish, including elephant sharks, fugu, zebrafish, gar, and some tetrapods, such as clawed frogs, anole  lizards, zebra finches, mice, and humans. In particular, we focused on two retrocopy groups: those that were coelacanth-specific and those that were lost in tetrapods. A total of 46 retrocopy-overlapped genes were identified as new retrogenes or lineage-specific genes for the coelacanth, and no orthologous sequence was found in any of the species investigated. These new genes most likely originated recently through the de novo acquisition of gene elements from retrocopies and not from host gene-retrocopy fusion. Thus, within the approximately 400 MY coelacanth evolutionary history [24], the rate of new retrogene formation has been relatively slow compared with that in humans (46/400 vs. 57/63, respectively) [13]. This result is consistent with the morphological resemblance and slow pace of evolution of protein-coding sequences in coelacanth genome. Comparison of new retrogenes with the remainder of retrocopy-overlapped genes showed that these genes tended to contain fewer exons, were more likely overlap at an exon site, and were more underexpressed (Additional file 1: Table S6). Additionally, the retrocopies that were mapped to these genes tended to be younger, have a lower expression frequency, and were less constrained by natural selection (Additional file 1: Table S6). The GO analysis indicated that the overrepresented functional categories were most likely related to cell response, modification of biomolecules, and the immune function (Additional file 1: Table S7).
In addition to the 46 coelacanth-specific retrocopyoverlapped genes, we identified 23 genes that were specifically lost in tetrapods (there were orthologous sequences in the fish but not the tetrapods; Additional file 1: Table S8). This indicates that these genes might be unnecessary for living on land. In previous studies on these genes using zebrafish data (gene expression, knockdown, and knockout), some associations with important developmental categories were reported (Additional file 1: Table S8), including caudal fin development, eye morphology, vasculature development, and nervous system development [25]. Compared with the retrocopy-overlapped genes that still exist in tetrapod lineages, the 23 specifically lost genes did not show any difference in genetic structure or expression. However, the "lost" retrocopy-overlapped genes had lower dN/dS values than those that were retained, which indicates different selection pressures (Additional file 1: Table S9). Comparison of the GO enrichment results showed that the overrepresented functional categories for the lost genes were most likely to be related to the nervous system, whereas those in the "retained" gene set were most likely related to circadian rhythms and ionic transmembrane transport (Additional file 1: Table S10).

Discussion
We explored the evolutionary fate and implications of retrocopies in the coelacanth genome and provided a novel perspective to understand the evolution of lower vertebrates and their adaptations in the transition from water to land. In total, we screened 472 retrocopies in the African coelacanth genome. The age distribution of the retrocopies based on obtained dS values showed a burst pattern of young retrocopies that accumulated in the genome (Fig. 1). Notably, similar patterns for retrocopy age distribution were found in humans, platypuses, and western clawed frogs, but not ray-finned fish [21]. This finding shows that the coelacanth's birth pattern of retrocopies is more similar to that of tetrapods than that of ray-finned fish, which indicates that this pattern change might be related to the vertebrate evolutionary transition from water to land. Because retropositions require RNA as a mediator, reverse transcriptase that stems from retrotransposons might be also necessary for retropositions. Reverse transcriptase encoded by long interspersed nuclear elements (LINEs), which possess endonucleolytic activity that can recognize any polyadenylated mRNA, seems to be responsible for retrotranspositions in mammals [1]. Additionally, as previously demonstrated [26,27], the LINE-1 (L1) element subfamily of LINEs can generate processed genes, which indicates that L1 retrotransposon activity has generated retroposed gene copies in mammals [1]. As in non-mammalian chordates, it was also reported that the number of retrocopies correlated with the number of L1 copies, but not with any other type of LINE element, such as L2 [21]. Additionally, the difference between retrocopy birth pattern ray-finned fish and that in tetrapods coincides with the dramatic difference in L1 retrotransposon diversity between mammals and fish [28]. These results indicate that L1 may also be responsible for the retrotranspositions in non-mammalian chordates and represent a genetic signal that reflects the vertebrate evolutionary transition from water to land.
Despite the burst pattern of young retrocopies that accumulated in the genome, retrocopies with indications of function (dN/dS significantly <0.5, overlap with Ensembl genes, or evidence of transcription) were much more overrepresented among the old than young retrocopies (Figs. 4, 5 and 6). The overrepresentation of the retrogenes among the old retrocopies and decrease in frequency of relatively old retrocopies indicated that many of the retrocopies might have been eliminated during evolution, which was consistent with our predictions based on the expected lack of regulatory elements and introns in the retrocopies. Such features clearly indicate a course of rapid diversity within the evolutionary fate of the retrocopies before their elimination or incorporation into a pathway. However, the retrocopies could be affected by natural selection associated with the host genes in which the retrocopies are inserted, or the retrocopies could be constrained by natural selection after the acquisition of exons/new regulatory elements (Fig. 3). Such retrocopies might attain biological relevance and become evolutionarily stable and, thus, ultimately be overrepresented as other unstable retrocopies are eliminated. This hypothesis was also supported by our analysis of retrocopy-overlapped genes. The newly originated coelacanth-specific retrocopy-overlapped genes were younger than the other retrocopy-overlapped genes. As indicated by their lower expression level (Additional file 1: Table S6), the younger genes formed from the retrocopies might be less evolutionarily stable compared with the other retrocopy-overlapped genes. Moreover, these newly formed genes tended to contain fewer exons (Additional file 1: Table S6), which indicated that the new retrogenes might continue to recruit exons or introns during their course of evolution, or, as a complementary mechanism, these retrocopies might be preferentially inserted into host genes with more exons.
As previously reported, the retrogenes might be preferentially expressed in the testis. In this study, more retrocopies were expressed in the testis than in other tissues. However, no dS or dN/dS bias, which reflects retrocopy age range, was associated with any tissuespecific retrocopies (Additional file 1: Tables S4 and S5). Moreover, expression pattern analysis revealed that older retrocopies tended to be expressed with higher FPKM values and in more tissues than the young retrocopies ( Fig. 6 and Table 1). The association of increased retrocopy expression distribution among tissues with increased age indicated that retrocopy biological functions might evolve to become essential over time [29].
The most direct ways in which a retrocopy can affect the genome are through retrocopy insertion into a gene and acquisition of new gene elements by the retrocopy. We identified 225 retrocopy-overlapped genes. GO enrichment analysis revealed that the overrepresented GO terms differed between the 225 retrocopy-overlapped genes and the parent genes from which those retrocopies originated. The gene functions that were overrepresented in the parent genes were most likely related to the synthesis and metabolism of biological molecules, whereas those of the 225 retrocopy-overlapped genes were most likely related to transmembrane transport. By retrotransposition of a father gene's mRNA copy insertion into a gene, these father genes may contribute to the evolution of retrocopy-overlapped genes, whose functional organization differs from that of the father's gene. This result indicates that retrotranspositions may contribute to functional organization dynamics of the genome.
Among the 225 retrocopy-overlapped genes, 23 genes that were specifically lost in the tetrapod lineage were identified, including myh10, which is related to eye morphology; NTN1, which is related to olfactory processes; and esrrga, which is related to caudal fin malformations (Additional file 1: Table S8). Comparisons between these genes and the retrocopy-overlapped genes with orthologous sequences in all of the investigated tetrapods revealed no differences in gene structure or expression level; however, as predicted from the retrocopies with which they overlapped, these lost genes tended to be more constrained by natural selection in the coelacanth genome. This result indicates that the genes were not lost in the tetrapod lineage because of any structural defect, but because during the adaptation of vertebrates from water to land, these genes were not necessary. Furthermore, the overrepresentation of functional categories related to the nervous system in these genes might indicate that, unlike the system of ionic homeostasis regulation (Additional file 1: Table S10), the sensory system of tetrapods may have experienced remodeling, which modified it from the system observed in fish.
The birth of new genes can contribute to the formation of lineage-or species-specific genes [30,31], as found in this study. We identified 46 retrocopy-overlapped genes that were specific to coelacanths. The overrepresented gene functions related to the immune function indicated that the immune system was reinforced during coelacanth evolution (Additional file 1: Table S7). Additionally, the slow rate of new retrogene formation corresponded to the slow evolution of protein-coding genes in the coelacanth genome [14], which provided novel insight into the morphological resemblance of the coelacanth to its fossil ancestors. Together, these results might indicate stability of the deep sea environment in which the coelacanth evolved.

Conclusions
Our study revealed a burst pattern of young retrocopies in the coelacanth genome. This pattern is similar to that in tetrapods rather than that in ray-finned fish, which indicates a possible genomic change related to water-to-land adaptations. Many retrocopies may have been eliminated during coelacanth genome evolution because of disrupted genetic structure defects. However, some might have been randomly inserted into existing genes or acquired regulatory elements, exons, or introns from de novo genetic sequences that facilitated overcoming of these defects and acquiring functions (i.e., to become retrogenes). These retrogenes were revealed to have an effect on functional organization of the genome, which provides novel insight into coelacanth evolution and the transformations involved in the transition from water to land. However, the results of this study were only sufficient for revealing some of the effects of retrotransposition on the genome; the response of the host gene to retrocopy insertion remains obscure, and retrocopies may function as noncoding RNAs that do not demand protein coding gene's structures (such as ORF) [32]. Thus, our studies in the future will further explore these issues.

Retrocopy screening
First, to obtain potential single-exon ORFs in the Africa coelacanth (Latimeria chalumnae) genome scaffold as candidate retrocopies, we mapped all of the peptide sequences that were retrieved from Ensembl (http://www.ensembl.org) to the repeat-masked genome sequence (Ensembl release 78) using tBLASTn [33]. Then, a series of steps was applied to the tBLASTn output to screen for reliable results: 1) when the distance between the hits was <40 bp, adjacent homology hits were merged; 2) merged hit sequences with aligned regions of >50 amino acids and an amino acid identity >30 % were selected; and 3) among the overlapped target sequences, the longest sequence was selected. Second, we performed searches for similarity of the merged hit sequences against all peptide sequences (including single-exon proteins, to identify DNA-based duplication of intron-containing genes [34]) using FASTA. The multiple-exon peptide sequences with the closet hits were selected for subsequent pairwise comparisons in GENEWISE [35]. Before the analysis, the hit sequences were expanded by 10,000 bp on each flank. Finally, based on the GENEWISE results, any retrocopy candidates showing alignments of ≤50 amino acids, an amino acid identity ≤70 % or multiple exons were first excluded, after which we confirmed the absence of at least two intros in the retrocopy candidates. Finally, parent-retro pairs with a common parent peptide sequence were not included in our analyses.

dN and dS estimation and dN/dS ratio test
The dN (nonsynonymous substitutions) and dS (synonymous substitutions) values were estimated for each retrocopy with its parent using the YN00 program of PAML4.8 [36]. To evaluate whether the dN/dS ratio between parent-retro pairs was significantly different from 0.5, we conducted a likelihood ratio test (LRT) using the codeml program of PAML4.8 in a pair-wise model. In the test, the null model was run with a fixed dN/dS = 0.5, and the alternative model was run to estimate dN/dS. Multiple testing was corrected via the false discovery rate method (FDR) [37] in R (http://www.R-project.org).

Mapping to the Ensembl annotation
We mapped all of the retrocopies to Ensembl (release 79) genes using coordinate information. A gene that was mapped with a retrocopy was referred to as a "retrocopyoverlapped gene". For each retrocopy that mapped to a gene, we regarded the retrocopy as an exon-overlapped retrocopy when it overlapped with any exon of that gene. Otherwise, the retrocopy was regarded as a non-exonoverlapped retrocopy.

Gene expression analysis
Paired-end RNA-seq data from Latimeria chalumnae tissues, including the gills, kidneys, pectoral fins, pelvic fins, pharynx and tail muscle, were obtained from SRA accession DRP000627. Because of the 99.73 % identity between the testis transcriptome of L. menadoensis and the genome of L. chalumnae [14], we also included the RNAseq data from the L. menadoensis testis (SRX189186). These reads were aligned against the African coelacanth annotated genome sequences using TopHat-2.0.13 [38] with a "-max-multihits 1" setting, which searched for the distinct best hit for each read. We estimated expression abundances using Cufflinks-2.2.1 [38] and measured the abundances in FPKM (fragments per kilobase of transcript per million fragments mapped). Both programs were run with the default settings.

Gene ontology (GO) analysis
The GO annotations for the African coelacanth were downloaded from the Ensembl BioMart database (Ensembl genes 79, http://www.ensembl.org). Gene enrichment tests were implemented in the TopGO package from Bioconductor (http://www.bioconductor.org). In the tests, the number of occurrences for the tested and reference genes in one functional category was compared, and the comparisons were assessed based on a significance index using Fisher's exact test. The total Ensembl-annotated coelacanth genes were used as the reference genes for all GO analyses in this study. Gene functional categories showing p < 0.01 were included as significantly enriched categories.