Genome-wide identification and analysis of A-to-I RNA editing events in bovine by transcriptome sequencing

RNA editing increases the diversity of the transcriptome and proteome. Adenosine-to-inosine (A-to-I) editing is the predominant type of RNA editing in mammals and it is catalyzed by the adenosine deaminases acting on RNA (ADARs) family. Here, we used a largescale computational analysis of transcriptomic data from brain, heart, colon, lung, spleen, kidney, testes, skeletal muscle and liver, from three adult animals in order to identify RNA editing sites in bovine. We developed a computational pipeline and used a rigorous strategy to identify novel editing sites from RNA-Seq data in the absence of corresponding DNA sequence information. Our methods take into account sequencing errors, mapping bias, as well as biological replication to reduce the probability of obtaining a false-positive result. We conducted a detailed characterization of sequence and structural features related to novel candidate sites and found 1,600 novel canonical A-to-I editing sites in the nine bovine tissues analyzed. Results show that these sites 1) occur frequently in clusters and short interspersed nuclear elements (SINE) repeats, 2) have a preference for guanines depletion/enrichment in the flanking 5′/3′ nucleotide, 3) occur less often in coding sequences than other regions of the genome, and 4) have low evolutionary conservation. Further, we found that a positive correlation exists between expression of ADAR family members and tissue-specific RNA editing. Most of the genes with predicted A-to-I editing in each tissue were significantly enriched in biological terms relevant to the function of the corresponding tissue. Lastly, the results highlight the importance of the RNA editome in nervous system regulation. The present study extends the list of RNA editing sites in bovine and provides pipelines that may be used to investigate the editome in other organisms.


Introduction
The biology of the mammalian transcriptome is far more complex than once thought. Previous studies have shed light on the dynamic nature of the mammalian transcriptome, where different molecular processes interact to fine-tune gene expression [1]. For instance, large-scale projects based on high throughput cDNA sequencing technology (RNA-Seq), such as ENCODE [2] and GENCODE [3] have clearly shown that RNA transcripts undergo a host of diverse processing mechanisms. One such mechanism is RNA editing, which is defined as any post-transcriptional or co-transcriptional mechanism that alters the nucleotide composition of a transcript. Therefore, this phenomenon leads to differences between the final transcript sequence and the DNA region it was transcribed from [4]. Since its discovery in 1986 in trypanosomes [5], RNA editing has been reported to occur in a broad range of species ranging from bacteria [6] to mammals [7,8]. As inosines are read as guanosine by the translation and splicing machineries [4,9], RNA editing can influence alternative splicing [10], recoding of open reading frames [4] and can affect miRNA-regulated post-transcriptional gene silencing [11]. RNA editing plays vital roles in the development and maintenance of the metazoan nervous system [12], marking RNAs for degradation, modulating nuclear retention of RNAs [9] and when deregulated, this mechanism is associated with various diseases [13] and cancers [14]. There are two known types of RNA editing in mammals namely, cytosine-to-uracil (C-to-U) and adenosine-to-inosine (A-to-I) [4]. C-to-U editing is catalyzed by the apolipoprotein B mRNA editing enzyme, catalytic polypeptide-like (APOBEC) family [15], while A-to-I editing is catalyzed by the adenosine deaminases acting on RNA (ADARs) family. The latter is the predominant type of RNA editing in mammals [9]. ADAR enzymes bind to double-stranded RNAs (dsRNAs) through their double-stranded RNA-binding domains and deaminate adenosine to inosine. Therefore, RNA editing preferentially occurs within inverted repeat sequences such as LINE (long interspersed nuclear elements) and SINE (short interspersed nuclear elements) retrotransposons (like the primate Alu repeats), because of the dsRNA structures formed by these sequences. For instance, it has been demonstrated that most of the A-to-I editing sites in the human transcriptome are clustered within Alu repeats, which mostly reside in introns and UTRs regions of genes [4].
Novel RNA-editing sites can be discovered by direct comparison between cDNA sequences and their corresponding genomic position. Several recent next generation sequencing (NGS) based studies, have reported RNA editing sites in different vertebrates including human [8,12,14], mouse [16], pig [17], chicken [18] and bovine [19]. There are several challenges for identifying RNA editome using RNA-Seq data including the discrimination of true RNA editing sites from single-nucleotide polymorphisms (SNPs), somatic mutations, systematic sequencing errors and mapping errors [20]. Recently, different bioinformatics methods have been developed to discover RNA editing events by comparing RNA and DNA sequencing data collected from single individuals. As RNA-Seq and DNA-Seq datasets are not always available for the same individual, different methods have been developed to identify RNA editing sites using RNA-Seq data alone [8,[12][13][14][19][20][21][22][23][24][25]. These methods have allowed the identification of more than 1.3 million potential RNA editing sites in human and more than 7,000 for the mouse [16]. Contrary to human and mouse, only two very recent studies have provided information about RNA editing in bovine, the first study identified 671 putative A-to-I editing sites in four fetal tissues [26] while the second study focused on one tissue (brain) to identify RNAsequencing reads containing clusters of editing sites in 21 diverse organisms including bovine [27]. As RNA editing is developmental stage specific [28] as well as tissue-specific [29], much more work is required in order to identify the extent of this post-transcriptional mechanism in this species. The comprehensive documentation of the bovine RNA editome will provide a valuable resource for the characterization of cellular and physiologic outcomes involving this modification in this agriculturally important species.
In this study, we used a largescale computational analysis of RNA-Seq data across brain, heart, colon, lung, spleen, kidney, testes, skeletal muscle and liver, from three adult animals. We developed a computational pipeline and used a rigorous strategy in order to identify novel RNA editing sites, taking into account sequencing errors, mapping bias (misalignments around repetitive regions and splice junctions), as well as biological replicates, to control the false positive rate. We also conducted a detailed characterization of sequence and structural features related to novel candidate A-to-I editing sites and found 1,600 novel canonical A-to-I editing sites. Most of the genes with predicted A-to-I editing were significantly enriched in biological terms relevant to the function of the corresponding tissue. In addition, we found frequent occurrence in clusters and in SINE repeats, preference for strong guanines depletion/ enrichment in the nucleotide 5 0 /3 0 to the edited A-to-I sites, lower editing sites in coding sequences than other regions, low number of evolutionary conserved sites, concordance of ADARs expression in different tissues with editing levels and tissue specificity of editing sites.

Datasets
To identify the RNA editome in the bovine transcriptome, we retrieved 27 publicly available paired-end strand-specific RNA-Seq samples (Gene Expression Omnibus database, accession number GSE41637). The libraries were developed from three adult animals and each contained reads from nine tissues, namely brain, heart, colon, lung, spleen, kidney, testes, skeletal muscle and liver. This dataset [30] was generated using the Illumina HiSeq 2000 platform. In this dataset, only the samples of one animal were sequenced at 76 bp read length and at high coverage (average = 111.2 million reads per sample) while the samples of the other animals had moderate coverage (average of 28.5 million reads per sample with 36 bp read length). Quality control and read mapping. Raw read quality check was performed with FastQC v0.11.5 (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Sequencer adapter removal and quality trimming was done with Trimmomatic [31] (parameters of Trailing 20, Maxinfo 60:0.90 and minimum length 60 for 76 bp; and Trailing 20, Maxinfo 30:0.90 and Minimum length 30 for 36 bp datasets). The trimmed reads were aligned to the Ensembl UMD3.1 bovine reference genome using Hisat2 version 2.0.5 [32]. A list of exon-exon junctions extracted from the known gene model annotation (Ensembl release 84) was used to guide the read mapping. Notable features of the Hisat2 program are that, 1) it prevents reads from being aligned to pseudogenes, which results in improved alignment accuracy [32] and 2) it is more efficient at providing editing prediction from RNA-Seq data than other programs [33]. We considered only uniquely and concordantly paired-end mapped reads, to reduce the potential bias caused by short read alignment. Also, in order to avoid potential PCR or sequencing optical artefacts influencing editing frequencies, duplicated reads that mapped to the same location were removed by MarkDuplicates tool from Picard (http://picard.sourceforge.net/), except those with the highest mapping quality score [34]. The remaining reads were locally realigned (using GATK tool) around putative insertions and deletions using Ensembl bovine SNP database version 146. Then, the base quality values were recalibrated by GATK (version 3.5, https://www.broadinstitute.org/gatk) tool.

Pipeline for identifying novel RNA editing sites
Identification of candidate RNA editing. RNA editing is defined as a single nucleotide base change between DNA and RNA. RNA-Seq single nucleotide variants (SNVs) calling was performed with HaplotypeCaller [35] using GATK tool on each sample with a stand_call_conf and stand_emit_conf value of 30 and mbq of 25. Initially identified SNVs were removed from further analysis if they corresponded to known bovine SNPs found in Ensembl bovine SNP database version 146. Then, the SNVs were filtered by the GATK standard filters (Homopoly-merRun >5, Total depth of coverage <10, RMSMappingQuality <40, QualitybyDepth < 2, MappingQualityRankSum < -12.5 and ReadPosRankSum <-8). These filtering parameters were used to eliminate: 1) sites with less than 10 supporting reads, 2) variant distance bias, 3) mapping quality bias, and 4) homopolymer bias [36]. The remaining SNVs were further filtered by several quality-aware steps. First, we discarded the sites with more than one nonreference type and the sites that were homozygous for the alternative allele. Second, the read coverage was restricted to at least 10 reads with at least three reads supporting the SNV site and only those sites, which were present in at least 10% of the reads, were kept for further analysis. Further, the SNVs sites with extreme degree of variation (100%) were removed under the assumption that 100% editing efficiency is unrealistic. Third, SNVs that occurred in regions with bidirectional transcription (sense and antisense gene pairs) were removed. Fourth, simple sequence repeats patterns (SSR [i.e. motifs ranging in length from 1 to 8 nucleotides with a minimum length of pattern of 6 bases]) were identified using GMATo software [37] and removed from further analysis. SNVs located in SSR regions were considered as biased with an offset of ±3 bases. Fifth, to exclude potential false positives resulting from poor mapping of reads at splice junctions, all SNVs located within 5 bp intronic flanking region were discarded. Sixth, to ensure that the reads supporting the SNVs were uniquely aligned, we filtered out SNVs in paralogs or repetitive regions by retrieving and aligning 100 bp of flanking sequence (50 upstream and 50 downstream of the SNV) using BLAT [38]. Finally, since RNA editing is tissue-specific and should be similarly present in animals within a species, while low frequency SNPs should not, the pipeline considered the number of times that the SNVs were detected in more than one individual to output a final set of predicted editing sites. Hence, all SNVs detected in at least two of the three individual bovines were considered candidate RNA editing sites, thus minimizing false positive results.
Annotation of the candidate RNA editing sites and enrichment analysis. As dsRNAs produced from inverted retrotransposon repeats such as LINE and SINE families are important ADAR substrate RNAs [4]. We investigated whether candidate RNA editing sites were enriched in these regions. For this, all interspersed repeats sequences (UCSC database [http:// genome.ucsc.edu]) detected by RepeatMasker program [39] were screened to identify the candidate RNA editing sites.
The functional annotation and genomic location of each candidate RNA editing in each gene was identified using SnpEff (v 4.3) [40]. In order to identify the biological functions associated with edited genes, the enriched gene ontology (GO) terms (BP, biological process) were investigated using Enrichr web-application (threshold false discovery rate (FDR)< = 0.05) [41].

EST analysis
We used the public bovine EST sequences (ftp://ftp.ncbi.nih.gov/repository/UniGene/) to investigate whether the editing events identified by our pipeline were also present in these sequences. For this, 50 bp upstream and 50 bp downstream flanking regions were extracted and queried against the bovine EST sequences using BLAST. Alignments with e-values less than 10 −5 were considered as significant and counted.

Conservation analysis
We performed cross-species transcriptome comparisons between editing sites identified in this study and those reported in human in order to identify highly conserved A-to-I editing events. Fifty base pairs flanking regions of the bovine candidate sites were BLAST against 50 bp flanking regions of the known human sites. The known editing sites in human (hg19) were retrieved from the RADAR database (http://rnaedit.com). The hits with e-values <0.001, >0.85% identity and >50 bp alignment length were considered as conserved editing sites.

Analysis of neighbor preferences
Previous studies reported that ADAR enzymes have sequence preference for the edited site neighbor nucleotide [4]. Base preference around the identified candidate RNA editing sites was investigated by extracting 10 bp upstream and 10 bp downstream of the edited sites. Sequence logo was then generated by WebLogo software [42].

Gene expression quantification
The clean reads samples were loaded into Salmon (0.8.1) pseudo-alignment transcript quantification [43], with sequence and GC bias correction enabled and using k = 25 for indexing, to quantify transcript-level abundances. Transcripts per kilobase million (TPM), which normalizes for transcript length and sequencing depth was used as an estimate for relative expression level. To improve reliability of quantification, the transcript-level quantifications were merged to the gene level by summing the corresponding transcript-level TPM estimates.

Statistical analysis
Statistical significance for differences between tissue means for editing ratio were assessed by Student's unpaired t-test and were implemented with R. FDR corrected P-values were used to control false positives resulting from multiple testing (corrected P-value < 0.05).

Identification of RNA editing in bovine
Computational identification of RNA editing is dependent on the stringency levels used to remove false positives. This is especially true when solely using transcriptome data [44]. To capture and characterize the complexity of the editome in the bovine transcriptome, we exploited existing strand-specific RNA-Seq datasets from nine tissues of three individual animals. We designed a computational approach that implements multiple filters with stringent thresholds (Fig 1).
After quality trimming, >88% of the 1,337 million reads contained in those datasets aligned to the bovine reference genome (UMD3.1) and~76% (i.e. 1 billion reads) were uniquely aligned (S1 File). Initial analysis of the high coverage samples identified 809,332 SNVs. The average number of variants was 195,574 (range in 9 tissues = 53,577 to 299,526). A total of 247,251 SNVs were detected in the 18 datasets of moderate coverage. The average number of SNV per tissue was 39,701 (range in 18 tissue samples = 14,251 to 58,300). In the high coverage group, 23% of variants in all tissues were A-to-I while in the moderate coverage group this range was 19%. This initial SNV list was further subjected to stringent filtering to control for false-positives. This step led to the identification of 15,541 SNVs in the high coverage sample and 1,128 SNVs in the moderate coverage samples of which 0.90 and 0.69 were A-to-I type, respectively (Fig 1).
A second filtering step was used to obtain a comprehensive map of potentially editable sites in the bovine transcriptome and identified 794 common SNVs in at least two bulls of which 88% were considered to be potential A-to-I RNA editing sites (Fig 2A, S2 File). These were located in unique genomic positions and were not close to any splice junction, bidirectional transcription or low complexity regions (such as SSRs). Of these,~88% (n = 697) corresponded to A-to-I differences and~3% (21 sites) were C-to-U (Fig 2A, S2 File). We focused further analysis on the canonical A-to-I editing sites.

RNA editing distribution in different tissues
The RNA editome (A-to-I events) varied per tissue. The tissue with the highest number of edited sites was the liver (n = 325; Fig 2B) while the lowest number of sites were detected in the muscle (n = 21; Fig 2B). The distribution of RNA editing ratio, however, was similar across tissues (Figs 3 and 4, S4 File). Tissue profiling identified the brain as the tissue with the highest editing ratio with 23% of edited sites having an editing ratio >0.50%. However, the mean editing ratio were not significantly different (corrected P-value <0.05) among tissues, except between brain and kidney (S4 File). Specifically, for A-to-I sites, we identified 320 genes In the first filtering step, we removed known SNPs, low-quality base calling, homozygous sites, sites with low read coverage, lowly or highly edited sites, sites with strand bias, variant distance bias, mapping quality bias, homopolymer bias, also variations which were located in bidirectional, exon-intron boundaries and paralogs or repetitive regions. In the second filtering step, we kept the sites which were observed in at least two biological replicates of cattle. B) The final distribution of all 12 mismatch types in different tissues. The identified variants were required to be present in least two or three biological replicates.  The bovine genome is shown as a circle. For each chromosome, the position of the editing sites along with their average editing ratio (heatmap) is shown in different tissues. The red, blue and black dots in the inner light gray circle indicate the A-to-I, Cto-U and other non-canonical editing sites, respectively. Also, the vertical yellow lines in the inner light gray circle indicate the position of inverted repeats. Tissues are shown in concentric circles and ordered as follow from the outside: brain, colon, heart, kidney, liver, lung, skeletal muscle, spleen and testes.
landscape of all RNA-DNA mismatches identified in this study across bovine chromosomes. A comprehensive list of all the edited genes in different tissues with number of editing sites in each gene is presented in S5 File.  Identification of A-to-I RNA editing events in bovine transcriptome Whole genome distribution of editing sites. Next, we assessed the genomic locations of A-to-I editing sites in the bovine genome. These sites were not evenly distributed across the genome. For example, transcripts from genes in chromosomes 3, 28 and 29 were not observed to be edited in the dataset we analyzed while several (n = 23) of those in chromosome 11 were (Fig 3). All together 678 of 697 edited events were observed in non-coding regions such as introns, intergenic and untranslated regions. Of those, 285 intergenic associated sites were located within 5 kb of annotated genes and some may represent extended 3 0 UTR regions. Furthermore, in agreement with previous study, most of the edited genes were edited only in one of the genic regions such as intron, coding sequence (CDS) or 3 0 UTR. The results of this analysis showed that only AJUBA and EAPP were edited simultaneously in the 3'UTR and the downstream. In addition, CD99, GK5 and ENSBTAG00000038619 were edited simultaneously in an intron and downstream regions (S3 File, Fig 5) [45]. Distribution of genomic location of A-to-I, C-to-U and non-canonical editing sites are shown in Fig 6. Only 17 A-to-I editing sites were located in CDSs, of which 15 editing sites changed the amino acids of 14 genes (either non-synonymous or missence; S3 File, Table 1). Of those, the brain had the largest number of CDS edited genes namely, CADPS, CYFIP2, GABRA3, GIPC1, GRIK2, KCNA1, LEMD2, TMEM63B, SON and ENSBTAG00000047954 while skeletal muscle had none (S5 File).
To further evaluate whether the identified A-to-I editing sites in our study are potential editing events, we investigated different sequence and structural features that have been shown to be consistent with the known properties of A-to-I editing.

Tissue-specific A-to-I editing sites
We first focused on tissue-specific A-to-I sites. We found 107, 6, 12, 34, 103, 15, 2, 7 and 3 tissuespecific sites, which were dispersed among 53, 5, 6, 12, 50, 9, 0, 4 and 2 genes in brain, colon, Identification of A-to-I RNA editing events in bovine transcriptome heart, kidney, liver, lung, skeletal muscle, spleen and testes, respectively. To investigate whether tissue specificity of edited genes was involved in functional processes relevant to that tissue, functional enrichment analysis was performed for each tissue. Results showed that these genes were significantly enriched in various biological processes associated with function of the associated tissue. For example, some of the significantly enriched biological processes terms were included "regulation of short-term neuronal synaptic plasticity" in brain (adjusted P-value = 0.05), "striated muscle contraction" in heart (adjusted P-value = 0.01), "kidney development" in kidney (adjusted P-value = 0.03), "positive regulation of lipid storage" in liver (adjusted P-value = 0.005) and "respiratory gaseous exchange" in lung (adjusted P-value = 0.02). The complete list of enrichment analysis of tissue-specific edited genes in each tissue along with the distribution of tissue-specific editing sites and their editing ratio are provided in S4 File.
We also analyzed the number of A-to-I editing sites that were shared among all the tissues. Results showed that the most commonly edited genes were GTF3C4 and C4A, which were edited in at least eight tissues, 23 genes were found to be edited in at least seven tissues, and 72 were in common in at least five tissues. In addition, we found that the lung and kidney shared 163 sites (S2 File). These findings indicated that the recurrence of editing sites across tissues was low.

ADARs expression and RNA editing levels
In order to explain differential or tissue-specific RNA editing, we first determined transcript abundance across tissues for genes that were edited in a tissue-specific manner. The rationale behind this analysis was to investigate whether tissue-specific RNA editing is correlated with the sole expression of those transcript in the tissue in which editing was observed. Our results show that tissue-specific edited genes were expressed in at least four of the nine somatic tissues examined (S6 File). Moreover, tissue-specific edited genes in lung and spleen were expressed in all other tissues. Also, more than 94% of tissue-specific edited genes in brain were expressed in at least eight other tissues.
Next, we analyzed if the editing pattern observed is related to the tissue-specific expression levels of ADAR enzymes, namely ADAR (or ADAR1), ADARB1 (or ADAR2), and ADARB2 (or ADAR3). Our results showed that ADAR had highest expression in the brain, which is consistent with higher number of tissue-specific RNA editing in this tissue. Interestingly, ADAR expression was the lowest in heart and skeletal muscle, which is in accordance with the lowest number of editing sites in these tissues (Fig 7). As shown in Fig 7, there is a relative relationship between ADAR expression and RNA editing pattern in different tissues. Overall, we found a trend for a positive correlation (Pearson correlation = 0.59, P-value = 0.1) between expression of ADAR family members and tissue-specific RNA editing. Identification of A-to-I RNA editing events in bovine transcriptome

Enrichment of editing sites in inverted repeat sequences
We analyzed the enrichment of A-to-I editing sites in based on genomic context. Results showed that out of 697 A-to-I editing sites, 588 sites were found in repetitive regions. Of the 588, 535 were enriched in SINE regions (specifically in the Bov-tA SINE family [n = 406]), 41 in LINEs and 12 were presented in other inverted repeat regions such as DNA transposons. In contrast, only 10 of 97 non-A-to-I sites were observed in these regions (Fig 3, S3 File).

Motif sequence
We queried whether sequence could affect editing events. Our results show that the nucleotides 5 0 and 3 0 to the editing sites had a strong preference for T and G enrichment, respectively (Fig 8).

Conservation analysis
To examine whether the editing sites we identified in bovine could also be detected in human orthologous regions, we performed cross-species comparisons between the bovine and human editomes. After applying stringent thresholds on the results (e-values <0.001, identity >0.85% and alignment length >50 bp), we observed 16 conserved A-to-I editing sites in 12 protein coding genes namely, TMEM63B, SON, KCNA1, GRIK2, GABRA3, FLNA, COG3, CADPS, AZIN1, GIPC1, DDX6 and ADCYAP1 (Table 2 and S3 File).

Validation of A-to-I editing sites by EST sequences
In order to further confirm that the identified A-to-I editing sites represented bona fide RNA editing events rather than technical artifacts, we looked for evidence of editing sites in the Identification of A-to-I RNA editing events in bovine transcriptome public bovine EST database by BLAST alignment. We found that 66% (458/697) of the A-to-I edited sites identified in this study were corroborated in at least one EST clone and 269 sites were supported by more than five edited EST sequences (S3 File). We also investigated the Cto-U and non-canonical editing sites by EST sequences. Results showed that 14 of 21 C-to-U (~66%) were supported by at least one EST clone. However, only 27 of 76 (~0.35%) of the non-canonical sites were confirmed, indicating that these may be considered as false positives.

Clustered editing sites
A large number of A-to-I editing sites are often clustered together due to promiscuous simultaneous editing of multiple adenosines by ADAR proteins [21]. Here, we found that most of the identified editing sites were spatially isolated from each other and only 241 editing sites were detected in 105 clusters (!2 editing sites within a 100 bp window) which corresponded to 67 known genes (S3 File). Clustered edited sites could be considered as an appropriate feature to distinguish true editing sites from false positive sites [46]. Therefore, we extended the editing sites based on the fact that A-to-I editing sites tend to be clustered together within specific genomic regions. To do this, all of the SNVs that remained after applying the GATK standard filters and removing known SNPs in all the 27 samples were used for extending the editing sites. Then, any SNV within 100 bp from a high-confidence A-to-I RNA editing site (initial candidate editing sites) were added to the pool of editing sites. Approximately 97% of the SNVs identified using this method (903 of 931 SNVs) were A-to-I substitutions and the number of A-to-I RNA editing sites was increased to 1,600. By extending the editing sites with distances smaller than 100 bp surrounding the high-confidence A-to-I editing sites, we identified a total of 1,424 editing sites in 373 clusters and in 223 annotated genes with four editing sites on each cluster on average, as well as 176 editing sites at single sites. We also found 14 edited genes with more than 10 editing sites. We next assessed the different characteristics of the novel A-to-I editing sites. Results showed that these editing events shared the same characteristics with the high-confidence Ato-I RNA editing site and were less likely to result from sequencing errors. Distribution of the novel A-to-I editing sites in different tissues were 318, 202, 44, 273, 335, 251, 7, 207 and 64 in brain, colon, heart, kidney, liver, lung, skeletal muscle, spleen and testes, respectively, which were similar to high-confidence editing sites. EST analysis revealed that 649 (~72%) and 417 (~46%) of 903 new editing sites were supported by at least one and more than five edited EST sequences, respectively. Similar to known RNA editing sites, we observed a sequence preference for strong G depletion and enrichment in 5 0 and 3 0 of the new editing sites, respectively. In addition, distribution of the new editing sites in genomic regions revealed that, like highconfidence editing sites, most of these sites were located in non-coding regions (57 sites in upstream, 5 sites in CDSs, 281 sites in intron, 21 sites in 3 0 UTR, 327 sites in downstream, 211 sites in intergenic and one site in pseudogene). Our results show that the clusters were largely from inverted repeats regions (329 clusters), however, we found 44 editing clusters from nonrepetitive regions. Also, of the 329 clusters in inverted repeats regions, 307 clusters were found in SINE repetitive regions. Thus, these features demonstrated the applicability of this approach in calling new extended RNA editing sites. Detailed characteristics of the new A-to-I editing sites along with high-confidence editing sites are provided in S7 File.

Discussion
RNA editing increases the proteome and transcriptome diversity in eukaryotic genomes [8,12,46,47]. While in human [8,12,14] and mouse [16], the A-to-I RNA editome is well characterized, in bovine only two studies have reported on this type of RNA posttranscriptional modification [26,27]. Hence, the comprehensive identification of the bovine editome is needed for the understanding of post-transcriptional gene regulation in this agriculturally important species.
The accurate identification of the editome solely from transcriptomic data is technically challenging and may suffer from methodical artifacts such as mapping and sequencing errors [48]. However, recent studies have showed that using efficient computational methods along with appropriate filtering can significantly improve the accuracy of identifying the RNA editing events [4,49,50]. Here, we used a RNA-Seq based computational approach for the de novo identification of RNA editing in the absence of the related DNA information. Similar approaches based on only RNA-Seq datasets have been used to characterize human and drosophila editing sites in various tissues [20,21]. Our approach involved three main steps: alignment of reads to the reference bovine genome, SNVs identification with basic filters, and RNA editing discovery with more stringent filters. On the first step, we made an improvement in the alignment procedure using an accurate aligner and a series of stringent filters. This rigorous filtering strategy mitigate potentially ambiguous mappings to similar genomic regions and mapping bias due to sequencing errors. For the next two steps, a series of filters with stringent cutoffs were applied to reduce obvious sources of false positives such a sequencing/mapping errors, known genomic SNPs and misaligned reads. In present study, beyond the comprehensive filters previously applied to remove potentially erroneous editing events, we used an additional filter and kept the SNVs, which were observed in at least two biological replicates. We hypothesized that functionally relevant RNA editing sites tend to be recurrently edited across individuals, whereas SNPs or false positives are unlikely to be common to unrelated individuals. The usefulness of considering biological replicates to identify biologically meaningful RNA editing sites, has been highlighted in a previous study [18]. Furthermore, we used strand-specific RNA-Seq data, which enable us to characterize the strand of the identified variants and minimize the false positive results [46]. Taken together, as we discussed above, these activities ensured the discovery of an accurate list of editing events in bovine. The two canonical RNA editing events namely A-to-I and C-to-U together accounted for 91% of the candidate editing sites in our study. Consistent with previous studies, about 88% (697) of these sites corresponded to A-to-I differences, mediated by ADAR enzymes, including 555 changes not previously reported in bovine. This finding supported the existing knowledge that A-to-I RNA editing is the dominant type of RNA editing in mammals [4]. Comparison of our results before and after filtering processes showed a clear enrichment in canonical A-to-I editing sites throughout successive filters, which is an indication of the reliability of our approach. Notably, non-canonical changes have been found by previous studies to be the most common sequencing error [51]. Consistently, our EST analysis also confirmed that the noncanonical mismatches might be false positive as only 35% of non-canonical sites were validated by EST sequences. In addition to A-to-I editing sites, about 3% of the identified DNA-RNA differences (21 sites) were C-to-U editing events, which has also been documented in mammals [15,22]. This type of modification had not yet been reported in cattle. Identification of this type of RNA editing in our datasets can be assigned to the activity of members of the APO-BEC enzyme family [22].
In the present study, to reduce false positive results and to obtain a high-confidence set of A-to-I editing sites, we used a stringent approach along with a series of filters with stringent cutoffs. However, incorporation of such filters often limits the results and it is undeniable that some of the true editing sites could be removed. Moreover, we focused on editing sites which were edited in at least two biological repeats by using a rigorous filtering approach, it would cause the omission of some true editing sites. For example, one of the applied filters was removing SNVs with very high or low editing ratio, which in addition to removing false positive sites can be lead to filter out some true editing sites. In addition, we used RNA-Seq samples with different coverage, which led us to identify a larger number of SNVs in high coverage samples than in the moderate coverage ones, a finding in agreement with previous studies [22,46]. It is well known that the number of RNA editing sites depends markedly on sequencing depth and increase with additional reads [22,46]. Hence, a large portion of this variability could be due to differences in the sequence depth of each sample rather than biological differences between cattle. Therefore, it can be suggested that many sites are edited with low editing levels and sufficient sequencing depth is required to detect these edited sites. In fact, most of the potential editing sites in the high coverage samples were discarded due to lack of supportive evidence from the two other moderate coverage samples. Therefore, we should note that there are likely many more RNA editing sites than those identified by our method. For example, the extent of the clustered A-to-I edited sites for our data set (697 sites) were considerably lower than what has been found in C. elegans and human [6,21,52].
We successfully applied a method to extend the identified RNA editing sites. By this method, which queried 100 bp of flanking region, an additional 903 A-to-I editing sites were discovered, including 791 novel ones. Notably, features of the editing sites called by the extending method were the same to those from our multifilter pipeline (high-confidence editing sites). Overall, we identified 1,725 variants (794 variants by the first method and 931 variants by the extending method) of which 1600 (697 sites from the first method and 903 sites by the extending method) and 25 variants (21 sites from the first method and 4 sites by the extending method) were A-to-I and C-to-U editing sites candidates, respectively. Therefore, if we assume that all the variations except A-to-I and C-to-U reflect sequencing errors, then the false discovery rate in this study is~6%, in agreement with the recent studies that have demonstrated that the non-canonical editing sites are false positives [51]. These noncanonical variants are most likely to be derived from sequencing or mapping errors as well as undetected SNPs in the bovine genome. Also, results of extending the editing sites revealed that most of predicted Ato-I editing sites were ain clusters and were enriched in repetitive regions. The extent of clustered edited sites after extending the editing sites (~89% of 1600 editing sites) is consistent with what is reported in the DARNED database (85.02%) [47]. Therefore, our results revealed that the RNA editing sites tended to reside in clusters instead of being randomly distributed over the genes.
A-to-I editing sites identified in this study yielded a total of 1,338, 86 and 176 editing sites in SINE regions, repetitive non-SINE regions and non-repetitive regions, respectively, an expansion in the list of bovine RNA editing sites. These findings are completely in agreement with known properties of known RNA editing sites and suggests that promiscuous A-to-I editing events can possibly occur in non-repetitive regions [21]. The bovine genome does not have Alu repeats, however, there are ral related active SINEs such as Bov-A2 and Bov-tA family, which are well known in bovine [53]. Similar to findings in human where most of the A-to-I editing sites are clustered in Alu elements [4,13], our results show that editing sites were enriched in Bov-tA family of SINEs (1,069 of 1,338 editing sites where located in SINE regions). On the other hand, nearly all of the editing sites (~99%) targeted non-coding regions (introns, intergenics or untranslated region), similar to what has been documented for human [4]. Also, it is reported that both intronic and non-coding sequences exhibit secondary structure, which can potentially enable the activity of RNA editing enzymes [54]. Moreover, coding regions have fewer inverted repeat sequences, which explains why RNA editing sites mainly reside in non-coding regions [4]. Thus, our findings support the notion that ADAR enzymes' activity is unspecifically influenced by dsRNA [18].
In line with previous studies, a large number of editing sites in introns (478 of 1600,~30%) and in 3 0 UTR or downstream regions (602 of 1600,~38%) indicates that RNA editing may play fundamental roles in regulation of splicing and miRNA regulation, respectively [12,55]. Human genome-wide projects, such as ENCODE (Encyclopedia of DNA Elements), have clearly revealed that a considerable fraction of genome are transcribed [56]. Thus, identifying RNA editing sites in intergenic regions may be due to an incomplete or poorly annotated bovine genome. On the other hand, three of the editing sites occurred in known pseudogenes. RNA editing in pseudogenes has been already reported in human tissues [52]. Recently it has been revealed that pseudogenes are functional and can be transcribed. Moreover, they can regulate gene expression by different mechanismes such as miRNA sponge [57,58]. Hence, in accordance with previous study [52], it can be hypothesizes that RNA editing may be an important mechanism for gene expression regulation by pseudogenes.
The results of this study indicate that the function of the edited genes with amino acid recoding potential are associated with the sepcific function of the tissue. For example, functional analysis based on the nine edited genes in brain revealed that these genes are involved in biological processes such as "chemical synaptic transmission", "regulation of synaptic plasticity" and "modulation of synaptic transmission", which is consistent with the known functions of RNA editing in animal nervous systems [59]. We also found that COG3 is edited in colon, heart, kidney, liver, lung, spleen and testes but not in brain and skeletal muscle. This RNA editing resulted in a codon change from AUU to GUU, and an amino acid recoding from isoleucine to valine. This gene is one of eight subunits of the COG tethering complex, which have an important cellular function associated with the structure and function of the Golgi. Interstingly, RNA editing in COG3 is conserved in human, mouse, rat [60], bovine [26], pig [55] and chicken [61].
Our results show that the number of editing sites are different among tissues, which can be ascribed to tissue-specific roles of RNA editing and sequencing depth variation. The fewer number of identified editing sites in skeletal muscle compared to other tissues suggests lower RNA editing activity in skeletal muscle and is in agreement with previous studies in pig [55] and rhesus macaque [50]. Furthermore, after skeletal muscle, the heart showed the lowest number of Ato-I editing sites (76 sites) which is in accordance to what has been observed in humans [52]. Notably, annotation of the edited genes in each tissue uncovered that most of them were associated with the function of that tissue. For instance, one of the most highly edited genes in brain was ATCAY, with seven A-to-I editing sites in its downstream (or its extended 3 0 UTR) region. In addition, this gene was edited only in brain tissue. Previous studies show that ATCAY is preferentially expressed in the human brain, is a brain-specific member of the BNIP-2 family and encodes the brain-specific protein BNIP-H or Caytaxin [62]. This gene affect glutamate synthesis at synapses during neurotransmission by decreasing the steady-state levels of glutamate [63]. Hence, extensive RNA editing of this gene may reflect a functional need for fine-tuning gene expression associated with brain functions. Another gene with a high number of editing sites compared to other genes was EIF2AK2. EIF2AK2 encodes the dsRNA-activated protein kinase R (PKR), which is an integral component of the innate immune response and also plays a more general role in cellular physiology such as regulation of protein synthesis, cell proliferation and differentiation, signal transduction and apoptosis [64]. It is also reported that this gene is ubiquitously expressed in all cells at low abundance [65]. Consistently, in our study, EIF2AK2 was edited in all tissues (except heart and skeletal muscle). This editing ranged from seven sites in lung to one site in testes. Therefore, ubiquitous editing of this gene may represent a functional mechanism for post-transcriptional regulation of this gene.
Our findings of tissue specificity of editing sites are in consistent with previous studies as reported that RNA editing profiles are highly tissue-specific [52,66]. Brain showed the highest number of tissue-specific edited genes which is consistent with what has been documented for mammals [4,52]. In addition, existence of tissue-specific RNA editing in other tissues (such as liver) suggests that RNA editing likely plays important roles in non-brain tissues. There are two main hypotheses to explain differential or tissue-specific RNA editing. The first one states that there is a difference in the expression levels of the transcript being edited. In other words, the transcript being edited in one tissue, is not available or at sufficient abundance to be edited in another tissue. To investigate this hypothesis, we assessed the expression of tissue-specific edited genes. In line with a previous study [52], our findings showed that RNA editing pattern and tissue-specific editing are not a consequence of the tissue-specific expression of the edited genes. A second hypothesis is that the editing patterns observed in our samples are related to the expression levels observed for the corresponding ADAR enzymes. Positive correlation between ADARs enzymes expression and RNA editing levels in mammalian tissues is reported in previous studies [50,52]. Deaminase activity of ADAR and ADARB1 and their highest expression in the brain have been demonstrated in previous studies [67]. Here, our results showed a positive correlation between ADAR expression and editing levels in different tissues. Nevertheless, even though transcripts in the liver and brain had a large number edited sites, the transcripts of the ADAR proteins were not as abundant in the liver as they were in the brain. Hence, biologically it is possible that there are some unknown mechanisms, which mediates the editing levels in different tissues which has also been suggested by others [67,68].
In bovine, like other mammals, the ADAR family is composed of three independent genes, ADAR (or ADAR1), ADARB1 (or ADAR2), and ADARB2 (or ADAR3) [69]. In our study, ADARB1 showed highest expression in brain and lung and a very low expression in other tissues. Interestingly, high expression of ADARB1 in brain and lung has been reported in human [52] and rhesus macaque tissues [50]. It is well known that ADARB2 is a brain specific enzyme with unknown catalytic activity [52]. In this context, our results also confirmed that ADARB2 was expressed exclusively in brain, which indicates conserved expression patterns of these genes. Generally, our results suggest that instead of gene expression of edited genes, editing enzyme expression may play an essential role in regulating tissue-specific editing levels in bovine.
Here, ADAR1 enzyme preferentially targets adenosines when the 5 0 nearest neighbor is A % U > C > G, a finding in agreement with the previous study in human [21]. Further, previous studies have reported that a guanosine directly adjacent in the 3 0 and a uridine directly adjacent in the 5 0 position of an adenosine increase the chance of the adenosine being edited as a potential cis-regulatory mechanism [4], which is in accordance to our findings and may be functional in the formation of a proper ADAR substrate structure [70].
We found 16 conserved A-to-I editing sites in 12 protein coding genes. These genes represent the same human orthologous genes and are significantly enriched in the functional category of "chemical synaptic transmission", which is associated with neural system functions. Of note, seven of the 12 genes were edited in brain and nine of them have amino acid recoding potential, a finding in accordance to others [71]. A-to-I editing of all of these genes except GIPC1, DDX6 and ADCYAP1 have been reported as conserved between human and mouse [71]. Most of the conserved edited genes were associated with neural-specific functions [71]. The observed low overlap between editing sites identified in this study and human editome is not surprising, as it is well documented that only very few editing sites are known to be conserved across large evolutionary distances [71,72] [66,71].
We compared our results to RNA editing sites predicted by Chen et al. [26]. This comparison revealed 254 common A-to-I editing sites, which reside in 91 different genes. Five of the 91 genes (including GABRA3, GRIK2, COG3, CADPS and CYFIP2) have editing sites with non-synonymous shifts. Thus, we report six novel putative functional coding changing editing sites in the genes AZIN1, GIPC1, KCNA1, LEMD2, TMEM63B and SON in bovine transcriptome. Five of these genes have been reported as genes with functional coding changing editing sites in mammals species including KCNA1 in brain tissue of mouse [73]; TMEM63B, AZIN1 and SON in human [8,74] and mouse [16] and GIPC1 in human [75], thus further demonstrating the efficacy of our A-to-I prediction pipelines and reveals that some edited sites are conserved throughout mammalian evolution. Among the novel edited genes with non-synonymous changes, only five genes (including CADPS, CYFIP2, GRIK2, KCNA1 and TMEM63B) showed tissue-specific editing. A certain proportion of the observed differences in the editing patterns between our results and [26] could be attributed to variability in number and age of tissues, computational method for editing discovery and differences in the stringency of applied filters, differences in experimental design and variability in the sequence depth of samples. This also might indicate diversity of editome among tissues or individuals. Overall, we discovered 1,346 novel A-to-I editing sites not yet reported in bovine transcriptome.

Conclusion
Here, we used a large number of RNA-Seq samples along with a computational method with multiple filters and stringent thresholds to facilitate unbiased detection of bona fide RNA editing sites in the bovine genome in the absence of corresponding DNA information. The present study extends the list of RNA editing sites in bovine and provides pipelines that may be used to investigate the editome in other organisms.