The impact of alternative splicing on RNA subcellular localization

Background: Alternative splicing, a ubiquitous phenomenon in eukaryotes, provides a regulatory mechanism for the biological diversity of individual genes. Most studies have focused on the e↵ects of alternative splicing for protein synthesis. However, the influence of alternative splicing on the RNA subcellular localization has rarely been studied. Results: By analyzing RNA-seq data from subcellular fractions across thirteen human cell lines, we observed that splicing is apparent to promote cytoplasmic localization. We also discovered that intron retention is preferred by transcripts localized in the nucleus. Short and structurally stable introns show a positive correlation with nuclear localization. Such introns are predicted to be preferentially bound by MBNL1, an RNA-binding protein that contains two nuclear localization signals. Conclusions: Our findings reveal that alternative splicing plays an important role in regulating RNA subcellular localization. This study provides valuable clues for understanding the biological mechanisms of alternative splicing.

Introduction throughput sequencing. The goal of this research was to perform a comprehensive 1 and genome-wide study of the impact of alternative splicing on RNA subcellular 2 localization. Therefore, we analyzed RNA-sequencing (RNA-seq) data from subcel-3 lular (cytoplasmic and nuclear) fractions and investigated whether alternative splic-4 ing causes an imbalance of RNA expression between the cytoplasm and the nucleus. 5 Briefly, we found that RNA splicing appeared to promote cytoplasmic localization. 6 We also observed that transcripts with intron retentions preferred to localize in the 7 nucleus. A further meta-analysis of retained introns indicated that short and struc-8 tured intronic sequences were more likely to appear in nuclear transcripts. Notably, 9 we also found that the CUG repeat sequence was enriched in the retained introns. 10 Our analysis suggested that MBNL, an RNA-binding protein (RBP), recognized 11 CUG-containing transcripts and led to nuclear localization by its nuclear signal se-12 quence (NLS). The above results are consistently observed across thirteen cell lines, 13 suggesting the reliability and robustness of the investigations. To our knowledge, 14 this is the first genome-wide study to analyze and evaluate the e↵ect of alternative 15 splicing on RNA subcellular localization across multiple cell lines. We believe that 16 this research may provide valuable clues for further understanding of the biological 17 function and mechanism of alternative splicing.

19
Thousands of transcript switches between cytoplasm and nucleus were identified 20 across thirteen human cell lines 21 To assess the influence of alternative splicing on RNA subcellular localization, we 22 analyzed RNA-seq datasets that cover the nuclear and cytoplasmic fractions in 23 thirteen human cell lines (Additional file 1). Using these datasets, we calculated 24 transcript usage changes ( T U) to identify transcript switches for genes between 25 the cytoplasm and the nucleus. For a transcript, the transcript usage (T U) means 26 the percentage of its expression in all transcript variants, and the T U assesses 27 the extent of di↵erential transcript usage between two conditions (i.e., nuclear and 28 cytoplasmic fractions). Thus, a transcript switch involves two transcript variants in 29 a gene, one of which is predominantly expressed in the cytoplasm ( T U > 0), while 30 the other one is mainly expressed in the nucleus ( T U < 0). The gene VPS4A, for 31 instance, contains four transcript variants (Figure 1A,upper). In HeLa cells, we 32 T U values, while non-coding transcripts prefer to locate in the nucleus and have 23 more negative T U values (Figure 2A). This result agrees with our understanding 24 that the protein-coding transcript needs to be transported into the cytoplasm to 25 produce proteins, while a large number of non-coding transcripts have been reported 26 to localize in the nucleus to participate in transcriptional and post-transcriptional 27 gene expression and chromatin organization, among other functions [28,29]. Next, 28 we compared the ribosome density on cytoplasmic transcripts ( T U > 0) and nu-29 clear transcripts ( T U < 0) in HeLa cells. Ribosome density was calculated from 30 ribosome profiling data (see "Methods"). Comparing this data with nuclear tran-31 scripts, we predicted that cytoplasmic transcripts interact more frequently with 32 ribosomes resulting in a higher ribosome density. Since non-coding transcripts are considered to be a group of RNA molecules that are not involved in protein transla-1 tion, they are suitable for the comparison of ribosome density. Indeed, we observed 2 that those cytoplasmic non-coding transcripts tend to associate with more ribo-3 somes relative to those of the nucleus ( Figure 2B). However, due to the limited 4 number of non-coding transcripts available for comparison, we did not observe a 5 significant di↵erence.

6
Taken together, we applied T U to screen for a pair of transcripts for each gene, 7 and the two transcripts were enriched in the cytoplasm (termed as "cytoplasmic 8 transcripts") and the nucleus (termed as "nuclear transcripts"), respectively. Thus, 9 we obtained a collection of transcript variants generated by alternative splicing, and 10 the subcellular localization was di↵erent between them as well.

11
A transcript switch is associated with RNA splicing rather than RNA degradation 12 We first investigated the post-transcriptional influence of the Nonsense-Mediated 13 RNA Decay (NMD) pathway on the transcript switch. Initially, the two transcripts 14 in a transcript switch are equally distributed in the cytoplasm and nucleus. One 15 reason for the transcript switch is that the rate of degradation of the two transcripts 16 in the cytoplasm and nucleus is di↵erent. To test this hypothesis, we calculated the 17 sensitivity of transcripts to NMD based on changes in RNA-seq data [30] before 18 and after the knockout of UPF1, a core factor of NMD. However, after compar-19 ing the sensitivity of NMD to cytoplasmic and nuclear transcripts, we did not find 20 significant di↵erences ( Figure 3A). This observation led us to conclude that NMD 21 may not significantly a↵ect transcript switches. The cause of the transcript switch 22 should be mainly due to intrinsic (sequence features) and transcriptional (e.g., splic-23 ing) e↵ects.

24
Considering that the longer the transcript the higher the splicing frequency (or 25 exon number)should be, we next exterminated the association between the length 26 of the transcript and its subcellular localization. We evaluated the relationship be-27 tween length and subcellular localization by calculating the length ratio (logarithmic 28 scale) between the cytoplasmic transcript and the nuclear transcript in each tran-29 script switch. We divided the transcript switches into three categories based on the 30 length ratio: positive (ratio > 1), negative (ratio < 1), and neutral (other). A 31 positive category indicates that the longer the transcript, the more enriched in the 32 cytoplasm, and vice versa. A neutral category means that there is no significant 1 correlation between transcript length and its subcellular localization. We observed 2 that the number of transcript switches in the positive category is higher than that 3 in the negative category, which implies that the longer the transcript is, the more 4 likely it is to be transported into the cytoplasm ( Figure 3B). To verify whether 5 the splicing frequency of the transcript is positively correlated with its cytoplasmic 6 location, we further compared the distribution of exon numbers (i.e., splicing fre-7 quency) in the cytoplasmic transcripts and nuclear transcripts for the positive and 8 negative categories, respectively ( Figure 3B, inset). We found that the exon num-9 ber of the cytoplasmic transcripts in the positive category is indeed higher than the 10 nuclear transcripts. Based on the above observations, we speculate that there are 11 significant di↵erences in subcellular localization between transcripts with or with-12 out splicing events. To confirm this hypothesis, we divided the transcript switches 13 into the mono-exonic (unspliced) and multi-exonic (spliced) groups and then com-14 pared the distribution of T U values between them. As expected, the T U value of 15 multi-exonic transcripts was positive (indicating cytoplasmic localization) and was 16 significantly and consistently higher than the negative T U value of mono-exonic 17 transcripts (representing nuclear localization) in all cell lines ( Figure 3C).

18
In brief, we first compared the NMD sensitivity between cytoplasmic transcripts 19 and nuclear transcripts, and found that transcript switches are not caused by NMD-20 induced imbalanced RNA levels between the cytoplasm and nucleus. Second, by 21 comparing the length between cytoplasmic transcripts and nuclear transcripts, we 22 found that longer transcripts with more splicing events are more likely to be enriched 23 in the cytoplasm. Finally, by comparing the T U values between spliced transcripts 24 and unspliced transcripts, we found that those unspliced transcripts were enriched 25 in the nucleus.

26
Enrichment and characterization of retained introns in the nuclear transcripts 27 Next, we asked whether there was a specific kind of splicing pattern associated 28 with subcellular localization. Seven di↵erent splicing patterns were considered [31].

29
Comparing to an original transcript, each type of splicing pattern inserts, deletes, 30 or replaces a partial sequence that may include sequence elements, which have 31 important or decisive e↵ects on subcellular localization (e.g., protein-binding sites, RNA structures, etc.). We used the , which ranges from -1 to 1, to measure the 1 preference for a type of splicing pattern between the nucleus and the cytoplasm (see 2 "Methods" for details). The smaller the , the more the corresponding splicing 3 pattern prefers the nucleus. Interestingly, in HeLa cells, we observed that retained 4 introns were obviously biased toward the nucleus, while other splicing patterns had 5 no significant preference ( Figure 4A). Consistent results were also observed in the 6 other twelve cell lines (Additional file 5). In conclusion, we have observed in all 7 thirteen cell lines that transcripts with retained introns prefer to be localized in the 8 nucleus.

9
To further characterize the retained introns associated with nuclear localization, 10 we first selected a set of retained introns enriched in the nucleus ( < 0 and 11 p < 0.05, termed as nuclear RIs). Compared with all retained introns, we should 12 be able to observe some di↵erent features of the nuclear RIs, which are likely to 13 be the determining factors for nuclear localization. We first investigated whether 14 the nuclear RIs have a specific splicing signal. By applying this splicing signal, 15 it is possible to regulate the intron retention specifically, thereby monitoring the 16 subcellular localization of the transcript. Unfortunately, we found no significant 17 di↵erence in the splicing signal between the nuclear RIs and all RIs ( Figure 4B). 18 Then, we considered whether the structure of the nuclear RIs (including the length 19 and RNA secondary structure) have a specific signature. Surprisingly, the length 20 of the nuclear RIs is significantly shorter than the overall length level ( Figure 4C, 21 upper). We also found that the average stem probability of the nuclear RIs is higher 22 than that of all RIs ( Figure 4C, bottom). Thus, we conclude that nuclear RIs 23 maintain a more compact and stronger structure when compared with the overall 24 RIs.

25
The preferences of RNA-binding proteins on retained introns suggest their role in 26 nuclear localization 27 We sought to further investigate whether the nuclear RIs mentioned above are asso-28 ciated with RBPs. We calculated the frequency of each dinucleotide in the nuclear 29 RIs across thirteen cell lines and normalized it with the background frequency of all 30 intronic sequences in the genome. The reason we examined the dinucleotide compo-31 sition of nuclear nucleotides was due to the reported specific contact between amino 32 acids and dinucleotides [32]. The normalized dinucleotide frequency represents the 1 extent to which the corresponding dinucleotide is preferred in the nuclear RIs. We 2 found the overexpression of the GC-rich sequence occurred in the nuclear RIs ( Fig-3 ure 5A). This observation also provides us an intuitive hypothesis of whether RBPs 4 that preferentially bound to RNA containing GC-rich sequences directly or indi-5 rectly a↵ect subcellular localization.

6
To further explore which RBPs preferentially interact with nuclear RIs, we pre-7 dicted the binding preference between an RBP and an intronic sequence using 8 RBPmap [33], a motif-based approach. We found that across thirteen cell lines, 9 fourteen RBPs consistently (more than 80%) preferred attaching to the nuclear RIs We emphasize that T U measures the change in the proportion of gene expression 27 for a transcript between the nucleus and the cytoplasm. An increased T U does 28 not imply that the transcript abundance in the cytoplasm is higher than that in 29 the nucleus. Instead, a significantly increased T U indicates that the transcript is 30 stored dominantly for the corresponding gene in the cytoplasm, but is expressed as 31 a minor isoform in the nucleus. Therefore, T U measures the dynamic changes of a representative (dominantly expressed) transcript of a gene between the nucleus 1 and the cytoplasm. By using T U, we defined a transcript switch, which indicates 2 that a single gene has two di↵erent representative transcripts in the nucleus and 3 the cytoplasm, respectively. A gene containing a transcript switch is termed as a 4 switching gene in this study.

5
We hypothesize that a switching gene can function separately in the nucleus and 6 cytoplasm (hereafter "bifunctional gene"), respectively. Previous studies have dis-7 covered several bifunctional mRNAs, which generate coding and non-coding iso- the switching genes. Indeed, we have observed 768⇠2,777 switching genes (Fig-20 ure 1C) in thirteen cell lines, suggesting that bifunctional genes or GCR are more 21 widespread in the human genome than expected. A total of 8,720 switching genes 22 ( Figure 1D) were identified in this study. We believe that this analysis promises 23 to provide a valuable resource for studying bifunctional or GCR-associated genes. 24 Additionally, we observed that most of the switching genes were cell-specific (Fig-25 ure 1D), suggesting that such genes may be related to genetic adaption to the 26 environmental changes (stress, development, disease, etc.).

27
Based on the observation that multi-exonic transcripts are preferentially exported 28 to the cytoplasm, we argue that splicing supports cytoplasmic localization. One 29 possible reason is that the exon-exon junction complex ( Studying the e↵ects of alternative splicing on RNA localization in healthy tissues 7 or other species would be another challenging work. In summary, this research pro-8 vides important clues for studying the mechanism of alternative splicing on gene 9 expression regulation. We also believe that the switching genes defined in this study 10 (Additional file 3) will provide a valuable resource for studying gene functions. and then subjected to high-throughput RNA sequencing. H1-hESC and NCI-H460 20 samples without replicates were discarded. See additional file 1 for accessions of 21 RNA-seq data.

22
The human genome (GRCh38) and comprehensive gene annotation were obtained 23 from GENCODE (v29) [58]. RNA-seq reads were mapped with STAR (2.7.1a) [59]  To investigate the inconsistency of subcellular localization of transcript variants, 30 we calculated the T U from the RNA-seq data to quantify this bias. For each 31 transcript variant, T U indicates the change in the proportion of expression level 32 in a gene, which is represented as indicates that it is enriched in the nucleus. We utilized SUPPA2 to compute T U 8 and assess the significance of di↵erential transcript usage between the cytoplasm 9 and the nucleus. For a gene, we selected a pair of transcript variants T c and T n that where P i is the significance level of the i-th transcript variant.

13
To study whether the di↵erent splicing patterns a↵ect the subcellular localization defined in previous studies [60,63]. For each alternative splicing site (e.g. RI), The datasets and materials supporting the findings of this article are included within Additional files 1-5. 27 Competing interests 28 The authors declare that they have no competing interests.   Comparison of T U between mono-exonic (black) and multi-exonic (red) transcripts across 13 cell lines. T U shows a significant positive correlation with splicing, indicating that splicing appears to be a dominant factor for RNA export from the nucleus. Sequence logos show intron-exon (the left) and exon-intron (the right) splice boundaries.