Abstract
Advances in sequencing technologies have empowered epitranscriptomic profiling at the single-base resolution. Putative RNA modification sites identified from a single high-throughput experiment may contain one type of modification deposited by different writers or different types of modifications, along with false positive results because of the challenge of distinguishing signals from noise. However, current tools are insufficient for subtyping, visualization, and denoising these signals. Here, we present iMVP, which is an interactive framework for epitranscriptomic analysis with a nonlinear dimension reduction technique and density-based partition. As exemplified by the analysis of mRNA m5C and ModTect variant data, we show that iMVP allows the identification of previously unknown RNA modification motifs and writers and the discovery of false positives that are undetectable by traditional methods. Using putative m6A/m6Am sites called from 8 profiling approaches, we illustrate that iMVP enables comprehensive comparison of different approaches and advances our understanding of the difference and pattern of true positives and artifacts in these methods. Finally, we demonstrate the ability of iMVP to analyze an extremely large human A-to-I editing dataset that was previously unmanageable. Our work provides a general framework for the visualization and interpretation of epitranscriptomic data.
Similar content being viewed by others
Introduction
The recent discovery of various RNA modifications in the transcriptome has given rise to the fast-growing field of epitranscriptomics. Individual RNA modification sites have been found to regulate nearly all aspects of RNA metabolism and are involved in a wide range of biological processes1,2. Accordingly, RNA modification writers are spatiotemporally controlled, and their misregulation leads to a number of pathologies. For example, the writers of m6A and A-to-I editing are misregulated in multiple types of cancers and have been identified as promising therapeutic targets of cancers3,4,5, with related cancer treatments in pre-clinic or pre-investigational new drug (IND) stages. Catalytic inhibitor of METTL3 STM2457 leads to impaired engraftment and prolonged survival in various mouse models of AML5. ADAR1p150 inhibitor Rebecsinib prevents malignant A-to-I editing-mediated leukemia stem cell self-renewal in completed pre-IND studies6.
Methods for transcriptome-wide RNA modification detection at the single-base resolution are undergoing rapid development. The common strategies are to convert RNA modification signals into next-generation sequencing-detectable signals, such as mutation and truncation (e.g.7,8,9,10,11). Putative RNA modification sites called by these strategies from a single high-throughput experiment contain one type of modification deposited by different writers (e.g., m5C called from RNA BS-seq approach or pseudouridine called from CMC-related methods12,13,14) or different types of modifications (e.g., A-to-I editing and m1A called from RNA-seq11,15 or m6A and m6Am called from m6A antibody-dependent methods16). Meanwhile, various levels of noise are present in such putative RNA modification sites because most methods have limited accuracy and specificity17. Despite the importance of the precise classification of modification types and their corresponding writers, the currently developed tools are insufficient to address this issue.
Most types of RNA modifications occur within a specific sequence and/or structural context18,19,20,21,22,23. Thus, we may assign the writers of each authentic modification site even without experimental validation, as long as prior knowledge of writer target motifs is provided. Typically, we process the putative modification sites and flanking sequences by using motif finders24, which utilize probability statistics, graph theory, deep learning, or other approaches, to obtain a set of enriched motifs, and then determine the writers of some modification sites based on their motifs. However, such a process is unable to estimate the levels of noise within the dataset. Moreover, current motif finders are slow or fail to manage inputs with over 10,000 records, such as RNA modification sites called from RNA-seq data or Nanopore direct RNA-seq data25,26. We realize that the goal of analyzing RNA modifications and noise based on sequence features is, in some ways, similar to the high-resolution dissection of tissue compositions using single-cell RNA-seq technologies. Hence, we may take inspiration from the strategies used to analyze the large numbers of parameters generated in single-cell studies for RNA modification analysis.
Here we developed a framework termed interactive epitranscriptomic Motif Visualization and Subtype Partition (iMVP) for epitranscriptomic subtyping, visualization, and denoising, with demonstrations of its utility on various kinds of high-throughput experiments.
Results
Establishment of iMVP
Our strategy is based on two principles. (1) Most RNA modifications are deposited within a specific sequence context, which is determined by the biophysical interactions between writers and their substrates. This is presumably also true for most artifacts introduced by the enzymic and chemical processes in modification detection. (2) The input sequences only account for a tiny subset of the k-mer space. Accordingly, our iMVP framework consists of five steps for the analysis and visualization of the topological distribution of RNA modifications, as follows (Fig. 1a): (1) the k-mer patterns of the modification sites were extracted and encoded into a computer-readable format-here, we one-hot encoded the 21-mer sequences surrounding the modification sites; (2) a dimension reduction algorithm was used to project the extracted patterns onto a 2-D plane, where the similarities of the sequences were therefore approximated by the Euclidean distances in the projections; (3) an unsupervised clustering algorithm was applied to group the enriched patterns in the projections; (4) clusters were extracted automatically or manually based on the contours via our interactive interface with the help of visualization; and (5) further analyses of the clusters, including (but not limited to) drawing the logos with WebLogo27 or discovering motifs by canonical methods (e.g. MEME28), were performed.
We started with dimension reduction tools that are widely used in single-cell analyses because their properties and performances have been well evaluated29,30, especially t-distributed Stochastic Neighborhood Embedding (t-SNE)31 and Uniform Manifold Approximation and Projection (UMAP)30. To benchmark the candidate algorithms, we selected m5C profiling data in fly embryos, which consisted of experimentally validated NSUN2-dependent (Type I, 90.9%) sites and NSUN6-dependent (Type II, 9.1%) sites with clear motif preferences (Supplementary Data 1)21. As expected, in all tested algorithms, RNA modification sites with a stringent sequence requirement, i.e., Type II sites with a strong 5’-CUCCA-3’ motif21, were highly condensed (Fig. 1b); in contrast, sites with a less stringent sequence requirement, i.e., Type I sites19,32, were projected into a broad area with a continuous density distribution (Fig. 1b). To quantitatively evaluate the candidate algorithms, we developed two metrics, the outgroup-ingroup score and boundary score (see ”Methods” section), to measure their condensability and discernibility, respectively. We found that t-SNE, Open t-SNE33, and UMAP had better condensabilities (i.e., higher outgroup-ingroup scores) than PCA, Isomap34, and DensMAP35, generating projections much closer to their corresponding centroids. Compared with t-SNE and Open t-SNE, UMAP had a better discernibility (i.e., a lower boundary score), making the boundaries of Type I and Type II sites much clearer (Fig. 1b and Supplementary Fig. 1a). These findings were also true for more complicated m5C data from nocodazole (Noc) treated HeLa cells, in which at least 4 motifs, including two minor motifs, are present (Supplementary Fig. 1b). We also tested UMAP with different sequence encoding methods, such as PCA preprocessing (a regular step in single-cell analysis36) and numeric labeling. No other methods with better performance than that of the one-hot encoding method were found (Supplementary Fig. 1c). Additionally, we tested UMAP and t-SNE on different initialization methods with a simulation dataset containing 109,850 sequences (Supplementary Note 1). We found that most motifs can be correctly clustered by UMAP and t-SNE, no matter which method is used (Supplementary Fig. 1d). Last, we tested UMAP with different random seeds or metric functions (Supplementary Fig. 1e, f), and it demonstrated consistent robustness across different parameter settings in general. Since UMAP and t-SNE exhibited comparable performance, we opted for UMAP for the subsequent analyses.
We next sought a suitable unsupervised clustering algorithm based on the UMAP output. Among the tested algorithms, we found that both density-based algorithms (e.g., HDBSCAN37, DBSCAN38, and OPTICS39) and graph-based algorithms (e.g., Spectral clustering40, Louvain41, and Leiden42) were able to recognize distinct clusters from UMAP projections (Fig. 1d and Supplementary Fig. 2a). Among these methods, HDBSCAN, Leiden, and Louvain demonstrated acceptable computational cost (Fig. 1e) and exhibited superior performance in clustering, as indicated by precision, recall, and Adjusted Rand Index (ARI) metrics (Fig. 1f, Supplementary Fig. 2b, and Supplementary Note 1). While Louvain and Leiden achieved higher ARI scores in fly dataset (Supplementary Fig. 2b), HDBSCAN outperformed them when handling noisy datasets that contain randomly generated sequences (Supplementary Fig. 2c-e). Moreover. Louvain and Leiden were found to be less flexible with only one adjustable parameter and tended to generate interleaved small clusters (Supplementary Note 2 and Supplementary Fig. 2f). In contrast, HDBSCAN enables both standard clustering and “soft clustering” (complete assignment), which makes it more flexible in different scenarios.
Since the majority of reported RNA modification motifs are within a range of 15 nucleotides (nt)1, we chose to analyze 21-nt sequences in our computations. Furthermore, we conducted tests using k-mers of varying lengths and observed that iMVP demonstrated robustness in handling k-mers ranging from 11 to 51 nt (Supplementary Fig. 3). In combination with UMAP and HDBSCAN, iMVP can process >8,000 21-mer sequences within 3 min. For ease of use, we also developed an interactive interface for iMVP (Supplementary Movie 1), which allows us to combine the automatic partitioning results with manually selected clusters. Both the cookbooks for iMVP and the interactive interface are available on GitHub (https://github.com/SYSU-zhanglab/iMVP).
Comparison of iMVP with traditional tools
To compare the performance of iMVP with other traditional tools in motif search, we analyzed two simulation datasets that comprised 100,000 and 200 sequences with 12 and 5 heterogeneous motifs, respectively (Supplementary Note 1). Our initial comparison was between iMVP and MEME. In the larger dataset, iMVP demonstrated superior ability to identify less abundant motifs compared to MEME, whereas both methods performed similarly in the smaller dataset (Supplementary Fig. 4–6 and Supplementary Note 3).
Next, to ensure a fair comparison between iMVP and traditional tools (including MEME, STREME, and HOMER), we extensively optimized the parameters of the traditional tools for the larger dataset, resulting in the generation of 200 parameter sets (see ”Methods” section). We evaluated the performance of different methods by analyzing their computational time and accuracy in motif searching. Overall, our findings revealed that iMVP surpassed existing tools in terms of both computational time and precision (Supplementary Fig. 7).
Relaxation of position-dependency of iMVP to broaden its applications
In addition to the scenario in which positions with modification-induced signals matched the coordinates of RNA modification sites themselves, signals generated due to the modified bases can be shifted to adjacent positions. For example, such an issue is common in signals from Nanopore direct RNA-seq, which shows promise for discriminating and identifying different RNA modifications in native RNA. Several methods based on base-calling errors, altered current intensities, or trace profiles of Nanopore direct RNA-seq data have been developed to identify signatures that represent RNA modifications (e.g43,44,45,46.). With these methods, modification caused errors and disturbances have been found to occur from the −3 to +3 positions relative to the modification sites43,46. Because of the position-aware nature of iMVP, we proposed a phase-matching strategy to solve such a “phase mismatching” issue, thus broadening the applications of iMVP.
To demonstrate this strategy, we analyzed the RNA modification sites called by xPore based on the differential Gaussian distribution of Nanopore signal parameters between wild type and METTL3 knockout HEK293T cells26 (Fig. 2a). In this dataset, RNA modification signals tended to locate in the −1 to +1 positions adjacent to the modified bases (three phases)26 (Supplementary Fig. 8a). To perform phase matching, we first applied iMVP to the unprocessed xPore variant data and grouped the sites into 9 clusters (Supplementary Data 2). Clusters #1 to #3, which accounted for 47.9% of the sites (3,446 sites), were within RRACH (R = A/G, H = A/C/U) motifs in different phases (Fig. 2b). To match the phases of m6A motifs, we recentered the sequences of cluster #1 to cluster #3 onto an “A” if possible (Fig. 2c). Other small clusters were centered by non-A bases (Supplementary Fig. 8b–e) and were less likely to be m6A signals; thus, they were not included in the analysis. After phase matching, we defined 95% (3,175) of sites as authentic m6A sites within RRACH motifs (Fig. 2d). The remaining 5% of sites were within the CAR motif (Fig. 2d). Despite the CAR motif resembles previously reported m6Am motif (BCA (B = C/G/U)47,48 or BBCABW49, as METTL3 is exclusively associated with m6A modification, these sites are likely to be false positives. Consistently, when examining the locations of CAR sites relative to transcription start site (TSS), no enrichment around TSS was observed (Supplementary Fig. 8f).
To evaluate the reliability of putative modification sites defined using iMVP, we applied a different method, i.e., m6A-seq approach, to identify m6A/m6Am in the same cell line (see ”Methods” section). A winscore-based method50 was used to quantify the m6A/m6Am peaks in our m6A-seq dataset. Winscore quantifies the enrichment of modified reads within a 50-nt window relative to gene expression and modified reads across the entire gene. As a positive control, we used m6A/m6Am sites identified by m6ACE-seq (Supplementary Fig. 9)51. We found that m6A clusters in both xPore and m6ACE-seq datasets had significantly higher winscores than the background winscores, indicative of authentic m6A sites (Fig. 2e). However, m6Am-like clusters in m6ACE-seq dataset but not xPore dataset had significantly higher winscores (Fig. 2e). This result is consistent with the role of METTL3 in regulating only m6A sites and suggests that the previously reported xPore sites with CAR motifs were likely false positives. Our analysis highlights a phase-matching strategy to expand the applications of iMVP.
To further relax the position-dependency of iMVP to broaden its applications, we introduced a sliding window strategy. This strategy consists of three steps: (1) generating sliding windows on input sequences; (2) performing iMVP on all the windows to extract clusters; and (3) conducting motif alignment and comparison on the extracted clusters to identify the patterns. To test this strategy, we utilized a simulation dataset comprising 500 PDX1 motifs randomly distributed in a 50-bp sequence, along with 500 random noise sequences. We first generated 20-bp windows with a 1-bp step, resulting in a total of 30,000 windows. Then we applied iMVP to these 30,000 windows (Supplementary Fig. 10a, b) and extracted enriched patterns (clusters #1 to #11) (Supplementary Fig. 10c). Among these patterns, 10 exhibited distinct PDX1 motifs. These motifs may be further aligned with Tomtom to generate a consensus motif (Supplementary Fig. 10d). By following all three steps, we identified between 48 to 483 sequences containing PDX1 motifs. The false discovery rate (FDR) ranged from 7.7% to 45.5%, depending on the specific thresholds applied (Supplementary Fig. 10e–g).
iMVP assists in the discovery of previously unknown mRNA m5C writers and known writer divergence across species
For the well-studied RNA modifications, such as m6A, m5C and pseudouridine, it is known that multiple writers have evolved to target unique subsets of the modification sites based on the differential protein-RNA biophysical interactions. The map of a specific RNA modification type is built upon this principle, and in turn, we can infer the writers of this map by visualizing the sequence features of all modification sites of this map. Foreseeing the needs in novel writer mining, we employed iMVP to investigate the most comprehensive m5C profiles from various samples in humans, mice, frogs, zebrafish, and flies19,32, where new writers may be hidden. Intriguingly, in addition to two clusters that represent known Type I and Type II sites, two more clusters appeared in early developmental stage samples (Fig. 3a, Supplementary Fig. 11, and Supplementary Data 3) and 48-hour Noc-treated HeLa cells (Fig. 3b and Supplementary Data 4), which are known to be enriched with mRNA m5C32. The first cluster (termed Type III sites) was found in both early developmental stages in vertebrates (Fig. 3a and Supplementary Fig. 11) and Noc-treated HeLa cells (Fig. 3b). This cluster contained a 5′-GUNGCCANNUG-3′ motif in a less structured sequence and accounted for 0.67% to 2.8% of the sites. The second cluster (termed Type IV sites) was only found in Noc-treated HeLa cells (Fig. 3a, b and Supplementary Fig. 11). This cluster contained a 5′-UUCGANGU-3′ motif and accounted for 11.04% of the sites. Notably, these 4 types of m5C sites cannot always be detected by MEME and HOMER (Supplementary Fig. 12).
The sequence contexts of Type III and IV sites resembled the C2278 (Fig. 3c) and C2870 (Fig. 3d) sites in yeast 25 S rRNA methylated by Rcm1 (NSUN5) and Nop252, suggesting that these two proteins may be new mRNA m5C writers. To validate this inference, we knocked out NSUN5 (Supplementary Fig. 13a) and knocked down Nop2 via siRNA in HeLa cells (Supplementary Fig. 13b). As a control, NSUN2 and NSUN6 were knocked out separately. As expected, loss or decreased levels of m5C methylation in each cluster was observed in the corresponding knockout or knockdown cells (Fig. 3e and Supplementary Fig. 13c). Consistently, we found that sites with a Type III motif had increased m5C levels (Supplementary Fig. 13d-e and Supplementary Data 4) when re-analyzing previously published NSUN5 overexpression data from LN229 cells53. Interestingly, upon the overexpression of NSUN5, a minor Type III sub-motif, 5’-GUNNNCCAKHUG-3’ (K = A/C), was also found (Supplementary Fig. 13f). Taken together, these findings indicate that NSUN5 and Nop2 are previously unknown mRNA m5C writers, suggesting that our global motif analysis strategy enables new motif and writer discovery.
Global visualization may also facilitate an intuitive comparison of differences in motifs among different biological contexts. To determine whether iMVP can assist with such a comparison, we examined the motif divergence of different types of m5C sites across species with iMVP. We first aggregated m5C sites (145,144 sites in total) from six species to generate a combined iMVP map. Next, we projected the sites of individual species onto the combined map for comparison. Interestingly, despite all three types of sites being highly consistent among all six species (Fig. 3f, g), a motif preference divergence between fly and vertebrate Type I sites was observed, wherein fly NSUN2 had a stronger preference for the CGRRG and CNGGAG motifs (Fig. 3g, h). Compared with the other two sub-types of motifs found in fly, vertebrate NSUN2 motifs exhibited a higher degree of similarity to the CRGNR sub-motif, as indicated by lower KL divergences (Fig. 3i). To investigate whether this difference was due to the preference of NSUN2, we overexpressed NSUN2 from human, mouse, zebrafish, and fly in NSUN2 knockout HeLa cells. The motifs found in this assay closely aligned with the finding in iMVP (Fig. 3j). Taken together, iMVP assists in the identification of subtle motif preference differences between vertebrate and fly NSUN2 proteins.
iMVP facilitates the identification of false positives in RNA BS-seq data
One of the major challenges in epitranscriptomic studies is how to obtain a reliable map of RNA modifications. Of the methods developed for transcriptome-wide RNA modification mapping, most have limited accuracy and specificity, thus leading to considerable levels of false positives. Many artifacts introduced by the enzymic and chemical processes in modification detection, as well as errors generated during high-throughput sequencing, preferentially occur within certain sequence contexts. Thus, global motif visualization of the inputs may facilitate the identification of false positives.
To determine whether this is true, we applied iMVP to the m5C profiling data in adult human and mouse tissues. mRNA m5C sites in mammalian adult tissues appear to be much less frequent19 and thus tend to be enriched in false positives, making accurately identifying real sites more challenging. Intriguingly, of the m5C profiles in human and mouse tissues, we indeed identified 4 (#4 to #7) and 2 (#4 and #5) minor clusters that were likely false positive, respectively (Fig. 4a, b). Both human and mouse clusters #4 were embedded in a homopolymer run of As that is prone to sequencing errors. The remaining human clusters (#5 to #7) were in primate-specific repeat element Alus. The second mouse cluster (#5) was in a GC-rich repetitive region that is difficult to be converted with BS-treatment. An examination of the raw mapping data revealed that these sites were clustered (Fig. 4c, d). Based on our finding that clustered sites in bisulfite sequencing tend to be false positives from conversion failure19, these sites were believed to be artifacts that escaped from our filters due to close-to-threshold statistical parameters. These observations highlight the ability of iMVP to pinpoint possible false positives that were unaddressed by traditional filtering steps. Notably, iMVP itself can only identified enriched clusters, and additional experiment and/or knowledge is required to determine whether these clusters are false positives or not.
Spiked iMVP distinguishes RNA-seq variants caused by RNA modifications from noise
Recent studies found that RNA modifications could introduce mismatches during reverse transcription; thus, variants called from RNA-seq may represent DNA mutations, RNA editing sites, or RNA modifications15,54. Based on these findings, algorithms, such as ModTect11, have been developed to analyze RNA-seq data to identify putative RNA modification sites. Due to their complexity, it is expected that a high-level of noise is present in putative RNA modification sites inferred from RNA-seq data. To assist in the analysis of such datasets, we developed a modified iMVP strategy, termed spiked iMVP, and applied it to the ModTect dataset.
In spiked iMVP, k-mers containing known modification signals (m1A, m1acp3Ψ, m3C, m4C, and m22G) were spiked into the putative RNA modification-induced variants to label the motif preferences of known RNA modifications (Fig. 5a and Supplementary Data 5). Next, variants with different reference bases were separately analyzed with iMVP and clusters were visualized (Fig. 5b–e). For example, in the A base group, we identified cluster #1, which contained 22 novel m1A-like sites and 5 known m1A spike-in sequences (Fig. 5b). In the U base group, we observed a GUG preference labeled by m1acp3Ψ(cluster #2) (Fig. 5c). In the G base group, which contained the most abundant variants called by ModTect, variants were grouped into three major clusters (cluster #4, cluster #5, and cluster #6) (Fig. 5e). Of the three clusters, the m22G spiked cluster #5 was enriched with a 5’U signature (Fig. 5e). These high-confidence m1A sites, and the novel sequence feature and narrowed-down list of m1acp3Ψand m22G identified with iMVP will facilitate future validation and functional studies. Note that the clusters containing known RNA modifications only accounted for a small fraction of the maps; and many of the remaining clusters were likely false positives (Fig. 5b, d and Supplementary Fig. 14): some were surrounded by Cs; some were within a tract of A/U dinucleotide repeats; and some were within a strong poly(A)-tract. Such variants were likely artifacts introduced by enzymatic and optical processes during library construction and sequencing. In summary, Spiked iMVP adds a new layer of filter and helps pinpoint more reliable RNA modification candidates from datasets with substantial fractions of false positives.
iMVP advances our understanding of m6A/m6Am profiles
A number of biochemical methods were recently developed to map transcriptome-wide m6A/m6Am at single-base resolution10,26,51,55,56,57,58. A systematical evaluation and comparison of sites called from different methods may shed light on our understanding of the comprehensive methylomes and the pattern of true positives and artifacts in these methods. However, such a comparison is missing due to the lack of tools. To bridge this knowledge gap, we collected m6A/m6Am sites from 7 studies with 8 different m6A/m6Am profiling methods (Fig. 6a and Supplementary Data 2): CIMS10, CITS10, m6ACE-seq51, m6A-label-seq55, MAZTER-seq56, m6A-REF-seq57, xPore26, and DART-seq58. Globally, we found that only a small set of m6A and m6Am sites were overlapped among methods (Supplementary Fig. 15a), although the same cell type was used in all studies. This observation suggests that each method may only captured a subset of the methylated sites. Furthermore, the potential variability in false positive rates among different methods may also contribute to the observed low overlap. Thus, approaches based on site overlaps are not able to compare different methods and iMVP may be qualified for this task.
We next performed iMVP on the aggregated 74,179 m6A/m6Am sites. Seven clusters were identified based on the patterns of density (Fig. 6b and Supplementary Fig. 15b). Clusters #1 to #4 were in canonical RRACH motifs, where cluster #1 had a strong 5′ A preference, and cluster #4 had no 5’ preference. Cluster #5 was a CAB motif, similar to previously reported m6Am motifs. Clusters #6 and #7 were in an A/G-rich context. In line with previous observations, the m6Am-like sites identified in Cluster #5 had the strongest enrichment around TSS; in contrast, the m6A sites in Cluster #1 to Cluster #4 had significantly weaker enrichments (Supplementary Fig. 15c). With global visualization, the extent of sequence preference in different methods was identified (Fig. 6c, d). Overall, CITS, m6ACE-seq, m6A-label-seq, and xPore had a broad range of k-mer types, while CIMS, MAZTER-seq, m6A-REF-seq, and DART-seq were biased towards certain types of k-mers. More specifically, CITS and m6ACE-seq, the two immunoprecipitation-based methods, had a similar k-mer distribution, despite CITS having a stronger enrichment in cluster #1(AAACA). m6A-label-seq, which is based on metabolic incorporation, had a unique enrichment in cluster #6. Although both MAZTER-seq and m6A-REF-seq utilized the MazF endoribonuclease strategy, sites identified by MAZTER-seq were more enriched in cluster #1. DART-seq, which is based on the hitchhiking of APOBEC C-to-U editing, had a strong bias in cluster #4 (NNACU).
Since antibody-based methods represent an indirect way to infer m6A positions, antibody-independent approaches have become the recent focus of method development. Because the sensitivities of different methods vary and are dependent on the local sequence features of the modified bases, we evaluated the reliability of m6A-antibody-independent approaches by comparing the winscores of individual clusters with the background winscores using m6A-seq peak data (see ”Methods” section). As a positive control, we used clusters identified by the m6A-antibody-dependent m6ACE-seq. DART-seq had the lowest m6A-peak enrichment among all methods (Fig. 6e), suggesting that many false positive signals may be introduced by APOBEC background editing. Compared with MAZTER-seq, signals from m6A-REF-seq had lower m6A-peak enrichment (Fig. 6e), suggesting that the addition of FTO treatment might introduce noise. Overall, m6A-label-seq and MAZTER-seq had the highest m6A-peak enrichment for nearly all clusters, suggesting that they are currently the most reliable methods. These results provide valuable insights for the selection of m6A-antibody-independent approaches for m6A and m6Am profiles at single-base resolution.
iMVP enables global motif discovery in an extremely large dataset
In canonical motif discovery tools, the input size is typically <10,000 sequences due to the high computational complexity of handling large datasets. Both UMAP and HDBSCAN are designed for large datasets, and they are currently implemented in the NVIDIA RAPIDS library, which allows us to scale up and speed up iMVP analysis with GPU acceleration (Supplementary Fig. 16a, b). Thus, our framework may be capable of handling extremely large numbers of RNA modification sites that were previously unmanageable, such as millions of A-to-I RNA variants in the human genome.
To challenge iMVP, we analyzed the human A-to-I RNA editing list from REDIportal25, which contains 15.6 million sites (5 million different 21-mers) (Fig. 7a). After dimension reduction, we reannotated the site list with UMAP coordinates to recover the densities. Globally, we observed distinct patterns of A-to-I RNA editing sites in Alu, repetitive non-Alu, and nonrepetitive regions (Supplementary Fig. 16c–e). Repetitive non-Alu and nonrepetitive sites were more centralized in k-mer patterns, although repetitive non-Alu sites had a broader range of k-mer distribution compared with nonrepetitive sites. In contrast, Alu sites gathered into small clusters far away from the centroid of the projections.
We then separately applied iMVP to the three types of A-to-I editing sites to achieve a high-resolution view. We found a consistent “galaxy-like” sparse view of Alu sites, a “mitosis-like” two-centroid k-mer distribution in repetitive non-Alu sites, and an “oocyte-like” pattern in nonrepetitive sites with a single centroid (Fig. 7b). To examine the clusters containing large numbers of data points in-depth, we applied an approximate clustering strategy based on a density histogram that mimicked the density sketching step in HDBSCAN (Supplementary Fig. 17a). With this strategy, we extracted 143, 52, and 22 large clusters from the projections. (Fig. 7c–e, Supplementary Fig. 17b–d, and Supplementary Fig. 18 and19). Interestingly, we found that the two centroids of repetitive non-Alu sites were located in poly(U)-tract and poly(A)-tract, respectively (Fig. 7d), which were mainly from LINE and LTR repeats (Fig. 7f). As homopolymer regions are prone to high sequencing error rates, we further validated these two groups of sites using ADAR1 knockout HEK293 cells59. We found that both sets of sites had decreased editing levels (Fig. 7g), confirming that these sites were authentic. For nonrepetitive sites, the centroid cluster was constituted by the canonical Position −1 G-depletion motif (Fig. 7e). Taken together, these analyses indicate that iMVP can handle extremely large datasets and expand our understanding of the sequence features of A-to-I RNA editing sites.
Discussion
To reveal the biological meanings of RNA modifications, it is vital to identify authentic modification sites and correlate each of them to the corresponding writers. However, tools for addressing this issue are lacking. To solve this problem, we developed an exploratory data analysis workflow, iMVP, to subtype and visualize the compositions of putative RNA modification sites. With the help of dimension reduction and density-based clustering, we converted the motif discovery workflow, which traditionally occurs in a black box, into a transparent visualization issue. We systematically and quantitatively benchmarked various decomposition and cluster algorithms and selected the best combination for use in our framework development. Moreover, we developed an easy to use interactive version of our tool, which supports both automatic subtype clustering and intuitive subtype partitioning. Notably, although iMVP is not primarily designed for the search of gapped motifs, it is still possible to identify such motifs if the input dataset contains a sufficient number of sequences with gapped patterns. This is exemplified by the identification of two subtypes of Type III motifs that were found to be gapped motifs (Supplementary Fig. 13e, f). We demonstrated that iMVP analysis expands our understanding of the motifs and writers of RNA modifications and pinpoints noise that was previously difficult to identify. With the aid of GPU acceleration, iMVP is able to handle millions of sites that were previously unmanageable.
It is worthy to mention that the combination of UMAP and HDBSCAN is one of the solutions for iMVP. For decomposition, t-SNE also has excellent performance. For clustering, Leiden and Louvain may outperform HDBSCAN when the background is not noisy. Moreover, although 2-D spaces are used in both decomposition and clustering in this study, for some complicated datasets, it is possible to project the k-mer patterns onto a 3 or higher dimension space for further clustering.
It should be noted that iMVP has some limitations. First, iMVP is not suitable for application to datasets containing too few sites or sites without sequence preferences. Second, iMVP is a position-sensitive strategy, and even with a phase-matching strategy, the current version may have poor performance when dealing with datasets with high numbers of phases. Future improvements, such as the use of parametric UMAP60 for dimension reduction, may improve its performance on such datasets and broaden the usage of iMVP (e.g., global analysis in iCLIP datasets of RNA binding proteins). Third, in the clustering step, the cluster output may vary with different parameters, especially when extracting clusters from the condensed density tree. Although this issue may be partially solved by introducing a hyperparameter optimization strategy with DBCV scores61, prior knowledge is still required for the selection of the proper parameters.
Given the recent rapid progress in the use of Nanopore direct RNA-seq to discover RNA modifications, it is expected that more tools will be developed to compare native and unmodified RNAs and identify differential sites, i.e., putative modification sites deposited by various writers. Hundreds of thousands to millions of sites may be found in a single Nanopore run. As we have shown that iMVP can handle the extremely large number of A-to-G variant sites, we expect iMVP to greatly aid in the visualization and interpretation of the Nanopore direct RNA-seq data in future studies.
Taken together, we anticipate that iMVP will be a valuable tool that can be rapidly adopted by the epitranscriptomics community and facilitate epitranscriptomic research in the future.
Methods
Cell culture
HeLa cells were purchased from Cell Bank, Type Culture Collection, Chinese Academy of Sciences (CBTCCCAS). HeLa cells have been identity verified using short tandem repeat (STR) analysis by CBTCCCAS. Cells were maintained in DMEM (Gibco, 11965118) supplemented with 10% FBS (CLARK, FB15015). Cells have been checked for mycoplasma contamination by CBTCCCAS and are routinely tested for mycoplasma by PCR detection of conditioned medium.
For Nocodazole treatment, wild type or knockout HeLa cells were first seeded and grown to 50% confluency. For Nop2 knockdown cells, cells were treated with siRNA for 48 hours before nocodazole treatment. Otherwise, cells were treated with 0.1 μg/ml nocodazole (Sigma, M1404) for 48 hours and then collected for further experiments.
CRISPR/Cas9-induced mutagenesis
NSUN5 (gRNA, GTATGAGTTGTTGTTGGGAA) knockout cells were generated via CRISPR/Cas9-induced mutagenesis19. In brief, a gRNA sequence was designed using CRISPR-ERA (http://CRISPR-ERA.stanford.edu). The sgRNA template oligonucleotide was synthesized and cloned into lentiCRISPR v.2 plasmid (Addgene no. 52961). The plasmid was transfected into the cells using Lipofectamine 3000 (Thermo, L3000015) following the manufacturer’s instructions. Transfected cells were selected using puromycin. Western blot with NSUN5 antibody (Proteintech, 15449-1-AP) was used to verify the loss of NSUN5 protein. NSUN2 and NSUN6 knockout cells generated previously were used in this study19,21.
RNA interference
For NOP2 knockdown, NOP2 siRNAs (siNOP2-1, GGAGUUCUUAGAAGCUAAU; siNOP2-2, GAUCCAGCCGUGAAGACUATT) were transfected using Lipofectamine RNAiMAX (Thermo, 13778150) following the manufacturer’s instruction. 48 hours after the transfection, the knockdown efficiency was examined by RT-qPCR with NOP2 primers (NOP2-F, GCACCCCAGGAACATGAG; NOP2-R, CACCACCTTGGGGAACTG) and GAPDH (as control) primers (GAPDH-F, TCAAGGCTGAGAACGGGAAG; GAPDH-R, GGACTCCACGACGTACTCAG).
mRNA BS-seq
mRNA BS-seq library construction was performed19. In brief, total RNA was isolated with TRIzol reagent and Direct-zol RNA MiniPrep kit. poly(A) + RNA was separated from total RNA using Oligo dT Magnetic Beads (Vazyme, N403). 100 ng–1 μg of poly(A) + RNA was converted using the EZ RNA methylation kit (Zymo Research, R5002) with a modified high-stringency conversion condition (Sulfonation: 3 cycles, (1) 70 °C, 10 min; (2) 64 °C, 45 min; Desulfonation: 25 °C for 30 min). The converted RNA was fragmented into 150 − 200 nt fragments in fragmentation buffer (NEB, E6150) at 94 °C for 8 min. The fragmented RNA was then used for library construction using NEBNext Ultra II Directional RNA Library Prep Kit. Libraries were sequenced on Hiseq X10 (Illumina) to produce paired-end 150 bp reads (Supplementary Data 6).
m6A-seq
Polyadenylated RNA, separated from total RNA using Oligo dT Magnetic Beads (Vazyme, N403), was fragmented in 1X NEB Next Magnesium RNA Fragmentation Buffer at 94 °C for 5 min. 10 ng of fragmented RNA was saved as input. 10 μg fragmented RNA was further incubated with 5 μg rabbit anti-m6A polyclonal antibody (Synaptic Systems, catalog number 202003) in IPP buffer (10 mM Tris-HCl pH 7.4, 150 mM NaCl and 0.1% Igepal CA-630) overnight at 4 °C. The m6A-Ab mixture was then immunoprecipitated by incubation with protein-G magnetic beads at 4 °C for another 2 hours. RNA was then eluted from the beads with IPP buffer containing 0.5 mg/ml N6-methyladenosine (Sigma-Aldrich, M2780). VAHTS stranded mRNA-seq library prep kit (Vazyme, NR601) was used for library construction. Libraries were sequenced on HiSeq X (Illumina) to produce paired-end 150 bp reads (Supplementary Data 6).
Data pre-processing
A summary of the source of data is in Supplementary Data 6.
m5C analysis
All m5C sites analyzed were obtained with standard BS-seq library preparation and analysis workflow19,32. These m5C sites were in GRCh37 or GRCm38 coordinates.
xPore analysis
The processed data of HEK293T (in GRCh38 coordinates) were obtained from the supplementary table of xPore26. We retrieved the strand information of the sites based on the provided gene ID.
m6ACE-seq data analysis
We downloaded the site list from the supplementary table of Koh et al51.
ModTect analysis
The site lists of ModTect in hg19 (GRCh37) coordinates were obtained from the supplementary table. Only sites with no ambiguous strand direction were used in our analysis. We spiked 14 known modified sequences (Supplementary Data 5) into the ModTect dataset for analysis.
m6A method comparison
We performed the m6A method comparison analyses using Ensembl GRCh38 coordinates. We first downloaded the site list from the supplementary tables of Linder et al10. (CITS/CIMS), Shu et al55. (m6A-label-seq), Pandey et al56. (MAZTER-seq), Zhang et al57. (m6A-REF-seq), Meyer58 (DART-seq), and Pratanwanich et al26. (xPore). For m6A-label-seq data, we retrieved the genomic coordinates based on the RefSeq id and sequences. For DART-seq, we assigned the m6A sites adjacent to the C-to-U signals. For sites in hg19 format, we used UCSC liftOver tool to convert them to Ensembl GRCh38 format. The summarized site list is in Supplementary Data 2.
RNA editing analysis
The site list (hg38) was downloaded from REDIportal25. We grouped the sites into Alu, repetitive non-Alu, and nonrepetitive sites based on the annotations in the original table.
The iMVP framework
For each analysis, we provided the codes, example inputs, and example outputs in detail in Juypter-Notebook (https://github.com/sysu-zhanglab/iMVP). Our package (including helper functions and the interactive interface) is available on Python Package Index (Pypi): iMVP-utils. Documents of our analysis and the package can be found on https://imvp.readthedocs.io/. In brief, we encoded the 10 nt flanking sequences of the sites into one-hot encoded format, and then projected the encoded sequences onto a 2-D plane with UMAP. We used HDBSCAN to split the projections based on the density contours. If stubborn clusters (e.g., two high-density clusters connect with a linkage) encountered, we further split those clusters with HDBSCAN and merged the small clusters manually based on the density contours. Density Based Cluster Validity (DBCV), which is implemented in the HDBSCAN API, may be used to search for suitable combination of the parameters. For Leiden, the Significance of Hierarchical Clustering (SHC) may be used to calculate the quality of clustering62.
In particular, for phase matching, we first built the 21-mer sequences for the −2, −1, 0, +1, +2 phases of the signals, then we performed phase-matching based on the patterns of clusters. Coordinates of phase-matched sites were re-computed to match the positions of the “A”s. For RNA editing site analysis, we used a single RTX2080Ti GPU for UMAP analysis. In the test run, we found a 46-fold boost in UMAP, allowing us to project up to 5 million 21-mer sequences within 15 min. However, CPU-based HDBSCAN is still the best practice because of the shorter runtime and the more flexible memory allocation. The computation of UMAP of k-mers was finished within 1 hour, where the majority of the time was spent on I/O and encoding process rather than UMAP computation (less than 15 min).
We set up our analysis (CPU-based) with Python 3.7, and the following core packages were used: Pandas (v1.3.4), Numpy (v1.20.0), Scipy (v1.5.1), Scikit-learn (v0.23.1), biopython (v1.77), hdbscan (v0.8.27), umap-learn (v0.5.2), openTSEN (v0.6.1), louvain (v0.7.1), leidenalg (v0.8.8), dash (v2.2.0), dahs-bio (v0.9.0), imageio (v2.13.5), weblogo (v3.7.0), and opencv-python (v4.5.5). For GPU-based analysis, we used the Docker container provided by NVIDIA: RAPIDS release 22.02 in Ubuntu 18.04, Python 3.8, and CUDA 11.5. The workflow was run on a RTX2080Ti with 11 Gb memory. The parameters and operations of each analysis were summarized in Supplementary Data 7.
Calculation of outgroup-ingroup score and boundary score
Outgroup-ingroup score was defined as the coefficient of variation (CV) ratio between outgroup distances and ingroup distances. Ingroup and outgroup distances were defined as the Euclidean distances between a data point to the centroid of a group: \(D=\sqrt{{(x-\bar{x})}^{2}+{(y-\bar{{{\mbox{y}}}})}^{2}},\) where x and y are the X-Y coordinate of a data point, and \(\bar{x}\) and \(\bar{y}\) are the mean of all data points of a group. Then the standard deviation of ingroup distances (SDingroup) and outgroup distances (SDoutgroup) and the means of them (Meaningroup and Meanoutgroup) were calculated. Outgroup-ingroup score was then obtained via the following formula: \({{\mbox{Score}}}=\frac{{{{\mbox{CV}}}}_{{{\mbox{outgroup}}}}}{{{{\mbox{CV}}}}_{{{\mbox{ingroup}}}}}=\frac{{{{\mbox{SD}}}}_{{{\mbox{outgroup}}}}/{{{\mbox{Mean}}}}_{{{\mbox{outgroup}}}}}{{{{\mbox{SD}}}}_{{{\mbox{ingroup}}}}/{{{\mbox{Mean}}}}_{{{\mbox{ingroup}}}}}\). If sites in a group are more condensed, a lower ingroup CV is expected; if the sites in different groups are better separated, a higher outgroup CV is expected. Hence, algorithms with better performance will have higher outgroup-ingroup scores.
Boundary score was calculated as below. We first searched for the top 50 nearest neighbors for each data point. Next, we calculated the proportions (P) of the neighbors that were not in the same group of the data point. Last, all proportions were aggregated to draw a cumulative distribution (Supplementary Fig. 1a). Boundary score was defined as the ratio of the sites whose P > 0.1. Hence, higher boundary scores mean that more sites were projected to form ambiguous boundaries. This score is complementary to the outgroup-ingroup score to describe the discernibility of the algorithm.
Calculation of ARI
ARI was calculated by sklearn.metrics.adjusted_rand_score function.
Parameter optimization for MEME, STREME and HOMER
For MEME, we fine-tuned parameters including maxw, mods, markov_order, and objfuns. Given that our simulation dataset comprised 10 motifs, each with a length exceeding 6, we established specific parameter values to optimize the performance of MEME. We set the minimum motif width (minw) to 6 and the number of motifs (nmotifs) to 10. Additionally, we explored various values for the maximum motif width (maxw), including 8, 12, and 15. We also varied the Markov model orders (markov_order) from 0 to 3. The mod parameter encompassed zoops, anr, and oops schemas, while the objfuns parameter encompassed the classic, de, se, cd, nc, and ce algorithms. This exploration resulted in the generation of 192 parameter sets for comprehensive testing.
For HOMER, to address the constraints of HOMER, which is limited to searching for motifs of a single specific length, we performed an analysis using a set of seven length parameters, spanning from 6 to 12. We also set the number of motifs to 10. The mean values were used to represent the performance of HOMER.
For STREME, we adjusted the two parameters: maxw and objfun. The maxw parameter was varied across three values: 8, 12, and 15. while the objfun parameter included two options: cd or ce. This led to 6 parameter sets being tested.
E-value calculation
E-values of the motifs were calculated by Simple Enrichment Analysis (SEA) of MEME (v5.0.0). All E-values, p-values, and q-values of the motifs presented in the study were shown in Supplementary Data 8.
KL divergence
KL divergences were calculated by MotifSuite63. A customized script based on Biopython.motifs API was used to generate the PWM format input for MotifSuite.
Motif drawing and motif finder analysis
We used Weblogo (v3.7.0) to plot the motifs of selected sequences. MEME/ STREME (v5.3.3) and HOMER (v4.11) were used in motif discovery.
mRNA BS-seq data analysis
mRNA m5C sites were called as we previously described19. We first trimmed adapters, the first 10 bp of the reads, the last 6 bp of the reads, and the low-quality bases using Cutadapt (-e 0.25 -q 25 -trim-n)64 and Trimmomatic65. Then clean reads were mapped to the in silico converted genome by HISAT2 (-k 10,–fr,–rna-strandness FR,–no-mixed)66 to obtain unique alignments. The remaining unmapped and multiple mapped reads were further mapped to the in silico converted transcriptome by Bowtie2 (-end-to-end,–fr,–gbar 5,–mp 5, -k 10, -R 2, -D 5)67. Alignment results were merged, and only bases with high quality (Q ≥ 30) were used for the variant calling. Last, the sites were called using a series of filters as previously described19. The methylation level (mismatch frequency) is defined as the number of reads with C divided by the number of reads with C or T.
RNA structure prediction and illustration
RNA secondary structure was predicted using RNAfold (2.4.12)68 with default settings. The m5C sites and 50 nt flanking sequences were extracted from the genome. The folding frequencies of each base of the sites with their 25 nt flanking sequences were shown.
m6A-seq analysis
We first trimmed adapters using Cutadapt64 (-e 0.1 -q 20 -m 20 --trim-n). Cleaned reads were then mapped to GRCh38 (Ensembl release 104) genome and transcriptome by HISAT2 with default settings in strand-specific mode66. Only unique alignments were transformed into bedGraph for further analysis. m6A peak analysis was performed as previously described with some modification50. In brief, each gene was split into 50-nucleotide sliding windows and an enrichment fold (winscore) was calculated for each window:\(\,{{\mbox{winscore}}}=\log 2(\frac{{{\mbox{MeanWinIP}}}/{{\mbox{MedianGeneIP}}}}{{{\mbox{MeanWinControl}}}/{{\mbox{MedianGeneControl}}}})\). MeanWinIP and MeanWinControl are the mean coverage for each window for immunoprecipitation and input control, respectively. MedianGeneIP and MedianGeneControl are gene median coverages for immunoprecipitation and input control, respectively. To examine the enrichments of each cluster, we compared the cumulative distributions of the winscores of each cluster with background winscores.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
The data supporting the findings of this study are available from the corresponding authors upon reasonable request. The sequencing data generated in this study have been deposited in the GEO database: BS-seq of HeLa cells with NSUN protein deficiency: GSE197650; control HeLa cells: GSM5319029; m6A-seq data generated in this study: GSE198955. Source data for the figures and supplementary figures are provided as a Source Data file. Source data are provided in this paper.
Code availability
The source code is published at https://github.com/SYSU-zhanglab/iMVP. https://doi.org/10.5281/zenodo.8286943.
References
Wiener, D. & Schwartz, S. The epitranscriptome beyond m6A. Nat. Rev. Genet. 22, 119–131 (2021).
Frye, M., Harada, B. T., Behm, M. & He, C. RNA modifications modulate gene expression during development. Science 361, 1346–1349 (2018).
Ishizuka, J. J. et al. Loss of ADAR1 in tumours overcomes resistance to immune checkpoint blockade. Nature 565, 43–48 (2019).
Mehdipour, P. et al. Epigenetic therapy induces transcription of inverted SINEs and ADAR1 dependency. Nature 588, 169–173 (2020).
Yankova, E. et al. Small-molecule inhibition of METTL3 as a strategy against myeloid leukaemia. Nature 593, 597–601 (2021).
Crews, L. A. et al. Reversal of malignant ADAR1 splice isoform switching with Rebecsinib. Cell Stem Cell 30, 250–263.e256 (2023).
Helm, M. & Motorin, Y. Detecting RNA modifications in the epitranscriptome: predict and validate. Nat. Rev. Genet. 18, 275–291 (2017).
Khoddami, V. et al. Transcriptome-wide profiling of multiple RNA modifications simultaneously at single-base resolution. Proc. Natl Acad. Sci. USA 116, 6784–6789 (2019).
Motorin, Y., Lyko, F. & Helm, M. 5-methylcytosine in RNA: detection, enzymatic formation and biological functions. Nucleic Acids Res. 38, 1415–1430 (2010).
Linder, B. et al. Single-nucleotide-resolution mapping of m6A and m6Am throughout the transcriptome. Nat. Methods 12, 767–772 (2015).
Tan, K. T., Ding, L. W., Wu, C. S., Tenen, D. G. & Yang, H. Repurposing RNA sequencing for discovery of RNA modifications in clinical cohorts. Sci. Adv. 7, eabd2605 (2021).
Blanco, S. & Frye, M. Role of RNA methyltransferases in tissue renewal and pathology. Curr. Opin. Cell Biol. 31, 1–7 (2014).
Karijolich, J., Yi, C. & Yu, Y. T. Transcriptome-wide dynamics of RNA pseudouridylation. Nat. Rev. Mol. Cell Biol. 16, 581–585 (2015).
Schwartz, S. et al. Transcriptome-wide mapping reveals widespread dynamic-regulated pseudouridylation of ncRNA and mRNA. Cell 159, 148–162 (2014).
Hauenschild, R. et al. The reverse transcription signature of N-1-methyladenosine in RNA-Seq is sequence dependent. Nucleic Acids Res. 43, 9950–9964 (2015).
Dominissini, D. et al. Topology of the human and mouse m6A RNA methylomes revealed by m6A-seq. Nature 485, 201–206 (2012).
Grozhik, A. V. & Jaffrey, S. R. Distinguishing RNA modifications from noise in epitranscriptome maps. Nat. Chem. Biol. 14, 215–225 (2018).
Grozhik, A. V. & Jaffrey, S. R. Epitranscriptomics: shrinking maps of RNA modifications. Nature 551, 174–176 (2017).
Huang, T., Chen, W. Y., Liu, J. H., Gu, N. N. & Zhang, R. Genome-wide identification of mRNA 5-methylcytosine in mammals. Nat. Struct. Mol. Biol. 26, 380–388 (2019).
Selmi, T. et al. Sequence- and structure-specific cytosine-5 mRNA methylation by NSUN6. Nucleic Acids Res. 49, 1006–1022 (2021).
Liu, J. et al. Sequence- and structure-selective mRNA m(5)C methylation by NSUN6 in animals. Natl Sci. Rev. 8, nwaa273 (2021).
Safra, M. et al. The m1A landscape on cytosolic and mitochondrial mRNA at single-base resolution. Nature 551, 251–255 (2017).
Safra, M., Nir, R., Farouq, D., Slutzkin, I. V. & Schwartz, S. TRUB1 is the predominant pseudouridine synthase acting on mammalian mRNA via a predictable and conserved code. Genome Res. 27, 393–406 (2017).
Hashim, F. A., Mabrouk, M. S. & Al-Atabany, W. Review of different sequence motif finding algorithms. Avicenna J. Med. Biotechnol. 11, 130–148 (2019).
Mansi, L. et al. REDIportal: millions of novel A-to-I RNA editing events from thousands of RNAseq experiments. Nucleic Acids Res. 49, D1012–D1019 (2021).
Pratanwanich, P. N. et al. Identification of differential RNA modifications from nanopore direct RNA sequencing with xPore. Nat. Biotechnol. 39, 1394–1402 (2021).
Crooks, G. E., Hon, G., Chandonia, J.-M. & Brenner, S. E. WebLogo: a sequence logo generator. Genome Res. 14, 1188–1190 (2004).
Bailey, T. L., Johnson, J., Grant, C. E. & Noble, W. S. The MEME Suite. Nucleic Acids Res. 43, W39–W49 (2015).
Yang, Y. et al. Dimensionality reduction by UMAP reinforces sample heterogeneity analysis in bulk transcriptomic data. Cell Rep. 36, 109442 (2021).
Becht, E. et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 37, 38–44 (2019).
Maaten, L. V. D. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Liu, J. et al. Developmental mRNA m5C landscape and regulatory innovations of massive m5C modification of maternal mRNAs in animals. Nat. Commun. 13, 2484 (2022).
Poličar, P. G., Stražar, M. & Zupan, B. openTSNE: a modular Python library for t-SNE dimensionality reduction and embedding. bioRxiv https://doi.org/10.1101/731877 (2019).
Tenenbaum, J. B., de Silva, V. & Langford, J. C. A global geometric framework for nonlinear dimensionality reduction. Science 290, 2319–2323 (2000).
Narayan, A., Berger, B. & Cho, H. Assessing single-cell transcriptomic variability through density-preserving data visualization. Nat. Biotechnol. 39, 765–774 (2021).
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).
McInnes, L., Healy, J. & Astels, S. hdbscan: hierarchical density based clustering. J. Open Source Softw. 2, 205–206 (2017).
Hahsler, M., Piekenbrock, M. & Doran, D. dbscan: fast density-based clustering with R. J. Stat. Softw. 91, 1–30 (2019).
Ankerst, M., Breunig, M. M., Kriegel, H.-P. & Sander, J. in Proceedings of the 1999 ACM SIGMOD international conference on Management of data. 49–60 (Association for Computing Machinery, Philadelphia, Pennsylvania, USA; 1999).
Ng, A. Y., Jordan, M. I. & Weiss, Y. in Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic. 849–856 (MIT Press, Vancouver, British Columbia, Canada; 2001).
Blondel, V. D., Guillaume, J.-L., Lambiotte, R. & Lefebvre, E. Fast unfolding of communities in large networks. J. Stat. Mech.: Theory Exp. 2008, P10008 (2008).
Traag, V. A., Waltman, L. & van Eck, N. J. From Louvain to Leiden: guaranteeing well-connected communities. Sci. Rep. 9, 5233 (2019).
Garalde, D. R. et al. Highly parallel direct RNA sequencing on an array of nanopores. Nat. Methods 15, 201–206 (2018).
Liu, H. et al. Accurate detection of m(6)A RNA modifications in native RNA sequences. Nat. Commun. 10, 4079 (2019).
Qin, H. et al. DENA: training an authentic neural network model using Nanopore sequencing data of Arabidopsis transcripts for detection and quantification of N6-methyladenosine on RNA. Genome Biol. 23, 25 (2022).
Begik, O. et al. Quantitative profiling of pseudouridylation dynamics in native RNAs with nanopore sequencing. Nat. Biotechnol. 39, 1278–1291 (2021).
Boulias, K. et al. Identification of the m(6)Am Methyltransferase PCIF1 Reveals the Location and Functions of m(6)Am in the Transcriptome. Mol. Cell 75, 631–643.e638 (2019).
Sun, H., Zhang, M., Li, K., Bai, D. & Yi, C. Cap-specific, terminal N6-methylation by a mammalian m6Am methyltransferase. Cell Res. 29, 80–82 (2019).
Koh, C. W. Q., Goh, Y. T. & Goh, W. S. S. Atlas of quantitative single-base-resolution N6-methyl-adenine methylomes. Nat. Commun. 10, 5636 (2019).
Batista, P. J. et al. m6A RNA modification controls cell fate transition in mammalian embryonic stem cells. Cell Stem Cell 15, 707–719 (2014).
Koh, C. W. Q., Goh, Y. T. & Goh, W. S. S. Atlas of quantitative single-base-resolution N(6)-methyl-adenine methylomes. Nat. Commun. 10, 5636 (2019).
Sharma, S., Yang, J., Watzinger, P., Kotter, P. & Entian, K. D. Yeast Nop2 and Rcm1 methylate C2870 and C2278 of the 25S rRNA, respectively. Nucleic Acids Res. 41, 9062–9076 (2013).
Janin, M. et al. Epigenetic loss of RNA-methyltransferase NSUN5 in glioma targets ribosomes to drive a stress adaptive translational program. Acta Neuropathol. 138, 1053–1074 (2019).
Ryvkin, P. et al. HAMR: high-throughput annotation of modified ribonucleotides. RNA 19, 1684–1692 (2013).
Shu, X. et al. A metabolic labeling method detects m(6)A transcriptome-wide at single base resolution. Nat. Chem. Biol. 16, 887–895 (2020).
Pandey, R. R. & Pillai, R. S. Counting the Cuts: MAZTER-Seq Quantifies m(6)A Levels Using a Methylation-Sensitive Ribonuclease. Cell 178, 515–517 (2019).
Zhang, Z. et al. Single-base mapping of m(6)A by an antibody-independent method. Sci. Adv. 5, eaax0250 (2019).
Meyer, K. D. DART-seq: an antibody-free method for global m(6)A detection. Nat. Methods 16, 1275–1280 (2019).
Song, Y. et al. Sense-antisense miRNA pairs constitute an elaborate reciprocal regulatory circuit. Genome Res. 30, 661–672 (2020).
Sainburg, T. A., Leland, M., & Timothy Q. G. Parametric UMAP embeddings for representation and semi-supervised learning. arXiv https://doi.org/10.48550/arXiv.2009.12981 (2020).
Moulavi, D., Jaskowiak, P. A., Campello, R. J. G. B., Zimek, A. & Sander, J. in Proceedings of the 2014 SIAM International Conference on Data Mining. p. 839–847 (2014).
Grabski, I. N., Street, K. & Irizarry, R. A. Significance analysis for clustering with single-cell RNA-sequencing data. Nat. Methods 20, 1196–1202 (2023).
Claeys, M., Storms, V., Sun, H., Michoel, T. & Marchal, K. MotifSuite: workflow for probabilistic motif detection and assessment. Bioinformatics 28, 1931–1932 (2012).
Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal 17, 10–12 (2011).
Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30, 2114–2120 (2014).
Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nat. Methods 12, 357–360 (2015).
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).
Lorenz, R. et al. ViennaRNA Package 2.0. Algorithms Mol. Biol. 6, 26 (2011).
Cannone, J. J. et al. The comparative RNA web (CRW) site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs. BMC Bioinform. 3, 2 (2002).
Acknowledgements
This study was supported by grants from National Key R&D Program of China (2020YFA0509400 to R.Z.), Guangdong Innovative and Entrepreneurial Research Team Program (2016ZT06S638 to R.Z.), the National Natural Science Foundation of China (82173050 to T.H.), Guangdong Natural Science Foundation (2021A1515010667 to T.H.), and SUMC Scientific Research Initiation Grant (510858061 to T.H.).
Author information
Authors and Affiliations
Contributions
J.L. and R.Z. contributed to the study design. J.L. and J.Y. performed bioinformatics analysis. T.H., T.Z., and Y.Z. performed all experiments. J.L. and R.Z. wrote the manuscript with input from all authors.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks Weixu Wang and the other, anonymous, reviewers for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Liu, J., Huang, T., Yao, J. et al. Epitranscriptomic subtyping, visualization, and denoising by global motif visualization. Nat Commun 14, 5944 (2023). https://doi.org/10.1038/s41467-023-41653-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467-023-41653-4
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.