isomiRTar: a comprehensive portal of pan-cancer 5′-isomiR targeting

Inaccurate cleavage of pri- and pre-miRNA hairpins by Drosha and Dicer results in the generation of miRNA isoforms known as isomiRs. isomiRs with 5′-end variations (5′-isomiRs) create a new dimension in miRNA research since they have different seed regions and distinct targetomes. We developed isomiRTar (https://isomirtar.hse.ru)—a comprehensive portal that allows one to analyze expression profiles and targeting activity of 5′-isomiRs in cancer. Using the Cancer Genome Atlas sequencing data, we compiled the list of 1022 5′-isomiRs expressed in 9282 tumor samples across 31 cancer types. Sequences of these isomiRs were used to predict target genes with miRDB and TargetScan. The putative interactions were then subjected to the co-expression analysis in each cancer type to identify isomiR-target pairs supported by significant negative correlations. Downstream analysis of the data deposited in isomiRTar revealed both cancer-specific and cancer-conserved 5′-isomiR expression landscapes. Pairs of isomiRs differing in one nucleotide shift from 5′-end had poorly overlapping targetomes with the median Jaccard index of 0.06. The analysis of colorectal cancer 5′-isomiR-mediated regulatory networks revealed promising candidate tumor suppressor isomiRs: hsa-miR-203a-3p—+1, hsa-miR-192-5p—+1 and hsa-miR-148a-3p—0. In summary, we believe that isomiRTar will help researchers find novel mechanisms of isomiR-mediated gene silencing in different types of cancer.


INTRODUCTION
MiRNAs are short non-coding molecules that post-transcriptionally regulate gene expression (Hobert, 2008). A mature miRNA associated with a member of the Argonaute proteins family directly binds to a target mRNA, which results in degradation of the mRNA or translation repression (Nilsen, 2007). It is known that most of such interactions require base pairing of a miRNA seed region (six nucleotides located at positions 2-7 from the miRNA 5 -end) with a target mRNA (Bartel, 2009;Agarwal et al., 2015). Aberrant miRNA expression and targeting were shown to play an essential role in various pathologies, including different types of cancer (Garzon, Calin & Croce, 2009;Visone & Croce, 2009).
Previously analyzed miRNA-seq reads revealed the existence of nucleotide sequence variability at 5 -and 3 -ends of mature miRNAs; these variants of miRNAs were termed isomiRs (Morin et al., 2008). The main pathway of isomiR biogenesis is related to the miRNA hairpin cleavage by Drosha and Dicer enzymes: because of imprecise cleavage, mature miRNAs could differ from each other by the presence or absence of several nucleotides at their 5 -or 3 -termini (Neilsen, Goodall & Bracken, 2012). We recently showed that almost all miRNAs are heterogeneous at their 3 -ends, while inaccurate Dicer cleavage was observed at 42% of miRNA 5 -ends, and inaccurate Drosha 5 -end cleavage was the rarest case (14% of miRNAs) (Zhiyanov, Nersisyan & Tonevitsky, 2021).
Since a seed region of a miRNA is located at positions 2-7, even a single nucleotide length variation at the 5 -end could alter the seed sequence, hence altering the targetome of the miRNA. While this is easy to confirm with bioinformatics sequence-based target prediction tools, only a few studies supported this phenomenon by an experiment. Namely, 5 -isomiR overexpression followed by the luciferase reporter assay analysis validated differential targeting for different 5 -isomiRs of miR-101 (Llorens et al., 2013), miR-9 (Tan et al., 2014), miR-183 (Telonis et al., 2015), miR-34/449 (Mercey et al., 2017) and miR-411 (van der Kwast et al., 2020). However, the role of isomiRs in cancer remains to be uncovered.
To the best of our knowledge, no publicly available databases summarize the results of computational isomiR target prediction. The situation is additionally complicated because most of the existing bioinformatics tools are not suited for large-scale custom target prediction required for non-canonical isomiRs. Given that, we developed isomiRTar (https://isomirtar.hse.ru)-the web portal which allows one to explore 5 -isomiR targeting in 31 cancer types from the Cancer Genome Atlas (TCGA). First, we predicted targets of all 5 -isomiRs by two widely used bioinformatics tools: miRDB (Chen & Wang, 2020) and TargetScan (Agarwal et al., 2015). Then, the interactions between 5 -isomiRs and putative targets were subjected to co-expression analysis in each cancer type. This resulted in the interactions supported by a significant negative correlation in cancer samples. Aside from the web portal implementation, we also analyzed general properties of 5 -isomiR targeting (such as the impact of one nucleotide variation on a targetome) and explored the colorectal cancer 5 -isomiR-gene regulatory network in depth.

TCGA data collection and processing
RNA-seq and miRNA-seq read count tables for primary solid tumor samples were downloaded from the GDC portal (https://portal.gdc.cancer.gov/). Thirty-one TCGA projects were selected for the downstream analysis (Table S1). First, library size normalization of RNA-seq data was independently conducted for each cancer type using edgeR TMM algorithm implementation (Robinson, McCarthy & Smyth, 2010), and the default low-expressed gene removal procedure was applied. After normalizing for transcript length, TMM-normalized fragments per kilobase of transcript per million mapped reads (TMM-FPKM) tables were generated. Finally, TMM-FPKM tables were ommllog2-transformed.
Processing of miRNA-seq data was started from the 5 -isomiR annotation from *.mirbase21.isoforms.quantification.txt files. For that, pri-miRNA genomic coordinates and local coordinates of canonical mature miRNA sequences were extracted from miRBase v21 (Kozomara, Birgaoanu & Griffiths-Jones, 2019). The standard 5 -isomiR nomenclature was utilized: a number after the ''|'' character denotes a shift from the canonical 5 -end in the 5 -3 direction. For example, hsa-miR-192-5p|+1 differs from the canonical hsa-miR-192-5p miRNA by the absence of the first nucleotide on its 5 -end. As in the case of mRNA expression data, the edgeR package was used to filter out low expressed isomiRs and generate ommllog2-transformed TMM-normalized reads per million mapped reads (TMM-RPM) tables. It is well known that the overwhelming majority of miRNA sequencing reads correspond to the dozens of the most highly expressed isomiRs (Shao et al., 2010;Nersisyan et al., 2020a). Given that we marked a minimal set of 5 -isomiRs accounting for 95% of sequencing reads in a TCGA project as highly expressed (Table S1). A typical read count distribution is presented in Fig. S1 for the TCGA-COAD project.
The custom prediction mode of the miRDB web portal (http://mirdb.org) was programmatically called for each isomiR sequence using the Requests module of Python. The default 80 threshold value was set for miRDB target scores to select reliable interactions. Due to the technical limitations in miRDB, target prediction was performed only for isomiRs with a sequence length between 17 and 29 nucleotides.
Publicly available TargetScan Perl scripts and UTR data were obtained from the official web portal (https://www.targetscan.org/vert_72/). Dr. George Bell kindly provided Perl scripts for calculating cumulative weighted context++ scores (CWCS). TargetScan predictions almost always covered miRDB predictions (see Results). Because of that, and since there is no conventional CWCS cut-off value, we equalized the numbers of miRDB/TargetScan predicted targets for each 5 -isomiR. Specifically, for each 5 -isomiR we sorted the predictions according to the CWCS (lowest to highest) and then selected top n entries of the list, where n is the number of targets of this isomiR predicted by miRDB. Finally, filtered miRDB and TargetScan predictions were merged.
For a pair of 5 -isomiR and predicted target, we calculated Spearman's correlation coefficient r between corresponding expression levels in each cancer type. The following thresholds were used to define the significance of interaction: r < −0.3, adjusted p-value < 0.05. The same standard criteria for considering miRNA-target correlation significant was previously used in many works (Hu et al., 2016;Paquette et al., 2019;Nersisyan et al., 2020b). Multiple testing correction was performed differently depending on the analysis purpose: 1. Analysis of a cancer-specific isomiR-target regulatory network. In this case, the Benjamini-Hochberg procedure was applied to the list of p-values corresponding to all putative isomiR-target interactions (referred to as FDR global in the isomiRTar interface).
2. Analysis of specific 5 -isomiR targets or specific target gene regulators in a fixed cancer type (out of other interactions context). In this case, the Benjamini-Hochberg procedure was applied only to the narrow list of corresponding p-values (referred to as FDR local in the isomiRTar interface).

Functional annotation analysis
Functional annotation analysis of gene sets was performed with DAVID 2021 (Sherman et al., 2022). The default 0.05 cut-off value for FDR was used to select significantly enriched KEGG pathways.

Network and statistical analyses, web portal implementation
Networks were constructed and analyzed using the NetworkX v2.8 Python module (Hagberg, Schult & Swart, 2008). Maximal cliques of isomiR-isomiR co-expression networks were found with the find_cliques function. SciPy v1.5 Python module (Virtanen et al., 2020) was used for the statistical analysis. Implementation of the isomiRTar website was based on Flask v2.0 Python framework and PostgreSQL database. All isomiRTar source codes have been made available on GitHub (https://github.com/s-a-nersisyan/isomiRTar).

Large-scale sequence-based prediction of 5 -isomiR target genes
We applied two state-of-the-art tools (miRDB and TargetScan) to predict target genes for all considered 5 -isomiRs. The tools are based on different ideas: the miRDB machine learning algorithm was trained on RNA-seq data of HeLa cells transfected with 25 miRNA mimics to predict miRNA targets based on binding site sequence and structure (Liu & Wang, 2019). TargetScan algorithm also scores miRNA bindings sites according to their cross-species conservation (Agarwal et al., 2015). TargetScan systematically yielded a significantly higher number of isomiR targets than miRDB, and the majority of miRDB targets were included in TargetScan predictions: median 95.5% with 95% confidence interval (CI) 85.7-100.0%.
To harmonize these data, we equalized the numbers of predicted targets for each 5 -isomiR by selecting the respective number of the best-ranked targets according to TargetScan (see Materials and Methods for details). After such a harmonization, the predictions of miRDB and TargetScan had a median of 26.5% common targets, 95% CI [0.0-48.9]% ( Fig. 2A).

Pan-cancer correlation analysis of 5 -isomiRs and their target genes
The conducted sequence-based target prediction allowed us to compare the putative non-cancer-specific targetomes of 5 -isomiRs. To determine which predicted targets could be downregulated by the corresponding 5 -isomiRs on transcriptome level, we selected isomiR-gene interactions with a significant negative correlation of corresponding expression levels in a set of TCGA samples (the analysis was separately performed for each cancer type). As a result, we could construct the 5 -isomiR-gene interaction networks for each cancer from TCGA. For a given 5 -isomiR and cancer type, we denote the number of significantly anti-correlated 5 -isomiR targets (i.e., the out-degree) as 5 -isomiR targeting activity (ITA).
ITA values substantially varied between cancer types, but generally, the median ITA did not exceed 10 (Fig. 3A), which was an order of magnitude lower than the median number of sequence-based predicted targets. Nevertheless, there were several isomiRs with broad functional activity (ITA >100) in almost all cancers. The number of sequence-based predicted targets was the primary determinant of ITA: the corresponding Spearman's correlation coefficients were all positive and significantly different from zero for all cancers (see the vertical axis labels in Fig. 3A). Interestingly, these coefficients strongly varied between cancer types from 0.29 (TCGA-CHOL) to 0.79 (TCGA-TGCT) and were significantly correlated with the median ITA values (Spearman's correlation = 0.8, p = 4.71 × 10 −8 ).
Expression levels of 5 -isomiRs also positively and statistically significantly correlated with ITA for all cancer types (p < 10 −3 ), though correlation coefficients were not high (between 0.2 and 0.3, Fig. 3B). We hypothesized that some low-expressed 5 -isomiRs had high ITA because of a co-expression with other active molecules. Given this, we first identified 5 -isomiR co-expression modules in each cancer type. Modules were defined as maximal cliques in corresponding co-expression graphs, where two isomiRs were connected with an undirected edge if absolute Spearman's correlation between their expression levels exceeded 0.5. Then, a representative 5 -isomiR with the highest median expression level was selected from each module. Limiting the analysis to these representative isomiRs systematically increased the correlation values between ITA and average expression in all except two cancer types (Fig. 3B). Notably, there was a complete absence of correlation between ITA and expression in the set of highly expressed 5 -isomiRs in each cancer, indicating that high expression is a necessary but insufficient condition for the downregulation of multiple target genes. The same holds for 18 miRNAs with high expression in all 31 cancer types: ITA values varied from less than ten anti-correlated targets (hsa-miR-100-5p, hsa-miR-99b-5p, hsa-miR-10b-5p) to dozens and hundreds (Fig. S2). Indeed, low values of ITA do not imply the absence of isomiR activity, e.g., in the case of very target-specific regulation.

Overview of isomiRTar portal
To make the 5 -isomiR pan-cancer expression profiles and the results of targeting analysis easily accessible, we developed the isomiRTar web portal (https://isomirtar.hse.ru). The portal's main page contains the forms for the five query modes: by cancer type, by 5 -isomiR, by gene, by 5 -isomiR, gene, and cancer triple. The main feature of the ''query by cancer'' page is an interaction network between 5 -isomiRs and their target genes in a specified cancer type. As the network is highly interactive, a user can retrieve information about a 5 -isomiR, target gene, or their interaction by simply hovering over corresponding nodes or edges (Fig. 4A). Aside from the graphical representation, the page contains summary statistics for 5 -isomiRs in a selected cancer: median expression, number of predicted targets, number of predicted targets supported by the significant negative correlation. The second query type allows one to explore a pan-cancer expression profile for all 5 -isomiRs of a specified miRNA (Fig. 4B). The analysis of targets for a given 5 -isomiR or 5 -isomiR regulators of a given mRNA can be performed in two modes: cancer-specific or pan-cancer. In the first case, the portal page contains a table with the results of correlation analysis between a selected 5 -isomiR and its predicted targets or predicted 5 -isomiR regulators and a selected target gene (Fig. 4C). By clicking to interaction, a user can see the scatterplot of corresponding expression values (Fig. 4D).
In the second case (pan-cancer mode), isomiR-target interactions are sorted according to the number of cancers, where we observed the significant anti-correlation between corresponding expression levels. Thus, ''universal'' (not cancer-specific) isomiR-gene interactions can be inferred from isomiRTar. To analyze the landscape of such ''universal'' interactions, we selected abundantly expressed 5 -isomiRs (highly expressed in 10 or more cancer types) and incident interactions, which were supported by significant negative correlations in at least half of cancers where the corresponding 5 -isomiR was highly expressed. The resulting network is illustrated in Fig. 4E. Aside from crucial and extensively studied interactions such as the downregulation of ZEB1 and ZEB2 by miR-200 (Hill, Browne & Tulchinsky, 2013;Nersisyan et al., 2021b), the network had several regulatory interactions mediated by non-canonical 5 -isomiRs: hsa-miR-29a-3p|-1, hsa-miR-101-3p|-1 and hsa-miR-192-5p|+1. The regulation of centrosomal protein 55 (CEP55) by hsa-miR-101-3p|-1 was the most pan-cancer conserved: a significant negative correlation was observed in 15 cancer types. Since CEP55 is a well-studied oncogene (Jeffery et al., 2016), this 5 -isomiR of hsa-miR-101-3p acted as a potential tumor suppressor, which agrees with the tumor suppressor role of the canonical hsa-miR-101-3p form (Wang et al., 2018).

DISCUSSION
We developed isomiRTar-the user-friendly web portal containing information about the pan-cancer 5 -isomiRs activity. With the use of isomiRTar, researchers can promptly retrieve sequence-based target/regulator predictions and analyze co-expression profiles of putative isomiR-gene interactions in 31 types of cancers. Furthermore, the implemented pan-cancer integration of targeting information allows one to reveal ''cancer-conserved'' To the best of our knowledge, two alternative isomiR web portals exist. Tumor IsomiR Encyclopedia (TIE) contains the results of pan-cancer TCGA analysis of 5 -and 3 -isomiR expression data in a web portal with multiple query and visualization modes (Bofill-De Ros et al., 2021). However, TIE lacks the possibility of isomiR target prediction and analysis. Another database, IsomiR Bank (Zhang et al., 2016), summarizes 308919 isomiRs detected from 2,727 samples with miRNA-seq, including expression data and target predictions with the miRanda tool (Betel et al., 2008). Nevertheless, the predictions available in IsomiR Bank are sequence-based and do not account for tissue or cancer specificity. Thus, we believe the developed isomiRTar portal will bring many new possibilities for the isomiR community.
The data generated by TCGA isomiR expression analysis and miRDB/TargetScan target predictions allowed us to infer several features of 5 -isomiR expression and targeting. Out of 1,022 unique 5 -isomiRs detected in the union of analyzed samples, 168 were marked as highly expressed in at least one cancer. More than half (62%) of these isomiRs had cancer-specific expression patterns, while the rest were abundantly expressed in ten or more cancer types. The high percentage of cancer-specific 5 -isomiRs agrees well with the recent study by Telonis et al. They showed that different TCGA cancers could be accurately discriminated based on the information about the ''high'' or ''low'' expression of each isomiR in a sample (Telonis et al., 2017).
Our work's median number of sequence-based 5 -isomiR targets was 209, 95% CI [6-1082] (Fig. 2B). The previous studies reported that a typical miRNA has from 200 to 606 targets on average, depending on the target prediction strategy (Lewis, Burge & Bartel, 2005;Krek et al., 2005;Chen & Wang, 2020), which is comparable with our estimates for 5 -isomiRs. After applying cancer-specific correlation analysis the median number of genes regulated by a 5 -isomiR decreased to dozens, though this quantity termed ITA varied dramatically between cancer types (Fig. 3A). Together with the weak (though positive and statistically significant) correlation between isomiR expression and ITA, these observations justified the use of pan-cancer correlation analysis and showed the insufficiency of non-specific sequence-based target prediction.
The large-scale prediction of 5 -isomiR targets allowed us to assess the impact of 1 nucleotide variation at a 5 -end of an isomiR on its targetome. The median Jaccard index for the targetomes of the adjacent 5 -isomiRs was equal to 0.06, 95% CI [0.00-0.31] (Fig.  2C). Although we did not find a similar large-scale analysis in the literature, the authors of small-scale studies also reported poor overlaps between 5 -isomiRs with one nucleotide difference: JI = 0.16 for miR-411 (van der Kwast et al., 2020), JI = 0.19 for miR-9, JI = 0.28 for miR-302a (Tan et al., 2014). Note that differences between predicted targetome overlaps could be due to target prediction tools and their internal thresholds.
Our study has several limitations. First, the used target prediction tools consider miRNA binding sites only in 3 -UTR regions of putative mRNAs. However, strong experimental evidence of miRNAs binding within coding sequences and 5 -UTRs exists (Forman & Coller, 2010;Fang & Rajewsky, 2011;Clark et al., 2014). Therefore, incorporating alternative target prediction tools that cover mRNA regions beyond 3 -UTR, e.g., RNA22 (Miranda et al., 2006), is an essential direction for future isomiRTar development. Next, we note that some of the analyzed samples are heterogeneous (e.g., breast tumors), and it is not possible yet to analyze different cancer subtypes separately. Inclusion of major subtypes for some cancers is also a priority feature for new versions of isomiRTar. Finally, we covered only 5 -isomiRs in our analysis since 3 -end variability does not affect the seed region of miRNAs. Yu and co-authors showed that miR-222 isomiRs differing by four nucleotides at the 3 -end play a distinct role in breast cancer (Yu et al., 2017). Though we acknowledged the existence of such studies, we decided to give preference to the conciseness of isomiRTar and did not consider 3 -end variations.

CONCLUSIONS
In the present work, we introduce a new interactive database, isomiRTar (https: //isomirtar.hse.ru), allowing researchers to analyze the interplay between 5 -isomiRs and their target genes in human cancer. The database contains the results of comprehensive bioinformatics analysis, including sequence-based 5 -isomiR target prediction and pancancer TCGA analysis.