Reassessment of miRNA variant (isomiRs) composition by small RNA sequencing

Summary IsomiRs, sequence variants of mature microRNAs, are usually detected and quantified using high-throughput sequencing. Many examples of their biological relevance have been reported, but sequencing artifacts identified as artificial variants might bias biological inference and therefore need to be ideally avoided. We conducted a comprehensive evaluation of 10 different small RNA sequencing protocols, exploring both a theoretically isomiR-free pool of synthetic miRNAs and HEK293T cells. We calculated that, with the exception of two protocols, less than 5% of miRNA reads can be attributed to library preparation artifacts. Randomized-end adapter protocols showed superior accuracy, with 40% of true biological isomiRs. Nevertheless, we demonstrate concordance across protocols for selected miRNAs in non-templated uridyl additions. Notably, NTA-U calling and isomiR target prediction can be inaccurate when using protocols with poor single-nucleotide resolution. Our results highlight the relevance of protocol choice for biological isomiRs detection and annotation, which has key potential implications for biomedical applications.


In brief
Gó mez-Martín et al. demonstrate that isomiRs are non-artifactual variants of miRNAs and that most small RNA sequencing protocols, especially those with randomized adapters, detect them with high accuracy. Single-nucleotide accuracy is crucial for making predictions about biological function. The work provides a reference for researchers interested in small ncRNA.

INTRODUCTION
MicroRNAs (miRNAs) are small non-coding RNA transcripts widely studied for their role in gene expression regulation, which they exert by imperfectly pairing to a target messenger RNA (mRNA). 1 Expression profiles of the 2,600 reported human miR-NAs over different tissues and pathophysiological conditions are a first step to understand the underlying biology and are useful for diagnostics 2 and therapeutics. 3 To generate comprehensive miRNA expression profiles, specific next-generation sequencing (NGS) protocols were developed. Initially, protocols were based on fixed (invariant) sequence adapters that introduced bias due to preferential ligation affinity to certain sequences, resulting in a high variability of less informative ncRNAs (Y-RNA, tRNA fragments). 4,5 To overcome this, recent sequencing protocols make use of random nucleotides at one or both sequencing adapters, thereby reducing detection bias of canonical miRNAs. [5][6][7][8] However, accurate detection of post-transcriptional modifications remains challenging, in particular for low-input applications such as extracellular miRNA profiling used for minimally invasive diagnostics.
Early NGS studies revealed that a significant proportion of miRNA reads in biological samples deviate in length and sequence from the 20-21 nucleotide canonical miRNAs. 9 MOTIVATION Sequence variants (isomiRs) of mature microRNAs can be detected by high-throughput sequencing methods. However, because isomiRs differ from the canonical (mature) miRNA sequence often by only a single nucleotide, most protocols introduce errors that may negatively influence biological interpretation. To investigate how sequencing protocols impact isomiR calling, we conducted a comprehensive comparison of 10 small RNA-seq protocols, providing a comprehensive reference guide for miRNA/isomiR interpretation by sequencing.
Among these variants, termed ''isomiRs,'' 10 one type is generated by post-transcriptional non-templated nucleotide additions (NTAs) by terminal nucleotidyl transferases. 11 These enzymes add up to 3 nucleotides to the 3 0 end, leading to slightly elongated mature miRNAs that will thus deviate from the genomic sequence. Functional implications of such post-transcriptional modifications are altered miRNA stability 12 and/or different mRNA targets compared with the canonical sequence. 13 A recent study describes that dysfunctional isomiRs accumulate in cancer cells, and that enzymatically restoring their canonical function may have therapeutic value. 14 Indeed, modified miR-NAs act as independent functional molecules that may support or counteract canonical miRNAs with diagnostic and prognostic implications. For instance, the targetome of one miR-411 isomiR that is upregulated in chronically ischemic human blood vessels has a very small overlap with the canonical miR-411 targetome. 15 Due to their functional relevance, accurate discrimination between true isomiRs and sequencing or alignment artifacts is crucial to avoid misinterpretation of their biological function and diagnostic significance. 16 Therefore we analyzed and compared the presence of isomiRs in a large body of sequencing data generated by independent laboratories using 10 different protocols to evaluate protocol differences in detecting true biological isomiRs. To assess comparable samples, we selected sequencing datasets from the miRXplore Universal Reference, a pool consisting of 963 different chemically synthesized miR-NAs (thus theoretically free of isomiRs) and libraries obtained from the HEK293T cell line. To avoid any bioinformatics processing bias, all sample data were uniformly processed using sRNAbench, 17 a broadly used miRNA analysis software that can also accommodate virtually any sequencing protocol. We also included samples from ''IsoSeek,'' our thoroughly optimized 7 in-house 5N protocol that aims to account for all technical bias by using 5 0 and 3 0 randomized adapters, each of them with 5 random nucleotide unique molecular identifiers (UMIs) (5N) based on commercial adapters. 18 Our results show that regardless of the adapter strategy, the majority of protocols generate only minor library preparation artifacts and detect biological variants with relevant overlap, mostly 3 0 end NTAs. Moreover, 4N and 5N protocols appear to be highly suitable for isomiR profiling of biological samples, whereas some protocols suffer from biased quantification and/or generation of artificial isomiRs.

RESULTS
Analysis of the 963-miRNA miRXplore reference pool reveals that the majority of small RNA-seq protocols generate low levels of false isomiRs A recent study cautioned that small RNA-seq protocols generate a large proportion of non-biological miRNA variants that should be considered as artifacts. 8 We first scrutinized this claim by analyzing the canonical miRNA and isomiR profiles from 10 different protocols generated furthermore by different laboratories. Specifically, we analyzed data from a synthetic 963-miRNA reference pool (miRXplore Universal, Miltenyi Biotec) and HEK293T cellular RNA using commercial and non-commercial randomized-end adapter protocols (AQ-seq, 6 4N-G and 4N-X, 5,6 NEXTflex, and our own 5N ''IsoSeek'' method 7 ), one custom 2N adapter protocol (AQRNA-seq 8 ), and three commercial invariant (fixed) adapter protocols (NEBNext, 13,19 TruSeq, 20 QIAseq, 21,22 and Clean-Tag 22 ). The main characteristics of the different protocols are summarized in Table S1.
Mapping statistics of the 963-miRNA pool revealed that 7 out of the 9 protocols showed 75%-80% recovery of exact matches, i.e., canonical sequences ( Figure 1A). AQRNA-seq and 4N-G protocols have a consistently lower percentage of canonical sequences but, in contrast, a higher percentage of 5 0 end length variants (lv5p, see isomiR classification used in Figure S1) than the other protocols. NTA isomiRs (see Figure S1) should theoretically be absent in the reference pool as these samples were never in contact with nucleotidyl transferases. Indeed, this subclass is almost not detected by any protocol (1% for each NTA class), with the exception of NEBNext, which appears to generate the most artifacts (see also Table 1A). However, more prominent were internal single mismatches (NucVar) that were present in 10%-12% of the miRNA-mapped reads in all protocols, except for 2 samples from the 4N-G protocol with only 4% of NucVar. This could be explained by a massive number of extensively trimmed 5 0 length variants that drastically reduce the percentage of canonical sequences ( Figure 1A and Table 1A). In fact, all other protocols have only 2% length variants ( Figure 1A and Table 1A), often only lacking a single nucleotide at 3 0 or rarely at the 5 0 end (Table 1A).
In contrast to the reference pool, the analysis of the isomiR distribution of the 23 HEK293T libraries revealed a different pattern. All protocols showed that 40%-50% of the miRNA-mapped reads represented the exact mature miRNA sequence (canonical) as annotated in miRBase ( Figure 1B). The most prominent isomiR class is the 3 0 end length variants (lv3p), which are presumably generated by ''sloppy'' Drosha or Dicer cleavage or the action of exonucleases. 1 Finally, approximately 10%-20% of the miRNA reads belong to NTA isomiRs, similarly to previously found in other studies. 11 It was unexpected that the percentage of NucVar isomiRs was 3-to 4-fold smaller in cells (4%) compared with the reference pool (12%). This could indicate that, rather than artifacts of library preparation, artificial NucVar isomiRs represent errors already introduced during oligosynthesis of the 963-miRNAs pool. 23 To investigate this possibility, we measured the presence of artificial NucVar isomiRs in a customized panel of 26 synthetic isomiRs, and we then compared the resulting percentage to the 963-miRNA pool results ( Figure S2A). Interestingly, we found a significantly lower percentage of NucVar in our customized spike-in set. Taking into account that our spike-in set was generated with the highest quality control standards, this suggests the presence of oligosynthesis errors in the commercial 963-miRNA pool.
Moreover, we directly compared the performance of IsoSeek with the fixed adapter NEBNext protocol using the same 26 synthetic isomiRs that were sequenced without any biological background ( Figure S2B). The distribution of the different isomiRs was considerably more equal in the IsoSeek samples, with 0.88 coefficient of variation (CoV), compared with NEBNext with 1.82 CoV, indicating a major reduction in bias.
Analysis of length variant isomiRs in the 963-miRNA set and HEK293T cells reveals 3 0 end length variants outnumber 5 0 end length variants We next performed an in-depth analysis of the four different length variant classes, i.e., extended or trimmed at the 5 0 or 3 0 end ( Figure S1). As can be observed in Table 1A, the majority of the protocols showed that samples from the 963-miRNA pool have a relatively low percentage of length variants (5%). This contrasts with the HEK293T cells (Table 1B) where the percentage of length variants is much (6-fold) higher (30%), suggesting a biological origin. In fact, in the case of the 963-miRNA pool samples, the most prominent length variants are at the 5 0 end, while the HEK293T cells had a high presence of lv3p isomiRs, which are consistent with ''sloppy'' Drosha or Dicer cleavage or the action of exonucleases. 12 Therefore, most 5 0 truncated miRNAs (lv5pT) and other 5 0 length variants are very likely due to library preparation artifacts.
A high proportion of the NucVar in HEK293T cell isomiRs point to biological origin We next looked in more detail into the different classes of singlenucleotide changes (NucVar, see Figure S1) in HEK293T cells. A high percentage of T>C and A>G single-nucleotide changes (in red) were consistently detected among the 4 protocols (NEBNext, QIAseq, AQ-seq, and IsoSeek) compared to all the other potential changes, which showed a much lower percentage ( Figure 2A).
We then compared the percentage of these two NucVar types, T>C ( Figure 2B) and A>G ( Figure 2C), in HEK293T cells to the percentage in the 963-miRNA pool samples sequenced with the same protocol. The percentage of both variants in the HEK293T cells was significantly higher in both cases (p value < 0.01), regardless of the sequencing protocol used. Interestingly, these two variants specifically may be caused by enzymatic activity. Specifically, the A>G conversion can be a result of the ADAR enzyme activity 24,25 and, in the case of T>C, a result of APOBEC3C action. 26 4 different small RNA protocols robustly detect NTA-U/A isomiRs in HEK293T cells that are absent in the artificial reference pool We observed that the percentage of NTA isomiRs in the 963-miRNA pool was consistently lower (1% per NTA class) than the percentage in HEK293T cells (20% total NTA), as  Table 2). In HEK293T cells, approximately 10%-30% of the total miRNA reads belonged to NTA isomiRs, similar to previous studies, 11 which is up to 20-fold higher than in the reference pool libraries. The majority are NTA-U and NTA-A isomiRs, while NTA-C and NTA-G are hardly detected using AQ-seq and IsoSeek ( Figure 1B and Table 2B). Moreover, in addition to a lower NTA percentage detected, a more equal and possibly random distribution of the percentage of the different NTA classes was observed (Table 2A). Overall, randomized adapter protocols reveal the more equal percentage of NTA distribution and the lowest percentage of artificial NTA in the 963-miRNA pool samples in agreement with prior observations. 11 Despite minor differences, we conclude that the majority of NTA-A and NTA-U classes that are detected in HEK293T cells represent true biological isomiRs. TUT4/7 knockout HEK293 cells reveal concordance and differences between protocols in the detection of uridylated isomiRs (NTA-U) To assign specificity of the protocols for biological isomiR detection, we studied the biological activity of nucleotidyl (uridyl) transferases TUT4 and TUT7 in HEK293T CRISPR-Casmediated double knockout (DKO) HEK293T cells. TUT4 and TUT7 enzymes synthetize the addition of uridine nucleotides at the 3 0 end of miRNAs. 11 To fully establish that small RNA sequencing data obtained by various protocols contain biologically generated isomiRs, and because uridylation may be cell type specific, 27 we analyzed 11 libraries from TUT4/7 DKO HEK293T cells 13 prepared with 4 different protocols in 6 different laboratories (see striped bars in Figure 1B). In agreement with knockdown studies using a NanoString-based detection in a different cell type, 11 all small RNA sequencing protocols showed reduced percentages of NTA-U in the TUT4/7 DKO cell lines compared with wild-type (WT) counterparts ( Figure 1B).
To establish the reproducibility of NTA-U isomiR detection by small RNA sequencing protocols, we assessed the detection of miRNAs with an uridylation ratio drop of at least 2-fold in TUT4/7 DKO HEK293T cells compared to the parental cells. We used data from 6 independent studies that employed 4 different protocols (NEBNext, QIAseq, AQ-seq, and IsoSeek). We identified 20 miRNAs (6% of all differentially uridylated miRNAs by at least 1 of the protocols) that were consistently detected as differentially uridylated due to the lack of TUT4/7 enzymes in all studies (Figure 3). We validated this overlap by comparison to a random selection from the total list of differentially uridylated miR-NAs (see more details in STAR Methods), yielding a highly significant Z score of 28.5. These analyses indicate that the observed differences in NTA-Us between WT and DKO cell lines with multiple protocols are not random but are driven by the depletion of TUT4/7 enzymes and thus biologically motivated.
Differences across protocols in detection of TUT4/7dependent miRNA uridylation In order to estimate potential biases in the detection of NTA iso-miRs, we performed a deep analysis into the NTA class distribution in parental and TUT4/7 DKO HEK293T cells. Using AQ-seq, IsoSeek, and QIAseq, we observed an overall reduction of NTA-U of 25%, 24%, and 34% respectively in TUT4/7 DKO cells as compared to parental cells ( Figure 4A). However, with the fixed adapter protocol NEBNext, this reduction was only 4% ( Figure 4A). Interestingly, using randomized adapter protocols, AQ-seq and IsoSeek, the global NTA-U reduction observed in DKO cells (24%) was complemented by a similar increase of the NTA-A fraction (23%). This phenomenon was recently described in cancer cells, 27 and this subtle but significant correlation was not seen in the fixed adapter protocols ( Figure 4A). On the other hand, the QIAseq protocol detected a significant decrease on uridylation but not a clear increase in NTA-A isomiRs. Instead, NTA-C isomiRs were increased. Taking into account that such isomiRs have not been described as enzymatically driven, we cannot rule out the possibility of technical artifacts presence.
We next investigated those miRNAs that displayed a clear difference in uridylation percentage in the DKO compared with the WT cells based on IsoSeek protocol. In most of the miRNAs where 3 0 end uridylation was reduced in the DKO, we observed an increase in 3 0 end adenylation ( Figure 4B). However, this was not the case for guadinylation and cytidylation ( Figures 4C  and 4D respectively). Taken together, these results suggest that a specific increase in adenylation of selected miRNAs in DKO cells occurs in the absence of urydilation. These A very low percentage (1%) of all NTAs is observed in the pool, with the highest value from the samples of the NEBNext protocol (3%). In contrast, a high percentage of NTA-U/A is observed in the HEK293T (15% with the exception of QIAseq protocol), while NTA-C/G (with no biological function underlined) percentages remain low. Only one fixed adapter protocol (NEBNext) showed a more significant higher percentage of NTA-C/G isomiRs. observations indicate that 3 0 end NTA modification is a nonrandom biological process caused by competing NTA (uridylase vs. adynelase) nuclease activities. We also evaluated the uridylation of 3p and 5p arm-derived miRNAs and observed that TUT4/ 7-dependent uridylation occurs predominantly on 3p miRNAs. This preference is especially pronounced when using randomized adapter protocols with reduced bias while being much less evident with the NEBNext fixed adapter protocol ( Figure 4E).

Biased detection of miRNA uridylation can affect target prediction
To investigate whether differences in isomiR detection between protocols influence target prediction strategies, 13 we compared the top 10 most uridylated miRNAs identified per protocol (Figure 5A). miRNA hsa-miR-30e-3p, which has a role in cancer, 28,29 is heavily mono-uridylated according to randomized adapter protocols, while with NEBNext, the exact mature sequence is more abundant ( Figure 5B). Importantly, only IsoSeek and QIAseq protocols show a clear decrease in uridylation in TUT4/7 DKO cells and an increase in adenylation ( Figure 5B). Having identified a biological isomiR, we performed target prediction for both the canonical hsa-miR-30e-3p and the 3 0 end mono-uridylated isoform. We observed many unique potential target genes of both the canonical miR-30e-3p and the mono-uridylated isomiR. This is in agreement with the recently described tail-U-mediated repression (TUMR) as described by Yang et al. 13 In fact, we found many more isomiR unique targets than overlapping targets ( Figure 5C). These observations may have critical implications for the biological function of these closely related isomiRs with a highly divergent targetome. Indeed, pathway enrichment analysis of the canonical targetome ( Figure 5D) and the TUMR targetome ( Figure 5E) showed that different pathways are enriched, leading to physiological changes. This analysis exemplifies the potential consequences of inaccurate isomiR detection derived from suboptimal choice of a sequencing protocol.

DISCUSSION
The advent of NGS has revealed that miRNAs, apart from the 21/22 nucleotide canonical (mature) genomic miRNA sequences, are comprised of many variants known as isomiRs. We show here that distinguishing true isomiRs from technical artifacts and sequencing errors is of high importance, as it may result in misinterpretation of their biological function (i.e., targetome), 13 and diagnostic 16 and potential therapeutic significance. 14 To demonstrate this, we thoroughly compared sequencing data from HEK293T cells and a 963-synthetic miRNA pool that should, theoretically, be free of any isomiRs. 95 samples were analyzed and sequenced with in total 10 different protocols including widely used fixed (or invariant) commercial adapters and newer protocols that make use of randomized adapters and other customized library preparation procedures.
We report several novel findings, some of special relevance when studying the biology of isomiRs. First, when benchmarking protocols using a standard reference set, it is of key importance that the quality of all oligos (reference sets, primers,  (Figures 1 and S2 and Tables 1 and 2). Second, some protocols may generate high numbers of library preparation artifacts (Figure 1 and Tables 1 and 2). Third, randomization of adapter sequences at the 3 0 and 5 0 ends strongly reduces ligation bias. Fourth, UMI-correction can reduce amplification bias but appears less essential when RNA input is not limiting. Fifth, to determine the biological relevance of isomiRs, analysis of control cells that lack relevant NTA-modifying enzymes is advised. Sixth, we determined higher percentages of T>C and A>G NucVar changes in cells when compared with the reference pool. This is in line with a biological role as these modifications are the only mutations known caused by RNA-editing enzymes. [24][25][26] Seventh, uridylation on 3p-arm miRNAs is much more frequent than 5p miRNAs ( Figure 4E), confirming the idea that uridylation generally occurs after Drosha cleavage but before Dicer processing. 30 Finally, only randomized adapter protocols detect NTA miRNA substrate competition (Figure 4), a recently discovered biological phenomenon in cancer cells. 27 What is the current experimental evidence for a biological role of isomiRs? In cartilage, a miR-140-3p isomiR is the dominant and active form, with a completely different seed and targetome. 31 In addition, there is evidence that shRNA-induced liver toxicity is solely related to competition with a major isoform of miR-122. 32 Moreover, some classes of isomiRs are preferentially sorted into communicating extracellular vesicles, 33,34 which could have important biological and diagnostic implications. Most strikingly, Qi et al. recently reported that expression of a plant nucleotidyl-transferase (RDR1) in human cancer cell lines blocks proliferation by targeting cell cycle genes and that these effects were dependent on the miRNA pathway. 14 It appeared that miRNA duplex isoforms with 1-nt-shorter 3 0 ends cannot be efficiently loaded onto the Argonaute complex and accumulate in human tumors. RDR1 expression restored the regular 2-nt overhang structure of miRNA duplexes, thereby rescuing the defective miRNA pathway in mouse xenograft models, which suppressed tumor growth. 14 In another study, an isomiR sequence of miR-411 was 5-fold more abundant than the canonical form in primary human vascular cells and in venous tissue samples from patients with peripheral artery disease. The authors show that iso-miR-411 and canonical miR-411 expression is differentially regulated under ischemic conditions in a murine hindlimb ischemia model. 15 Finally, isomiRs may act as cancer biomarkers and function as either allies or antagonists of their canonical counterparts. 35 By using cell, and not tissue, data, McCall et al. observed many cell-specific differences in the iso-miR composition of miRNAs. 36 Notably of 205 common miR-NAs, the most abundant sequence did not match the reported sequence in mirBase.org. 36 Since a large body of studies shows that isomiRs have a biological function, it is thus critical to choose reliable library protocols and mapping strategies. We observed that 4-5N randomized adapters approaches resulted in the most unbiased protocols, showing the lowest amount of NTA percentage in the 963-miRNA reference pool samples. This is in agreement with previous observations only considering the exact miRNA sequences. 11 IsoSeek and AQ-seq detect profound global changes in the miRNA uridylome in HEK293T cells, while this was not observed using the fixed adapter NEBNext protocol. Nevertheless, all protocols determined that hsa-miR-760, which has been correlated with breast cancer progression, 37 is a heavily uridylated miRNA ( Figure 5A). Importantly, exploiting CRISPR-Cas9-generated TUT4/7 DKO cells, we could identify miRNAs that are specifically uridylated by the combined action of these TuTases. We identified with multiple protocols that at least 20 miRNAs (6%) are selectively uridylated by these enzymes (Figure 3), a biologically relevant observation that was also highly significant (Z score = 28.5). NTAs by TUT4 and TUT7 triggers ''armswitching,'' changing the repressive activity of modified miR-NAs. 30 The TUMR is abolished in cells lacking the uridylation enzymes TUT4 and TUT7. 13 We show that the canonical and mono-uridylated miR-30e-3p targetome differ dramatically ( Figure 5), which highlights that the correct interpretation of this biological phenomenon relies on appropriate protocol selection.
In summary, this study unveils the importance of an appropriate sequencing method for isomiR analysis. Randomized adapter-based protocols outperformed those with fixed adapters on this end. This is of crucial importance for the detection of non-templated additions (especially NTA-Us and NTA-As), which are miRNA modifications of high biological relevance that are underestimated or overestimated by certain protocols due to their different biases. Accurate Resource ll profiling of isomiRs is mandatory to achieve detailed interpretations of miRNA biological functions. Given the growing body of evidence for the functional importance of certain classes of isomiRs, [13][14][15]19,[31][32][33][34][35][36] we propose that miRNA expression profiling with sequencing is ideally analyzed at the individual isomiR level, rather than aggregating all functional variants

Limitations of the study
The main limitation of the study is that, despite conducting a comprehensive comparison of several available miRNA-seq protocols, other less-known or future protocols not included in this study have the potential to outperform those described here. However, by following the same analysis pipeline described here, any other potential protocol should be easily compared with the others. It is important to note that the choice of the bioinformatics pipeline is crucial, especially for isomiR analysis, and that the analysis with a different bioinformatics pipeline can lead to different results. In this study, we used the widely used and validated miRNA analysis software sRNAbench, which is also optimized for isomiR profiling.

STAR+METHODS
Detailed methods are provided in the online version of this paper and include the following: