Interactive Web-based Annotation of Plant MicroRNAs with iwa-miRNA

MicroRNAs (miRNAs) are important regulators of gene expression. The large-scale detection and profiling of miRNAs have been accelerated with the development of high-throughput small RNA sequencing (sRNA-Seq) techniques and bioinformatics tools. However, generating high-quality comprehensive miRNA annotations remains challenging due to the intrinsic complexity of sRNA-Seq data and inherent limitations of existing miRNA prediction tools. Here, we present iwa-miRNA, a Galaxy-based framework that can facilitate miRNA annotation in plant species by combining computational analysis and manual curation. iwa-miRNA is specifically designed to generate a comprehensive list of miRNA candidates, bridging the gap between already annotated miRNAs provided by public miRNA databases and new predictions from sRNA-Seq datasets. It can also assist users in selecting promising miRNA candidates in an interactive mode, contributing to the accessibility and reproducibility of genome-wide miRNA annotation. iwa-miRNA is user-friendly and can be easily deployed as a web application for researchers without programming experience. With flexible, interactive, and easy-to-use features, iwa-miRNA is a valuable tool for the annotation of miRNAs in plant species with reference genomes. We also illustrate the application of iwa-miRNA for miRNA annotation using data from plant species with varying genomic complexity. The source codes and web server of iwa-miRNA are freely accessible at http://iwa-miRNA.omicstudio.cloud/.


Introduction
MicroRNAs (miRNAs) are a class of small non-coding RNAs that are widespread in eukaryotes and play roles in a variety of biological processes, including plant growth, development, and stress responses [1][2][3]. In plants, miRNA genes are transcribed into primary transcripts, which are processed by the DICER-LIKE 1 (DCL1), SERRATE (SE), and HYPONASTIC LEAVES 1 (HYL1) proteins to generate a stem-loopstructured miRNA precursor, followed by trimming into a mature miRNA/miRNA* duplex [4]. Recently, some miRNAs have also been associated with agriculturally important traits, emerging as potential targets for crop improvement and protection [5]. Due to their biological and agricultural importance, miRNAs have become essential elements annotated in genome sequences, particularly for plant species.
Genome-wide miRNA identification is generally accomplished using bioinformatics methods, such as homology search, machine learning-based prediction, and nextgeneration sequencing (NGS) data mining [6][7][8][9][10][11][12][13][14]. Such computationally identified miRNAs have been deposited into public data repositories, such as miRBase [15], PmiREN [16], sRNAanno [17], and Plant small RNA genes (PsRNA) [18] by multiple research groups, providing valuable resources for life scientists interested in miRNA research -from singlegene to genome-wide scale, basic molecular biology to population genetics, and bioinformatics to experimental biology. Despite these advances, present-day annotations remain riddled with false positives and have a limited degree of comprehensiveness (the fraction of all bona fide miRNA genes that are included), exhaustiveness (the fraction of all mature miRNAs from each miRNA gene), and completeness (the fraction of pri/pre-miRNA sequences that cover the entire length) [19,20].
There are multiple factors that make the computational identification of miRNAs challenging. First, the tissue-/cell type-/developmental stage-specific expression and/or low expression properties of some miRNAs mean that they are often poorly identified from traditional low-throughput experimental studies and NGS experiments with limited samples and/or low sequencing depth. Second, the imperfect criteria were defined for the identification of miRNAs from NGS data. Although high-throughput criteria were established years ago [21] and have been continuously updated in response to studies of mechanisms [12,19], they cannot fully capture the species to species variation of miRNA characteristics. Third, the unsatisfactory level of accuracy of automatic miRNA annotation methods. Homology-based strategies fail to identify speciesspecific miRNAs. While machine learning-based tools have been designed for genome-wide miRNA prediction [7,8,[22][23][24], most are trained with limited experimentally validated miRNA data and have markedly lower prediction accuracy in cross-species prediction [25]. Since the introduction of small RNA sequencing (sRNA-Seq), many sequencing-based tools have been developed that vary in their characterization of miRNAs [10][11][12][13][14]; however, only a few tools have kept pace with updated miRNA identification criteria, and they continue to suffer from trade-offs between quality and quantity [26]. Given these differences in the use of sRNA-Seq data, automatic annotation approaches, and miRNA identification criteria, inconsistency often arises in existing plant miRNA annotations. For example, at the beginning of this project in December 2019, we observed that, in Arabidopsis thaliana, there were 326, 221, 163, and 142 miRNA precursors annotated in miRBase (v22.1), PmiREN (v1.0), sRNAanno (v1.0), and PsRNA (v1.0) databases, respectively, with an overlap of only 120 miRNA precursors among these four databases. These inconsistencies indicate a proportion of false positives and false negatives within existing plant miRNA annotations, which may have serious consequences for downstream studies, such as expression quantification, differential expression analysis, targetome analysis, and functional screening.
A straightforward way to improve the quality of miRNA annotations is to develop bioinformatics methods with sophisticated miRNA identification algorithms and criteria. In addition, a combination of automatic annotation and manual annotation would also be effective. The power of manual annotation has been demonstrated in human protein-coding gene annotation, where human annotators inspect automatically annotated transcripts and create relatively confident annotation databases, including GENCODE [27] and RefSeq [28]. These manually annotated databases are often free from many of the artifacts resulting from automated approaches and have been adopted by most large-scale genomics projects, including the Encyclopedia of DNA Elements (ENCODE) [29] and the Genotype-Tissue Expression (GTEx) project [30]. In recent years, manual inspection has also been advocated and performed to compile high-quality miRNA datasets from the genomes of Citrus sinensis [26] and human [31]. These pioneer investigations will provoke a wider interest among scientists in the research field of manual inspection of genome annotation, accompanied by an increased demand for effective interactive annotation tools to manage and analyze genome-wide miRNA annotations.
Here, we present iwa-miRNA, a web-based framework for interactive annotation of miRNAs from plant species with reference genomes. iwa-miRNA not only provides functions for automatically incorporating miRNA annotations from four representative databases (i.e., miRBase, PmiREN, sRNAanno, and PsRNA) but also builds a bioinformatics pipeline designed specifically to handle large-scale sRNA-Seq data for candidate miRNA prediction. Both annotated and predicted miRNAs are aggregated into a comprehensive list of miRNA candidates. Two miRNA selection approaches, high-throughput criteria and machine learning-based, are provided to assist the selection of promising miRNA candidates based on the sequence-, structure-, and expression-based features. To enhance the accessibility of miRNA annotation, iwa-miRNA generates a report page with detailed information customized by feature types for each selected miRNA, facilitating convenient miRNA refinement during manual curation. The source codes of iwa-miRNA have been combined into a Galaxy platform, organized with user-friendly web interfaces, and supported with extensive user documents. With these flexible, interactive, and easy-to-use features, iwa-miRNA can generate a comprehensive collection of miRNA candidates and allows users to interrogate miRNA annotations in a straightforward way, without the need for computational skills. We provide examples of the application of iwa-miRNA for miRNA annotation of A. thaliana, maize (Zea mays L.), and hexaploid bread wheat (Triticum aestivum L.).

Method
The iwa-miRNA comprises three modules: MiRNA Compilation, MiRNA Selection, and Manual Curation ( Figure 1; Table 1). The source codes of the modules and their dependencies are fully organized within the Galaxy framework. iwa-miRNA can be implemented through a user-friendly web interface and summarizes the results into HTML pages, using Rmarkdown for easy visualization, interpretation, and sharing.

MiRNA Compilation (Module I)
This module generates a comprehensive collection of miRNA candidates by aggregating already annotated miRNAs from four plant miRNA databases (i.e., miRBase, PmiREN, sRNAanno, and PsRNA) and predicted miRNAs from usersubmitted sRNA-Seq data ( Figure 1A). For a plant species of interest, all miRNA annotations (e.g., name, sequence, and genomic coordinate) provided by the four miRNA databases are automatically retrieved using the ''miRNARetrival" function. miRNA prediction is accomplished using the ''miRNAPredict" function, which is specifically designed for parallel analysis of large-volume sRNA-Seq data. This function is based on a series of bioinformatics tools and custom scripts required for read cleaning (FASTX-Toolkit v0.0.14; http://hannonlab.cshl.edu/fastx_toolkit), genome mapping of reads (Bowtie v1.2.3 [32]), and miRNA prediction (miRDeep-P2 [12] and miRCat2 [33]). iwa-miRNA accepts the inputs of target genome sequences in FASTA format and corresponding annotations in GFF3/GTF format, which can be directly submitted by users or automatically fetched from the Ensembl Plants (https://plants.ensembl.org) database using the 'genomePrepare' tool. For miRNA annotations from different versions of the genome, the 'miRNATranslate' function can be used to translate annotated miRNAs into the genomic coordinate system of the target genome by performing miRNA precursor-to-genomic alignment using GMAP (v2019.09.12) [34]. All miRNA candidates are finally organized using a uniform naming scheme and genomic coordinates.

MiRNA Selection (Module II)
This module selects a subset of miRNA candidates that are regarded as promising miRNAs, according to the highthroughput criteria and/or using a machine learning-based approach ( Figure 1B; File S1). For the latter miRNA selection approach, iwa-miRNA builds a one-class support vector machine (SVM) classifier [35] to predict if tested miRNA candidates are potentially real miRNAs or not (File S1; Table S1). iwa-miRNA is user-friendly in that users can tune parameters according to the sRNA-Seq data at hand. A set of default parameters derived from our own analysis experience are also provided to assist non-expert users within their analyses.

Manual Curation (Module III)
This module provides the information for all miRNA candidates generated during the compilation and selection processes and creates a summary page for rapid curation of the quality of selected miRNAs ( Figure 1C). For miRNAs of interest, users can further inspect them by entering into the corresponding report pages, which provide more detailed information customized by feature types and visualized using various plot styles. A secondary structure plot is generated to display the location of a mature miRNA within the precursor sequence and quality-profiling results. Read stacks are plotted to show the read-support of identified miRNAs. A boxplot is used to visualize miRNA expression patterns and arm selection events across different samples. A bipartite network is constructed to depict miRNA-target interactions predicted by psRNAtarget [36]. Users can update the list of selected miRNAs in a dynamic manner through adjusting criteria thresholds or by direct deletion from the summary page. Selected miRNAs are finally exported into table, GFF3/GTF, and FASTA format files for downstream exploratory analyses ( Figure 1D).

Results
We illustrate the efficiency of iwa-miRNA for miRNA annotation of three plant genomes of different complexity. Among these applications, four databases (miRBase, PmiREN, sRNAanno, and PsRNA) and a set of publicly available sRNA-Seq datasets were used to generate candidate miRNAs. Both high-throughput criteria and one-class SVM with default parameters ( Figure 1B) were used to identify promising miRNA candidates.
Case 1: application of iwa-miRNA for miRNA annotation in A. thaliana As an initial demonstration of our framework, we looked at the long-studied and extensively annotated miRNAs of the model plant species, A. thaliana, which has a relatively small genome of $ 135 Mb. In A. thaliana, miRNAs have been studied for over 18 years [37][38][39] and explored using more than 2000 sRNA-Seq datasets [40]. Using iwa-miRNA, we obtained a total of 365 miRNA precursors corresponding to 625 mature forms from the four databases (miRBase, PmiREN, sRNAanno, and PsRNA; Figure 2A and Table S2). Using 1063 sRNA-Seq datasets (File S1) for the Columbia ecotype of A. thaliana as inputs, iwa-miRNA predicted 435 miRNA precursors, 302 of which were not previously annotated in any of the four plant miRNA databases. This resulted in the identification of 667 miRNA precursor candidates, corresponding to 1190 mature miRNA candidates (Table S2). Newly predicted miRNA precursors were expressed at relatively low expression levels and with less breadth than those that were already annotated ( Figure 2B and C), indicating the potential importance of iwa-miRNA in identifying miRNA precursors with tissue-/cell type-/developmental stage-specific expression and/or low expression levels. There were 330 miRNA precursor candidates that passed the highthroughput criteria (denoted as Ara-Set1), 203 of which were annotated in at least one of the four databases. Using iwa-miRNA, we were able to characterize these 667 candidate miRNA precursors using 219 sequence features, 382 structural features, and 1063 expression features [i.e., transcripts per million (TPM) values across all samples], providing an opportunity to predict miRNAs using machine learning approaches (File S1). Using 365 already annotated miRNA precursors as positive samples, iwa-miRNA built a one-class SVM classifier to predict 308 miRNA precursor candidates (Ara-Set2) as true positives. There were 208 candidate miRNA precursors in common between Ara-Set1 and Ara-Set2 ( Figure 2D). For newly predicted miRNA precursors, 15 candidates (five from each region of the Venn diagram at the bottom of Figure 2D) were randomly selected for validation using quantitative realtime polymerase chain reaction (qRT-PCR) experiments, in which mature sequences were amplified with specifically designed primers (Table S3). Three miRNA precursor candidates were excluded because their mature sequences were not unique in the Arabidopsis reference genome (TAIR10). These wet laboratory experiments validated 11 candidates as expressed in a mixed sample of A. thaliana roots, shoots, leaves, and flowers ( Figure 2E; File S1). These results provide evidence to confidently annotate A. thaliana miRNAs using iwa-miRNA, although further validation experiments at a large scale are desirable.
For each miRNA precursor candidate, iwa-miRNA assigns an identifier via a uniform naming scheme, and the corresponding uniform resource identifier (URI) is hyperlinked to an HTML web page reporting a detailed description of feature information for manual inspection ( Figure 3A). Figure 3B shows the report page of a representative example, ''ath-MIR156b", which regulates vegetative phase change and recurring environmental stress by repressing squamosa promoter binding protein-like (SPL) transcription factors [41,42]. The precursor of miR156b produces two mature miRNAs of different lengths: a 20-nt miRNA from the 5 0 arm (miR156b-5p) and a 23-nt miRNA from the 3 0 arm (miR156b-3p). The former is present at high levels in the root, leaf, and seed tissues, while the latter is mainly found in the root.

Case 2: application of iwa-miRNA for miRNA annotation in maize
The successful application of iwa-miRNA to miRNA annotation in A. thaliana prompted us to evaluate its value for the analysis of plants with larger, more complex genomes. Here we focused on the model crop, maize, specifically the B73 inbred line, which has a reference genome of 2.3 Gb, more than 80% of which comprises transposable elements and other repeat sequences [43]. iwa-miRNA obtained a total of 619 miRNA precursors, which correspond to 893 mature forms, from the four databases ( Figure 4A). For each database, the proportion of uniquely annotated maize miRNA precursors was markedly different from that observed in A. thaliana (Table S4). This difference underscores the importance of performing an aggregation analysis and manual inspection of miRNA annotations from different sources. Furthermore, an integrative analysis of 195 sRNA-Seq datasets obtained from previously reported work [44] yielded a total of 1241 miRNA precursor candidates, 622 of which were previously un-annotated in any of the four plant miRNA databases (Table S5; File S1).
Using iwa-miRNA, 886 miRNA precursors were selected for downstream analysis: 588 passed with the highthroughput criteria (maize-Set1) and 704 predicted by the one-class SVM classifier (maize-Set2) ( Figure 4B). One of the SVM-predicted miRNAs, zma-miR_N85a-5p, had already been validated by qRT-PCR in our recently published paper [44] and exhibited a seed-specific expression pattern. Of the 886 miRNA precursors, 381 exhibited broad expression patterns, having ! 1 TPM in more than 50% of 195 samples ( Figure 4C).
Preliminary statistical analysis showed that most novel miRNAs were 21-nt and 22-nt in length (Table S5). Novel miRNAs are predominantly from intergenic regions and transposable elements ( Figure 4D). Some newly predicted miRNAs could be statistically associated with maize traits. As shown in Figure 4E, in the tail region of zma-miR_N221a-5p, there is a single nucleotide polymorphism (SNP; rs#chr6: 22134053) that has been reported to be significantly associated with the metabolic trait ''Np.Npp_Feruloyl.caffeoyl_spermidine_derivative_ E1" (P = 4.7EÀ6) [45]. This genetic variant (AA/GG) may influence target gene selection (56 vs. 51 genes; 30 overlapped).
These preliminary results indicate the efficiency and power of interactive annotation of miRNAs in crop plants with complex genomes.

Case 3: application of iwa-miRNA for miRNA annotation in wheat
Finally, we showcase the application of iwa-miRNA to miRNA annotation in a more complex crop plant, hexaploid (AABBDD) bread wheat (T. aestivum), which has an even larger genome ($ 17 Gb), with a more complex repeat landscape than that of maize [46]. In addition to this, the wheat genome also presents other specific challenges, such as the composition of three closely related and independently maintained subgenomes. Given that miRNAs were previously annotated based on different versions of the wheat reference genome from the four plant miRNA databases, iwa-miRNA first unified the miRNA annotations by mapping miRNA precursor sequences to the latest wheat reference genome (IWGSC RefSeq v1.0; https://plants.ensembl.org/Triticum_aestivum) using GMAP (v2019.09.12). Then, it was applied to predict miRNA precursors from 95 sRNA-Seq datasets (File S1; Table S6), resulting in a list of 2857 miRNA precursor candidates in wheat ( Figure 5A; Table S7). Subsequently, 2030 miRNA precursor candidates were selected based on high-throughput criteria (wheat-Set1; 1617 miRNA precursor candidates) and oneclass SVM prediction (wheat-Set2; 1410 miRNA precursor candidates) ( Figure 5B). Finally, these 2030 selected miRNA precursors, corresponding to 2163 mature miRNAs, were organized into a summary page for future experimental validation and functional exploration.
Of 2030 miRNA precursors, 1926 could be clearly located on the three subgenomes, among which the D subgenome contained significantly fewer miRNA precursors than the A and B subgenomes (571 vs. 611 and 744, respectively; v 2 test, P < 0.001). This result suggests that there may be subgenome bias within these annotated miRNAs.  Table S8). Further, 7.54% (118/1565) of groups of homologous miRNA genes contained triads, with a single gene copy per subgenome. Among these 118 triads, 17 groups exhibited differences in mature sequences and expression levels ( Figure 5D). A representative example is Tae-MIR408a/c/f, which has been linked to the salt stress Venn diagrams comparing miRNA precursor candidates between the two miRNA selection approaches. Ara-Set1 and Ara-Set2 represent A. thaliana miRNA precursors identified based on high-throughput criteria and one-class SVM, respectively. The upper and lower Venn diagrams indicate the overlap between high-throughput criteria and one-class SVM for already annotated miRNA precursors and newly predicted miRNA precursors, respectively. E. qRT-PCR results for 12 candidates randomly selected from three regions of the lower Venn diagram in (D). DCt, the difference in cycle threshold (Ct) between the miRNA of interest and U6 small RNA (i.e., Ct miRNA À Ct U6 ). The presented data were average DCt of three replicates, and standard errors were plotted on the graph. SVM, support vector machine.
response and leaf polarity in wheat [5,47,48]. Among the A, B, and D subgenomes, there are nucleotide differences in the 5 0 mature Tae-MIR408a/c/f sequence (A/B vs. D) and the 3 0 mature sequence (A/D vs. B), resulting in identification of four mature sequences: Tae-miR408a-5p, Tae-miR408c/f-5p, Tae-miR408a/c-3p, and Tae-miR408f-3p ( Figure 5E). These four sequences exhibit different expression patterns in leaf and grain tissues ( Figure 5F). Tae-miR408a-5p and Figure 3 Summary and report pages generated by iwa-miRNA A. Screenshot of summary page reporting the information of some features for A. thaliana miRNAs. B. The report page for ath-MIR156b. The reported information includes RNA secondary structure, read support of miRNA, miRNA arm expression patterns, arm selection events across different samples, miRNA-target interactions, and expression levels of different miRNA isoforms. HT, high-throughput; MFE, minimal free energy; AMFE, adjusted minimal free energy; add5, 5 0 template addition; sub5, 5 0 template deletion; add3, 3 0 template addition; sub3, 3 0 template deletion.
Tae-miR408c/f-5p are highly expressed in grain and leaf, respectively. Further, Tae-miR408a/c-3p is mainly expressed in leaf, while Tae-miR408f-3p has no expression in any tissues tested. Nucleotide differences in the four mature sequences also result in the diversity of target genes ( Figure 5G). These results indicate that the comprehensive annotation of miRNAs provides opportunities to explore the evolution and functional diversification of miRNAs in polyploid plants, including hexaploid bread wheat.
Discussion miRNAs have been the subject of extensive research over the past 20 years [39,49]; however, annotating miRNAs at the genome-scale is not straightforward, not only because of the experimental difficulty in capturing some miRNAs with low-or context-specific expression patterns but also owing to the computational difficulties in accurately distinguishing signals from noise within genomic sequences and sRNA-Seq data. To facilitate miRNA annotation, we have developed an interactive annotation framework, iwa-miRNA, with a userfriendly interface to compile, select, and manually curate miRNAs from plant species with reference genomes.
Compared to existing miRNA-related bioinformatics annotation databases and tools, iwa-miRNA has several distinguishing features.
First, iwa-miRNA bridges the gap between existing annotations provided by public miRNA databases and new predictions from sRNA-Seq datasets. Most miRNA databases are periodically updated; however, they cannot integrate new miRNAs predicted from the rapidly accumulating sRNA-Seq data in a timely manner. In contrast, many bioinformatics tools have been designed with the sole purpose of predicting miRNAs from sRNA-Seq data, with less consideration of miRNA annotations provided by different databases. This issue was recently addressed by miRCarta [50], which collects novel human miRNA candidates and augments the annotation information provided by miRBase. Unlike miRCarta, iwa-miRNA tackles this deficiency with an emphasis on plant miRNAs. We suggest that, in the future, more attention should be paid to bridging the gap between miRNA annotations and predictions in plants, human, and other species.
Second, iwa-miRNA provides a new way to interactively select promising miRNAs. The aggregation of already annotated and newly predicted miRNAs generates a comprehensive collection of miRNA candidates, which certainly contain false positive hits, as well as interesting candidates (especially those expressed in tissue-/cell type-/developmental stagespecific patterns) for further validation. In previous studies, miRNA candidates were selected according to pre-defined and often un-transparent rules. The reliability of computationally annotated miRNAs is difficult to judge, particularly for scientists who have no experience in the process of miRNA prediction and annotation. iwa-miRNA provides a visualization of the sequence-, structure-, and expression-based features used in miRNA selection. In this way, researchers (both annotators and bench scientists) can manage miRNA selection in a dynamic manner with full control over criteria, thus conveniently selecting promising candidates for further exploratory analysis and experimental validation.
Third, iwa-miRNA is user-friendly and can be deployed as a web application for easy accessibility and public/private data analysis. To facilitate the application of iwa-miRNA, all source codes and dependencies have been integrated into the Galaxy system, followed by packaging an independent runtime environment into a Docker image. This enables compatibility and portability: users can launch iwa-miRNA using a simple command, regardless of which operating system (Windows, Linux, or Macintosh) is used. Through a simple graphical interface, users can use this 'one-stop' platform to mine available sRNA-Seq datasets and miRNA annotations. iwa-miRNA is also integrated with Rmarkdown-based HTML reports to return interactive tables, publication-grade plots, and reproducible operations.
iwa-miRNA also suffers from some limitations. It cannot handle sRNA-Seq data from species without genome sequences. Recent RNA-Seq data analysis revealed that unmapped reads could serve as a valuable resource for new gene discovery [51]. In the current version, iwa-miRNA ignores sequencing reads that do not map to the reference genome. Since the main purpose of this study was to develop a platform that facilitates integrative annotation of miRNAs, iwa-miRNA has a limited ability to perform downstream analysis of selected miRNAs, such as enrichment analysis (e.g., microRNA gene ontology annotation and miRNA set enrichment analysis) and comparative analysis between different species [52][53][54]. It should also be noted that there may be some false positives contained in the list of predicted miRNA candidates. Further ''wet" experiments should be performed to validate miRNA candidates of particular interest before any functional investigation.

Future work
The iwa-miRNA project continues to be developed and improved. In future versions of iwa-miRNA, we plan to * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * develop new functional modules to analyze sRNA-Seq data for species without a reference genome, identify candidate miRNAs from unmapped reads, and provide more downstream exploratory analyses. We invite researchers to use the iwa-miRNA platform to carry out large-scale sRNA-Seq data analysis on their local computers. Research collaborations are welcome, in particular for researchers without highthroughput computational resources.

Data availability
The prototype web server of iwa-miRNA is accessible at http://iwa-miRNA.omicstudio.cloud.
CRediT author statement