DNABERT-based explainable lncRNA identification in plant genome assemblies

Long non-coding ribonucleic acids (lncRNAs) have been shown to play an important role in plant gene regulation, involving both epigenetic and transcript regulation. LncRNAs are transcripts longer than 200 nucleotides that are not translated into functional proteins but can be translated into small peptides. Machine learning models have predominantly used transcriptome data with manually defined features to detect lncRNAs, however, they often underrepresent the abundance of lncRNAs and can be biased in their detection. Here we present a study using Natural Language Processing (NLP) models to identify plant lncRNAs from genomic sequences rather than transcriptomic data. The NLP models were trained to predict lncRNAs for seven model and crop species (Zea mays, Arabidopsis thaliana, Brassica napus, Brassica oleracea, Brassica rapa, Glycine max and Oryza sativa) using publicly available genomic references. We demonstrated that lncRNAs can be accurately predicted from genomic sequences with the highest accuracy of 83.4% for Z. mays and the lowest accuracy of 57.9% for B. rapa, revealing that genome assembly quality might affect the accuracy of lncRNA identification. Furthermore, we demonstrated the potential of using NLP models for cross-species prediction with an average of 63.1% accuracy using target species not previously seen by the model. As more species are incorporated into the training datasets, we expect the accuracy to increase, becoming a more reliable tool for uncovering novel lncRNAs. Finally, we show that the models can be interpreted using explainable artificial intelligence to identify motifs important to lncRNA prediction and that these motifs frequently flanked the lncRNA sequence.


Introduction
Regulatory non-coding ribonucleic acids (ncRNAs) can modulate gene expression, with many ncRNAs reported to regulate flowering time and development [1,2], abiotic stress tolerance [3,4] and interaction beneficial or pathogenic microorganisms [5][6][7].Long non-coding ribonucleic acids (lncRNA), a subset of ncRNAs, are defined as RNA transcripts longer than 200 nucleotides that are not translated into functional proteins but can sometimes generate small peptides [8,9].LncRNA expression is often specific to the plant developmental stage and tissue or cell type [10][11][12][13], with some transcripts presenting a remarkably short life [14].Once considered an evolutionary oddity, lncRNAs have demonstrated roles in essential biological processes.They may act by regulating the expression of neighbouring genes or across multiple loci in the genome, and are also involved in epigenetic regulation [15][16][17].For example in rice, the overexpression of lncRNA LAIR can increase grain yield by upregulating the expression of the LRK (leucine-rich repeat receptor kinase) gene cluster, producing larger primary panicles and more panicles per plant [18].The lncRNA Ef-cd, transcribed from the antisense strand of the flowering activator OsSOC1 locus, positively regulates the expression of this gene, shortening time to maturity in late-maturing rice by 7-20 days, without causing a decrease in yield usually associated with early maturing varieties [19].Other lncRNAs are associated with heat responsive genes in Brassica rapa [20], osmotic and salt stress in Medicago truncatula [21], drought stress response in Zea mays and Oryza sativa [22,23], stress response in Glycine max [24], sexual reproduction in O. sativa [25], pathogen response in Triticum aestivum and Arabidopsis thaliana [26][27][28], and vernalization-mediated epigenetic silencing in A. thaliana [15].Hence the identification of novel lncRNAs may support the breeding of improved crop varieties.
Multiple approaches are employed for lncRNA detection and characterisation.The majority of lncRNA prediction bioinformatics pipelines use a combination of in-house scripts to filter potential lncRNAs from transcriptome datasets based on length, open read frame (ORF) size and comparison to known protein-coding sequences, followed by a tool to estimate coding potential, such as CPC2 [29], LncMachine [30], CPAT [31], PlncPRO [32] and PLEK [33] (reviewed in [34]).For the coding potential models, a common approach is to manually define features to be extracted from transcripts and to use these features as input for machine and deep learning models, such as random forest, to categorise transcripts into mRNA or lncRNA [35,36].However, handcrafted features used for machine learning can introduce bias and reduce classification performance, depending on the relevance of the chosen features [37,38].Another drawback is that current models rely on transcriptome data or computationally predicted lncRNAs available on CANTATAdb [39], PLncDB [40], GreeNC [41,42] or AlnC [43] to train their models and output a prediction.Since lncRNAs have a low expression rate that can be highly specific to the plant developmental stage or tissue, identifying lncRNAs from transcriptome datasets may underestimate lncRNA abundance or fail to detect lncRNA presence [24].A model that could take advantage of publicly available genomic references rather than transcriptomic data to identify lncRNAs would potentially identify all lncRNAs, particularly if trained with learned features, avoiding bias introduced by handcrafted feature extraction.
Natural Language Processing (NLP) models, such as Bidirectional Encoder Representations from Transformers (BERT) [44] and Universal Language Model Fine-tuning (ULMFiT) [45], are deep learning models developed to deal with large sequences of words or text data.Transformers employ an attention-based mechanism to train their encoder-decoder architecture [46], offering an opportunity to initially pre-train a model on genomic data without labelling the dataset beforehand (unsupervised training), avoiding the introduction of annotation bias.DNABERT, GeneBERT and Genomic-ULMFiT are the genomic versions of the NLP frameworks (https://github.com/kheyer/Genomic-ULMFiT)[47,48].These NLP models have been fine-tuned for the identification of N4-methylcytosine DNA methylation sites in the Caenorhabditis elegans genome [49], identification of gene promoters, splice sites and transcription factor binding sites from the human genome data [47], and disease risk estimation based on genomic sequences and RNA-splicing [48].One of the advantages of using a pre-trained genomic model is that it can be fine-tuned to a specific classification task using a smaller dataset with high quality annotations.NLP offers an opportunity to achieve direct genome prediction of lncRNAs from genomic data.An advantage of generating models from genomic data is that models trained with high accuracy on high quality data will likely have cross-species applicability, with previous evidence in transcriptomics prediction indicating machine learning models trained on well studied crops can predict across species [50].The NLP models could then be interpreted to determine important motifs in lncRNA prediction within and across plant species to determine evolutionary conserved regulatory motifs.Visualisation and interpretation techniques have been used by DNABERT [47] to pre-train models for genomic input and to visualise important motifs for promoter identification through attention scores [51] for universal applicability to any NLP model that uses transformers.Packages such as these make the leap from predicting promoter regions to lncRNA transcripts feasible with pre-trained BERT language models for interpretation and visualisation.
This study serves as a foundation for performing a genome-wide search and discovery for lncRNAs using an NLP deep learning model.Here, we trained NLP models to predict lncRNA from genomic data of seven model and crop species (Zea mays, Arabidopsis thaliana, Brassica napus, Brassica oleracea, Brassica rapa, Glycine max and Oryza sativa) with accuracy between 83.4% and 57.9%.The cross-applicability of these NLP models was tested by comparing each species' model's performance with lncRNA data from other species.In addition, we further investigate the model features to visualise the most important motifs for lncRNA prediction.Significant motif conservation across species was also investigated, with the majority of the significant motifs located at the edges of the lncRNA sequence of the species analysed.

LncRNA prediction performance of for seven crop species
Multiple NLP model instances were fine-tuned separately for each species using genomic sequences, the sequences were split into different k-mers sizes from 3-mers to 6-mers.The models were evaluated using accuracy, AUROC, F1, MCC, precision and recall.The accuracy reported for the highest performing model in each species ranged from 83.4% and 57.9% (73.5% median).The value for MCC varied substantially, ranging between 0.16 and 0.67 (0.4 median), with 1 being considered a perfect prediction.AUROC varied between 61% and 90%, while F1 ranged between 58% and 83%, the precision score ranged between 58% and 84% and a recall score between 58% and 84% (Table 1).Across all classifiers, we observed a 30% performance difference, with the Z. mays model consistently showing the highest prediction performance, whereas the B. rapa model reported the lowest prediction performance.Assessing the impact of k-mers size on model performance, most species achieved the highest performance using the fine-tuned 3-mers model, except for B. rapa and Z. mays, which had the best accuracy using the 4mers and 6-mers models respectively.The lncRNA prediction confidence values are available at Supplementary Table 1.
Confusion matrix analysis is commonly used in machine learning classification tasks to assess which classes are most often misclassified (or "confused") by the model, it helps identify whether the misclassified samples share similarities and if either class is more often wrongly predicted.Here, the models accurately identified lncRNAs 59.3-88.6% of the time and detected random stretches of DNA between 55.6% and 78.1% of the time, showing the model was equally accurate in predicting either class.The Z. mays model had the largest dataset with 14030 genomic sequences and achieved the highest classification performance to identify both lncRNAs and no lncRNAs.The two models trained with G. max and O. sativa had smaller datasets with 3414 and 3716 sequences, respectively, but performed on average 18.7% better than the B. rapa model with a medium-size set (9725 genomic sequences).

Identification of lncRNA from external datasets across six species
Machine learning models are often impacted by lack of variability within a training dataset, which might lead to lower performance when faced with external data.In this section, we evaluated the robustness of the proposed methodology to detect lncRNAs across seven datasets dedicated to lncRNA curation.Including a wide-range of lncRNAs databases in the model evaluation phase challenges to the model's capacity to generalise and sustain consistent performance across a spectrum of conditions, demonstrating its capacity to classify transcripts extracted from multiple sequencing experiments and a different genome sequence than the one employed for constructing the training and validation dataset in the previous section.
65% of the lncRNAs were accurately classified.The models exhibited a consistent 65% for both Recall and Precision with an average F1 score of 0.78, which is compatible with the results observed in the previous section.Regarding the identification of known lncRNAs, the A. thaliana model trained with 3-mer was able to identify three of the known lncRNAs (IPS1, COLDAIR and APOLO) solely based on their genomic sequences without the added 2 kbp flanks.The ELENA1 lncRNA was only identified with the 2 kbp flanks added, and COOLAIR was misclassified.Prediction results are detailed in Supplementary Table 2.
A substantial variation in prediction performance is observed across the species as shown in Table 2.The highest prediction performance is observed in the identification of transcripts from O. sativa and G. max models, which showed 67.4% and 76.3% accuracy respectively.Both species models outperformed the F1 score by at least 0.10 in comparison to the previous performance observed with the Cantata dataset.The Z. mays prediction model showed the lowest classification performance observed, correctly classifying 54% of the lncRNAs which is 30% less than the lncRNA prediction accuracy on the Cantata database.Moreover, the B. napus specific model presented a 16% drop in prediction accuracy, precision and recall for the identification of lncRNAs in the external datasets in comparison to the results obtained with Cantata.There were 4563 transcripts identified as long intergenic non-coding RNAs (lincRNAs) with the majority of these being from A. thaliana (4536) and the remaining from O. sativa (14), G. max (8) and Z. mays (5).The prediction performance of lincRNAs showed a similar performance to the species specific models described in Table 2.A detailed list of the lncRNA sequence, prediction label, confidence level, chromosomal position and attributes is available in Supplementary Table 3.

Cross-species prediction of lncRNAs using the single species finetuned NLP model
NLP models with the highest accuracy for each crop were used to predict the lncRNAs in another species.For example, the 3-mers model trained with A. thaliana genomic k-mers was employed to predict lncRNAs in the remaining six species without further fine-tuning.Most of the models performed similarly or slightly worse at predicting lncRNAs for a new species in comparison to identifying lncRNAs in their original species, except for the model trained on Z. mays data that performed significantly worse at classifying other species' lncRNAs (Table 3, Fig. 2).In all cases, the highest lncRNA classification performance was achieved by models trained in that species instead of crossspecies prediction models.
In addition, we assessed how a model trained on a monocot (Z.mays and O. sativa) or dicot (A.thaliana, B. rapa, B. oleracea, B. napus and G. max) species would perform when predicting lncRNAs from the other group.Here, the models performed better when predicting lncRNAs from the same group, with an average accuracy of 66.2% for monocotmonocot prediction, and 62.7% for dicot-dicot prediction.Models performing lncRNA prediction in a different group (monocot-dicot and dicot-monocot) presented an average accuracy of 60.3%.

Motif identification and frequency of motifs appearing per genomic location
The characterisation of lncRNAs is a challenging task as lncRNAs share many similarities with other RNA types.Thus, to uncover potentially conserved patterns in lncRNA sequences, we extracted and ranked the most relevant genomic features employed by the best performing models for the classification of lncRNA.The best performing model for each crop (ranked by accuracy) was interpreted using the transformers interpret package [51] using the lncRNA sequences from the holdout dataset.The 50 most important genomic motifs for lncRNA identification were subjected to a chi-squared test against non-lncRNA motifs to determine the significant motifs (p-value < 0.05).The resulting 158 unique motifs were compared across species according to sequence similarity (Fig. 2, Supplementary Visualising the location of significant motifs in lncRNA divided into 25 bp bins of upstream & downstream flanking regions and bins representing 5% lncRNA sequence, showed that significant motifs important to lncRNA appeared more frequently in the upstream and downstream flanking regions than in the lncRNA sequences themselves (Fig. 4, Supplementary Figure 1).

Discussion
In this study, we fine-tuned multiple NLP instances to successfully identify lncRNAs within the genome sequence of seven plant species without the need for transcriptome data, and used these models for cross-species prediction of lncRNAs, excluding structural lncRNAs (rRNA and tRNA) which were absent in the CANTATAdb and PLncDB datasets.The code used to develop this project is available on Github (https://github.com/AppliedBioinformatics/lncRNA_Prediction_Interpretation) along with the trained model weights, allowing other researchers to implement the model with their datasets.Identifying lncRNA transcripts is challenging as lncRNAs tend to have lower expression than other transcripts, and their expression is often specific to the development stage, tissue or cell type [10][11][12][13] causing lncRNAs to go undetected when using RNA-seq based models [12].To address this limitation, we identified lncRNAs directly from the genomic sequence using known lncRNAs and genome assemblies.Identifying lncRNAs from genomic sequences can further our understanding of their sequence conservation across plant species [24], providing a foundation to investigate lncRNA evolution and identify novel transcripts [58].
The NLP models fine-tuned per species presented a median accuracy of 73.5%, with the model trained with Z. mays performing 25.6% better than the model for B. rapa.Several factors may have contributed to the higher performance of the Z. mays model, including the high quality of genome assembly.The Z. mays AGP RefGen 4 v genome assembly was constructed using advanced single-molecule technology, improving contig length and supporting the accurate assembly of intergenic spaces and centromeres [59], providing higher quality sequences for model training.Centromeric regions are reported to encode actively transcribed lncRNAs [24] associated with centromere maintenance and cellular division [60].In our study, Z. mays also had the largest available dataset of predicted lncRNAs after filtering [61], which supports the accurate training of machine learning models as they may better represent transcript diversity [62].The lncRNAs from B. rapa and O. sativa were consistently misclassified across the different models, which may indicate that different features were more important for these species.The difficulty in accurately identifying lncRNAs from these species may also be associated with genome assembly and annotation quality.B. rapa was the first species sequenced among the Brassica genus in 2011 [63], and the O. sativa genome was annotated before RNA sequencing was widely available, so these genome assemblies and annotations might be missing lncRNA regions [25,64].It was previously reported in the O. sativa genome annotation that many lncRNA loci had been incorrectly classed as putative protein-coding genes [41].Further research will measure the impact of genome assembly on model performance to assess the importance of assembly methods on lncRNA identification.The algorithm architecture applied here is susceptible to dataset errors, as it learns directly from the data presented, thus re-training the model as more validated lncRNA sequences are released will lead to higher model accuracy.Since the model presented here is trained using known lncRNAs identified through bioinformatics pipelines, it is difficult to assess its ability to identify novel lncRNAs, though the predicted set may provide a foundation for subsequent validation.
Regarding the identification of known lncRNAs such as APOLO [52,53], ELENA1 [54,55], COOLAIR [56], COLDAIR [15], IPS1 [57], the model faced challenges in identifying specific lncRNAs, such as ELENA1 and COOLAIR.This difficulty may be linked to their proximity to coding regions, due to involvement in nearby plant gene regulation.For instance, COOLAIR lncRNAs references a group of antisense lncRNAs involved in silencing the A. thaliana FLOWERING LOCUS C (FLC) during vernalization [65].Identifying antisense lncRNAs is particularly challenging as they are highly related to coding genomic regions.Moreover, the COLDAIR lncRNA is located within the FLC locus and transcribed in the same direction as FLC starting from the VRE element (RNA central ID URS000075325B), reported to act on the FLC chromatin "cis" regulation (Kim et al., 2017), indicating its proximity to the FLC loci that would be captured by the 2 kbp region included in our dataset.
Although lncRNAs do not show the same degree of interspecies sequence conservation as protein-coding genes [66], a study using ten different monocot and dicot species showed that the majority of lncRNAs have some sequence conservation at the intra-and sub-species level, and the number of lncRNAs identified correlated with the number of genes in each species [67].There is also evidence of structural and positional conservation across animal lncRNAs showing similar expression patterns suggesting transcript functional conservation [68,69].In our study, the models trained on monocot or dicot species had, on average, 3-6% higher accuracy than when predicting lncRNAs from other species in the same group, suggesting differences in the features employed for predicting these groups.The trained NLP models identified multiple genomic motifs significantly associated with lncRNA prediction.For the majority of species, the genomic motifs for lncRNA identification were more frequently located at the boundaries of the lncRNA sequence (Fig. 3), and presented a low GC content in comparison to the lncRNA sequences, with several being composed of poly-A or poly-T stretches.Characterising the role of these conserved motifs in lncRNA expression is beyond the scope of this study and the datasets, however, these motifs may contribute to identifying lncRNA expression regulation in future research.In addition, as more conserved motifs are identified, these may act as novel genomic markers allowing more precise identification of lncRNAs from other plant species.Fig. 1.Confusion matrices of the predicted labels from each species' highest performing models.The matrix is coloured based on the percentage of k-mers labelled in each class from the total test set of the species indicated in the title (i.e. A. thaliana had 1387 k-mers in total, with 488 of these (above 35%) being correctly predicted as lncRNAs).

Conclusion
We present the first study using an NLP model to successfully identify lncRNAs in plant species from genomic sequences instead of transcriptomic data.This method might allow for the detection of lncRNAs extracted directly from the genome reference regardless of their expression and can contribute to our understanding of lncRNA conservation and function across plant species.The proposed method could possibly be used for the identification of lncRNA across diverse species, including animal species.However more research is required to confirm its efficiency for lncRNA detection in human and animal species.The models developed here are able to identify the majority of lncRNAs from the species used for training, and across different species, with the Z. mays model presenting the highest accuracy.Further research may include a wider diversity of genomic sequences to increase model robustness and achieve better classification performance in the genomewide identification of lncRNAs.

BlastX filtering
The lncRNAs fasta sequences were extracted using bedtools getfasta query [74] using the reference genome fasta for the respective species.A BLAST database for each species was built using makeblastdb query and the genome fasta file [75].Next, the lncRNA fasta sequences for each species was compared to the respective genome database through blastn (outfmt 6, evalue 1e-10) to filter out sequences with high similarity to known mRNAs.The extract_blast.py python script was used to extract lncRNAs with over 90% of its length matched to a mRNA sequence.The blastx_pipeling.shscript can be used to complete these steps efficiently.

Adding flanking sequence & generating a control set
The lncRNA genomic context was incorporated in the sequence using the add_flanks.pyscript, which adds the 500 bp flanking regions upstream and downstream of the gtf file.The script ensured the flanking regions did not exceed chromosome or scaffold boundaries using the index file generated with samtools faidx [76].The control dataset, labelled "non-lncRNA", was composed from random stretches of genomic.After filtering the lncRNA plus flanking regions from the genome file using bedtools shuffle (-excl lncRNA gtf file), the control sequences were extracted from the reference genome running bedtools getfasta with random intervals of equivalent length to the flanked lncRNA fasta sequences.
For cross species prediction, the best performing k-mers model for each crop was used to predict the separate test set of every other plant with the same evaluation metrics.

Evaluation metrics
Model performance was assessed using accuracy, area under receiver operating characteristics curve (AUROC), F-score (F1), Mathew's correlation coefficient (MCC), Precision and Recall as the evaluation metrics calculated using the scikit-learn library [78].AUROC is estimated by plotting true positive rate vs false positive rate.The other metrics are defined below, where TP, FP, TN and TP mean true positive, false positive, true negative and true positive: Precision TP TP + FP (4) Recall TP TP + FN (5)

Interpretation of models
The transformers interpret repository (https://github.com/cdpierse/transformers-interpret) was used to analyse the model classification results.The scripts atributions.pyand explainer.pyfrom this repository were adapted to increase the maximum sequence length of the input from 512 to 2048.
Each lncRNA labelled sequence from the test dataset was iterated over with the transformers interpret sequence classification explainer object.This object returned attribution scores for each k-mer in the sequence.If more than 7 successive k-mers had positive attribution scores for lncRNA prediction, these k-mers were merged and their attribution scores summed until a k-mer with a negative attribution score was present.This merged k-mer would be annotated as a motif important for lncRNA prediction.After iterating over every lncRNA sequence for a given plant species, the frequency of a motif occurring as a positive prediction factor was multiplied by the sum of a motif's

Visualisation of significant motif location
The motifs contained in the lncRNA (plus flanking regions) and non-lncRNA sequences were extracted and counted using grep and awk commands.Then, a custom R script (chi_script.R) was used to run a chisquare test to assess whether the motif appeared in lncRNA sequences significantly more than non-lncRNA sequences.Motifs with a p-value less than 0.05 were kept whereas every other motif was excluded.
The parse_fasta function from Genomic_ULMFiT was used to count the locations of significant motifs in lncRNA and non-lncRNA sequences.Each lnRNA sequence was divided into 60 bins.Bins 1-20 were 5% intervals of the upstream flanking region, bins 21-40 were 5% intervals of the lncRNA sequence, and bins 41-60 were 5% intervals of the downstream flanking region.The counts bins 21-40 that refer to the lncRNA sequence, were normalised to counts of bins 1-20 and bins 41-60, as the lncRNA sequence was of variable length whereas the flanks were a consistent 500 bp.This ensured motif counts on longer lncRNAs sequences were not biased by the sequence length.The bins and their counts were analysed with the tidyverse suite [79] and plotted with ggplot2 [80] in R.

Declaration of Competing Interest
None.

Fig. 2 .
Fig. 2. AUROC of each fine-tuned model when performing cross-species prediction of lncRNAs.The black line indicates model performance when predicting lncRNAs in the species it was trained, whereas other species are represented by the colours indicated in the plot legend.

Fig. 3 .
Fig. 3. Shared genomic motifs across different plant species.The main plot indicates the number of significant genomic motifs (p-value < 0.05) shared or unique to the species.The number of significant genomic motifs identified per species is indicated in the lateral plot in blue.

Fig. 4 .
Fig. 4. Significant (p-value < 0.05) genomic motif sequence location for identifying lncRNAs in each species based on the highest performance single species classification model.

Table 1
The performance metrics of different k-mers length NLP models are fine-tuned for each species.The best metric model for each species is presented in bold text.

Table 2
Evaluation metrics of species specific models using the best fine-tuned models presented in the previous section for the identification of lncRNAs from PLncDB, RNAcentral, EVLncRNAs, Liu, NCBI, Li, and Cameron datasets.

Table 4 -
5).Apart from O. sativa, which shared no significant motifs with any of the other species, all species shared two lncRNA motifs (AAAAAAA and AAAAAAAA) belonging to the same motif cluster.The Brassica species shared five motifs, whereas B. napus and B. oleracea shared twelve motifs.Z. mays and G. max shared eleven motifs, the second highest number shared between two species.

Table 3
Performance metrics of cross-species lncRNA prediction.