LncSTPred: a predictive model of lncRNA subcellular localization and decipherment of the biological determinants influencing localization

Introduction Long non-coding RNAs (lncRNAs) play crucial roles in genetic markers, genome rearrangement, chromatin modifications, and other biological processes. Increasing evidence suggests that lncRNA functions are closely related to their subcellular localization. However, the distribution of lncRNAs in different subcellular localizations is imbalanced. The number of lncRNAs located in the nucleus is more than ten times that in the exosome. Methods In this study, we propose a new oversampling method to construct a predictive dataset and develop a predictive model called LncSTPred. This model improves the Adaboost algorithm for subcellular localization prediction using 3-mer, 3-RF sequence, and minimum free energy structure features. Results and Discussion By using our improved Adaboost algorithm, better prediction accuracy for lncRNA subcellular localization was obtained. In addition, we evaluated feature importance by using the F-score and analyzed the influence of highly relevant features on lncRNAs. Our study shows that the ANA features may be a key factor for predicting lncRNA subcellular localization, which correlates with the composition of stems and loops in the secondary structure of lncRNAs.

In recent decades, several predictive models for lncRNA subcellular localization have been proposed.For instance, lncLocation incorporates sequential, physicochemical, and structural features in its predictive model construction.It leverages the SVM algorithm along with binomial distribution and iterative feature selection techniques to create predictive models (Feng et al., 2020).In iLoc-lncRNA 2.0, the predictive model is built using 8-mer features and the mRMR method, which results in 1407 features and subsequently submitted to the SVM algorithms for model construction (Zhang et al., 2022).DeepLncLoc transforms sequence information into a matrix using word2vec, then uses a CNN to construct the DeepLncLoc model (Zeng et al., 2022).
In this paper, we presented the LncSTPred, an Adaboostbased model for predicting and interpreting lncRNA subcellular localization based on primary RNA sequences in the RNALocate database.Our model supports five localization types including nucleus, cytoplasm, cytosol, ribosome, and exosome, accommodating both sequence and structure features.To solve the imbalance of categories, we employed Borderline-SMOTE, ADASYN, and UNCERTAIN WEIGHT on the training set.We enhanced Adaboost using maximum likelihood estimation and selected key features through the F-score.Subsequently, we conducted bioinformatics analyses of sequence and structure distributions of these features.An overview of the research process is depicted in Figure 1.

Collection and preprocessing of dataset
The data used in this study related to subcellular localization of LncRNA in mammals, including Homo sapiens and Mus musculus, and were extracted from the RNALocate database (https:// www.rnalocate.org/download)(Cui et al., 2022).Researchers can download all raw lncRNAs in the "Download and API" page.Subsequently, we retained lncRNAs related to H. sapiens and M. musculus, and excluded lncRNAs without sequence annotation information and those with multiple subcellular localizations.Redundant sequences were then removed using the CD-HIT program (Huang et al., 2010;Li and Godzik, 2006) with a threshold set at 80%.Subsequently, categories with sample sizes below 10 were excluded, resulting in 1342 lncRNAs across five distinct categories.These categories included 673 lncRNAs located in the nucleus, 407 in the cytoplasm, 152 associated with ribosomal localization, 94 within the cytosol, and 16 originating from exosomes, as depicted in Table 1.

Nucleotide composition features
The nucleotide compositional features from lncRNA are significant features used to characterize the biological function of RNA and their species.Sequences are represented by Equation 1.The traditional method of sequence feature extraction is Kmer.Additionally, we aimed to extract sufficient information from the sequences.Therefore, we used reading frame (RF) features to further characterize the sequences, following the methodology outlined by Rainey et al. (Rainey and Repka, 2013).For convenience of description, we defined the 3-RF by Equation 2.
Where the sequence described a lncRNA with a length of M base pairs.N i denoted the type of nucleotide at position i. {N 1 , N 2 , N 3 , … , N K } was called k-mer, and there were 4 k combinations in total.3 − RF y x represented the combination of three reading frames in different starting points.x = {1,2,3} represented the first, second and third position respectively, y = {1,2, 3, 64} were the 64 combinations of 3-mer respectively.

Minimum free energy
The formation of base pairs can reduce the energy of RNA molecules and make the structure more stable.Therefore, based on the core idea of the minimum free energy (MFE) and the Frontiers in Molecular Biosciences 02 frontiersin.orgThe flowchart of predictive model construction.Zuker algorithm, we defined the global minimum value of the overall energy as Equation 3 (Zuker and Stiegler, 1981;Zuker and Sankoff, 1984).
Where, α ij represents the stacking energy when i and j are paired.β k , γ k , ε k , and δ k describe the energy of the bulge loop, interior loop, multi-branched loop, and hairpin loop, respectively.In the actual calculation, Zuker's algorithm uses four free energy functions and five dynamic programming matrices.The minimum free energy of the RNA sequence is similar to the backtracking process of the Nussinov base pair maximization algorithm.

Feature importance
The F-score is a simple and effective feature selection method which measures the discriminative power of features across categories (Xie et al., 2010).The F-score of the ith feature in a multi-classification problem can be defined as Equation 4: Where l denotes the total number of categories.n j represents the number of samples in the jth class.The x i and x (j) i are the average values of the ith feature across the entire dataset and the jth data subset, respectively.x (j) k,i is the kth observation of the ith feature in the jth class

Sample balance process
Machine learning algorithms generally assume that positive and negative datasets are balanced.However, when the ratio of positive to negative sets exceeds 1:3, the results will be affected.To address this imbalance, two main approaches can be employed.One is adjusting sample weights in the algorithm, such as using the class-weight parameter in the XGBoost algorithm to balance different categories.The other is oversampling the minority categories to equalize sample numbers across categories.In this study, the ratio of nuclear to ribosome samples exceeds 1:4, and the ratio of nuclear to exosome samples is more than 1:40, making it challenging to find a suitable class-weight to characterize this complex distribution.Therefore, we constructed a predictive dataset using two oversampling methods, Borderline-SMOTE (Han et al., 2005) and ADASYN (He et al., 2008), for sample balancing.By focusing on borderline instances, Borderline-SMOTE generally produces better classification results compared to SMOTE, particularly when the minority class is at high risk of misclassification.It also reduces the risk of overfitting by concentrating on the most informative samples.ADASYN focuses on the more difficult minority class samples.This targeted approach ensures that the classifier is better trained on challenging examples, improving its robustness and generalizability.Finally, we filtered with the UNCERTAIN WEIGHT (Kendall et al., 2018) method.

Predictive model algorithm
The Boosting algorithm constructs high-accuracy classifiers by combining several base classifiers, each with moderate accuracy.Adaboost exemplifies this strategy and is known for its high accuracy and ability to model complex split interfaces through nonlinear combination.In the Adaboost algorithm, each base classifier generates a predicted classification result and a self-correction factor to estimate the reliability of the classification (Schapire and Singer, 1999).
In the original binary classification problem solved in the Adaboost algorithm, the coefficients α t corrected for the basic classifier during iteration are shown in Equation 5: ε t is the classification error rate of the base classifier on the training dataset.This coefficient ensures that, in each round, the classifier's accuracy is at least greater than random probability, which is more than 1/2 in a binary classification problem.To extend this to a multiclassification problem, after ensuring the training data is balanced, the accuracy of the base classifier must be at least 1/k, where k is the number of categories.In this paper, k = 5.To ensure that each round prioritizes minimizing the classification error of the base classifier with the highest weight in the final classifier, we refined it as Equation 6: In multi-classification Adaboost, we update the sample weights and decrease the weight of the previously classified base classifier.The weight of correctly classified samples is shown as Equation 7: The weight of wrongly classified samples is shown as Equation 8: To ensure the weights sum to 1, the weights ω t+1,i of round t+1 in the training set after the round t are shown as Equation 9:

Performance evaluation
We use the Specificity (Sp), Sensitivity (Sn), Accuracy (ACC), and Matthews Correlation Coefficient (MCC) to measure the performance of the predictive model.Evaluation indicators can be written as Equation 10: In the context of classification issue, TP represents the number of correctly recognized positives, FN represents the number of positives recognized as negatives, FP represents the number of negatives recognized as positives, and TN represents the number of correctly recognized negatives.Additionally, the ROC (Receiver 3 Results

Description of predictive dataset
In our original dataset, the ratio of ribosome samples, cytosol samples, and exosome samples to nucleus samples exceeded 1:3, the distribution of original dataset is shown in Figure 2A.To address this imbalance, we oversampled cytoplasm, ribosomes, and exosomes.The balanced dataset was analyzed using the K-means clustering method with all features, including k-mer, k-RF, and MFE. Figure 2B illustrates the K-means clustering results after data balancing.Clusters three and five corresponded to nuclear and cytoplasmic localization, while clusters 1, 2, and four represented ribosome, cytosol, and exosome localizations, respectively.After preprocessing, which involved removing outliers and de-linearizing the dataset, the revised K-means clustering results were shown in Figure 2C.In this updated dataset, clusters 1 to five denoted ribosome, cytoplasm, exosome, cytosol, and nuclear localizations, respectively, the processing of the dataset is shown in Table 2.

Predictive modelling process
The predicted results, both before and after sample balancing, were shown in Table 3. From Table 4, we found that exosomes, cytosols, and ribosomes were well-identified in the new predictive dataset.Notably, despite the limited data in the sample, these categories exhibited improved predictive results compared to those that were not oversampled.This result suggested that oversampling may enhance the model's ability to capture more biological characteristics, thus improving prediction accuracy.This aligns with the concept of biological diversity.Furthermore, the ROC curve

Predictive modelling process
The predicted results, both before and after sample balancing, were shown in Table 3.In Table 3, we found that exosomes, cytosols, and ribosomes were well-identified in the new predictive dataset.Notably, despite the limited data in the sample, these categories exhibited improved predictive results compared to those not oversampled.This result suggested that oversampling may enhance the model's ability to capture more biological characteristics, thus improving prediction accuracy.This aligns with the concept of biological diversity.Furthermore, ROC curve analysis is conducted, and the results were shown in Figure 2D.In Figure 2D, we found that the AUC values for exosome, ribosome, cytoplasm, cytosol, and nucleus are 0.91, 0.89, 0.99, 0.93, and 0.77, respectively.These precise predictive outcomes indicate that the model has a commendable generalization ability across diverse subcellular localization categories.
To identify which feature categories most influence the predictive model's results, we conducted separate experiments on 3mer, 3-RF, MFE, and various combinations of these features.Table 4 shown the predictive results of different feature combinations in the LncSTPred model.We observed that the model's performance using sequential features closely resembles that of other theoretical models.However, a substantial improvement in efficacy was achieved by adding the MFE feature.

Comparison with other researchers' methods
Over the past few decades, numerous lncRNA predictive models have emerged.In this study, we conducted a comparative analysis between LncSTPred and existing theoretical algorithms, as well as other scholars' predictive models to assess LncSTPred's performance.Our comparison included Support Vector Machine (SVM), Random Forest, and XGBoost, with results presented in Table 5. MCC is a comprehensive performance metric that effectively reflects the classification ability of the model, especially when dealing with unbalanced datasets.Despite both XGBoost and LncSTPred employing the boosting integration strategy, LncSTPred displayed higher accuracy and generalization ability, particularly when using class-weight parameters for predicting lncRNA subcellular localization.This improvement is attributed to optimized error recognition and adjustments in error recognition point weights within LncSTPred.We compared LncSTPred with machine learning-based models, specifically iLoc-lncRNA 2.0 and LightGBM-LNCLOC, using datasets from these models for performance validation (Table 6).Both iLoc-lncRNA 2.0 and LncSTPred demonstrated high precision in ribosome and exosome categories, with superior sensitivity in predicted cytoplasmic and nuclear localizations.In contrast, lncLocation, which did not use oversampling, showed comparable accuracy in nuclear and cytoplasmic categories but lower performance in ribosomal and exosomal predictions.This indicated that oversampling enhanced lncRNA subcellular localization prediction, particularly improved sensitivity in the cytoplasmic category compared to lncLocator.The analysis of classification datasets highlighted LncSTPred's robust adaptability across various datasets, affirming its reliability as a classification model.

Analysis of feature importance
We computed the feature importance in LncSTPred using the F-score described in Section 2.4, and the results were depicted in Figure 3.The top 10 combinations were "ACA" and "AUA" in the third RF, "AUA", "AGA", and "UCA" in the second RF, and "GGU" in the first RF.Additionally, "AUA", "UAU", "CGG", and "GGC" were prominent in the 3-mer.The triplex nucleotide "ANA" had a significant impact on predictive modeling.To explore the potential effects of these 10 triplexes on the subcellular localization of lncRNAs, we analyzed their distribution in lncRNA sequences and secondary structure substructures.
Figure 4A displayed the frequency of triplex nucleotides in the original dataset.Interestingly, ACA and AGA frequencies exceeded the random probability, whereas AUA and UAU frequencies were below the random probability in the lncRNA sequences.
We used the RNAfold software to predict the secondary structure of lncRNA (Gruber et al., 2008).The output was in dotbracket format, where '...' denoted loop structures, ')))' or '(((' represented stem structures, and '..(', '..)',').. ', or '(..' indicated junction structures.Figure 4B displayed three distinct substructures observed in the secondary structure.A frequency analysis of these substructure types was conducted, and the results were presented in Table 7. Analysis of Table 7 revealed specific enrichments: ACA in the junction, and AGA and TAT in the stem structures.While the frequency of ATA in the predictive dataset was lower than that of random combinations, ATA, ACA, and AGA predominantly appeared at critical positions within the junction and stem substructures of the lncRNA secondary structure.These findings suggested the significant contributory role of ANA in the construction of RNA secondary structures.

Discussion
In recent years, the recognition of lncRNA subcellular localization has garnered increasing attention, as researchers have realized its potential for discovering the function of LncRNA.In this paper, we propose an improved algorithm for predicting lncRNA subcellular localizations, called LncSTPred.
During the establishment of the predictive model, we recognized the significant impact of the predictive dataset on the results.To address this, we utilized three oversampling methods-Borderline-SMOTE, ADASYN, and UNCERTAIN WEIGHT-to oversample the sample sets.This ensured that the predictive model for each category of samples could be sufficiently trained to achieve the best predictive results.Constructing LncSTPred using the improved Adaboost algorithm, we achieved 94.14% accuracy in the 5categorical dataset and 90.76% accuracy in Lin's 4-categorical dataset.This demonstrates that our improvements to the Adaboost algorithm, combined with data balancing, can provide better results than using the class-weight parameter in other algorithms.
Several studies have explored the impact of specific nucleotide combinations on RNA secondary structure.For example, the predictions of Smith et al. accurately identify mascRNA and a conserved hairpin upstream of Evolutionarily Conserved Structures (ECS).They observed that "ANA" triplex nucleotides predominantly appear at the stem-loop junction in ECS (Smith et al., 2013).Novikova et al., through biochemical probing, delineated a complex, two-dimensional structure comprising distinct subdomains, including helical segments, terminal loops, internal loops, and linker regions.This study underscores that purine-rich sequences are highly conserved and often situated in single-stranded regions such as terminal and internal loops (Novikova et al., 2012).These findings corroborate the involvement of "ANA" triplex nucleotide composition in lncRNA secondary structure.Additionally, we performed a quantitative analysis of feature importance to identify the most significant features.By analyzing the frequency of triplex nucleotides and the stem-loop structures of lncRNA, we aimed to understand the relationship between significant features and lncRNA subcellular localization.Our analysis revealed a bias in the frequency of ANA nucleotide combinations within triplex nucleotides and the substructural frequency of stem-loop structures.These findings suggest that ANA nucleotide combinations play key roles in the composition of lncRNA secondary structures.In previous studies, Constanty et al. found that conserved U-rich and A-rich motifs were associated with specific processing and localization functions of lncRNAs like NEAT1 and MALAT1 (Constanty and Shkumatava, 2021).Furthermore, Cai et al. provided evidence that specific triplexes, including ACA, ATA, and AGA, significantly influence localization patterns by analyzing various sequences (Cai et al., 2023).Lyu et al. emphasized the relevance of trinucleotide propensity and positionspecific features in recognizing lncRNA subcellular localization, demonstrating that specific triplexes like UAU could play a role in these predictions (Lyu et al., 2023).
Although LncSTPred has achieved better results in predicting lncRNA subcellular localization, we still face some challenges.On the one hand, the training process of AdaBoost is more complex, leading to significantly higher computing time compared to other prediction models.On the other hand, LncSTPred currently only accurately predicts lncRNAs only localized to a single subcellular localization, whereas many lncRNAs are localized in multiple subcellular localizations.Therefore, we will focus on lncRNA subcellular localization prediction in the future, further enhancing the accuracy of predicting subcellular localization of lncRNAs and the prediction of multi-localized lncRNAs using deep learning algorithms.

FIGURE 2
FIGURE 2 Demonstration of sample distribution at various stages of the balancing process and predictive model results (A).The distribution of the original dataset through K-means clustering.X-axis and Y-axis were projection of the data onto the first dimension of a reduced-dimensional space after using PCA (Principal Component Analysis) (B) The distribution of the balanced dataset through K-means clustering.(C) The distribution of the predictive dataset through K-means Clustering.(D) ROC curve of all categories and averaged cases.

FIGURE 3
FIGURE 3Top 10 features of the LncSTPred model.

FIGURE 4
FIGURE 4 Biological analysis of the 10 most important features: (A) Frequency distribution of triplex nucleotides in lncRNA sequences.The X-axis represents 64 types of triplexes nucleotides, and the Y-axis represents the frequency of triplexes.(B) Three types of substructures in the secondary structure of lncRNAs.

TABLE 1
Number of lncRNA in each subcellular localization.

TABLE 3
The predictive results of each subcellular localization.

TABLE 6
Comparison with previous state-of-the-art methods.