PAI: Predicting adenosine to inosine editing sites by using pseudo nucleotide compositions

The adenosine to inosine (A-to-I) editing is the most prevalent kind of RNA editing and involves in many biological processes. Accurate identification of A-to-I editing site is invaluable for better understanding its biological functions. Due to the limitations of experimental methods, in the present study, a support vector machine based-model, called PAI, is proposed to identify A-to-I editing site in D. melanogaster. In this model, RNA sequences are encoded by “pseudo dinucleotide composition” into which six RNA physiochemical properties were incorporated. PAI achieves promising performances in jackknife test and independent dataset test, indicating that it holds very high potential to become a useful tool for identifying A-to-I editing site. For the convenience of experimental scientists, a web-server was constructed for PAI and it is freely accessible at http://lin.uestc.edu.cn/server/PAI.

Comparison with other classifiers. Since there is no freely accessible predictor or webserver that could be used to identify the A-to-I editing sites, and hence no comparison could be made in this study for PAI with its counterparts. In order to testify its superiority, we compared the predictive results of PAI with those of other  commonly used classifiers, i.e., Naïve Bayes, BayesNet and J48 Tree, as implemented in WEKA 15 . The jackknife test results of different classifiers for identifying A-to-I editing sites in the benchmark dataset were reported in Table 1.
It is shown that the sensitivity, specificity, accuracy and MCC of PAI are all higher than that of the other three state-of-the-art classifiers. These results suggest that the proposed SVM based model can be effectively used to identify A-to-I editing sites.

Webserver.
To enable applications of the proposed method and for the convenience of scientific community, a freely accessible online webserver was established. The user guide is given as following.
Step 1. Open the web server at http://lin.uestc.edu.cn/server/PAI, and the top page of PAI will be shown as in Fig. 3.
Step 2. Either type or copy/paste the query RNA sequences into the input box at the center of Fig. 3.
Step 3. Click on the 'Submit' button to see the predicted result. For example, if use the query RNA sequences in the 'Example' window as the input, the outcomes are as following: All the Adenosines (A) at position 26 in the four query sequences can be edited to Inosine (I). These results are consistent with the experimental observations.

Conclusions
RNA-seq analyses have demonstrated that A-to-I editing is associated with a number of key biological processes and plays important roles ranging from changing codon to regulating mRNA splicing. Therefore, genome-wide detection of A-to-I editing sites will facilitate our understanding of its biological functions.
In the present study, we proposed a support vector machine based model for predicting A-to-I editing sites by using pseudo dinucleotide composition and found that the model is very promising as reflected by high success rates obtained from the rigorous jackknife test and independent dataset test.
For the convenience of researchers in the scientific community, a web-server for the proposed model, called PAI, is provided. We hope that it will provide novel insights into the understanding of the distribution and function of A-to-I editing. As the current method is only applicable to D. melanogaster, future work will expand to other species once the high quality experimental data that can be used to train the model is available.

Materials and Methods
Dataset. The benchmark dataset used to train and test the proposed method was built based on Laurent et al.'s work 14 . By using single molecular sequencing method, they sequenced the RNAs and DNAs of the  wild-type D. melanogaster and RNAs of the ADAR-deficient D. melanogaster, and obtained a training dataset including 127 A-to-I editing site containing sequences and 127 non-A-to-I editing site containing sequences. After removing the redundant samples in their dataset, we obtained a benchmark dataset including 125 A-to-I editing site containing sequences and 119 non-A-to-I editing site containing sequences. It was observed via preliminary trials that when the length of the sequences in the benchmark dataset is 51 nt with the A that can be edited to Inosine in the center, the corresponding predictive results were most promising. Accordingly, all the sequences in the training dataset are 51-nt long and are available at http://lin.uestc.edu.cn/ server/PAI.
To further verify the power of the proposed method, we also build an independent dataset by harvesting the A-to-I editing site containing sequences of D. melanogaster from Yu and his colleagues' work 16 . By removing the sequences with more than 75% sequence similarity using CD-HIT 17 , we obtained 300 A-to-I editing site containing sequences. These sequences are also 51-nt long and are available at http://lin.uestc.edu.cn/server/PAI. Pseudo nucleotide composition. In order to include the global sequence order information, the pseudo nucleotide composition was proposed to represent genomic sequences 18 . Since its introduction, pseudo nucleotide composition has been successfully applied in many branches of computational genomics [19][20][21][22] . Due to its excellent performance, a series of flexible web-servers were developed to generate pseudo nucleotide compositions [23][24][25][26] . Therefore, in the current work, the pseudo nucleotide composition was also used to represent RNA samples. Below is the brief elaboration on how to encode RNA sequences using pseudo nucleotide composition. For more details of pseudo nucleotide composition, see a recent review 27 .
Suppose a RNA sequence with L nucleic acid residues, the pseudo nucleotide composition can be defined as, In Eq. 2, f u (u = 1, 2, … ., 4 k ) is the normalized occurrence frequency of the non-overlapping k-tuple nucleotides in the RNA sequence. λ is the number of the total counted ranks of the correlations along a RNA sequence, and w is the weight factor. It is through the λ correlation factors that not only considerable global sequence-order effects can be incorporated but the RNA sequences in the benchmark dataset with extreme difference in length can also be converted into a set of feature vectors with a same dimension. The correlation factor θ j represents the j-tier structural correlation factor between all the j-th most contiguous k-tuple nucleotide T i = R i R i+1 … R i+k−1 and is defined as, For example, θ 1 is called the first-tier correlation factor that reflects the sequence order correlation between all the most contiguous k-tuple nucleotide along a RNA sequence; θ 2 , the second-tier correlation factor between all the second most contiguous k-tuple nucleotide; θ 3 , the third-tier correlation factor between all the third most contiguous k-tuple nucleotide; and so forth. The correlation function Θ (T i , T j ) is given by where v is the number of RNA physicochemical properties. P u (T i ) is the numerical value of the u-th (u = 1, 2, … ., 6) property for the dinucleotide T i at position i, and P u (T j ) is the corresponding value for the dinucleotide T j at position j.
Before substituting them into Eq. 4, all the original values P u (T i ) (u = 1, 2, … , 6) were subjected to a standard conversion as described by the following equation, where the symbol < > means taking the average of the quantity therein over the 16 different dinucleotides, and SD means the corresponding standard deviation. The converted values obtained by Eq. 5 will have a zero mean value over the 16 different dinucleotides.

RNA physicochemical properties.
It has been reported that A-to-I editing are correlated with RNA structures 2 . Since RNA structure is determined by the complex pattern of base-base interaction [28][29][30][31] , the RNA local structural properties were used to define the pseudo nucleotide composition, of which three are local translational parameters (Shift, Slide, Rise) and the other three the local angular parameters (Twist, Tilt, Roll). The detailed values for the six local structural property parameters are given in Table 2. Therefore, k is equal to 2 meaning that the pseudo dinucleotide composition (PseDNC) was used, and v is equal to 6 reflecting the number of RNA physicochemical properties considered.

Support Vector Machine.
As a smart supervised machine learning algorithm, support vector machine (SVM) has been widely employed to build classifiers in the realm of computational genomics and proteomics [32][33][34][35][36] . Its basic idea is to transform the input data into a high dimensional feature space and then determine the optimal separating hyperplane. In the current study, the LibSVM package 3.18 (https://www.csie.ntu.edu.tw/~cjlin/libsvm/) was used to perform the predictions. The radial basis function (RBF) was chosen as the kernel of SVM, where the regularization parameter C and kernel parameter γ were optimized using a grid search approach as defined by Performance evaluation. The performance of the proposed method was evaluated by using the widely used four metrics, namely sensitivity (Sn), specificity (Sp), Accuracy (Acc) and the Mathew's correlation coefficient (MCC), which are expressed as Sn  TP  TP FN   Sp  TN  TN FP   Acc  TP TN  TP FN TN FP   MCC  TP TN  FP FN  TP FN  TP FP  TN FN  TN FP   100%   100% 100% where TP represents the number of the correctly recognized A-to-I editing site containing sequences, TN represents the number of the correctly recognized non-A-to-I editing site containing sequences, FP represents the number of non-A-to-I editing site containing sequences recognized as A-to-I editing site containing sequences and FN represents the number of A-to-I editing site containing sequences recognized as non-A-to-I editing site containing sequences, respectively.