Prediction of High-Risk Types of Human Papillomaviruses Using Reduced Amino Acid Modes

A human papillomavirus type plays an important role in the early diagnosis of cervical cancer. Most of the prediction methods use protein sequence and structure information, but the reduced amino acid modes have not been used until now. In this paper, we introduced the modes of reduced amino acids to predict high-risk HPV. We first reduced 20 amino acids into several nonoverlapping groups and calculated their structure and physicochemical modes for high-risk HPV prediction, which was tested and compared with the existing methods on 68 samples of known HPV types. The experiment result indicates that the proposed method achieved better performance with an accuracy of 96.49%, indicating that the reduced amino acid modes might be used to improve the prediction of high-risk HPV types.


Introduction
Cervical cancer is a cancer with a higher morbidity and mortality rate among women worldwide [1]. There are about 500,000 new cases of cervical cancer each year, with 280,000 deaths [2], which has become the second largest female cancer [3,4]. Studies have indicated that human papillomavirus (HPV) infection is closely related to the occurrence and development of cervical cancer, and certain types of HPV cause abnormal tissue growth in the form of papilloma [5][6][7].
Human papillomavirus belongs to the papillomavirus family. It is an icosahedral, uncoated particle composed of double-stranded DNA of approximately 8,000 nucleotide base pairs [8,9]. The circular DNA is about 55 nm in diameter [10][11][12][13]. To date, there are more than 150 types of human papillomavirus (HPV), and some new HPV types will be found when there are significant homologous differences between some new HPV types and defined HPV types [14][15][16]. Epidemiological studies have shown a strong correlation between genital HPV and cervical cancer. Genital HPV can be divided into three types according to its relative malignancy: low-risk type, intermediate-risk type, and high-risk type. The clinical association studies usually use two types of HPV: high-risk and low-risk. Low-risk types are associated with low-grade lesions, while high-risk viral types are more closely related to high-grade cervical lesions and cancer [17]. High-risk types included HPV-16, HPV-18, HPV-26, HPV-31, HPV-33, HPV-35, HPV-39, HPV-45, HPV-51-53, HPV-56, HPV-58, HPV-59, HPV-66, HPV-68, HPV-70, HPV-73, HPV-82, and HPV-85 [18]. HPV-16 and HPV-18 accounted for 62.6% and 15.7% of cervical cancers [19], respectively. Therefore, the identification of high-risk HPV has become an important part of the diagnosis and treatment of cervical cancer.
Up to now, many epidemiological and experimental methods can identify HPV types [5,[20][21][22], mainly using polymerase chain reaction (PCR) technology, and be applied to rapid detection of clinical samples. With the rapid growth of human papillomavirus (HPV) data and sensitivity requirements, we need a reliable and effective calculation method to predict the high-risk types of HPV directly.
In recent years, several computational models have been proposed to predict high-risk HPV types. Eom et al. studied the sequence fragments and introduced genetic algorithms to predict the HPV types [23]. Joung et al. used support vector machines to predict the HPV types based on the hidden Markov model [24,25]. Park et al. proposed to use decision trees to predict human papillomavirus types [26]. Kim and Zhang calculated the distance of amino acid pairs and further predict the risk types of HPV based on E6 proteins [7,9]. Kim et al. proposed a set of support vector machines (GSVM) for the classification of HPV types using the differential molecular sequence of protein secondary structure [13]. Esmaeili et al. used ROC to classify HPV types based on Chou's pseudo amino acid composition [27]. Alemi et al. compared the physicochemical properties between the highand low-risk HPV types, and they used support vector machines to predict the high-risk HPV types [28].
These methods have performed well in the prediction of high-risk HPV types, but the challenge of extracting HPV information remains. The information widely used in the prediction of high-risk types of HPV is based on sequence information, but the information limited to the characteristics of 20 AAs and their reduction groups has not been explored so far. In this paper, we proposed a novel method to predict high-risk types of HPVs based on the reduced amino acid modes. We classified 20 amino acids into several groups and extract their structure and chemical properties. These extracted features were used to predict the high-risk type of HPVs based on a support vector machine. Through some experiments and comparative analysis, we want to evaluate the efficiency of the proposed method, as well as the efficiency of various reduced amino acid modes.

Datasets.
There are eight open reading frames that encode early and late genes of the HPVs [11]. The early and late genes have polyA signal 1 and polyA signal 2. The produce of the late genes are L1 and L2 proteins which affect the viral capsid structure [12], while early genes are transformed into E1-E7 proteins. We constructed seven protein databases of the HPVs whose sequences are downloaded from the Los Alamos National Laboratory (LANL). Each protein has 72 HPV types. If a certain type of protein lacks the sequences of HPVs, we downloaded the missing sequence from the National Biotechnology Information Center. Since the E4 protein cannot be found in the National Biotechnology Information Center, its total number is 71. According to an HPV compendium, seventeen HPV types are classified as high-risk types (HPV-16, HPV-18, HPV-31, HPV-33, HPV-35, HPV-39, HPV-45, HPV-51, HPV-52, HPV-56, HPV-58, HPV-59, HPV-61, HPV-66, HPV-67, HPV-68, and HPV-72), and the remaining is low-risk type [13].

Reduced
Amino Acids (RedAAs). 20 amino acids have subtle differences, but some of them have similar basic structures and functions. AAindex is a database of physical and biochemical indicators of amino acids established by Tomii and Kanehisa [29]. It mainly includes three parts: AAindex 1, AAindex 2, and AAindex 3. AAindex 1 is a database that describes the physicochemical and biological properties of amino acids. AAindex 2 is the matrix of amino acid muta-tion, and AAindex 3 is the protein contact potential statistics. These data are from published articles. We mainly used AAindex 1 to calculate the correlation coefficient as the distance between the two indicators. AAindex 1 currently contains 544 indexes, and this article selected 522 indexes. These 522 characteristics are further divided into 7 categories: (A)-alpha and turn propensities, (B)-beta propensity, (C)-composition, (H)-hydrophobicity, (P)-physicochemical properties, and (O)-other properties [29].
Here, we introduced BLOSUM62 to classify amino acids to simplify sequence analysis [30]. We denote the ith group as X i and denote its jth amino acid as X i ðjÞ. Using BLOSUM62, we calculated the similarity score SðX i ðjÞ, R k Þ between X i ðjÞ and the kth amino acid R k as follows: where BlosumðX i ðjÞ, R k Þ denotes the substitution value between X i ðjÞ and R k . Then, we summed up all scores of different groups as the score between Seq s and Seq 0 : where g 0 ðiÞ is the ith group size of Seq 0 , g s ðiÞ is the ith group size of Seq s , m i ðkÞ is the total number of R k occurrences in Seq 0 , and N is the group size. S measures the degree of retention of parent sequence information. Given a size N group, we analyzed all amino acid groups and calculated the similarity score between the parent sequence and the reduced sequence. 2.3. Reduced Amino Acid Modes (RedAA Modes). 20 amino acids were divided into the following nonoverlapping groups according to their physicochemical properties in AAindex, and four types of the reduced amino acid modes were calculated as protein structural and physicochemical features.

Content Modes.
The first mode is associated with the content-specific features, including the distribution of the RedAA and RedAA pattern in protein sequences.
(1) K-mer. Protein sequences and peptides can be seen as a collection of symbols, and their characteristics can be analyzed by the frequency of their small fragments. k-mers are k consecutive characters in reduced proteins, and a sliding window of length m can be used to calculate their frequencies [31][32][33], moving from position 1 to m − k + 1 with one base at a time. It allows the overlaps of the k-mers and is calculated as where Count w RedAA is the occurrence number of the k-mer w RedAA and R is k-mer set of the RedAAs.
(2) RCTD. "Composition (C)," "Transition (T)," and "Distribution (D)" are three descriptors of RedAAs, which are defined as follows [34,35]: Composition: it can be regarded as a single monomer of the reduced sequence, and the sequence components are described by calculating the percentage of each RedAA.
Transition: it can be used as the conversion of RedAA I and A by calculating the frequency of I followed by A: where Count IA and Count AI are the "IA" and "AI" numbers, respectively, in the reduced sequence with length N.
Type I PRseAAC was proposed by Kuo-Chen Chou, which is defined as follows: where f i is the RedAA frequency and w is the weighting factor. θ i is calculated as where H i ðRedAAÞ is the RedAAs' property and R is the RedAA size. Type II PRseAAC can be calculated as where f i is the RedAA frequency, w is the weighting factor, SH i ðRedAAÞ is the RedAAs' property, R is the RedAA size, and N is the sequence length.

Correlation
Mode. The second RedAA mode is based on the characteristics of correlation, which describes the correlation among the RedAAs. In the proposed RedAA mode, three different autocorrelation features are implemented: normalized Moreau-Broto autocorrelation (NMB) [39], Moran autocorrelation (M) [40], and Geary autocorrelation (G) [41].
(1) NMB. The RedAA NMB is defined as where P RedAA i denotes the RedAA property at position i of the sequence, d is the autocorrelation lag, and N is the sequence length.
(2) M. The RedAA M can be calculated as where P RedAA i denotes the RedAA property at position i of the sequence, d is the autocorrelation lag, and N is the sequence length.
(3) G. The RedAA G is defined as where P RedAA i denotes the RedAA property at position i of the sequence, d is the autocorrelation lag, and N is the sequence length.

Order Mode.
The order mode reflects the physical and chemical interaction among the RedAA pairs. There are two kinds of order modes: sequence coupling score and quasisequence score [42].
(1) Sequence Coupling Score. The sequence coupling score is calculated: where d RedAA is the Schneider-Wrede physicochemical distance or Grantham chemical distance between the RedAAs at positions i and i + d and 1 ≤ d ≤ N.
(2) Quasi-Sequence Score. The quasi-sequence score of the RedAA is defined: where f RAA i is the RedAA frequency and w denotes the weighting factor.
The quasi-sequence score can be calculated as where τ is the sequence coupling score, f RAA i is the RedAA frequency, and w denotes the weighting factor.

Position
Mode. The position mode represents the distribution of RedAA positions of protein sequences based on the coefficient of variations [32,43]. First, we converted the protein sequence into a digital sequence NðRedAAÞ and calculated the probabilities P RedAA ðξÞ of the separation distance ζ between two adjacent RedAAs. The mean E ðRedAAÞ ðξÞ and variance D ðRedAAÞ ðξÞ are defined: We then calculated the positional information C ðRedAAÞ ðξÞ: where C ðRedAAÞ ðξÞ is the reciprocal of the coefficient of variation (CV) which compares the degree of change between two datasets, even if there are large differences between their means. In this paper, it was denoted as the RedAA position characteristics.

Prediction
Algorithm. Y = ½y 1 , y 2 , ⋯, y n T is an HPV label set, y i = 1 is from the high-risk type, and y i = 2 is from the low-risk type. We used x ij to represent the jth features of the RedAA modes of the ith HPV sample, where j = 1, 2, ⋯, m. All of the features of the RedAA modes for all HPV samples are denoted as We used a support vector machine (SVM) to predict the HPV type, which is expressed as follows: where w is a linear combination of a set of nonlinear data conversion: where b denotes the bias term, C denotes some regularization parameters, and ξ i is the training error. The above problem can be expressed: Here, the Gaussian kernel function is used to calculate φ ðx i Þ T φðx j Þ instead of φðx i Þ and φðx j Þ. The separation problem can be expressed: The training model can predict the risk type of the test sample x ∈ R m according to the following formula: yðxÞ = 1 indicates that the sample x belongs to the highrisk type; otherwise, it belongs to the low-risk type. In order to obtain a better model, we used a simple grid search strategy based on 10-fold cross-validation to find the optimal model for each dataset.

Evaluation Measures.
There are three popular methods to evaluate the efficiency of prediction models: subsampling test, independent test, and jackknife test. Since the jackknife test can evaluate the efficiency of various predictor variables, we used it to evaluate the efficiency of the proposed method and calculated the class accuracies and overall accuracies: specificity accuracy of high-risk type ð Þ = a a + c , sensitivity accuracy of low-risk type where a denotes true positives, c denotes false positives, d denotes true negatives, and b denotes false negatives.

HPV Classification.
We used the jackknife test to evaluate the performance of the proposed RedAA modes. We divided the 20 amino acids into 5 to 19 groups and calculated their RedAA modes as protein features and then input them into the support vector machine to predict the HPV type. Table 1 shows the tagged HPV types and the predicted results. It can be seen from Table 1 that the 65 HPV types predicted by our method are consistent with the actual types and have better performance. However, HPV-72 is predicted to be low-risk but is actually high-risk, and HPV-30 is predicted to be high-risk but is actually low-risk. For further comparison, we compared our results with Kim et al.'s results [13]. For Kim et al.'s prediction, HPV-56 was predicted to be potentially high-risk, and we predicted it to be high-risk; HPV-53 and HPV-73 were predicted to be potentially highrisk, but in our results, they were low-risk. Phylogenetic analysis showed that HPV-30 was closely related to the established oncogenic type HPV-56, suggesting that HPV-30 was more likely to be a high-risk type. The results show that the proposed method is more consistent with the actual risk type.
We further compared our method with the following method: SVM based on the mismatch [24], SVM classifier based on the linear kernel [13], SVM based on the gap spectral kernel (Gap) [7], BLAST model [13] and integrated SVM (Ensemble) [13], and two text prediction methods based on AdaCost [26] and naive Bayes [26]. The accuracy of our method reaches 96.49%, while the accuracy of the integrated SVM is 94.12%, the accuracy of the SVM based on the unmatched kernel is 92.70%, the accuracy of the SVM based on the linear kernel is 90.28%, and the accuracy of BLAST reaches 91.18%. As for the text prediction method, AdaCost [26] has an accuracy rate of 93.05%, while naive Bayes [26] has an accuracy rate of 81.94%. The comparison also shows that the RedAA model is more effective in classifying the risk types of human papillomaviruses.

The Performance of the Early and Late Proteins in HPV
Type Prediction. Early HPV proteins contain E1, E2, E4, E5, E6, and E7, and late proteins include L1 and L2 [3,5]. Information commonly used for high-risk and low-risk HPV prediction includes information on protein sequences, secondary structure, and pseudoamino acid composition, in which most of them use E6, E7, or L1 protein [23][24][25][26][27][28]. In this paper, we used seven protein datasets of early and late proteins in HPV type prediction and compared their performance. Figure 1 compares the accuracy of each category and the overall accuracy based on early and late proteins. Figure 1 shows that the prediction accuracy of low-risk types is higher than that of high-risk types, except for E5 protein. L1 protein outperforms other HPV proteins in the prediction of low-risk types. L2 protein performs best in highrisk type predictions. The above research shows that E6, E7, L1, and L2 proteins are closely related to high-risk HPV and play an important role in the occurrence and development of diseases [14]. The function of L1 protein in low-risk and high-risk types is not exactly the same. L1 protein in the high-risk type exists in the form of integration, and L1 gene product self-assembly efficiency is low. L1 protein in the low-risk type exists in the form of free tissue, with high self-assembly efficiency. In high-risk typing, if L1 protein mutates, L1 protein cannot combine with L2 protein to form capsid protein and then cannot assemble HPV-infected virus particles. When HPV enters the host cell, the viral DNA replicates in large quantities and can integrate with the host cell DNA, resulting in host cell infection, infinite value addition, and cell immortalization. The results show that L1 protein performs better in the prediction of high-risk HPV types, while L2 protein is more suitable for low-risk HPV types.

Influence of the Physicochemical Properties of Amino
Acids. The proposed method reduced 20 AAs into several nonoverlapping groups, which relies heavily on the physical and biochemical indices of amino acids. The 522 characteristics of AAindex are divided into seven categories according to their physical and biochemical features [29]. The largest   group is hydrophobicity and the second largest group is alpha and turn propensities, and the sizes of the other four groups are relatively small. For each HPV protein, we used 522 physicochemical properties to calculate six kinds of reduced AA modes. For each class of the physicochemical properties of amino acids, we calculated their mean of the overall accuracies of HPV type prediction. The comparison of different physicochemical property classes and the RedAA modes is shown in Figure 2. From Figure 2, it can be found that the proposed prediction has no obvious preference among 7 classes of physicochemical properties for E1 proteins. As for E2 proteins, composition is the best of the six reduced AA modes. For E4 proteins, the physicochemical properties of beta and composition are better. For the reduced AA mode position and RCTD, the physicochemical properties of beta are better in prediction, but composition is better for the other four modes. The results of E5, E6, E7, L1, and L2 proteins are similar to those of E2 proteins, and the six reduced AA modes show better performance in beta physicochemical properties. These results indicate that E5, E6, E7, L1, and L2 proteins have a preference for beta physicochemical properties to reduce amino acids and calculate the six reduced AA modes in HPV type prediction.
3.5. Comparison of the Reduced Amino Acid Modes. In order to evaluate the performance of different modes, we used 522 physicochemical properties to calculate the RedAA modes of all the early and late proteins and calculated their average of the overall accuracies of HPV type prediction, which is shown in Figure 2. Figure 2 shows that six RedAA modes have the same preference trend among seven classifications of the physicochemical properties. As for E1, E2, E4, E5, and E7 proteins, PRseAAC is better than the other RedAA  Figure 3. Figure 3 shows the accuracy of HPV type prediction with the increase in reduced amino acids when combining the PRseAAC and physicochemical properties of amino acids for E1 proteins, and the best-performing PRseAAC achieves 95.378% accuracy with 19 reduced amino acids. For E2 proteins, the prediction model achieves the best performance with the PRseAAC and the physical and physicochemical properties of the composition class when amino acids are reduced to 14 classes. As for E5 and E7, PRseAAC achieves 87.18% and 75.07% accuracies when 20 amino acids are reduced to 7 and 12 classes, respectively. For E6, L1, and L2 proteins, the combination of the RCTD and beta physicochemical properties achieves best performances with 8, 15, and 11 reduced amino acids, respectively.

Conclusion
Genital papillomavirus is closely related to cervical cancer, especially high-risk HPV. Therefore, the identification of the HPV risk type is of great significance for the cervical cancer. We proposed a computational method for the prediction of the high-risk HPV based on the RedAA modes. With the help of the physicochemical properties of the amino acids, we reduced 20 amino acids into several nonoverlapping groups and calculated the structure and physicochemical characteristics of reduced AAs (RedAA) as the RedAA modes. We used reduced sequence information to predict high-risk types of HPV. Experiments with 68 known HPV types show that the proposed method has better performance than previous methods.
The first contribution is that L1 protein performs better in the prediction of high-risk HPV types, while L2 protein is more suitable for low-risk HPV types. The second contribution can be indicated from the influence of the physicochemical properties of amino acids; we noticed that E5, E6, E7, L1, and L2 proteins have a preference for beta physicochemical properties to reduce amino acids. The third contribution can be deduced from the comparison of the reduced amino acid modes; we found that the PRseAAC and RTCD outperform the other four RedAA modes and show better performance in beta physicochemical properties of the amino acids. The final contribution can be seen from the influence of the number of reduced amino acids; we noticed that the combination of the RCTD and beta physicochemical properties achieves the best performances with 8, 15, and 11 reduced amino acids for E6, L1, and L2 proteins, respectively.

Data Availability
All the data used to support the findings of this study are available from the Los Alamos National Laboratory (https://pave.niaid.nih.gov/lanl-archives).

Conflicts of Interest
The authors declare that they have no conflicts of interest.