Classification of Hepatitis Viruses from Sequencing Chromatograms Using Multiscale Permutation Entropy and Support Vector Machines

Classifying nucleic acid trace files is an important issue in molecular biology researches. For the purpose of obtaining better classification performance, the question of which features are used and what classifier is implemented to best represent the properties of nucleic acid trace files plays a vital role. In this study, different feature extraction methods based on statistical and entropy theory are utilized to discriminate deoxyribonucleic acid chromatograms, and distinguishing their signals visually is almost impossible. Extracted features are used as the input feature set for the classifiers of Support Vector Machines (SVM) with different kernel functions. The proposed framework is applied to a total number of 200 hepatitis nucleic acid trace files which consist of Hepatitis B Virus (HBV) and Hepatitis C Virus (HCV). While the use of statistical-based feature extraction methods allows representing the properties of hepatitis nucleic acid trace files with descriptive measures such as mean, median and standard deviation, entropy-based feature extraction methods including permutation entropy and multiscale permutation entropy enable quantifying the complexity of these files. The results indicate that using statistical and entropy-based features produces exceptionally high performances in terms of accuracies (reached at nearly 99%) in classifying HBV and HCV.


Introduction
Investigating sequencing of nucleotides from deoxyribonucleic acid (DNA) and ribonucleic acid (RNA) is an important research area in the field of molecular genetics. Although next-generation sequencing platforms have been getting more applicable than capiller electrophoresis recently, capiller electrophoresis studies are required for the verification of next generation sequencing results. Since assessing the huge number of subjects is time-consuming and cost-intensive, it is widely used in small sized projects. In order to determine the sequencing of interested nucleic acid (DNA-RNA) regions, millions of copies are amplified with the process named polymerase chain reaction (PCR). In PCR, the interested RNA region is also converted to DNA copies. After that, the PCR product is prepared for capiller electrophoresis. As a result, base calling signals (trace files) are obtained from the bases of DNA, namely Adenine (A), Cytosine (C), Guanine (G), and Thymine (T) which are labeled with four different fluorescent dyes. A different analysis (i.e., mutation analysis, identification of subtypes of a virus known as a genotyping process and determination of species) can be accomplished from the results of a chromatogram that includes related sequences for the specific purpose.
Sequential data modeling for the purpose of discriminating and classifying DNA chromatograms becomes very popular with the rapid development of sequencing techniques in molecular genetics and bioinformatics [1][2][3][4]. While some types of chromatograms can be manually recognized by an recognized by an expert, it is hard to classify many of them without using any special software. Hepatitis B Virus (HBV) and Hepatitis C Virus (HCV) base calling signals are two types of hepatitis DNA chromatograms, and distinguishing these signals visually is impossible. Therefore, classification of hepatitis DNA trace files is an important issue in utilizing resources efficiently. The illustrations of HBV and HCV trace file samples are given in Figures 1 and 2, respectively. These figures show the peaks of bases A, C, G, and T with different colors.  This study deals with classification of HBV and HCV trace files with support vector machines (SVM) using statistical and entropy-based feature extraction methods. Trace files are also accepted in a time-series manner and exhibit complex characteristics. In order to measure the complexity, approximate entropy (ApEn) was suggested by Pincus with the application of an electroencephalogram (EEG) series [5]. ApEn depends on the length of series and it takes a lower value than expected when the length is short. Since the sample entropy (SampEn) proposed by [6] is not affected by the length, it is more consistent than ApEn [7]. In addition, the calculation of SampEn is easier than ApEn. Permutation entropy (PE) [8] estimates the complexity of non-stationary, noisy and non-linear series by comparing neighboring values. These traditional entropy measures have been utilized for different purposes, especially in fault diagnosis and vibroarthographic (VAG) and electroencephalogram (EEG) signal-processing studies. However, none of these entropy measures are applicable for the systems which show structures on multiple spatial and temporal scales. In order to estimate multiscale complexity, multiscale entropy (MSE) was first suggested by Costa, Gollberg and Peng for the physiologic time series data [9]. The superiority of MSE was then showed by different time series data such as cardiac inter-beat (RR) [10,11] and human gait [12]. MSE uses single scale SampEn in order to quantify the complexity of coarse-grained series and different studies showed that it has some limitations based on the characteristics (e.g. existence of outliers, stationary) and length of the series [13][14][15]. A modification of MSE, namely multiscale permutation entropy (MPE), uses PE instead of SampEn, and the procedure is more robust to artifacts and observational noise in the time series data [16]. Except for techniques based on statistical theory, various researchers have offered suggestions with regard to using single and multiscale entropy measures as a feature recognized by an expert, it is hard to classify many of them without using any special software. Hepatitis B Virus (HBV) and Hepatitis C Virus (HCV) base calling signals are two types of hepatitis DNA chromatograms, and distinguishing these signals visually is impossible. Therefore, classification of hepatitis DNA trace files is an important issue in utilizing resources efficiently. The illustrations of HBV and HCV trace file samples are given in Figures 1 and 2, respectively. These figures show the peaks of bases A, C, G, and T with different colors.  This study deals with classification of HBV and HCV trace files with support vector machines (SVM) using statistical and entropy-based feature extraction methods. Trace files are also accepted in a time-series manner and exhibit complex characteristics. In order to measure the complexity, approximate entropy (ApEn) was suggested by Pincus with the application of an electroencephalogram (EEG) series [5]. ApEn depends on the length of series and it takes a lower value than expected when the length is short. Since the sample entropy (SampEn) proposed by [6] is not affected by the length, it is more consistent than ApEn [7]. In addition, the calculation of SampEn is easier than ApEn. Permutation entropy (PE) [8] estimates the complexity of non-stationary, noisy and non-linear series by comparing neighboring values. These traditional entropy measures have been utilized for different purposes, especially in fault diagnosis and vibroarthographic (VAG) and electroencephalogram (EEG) signal-processing studies. However, none of these entropy measures are applicable for the systems which show structures on multiple spatial and temporal scales. In order to estimate multiscale complexity, multiscale entropy (MSE) was first suggested by Costa, Gollberg and Peng for the physiologic time series data [9]. The superiority of MSE was then showed by different time series data such as cardiac inter-beat (RR) [10,11] and human gait [12]. MSE uses single scale SampEn in order to quantify the complexity of coarse-grained series and different studies showed that it has some limitations based on the characteristics (e.g. existence of outliers, stationary) and length of the series [13][14][15]. A modification of MSE, namely multiscale permutation entropy (MPE), uses PE instead of SampEn, and the procedure is more robust to artifacts and observational noise in the time series data [16]. Except for techniques based on statistical theory, various researchers have offered suggestions with regard to using single and multiscale entropy measures as a feature This study deals with classification of HBV and HCV trace files with support vector machines (SVM) using statistical and entropy-based feature extraction methods. Trace files are also accepted in a time-series manner and exhibit complex characteristics. In order to measure the complexity, approximate entropy (ApEn) was suggested by Pincus with the application of an electroencephalogram (EEG) series [5]. ApEn depends on the length of series and it takes a lower value than expected when the length is short. Since the sample entropy (SampEn) proposed by [6] is not affected by the length, it is more consistent than ApEn [7]. In addition, the calculation of SampEn is easier than ApEn. Permutation entropy (PE) [8] estimates the complexity of non-stationary, noisy and non-linear series by comparing neighboring values. These traditional entropy measures have been utilized for different purposes, especially in fault diagnosis and vibroarthographic (VAG) and electroencephalogram (EEG) signal-processing studies. However, none of these entropy measures are applicable for the systems which show structures on multiple spatial and temporal scales. In order to estimate multiscale complexity, multiscale entropy (MSE) was first suggested by Costa, Gollberg and Peng for the physiologic time series data [9]. The superiority of MSE was then showed by different time series data such as cardiac inter-beat (RR) [10,11] and human gait [12]. MSE uses single scale SampEn in order to quantify the complexity of coarse-grained series and different studies showed that it has some limitations based on the characteristics (e.g., existence of outliers, stationary) and length of the series [13][14][15]. A modification of MSE, namely multiscale permutation entropy (MPE), uses PE instead of SampEn, and the procedure is more robust to artifacts and observational noise in the time series data [16]. Except for techniques based on statistical theory, various researchers have offered suggestions with regard to using single and multiscale entropy measures as a feature extraction technique for classification of sequential data. While some studies investigate the performance of different sophisticated classifiers with extracted features using ApEn, SampEn and/or PE [17][18][19][20][21][22], others handle multiscale-based technique such as MPE [23,24]. These entropy measures have been used in biological time series data for the purpose of both quantifying complexity and the extraction of features in classification. However, there is no work available that uses entropy-based feature extraction methods for DNA trace files, especially for hepatitis DNA trace files. On the other hand, sophisticated classifiers within the concept of machine learning have been investigated in terms of their classification ability in the studies of DNA sequencing [25][26][27][28]. However, SVM [29,30] has been reported as a powerful classification tool compared with other supervised algorithms in recent years [31], and to the best our knowledge, none of the hepatitis DNA studies have examined SVM as a classifier.
In this study, a new framework for the classification of HBV and HCV trace files based on features extracted from four bases (i.e., A, C, G, D) of hepatitis DNA chromatograms is presented. Statistical-based and entropy-based features are extracted from the hepatitis DNA trace files. By using a statistical-based feature extraction method, it is intended to capture the statistical properties of four bases belonging to HBV and HCV with computing the values of mean, median and standard deviation. On the other hand, an entropy-based feature extraction method based on PE and MPE is utilized for the purpose of quantifying the complexity of these bases. Therefore, 24 computationally efficient features are extracted and later their different combinations are fed to SVM with different kernel functions such as linear, polynomial (Poly.) and radial bases (RBF).
The rest of this study is organized as follows. Section 2 includes materials and methods of the study. The proposed framework is also given in this section. Model comparison results are presented in Section 3. A discussion and some concluding remarks are provided in Sections 4 and 5, respectively.

Dataset
Hepatitis DNA trace files are obtained by "Phred" [32], which is widely used in academic and commercial laboratories as a base-calling software, embedded in a ABI-3730 capillary sequencer device (Applied Biosystems, Foster City, CA, USA) for DNA sequence traces. The data consists of 200 trace files, of which 96 are HBV and 104 are HCV. Type of hepatitis is taken as the dependent variable of constructed SVM models, which has a binary form. Therefore, hepatitis type is labeled as +1 if the trace file represents HBV, otherwise it is labeled as -1. Each trace file consists of four base calling signals time series shaped like Gaussian peaks (A, C, G, T bases). A typical segment of a DNA trace file is illustrated in Figure 3. Each base calling signal in the trace file is converted to an array using the "scfread" function of MATLAB 2017a software [33]. extraction technique for classification of sequential data. While some studies investigate the performance of different sophisticated classifiers with extracted features using ApEn, SampEn and/or PE [17][18][19][20][21][22], others handle multiscale-based technique such as MPE [23,24]. These entropy measures have been used in biological time series data for the purpose of both quantifying complexity and the extraction of features in classification. However, there is no work available that uses entropy-based feature extraction methods for DNA trace files, especially for hepatitis DNA trace files. On the other hand, sophisticated classifiers within the concept of machine learning have been investigated in terms of their classification ability in the studies of DNA sequencing [25][26][27][28]. However, SVM [29,30] has been reported as a powerful classification tool compared with other supervised algorithms in recent years [31], and to the best our knowledge, none of the hepatitis DNA studies have examined SVM as a classifier.
In this study, a new framework for the classification of HBV and HCV trace files based on features extracted from four bases (i.e., A, C, G, D) of hepatitis DNA chromatograms is presented. Statistical-based and entropy-based features are extracted from the hepatitis DNA trace files. By using a statistical-based feature extraction method, it is intended to capture the statistical properties of four bases belonging to HBV and HCV with computing the values of mean, median and standard deviation. On the other hand, an entropy-based feature extraction method based on PE and MPE is utilized for the purpose of quantifying the complexity of these bases. Therefore, 24 computationally efficient features are extracted and later their different combinations are fed to SVM with different kernel functions such as linear, polynomial (Poly.) and radial bases (RBF).
The rest of this study is organized as follows. Section 2 includes materials and methods of the study. The proposed framework is also given in this section. Model comparison results are presented in Section 3. A discussion and some concluding remarks are provided in Section 4 and Section 5, respectively.

Dataset
Hepatitis DNA trace files are obtained by "Phred" [32], which is widely used in academic and commercial laboratories as a base-calling software, embedded in a ABI-3730 capillary sequencer device (Applied Biosystems, Foster City, CA, USA) for DNA sequence traces. The data consists of 200 trace files, of which 96 are HBV and 104 are HCV. Type of hepatitis is taken as the dependent variable of constructed SVM models, which has a binary form. Therefore, hepatitis type is labeled as +1 if the trace file represents HBV, otherwise it is labeled as -1. Each trace file consists of four base calling signals time series shaped like Gaussian peaks (A, C, G, T bases). A typical segment of a DNA trace file is illustrated in Figure 3. Each base calling signal in the trace file is converted to an array using the "scfread" function of MATLAB 2017a software [33].

Feature Extraction
Identifying features extracted from the raw data correctly plays a vital role in the purpose of achieving better classification. Since the intensities of four base calling signals are different from each other, the raw data include trace files which cannot be directly used as an input for the classification process. For this reason, raw data should be converted to a mathematical representation which gives constant values. Different methods can be used in order to represent the raw data. Two types of extraction methods for arrays obtained from hepatitis DNA trace files are introduced in this study: (1) statistical-based feature extraction and (2) entropy-based feature extraction. Following subsections which provide the formulations of how features are extracted from a given base calling signal based on statistical and entropy theory. All calculations are carried out using MATLAB 2017a software [33].

Statistical-Based Feature Extraction Method
Three statistical features based on descriptive statistical theory, including central tendency measures (mean and median) and a central dispersion measure (standard deviation), are used. These are frequently used statistics that reflect the property of DNA trace files [26,27].
Let N denote the length of each base-calling signal. The data points (located on X-axis in Figure 3) corresponding to signal intensities (located on Y-axis in Figure 3) for base calling signals A, C, G, and T can be expressed as y A (1, 2, . . . , N), y C (1, 2, . . . , N), y G = (1, 2, . . . , N), and y T = (1, 2, . . . , N), respectively. The mean and standard deviation formulas for each base calling signal are given as follows where j = A, C, G, T: The intensities of base calling signal j are ordered and then the middle value is found by Equation (3) as median j , where j = A, C, G, T:

Entropy-Based Feature Extraction Method
Two entropy-based feature extraction methods including PE and MPE are given in this section. The procedures of obtaining PE and MPE for a given base calling signal are presented below.

• Permutation Entropy
The procedure of measuring PE of a given time series is a process of calculating Shannon entropy (ShEn) with mapping the original series to ordinal patterns. Using ordinal patterns has numerous advantages from different aspects [34]. For a given base calling signal j ( j = A, C, G, T), the intensities, which exhibit the characteristics of a time series Y j = y j (i) i=1,2,...,N with length N, m-dimensional vector, can be expressed as: where the embedding dimension is denoted by m (≥ 2), and time lag is denoted by τ ( N). Here, y j (i) denotes overlapping segments with length m. According to parameter m, the number of possible permutations will be m! with permutation patterns π p where p = 1, 2, . . . , m!. For each y j (i), Equation (4) can be arranged in ascending order such that: Let the probability distribution for each permutation pattern π be shown with P(π 1 ), P(π 2 ), . . . , P(π k ) where k ≤ m! and satisfy the condition k l=1 P(π l ) = 1. Based on the ShEn, the PE of order m is now obtained as: When the relative frequencies of all permutation patterns are equal, the probabilities take the value of 1 m! , and the maximum value for HPE j (m) is obtained as ln(m!) [35,36]. To make HPE j (m) scale-independent and comparable among different m, normalized PE HNPE j ∈ [0, 1] is calculated by the following equation: • Multiscale Permutation Entropy The procedure of measuring MPE for a given intensity of base calling signal j ( j = A, C, G, T) with length N starts with creating a coarse-grained structure. The coarse-grained method introduced by Costa, Goldberg and Peng divides the original time series into non-overlapping windows of increasing length s, also called scale parameter [9]. The z-th element of multiple coarse-grained time series is obtained by: where 1 ≤ z ≤ N s . Here, N s is the length of the constructed coarse-grained time series. After determining the multiple coarse-grained time series, SampEn is then calculated. Instead of SampEn, Aziz and Arif suggested using PE (given in Equations (5) and (6)) to calculate the complexity of each coarse-grained series C j = c j (z) z=1,2,...,m with length m where m-dimensional embedded vector can be expressed as follows [16]: It should be noted that MPE j reduces to HNPE j when the scale parameter is equal to 1. HNPE j and MPE j are the entropy values of base calling signals' intensities and calculated for all j = A, C, G, T bases. These entropy measures are used as features that will be included into the SVM classification models.

Support Vector Machines
Binary class SVM aims to find the most appropriate hyperplane that separates two classes. The training set X with n samples has the form: where x i denotes the set of input vectors, and y i is the set of corresponding labels which has a binary form [37]. The purpose is to estimate the parameters w and b which define the optimal hyperplane obtained from decision function expressed as sign ( f (x)). Here, f (x) is the discriminant function used as the seperating hyperplane and can be defined as follows: where the following constraint should be satisfied for this hyperplane: A quadratic optimization problem which has the objective function min 1 2 w 2 with linear constraints given in Equation (12) is defined in order to obtain a maximum margin band. Using Lagrangian multipliers and Karush-Kuhn-Tucker conditions, the following dual problem can be obtained: where x i inputs are named as support vectors corresponding to α i 's, and the values of α i 's are found by using one of the quadratic optimization methods for Equation (13). After that, the unknown parameters w and b are determined (for more details, see [38]). The slack variable (ξ i ) is added to the problem in the case of linearly non-separable data. The value of ξ i represents the total number of misclassifications. When the data is linearly separable, the linear SVM mentioned above is applied; otherwise non-linear SVM should be preferred. The non-linear SVM outperforms the linear SVM when the complex-structured time series has many features. In non-linear SVM, the inputs are transformed from nonlinear to linear space with a specific kernel function. The aim is to find the hyperplane with the highest margin in the new space where the transformation is successfully achieved by kernels [39]. In the problem, the penalty parameter of the error term is shown by C and the term of C n i=1 ξ i is added to the object function [40]. After the transformation process of inputs, a linear SVM problem can be formulated for the new space [41]. Also, depending on kernels, Equation (13) is revised as the new dual optimization problem for the non-linear SVM given below: where K x i T x j = φ(x i ) T φ(x i ) is the kernel function. Linear, RBF and Poly. kernels are frequently used in SVM, and the preferability of a kernel over the others is based on expert knowledge and data structure. Table 1 shows the formulations of kernels used in this study, and γ and d are the kernel parameters. While only C parameter can be tuned in linear SVM, γ and d can be tuned in addition to C in RBF and Poly. kernel SVM, respectively.

K(x i T x j )
Linear

Performance Evaluation
Different measures are used in evaluating the performance of SVM models with different kernel functions. Most of these can be derived from a confusion matrix which is a 2 × 2 table that holds information about the predicted versus actual class of observation. A typical confusion matrix is given in Table 2: In the confusion matrix, TP and TN denote the number of correctly classified HBV and HCV trace files, respectively. Sensitivity (Se), sometimes called the TP rate, indicates the proportion of correctly classified HBV trace files. Analogously, specificity (Sp), also called the TN rate, shows the proportion of correctly classified HCV trace files. Accuracy (Acc) gives the the proportion of overall trace files that are classified correctly. Kappa (κ) statistics is an important agreement measure in the process of assessing the discriminative power of the relevant SVM model. κ statistics lies in the range between [-1,+1] and the perfect classification between HBV and HCV trace files is achieved when κ is found as 1.

Proposed Framework
The experimental setup of the proposed framework is described in the following steps: Step 1: Preparing Dataset and Extracting Features Two hundred trace files belong to Hepatitis DNA are obtained with Phred software. Hepatitis types (96 traces for HBV and 104 traces for HCV) are labeled as +1 and -1 if the related trace represents HBV and HCV, respectively. In order to extract features for the classification process, all trace files that contain four base calling signals are converted to arrays.
In total, 24 features are extracted by two different feature extraction methods, namely statistical-based and entropy-based for SVM classification. Twelve features are obtained in the concept of statistical-based extraction and given in Table 3 where µ j , σ j and median j denote mean, standard deviation and median of base calling signal j(= A, C, G, T) respectively. The remaining 12 features are extracted with the entropy-based method; four of them are with single scale PE and eight are with multiscale PE. MPE  [8]. Also, Nalband, Prince and Agrawal followed this suggestion, using the same values [19]. Thus, these parameters are chosen as m = 3 and τ = 1. All calculations are carried out using MATLAB 2017a software [33].

Step 3: Performing Classification Process and Evaluating Results
The classification process of hepatitis DNA trace files is performed by using SVM with three different kernel functions. In this step, features extracted with statistical-based methods are used at first separately, and then together. Likewise, classification is performed using PE (i.e., MPE at s = 1), MPE at s = 2 and MPE at s = 3 features separately, and then together. SVM models using the mentioned features are built for each splitting proportion. Then performance evaluation measures Acc, Se, Sp, and κ are obtained for training and testing pairs. In addition, the number of support vectors (nSV) generated by the training phase of the relevant SVM model is found. This process is run a total of 10 times in order to aviod the random selection process effect. Thus, the performance evaluation measures and nSVs are calculated 10 times for each model. Acc, Se, Sp, κ, and nSV denote the mean values of Acc, Se, Sp, κ, and nSV, respectively. When training and testing errors are defined as ε training = 1 − Acc training and ε testing = 1 − Acc testing , the error of the relevant SVM model is calculated by ε di f f = ε training − ε testing .
"Caret" and "Kernlab" libraries in R studio (version 1.2.1335, RStudio, Inc., Boston, MA, USA) programming language [42,43] are used in step 2 and step 3. Table 4 reports the classification performance of SVM models using statistical-based features for 10%, 20%, 30%, 40%, and 50% training sets and their corresponding testing sets.  When the statistical-based features are taken into account for 10%, 20%, 30%, 40%, and 50% training samples, the SVM-RBF kernel classifier with mean and all statistics features produces better classification performances in terms of both Acc and κ. Additionally, the SVM models built with mean and all statistics features in all proportions of training samples indicate high classification accuracies ranging from nearly 93% to 99%. All the SVM models (with linear, Poly. and RBF kernels) for each training sample using median have the lowest classification performances among other statistical-based features. When the difference of error value between training and testing sets approaches to zero, it may be indicated that the model is not suffering from the over-fitting problem. The last column of Table 4 provides ε di f f , and these values are close to zero in general. On the other hand, Han and Jiang pointed out that the over-fitting problem in classification can be detected by using the expected values of sensitivity and specifity [44]. When these values are complementary, it can be said that the model has an over-fitting problem. It is shown in Table 4 that Se and Sp take on non-complementary values.

Classification with Entropy-Based Features
The classification performance of SVM models with entropy-based features for 10%, 20%, 30%, 40%, and 50% training sets and their corresponding testing sets are given in Table 5.  For 10% training samples, SVM-RBF kernel classifier with features of MPE at s = 2 has the highest performance in terms of Acc (95.6%) and κ (0.911). For the same training proportion, SVM-RBF kernel classifier with MPE at s = 3 and all entropies have the same values of Acc = 95.5% and κ = 0.909. In the case 20% and 30% training, the highest values of Acc are obtained with SVM-Poly. kernel classifier that uses all entropies as 96.6% and 98.3%, respectively. Also, this classifier produces the highest value of κ for 20% and 30% training samples. Results for 40% training samples show that SVM-RBF kernel classifier using all entropy-based features achieves better classification performance in terms of Acc and κ (98.9% and 0.978, respectively). Besides, SVM-Poly. kernel classifier with all entropy-based features takes the highest values of Acc and κ (98.1% and 0.962, respectively) for 50% training samples. Additionally, SVM models using entropy-based features in all training proportions achieve substantial classification performances where accuracies are ranging from nearly 93% to 99%. According to ε di f f , Se and Sp values, it can be concluded that the over-fitting problem does not appear in SVM models for 10%, 20%, 30%, 40%, and 50% training samples. SVM models using entropy-based features indicate very low ε di f f , ranging from 0.000 and 0.050.

Discussion
The characteristic of sequential data exhibits a complex structure. Due to the difficulty of distinguishing this type of data visually, the classification of sequential data has attracted notable attention of researchers in different areas. Most recent studies dealt with the complexity of the system, and therefore, used various types of entropy to extract features from the raw data. Features which reflect the behaviour of data truthfully do not only reduce the dimensionality of space, but also improve the classification quality.
Recent studies for biological systems offered novel approaches to extract features based on single and multiscale entropy measures in order to achieve high classification accuracy. Especially, extracted features from EEG signal-based entropy helps researchers in the early diagnosis of epilepsy, different types of sleep disorders, and brain-related disorders such as Alzheimers [45]. Acharya et al. [46] extracted features from EEG signals by using ApEn, SampEn, and Phase Entropies (S1 and S2) for the purpose of detecting epilepsy. After applying different machine learning classification algorithms, it was shown that fuzzy classifier produced better classification performance (98%) in terms of the performance measures used in the study. Collected EEG signals from the brain were also discriminated with various classifiers after the extraction process, including entropy-based methods (i.e., ApEn and SampEn) in another important study [47]. AverageShEn, Renyi's (RE), ApEn, SampEn, and S1 and S2 entropies were utilized to extract features from focal and non-focal epilepsy EEG signals in the study of Sharma, Pachori and Acharya [48]. It was reported that the least squares SVM with Morlet wavelet kernel function reached an 87% accuracy rate in classifying signals. For the classification of focal and non-focal EEG, Arunkumar et al. [49] proposed a methodology based on ApEn, SampEn and RE. Extracted features were fed into different classifiers such as NaïveBayes (NBC), SVM, k-nearest neighborhood (KNN), and non-nested generalized exemplars (NNGe). The results demonstrated that NNGe has the best classification performance with 98% accuracy. Also, a review about entropy-based feature extraction methods was presented for the diagnosis of epilepsy in [50]. To detect epileptic seizures, MSE was utilized as the feature extraction method, and SVM classifier was performed in [51]. The classification accuracies in classifying seizure, seizure-free and normal EEG signals were found to be higher than 98%. In sleep scoring classification, features were extracted from EEG signals using MSE, and SVM-based classifiers were performed in [52]. The overall accuracy rate was found to be 91.4%. To make accurate classifications of sleep stages, Rodríguez-Sotelo et al. [53] proposed a method based on J-means classification with EEG features extracted by fractal dimension, detrended fluctuation analysis, ShEn, ApEn, SampEn, and MSE. Extracted features were optimized with Q−α method, and then were fed to J-means classifier which achieved an average of 80% accuracy rate. Another important study which deals with sleep disorders was conducted by using 22 different EEG features including ApEn, SampEn and PE [54]. Extracted features were then fed to Wavelet transform and SVM classifiers. Recent studies have also showed that features extracted from entropy measures produce high classification performance in classifying human sleep EEG signals with different supervised and unsupervised machine learning methods [55][56][57]. To detect Alzheimer's disease, various EEG features including entropy (e.g., ApEn, SampEn, PE) and statistical (e.g., mean, variance, standard deviation) measures were extracted in [58] and then fed to six classifiers including SVM, artificial neural network, KNN, NBC, and random forest. The proposed method indicated high classification accuracy ranging from nearly 89% to 97%.
Some of the important studies presented above can be seen as pioneers in classifying the sequential data obtained from biological systems and they demonstrated the usefulness of entropy-based feature extraction methods. On the other hand, an increasing number of studies in recent years have investigated the classification abilities of machine learning-based methods for genomic data [59,60]. Genomics is defined as one of the most important domains in bioinformatics [59] where computational methods need to be carefully utilized in order to discover useful but hidden information from biological systems. Extracting a set of features from the bases of DNA and then feeding to any supervised classifier for the purpose of labeling DNA trace files (e.g., high/low quality, genotyping of the viruses, species identification) is an important step to achieve high classification accuracy, as in all classification paradigms. To the best of our knowledge, there is no work which deals with entropy-based feature extraction methods for gene sequencing data. In this study, a new framework is proposed to classify hepatitis DNA trace files with SVM using extraction methods based on both statistics and entropy (i.e., PE and MPE) measures. The mathematical formulations of two extraction methods are introduced. The offered extraction methods are applied for the hepatitis DNA trace files and hence, the classification of the files as HBV and HCV is performed via SVM with three different kernel functions.
SVM models built with median features have low accuracies compared to models with other statistical-based features. In general, SVM-RBF kernel classifier using mean and all statistics features outperforms SVM models with other statistical-based features. On the other hand, SVM-RBF or SVM-Poly. kernel classifiers using all entropies achieve higher classification performances than SVM-linear classifier for all training samples except 10%. SVM models using both statistical and entropy-based features exhibit very close classification performances in terms of accuracies.
When the best-performing SVM models for each training proportions are compared, it is found that the models with entropy-based features produce lower nSVs than models with statistical-based features and consequently yield lower complexity in the decision process.
According to Table 5, SVM-RBF kernel classifiers with entropy-based features have a higher percentage of nSVs compared with SVM-linear and SVM-Poly. kernel classifiers for all training proportions. Therefore, one can conclude that an over-fitting problem can appear. On the contrary, for each training proportion, it is found that ε diff values are close to 0, Se values are close to 1, and Sp values are above 0.90. In addition, Se and Sp do not take complementary values. Thus, according to these values, it is not expected that an over-fitting problem can arise. On the other hand, SVM models using all entropies have lower nSVs compared with models using PE, MPE at s = 2 and MPE at s = 3 separately for training proportions from 30% to 50%. Thus, it can be concluded that less parameters are enough to define hyperplanes of problem complexity in the situation of SVM using all entropies. Moreover, cross-validation utilized in the training phase also contributes to overcome the over-fitting problem.

Conclusions
The results demonstrate that the proposed framework produces remarkable classification performances based on both statistical and entropy features. By integrating this framework into the DNA sequencing devices, autonomous classification of DNA trace files, especially hepatitis DNA trace files that cannot be distinguished visually, can be achieved successfully.
The proposed framework, which offers two different feature extraction methods, demonstrates that SVM models with statistical-based features have high performance as well as models with entropy-based features. Hence, it is suggested that entropies can be effectively used in the extraction of features from DNA trace files which produce non-stationary, noisy and non-linear signals. This feature extraction method can be considered either alone or combined with other extraction methods with the purpose of obtaining higher classification performance.
Although this study is designed for the classification of two class trace files (HBV and HCV), further studies can be concentrated on multi-class trace files such as the genotypes (sub-types) of hepatitis, other viruses and bacteria DNA trace files. Also, different supervised machine learning methods can be implemented and compared in terms of their classification ability.