Comparison of Support Vector Machine Recursive Feature Elimination and Kernel Function as feature selection using Support Vector Machine for lung cancer classification

Cancer is the uncontrolled growth of abnormal cell that need a proper treatment. Cancer is second leading cause of death according to the World Health Organization in 2018. There are more than 120 types of cancer, one of them is lung cancer. Cancer classification has been able to maximize diagnosis, treatment, and management of cancer. Many studies have examined the classification of cancer using microarrays data. Microarray data consists of thousands of features (genes) but only has dozens or hundreds of samples. This can reduce the accuracy of classification so that the selection of features is needed before the classification process. In this research, the feature selection methods are Support Vector Machine Recursive Feature Elimination (SVM-RFE) and Kernel Function and the classification method is Support Vector Machine (SVM). The results showed SVM using SVM-RFE as feature selection is better than SVM method without using feature selection and Gaussian Kernel Function.


Introduction
Cancer is a term for disease in which cells divide abnormally without control and can invade surrounding tissue. Each year more than 1.1 million of people die cause of lung cancer [1]. Lung cancer can be included to solid tumor categories or abnormal tissues which is solid form [2]. One of the main causes of lung cancer is prolonged exposure or inhalation of a carcinogenic substance [3]. Lung cancer is divided into two types namely, non-small cell lung cancer (NSCLC) and small cell lung cancer (SCLC). The division is based on growth patterns and different medical treatments. NSCLC is subdivided into three cancer subtypes, namely Adenocarcinoma, Squamous cell carcinoma, and Large cell carcinoma. SCLC is the most aggressive and rapidly developing type of lung cancer. SCLC is closely related to smoking, with only 1 % of all cases occurring in nonsmokers. The treatment given by the medical of the two types of cancer is different. NSCLC is usually treated surgically while SCLC generally responds better to chemotherapy and radiotherapy.
The detection of lung cancer can be done in various ways, namely history taking, physical examination, anatomic pathology examination, laboratory examination, imaging examination, special examination, and other examinations. In general, detection of lung cancer is carried out using radiological imaging techniques through imaging tests. However, these techniques still result low survival rates cause detect the malignant cells in late stage of lung cancer [4]. Another method for lung classification by using a test image or usually known as image segmentation. Research on cancer classification by using image segmentation previously been done by the method Fuzzy C-Mean Possibilistic; Clustering K-Nearest Neighbor; and Fuzzy Clustering and Support Vector Machine.. Cancer classification has improved for many years to detect cancer at early stage. One of them is using microarray data for lung cancer classification. The advances in molecular biology can be applied to detect the formation of cancer in earlier stages using the information of RNA, DNA and proteins. Several cancer studies using microarray data are also been done by the method Fuzzy C-Means and Possibilistic C-Means [5]. Microarray data shows human genes expressions data on specific part of body numerically. Given some microarray data that already classified manually from patient data, the data will make clues or models to the machine to classify other microarray data on the same issues. An important problem in the classification using microarray data is that there are many features (in this case in the form of genes) but a large number of samples. The number of genes can reach tens of thousands while the number of samples is only in the tens or hundreds. This can add to the data processing costs needed. In addition, irrelevant features can reduce classification accuracy. Therefore, before carrying out the classification process, the selection of features is done first. Other characteristic of the data is it does not have many samples [6].
The feature selection method is divided into two categories, namely independent classifier and dependent classifier. The independent classifier is also called the filter method. The filter method does not depend on any classifier method, that is, the filter method does not use the classifier method to determine the features to be selected and assess the relevance of the features by only looking at the intrinsic nature of the data. The dependent classifier is divided into two methods, namely wrapper and embedded. The basic idea of the wrapper method is to search through a subset of feature spaces with input of all the features of the dataset. The wrapper method uses the overall work evaluation of the classifier or its algorithm to find possible interactions between variables by measuring the accuracy of the classifier's predictions. In this paper, the feature selection using Support Vector Machine Recursive Feature Elimination (SVM-RFE) and Kernel Function and the classification method using Support Vector Machine (SVM).

Microarray data
This paper use lung cancer microarray data form of lung cancer. Microarray data is a data containing the value expression gene called with features. The data obtained from Kent Ridge Bio-medical Dataset from Michigan (http://datam.i2r.astar.edu.sg/datasets/krbd/). Data lung cancer will be tested has 7129 features and divided into two classes, namely cancer with a sample of 86 and 10 non-cancer samples.

Feature selection based on Kernel Function
Kernel function was invented by Vapnik [7], developed by Scholkopf et al. [8] and Christianini and Taylor [9]. Kernel functions using a linear classifier to solve a linear or non-linear problem into a higher dimensional space. The dot product Kernel function from Theorem of Mercer defined as follows [10]: Kernel function has several types. There are gaussian, polynomial, and the others. Euclidean distance can be calculated as follows: To measure the function dissimilarity in feature selection using a Gaussian Kernel, equation (2) was applied as below: The main idea of kernel function as feature selection is find weight each features to optimize the objective function g. Dataset ! ∈ ℝ !"# , n is number of samples and p is number of genes will be calculated using ! ! = ! ! , which ! = 1,2, … , ! first. The dataset will be used of course should be labeled to be classified, ! ! ∈ ! which ! = 1,2, … , !. The C class will be treated as a cluster, so the cluster center ! ! = [! !! , ! !! , … , ! !" ] can be calculated using equation (4): ! ! is the number of samples contained in class ! ! . In kernel function, dissimilarity measure of the feature selection will be calculated. The function of dissimilarity between the sample and the center of the cluster obtained by using kernel written in the following equation (5): Using a Gaussian Kernel in equation (3), then the distance in equation (4) can be written as: The objective function similar with the objective function in SCAD as feature selection algorithm as, shown in equation [8]: Minimize the objective function with make the value of ! and ! ! lowest. So, the values of ! and ! ! need to be update to find the lowest of ! and ! ! . The objective function for updating shown in equation (9) and equation (10): Using equation (4) to find ! ! Using equation (5) to find the distance, Step 2. Using equation (10) to find the value of ! ! Step 3. Update weight of each gene using equation (9) Step 4. Using equation (7) to find ! Step 5. Determine iteration stopping for centroid of current iteration (!) and previous iteration (! − 1) ∆= ! ! − ! !!! If ∆< !, then iteration stops. If ∆> !, then go to step 2

Feature selection based on SVM-RFE SVM-RFE was introduced for feature selection by Guyon in 2002.
The SVM-RFE method is a recursive feature elimination application that uses SVM weights as a ranking criterion. SVM-RFE is classified as an embedded method. SVM-RFE widely used for gene selection and several improvements have been recently suggested [11][12][13]. The algorithm for SVM-RFE is shown in Algorithm 2 [11]. The main principle of SVM-RFE is to eliminate features that have the lowest weight squares in each iteration. The procedure for recursive feature elimination in general is: a. Perform classifier training to find the weight vector (!) b. Calculate ranking criteria for all features c. Eliminate features with the smallest ranking criterion value Features used in the iteration are eliminated with backward feature elimination. The ranking score is given by the components of the weight vector ! of the SVM as follows:  [7]. The purpose of SVM is to find an optimal hyperplane, which is a hyperplane that has a maximum margin [14]. There will be more than one hyperplane that can divide data into two classes. SVM will choose hyperplane with maximum margin. The primal optimization problem can be written as: where the parameter ! > 0 will control the trade-off between the amount of data in the wrong class (narrowed margin) and generalization capabilities (margin widened). C values are not obtained in the learning process but must be determined before learning. ! ! is the slack variable. The slack variable enables misclassification at some distances. The slack variable is formulated as The slack variable is formulated as ! ! = |! ! − ! ! ! |. If ! ! = 0 then ! ! lies in the margin and is classified correctly, and if 0 < ! ! ≤ 1then ! ! lies in the margin but is still classified in the correct class, whereas if ! ! > 1 then ! ! is classified in the wrong class.

Performance measures
Measurement of the ability of a classification model is based on the amount of data that is classified correctly or incorrectly. The amount of data classified correctly or incorrectly can be presented in more detail using the confusion matrix. Confusion matrix for this research is shown in table 1 [15].

Data overview
The data consists of 7129 genes and each gene has 96 samples where 86 are cancer and 10 are noncancer. The column of data represents features or genes associated with lung cancer. The row of data represents the number of patient samples. The sample data of cancer patients in 1 st -86 th line and remain lines is non-cancer patients. The table below is a form of data that will be used in the process of cancer classification as shown in table 2.

Results
In this research, first, we make a new data with select 10, 15, 25, 50, 75, 100, 150, 200, 500, 1000 features selected and classify each data. After that, we do classification without feature selection or using all genes in the data was done. SVM will be evaluate by k-fold cross validation with k = 6. The results are shown in table 3.

Conclusions
From the result and testing, it can be seen that the accuracy of using a feature selection based SVM-RFE method reached 96.7899% accuracy in usage 5 best features, this percentage slightly more superior than SVM method without using feature selection and using a feature selection base Kernel Function especially Gaussian. Thus, it can be said SVM by using feature selection based SVM-RFE is better than SVM method without using feature selection and using a feature selection base Kernel Function especially Gaussian. Except that, average of the accuracy using Gaussian Kernel below the average accuracy using SVM-RFE. Suggestion from the authors can also be discussed using other classification and other kernel types.