KERNEL LOGISTIC REGRESSION-LINEAR FOR LEUKEMIA CLASSIFICATION USING HIGH DIMENSIONAL DATA

Kernel Logistic Regression (KLR) is one of the statistical models that has been proposed for classiﬁcation in the machine learning and data mining communities, and also one of the effective methodologies in the kernel–machine techniques. Basely, KLR is kernelized version of linear Logistic Regression (LR). Unlike LR, KLR has ability to classify data with non linear boundary and also can accommodate data with very high dimensional and very few instances. In this research, we proposed to study the use of Linear Kernel on KLR in order to increase the accuracy of Leukemia Classiﬁcation. Leukemia is one of the cancer types that causes mortality in medical diagnosis problem. Improving the accuracy of Leukemia Classiﬁcation is essential for more effective diagnosis and treatment of Leukemia disease. The Leukemia data sets consists of 7120 (very high dimensional) DNA micro arrays data of 72 (very few instances) patient samples on the state of Leukemia types. In Leukemia classiﬁcation based upon gene expression, monitoring data using DNA micro array offer hope to achieve an objective and highly accurate classiﬁcation. It can be demonstrated that the use of Linear Kernel on Kernel Logistic Regression (KLR–Linear) can improve the performance in classifying Leukemia patient samples and also can be shown that KLR–Linear has better accuracy than KLR–Polynomial and Penalized Logistic Regression.

In the last decade, it was found that the use of classifier system is one of the most important factors in cancer diagnosis and treatment, besides evaluating data that taken from patient and decision of medical expert [1].Classification system can achieve an objective and highly accurate cancer classification by minimizing errors due to fatigued or inexperienced expert.
As many authors have pointed out, problem domain such as medical diagnosis does require transparent reasoning (interpretable) as well as accurate classification method [2].KLR approach is particularly well suited for this type of situation.Kernel Logistic Regression (KLR) is one of the classification methods in the machine learning and data mining communities that has ability to explain the reasoning for the classification/decision process (KLR provides probability of classification membership).
Trust in a system is developed by the clear description of how they were derived (transparent/interpretable) and also by quality of the results (accuracy).In this research, we proposed to use Linear Kernel on Kernel Logistic Regression (KLR), in order to improve the accuracy of KLR-Polynomial in classifying Leukemia patient samples.Hsu et all [4] suggested to use Linear Kernel when the number of features is very large.If the number of features is large, one may not need up data to a higher dimensional space.That is, the non linear mapping (like Polynomial Kernel) does not improve the performance.Hence, using the Linear Kernel is good enough.
This paper is organized as follows.In section 2, we give a description to KLR, the theory and the design of experiment that will be conducted.Section 3 reports the numerical results of experiment, and finally, we conclude in section 4.

KERNEL LOGISTIC REGRESSION
Kernel Logistic Regression (KLR), a non-linear form of Logistic Regression (LR), can be achieved via the socalled "kernel trick", whereby a familiar LR model is developed in a high-dimensional feature space, induced by a Mercer kernel.

Logistic Regression
Suppose we have a classification problem with c classes (c ≥ 2), with a training set {(x i , y i )} n i−1 of n input samples independent and identically distributed (i.i.d) x, X ∈ R d , and corresponding label y.The problem of classification consists of assigning input samples vector X into one of c classes label.In Logistic Regression, we define a linear discriminant function or logit model for class k as [5] The conditional or posterior probability that x i belongs to class c via the linear discriminant function is written as The class of membership of new point x can be given by this classification rule.Considering a binary or two class problem with labels y i ∈ {0, 1}.The success probability of the sample x i belonging to class 1 (y i = 1) is given by P (y = 1|x), since P (y = 0|x) = 1 − P (y = 1|x) that it belong to class 0 (y i = 0).Then, we define a linear discriminant function (logit model) for two class problem based on equation (1) as g(x) = ln P (y = 1|x) where β denotes the weight vector with size (d + 1) × 1 including the intercept, while the first element of X is 1.Via the logit model in equation ( 3), we can write the posterior probability of the class membership as and The logit link function constraint the output of the model to lie in the range {0,1}.Assuming the label, y i , represent an i.i.d sample drawn from a Bernoulli distribution conditioned on the input vector X, The likelihood of the data is given by The optimal model parameter β, are then determined by maximizing the conditional log likelihood, or equivalently, by minimizing the negative logarithm of the likelihood we wish to solve the equation system ∂L(β) ∂L(βj ) = 0, in order to find the optimizing weight vector β.Since the π(x i ) depend nonlinearly on β, this system cannot be solved analytically and an iterative technique must be applied.The optimal model parameters can be found using Newton's method or equivalently an iteratively re-weighted least squares procedure [6].The Newton Raphson's Method Where the Hessian The gradient of L, The Newton's method can be restated as an Iteratively Reweighted Least Squares (IRLS) problem [7] .
Iteratively Re-weighted Least Squares Procedure Forming a variable that states a generalized linear model, [8] the normal form equations of least squares problem with input matrix (W (t) ) 1 2 X and dependent variables (W (t) ) 1 2 Z (t) can be written as, At each iteration, the model parameters are given by the solution of a weighted least-squares problem, such that and The algorithm proceeds iteratively, updating the weights according to (13) and then updating W and Z according to ( 14) and ( 15) until convergence is achieved.

Kernelized Logistic Regression
Consider Kernel Logistic Regression, a non-linear form of Logistic Regression.Logistic Regression is a linear classifier and well known classification method in the field of statistical learning and also is the model of choice in many problem domains.However, this method has limitation to classify the data with nonlinear boundaries [8].Kernel Logistic Regression, may overcome this limitation via the so-called "kernel trick".The "kernel trick" [9,10,11] provides a general mechanism for constructing nonlinear generalizations of familiar linear Logistic Regression by mapping of original data X into a high-dimensional Hilbert space F, usually called feature space, and then by using linear pattern analysis to detect relations in the feature space.Mapping is performed by specifying the inner product between each pair of data.The inner product in the feature space is often much more easily computed than the coordinates of the points (notably when the dimensionality of the feature space is high).The linear nature of the underlying model means that the parameters of a kernel model typically given by the solution of a convex optimization problem [12], with a single global optimum, for which efficient algorithm exist.Given an input space X and a feature space F, we consider a function φ : X → F. The Kernel Logistic Regression model implements a well known linear Logistic Regression model in the feature space (appears as nonlinear model in the input space).According to logit model in equation ( 3), after mapping into feature space the logit model can be written as where φ(.) represent a nonlinear mapping of the original data X into feature space.

Kernel Function
Rather than defining the feature space explicitly, it is instead defined by a kernel function that evaluates the inner product between the images of input vectors in the feature space, For the interpretation of the kernel function as an inner product in a fixed feature space to be valid, the kernel must obey Mercer's condition [13], that is the kernel must be positive (semi) definite.There are usually the following choices for kernel function: ) (radial basis function, RBF, where σ is a tuning parameter.In this work, we use Linear Kernel as suggested by Hsu et al [4].
Globally, there are two reason when we use Linear Kernel.The first reason is about number of features that are much larger than number of instances.The second one is because both number of features and instances are large.
Generally, in other situation, Hsu et al [4] suggest that RBF Kernel is reasonable first choice.The RBF Kernel has less numerical difficulties and can handle the case when relation between class labels and attributes is nonlinear (unlike Linear Kernel).

Kernel Logistic Regression Modelling
When constructing a statistical model in a high dimensional space, it is necessary to take steps to avoid over fitting the training data, that is to impose a penalty on large fluctuations of the estimated parameters β.The most popular method is ridge penalty λ 2 β 2 that was introduced by [14].As a result, the Kernel Logistic Regression model is trained by adding a quadratic regularized to negative log likelihood, where λ is regularization parameter that must be set in order to obtain a good bias-variance trade off and avoid over fitting [6,15].Furthermore, L(β) ridge represents a convex optimization problem.The representer theorem [10,16] states that the solution of an optimization of the equation ( 19) can be written in the form of an expansion over training pattern, (x i is replaced by φ(x i )) where W = diag({w i , w 2 , ..., w n }) is a diagonal weight matrix with non-zero elements given by and so from (17) we have the so-called kernel machine, With the usual kernel trick, the inner product can be substituted by kernel functions satisfying Mercers condition.Substituting the expansion of β in (19) into (12), this lead us to nonlinear generalization of Logistic Regression in kernel feature spaces which we call Kernel Logistic Regression [8].We can write, Like LR, KLR also produce posterior probability of the class membership.Bartlett and Tewari (2004) [17] proved that KLR can be used to estimate all conditional probabilities.

Experiment Data Description
The Leukemia data set used in this study come from http://www.genome.wi.mit.edu/cancer.This data set consists of 72 samples of two types of acute leukemias, Acute Myeloid Leukemia (AML) and Acute Lymphoblastic Leukemia (ALL) [16].Each sample is a vector corresponding to 7129 genes.The Leukemia patient (1=AML, 2=ALL) will be classified according to those genes.There are 25 samples of data set fit in to AML, and remaining 47 data is ALL.

Methodology
The goal of this experiment is to study the classification performance of applying KLR to classify Leukemia patient samples.In order to achieve this goal, the data set was conducted with k-fold cross validation (cv) method [1].k-Fold cross validation (cv) is one way to improve over the holdout method.The data set is divided into k subsets, and the holdout method is repeated k times.Each time, one of the k subsets is used as the test set and the other k − 1 subsets are put together to form a training set.Then the average error across all k trials is computed.In this work, we are using 10-fold cross validation.
In this experiment, we pre processed the data set so that the mean is 0 and standard deviation is 1 and used linear kernel with λ = 0.06 [3] to perform classification task.Then, in order to know the performance of KLR-Linear, we also compared the result of KLR-Linear with KLR-Polynomial [2] and Penalized Logistic Regression RFE, PLR-RFE [3].

Performance Evaluation Method
We have used four indicators to evaluate the classification performance of leukemia diagnosis.These indicators (accuracy, sensitivity and specificity analysis) are based on confusion matrix and Receiver Operating Characteristic (ROC) curve (area under the curve).
Confusion Matrix A confusion matrix [1] contains information about actual and predicted classifications done by a classification system.Table 1 shows the confusion matrix for a two class i) The accuracy is the proportion of the total number of predictions that were correct.It is determined using the equation: ii) The sensitivity is the proportion of AML cases that were correctly identified, is calculated using the equation: iii) The specificity is the proportion of ALL cases that were correctly classified as ALL, is calculated using the equation: Receiver Operating Characteristics Curve A Receiver Operating Characteristic (ROC) curve [18] shows the relationship between False Positives (FP) and True Positives (TP).In the ROC curve the horizontal axis has the percentage of FP and vertical axis has the percentage of TP for a database sample.The final performance of this work is assessed using the Area Under the ROC (AU-ROC) curve.

RESULT
In this experiment, we create confusion matrix based on classification prediction result of applying KLR-Linear in classifying Leukemia patient samples, calculate the total accuracy, sensitivity, and specificity classification prediction.Then, we draw the ROC curve and calculate the AUROC curve of classification prediction.The Confusion matrix is shown in Table 2.At the same time, we compare the results of KLR-Linear with previous research that used KLR-Polynomial [2]    According to Table 2, we see that the ALL cases were perfectly identified (47;100%) (by using KLR-Linear and KLR-Polynomial) while the AML cases were 92% (23/25) correctly classified by using KLR-Linear.The results of ROC curve is drawn, as shown in Figure 1.
Table 3 summaries the classification performance indicators of applying KLR (Polynomial and Linear) to classify Leukemia patient samples, according to confusion matrix and ROC curve above.
All indicators (accuracy, sensitivity, specificity, AU-ROC curve) of KLR-Linear display high values.It shows that the classification performance of KLR-Linear is better than KLR-Polynomial.
In addition, we compared the accuracy result of KLR (Linear and Polynomial) in classifying leukemia patient with PLR-RFE.The result shows that the accuracy of KLR-Linear is higher than KLR-Polynomial and PLR RFE.

CONCLUSION
We have proposed Kernel Logistic Regression with Linear Kernel (KLR-Linear) for high dimensional data problem to classify Leukemia patient samples.It can be shown that Kernel Logistic Regression with Linear Kernel (KLR-Linear) has better classification performance as compared with KLR-Polynomial and PLR-RFE.

Table 1 :
Representation of Confusion Matrix

Table 3 :
The result of performance evaluation

Table 4 :
Comparison of leukemia classification method