An Intrusion Detection System based on Support Vector Machine using Hierarchical Clustering and Genetic Algorithm

ISSN: 2321-2381 © 2015 | Published by The Standard International Journals (The SIJ) 21 Abstract—This study proposed an SVM based IDS which combines GA, Hierarchical Clustering and SVM techanique.GA is used to preprocess the KDD Cup (1999) data set before SVM training. The proposed system reduces the training time and also achieve better classification of various types of attacks. GA provide the important feature and hierarchical clustering algorithm is used to provide a high quality, abstracted and reduced dataset to SVM for training. This system tries to increase accuracy of probe and u2r attacks. This system is implemented in MATLAB.


INTRODUCTION
S the use of Internet is growing day by day, its security has been a focus in the current research. Nowadays, much attention has been paid to Intrusion Detection System (IDS) which is closely linked to the safe use of network services. Network Intrusion Detection System (NIDS), as an important link in the network security infrastructures, aims to detect malicious activities, such as denial of service attacks, port scans, or even attempts to crack into computers by monitoring network traffic. A common problems of NIDS is that it specifically detect known service or network attack only, which is called misuse, by using pattern matching approaches. On the other hand, an anomaly detection system detects attacks by building profiles of normal behaviors first, and then identifies potential attacks when their behaviors are significantly deviated from the normal profiles.
Many researches have applied data mining techniques in the design of NIDS. One of the promising techniques is Support Vector Machine (SVM), which solid mathematical foundations [Khan et al.,7] have provided satisfying results. SVM separates data into multiple classes (at least two) by a hyperplane, and simultaneously minimizes the empirical classification error and maximizes the geometric margin. Thus, it is also known as maximum margin classifiers. Hierarchical clustering algorithm that is used to produce fewer significant instances from a very large dataset. With fewer significant instances, the Support Vector Machines (SVMs) can achieve shorter training time and better classification performance.
An intrusion is unauthorized access or use of computer system resources. Intrusion detection systems are software that detects, identifies and responds to unauthorized or abnormal activities on a target system. The major functions performed by intrusion detection systems are: (1) monitor and analyze user and system activities, (2) assess the integrity of critical system and data files,(3) recognize activity patterns reflecting known attacks, (4) respond automatically to detected activities, and (5) report the outcome of the detection process [Burbeck & Simmin, 3]. Intrusion detection system can broadly be classified into misuse detection and anomaly detection. In Misuse detection, to identify intrusion well known attacks pattern or vulnerable spots in the system are used. While in anomaly detection, such attempts which are deviated from the normal established pattern can be recognized as intrusions. In Misuse detection, low false positive rate is obtained and minor variation from known A attacks cannot be detected while Anomaly detection has high false positive rate as it can detect novel attacks. In an ideal intrusion detection system, high attack detection rate along with 0% false positive rate should be there. This low rate of false positives is only achieved at the expense of ignoring minor malicious activity detection. As both of the this shows the complementary nature, many systems attempt to combine both techniques where misuse detection techniques can be used as the first line of defence, while the anomaly detection techniques can be as a second.
Most intrusion detection systems are classified as either a network-based or a host-based approach to recognize and detect attacks. A network-based intrusion detection system performs traffic analysis on a local area network. A hostbased intrusion detection system places its reference monitor in the kernel/user layer and watches for anomalies in the system call patterns. The advantages of using network-based intrusion detection systems are no processing impact on the monitored hosts, the ability to observe network-level events, and monitoring an entire segment at once. However, as the complexity and capacity of networks increase, the performance requirements for probes can become prohibitive [Chen et al.,21]. Host-based intrusion detection systems can analyze all activities on the host, including its own network activities. Unfortunately, this approach implies a performance impact on every monitored system [Verwoerd & Hunt,20].
This study proposed an intrusion detection system based on SVM, hierarchical clustering and genetic algorithm. Genetic algorithm is used to eliminate unimportant features from the training set so that the obtained SVM model could classify the network traffic data more accurately. Hierarchical clustering algorithm stores fewer abstracted data points of KDD Cup 1999 data set than the whole data set. Thus the system could greatly reduce the training time and achieve better detection performance in the resultant SVM classifier. The rest of this paper is organized as follows. Section 2 provides hierarchical clustering, genetic algorithm and SVM. Section 3 describes the proposed system. Section 4 represent the experimental results. Finally section 5 remarks the conclusion.

II. RELATED WORKS
Fuzzy Rough C-Means (FRCM), utilized the advantage of fuzzy set theory and rough set theory for network intrusion detection [Chimphlee et al.,4]. Performance of a comprehensive set of pattern recognitions and machine learning algorithms was analysed. Their system outperformed the KDD Cup 1999 winner's system, combined several classifiers, one designated for one type of attacks in the KDD Cup 1999 dataset [Patcha & Park,12]. Another fuzzy approach proposed, combined the neuro-fuzzy network, fuzzy inference approach and genetic algorithms to design their NIDS, and was evaluated by the KDD Cup 1999 dataset [Chen et al.,21]. Some researchers proposed a security vulnerability evaluation and patch framework, which enables evaluation of computer program installed on host to detect known vulnerabilities. Intruders can bypass the preventive security tools; thus, a second level of defence is necessary, which is constituted by tools such as anti-virus software and Intrusion Detection System (IDS) [ [Fei et al.,6]. The implementation of SVMs requires the specification of the trade-off constant C as well as the type of the kernel function K. The choice of these parameters depends on the training data and consequently the set of independent variables (attributes) that enters the analysis is also an issue. The proposed methodology provides a framework to specify these parameters in an integrated context, using GAs [Satsiou et al.,2]. An ideal intrusion detection system is one that has a high attack detection rate along with a 0% false positive rate. However, such a low rate of false positives is only achieved at the expense of ignoring minor malicious activity detection. This provides an attacker with a small window of opportunity to perform arbitrary behaviors, giving them insight regarding the type of the intrusion detection system in use [Toosi & Kahani,19]. Many recent approaches to intrusion detection systems utilize data mining techniques [Lam et al.,9]. These approaches build detection models by applying data mining techniques to large data sets of an audit trail collected by a system [Helmer & Liepins,10]. At present, data mining algorithm applied to intrusion detection mainly has four basic patterns: association, sequence, classification and clustering [Sulaimana & Muhsinb,11]. Building IDS having a small number of false positives is an extremely difficult task. In this paper we present two orthogonal and complementary approaches to reduce the number of false positives in intrusion detection by using alert postprocessing. The basic idea is to use existing IDSs as an alert source and then apply either off-line (using data mining) or on-line (using machine learning) alert processing to reduce the number of false positives [Pietraszek & Tanner,18].

Support Vector Machine
An SVM is a supervised learning method which performs classification by constructing an N-dimensional hyperplane that optimally separates the data into different categories. In the basic classification, SVM classifies the data into two categories. Given a training set of instances, labeled pairs {(x, y)}, where y is the label of instance x, SVM works by maximizing the margin to obtain the best performance in classification [Patcha & Park,12]. Support Vector Machine is a popular learning technique due to its high accuracy and performance in solving both regression and classification tasks. Although, the training time in SVM is computationally expensive task as the whole time is used in solving a problem. Many researches are carried out in SVM to reduce the training time such as chunking the problem, decomposition approach using iterative method etc. SVM is originated from structural risk minimization (SRM) principle, which shorten the generalization error, i.e., true error on unseen examples. SVM mainly concerned with classes and separate the data in a hyperplane defined by a number of support vectors. These support vectors are the subset of training data used to define the boundary between two classes. In case, SVM cannot separate the data into two classes, it projects the data into high-dimensional feature space by using kernel function. This high dimensional feature space create a hyperplane which allows linear separation. The kernel function is very important in SVM as it helps in finding the hyperplane and support vectors. There may be various kernel functions such as linear, polynomial or Gaussian.

Genetic Algorithm
The implementation of SVMs requires the specification of the kernel function K. The choice of this parameter depends on the training data and consequently the set of independent variables (attributes) that enters the analysis is also an issue. The proposed methodology provides a framework to specify this parameters in an integrated context using GA's. The first step for the implementation of a GA involves the specification of an appropriate coding for each possible solution. In the context considered in this study each, solution is defined by the attributes used for model development and the parameter required to define the kernel function. GA is used to select the appropriate features from the large data set.

BIRCH Hierarchical Clustering Algorithm
The BIRCH hierarchical clustering algorithm applied in this system was originally proposed by [Horng et al.,16]. BIRCH is different from other clustering techniques such as CURE, ROCK, Chameleon because it stores fewer abstracted data points than the whole dataset. Each abstracted point represents the centroid of a cluster of data points. Compared to CURE, ROCK, and Chameleon, the BIRCH clustering algorithm can achieve high quality clustering with lower processing cost. The advantages of BIRCH are as follows: 1. Constructs a tree, called a Clustering Feature (CF) tree, by only one scan of dataset using an incremental clustering technique. 2. Able to handle noise effectively. 3. Memory-efficient because BIRCH only stores a few abstracted data points instead of the whole dataset.

Clustering Feature (CF)
The concept of a Clustering Feature (CF) tree is at the core of BIRCH's incremental clustering algorithm. Nodes in the CF tree are composed of clustering features. A CF is a triplet, which summarizes the information of a cluster.

CF Tree
A CF tree is a height-balanced tree with two parameters, branching factor B and radius threshold T. Each non-leaf node in a CF tree contains the most B entries of the form (CFi, child i), where 1 <=i<= 6 B and child i is a pointer to its ith child node, and CFi is the CF of a cluster pointed by the child i. A CF tree is a compact representation of a dataset, each entry in a leaf node represents a cluster that absorbs many data points within its radius of T or less. A CF tree can be built dynamically as new data points are inserted. The insertion procedure is similar to that of a B+-tree to insert a new data to its correct position in the sorting algorithm. In KDD99 data set redundancy is amazingly high. Obviously, such a high redundancy certainly influences the use of data. By deleting the repeated data, the size of data set is reduced from 494,021 to 145,586.Furthermore, in order to make the data set more efficient, hierarchical clustering using BIRCH is used to reduce the dataset. Hierarchical clustering is a popular clustering algorithm which aims to partition different data samples into certain clusters by evaluating the smallest distance between data and clusters.
In the proposed system, firstly on the data set Birch and GA is applied, so that it can find out preprocessed and optimal dataset. Now, on this reduced and optimal data set, SVM is applied which classifies the network traffic data.   [Patcha & Park,12]. The result of proposed system on the KDD data set are as follow-  From the result we have seen that fp of u2r and probe is less due to which accuracy of this attack is more and also in about 1000 instances about 18.5 % and 1.9% are probe and u2r attacks respectively.

V. CONCLUSION
In this study, an SVM based intrusion detection system which combines genetic algorithm, hierarchical clustering algorithm and SVM technique. The genetic algorithm and BIRCH hierarchical clustering technique is used for data preprocessing. The Birch hierarchical clustering provide highly qualified, abstracted and reduced data set to the SVM training. The famous KDD CUPP 1999 data set was used to evaluate the proposed system. Compared with other intrusion detection, this system showed better performance in the detection of various attacks. The future work is to apply the SVM with other Data preprocessing techniques.