A Novel DNA Sequence Approach for Network Intrusion Detection System Based on Cryptography Encoding Method

— A novel method for Network Intrusion Detection System (NIDS) has been proposed, based on the concept of how DNA sequence detects disease as both domains have similar conceptual method of detection. Three important steps have been proposed to apply DNA sequence for NIDS: convert the network traffic data into a form of DNA sequence using Cryptography encoding method; discover patterns of Short Tandem Repeats (STR) sequence for each network traffic attack using Teiresias algorithm; and conduct classification process depends upon STR sequence based on Horspool algorithm. 10% KDD Cup 1999 data set is used for training phase. Correct KDD Cup 1999 data set is used for testing phase to evaluate the proposed method. The current experiment results show that the proposed system has obtained good results and these results are equal to 86.36%, 49.69%, and 77.65% for detection rate, false alarm rate, and accuracy respectively. These results are considered as a better result when it is compared with the other previous basic algorithms. It is possible to conclude that DNA sequence has the potential for NIDS solution and it has potential improvement using a better encoding method.


I. INTRODUCTION
Intrusion detection is used to detect and prevent attacks [1] and secure the network as the Deoxyribonucleic Acid (DNA) used to detect disease in the human body.The concept of detection of abnormal functioning of tissues by DNA sequence can be applied to computing systems where the normal functioning of the system can be determined by DNA sequence that differs from DNA sequence of attack.Various DNA encoding methods are used in many reliable computer system techniques such as cryptography, steganography, and a digital signature that have been developed through using various DNA encoding methods.
A system to encrypt and generate digital signature has been built that is based on DNA Cryptography to handles the combination of all characters with superb accuracy [2].To hide secret data in an image, [3] established a system which is depending on the transfer the image into two security layers, one is DNA sequence and the second is a covered layer.The DNA steganography and RGB colors have been used to create a cryptographic system [4].This system is depending on the conversion of cipher text to color by supplying cipher DNA sequence.
The DNA sequence can be converted into fingerprint image through using wavelets transform function [5].The DNA decoding is applied to extract the signature from the watermark image, which is become invisible.Any type of data such as text, image, audio or video can be encrypted via DNA cryptography [6].Image encryption depending on DNA cryptography and hill cipher is established [7].In addition, the encryption system is provided that depends upon the using of DNA and a key length of 256 bits [8].
Detection of intrusion can be performed by a system provided by [9] where two grains detection level is used.The coarse grained is detected the intrusion, and the details are done by the fine-grained system.The IDS system for IEEE 802.11 wireless network is suggested by [10] which are established upon behavioural analysis and through applying sequential machine learning techniques.The pattern in the protocol is modelled and characterized the probabilities.The MCLP classifier is improved by the proposed IDS that is carried out by multiple criteria of linear programming and swam optimization [11].
The K-means method, cuttlefish algorithm and five rules algorithms are applied to various numbers of clusters, in order to lower the features number and implement high detection rate and to decrease the false detecting attack [12], [13].To improve the efficiency of the IDS which is depending on extreme learning machine, a framework is established by [14] that is combined the outputs of simple learners.The hybrid IDS proposed by [15] for a sensor network is used to reduce the communication costs is depending upon the support vector machine algorithm and signature rules.Good performance and maximize the production of the IDS is achieved by [16] through establishing a system that applies both negative and exclusive pattern matching techniques.The fast heuristic clustering method has been applied to establish a novel intrusion detection system that is based on data mining technique [17].
To discover unknown attacks, [18] proposed an adaptive method depend on ant colony clustering.The method is focused on the clustering process of an ant colony movement.The structure of the intrusion detection system is designed, based on ant colony clustering.This can not only improve the detection rate but also reduce false positive rate significantly, and can automatically detect various kinds of attacks.Is important to prevent sensitive data from attack, like prevent privacy in personal communication.Therefore this system implemented to enables the data owner to detect and prevent data leak.The results show that approach can provide accurate detection with a small number of false alarm [19].Suggest a step to reshape the policy in order to develop a data protection that leads to creating better confidence for the user, therefore adopted a survey questionnaire methodology by clients [20].
An Extreme Learning Machine-based intrusion detection method for Advanced Metering Infrastructure is presented [21].Firstly, the method filter and partition the malicious data, and different types of invasion are effectively extracted.Finally, Extreme Learning Machine is used to detect the various attack types of malicious data.ELM tends to have better scalability, and much better generalization performance is achieved at much faster learning speed than traditional SVM.Xing-zhu [22] improved the neural network model for network intrusion detection.The network feature subset and parameters of the Radial Basis Function neural network are regarded as a particle.Then, to establish the optimal network intrusion detection model, collaboration and information exchange between particles and the optimal feature subset and parameters of Radial Basis Function neural network are found.The simulation results showed that this system reduced the feature dimensions, and the best parameters of the Radial Basis Function neural network is obtained which, is a kind of network intrusion detection model with high detection accuracy and high speed.
Promod and Jacob [23] applied the Random Forest to measure the intrusion of unauthorized personnel to certain designated areas of the organization.The system of time attendance acts as a security system as it involves access to doors and barriers through which only authorized personnel should access.The Random Forest classifier is used to build a model for intrusion detection system [24] The Random Forest is an ensemble classifier and performs well compared to other traditional classifiers for effective classification of attacks.The obtained empirical results indicated that the presented model is efficient with low false alarm rate and high detection rate.
The current paper is presenting a new procedure that can be used to detect the intrusion detections based on cryptography DNA encoding approach.Also, the results are compared with previously published results in this field of work.

II. MATERIAL AND METHODS
DNA is the genetic material that existed in most organisms (include human being).It has the advantage of storage of information in the long term.This information is saved as a code made up of four chemical bases, called Adenine, Cytosine, Thymine, and Guanine and they are referred as A, C, T, and G, to form base pairs that attached to a sugar molecule and a phosphate molecule, DNA structure is show in Fig. 1 [25] Nucleotides are the base pair, make up two long spiral strands connected together, based on these base pairs.There are about 3 million bases, 99% of these pairs are similar to all persons, and only 1% is unique.DNA cells include genetic information, shared in human through chromosomes where a total of 46 chromosomes are found, 23 from the father and 23 from the mother.The offspring sharing 99.7% with their parents and only 0.3% is the unique code (repetitive coding) that causes DNA to be as biometrics.In DNA, when a pattern of two or more nucleotides is repeated, it is called Short Tandem Repeats (STR), and they are directly adjacent to each other [26].For example; the sequence ACTT A-G.Such repeats are used in the investigation to look for certain particular areas of DNA that make the search much easier than looking at all the DNA sequence.

is repeated three times in A-A-A-C-T-T-A-C-T-T-A-C-T-T-
These special areas of the DNA are believed to be parts that do not code for any genes (non-coding sequences), but they can be changed in various people.Identical repeats of the same pattern exist with the length of 2 to 6 base pairs of DNA, and they can be found anywhere from 1 to 50 times in a row.For example, the sequence "A-C-C-A-C-C-A-C-C-A  Cryptography is a branch of study within the field of cryptology.The original message is called the plaintext while the coded message is called the cipher text.Encryption and decryption are the processes of converting plaintext to cipher text and vice versa [28].Encryption algorithms can be grouped into two types which are stream ciphers and block ciphers.In stream ciphers, the image-pixels or text-character are encrypted consecutively, and in block ciphers, blocks of bits or blocks of characters are used [29].This work is carried out by using Teiresias and Horspool algorithms.
Teiresias algorithm can be used to detect and report all existing patterns in a set of input sequences without using alignment.Let Ʃ be the alphabet of residues (e.g. the set of the whole DNA sequences), where a regular expression of the form Ʃ (Ʃ U {'.'}) Ʃ is defining a pattern, and the symbol '.' is used to mark a position that can be an arbitrary residue.Every pattern P defines a language G(P) that is consisting of all strings which can be obtained from P by replaced "each don't care" by " an arbitrary residue" from Ʃ.For example, the pattern "T.AA..C", the following peptides are elements of G ("T.AA..C"): TAAAGCC, TCAAGTC, TTAATGC.For any pattern P, each substring of P that is itself a pattern is called a sub pattern of P. For example, "A..C" is a sub pattern of the pattern "T.AA..C".A pattern P is called a <L.W> pattern (with L <= W) if each sub pattern of P with length W or more contains at least L residues [30].
Horspool matching algorithm is utilized in the present research as an efficient string searching algorithm which has been used to classify the data into attack or normal.The target string (key) that is being searched is pre-processed by the algorithm.Such algorithm does not require checking each character in the searched string, but it skips some of them.However, it becomes faster when the key becomes longer.The efficiency of this algorithm is derived from the fact that each unsuccessful attempt to find a match between the search string and the text used in the searching; since it uses the information gained from that attempt to move as many positions of the text where the string cannot match [31].Table 1 shows the bad-character table used by Horspool algorithm for the following example (Table 2) that illustrates the application of Horspool algorithm, looking for the key "GCAGAGAG", in the sequence "GCATCGCAGAGAGTATACAGTACG".

A
A C G T

G C A T C G C A G
The current proposed system is done based on three steps, these steps are; DNA sequence for NIDS, STR extraction, and matching process as shown in the following steps: • Based on Encoding To perform the first step, the DNA encoding to intrusion detection system based on the DNA encoding table [32] is applied as shown in Table 3.The example below illustrates how the DNA sequence is generated for the following network traffic: (0,tcp,http,SF,181,5450,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.0 0,0.00,0.00,0.00,1.00,0.00,0.00,9,9,1.00,0.00,0.11,0.00,0.00,0.00,0.00,0.00) The "KDD Cup 99 is the information source that it is used in the current research, which consists of thousands of records and each record in the dataset has 42 features, 22 of these features explain the connection and 19 features of them describe their connection properties of the same host with the last two seconds [33].These features are shown in Table 4.The performance of the present system is determined by three measures called detection rate (DR), false alarm rate (FAR), and accuracy [34].The DR is the ratio of the number of correctly detected attacks over the total number of attacks.The formulae for calculating DR are shown in equations (1) as follow: (1) The FAR is the ratio of the number of normal connections that are incorrectly misclassified as attacks to the total number of normal connections.The formula for calculating FAR are shown in equations (2) as follow (2) Accuracy is measured by calculating the ratio of the number of truly classified connections over the total number of connections.The formula for calculating Accuracy are shown in equations (3) as follow (3) III.RESULTS AND DISCUSSIONS Table 7 exhibited the values obtained in terms of detection rate, false alarm rate and accuracy from the present system and these are compared with two novel intrusion detection systems mentioned by (Duque & Omar [12]; Yu et al., [18]).The published values are obtained from the two systems that have been applied the data mining technique to intrusion detection systems (Duque & Omar [12]; Yu et al., [18]).From the table, it is clear that the detection rate and accuracy obtained by the method of the present system are quite good.This system gives a better detection rate than the previous two systems, and the result is equal to 86.36%.The false alarm rate results for the two systems are not mentioned, and finally, our accuracy result is less than the accuracy of the second system and it equal to 77.65%.The detection rate results for the proposed system and the published one are shown in Fig. 3, The accuracy results for the proposed system and the published one are shown in Fig. 4, and the results of the proposed system are illustrated in Fig. 5.This paper has shown how the concepts of DNA sequence disease detection are used for network intrusion detection system.The method looks simple which consist of five steps to conduct detection process.Even though the performance is weak compared with the stated art of IDS, but the proposed method has shown its relatively good result.The performance of applying DNA sequence is very much relying on the DNA encoding techniques.The use of cryptography encoding method may not be suitable for the network.Therefore, the future suggestion is to build a suitable DNA encoding method for network IDS.

Fig. 2
Fig. 2 Exhibited an example of the Short Tandem Repeats

TABLE IV LIST
OF VARIOUS FEATURES OF KDD-CUP99 TASK DESCRIPTION Cup 99 dataset which included 22 types of attacks are used for training as shown in Table5, and the corrected KDD Cup 99 which included 37 types of attacks are used for testing phases as shown in Table6.

TABLE VI CLASS
LABELS THAT APPEARS IN "CORRECTED KDD" DATASET

TABLE VII COMPARISON
BETWEEN THE DR, FAR AND ACCURACY OF THE PROPOSED SYSTEM WITH THE PUBLISHED ONES Fig. 3 Detection rate results