A Negative Selection Algorithm Based on Hierarchical Clustering of Self Set and its Application in Anomaly Detection

A negative selection algorithm based on the hierarchical clustering of self set HC-RNSA is introduced in this paper. Several strategies are applied to improve the algorithm performance. First, the self data set is replaced by the self cluster centers to compare with the detector candidates in each cluster level. As the number of self clusters is much less than the self set size, the detector generation efficiency is improved. Second, during the detector generation process, the detector candidates are restricted to the lower coverage space to reduce detector redundancy. In the article, the problem that the distances between antigens coverage to a constant value in the high dimensional space is analyzed, accordingly the Principle Component Analysis (PCA) method is used to reduce the data dimension, and the fractional distance function is employed to enhance the distinctiveness between the self and non-self antigens. The detector generation procedure is terminated when the expected non-self coverage is reached. The theory analysis and experimental results demonstrate that the detection rate of HC-RNSA is higher than that of the traditional negative selection algorithms while the false alarm rate and time cost are reduced.


Introduction
Negative selection is a biological process by which the immune system generates non-self detectors that do not match self structures.The biological negative selection process can be mapped to the computational domain as a two class pattern classification problem in the artificial immune system (AIS) 1 , in which the normal states correspond to self antigens while the abnormal states correspond to non-self antigens.In AIS, negative selection algorithm (NSA) is an important method for the generation of detectors.NSA is designed by modeling the biological process in which T-cells mature in thymus through being censored against self cells 2 .After negative selection, the left mature (valid) detectors are used for further applications such as anomaly detection 3 , machine learning 4 , pattern recognition 5 , intrusion detection 6 , and etc.
The native negative selection algorithm (NNSA) defines the self/non-self discrimination problems using binary representations and calculates the affinities between binary strings by the r-contiguous-bits method 2,7,8 .Li and T. Stibor pointed out that the efficiency of NNSA is too low to be applied 9,10,11 : under the given failure rate P f ≈ e -Pm|D| , where P m is the match probability between random detector and antigen, the least number of detector candidates N 0 is -ln(P f )/(P m ⋅ (1-P m ) Ns ) which means N 0 is exponentially related to N s and the time International Journal of Computational Intelligence Systems, Vol. 4, No. 4 (June, 2011), 410-419 complexity of NNSA is O(N 0 ⋅ N s ) 2 .Thus, the time cost of NNSA cannot be accepted when the self size is large 10 .
Gonzalez and Dasgupta introduced the real negative selection algorithm RNSA 12,13 , which normalized detectors and antigens into [0, 1] d .And then, Ji and Dasgupta improved RNSA using variable detector radius called V-Detector, which set detector radius by the nearest self distance to enlarge the non-self coverage with little number of detectors 14,15 .
T. Stibor indicated that RNSA and V-detector also suffered to the curse of dimensionality 16,17 .On the one hand, the distances between antigens in the high dimensional data space converges to a constant value.Therefore, there is little distinctiveness between self and non-self antigens, resulting in the higher false alarm rate.On the other hand, the algorithms terminate with only a very small number of large radii detectors (hyperspheres) which are covering a limited number of spikes.As a result a large proportion of the volume of the hypercube ([0, 1] d ) does not lie within the hyperspheres, it lies in the remaining (high-volume) spikes.Thus the detection rate is lower.
Additionally, for most pattern recognition algorithms, the distance calculation is the main source of time consuming 18,19 .However, NNSA, RNSA and V-detector didn't take any strategy to reduce the cost of distance calculation: the distances from detector candidates to the self set have to be calculated, resulting in the lower efficiency 9 .Furthermore, as there are many overlapped detection regions, the reduction of detector redundancy must also be taken into consideration.
A real negative selection algorithm based on the hierarchical clustering of self set (HC-RNSA) is present in this article.The underline idea is that first, the self data set is preprocessed using Principle Component Analysis (PCA) method to reduce the data dimension, and then the self set is hierarchically clustered.During the detector generation process the detector candidates, restricted in the lower coverage space, are compared with the cluster centers using fractional distance function to eliminate the self reactive detectors.The detector generation process is recursively continued from the higher cluster level to the lower level until the cluster radius is less than the self radius, and in each cluster level, the exit condition is to reach the expected non-self coverage.

Basic Definition
In the AIS, antibodies are defined as detectors which are used to recognize non-self elements 2 .Therefore, the accuracy of the detection result is determined by the quality of detectors.As the randomly generated detector candidates may matched self elements, resulting in self reactive 2,12 .The negative selection algorithm, inspired by the censoring process of antibody cells in the biological body, was designed to eliminate the self reactive detectors.The basic conceptions are defined as: Def 1.All the character strings abstracted from the sample space constitute the antigen set where n is the data dimension and f i is the ith normalized attribute.
Def 2. The self set S ⊂ U is the character strings abstracted from the normal samples, r s ∈ R + is the variability threshold of the self points; Non-self set N = U -S, which are character strings abstracted from the abnormal samples, and , S N U S N ∪ = ∩ = Φ.
Def 3. Detector d=<c, r>, where c∈N, c is the central vector which represents the location of d in the sample space, r∈ R + is the detector radius.Antigens which are close to d less than r will be identified as non-self elements.Def 4. The non-self coverage of detectors is defined as the ratio of the volume of the non-self space that can be recognized by any detector to the volume of the entire non-self space 14 . .
Def 5. Anomaly detection is to find a functional mapping f: using training data samples generated according to an unknown probability distribution P(x, y): where C 0 is the set of normal samples, C 1 is the set of abnormal samples, and such that f will correctly classify unknown examples (x, y).For AIS, the training set only contains normal samples (x, y ∈ C 0 ) and the task is to detect abnormal samples (x, y∈ C 1 ) with the function f trained by normal samples.As described in Ref. 17, abstracting these principles and modeling immune components according to the AIS framework, we obtain a technique for anomaly detection: normal behavior of a system.Output: D = set of hyperspheres, which recognizing a proportion of the total space [0, 1] n , except the normal points.
Detector generation: While non-self coverage of detectors is not reached, generate hyperspheres.
Classification: If unknown point lies within a hypersphere, it does not belong to the normal behavior of the system and is classified as an anomaly.

The estimation of non-self coverage
As Eq. ( 1) is hard to calculate, we select fixed number of samples in the non-self space and then estimate p using statistical inference method: the probability of a random sample to be recognized by detector set D obeys binomial distribution 15 , P{x =1, x is covered} =p, P{x =0, x is uncovered} =1-p.According to the Neyman-Pearson theorem, there exists a most powerful test for the hypothesis testing problem in Eq. ( 4): .
where H 0 is the hypothesis that the expected non-self coverage is not reached and H 1 is on the contrary.The rejection region of Eq. ( 4) is the same as that of Eq. ( 5) 20 .
where p 1 is a random value bigger than p exp .The likelihood ratio of Eq.( 5) is calculated through a random sample set {x 1 , x 2 , … x n }.

The value ranges of detector candidates
The random value ranges (RVR) are set of d dimensional hypercube: ranges = {hypercube | hypercube = ([low 1 , low 2 …low d ], [high 1 , high 2 …high d ])}.High i and low i represent the upper and lower bounds of the ith attribute of the detectors' central vectors.The RVR of the ith cluster of cluster level l is defined in Eq. ( 10): where c i is the cluster center and r l is the cluster radius.
During the detector generation procedure of level l+1, the non-self space outside the RVR of level l has been covered by detectors, so the new generated detector candidates should be located in the RVR of level l to reduce detector redundancy.As Fig. 1 shows, during the detector generation process of level 2, the non-self space outside the RVR of level 1 has been covered by detectors, therefore the detector candidates in level 2 are generated within the RVR of level 1 to reduce the detector redundancy.So does the detector generation process in the level 3.

The probability of generating invalid detectors
As the randomness during the detector generation procedure, many invalid detectors (covered self samples) are generated which results in the lower efficiency.Let p represents the probability of generating an invalid detector.For RNSA 12 and V-detector 14 , as the central vectors of the detector candidates are randomly sampled from the unit hypercube [0, 1] d , p is the ratio of the self hyperspheres' volume to the volume of the unit hypercube: where n is the self number, self V is the volume of single self hypershpere, cube V is the volume of the unit hypercube.
In HC-RNSA, samples are chosen from the random value range which is hypercube with edge length 4r, r is the cluster radius, so p is the ratio of the cluster hyperspheres' volume to the volume of the 4r-hypercube: where m is the number of clusters, clu V is the volume of a cluster hypersphere, _ random cube V is the volume of the 4rhypercube (RVR).
To compare the efficiency of RNSA、V-detector and HC-RNSA, we define the coefficient as follows: (4 ) ( / 2 1) From Eq. ( 13), we get coefficient =( 4 ) when ρ is bigger than 1, the efficiency of HC-RNSA is higher than that of the traditional algorithms.As Fig. 2 shows, when the data dimension is lower than 20 and the self radius is bigger than 0.05, ρ >1, otherwise the efficiency of HC-RNSA is lower than that of the traditional NSAs.Therefore, the data pretreatment process is needed to reduce the data dimension when deals with high dimensional data.In the article, the Principle Component Analysis (PCA) method is employed: First, n antigen samples 1 ≤ i ≤ n are selected to calculate the correlation coefficient matrix R: . Fig. 1.The rectangles are random value ranges, the cycles are clusters and the shadows are regions covered by detectors.From the higher cluster level to the lower level, the cluster radius is halved.
Then the eigenvector of R is calculated and ordered by values.The first m eigenvector whose accumulative contribution rate is more than threshold are selected as the principle components.

The fractional distance function
Theorem 1.Let F is a random distribution of two antigens, for the L k metric, , where d is the data dimension, C k is a constant dependents on norm k, D max and D min are the maximum and minimum distances from antigens to the origin using the L k metric.
As the d attributes of A and B are drawn from the distribution F with mean μ and standard deviation σ , which means 21 : The comparison is between two random antigens, so From Eq.( 16) we get: Since each According to Slutsky's Theorem, put Eq. ( 15) into the denominator of Eq. ( 17), get Combine the results of Eq. ( 18) and Eq. ( 19) to obtain max min where C k is some constant dependents on k.
From Fig. 3 we can see D max -D min is increasing with d (1/k)-(1/2) , which inspires us to use the fractional norm distance function in Eq. ( 21) to enhance the distinctiveness between self and non-self antigens.

The negative selection algorithm HC-RNSA
The Steps 1-3 are the data pretreatment stages, in which the data dimension is reduced and the self set is hierarchically clustered.Steps 4-8 are the detector generation process.In step4 the detector candidates are restricted to the random value ranges to reduce the detector redundancies.In step 5, the self data is replaced by the cluster centers to compare with the detector candidates.As the number of cluster centers is far less than the self set size, the efficiency of the negative selection process is much enhanced.In step 6, the termination criterion is based on the hypothesis test of the non-self coverage.If the hypothesis H 0 in Eq. ( 4) is rejected, the generation procedure can be terminated.
Step 7 test whether the non-self sample x is covered by detectors to accumulate the cover count m.In step 8, the non-self sample x is reused to generate detector d<x, r>, where r is the nearest distance between x and the cluster ranges, and r is a byproduct of step 5.As the non-self coverage of detectors is varying when new detectors are put into D, so the detector set is unchanged until the number of non-self samples equals the predefined size N.

Theorem 2. The time complexity of the detector generation process of HC-RNSA is irrelevant to the self set size.
Proof.In HC-RNSA, the time consuming of step 4, 6 and 8 is a constant time t which could be ignored.In step 5, the distances from the sample x to the cluster centers are calculated, the time complexity of this step is O (|C l |), C l is the set of cluster centers in level l.Assuming the number of samples in cluster level l is N l , the number of non-self samples of step 7 is ( 1) Eq. ( 12) is the probability of generating self samples.The distances between the non-self sample x and the detector set D are calculated in step 7, and the time complexity of this step is Therefore, the time complexity of the detector generation process of HC-RNSA is irrelevant to the self set size.(1 -) As Table 1 shows the self set size is exponentially related to the time complexity of the traditional NSAs.Therefore, when the number of self data increases, the time consuming increases incredibly.But for HC-RNSA, the time complexity of the detector generation process is irrelevant to the self set size, which means HC-RNSA is suitable for the detector generation under large number of self data.

Experiment
To test the anomaly detection performance of HC-RNSA and compare it with the traditional NSAs: NNSA 2 , RNSA 12 and V-detector 14 , comparison experiments are designed based on different classic UCI (University of California Irvine) data sets 22 , which have been widely used in the fields of anomaly detection, disease diagnose, equipment detection, etc 22 .
The detection rate (DR), false alarm rate (FA) and time cost are three evaluation criterions of NSAs.
where TP, TN, FP and FN are the counts of true positive, true negative, false positive and false negative respectively.
The experimental data properties are described in Table 2.In the data sets Ball-bearing and Delft pump, all the records with normal equipment state are taken as self data, others are non-self data.In the other data sets, the records collected from healthy people are self data, and records from the unhealthy constitute non-self data set.The data records are first normalized into [0, 1] d , and then NSAs including: HC-RNSA, NNSA 2 , RNSA 12 , Vdetector 14 are employed to generate detectors based on these data sets.The parameters are shown in Table 3.The receiver operating characteristic curves (ROC) are generated by repeated experiments under different expected coverage.In Fig. 4, the horizontal axis represents the false alarm rate while the vertical axis represents the detection rate.Therefore the ideal curve is the vertical axis, which means the false alarm rate will always be zero with any detection rate.
Fig. 4 shows that the four NSAs get similar results on the data set Biomed; on the other data sets, they get much different results, and however, HC-RNSA always gets better results than others.Apparently, on Ball-bearing, Delft pump and B.Cancer, the detection results of HC-RNSA are much better than that of the traditional NSAs.Combining Table 2 and Fig. 4 we can see that, little self set size results in poor performance of the traditional NSAs.On the one hand, that is because the traditional NSAs rely on the self set to training detector candidates, thus the lack of self elements will result in the generation of self reactive detectors and poor performance.For HC-RNSA, the of detectors is decided by the nearest self cluster margin.Therefore, the absence of some self elements will not affect the training of detectors.On the other hand, the data distribution of self data is not taken into consideration in the traditional NSAs.However, HC-RNSA discriminates the self and non-self regions by the self cluster ranges and generates mature detectors based on the discrimination, which reduced the false alarm rate.The time cost of the detector generation process is shown in Table 4. From the table we can see that the time cost of HC-RNSA on each data set is less than that of the traditional NSAs.As discussed in Sec.3.2, during the detector generation process, the self data are replaced by self cluster centers to compare with the detector candidates.Usually, the number of cluster centers is much less than the self set size, so the efficiency of the detector generation process is much improved.
The HC-RNSA algorithm without PCA pretreatment stage is called HC 1 and the HC-RNSA using integer norm distance function is called HC 2 .The detection result of HC-RNSA, HC 1 and HC 2 on the same UCI data sets is shown in Table 5.
As Table 5 shows, on the data sets: Ball-bearing, Delft pump and Arrhythmia, the detection rate of HC-RNSA is higher than that of HC 1 and HC 2 while its false alarm rate is lower.The result demonstrates that on the A negative selection algorithm higher dimensional data sets, the PCA pretreatment and fractional distance function is essential to the improvement of the algorithm performance.However, on the lower dimensional data sets: Biomed and Diabetes, HC 1 and HC 2 have better performance.On the one hand, that is because after the PCA pretreatment process, the data dimension is reduced, on the same time, some useful distinctiveness information is also lost; on the other hand, as discussed in Sec.3.1.4that D max -D min does not coverage in the lower dimensional space, so the fractional distance function is not needed.Therefore, when we chose negative selection algorithms to generate immune detectors, the self data distribution, data dimension, self set size, self radius and etc. must be taken into

Conclusions
Artificial immune theory is an intelligent soft computing technique which has the adaptive learning ability to study from the training data.But the application of the artificial immune system has been confined by the lower efficiency of the detector generation procedure.Therefore, a new negative selection algorithm HC-RNSA based on the hierarchical clustering of self set is proposed in this article.In HC-RNSA, the self data is replaced by the cluster centers to compare with the detector candidates to reduce the cost of distance calculations; the PCA method and fractional distance function are employed to improve the detection performance in the high dimensional space.The non-self coverage is estimated based on statistical inference method to dynamically terminate the detector generation procedure, which is more reasonable than the traditional exit conditions based on a given detector set size.The theory analysis and experimental results demonstrate that HC-RNSA is an effective algorithm to generate artificial immune detectors for anomaly detection.
(a) original data (b) detectors in level 1 (c) detectors in level 2 (d) detectors in level 3 a random variable with zero mean and finite variance 2 ' σ , where ' σ is the standard deviation of k i P .The sum of different values of R i over d dimensions will converge to a normal distribution: central limit theorem, so the expected value of the numerator is a constant C.

Fig. 2 .
Fig. 2. The relationship between ρ , r s and d Fig. 3.The relationship between d and D max -D min under different distance norm is the average probability of generating self samples and | | C is the average number of clusters in each level.

Fig. 4 .
Fig. 4. The receiver operating characteristic curves Step4 Sample non-self data x from the random value ranges of the ith cluster level.Step5 Calculate the distance dis(x, c) between x and each center c in C i by Eq. (21), if dis(x, c) is less than the cluster radius r i , drop x, go to step 4, else increase n.Step6 If n equals N, calculate the rejection range of H 0 by Eq. (9).If H 0 is rejected then increase i, reset m, n, go to step 4; else put the detectors from Td into D. Step7 If x is covered by detectors in D, increase the cover count m Step8 Generate detector d<x, r>.Put d<x, r> into the temporary set Td, go to step4.

Table 1 .
The time complexity of the detector generation process.

Table 2 .
The data prosperities of the UCI data sets.
*d is the data dimension

Table 4 .
The time cost of the detector generation process (h).