Anomaly detection for high-dimensional space using deep hypersphere fused with probability approach

Zheng, Jian; Li, Jingyi; Liu, Cong; Wang, Jianfeng; Li, Jiang; Liu, Hongling

doi:10.1007/s40747-022-00695-9

Anomaly detection for high-dimensional space using deep hypersphere fused with probability approach

Original Article
Open access
Published: 24 March 2022

Volume 8, pages 4205–4220, (2022)
Cite this article

Download PDF

You have full access to this open access article

Complex & Intelligent Systems Aims and scope Submit manuscript

Anomaly detection for high-dimensional space using deep hypersphere fused with probability approach

Download PDF

Jian Zheng ORCID: orcid.org/0000-0001-9741-9526¹,
Jingyi Li²,
Cong Liu³,
Jianfeng Wang¹,
Jiang Li¹ &
…
Hongling Liu¹

1723 Accesses
2 Citations
Explore all metrics

Abstract

Data distribution presents sparsity in a high-dimensional space, thus difficulty affording sufficient information to distinguish anomalies from normal instances. Moreover, a high-dimensional space may exist many subspaces, obviously, anomalies can exist in any subspaces. This also creates trouble for anomaly mining. Consequently, it is a challenge for anomaly mining in a high-dimensional space. To address this, here proposed a deep hypersphere method fused with probabilistic approach for anomaly mining. In the proposed method, the deep neural network is used as a feature extractor to capture those layered low-dimensional features from the data lying in a high-dimensional space. To promote the ability of the deep neural network to capture these features, the probability approach of sample binary-classification is fused into the loss function, thereby forming the probability deep neural network Then, the hypersphere is used as an anomalous detector. In the low-dimensional features extracted by the deep neural network, the anomalous detector separates anomaly features from normal features. Finally, experimental results on synthetic and real-world data sets show that the proposed method not only outperforms the state-of-the-art methods in the precision of mined anomalies, but also this hybrid method consisting of deep neural networks and traditional detection methods has outstanding capabilities of mining high-dimensional anomalies. We find that deep neural networks fusing the probabilistic method of sample multi-classification can capture these desired low-dimensional features; moreover, these captured low-dimensional features present more obvious layered characteristics. We also demonstrate that as long as these captured features represent a fewer anomaly instances, it can sufficiently identify anomalies from normal instances.

Feature dimensionality reduction: a review

Article Open access 21 January 2022

A Comprehensive Survey of Anomaly Detection Algorithms

Article 26 November 2021

A modified Adam algorithm for deep neural network optimization

Article Open access 25 April 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

The “curse of dimensionality” brings three major challenges for anomaly mining. The first one is the computational power of detection methods. High-dimensional space may exist many subspaces, and anomalies may exist any subspaces. A brute-force method is computationally prohibitive because of searching an exponential number of subspaces. Another challenge is that the relative contrast between data objects become more and more similar in a high-dimensional space [1, 2]. Most existing anomaly detection methods implicitly or explicitly rely on the distance between data objects. For a low-dimensional space, the distance assessment metrics are easy to mine anomalies, such as the distance-based methods in [3, 4]. Unfortunately, in a high-dimensional space, the metric of assessing distance may not be able to measure the similarity between data [5,6,7]. The third challenge is that data distribution becomes sparse in a high-dimensional space, thereby hard providing sufficient information to distinguish anomalies from normal instances [8,9,10]. Overall, it is a tough task to mine anomalies existing in a high-dimensional space.

Currently, anomaly detection methods are divided into the following categories: (I) distance-based, such as K-Nearest Neighbor (KNN) [3, 4]. Although such method does not have to assume data distribution and require training samples, the distance between data in a high-dimensional space is not easy to calculate. (II) Cluster-based, e.g., the model in [11], such method requires assume data distribution. (III) Reconstruction error-based, for this method, the threshold is set in advance [12], such as Matrix Factorization (MF) [13]. When the reconstructed error is larger than set the threshold, this is considered to be anomalies. If threshold is not set properly, the precision of detection methods suffers from seriously negative effects. (IV) Classification-based, e.g., One Class-Support Vector Machine (OC-SVM) [14], as for this method, anomalies are linearly separated by SVM from normal instances but the curse of dimensionality limits the ability of SVM to linearly separate features [15, 16]. (V) Deep network architectures-based, e.g., Deep Autoencoder (DAE) [17], Generative Adversarial Networks (GANs) [18], deep networks can capture the layered features used to identify anomalies and normal instances from the background space [19, 20]. For instance, GANs exhibit excellent ability for anomaly detection even if the reconstructed anomaly instances are very poor [21]. Nevertheless, GANs are prone to pattern collapse during training. In addition, Goh et al. use recurrent neural networks (RNNs) [22] for anomaly detection in cyber physical systems. (VI) Hybrid method-based, i.e., consisting of deep networks and tradition detection methods, such as Deep Neural Networks based K-classification (DNN-K) [23], Deep Neural Networks–Support Vector Machine (DNN–SVM) [24], and Deep Neural network–support Vector Data Description (DNN–SVDD) [25]. Hybrid methods are much more extensible while the computational complexity augments as the depth of hybrid architecture increases [26].

In addition to above involved methods, hyperspheres are also commonly used for anomaly detection. For example, in [27, 28], the hypersphere is used for binary-classification of normal classes and anomaly classes. Usually, hyperspheres are sensitive to a lack of data, moreover, have also no advantages at learning complex invariants. Whereas, in a good feature space, hyperspheres can exhibit attractive capability of separating data.

Given these complementary advantages of a hypersphere and a deep network, this is very attractive to study a hybrid methods of both for anomaly detection. In this work, our motivation is to mine a limited number of potential anomalies existing in a high-dimensional space. In addition, we also look at exploring the searching efficiency of subspaces existing in a high-dimensional space, aiming at giving the computational complexity of anomaly detection algorithms. Hence, we developed a hybrid model consisting of a hypersphere and a deep neural network to meet our studied targets. First, the deep neural network in the proposed model captures these layered low-dimensional features from high-dimensional data. To promote the ability of the deep neural network to learn these layered low-dimensional features, we consider the probability approach of sample binary-classification to be fused into the loss function, thereby forming the probability deep neural network. Then, the hypersphere linearly separates these captured low-dimensional features. Finally, the proposed method is verified on synthetic and real-world data sets.

We summarize the main contributions of this work as follows.

(1)
High dimensionality increases the complexity of the data space, facing to the curse of dimensionality, the deep neural networks fusing the probabilistic method of sample multi-classification can capture these desired low-dimensional features from high-dimensional data; moreover, these captured low-dimensional features present more significant layered characteristics.
(2)
As long as these extracted features can represent a fewer anomalous instances, it is sufficient to identify anomalies from normal instances.
(3)
Upon a high-dimensional space, the hybrid approach composing of a deep neural network and a traditional detection method has stronger mining power than deep detection approaches or traditional detection approaches.

Methodology

Layered features extraction

Usually, the background space is high dimension, which is not conducive for anomaly mining, so we first capture these low-dimensional features from the data in the background space using deep neural networks. This purpose is to reduce the dimensionality of a searching space for anomaly mining.

The loss function is one of the critical hyper-parameters for deep neural networks because of affecting the learning capability of deep neural networks [29,30,31]. Anomaly detection can be treated to be binary-classification of samples, i.e., anomaly and normal classes, based on this, we consider the loss function from the perspective of calculating sample classification probability.

Given a sample set x = {x₁,x₂,….,x_l}, and l > 0. Item P(C_i) denotes the probability that the point x_l should be in class C_i $\in$ Class list C, and i = 1,2,….,m. In addition, let us assume that the classification is mutually independent, i.e., conditional independence. The result of conditional independence is given in the following equation:

$$ P(C|s_{j} ) = P(C_{1} ,C_{2} ,....C_{i} |s_{j} ) = \mathop \Pi \limits_{i = 1}^{m} P(C_{i} |s_{j} ), $$

(1)

where s_j is the class j from the class set. The posterior probability for tagging x is as follows:

$$ P(s_{j} |C) = \frac{{P(s_{j} )P(C|s_{j} )}}{P(C)} = \frac{{P(s_{j} )\mathop \Pi \nolimits_{i = 1}^{m} P(C_{i} |s_{j} )}}{p(C)},j = 1,2,..... $$

(2)

Since the denominator in Eq. (2) is not based on s_j, this part can be ignored [32]. Therefore, s_j can be calculated using the following equation:

$$ \Omega (x) \propto P(s_{j} )\mathop \Pi \limits_{i = 1}^{m} P(C_{i} |s_{j} ). $$

(3)

According to the [32], the probability estimate P(C_i|s_j) and the prior probability for class C_j are replaced using $\frac{{M_{i} (j,u_{i} )}}{{D_{j} }}$ and $\frac{{N_{j} }}{N}$, respectively. Hence, Eq. (3) is modified as follows:

$$ \Omega (x) \propto \left\{ {\mathop \Pi \limits_{i = 1}^{m} \frac{{M_{i} (k,u) + \frac{1}{C}}}{{D_{k} + 1}}} \right\}^{B} , $$

(4)

where C is the number of the classes. M_i(k,u_i) represents the number of data set elements which have the class of s_k, and are assigned to class s_u. D_k determines the number of class s_u. B is a constant. In regard to the B value, Titterington [32] has proposed the values 1, 0.8, or 0.5 as a reference. For the detailed proof respecting Eqs. (3) and (4), please see the [32]. Let us assume C₁ and C₂ are anomaly classes and normal classes, respectively. As such, we can simplify Eq. (4), having that

$$ \Omega (x) \propto \left\{ {\mathop \Pi \limits_{i = 1}^{2} \frac{{M_{i} (1,2) + \frac{1}{2}}}{3}} \right\}^{B} . $$

(5)

Equation (5) demonstrates the probability that the point x_l should be classified into class C_i is $\Omega (x)$.

Sparsity can be encouraged by adding a regularization term that takes a large value when the average activation value $\hat{\rho }_{i}$, of a neuron i and its desired value $\rho_{i}$, are not close in the value [33]. One such sparsity regularization term can be the K–L divergence, as follows:

$$ KL_{{s{\text{parse}}}} = \sum\limits_{i = 1} {\rho \log \left( {\frac{\rho }{{\hat{\rho }_{i} }}} \right)} + (1 - \rho )\log \left( {\frac{1 - \rho }{{1 - \hat{\rho }_{i} }}} \right). $$

(6)

We calculate the average output activation measure of the neuron i with probability manner, having

$$ \hat{\rho }_{i} = \frac{1}{n}\sum\limits_{{j = 1}}^{n} {h\left( {{\mathbf{w}}_{i}^{{(1){\mathbf{T}}}} x_{j} *\Omega (x) + {\mathbf{b}}_{i}^{{(1)}} } \right),} $$

(7)

where n is the total number of training examples. ${\mathbf{w}}_{i}^{{(1){\mathbf{T}}}}$ is the ith row of the weight matrix ${\mathbf{W}}^{{{\mathbf{(1)}}}}$, and ${\mathbf{b}}_{i}^{(1)}$ is the ith entry of the bias vector ${\mathbf{b}}^{(1)}$. The loss function L(w,b) is given in the following equation:

$$ L({w},{\mathbf{b}}) = ||e - \hat{e}|| + KL_{{{\text{sparse}}}} , $$

(8)

where e and $\hat{e}$ are the inputting and the reconstructed inputting, respectively. Equation (8) demonstrates the probability that anomalous points and normal points should belong to the C₁ classes and C₂ classes in a given sample.

Anomaly separation

Hypersphere can be defined with center a and radius R, given a data set y = {y₁,y₂,…,y_i,…,}, i = 1, 2,.., the error function using a hypersphere to learn a compact space around y_i can be defined as follows [25]:

$$ \left\{ \begin{gathered} \Delta ({\mathbf{a}},r) = R^{2} + \lambda * \sum\limits_{i} {\xi_{i} } \hfill \\ {||}y_{i} { - }{\mathbf{a}}{||}^{{2}} \le R^{2} + \xi_{i} , \, \forall i = 1,2,...,\xi_{i} \ge 0 \hfill \\ \end{gathered} \right., $$

(9)

where $\xi_{i}$ is a slack variable that lets some data points fall outside the hypersphere.$\xi_{i}$ is an auxiliary variable rather than part of the parameter set.$\lambda$ is used for the proportion to trade-off data outside the hypersphere, allowing users to predefine. The distance between the sample and the center of the sphere can be calculated (the detailed derivation see [25]) as follows [25]:

$$ y - {\mathbf{a}}^{2} = (y^{2} y) - 2\sum\limits_{i} {\alpha_{i} (y^{2} y_{i} )} + \sum\limits_{i,j} {\alpha_{i} \alpha_{j} (y_{i}^{2} y_{j} )} , $$

(10)

where $\alpha_{i}$ and $\alpha_{j}$ are Lagrange multipliers, and j = 1, 2, … Using the mapping transformation, Eq. (10) can be converted into the following equation:

$$ \varphi (y) - {\mathbf{a}}^{2} = K(y,y) - 2\sum\limits_{i} {\alpha_{i} K(y,y_{i} )} + \sum\limits_{i,j} {\alpha_{i} \alpha_{j} K(y_{i} ,y_{j} )} , $$

(11)

where $\varphi ( \, )$ is a mapping function.$K( \, )$ is a kernel function of satisfying Mercer theorem. Certainly, there are many the kernel functions of satisfying Mercer theorem. In this work, we select the Matern52 kernel in [34], having that

$$ \begin{aligned} K(y,y_{i} ) &= \theta_{0} \left(1 + \sqrt {C_{r} *r^{2} (y,y_{i} )}\right. \\ &\quad \left. + A_{r} *r^{2} (y,y_{i} )\right)\exp \left\{ - \sqrt {B_{r} *r^{2} (y,y_{i} )} \right\} , \end{aligned}$$

(12)

where C_r, A_r, B_r are constant coefficients.

There are two reasons for selecting the Matern52 kernel as our kernel function, (i) the Matern52 kernel can make radius warping concave and non-decreasing [34, 35], so as to be prone to focus more on areas with small radii. (ii) The Matern52 kernel, which is a continuous positive definite kernel, can flexibly control the searches in the normal data region because of being non-stationary [36]. This is very conducive to promoting the separated accuracy of anomaly and normal features.

The output of a hypersphere can be calculated in the following equation:

$$ R^{2} - \varphi (y) - {\mathbf{a}}^{2} . $$

(13)

In Eq. (13), the sample is normal if the output is a positive value. Otherwise, the sample is anomalies.

The proposed model is composed of the deep neural network and the hypersphere, so the final learning function $\nabla (L({\mathbf{w}},{\mathbf{b}}),\Delta ({\mathbf{a}},r))$ of our model includes the loss function L(w,b) in Eq. (8) of the probability deep neural network and the error function $\Delta ({\mathbf{a}},r)$ in Eq. (9) of the hypersphere, as follows:

$$ \nabla (L({\mathbf{w}},{\mathbf{b}}),\Delta ({\mathbf{a}},r))=\min (L({\mathbf{w}},{\mathbf{b}}) + \Delta ({\mathbf{a}},r)). $$

(14)

Model

In this section, we interpret the rationality of the proposed model and describe the model architecture. In addition, some hyper parameters are configured, as well as, the model training is also presented.

Rationality

Since the background space is high dimension, anomaly detection is considered to be performed in the low-dimensional feature space, instead of the background space. Base on this fact, we opt for the deep neural network to capture low-dimensional features from high-dimensional data. To allow these captured low-dimensional features more layered, the binary-classification probability of samples in Eq. (5) is fused into the loss function in Eq. (8). Our though originates from the fact that anomaly detection can be treated as a binary-classification of samples, so we consider the probability approach of sample binary-classification.

As for the proposed model, the performance not only relies on these captured low-dimensional features, but also depends on the kernel in the hypersphere. In view of this, using the kernel in Eq. (12) is beneficial for promoting the separated capability of the hypersphere to anomaly features. The hypersphere is trained using the error function in Eq. (9), then outputs these separated features. As such, the proposed model exhibits outstanding ability to identify anomalies from the input samples.

Model architecture

The proposed model, which consists of the probability deep neural network and the hypersphere (namely, DNNH), has three modules, including an encoding module, a hypersphere module and a decoding module, as shown in Fig. 1. For the encoding module in Fig. 1, there are two hidden layers. The ith hidden layer is denoted as H_i(e), and i = 1, 2. Given a inputting sample Z = {z₁,z₂,…,z_n}, Z is mapped onto the input layer in the encoding module. Then, H_i(e) captures the low-dimensional features F = {f₁, f₂,…, f_m} from Z, where F contains anomaly features and normal features, and m < n. Equation (8) ensures that F is better extracted by H_i(e). Then, the captured F is sent to the hypersphere module.

For the hypersphere module in Fig. 1, according to the captured F, the kernel in Eq. (12) performs the operation that separates anomaly features from normal features. Through iteration learning the error function in Eq. (9), the hypersphere is well trained. Once the training is completed, the hypersphere sends out the separated low-dimensional features Fs = {Fa, Fn}, where Fa = {fa₁,fa₂, …, fa_u}, Fn = {fn₁, fn₂, …, fn_v}, and u + v = m.

For the decoding module in Fig. 1, similar to the encoding module, there are two hidden layers. The jth hidden layer is denoted as H_j(d), and j = 1, 2. After receiving Fs, H_j(d) reconstructs the input Z. Finally, the output layer in the decoding module sends out the learned normal and anomaly classes.

Hyper parameter configuration

Regarding these hyper parameters in DNNS, we carefully studied part of them, e.g., optimizer, learning rate and activation function. Due to the other hyper parameters have no substantial effects on results, their default values are adopted.

Optimizer. Adam is used as the optimizer of DNNH. This reasons are that (i) Adam inherits the capability of AdaGrad to deal with sparse gradients [37]. (ii) As for Adam, the capability of handling sparse gradients is stronger than existing optimizers, such as RMSprop, SGD, Momentum and Nesterov etc. (iii) Adam can provide different learning rates based on different hyper parameters.

Learning rate. Using Adam as the optimizer of DNNH, there is no have to initialize the learning rate for DNNH.

Activation function. The function Sigmoid is considered to be used as the activation function. Compared to other activation functions, e.g., tanh, Relu and elu, the output of Sigmoid is only 0 and 1, so this is very suitable for judging anomalies and normal points.

In addition, we also need to consider the B value in Eq. (5), so B adopts the reference value in [32], i.e., let B be equal to 1.

Training and testing

For the model training, we dynamically adjust the iteration epoch according the observed training precision, until DNNH can converge, the training is finished. While for the model testing, the testing set is used to verify the performance of DNNH.

Experimental settings

In “Training and testing”, experimental data sets are described, including 15 synthetic data sets and 6 real-world high-dimensional data sets. In “Experimental settings”, those compared approaches and their parameters are illustrated. Experimental assessment metrics are given in “Data sets”.

Data sets

We generated three type of synthetic data sets using the manner in [38], as shown in Fig. 2, each of type contains 5 sub data sets. For each sub data sets in the same type, data dimensionality gradually increases from 1000 to 5000. The first type in Fig. 2a, denoted as T1, represents that normal data and anomalies are mixed by random manner. The second type in Fig. 2b, denoted as T2, represents that anomalies locate outside normal data. The third type in Fig. 2c, denoted as T3, represents that normal data surrounds anomalies. For the 15 synthetic data sets, we consider a few limited number of anomalies, i.e., anomaly ratio is equal to 3%. The detailed description of the 15 synthetic data sets is listed in Table 1 of Appendix A.

The six real-world data sets are adopted, whose data dimensions are greater than 1000 dimension. In addition, we also use two 2 benchmark data sets for the cross-verification in regard to the dividing data. Since the eight real-world high-dimensional data sets are usually used for classification tasks or clustering tasks, we converted them to anomaly detection data sets using the manner in [38]. Table 2 of Appendix A gives detailed description in regard to the eight real-world data sets.

Competing methods and their parameters

We selected the nine state-of-the-art approaches from five types of anomalous detection methods, i.e., (I) distance-based, KNN in [3]. (II) Reconstructed error-based, MF in [14]. (III) Classification-based, OC-SVM in [15]. (IV) Deep networks-based, DAE in [17], GANs in [18], RNNs in [22] (V) Hybrid architectures-based, DNN-K in [23], DNN–SVM in [24], DNN–SVDD in [25].

For the nine competitors, their optimal parameters observed in the corresponding literature were used. Unless otherwise state, all experiments are run on the same experimental settings. In addition, those parameters that are not stated adopt default values.

Assessment metrics

In this work, receiver operating characteristic curve (ROC) and corresponding area under the curve (AUC) are commonly used to assess the accuracy of anomaly detection. In addition, mean square error (mse) and standard deviation (sd) are also applied to assess the detection results of methods. The calculation formulas of mse and sd are given in the following equation:

$$ \left\{ \begin{gathered} {\text{mse}} = \frac{1}{D}\sum\limits_{i = 1}^{D} {(y_{i} - p_{i} )^{2} } \hfill \\ {\text{sd}} = \sqrt {\frac{1}{D}\sum\limits_{i = 1}^{D} {\left( {p_{i} - \frac{1}{D}\sum\limits_{i = 1}^{D} {p_{i} } } \right)^{2} } } \hfill \\ \end{gathered} \right., $$

(15)

where y_i is the actual value, and p_i is the predicted value. D is the input data volume. To have the fair results, all experiments were run independently 100 times. Then, we analyze these results with statistical significance of the synthetic data set by of t test (p value < 0.05 for mse).

Results

In this section, all experimental results are presented, including mse, sd and detection accuracy, aiming at presenting the capabilities of DNNH for anomaly detection, and giving some insights respecting anomaly detection upon a high-dimensional space.

All results show that the detected performance of DNNH is significantly better than that of the nine competitors in considered cases. Moreover, these experimental results do not show in general a difference between the ability of anomaly detection upon a high-dimensional space for DNNH and the nine competitors.

Cross-validation

Due to the division of experimental data sets’ effects on the training accuracy of DNNH, the division testing on data sets needs to be considered so as to get the optimal configuration proportion for the training set and the testing set. Using benchmark data set B1, B2, we configured the following divided proportion for the training set and the testing set, respectively, i.e., training /testing set = 0.9/0.1, 0.8/0.2, 0.7/0.3, 0.6/0.4,0.5/0.5.

The results are shown in Fig. 7 of Appendix B. When the proportion of training/testing set is equal to 0.8/0.2, DNNH gains the best performance, i.e., the AUC is 90.80% and 92.25% on the benchmark data set B1, B2, respectively. As such, all subsequent experiments adopt the value 0.8/0.2 to divide the training/testing set.

Experiments on synthetic data sets

The results on the two metrics of mse and sd show that DNNH is all lower than the nine competitors for the errors of mined anomalies, as shown in Table 3 of Appendix C. These statistical results in Table 3 of Appendix C indicates that there are no differences in general between DNNH and these competitors in the mining accuracy.

Mining accuracy

The results of the mined accuracy in Fig. 3 show that the capabilities of anomalous detection methods decrease as the dimensionality of input data increases. For different data distributions, i.e., on three type of data sets in Fig. 3a–c, DNNH maintains the high mining accuracy and also outperforms the nine competitors. In particular, when the dimensionality of the input data is equal to 5000, DNNH gains the accuracy of over 93%. However, the tradition methods are below 71% in the mined accuracy, e.g., the [3], the [14] and the [15]. While for deep methods and hybrid methods, they are below 85% in terms of accuracy. Together, these results implies that DNNH is not sensitive to data distribution, and DNNH has outstanding advantages for the high-dimensional anomaly detection.

Anomaly clustering

Figure 4 displays these visualized results of anomaly clustering on synthetic data set T1(5), T2(5) and T3(5), where these regions surrounded by a black line represent the results of anomaly clustering detected using DNNH and the nine competitors. Figure 4a shows that the quantity and quality of anomaly clustering detected using DNNH outperforms that of nine competitors. (Please observe the area surrounded by a black line). While for the traditional methods, such as OC-SVM in [15], MF in [14] and KNN in [3], the poorest detected results are obtained. Similarly, in Fig. 4b, c, DNNH not only gains these advanced results of the anomaly clustering, but also is better than the nine competitors. As such, for high-dimensional data presenting different distributions, DNNH outperforms these competitors in regard to the detected anomaly clustering.

Time complexity

Figure 5 displays the execution time of DNNH and nine competitors. Results show that the execution time of the tradition methods, e.g., the methods in [3] in [14] and in [15], is lower than that of deep methods and hybrid methods on most data sets. While DNNH is lower than hybrid methods, such as the methods in [23], in [24] and the in [25].

During calculating Eq. (9), DNNH needs to do multiple iterations. The number of iterations T relies on the input data dimension D and data volume V, i.e., T = I₁D + I₂V, where I₁ and I₂ are a constant. In the case of fixed data volume, T mainly depends on data dimension and increases as data dimension increases. Hence, the time complexity of DNNH is O(n) = I₁D + I₂V, i.e., O(n)$\approx$ O(n²) < < O(n³).

Experiments on real-world data sets

The results on real-world data sets show that the mined accuracy of DNNH outperforms nine competitors, as shown in Fig. 6. Especially, on the ultra-high dimensional data set R1 (data dimension is equal to 10,000), DNNH reaches above 71% of the mined accuracy, while the traditional methods, e.g., KNN [3], MF [14] and OC-SVM [15], almost fails on data set R1, i.e., their mined accuracy is less than 20%. Obviously, hybrid approaches and deep approaches are superior than traditional methods in terms of the mine performance. This is because deep architectures can capture low-dimensional features from high-dimensional data, so as to decrease the dimensionality of the input data.

According to experimental results on synthetic data sets and real-world data sets, several observations can be obtained from Figs. 3, 4, 5 and 6.

i.
High dimensionality of the input data increases the complexity of data space, through fusing the probability method of sample binary-classification into deep neural networks, the hidden layers can extract these low-dimensional layered features of being used to distinguish anomalies from normal instance.
ii.
If the extracted features can represent fewer anomaly instances, this enough identify anomalies from normal classes.
iii.
On a high-dimensional space, hybrid methods consisting of deep networks and traditional methods show excellent the mined capabilities of anomalies.
iv.
The time complexity of deep detection algorithms includes the depth of network architectures and the dimensionality of input data. Usually, the data lying in a high-dimensional space is prone to a deeper network architecture to learn more meaningful features, so that the computational complexity of deep detection algorithms increases along with the number of network layers. While the computational complexity of hybrid algorithms contains the complexity of deep detection algorithms and traditional detection algorithms.

Discussion

DNNH has outstanding the mined capability for high-dimensional anomalies, so we give a detailed explanation as following. First, Eq. (8) ensures the layered characteristic of the captured low-dimensional anomaly features and low-dimensional normal features. Equation (5) reduces the probability that anomalous points are classified into the C₂ classes (i.e., normal classes) during mining anomalies. Then, the kernel in Eq. (12) achieves the linear separation of the two types of captured low-dimensional features. Moreover, through learning the error function $\Delta ({\mathbf{a}},r)$ in Eq. (9), the hypersphere allows that the captured low-dimensional features are far away from the center a of the sphere. Finally, through learning the final learning function $\nabla (L({\mathbf{w}},{\mathbf{b}}),\Delta ({\mathbf{a}},r))$ in Eq. (14), DNNH gains these advanced results of anomaly mining upon a high-dimensional space.

Conclusion

In this work, a hybrid method is proposed for anomaly mining upon a high-dimensional space. In the proposed method, the probability deep neural network first captures the low-dimensional features from the background space. Then, the captured low-dimensional features are separated using the hypersphere, so as to realize the distinction between anomaly and normal classes. Experimental results show that the proposed method outperforms the advanced anomalous detection methods in mining ability. We demonstrate that deep neural networks can capture these desired low-dimensional layered features through fusing the probability method of sample multi-classification. Moreover, the extracted features that represent a fewer anomaly instances can sufficiently identify anomalies from normal instances. In future work, we will look at exploring anomaly detection methods targeted to irrelevant attributes interference on a high-dimensional space, i.e., how to identify the anomalies masked by irrelevant attributes on a high-dimensional space.

References

Kui Yu, Chen H (2019) Markov Boundary-Based Outlier Mining. IEEE Transactions on neural networks and learning systems 30(4):1259–1264
Article MathSciNet Google Scholar
Menon V, Kalyani S (2019) Structured and unstructured outlier identification for robust PCA: a fast parameter free algorithm. IEEE Trans Signal Process 67(9):2439–2452
Article MathSciNet Google Scholar
Salehi M, Leckie C, Bezdek JC, Vaithianathan T, Zhang X (2016) Fast memory efficient local outlier detection in data streams. IEEE Trans Knowl Data Eng 28:3246–3260
Article Google Scholar
Chehreghani MH (2016) K-nearest neighbor search and outlier detection via minimax distances. In: SIAM International Conference on Data Mining, pp 405–413
Li X, Lv J, Yi Z (2020) Outlier detection using structural scores in a high-dimensional space. IEEE Trans Cybernet 50(5):2302–2310
Article Google Scholar
Gupta M, Gao J, Aggarwal CC, Han J (2014) Outlier detection for temporal data: a survey. IEEE Trans Knowl Data Eng 26(9):2250–2267
Article Google Scholar
Radovanovic M, Nanopoulos A, Ivanovic M (2015) Reverse nearest neighbors in unsupervised distance-based outlier detection. IEEE Trans Knowl Data Eng 27(5):1369–1382
Article Google Scholar
Cao N, Lin C, Qiuhan Zhu Y-R, Lin XT, Wen X (2018) Voila: Visual Anomaly Detection and Monitoring with Streaming Spatiotemporal Data. IEEE Trans Visual Comput Graphics 24(1):23–33
Article Google Scholar
Soleimani H, Miller DJ (2016) Atd, “Anomalous topic discovery in high dimensional discrete data.” IEEE Trans Knowl Data Eng 28(9):2267–2280
Article Google Scholar
Mao J, Wang T, Jin C, Zhou A (2017) Feature grouping-based outlier detection upon streaming trajectories. IEEE Trans Knowl Data Eng 29(12):2696–2709
Article Google Scholar
Manzoor E, Milajerdi SM and Akoglu L (2016) Fast memory-efficient anomaly detection in streaming heterogeneous graphs. In: ACM SIGKDD international conference on knowledge discovery and data mining, pp 1035–1044
O’Reilly C, Gluhak A, Imran MA (2016) Distributed anomaly detection using minimum volume elliptical principal component analysis. IEEE Trans Knowl Data Eng 28:2320–2333
Article Google Scholar
He X, Zhang H, Kan MY and Chua TS (2016) Fast matrix factorization for online recommendation with implicit feedback. In: ACM SIGIR Special Interest Group on Information Retrival, pp 549–558
Erfani SM, Rajasegarar S, Karunasekera S, Leckie C (2016) High-dimensional and large-scale anomaly detection using a linear one-class svm with deep learning. Pattern Recogn 58:121–134
Article Google Scholar
Ergen T, Mirza AH and Kozat SS (2017) Unsupervised and semi-supervised anomaly detection with lstm neural networks. arXiv arXiv:1710.09207
Berg C, Christensen JPR, Ressel P (1984) Harmonic analysis on semigroups. Springer, Berlin
Book Google Scholar
Zhou C, Paffenroth RC (2017) Anomaly detection with robust deep autoencoders. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 665–674
Li D, Chen D, Shi L, Jin B, Goh J and Ng S-K (2019) MAD-GAN: multivariate anomaly detection for time series data with generative adversarial networks. In: International Conference on Artificial Neural Networks (ICANN) 2019: Artificial Neural Networks and Machine Learning – ICANN 2019: Text and Time Series, pp 703–716
Le Cun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444
Article Google Scholar
Lusch B, Kutz JN, Brunton SL (2018) Deep learning for universal linear embeddings of nonlinear dynamics. Nat Commun 9:1–10
Article Google Scholar
Creswell A, White T, Dumoulin V, Arulkumaran K, Sengupta B, Bharath AA (2018) Generative adversarial networks: An overview. IEEE Signal Process Magazine 35(1):53–65
Article Google Scholar
Goh J, Adepu S, Tan M, et al (2017) Anomaly detection in cyber physical systems using recurrent neural networks” [C]. In: 2017 IEEE 18^th International Symposium on High Assurance Systems Engineering (HASE), pp 140–145, IEEE
Bengio Y and LeCun Y (2007) Scaling learning algorithms towards AI. In: Bottou L, et al (eds) Large scale kernel machines, pp 1–41
Inoue J, Yamagata Y, Chen Y, Poskitt CM and Sun J (2017) Anomaly detection for a water treatment system using unsupervised machine learning. In: Data Mining Workshops (ICDMW), 2017 IEEE International Conference on, 1058–1065
Kim S, Choi Y, Lee M (2015) Deep learning with support vector data description. Neurocomputing 165:1–17
Article Google Scholar
Saxe AM, Koh PW, Chen Z, Bhand M, Suresh B and Ng AY (2011) On random weights and unsupervised feature learning. In: Machine Learning, 2011 IEEE International Conference on, pp 1089–1096
Peng X, Shen J (2017) A twin-hyperspheres support vector machine with automatic variable weights for data classification. Inf Sci 417:216–235
Article MathSciNet Google Scholar
Peng X, Dong Xu (2013) A twin-hypersphere support vector machine classifier and the fast learning algorithm. Inf Sci 221:12–27
Article MathSciNet Google Scholar
Araque O, Corcuera-Platas I, Sánchez-Rada JF, Iglesias CA (2017) Enhancing deep learning sentiment analysis with ensemble techniques in social applications. Expert Syst Appl 77:236–246
Article Google Scholar
Qinglin Wu, Coggins BE, Zhou P (2016) Unbiased measurements of reconstruction fidelity of sparsely sampled magnetic resonance spectra. Nat Commun 7:1–8
Google Scholar
Zhang R, Li J, Shaoyu Wu, Meng D (2016) Learning to select supplier portfolios for service supply chain. PLoS One 11(5):1–19
Google Scholar
Titterington D (1981) Comparison of discriminant techniques applied to a complex data set of head injured patients. J R Stat Soc 144(2):145–175
MathSciNet MATH Google Scholar
Olshausen BA, Field DJ (1997) Sparse coding with an overcomplete basis set: a strategy employed by V1. Vis Res 37:3311–3325
Article Google Scholar
Snoek J, Larochelle H and Adams RP (2012) Practical bayesian optimization of machine learning algorithms. In: Advances in neural information processing systems, pp 2951–2959
Oh CY, Gavves E and Welling M (2019) BOCK: Bayesian optimization with cylindrical kernels, arXiv:1806.01619v2
Jayasumana S, Hartley R, Salzmann M et al (2014) Optimizing over radial kernels on compact manifolds [C]. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3802–3809
Kingma DP and Ba JL (2015) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980v8
Campos GO, Zimek A, Sander J (2016) On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min Knowl Disc 30:891–927
Article MathSciNet Google Scholar

Download references

Funding

The research funding is Supported by the Science and Technology Research Program of Chongqing Municipal Education Commission of China under Grant KJQN201903003. And the Science and Technology Research Program of Chongqing Municipal Education Commission of China under Grant KJQN202003001.

Author information

Authors and Affiliations

Chongqing Aerospace Polytechnic, Chongqing, 400021, China
Jian Zheng, Jianfeng Wang, Jiang Li & Hongling Liu
Chongqing College of Mobile Communication, Chongqing, 401520, China
Jingyi Li
Chongqing Medical and Pharmaceutical College, Chongqing, 401331, China
Cong Liu

Authors

Jian Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Jingyi Li
View author publications
You can also search for this author in PubMed Google Scholar
Cong Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jianfeng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jiang Li
View author publications
You can also search for this author in PubMed Google Scholar
Hongling Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jian Zheng.

Ethics declarations

Conflict of interest

All authors have no conflicts of interest to declare that are relevant to the content of this article.

Ethics approval and consent to participate

All authors declare that this work does not include humans and animals, as well as never collects data from human subjects.

Consent for publication

All authors agree with availability of data and materials.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Data set description

See Tables 1, 2.

Table 1 Description of the synthetic data sets

Full size table

Table 2 Description of real-world data sets

Full size table

Appendix B: Cross-validation on benchmark data sets

See Fig. 7.

Appendix C: Statistical results on synthetic data sets

See Table 3.

Table 3 Results of mse and sd on synthetic data sets

Full size table

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zheng, J., Li, J., Liu, C. et al. Anomaly detection for high-dimensional space using deep hypersphere fused with probability approach. Complex Intell. Syst. 8, 4205–4220 (2022). https://doi.org/10.1007/s40747-022-00695-9

Download citation

Received: 06 September 2021
Accepted: 11 February 2022
Published: 24 March 2022
Issue Date: October 2022
DOI: https://doi.org/10.1007/s40747-022-00695-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Anomaly detection for high-dimensional space using deep hypersphere fused with probability approach

Abstract

Similar content being viewed by others

Feature dimensionality reduction: a review

A Comprehensive Survey of Anomaly Detection Algorithms

A modified Adam algorithm for deep neural network optimization

Introduction

Methodology

Layered features extraction

Anomaly separation

Model

Rationality

Model architecture

Hyper parameter configuration

Training and testing

Experimental settings

Data sets

Competing methods and their parameters

Assessment metrics

Results

Cross-validation

Experiments on synthetic data sets

Mining accuracy

Anomaly clustering

Time complexity

Experiments on real-world data sets

Discussion

Conclusion

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval and consent to participate

Consent for publication

Additional information

Publisher's Note

Appendices

Appendices

Appendix A: Data set description

Appendix B: Cross-validation on benchmark data sets

Appendix C: Statistical results on synthetic data sets

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation