Introduction

The “curse of dimensionality” brings three major challenges for anomaly mining. The first one is the computational power of detection methods. High-dimensional space may exist many subspaces, and anomalies may exist any subspaces. A brute-force method is computationally prohibitive because of searching an exponential number of subspaces. Another challenge is that the relative contrast between data objects become more and more similar in a high-dimensional space [1, 2]. Most existing anomaly detection methods implicitly or explicitly rely on the distance between data objects. For a low-dimensional space, the distance assessment metrics are easy to mine anomalies, such as the distance-based methods in [3, 4]. Unfortunately, in a high-dimensional space, the metric of assessing distance may not be able to measure the similarity between data [5,6,7]. The third challenge is that data distribution becomes sparse in a high-dimensional space, thereby hard providing sufficient information to distinguish anomalies from normal instances [8,9,10]. Overall, it is a tough task to mine anomalies existing in a high-dimensional space.

Currently, anomaly detection methods are divided into the following categories: (I) distance-based, such as K-Nearest Neighbor (KNN) [3, 4]. Although such method does not have to assume data distribution and require training samples, the distance between data in a high-dimensional space is not easy to calculate. (II) Cluster-based, e.g., the model in [11], such method requires assume data distribution. (III) Reconstruction error-based, for this method, the threshold is set in advance [12], such as Matrix Factorization (MF) [13]. When the reconstructed error is larger than set the threshold, this is considered to be anomalies. If threshold is not set properly, the precision of detection methods suffers from seriously negative effects. (IV) Classification-based, e.g., One Class-Support Vector Machine (OC-SVM) [14], as for this method, anomalies are linearly separated by SVM from normal instances but the curse of dimensionality limits the ability of SVM to linearly separate features [15, 16]. (V) Deep network architectures-based, e.g., Deep Autoencoder (DAE) [17], Generative Adversarial Networks (GANs) [18], deep networks can capture the layered features used to identify anomalies and normal instances from the background space [19, 20]. For instance, GANs exhibit excellent ability for anomaly detection even if the reconstructed anomaly instances are very poor [21]. Nevertheless, GANs are prone to pattern collapse during training. In addition, Goh et al. use recurrent neural networks (RNNs) [22] for anomaly detection in cyber physical systems. (VI) Hybrid method-based, i.e., consisting of deep networks and tradition detection methods, such as Deep Neural Networks based K-classification (DNN-K) [23], Deep Neural Networks–Support Vector Machine (DNN–SVM) [24], and Deep Neural network–support Vector Data Description (DNN–SVDD) [25]. Hybrid methods are much more extensible while the computational complexity augments as the depth of hybrid architecture increases [26].

In addition to above involved methods, hyperspheres are also commonly used for anomaly detection. For example, in [27, 28], the hypersphere is used for binary-classification of normal classes and anomaly classes. Usually, hyperspheres are sensitive to a lack of data, moreover, have also no advantages at learning complex invariants. Whereas, in a good feature space, hyperspheres can exhibit attractive capability of separating data.

Given these complementary advantages of a hypersphere and a deep network, this is very attractive to study a hybrid methods of both for anomaly detection. In this work, our motivation is to mine a limited number of potential anomalies existing in a high-dimensional space. In addition, we also look at exploring the searching efficiency of subspaces existing in a high-dimensional space, aiming at giving the computational complexity of anomaly detection algorithms. Hence, we developed a hybrid model consisting of a hypersphere and a deep neural network to meet our studied targets. First, the deep neural network in the proposed model captures these layered low-dimensional features from high-dimensional data. To promote the ability of the deep neural network to learn these layered low-dimensional features, we consider the probability approach of sample binary-classification to be fused into the loss function, thereby forming the probability deep neural network. Then, the hypersphere linearly separates these captured low-dimensional features. Finally, the proposed method is verified on synthetic and real-world data sets.

We summarize the main contributions of this work as follows.

  1. (1)

    High dimensionality increases the complexity of the data space, facing to the curse of dimensionality, the deep neural networks fusing the probabilistic method of sample multi-classification can capture these desired low-dimensional features from high-dimensional data; moreover, these captured low-dimensional features present more significant layered characteristics.

  2. (2)

    As long as these extracted features can represent a fewer anomalous instances, it is sufficient to identify anomalies from normal instances.

  3. (3)

    Upon a high-dimensional space, the hybrid approach composing of a deep neural network and a traditional detection method has stronger mining power than deep detection approaches or traditional detection approaches.

Methodology

Layered features extraction

Usually, the background space is high dimension, which is not conducive for anomaly mining, so we first capture these low-dimensional features from the data in the background space using deep neural networks. This purpose is to reduce the dimensionality of a searching space for anomaly mining.

The loss function is one of the critical hyper-parameters for deep neural networks because of affecting the learning capability of deep neural networks [29,30,31]. Anomaly detection can be treated to be binary-classification of samples, i.e., anomaly and normal classes, based on this, we consider the loss function from the perspective of calculating sample classification probability.

Given a sample set x = {x1,x2,….,xl}, and l > 0. Item P(Ci) denotes the probability that the point xl should be in class Ci \(\in\) Class list C, and i = 1,2,….,m. In addition, let us assume that the classification is mutually independent, i.e., conditional independence. The result of conditional independence is given in the following equation:

$$ P(C|s_{j} ) = P(C_{1} ,C_{2} ,....C_{i} |s_{j} ) = \mathop \Pi \limits_{i = 1}^{m} P(C_{i} |s_{j} ), $$
(1)

where sj is the class j from the class set. The posterior probability for tagging x is as follows:

$$ P(s_{j} |C) = \frac{{P(s_{j} )P(C|s_{j} )}}{P(C)} = \frac{{P(s_{j} )\mathop \Pi \nolimits_{i = 1}^{m} P(C_{i} |s_{j} )}}{p(C)},j = 1,2,..... $$
(2)

Since the denominator in Eq. (2) is not based on sj, this part can be ignored [32]. Therefore, sj can be calculated using the following equation:

$$ \Omega (x) \propto P(s_{j} )\mathop \Pi \limits_{i = 1}^{m} P(C_{i} |s_{j} ). $$
(3)

According to the [32], the probability estimate P(Ci|sj) and the prior probability for class Cj are replaced using \(\frac{{M_{i} (j,u_{i} )}}{{D_{j} }}\) and \(\frac{{N_{j} }}{N}\), respectively. Hence, Eq. (3) is modified as follows:

$$ \Omega (x) \propto \left\{ {\mathop \Pi \limits_{i = 1}^{m} \frac{{M_{i} (k,u) + \frac{1}{C}}}{{D_{k} + 1}}} \right\}^{B} , $$
(4)

where C is the number of the classes. Mi(k,ui) represents the number of data set elements which have the class of sk, and are assigned to class su. Dk determines the number of class su. B is a constant. In regard to the B value, Titterington [32] has proposed the values 1, 0.8, or 0.5 as a reference. For the detailed proof respecting Eqs. (3) and (4), please see the [32]. Let us assume C1 and C2 are anomaly classes and normal classes, respectively. As such, we can simplify Eq. (4), having that

$$ \Omega (x) \propto \left\{ {\mathop \Pi \limits_{i = 1}^{2} \frac{{M_{i} (1,2) + \frac{1}{2}}}{3}} \right\}^{B} . $$
(5)

Equation (5) demonstrates the probability that the point xl should be classified into class Ci is \(\Omega (x)\).

Sparsity can be encouraged by adding a regularization term that takes a large value when the average activation value \(\hat{\rho }_{i}\), of a neuron i and its desired value \(\rho_{i}\), are not close in the value [33]. One such sparsity regularization term can be the KL divergence, as follows:

$$ KL_{{s{\text{parse}}}} = \sum\limits_{i = 1} {\rho \log \left( {\frac{\rho }{{\hat{\rho }_{i} }}} \right)} + (1 - \rho )\log \left( {\frac{1 - \rho }{{1 - \hat{\rho }_{i} }}} \right). $$
(6)

We calculate the average output activation measure of the neuron i with probability manner, having

$$ \hat{\rho }_{i} = \frac{1}{n}\sum\limits_{{j = 1}}^{n} {h\left( {{\mathbf{w}}_{i}^{{(1){\mathbf{T}}}} x_{j} *\Omega (x) + {\mathbf{b}}_{i}^{{(1)}} } \right),} $$
(7)

where n is the total number of training examples. \({\mathbf{w}}_{i}^{{(1){\mathbf{T}}}}\) is the ith row of the weight matrix \({\mathbf{W}}^{{{\mathbf{(1)}}}}\), and \({\mathbf{b}}_{i}^{(1)}\) is the ith entry of the bias vector \({\mathbf{b}}^{(1)}\). The loss function L(w,b) is given in the following equation:

$$ L({w},{\mathbf{b}}) = ||e - \hat{e}|| + KL_{{{\text{sparse}}}} , $$
(8)

where e and \(\hat{e}\) are the inputting and the reconstructed inputting, respectively. Equation (8) demonstrates the probability that anomalous points and normal points should belong to the C1 classes and C2 classes in a given sample.

Anomaly separation

Hypersphere can be defined with center a and radius R, given a data set y = {y1,y2,…,yi,…,}, i = 1, 2,.., the error function using a hypersphere to learn a compact space around yi can be defined as follows [25]:

$$ \left\{ \begin{gathered} \Delta ({\mathbf{a}},r) = R^{2} + \lambda * \sum\limits_{i} {\xi_{i} } \hfill \\ {||}y_{i} { - }{\mathbf{a}}{||}^{{2}} \le R^{2} + \xi_{i} , \, \forall i = 1,2,...,\xi_{i} \ge 0 \hfill \\ \end{gathered} \right., $$
(9)

where \(\xi_{i}\) is a slack variable that lets some data points fall outside the hypersphere.\(\xi_{i}\) is an auxiliary variable rather than part of the parameter set.\(\lambda\) is used for the proportion to trade-off data outside the hypersphere, allowing users to predefine. The distance between the sample and the center of the sphere can be calculated (the detailed derivation see [25]) as follows [25]:

$$ y - {\mathbf{a}}^{2} = (y^{2} y) - 2\sum\limits_{i} {\alpha_{i} (y^{2} y_{i} )} + \sum\limits_{i,j} {\alpha_{i} \alpha_{j} (y_{i}^{2} y_{j} )} , $$
(10)

where \(\alpha_{i}\) and \(\alpha_{j}\) are Lagrange multipliers, and j = 1, 2, … Using the mapping transformation, Eq. (10) can be converted into the following equation:

$$ \varphi (y) - {\mathbf{a}}^{2} = K(y,y) - 2\sum\limits_{i} {\alpha_{i} K(y,y_{i} )} + \sum\limits_{i,j} {\alpha_{i} \alpha_{j} K(y_{i} ,y_{j} )} , $$
(11)

where \(\varphi ( \, )\) is a mapping function.\(K( \, )\) is a kernel function of satisfying Mercer theorem. Certainly, there are many the kernel functions of satisfying Mercer theorem. In this work, we select the Matern52 kernel in [34], having that

$$ \begin{aligned} K(y,y_{i} ) &= \theta_{0} \left(1 + \sqrt {C_{r} *r^{2} (y,y_{i} )}\right. \\ &\quad \left. + A_{r} *r^{2} (y,y_{i} )\right)\exp \left\{ - \sqrt {B_{r} *r^{2} (y,y_{i} )} \right\} , \end{aligned}$$
(12)

where Cr, Ar, Br are constant coefficients.

There are two reasons for selecting the Matern52 kernel as our kernel function, (i) the Matern52 kernel can make radius warping concave and non-decreasing [34, 35], so as to be prone to focus more on areas with small radii. (ii) The Matern52 kernel, which is a continuous positive definite kernel, can flexibly control the searches in the normal data region because of being non-stationary [36]. This is very conducive to promoting the separated accuracy of anomaly and normal features.

The output of a hypersphere can be calculated in the following equation:

$$ R^{2} - \varphi (y) - {\mathbf{a}}^{2} . $$
(13)

In Eq. (13), the sample is normal if the output is a positive value. Otherwise, the sample is anomalies.

The proposed model is composed of the deep neural network and the hypersphere, so the final learning function \(\nabla (L({\mathbf{w}},{\mathbf{b}}),\Delta ({\mathbf{a}},r))\) of our model includes the loss function L(w,b) in Eq. (8) of the probability deep neural network and the error function \(\Delta ({\mathbf{a}},r)\) in Eq. (9) of the hypersphere, as follows:

$$ \nabla (L({\mathbf{w}},{\mathbf{b}}),\Delta ({\mathbf{a}},r))=\min (L({\mathbf{w}},{\mathbf{b}}) + \Delta ({\mathbf{a}},r)). $$
(14)

Model

In this section, we interpret the rationality of the proposed model and describe the model architecture. In addition, some hyper parameters are configured, as well as, the model training is also presented.

Rationality

Since the background space is high dimension, anomaly detection is considered to be performed in the low-dimensional feature space, instead of the background space. Base on this fact, we opt for the deep neural network to capture low-dimensional features from high-dimensional data. To allow these captured low-dimensional features more layered, the binary-classification probability of samples in Eq. (5) is fused into the loss function in Eq. (8). Our though originates from the fact that anomaly detection can be treated as a binary-classification of samples, so we consider the probability approach of sample binary-classification.

As for the proposed model, the performance not only relies on these captured low-dimensional features, but also depends on the kernel in the hypersphere. In view of this, using the kernel in Eq. (12) is beneficial for promoting the separated capability of the hypersphere to anomaly features. The hypersphere is trained using the error function in Eq. (9), then outputs these separated features. As such, the proposed model exhibits outstanding ability to identify anomalies from the input samples.

Model architecture

The proposed model, which consists of the probability deep neural network and the hypersphere (namely, DNNH), has three modules, including an encoding module, a hypersphere module and a decoding module, as shown in Fig. 1. For the encoding module in Fig. 1, there are two hidden layers. The ith hidden layer is denoted as Hi(e), and i = 1, 2. Given a inputting sample Z = {z1,z2,…,zn}, Z is mapped onto the input layer in the encoding module. Then, Hi(e) captures the low-dimensional features F = {f1, f2,…, fm} from Z, where F contains anomaly features and normal features, and m < n. Equation (8) ensures that F is better extracted by Hi(e). Then, the captured F is sent to the hypersphere module.

Fig. 1
figure 1

DNNH architecture. The model consists of an encoding module, a hypersphere module and a decoding module

For the hypersphere module in Fig. 1, according to the captured F, the kernel in Eq. (12) performs the operation that separates anomaly features from normal features. Through iteration learning the error function in Eq. (9), the hypersphere is well trained. Once the training is completed, the hypersphere sends out the separated low-dimensional features Fs = {Fa, Fn}, where Fa = {fa1,fa2, …, fau}, Fn = {fn1, fn2, …, fnv}, and u + v = m.

For the decoding module in Fig. 1, similar to the encoding module, there are two hidden layers. The jth hidden layer is denoted as Hj(d), and j = 1, 2. After receiving Fs, Hj(d) reconstructs the input Z. Finally, the output layer in the decoding module sends out the learned normal and anomaly classes.

Hyper parameter configuration

Regarding these hyper parameters in DNNS, we carefully studied part of them, e.g., optimizer, learning rate and activation function. Due to the other hyper parameters have no substantial effects on results, their default values are adopted.

Optimizer. Adam is used as the optimizer of DNNH. This reasons are that (i) Adam inherits the capability of AdaGrad to deal with sparse gradients [37]. (ii) As for Adam, the capability of handling sparse gradients is stronger than existing optimizers, such as RMSprop, SGD, Momentum and Nesterov etc. (iii) Adam can provide different learning rates based on different hyper parameters.

Learning rate. Using Adam as the optimizer of DNNH, there is no have to initialize the learning rate for DNNH.

Activation function. The function Sigmoid is considered to be used as the activation function. Compared to other activation functions, e.g., tanh, Relu and elu, the output of Sigmoid is only 0 and 1, so this is very suitable for judging anomalies and normal points.

In addition, we also need to consider the B value in Eq. (5), so B adopts the reference value in [32], i.e., let B be equal to 1.

Training and testing

For the model training, we dynamically adjust the iteration epoch according the observed training precision, until DNNH can converge, the training is finished. While for the model testing, the testing set is used to verify the performance of DNNH.

Experimental settings

In “Training and testing”, experimental data sets are described, including 15 synthetic data sets and 6 real-world high-dimensional data sets. In “Experimental settings”, those compared approaches and their parameters are illustrated. Experimental assessment metrics are given in “Data sets”.

Data sets

We generated three type of synthetic data sets using the manner in [38], as shown in Fig. 2, each of type contains 5 sub data sets. For each sub data sets in the same type, data dimensionality gradually increases from 1000 to 5000. The first type in Fig. 2a, denoted as T1, represents that normal data and anomalies are mixed by random manner. The second type in Fig. 2b, denoted as T2, represents that anomalies locate outside normal data. The third type in Fig. 2c, denoted as T3, represents that normal data surrounds anomalies. For the 15 synthetic data sets, we consider a few limited number of anomalies, i.e., anomaly ratio is equal to 3%. The detailed description of the 15 synthetic data sets is listed in Table 1 of Appendix A.

Fig. 2
figure 2

Synthetic data sets. Anomalies are marked as red fork. Normal data is marked as blue.

The six real-world data sets are adopted, whose data dimensions are greater than 1000 dimension. In addition, we also use two 2 benchmark data sets for the cross-verification in regard to the dividing data. Since the eight real-world high-dimensional data sets are usually used for classification tasks or clustering tasks, we converted them to anomaly detection data sets using the manner in [38]. Table 2 of Appendix A gives detailed description in regard to the eight real-world data sets.

Competing methods and their parameters

We selected the nine state-of-the-art approaches from five types of anomalous detection methods, i.e., (I) distance-based, KNN in [3]. (II) Reconstructed error-based, MF in [14]. (III) Classification-based, OC-SVM in [15]. (IV) Deep networks-based, DAE in [17], GANs in [18], RNNs in [22] (V) Hybrid architectures-based, DNN-K in [23], DNN–SVM in [24], DNN–SVDD in [25].

For the nine competitors, their optimal parameters observed in the corresponding literature were used. Unless otherwise state, all experiments are run on the same experimental settings. In addition, those parameters that are not stated adopt default values.

Assessment metrics

In this work, receiver operating characteristic curve (ROC) and corresponding area under the curve (AUC) are commonly used to assess the accuracy of anomaly detection. In addition, mean square error (mse) and standard deviation (sd) are also applied to assess the detection results of methods. The calculation formulas of mse and sd are given in the following equation:

$$ \left\{ \begin{gathered} {\text{mse}} = \frac{1}{D}\sum\limits_{i = 1}^{D} {(y_{i} - p_{i} )^{2} } \hfill \\ {\text{sd}} = \sqrt {\frac{1}{D}\sum\limits_{i = 1}^{D} {\left( {p_{i} - \frac{1}{D}\sum\limits_{i = 1}^{D} {p_{i} } } \right)^{2} } } \hfill \\ \end{gathered} \right., $$
(15)

where yi is the actual value, and pi is the predicted value. D is the input data volume. To have the fair results, all experiments were run independently 100 times. Then, we analyze these results with statistical significance of the synthetic data set by of t test (p value < 0.05 for mse).

Results

In this section, all experimental results are presented, including mse, sd and detection accuracy, aiming at presenting the capabilities of DNNH for anomaly detection, and giving some insights respecting anomaly detection upon a high-dimensional space.

All results show that the detected performance of DNNH is significantly better than that of the nine competitors in considered cases. Moreover, these experimental results do not show in general a difference between the ability of anomaly detection upon a high-dimensional space for DNNH and the nine competitors.

Cross-validation

Due to the division of experimental data sets’ effects on the training accuracy of DNNH, the division testing on data sets needs to be considered so as to get the optimal configuration proportion for the training set and the testing set. Using benchmark data set B1, B2, we configured the following divided proportion for the training set and the testing set, respectively, i.e., training /testing set = 0.9/0.1, 0.8/0.2, 0.7/0.3, 0.6/0.4,0.5/0.5.

The results are shown in Fig. 7 of Appendix B. When the proportion of training/testing set is equal to 0.8/0.2, DNNH gains the best performance, i.e., the AUC is 90.80% and 92.25% on the benchmark data set B1, B2, respectively. As such, all subsequent experiments adopt the value 0.8/0.2 to divide the training/testing set.

Experiments on synthetic data sets

The results on the two metrics of mse and sd show that DNNH is all lower than the nine competitors for the errors of mined anomalies, as shown in Table 3 of Appendix C. These statistical results in Table 3 of Appendix C indicates that there are no differences in general between DNNH and these competitors in the mining accuracy.

Mining accuracy

The results of the mined accuracy in Fig. 3 show that the capabilities of anomalous detection methods decrease as the dimensionality of input data increases. For different data distributions, i.e., on three type of data sets in Fig. 3a–c, DNNH maintains the high mining accuracy and also outperforms the nine competitors. In particular, when the dimensionality of the input data is equal to 5000, DNNH gains the accuracy of over 93%. However, the tradition methods are below 71% in the mined accuracy, e.g., the [3], the [14] and the [15]. While for deep methods and hybrid methods, they are below 85% in terms of accuracy. Together, these results implies that DNNH is not sensitive to data distribution, and DNNH has outstanding advantages for the high-dimensional anomaly detection.

Fig. 3
figure 3

Mining performance on different the dimensionality of the input data

Anomaly clustering

Figure 4 displays these visualized results of anomaly clustering on synthetic data set T1(5), T2(5) and T3(5), where these regions surrounded by a black line represent the results of anomaly clustering detected using DNNH and the nine competitors. Figure 4a shows that the quantity and quality of anomaly clustering detected using DNNH outperforms that of nine competitors. (Please observe the area surrounded by a black line). While for the traditional methods, such as OC-SVM in [15], MF in [14] and KNN in [3], the poorest detected results are obtained. Similarly, in Fig. 4b, c, DNNH not only gains these advanced results of the anomaly clustering, but also is better than the nine competitors. As such, for high-dimensional data presenting different distributions, DNNH outperforms these competitors in regard to the detected anomaly clustering.

Fig. 4
figure 4

Clustering results. Anomalies are marked as red fork. Normal is marked as blue circle. Black line illustrates the clustered regions of the detected anomalies

Time complexity

Figure 5 displays the execution time of DNNH and nine competitors. Results show that the execution time of the tradition methods, e.g., the methods in [3] in [14] and in [15], is lower than that of deep methods and hybrid methods on most data sets. While DNNH is lower than hybrid methods, such as the methods in [23], in [24] and the in [25].

Fig. 5
figure 5

Running time of method

During calculating Eq. (9), DNNH needs to do multiple iterations. The number of iterations T relies on the input data dimension D and data volume V, i.e., T = I1D + I2V, where I1 and I2 are a constant. In the case of fixed data volume, T mainly depends on data dimension and increases as data dimension increases. Hence, the time complexity of DNNH is O(n) = I1D + I2V, i.e., O(n)\(\approx\) O(n2) <  < O(n3).

Experiments on real-world data sets

The results on real-world data sets show that the mined accuracy of DNNH outperforms nine competitors, as shown in Fig. 6. Especially, on the ultra-high dimensional data set R1 (data dimension is equal to 10,000), DNNH reaches above 71% of the mined accuracy, while the traditional methods, e.g., KNN [3], MF [14] and OC-SVM [15], almost fails on data set R1, i.e., their mined accuracy is less than 20%. Obviously, hybrid approaches and deep approaches are superior than traditional methods in terms of the mine performance. This is because deep architectures can capture low-dimensional features from high-dimensional data, so as to decrease the dimensionality of the input data.

Fig. 6
figure 6

Mining performance on real-world data sets

According to experimental results on synthetic data sets and real-world data sets, several observations can be obtained from Figs. 3, 4, 5 and 6.

  1. i.

    High dimensionality of the input data increases the complexity of data space, through fusing the probability method of sample binary-classification into deep neural networks, the hidden layers can extract these low-dimensional layered features of being used to distinguish anomalies from normal instance.

  2. ii.

    If the extracted features can represent fewer anomaly instances, this enough identify anomalies from normal classes.

  3. iii.

    On a high-dimensional space, hybrid methods consisting of deep networks and traditional methods show excellent the mined capabilities of anomalies.

  4. iv.

    The time complexity of deep detection algorithms includes the depth of network architectures and the dimensionality of input data. Usually, the data lying in a high-dimensional space is prone to a deeper network architecture to learn more meaningful features, so that the computational complexity of deep detection algorithms increases along with the number of network layers. While the computational complexity of hybrid algorithms contains the complexity of deep detection algorithms and traditional detection algorithms.

Discussion

DNNH has outstanding the mined capability for high-dimensional anomalies, so we give a detailed explanation as following. First, Eq. (8) ensures the layered characteristic of the captured low-dimensional anomaly features and low-dimensional normal features. Equation (5) reduces the probability that anomalous points are classified into the C2 classes (i.e., normal classes) during mining anomalies. Then, the kernel in Eq. (12) achieves the linear separation of the two types of captured low-dimensional features. Moreover, through learning the error function \(\Delta ({\mathbf{a}},r)\) in Eq. (9), the hypersphere allows that the captured low-dimensional features are far away from the center a of the sphere. Finally, through learning the final learning function \(\nabla (L({\mathbf{w}},{\mathbf{b}}),\Delta ({\mathbf{a}},r))\) in Eq. (14), DNNH gains these advanced results of anomaly mining upon a high-dimensional space.

Conclusion

In this work, a hybrid method is proposed for anomaly mining upon a high-dimensional space. In the proposed method, the probability deep neural network first captures the low-dimensional features from the background space. Then, the captured low-dimensional features are separated using the hypersphere, so as to realize the distinction between anomaly and normal classes. Experimental results show that the proposed method outperforms the advanced anomalous detection methods in mining ability. We demonstrate that deep neural networks can capture these desired low-dimensional layered features through fusing the probability method of sample multi-classification. Moreover, the extracted features that represent a fewer anomaly instances can sufficiently identify anomalies from normal instances. In future work, we will look at exploring anomaly detection methods targeted to irrelevant attributes interference on a high-dimensional space, i.e., how to identify the anomalies masked by irrelevant attributes on a high-dimensional space.