1 Introduction

Outliers (a.k.a anomalies) are data points that show dramatically different behavior from the remainder of data points in the dataset. The process of finding such data points is known as Outlier Detection (OD). In the era of big data, OD is considered as one of the vital task of data mining with a wide range of application domains [21], i.e., (i) fraud detection—in this domain, outlier refers to the fraud that includes credit card frauds [6], insurance claim frauds [4]; (ii) Medical or public health—in this domain, outlier refers to an unusual health condition of patients that happens due to instrumental error or disease symptoms [14].

Recently, researchers have been interested in the explanation of why the data point is considered as an outlier. The problem of finding these explanations leads to the Outlying Aspect Mining (OAM) [8, 22, 27, 28]. OAM is the task of identifying feature subset(s), in which a given data point is dramatically inconsistent with the rest of the data. In literature, the problem of OAM is also referred as outlying subspace detection [31], outlier explanation [9, 17, 18], outlier interpretation [7, 16, 29], outlying property detection [1] and outlying aspect mining [8, 22, 23, 26,27,28, 30].

In many application scenarios, it is required to find out in which set of feature(s), a given point is different than others. For example, in a bank, a fraud analyst collects information about various aspects of credit card fraud, and he/she is interested to know in which aspects the fraud does not conform with the remainder of that set of data. Moreover, when evaluating job applications, a panel member wants to know the job applicant’s unique features. Another exciting application of OAM is in the medical domain [20]. Assume that you are a doctor and while treating a specific patient, you want to know, how this patient is different than others. Existing OD methods cannot answer all these questions.

To detect outlying aspects, OAM algorithms require a scoring measure to rank subspaces based on the outlying degrees of the given query. Existing OAM algorithms such as HOSMiner [31], OAMiner [8], Density Z-Score [27] and sGrid [28] use a traditional distance or density-based outlier score as the ranking measure. Because distance or density-based outlier scores depend on the dimensionality of subspaces, they cannot be compared directly to rank subspaces. [27] proposed to use Z-Score normalization to make them comparable. It requires computing the outlier scores of all the data points in each subspace. It adds significant computational overhead making OAM algorithms infeasible to run in large and/or high-dimensional datasets. Also, we discover that Z-Score normalization is not appropriate for OAM in some cases.

In this paper, we focus on the two issues of existing scores used in OAM: (i) dimensionality unbiasedness, and (ii) computational complexity. It is worth noting that another computational issue in OAM is to deal with the exponentially large number of subspaces. Current OAM methods perform a systematic search; which is computationally prohibitive when the number of dimensions is high. This paper does not deal with this computational issue. It still uses the existing systematic search approach but deals with computing the score in each subspace efficiently.

This paper makes the following contributions:

  • Identify an issue of using Z-Score normalization of density-based outlier scores to rank subspaces and shows that it is biased towards a subspace having high-density variance.

  • Propose a new simple measure called Simple Isolation score using Nearest Neighbor Ensemble (SiNNE), which is useful for detecting outliers from the dataset and outlying aspects of the given outlier points.

  • Provide an objective measure to assess the quality of discovered outlying subspaces.

  • Validate the effectiveness and efficiency of SiNNE in OAM. Our empirical results show that SiNNE can detect more interesting outlying aspects than the existing score, and it allows the OAM algorithm to run orders of magnitude faster than the existing scoring measure.

The rest of the paper is organized as follows. Section 2 provides a summary of previous work on outlying aspect mining. The proposed outlier detector scoring measure is presented in Sect. 3. Experimental settings are provided in Sect. 4, and empirical evaluation results are provided in Sect. 5. Finally, conclusions are provided in Sect. 6.

2 Related Works

In this section, first, we fixed some notations for the rest of the paper, provided some basic definitions, and then discussed recent outlying aspect mining methods. The high-level process pipeline of OAM is shown in Fig. 1.

Fig. 1
figure 1

The high-level process pipeline

2.1 Basic Notations and Definitions

Let \({\mathcal {X}} = \{x_1, x_2, \ldots , x_n\}\) be a collection of n data points in an d-dimensional space \(\Re\), where \(\Re\) is a real domain. Each data point x is represented as an d dimensional vector \(\big \langle x^{(1)}, x^{(2)}, \ldots , x^{(d)} \big \rangle\). Let \({\mathcal {F}}\) be a full feature space and \(\mathbb {S} = \{{\mathcal {S}}_1, {\mathcal {S}}_2, \ldots , {\mathcal {S}}_\eth \}\) be a set of all possible subspaces, where \(\eth = 2^d - 1\) is the number of possible subspaces. The key symbols and notations used in this paper are provided in Table 1.

Table 1 Key symbols and notations

The problem of outlier detection is to identify all \(x_i\) which remarkably deviates from others in full feature set \({\mathcal {F}}\), whereas the problem of outlying aspect mining is to identify subspace \({\mathcal {S}}_i \in \mathbb {S}\), where the given data point \(x_i \in {\mathcal {X}}\) is significantly different from the rest of the data. That given data point \(x_i \in {\mathcal {X}}\) is referred as a query \(\mathbf{q}\).

Definition 1

(Outlier) An outlier is a data instance that significantly deviates from others in the full feature set \({\mathcal {F}}\).

Definition 2

(Subspace) A subspace is a subset of the dimensions d of dataset \({\mathcal {X}}\).

Definition 3

(Query point) A query \(\mathbf{q}\) is a data point of interest, which is used to find outlying aspects.

Definition 4

(Problem definition) Given a set of n instances \({\mathcal {X}}\) (\(|{\mathcal {X}}| = n\)) in d dimensional space, a query \(\mathbf{q} \in {\mathcal {X}}\), a subspace \({\mathcal {S}}\) is called outlying aspect of \(\mathbf{q}\) iff,

  • outlying degree of \(\mathbf{q}\) in subspace \({\mathcal {S}}\) is higher than other subspaces, and there is no other subspace with same or higher outlying degree.

2.2 Outlying Aspect Mining

To the best of our knowledge, [31] is the earliest work that defines the problem of OAM. They introduced a framework to detect an outlying subspace called HOS-Miner (stands for High-dimensional Outlying Subspace Miner). Therein, the author used a distance-based measure called Outlying Degree (OutD in short). The OutD of query \(\mathbf{q}\) in subspace \({\mathcal {S}}\) is computed as:

$$\begin{aligned} OutD_{\mathcal {S}}(\mathbf{q}) = \sum \limits _{x\in \aleph _{\mathcal {S}}^k(\mathbf{q})} d_{\mathcal {S}}(\mathbf{q}, x) \end{aligned}$$

where \(\aleph _{\mathcal {S}}^k(\mathbf{q})\) is a set of k-nearest neighbors of \(\mathbf{q}\) in subspace \({\mathcal {S}}\), \(d_{\mathcal {S}}(a, b)\) is an euclidean distance between a and b in subspace \({\mathcal {S}}\), which is computed as \(d_{\mathcal {S}}(a,b) = \sqrt{\sum _{i \in {\mathcal {S}}} (a_i - b_i)^2 }\).

In 2015, [8] introduced Outlying Aspect Miner (OAMiner in short). Instead of using distance, therein, authors employed a kernel density estimation [24]-based scoring measure to compute the outlyingness of query \(\mathbf{q}\) in subspace \({\mathcal {S}}\):

$$\begin{aligned} \tilde{f}_{{\mathcal {S}}}(\mathbf{q}) = \frac{1}{n(2 \pi )^{\frac{m}{2}} \prod_{i \in {\mathcal {S}}} h_{i}} \sum \limits _{{x} \in {\mathcal {X}}} e^ {- \sum_{i \in {\mathcal {S}}} \frac{(\mathbf{q} _i - x_i)^2}{2 h^2_{i}}} \end{aligned}$$

where \(\tilde{f}_{{\mathcal {S}}}(\mathbf{q})\) is a kernel density estimation of \(\mathbf{q}\) in subspace \({\mathcal {S}}\), m is the dimensionality of subspace \({\mathcal {S}}\) (\(m = |{\mathcal {S}}|\)), \(h_{i}\) is the kernel bandwidth in dimension i.

[8] stated that \(\tilde{f}_{{\mathcal {S}}}\) is bias towards high-dimensional subspaces—density tends to decrease as dimension increases. Thus, to remove the effect of dimensionality biasedness, they proposed to use the density rank of the query as a measure of outlyingness.

[27] proposed two outlying scoring metrics (i) density Z-Score and (ii) iPath score (stands for isolation Path).

Therein, the density Z-Score is defined as follows:

$$\begin{aligned} \hbox {Z-Score} (\tilde{f}_{{\mathcal {S}}}(\mathbf{q})) \triangleq \frac{\tilde{f}_{{\mathcal {S}}}(\mathbf{q}) -\mu _{\tilde{f}_{{\mathcal {S}}}}}{\sigma _{\tilde{f}_{{\mathcal {S}}}}} \end{aligned}$$

where \(\mu _{\tilde{f}_{{\mathcal {S}}}}\) and \(\sigma _{\tilde{f}_{{\mathcal {S}}}}\) are the mean and standard deviation of the density of all data instances in subspace \({\mathcal {S}}\), respectively.

The iPath score is motivated by Isolation Forest (iForest) anomaly detection approach [15]. The process of calculating the iPath score in subspace \({\mathcal {S}}\) of query \(\mathbf{q}\) w.r.t. sub-samples \(\psi\) of the data is:

$$\begin{aligned} iPath_{{\mathcal {S}}}(\mathbf{q}) = \frac{1}{t} \sum \limits _{i=1}^t l_{{\mathcal {S}}}^i(\mathbf{q}) \end{aligned}$$

where \(l_S^i(\mathbf{q})\) is path length of \(\mathbf{q}\) in \(i^{th}\) tree and subspace \({\mathcal {S}}\).

[27] were the first to coin the term dimensionality unbiasedness, i.e., “A dimensionality unbiased outlyingness measure (OM) is a measure of which the baseline value, i.e., average value for any data sample \({\mathcal {X}} = \{x_1, x_2, \ldots , x_n \}\) drawn from a uniform distribution, is a quantity independent of the dimension of the subspace \({\mathcal {S}}\).”

[28] introduced a simple grid-based density estimator called sGrid. sGrid is a smoothed variant of a grid-based density estimator [24]. Let \({\mathcal {X}}\) be a collection of n data objects in d-dimensional space, \(x.{\mathcal {S}}\) be a projection of a data object \(x \in {\mathcal {X}}\) in subspace \({\mathcal {S}}\). The sGrid density of point \(\mathbf{q}\) is computed as the number of points that falls into a bin that covers point \(\mathbf{q}\) and its surrounding neighbors. In their work, they show that the proposed density estimator has advantages over the existing kernel density estimator in outlying aspect mining by replacing the kernel density estimator with sGrid.

In recent work, [30] proposed a reconstruction-based method using completely random trees (RecForest in short). Therein, reconstruction has been done using the intersection of the bounding boxes in the completely random forest for each data point. The outlying score OS of each feature \(i = 1, 2, \ldots , d\) for query \(\mathbf{q}\) is defined as:

$$\begin{aligned} OS_i = \frac{\exp ({\mathbf{q}_i - \mathbf{q}^{rec}_i})^2}{\sum _{j=1}^d \exp (\mathbf{q}_j - \mathbf{q}^{rec}_j)^2} \end{aligned}$$

where \(\mathbf{q}^{rec}\) is a reconstructed sample of \(\mathbf{q}\).

[29] proposed an Attention-guided Triplet deviation network for Outlier interpretatioN (ATON). Instead of searching subspaces, ATON learns an embedding space and learns how each dimension is contributing to the outlyingness of the query.

3 The Framework

We first outline the motivation for our method, followed by the details of SiNNE. Figure 2 presents the flowchart of the complete framework.

Fig. 2
figure 2

The flowchart

3.1 Issue of Using Z-Score

Because Z-Score normalization uses mean and variance of density values of all data instances in a subspace (\(\mu _{\tilde{f}_{{{\mathcal {S}}}_i}}\) and \(\sigma _{\tilde{f}_{{{\mathcal {S}}}_i}}\)), it can be biased towards a subspace having high variation of density values (i.e., high \(\sigma _{\tilde{f}_{{{\mathcal {S}}}_i}}\)).

Let’s take a simple example to demonstrate this. Assume that \({{\mathcal {S}}}_i\) and \({{\mathcal {S}}}_j\) (\(i \ne j\)), be two different subspaces of the same dimensionality (i.e., \(|{\mathcal {S}}_i| = |{\mathcal {S}}_j|\)). Intuitively, because they have the same dimensionality, they can be ranked based on the raw density (unnormalized) values of a query \(\mathbf{q}\). Assuming \(\mu _{\tilde{f}_{{\mathcal {S}}_i}} = \mu _{\tilde{f}_{{\mathcal {S}}_j}}\), we can have \(Z(\tilde{f}_{{\mathcal {S}}_i}(\mathbf{q})) < Z(\tilde{f}_{{\mathcal {S}}_j}(\mathbf{q}))\) even though \(\tilde{f}_{{\mathcal {S}}_i}(\mathbf{q}) = \tilde{f}_{{\mathcal {S}}_j}(\mathbf{q})\) if \({\sigma _{\tilde{f}_{{\mathcal {S}}_i}}} > {\sigma _{\tilde{f}_{{\mathcal {S}}_j}}}\) (i.e., \({\mathcal {S}}_i\) is ranked higher than \({\mathcal {S}}_j\) based on density Z-Score normalization just because of higher \({\sigma _{\tilde{f}_{{\mathcal {S}}_i}}}\)).

To show this effect in a real-world dataset, let’s take an example of the pendigitsFootnote 1 dataset (\(n=9868\) and \(d=16\)). Figure 3 shows the distribution of data in two three-dimensional subspaces \({\mathcal {S}}_i=\{7, 8, 13\}\) and \({\mathcal {S}}_j=\{2, 10, 13\}\). Visually, the query \(\mathbf{q}\) represented by the red square appears to be more outlier in \({\mathcal {S}}_j\) than in \({\mathcal {S}}_i\). This is consistent with its raw density values in the two subspaces, \(\tilde{f}_{{\mathcal {S}}_j}(\mathbf{q})=1.20 < \tilde{f}_{{\mathcal {S}}_i}(\mathbf{q})=21.30\). However, the ranking is reversed after the Z-Score normalization, (\(Z(\tilde{f}_{{\mathcal {S}}_j}(\mathbf{q}))=-1.25 > Z(\tilde{f}_{{\mathcal {S}}_i}(\mathbf{q}))=-2.10\)). This is due to the higher \(\sigma _{\tilde{f}_{{\mathcal {S}}_i}}=57.3 > \sigma _{\tilde{f}_{{\mathcal {S}}_j}}=34.2\).

Fig. 3
figure 3

Data distribution in two three-dimensional subspaces of the Pendigits dataset. a \(\tilde{f}_{{{\mathcal {S}}}_i}(\mathbf{q})=21.30, Z(\tilde{f}_{{{\mathcal {S}}}_i}(\mathbf{q}))=-2.10\); b \(\tilde{f}_{S_j}(\mathbf{q})=1.20, Z(\tilde{f}_{S_j}(\mathbf{q}))=-1.25\)

Apart from these, existing OAM scoring measures have two limitations:

  • they are dimensionally biased and they require normalization; and

  • they are expensive to compute in each subspace.

Being motivated by these limitations of density-based scores in OAM, we introduce a new measure which is dimensionally unbias in its raw form and can be computed efficiently.

3.2 Outlierness Computation

We now introduce a new scoring measure called simple isolation using nearest-neighbor ensembles (SiNNE in short). This scoring function is inspired by the isolation-based anomaly detection using nearest-neighbor ensembles [2, 3].

The proposed scoring function has two major steps:

  • Building hyperspheres: The process of building hyperspheres in each subspace. The hyperspheres are build using nearest neighbors.

  • Scoring query: The current model is used to score the query.

3.2.1 Build Model

Let \({\mathcal {X}} = \{x_1, x_2, \ldots , x_n\}\) be a dataset \(x_i \in \Re ^d\), where \(i \in n\) represents the position of data point x in \({\mathcal {X}}\), n is the number of data points in the dataset and d is the number of dimensions. We randomly choose \(\psi\) data samples from \({\mathcal {X}}\), t times in each subspace.

Our proposed scoring function follows same procedure as the iNNE [2] to build ensemble of hyperspheres. However, in context of OAM, the difference is that we create ensembles in subspaces instead of full feature space.

Basically, SiNNE creates an ensemble of hyperspheres. Ensemble is defined as t sets of hyperspheres, where each set consists of \(\psi\) hyperspheres.

Definition 5

(Hyperspheres) Given data subset \({\mathcal {D}}^{(\psi )}_i\), a hypersphere \(\mathsf {H}(c)\) centered at c with radii \(\tau (c) = ||c - \eta _c||\) is defined as {\(x: ||x - c|| \le \tau (c)\)}, where \(x \in \Re ^d\) and \(c, \eta _c \in {\mathcal {D}}^{(\psi )}_i\); \(\eta _c\) is the nearest neighbor of c in \({\mathcal {D}}^{(\psi )}_i\).

Definition 6

Given \(\psi\) sub-samples, an ensemble \({\mathcal {H}}\) contains t sets and each set consists of \(\psi\) hyperspheres. \({\mathcal {H}}\) is defined as:

$$\begin{aligned} {\mathcal {H}} = \{\{\mathsf {H}(c): c \in {\mathcal {D}}^{(\psi )}_i\}: i = 1, 2, \ldots , t \} \end{aligned}$$

Note that the training process of SiNNE and iNNE is same, however, they differ in the computation of outlier score (cf. Sect. 3.5 for more differences).

Definition 7

(Simple isolation score) The simple isolation score of \(\mathbf{q}\) in subspace \({\mathcal {S}}\) based on sub-sample \({\mathcal {D}}\) is defined as:

$$\begin{aligned} \text{ SI}_{{\mathcal {S}}}(\mathbf{q}) = {\mathbb{I}}[\mathbf{q} \in \bigcup \limits _{c \in {\mathcal {D}}} \mathsf {H}(c)] \end{aligned}$$
(1)

where \({\mathbb{I}}[B]\) denotes the indicator function which gives the output 0 if B is true; otherwise \({\mathbb{I}}[B] = 1\).

SI takes the value either 0 or 1. When \(\mathbf{q}\) is covered by any of the hypersphere, it assigns 0 and if it is not covered by any of the hypersphere then SiNNE assumes that point is far away from the data and assigns 1.

Definition 8

The outlier score for \(\mathbf{q}\) in subspace \({\mathcal {S}}\) based on SiNNE is defined as the average of simple isolation score over t sets.

$$\begin{aligned} \text{ SiNNE}_{{\mathcal {S}}}(\mathbf{q}) = \frac{1}{t} \sum \limits _{i=1}^{t} \text{ SI}_{{\mathcal {S}}}^i(\mathbf{q}) \end{aligned}$$
(2)

As SI takes 0 or 1 score only, SiNNE(q) have score values in the range [0, 1].

Because the area covered by each hypersphere decreases as the dimensionality of the space increases and so is the actual data space covered by normal instances. Therefore, SiNNE is independent of the dimensionality of space in its raw form without any normalization making it ideal for OAM. It adapts to the local data density in the space because the sizes of the hyperspheres depend on the local density. It can be computed a lot faster than the k-NN distance or density. Also, it does not require to compute outlier scores of all n instances in each subspace (which is required with existing score for Z-Score normalization) which gives it a significant advantage in terms of run time.

The procedures to build an ensemble of models and using them to compute outlyingness of the given query data in subspace \({\mathcal {S}}\) are provided in Algorithms 1 and 2.

figure a
figure b

Time complexity The time complexity of creating SiNNE model is \(O(t \psi ^2)\) and in scoring stage, for query data point, it needs to find whether it falls in any hyperspheres or not, which takes \(O(t\psi )\). Total time complexity of SiNNE is \(O(t\psi ^2 + t\psi )\).

3.3 Subspace Search

Apart from scoring measure, OAM framework requires subspace search method. In this work, we will be using Beam [27] search method, because it is the latest search method and used in literature. We replicate the procedure of beam search in Algorithm 3 for ease of reference. The overall time complexity of beam search is \(O(d^2 + W \cdot d \cdot \ell )\), where W is beam width and \(\ell\) maximum dimension of subspace.

figure c

3.4 An Example of Proposed Method

In this section, we present an illustrative example of proposed method. Figure 4a shows a randomly selected 8 sub-samples (highlighted in black color) from dataset with \(n =50\) in 2-d subspace. Figure 4b shows an example of how \(\mathsf {H}(c)\) hypershpere is build at centered c with radii \(\tau (c)\). Figure 4c shows all 8 hyperspheres created using 8 sub-samples, which is used to compute outlying degree of the data point. As shown in Fig. 4d, to compute outlying degree of point x, the hypershpere that covers x needs to be determined. The \(\text{ SI }(x) = 0\) as x falls in hypershpere while data point y does not fall in any hypersphere, and thus outlying degree of y is obtained as 1.

Fig. 4
figure 4

a Randomly selected sub-samples \({\mathcal {D}}\) of size \(\psi = 8\); b build hypersphere for data point c; c set of hyperspheres from \({\mathcal {D}}\); d simple isolation score for data point x and y using isolation model

3.5 Key Differences with Closely Related Work

In this subsection, we discuss the difference between SiNNE and iNNE.

Although having similar training process, SiNNE and iNNE employ different scoring mechanism. Specifically, iNNE employs local isolation-based score which is computed as follows:

$$\begin{aligned} I_i(\mathbf{q}) = \left\{ \begin{array}{ll} 1 - \frac{\tau (\eta _{cnn(\mathbf{q})})}{\tau (cnn(\mathbf{q}))}, & {\text{ if }} \; \mathbf{q} \in \bigcup _{c\in {\mathcal {D}}_i} {\mathcal {H}}(c) \\ 1, & {\text{ otherwise }} \end{array}\right. \end{aligned}$$
(3)

where \(cnn(\mathbf{q}) = \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{c \in {\mathcal {D}}} \{ \tau (c) : \mathbf{q} \in {\mathcal {H}}(c)\}\), \({\mathcal {D}}\) is set of randomly selected sub-samples without replacement, \(|{\mathcal {D}}| = \psi\), \({\mathcal {H}}(c)\) is a hypersphere centered at c with radius \(\tau (c) = d_{\mathcal {S}}(c, \eta _c)\), where \(\eta _c\) is nearest neighbor of c.

In contrast, SiNNE uses a new simple isolation-based score (cf. Eq. (1)) which assigns 0 if point falls in any hypersphere otherwise 1.

Apart from this, iNNE creates a model in full feature space since it has single sole purpose of detecting outliers from the full feature space \({\mathcal {F}}\) while the purpose of SiNNE is to detect subspace for the given data point, and thus it creates a model in subspace. Although iNNE [2] was previously used as a outlier detector, its use in OAM context is new.

Theorem 1

The isolation score using iNNE with sub-sample size \(\psi = 2\) is equivalent to SiNNE.

Proof

Given a iNNE model \({\mathcal {H}}\) and sample size \(\psi = 2\), each set contains two hypersphere with same radius (cf. Definition 5). Thus, \(\tau (\eta _{cnn(\mathbf{q})}) = \tau (cnn(\mathbf{q}))\). For sample size (\(\psi = 2\)) isolation score is as follows:

$$\begin{aligned} I_i(\mathbf{q}) = \left\{ \begin{array}{ll} 0, &\quad {\text{ if }} \; \mathbf{q} \in \bigcup _{c\in {\mathcal {D}}_i} {\mathcal {H}}(c), \\ 1, &\quad {\text{ otherwise }} \end{array}\right. \end{aligned}$$
(4)

which is same as Eq. 1. \(\square\)

In terms of performance, SiNNE detects the ground truth for each query while iNNE only detects the ground truth for 11 out of 15 queries (details are presented in “Appendix”). In addition to that, SiNNE is faster than iNNE, this is because SiNNE does not require to find smallest hypersphere and its neighboring hypersphere for score.

4 Experimental Setting

4.1 Datasets

In this study, we used two types of datasets, i.e., synthetic and real-world. For syntheticFootnote 2 datasets, we adopted five datasets ( [13]): synth_10D, synth_20D, synth_50D, synth_75D, and synth_100D.

For real-worldFootnote 3 datasets, we adopted six datasets ( [5]): wilt, pageblocks, mnist, u2r, mulcross and covertype.

The characteristics of datasets in terms of data size and the dimensionality of the original input space are provided in Table 2.

Table 2 Dataset statistics

4.2 Contenders and Parameters

We compare SiNNE (SiBeam) with three contenders (a) kernel density rank (RBeam), (b) Z-Score normalized kernel density (Beam) and (c) Z-Score normalized sGrid density (sGBeam).

We used default parameters as suggested in respective papers unless specified otherwise. For SiBeam, we set \(\psi = 8\) and \(t = 100\). The Beam and RBeam employed KDE (kernel density estimator) to estimate density. KDE uses the Gaussian kernel with default bandwidth.Footnote 4 To calculate the Gaussian kernel, we use Euclidean distance. The parameter w block size for bit set operation in sGBeam was set to 64 as suggested by the authors [28]. Parameters beam width (W) and maximum dimensionality of subspace (\(\ell\)) in Beam search procedure were set to 100 and 3, respectively, as done in [27].

4.3 Evaluation Metric

As far as we know, there is no such publicly available real-world dataset which offers ground truth to verify the quality of discovered subspaces. Therefore, in the absence of a better evaluation measure, we propose to use a mean kernel embedding [19] to evaluate the quality of discovered subspaces. The intuition behind the mean kernel embedding is, in the most outlying aspect, the query is far away from the distribution of the data, i.e., it has the minimum average similarity with rest of the data. The quality of discovered subspace \({\mathcal {S}}\) for a query \(\mathbf{q}\) using a kernel mean embedding method [19] is computed as follows:

$$\begin{aligned} f_{\mathcal {S}}(\mathbf{q}, {\mathcal {X}}) = \frac{1}{n} \sum \limits _{x \in {\mathcal {X}}} K_{\mathcal {S}}(\mathbf{q}, x) \end{aligned}$$
(5)

where \(K_{\mathcal {S}}(\mathbf{q}, x)\) is a kernel similarity of \(\mathbf{q}\) and x in subspace \({\mathcal {S}}\).

We use Chi-square kernel [32] because it is parameter-free and widely used by the computer vision research community. The Chi-square kernel \(K_{\mathcal {S}}(\mathbf{q}, x)\) is computed as follows:

$$\begin{aligned} K_{\mathcal {S}}(\mathbf{q}, x) = 1 - \sum \limits _{i \in {\mathcal {S}}} 2\frac{(\mathbf{q}_i - x_i)^2 }{ (\mathbf{q}_i + x_i)} \end{aligned}$$

In OAM, \(\mathbf{q}\) is considered to be more outlier in \({\mathcal {S}}_i\) than \({\mathcal {S}}_j\) if \(f_{{\mathcal {S}}_i}(\mathbf{q}, {\mathcal {X}}) < f_{{\mathcal {S}}_j}(\mathbf{q}, {\mathcal {X}})\).

4.4 Implementation

All measures and experimental setup were implemented in Java using WEKA platform [10]. We made the required changes in the Java implementation of iNNEFootnote 5 provided by the authors to implement SiNNE. We used the Java implementations of sGrid made available by the authors [28].

All experiments were conducted on a machine with Intel 8-core i9 CPU and 16 GB main memory, running on macOS Monterey version 12.0.1.

We run each jobs on multiple single CPU treads, which is done using GNU parallel [25]. All jobs were performed upto 24 h, and incomplete jobs were killed and marked as ‘\(\blacklozenge\)’.

5 Empirical Evaluation

In this section, we compare SiNNE and three contenders in four set of experiments: (a) Experiment 1—dimensionality unbiasedness; (b) Experiment 2—performance on synthetic datasets; (c) Experiment 3—performance on real-world datasets; and (d) Experiment 4—run-time comparisons.

5.1 Experiment 1: Dimensionality Unbiasedness

We generated 19 synthetic datasets using NumPy [12] library. Each dataset contains 1000 data points from uniform distribution \({\mathcal {U}}\)([0,1]\(^d)\), where d varied from 2 to 20. We computed the average score of all instances using SiNNE and KDE. The results are presented in Fig. 5. The flat line for SiNNE shows that it is dimensionality unbiased, whereas KDE (without Z-Score normalization) is not. Note that [27] shows that ranks and Z-Score normalization make any score dimensionally unbias. Hence, we did not include them in our experiment.

Fig. 5
figure 5

Dimensionality unbiasedness

5.2 Experiment 2: Performance on Synthetic Datasets

[13] provided several synthetic datasets, which are used in previous studies [8, 22, 27, 28]. The collection of these synthetic datasets have 1000 data points and dimensions are 10, 20, 50, 75, and 100. Each dataset has a fixed number of outliers for which outlying subspaces are known (ground truth).

synth_10D has 19 outliers, we passed all outliers one at a time as a query. Table 3 summarize the subspace discovered by SiBeam, RBeam, Beam, and sGBeam for all 19 queries. In terms of exact matches, SiBeam is the best performing measure which detects the ground truth as a top outlying aspect of each query. Beam and sGBeam perform similar by producing 19 exact matches. RBeam is the worst performing measure, which produces only five exact matches.

Table 3 Comparison of SiBeam, RBeam, Beam, and sGBeam in term of exact matches on synth_10D. Discovered subspaces with the exact matches with the ground truths are bold-faced. \(\mathbf{q}\)-id represent query point index; GT represents ground truth; the numbers in the bracket (subspace) are attribute indices

Table 4 summarizes the mining results of SiBeam, RBeam, Beam, and sGBeam on four synthetic datasets, i.e., synth_20D, synth_50D, synth_75D and synth_100D. SiBeam finds the ground truth as a top outlying subspace for each query (ten queries from each datasets). Beam and sGBeam perform similar by producing 39 exact matches out of 40. RBeam is the worst performing measure, which produces exact matches for 5 queries out of 40.

Table 4 Comparison of outlying aspects discovered by SiBeam, RBeam, Beam, and sGBeam on four synthetic datasets and average run time of 10 queries from each dataset. Discovered subspaces with the exact matches with the ground truths are bold-faced. \(\mathbf{q}\)-id represent query point index; GT represents ground truth; the numbers in the bracket (subspace) are attribute indices

5.3 Experiment 3: Performance on Real-World Datasets

In real-world datasets, outliers and their outlying aspects are not available. Thus, we used the state-of-the-art outlier detector called iForestFootnote 6 [15] to find top k (\(k = 5\)) outliers and they were used as queries. We then use the \(f_{{\mathcal {S}}}\) score (cf. Eq. 5) in the top-ranked subspace to measure the quality of discovered subspace—the lower the value, the more likely the subspace is outlying aspect of a given query.

It is worth noting that SiBeam and sGBeam are the only methods which are able to finish the process for each query, while RBeam and Beam finish the process for only 10 queries.

Table 5 shows subspaces discovered by four OAM methods (i.e., SiBeam, RBeam, Beam, and sGBeam) on six real-world datasets.

Table 5 Comparison of outlying aspects discovered by SiBeam, RBeam, Beam, and sGBeam on six real-world datasets and average run time of five queries from each dataset. \(\mathbf{q}\)-id represent query point index; the numbers in the bracket (subspace) are attribute indices

Table 6 shows the quality of discovered subspaces by SiBeam, RBeam, Beam, and sGBeam. High-quality subspaces of each query is highlighted in bold. SiBeam is best performer on 28 out of 30 according to proposed quality measure. sGBeam discovered high-quality subspace for only 5 queries out of 30. On the other hand, RBeam discovered high-quality subspace for only one query out of ten, whereas Beam was unable to detect high-quality subspace even for a single query.

Table 6 Comparison of SiBeam, RBeam, Beam, and sGBeam on six real-world datasets in terms of quality of discovered subspace

The average run time of five queries for each dataset is presented in Table 5. Next, we visually compare discovered subspaces by each measure for top query from each datasets.

Tables 7, 8, 9, 10, 11 and 12 shows the subspace discovered by SiBeam and contending measures on wilt, pageblock, mnist, u2r, mulcross, and covertype, respectively. Visually, we can say that SiBeam detects better subspace than its 3 contenders.

Table 7 Visualization of discovered subspaces by SiBeam, RBeam, Beam, and sGBeam in the wilt dataset
Table 8 Visualization of discovered subspaces by SiBeam, RBeam, Beam and sGBeam in the pageblock dataset
Table 9 Visualization of discovered subspaces by SiBeam, RBeam, Beam and sGBeam in the mnist dataset
Table 10 Visualization of discovered subspaces by SiBeam, RBeam, Beam and sGBeam in the u2r dataset
Table 11 Visualization of discovered subspaces by SiBeam, RBeam, Beam and sGBeam in the mulcross dataset
Table 12 Visualization of discovered subspaces by SiBeam, RBeam, Beam and sGBeam in the covertype dataset

5.4 Experiment 4: Run-Time Comparison

Table 7 shows average run time for randomly chosen 10 queries from each real-world datasets of the SiBeam and its three contending measures. SiBeam and sGBeam were able to finish for all datasets, whereas RBeam and beam only able to finish on wilt, and pageblock datasets within 24 h. These results shows that the proposed scoring measure enables the existing OAM approach based on beam search to run orders of magnitude faster in large datasets. Specifically, SiBeam runs at least two and three magnitude faster than RBeam and Beam on wilt and pageblocks datasets, respectively. SiBeam runs at least two order of magnitude faster than sGBeam on large datasets (\(n>50\)K).

Table 13 Average run time (in CPU seconds) for 10 queries of SiBeam, RBeam, Beam, and sGBeam on six real-world datasets

6 Conclusion

In this paper, we have introduced an efficient and effective scoring measure Simple Isolation score using Nearest Neighbor Ensemble (SiNNE), which is dimensionally unbias. By replacing the existing scoring measure to proposed scoring measure, we gain three benefits. The first benefit is that SiNNE is dimensionally unbiased measure, which does not rely on any normalization means it can be used directly to compare subspaces with different dimensionality. The second benefit is that SiNNE allows existing OAM (i.e., Beam) to run orders of magnitude faster compared to three state-of-the-art scoring measures. Thus it is more suitable for mining huge datasets with thousands of dimensions. The third benefit is now we can identify more interesting outlying subspace for a given query. This is confirmed by considerably better performance of SiNNE, compared to three state-of-the-art scoring measures in empirical evaluation. In addition to that, we introduced a new performance measure for outlying aspect mining. Our experimental results on real-world datasets show that SiNNE perform comparatively better than state-of-the-art measures.