A New Dimensionality-Unbiased Score for Efficient and Effective Outlying Aspect Mining

Samariya, Durgesh; Ma, Jiangang

doi:10.1007/s41019-022-00185-5

A New Dimensionality-Unbiased Score for Efficient and Effective Outlying Aspect Mining

Research Papers
Open access
Published: 29 April 2022

Volume 7, pages 120–135, (2022)
Cite this article

Download PDF

You have full access to this open access article

Data Science and Engineering Aims and scope Submit manuscript

A New Dimensionality-Unbiased Score for Efficient and Effective Outlying Aspect Mining

Download PDF

2064 Accesses
12 Citations
Explore all metrics

Abstract

The main aim of the outlying aspect mining algorithm is to automatically detect the subspace(s) (a.k.a. aspect(s)), where a given data point is dramatically different than the rest of the data in each of those subspace(s) (aspect(s)). To rank the subspaces for a given data point, a scoring measure is required to compute the outlying degree of the given data in each subspace. In this paper, we introduce a new measure to compute outlying degree, called Simple Isolation score using Nearest Neighbor Ensemble (SiNNE), which not only detects the outliers but also provides an explanation on why the selected point is an outlier. SiNNE is a dimensionally unbias measure in its raw form, which means the scores produced by SiNNE are compared directly with subspaces having different dimensions. Thus, it does not require any normalization to make the score unbiased. Our experimental results on synthetic and publicly available real-world datasets revealed that (i) SiNNE produces better or at least the same results as existing scores. (ii) It improves the run time of the existing outlying aspect mining algorithm based on beam search by at least two orders of magnitude. SiNNE allows the existing outlying aspect mining algorithm to run in datasets with hundreds of thousands of instances and thousands of dimensions which was not possible before.

A New Effective and Efficient Measure for Outlying Aspect Mining

sGrid++: Revising Simple Grid Based Density Estimator for Mining Outlying Aspect

Discovering outlying aspects in large datasets

Article 09 February 2016

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Outliers (a.k.a anomalies) are data points that show dramatically different behavior from the remainder of data points in the dataset. The process of finding such data points is known as Outlier Detection (OD). In the era of big data, OD is considered as one of the vital task of data mining with a wide range of application domains [21], i.e., (i) fraud detection—in this domain, outlier refers to the fraud that includes credit card frauds [6], insurance claim frauds [4]; (ii) Medical or public health—in this domain, outlier refers to an unusual health condition of patients that happens due to instrumental error or disease symptoms [14].

Recently, researchers have been interested in the explanation of why the data point is considered as an outlier. The problem of finding these explanations leads to the Outlying Aspect Mining (OAM) [8, 22, 27, 28]. OAM is the task of identifying feature subset(s), in which a given data point is dramatically inconsistent with the rest of the data. In literature, the problem of OAM is also referred as outlying subspace detection [31], outlier explanation [9, 17, 18], outlier interpretation [7, 16, 29], outlying property detection [1] and outlying aspect mining [8, 22, 23, 26,27,28, 30].

In many application scenarios, it is required to find out in which set of feature(s), a given point is different than others. For example, in a bank, a fraud analyst collects information about various aspects of credit card fraud, and he/she is interested to know in which aspects the fraud does not conform with the remainder of that set of data. Moreover, when evaluating job applications, a panel member wants to know the job applicant’s unique features. Another exciting application of OAM is in the medical domain [20]. Assume that you are a doctor and while treating a specific patient, you want to know, how this patient is different than others. Existing OD methods cannot answer all these questions.

To detect outlying aspects, OAM algorithms require a scoring measure to rank subspaces based on the outlying degrees of the given query. Existing OAM algorithms such as HOSMiner [31], OAMiner [8], Density Z-Score [27] and sGrid [28] use a traditional distance or density-based outlier score as the ranking measure. Because distance or density-based outlier scores depend on the dimensionality of subspaces, they cannot be compared directly to rank subspaces. [27] proposed to use Z-Score normalization to make them comparable. It requires computing the outlier scores of all the data points in each subspace. It adds significant computational overhead making OAM algorithms infeasible to run in large and/or high-dimensional datasets. Also, we discover that Z-Score normalization is not appropriate for OAM in some cases.

In this paper, we focus on the two issues of existing scores used in OAM: (i) dimensionality unbiasedness, and (ii) computational complexity. It is worth noting that another computational issue in OAM is to deal with the exponentially large number of subspaces. Current OAM methods perform a systematic search; which is computationally prohibitive when the number of dimensions is high. This paper does not deal with this computational issue. It still uses the existing systematic search approach but deals with computing the score in each subspace efficiently.

This paper makes the following contributions:

Identify an issue of using Z-Score normalization of density-based outlier scores to rank subspaces and shows that it is biased towards a subspace having high-density variance.
Propose a new simple measure called Simple Isolation score using Nearest Neighbor Ensemble (SiNNE), which is useful for detecting outliers from the dataset and outlying aspects of the given outlier points.
Provide an objective measure to assess the quality of discovered outlying subspaces.
Validate the effectiveness and efficiency of SiNNE in OAM. Our empirical results show that SiNNE can detect more interesting outlying aspects than the existing score, and it allows the OAM algorithm to run orders of magnitude faster than the existing scoring measure.

The rest of the paper is organized as follows. Section 2 provides a summary of previous work on outlying aspect mining. The proposed outlier detector scoring measure is presented in Sect. 3. Experimental settings are provided in Sect. 4, and empirical evaluation results are provided in Sect. 5. Finally, conclusions are provided in Sect. 6.

2 Related Works

In this section, first, we fixed some notations for the rest of the paper, provided some basic definitions, and then discussed recent outlying aspect mining methods. The high-level process pipeline of OAM is shown in Fig. 1.

2.1 Basic Notations and Definitions

Let ${\mathcal {X}} = \{x_1, x_2, \ldots , x_n\}$ be a collection of n data points in an d-dimensional space $\Re$, where $\Re$ is a real domain. Each data point x is represented as an d dimensional vector $\big \langle x^{(1)}, x^{(2)}, \ldots , x^{(d)} \big \rangle$. Let ${\mathcal {F}}$ be a full feature space and $\mathbb {S} = \{{\mathcal {S}}_1, {\mathcal {S}}_2, \ldots , {\mathcal {S}}_\eth \}$ be a set of all possible subspaces, where $\eth = 2^d - 1$ is the number of possible subspaces. The key symbols and notations used in this paper are provided in Table 1.

Table 1 Key symbols and notations

Full size table

The problem of outlier detection is to identify all $x_i$ which remarkably deviates from others in full feature set ${\mathcal {F}}$, whereas the problem of outlying aspect mining is to identify subspace ${\mathcal {S}}_i \in \mathbb {S}$, where the given data point $x_i \in {\mathcal {X}}$ is significantly different from the rest of the data. That given data point $x_i \in {\mathcal {X}}$ is referred as a query $\mathbf{q}$.

Definition 1

(Outlier) An outlier is a data instance that significantly deviates from others in the full feature set ${\mathcal {F}}$.

Definition 2

(Subspace) A subspace is a subset of the dimensions d of dataset ${\mathcal {X}}$.

Definition 3

(Query point) A query $\mathbf{q}$ is a data point of interest, which is used to find outlying aspects.

Definition 4

(Problem definition) Given a set of n instances ${\mathcal {X}}$ ($|{\mathcal {X}}| = n$) in d dimensional space, a query $\mathbf{q} \in {\mathcal {X}}$, a subspace ${\mathcal {S}}$ is called outlying aspect of $\mathbf{q}$ iff,

outlying degree of $\mathbf{q}$ in subspace ${\mathcal {S}}$ is higher than other subspaces, and there is no other subspace with same or higher outlying degree.

2.2 Outlying Aspect Mining

To the best of our knowledge, [31] is the earliest work that defines the problem of OAM. They introduced a framework to detect an outlying subspace called HOS-Miner (stands for High-dimensional Outlying Subspace Miner). Therein, the author used a distance-based measure called Outlying Degree (OutD in short). The OutD of query $\mathbf{q}$ in subspace ${\mathcal {S}}$ is computed as:

$$\begin{aligned} OutD_{\mathcal {S}}(\mathbf{q}) = \sum \limits _{x\in \aleph _{\mathcal {S}}^k(\mathbf{q})} d_{\mathcal {S}}(\mathbf{q}, x) \end{aligned}$$

where $\aleph _{\mathcal {S}}^k(\mathbf{q})$ is a set of k-nearest neighbors of $\mathbf{q}$ in subspace ${\mathcal {S}}$, $d_{\mathcal {S}}(a, b)$ is an euclidean distance between a and b in subspace ${\mathcal {S}}$, which is computed as $d_{\mathcal {S}}(a,b) = \sqrt{\sum _{i \in {\mathcal {S}}} (a_i - b_i)^2 }$.

In 2015, [8] introduced Outlying Aspect Miner (OAMiner in short). Instead of using distance, therein, authors employed a kernel density estimation [24]-based scoring measure to compute the outlyingness of query $\mathbf{q}$ in subspace ${\mathcal {S}}$:

$$\begin{aligned} \tilde{f}_{{\mathcal {S}}}(\mathbf{q}) = \frac{1}{n(2 \pi )^{\frac{m}{2}} \prod_{i \in {\mathcal {S}}} h_{i}} \sum \limits _{{x} \in {\mathcal {X}}} e^ {- \sum_{i \in {\mathcal {S}}} \frac{(\mathbf{q} _i - x_i)^2}{2 h^2_{i}}} \end{aligned}$$

where $\tilde{f}_{{\mathcal {S}}}(\mathbf{q})$ is a kernel density estimation of $\mathbf{q}$ in subspace ${\mathcal {S}}$, m is the dimensionality of subspace ${\mathcal {S}}$ ($m = |{\mathcal {S}}|$), $h_{i}$ is the kernel bandwidth in dimension i.

[8] stated that $\tilde{f}_{{\mathcal {S}}}$ is bias towards high-dimensional subspaces—density tends to decrease as dimension increases. Thus, to remove the effect of dimensionality biasedness, they proposed to use the density rank of the query as a measure of outlyingness.

[27] proposed two outlying scoring metrics (i) density Z-Score and (ii) iPath score (stands for isolation Path).

Therein, the density Z-Score is defined as follows:

$$\begin{aligned} \hbox {Z-Score} (\tilde{f}_{{\mathcal {S}}}(\mathbf{q})) \triangleq \frac{\tilde{f}_{{\mathcal {S}}}(\mathbf{q}) -\mu _{\tilde{f}_{{\mathcal {S}}}}}{\sigma _{\tilde{f}_{{\mathcal {S}}}}} \end{aligned}$$

where $\mu _{\tilde{f}_{{\mathcal {S}}}}$ and $\sigma _{\tilde{f}_{{\mathcal {S}}}}$ are the mean and standard deviation of the density of all data instances in subspace ${\mathcal {S}}$, respectively.

The iPath score is motivated by Isolation Forest (iForest) anomaly detection approach [15]. The process of calculating the iPath score in subspace ${\mathcal {S}}$ of query $\mathbf{q}$ w.r.t. sub-samples $\psi$ of the data is:

$$\begin{aligned} iPath_{{\mathcal {S}}}(\mathbf{q}) = \frac{1}{t} \sum \limits _{i=1}^t l_{{\mathcal {S}}}^i(\mathbf{q}) \end{aligned}$$

where $l_S^i(\mathbf{q})$ is path length of $\mathbf{q}$ in $i^{th}$ tree and subspace ${\mathcal {S}}$.

[27] were the first to coin the term dimensionality unbiasedness, i.e., “A dimensionality unbiased outlyingness measure (OM) is a measure of which the baseline value, i.e., average value for any data sample ${\mathcal {X}} = \{x_1, x_2, \ldots , x_n \}$ drawn from a uniform distribution, is a quantity independent of the dimension of the subspace ${\mathcal {S}}$.”

[28] introduced a simple grid-based density estimator called sGrid. sGrid is a smoothed variant of a grid-based density estimator [24]. Let ${\mathcal {X}}$ be a collection of n data objects in d-dimensional space, $x.{\mathcal {S}}$ be a projection of a data object $x \in {\mathcal {X}}$ in subspace ${\mathcal {S}}$. The sGrid density of point $\mathbf{q}$ is computed as the number of points that falls into a bin that covers point $\mathbf{q}$ and its surrounding neighbors. In their work, they show that the proposed density estimator has advantages over the existing kernel density estimator in outlying aspect mining by replacing the kernel density estimator with sGrid.

In recent work, [30] proposed a reconstruction-based method using completely random trees (RecForest in short). Therein, reconstruction has been done using the intersection of the bounding boxes in the completely random forest for each data point. The outlying score OS of each feature $i = 1, 2, \ldots , d$ for query $\mathbf{q}$ is defined as:

$$\begin{aligned} OS_i = \frac{\exp ({\mathbf{q}_i - \mathbf{q}^{rec}_i})^2}{\sum _{j=1}^d \exp (\mathbf{q}_j - \mathbf{q}^{rec}_j)^2} \end{aligned}$$

where $\mathbf{q}^{rec}$ is a reconstructed sample of $\mathbf{q}$.

[29] proposed an Attention-guided Triplet deviation network for Outlier interpretatioN (ATON). Instead of searching subspaces, ATON learns an embedding space and learns how each dimension is contributing to the outlyingness of the query.

3 The Framework

We first outline the motivation for our method, followed by the details of SiNNE. Figure 2 presents the flowchart of the complete framework.

3.1 Issue of Using Z-Score

Because Z-Score normalization uses mean and variance of density values of all data instances in a subspace ($\mu _{\tilde{f}_{{{\mathcal {S}}}_i}}$ and $\sigma _{\tilde{f}_{{{\mathcal {S}}}_i}}$), it can be biased towards a subspace having high variation of density values (i.e., high $\sigma _{\tilde{f}_{{{\mathcal {S}}}_i}}$).

Let’s take a simple example to demonstrate this. Assume that ${{\mathcal {S}}}_i$ and ${{\mathcal {S}}}_j$ ($i \ne j$), be two different subspaces of the same dimensionality (i.e., $|{\mathcal {S}}_i| = |{\mathcal {S}}_j|$). Intuitively, because they have the same dimensionality, they can be ranked based on the raw density (unnormalized) values of a query $\mathbf{q}$. Assuming $\mu _{\tilde{f}_{{\mathcal {S}}_i}} = \mu _{\tilde{f}_{{\mathcal {S}}_j}}$, we can have $Z(\tilde{f}_{{\mathcal {S}}_i}(\mathbf{q})) < Z(\tilde{f}_{{\mathcal {S}}_j}(\mathbf{q}))$ even though $\tilde{f}_{{\mathcal {S}}_i}(\mathbf{q}) = \tilde{f}_{{\mathcal {S}}_j}(\mathbf{q})$ if ${\sigma _{\tilde{f}_{{\mathcal {S}}_i}}} > {\sigma _{\tilde{f}_{{\mathcal {S}}_j}}}$ (i.e., ${\mathcal {S}}_i$ is ranked higher than ${\mathcal {S}}_j$ based on density Z-Score normalization just because of higher ${\sigma _{\tilde{f}_{{\mathcal {S}}_i}}}$).

To show this effect in a real-world dataset, let’s take an example of the pendigits^{Footnote 1} dataset ($n=9868$ and $d=16$). Figure 3 shows the distribution of data in two three-dimensional subspaces ${\mathcal {S}}_i=\{7, 8, 13\}$ and ${\mathcal {S}}_j=\{2, 10, 13\}$. Visually, the query $\mathbf{q}$ represented by the red square appears to be more outlier in ${\mathcal {S}}_j$ than in ${\mathcal {S}}_i$. This is consistent with its raw density values in the two subspaces, $\tilde{f}_{{\mathcal {S}}_j}(\mathbf{q})=1.20 < \tilde{f}_{{\mathcal {S}}_i}(\mathbf{q})=21.30$. However, the ranking is reversed after the Z-Score normalization, ($Z(\tilde{f}_{{\mathcal {S}}_j}(\mathbf{q}))=-1.25 > Z(\tilde{f}_{{\mathcal {S}}_i}(\mathbf{q}))=-2.10$). This is due to the higher $\sigma _{\tilde{f}_{{\mathcal {S}}_i}}=57.3 > \sigma _{\tilde{f}_{{\mathcal {S}}_j}}=34.2$.

Apart from these, existing OAM scoring measures have two limitations:

they are dimensionally biased and they require normalization; and
they are expensive to compute in each subspace.

Being motivated by these limitations of density-based scores in OAM, we introduce a new measure which is dimensionally unbias in its raw form and can be computed efficiently.

3.2 Outlierness Computation

We now introduce a new scoring measure called simple isolation using nearest-neighbor ensembles (SiNNE in short). This scoring function is inspired by the isolation-based anomaly detection using nearest-neighbor ensembles [2, 3].

The proposed scoring function has two major steps:

Building hyperspheres: The process of building hyperspheres in each subspace. The hyperspheres are build using nearest neighbors.
Scoring query: The current model is used to score the query.

3.2.1 Build Model

Let ${\mathcal {X}} = \{x_1, x_2, \ldots , x_n\}$ be a dataset $x_i \in \Re ^d$, where $i \in n$ represents the position of data point x in ${\mathcal {X}}$, n is the number of data points in the dataset and d is the number of dimensions. We randomly choose $\psi$ data samples from ${\mathcal {X}}$, t times in each subspace.

Our proposed scoring function follows same procedure as the iNNE [2] to build ensemble of hyperspheres. However, in context of OAM, the difference is that we create ensembles in subspaces instead of full feature space.

Basically, SiNNE creates an ensemble of hyperspheres. Ensemble is defined as t sets of hyperspheres, where each set consists of $\psi$ hyperspheres.

Definition 5

(Hyperspheres) Given data subset ${\mathcal {D}}^{(\psi )}_i$, a hypersphere $\mathsf {H}(c)$ centered at c with radii $\tau (c) = ||c - \eta _c||$ is defined as {$x: ||x - c|| \le \tau (c)$}, where $x \in \Re ^d$ and $c, \eta _c \in {\mathcal {D}}^{(\psi )}_i$; $\eta _c$ is the nearest neighbor of c in ${\mathcal {D}}^{(\psi )}_i$.

Definition 6

Given $\psi$ sub-samples, an ensemble ${\mathcal {H}}$ contains t sets and each set consists of $\psi$ hyperspheres. ${\mathcal {H}}$ is defined as:

$$\begin{aligned} {\mathcal {H}} = \{\{\mathsf {H}(c): c \in {\mathcal {D}}^{(\psi )}_i\}: i = 1, 2, \ldots , t \} \end{aligned}$$

Note that the training process of SiNNE and iNNE is same, however, they differ in the computation of outlier score (cf. Sect. 3.5 for more differences).

Definition 7

(Simple isolation score) The simple isolation score of $\mathbf{q}$ in subspace ${\mathcal {S}}$ based on sub-sample ${\mathcal {D}}$ is defined as:

$$\begin{aligned} \text{ SI}_{{\mathcal {S}}}(\mathbf{q}) = {\mathbb{I}}[\mathbf{q} \in \bigcup \limits _{c \in {\mathcal {D}}} \mathsf {H}(c)] \end{aligned}$$

(1)

where ${\mathbb{I}}[B]$ denotes the indicator function which gives the output 0 if B is true; otherwise ${\mathbb{I}}[B] = 1$.

SI takes the value either 0 or 1. When $\mathbf{q}$ is covered by any of the hypersphere, it assigns 0 and if it is not covered by any of the hypersphere then SiNNE assumes that point is far away from the data and assigns 1.

Definition 8

The outlier score for $\mathbf{q}$ in subspace ${\mathcal {S}}$ based on SiNNE is defined as the average of simple isolation score over t sets.

$$\begin{aligned} \text{ SiNNE}_{{\mathcal {S}}}(\mathbf{q}) = \frac{1}{t} \sum \limits _{i=1}^{t} \text{ SI}_{{\mathcal {S}}}^i(\mathbf{q}) \end{aligned}$$

(2)

As SI takes 0 or 1 score only, SiNNE(q) have score values in the range [0, 1].

Because the area covered by each hypersphere decreases as the dimensionality of the space increases and so is the actual data space covered by normal instances. Therefore, SiNNE is independent of the dimensionality of space in its raw form without any normalization making it ideal for OAM. It adapts to the local data density in the space because the sizes of the hyperspheres depend on the local density. It can be computed a lot faster than the k-NN distance or density. Also, it does not require to compute outlier scores of all n instances in each subspace (which is required with existing score for Z-Score normalization) which gives it a significant advantage in terms of run time.

The procedures to build an ensemble of models and using them to compute outlyingness of the given query data in subspace ${\mathcal {S}}$ are provided in Algorithms 1 and 2.

Time complexity The time complexity of creating SiNNE model is $O(t \psi ^2)$ and in scoring stage, for query data point, it needs to find whether it falls in any hyperspheres or not, which takes $O(t\psi )$. Total time complexity of SiNNE is $O(t\psi ^2 + t\psi )$.

3.3 Subspace Search

Apart from scoring measure, OAM framework requires subspace search method. In this work, we will be using Beam [27] search method, because it is the latest search method and used in literature. We replicate the procedure of beam search in Algorithm 3 for ease of reference. The overall time complexity of beam search is $O(d^2 + W \cdot d \cdot \ell )$, where W is beam width and $\ell$ maximum dimension of subspace.

3.4 An Example of Proposed Method

In this section, we present an illustrative example of proposed method. Figure 4a shows a randomly selected 8 sub-samples (highlighted in black color) from dataset with $n =50$ in 2-d subspace. Figure 4b shows an example of how $\mathsf {H}(c)$ hypershpere is build at centered c with radii $\tau (c)$. Figure 4c shows all 8 hyperspheres created using 8 sub-samples, which is used to compute outlying degree of the data point. As shown in Fig. 4d, to compute outlying degree of point x, the hypershpere that covers x needs to be determined. The $\text{ SI }(x) = 0$ as x falls in hypershpere while data point y does not fall in any hypersphere, and thus outlying degree of y is obtained as 1.

3.5 Key Differences with Closely Related Work

In this subsection, we discuss the difference between SiNNE and iNNE.

Although having similar training process, SiNNE and iNNE employ different scoring mechanism. Specifically, iNNE employs local isolation-based score which is computed as follows:

$$\begin{aligned} I_i(\mathbf{q}) = \left\{ \begin{array}{ll} 1 - \frac{\tau (\eta _{cnn(\mathbf{q})})}{\tau (cnn(\mathbf{q}))}, & {\text{ if }} \; \mathbf{q} \in \bigcup _{c\in {\mathcal {D}}_i} {\mathcal {H}}(c) \\ 1, & {\text{ otherwise }} \end{array}\right. \end{aligned}$$

(3)

where $cnn(\mathbf{q}) = \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{c \in {\mathcal {D}}} \{ \tau (c) : \mathbf{q} \in {\mathcal {H}}(c)\}$, ${\mathcal {D}}$ is set of randomly selected sub-samples without replacement, $|{\mathcal {D}}| = \psi$, ${\mathcal {H}}(c)$ is a hypersphere centered at c with radius $\tau (c) = d_{\mathcal {S}}(c, \eta _c)$, where $\eta _c$ is nearest neighbor of c.

In contrast, SiNNE uses a new simple isolation-based score (cf. Eq. (1)) which assigns 0 if point falls in any hypersphere otherwise 1.

Apart from this, iNNE creates a model in full feature space since it has single sole purpose of detecting outliers from the full feature space ${\mathcal {F}}$ while the purpose of SiNNE is to detect subspace for the given data point, and thus it creates a model in subspace. Although iNNE [2] was previously used as a outlier detector, its use in OAM context is new.

Theorem 1

The isolation score using iNNE with sub-sample size $\psi = 2$ is equivalent to SiNNE.

Proof

Given a iNNE model ${\mathcal {H}}$ and sample size $\psi = 2$, each set contains two hypersphere with same radius (cf. Definition 5). Thus, $\tau (\eta _{cnn(\mathbf{q})}) = \tau (cnn(\mathbf{q}))$. For sample size ($\psi = 2$) isolation score is as follows:

$$\begin{aligned} I_i(\mathbf{q}) = \left\{ \begin{array}{ll} 0, &\quad {\text{ if }} \; \mathbf{q} \in \bigcup _{c\in {\mathcal {D}}_i} {\mathcal {H}}(c), \\ 1, &\quad {\text{ otherwise }} \end{array}\right. \end{aligned}$$

(4)

which is same as Eq. 1. $\square$

In terms of performance, SiNNE detects the ground truth for each query while iNNE only detects the ground truth for 11 out of 15 queries (details are presented in “Appendix”). In addition to that, SiNNE is faster than iNNE, this is because SiNNE does not require to find smallest hypersphere and its neighboring hypersphere for score.

4 Experimental Setting

4.1 Datasets

In this study, we used two types of datasets, i.e., synthetic and real-world. For synthetic^{Footnote 2} datasets, we adopted five datasets ( [13]): synth_10D, synth_20D, synth_50D, synth_75D, and synth_100D.

For real-world^{Footnote 3} datasets, we adopted six datasets ( [5]): wilt, pageblocks, mnist, u2r, mulcross and covertype.

The characteristics of datasets in terms of data size and the dimensionality of the original input space are provided in Table 2.

Table 2 Dataset statistics

Full size table

4.2 Contenders and Parameters

We compare SiNNE (SiBeam) with three contenders (a) kernel density rank (RBeam), (b) Z-Score normalized kernel density (Beam) and (c) Z-Score normalized sGrid density (sGBeam).

We used default parameters as suggested in respective papers unless specified otherwise. For SiBeam, we set $\psi = 8$ and $t = 100$. The Beam and RBeam employed KDE (kernel density estimator) to estimate density. KDE uses the Gaussian kernel with default bandwidth.^{Footnote 4} To calculate the Gaussian kernel, we use Euclidean distance. The parameter w block size for bit set operation in sGBeam was set to 64 as suggested by the authors [28]. Parameters beam width (W) and maximum dimensionality of subspace ($\ell$) in Beam search procedure were set to 100 and 3, respectively, as done in [27].

4.3 Evaluation Metric

As far as we know, there is no such publicly available real-world dataset which offers ground truth to verify the quality of discovered subspaces. Therefore, in the absence of a better evaluation measure, we propose to use a mean kernel embedding [19] to evaluate the quality of discovered subspaces. The intuition behind the mean kernel embedding is, in the most outlying aspect, the query is far away from the distribution of the data, i.e., it has the minimum average similarity with rest of the data. The quality of discovered subspace ${\mathcal {S}}$ for a query $\mathbf{q}$ using a kernel mean embedding method [19] is computed as follows:

$$\begin{aligned} f_{\mathcal {S}}(\mathbf{q}, {\mathcal {X}}) = \frac{1}{n} \sum \limits _{x \in {\mathcal {X}}} K_{\mathcal {S}}(\mathbf{q}, x) \end{aligned}$$

(5)

where $K_{\mathcal {S}}(\mathbf{q}, x)$ is a kernel similarity of $\mathbf{q}$ and x in subspace ${\mathcal {S}}$.

We use Chi-square kernel [32] because it is parameter-free and widely used by the computer vision research community. The Chi-square kernel $K_{\mathcal {S}}(\mathbf{q}, x)$ is computed as follows:

$$\begin{aligned} K_{\mathcal {S}}(\mathbf{q}, x) = 1 - \sum \limits _{i \in {\mathcal {S}}} 2\frac{(\mathbf{q}_i - x_i)^2 }{ (\mathbf{q}_i + x_i)} \end{aligned}$$

In OAM, $\mathbf{q}$ is considered to be more outlier in ${\mathcal {S}}_i$ than ${\mathcal {S}}_j$ if $f_{{\mathcal {S}}_i}(\mathbf{q}, {\mathcal {X}}) < f_{{\mathcal {S}}_j}(\mathbf{q}, {\mathcal {X}})$.

4.4 Implementation

All measures and experimental setup were implemented in Java using WEKA platform [10]. We made the required changes in the Java implementation of iNNE^{Footnote 5} provided by the authors to implement SiNNE. We used the Java implementations of sGrid made available by the authors [28].

All experiments were conducted on a machine with Intel 8-core i9 CPU and 16 GB main memory, running on macOS Monterey version 12.0.1.

We run each jobs on multiple single CPU treads, which is done using GNU parallel [25]. All jobs were performed upto 24 h, and incomplete jobs were killed and marked as ‘$\blacklozenge$’.

5 Empirical Evaluation

In this section, we compare SiNNE and three contenders in four set of experiments: (a) Experiment 1—dimensionality unbiasedness; (b) Experiment 2—performance on synthetic datasets; (c) Experiment 3—performance on real-world datasets; and (d) Experiment 4—run-time comparisons.

5.1 Experiment 1: Dimensionality Unbiasedness

We generated 19 synthetic datasets using NumPy [12] library. Each dataset contains 1000 data points from uniform distribution ${\mathcal {U}}$([0,1]$^d)$, where d varied from 2 to 20. We computed the average score of all instances using SiNNE and KDE. The results are presented in Fig. 5. The flat line for SiNNE shows that it is dimensionality unbiased, whereas KDE (without Z-Score normalization) is not. Note that [27] shows that ranks and Z-Score normalization make any score dimensionally unbias. Hence, we did not include them in our experiment.

5.2 Experiment 2: Performance on Synthetic Datasets

[13] provided several synthetic datasets, which are used in previous studies [8, 22, 27, 28]. The collection of these synthetic datasets have 1000 data points and dimensions are 10, 20, 50, 75, and 100. Each dataset has a fixed number of outliers for which outlying subspaces are known (ground truth).

synth_10D has 19 outliers, we passed all outliers one at a time as a query. Table 3 summarize the subspace discovered by SiBeam, RBeam, Beam, and sGBeam for all 19 queries. In terms of exact matches, SiBeam is the best performing measure which detects the ground truth as a top outlying aspect of each query. Beam and sGBeam perform similar by producing 19 exact matches. RBeam is the worst performing measure, which produces only five exact matches.

Table 3 Comparison of SiBeam, RBeam, Beam, and sGBeam in term of exact matches on synth_10D. Discovered subspaces with the exact matches with the ground truths are bold-faced. $\mathbf{q}$-id represent query point index; GT represents ground truth; the numbers in the bracket (subspace) are attribute indices

Full size table

Table 4 summarizes the mining results of SiBeam, RBeam, Beam, and sGBeam on four synthetic datasets, i.e., synth_20D, synth_50D, synth_75D and synth_100D. SiBeam finds the ground truth as a top outlying subspace for each query (ten queries from each datasets). Beam and sGBeam perform similar by producing 39 exact matches out of 40. RBeam is the worst performing measure, which produces exact matches for 5 queries out of 40.

Table 4 Comparison of outlying aspects discovered by SiBeam, RBeam, Beam, and sGBeam on four synthetic datasets and average run time of 10 queries from each dataset. Discovered subspaces with the exact matches with the ground truths are bold-faced. $\mathbf{q}$-id represent query point index; GT represents ground truth; the numbers in the bracket (subspace) are attribute indices

Full size table

5.3 Experiment 3: Performance on Real-World Datasets

In real-world datasets, outliers and their outlying aspects are not available. Thus, we used the state-of-the-art outlier detector called iForest^{Footnote 6} [15] to find top k ($k = 5$) outliers and they were used as queries. We then use the $f_{{\mathcal {S}}}$ score (cf. Eq. 5) in the top-ranked subspace to measure the quality of discovered subspace—the lower the value, the more likely the subspace is outlying aspect of a given query.

It is worth noting that SiBeam and sGBeam are the only methods which are able to finish the process for each query, while RBeam and Beam finish the process for only 10 queries.

Table 5 shows subspaces discovered by four OAM methods (i.e., SiBeam, RBeam, Beam, and sGBeam) on six real-world datasets.

Table 5 Comparison of outlying aspects discovered by SiBeam, RBeam, Beam, and sGBeam on six real-world datasets and average run time of five queries from each dataset. $\mathbf{q}$-id represent query point index; the numbers in the bracket (subspace) are attribute indices

Full size table

Table 6 shows the quality of discovered subspaces by SiBeam, RBeam, Beam, and sGBeam. High-quality subspaces of each query is highlighted in bold. SiBeam is best performer on 28 out of 30 according to proposed quality measure. sGBeam discovered high-quality subspace for only 5 queries out of 30. On the other hand, RBeam discovered high-quality subspace for only one query out of ten, whereas Beam was unable to detect high-quality subspace even for a single query.

Table 6 Comparison of SiBeam, RBeam, Beam, and sGBeam on six real-world datasets in terms of quality of discovered subspace

Full size table

The average run time of five queries for each dataset is presented in Table 5. Next, we visually compare discovered subspaces by each measure for top query from each datasets.

Tables 7, 8, 9, 10, 11 and 12 shows the subspace discovered by SiBeam and contending measures on wilt, pageblock, mnist, u2r, mulcross, and covertype, respectively. Visually, we can say that SiBeam detects better subspace than its 3 contenders.

Table 7 Visualization of discovered subspaces by SiBeam, RBeam, Beam, and sGBeam in the wilt dataset

Full size table

Table 8 Visualization of discovered subspaces by SiBeam, RBeam, Beam and sGBeam in the pageblock dataset

Full size table

Table 9 Visualization of discovered subspaces by SiBeam, RBeam, Beam and sGBeam in the mnist dataset

Full size table

Table 10 Visualization of discovered subspaces by SiBeam, RBeam, Beam and sGBeam in the u2r dataset

Full size table

Table 11 Visualization of discovered subspaces by SiBeam, RBeam, Beam and sGBeam in the mulcross dataset

Full size table

Table 12 Visualization of discovered subspaces by SiBeam, RBeam, Beam and sGBeam in the covertype dataset

Full size table

5.4 Experiment 4: Run-Time Comparison

Table 7 shows average run time for randomly chosen 10 queries from each real-world datasets of the SiBeam and its three contending measures. SiBeam and sGBeam were able to finish for all datasets, whereas RBeam and beam only able to finish on wilt, and pageblock datasets within 24 h. These results shows that the proposed scoring measure enables the existing OAM approach based on beam search to run orders of magnitude faster in large datasets. Specifically, SiBeam runs at least two and three magnitude faster than RBeam and Beam on wilt and pageblocks datasets, respectively. SiBeam runs at least two order of magnitude faster than sGBeam on large datasets ($n>50$K).

Table 13 Average run time (in CPU seconds) for 10 queries of SiBeam, RBeam, Beam, and sGBeam on six real-world datasets

Full size table

6 Conclusion

In this paper, we have introduced an efficient and effective scoring measure Simple Isolation score using Nearest Neighbor Ensemble (SiNNE), which is dimensionally unbias. By replacing the existing scoring measure to proposed scoring measure, we gain three benefits. The first benefit is that SiNNE is dimensionally unbiased measure, which does not rely on any normalization means it can be used directly to compare subspaces with different dimensionality. The second benefit is that SiNNE allows existing OAM (i.e., Beam) to run orders of magnitude faster compared to three state-of-the-art scoring measures. Thus it is more suitable for mining huge datasets with thousands of dimensions. The third benefit is now we can identify more interesting outlying subspace for a given query. This is confirmed by considerably better performance of SiNNE, compared to three state-of-the-art scoring measures in empirical evaluation. In addition to that, we introduced a new performance measure for outlying aspect mining. Our experimental results on real-world datasets show that SiNNE perform comparatively better than state-of-the-art measures.

Notes

Available at https://elki-project.github.io/datasets/outlier.
Available at https://www.ipd.kit.edu/ muellere/HiCS/.
Available at https://elki-project.github.io/datasets/outlier.
Note that a better rule of thumb [11] was used to set bandwidth h as: $h = 1.06\ min \ \left\{ \sigma , \ \frac{R}{1.34} \right\} \ n^{- \frac{1}{5}}$ where $R = {\mathcal {X}}_{[0.75n]} - {\mathcal {X}}_{[0.25n]}$, where ${\mathcal {X}}_{[0.25n]}$ and ${\mathcal {X}}_{[0.75n]}$ are the first and third quartiles of data ${\mathcal {X}}$, respectively.
Available at https://github.com/tharindurb/iNNE.
We used default parameter of $\psi$ and t which are 256 and 100, respectively.

References

Angiulli F, Fassetti F, Manco G, Palopoli L (2017) Outlying property detection with numerical attributes. Data Min Knowl Disc 31(1):134–163
Article MathSciNet Google Scholar
Bandaragoda TR, Ting KM, Albrecht D, Liu FT, Wells JR (2014) Efficient anomaly detection by isolation using nearest neighbour ensemble. In: 2014 IEEE international conference on data mining workshop, pp 698–705
Bandaragoda TR, Ting KM, Albrecht D, Liu FT, Zhu Y, Wells JR (2018) Isolation-based anomaly detection using nearest-neighbor ensembles. Comput Intell 34(4):968–998. https://doi.org/10.1111/coin.12156
Article MathSciNet Google Scholar
Brockett PL, Xia X, Derrig RA (1998) Using Kohonen’s self-organizing feature map to uncover automobile bodily injury claims fraud. J Risk Insur 65(2):245–274. http://www.jstor.org/stable/253535
Campos GO, Zimek A, Sander J, Campello RJGB, Micenková B, Schubert E, Assent I, Houle ME (2016) On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min Knowl Disc 30(4):891–927. https://doi.org/10.1007/s10618-015-0444-8
Article MathSciNet Google Scholar
Chan PK, Fan W, Prodromidis AL, Stolfo SJ (1999) Distributed data mining in credit card fraud detection. IEEE Intell Syst Appl 14(6):67–74
Article Google Scholar
Dang XH, Micenková B, Assent I, Ng RT (2013) Local outlier detection with interpretation. In: Blockeel H, Kersting K, Nijssen S, Železný F (eds) Machine learning and knowledge discovery in databases. Springer Berlin Heidelberg, Berlin, pp 304–320
Duan L, Tang G, Pei J, Bailey J, Campbell A, Tang C (2015) Mining outlying aspects on numeric data. Data Min Knowl Disc 29(5):1116–1151. https://doi.org/10.1007/s10618-014-0398-2
Article MathSciNet MATH Google Scholar
Gupta N, Eswaran D, Shah N, Akoglu L, Faloutsos C (2019) Beyond outlier detection: lookout for pictorial explanation. In: Berlingerio M, Bonchi F, Gärtner T, Hurley N, Ifrim G (eds) Machine learning and knowledge discovery in databases. Springer, Cham, pp 122–138
Chapter Google Scholar
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. SIGKDD Explor Newsl 11(1):10–18. https://doi.org/10.1145/1656274.1656278
Article Google Scholar
Härdle W (2012) Smoothing techniques: with implementation in S. Springer, New York
MATH Google Scholar
Harris CR, Millman KJ, van der Walt SJ, Gommers R, Virtanen P, Cournapeau D, Wieser E, Taylor J, Berg S, Smith NJ, Kern R, Picus M, Hoyer S, van Kerkwijk MH, Brett M, Haldane A, del Río JF, Wiebe M, Peterson P, Gérard-Marchant P, Sheppard K, Reddy T, Weckesser W, Abbasi H, Gohlke C, Oliphant TE (2020) Array programming with NumPy. Nature 585(7825):357–362. https://doi.org/10.1038/s41586-020-2649-2
Article Google Scholar
Keller F, Muller E, Bohm K (2012) Hics: high contrast subspaces for density-based outlier ranking. In: Proceedings of the 2012 IEEE 28th International Conference on Data Engineering, IEEE Computer Society, Washington, DC, USA, ICDE’12, pp 1037–1048, https://doi.org/10.1109/ICDE.2012.88
Lin J, Keogh E, Ada Fu, Van Herle H (2005) Approximations to magic: finding unusual medical time series. In: 18th IEEE symposium on computer-based medical systems (CBMS’05), pp 329–334
Liu FT, Ting KM, Zhou Z (2008) Isolation forest. In: 2008 Eighth IEEE international conference on data mining, pp 413–422
Liu N, Shin D, Hu X (2018) Contextual outlier interpretation. In: Proceedings of the 27th international joint conference on artificial intelligence. AAAI Press, IJCAI’18, pp 2461–2467
Mejía-Lavalle M, Sánchez Vivar A (2009) Outlier detection with explanation facility. In: Perner P (ed) Machine learning and data mining in pattern recognition. Springer Berlin Heidelberg, Berlin, pp 454–464
Chapter Google Scholar
Micenková B, Ng RT, Dang X, Assent I (2013) Explaining outliers by subspace separability. In: 2013 IEEE 13th international conference on data mining, pp 518–527, https://doi.org/10.1109/ICDM.2013.132
Muandet K, Fukumizu K, Sriperumbudur B, Schölkopf B (2017) Kernel mean embedding of distributions: a review and beyond. Found Trends Mach Learn 10(1–2):1–141
Article Google Scholar
Samariya D, Ma J (2021) Mining outlying aspects on healthcare data. In: Siuly S, Wang H, Chen L, Guo Y, Xing C (eds) Health information science. Springer, Cham, pp 160–170
Chapter Google Scholar
Samariya D, Thakkar A (2021) A comprehensive survey of anomaly detection algorithms. Ann Data Sci. https://doi.org/10.1007/s40745-021-00362-9
Article Google Scholar
Samariya D, Aryal S, Ting KM, Ma J (2020) A new effective and efficient measure for outlying aspect mining. In: Huang Z, Beek W, Wang H, Zhou R, Zhang Y (eds) Web information systems engineering—WISE 2020. Springer, Cham, pp 463–474
Chapter Google Scholar
Samariya D, Ma J, Aryal S (2020b) A comprehensive survey on outlying aspect mining methods. arXiv preprint arXiv:2005.02637
Silverman BW (1986) Density estimation for statistics and data analysis. Chapman & Hall, London
MATH Google Scholar
Tange O (2020) Gnu parallel 20201022 (‘samuelpaty’). Zenodo. https://doi.org/10.5281/zenodo.4118697
Article Google Scholar
Vinh NX, Chan J, Bailey J, Leckie C, Ramamohanarao K, Pei J (2015) Scalable outlying-inlying aspects discovery via feature ranking. In: Cao T, Lim EP, Zhou ZH, Ho TB, Cheung D, Motoda H (eds) Advances in knowledge discovery and data mining. Springer, Cham, pp 422–434
Chapter Google Scholar
Vinh NX, Chan J, Romano S, Bailey J, Leckie C, Ramamohanarao K, Pei J (2016) Discovering outlying aspects in large datasets. Data Min Knowl Disc 30(6):1520–1555. https://doi.org/10.1007/s10618-016-0453-2
Article MathSciNet MATH Google Scholar
Wells JR, Ting KM (2019) A new simple and efficient density estimator that enables fast systematic search. Pattern Recognit Lett 122:92–98. https://doi.org/10.1016/j.patrec.2018.12.020
Article Google Scholar
Xu H, Wang Y, Jian S, Huang Z, Wang Y, Liu N, Li F (2021) Beyond outlier detection: Outlier interpretation by attention-guided triplet deviation network. In: Proceedings of the web conference 2021, association for computing machinery, New York, NY, USA, WWW’21, pp 1328–1339, https://doi.org/10.1145/3442381.3449868
Xu YX, Pang M, Feng J, Ting KM, Jiang Y, Zhou ZH (2021) Reconstruction-based anomaly detection with completely random forest. In: Proceedings of the 2021 SIAM international conference on data mining (SDM), SIAM, pp 127–135
Zhang J, Lou M, Ling TW, Wang H (2004) Hos-miner: a system for detecting outlyting subspaces of high-dimensional data. In: Proceedings of the thirtieth international conference on very large data bases—volume 30, VLDB endowment, Toronto, Canada, VLDB’04, pp 1265–1268, http://dl.acm.org/citation.cfm?id=1316689.1316810
Zhang J, Marszałek M, Lazebnik S, Schmid C (2007) Local features and kernels for classification of texture and object categories: a comprehensive study. Int J Comput Vis 73(2):213–238. https://doi.org/10.1007/s11263-006-9794-4
Article Google Scholar

Download references

Acknowledgements

The preliminary version of this paper is published in Proceedings of the 21st International Conference on Web Information Systems Engineering (WISE) 2020 [22]. This work is supported by Federation University Research Priority Area (RPA) scholarship, awarded to Durgesh Samariya.

Author information

Authors and Affiliations

School of Engineering, Information Technology and Physical Sciences, Federation University, Berwick, VIC, Australia
Durgesh Samariya & Jiangang Ma

Authors

Durgesh Samariya
View author publications
You can also search for this author in PubMed Google Scholar
Jiangang Ma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Durgesh Samariya.

Ethics declarations

Conflict of interest

The authors have no conflicts of interest to declare that are relevant to the content of this article.

Appendix: SiNNE Versus iNNE

This appendix provides the additional results of SiNNE and iNNE comparison from the following Sect. 3.5.

Table 14 presents the subspaces discovered by SiNNE and iNNE on synthetic datasets. In term of exact matches, SiNNE detects ground truth of each query as outlying aspects, whereas iNNE only detects ground truth of 11 queries out of 15 as most outlying aspects.

Table 14 Comparison of SiNNE (SiBeam) and iNNE (iBeam) on five synthetic datasets. Discovered subspaces with the exact matches with ground truths are bold-faced. $\mathbf{q}$-id represent query point index; the numbers in the bracket (subspace) are attribute indices

Full size table

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visithttp://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Samariya, D., Ma, J. A New Dimensionality-Unbiased Score for Efficient and Effective Outlying Aspect Mining. Data Sci. Eng. 7, 120–135 (2022). https://doi.org/10.1007/s41019-022-00185-5

Download citation

Received: 13 December 2021
Revised: 09 March 2022
Accepted: 05 April 2022
Published: 29 April 2022
Issue Date: June 2022
DOI: https://doi.org/10.1007/s41019-022-00185-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A New Dimensionality-Unbiased Score for Efficient and Effective Outlying Aspect Mining

Abstract

Similar content being viewed by others

A New Effective and Efficient Measure for Outlying Aspect Mining

sGrid++: Revising Simple Grid Based Density Estimator for Mining Outlying Aspect

Discovering outlying aspects in large datasets

1 Introduction

2 Related Works

2.1 Basic Notations and Definitions

Definition 1

Definition 2

Definition 3

Definition 4

2.2 Outlying Aspect Mining

3 The Framework

3.1 Issue of Using Z-Score

3.2 Outlierness Computation

3.2.1 Build Model

Definition 5

Definition 6

Definition 7

Definition 8

3.3 Subspace Search

3.4 An Example of Proposed Method

3.5 Key Differences with Closely Related Work

Theorem 1

Proof

4 Experimental Setting

4.1 Datasets

4.2 Contenders and Parameters

4.3 Evaluation Metric

4.4 Implementation

5 Empirical Evaluation

5.1 Experiment 1: Dimensionality Unbiasedness

5.2 Experiment 2: Performance on Synthetic Datasets

5.3 Experiment 3: Performance on Real-World Datasets

5.4 Experiment 4: Run-Time Comparison

6 Conclusion

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Appendix: SiNNE Versus iNNE

Appendix: SiNNE Versus iNNE

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation