Outlier mining in high-dimensional data using the Jensen–Shannon divergence and graph structure analysis

Reliable anomaly/outlier detection algorithms have practical applications in many fields. For instance, anomaly detection allows to filter and clean the data used to train machine learning algorithms, improving their performance. However, outlier mining is challenging when the data is high-dimensional, and different approaches have been proposed for different types of data (temporal, spatial, network, etc). Here we propose a methodology to mine outliers in generic datasets in which it is possible to define a meaningful distance between elements of the dataset. The methodology is based on defining a fully connected, undirected graph, where the nodes are the elements of the dataset and the links have weights that are the distances between the nodes. Outlier scores are defined by analyzing the structure of the graph, in particular, by using the Jensen–Shannon (JS) divergence to compare the distributions of weights of different nodes. We demonstrate the method using a publicly available database of credit-card transactions, where some of the transactions are labeled as frauds. We compare with the performance obtained when using Euclidean distances and graph percolation, and show that the JS divergence leads to performance improvement, but increases the computational cost.


Introduction
The terms anomaly and outlier generally refer to observations of a give process that seems to be generated by a mechanism that is not the one that governs the process. While outliers and anomalies are often used as synonymous, they have also been distinguished in the following terms: an outlier is a legitimate data point that is far from the center of a distribution that characterizes a process, and an anomaly is a data point that cannot be explained given current knowledge of the process generating the data. Anomalies have been classified in three types: point anomalies (a data point that is anomalous with respect to the rest of the data), contextual anomalies (a data point that is anomalous in a specific context) and collective anomalies (a set of data points that are not anomalies by themselves, but their collective occurrence is anomalous) [1].
The identification of these observations is important for different purposes (failure detection, novelty detection, intruder detection, etc), and in particular, in the context of artificial intelligence systems, to clean the data used for training the algorithms.
Many methods for outlier detection have been proposed in the literature (see, e.g. [2][3][4][5][6] and references therein), some of them, based on distances that can be computed between elements of the dataset [7][8][9][10][11]. In outlier detection via graph methods, distance-based outlier mining is based on a fully connected graph structure in which the nodes represent the elements of the dataset and the connections between them are quantified by a distance measure. In this sense, with N elements in the dataset, each with a M-dimensional vector of features, . . N, by using an appropriate distance to quantify differences in the vectors of features, one obtains a N × N distance matrix, where the vector of distances of node i, d i = {d i1 , d i2 , . . . d iN }, can be considered a new set of features that can be used for training a binary classification algorithm (outlier/normal element). In large datasets the new feature vector is very long and this approach becomes computationally demanding. An alternative approach is to use a dimensionality-reduction strategy and extract informative features from the vectors of distances. In this work we show that the average distance and the shape of the distance distribution can be used for outlier mining.
A popular measure for comparing different distributions is the Jensen-Shannon (JS) divergence [12]. We define new outlier scores (OSs) using the JS divergence computed from the distributions of Euclidean distances between nodes. The method does not have any free parameter and thus, it does not require training. We demonstrate the method using a publicly available database of credit-card transactions where some of the transactions are labeled as frauds [13,14]. We quantify the performance with well-known measures, the area under the curve receiver operating characteristic (AUC-ROC) and the area under the curve precision recall (AUC-PR) [15,16]. We also compare with the performance of the 'graph-percolation' method proposed in [10], which is also parameter-free. In the percolation method, the links with longest distances are gradually removed, and an OS is assigned to each element, in the order in which the elements become disconnected from the giant component.
The work is organized as follows. Section 2 describes the proposed methodology, section 3 describes the dataset used, section 4 describes the performance quantifiers, section 5 presents the results and section 6 presents the discussion and the conclusions.

Method
We consider a set of N elements (nodes) which have associated vectors with M features (for the credit card database to be described in the next section, M = 28). The Euclidean distance between two elements, i and j, whose feature vectors are With the vector of distances of node i, we define the first 'outlier score' of node i, OS1 i as the average distance, Elements that have high values of OS1 have, on average, large distances to other nodes. More information can be obtained by inspecting not the average, but the shape of the distribution of distances. Given two nodes i and j, if the shape of the distributions of Euclidean distances {d il } and {d jl } (with l = 1 . . . N) is similar, then the elements can be considered similar, else, they are different. Therefore, a weight can be assigned to the link between nodes i and j by calculating the distance between the distributions of {d il } and {d jl } values, P i and P j respectively. Different distance measures can be used to compare two distributions and here we use the popular JS divergence, where the Shannon entropy, H[P], is defined as follows [17]: given a discrete random variable X which takes values in the alphabet Ξ and is distributed according to p : Ξ → [0, 1],   (2) and (5), OP1 and OP2 refer to the order in which a node becomes disconnected from the giant component, as the longest links are removed one by one, when using Euclidean distances (OP1) or when using JS divergences (OP2).
In addition, we calculate OSs using the graph percolation procedure proposed in [10]. In this procedure, the connections with longest distances are removed one by one, and an OS is assigned to each element, in the order in which the element becomes disconnected from the giant component. For example, if node number 3 is the first node to become disconnected from the giant component, and then nodes 5 and 11 become disconnected, then, node number 3 is assigned an OS OP = 1, node number 5 is assigned OP = 2 and node number 11, OP = 3 (a video showing an example is available in [10], supplementary information). We refer to the OSs defined in this way, as OP1 when using Euclidean distances and as OP2, when using JS divergences.
The whole procedure is schematically summarized in figure 1: we start with the N × N matrix of Euclidean distances (d). Using this matrix, we define OS1 as the average Euclidean distance to all the other nodes and OP1 is the order in which is disconnected from the giant component. Through the probability distributions obtained from the values of distances for each node, the JS divergence is computed for each pair of nodes, and a new N × N distance matrix is obtained (D). With this matrix, we define OS2 as the average JS divergence, and OP2 is the order in which the element becomes disconnected from the giant component, when using JS divergences.

Data
The dataset used in this study represents 284 807 European credit card transactions carried out in two days in September 2013 [14]. Transactions are anonymous and labeled in two classes: fraud (1) or non-fraud (0). The dataset is very unbalanced, with 492 frauds and 284 315 non-frauds, where frauds represent only 0.172% of the dataset. The dataset is anonymized by a principal component analysis with 28 dimensions.
The data structure contains 31 numeric attributes. The 'Time' defines the seconds between transactions and the first transaction. The 'Amount' defines the value associated with each transaction. The 'Class' attribute specifies the prevision of transactions, where the value 1 represents fraud, and the value 0 represents non-fraud. The other 28 attributes have values taken from PCA.
In [10] no improvement in average precision was found when including the attributes 'Time' and 'Amount' in the feature set and therefore, here we only use the 28 PCA features in the analysis.

Performance measures
As important as selecting data attributes (features) to mine outliers is selecting an appropriate performance measure. Two well-known measures are the AUC-ROC and the AUC-PR [15,16].  The area under the ROC curve is a measure of the goodness of a binary classifier: while random guessing gives a diagonal line, a perfect classifier has one (or more) thresholds that perfectly separate the two classes. In this situation, AUC-ROC = 1.
ROC curves can present a very optimistic view of a classifier performance if there is a significant class unbalance and in this case PR curves have been used as an alternative to ROC curves [15,16]. The PR curve is more informative because it does not depend on the number of true negatives. The precision is the ratio of correct positive detections over all positive detections, TP/(TP+FP), and the PR curve is obtained by plotting the precision vs. the recall (i.e. the TPR). The average precision is the area under the PR curve (AUC-PR). Figure 2 displays the ROC and PR curves obtained with the four OSs, considering datasets with different sizes. In all cases, the elements of the datasets were selected such that 10% were fraud transactions and 90% were regular ones. We note that OS1 and OS2 have very similar performance, which is higher than the performance of OP1 and OP2. In [10] the average performance achieved with the percolation method was lower than 0.6, while here, the performance of OS1 and OS2 typically exceed 0.8. We have verified that the value of AUC-PR remains similar when a smaller number of outliers is included in the dataset (figure 3).

Results
Results with different N are summarized in table 1, where it can be seen that OS1 and OS2 clearly outperform the percolation-based methods, OP1 and OP2. However, one might wonder if this increase in performance does not require a high increase of the computational time. Figure 4 shows that, as could be expected, for the four methods the computational time increases as ∝ N 2 due to the calculation of the N × N   Figure 4 indicates some typical times when the algorithm was run on a iMac core i7 computer with 64 GB RAM.
In table 1 we also notice that for OP1, the mean of the AUC-PR seems to decrease and then seems to saturate as N increases. We do not yet know the origin of this variation, and to determine whether is generic or is specific to the dataset analyzed, additional studies, using other datasets, are planned.
Taken together, the results presented in table 1 and figure 4 show that OS1 is a cost-effective method for mining outliers in this dataset, because it has a performance that is only slightly lower than OS2, while it avoids the calculation of the matrix of JS divergences, which is computationally demanding when the datasets are large.
We remark that detecting credit card frauds is an active research field, and the publicly available dataset that we have used [13,14], has been used by other authors. A natural next step is to perform a critical comparison with other techniques; however, this is left for future work as our goal here is to present a general method that can be used to mine outliers in generic datasets, whenever appropriate distances can be defined between elements of the datasets. We use the credit card database just as an example, and we do not claim that our method outperforms other methods for detecting credit card frauds.

Conclusion
We have proposed a new methodology for mining outliers in high-dimensional datasets. The method is based in the analysis of the fully connected graph, where the distances between the elements of the dataset are defined by using the Euclidean or the JS divergence. At least for the dataset analyzed we conclude that the analysis of the average distance improves precision, with respect to the graph percolation method proposed in [10]. Future work will be devoted to test these methods in databases of crime occurrences in Minas Gerais, Brazil [19].
As the JS divergence complements the information obtained from the mean Euclidean distances (because it compares the shape of two distributions), we speculate that combining OS1 and OS2 can further improve performance. Another way in which the method can be refined is by replacing the (parameter-free) Euclidean distance with a functional form whose parameters can be optimally tuned to the data that is analyzed. The method proposed here is parameter-free and therefore, the algorithm does not require any training; however, a functional form can be used instead of the Euclidean distance, and its parameters can be optimally tuned to the particular data under inspection, by using machine learning. For example, weights can be assigned to different features, to pay particular attention to the most informative ones.

Data availability statement
The dataset used in this study can be downloaded in [14]. The data that support the findings of this study are openly available at the following URL/DOI: www.kaggle.com/datasets/mlg-ulb/creditcardfraud.