Exceptional in so Many Ways—Discovering Descriptors That Display Exceptional Behavior on Contrasting Scenarios

The current state of the art in supervised descriptive pattern mining is very good in automatically finding subsets of the dataset at hand that are exceptional in some sense. The most common form, subgroup discovery, generally finds subgroups where a single target variable has an unusual distribution. Exceptional model mining (EMM) typically finds subgroups where a pair of target variables display an unusual interaction. What these methods have in common is that one specific exceptionality is enough to flag up a subgroup as exceptional. This, however, naturally leads to the question: can we also find multiple instances of exceptional behaviour simultaneously in the same subgroup? This paper provides a first, affirmative answer to that question in the form of the SPEC (Subsets of Pairwise Exceptional Correlations) model class for EMM. Given a set of predefined numeric target variables, SPEC will flag up subgroups as interesting if multiple target pairs display an unusual rank correlation. This is a fundamental extension of the EMM toolbox, which comes with additional algorithmic challenges. To address these challenges, we provide a series of algorithmic solutions whose strengths/flaws are empirically analysed.


I. INTRODUCTION
We are living in a Golden Age of data science, where data mining techniques designed to discover valuable insights from a collection of records [1] are employed to transform tons of facts into useful information in fields as diverse as education [2], health care [3], and Internet of Things [4]. Nowadays, the quantity of data gathered on different domains is so high that it is a common practice not only to provide specific algorithms for such huge quantity of data [5], but also to reduce such enormous amount of data in order to be able to process it. In this regard, identifying subsets of a dataset that are somehow of great interest to researchers in a specific field is a key point for different discovering and filtering tasks [6]. Traditional pattern mining methods discover coherent nuggets of information that somehow deviate The associate editor coordinating the review of this manuscript and approving it for publication was Feng Xia . from the norm, i.e., where something interesting is going on. This deviation is quantified according to different measures: in terms of a relatively high/low occurrence, which is known as frequent/infrequent itemset mining [7], or an unusual distribution for a specific target variable, known as Subgroup Discovery (SD) [8], or even considering patterns of high utility for a specific aim [9].
Exceptional model mining (EMM) [10], [11] was proposed as a supervised descriptive pattern mining framework [3] to encompass different forms of interesting behaviour on a pair of target attributes. Unlike SD [12], which typically seeks unusual target distributions, EMM typically looks for unusual pairwise interactions between two targets (reported as a model class). This framework allows users to define the model class they are interested in, and to search for interesting data subsets according to such model, pointing out reasons to understand why a specific subset causes such unusual interaction. Taking the example widely used in EMM about the analysis of the housing price per square meter [11], the general know-how is that a larger size of the lot coincides with a higher sales price. At this point, an investor might wonder whether it is possible to find specific data subsets where the price of an additional square meter is significantly less than the norm, or even zero. Finding out such subsets may ease the speculation and increase the benefits thanks to the knowledge provided by EMM, e.g., the price of a house in the higher segments of the market is mostly determined by its location and facilities. The desirable location may provide a natural limit on the lot size, such that this is not a factor in the pricing.
While EMM is quite successful in automatically determining which subgroups of the dataset at hand feature unusual interactions between the predefined pair of target variables, one does need to preset these target variables before running the algorithm. A typical EMM instance does not allow for the algorithm to determine itself which target variables are the relevant ones. Perhaps there may be multiple unusual interactions at play in a larger set of target variables simultaneously. As an example, let us go back to the analysis of the housing price where the aim now is to discover subsets that present an unusual interaction on multiple models at the same time, e.g. housing price and any other target variable. As we will describe in the experimental section, houses with a recreational room cause exceptional pairwise interactions between the sales price and multiple other targets: (1) supplementary square meters are not related to a rise in sales price now; (2) a higher number of full bathrooms does not cause an increment in the sales price; (3) additional stories excluding basement does not imply less affordable houses. The finding of such a subset (houses with a recreational room) determines various exceptional interactions with regard to the housing price that should be known by an investor to apply corrective actions that increase the financial benefits, e.g., adding a recreational room is a luxury that has a huge impact in the sales price in such a way that neither additional square meters, full bathrooms or stories excluding basement will increase the sales price. All in all, a typical EMM instance requires a pair of fixed target variables to look for unusual interactions but, in many application fields, the target variables are not predefined since there is no previous knowledge about the problem at hand, requiring to look for several unusual pairwise interactions between sets of targets.
Any exceptional interaction between any combination of variables is a challenging research agenda; this paper sets a specific step on that path. A new model class for EMM is proposed, assessing the rank correlation between each pair of numeric targets from a set. In other words, extending the EMM model class [13] to highlight subgroups where several pairwise correlations are exceptional. The concept of pairwise exceptional pattern is therefore introduced as an extension of the already known concept of exceptional pattern. Hence, SPEC fundamentally extends the typical EMM toolbox, and as such, requires fundamental algorithmic contributions as well, which this paper will provide. In this regard, the contribution of this research work can be summarized as follows: 1) SPEC describes reasons to understand the cause of unusual interaction among multiple targets in data. It looks for good descriptors extracting interesting subsets of data on contrasting scenarios. 2) A typical EMM instance does not allow for the algorithm to determine itself which target variables are the relevant ones; conversely, SPEC can determine exactly which targets are relevant for the subgroup at hand, and subgroups may be interesting only if they can display exceptionality in several such pairwise settings simultaneously.

3) SPEC is moved towards unsupervised learning tasks
where no knowledge about data is required in terms of targets (data subsets to be analysed). Instead, users search for useful information among a wide set of attributes at the same time. The final aim of this paper is to set the bases for further research studies on the ideas here presented, so the presented algorithmic solutions are just adaptations of wellknown algorithms demonstrating that all the proposed ideas are feasible to be carried out. Some data are analysed to demonstrate the usefulness of the proposal and how interesting is the discovery of exceptional data subsets on many different pairs of targets. The rest of the paper is organized as follows. Section II provides some key concepts and descriptions to understand the EMM problem. Section III describes the SPEC model class as well as some algorithmic solutions. Section IV includes some experimental analysis. Finally, a lesson learned is described in Section V and some concluding remarks are outlined in Section VI.

II. PRELIMINARIES
In general terms, a pattern (itemset) is the key element in any process of eliciting useful knowledge [14] since it defines subsequences or substructures representing any type of homogeneity and regularity in data [7]. Formally, given a set of items I = {i 1 , i 2 , . . . , i l } in a database , a pattern P is defined as a subset of I , i.e., P = {i j , . . . , i k } ⊆ I ∈ : 1 ≤ j ∧ k ≤ l. Pattern mining [15] is a broad subtask of data mining [1] that aims at describing intrinsic and important properties of data by finding novel, significant, unexpected, nontrivial and actionable elements hidden in data. This task identifies and describes chunks of data [11] that are of great interest to researchers in a specific field.
Nowadays, trending supervised local pattern mining [3] frameworks such as exceptional model mining (EMM) [10], which looks for unusual interactions on a pair of targets and describes reasons to understand the cause of such unusual interaction, are being considered to capture different forms of interesting behaviour. EMM [10] is defined as a multitarget generalization of subgroup discovery [8]. Rather than denoting the unusual distribution of a single target variable t as subgroup discovery does, EMM considers the unexpected interaction between a pair of target variables t x , t y [11]. VOLUME 8, 2020 EMM is highly related to the discovery of more actionable insights by finding coherent subsets that behave differently when they are compared to either the whole dataset (the interest is focused on deviations from a possibly inhomogeneous norm) or the complement (the attention is paid to dichotomies) of such subsets. In a formal way, let us assume a dataset consisting of a bag of records r ∈ in the form r = (a 1 , . . . , a k , t 1 , t 2 ) : k ∈ N + . Here, A = {a 1 , . . . , a k } denotes the descriptive attributes or descriptors, and T = {t 1 , t 2 } denotes the target variables or targets. EMM aims at discovering a subset of data G D ⊂ corresponding to a description given by a set of descriptors D(a 1 , . . . , a k ), satisfying that G D = {∀r i ∈ : D(a i 1 , . . . , a i k ) = 1}, and G D showing an unusual interaction on two specific target variables t x and t y .
In one of the three original EMM model classes, the concept of interest was based on the Pearson's standard correlation coefficient ρ between t x and t y for both the subset ρ G D and the whole dataset ρ (or the complement G C D ≡ \ G D , so the correlation is denoted as ρ G C D ). Thus, a quality measure for this model class was determined by ϕ( [10]. However, G D ⊆ so it is not possible to statistically compare the exceptionality [13] in terms of ϕ(G D ) = |ρ G D − ρ |. Besides, Pearson's standard correlation coefficient ρ includes some drawbacks that should be considered in the EMM problem [13]: (1) it is sensitive to non-normality and without the normality assumption (it usually happen in many real-life examples), many statistical tests on ρ become meaningless or at least hard to interpret; (2) it is easily affected by outliers; and (3) it assumes there is always a linear relationship between targets. In order to overcome the aforementioned drawbacks, two different correlation coefficients based on ranking were already considered. One of them is the Spearman's rank correlation coefficient [16] which is a nonparametric measure of rank correlation (statistical dependence between the ranking of two variables t x and t y ). It assesses how well the relationship between the two variables (t x and t y ) can be described using a monotonic function. The other one is Kendall's τ coefficient [17] which is used to measure the statistical dependence by determining the similarity of the orderings of the data when ranked by each of the quantities. In general terms, both coefficients usually produce almost the same solutions and, therefore, they can be used interchangeably in this problem as it was already pointed out [13].
Recently, to allow subgroups to be exceptional in subsets of a predefined set of target columns at hand, two approaches to EMM were introduced. As such, these works are the closest related to the current paper. Reference [18] introduced Exceptional Preferences Mining, where a subgroup is interesting if it features unusual preference relations between a predefined set of labels. These relations can be gauged over the whole set, but also in a labelwise or even pairwise manner: in the last case a subgroup is deemed interesting if a single pair of labels is ranked in an unusual way, compared to the overall dataset. Reference [19] defined a local pattern mining task on an olfactory dataset. Since it is generally not well understood exactly which properties influence our olfactory perceptions, the method allows for the discovery of rules where any subset of the predefined label set is relevant. Two facts set these two related works apart from the current paper, that is, neither of the related papers allows for numeric targets, and neither of the related papers explicitly rewards multiple simultaneous exceptional pairwise interactions.

III. SUBSETS OF PAIR-WISE EXCEPTIONAL CORRELATIONS
The SPEC (Subsets of Pairwise Exceptional Correlations) model class fundamentally extends the exceptional model mining (EMM) toolbox. Unusual interactions between target variables are not measured on a single model (pairs of targets) but multiple models (several pairs of targets). Hence, SPEC solves a much broader task that looks for good descriptors extracting interesting subsets of data on contrasting scenarios: a pattern or itemset (set of descriptive attributes) might be deemed to be exceptional if it describes an exceptional behaviour on disparate models in the same dataset .
Definition 1 (descriptors and targets): Suppose a dataset comprising a bag of N records r ∈ of the form r = (a 1 , . . . , a k , t 1 , t 2 ), where k is a positive integer, i.e., k ∈ N + . We call the first k attributes of the dataset the descriptors, whose set we denote by A and whose collective domain (which is a Cartesian product of k single-attribute domains, each of which can be binary, categorical, or numerical) we denote by A. The last two attributes of the dataset, conversely, are denoted as a set by T , and its collective domain in this paper must be R 2 : all targets are numeric. If necessary, we will refer to the i th record of the dataset and its components by superscript i. Definition 4 (exceptional pattern): A description D and its associated subgroup G D are deemed as exceptional in a dataset if the subset G D ⊂ obtained by D describes an unusual interaction between a pair of targets. This exceptionality is governed by a quality measure ϕ. Even though technically a quality measure has access to the whole description, 200984 VOLUME 8, 2020 the spirit of EMM suggests that it assesses the exceptionality of the interaction between the targets in the induced subgroup. Hence, a description is defined in terms of the descriptors, it selects a subset of the dataset known as a subgroup, and the quality measure evaluates the subgroup by contrasting the interaction behaviour of the selected two targets with target interaction behaviour outside of the subgroup (complement).
Definition 5 (pairwise exceptional pattern): A description D and its associated subgroup G D are considered as pairwise exceptional in a dataset if the subset G D ⊂ obtained by D describes an unusual interaction between multiple pairs of targets. Similarly to exceptional patterns, the exceptionality of a pair is governed by a quality measure ϕ. In general terms, a pairwise exceptional pattern P 1 is better than another pairwise exceptional pattern P 2 if the number of target pairs in which P 1 is exceptional is much higher than P 2 .

A. TASK COMPLEXITY
The traditional exceptional model mining task has a nontrivial computational cost [11] and it is even higher for the proposed SPEC model class. Let us, for the moment, fix two specific target variables t x and t y as exceptional model mining does. At least 2 k − 1 different descriptions exist for this model (assuming that all descriptors are binary; if descriptors are not binary, the number increases so this is the best-case scenario. It is already exponential). Each of these subsets (and their complements) should quantify the rank correlation coefficient for the targets t x and t y so a brute force approach requires a total of 2 × (2 k − 1) = 2 k+1 − 2 different evaluations. Let us now consider the proposed SPEC model class and several targets m that is larger than two. Here, the combinations of target pairs for which this procedure needs to be applied is pairs of target variables in . Thus, when dealing with the problem of mining exceptional patterns, a total of 1 2 m(m − 1)(2 k+1 − 2) different evaluations are required; the complexity is exponential in the number of descriptors and quadratic in the number of targets. Additionally, Kendall's τ rank correlation requires a total of (n × (n − 1))/2 evaluations in a subgroup of n records, so the τ rank correlation for each pair of targets will require (|G D | × (|G D | − 1))/2 operations for the subgroup G D and (|G C D | × (|G C D | − 1))/2 for its complement. Hence, the final computational complexity is 1 2 In such cases where m = 2, the complexity collapses to that of a traditional exceptional model mining with the rank correlation model class.

B. PROPOSED APPROACHES
To demonstrate the usefulness of the proposed ideas and how they can be accomplished, disparate methodologies including exhaustive search, random search and evolutionary approaches are proposed. All these approaches are just adaptations of well-known and widely recognised techniques with the aim of serving as a demonstration Algorithm 1 Exhaustive Search Approach Require: , α, β Dataset , minimum threshold values α and β Ensure: P 1: P ← ∅ 2: F ← Apply a FIM Algorithm( ) Generate all the frequent itemsets 3: P ← GetExceptinalPatterns(F, , α, β) Calculate exceptionality of ∀D ∈ F 4: return P 5: procedure Get exceptional patterns(F, , α, β) 6: for all record r ∈ do Save records in G D or its complement G C D 12: if D ⊆ r then 13: end if 17: end for 18: for all the pairs of targets 19: calculate ϕ An unusual interaction is discovered 21: S G D ← S G D ∪ p-value 22: end if 23: end for 24: if quality(S G D )≥ β then More than β unusual interactions 25: end if 27: end for 28: return P 29: end procedure of the usefulness of the provided foundations for future research works as well as more efficient and specifically designed algorithms. Adapted algorithms belong to two different methodologies (exhaustive search and heuristic-based approaches). Due to some space restrictions, a more detailed description of such algorithms is accordingly available at http://www.uco.es/kdis/spec together with the source code (they were implemented in Python).

1) EXHAUSTIVE SEARCH
Algorithm 1 illustrates the pseudo-code of a simple exhaustive search approach. The proposal works in two different phases. In the first one (see Line 2, Algorithm 1), a frequent VOLUME 8, 2020 itemset mining (FIM) algorithm is applied to obtain all the frequent solutions (only considering descriptive attributes) in . Here, any of the most widely used FIM algorithms [20] are provided to be used: Apriori, FP-Growth, Eclat, LCM, Sam, Relim. In the second phase (see Line 3, Algorithm 1), the exceptionality of each of the previous solutions is computed (see Lines 5 to 29, Algorithm 1). Here, each descriptor D is analyzed to obtain the subset of data G D ⊂ and its complement G C D (see Lines 11 to 17, Algorithm 1). Once G D and G C D are obtained, the procedure calculates the unusual interaction for each pair of targets t x , t y ∈ T measured by one minus the p-value of z t x t y τ (G D ). This value denotes the probability that there exist statistical differences in the Kendall's rank correlation by taking the null-hypothesis H 0 : Hence, according a minimum threshold value α, D denotes an unusual interaction on t x , t y ∈ T if and only if ϕ , is greater than a minimum α value (see Line 20, Algorithm 1). Finally, a descriptor D (pattern or itemset) will be considered as interesting if it denotes an unusual interaction in more than β pairs of targets t x , t y ∈ T (see Line 24, Algorithm 1). The algorithm ends by returning the set P of exceptional patterns from the whole set F of feasible patterns or descriptors. Considering the Apriori algorithm as baseline, it is widely studied [21] that its order of the time complexity is exponential, that is, O(2 d+1 ) and it runs slowly with regard to the number of attributes d.

2) RANDOM SEARCH
The proposed algorithm (see Algorithm 2) works on ite iterations and each iteration is responsible for generating M random solutions (descriptions). To generate a random description (see Lines 4 to 14, Algorithm 2), the algorithm first produces a random number l between 1 and k (number of descriptors in data) to determine the length of the solution (number of items in the description). For such a number, random descriptors are taken to form a chromosome of lenght l. Each gene within the chromosome represents a chosen descriptor. Once the description D is randomly formed (the chromosome), a procedure to calculate whether the descriptor is exceptional or not (see Lines 15 to 27, Algorithm 2) is performed. This procedure is exactly the same as the one shown in Algorithm 1, i.e., it obtains the subset of data G D ⊂ (and its complement G C D ) given the description D and it calculates the unusual interaction for each pair of targets t x , t y ∈ T measured based on the p-value of z t x t y τ (G D ). Finally, since this algorithm can work with continuous attributes, those descriptors that include the same descriptors but different range of values should be considered as equal. Hence, a procedure to avoid repeated subsets is included (see Line 29, Algorithm 2), which checks whether the records covered are the same even when their range of values is different. Finally, we also propose a variant of random search approach in which attributes that are not selected yet would be more probable to be selected than those end for 15: for all record r ∈ do Save records in G D or its complement G C D 16: if D ⊆ r then 17: end if 21: end for 22: for all t x , t y ∈ do Obtain z t x t y τ (G D ) for all the pairs of targets 23: calculate ϕ An unusual interaction is discovered 25: 26: end if 27: end for 28: random search algorithm depends on the number of solutions m to be found, that is, O(m).

3) EVOLUTIONARY COMPUTATION
The proposal follows a well-known generational schema where, in each generation of the evolutionary process, solutions are crossed and mutated, and new offspring are obtained. The algorithm (see Algorithm 3) starts by encoding patterns (descriptions) through a similar process of the already described random search approach (see Lines 4 to 31, Algorithm 2). Finally, the generational schema is performed over G generations (see Lines 6 to 30, Algorithm 3), returning the set E comprising those best solutions found so far. A fitness function is proposed to define how promising a solution s 1 is, in such a way that it is defined as the average of one minus the p-value of z t x t y τ (s 1 ) for any pair of targets t x and t y . In order to obtain new solutions along the evolutionary process, two genetic operators have been proposed and applied (see Line 8, Algorithm 3). The crossover genetic operator is based on the assertion that extremely high or low values of |G D | tend to produce a high exceptionality since |G D | and |G C D | are dissimilar [13]. Thus, having a solution with a |G D | (frequency of the pattern) value close to 0.5 (in per unit basis) tend to obtain a new solution whose |G D | value is far from 0.5. Here, given two patterns (solutions or set of descriptors) p 1 and p 2 and each pattern including a set of items (variables of the dataset), the item having the lowest frequency from the pattern having the highest frequency is swapped by the item with the highest frequency from the pattern with the lowest frequency. In other words, being the frequency of p 1 0.7 and the frequency of p 2 0.3, the aim is to increase the frequency of p 1 and reduce the frequency of p 2 so both solutions tend to be far from 0.5 (the value of |G D | to be avoided). As for the mutator genetic operator, it follows the same idea as the crossover operator but it now replaces the worst item within the pattern by a random one. It is widely studied [21] that the order of the time complexity of evolutionary algorithms follows a quadratic distribution, being equal to O(N × d) for d attributes and N instances or transactions. Last but not least, this proposal cannot be compared to existing high-performance evolutionary-based solutions for mining frequent patterns [22], [23] since the goal is completely different. In proposal for mining exceptional patterns the aim is not to find frequent data subsets so neither the fitness function nor the genetic operators are similar to those provided by authors in [22], [23]. Hence, the proposals cannot be fairly compared.

IV. EXPERIMENTAL ANALYSIS
The aim of this experimental analysis is threefold: demonstrating the usefulness of the genetic operators for this problem when evolutionary algorithms are considered; describing the performance of the proposed algorithms when data dimensionality varies; demonstrating the importance of using SPEC by analysing different solutions. The reader should consider that insights discovered by SPEC Follow the same procedure as the one shown in Lines 4 to 31, Algorithm 2 5: end for 6: for g from 1 to G do Iterate G times seeking solutions 7: parents ← apply parent selector on P Typical tournament selector (the size of tournament is 3) 8: offspring ← apply crossover and mutation on parents 9: for all ind ∈ offspring do Evaluate all the new individuals 10: for all record r ∈ do Save records in G D or its complement G C D 11: if D ∈ ind ⊆ r then 12: end if 16: end for 17: for all t x , t y ∈ do Obtain z t x t y τ (G D ) for all the pairs of targets 18: calculate ϕ An unusual interaction is discovered 20: S G D ← S G D ∪ p-value 21: end if 22: end for 23: if ϕ SPECi (S G D )≥ ϕ then i ∈ {one, some, all} 24: Check and update range of values of D ∈ ind 25: Update set P with ind considering the population size 26: end if 27: end for 28: P ← update the general population considering the set offspring 29: E ← update the elite population considering the sets P and offspring 30: end for 31: return E on real-word data should be analysed in collaboration with experts in the specific application field. Three different methodologies, implemented in Python programming language, are given and all their variants are freely available to be downloaded at http://www.uco.es/kdis/spec. It should be noted that all the experiments presented in this section were run on an Intel(R) Core(TM) i7 CPU at 2.67GHz with 12GB main memory and running CentOS 5.4. To carry out this experimental analysis a collection of datasets are considered and described, considering a varied number of transactions, targets and descriptors.
As any evolutionary approach, the proposed evolutionary model should be configured with a set of adjustable parameters. All these parameters require previous study to determine those considered optimal, that is, those that allow us to obtain the best global results. It is worth mentioning that no single combination of parameter values performs better for any data sets, and sometimes, it depends on the problem under study. In this regard, the best results for the evolutionary approach are described in the following subsection.

A. DATASETS AND EXPERIMENTAL SET-UP
The experimental analysis has been carried out by considering a varied set of datasets (see Table 1) which is publicly available at http://www.uco.es/kdis/spec. These datasets were selected to be as varied as possible, comprising either continuous and discrete attributes, including a varied number of target attributes (λ combinations of pairs of targets), and containing a diverse number of transactions. As for the target variables, the λ value is also provided (feasible pairs of targets to be considered). Finally, since exhaustive search algorithms cannot handle continuous descriptors, any descriptor variable that is defined in a continuous domain will be discretized in 3 and 5 bins of equal width and equal frequency. Datasets with no descriptor defined in a continuous domain (Iris and Housing) will not be discretized.
Exhaustive search approaches have been run by considering different support thresholds (0.05, 0.10 and 0.15) so any pattern (G D ) that overcomes these thresholds is analysed in terms of exceptionality (ϕ t i t j τ (G D ) value). Even when three manners of gauging exceptionality were described in this work (ϕ SPECone , ϕ SPECsome , and ϕ SPECall ), the experimental analysis aims to compare the runtime and performance of the proposed algorithms and, therefore, the behaviour is measured regardless the three metrics or manners of gauging exceptionality. Hence, the average ϕ t i t j τ (G D ) ∀t i t j ∈ T is considered to quantify the solutions. Random search approaches consider 2,000 iterations in which the best 20 solutions are returned. Finally, as for the evolutionary computation approach, it considers a population size of 100 individuals and the algorithm is running till there are 150 generations without any improvement in the average results (considering the best 20 solutions found so far). As for the genetic operators' probabilities, the algorithm self-adapts these values, that is, the values increase or decrease according to the average value of the 20 best solutions (elite population). In the beginning, both probabilities (mutation and crossover) are the same (a 0.5 value is considered). The average value of the elite population is analysed every 5 iterations and if there is an improvement, then the crossover probability is increased in 0.05 and the mutation probability is decreased in 0.05. On the contrary, if no improvement is achieved after 5 iterations, then the crossover probability is decreased in 0.05 and the mutation probability is increased in 0.05.

B. ANALYSIS OF THE GENETIC OPERATORS
This section aims to demonstrate that the proposed genetic operators are well-suited for this problem, so a comparison with random genetic operators is performed. By random genetic operator, we mean that items within a solution are randomly selected to be swapped by those of other solution (crossover) or to be replaced by new ones (mutation). In this analysis, the Iris dataset has been discarded since it only includes a single descriptor with three different values and, therefore, only three solutions are available (one for each of the three values). Five different datasets (considering a different number of descriptors) have been considered in this analysis, comprising few descriptors (Pollution and Emotions datasets) as well as a high number of descriptors (Yeast dataset).
First, each genetic operator is analysed in isolation, studying the number of new solutions obtained that are better than their predecessors. In other words, how many times each genetic operator can produce new solutions whose ϕ values are higher than their predecessors. The average results obtained after running 30 times each genetic operator and dataset, and considering the same number of individuals to be obtained, are shown in Table 2. Second, we combine both versions of crossover and mutation genetic operators (random and proposed) and run the whole evolutionary proposal 30 times per dataset. Results are shown in Table 3, considering the 20 best solutions found so far. The aim is to demonstrate which combination is better in terms of average ϕ t i t j τ (G D ) for each solution provided by the evolutionary proposal.
As a result, the proposed crossover/mutation genetic operators are really suitable for this problem. Thus, given a  description D comprising a set of descriptors a i , . . . , a j , it is much more interesting to modify those descriptors that produce a frequency of G D close to 0.5 than to randomly choose a descriptor within D. This issue was already formally described in the previous section according to some preliminary studies described in [13].

C. ANALYSIS OF THE PERFORMANCE
The goal of this analysis is twofold. First, it describes the performance in terms of runtime. Second, it studies the results in terms of average ϕ t i t j τ (G D ) for the set of solutions returned by each algorithm. In order to carry out a fair comparison, this second analysis is performed by taking the best 20 solutions provided by the exhaustive search algorithms and comparing them to the set of 20 solutions provided by heuristic-based approaches. It is important to remark that any other comparison is unfair since exhaustive search approaches return the whole set of solutions and, therefore, the average is biased.
Focusing first on the analysis of the runtime (see Figure 1), let us focus on how the exhaustive search approaches perform when data dimensionality increases (in terms of attributes/descriptors), that is, from 10 4 to 10 10 items. These exhaustive search algorithms are based on well-known FIM algorithms (LCM, ECLAT, Relim, Sam, FP-Growth and Apriori) [24] and they all return exactly the same set of solutions. In general terms and similarly to some studies already carried out by the FIM community [20], the LCM algorithm is the one that best performs, but Sam also obtains really good results in terms of runtime. Apriori, on the contrary, is the algorithm that worst performs and this behaviour was expected since it was the first algorithm proposed to obtain frequent itemsets.
Considering now the whole set of datasets provided in the experimental set-up, the runtime is analysed by considering all the exhaustive search approaches (LCM, ECLAT, Relim, Sam, FP-Growth and Apriori), two versions of a Random Search (RS as a traditional random search and RSWeights as the version in which the probability to be chosen is inversely proportional to the number of times the attribute was chosen), and an evolutionary computation approach (Evol.). Three different support threshold values were also considered. Additionally, due to exhaustive search algorithms (LCM, ECLAT, Relim, Sam, FP-Growth and Apriori) cannot deal with continuous descriptors/attributes, different discretization methods (equal-width EW, and equal-frequency EF) have been applied to those datasets having continuous features. Here, 3 and 5 bins have been considered for both EW and EF methods. In total, 26 different scenarios (8 datasets with 4 different discretization methods, except for Iris and Housing that did not require any discretization) were considered. To analyse and validate the results of a series of nonparametric statistical tests were considered, applying the Friedman's test [25] to evaluate whether there are significant differences in the results of the algorithms. If Friedman's test indicated that the results were significantly different, the Rom post-hoc test [26] was used to perform multiple comparisons among all methods. This test is a modification to Hochberg's procedure [27] to increase its power and it was well-studied and highly recommended [28] to perform multiple comparisons in experimental studies. As a result, Friedman's test rejected the null hypothesis in all cases analysed, considering a significance level α = 0.05. A Rom post-hoc test [26] is therefore performed for all pairwise comparisons and results are illustrated in Figure 2. LCM and the evolutionary computation approach appeared as the best algorithms in runtime, the former being better when the search space decreases (a higher support threshold value). Apriori, on the contrary, appeared as the worst algorithm in terms of runtime. In general, there is no statistical difference among LCM, Evol., RS, RSWeights, Relim, Sam and ECLAT.
Focusing on the analysis of the results in terms of average ϕ t i t j τ (G D ) for the set of solutions returned by each algorithm, the best 20 solutions provided by each algorithm were analysed. It is important to remark again that any other comparison is unfair since exhaustive search approaches return the whole set of solutions and, therefore, the average is biased. Similarly to the previous analysis, three different studies based on different support threshold values were performed. In order to analyse and validate the results, a series of nonparametric statistical tests were also considered, applying the Friedman's test [25] to evaluate whether there are significant differences in the results of the algorithms. Since all the exhaustive search approaches return exactly the same set of VOLUME 8, 2020  results, we have gathered all the algorithms under the label Exh.Search. Additionally, two versions of a Random Search (RS as a traditional random search and RSWeights as the version in which the probability to be chosen is inversely proportional to the number of times the attribute was chosen), and an evolutionary computation approach (Evol.) were considered. The Friedman's test rejected the null hypothesis in all cases analysed, considering a significance level of α = 0.05. A Rom post-hoc test [26] is therefore performed for all pairwise comparisons and results are illustrated in Figure 3. As it was expected, exhaustive search approaches obtain the best results since they can mine the whole search space and, therefore, the 20 best solutions analysed are always those with the maximum ϕ t i t j τ (G D ) value. The evolutionary computation approach appears as the second-best approach, being better than RS and RSWeights. There is no statistical difference between RSWeights and RS, but the former performs statistically better. It is important to remark that the statistical differences between Exh.Search and Evol. are reduced with the reduction of the search space (higher support threshold value). Finally, shall us remark that all these analyses were performed on 26 different scenarios (8 datasets with 4 different discretization methods, except for Iris and Housing that did not require any discretization).

D. ANALYSIS OF THE SOLUTIONS
This subsection aims to illustrate and analyse some insights obtained in the experimental analysis, demonstrating the utility of SPEC on real scenarios. The two datasets with the lowest number of targets have been considered here as a matter of shortening the study. It is important to remark that this analysis should be done in collaboration with experts to fully understand the extracted insights.

1) IRIS DATASET
In the first real-world experiment, we analyse the Iris dataset, which is perhaps the best-known dataset to be found in the data mining literature. Iris consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample, including the length and the width of the sepals and petals, in centimetres. The information for each sample includes four attributes of interest that are taken as target variables (sepal-length, petal-length, sepal-width and petal-width), and an additional attribute to define candidate subgroups, i.e. the species of Iris. In this analysis, the total number of pairs of targets is 6, i.e. λ = C 4,2 = 6. Since only one attribute is considered to generate candidate subgroups, the search space is equal to the number of values for this single attribute, i.e. three subgroups, one per species of Iris. The one that provided a higher ϕ t i t j τ (G D ) is Iris-setosa (value 0.8108), which is much higher than Iris-versicolor (value 0.4238) and Iris-virginica (value 0.3946). Considering, therefore, Irissetosa as the baseline, Figure 4 shows the ϕ t i t j τ (G D ) for each of the 6 pairs of targets, obtaining that the general trend between most of the pairs of targets is completely different 200990 VOLUME 8, 2020  when setosa is considered. It is possible to statistically assert with a probability α = 0.95 that the trend between petallength and any other feature is affected by the setosa class. In other words, the length of the petal and sepal denotes an exceptional correlation with a probability of 0.9999; the length of the petal and the width of the sepal exceptionally correlates with a probability of 0.9751; and the length and width of the petal also denote an exceptional correlation with a probability of 0.9999. Hence, D={class = Iris-setosa} is an exceptional pattern that denotes exceptional correlations between different pairs of measures. For a matter of clarification, let us focus on the length of the petal and sepal, which denoted an exceptional correlation with a probability of 0.9999. This behaviour is clearly illustrated in Figure 5, where the length of both the sepal and the petal seems to be positively correlated (an increment of the petal length implies an increment of the sepal length). This behaviour, however, is different when only the setosa class is considered since an increment of the petal length does not imply an increment in the sepal length.
The strength of using SPEC is the ability to obtain subsets that denote an exceptional behaviour on more than a single pairs of targets. The example provided in this section illustrates the capability of SPEC to provide the user with a more general knowledge, much more rewarding than the one obtained by EMM. For example, considering any algorithm for mining exceptional models [11], it is possible to obtain that the subset given by the descriptor D={class = Iris-virginica} is really interesting for the targets sepal-length and petal-width, since with a probability p = 0.9999 it is possible to statistically assert that its correlation is exceptional. Even when this assertion is fine, this knowledge is partial and does not illustrate all the information that can be obtained, i.e. descriptor D={class = Iris-setosa} is more general and provides additional unusual interactions among target variables.

2) WINDSOR HOUSING DATASET
In this second analysis on a real-world dataset we analyse the Windsor Housing dataset. It contains information on 546 houses that were sold in Windsor, Canada, in the summer of 1987. The information for each house includes 5 attributes of interest that are taken as target variables (price of a house, the lot size of a property in square feet, number of bedrooms, number of full bathrooms, and the number of stories excluding basement), and 7 additional attributes to define candidate subgroups: Driveway -does the house include a driveway?; Recroom -does the house have a recreational room?; Fullbase -does the house have a full finished basement?; Gashw -does the house use gas for hot water heating?; Airco -does the house have central air conditioning?; Garagepl -number of garage places, which can be 0, 1, 2, or 3; and, finally, Prefarea -is the house located in a preferred neighbourhood of the city?. It is important to remark that, due to some space limitations, the whole set of obtained results are available at http://www.uco.es/kdis/spec. The most important results are described below.
Considering the solution with the highest ϕ t i t j τ (G D ) value, it refers to a descriptor D={recroom = no, fullbase = yes, gashw = no, prefarea = no} where the subset given by D includes 77 records, and the complement 469. Analysing each pair of targets (see Figure 6), it is obtained that the sales price denotes a behaviour that is statistically different on any of the following variables: lot size, number of bedrooms, number of full bathrooms, and the number of stories excluding basement. This issue is quite interesting since it statistically shows that the correlation between price and the aforementioned variables is modified when D comes to play. In other words, it means that when a house includes a full finished basement, even when it is not located in a preferred neighbourhood and it lacks of some extras (recreational room and gas for hot water heating), then the price is not related to the lot size since extra square meters are meaningless for the purchaser. Here, the fact of having a full finished basement is a huge motivation to invest in houses. The same behaviour is described for additional bedrooms, bathrooms or stories. According to the general trend, all of them are positively correlated to the sales price. However, this correlation disappears with D and it may be caused by the full finished basement, which is considered as a luxury in some neighbourhoods. This knowledge is essential for any real estate agency, so corrective actions can be taken to avoid that unusual correlation with regard to the sales price.
Since the Windsor Housing dataset has been previously used in EMM, some of the provided descriptors [11] are analysed again by considering now the proposed SPEC model class. In this sense, the variables driveway and recroom are analysed, which were previously obtained by EMM [11] as good descriptors on the basis of targets Price and Lot size. The aim now is to provide a more general knowledge by studying whether these descriptors (driveway and recroom) provide exceptional behaviour on the whole set of target variables. Analysing the results obtained by SPEC in this sense (see Figure 7), it is obtained that the price is only affected by the descriptor D ={driveway = yes, recroom = yes} on a single target variable, i.e. Lot size. It means that the price is not positively correlated (as expected according to the general trend) to the lot size for those houses having a driveway and a recreational room. This exceptional behaviour is perhaps caused by the fact that these features are considered luxury extras for a house so the fact of adding extra square meters does not heavily affect to the sales price. Additionally, it should be noted that the price increases (as expected by the general trend of data) with the increasing number of bedrooms, bathrooms and stories excluding basement even when the house includes a driveway and a recreational room.
Finally, we aim to analyse each of the attributes within the aforementioned descriptor, i.e. D={driveway = yes, recroom = yes}. Figure 8(a) shows the p-values for each pair of targets when considering the subset given by the feature driveway = yes. As shown, this subset does not affect the general data trend and it is difficult to find an exceptional behaviour on any pair of targets. On the contrary, considering the feature recroom = yes, it is more probable to find unusual interactions between different pairs of targets (see Figure 8(b)). For example, it is illustrated that it is highly probable that those houses that include a recreational room do not increase their price with additional square meters (considering a probability of 97.67%) neither with additional bathrooms (considering a probability of 95.96%). It means that the fact of having a recreational room is enough for many  purchasers so they do not consider the lot size or the number of bathrooms as extra features that someone should pay for them.

V. LESSONS LEARNT
Unlike traditional EMM, which aims at discovering interesting data subsets that denote some unusual interaction between a fixed pair of target variables, SPEC mines data subsets on any combination of target variables. Hence, the task of the SPEC model class for EMM is computationally harder than traditional EMM. Additionally, since SPEC extracts data subsets with an unusual interaction between any pair of variables, the knowledge provided is potentially more powerful, looking beyond where EMM looks.
The experimental analysis has demonstrated the good performance of the proposed algorithms for different data dimensionalities, also obtaining a diverse set of solutions. Results obtained on various real-world datasets have demonstrated the usefulness of this new model class. On the dataset including the sales price of houses, it is discovered that houses with a recreational room are quite interesting since they denote exceptional behaviour on multiple scenarios. For example, the price of a house usually increases when the lot size also increases, or when the number of bathrooms increases (it is a luxury to have some extra bathrooms so people pay for that). However, when the house includes a recreational room, it is discovered that with a high probability (more than 95%), the price is not affected by the lot size or the number of bathrooms. A recreational room is a luxury by itself so some extra square feet or some additional bathrooms are meaningless for the price -it usually happens that houses with a recreational room are located in the best districts of the city so the price is not affected so much by the aforementioned variables.
As demonstrated, the aforementioned knowledge cannot be provided by traditional EMM tasks since they are only focused on a pair of targets. Thus, any comparison with regard to EMM algorithms is unfair. It is also important to highlight that a major drawback of SPEC is its computational time, which is higher than the one of traditional EMM due to the huge number of combinations of targets that it requires to be analysed. In this sense, some proposals based on heuristic search methodologies were provided to reduce the computational time. These algorithms include some additional features such as the ability to extract subsets on continuous domains (exhaustive search approaches require a preprocessing step to transform continuous variables into discrete ones). This feature implies, just in numerical domains, that the whole set of features can be used as tentative target variables. It is therefore not required to predefine a set of descriptive attributes and a set of target variables. A major drawback of the heuristic approaches, however, is the lack of guarantee that all the feasible solutions are analysed so better solutions can be still hidden for the user.
To sum up, the proposed approaches have some advantages and flaws. If the runtime is meaningless and the user requires to analyse the whole search space to take any existing solution, then the exhaustive search approaches are much more appropriate and, within them, LCM seems to be the faster one. On the contrary, if the runtime is paramount, then heuristic-based solutions should be used, especially the evolutionary computation approach. Nevertheless, the user should be aware that the obtained solutions might not be the best ones.

VI. CONCLUSION
This paper presents the Subsets of Pairwise Exceptional Correlations (SPEC) model class for EMM. The proposed model class shares the basic concepts of exceptional model mining, but it also includes some additional features that are required to improve the knowledge on the user's side. In any case, any comparison of SPEC with regard to EMM algorithms is unfair since the aim of both are completely different (a pair of targets vs a wide set of targets). Additionally, it is important to remark that SPEC is a supervised local pattern mining task that describes reasons to understand the cause of unusual interaction among any combination of target variables in data. When the set of target variables is wide enough, SPEC moves towards an unsupervised learning task. Finally, since SPEC extracts data subsets with an unusual interaction between any pair of variables, the knowledge provided is more useful since it looks not only for exceptional subsets on a particular case (a predefined set of targets as exceptional model mining does) but on the general dataset, denoting an exceptional behaviour on the whole dataset.
The formal definitions and major features of this novel framework are described, and multiple algorithmic solutions are presented (different exhaustive search and heuristic-based approaches). All these approaches are just adaptations of well-known and widely recognised techniques with the aim of serving as a demonstration of the usefulness of the provided foundations for future research works as well as more efficient and specifically designed algorithms.
JOSÉ MARÍA LUNA received the Ph.D. degree in computer science from the University of Granada, Spain, in 2014. He is currently an Assistant Professor with the Department of Computer Science and Numerical Analysis, University of Cordoba, Spain. He is author of the two books related to pattern mining, published by Springer. He has published more than 30 articles in top ranked journals and international scientific conferences. He is author of two book chapters. He has also been involved in four national and regional research projects. He has contributed to three international projects. His research is focused on evolutionary computation and pattern mining MYKOLA PECHENIZKIY is currently a Full Professor and the Chair of the Data Mining Research Group that is part of the Data and Artificial Intelligence cluster with the Department of Mathematics and Computer Science, Eindhoven University of Technology, The Netherlands. His core expertise and research interests include predictive analytics and knowledge discovery from evolving data, and in their application to real-world problems in industry, medicine and education. He has been a Principal Investigator of several nationally funded and industry funded projects that being inspired by challenges of the real-world applications aim at developing foundations for next generation of informed and responsible predictive analytics. Over the past decade, he has coauthored more than 100 peer-reviewed publications. He serves on several program committees and editorial boards of leading data mining and AI conferences (AAAI, ECMLPKDD, and IJCAI) and journals (Data Mining and Knowledge Discovery and Machine Learning).
WOUTER DUIVESTEIJN received the B.Sc. degrees in mathematics and computer science, and the M.Sc. degrees in fundamental mathematics and applied computing science from Universiteit Utrecht, The Netherlands, in 2005, 2007, and 2008, respectively, and the Ph.D. degree from Universiteit Leiden, The Netherlands, in 2013, with a thesis titled Exceptional Model Mining. He spent the subsequent three years as a Postdoctoral Researcher at Technische Universität Dortmund, Germany, the University of Bristol, U.K., and Universiteit Gent, Belgium, before moving to this current position in 2016. His research revolves around all aspects of Exceptional Model Mining: finding subgroups in datasets that are interpretable and display some kind of unusual behavior. Lately, he has started working on some fundamental problems in clustering. He is currently an Assistant Professor Data Mining with Technische Universiteit Eindhoven, The Netherlands. He has contributed to the organization of seven conferences and workshops, four of which as the General (co-)Chair, and he has been providing reviews as a member of more than 30 program committees and for ten journals, including participating in the DAMI Guest Editorial Board for the ECMLPKDD journal track.
SEBASTIÁN VENTURA (Senior Member, IEEE) received the B.Sc. and Ph.D. degrees in sciences from the University of Córdoba, Spain, in 1989 and 1996, respectively. He is currently a Full Professor with the Department of Computer Science and Numerical Analysis, University of Córdoba, where he also heads the Knowledge Discovery and Intelligent Systems Research Laboratory. He has published three books and about 300 articles in journals and scientific conferences, and he has edited three books and several special issues in international journals. He has also been involved in 15 research projects (being the coordinator of seven of them) supported by the Spanish and Andalusian governments and the European Union. His main research interests include data science, computational intelligence, and their applications. He is a Senior Member of the IEEE Computer, the IEEE Computational Intelligence, and the IEEE Systems, Man, and Cybernetics Societies, as well as the Association of Computing Machinery (ACM).