New upper bounds for tight and fast approximation of Fisher’s exact test in dependency rule mining
Introduction
Dependency rules are simple data mining patterns which can be used for a thorough analysis of statistical dependencies in categorical data sets. A dependency rule expresses a positive dependency between a set of attributes, , and a single binary attribute, . It can be read as “If factors occur, then is more likely to occur than otherwise”. Similarly, a negative dependency between and can be expressed by rule . In the dependency rule mining, the goal is to search for the best or all sufficiently good dependency rules with the selected goodness measure. Because the patterns are very simple, the search can be done quite efficiently, and dependency rule algorithms can handle even millions of rows of data containing thousands or tens of thousands binary attributes without any suboptimal heuristics (see e.g. Hämäläinen, 2012).
A classical application of dependency rules (and related association rules (Agrawal et al., 1993) which, however, do not necessarily express statistical dependencies) is market basket analysis, which aims to find dependencies describing shopping habits. A rule may reveal, for example, that if a market basket contains coffee and cream, then it is also more likely to contain sugar than other baskets. This kind of information has several applications in marketing, from optimal product arrangements to individual recommendations in web-stores. Medical science is another application field which contains huge binary or easily binarized data sets. In this context, dependency rules can be used to analyze which gene alleles, habits, environmental and phenotypic factors predispose or prevent diseases. This is an important application for dependency rule mining, because disease mechanisms are often very complex, involving dozens of factors. An extra difficulty is that statistical dependence (as well as statistical significance) is not a monotonic property. This means that rule can express a strong and highly significant dependency, even if and as well as and were statistically independent. Therefore, it is not possible to find the most important dependencies without efficient search algorithms. Example 1 Let us consider a medical database which contains information on patients, their diseases (Alzheimer disease, stroke, heart attack, and coronary heart disease (CAD)), medical measurements (blood pressure and HDL and LDL cholesterol), and occurrence of certain gene alleles (ABCA1-R219K, ACE-D, ApoE-e2, -e3 and -e4). Here are some examples of discovered dependency rules: Here, simple rules tend to be frequent but weaker, while more specific rules are rarer but stronger. For example, carriers of the ApoE-e2 allele tend to have low LDL cholesterol, in general, but the dependence is not strong. However, if the carrier belongs to some special group, like exercising men, the dependence is much stronger. These rules also demonstrate the non-monotonic nature of statistical dependence. For example, allele ABCA1-R219K has no effect on Alzheimer disease, when considered alone, but among women, it surprisingly increases the risk. Similarly, ApoE-e4 is a risk factor for Alzheimer disease, but with modest smoking, it seems to protect from the same disease.
The applications of dependency rules are not restricted to categorical (easily binarized) data, although numerical variables require some kind of discretization. This can still be beneficial as a preliminary data analysis, because dependency rules can reveal even complex dependency structures without any assumptions on their form. This information on dependencies is anyway needed before one can select suitable methods for a more detailed analysis.
The main dilemma in dependency rule analysis (like most data mining) is how to perform the search efficiently without trading off quality of results. As one can expect, the problem is computationally very demanding, because the number of possible patterns is exponential, , where is the number of binary attributes. Even a simpler problem, searching for the optimal classification rules (dependency rules with a fixed consequence attribute) is known to be NP-hard with common statistical measures like the -measure (Morishita and Sese, 2000), and no polynomial time solutions are known. The search algorithms try to prune the search space as much as possible without explicit testing, but still it is necessary to test millions of potentially promising patterns. For this reason, all algorithm steps and implementation details have to be polished to as fast as possible.
Concerning quality, the main concern is that the discovered patterns should be genuine dependencies which are likely to hold also in future data. In practice, this means statistical significance testing—the algorithm should find those dependencies which are least likely to have occurred by chance. In dependency rule mining and related pattern discovery, the most commonly used statistical significance measure is the -measure (Morishita and Sese, 2000, Nijssen et al., 2009, Hämäläinen, 2011), but also other measures like Pearson’s correlation coefficient (Antonie and Zaïane, 2004), -score (Hämäläinen, 2010), mutual information (practically, log likelihood ratio) (Nijssen et al., 2009), and odds ratio (Li et al., 2013) have been used. Fisher’s exact test–which is the focus of this paper–has been used only rarely. In Kingfisher (Hämäläinen, 2012), Fisher’s -value, , is the main search measure. In MagnumOpus (Webb and Zhang, 2005), the rules are searched for with other measures, but Fisher’s exact test is used to test the improvement of a rule against its simplifications. In addition, there is a new graph mining algorithm (Sugiyama et al., 2015) which uses to evaluate dependencies between subgraphs and class values. It seems that the interest in Fisher’s exact test is rising, especially among bioinformaticians, but efficient tools are still lacking.
Asymptotic measures have been preferred in data mining, because they are fast to evaluate and therefore suitable to exhaustive search. It has been implicitly assumed that since the data sets are large, asymptotic measures can be safely used. However, the most significant (non-trivial) dependency rules may be relatively infrequent and the corresponding distributions too skewed to meet the requirements of asymptotic tests. Especially, the popular -measure can produce very unreliable results, where the discovered dependency rules do not hold in future data. This was shown in extensive cross-validation studies, where the accuracy of the best rules discovered by the -measure and Fisher’s exact test () were compared (Hämäläinen, 2012). In these experiments, the -measure often selected rules which expressed much weaker dependence or even independence in the test set, while the rules found by held always well in test sets. This was not surprising, because Fisher’s exact test is known for its robustness (Lydersen et al., 2009). A more surprising result was the inefficiency of the search with the -measure, due to its weaker pruning ability. Thus, turned out to be a superior search measure in terms of both accuracy and efficiency of pruning.
The only problem with Fisher’s exact test is that it is computationally laborious. The -value is the sum of the probabilities of the observed and all more extreme contingency tables. In the worst case, the sum may contain terms, which means that the worst case time complexity is , where is the data size. For example, if and have frequencies and the frequency of combination is , then one should evaluate 200 000 terms. In addition, each term involves binomial factors, but they can be evaluated in a constant time, if all factorials have been tabulated.
In this paper, we introduce a family of tight approximations for the exact -value, which can still be calculated in a constant time. The approximations are actually upper bounds for , but when the dependency is sufficiently strong, they give tight approximations to the exact values. In practice, they give identical results with the exact -values, when used for rule ranking.
The main idea of the new approximations is to calculate only the first terms from exactly and estimate an upper bound for the remaining terms. The simplest upper bound evaluates only the first term exactly. It is also intuitively appealing as a goodness measure, because it is reminiscent to the existing dependency measures like the odds ratio. When the dependencies are sufficiently strong (a typical data mining application), the results are also highly accurate. However, if the data set contains only weak and relatively insignificant dependencies, the simplest upper bound may produce too inaccurate results. In this case, one can use tighter upper bounds, which can be adjusted arbitrarily accurate. Nevertheless, there is always a trade-off between speed and accuracy. The more accurate -values are wanted, the more terms have to be calculated exactly. Fortunately, the largest terms of are always the first ones, and in practice it is sufficient to calculate only a small number (say, 10) of them exactly.
As far as we know, there does not exist any other computationally fast but accurate approximations to . The likely reason is that statisticians have totally different speed requirements than data miners. If a measure is evaluated just a couple of times, then a second or two is not too long. However, if one has to evaluate the measure a million times, then a millisecond is already too much. In addition, the users of data mining software have got used to short executions times and are reluctant to change into slower but more accurate methods.
The related research has mostly concentrated on developing efficient network algorithms for Fisher’s exact test in the (Mehta and Patel, 1980, Requena and Ciudad, 2006) and (Mehta and Patel, 1983) cases. However, these algorithms do not offer any solution to the summation problem. There is also an interesting method (Wu, 1993) for improving the accuracy of calculation, by simplifying the binomial coefficients, but it is computationally far too demanding for data mining applications. In practice, the old technique of tabulating factorials (Verbeek and Kroonenberg, 1985) is still the fastest technique to evaluate .
The rest of the paper is organized as follows. In Section 2, the basic concepts and notations are defined. In Section 3, we introduce the new upper bounds and give theoretical error bounds for the resulting approximations. In Section 4, we evaluate the accuracy and computational efficiency of the new upper bounds experimentally. The final conclusions are drawn in Section 5.
Section snippets
Preliminaries
Dependency rules are data mining patterns which describe statistical dependencies between sets of attribute–value combinations in data. Given the set of all binary attributes in data, , a dependency rule can be expressed as , where is a set of attributes, is a single attribute, is a single truth value, and is a short-hand notation for the truth value assignment of ’s attributes. So, in terms of propositional logic, the rule antecedent (condition part) is a
Upper bounds
In this section, we will first introduce two simple upper bounds and analyze their error bounds. After that we generalize the idea and introduce a family of adjustable upper bounds.
Experimental evaluation
The goal of the experimental evaluation was to evaluate both the accuracy and computational efficiency of the new upper bounds. For this purpose, several experiments were done simulating the requirements of typical pattern mining algorithms. In addition, we evaluated the performance of mining dependency rules from real world data sets using the simplest upper bound, , instead of exact .
All experiments were run on 2.7 GHz Intel i7-2620M processor having 8 GB RAM and using Linux operating
Conclusions
We have introduced a family of upper bounds, which can be used to estimate the -value of Fisher’s exact test fast but accurately. Unlike the -based approximation, these upper bounds are not sensitive to the data size, distribution, or small expected counts.
All new approximations can be evaluated in a constant time (asymptotic complexity ), while the exact calculation of depends on the data size (complexity ). In practice, the execution time of the new approximations depends on
Acknowledgments
This research was partially supported by the Academy of Finland, grant 258589. A part of the work was done when the author worked in the University of Eastern Finland.
References (23)
- et al.
A major improvement to the network algorithm for Fisher’s exact test in 2 c contingency tables
Comput. Statist. Data Anal.
(2006) - et al.
A survey of algorithms for exact distributions of test statistics in r c contingency tables with fixed margins
Comput. Statist. Data Anal.
(1985) - et al.
Mining association rules between sets of items in large databases
A survey of exact inference for contingency tables
Statist. Sci.
(1992)- et al.
Comparing the asymptotic power of exact tests in 2×2 tables
Comput. Statist. Data Anal.
(2004) - et al.
Mining positive and negative association rules: an approach for confined rules
- erfc (software). ECE44 Laboratory, University of Illinois Urbana-Champaign. Retrieved 1.6.2014....
- FIMI Repository (data collection). Administered by B. Goethals. Retrieved April 2010....
- Gnu profiler (gprof, software). Copyright 2009 Free Software Foundation, Inc....
Statapriori: an efficient algorithm for searching statistically significant association rules
Knowl. Inf. Syst.
(2010)
Efficient search methods for statistical dependency rules
Fund. Inform.
Cited by (17)
Significance-based discriminative sequential pattern mining
2019, Expert Systems with Applications2nd special issue on robust analysis of complex data
2017, Computational Statistics and Data AnalysisSpecial Issue on Advances in Data Mining and Robust Statistics
2016, Computational Statistics and Data AnalysisAlice and the Caterpillar: A more descriptive null model for assessing data mining results
2024, Knowledge and Information SystemsMining Statistically Significant Communities From Weighted Networks
2023, IEEE Transactions on Knowledge and Data EngineeringResearch on data mining model of fault operation and maintenance based on electric vehicle charging behavior
2023, Frontiers in Energy Research