New upper bounds for tight and fast approximation of Fisher’s exact test in dependency rule mining

https://doi.org/10.1016/j.csda.2015.08.002Get rights and content

Highlights

  • A family of new tight upper bounds to approximate Fisher’s p is introduced.

  • The new approximations suit for data mining purposes, because they are much faster to evaluate than the exact p-value.

  • Theoretical analysis and empirical evaluation show that the new approximations are very accurate for all practical purposes.

  • The approximations are not sensitive to the data size, distribution or small expected counts.

Abstract

In the dependency rule mining, the goal is to discover the most significant statistical dependencies among all possible collapsed 2×2 contingency tables. Fisher’s exact test is a robust method to estimate the significance and it enables efficient pruning of the search space. The problem is that evaluating the required p-value can be very laborious and the worst case time complexity is O(n), where n is the data size. The traditional solution is to approximate the significance with the χ2-measure, which can be estimated in a constant time. However, the χ2-measure can produce unreliable results (discover spurious dependencies but miss the most significant dependencies). Furthermore, it does not support efficient pruning of the search space. As a solution, a family of tight upper bounds for Fisher’s p is introduced. The new upper bounds are fast to calculate and approximate Fisher’s p-value accurately. In addition, the new approximations are not sensitive to the data size, distribution, or smallest expected counts like the χ2-based approximation. In practice, the execution time depends on the desired accuracy level. According to experimental evaluation, the simplest upper bounds are already sufficiently accurate for dependency rule mining purposes and they can be estimated in 0.004–0.1% of the time needed for exact calculation. For other purposes (testing very weak dependencies), one may need more accurate approximations, but even they can be calculated in less than 1% of the exact calculation time.

Introduction

Dependency rules are simple data mining patterns which can be used for a thorough analysis of statistical dependencies in categorical data sets. A dependency rule XA expresses a positive dependency between a set of attributes, X, and a single binary attribute, A. It can be read as “If factors X occur, then A is more likely to occur than otherwise”. Similarly, a negative dependency between X and A can be expressed by rule X¬A. In the dependency rule mining, the goal is to search for the best or all sufficiently good dependency rules with the selected goodness measure. Because the patterns are very simple, the search can be done quite efficiently, and dependency rule algorithms can handle even millions of rows of data containing thousands or tens of thousands binary attributes without any suboptimal heuristics (see e.g. Hämäläinen, 2012).

A classical application of dependency rules (and related association rules (Agrawal et al., 1993) which, however, do not necessarily express statistical dependencies) is market basket analysis, which aims to find dependencies describing shopping habits. A rule may reveal, for example, that if a market basket contains coffee and cream, then it is also more likely to contain sugar than other baskets. This kind of information has several applications in marketing, from optimal product arrangements to individual recommendations in web-stores. Medical science is another application field which contains huge binary or easily binarized data sets. In this context, dependency rules can be used to analyze which gene alleles, habits, environmental and phenotypic factors predispose or prevent diseases. This is an important application for dependency rule mining, because disease mechanisms are often very complex, involving dozens of factors. An extra difficulty is that statistical dependence (as well as statistical significance) is not a monotonic property. This means that rule A,BC can express a strong and highly significant dependency, even if A and C as well as B and C were statistically independent. Therefore, it is not possible to find the most important dependencies without efficient search algorithms.

Example 1

Let us consider a medical database which contains information on patients, their diseases (Alzheimer disease, stroke, heart attack, and coronary heart disease (CAD)), medical measurements (blood pressure and HDL and LDL cholesterol), and occurrence of certain gene alleles (ABCA1-R219K, ACE-D, ApoE-e2, -e3 and -e4). Here are some examples of discovered dependency rules:

Here, simple rules tend to be frequent but weaker, while more specific rules are rarer but stronger. For example, carriers of the ApoE-e2 allele tend to have low LDL cholesterol, in general, but the dependence is not strong. However, if the carrier belongs to some special group, like exercising men, the dependence is much stronger. These rules also demonstrate the non-monotonic nature of statistical dependence. For example, allele ABCA1-R219K has no effect on Alzheimer disease, when considered alone, but among women, it surprisingly increases the risk. Similarly, ApoE-e4 is a risk factor for Alzheimer disease, but with modest smoking, it seems to protect from the same disease.

The applications of dependency rules are not restricted to categorical (easily binarized) data, although numerical variables require some kind of discretization. This can still be beneficial as a preliminary data analysis, because dependency rules can reveal even complex dependency structures without any assumptions on their form. This information on dependencies is anyway needed before one can select suitable methods for a more detailed analysis.

The main dilemma in dependency rule analysis (like most data mining) is how to perform the search efficiently without trading off quality of results. As one can expect, the problem is computationally very demanding, because the number of possible patterns is exponential, O(k2k), where k is the number of binary attributes. Even a simpler problem, searching for the optimal classification rules (dependency rules with a fixed consequence attribute) is known to be NP-hard with common statistical measures like the χ2-measure (Morishita and Sese, 2000), and no polynomial time solutions are known. The search algorithms try to prune the search space as much as possible without explicit testing, but still it is necessary to test millions of potentially promising patterns. For this reason, all algorithm steps and implementation details have to be polished to as fast as possible.

Concerning quality, the main concern is that the discovered patterns should be genuine dependencies which are likely to hold also in future data. In practice, this means statistical significance testing—the algorithm should find those dependencies which are least likely to have occurred by chance. In dependency rule mining and related pattern discovery, the most commonly used statistical significance measure is the χ2-measure (Morishita and Sese, 2000, Nijssen et al., 2009, Hämäläinen, 2011), but also other measures like Pearson’s correlation coefficient (Antonie and Zaïane, 2004), z-score (Hämäläinen, 2010), mutual information (practically, log likelihood ratio) (Nijssen et al., 2009), and odds ratio (Li et al., 2013) have been used. Fisher’s exact test–which is the focus of this paper–has been used only rarely. In Kingfisher (Hämäläinen, 2012), Fisher’s p-value, pF, is the main search measure. In MagnumOpus (Webb and Zhang, 2005), the rules are searched for with other measures, but Fisher’s exact test is used to test the improvement of a rule against its simplifications. In addition, there is a new graph mining algorithm (Sugiyama et al., 2015) which uses pF to evaluate dependencies between subgraphs and class values. It seems that the interest in Fisher’s exact test is rising, especially among bioinformaticians, but efficient tools are still lacking.

Asymptotic measures have been preferred in data mining, because they are fast to evaluate and therefore suitable to exhaustive search. It has been implicitly assumed that since the data sets are large, asymptotic measures can be safely used. However, the most significant (non-trivial) dependency rules may be relatively infrequent and the corresponding distributions too skewed to meet the requirements of asymptotic tests. Especially, the popular χ2-measure can produce very unreliable results, where the discovered dependency rules do not hold in future data. This was shown in extensive cross-validation studies, where the accuracy of the best rules discovered by the χ2-measure and Fisher’s exact test (pF) were compared (Hämäläinen, 2012). In these experiments, the χ2-measure often selected rules which expressed much weaker dependence or even independence in the test set, while the rules found by pF held always well in test sets. This was not surprising, because Fisher’s exact test is known for its robustness (Lydersen et al., 2009). A more surprising result was the inefficiency of the search with the χ2-measure, due to its weaker pruning ability. Thus, pF turned out to be a superior search measure in terms of both accuracy and efficiency of pruning.

The only problem with Fisher’s exact test is that it is computationally laborious. The pF-value is the sum of the probabilities of the observed and all more extreme contingency tables. In the worst case, the sum may contain n4 terms, which means that the worst case time complexity is O(n), where n is the data size. For example, if X and A have frequencies fr(X)=fr(A)=500000 and the frequency of combination XA is fr(XA)=300001, then one should evaluate 200 000 terms. In addition, each term involves binomial factors, but they can be evaluated in a constant time, if all factorials have been tabulated.

In this paper, we introduce a family of tight approximations for the exact pF-value, which can still be calculated in a constant time. The approximations are actually upper bounds for pF, but when the dependency is sufficiently strong, they give tight approximations to the exact values. In practice, they give identical results with the exact pF-values, when used for rule ranking.

The main idea of the new approximations is to calculate only the first terms from pF exactly and estimate an upper bound for the remaining terms. The simplest upper bound evaluates only the first term exactly. It is also intuitively appealing as a goodness measure, because it is reminiscent to the existing dependency measures like the odds ratio. When the dependencies are sufficiently strong (a typical data mining application), the results are also highly accurate. However, if the data set contains only weak and relatively insignificant dependencies, the simplest upper bound may produce too inaccurate results. In this case, one can use tighter upper bounds, which can be adjusted arbitrarily accurate. Nevertheless, there is always a trade-off between speed and accuracy. The more accurate p-values are wanted, the more terms have to be calculated exactly. Fortunately, the largest terms of pF are always the first ones, and in practice it is sufficient to calculate only a small number (say, 10) of them exactly.

As far as we know, there does not exist any other computationally fast but accurate approximations to pF. The likely reason is that statisticians have totally different speed requirements than data miners. If a measure is evaluated just a couple of times, then a second or two is not too long. However, if one has to evaluate the measure a million times, then a millisecond is already too much. In addition, the users of data mining software have got used to short executions times and are reluctant to change into slower but more accurate methods.

The related research has mostly concentrated on developing efficient network algorithms for Fisher’s exact test in the 2×c (Mehta and Patel, 1980, Requena and Ciudad, 2006) and r×c (Mehta and Patel, 1983) cases. However, these algorithms do not offer any solution to the summation problem. There is also an interesting method (Wu, 1993) for improving the accuracy of calculation, by simplifying the binomial coefficients, but it is computationally far too demanding for data mining applications. In practice, the old technique of tabulating factorials (Verbeek and Kroonenberg, 1985) is still the fastest technique to evaluate pF.

The rest of the paper is organized as follows. In Section  2, the basic concepts and notations are defined. In Section  3, we introduce the new upper bounds and give theoretical error bounds for the resulting approximations. In Section  4, we evaluate the accuracy and computational efficiency of the new upper bounds experimentally. The final conclusions are drawn in Section  5.

Section snippets

Preliminaries

Dependency rules are data mining patterns which describe statistical dependencies between sets of attribute–value combinations in data. Given the set of all binary attributes in data, R, a dependency rule can be expressed as X=xA=v, where XR is a set of attributes, ARX is a single attribute, v{0,1} is a single truth value, and X=x is a short-hand notation for the truth value assignment of X’s attributes. So, in terms of propositional logic, the rule antecedent (condition part) is a

Upper bounds

In this section, we will first introduce two simple upper bounds and analyze their error bounds. After that we generalize the idea and introduce a family of adjustable upper bounds.

Experimental evaluation

The goal of the experimental evaluation was to evaluate both the accuracy and computational efficiency of the new upper bounds. For this purpose, several experiments were done simulating the requirements of typical pattern mining algorithms. In addition, we evaluated the performance of mining dependency rules from real world data sets using the simplest upper bound, ub2, instead of exact pF.

All experiments were run on 2.7 GHz Intel i7-2620M processor having 8 GB RAM and using Linux operating

Conclusions

We have introduced a family of upper bounds, which can be used to estimate the pF-value of Fisher’s exact test fast but accurately. Unlike the χ2-based approximation, these upper bounds are not sensitive to the data size, distribution, or small expected counts.

All new approximations can be evaluated in a constant time (asymptotic complexity O(1)), while the exact calculation of pF depends on the data size (complexity O(n)). In practice, the execution time of the new approximations depends on

Acknowledgments

This research was partially supported by the Academy of Finland, grant 258589. A part of the work was done when the author worked in the University of Eastern Finland.

References (23)

  • F. Requena et al.

    A major improvement to the network algorithm for Fisher’s exact test in 2 × c contingency tables

    Comput. Statist. Data Anal.

    (2006)
  • A. Verbeek et al.

    A survey of algorithms for exact distributions of test statistics in r × c contingency tables with fixed margins

    Comput. Statist. Data Anal.

    (1985)
  • R. Agrawal et al.

    Mining association rules between sets of items in large databases

  • A. Agresti

    A survey of exact inference for contingency tables

    Statist. Sci.

    (1992)
  • A.M. Andrés et al.

    Comparing the asymptotic power of exact tests in 2×2 tables

    Comput. Statist. Data Anal.

    (2004)
  • M.-L. Antonie et al.

    Mining positive and negative association rules: an approach for confined rules

  • erfc (software). ECE44 Laboratory, University of Illinois Urbana-Champaign. Retrieved 1.6.2014....
  • FIMI Repository (data collection). Administered by B. Goethals. Retrieved April 2010....
  • Gnu profiler (gprof, software). Copyright 2009 Free Software Foundation, Inc....
  • W. Hämäläinen

    Statapriori: an efficient algorithm for searching statistically significant association rules

    Knowl. Inf. Syst.

    (2010)
  • W. Hämäläinen

    Efficient search methods for statistical dependency rules

    Fund. Inform.

    (2011)
  • Cited by (17)

    View all citing articles on Scopus
    View full text