Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

A discussion on significance indices for contingency tables under small sample sizes

  • Natalia L. Oliveira ,

    Contributed equally to this work with: Natalia L. Oliveira, Carlos A. de B. Pereira, Marcio A. Diniz, Adriano Polpo

    Roles Conceptualization, Formal analysis, Methodology, Writing – original draft, Writing – review & editing

    Affiliation Department of Statistics and Data Science, Carnegie Mellon Univesity, Pittsburgh, United States of America

  • Carlos A. de B. Pereira ,

    Contributed equally to this work with: Natalia L. Oliveira, Carlos A. de B. Pereira, Marcio A. Diniz, Adriano Polpo

    Roles Conceptualization, Formal analysis, Methodology, Writing – original draft, Writing – review & editing

    Affiliation Department of Statistics, University of Sao Paulo, Sao Paulo, Brazil

  • Marcio A. Diniz ,

    Contributed equally to this work with: Natalia L. Oliveira, Carlos A. de B. Pereira, Marcio A. Diniz, Adriano Polpo

    Roles Conceptualization, Formal analysis, Methodology, Writing – original draft, Writing – review & editing

    Affiliation Department of Statistics, Federal University of Sao Carlos, Sao Carlos, Brazil

  • Adriano Polpo

    Contributed equally to this work with: Natalia L. Oliveira, Carlos A. de B. Pereira, Marcio A. Diniz, Adriano Polpo

    Roles Conceptualization, Formal analysis, Methodology, Supervision, Writing – original draft, Writing – review & editing

    polpo@ufscar.br

    Affiliation Department of Statistics, Federal University of Sao Carlos, Sao Carlos, Brazil

Abstract

Hypothesis testing in contingency tables is usually based on asymptotic results, thereby restricting its proper use to large samples. To study these tests in small samples, we consider the likelihood ratio test (LRT) and define an accurate index for the celebrated hypotheses of homogeneity, independence, and Hardy-Weinberg equilibrium. The aim is to understand the use of the asymptotic results of the frequentist Likelihood Ratio Test and the Bayesian FBST (Full Bayesian Significance Test) under small-sample scenarios. The proposed exact LRT p-value is used as a benchmark to understand the other indices. We perform analysis in different scenarios, considering different sample sizes and different table dimensions. The conditional Fisher’s exact test for 2 × 2 tables and the Barnard’s exact test are also discussed. The main message of this paper is that all indices have very similar behavior, except for Fisher and Barnard tests that has a discrete behavior. The most powerful test was the asymptotic p-value from the likelihood ratio test, suggesting that is a good alternative for small sample sizes.

Introduction

We discuss indices for homogeneity, independence, and Hardy-Weinberg equilibrium hypotheses [1, 2] in contingency tables. We propose an exact evaluation of the Likelihood Ratio Test (LRT) as a benchmark significance index. Based on the work of [3], its idea is to evaluate the probability distribution of all possible tables on the sample space under the null hypothesis. Once the distribution for sampling contingency tables under the hypothesis is known, we are able to compute the exact distribution of the Likelihood Ratio Test (LRT) statistics. The main difficulty for this procedure is that it is computationally time-consuming, being only feasible for small sample sizes and/or for tables of small dimension.

The exact LRT p-value presented as a way to do exact inference. The aim is to compare the behavior of the frequentist LRT asymptotic p-value [4], the exact LRT p-value, the Fisher’s exact test p-value [5], the Chi-Square test asymptotic p-value [6, 7] and the Barnard’s exact test p-value [811]. These frequentist indices are also compared to the e-value from the Full Bayesian Significance Test (FBST) [12, 13]. It was considered the asymptotic e-value and its approximation (based on a Markov Chain Monte Carlo procedure) of the exact e-value. The choice of adding a Bayesian index to the comparison study originates from the known asymptotic relationship between the LRT and the FBST [14]. Moreover, the FBST and its e-value can be viewed as a Bayesian p-value counterpart, and therefore it is interesting to understand this Bayesian method when compared to frequentist methods. It is important to point out that we are mainly interested in the values of the indices, not in the acceptance or rejection of the hypothesis; that is, our focus is on the significance test, which consists of the evaluation of the p-(e-)values. In an applied setting, the researcher can, based on the indices, make his/her decision about his/her application. We are not interested in comparing the values of the indices with some fixed significance value (generally 5%) to decide the if the hypothesis should be accepted or rejected. With this goal in mind, all significance indices considered here are in agreement with the ASA’s statement on significance indices [15].

From a historical perspective, hypothesis testing has been the most widely used statistical tool in many fields of science [1618]. For categorical data, [19] discusses some exact procedures to perform inference and [20] presents methodological procedures for hypothesis testing for contingency tables. Tests for homogeneity hypothesis in contingency tables have been compared by [21], who compared the conditional and unconditional, and by [22], who compares, under an asymptotic perspective, two tests for equality of two proportions considering Goodman’s Y2 and χ2 statistics. Regarding tests for the independence of two classifiers in contingency tables, [23] presents an algorithm for finding the exact permutation significance level for r × c contingency tables. [24], studies a simple way to compare two correlated proportions. More recently, [25] presents the exact likelihood ratio test for equality of two normal populations, and [26] discuss exact unconditional tests for homogeneity hypothesis in 2 × 2 tables.

One important aspect that differentiates the tests procedures is how each one deals with the elimination of the nuisance parameter. Basu [27] lists several methods but focuses on marginalization and conditioning. He defines marginalization as every procedure that replaces the observed sample x by the observed value of a suitable statistic T(x) = t. Therefore, instead of working with the original experiment and data x, one should use the marginal experiment and the recorded value T(x) since the marginal statistical model would depend only on the parameter of interest. To justify these procedures, Basu adds that researchers usually recur to invariance or partial sufficiency arguments.

By conditioning, Basu defines methods of elimination that also consist of choosing a suitable statistic, but such that the conditional distribution of the observed sample, x, given the observed value of the statistic depends on the full parameter space only through the parameter of interest. Another commonly used approach that Basu describes is the one he calls maximization. In this case the nuisance parameter is eliminated from the risk function by some sort of maximization (or minimax) principle or directly from the likelihood, usually maximizing it with respect the nuisance parameters.

A final important strategy mentioned by Basu is the one he called Bayesian solution. In this case, one should derive the full posterior and integrate out the nuisance parameters, obtaining the posterior marginal distribution necessary to perform the required statistical inference. It is important to point out that the FBST does not follow this Bayesian strategy, since its evidence value is computed considering the full posterior. The proposed exact LRT p-value is based on the idea of integrating out the nuisance parameter, which is in some way related to Basu’s Bayesian solution [26]. The methods for elimination of nuisance parameters, maximization and Bayesian solution can be considered as unconditional methods.

The Likelihood Ratio Test (LRT) asymptotic p-value [28], the Chi-Square test asymptotic p-value [29], Fisher’s homogeneity exact test [29, 30], Barnard’s exact test [8], and the Full Bayesian Significance Test (FBST) asymptotic and exact e-value [12, 13] are presented in detail for the case of 2 × 2 contingency tables considering homogeneity hypothesis (Section 1.1). The theoretical results for homogeneity and independence hypotheses for tables of any dimension and Hardy-Weinberg equilibrium hypothesis are presented in sections 1.2, 1.3 and 1.4.

We study the relationship between indices in Section 2.1. [14] perform a similar study, however they consider continuous random variables using the e-value and the LRT p-value and show that these indices share an asymptotic relationship. In our case, the asymptotic LRT p-value, the exact LRT p-value and the Chi-Square p-value have similar behavior, including in small sample size scenarios. Both Fisher’s exact test and Barnard’s exact test have a discrete behavior for their p-values, being more clear for the Barnard’s exact test p-value. All tests are unconditional tests, except for the Fisher one, that is a conditional test. It is important to draw attention to the fact that the present results are not based on a simulation study, we compute the indices for all possible tables in the sample space.

In addition to our focus on the study of significance indices, we also provide, for the frequentist indices, a study of the power functions to compare the tests considering the homogeneity hypothesis (2 × 2 tables) and Hardy-Weinberg equilibrium hypothesis (Section 2.2). The Fisher’s exact test was the least powerful, followed by the Barnard’s exact test, Chi-Square test, the exact LRT and the asymptotic LRT, the most powerful one. We did not evaluate the power function for the FBST; firstly, because it is not the aim of the Bayesian paradigm, and secondly, to do so, it would be necessary to define a decision rule for the FBST, which is not in the scope of this paper. We also note that, under the hull hypothesis, considering the significance level 5%, all frequentist indices achieved 5% rejection as expected.

1 Methods

1.1 Homogeneity test for 2 × 2 contingency tables

Let X1 and X2 be two random variables, representing the rows (1 and 2) of Table 1, x11 and x21 being their observed values, and n1⋅ and n2⋅ fixed sample sizes. Consider the distributions of X1 as Binomial(n1⋅, θ11) and X2 a Binomial(n2⋅, θ21) for describing the chances of a subject belong to category (column) C1 in two distinct populations. Both populations are partitioned into two categories (columns) C1 and C2 and the objective is to test homogeneity among the two unknown population frequencies, H: θ11 = θ21 = θ. This hypothesis is geometrically represented by a diagonal line of the unit square.

The likelihood function is specified by (1) where 0 ≤ θi1 ≤ 1, i = 1, 2. Under H, the likelihood function simplifies to (2) and the LRT test statistics is: (3) in which ΘH is the parametric set defined by the hypothesis.

  1. Exact LRT p-value:

To define this p-value, we use the predictive distributions of X1 and X2 before any data were observed. The proposed p-value is an alternative way to calculate an exact p-value for the LRT. The goal is to find a distribution for the contingency table under H that is not a function on θ. We consider θ a nuisance parameter in the likelihood function in (2) and integrate it over θ in order to eliminate it, as suggested by [27]. The idea is to incorporate the concept of the Bayesian solution nuisance parameter elimination approach but in a frequentist setting, which means using the likelihood function instead of a posterior distribution. That is, (4)

To obtain the probability function Pr(X1 = x11, X2 = x21H), one needs to find a normalization constant. (5) Note that to calculate (5), we evaluate h(⋅, ⋅) for all possible tables. In the case of a homogeneity hypothesis for 2 × 2 contingency tables, . We present the table’s probability in terms of this sum to obtain a general formula for all hypotheses and table dimensions considered here, since in other scenarios this quantity does not sum up to 1 (for example, the sum of h for all possible 2 × 2 tables considering independence hypothesis with n = 2 is 2304). The exact p-value calculation follows directly from the test statistic distribution: (6) in which R is the set of all pairs (i, j) such that λ(i, j) ≤ λ(x11, x21), and λ(x11, x21) is the observed test statistic, as in (3).

  1. Barnard’s Exact Test:

Consider that n1⋅ and n2⋅ are fixed in Table 1. The random variables X1 and X2 are independent Binomial distribution with parameters θ11 and θ21. The probability of a sample {x11, x21} be drawn is (7) and, under hypothesis H, (8)

We define the critical region as R = {λ(X1, X2) ≤ λ(x11, x21)}, then the Barnard’s exact p-value is obtained by (9) That is, the Barnard’s exact test consider the p-values for all possible points of the parameter space under H, and takes the maximum p-value. In this test, the chosen approach for nuisance parameter elimination among the ones presented by Basu is maximization.

  1. Full Bayesian Significance Test:

The Bayesian approach considered is based on the FBST (Full Bayesian Significance Test) [12, 13].

Definition 1 Let π(θx) be the posterior density function of θ given the observed sample and . The supporting evidence measure for the hypothesis θ ∈ ΘH is defined as EvH, x) = 1 − Pr(θT(x) ∣ x).

Consider that, a priori, θ11 and θ21 are independent and both follow a Uniform(0, 1) distribution. The choice of uniforms priors is to avoid a subjective prior to have a fair comparison with frequentist indices. Recall that X1 and X2 given θ11 and θ21 are Binomial distributed. Hence, the posterior distributions for θ11 and θ21 are independent Beta(x11 + 1, n1⋅x11 + 1) and Beta(x21 + 1, n2⋅x21 + 1). Under the hypothesis H, the posterior distribution is (10) and by maximizing it in θ we obtain supθ∈(0,1) π(θx11, x21, n1⋅, n2⋅, H), where is the Beta function. Since x11, x21, n1⋅ and n2⋅ are integers, (11) (12) the hypothesis’ tangent set, T, is (13) and (14)

To calculate the approximate e-value, we use the following algorithm:

  1. A random sample of size k is generated from posterior distribution of θ11, θ21, obtaining .
  2. The e-value is calculated by in which I(A) is the indicator function of set A.
  1. Other indices:

For the LRT, the statistic −2 ln[λ(X1, X2)] has asymptotically a chi-square distribution with 1 degree of freedom, which is dim(Θ) − dimH) [28]. The FBST uses the same statistic, however its asymptotic distribution is a chi-square with 2 degrees of freedom [13], which is dim(Θ). For the chi-square test and the Fisher’s exact test for homogeneity see [29].

1.2 Homogeneity hypothesis for × c contingency tables

Let Xi, i = 1, …, , be random variables that are represented by the rows of Table 2 and n1⋅, n2⋅, …, n are known constants.

Assuming that Xi, i = 1, …, , follows a Multinomial(ni, θi1, …, θic) distribution, we are interested in testing if their distributions are homogeneous with respect to categories (columns) Cj, j = 1, …, c. That is, in which , 0 ≤ θk ≤ 1, ∀k = 1, …, c.

Let x be all observed values presented in Table 2 and θ all the parameters. The likelihood function is (15) and under the hypothesis H, (16) The LRT λ statistic is (17)

  1. Exact LRT p-value:

To obtain the exact LRT p-value, we need the function h(x). In this scenario, (18) and the p-value’s calculation follows as in Subsection 1.1.

  1. FBST:

Assuming a Dirichlet(1, 1, …, 1) prior for {θi1, …, θic}, and since Xi follows a Multinomial(ni, θi1, …, θic) distribution, then the posterior distribution is a Dirichlet(xi1 + 1, xi2 + 1, …, xic + 1), i = 1, …, .

In this setting, (19) and we can obtain the e-value from Definition 1.

  1. Other indices:

Both asymptotic LRT p-value and asymptotic e-value are calculated as Pr[−2 ln(λ(X)) ≤ −2 ln(λ(x))], but while the LRT considers that this statistic follows a distribution with ( − 1)(c − 1) degrees of freedom, the FBST considers that it follows a distribution with (c − 1) degrees of freedom. The Chi-Square homogeneity test is also obtained.

1.3 Independence hypothesis for × c contingency tables

Consider that θij is the probability of observing a sample in the cell at row i and column j, θi is the probability of observing a sample in row i, θj is the probability of observing a sample in column j, 0 ≤ θij ≤ 1, 0 ≤ θi ≤ 1, 0 ≤ θj ≤ 1, i = 1, …, , j = 1, …, c, , , and .

For the independence hypothesis, our interest is to test H: θij = θi × θj, ∀i, j. For the case of 2 × 2 table, the independence hypothesis is geometrically represented as Fig 1.

thumbnail
Fig 1. Geometric representation of the independence hypothesis (gray surface) for 2 × 2 tables.

The parametric space is the three-dimensional simplex (regular tetrahedron).

https://doi.org/10.1371/journal.pone.0199102.g001

Considering that n⋅⋅ is known, we assume that the outcomes of Table 2 follow a Multinomial(n.., θ) distribution, θ = {θ11, …, θ1(c−1), …, θ1, …, θ(c−1)}, and , i = 1, …, . The likelihood function is (20) The likelihood function under H is (21) and the LRT λ statistic is (22)

  1. Exact LRT p-value:

As shown in Subsection 1.1, this p-value is obtained the same way but with a different h(x). In this case, (23)

  1. FBST:

Assuming a Dirichlet(1, …, 1) as prior distribution for θ and that the outcomes of Table 2 follow a Multinomial(n, θ11, …, θ1c, …, θ1, …, θc) distribution, then the posterior distribution is a Dirichlet(x11 + 1, …, x1c + 1, …, x1 + 1, …, x c1 + 1). The e-value is obtained from Definition 1 and (24)

  1. Other indices:

We obtained the asymptotic LRT p-value and e-value, considering that −2ln(λ(X)) follows a distribution with ( − 1)(c − 1) and (c − 1) degrees of freedom. We also obtained the p-value for the Chi-Square independence test.

1.4 Hardy-Weinberg equilibrium

An individual’s genotype is formed by a combination of alleles. If there are two possible alleles for one characteristic (say A and a), the possible genotypes are AA, Aa or aa. Considering a few premises true [31], the principle says that the allele probability in a population does not change from generation to generation. It is a fundamental principle for the Mendelian mating allelic model. If the probabilities of alleles are θ and 1 − θ, the expected genotype probabilities are (θ2, 2θ(1 − θ), (1 − θ)2) 0 ≤ θ ≤ 1.

Considering the Hardy-Weinberg equilibrium, the aim is to verify if a population follows these genotypes proportions. Therefore, the equilibrium hypothesis is in which θ1, θ2, θ3 are the proportions of AA, Aa, and aa, respectively. This hypothesis is geometrically represented in Fig 2.

thumbnail
Fig 2. Geometric representation of the Hardy-Weinberg equilibrium hypothesis (black line), and the parametric space (gray shading).

https://doi.org/10.1371/journal.pone.0199102.g002

Let X be a random vector. Table 3 represents the genotype frequencies for the population in question. Considering n known, we assume that X follows a Trinomial(n, θ1, θ2, θ3) distribution. The likelihood function for this model is (25) in which x = {x1, x2, x3}, θ1 + θ2 + θ3 = 1 and θi > 0, i = 1, 2, 3. Under the hypothesis H, (26)

The maximum likelihood estimator for θ under H is and the LRT λ statistic is (27)

  1. Exact LRT p-value:

Calculations follow as for the other indices and in this scenario (28)

  1. Barnard’s Exact Test:

The critical region is R = {λ(X) ≤ λ(x)}, and the Barnard’s exact p-value is obtained by (29)

  1. FBST:

Assuming a Dirichlet(1, 1, 1) prior for θ and that X follows a Trinomial(n, θ1, θ2, θ3) distribution, the posterior distribution is θx ~ Dirichlet(x1 + 1, x2 + 1, x3 + 1). In this setting, (30)

  1. Other indices:

Both asymptotic LRT p-value and asymptotic e-value are obtained, the p-value considering that −2 ln(λ(X)) follows a distribution with 1 degrees of freedom and the FBST considering that it follows a distribution with 2 degrees of freedom.

2 Results

2.1 Relations between the indices

In many practical situations, mainly in biological studies, asymptotic distributions are used to evaluate indices even for small samples. With that in mind, one of our interests is to understand how the use of asymptotic results for small sample size settings compares to the use of an exact index. Surprisingly, the values of exact and asymptotic indexes do not diverge considerably.

As our objective is to compare the indices, we consider different scenarios for each hypothesis. For each scenario, we evaluate the significance indices of all test procedures presented here. Note that this is not a simulation study; for each sample size, we evaluate the indices for all possible contingency tables of a fixed dimension and size. For example, considering homogeneity hypothesis in a 2 × 2 table with marginals (10, 10), there are 121 possible tables or considering independence hypothesis in a 2 × 3 table with marginal 15, there are 15504 possible tables. We evaluated the indices for all tables that fit into each specification. For the e-value computation, non-informative priors for the parameters are considered (that is, π(θ) ∝ 1). This way, no extra information is added besides the data, allowing fair comparisons between frequentist and Bayesian indices.

For each scenario, plots are drawn to illustrate differences between the indices’ values. The indices studied are the exact LRT p-value, asymptotic p-value for the LRT, asymptotic p-value for the chi-square test, e-value and asymptotic e-value. For the homogeneity hypothesis in 2 × 2 tables, Fisher and Barnard exact tests were also considered, and for Hardy-Weinberg equilibrium hypothesis the Barnard’s exact test was also obtained. We considered many different scenarios, however, since the aim is to understand the indices in small sample size, the scenarios presented here are in Table 4.

Figs 3, 4 and 5 illustrate the results of the discussion above. For all hypotheses, exact and asymptotic e-values are very similar for both large and small sample sizes. Looking into the frequentist indices, exact LRT p-values and asymptotic p-values, both LRT and Chi-Square, are also very similar to each other. The difference found between e-values when compared to asymptotic LRT p-value happens as a result of the way these indices are formulated: while e-values consider the full dimension of the parameter space, p-value consider the complementary dimension of the set corresponding to hypothesis H. This is expected from the asymptotic relationship between e-value and p-value from the LRT [13, 14]. Since the exact LRT p-value is directly related to the asymptotic LRT p-value, we observe the same behavior of the differences between e-values and asymptotic LRT p-value. Fisher’s exact test was only calculated for the homogeneity hypothesis in 2 × 2 tables, and Barnard’s exact test was calculated for the homogeneity hypothesis in 2 × 2 tables and for the Hardy-Weinberg equilibrium hypothesis. Both indices have a different behavior among the other indices considered. They have a discrete behavior, which is not surprising since Fisher’s exact test is a conditional test and Barnard’s exact test takes a maximization nuisance parameter elimination. Looking at the plots, their values do not form a continuous curve like the other indices’ values do, and its points are quite far from all the other indices.

thumbnail
Fig 3. Scaterplots for the significance indices of homogeneity hypothesis considering different sample sizes and different table dimensions.

The indices were evaluated for all possible samples in the sample space. The label in the top box of that column give the index in the x-axis, and the label in the left box of that row give the index in the y-axis. Each table dimesions and sample sizes are given in the sublabels.

https://doi.org/10.1371/journal.pone.0199102.g003

thumbnail
Fig 4. Scaterplots for the significance indices of independence hypothesis considering different sample sizes and different table dimensions.

The indices were evaluated for all possible samples in the sample space. The label in the top box of that column give the index in the x-axis, and the label in the left box of that row give the index in the y-axis. Each table dimesions and sample sizes are given in the sublabels.

https://doi.org/10.1371/journal.pone.0199102.g004

thumbnail
Fig 5. Scaterplots for the significance indices of Hardy-Weinberg hypothesis considering different sample sizes and different table dimensions.

The indices were evaluated for all possible samples in the sample space. The label in the top box of that column give the index in the x-axis, and the label in the left box of that row give the index in the y-axis. Each table dimesions and sample sizes are given in the sublabels.

https://doi.org/10.1371/journal.pone.0199102.g005

2.2 Power function

Power functions are a useful tool to compare hypothesis tests. For all θ ∈ Θ, the power function provides the probability of rejecting the hypothesis for a given θ. In fact, we look for a test that does not reject the hypothesis for θ ∈ ΘH and the further the θ value is from the hypothesis, the probability of rejection increases.

The power functions presented are the ones that we are able to represent in ℝ3, which are the power functions for the homogeneity hypothesis in 2 × 2 contingency tables and for the Hardy-Weinberg equilibrium hypothesis.

We used p-values less than 0.05 as a decision rule to reject the hypothesis. This choice is based on what is vastly used in most fields of science as a decision rule. In this case, Power(θ1, θ2) = P(reject H|(θ1, θ2) and Reject H if index ≤ 0.05.

We obtain the power function for all tests but the FBST. The FBST is a Bayesian significance test and in order to obtain a power function, one would need a decision rule. Since its construction differs from that of the p-values, we cannot use the same decision rule, and constructing a decision rule is not in the scope of this paper.

We used a Monte Carlo procedure to evaluate the power function of these tests. We consider a grid for the unit square with 100 × 100 points on the axes (θ1, θ2). For each point in the grid we generated 1000 tables. From these 1000 tables we evaluate the proportion of rejections, which is an approximation of the power function.

We plot pairs of power functions to illustrate and compare their shapes. For the homogeneity hypothesis in a table with marginals (10, 10), Fig 6 shows that Fisher’s exact test is less powerful than the Barnard’s exact test, the Barnard’s exact test is has similar power when compared with the Chi-square test, while the Chi-square is less powerful than the proposed exact LRT p-value, which is less powerful than the asymptotic p-value for the LRT. To have a clear picture, we plot the power functions from different tests against each other. Fig 7a consists of the power functions for tables with marginal equals to (10, 10). It shows that the use of the asymptotic p-value for the LRT results in a more powerful test than the other indices. When comparing the proposed exact p-value to other indices, it is more powerful than the Chi-square test and the Fisher’s exact test. Between the Chi-square and the Fisher’s exact test, the Chi-square test is more powerful.

thumbnail
Fig 6. Power function for homogeneity hypothesis in 2 × 2 contingency tables with n1⋅ = n2⋅ = 10.

https://doi.org/10.1371/journal.pone.0199102.g006

thumbnail
Fig 7. Plots of power function values for the homogeneity test.

Each graph presents one index versus another, each dot representing a point in the considered parametric space (in this case, 100 × 100 = 10000 points), and if a dot is on top of the gray identity line, the power functions assume the same value for that point in the parametric space. The scenario is 2 × 2 with marginals n1⋅ = n2⋅ = 10 in (a) and n1⋅ = n2⋅ = 100 in (b).

https://doi.org/10.1371/journal.pone.0199102.g007

For tables with marginal equals to (100, 100), the graphs are more concentrated near the identity line (Fig 7b), showing that all indices are more alike. The ordering still exists, but it is less severe. It is interesting to point out that, as expected, the Chi-square test works better with larger samples.

For the Hardy-Weinberg hypothesis, the results are similar to the ones obtained for the homogeneity hypothesis and are shown in Figs 8 and 9. In this case, the most powerful test was the asymptotic p-value for the LRT, followed by the exact p-value for the LRT, which is more powerful to the Chi-square test, that is similar the Barnard’s exact test. We call attention to the fact that, under hypothesis H, the power function achieves the value of 0.05, as expected, since this is the significance level chosen to build the power functions.

thumbnail
Fig 8. Power function for Hardy-Weinberg equilibrium hypothesis with n = 10.

https://doi.org/10.1371/journal.pone.0199102.g008

thumbnail
Fig 9. Plots of power functions values for the Hardy-Weinberg equilibrium test.

Each graph presents one index versus another, each dot representing a point in the considered parametric space (in this case, 100 × 100 = 10000 points), and if a dot is on top of the gray identity line, the power functions assume the same value for that point in the parametric space. The scenarios are marginals n = 10 (a) and n = 100 (b).

https://doi.org/10.1371/journal.pone.0199102.g009

3 Conclusion

After evaluating the indices for tables in different scenarios, we noticed that all of them had very similar behaviors, independently of the perspective (Bayesian or frequentist), sample size and table dimension. The exceptions are the p-values for Fisher and Barnard’s exact tests for the homogeneity hypothesis in 2 × 2 tables, and Barnard’s exact test for Hardy-Weinberg equilibrium, which show a discretized behavior. Studying the power functions considering homogeneity hypothesis in 2 × 2 tables and Hardy-Weinberg equilibrium hypothesis, the LRT presented itself as a powerful test when considering small sample sizes, while Fisher’s exact test was the least powerful one for the homogeneity hypothesis and the Barnard’s exact test was the least powerful for the Hardy-Weinberg equilibrium hypothesis. By enlarging sample sizes, the power of these tests increases accordingly.

Finally, we finish this paper listing our main conclusions:

  • The LTR asymptotic p-value seems to be a good frequentist alternative for small sample sizes.
  • Since there is an asymptotic relationship between the p-value for the LRT and the e-value (FBST), we consider that both indices are equivalent in the explored settings.
  • In cases where there is available information besides the data that to be taken into account, represented by informative priors, we consider the e-value a more appropriate index than a frequenstist one, since the e-value offers a mechanism to incorporate that information.

Acknowledgments

This work was partially supported by the Brazilian agencies FAPESP grant 2012/16669-4, and CNPq grants 302767/2017-7 and 308776/2014-3. The agencies had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  1. 1. Emigh TH. A Comparison of Tests for Hardy-Weinberg Equilibrium. Biometrics. 1980;36(4):627–642. pmid:25856832
  2. 2. Montoya-Delgado LE, Z IT, Pereira CAB, Whittle MR. An unconditional exact test for the Hardy-Weimberg Equilibrium Law: Sample space ordering using the Bayes Factor. Genetics. 2001;158(2):875–83. pmid:11404348
  3. 3. Pereira CAB, Wechsler S S. On the Concept of P-value. Brazilian Journal of Probability and Statistics. 1993;7:159–177.
  4. 4. Wilks SS. The large-sample distribution of the likelihood ratio for testing composite hypotheses. The Annals of Mathematical Statistics. 1938;9:60–62.
  5. 5. Fisher RA. Statistical Methods for Research Workers. 5th ed. Biological Monographs and Manuals. Edinburg: Oliver and Boyd; 1934.
  6. 6. Pearson K. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, Series 5. 1900;50(302):157–175.
  7. 7. Fisher RA. On the interpretation of χ2 from contingency tables, and the calculation of P. Journal of the Royal Statistical Society. 1922;85(1):87–94.
  8. 8. Barnard GA. A New Test for 2x2 Tables. Nature. 1945;156:177.
  9. 9. Fisher RA. A New Test for 2x2 Tables. Nature. 1945;156:388.
  10. 10. Barnard GA. A New Test for 2x2 Tables. Nature. 1945;156:783–784.
  11. 11. Barnard GA. Statistical Inference. Journal of the Royal Statistical Society, Series B (Methodological). 1949;11(2):115–149.
  12. 12. Pereira CAB, Stern JM. Evidence and Credibility: a Full Bayesian Test of Precise Hypothesis. Entropy. 1999;1:104–115.
  13. 13. Pereira CAB, Stern JM, Wechsler S. Can a Significance Test Be Genuinely Bayesian? Bayesian Analysis. 2008;3(1):19–100.
  14. 14. Diniz MA, Pereira CAB, Polpo A, Stern J, Wechesler S. Relationship Between Bayesian and Frequentist Significance Indices. International Journal for Uncertainty Quantification. 2012;2(2):161–172.
  15. 15. Wasserstein RL, Lazar NA. The ASA’s Statement on p-Values: Context, Process, and Purpose. The American Statistician. 2016;70(2):129–133.
  16. 16. Lawson AE, Clark B, Cramer-Meldrum E, Falconer KA, Sequist JM, Kwon Y. Development of Scientific Reasoning in College Biology: Do Two Levels of General Hypothesis-Testing Skills Exist? Journal of Research in Science Teaching. 2000;37(1):81–101.
  17. 17. Herrmann E, Call J, Hernandez-Lloreda MV, Hare B, Tomasello M. Humans Have Evolved Specialized Skills of Social Cognition: The Cultural Intelligence Hypothesis. Science. 2007;317(5843):1360–1366. pmid:17823346
  18. 18. Montgomery DD, Runger GC. Applied Statistics and Probability for Engineers. John Wiley & Sons; 2010.
  19. 19. Agresti A. Exact inference for categorical data: recent advances and continuing controversies. Statistics in Medicine. 2001;20:2709–2722. pmid:11523078
  20. 20. Agresti A. Categorical Data Analysis. 2nd ed. John Wiley & Sons; 2002.
  21. 21. Mehta CR, F HJ. Exact Power of Conditional and Unconditional Tests: Going beyond the 2x2 Contingency Table. The American Statistician. 1993;47(2):91–98.
  22. 22. Eberhardt KR, Fligner MA. A Comparison of Two Tests for Equality of Two Proportions. The American Statistician. 1977;31(4):151–155.
  23. 23. Pagano M, Halvorsen KT. An Algorithm for Finding the Exact Significance Levels of r × c Contingency Tables. Journal of the American Statistical Association. 1981;76(376):931–934.
  24. 24. Irony TZ, Pereira CAB, Tiwari RC. Analysis of Opinion Swing: Comparison of two correlated proportions. The American Statistician. 2000;54(1):57–62.
  25. 25. Zhang L, Xinzhong Xu, Chen G. The Exact Likelihood Ratio Test for Equality of Two Normal Populations. The American Statistician. 2012;66(3):180–184.
  26. 26. Shan G, Wilding GE. Powerful Exact Unconditional Tests for Agreement Between Two Raters with Binary Endpoints. PLoS ONE. 2014;9(5):e97386. pmid:24837970
  27. 27. Basu D. On the Elimination of Nuisance Parameters. Journal of the American Statistical Association. 1977;72(358):355–366.
  28. 28. Casella G, Berger R. Statistical Inference. 2nd ed. Duxbury Press; 2001.
  29. 29. Agresti A. An Introduction to Categorical Data Analysis. 2nd ed. John Wiley & Sons; 2007.
  30. 30. Irony TZ, Pereira CAB. Exact tests for equality of two proportions: Fisher vs. Bayes. Journal of Statistical Computation and Simulation. 1986;25:93–114.
  31. 31. Hartl DL, Clark AG. Principles of Population Genetics. 4th ed. Sinauer Associates, Inc. Publishers; 2007.