A discussion on significance indices for contingency tables under small sample sizes

Natalia L. Oliveira; Carlos A. de B. Pereira; Marcio A. Diniz; Adriano Polpo

doi:10.1371/journal.pone.0199102

Abstract

Hypothesis testing in contingency tables is usually based on asymptotic results, thereby restricting its proper use to large samples. To study these tests in small samples, we consider the likelihood ratio test (LRT) and define an accurate index for the celebrated hypotheses of homogeneity, independence, and Hardy-Weinberg equilibrium. The aim is to understand the use of the asymptotic results of the frequentist Likelihood Ratio Test and the Bayesian FBST (Full Bayesian Significance Test) under small-sample scenarios. The proposed exact LRT p-value is used as a benchmark to understand the other indices. We perform analysis in different scenarios, considering different sample sizes and different table dimensions. The conditional Fisher’s exact test for 2 × 2 tables and the Barnard’s exact test are also discussed. The main message of this paper is that all indices have very similar behavior, except for Fisher and Barnard tests that has a discrete behavior. The most powerful test was the asymptotic p-value from the likelihood ratio test, suggesting that is a good alternative for small sample sizes.

Citation: Oliveira NL, Pereira CAdB, Diniz MA, Polpo A (2018) A discussion on significance indices for contingency tables under small sample sizes. PLoS ONE 13(8): e0199102. https://doi.org/10.1371/journal.pone.0199102

Editor: Mauro Gasparini, Politecnico di Torino, ITALY

Received: November 15, 2017; Accepted: May 31, 2018; Published: August 2, 2018

Copyright: © 2018 Oliveira et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The computer code that was used to generate all analysis in the paper is available from Github (https://github.com/adrianopolpo/contingencytables).

Funding: This work was partially supported by the Brazilian agencies FAPESP grant 2012/16669-4, and CNPq grants 302767/2017-7 and 308776/2014-3. The agencies had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

We discuss indices for homogeneity, independence, and Hardy-Weinberg equilibrium hypotheses [1, 2] in contingency tables. We propose an exact evaluation of the Likelihood Ratio Test (LRT) as a benchmark significance index. Based on the work of [3], its idea is to evaluate the probability distribution of all possible tables on the sample space under the null hypothesis. Once the distribution for sampling contingency tables under the hypothesis is known, we are able to compute the exact distribution of the Likelihood Ratio Test (LRT) statistics. The main difficulty for this procedure is that it is computationally time-consuming, being only feasible for small sample sizes and/or for tables of small dimension.

The exact LRT p-value presented as a way to do exact inference. The aim is to compare the behavior of the frequentist LRT asymptotic p-value [4], the exact LRT p-value, the Fisher’s exact test p-value [5], the Chi-Square test asymptotic p-value [6, 7] and the Barnard’s exact test p-value [8–11]. These frequentist indices are also compared to the e-value from the Full Bayesian Significance Test (FBST) [12, 13]. It was considered the asymptotic e-value and its approximation (based on a Markov Chain Monte Carlo procedure) of the exact e-value. The choice of adding a Bayesian index to the comparison study originates from the known asymptotic relationship between the LRT and the FBST [14]. Moreover, the FBST and its e-value can be viewed as a Bayesian p-value counterpart, and therefore it is interesting to understand this Bayesian method when compared to frequentist methods. It is important to point out that we are mainly interested in the values of the indices, not in the acceptance or rejection of the hypothesis; that is, our focus is on the significance test, which consists of the evaluation of the p-(e-)values. In an applied setting, the researcher can, based on the indices, make his/her decision about his/her application. We are not interested in comparing the values of the indices with some fixed significance value (generally 5%) to decide the if the hypothesis should be accepted or rejected. With this goal in mind, all significance indices considered here are in agreement with the ASA’s statement on significance indices [15].

From a historical perspective, hypothesis testing has been the most widely used statistical tool in many fields of science [16–18]. For categorical data, [19] discusses some exact procedures to perform inference and [20] presents methodological procedures for hypothesis testing for contingency tables. Tests for homogeneity hypothesis in contingency tables have been compared by [21], who compared the conditional and unconditional, and by [22], who compares, under an asymptotic perspective, two tests for equality of two proportions considering Goodman’s Y² and χ² statistics. Regarding tests for the independence of two classifiers in contingency tables, [23] presents an algorithm for finding the exact permutation significance level for r × c contingency tables. [24], studies a simple way to compare two correlated proportions. More recently, [25] presents the exact likelihood ratio test for equality of two normal populations, and [26] discuss exact unconditional tests for homogeneity hypothesis in 2 × 2 tables.

One important aspect that differentiates the tests procedures is how each one deals with the elimination of the nuisance parameter. Basu [27] lists several methods but focuses on marginalization and conditioning. He defines marginalization as every procedure that replaces the observed sample x by the observed value of a suitable statistic T(x) = t. Therefore, instead of working with the original experiment and data x, one should use the marginal experiment and the recorded value T(x) since the marginal statistical model would depend only on the parameter of interest. To justify these procedures, Basu adds that researchers usually recur to invariance or partial sufficiency arguments.

By conditioning, Basu defines methods of elimination that also consist of choosing a suitable statistic, but such that the conditional distribution of the observed sample, x, given the observed value of the statistic depends on the full parameter space only through the parameter of interest. Another commonly used approach that Basu describes is the one he calls maximization. In this case the nuisance parameter is eliminated from the risk function by some sort of maximization (or minimax) principle or directly from the likelihood, usually maximizing it with respect the nuisance parameters.

A final important strategy mentioned by Basu is the one he called Bayesian solution. In this case, one should derive the full posterior and integrate out the nuisance parameters, obtaining the posterior marginal distribution necessary to perform the required statistical inference. It is important to point out that the FBST does not follow this Bayesian strategy, since its evidence value is computed considering the full posterior. The proposed exact LRT p-value is based on the idea of integrating out the nuisance parameter, which is in some way related to Basu’s Bayesian solution [26]. The methods for elimination of nuisance parameters, maximization and Bayesian solution can be considered as unconditional methods.

The Likelihood Ratio Test (LRT) asymptotic p-value [28], the Chi-Square test asymptotic p-value [29], Fisher’s homogeneity exact test [29, 30], Barnard’s exact test [8], and the Full Bayesian Significance Test (FBST) asymptotic and exact e-value [12, 13] are presented in detail for the case of 2 × 2 contingency tables considering homogeneity hypothesis (Section 1.1). The theoretical results for homogeneity and independence hypotheses for tables of any dimension and Hardy-Weinberg equilibrium hypothesis are presented in sections 1.2, 1.3 and 1.4.

We study the relationship between indices in Section 2.1. [14] perform a similar study, however they consider continuous random variables using the e-value and the LRT p-value and show that these indices share an asymptotic relationship. In our case, the asymptotic LRT p-value, the exact LRT p-value and the Chi-Square p-value have similar behavior, including in small sample size scenarios. Both Fisher’s exact test and Barnard’s exact test have a discrete behavior for their p-values, being more clear for the Barnard’s exact test p-value. All tests are unconditional tests, except for the Fisher one, that is a conditional test. It is important to draw attention to the fact that the present results are not based on a simulation study, we compute the indices for all possible tables in the sample space.

In addition to our focus on the study of significance indices, we also provide, for the frequentist indices, a study of the power functions to compare the tests considering the homogeneity hypothesis (2 × 2 tables) and Hardy-Weinberg equilibrium hypothesis (Section 2.2). The Fisher’s exact test was the least powerful, followed by the Barnard’s exact test, Chi-Square test, the exact LRT and the asymptotic LRT, the most powerful one. We did not evaluate the power function for the FBST; firstly, because it is not the aim of the Bayesian paradigm, and secondly, to do so, it would be necessary to define a decision rule for the FBST, which is not in the scope of this paper. We also note that, under the hull hypothesis, considering the significance level 5%, all frequentist indices achieved 5% rejection as expected.

1 Methods

1.1 Homogeneity test for 2 × 2 contingency tables

Let X₁ and X₂ be two random variables, representing the rows (1 and 2) of Table 1, x₁₁ and x₂₁ being their observed values, and n_1⋅ and n_2⋅ fixed sample sizes. Consider the distributions of X₁ as Binomial(n_1⋅, θ₁₁) and X₂ a Binomial(n_2⋅, θ₂₁) for describing the chances of a subject belong to category (column) C₁ in two distinct populations. Both populations are partitioned into two categories (columns) C₁ and C₂ and the objective is to test homogeneity among the two unknown population frequencies, H: θ₁₁ = θ₂₁ = θ. This hypothesis is geometrically represented by a diagonal line of the unit square.

Download:

Table 1. Contingency table 2 × 2.

https://doi.org/10.1371/journal.pone.0199102.t001

The likelihood function is specified by (1) where 0 ≤ θ_i1 ≤ 1, i = 1, 2. Under H, the likelihood function simplifies to (2) and the LRT test statistics is: (3) in which Θ_H is the parametric set defined by the hypothesis.

• Exact LRT p-value:

To define this p-value, we use the predictive distributions of X₁ and X₂ before any data were observed. The proposed p-value is an alternative way to calculate an exact p-value for the LRT. The goal is to find a distribution for the contingency table under H that is not a function on θ. We consider θ a nuisance parameter in the likelihood function in (2) and integrate it over θ in order to eliminate it, as suggested by [27]. The idea is to incorporate the concept of the Bayesian solution nuisance parameter elimination approach but in a frequentist setting, which means using the likelihood function instead of a posterior distribution. That is, (4)

To obtain the probability function Pr(X₁ = x₁₁, X₂ = x₂₁ ∣ H), one needs to find a normalization constant. (5) Note that to calculate (5), we evaluate h(⋅, ⋅) for all possible tables. In the case of a homogeneity hypothesis for 2 × 2 contingency tables, . We present the table’s probability in terms of this sum to obtain a general formula for all hypotheses and table dimensions considered here, since in other scenarios this quantity does not sum up to 1 (for example, the sum of h for all possible 2 × 2 tables considering independence hypothesis with n = 2 is 2304). The exact p-value calculation follows directly from the test statistic distribution: (6) in which R is the set of all pairs (i, j) such that λ(i, j) ≤ λ(x₁₁, x₂₁), and λ(x₁₁, x₂₁) is the observed test statistic, as in (3).

• Barnard’s Exact Test:

Consider that n_1⋅ and n_2⋅ are fixed in Table 1. The random variables X₁ and X₂ are independent Binomial distribution with parameters θ₁₁ and θ₂₁. The probability of a sample {x₁₁, x₂₁} be drawn is (7) and, under hypothesis H, (8)

We define the critical region as R = {λ(X₁, X₂) ≤ λ(x₁₁, x₂₁)}, then the Barnard’s exact p-value is obtained by (9) That is, the Barnard’s exact test consider the p-values for all possible points of the parameter space under H, and takes the maximum p-value. In this test, the chosen approach for nuisance parameter elimination among the ones presented by Basu is maximization.

• Full Bayesian Significance Test:

The Bayesian approach considered is based on the FBST (Full Bayesian Significance Test) [12, 13].

Definition 1 Let π(θ ∣ x) be the posterior density function of θ given the observed sample and . The supporting evidence measure for the hypothesis θ ∈ Θ_H is defined as Ev(Θ_H, x) = 1 − Pr(θ ∈ T(x) ∣ x).

Consider that, a priori, θ₁₁ and θ₂₁ are independent and both follow a Uniform(0, 1) distribution. The choice of uniforms priors is to avoid a subjective prior to have a fair comparison with frequentist indices. Recall that X₁ and X₂ given θ₁₁ and θ₂₁ are Binomial distributed. Hence, the posterior distributions for θ₁₁ and θ₂₁ are independent Beta(x₁₁ + 1, n_1⋅ − x₁₁ + 1) and Beta(x₂₁ + 1, n_2⋅ − x₂₁ + 1). Under the hypothesis H, the posterior distribution is (10) and by maximizing it in θ we obtain sup_θ∈(0,1) π(θ ∣ x₁₁, x₂₁, n_1⋅, n_2⋅, H), where is the Beta function. Since x₁₁, x₂₁, n_1⋅ and n_2⋅ are integers, (11) (12) the hypothesis’ tangent set, T, is (13) and (14)

To calculate the approximate e-value, we use the following algorithm:

A random sample of size k is generated from posterior distribution of θ₁₁, θ₂₁, obtaining .
The e-value is calculated by in which I(A) is the indicator function of set A.

• Other indices:

For the LRT, the statistic −2 ln[λ(X₁, X₂)] has asymptotically a chi-square distribution with 1 degree of freedom, which is dim(Θ) − dim(Θ_H) [28]. The FBST uses the same statistic, however its asymptotic distribution is a chi-square with 2 degrees of freedom [13], which is dim(Θ). For the chi-square test and the Fisher’s exact test for homogeneity see [29].

1.2 Homogeneity hypothesis for ℓ × c contingency tables

Let X_i, i = 1, …, ℓ, be random variables that are represented by the rows of Table 2 and n_1⋅, n_2⋅, …, n_ℓ⋅ are known constants.

Download:

Table 2. Contingency table ℓ × c.

https://doi.org/10.1371/journal.pone.0199102.t002

Assuming that X_i, i = 1, …, ℓ, follows a Multinomial(n_i⋅, θ_i1, …, θ_ic) distribution, we are interested in testing if their distributions are homogeneous with respect to categories (columns) C_j, j = 1, …, c. That is, in which , 0 ≤ θ_k ≤ 1, ∀k = 1, …, c.

Let x be all observed values presented in Table 2 and θ all the parameters. The likelihood function is (15) and under the hypothesis H, (16) The LRT λ statistic is (17)

• Exact LRT p-value:

To obtain the exact LRT p-value, we need the function h(x). In this scenario, (18) and the p-value’s calculation follows as in Subsection 1.1.

• FBST:

Assuming a Dirichlet(1, 1, …, 1) prior for {θ_i1, …, θ_ic}, and since X_i follows a Multinomial(n_i, θ_i1, …, θ_ic) distribution, then the posterior distribution is a Dirichlet(x_i1 + 1, x_i2 + 1, …, x_ic + 1), i = 1, …, ℓ.

In this setting, (19) and we can obtain the e-value from Definition 1.

• Other indices:

Both asymptotic LRT p-value and asymptotic e-value are calculated as Pr[−2 ln(λ(X)) ≤ −2 ln(λ(x))], but while the LRT considers that this statistic follows a distribution with (ℓ − 1)(c − 1) degrees of freedom, the FBST considers that it follows a distribution with ℓ(c − 1) degrees of freedom. The Chi-Square homogeneity test is also obtained.

1.3 Independence hypothesis for ℓ × c contingency tables

Consider that θ_ij is the probability of observing a sample in the cell at row i and column j, θ_i⋅ is the probability of observing a sample in row i, θ_⋅j is the probability of observing a sample in column j, 0 ≤ θ_ij ≤ 1, 0 ≤ θ_i⋅ ≤ 1, 0 ≤ θ_⋅j ≤ 1, i = 1, …, ℓ, j = 1, …, c, , , and .

For the independence hypothesis, our interest is to test H: θ_ij = θ_i⋅ × θ_⋅j, ∀i, j. For the case of 2 × 2 table, the independence hypothesis is geometrically represented as Fig 1.

Download:

Fig 1. Geometric representation of the independence hypothesis (gray surface) for 2 × 2 tables.

The parametric space is the three-dimensional simplex (regular tetrahedron).

https://doi.org/10.1371/journal.pone.0199102.g001

Considering that n_⋅⋅ is known, we assume that the outcomes of Table 2 follow a Multinomial(n.., θ) distribution, θ = {θ₁₁, …, θ_1(c−1), …, θ_ℓ1, …, θ_ℓ(c−1)}, and , i = 1, …, ℓ. The likelihood function is (20) The likelihood function under H is (21) and the LRT λ statistic is (22)

• Exact LRT p-value:

As shown in Subsection 1.1, this p-value is obtained the same way but with a different h(x). In this case, (23)

• FBST:

Assuming a Dirichlet(1, …, 1) as prior distribution for θ and that the outcomes of Table 2 follow a Multinomial(n, θ₁₁, …, θ_1c, …, θ_ℓ1, …, θ_ℓc) distribution, then the posterior distribution is a Dirichlet(x₁₁ + 1, …, x_1c + 1, …, x_ℓ1 + 1, …, x_{ℓ c1} + 1). The e-value is obtained from Definition 1 and (24)

• Other indices:

We obtained the asymptotic LRT p-value and e-value, considering that −2ln(λ(X)) follows a distribution with (ℓ − 1)(c − 1) and (ℓc − 1) degrees of freedom. We also obtained the p-value for the Chi-Square independence test.

1.4 Hardy-Weinberg equilibrium

An individual’s genotype is formed by a combination of alleles. If there are two possible alleles for one characteristic (say A and a), the possible genotypes are AA, Aa or aa. Considering a few premises true [31], the principle says that the allele probability in a population does not change from generation to generation. It is a fundamental principle for the Mendelian mating allelic model. If the probabilities of alleles are θ and 1 − θ, the expected genotype probabilities are (θ², 2θ(1 − θ), (1 − θ)²) 0 ≤ θ ≤ 1.

Considering the Hardy-Weinberg equilibrium, the aim is to verify if a population follows these genotypes proportions. Therefore, the equilibrium hypothesis is in which θ₁, θ₂, θ₃ are the proportions of AA, Aa, and aa, respectively. This hypothesis is geometrically represented in Fig 2.

Download:

Fig 2. Geometric representation of the Hardy-Weinberg equilibrium hypothesis (black line), and the parametric space (gray shading).

https://doi.org/10.1371/journal.pone.0199102.g002

Let X be a random vector. Table 3 represents the genotype frequencies for the population in question. Considering n known, we assume that X follows a Trinomial(n, θ₁, θ₂, θ₃) distribution. The likelihood function for this model is (25) in which x = {x₁, x₂, x₃}, θ₁ + θ₂ + θ₃ = 1 and θ_i > 0, i = 1, 2, 3. Under the hypothesis H, (26)

Download:

Table 3. Genotype frequency.

https://doi.org/10.1371/journal.pone.0199102.t003

The maximum likelihood estimator for θ under H is and the LRT λ statistic is (27)

• Exact LRT p-value:

Calculations follow as for the other indices and in this scenario (28)

• Barnard’s Exact Test:

The critical region is R = {λ(X) ≤ λ(x)}, and the Barnard’s exact p-value is obtained by (29)

• FBST:

Assuming a Dirichlet(1, 1, 1) prior for θ and that X follows a Trinomial(n, θ₁, θ₂, θ₃) distribution, the posterior distribution is θ ∣ x ~ Dirichlet(x₁ + 1, x₂ + 1, x₃ + 1). In this setting, (30)

• Other indices:

Both asymptotic LRT p-value and asymptotic e-value are obtained, the p-value considering that −2 ln(λ(X)) follows a distribution with 1 degrees of freedom and the FBST considering that it follows a distribution with 2 degrees of freedom.

2 Results

2.1 Relations between the indices

In many practical situations, mainly in biological studies, asymptotic distributions are used to evaluate indices even for small samples. With that in mind, one of our interests is to understand how the use of asymptotic results for small sample size settings compares to the use of an exact index. Surprisingly, the values of exact and asymptotic indexes do not diverge considerably.

As our objective is to compare the indices, we consider different scenarios for each hypothesis. For each scenario, we evaluate the significance indices of all test procedures presented here. Note that this is not a simulation study; for each sample size, we evaluate the indices for all possible contingency tables of a fixed dimension and size. For example, considering homogeneity hypothesis in a 2 × 2 table with marginals (10, 10), there are 121 possible tables or considering independence hypothesis in a 2 × 3 table with marginal 15, there are 15504 possible tables. We evaluated the indices for all tables that fit into each specification. For the e-value computation, non-informative priors for the parameters are considered (that is, π(θ) ∝ 1). This way, no extra information is added besides the data, allowing fair comparisons between frequentist and Bayesian indices.

For each scenario, plots are drawn to illustrate differences between the indices’ values. The indices studied are the exact LRT p-value, asymptotic p-value for the LRT, asymptotic p-value for the chi-square test, e-value and asymptotic e-value. For the homogeneity hypothesis in 2 × 2 tables, Fisher and Barnard exact tests were also considered, and for Hardy-Weinberg equilibrium hypothesis the Barnard’s exact test was also obtained. We considered many different scenarios, however, since the aim is to understand the indices in small sample size, the scenarios presented here are in Table 4.

Download:

Table 4. Considered scenarios.

https://doi.org/10.1371/journal.pone.0199102.t004

Figs 3, 4 and 5 illustrate the results of the discussion above. For all hypotheses, exact and asymptotic e-values are very similar for both large and small sample sizes. Looking into the frequentist indices, exact LRT p-values and asymptotic p-values, both LRT and Chi-Square, are also very similar to each other. The difference found between e-values when compared to asymptotic LRT p-value happens as a result of the way these indices are formulated: while e-values consider the full dimension of the parameter space, p-value consider the complementary dimension of the set corresponding to hypothesis H. This is expected from the asymptotic relationship between e-value and p-value from the LRT [13, 14]. Since the exact LRT p-value is directly related to the asymptotic LRT p-value, we observe the same behavior of the differences between e-values and asymptotic LRT p-value. Fisher’s exact test was only calculated for the homogeneity hypothesis in 2 × 2 tables, and Barnard’s exact test was calculated for the homogeneity hypothesis in 2 × 2 tables and for the Hardy-Weinberg equilibrium hypothesis. Both indices have a different behavior among the other indices considered. They have a discrete behavior, which is not surprising since Fisher’s exact test is a conditional test and Barnard’s exact test takes a maximization nuisance parameter elimination. Looking at the plots, their values do not form a continuous curve like the other indices’ values do, and its points are quite far from all the other indices.

Download:

Fig 3. Scaterplots for the significance indices of homogeneity hypothesis considering different sample sizes and different table dimensions.

The indices were evaluated for all possible samples in the sample space. The label in the top box of that column give the index in the x-axis, and the label in the left box of that row give the index in the y-axis. Each table dimesions and sample sizes are given in the sublabels.

https://doi.org/10.1371/journal.pone.0199102.g003

Download:

Fig 4. Scaterplots for the significance indices of independence hypothesis considering different sample sizes and different table dimensions.

The indices were evaluated for all possible samples in the sample space. The label in the top box of that column give the index in the x-axis, and the label in the left box of that row give the index in the y-axis. Each table dimesions and sample sizes are given in the sublabels.

https://doi.org/10.1371/journal.pone.0199102.g004

Download:

Fig 5. Scaterplots for the significance indices of Hardy-Weinberg hypothesis considering different sample sizes and different table dimensions.

The indices were evaluated for all possible samples in the sample space. The label in the top box of that column give the index in the x-axis, and the label in the left box of that row give the index in the y-axis. Each table dimesions and sample sizes are given in the sublabels.

https://doi.org/10.1371/journal.pone.0199102.g005

2.2 Power function

Power functions are a useful tool to compare hypothesis tests. For all θ ∈ Θ, the power function provides the probability of rejecting the hypothesis for a given θ. In fact, we look for a test that does not reject the hypothesis for θ ∈ Θ_H and the further the θ value is from the hypothesis, the probability of rejection increases.

The power functions presented are the ones that we are able to represent in ℝ³, which are the power functions for the homogeneity hypothesis in 2 × 2 contingency tables and for the Hardy-Weinberg equilibrium hypothesis.

We used p-values less than 0.05 as a decision rule to reject the hypothesis. This choice is based on what is vastly used in most fields of science as a decision rule. In this case, Power(θ₁, θ₂) = P(reject H|(θ₁, θ₂) and Reject H if index ≤ 0.05.

We obtain the power function for all tests but the FBST. The FBST is a Bayesian significance test and in order to obtain a power function, one would need a decision rule. Since its construction differs from that of the p-values, we cannot use the same decision rule, and constructing a decision rule is not in the scope of this paper.

We used a Monte Carlo procedure to evaluate the power function of these tests. We consider a grid for the unit square with 100 × 100 points on the axes (θ₁, θ₂). For each point in the grid we generated 1000 tables. From these 1000 tables we evaluate the proportion of rejections, which is an approximation of the power function.

We plot pairs of power functions to illustrate and compare their shapes. For the homogeneity hypothesis in a table with marginals (10, 10), Fig 6 shows that Fisher’s exact test is less powerful than the Barnard’s exact test, the Barnard’s exact test is has similar power when compared with the Chi-square test, while the Chi-square is less powerful than the proposed exact LRT p-value, which is less powerful than the asymptotic p-value for the LRT. To have a clear picture, we plot the power functions from different tests against each other. Fig 7a consists of the power functions for tables with marginal equals to (10, 10). It shows that the use of the asymptotic p-value for the LRT results in a more powerful test than the other indices. When comparing the proposed exact p-value to other indices, it is more powerful than the Chi-square test and the Fisher’s exact test. Between the Chi-square and the Fisher’s exact test, the Chi-square test is more powerful.

Download:

Fig 6. Power function for homogeneity hypothesis in 2 × 2 contingency tables with n_1⋅ = n_2⋅ = 10.

https://doi.org/10.1371/journal.pone.0199102.g006

Download:

Fig 7. Plots of power function values for the homogeneity test.

Each graph presents one index versus another, each dot representing a point in the considered parametric space (in this case, 100 × 100 = 10000 points), and if a dot is on top of the gray identity line, the power functions assume the same value for that point in the parametric space. The scenario is 2 × 2 with marginals n_1⋅ = n_2⋅ = 10 in (a) and n_1⋅ = n_2⋅ = 100 in (b).

https://doi.org/10.1371/journal.pone.0199102.g007

For tables with marginal equals to (100, 100), the graphs are more concentrated near the identity line (Fig 7b), showing that all indices are more alike. The ordering still exists, but it is less severe. It is interesting to point out that, as expected, the Chi-square test works better with larger samples.

For the Hardy-Weinberg hypothesis, the results are similar to the ones obtained for the homogeneity hypothesis and are shown in Figs 8 and 9. In this case, the most powerful test was the asymptotic p-value for the LRT, followed by the exact p-value for the LRT, which is more powerful to the Chi-square test, that is similar the Barnard’s exact test. We call attention to the fact that, under hypothesis H, the power function achieves the value of 0.05, as expected, since this is the significance level chosen to build the power functions.

Download:

Fig 8. Power function for Hardy-Weinberg equilibrium hypothesis with n = 10.

https://doi.org/10.1371/journal.pone.0199102.g008

Download:

Fig 9. Plots of power functions values for the Hardy-Weinberg equilibrium test.

Each graph presents one index versus another, each dot representing a point in the considered parametric space (in this case, 100 × 100 = 10000 points), and if a dot is on top of the gray identity line, the power functions assume the same value for that point in the parametric space. The scenarios are marginals n = 10 (a) and n = 100 (b).

https://doi.org/10.1371/journal.pone.0199102.g009

3 Conclusion

After evaluating the indices for tables in different scenarios, we noticed that all of them had very similar behaviors, independently of the perspective (Bayesian or frequentist), sample size and table dimension. The exceptions are the p-values for Fisher and Barnard’s exact tests for the homogeneity hypothesis in 2 × 2 tables, and Barnard’s exact test for Hardy-Weinberg equilibrium, which show a discretized behavior. Studying the power functions considering homogeneity hypothesis in 2 × 2 tables and Hardy-Weinberg equilibrium hypothesis, the LRT presented itself as a powerful test when considering small sample sizes, while Fisher’s exact test was the least powerful one for the homogeneity hypothesis and the Barnard’s exact test was the least powerful for the Hardy-Weinberg equilibrium hypothesis. By enlarging sample sizes, the power of these tests increases accordingly.

Finally, we finish this paper listing our main conclusions:

The LTR asymptotic p-value seems to be a good frequentist alternative for small sample sizes.
Since there is an asymptotic relationship between the p-value for the LRT and the e-value (FBST), we consider that both indices are equivalent in the explored settings.
In cases where there is available information besides the data that to be taken into account, represented by informative priors, we consider the e-value a more appropriate index than a frequenstist one, since the e-value offers a mechanism to incorporate that information.

Acknowledgments

This work was partially supported by the Brazilian agencies FAPESP grant 2012/16669-4, and CNPq grants 302767/2017-7 and 308776/2014-3. The agencies had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1. Emigh TH. A Comparison of Tests for Hardy-Weinberg Equilibrium. Biometrics. 1980;36(4):627–642. pmid:25856832
- View Article
- PubMed/NCBI
- Google Scholar
2. Montoya-Delgado LE, Z IT, Pereira CAB, Whittle MR. An unconditional exact test for the Hardy-Weimberg Equilibrium Law: Sample space ordering using the Bayes Factor. Genetics. 2001;158(2):875–83. pmid:11404348
- View Article
- PubMed/NCBI
- Google Scholar
3. Pereira CAB, Wechsler S S. On the Concept of P-value. Brazilian Journal of Probability and Statistics. 1993;7:159–177.
- View Article
- Google Scholar
4. Wilks SS. The large-sample distribution of the likelihood ratio for testing composite hypotheses. The Annals of Mathematical Statistics. 1938;9:60–62.
- View Article
- Google Scholar
5. Fisher RA. Statistical Methods for Research Workers. 5th ed. Biological Monographs and Manuals. Edinburg: Oliver and Boyd; 1934.
6. Pearson K. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, Series 5. 1900;50(302):157–175.
- View Article
- Google Scholar
7. Fisher RA. On the interpretation of χ² from contingency tables, and the calculation of P. Journal of the Royal Statistical Society. 1922;85(1):87–94.
- View Article
- Google Scholar
8. Barnard GA. A New Test for 2x2 Tables. Nature. 1945;156:177.
- View Article
- Google Scholar
9. Fisher RA. A New Test for 2x2 Tables. Nature. 1945;156:388.
- View Article
- Google Scholar
10. Barnard GA. A New Test for 2x2 Tables. Nature. 1945;156:783–784.
- View Article
- Google Scholar
11. Barnard GA. Statistical Inference. Journal of the Royal Statistical Society, Series B (Methodological). 1949;11(2):115–149.
- View Article
- Google Scholar
12. Pereira CAB, Stern JM. Evidence and Credibility: a Full Bayesian Test of Precise Hypothesis. Entropy. 1999;1:104–115.
- View Article
- Google Scholar
13. Pereira CAB, Stern JM, Wechsler S. Can a Significance Test Be Genuinely Bayesian? Bayesian Analysis. 2008;3(1):19–100.
- View Article
- Google Scholar
14. Diniz MA, Pereira CAB, Polpo A, Stern J, Wechesler S. Relationship Between Bayesian and Frequentist Significance Indices. International Journal for Uncertainty Quantification. 2012;2(2):161–172.
- View Article
- Google Scholar
15. Wasserstein RL, Lazar NA. The ASA’s Statement on p-Values: Context, Process, and Purpose. The American Statistician. 2016;70(2):129–133.
- View Article
- Google Scholar
16. Lawson AE, Clark B, Cramer-Meldrum E, Falconer KA, Sequist JM, Kwon Y. Development of Scientific Reasoning in College Biology: Do Two Levels of General Hypothesis-Testing Skills Exist? Journal of Research in Science Teaching. 2000;37(1):81–101.
- View Article
- Google Scholar
17. Herrmann E, Call J, Hernandez-Lloreda MV, Hare B, Tomasello M. Humans Have Evolved Specialized Skills of Social Cognition: The Cultural Intelligence Hypothesis. Science. 2007;317(5843):1360–1366. pmid:17823346
- View Article
- PubMed/NCBI
- Google Scholar
18. Montgomery DD, Runger GC. Applied Statistics and Probability for Engineers. John Wiley & Sons; 2010.
19. Agresti A. Exact inference for categorical data: recent advances and continuing controversies. Statistics in Medicine. 2001;20:2709–2722. pmid:11523078
- View Article
- PubMed/NCBI
- Google Scholar
20. Agresti A. Categorical Data Analysis. 2nd ed. John Wiley & Sons; 2002.
21. Mehta CR, F HJ. Exact Power of Conditional and Unconditional Tests: Going beyond the 2x2 Contingency Table. The American Statistician. 1993;47(2):91–98.
- View Article
- Google Scholar
22. Eberhardt KR, Fligner MA. A Comparison of Two Tests for Equality of Two Proportions. The American Statistician. 1977;31(4):151–155.
- View Article
- Google Scholar
23. Pagano M, Halvorsen KT. An Algorithm for Finding the Exact Significance Levels of r × c Contingency Tables. Journal of the American Statistical Association. 1981;76(376):931–934.
- View Article
- Google Scholar
24. Irony TZ, Pereira CAB, Tiwari RC. Analysis of Opinion Swing: Comparison of two correlated proportions. The American Statistician. 2000;54(1):57–62.
- View Article
- Google Scholar
25. Zhang L, Xinzhong Xu, Chen G. The Exact Likelihood Ratio Test for Equality of Two Normal Populations. The American Statistician. 2012;66(3):180–184.
- View Article
- Google Scholar
26. Shan G, Wilding GE. Powerful Exact Unconditional Tests for Agreement Between Two Raters with Binary Endpoints. PLoS ONE. 2014;9(5):e97386. pmid:24837970
- View Article
- PubMed/NCBI
- Google Scholar
27. Basu D. On the Elimination of Nuisance Parameters. Journal of the American Statistical Association. 1977;72(358):355–366.
- View Article
- Google Scholar
28. Casella G, Berger R. Statistical Inference. 2nd ed. Duxbury Press; 2001.
29. Agresti A. An Introduction to Categorical Data Analysis. 2nd ed. John Wiley & Sons; 2007.
30. Irony TZ, Pereira CAB. Exact tests for equality of two proportions: Fisher vs. Bayes. Journal of Statistical Computation and Simulation. 1986;25:93–114.
- View Article
- Google Scholar
31. Hartl DL, Clark AG. Principles of Population Genetics. 4th ed. Sinauer Associates, Inc. Publishers; 2007.

[ref1] 1. Emigh TH. A Comparison of Tests for Hardy-Weinberg Equilibrium. Biometrics. 1980;36(4):627–642. pmid:25856832
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Montoya-Delgado LE, Z IT, Pereira CAB, Whittle MR. An unconditional exact test for the Hardy-Weimberg Equilibrium Law: Sample space ordering using the Bayes Factor. Genetics. 2001;158(2):875–83. pmid:11404348
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Pereira CAB, Wechsler S S. On the Concept of P-value. Brazilian Journal of Probability and Statistics. 1993;7:159–177.
View Article
Google Scholar

[10] View Article

[11] Google Scholar

[ref4] 4. Wilks SS. The large-sample distribution of the likelihood ratio for testing composite hypotheses. The Annals of Mathematical Statistics. 1938;9:60–62.
View Article
Google Scholar

[13] View Article

[14] Google Scholar

[ref5] 5. Fisher RA. Statistical Methods for Research Workers. 5th ed. Biological Monographs and Manuals. Edinburg: Oliver and Boyd; 1934.

[ref6] 6. Pearson K. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, Series 5. 1900;50(302):157–175.
View Article
Google Scholar

[17] View Article

[18] Google Scholar

[ref7] 7. Fisher RA. On the interpretation of χ² from contingency tables, and the calculation of P. Journal of the Royal Statistical Society. 1922;85(1):87–94.
View Article
Google Scholar

[20] View Article

[21] Google Scholar

[ref8] 8. Barnard GA. A New Test for 2x2 Tables. Nature. 1945;156:177.
View Article
Google Scholar

[23] View Article

[24] Google Scholar

[ref9] 9. Fisher RA. A New Test for 2x2 Tables. Nature. 1945;156:388.
View Article
Google Scholar

[26] View Article

[27] Google Scholar

[ref10] 10. Barnard GA. A New Test for 2x2 Tables. Nature. 1945;156:783–784.
View Article
Google Scholar

[29] View Article

[30] Google Scholar

[ref11] 11. Barnard GA. Statistical Inference. Journal of the Royal Statistical Society, Series B (Methodological). 1949;11(2):115–149.
View Article
Google Scholar

[32] View Article

[33] Google Scholar

[ref12] 12. Pereira CAB, Stern JM. Evidence and Credibility: a Full Bayesian Test of Precise Hypothesis. Entropy. 1999;1:104–115.
View Article
Google Scholar

[35] View Article

[36] Google Scholar

[ref13] 13. Pereira CAB, Stern JM, Wechsler S. Can a Significance Test Be Genuinely Bayesian? Bayesian Analysis. 2008;3(1):19–100.
View Article
Google Scholar

[38] View Article

[39] Google Scholar

[ref14] 14. Diniz MA, Pereira CAB, Polpo A, Stern J, Wechesler S. Relationship Between Bayesian and Frequentist Significance Indices. International Journal for Uncertainty Quantification. 2012;2(2):161–172.
View Article
Google Scholar

[41] View Article

[42] Google Scholar

[ref15] 15. Wasserstein RL, Lazar NA. The ASA’s Statement on p-Values: Context, Process, and Purpose. The American Statistician. 2016;70(2):129–133.
View Article
Google Scholar

[44] View Article

[45] Google Scholar

[ref16] 16. Lawson AE, Clark B, Cramer-Meldrum E, Falconer KA, Sequist JM, Kwon Y. Development of Scientific Reasoning in College Biology: Do Two Levels of General Hypothesis-Testing Skills Exist? Journal of Research in Science Teaching. 2000;37(1):81–101.
View Article
Google Scholar

[47] View Article

[48] Google Scholar

[ref17] 17. Herrmann E, Call J, Hernandez-Lloreda MV, Hare B, Tomasello M. Humans Have Evolved Specialized Skills of Social Cognition: The Cultural Intelligence Hypothesis. Science. 2007;317(5843):1360–1366. pmid:17823346
View Article
PubMed/NCBI
Google Scholar

[50] View Article

[51] PubMed/NCBI

[52] Google Scholar

[ref18] 18. Montgomery DD, Runger GC. Applied Statistics and Probability for Engineers. John Wiley & Sons; 2010.

[ref19] 19. Agresti A. Exact inference for categorical data: recent advances and continuing controversies. Statistics in Medicine. 2001;20:2709–2722. pmid:11523078
View Article
PubMed/NCBI
Google Scholar

[55] View Article

[56] PubMed/NCBI

[57] Google Scholar

[ref20] 20. Agresti A. Categorical Data Analysis. 2nd ed. John Wiley & Sons; 2002.

[ref21] 21. Mehta CR, F HJ. Exact Power of Conditional and Unconditional Tests: Going beyond the 2x2 Contingency Table. The American Statistician. 1993;47(2):91–98.
View Article
Google Scholar

[60] View Article

[61] Google Scholar

[ref22] 22. Eberhardt KR, Fligner MA. A Comparison of Two Tests for Equality of Two Proportions. The American Statistician. 1977;31(4):151–155.
View Article
Google Scholar

[63] View Article

[64] Google Scholar

[ref23] 23. Pagano M, Halvorsen KT. An Algorithm for Finding the Exact Significance Levels of r × c Contingency Tables. Journal of the American Statistical Association. 1981;76(376):931–934.
View Article
Google Scholar

[66] View Article

[67] Google Scholar

[ref24] 24. Irony TZ, Pereira CAB, Tiwari RC. Analysis of Opinion Swing: Comparison of two correlated proportions. The American Statistician. 2000;54(1):57–62.
View Article
Google Scholar

[69] View Article

[70] Google Scholar

[ref25] 25. Zhang L, Xinzhong Xu, Chen G. The Exact Likelihood Ratio Test for Equality of Two Normal Populations. The American Statistician. 2012;66(3):180–184.
View Article
Google Scholar

[72] View Article

[73] Google Scholar

[ref26] 26. Shan G, Wilding GE. Powerful Exact Unconditional Tests for Agreement Between Two Raters with Binary Endpoints. PLoS ONE. 2014;9(5):e97386. pmid:24837970
View Article
PubMed/NCBI
Google Scholar

[75] View Article

[76] PubMed/NCBI

[77] Google Scholar

[ref27] 27. Basu D. On the Elimination of Nuisance Parameters. Journal of the American Statistical Association. 1977;72(358):355–366.
View Article
Google Scholar

[79] View Article

[80] Google Scholar

[ref28] 28. Casella G, Berger R. Statistical Inference. 2nd ed. Duxbury Press; 2001.

[ref29] 29. Agresti A. An Introduction to Categorical Data Analysis. 2nd ed. John Wiley & Sons; 2007.

[ref30] 30. Irony TZ, Pereira CAB. Exact tests for equality of two proportions: Fisher vs. Bayes. Journal of Statistical Computation and Simulation. 1986;25:93–114.
View Article
Google Scholar

[84] View Article

[85] Google Scholar

[ref31] 31. Hartl DL, Clark AG. Principles of Population Genetics. 4th ed. Sinauer Associates, Inc. Publishers; 2007.

Figures

Abstract

Introduction

1 Methods

1.1 Homogeneity test for 2 × 2 contingency tables

1.2 Homogeneity hypothesis for ℓ × c contingency tables

1.3 Independence hypothesis for ℓ × c contingency tables

1.4 Hardy-Weinberg equilibrium

2 Results

2.1 Relations between the indices

2.2 Power function

3 Conclusion

Acknowledgments

References