A Graphical Diagnostic Test for Two-Way Contingency Tables Un prueba gráfica de diagnóstico para tablas de contingencia de doble entrada

We propose and illustrate a new graphical method to perform diagnostic analyses in two-way contingency tables. In this method, one observation is added or removed from each cell at a time, whilst the other cells are held constant, and the change in a test statistic of interest is graphically represented. The method provides a very simple way of determining how robust our model is (and hence our conclusions) to small changes introduced to the data. We illustrate via four examples, three of them from real-world applications, how this method works.


Introduction
In applied statistics, the analysis of contingency tables (CTs) is one of the most common tasks performed by statisticians on a daily basis.CTs display the frequency distribution of two or more categorical variables and, depending on how many variables are considered for analysis, they are called two-way CTs or multiway CTs.Information presented as CTs is far from uncommon in the literature, and includes applications in a wide variety of areas such as psychology (Iossifova & Marmolejo-Ramos 2013), genetics (Kamish 1988, Clarke, Anderson, Pettersson, Cardon, Morris & Zondervan 2011, Dickhaus, Stra burger, Schunk, Morcillo-Suarez, Illig & Navarro 2012), demography (Carlier & Ewing 1992, Cung 2013) and the social sciences (Wickens 1969).In the context of two-way CTs, categorical data is traditionally displayed as shown in Table 1.In order to analyse this type of data, it is necessary to use a statistical model M that describes how the categorical variables X and Y are stochastically related.Once model M is fitted, the statistical inference begins, and we will therefore, be able to draw conclusions about the data at hand (Harrell 2001, Chapter 5).
As an example, let us assume that gender and the presence/absence of a particular disease are our two variables of interest.In this case, we have a two-way contingency table which would display how many males and females developed the disease, as well as how many of them did not (i.e., a 2 × 2 CT).Using Table 1 as our testbed, we can note that the number of categories per variable is I and J, respectively: I × J is the number of possible combinations (also called cells) that will result after crossing X and Y , n ij is the number of individuals in the ith category of X and the jth category of Y (i = 1, 2, . . ., I, j = 1, 2, . . ., J), and n = n 11 + n 12 + • • • + n IJ is the total number of observations (or individuals) in the sample.
Models for CTs often include log-linear (Agresti 2002, Chapters 8-9), logistic regression (Hosmer & Lemeshow 1989), and GSK (Grizzle, Starmer & Koch 1969) models.After fitting model M to an observed CT, it is crucial to validate the model and determine how robust the results are (Agresti 2002, MacCullagh 2002).Although several approaches have been proposed in the literature (Snee 1977, Marcus & Elias 1998, Kleijnen 1999, Geweke 2007), in this paper we will focus on validation via diagnostics.
Diagnostics is an area in statistics that started with the seminal work in regression analysis by Belsey, Kuh & Welsch (1980), which has received a lot of attention over the past three decades, and has rapidly extended to many Revista Colombiana de Estadística 39 (2016) 97-108 A Diagnostic Test for Two-Way Contingency Tables 99 other areas.For CTs in particular, several diagnostic methods emulating those in log-linear models have been proposed (Lustbader & Moolgavkar 1985, Tsujitani & Koch 1991, Andersen 1992), as well as graphical methods (Genest & Green 1987, Friendly 1994, Friendly 1995, Friendly 1999) and methods for detecting outlier cells (Fuchs & Kennet 1980, Simonoff 1988).Unfortunately, graphical methods are not very popular since they have not been fully integrated into any statistical software, are not easy to understand, and/or because the observed CT contains so few observations that the user feels it is unnecessary to go further into an exploratory (graphical) analysis.
In the CT literature, one of the most frequently used diagnostics tests is the addition (or elimination) of one observation to each of the I × J cells at a time, followed by the calculation of a test statistic under model M (Andersen 1992).By doing so, changes between the test statistic calculated with the full data and then after adding (or eliminating) observations can be determined and those changes tell us how important this observation is to the fitted model.If removing some observations from one cell makes us reject the model, it suggests that the model is not robust enough.Although the process of the addition (or elimination) of observations to specific cells can be extended to more than one cell at the time, some challenges still emerge.
Here, we propose a graphical diagnostic test (GDT) that allows us to determine the quality and robustness of the fitted model M without the aforementioned complexities, and that emulates those tests used for linear models.Furthermore, we also provide an easy-to-use implementation in R (R Core Team 2015) to perform the test in order to visualise the effect that adding/eliminating k observations to/from the (ij)-th cell of the original CT has on the model (see Appendix A).Three real and three simulated examples are presented for illustration purposes.

Proposed Method
We propose the following procedure to study the contribution of a single observation to the model by removing (or adding) one observation each time to every cell the CT, whilst other cells are held constant.This process is then repeated for each cell.If the model is rejected or accepted, and this contradicts the conclusion reached when no modification was introduced, it suggests that our model is not robust enough.However, if we eliminate many observations of a single cell and our conclusion does not change, we can conclude that our model is robust to (not necessarily small) changes in that cell.

Notation
From now on, we will consider the notation in Table 2. Observe that even though the test statistic is represented by T and no particular distribution has been assumed, it is straightforward to generalise it with any test statistic applicable to two-way CTs when a hypothetical model M is considered 1 .
To illustrate the proposed notation, let us suppose we are interested in testing independence between two categorical variables using N. Thus, it is possible to define T as the Pearson χ 2 statistic given by with E M ij being the expected value of the (ij)-th cell of N under model M (independence).The calculation of the χ 2 M (ij)−k and χ 2 M (ij)+k statistics are similar, and this has been left as an exercise for the reader 2 .Under the null hypothesis, χ 2 M follows a χ 2 distribution with (I − 1)(J − 1) degrees of freedom, i.e., χ 2 M ∼ χ 2 (I−1)(J−1) .Another possibility when working with CTs is to use a likelihood ratio test (LRT), denoted as G 2 , to determine whether two categorical variables are independent.Under model M , this test uses the likelihood of the data under the null hypothesis relative to the maximum likelihood, i.e., with N ij and E M ij as previously defined.Under the null hypothesis of independence the two categorical variables, An important observation when testing independence is that the degrees of freedom of the test statistic under the null hypothesis are not affected by the test 1 Although model selection is not in the scope of this paper, one approach to determine whether model M is appropriate or not for the contingency table is to use statistics such as the Likelihood Ratio test (LRT) to compare two models M 1 and M 2 , or to use the log-likelihood function and select the model with the highest value.
2 Although there is no specific criterion for selecting the value of k, it can only be, at the most, equal to the value of a particular cell when observations are being removed from every cell at the time.This is to guarantee that no negative entries are going to be present in the transformed contingency table.
Revista Colombiana de Estadística 39 (2016) 97-108 A Diagnostic Test for Two-Way Contingency Tables 101 statistic that is being used, the number of k observations added to the (ij)-the cell of N ij , or by removing the same number of observations from that cell.This is true and holds for any CT since the degrees of freedom only depend on the dimension of the table.

Procedure
Beginning with the information in the (ij)th cell, we propose the following procedure to perform graphical diagnostics on the CT: Table 2 for more details).
2. Determine the p-value of the test as with F as the cumulative distribution function of the test statistic under the null hypothesis.
3. Plot p M against the number of added/removed observations k.
Steps 1-3 are repeated as many times as the number of cells in the CT.For instance, when the CT is a 2 × 2 table, and it is of interest to evaluate how robust the model M is by initially adding or removing k observations from each cell of the original data, the resulting plot for each condition (adding or removing up to k observations) will be four-lined.Although it is somewhat expected that a gradual change in both the p-value and the test statistic will occur as observations are added and removed one at a time, it might be the case that adding (or removing) just one observation completely changes the result obtained with the original data set.

Examples
In this section we will illustrate the proposed GDT with six examples, three of them from the literature.

Real Data
Example 1. Hand ability and gender.Table 3 (a) shows the number of leftand right-handed children, by gender, in a sample of 476 elementary school children whose teachers were asked about which hand their students used more skilfully (Correa 2002).The question of interest is whether hand ability and gender are independent.For this data, the χ 2 -based test of independence gives χ 2 M = 0.0263 and p M = 0.8712, from which we conclude that these variables can be considered independent.As shown in Figure 1(a) adding slightly more than 20 observations to cell (2,1) (i.e., left-handed girls), whilst leaving the other cells intact, would have shown that gender and handedness are not independent.A similar result would have been obtained if slightly more than 20 observations would have been removed from that very same cell or from the cell representing the left-handed boys exclusively (see Figure 1(b)).Hence, the independence model is robust enough and our conclusion reliable.
Example 2. Polygraph evaluation.Simonoff (2003, pp. 221) presented the results of a polygraph evaluation applied to twenty individuals as shown in Table 3(b).The LRT for testing independence produces G 2 M = 3.4539 and p M = 0.0631.Using a type I error probability of 5%, the hypothesis of independence is not rejected.But how robust is this result?Figures 1(c)-(d) show how fragile the independence model is.In fact, adding just one observation to cells (1,1) or (2,2), or removing one observation from cells (1,2) or (2,1) would have changed our conclusion.
Example 3. Hand preference and brain injury.The brain pathology hypothesis argues that brain injury leads to hand preference.Vlachos et al. (2013) found no evidence to support such an association (see Table 3(c)).This finding s corroborated using the G 2 statistic, which gives G 2 M = 3.031 and p M = 0.21969.However, if three more cases of mixed-handed participants with brain injury had been added (i.e., cell (2,2)) whilst leaving the other cells intact, an association between brain injury and handedness would have been found.Similarly, if three or four participants had been removed from cell (2,1), other things being equal, a significant result would have been found.Notice that only adding more than 10 observations to cell (2,1) would have also led to a significant result (see Figure 1   All these three cases exemplify how our GDT is highly valuable in testing hypotheses in the light of specific theories and concepts under investigation.Despite this, the method will output the results of the p-value associated with the test statistic under use (e.g., G 2 and χ 2 ), but it is up to the researcher to seek interpretations based on the data at hand and the underlying theories (i.e., those specific to the phenomena being studied).

Simulated Data
Simulating two-way CT situations that can occur in actual research is a rather cumbersome task since various variables would need to be considered.For instance, one can take into consideration the levels in each of the categories, the total number of observations, the percentage of observations assigned to each cell, the test statistic used to analyse the CT3 and the effect of missing data on the test statistic (Correa & Vélez 2014).Thus, in order to give the researcher a feel for what our method can offer, we only present the case of 2 × 2 CTs in which the percentage of observations per cell was kept equal (i.e., each cell had the same proportion of observations) and only the total amount of observations varied (n = 20, 40 and 100).Also, we assumed the data to be independent and used the G 2 statistic.
In a 2 × 2 CT with just five observations per cell, much more than five observations need to be added to any cell or one cell, needs to be emptied in order to reject the independence model (first column, Figure 2).When a 2 × 2 CT has 10 observations per cell, leaving one cell with just one observation would suffice to reject independence between the categories.Alternatively, adding more than 10 observations to any of the cells would show an association (second column, Figure 2).When n = 100, there seems to be more freedom in the number of observations that need to be added or removed in each cell in order to find an association between the categories.That is, just above 25 observations would need to be added to any cell or just 15 or 16 observations would need to be removed from any cell in order to reject independence (third column, Figure 2).These results simply confirm that it is more advantageous to have CTs with a large number of observations in order to remove or add a number of observations that is proportionally small in relation to the cell size.Hence, this approach shows Revista Colombiana de Estadística 39 (2016) 97-108 A Diagnostic Test for Two-Way Contingency Tables 105 how much a model of perfect independence needs to change in order to show an association between the two categories and that the CT's sample size plays a key role in this process.

Discussion
One of the main tasks of any researcher consists in validating the models built to account for data, and models fitted to two-way contingency tables are not the exception.In this paper, we have presented a graphical method to verify the robustness of a fitted model to a two-way contingency table.We strongly believe that this method is useful in model assessment and data analysis, easy to interpret and fully adaptable to any data represented in the form of two-way CTs.The GDT proposed here is useful in checking the robustness of a CT model that is suitable for categorical data analyses methods.As shown in the simulated CTs, going from an independence to an association model requires adding or removing a number of observations that is very disproportional as the sample size of the CT shrinks.
As mentioned in the simulation section, many two-way CT cases could be considered for a simulation study; however, given the multiplicity of factors and levels, that is a rather complex task.Having said this, we believe the options presented in that section need to be tackled to better understand how the proposed method behaves under different circumstances.However, we reiterate that those simulations are valid merely for statistical purposes.In other words, researchers almost never have CTs that resemble a fully independent model; on the contrary, most, if not all, real CTs seek to demonstrate an association between the categories of interest.Precisely because of this reason, we believe the proposed graphical test is helpful in assessing potential explanatory models underlying the categorical data under study.
Despite the virtues of our graphical test, some of its weaknesses need to be acknowledged.First, in its current form, our implementation of the method can only deal with two-way CTs, so further work is needed to extend it to multi-way CTs.In the hypothetical case of two-way CTs stratified by a third variable with S categories (e.g., socioeconomic status), performing our GDT for each of the resulting S two-way CTs and plotting them side-by-side is straightforward.Another possibility would be to split the CT and perform the GDT on each resulting twoway CT using different aesthetics to distinguish each cell (i.e., each cell in the CT would be represented by S lines).
Second, the current implementation of the GDT removes or adds one observation at a time from each cell in the CT.Thus, the total number of lines being displayed will be c = I × J, and for large values of I and J the resulting plot might be specially crowded.Even though the user is free to make the necessary changes to have an aesthetically appealing statistical graphic, we suggest using the test when no more than ten cells are present in the two-way CT.However, if the data is presented in a multi-way CT, then the two possibilities previously described could be a starting point.A further improvement would require that sets of pairs, triplets or p-tuplets of cells are simultaneously evaluated whilst observations are added to and/or removed from them.Although this is not computationally challenging, its graphical representation is.The main difficulty is how to represent the c !/{(c − p)! p!} total number of possible p-tuples.When p = 2, a 3D graphical representation is suitable, but for p > 2 the alternatives are more scarce.We plan to tackle these issues in the near future, and provide an easy-to-use implementation in R.

Figure 1 :
Figure 1: Graphical diagnostic test for Examples 1, 2, and 3.In Example 1, more than 20 observations need to be (a) added or (b) removed from two cells in order to reach statistical significance.In Example 2, (c) adding or (d) removing a couple of observations in a couple of cells is enough to obtain a significant result.In the last example, three observations would need to be (e) added to a cell or (f) removed from another cell to reach a significant result.The grey horizontal line corresponds to a type I error probability of 5%.

Figure 2 :
Figure 2: Graphical diagnostic test for three simulated 2 × 2 CTs varying the sample size n.The percentage of observations assigned to each cell is 25% in all cases.The grey horizontal line corresponds to a type I error probability of 5%.

Table 1 :
A two-way contingency table in which a total of I × J cells are present.

Table 2 :
Notation used in our alternative test for contingency tables.

Table 3 :
Contingency tables to exemplify the graphical diagnostic test.