Exploring dependence between categorical variables: Benefits and limitations of using variable selection within Bayesian clustering in relation to log-linear modelling with interaction terms

This manuscript is concerned with relating two approaches that can be used to explore complex dependence structures between categorical variables, namely Bayesian partitioning of the covariate space incorporating a variable selection procedure that highlights the covariates that drive the clustering, and log-linear modelling with interaction terms. We derive theoretical results on this relation and discuss if they can be employed to assist log-linear model determination, demonstrating advantages and limitations with simulated and real data sets. The main advantage concerns sparse contingency tables. Inferences from clustering can potentially reduce the number of covariates considered and, subsequently, the number of competing log-linear models, making the exploration of the model space feasible. Variable selection within clustering can inform on marginal independence in general, thus allowing for a more efficient exploration of the log-linear model space. However, we show that the clustering structure is not informative on the existence of interactions in a consistent manner. This work is of interest to those who utilize log-linear models, as well as practitioners such as epidemiologists that use clustering models to reduce the dimensionality in the data and to reveal interesting patterns on how covariates combine.

cell is expanded so that, µ(l) = a⊆P ξ a (l a ).
Here, l a = (l p , p ∈ a) indicates a marginal cell defined by the levels of a subset a of P. In the above sum, each term represents parameters ξ a , one for each combination of levels of the factors in a. As pointed out in Dellaportas and Forster (1999), the ξ a are equivalent to the parameters commonly used when a specific log-linear model is defined. When a = ∅, the ξ a is 1 a constant. When |a| = 1, ξ a is called a 'main effect'. When |a| = m, ξ a is an interaction of order m − 1. To ensure identifiability, certain ξ a parameters are apriori set equal to zero. If the number of individuals is fixed, say equal to n, the conditional distribution of the cell counts becomes multinomial and a constraint is added to the model to ensure that the total of the cell counts is n.

S2. Calculation of the T γ matrices
In this and the next Section, we demonstrate in detail some technical aspects of our methodology.
Throughout Sections S2 and S3 we use as a working example the real data analysis presented in Section 5.1 of the main manuscript, where six risk factors for coronary heart disease (CHD) were investigated.
To build the T Real data (CHD) γ matrix we follow the algorithm proposed in Section 3.3 of the main manuscript. Before reweighting so that the maximum element is one, we obtain, We reweight and obtain, . A working example for the selection of candidate edges Assume the currently accepted model in the reversible jump MCMC chain is 'ABC+DEF', and consider the T Real data (CHD) γ matrix.
The aforementioned elements of T Real data (CHD) γ can be arranged as, We divide each element with the sum of all nine elements, obtaining, Now, the elements above add up to one. We sample an edge using them as probabilities associated with each one of the candidate edges. It is then proposed to add the selected edge to the currently accepted graphical model.

Selecting an edge for possible removal
Elements (1, 2), (1, 3), (2, 3), (4, 5), (4, 6), (5, 6), of matrix T Real data (CHD) γ correspond to edges (AB), (AC), (BC), (DE), (DF ), (EF ). These edges form the currently accepted model, and one of them will become a candidate edge for removal. The aforementioned elements of T Real data (CHD) γ can be arranged as, To randomly select an edge for possible removal, we divide each element with the sum of all six elements, obtaining, Now, the elements above add up to one. We sample an edge using these elements as probabilities associated with each one of the candidate edges. It is then proposed to remove the selected edge from the currently accepted graphical model.

Selecting two edges to swap
To attempt to swap one edge with another, we select a candidate edge for addition and one for simultaneous removal exactly as described above.

S4. Parameter coefficients and design matrices for simulated data
We present the parameter coefficients and design matrices for the five simulated data sets.
Simulation 3 . .    Figures S1,S2 and S3 show the rate of accumulated posterior model probability for Simulations 1,2 and 3 respectively, with respect to the top ten models. We run four replications for each analysis, averaging over the results. The analyses are based on 110000 iterations of the reversible jump MCMC sampler, thinned to 11000 iterations. The top ten models and associated posterior probabilities were calculated using the search algorithm that combines PDV and the cluster specific approach in a balanced manner. We see that this algorithm performs very well in Simulation 1, and also in Simulation 3. It is not performing as well for Simulation 2, although differences in Simulations 2 and 3 are not very pronounced. The PDV algorithm performs worse than the thee other algorithms that employ cluster specific information, over the first 2000 iterations or so. As a note of caution, we must stress that the limited number of replications (4) for each analysis, suggests that increased variability is associated with the information presented in Figures S1,S2 and S3.

S6. Additional analysis based on pairwise odds-ratios
To gain some intuition on the performance of the different approaches, we conducted an additional investigation using the two real data sets analyzed in the submitted manuscript. Instead of the proposed T γ matrix, the evidence for the presence or absence of an edge is assessed using pairwise odds-ratios, and results are compared with results from the strategy we adopt in this manuscript. The odds-ratios were calculated by fitting a logistic regression between the two factors, where the preceding letter in alphabetical order was modelled as the outcome. The further the odds-ratio from 1, the stronger the pairwise relation between the two factors. If 1 is not contained in the 95% CI for the odds-ratio, we interpret this as a strong indication that the edge between the two factors is present. Note that quantifying the strength of the evidence for the existence of an edge provided by the pairwise strategy is not straightforward, as the calculated odds-ratio cannot be easily converted into a probability.
In the first real data illustration (Table S1, corresponding to matrix T Real data (CHD) γ in Section 5.1 in the main manuscript) the two methods (our strategy based on the T γ matrix and the one based on odds-ratios) perform comparably. We note that the pairwise strategy does not clearly indicate that 'F' has only a main effect (since the odds-ratio for the 'BF' edge does not include 1), thus creating a potential False Positive compared to our strategy.
The difference in the results from the two approaches is illustrated clearly in the second real data illustration. In Table S2 (corresponding to matrix T Real data (GE) (2nd run) γ in Section 5.2 in the main manuscript), we see that the pairwise strategy wrongly indicates that the 'DF' edge is absent, and thus would fail to provide evidence for a 'DEF' three-way interaction. (The p-value for the slope coefficient for the logistic regression between 'D' and 'F' is 0.23). Note that in all analyses we have performed using our strategy and the T γ matrices, we never encountered the case where the T γ matrix fails to capture the existence of an edge, either in a pairwise or a higher order interaction, failure which would be detrimental to a model search algorithm. We also note a prominent False Positive edge ('AB') indicated by the pairwise strategy.
This limited comparison was carried out to confirm empirically what could be anticipated by design, namely that an approach based on clustering will give different evidence for the existence of links compared to an approach based on pairwise testing, and that our approach has potential benefits for detecting higher order interactions.   iterations, thinned to 11000 iterations.