Optimal Gaussian approximations to the posterior for log-linear models with Diaconis-Ylvisaker priors

In contingency table analysis, sparse data is frequently encountered for even modest numbers of variables, resulting in non-existence of maximum likelihood estimates. A common solution is to obtain regularized estimates of the parameters of a log-linear model. Bayesian methods provide a coherent approach to regularization, but are often computationally intensive. Conjugate priors ease computational demands, but the conjugate Diaconis-Ylvisaker priors for the parameters of log-linear models do not give rise to closed form credible regions, complicating posterior inference. Here we derive the optimal Gaussian approximation to the posterior for log-linear models with Diaconis-Ylvisaker priors, and provide convergence rate and finite-sample bounds for the Kullback-Leibler divergence between the exact posterior and the optimal Gaussian approximation. We demonstrate empirically in simulations and a real data application that the approximation is highly accurate, even in relatively small samples. The proposed approximation provides a computationally scalable and principled approach to regularized estimation and approximate Bayesian inference for log-linear models.


Introduction
Contingency table analysis routinely relies on log-linear models, which represent the logarithm of cell probabilities as an additive model [Agresti, 2002]. With the standard choice of Multinomial or Poisson likelihood, James Johndrow gratefully acknowledges support from grant ES020619 from the National Institute of Environmental Health Sciences (NIEHS) of the U.S. National Institutes of Health.
: Dr. Bhattacharya acknowledges support for this project from the Office of Naval Research.
these are exponential family models, and are routinely fit through maximum likelihood estimation [Fienberg & Rinaldo, 2007]. However, sparsity in the observed cell counts often makes maximum likelihood estimation infeasible (see Haberman [1974] and Bishop et al. [2007]) in practical applications. In such cases, regularization is often used to obtain unique parameter estimates [Park & Hastie, 2007, Zou & Hastie, 2005.
A common Bayesian approach to inference in high-dimensional contingency tables is to place a conjugate prior on the parameters of a graphical or hierarchical log-linear model, and an independent prior over the space of all such models (see e.g. Massam et al. [2009]). This leads to a standard model-averaged posterior [Hoeting et al., 1998], where all possible sparse log-linear models in the chosen class are weighted by their posterior evidence. Use of non-conjugate (e.g. Gaussian) priors with computation by Markov chain Monte Carlo [Gelfand & Smith, 1990] has also been proposed [Dellaportas & Forster, 1999]. Although model averaging is generally considered ideal in high dimensional settings, computational algorithms for posterior inference scale exceedingly poorly in p. Since the smallest contingency table corresponding to cross-classification of p categorical variables has 2 p cells, the corresponding log-linear model has 2 p´1 free parameters, so the model space grows super-exponentially in p. Accordingly, posterior computation is essentially infeasible for p ą 15, the largest case demonstrated to date in the literature [Dobra & Massam, 2010] to the best of our knowledge.
Alternatively, one can place a Gaussian prior on the parameters of a saturated log-linear model to induce Tikhonov type regularization, and then perform computation by Markov chain Monte Carlo. This approach is well-suited to situations in which the sample size is not tiny relative to the table dimension, but where zero counts nonetheless exist in some cells. In this case, data augmentation Gibbs samplers such as that proposed by Polson et al. [2013] provide for conditionally conjugate updates. However, this by itself is computationally intensive relative to alternatives such as elastic net [Zou & Hastie, 2005], and can suffer from poor mixing. In principle, a more scalable Bayesian approach for producing Tikhonov regularized point estimates would be to utilize the Diaconis-Ylvisaker conjugate prior [Diaconis & Ylvisaker, 1979] on the parameters of the log-linear model, which is essentially computation free. The main drawback is that the resulting posterior distribution is difficult to work with, lacking closed form expressions for even marginal credible intervals or fast algorithms for sampling from the posterior. An accurate and more tractable approximation to this posterior is therefore of practical interest.
Approximations to the posterior distribution have a long history in Bayesian statistics, with the Laplace approximation perhaps the most common and simple alternative [Tierney & Kadane, 1986, Shun & McCullagh, 1995. More sophisticated approximations, such as those obtained using variational methods [Attias, 1999] may in some cases be more accurate but require computation similar to that for generic EM algorithms. Moreover, there exist no theoretical guarantees of the approximation error in finite samples, and these approximations are known to be inadequate in relatively simple models [Wang & Titterington, 2004.
In this article, we propose a Gaussian approximation to the posterior for log-linear models with Diaconis-Ylvisaker priors. The approximation is shown to be the optimal Gaussian approximation to the posterior in the Kullback-Leibler divergence, and convergence rates to the exact posterior and a finite-sample Kullback-Leibler error bound are provided. The approximation is shown empirically to be accurate even for modest sample sizes; effectively, the empirical results suggest that the approximation is accurate enough to be used in place of the exact posterior within the range of sample sizes for which the posterior is sufficiently concentrated to be statistically useful. We also show how the approximation can be used to perform model selection using the penalized credible region method [Bondell & Reich, 2012]. In a real data application, the method performs favorably in model selection for graphical log-linear models compared to methods requiring vastly greater computational resources.

2 Background
We first provide a brief review of exponential families. We then describe the family of conjugate priors for the natural parameter of an exponential family, referred to as Diaconis-Ylvisaker priors. We then provide more detailed background on log-linear models for Multinomial likelihoods and the associated Diaconis-Ylvisaker prior.

Exponential families
Following Diaconis & Ylvisaker [1979], let µ be a σ-finite measure defined on pR p , Bq, where B denotes all Borel sets on R p . Let supppµq " ty P R p : dµpyq ą 0u be the support of µ, and define Y as the interior of the convex hull of supppµq. For θ P R p , define M pθq " log ş Y e θ T y dµpyq, and let Θ " tθ P R p : M pθq ă 8u, which we assume is an open set. We refer to Θ as the natural parameter space. The exponential family of probability measures tP p¨; θqu indexed by a parameter θ P Θ is defined by dP py; θq " e θ T y´M pθq dµpyq, θ P Θ.
This family includes many of the probability distributions commonly used as sampling models in likelihoodbased statistics. Diaconis & Ylvisaker [1979] develop the family of conjugate priors for the parameter θ of regular exponential family likelihoods. These Diaconis-Ylvisaker priors are given by On observing data y consisting of n observations with sufficient statisticsȳ, the posterior is then also Diaconis-Ylvisaker, with parameters n 0`n , y 0`ȳ , i.e. dπpθ | yq " dπpθ; n 0`n , y 0`ȳ q. In the sequel we focus on one member of the exponential family, the multinomial. In the natural parametrization, the ultinomial likelihood gives rise to the log-linear model and the closely related multinomial logit model, which we now describe.

Log-linear models
Let S d " tpx 1 , . . . , x d q P r0, 1s d : ř d j"1 x j ď 1u denote the d-dimensional unit simplex. Consider N independent samples from a categorical variable with pd`1q levels. We denote the levels of the variable by 0, 1, . . . d, without loss of generality. Let y j denote the number of times the jth level is observed in the N samples and set y " py 0 , y 1 , . . . , y d q T ; clearly ř d j"0 y j " N . The joint distribution of y is given by a multinomial distribution, denoted y " Multinomial pN, πq, which is parametrized by π " pπ 1 , . . . , π d q T P S d , where π j is the probability of observing the jth level for j " 1, . . . , d.
The log-linear model is a generalized linear model for multinomial likelihoods obtained by choosing the logistic link function, which also results in the natural exponential family parametrization. Define the logistic transformation : R d Ñ S d and its inverse log ratio transformation ´1 : S d Ñ R d as where π 0 " 1´ř d j"1 π j , and θ 0 " 0. We shall write π " pθq and θ " ´1 pπq " logpπ{π 0 q, respectively, to denote the transformations in (3). Using (3), the multinomial likelihood in the log-linear parameterization 4 Optimal credible regions for Bayesian log-linear models can be expressed as An important motivating case is when y " vecpnq, with n a contingency table arising from crossclassification of N independent observations on p categorical variables y 1 , . . . , y p . Suppose that the vth variable y v has d v many levels, so that the contingency table has ś p v"1 d v many cells, and y is a pd`1qdimensional vector of counts with d " ś p v"1 d v´1 . We refer to the parametrization θ " logpπ{π 0 q in the contingency table setting as the identity parametrization. Also of particular interest in this setting are reparametrizations of (3) that represent log π{π 0 as an additive model involving parameters that correspond to interactions among y 1 , . . . , y p . Every identified parametrization of the log-linear model for the multinomial likelihood can be represented by where X is a d by d non-singular binary matrix and θ˚P R d . In the simulations and application, we make a specific choice for X that corresponds to the corner parametrization of the log-linear model [Massam et al., 2009]. We illustrate the identity and corner parameterizations through a 2 3 contingency table in Example 2.1 below. Details for the general case can be found in the Appendix.

Conjugate priors for log-linear models
We now present the Diaconis-Ylvisaker prior for the multinomial likelihood (4) and derive an optimal Gaussian approximation to the corresponding posterior in Kullback-Leibler divergence. Extensions to log-linear models with a non-identity parametrization (i.e., X ‰ I d in (5)) is straightforward by invariance properties of the Kullback-Leibler divergence and are discussed subsequently. All proofs are deferred to the Appendix. For the multinomial likelihood (4), the Diaconis-Ylvisaker prior is obtained by applying the inverse logistic transformation ´1 to a Dirichlet distribution, which not surprisingly is the conjugate prior for π. Recall that π 0 " 1´ř d j"1 π j . The Dirichlet distribution Dpαq on S d with parameter vector α " pα 0 , α 1 , . . . , α d q T has density and corresponding probability measure Qp¨, αq with QpA, αq " ş A qpπ; αqdπ for Borel subsets A of S d .
Proposition 2.2. Suppose π " Dpαq and let θ " logpπ{π 0 q P R d . Define A " ř d j"0 α j . Then θ has a density on R d given by We write θ " LDpαq and use Pp¨; αq to denote the probability measure associated with the density (7), with PpB; αq " ş B ppθ; αqdθ for Borel subsets B of R d . If a non-identity parametrization θ " Xθ˚as in (5) is employed, then we denote the induced distribution on θ˚" X´1θ by P X p¨; αq and the density by p X pθ; αq.
The proof of Proposition 2.3 is established within the proof of Theorem 3.1 in the Appendix. Assume the data y is generated from a Multinomial`N, π 0˘d istribution and let θ 0 " logpπ 0 {π 0 0 q be the true loglinear parameter, where π 0 0 " 1´ř d j"1 π 0 j . If a LDpαq prior is placed on θ, one can use Proposition 2.3 to show that the posterior mean Epθ | yq converges almost surely to θ 0 with increasing sample size, and the posterior covariance covpθ | yq converges to the inverse Fisher information matrix as long as the entries of the prior hyperparameter α are suitably bounded. In fact, a Bernstein-von Mises type result can be established, showing that the posterior distribution approaches a Gaussian distribution, centered at the true parameter value and having covariance the inverse Fisher information matrix, in the total variation metric. We do not pursue such frequentist asymptotic validations further in this paper. Our goal rather is to provide a Gaussian approximation to the posterior distribution that can be used in practice, and provide finite sample bounds to the approximation error.

Main results
In this section, we provide an optimal Gaussian approximation to a LDpβq distribution (7) in the Kullback-Leibler divergence, i.e., we exhibit a vector µ˚P R d and a positive definite matrix Σ˚such that the Kullback-Leibler divergence between LDpβq and N pµ˚, Σ˚q is the minimum among all Gaussian distributions. This result provides a readily available Gaussian approximation to the posterior distribution LDpβ " α`yq of the log-linear parameter θ in (4) with a Diaconis-Ylvisaker prior LDpαq. We also provide a non-asymptotic error bound for the Kullback-Leibler approximation. Using Pinsker's inequality, the approximation error in the total variation distance can be bounded in finite samples.
The matrix Σ˚has a compound-symmetry structure and is therefore positive-definite. From Proposition 2.3, the parameters of the optimal Gaussian approximation µ˚and Σ˚are indeed the mean and covariance matrix of the LDpβq distribution. Equation (10) provides an upper-bound to the approximation error. In the posterior, β j " α j`yj and B " ř d j"0 α j`N . The condition β j ě 1{2 is therefore satisfied whenever every category has at least one observation. Since the approximation error is approximately in the order of ř d j"0 pπ 0 j N q´1, where as before π 0 j denotes the true probability of category j. In the best case where all the categories receive approximately equal probability, i.e., π 0 j -pd`1q´1, the approximation error is Opd 2 {N q. However, the convergence rate in N can be slower if some of the π 0 j s are very small. In other words, the higher the entropy of the data generating distribution, the worse the approximation is, although our simulations suggest that the approximation is practicable even for moderate sample sizes and unbalanced category probabilities. When one considers that the eigenvalues of the covariance matrix enter into the constant in Berry-Esséen convergence rates, and that here the covariance of the data is given by diagpπ 0 q´π 0 pπ 0 q T , it appears that a similar phenomenon is at work here.
The main idea behind our proof is to exploit the invariance of the Kullback-Leibler divergence under bijective transformations and transfer the domain of the problem from R d to S d . Since an LDpβq distribution is obtained from a Dirichlet Dpβq distribution via the inverse log-ratio transform ´1 , the problem of finding the best Gaussian approximation to LDpβq is equivalent to finding the best approximation to Dpβq among a class of distributions obtained by applying the logistic transform to Gaussian random variables. If θ " N pµ, Σq, the induced distribution on π " pθq is called a logistic normal distribution -denoted Lpµ, Σqand has density on S d given by The problem therefore boils down to calculating the Kullback-Leibler divergence between a Dirichlet density qp¨; βq and a logistic normal density r qp¨; µ, Σq and optimizing the expression with respect to µ and Σ. The details are deferred to the Appendix.
Once the approximation is derived in the identity parametrization, we appeal to the invariance of the Kullback-Leibler divergence under one-to-one transformations to obtain the corresponding approximation in a non-identity parameterization θ " Xθ˚as in (5) for any non-singular X. The result is stated below.
for any full-rank d by d matrix X. Moreover, the bound on the KL divergence as a function of β in (10) is attained for D pP X p¨; βq || N p¨; µ˚, Σ˚qq Thus, the best Gaussian approximation to the posterior (in the Kullback-Leibler sense) under the Diaconis-Ylviaker prior is given by N pXµ˚, X 1 Σ˚Xq for any one-to-one linear transformation X. We refer to this as the optimal Normal (oN) approximation. 8 Optimal credible regions for Bayesian log-linear models

Simulations
We conducted several simulation studies to assess the performance of the approximation in Theorem 3.1 and Corollary 3.2. In each study, we simulated 100 realizations from π " Dpa, . . . , aq, y " Multinomial pN, πq , with the posterior of π under a Dirichlet Dpa, . . . , aq prior being Dpy 1`a , . . . , y d`a q. We chose the dimension d to be 2 8 , corresponding to a p " 8-way contingency table for binary variables. To obtain a simulation-based approximation to the posterior for θ " logpπ{π 0 q under the Diaconis-Ylvisaker prior, we sampled mc many π values from the Dpy 1`a , . . . , y d`a q posterior and then transformed to θ " ´1 pπq to obtain posterior samples of θ; we refer to this procedure as the Monte Carlo approximation. We also computed a Laplace approximation to the posterior under the Diaconis-Ylvisaker prior, which is given by Normal´θ M AP , Ipθ M AP q´1¯, whereθ M AP is the maximum a-posteriori estimate of θ and Ipθq is the Fisher information matrix evaluated at θ. The maximum a-posteriori estimateθ M AP was computed by the Newton-Raphson method.
We compare the accuracy of the proposed Gaussian approximation to the Monte Carlo procedure and the Laplace approximation. In addition to the identity parameterization, i.e., X " I d in (5), we also consider the corner parameterization given by logpπ{π 0 q " Xθ˚for an appropriate X matrix; see Appendix for more details. For the Monte Carlo samples, each sample of θ is transformed to θ˚via X´1θ " θ˚. For the normal approximations θ " Normal pµ, Σq, the corresponding approximate posterior is given by θ˚" Normal`X´1µ, X´1ΣX´1˘.
We conduct simulations for different values of N (250, 1000, and 10,000) and a (1 and 1{d). We then assess performance in several ways.
• Coverage of 95 percent posterior credible intervals for θ or θ˚.
• The standardized loss in the Frobenius norm for estimates of Σ, the posterior covariance, given by || p ΣΣ || F {||Σ|| F , where ||S|| F is the Frobenius norm of S. Note that the covariance in Theorem 3.1 is exactly the posterior covariance, so this measure is computed only for the simulation and Laplace approximations.
• The value of the Kolmogorov-Smirnov statistic for comparing the Monte Carlo empirical measure 1 mc ř mc t"1 δ θt to the normal approximation from Theorem 3.1, Normal pµ, Σq.
• The computation time required to compute each posterior approximation. Table 1 shows unexplained variation for the Laplace approximation, the Monte Carlo approximation for mc " 10 3 , 10 4 , 10 5 , and 10 6 , and the optimal normal approximation. As expected, the optimal normal approximation outperforms the Laplace approximation. Moreover, it is comparable to the Monte Carlo approximation at every sample size and for all of the values of mc considered. Performance for all approximations is noticeably better in the corner parametrization than the identity parametrization. Table 2 shows coverage of approximate 95 percent credible intervals for the Laplace approximation, optimal Normal approximation, and the Monte Carlo approximation. The intervals derived using the Laplace approximation are universally too wide. Nominal coverage for the Monte Carlo approximation is insensitive to the value of mc in the range tested, and is slightly high at the two smaller sample sizes. The optimal  normal approximation has the best coverage; in all cases it is between 0.94 and 0.96 and for N " 10, 000 the coverage is 0.95 in both parametrizations. Table 3 shows dependence of || p Σ´Σ|| F {||Σ|| F on mc for the two different parametrizations and three sample sizes considered. Note that Σ is known exactly since Σ " Σ˚, the posterior covariance under the DY prior. The main point of this table is to demonstrate the relatively large number of Monte Carlo samples required to obtain reasonably small error in estimation of the posterior covariance. Even with 10 5 samples the relative error is on the 1 percent range. Thus, compound linear hypothesis testing and computation of credible regions is very inefficient using the Monte Carlo method.  Table 4 shows the computation time in seconds for each of the three approximations. The Laplace approximation is fast, requiring about 0.03-0.04 seconds to compute at all sample sizes. The optimal normal approximation is about an order of magnitude faster, with the computation time arising mainly in computing the polygamma functions and matrix multiplications. The Monte Carlo approximation is about four orders of magnitude slower than the optimal Normal approximation. Here, only mc " 10 6 is considered because of the non-negligible error in the posterior covariance for smaller samples; the algorithm scales linearly in mc so for mc " 10 5 the required time would be approximately 3 seconds. Only about 100 samples could be obtained in the 0.003 seconds required to compute the optimal normal approximation. Results in the previous tables make clear that the optimal normal approximation is superior to the other approximations considered in terms of point estimation, estimation of 95 percent credible intervals, covariance estimation, and computation time. However, it is possible that differences between the optimal normal approximation and the exact posterior exist in the tails of the distribution. To assess this, we compare the empirical measure of the Monte Carlo approximation using mc " 10 6 samples to the optimal normal approximation by computing the Kolmogorov-Smirnov (KS) statistic for the marginal distributions of 20 randomly selected entries of θ. The entries considered were re-selected for each of the 100 replicate simulations and for each of the three sample sizes. Shown in Figure 1 are histograms of these KS statistics in the corner and identity parametrizations. Most are less than 0.02, and none are greater than 0.07. Considering that the KS statistic is a point estimate of the total variation distance between distributions, this indicates that the optimal normal approximation is an excellent approximation to the posterior marginals. Moreover, we cannot rule out the possibility of residual Monte Carlo error in the marginals from the Monte Carlo approximation, which may account for part of the observed discrepancy.

Kolmogorov-Smirnov -identity
Kolmogorov-Smirnov -corner Figure 1: Distribution of Kolmogorov-Smirnov statistics comparing 1 mc ř mc t"1 δ θt to the oN approximation for 20 randomly selected entries of θ and over 100 replicate simulations (entries of θ were re-selected for each replicate).

Real Data Example
We consider the Rochdale data, a 2 8 contingency table with N " 665 that is over 50 percent sparse, and for which the top ten cell counts all exceed 20. This dataset is described at length in Dobra & Lenkoski [2011]. We first assess the accuracy of the approximation to the full posterior under the Diaconis-Ylvisaker prior in the same manner as in §4, by comparing marginal posteriors computed using the approximation to those obtained from large Monte Carlo samples from the exact Dirichlet posterior transformed to the log-linear parametrization. For the log-linear model in the corner parametrization, the distribution of Kolmogorov-Smirnov statistics computed for the 255 entries of θ˚obtained by comparing 10 6 Monte Carlo samples from the exact posterior to the optimal Gaussian approximation is shown in Fig. 2. The distribution is very similar to that observed for the simulations in §4. Undoubtedly, the Diaconis-Ylvisaker prior is less well-suited to inference on important variable interactions in this dataset than the more sophisticated methods of Dobra & Lenkoski [2011] and Bhattacharya & Dunson [2012]. However, our approximation has the advantage of being essentially computation-free, whereas the methods of Dobra & Lenkoski [2011] and Bhattacharya & Dunson [2012] are computationally intensive even at this small scale. In many settings, particularly with modern large-scale problems, some loss of performance may be acceptable in order to obtain useful inferences instantaneously. Thus, we are interested in the extent to which our method can replicate the results of Dobra & Lenkoski [2011], which were similar to those of Bhattacharya & Dunson [2012] in many respects. We analyze performance in testing conditional independence hypotheses (i.e. learning an interaction graph).
Sparse θ˚is a set of measure zero with respect to the posterior under the Diaconis-Ylvisaker prior. To obtain a sparse point estimate of the interaction graph, we employ the penalized credible region approach of Bondell & Reich [2012]. This method produces a point estimate by finding the sparsest θ˚within a 1´α credible region for θ˚. Although the exact solution to this problem is intractable, Bondell & Reich [2012] show that it can be approximated using a lasso path, and provide software in the BayesPen R package [Wilson et al., 2015]. Using the resulting lasso path from BayesPen, the selected model corresponding to 12 Optimal credible regions for Bayesian log-linear models Table 5: Left, titled CGGM Results: Marginal posterior inclusion probabilities of edges (above the main diagonal) and indicator of edge inclusion in the median probability model (below the main diagonal) from copula Gaussian graphical model estimated on Rochdale data in Dobra & Lenkoski [2011]. Rows and columns correspond to the eight binary variables, which are labeled a-h. Right, titled Comparison to oN: table of edge classifications for all marginal tables of size 2 4 from copula Gaussian graphical model median probability model (columns, labeled CGGM) and penalized credible region for Gaussian approximation to posterior under the DY prior (rows, labeled oN-PCR).
CGGM any value of α P p0, 1q can be obtained as follows.
1. For the selected value of α, find the 1´α quantile of a χ 2 distribution with d´1 degrees of freedom. Label this δ max .
3. Find the sparsest model in the lasso path having δpθ 0 q ď δ max . This is the sparse point estimate.
With 256 cells and 665 observations, the posterior under the saturated model with Diaconis-Ylvisaker prior is very diffuse. To make a reasonable comparison, we obtain the posterior under the Diaconis-Ylvisaker prior for the marginal tables corresponding to all`8 4˘" 70 unique subsets of four variables. For each of these marginal tables, we then utilize the penalized credible region procedure of Bondell & Reich [2012] to obtain a sparse model. For comparison, we utilize the median probability graphical model from Dobra & Lenkoski [2011], which is shown in Table 5. Specifially, for every subset of four variables, we obtain the marginal graph corresponding to the median probability model of Dobra & Lenkoski [2011] by removing the complement of the subset of nodes under consideration and moralizing, i.e. placing an edge between nodes that (1) have an edge between them in the full graph or (2) are connected solely by a path through nodes that were removed. We treat the graph obtained in this way as the standard for assessing performance of the penalized credible region applied to our Gaussian posterior approximation. We compute the true (false) negative and positive counts for the penalized credible region procedure applied to our posterior Gausian approximation to all 70 marginal graphs, treating the corresponding marginal median probability graph from Dobra & Lenkoski [2011] as the truth. This produces a total of 70`4 2˘" 420 dependent pseudo hypothesis tests. The results for α " 0.1 in the penalized credible region procedure are shown in Table 5. We obtain a false discovery rate of 0.02, and an F 1 score of 0.89, indicating that for marginal tables of size 2 4 , the posterior approximation is useful for model selection on the Rochdale data.

Discussion
Outside of linear models, conjugate priors are often non-standard or their multivariate generalizations are difficult to work with. This hampers uncertainty quantification because it is difficult to obtain posterior credible regions for parameters under such priors. Given that automatic and coherent quantification of uncertainty through the posterior is one of the chief advantages of a fully Bayesian approach, this limitation is a significant problem. The optimal Gaussian approximation to the posterior for log-linear models with Dianconis-Ylvisaker conjugate priors derived here offers a highly accurate and essentially computation-free approximation to posterior credible regions for this important class of models. Interestingly, this Gaussian approximation is not the Laplace approximation, and it is faster to compute and offers a better approximation to the posterior than the Laplace approximation. If similar results could be obtained for the posterior in other models, it suggests that the Laplace approximation may not be an appropriate default Gaussian approximation to the posterior. The theoretical result provided here can be easily extended to cases where some categories cannot co-occur, i.e. cases of structural zeros in contingency tables. Extensions to model selection using our approximation are also available by the penalized credible region approach. It seems reasonable that the strategy used here to obtain optimality and convergence rate guarantees could be extended to a larger class of generalized linear models by studying the properties of multivariate Gaussian distributions under inverse link transformations. This may also present a strategy for obtaining approximate credible intervals for parameters in the Bayesian model averaging context for generalized linear models with conjugate priors.

A Log-linear model details
The discussion here largely follows Massam et al. [2009] and Lauritzen [1996] in its presentation. Let V be the set of variables that will be collected into a contingency table. Let I γ , γ P V denote the set of possible levels of values of γ. Without loss of generality, we can take this set to be a finite collection of sequential nonnegative integers. Let I " Ś γPV I γ be the set of all possible combinations of levels of the variables in V . Every cell i of the contingency table corresponds to an element of V ; thus |I| " d`1, where d is defined as in the main text.
Following Lauritzen [1996], define a cell of the contingency table as i " pi γ , γ P V q, and let πpiq " prry 1 " i 1 , . . . , y p " i p s. For any E Ă V , let i E " pi γ , γ P Eq be the cell of the E-marginal table corresponding to the values in i of the variables in E. Finally, designate the "base" cell i˚" p0, 0, . . . , 0q. Thus, every i can be written as i " pi E , iEcq, where E is the subset of V on which i ‰ 0. Then, the log-linear model in the corner parametrization is given by where for any F Ă V , θ F pi F q is a parameter corresponding the the variables in F taking the values in i F , and the notation Ď H means all subsets excluding the empty set. Refer to Proposition 2.1 in Letac & Massam [2012] for a result showing how the model can be expressed in the form in (5).

B Proof of Proposition 2.2
This is readily seen by the change of variable theorem; one only needs some work to calculate the Jacobian term for the change of variable. The matrix of partial derivatives J " pBθ j {Bπ r q jr is given by Write J " U`uu T , where u " p1´ř d l"1 π l q´1 {2 p1,´1, . . . ,´1q T and U " Diagp1{π 1 , . . . , 1{π d q. We then have |J| " |U |p1`u T U´1uq and therefore, The proof is concluded by noting that ppθ; αq " qp pθq; αq |J|´1.

C Proof of main results
We first state some preparatory results that are used to prove the main results.
The trigamma function ψ 1 pzq " d dz ψpzq is the derivative of the digamma function. We derive a simple bound for the trigamma function that is used in the sequel.
Lemma C.1. For any z ą 1{3, The condition z ą 1{3 is only required for the upper bound.
Finally, we state a useful result in Lemma C.2.
The above quantity is non-negative since it equals 2D N pµ X , Σ X q || N pµ, Σq ( , i.e., twice the Kullback-Leibler divergence between N pµ X , Σ X q and N pµ, Σq. Since µ and Σ were arbitrary, the first part is proved. The second part has been already proved at the beginning.

C.2 Proof of Theorem 3.1 and Corollary 3.2
We can now give a proof of Theorem 3.1. Recall the Dirichlet density q from (6) and the logistic normal density r q from (11). We shall write qpπq and r qpπq in place of qpπ | βq and r qpπ | µ, Σq henceforth for brevity. From (6) and (11), Observe that µ and Σ appear only in the last two terms in the right hand side of the above display. Invoking Lemma C.2, it is therefore evident that Dpq || r qq " E q logpq{r qq is minimized when µ˚" E q logpπ{π 0 q and Σ˚" var q tlogpπ{π 0 qu, and the minimum vaue of the Kullback-Leibler divergence is Using standard properties of the Dirichlet distribution or Exponential family differential identities, with β " ř d j"0 β j , E q log π j " ψpβ j q´ψpβq, j " 0, 1, . . . d, cov q plog π j , log π l q " ψ 1 pβ j qδ jl´ψ 1 pβq, j, l " 0, 1, . . . , d.
Therefore, µj " E q log π j´Eq log π 0 " ψpβ j q´ψpβ 0 q for j " 1, . . . d. Next, σj j 1 " cov q plog π jĺ og π 0 , log π j 1´log π 0 q " δ jj 1 ψ 1 pβ j q`ψ 1 pβ 0 q for j, j 1 " 1, . . . , d. The expressions for µ˚and Σ˚are identical to (8), proving the first part of the theorem. Note this also establishes Proposition 2.3. We now proceed to bound each term in the expression for the minimum Kullback-Leibler divergence in (18); refer to them by T 1 , T 2 , T 3 and T 4 respectively. First, we have, T 1 :" log B β " log Γpβq´d In the above display, we used (14) to bound log Γpβq from above and log Γpβ j qs from below. The p´βq term in upper bound to log Γpβq cancels out the p´ř d j"0 β j q contribution from the lower bounds to the log Γpβ j qs. Next, " d ÿ j"0 β j ψpβ j`1 q´βψpβq *´d ăˆd ÿ j"0 β j log β j´β log β˙´d 2`1 12β .
In the first line of the above display, we used (19). From the first to the second line, we used the identity ψpz`1q " ψpzq`1{z. From the second to the third line, we only use ř d j"0 β j " β. From the third to the fourth line, we made use of the bound (15) for the digamma function ψ. From the upper bound in (15), β j ψpβ j`1 q ă β j log β j`1 {2 and hence ř d j"0 β j ψpβ j`1 q ă ř d j"0 β j log β j`p d`1q{2. From the lower bound in (15), βψpβq ą β log β`1{2´1{p12βq.