Goodness of fit for the logistic regression model using relative belief

A logistic regression model is a specialized model for product-binomial data. When a proper, noninformative prior is placed on the unrestricted model for the product-binomial model, the hypothesis H 0 of a logistic regression model holding can then be assessed by comparing the concentration of the posterior distribution about H 0 with the concentration of the prior about H 0. This comparison is effected via a relative belief ratio, a measure of the evidence that H 0 is true, together with a measure of the strength of the evidence that H 0 is either true or false. This gives an effective goodness of fit test for logistic regression.

( 1 ) and x = (x 1 , . . . , x k ), β = (β 1 , . . . , β k ) ∈ R k . While the use of this model is quite common, the question concerning whether or not the model actually holds has not been fully dealt with in the literature. It is our purpose here to develop a Bayesian approach to this problem. It is to be noted that, irrespective of whether or not (1) holds, Y | X 1 = x 1 , . . . , X k = x k ∼ Bernoulli(θ(x)) for some θ(x) ∈ [ 0, 1] and, if a sample of n(x) is taken at these settings of the predictors, then s(x) = So in this case the logistic regression is just a reparameterization of the product-binomial model. Goodness of fit beyond the runs tests is then not relevant.
The case when there is at least some quantitative predictors is thus the one of interest. It is to be noted that in any well-designed study there should always be replication, namely, n(x) > 1 for some of the x, for precisely the reason that model checking is a necessary part of any statistical analysis in a scientific context. It may be, however, that the data was collected in a somewhat haphazard way, and while model checking is still a requirement, it can't be expected that this will be as effective as in the designed context.
The general approach taken here can be described as follows. First a noninformative prior is placed on the θ(x). A specific meaning is applied to the word noninformative here, namely, we require that there is no possibility of there being prior-data conflict. Prior-data conflict loosely means that the prior places the bulk of its mass in a region of the parameter space which the data identifies as being unlikely to contain the true value. Prior-data conflict can be measured in a number of ways and this is discussed in Section 2.
Suppose that there are m values for x so the full space for the θ(x) is [ 0, 1] m . Let H 0 ⊂ [ 0, 1] m be the subset that corresponds to θ(x) = p(x β) for some β ∈ R k . The prior on the θ(x) leads to a posterior distribution for these quantities. Intuitively, if the posterior is more concentrated about H 0 than the prior, then this is evidence in favor of H 0 with the opposite holding when the posterior is less concentrated about H 0 than the prior. Once a method of measuring concentration about H 0 is selected, this evidence can be measured via a relative belief ratio. While a relative belief ratio is somewhat like a Bayes factor, it will be seen to differ in some key ways. Furthermore, a measure of the strength of this evidence, whether for or against H 0 , is presented. Several natural measures of concentration are considered. This is all discussed in Section 3. In Section 4 aspects of the computations are considered including application to a number of examples. Tsutukawa and Lin (1986) and Bedrick et al. (1996Bedrick et al. ( , 1997 are concerned with the Bayesian analysis of logistic regression models although not with goodness of fit. With a p-value based on asympotics, a commonly used goodness of fit statistic for logistic regression is the deviance statistic which is twice the difference between the maximized log-likelihood with no constraints and the maximized log-likelihood, assuming the logistic regression model holds. Chen and Chen (2004) also propose a frequentist asymptotic goodness of fit test in the context of case-control studies. It is shown here that a Bayesian goodness of fit test arises very naturally and that it has a number of advantages. In particular, evidence can be obtained in favor of the logistic regression model, as opposed to only evidence against as with p-values, and there is no appeal to asymptotics.

The Model and prior
Intuitively, a noninformative prior for the θ(X m ) is then given by the uniform distribution on [ 0, 1] m as this allows for any of the possible values and they are all weighted equally. Of course, other definitions can be provided for noninformativity but this definition seems most suitable for this context as it possesses a key property.
To see this suppose that we have a statistical model {f θ : θ ∈ } for the data x, and a prior π . Let T denote a minimal sufficient statistic for the model with marginal model {f θ ,T : θ ∈ }. Then m T (t) = f θ,T (t) (dθ) is the prior predictive density of T. In Evans and Moshonov (2006) a basic check on the prior is to compute the tail probability M T (m T (t) ≤ m T (T(x))) and conclude that prior-data conflict exists whenever this probability is small as this implies that the observed data is a priori surprising. The consistency of this check, under quite general conditions, is established in Evans and Jang (2011). If, for example, M T (m T (t) ≤ m T (T(x))) ≡ 1 for all possible T(x), then it is never the case that there is prior-data conflict and the prior can be called noninformative. Clearly this criterion for noninformativity can be weakened but this is all that is required here.
For the product-binomial, T can be taken equal to s(X). Since the counts are independent and the priors on the θ(x i ) are independent and uniform, the prior predictives of the s(x i ) are independent. The prior predictive for s(x i ) is easily seen to equal 1/(n(x i ) + 1), namely, it is uniform on {0, 1, . . . , n(x i )}. As such, for the product-binomial with a uniform prior, M T (m T (t) ≤ m T (T(x))) ≡ 1 and our criterion for noninformativity is satisfied. The posterior on θ(X) induced by this prior and the observed data s(X) is a product of beta distributions where θ( . Note that is easy to generate from both the prior and posterior of θ(X).

Hypothesis assessment via relative belief and concentration
Suppose for the model {f θ : θ ∈ } it is desired to assess the hypothesis H 0 ⊂ . In many cases there is a parameter of interest ψ = (θ) with H 0 = −1 {ψ 0 }. The evidence for or against H 0 can then be assessed via a relative belief ratio RB (ψ 0 | x) = π (ψ 0 | x)/π (ψ 0 ), the ratio of the posterior to prior densities of evaluated at the hypothesized value. If RB (ψ 0 | x) > 1, then there is evidence in favor of H 0 , as the posterior probability for ψ 0 is greater than the prior probability for ψ 0 , while RB (ψ 0 | x) < 1 implies there is evidence against H 0 and when RB (ψ 0 | x) = 1 there isn't evidence either way. The strength of the evidence is measured by the posterior probability For when RB (ψ 0 | x) < 1 and (2) is small, then there is a strong belief that the true value of has a larger relative belief ratio than ψ 0 and so the evidence against ψ 0 is strong. When RB (ψ 0 | x) < 1 and (2) is large, then there is only weak evidence against H 0 as this says that there is a large belief that the true value of ψ has the value of its relative belief ratio no greater than RB (ψ 0 | x). When RB (ψ 0 | x) > 1 and (2) is large, then there is a weak belief that the true value of has a larger relative belief ratio than ψ 0 and so the evidence in favor of ψ 0 is strong. Note that in the set {ψ : RB (ψ | x) ≤ RB (ψ 0 | x)} the value ψ 0 has the most evidence in its favor when RB (ψ 0 | x) > 1. When RB (ψ 0 | x) > 1 and (2) is small, then there is only weak evidence in favor H 0 as this says that there is a large belief that the true value of ψ has the value of its relative belief ratio greater than RB (ψ 0 | x). The relative belief ratio is discussed in Baskurt and Evans (2013) and a full development of a theory of inference based on this is presented in Evans (2015).
The interpretation of the relative belief ratio as the evidence demands that a relative belief ratio greater than 1 be interpreted as evidence in favour of the hypothesis no matter how much greater it is than 1, assuming it is computed exactly. This is because RB (ψ 0 | x) > 1 occurs only when the posterior probability is greater than the prior probability of the hypothesis and that is the basic criterion for saying the data has led to evidence in favor. A similar comment applies to evidence against, namely, when RB (ψ 0 | x) < 1. To understand what RB (ψ 0 | x) = 1 means consider a discrete context as then this occurs iff the posterior probability of {ψ 0 } equals the prior probability of {ψ 0 } and this occurs iff the events {ψ 0 } and {x} are statistically independent in the joint probability model for (θ, x). In other words RB (ψ 0 | x) = 1 iff the actual observed data x tells us nothing about the hypothesis that the true value of ψ = (θ) is ψ 0 . This is clearly a very unusual circumstance but one can easily construct such situations generally when the model contains nonidentifiability. In essence the relative belief ratio is giving the correct assessment of evidence in such a case.
The size of RB (ψ 0 | x) does not necessarily reflect the strength of the evidence. Note that in the discrete case RB (ψ 0 | x) = π (ψ 0 | x)/π (ψ 0 ) ≤ 1/π (ψ 0 ) so there is an upper bound on this value. From this it is seen that relative belief ratios do not measure evidence on an absolute scale but they need to be calibrated in each context and that is the role of the strength (2). So even if RB (ψ 0 | x) = 1.000005 the strength could be high when (2) is close to 1 as this says our belief that the true value of ψ has a larger relative belief ratio is small. Note that the strength is playing the role of the standard error here as it measures how reliable we believe our assessment of the evidence is. Also, it can happen that even though RB (ψ 0 | x) is very high, (2) can be very small and so the evidence is only weak evidence in favor. This phenomenon is associated with the Jeffreys-Lindley paradox as is discussed in Evans (2015). In short, relative belief ratios need to be calibrated and the calibration depends on the context. The issues concerning measuring strength are a somewhat more involved than the measure of the evidence itself and additional discussion can be found in Evans (2015).
In a number of situations H 0 does not arise via H 0 = −1 {ψ 0 } for some in an obvious way and also (H 0 ) = 0. The prior nullity may arise because H 0 is a lower dimensional subset of and not because there is no belief that H 0 is true. This is the case with logistic regression when k < m and is the uniform prior on =[ 0, 1] m . In such a context it is reasonable to choose = d H 0 where d H 0 (θ(X)) is a measure of the distance of θ(X) to H 0 . So with ψ 0 = 0 then H 0 = −1 {ψ 0 } and H 0 can be assessed using relative belief. Note that it is clear that in assessing H 0 a comparison is being made between the concentrations of the prior and posterior about H 0 . If RB (ψ 0 | x) > 1, then the data has lead to the posterior being more concentrated about H 0 than the prior. If RB (ψ 0 | x) < 1, then the data has lead to the posterior being less concentrated about H 0 than the prior. The method of concentration, with d H 0 equal to squared Euclidean distance as discussed in Example 1, was developed for some specific inference problems in Evans et al. (1993Evans et al. ( , 1997.
While there are many possible choices for d H 0 , two are considered here.
) denote the logit associated with x and note that the logistic regression model holds iff μ(x) = x β for some β ∈ R k for every x ∈ R k . The logistic regression model thus implies that μ(X) = Xβ for some β ∈ R k . If a probability distribution is placed on θ(X), then this induces a probability distribution on μ(X) which in turn induces a probability distri- where it is assumed that X is of full rank. The reason for dividing by the dimension m of μ(X) will become apparent in Example 3 although this clearly has no effect on the optimization. Note that d H 0 (θ(X)) = 0 iff the logistic regression model holds for the observed s(X). So it is natural to measure the concentration of the probability distribution placed on θ(X) about H 0 by seeing how concentrated the induced distribution on d H 0 (θ(X)) is about 0.
Example 2 Kullback-Leibler (KL) distance. The KL distance between the Bernoulli(θ) and the Bernoulli(p) distribution is given by KL(θ , p) p)). It follows that the KL distance between the product-Bernoulli(θ 1 , . . . , θ m ) and and the reason for dividing by m is discussed in Example 4. Given an efficient algorithm for finding the optimal β, it is straightforward to generate from the prior and posterior distributions of d H 0 (θ(X)).
After a choice is made of d H 0 , the methodology proceeds via simulation from the prior and posterior distributions of d H 0 (θ(X)), then computing RB d H 0 (0 | s(X)) and its strength. A significant aspect of this computation is that the prior and posterior densities of d H 0 typically both vanish at 0 and so the ratio cannot be directly computed. This is a support measure issue arising due to continuity and is dealt with theoretically by defining the relative belief ratio at a point as the limiting ratio of the posterior to prior probabilities of shrinking neighborhoods. Practically this is dealt with by the choice of δ since d H 0 (p) ∈[ 0, δ) implies that H 0 holds to the accuracy required in the application. In other words, H 0 holds whenever the difference between the true model and the logistic regression model is of no practical consequence as measured by d H 0 . The range of the prior distribution of d H 0 is then discretized via the partition where k is chosen so that the effective range of this prior distribution is covered. The prior and posterior probability contents of these intervals are estimated by generating large samples from the prior and the posterior distributions of θ, computing d H 0 (θ) for each sampled value, which gives samples from the prior and posterior distributions of d H 0 (θ), and then using the approximate contents for the approximation of the relative belief ratios of the intervals. The relevant relative belief ratio for assessing H 0 is then RB d H 0 ([ 0, δ) | s(X)) and the strength of this evidence is assessed by comparing this relative belief ratio against the other values by computing the posterior probability that where the Monte Carlo estimates were used for these computations. It is proved in Evans (2015) that this procedure is consistent as the amount of data increases in the sense that the relative belief ratio converges to the maximum possible value (always greater than 1) and the strength converges to 1 when H 0 is true, and the relative belief ratio converges to 0 and the strength converges to 0 when H 0 is false.
The choice of δ is application dependent as it represents the deviation from the precise null that is just of practical consequence. The two distance measures considered here lead to very natural choices for δ.
Example 3 Squared Euclidean distance and absolute error. For this distance measure let δ equal the maximum squared distance between two logits such that any difference smaller than δ is practically speaking immaterial. In other words, if max i (log(θ(x i )/(1 − θ(x i ))) − x i β) 2 < δ, then this difference is irrelevant from the point of view of the application. It is clear then that (μ i (X) − x i β) 2 < δ for i = 1, . . . , m implies ||μ(X) − Xβ|| 2 /m< δ while ||μ(X) − Xβ|| 2 /m< δ implies that the average squared absolute error between individual logits is less than δ. So in practice we proceed by selecting δ and discretizing the prior and posterior distributions of d H 0 as previously described. Note that it is also reasonable to choose a discretization parameter δ * < δ for d H 0 . For example, if δ * = δ/m then ||μ(X) − Xβ|| 2 /m < δ * implies max i (μ i (X) − x i β) 2 < δ but this might be deemed overly rigorous.
While the interpretation of error in the value of x β is straightforward in linear regression, this is more difficult in logistic regression and it then seems clearer to state bounds on the probabilities. Note, however, for probabilities θ and p, the logits satisfy to reflect what is considered a meaningful absolute squared difference in the probabilities, then the logits satisfying this error bound implies that the probabilities also satisfy this, at least when δ is small.

Example 4 Kullback-Leibler (KL) distance and relative error. For this distance measure let δ equal the maximum relative error in the probabilities. So it is desired that max
(1−δ) and the lower bound can be replaced by 0 since the KL distance is always nonnegative. Using log(1 + x) ≈ x when x is small, a small relative error of δ on the probabilities then implies the approximate bounds 0 ≤ m i=1 KL(θ i , p i )/m < δ. Conversely, m i=1 KL(θ i , p i )/m < δ implies that the average relative error in the probabilities is bounded by δ. This gives the discretization for the prior and posterior distributions for d H 0 (θ(X)) in this case. Again a discretization parameter δ * < δ can be used for d H 0 (θ(X)) if a bound on the average relative error on the individual probabilities is not felt to be rigorous enough.
It is emphasized that the choice of the distance measure and the discretization parameter are application dependent. Given that the concern is with model checking, and there are often many ways in which a model can be checked, the choice of the distance measure is perhaps not important. On the other hand, when choosing between the distance measures suggested here, this could be determined by the choice of absolute or relative error as the criterion of accuracy. When the probabilities in question are not too small or  too large, then absolute error seems like the appropriate error criterion to use, and hence use squared Euclidean distance, while when probabilities are felt to be close to 0 or 1, then relative error seems like the appropriate error criterion and so use Kullback-Liebler distance. Some may object to the need to discretize. In our view the choice of a δ to specify practically relevant deviations is a necessary aspect of any meaningful inference problem. It seems realistic to say that a logistic regression model is never strictly correct as there is no reason to suppose that the probabilities are exactly given by (1) for any x. What is more relevant is whether or not the logistic regression model is approximately correct and to make the notion of approximation precise one has to specify a δ. For example, if a logistic regression model provided two or three decimal accuracy for the relevant probabilities, then it could be that this is sufficient accuracy but this depends on the application as sometimes greater accuracy may be required. Examples 3 and 4 provide prescriptions for how δ can be chosen to reflect the accuracy desired in a problem. It would seem very odd that an individual familiar with the application couldn't specify such an accuracy as it cannot be true that any deviation whatsoever is significant as this contradicts the approximate nature of the logistic regression model. Provided the prior distribution is relatively smooth, as is the case here, the results will not change much by making small changes in δ, as changes in the prior and posterior probabilities will also be small. Also, as is well known, p-values can detect deviations from hypotheses that are not practically meaningful when sample sizes are large. The way to avoid this behavior is to build the relevant deviation directly into the inference methodology and that is what is done here.

Examples
Implementation of the computations is relatively straight-forward via simulation once d H 0 and δ have been selected, although clearly using squared Euclidean distance is somewhat easier. For the optimization with KL distance, the R routine optim was used. In all the examples the prior and posterior distributions of d H 0 were approximated using a Monte Carlo sample of size of 10 5 and these distributions were then discretized as previously discussed.  Some simulated examples are now considered where each distance is applied when the logistic regression model holds and when it doesn't.

Example 5 Simulated examples when logistic regression is correct.
Consider the situation where k = 2 with X 1 ≡ 1 and X 2 is a nonconstant quantitative predictor, so p(x β) = exp{β 1 + β 2 x 2 }/ (1 + exp{β 1 + β 2 x 2 }) . Various choices are considered for n = n(x 1 ) = · · · = n(x m ) and for δ, the squared absolute error in the respective probabilities when using Euclidean distance and the relative error in the respective probabilities when using KL distance. Note that in practice δ and the n(x i ) are fixed in an application. Here m = 3 with x 2 ∈ {0, 1, 2} and β 1 = 0.5, β 2 = −1.0 so p(X) = (0.62, 0.38, 0.18) gives the true probabilities for the corresponding Bernoulli distributions. Table 1 gives the results of some simulations when using squared Euclidean distance as the basis for the measure of concentration. When n = 1 the data s(X) = (1, 0, 0) was obtained, when n = 5 the data s(X) = (4, 2, 1) was obtained and when n = 10 the data s(X) = (7, 6, 1) was obtained. Notice that the relative belief ratio is always greater than 1 which says there is evidence in favour of the logistic regression model being true. The strength of this evidence depends on δ but not greatly and it is to be noted that this is determined by how much squared error is acceptable in the probabilities provided by the model. The p-value based on the deviance gave the values 1.0, 0.74, 0.22 when n = 1, 5, 10 respectively, and so this test would also not reject the logistic regression model in these simulations. It is to be noted, however, that the p-value is asymptotic and it is not clear how large n has to be for this approximation to be accurate. Also, in contrast to the relative belief ratio, a large p-value does not provide evidence in favor of H 0 . It is of interest that, while it is expected that both the relative belief ratio and its strength will increase as n increases, this did not happen when comparing n = 5 with n = 10. This is undoubtedly due to sampling variability as there is no guarantee that increasing sample size increases accuracy. Table 2 gives the results of the analysis of the data when using KL distance as the basis for the measure of concentration. Again the relative belief ratio is always greater than 1 and so gives the correct inference. The strength of the evidence in favor is always at least as great as when using squared Euclidean distance. This underscores a reasonable conjecture that Table 5 The values of RB together with the (strength) of the evidence in Example 6 when m = 5 using squared Euclidean distance. The effective range of the prior is [ 0, 3.0) Setting β 1 = 0.5, β 2 = 1.0 leads to the following probabilities for the corresponding Bernoulli distributions p(X) = (0.206, 0.210, 0.296, 0.316, 0.356, 0.364, 0.392, 0.417, 0.443, 0.463, 0.494, 0.504, 0.513, 0.543, 0.604, 0.663, 0.750, 0.760, 0.858, 0.882).
When n = 1, s(X) = (0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1), when n = 5, s(X) = (0, 0, 3, 2, 3, 2, 3, 1, 3, 2, 1, 3, 1, 1, 4, 3, 2, 3, 3, 5) and when n = 10, s(X) = (1, 3, 4, 3, 2, 6, 4, 5, 3, 0, 7, 7, 8, 7, 6, 3, 8, 8, 8, 7) are the corresponding generated data sets. Table 3 gives the results when using squared Euclidean distance and Table 4 gives the results when using KL distance, respectively, as the basis for the measure of concentration. Notice that the relative belief ratio is always greater than 1 which says correctly that there is evidence in favour of the logistic regression model being true. The strength of this evidence depends on δ and it is to be noted that this is determined by how much squared error is acceptable in the probabilities provided by the model. Of some interest is the fact that the strength drops in this example as δ increases and this is undoubtedly due to the relative belief ratio RB d H 0 ([ 0, δ) | s(X)) decreasing as greater tolerance is permitted for the approximation given by the logistic regression model. The posterior probability of the interval [ 0, δ) increases with δ but for this dataset, it does not increase as fast as the prior probability so the evidence, as measured by change in belief, is not as great for larger values of δ. This underlines the importance of choosing δ to reflect desired accuracy. The p-value based on the deviance gave the values 0.40, 0.22, 0.01 when n = 1, 5, 10 respectively. So in fact the deviance test leads to a p-value that would indicate that the model is not true when n = 10 and this is incorrect.   (1, x 2 ) is wrong. Here m = 5 and the values x 2 ∈ {1, 3, 5, 7, 9} were chosen with the true probabilities given by θ(X) = (0.875, 0.327, 0.107, 0.198, 0.908). The average squared Euclidean distance between these product-Bernoulli probabilities and the best fitting logistic regression with the corresponding values for x 2 is 0.117, so the logistic regression model is definitely false. The following data sets were generated from the true model: when n = 1, then s(X) = (1, 0, 0, 0, 1), when n = 5 then s(X) = (5, 2, 0, 1, 5) and when n = 10 then s(X) = (9, 3, 1, 2, 9). Table 5 records the results of the goodness of fit test based on squared Euclidean distance. In every case the relative belief ratio is less than 1, so there is evidence against H 0 , and the strength of this evidence is strong. The p-values based on the deviance statistic are respectively 0.08,0.00,0.00. So, excepting the n = 1 case, this approach also clearly rejects H 0 . Table 6 records the results of the goodness of fit test based on KL distance. In every case the relative belief ratio is less than 1 and the strength of this evidence is definitive with the possible exception of two cases when n = 1.
The following example presents an application to a real data set.
Example 7 Bioassay experiment. In studying drugs and other chemical compounds, acute toxicity tests or bioassay experiments are commonly implemented on animals. Such experiments proceed by controlling various dose levels of the compound to groups of animals. The animals responses are typically characterized as alive or dead, tumor or no tumor, etc. The logistic regression model with x = (1, x 2 ) is considered where x 2 is the log of the dosage in g/ml of a toxin. The data is provided in Table 9 and comes from a real data set analyzed in Racine et al. (1986) where m = 4, n(x 1 ) = · · · = n(x 4 ) = 5 and s(x i ) is the number of deaths at the i-th dosage.
Since the authors were not part of this application the value of δ cannot be strictly determined by what the goals of the study are. As such, the results are considered for a fairly wide range of possible values for δ in Table 10 and it is seen that the results are robust to this choice. In all cases the relative belief ratios show that there is evidence in favor of the logistic regression model holding and the strength of this evidence is universally very strong especially when using KL distance. The p-value based on deviance is 0.97 so the logistic regression model is not rejected but again it is to be noted that, following the logic of p-values, this is not to be interpreted as support for this model.

Conclusions
A Bayesian goodness of fit test has been developed for logistic regression models based on a measure of evidence. A definite advantage of this approach is that evidence can be obtained in favor of the model holding. Also, there is no need to appeal to asymptotics in the interpretation of the results as in the case of classical goodness of fit tests. Since every product-Bernoulli distribution is treated equally in the priors there is no bias towards accepting or rejecting the logistic regression model. The choice of which distance measure to use is dependent on whether relative or absolute error is the appropriate criterion to apply when considering the approximation a logistic regression model supplies to the true probabilities. The approach developed in this paper can also be used for goodness of fit tests for other models such as probit regression with only minor changes.