Bayes Factors, relations to Minimum Description Length, and overlapping model classes
Introduction
This article is a largely non-technical perspective on methods to choose among candidate models purported to explain observed data. There are numerous goals of modeling, such as prediction of more data collected under the same conditions, generalization to similar but different situations, elegance, simplicity, approximating unknown truth, gaining understanding of complex situations, maximizing utility based on applications of the model, and others. No method can simultaneously accomplish all these goals. This article will cover the two methods that have emerged as best compromises in satisfying many of these goals. We will not try to lay out the precise way in which the methods satisfy the different goals, because that is a highly technical matter well beyond the scope of this paper, but it will become clear that they are aimed at providing a good fit to observed data with as simple a model as possible, while taking prior knowledge into account.
One method is Bayesian Model Selection (BMS), instantiated in the form of Bayes factors. The other is the Minimum Description Length principle (MDL), instantiated by a variant of Normalized Maximum Likelihood (NML) that takes prior probabilities into account; we term this NML*. NML and NML* are the same for uniform priors. We show a surprisingly simple relation of BMS to NML* that helps explain why the two methods often give qualitatively similar results. One situation in which the two methods differ, apparently to the detriment of NML, occurs when the model classes under comparison overlap. We suggest that in most such cases it is better to change the model comparison so that the classes do not overlap; generally this may be accomplished by deleting the shared model instances from the larger class.
Throughout the article we illustrate the observations and suggestions with a simple example borrowed from Heck, Wagenmakers, and Morey (2015). They compare Bayes factors (BF) and NML for Bernoulli model classes inferring the probability of success, , for flat priors (i.e. uniform priors): One model class posits to lie in the range [0, 1], which we term , and the other posits to lie in a restricted range , which we term : the instances in the range are common to the two classes. Certain results seem to favor the use of BF over NML*. We compare their findings with those arising when the comparison of model classes is altered to compare in vs. in , the latter we term . We also consider a variant of their example with geometric priors. In all cases that compare vs. , BMS and NML* produce qualitatively similar results.
Section snippets
Terminology, definitions, and an exposition using tables
We assume that all measurements and all probabilities in this article are discretized into suitably small intervals. Using discretized measures to approximate continuous distributions not only simplifies our exposition but also accords with actual practice, and matches computational approaches used in model selection. In addition, discretization implicitly recognizes that all models are approximations to reality, reflects the fact that no measurements are infinitely precise, and limits
Applications of Bayes theorem to model instances
Bayes theorem can be used to modify Tables Table 1a, Table 1b, Table 1c, Table 1d to show the results of Bayesian inference, based on the observed data outcome y. These tables of posteriors could serve as tables of priors for a continuation/replication of the same study. We show only posterior versions of Table 1a, Table 1b: Table 2a, Table 2b differ from Table 1a, Table 1b only in the change from prior probabilities to posterior probabilities conditional on observing outcome , as indicated by
Geometric and uniform priors in the binomial example
There are many reasons to include prior knowledge in inference, and many cases where one would want to do so, both for BMS and MDL, although there are other reasons why one most often sees in practice the use of uninformative, uniform, or transformation invariant priors. Thus, in the framework of our binomial example, we analyze both a uniform prior and a geometric prior.3
Bayesian model selection
Given some observed data, , we have described how Bayes theorem can be applied to Tables Table 1a, Table 1b, Table 1c, Table 1d to produce posterior probabilities for the instances in the model classes, and thereby for the classes themselves, as depicted in Table 2a, Table 2b. In general it is undesirable to prefer model classes so large they can predict any possible outcome, and desirable to favor a model class that predicts a small range of outcomes when that range includes the observed
Minimum Description Length and normalized maximum likelihood
A chief aim of this article is the comparison of BMS to another major method for model comparison, one based on the Minimum Description Length principle (MDL; see Grünwald, 2007). At the core of MDL is the idea that one can characterize the regularities in data by a code that compresses the description to an optimal degree. Thus the MDL approach begins with a preference for simplicity. We say no more here about MDL but give a somewhat more detailed overview in Appendix B.
The examples we use
Non-uniform priors, within-class and between class
The typical application of BMS uses uniform priors within class and equal priors for the two classes. As seen in Eq. (3) this produces a simplicity preference in proportion to the class sizes. When one has prior information that some instances and classes are more probable than others, and decides to incorporate that information into BMS, considerable care is needed in deciding how to represent that information within and between classes, and the decisions made affect the degree of preference
BMS and NML* for the binomial example
We now show model selection results for the binomial example, for the Bayes Factor and NML* (NML * reduces to NML for flat priors, so we shall refer to NML* only in the following). The Bayes Factor and the ratio of NML* scores give the odds for one model class over the other. For comparisons of against or the odds are because the probabilities of the two classes must add to 1.0. It helps us show the results to display them as rather than as a ratio, and that
Relation of BMS to NML*
The analyses of the binomial examples demonstrate qualitative similarity of BMS and NML* for all comparisons of vs. . There is an alternative but equivalent characterization of the Bayes Factor that helps explain why this might be the case. This characterization is illustrated in Tables 3a, 3b, and 1c. These give one table for each model class, each table based on prior probabilities that add to 1.0— Table 3a, Table 3b are just 1c truncated, with normalized priors. The Bayes Factor is the
An alternative approach to model comparison when classes share instances
In our view the simpler and more coherent approach to model comparison in most cases when classes overlap is the one we have utilized in our binomial example, one that entails removal of shared instances from the model classes being compared. This is most easily and sensibly accomplished by subtracting the shared instances from the larger class prior to comparison. This approach can be justified for both BMS and NML* and the fact that it produced qualitatively similar results for BMS and NML*
Overview of the proposal to delete shared instances
Deleting shared instances from the larger class prior to model comparison is non-traditional but not as radical as it might appear. In most cases, such as comparisons of a class to a smaller one with certain parameter values specified, the larger class is enough larger that the deletion does not alter the model comparison. In the few cases where the classes are reasonably close in size we think a good case can be made that deletion produces a comparison that better conforms to the goals of
Situations justifying retention of shared instances
There are settings where it is natural to use the probabilistic rules of Bayes theorem to discriminate model classes that have the same instances but differ in the priors assigned to them. We regard these situations as ones better characterized as probabilistic inference rather than model selection. These situations are discussed in Appendix D.
Proper management of priors
For both BMS and MDL, and any other potential basis for inference and induction, incorporating prior knowledge is an important consideration, particularly when highly relevant knowledge exists. When an apparently well-designed study claims a demonstration of ESP, we doubt the conclusion not just because a theoretical justification is absent but because the prior probability is low. In most scientific model selection problems such relevant knowledge does exist. It may be vague and hard to
Priors, posteriors, complexity, and precision of observed data
Subtle issues arise due to the power and precision of measurement. Two model instances (from say two different model classes) might have different functional forms, but predict very similar distributions of outcomes. We are recommending that shared instances be removed from model class comparisons, so it is essential that we define in consistent fashion what makes instances the same. This issue is discussed in Appendix E.
Goals of inference
The methods we have described in this article are considered present state-of-the-art not because they simultaneously satisfy every goal of inference, something probably impossible for any method of model selection. They do represent a good compromise that emphasizes a balance of good fit to data with simplicity. When one has limited and noisy data, an overly complex model class will have instances that appear highly likely because they fit not only the underlying true generating processes but
Summary
This article explains in non-technical manner two chief methods for deciding which model classes better explain observed data. A model class is a collection of model instances. An instance is characterized by the distribution of experimental data outcomes it predicts—there is a one-to-one mapping between instances and associated distributions. We show such distributions in the form of tables and explain the model selection methods by simple arithmetic applied to the entries in the tables.
The
References (22)
- et al.
Extending Bayesian induction
Journal of Mathematical Psychology
(2016) - et al.
Testing order constraints: Qualitative differences between Bayes factors and Normalized Maximum Likelihood
Statistics & Probability Letters
(2015) - et al.
Equality and inequality constrained multivariate linear models: Objective model selection using constrained posterior priors
Journal of Statistical Planning and Inference
(2010) Modeling by shortest data description
Automatica
(1978)- et al.
Luckiness and regret in minimum description length inference
- Bartlett, P., Grünwald, P., Harremoes, P., Hedayati, F., & Kotlowski, W. (2013). Horizon-independent optimal prediction...
- et al.
Statistical inference under order restrictions
(1972) - et al.
A Bayesian hierarchical mixture approach to individual differences: Case studies in selective attention and representation in category learning
Journal of Mathematical Psychology
(2014) Science and statistics
Journal of the American Statistical Association
(1976)- Clarke, B., & Dawid, A.P. (Unpublished manuscript, 1999). Online prediction with experts under a log-scoring...
A tutorial introduction to the minimum description length principle
Cited by (10)
Selecting amongst multinomial models: An apologia for normalized maximum likelihood
2020, Journal of Mathematical PsychologyAn evaluation of alternative methods for testing hypotheses, from the perspective of Harold Jeffreys
2016, Journal of Mathematical PsychologyExtending Bayesian induction
2016, Journal of Mathematical PsychologyEvaluating the Complexity and Falsifiability of Psychological Models
2023, Psychological ReviewStatistics in the Service of Science: Don’t Let the Tail Wag the Dog
2023, Computational Brain and BehaviorCommentary on Gronau and Wagenmakers
2019, Computational Brain and Behavior