Bayes Factors, relations to Minimum Description Length, and overlapping model classes

https://doi.org/10.1016/j.jmp.2015.11.002Get rights and content

Highlights

  • Contrast and exposition of BMS and NML using simple tables.

  • BMS is a normalized mean probability of instance posteriors, NML a normalized maximum.

  • Subtracting shared instances from a larger class has benefits and aligns NML to BMS.

  • Discussion on management of priors and goals of inference.

Abstract

This article presents a non-technical perspective on two prominent methods for analyzing experimental data in order to select among model classes. Each class consists of model instances; each instance predicts a unique distribution of data outcomes. One method is Bayesian Model Selection (BMS), instantiated with the Bayes factor. The other is based on the Minimum Description Length principle (MDL), instantiated by a variant of Normalized Maximum Likelihood (NML): the variant is termed NML* and takes prior probabilities into account. The methods are closely related. The Bayes factor is a ratio of two values: V1 for model class M1, and V2 for M2. Each Vjis the sum over the instances of Mj,of thejoint probabilities (prior times likelihood) for the observed data, normalized by a sum of such sums for all possible data outcomes. NML* is qualitatively similar: The value it assigns to each class is the maximum over the instances in Miof the joint probability for the observed data normalized by a sum of such maxima for all possible data outcomes. The similarity of BMS to NML* is particularly close when model classes do not have instances that overlap, a way of comparing model classes that we advocate generally. These observations and suggestions are illustrated throughout with use of a simple example borrowed from Heck, Wagenmakers, and Morey (2015) in which the instances predict a binomial distribution of number of successes in N trials. The model classes posit the binomial probability of success to lie in various regions of the interval [0,1]. We illustrate the theory and the example not with equations but with tables coupled with simple arithmetic. Using the binomial example we carry out comparisons of BMS and NML* that do and do not involve model classes that overlap, and do and do not have uniform priors. When the classes do not overlap BMS and NML* produce qualitatively similar results.

Introduction

This article is a largely non-technical perspective on methods to choose among candidate models purported to explain observed data. There are numerous goals of modeling, such as prediction of more data collected under the same conditions, generalization to similar but different situations, elegance, simplicity, approximating unknown truth, gaining understanding of complex situations, maximizing utility based on applications of the model, and others. No method can simultaneously accomplish all these goals. This article will cover the two methods that have emerged as best compromises in satisfying many of these goals. We will not try to lay out the precise way in which the methods satisfy the different goals, because that is a highly technical matter well beyond the scope of this paper, but it will become clear that they are aimed at providing a good fit to observed data with as simple a model as possible, while taking prior knowledge into account.

One method is Bayesian Model Selection (BMS), instantiated in the form of Bayes factors. The other is the Minimum Description Length principle (MDL), instantiated by a variant of Normalized Maximum Likelihood (NML) that takes prior probabilities into account; we term this NML*. NML and NML* are the same for uniform priors. We show a surprisingly simple relation of BMS to NML* that helps explain why the two methods often give qualitatively similar results. One situation in which the two methods differ, apparently to the detriment of NML, occurs when the model classes under comparison overlap. We suggest that in most such cases it is better to change the model comparison so that the classes do not overlap; generally this may be accomplished by deleting the shared model instances from the larger class.

Throughout the article we illustrate the observations and suggestions with a simple example borrowed from Heck, Wagenmakers, and Morey (2015). They compare Bayes factors (BF) and NML for Bernoulli model classes inferring the probability of success, θ, for flat priors (i.e. uniform priors): One model class posits θ to lie in the range [0, 1], which we term M3, and the other posits θ to lie in a restricted range [0,z], which we term M1: the instances in the range [0,z] are common to the two classes. Certain results seem to favor the use of BF over NML*. We compare their findings with those arising when the comparison of model classes is altered to compare θ in [0,z] vs. θ in (z,1], the latter we term M2. We also consider a variant of their example with geometric priors. In all cases that compare M1 vs. M2, BMS and NML* produce qualitatively similar results.

Section snippets

Terminology, definitions, and an exposition using tables

We assume that all measurements and all probabilities in this article are discretized into suitably small intervals. Using discretized measures to approximate continuous distributions not only simplifies our exposition but also accords with actual practice, and matches computational approaches used in model selection. In addition, discretization implicitly recognizes that all models are approximations to reality, reflects the fact that no measurements are infinitely precise, and limits

Applications of Bayes theorem to model instances

Bayes theorem can be used to modify Tables Table 1a, Table 1b, Table 1c, Table 1d to show the results of Bayesian inference, based on the observed data outcome y. These tables of posteriors could serve as tables of priors for a continuation/replication of the same study. We show only posterior versions of Table 1a, Table 1b: Table 2a, Table 2b differ from Table 1a, Table 1b only in the change from prior probabilities to posterior probabilities conditional on observing outcome y, as indicated by

Geometric and uniform priors in the binomial example

There are many reasons to include prior knowledge in inference, and many cases where one would want to do so, both for BMS and MDL, although there are other reasons why one most often sees in practice the use of uninformative, uniform, or transformation invariant priors. Thus, in the framework of our binomial example, we analyze both a uniform prior and a geometric prior.3

Bayesian model selection

Given some observed data, y, we have described how Bayes theorem can be applied to Tables Table 1a, Table 1b, Table 1c, Table 1d to produce posterior probabilities for the instances in the model classes, and thereby for the classes themselves, as depicted in Table 2a, Table 2b. In general it is undesirable to prefer model classes so large they can predict any possible outcome, and desirable to favor a model class that predicts a small range of outcomes when that range includes the observed

Minimum Description Length and normalized maximum likelihood

A chief aim of this article is the comparison of BMS to another major method for model comparison, one based on the Minimum Description Length principle (MDL; see Grünwald, 2007). At the core of MDL is the idea that one can characterize the regularities in data by a code that compresses the description to an optimal degree. Thus the MDL approach begins with a preference for simplicity. We say no more here about MDL but give a somewhat more detailed overview in Appendix B.

The examples we use

Non-uniform priors, within-class and between class

The typical application of BMS uses uniform priors within class and equal priors for the two classes. As seen in Eq. (3) this produces a simplicity preference in proportion to the class sizes. When one has prior information that some instances and classes are more probable than others, and decides to incorporate that information into BMS, considerable care is needed in deciding how to represent that information within and between classes, and the decisions made affect the degree of preference

BMS and NML* for the binomial example

We now show model selection results for the binomial example, for the Bayes Factor and NML* (NML * reduces to NML for flat priors, so we shall refer to NML* only in the following). The Bayes Factor and the ratio of NML* scores give the odds for one model class over the other. For comparisons of M1 against M2 or M3 the odds are p(M1)/[1p(M1)] because the probabilities of the two classes must add to 1.0. It helps us show the results to display them as p(M1) rather than as a ratio, and that

Relation of BMS to NML*

The analyses of the binomial examples demonstrate qualitative similarity of BMS and NML* for all comparisons of M1 vs. M2. There is an alternative but equivalent characterization of the Bayes Factor that helps explain why this might be the case. This characterization is illustrated in Tables 3a, 3b, and 1c. These give one table for each model class, each table based on prior probabilities that add to 1.0— Table 3a, Table 3b are just 1c truncated, with normalized priors. The Bayes Factor is the

An alternative approach to model comparison when classes share instances

In our view the simpler and more coherent approach to model comparison in most cases when classes overlap is the one we have utilized in our binomial example, one that entails removal of shared instances from the model classes being compared. This is most easily and sensibly accomplished by subtracting the shared instances from the larger class prior to comparison. This approach can be justified for both BMS and NML* and the fact that it produced qualitatively similar results for BMS and NML*

Overview of the proposal to delete shared instances

Deleting shared instances from the larger class prior to model comparison is non-traditional but not as radical as it might appear. In most cases, such as comparisons of a class to a smaller one with certain parameter values specified, the larger class is enough larger that the deletion does not alter the model comparison. In the few cases where the classes are reasonably close in size we think a good case can be made that deletion produces a comparison that better conforms to the goals of

Situations justifying retention of shared instances

There are settings where it is natural to use the probabilistic rules of Bayes theorem to discriminate model classes that have the same instances but differ in the priors assigned to them. We regard these situations as ones better characterized as probabilistic inference rather than model selection. These situations are discussed in Appendix D.

Proper management of priors

For both BMS and MDL, and any other potential basis for inference and induction, incorporating prior knowledge is an important consideration, particularly when highly relevant knowledge exists. When an apparently well-designed study claims a demonstration of ESP, we doubt the conclusion not just because a theoretical justification is absent but because the prior probability is low. In most scientific model selection problems such relevant knowledge does exist. It may be vague and hard to

Priors, posteriors, complexity, and precision of observed data

Subtle issues arise due to the power and precision of measurement. Two model instances (from say two different model classes) might have different functional forms, but predict very similar distributions of outcomes. We are recommending that shared instances be removed from model class comparisons, so it is essential that we define in consistent fashion what makes instances the same. This issue is discussed in Appendix E.

Goals of inference

The methods we have described in this article are considered present state-of-the-art not because they simultaneously satisfy every goal of inference, something probably impossible for any method of model selection. They do represent a good compromise that emphasizes a balance of good fit to data with simplicity. When one has limited and noisy data, an overly complex model class will have instances that appear highly likely because they fit not only the underlying true generating processes but

Summary

This article explains in non-technical manner two chief methods for deciding which model classes better explain observed data. A model class is a collection of model instances. An instance is characterized by the distribution of experimental data outcomes it predicts—there is a one-to-one mapping between instances and associated distributions. We show such distributions in the form of tables and explain the model selection methods by simple arithmetic applied to the entries in the tables.

The

References (22)

  • P. Grünwald

    A tutorial introduction to the minimum description length principle

  • Cited by (10)

    • Extending Bayesian induction

      2016, Journal of Mathematical Psychology
    • Commentary on Gronau and Wagenmakers

      2019, Computational Brain and Behavior
    View all citing articles on Scopus
    View full text