Valid inferential models for prediction in supervised learning problems

https://doi.org/10.1016/j.ijar.2022.08.001Get rights and content

Abstract

Prediction, where observed data is used to quantify uncertainty about a future observation, is a fundamental problem in statistics. Prediction sets with coverage probability guarantees are a common solution, but these do not provide probabilistic uncertainty quantification in the sense of assigning beliefs to relevant assertions about the future observable. Alternatively, we recommend the use of a probabilistic predictor, a data-dependent (imprecise) probability distribution for the to-be-predicted observation given the observed data. It is essential that the probabilistic predictor be reliable or valid, and here we offer a notion of validity and explore its behavioral and statistical implications. In particular, we show that valid probabilistic predictors must be imprecise, that they avoid sure loss, and that they lead to prediction procedures with desirable frequentist error rate control properties. We provide a general construction of a provably valid probabilistic predictor, which has close connections to the powerful conformal prediction machinery, and we illustrate this construction in regression and classification applications.

Introduction

Data-driven prediction of future observations is a fundamental problem. Here our focus is on applications where the data Z=(X,Y) consists of explanatory variables XXRd, for some d1, and a response variable YY. That is, we observe a collection Zn={Zi=(Xi,Yi):i=1,,n} of n pairs from an exchangeable process. The two most common examples are regression and classification, where Y is an open subset and finite subset of R, respectively. We consider both cases in what follows. The prediction problem corresponds to a case where we are given a value xn+1 of the next explanatory variable Xn+1, and the goal is to predict the corresponding future response Yn+1Y.

By “prediction” we mean quantifying uncertainty about Yn+1 in a data-dependent way, i.e., depending on the observed data Zn and the given value xn+1 of Xn+1. One perspective on prediction uncertainty quantification is the construction of a suitable family of prediction sets representing collections of sufficiently plausible values for Yn+1; see, e.g., [44], [5], [28], and Equation (19) below. While prediction sets are practically useful, there are prediction-related tasks that they cannot perform, in particular, it cannot assign degrees of belief (or betting odds, etc.) to all relevant assertions or hypotheses “Yn+1A,” for AY. An alternative approach is to develop what we refer to here as a probabilistic predictor, i.e., a probability-like structure (precise or imprecise probability) defined on Y, depending on Zn and xn+1, designed to quantify uncertainty about Yn+1 by directly assigning degrees of belief to relevant assertions. The most common approach to probabilistic prediction is Bayesian, where a prior distribution for the model is specified and uncertainty is quantified by the posterior predictive distribution of Yn+1, given Zn and Xn+1=xn+1. Other non-Bayesian approaches leading to predictive distributions include [29], [10], [49], [46].

Before moving forward, it is important to distinguish between uncertainty quantification with prediction sets and with probabilistic predictors. One does not need a full (precise or imprecise) probability distribution to construct prediction sets and, moreover, sets derived from a probabilistic predictor are not guaranteed to satisfy the frequentist coverage probability property that warrants calling them genuine “prediction sets.” Therefore, the motivation for going through the trouble of constructing probabilistic predictor, Bayesian or otherwise, must be that there are important prediction-related tasks that prediction sets cannot satisfactorily handle. In other words, the belief assignments provided by a (precise or imprecise) probability must be a high priority. Strangely, however, the reliability of probabilistic predictors is only ever assessed in terms of (asymptotic) coverage probability properties of their corresponding prediction sets. Our unique perspective is that, since belief assignments are a priority, there ought to be a way to directly assess the reliability of a probabilistic predictor's belief assignments.

For prediction problems where only the (response) variables Y1,,Yn are observed, Cella and Martin [8] introduced a notion of validity for probabilistic predictors. Roughly, their validity condition requires that the subsets AY to which the probabilistic predictor tends to assign large numerical degrees belief are the same as those that tend to contain Yn+1. The point being that such a constraint ensures that the belief assignments made by the probabilistic predictor are not systematically misleading. Here we extend their notion of validity to the case where explanatory variables are present, and the precise definitions are given below in Definition 1, Definition 2. It turns out these notions of validity have some important consequences, imposing certain constraints on the mathematical structure of the probabilistic predictor. Indeed, we argue in Section 3 (see, also, Corollary 1 in Section 4) that validity can only be achieved by probabilistic predictors that take the form of an imprecise probability distribution. Section 2 provides a preview of the formal definition of validity and offers empirical support for the claim that precise probabilistic predictors cannot be valid.

After formally introducing these notions of validity in Section 3, we explore their behavioral and statistical consequences. First, we show that even the weaker validity property in Definition 1 implies that the probabilistic predictor avoids (a property stronger than) the familiar sure loss property in the imprecise probability literature, hence is not internally irrational from a behavioral point of view. We go on to show that prediction-related procedures, e.g., tests and prediction regions, derived from (uniformly) valid probabilistic predictors control frequentist error probability. The take-away message is that a (uniformly) valid probabilistic predictor provides the “best of both worlds”—it simultaneously achieves desirable behavioral and statistical properties.

Given the desirable properties of a valid probabilistic predictor, the natural question is how to construct one? The probabilistic predictor we construct here is largely based on the general theory of valid inferential models (IMs) as described in Martin and Liu [37], [39]. Martin and Liu's construction usually assumes a parametric model but, here, we aim to avoid such strong assumptions. For this, we use a particular extension of the so-called generalized IM approach developed in [32], [33]. The basic idea is that a link/association between observable data, quantities of interest, and an unobservable auxiliary variable with known distribution can be made without fully specifying the data-generating process. In Section 5, we develop a valid IM construction that assumes only exchangeability of the observed data process, no parametric model assumptions required. There, in Theorem 1, we establish that this general IM-based probabilistic predictor construction achieves the (uniform) validity property. The specifics of this construction are presented in Section 6, in the context of regression. Section 7 considers the classification problem, and we show that the discreteness of Y in classification problems may cause the IM random set output, from which the probabilistic predictor is derived, to be empty with positive probability. Two possible adjustments are provided, with the one based on suitably “stretching” the random set being most efficient.

An important observation is that parallels can be drawn between our proposed IM construction and the conformal prediction approach put forward in [44] and elsewhere. This is interesting for at least two reasons.

  • It demonstrates that one does not necessarily need “new methods” to construct probabilistic predictors to achieve the desired (uniform) validity property, just an appropriate re-interpretation of the output returned by certain existing methods. In particular, our proposed IM construction returns a possibility measure whose contour function is the transducer derived from an appropriate conformal prediction algorithm. Consequently, all we need is the corresponding conformal prediction algorithm to achieve our goals.

  • However, there would be a variety of different ways the conformal prediction algorithm could be re-interpreted as a probabilistic predictor, e.g., as a precise probability distribution or one of several different imprecise probability distributions. Our developments here reveal that the appropriate re-interpretation, the one that leads to (uniform) validity, is by treating the conformal transducer as the contour function that defines a possibility measure.

These points, along with some other concluding remarks, are given in Section 8.

Section snippets

Prediction validity: a preview

To help clarify the difference between the traditional notions of uncertainty quantification in (probabilistic) prediction and the notions we have in mind here, we consider a relatively simple example for illustration, one in which there are no covariates. That is, suppose we have a sequence of real-valued observables Y1,Y2, and, based on the observations Yn=yn, the goal is to predict Yn+1 in a probabilistic way. One standard way to approach this is to construct a Bayesian predictive

Setup

The goal here is to formalize the ideas discussed in Section 2 above. Recall that the present paper is concerned with prediction in supervised learning problems, so we assume there is an exchangeable process Z=(Z1,Z2,) with distribution P, where each Zi is a pair (Xi,Yi)Z=X×Y. As is customary, “P(ZnB)” is understood to mean the marginal probability for the event “ZnB” derived from the joint distribution of Z under P. The distribution P is completely unknown, except that it belongs to the

Behavioral

Despite our focus on frequentist-style properties, validity has some important behavioral consequences, à la de Finetti, Walley, and others. Towards this, defineγ_n(A)=inf(zn,x)Zn×XΠ_xn(A)andγn(A)=sup(zn,x)Zn×XΠxn(A), the lower/upper probabilistic predictor evaluated at A, optimized over all of its data inputs; recall that Π_xn and Πxn depend implicitly on an argument zn. An especially poor specification of prediction probabilities is a situation in which, for some AY,γ_n(A)>infPPP(Yn+

Inferential models

A relevant question is how to construct a probabilistic predictor that achieves the (uniform) validity condition. One strategy would be through a generalized Bayes approach as advocated for in, e.g., [47, Sec. 6.4]. That is, if P is the set of candidate joint distributions for the observables, the generalized Bayes rule would define an upper prediction probability asΠxn(A)=supPPP(Yn+1A|Zn,Xn+1=x),AA, and corresponding lower probability by replacing sup by inf. That this satisfies validity

Probabilistic prediction in regression

Recall that the A-step requires the specification of a real-valued function ϕn, such that the distribution of ϕn(Zn,Zn+1) is known. Towards this, given Zn+1=(Zn,Zn+1) consisting of the observable (Zn,Xn+1) and the yet-to-be-observed Yn+1, consider first a transformation Zn+1Tn+1, defined byTi=Ψ(Zin+1,Zi),iIn+1, where Zin+1=Zn+1{(Yi,Xi)}, and Ψ is a suitable real-valued function that compares Yi to a prediction derived from Zin+1 at Xi, being small if they agree and large if they disagree.

Probabilistic prediction in classification

In Section 6, we found that the A-step boils down to the specification of a suitable real-valued, exchangeability-preserving function Ψ, which Vovk et al. [44] refer as a non-conformity measure. In binary classification problems, a Ψ function like in (26) can also be used here by encoding the binary labels as distinct real numbers. However, if there are more than two labels, and not in an ordinal scale where the assignment of different numbers to them is justified, there is no natural way to

Conclusion

Here we focused on the important problem of prediction in supervised learning applications with no model assumptions (except exchangeability). We presented a notion of prediction validity, one that goes beyond the usual coverage probability guarantees of prediction sets. This condition assures the reliability of the degrees of belief, obtained from a imprecise probability distribution, assigned to all relevant assertions about the yet-to-be-observed quantity of interest. We also showed that, by

CRediT authorship contribution statement

Both authors contributed equally to the paper.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

The authors thank the reviewers from the conference proceedings and journal submissions for their valuable feedback, and the IJAR guest editors—Andrés Cano, Jasper De Bock, and Enrique Miranda—for the invitation to contribute to the special journal issue. This work is partially supported by the U.S. National Science Foundation, grants DMS–1811802 and SES–2051225.

References (49)

  • P. Walley

    Reconciling frequentist properties with the likelihood principle

    J. Stat. Plan. Inference

    (2002)
  • C.M. Wang et al.

    Fiducial prediction intervals

    J. Stat. Plan. Inference

    (2012)
  • A. Agresti

    Categorical Data Analysis

    (2003)
  • M.S. Balch et al.

    Satellite conjunction analysis and the false confidence theorem

    Proc. R. Soc. A, Math. Phys. Eng. Sci.

    (2019)
  • J. Cahoon et al.

    Generalized inferential models for meta-analyses based on few studies

    Stat. Appl.

    (2020)
  • L. Cella et al.

    Approximately valid and model-free possibilistic inference

  • L. Cella et al.

    Valid inferential models for prediction in supervised learning problems

  • V. Chernozhukov et al.

    Distributional conformal prediction

  • F.P.A. Coolen

    On nonparametric predictive inference and objective Bayesianism

    J. Log. Lang. Inf.

    (2006)
  • A.P. Dempster

    Upper and lower probabilities induced by a multivalued mapping

    Ann. Math. Stat.

    (1967)
  • A.P. Dempster

    A generalization of Bayesian inference

    J. R. Stat. Soc., Ser. B, Stat. Methodol.

    (1968)
  • A.P. Dempster

    Statistical inference from a Dempster–Shafer perspective

  • D. Dua et al.

    UCI Machine Learning Repository

    (2017)
  • D. Dubois et al.

    Possibility Theory

    (1988)
  • This is an extended version of the 2021 International Symposium on Imprecise Probability Theory and Applications (ISIPTA) proceedings paper, [7].

    View full text