Valid inferential models for prediction in supervised learning problems☆
Introduction
Data-driven prediction of future observations is a fundamental problem. Here our focus is on applications where the data consists of explanatory variables , for some , and a response variable . That is, we observe a collection of n pairs from an exchangeable process. The two most common examples are regression and classification, where is an open subset and finite subset of , respectively. We consider both cases in what follows. The prediction problem corresponds to a case where we are given a value of the next explanatory variable , and the goal is to predict the corresponding future response .
By “prediction” we mean quantifying uncertainty about in a data-dependent way, i.e., depending on the observed data and the given value of . One perspective on prediction uncertainty quantification is the construction of a suitable family of prediction sets representing collections of sufficiently plausible values for ; see, e.g., [44], [5], [28], and Equation (19) below. While prediction sets are practically useful, there are prediction-related tasks that they cannot perform, in particular, it cannot assign degrees of belief (or betting odds, etc.) to all relevant assertions or hypotheses “,” for . An alternative approach is to develop what we refer to here as a probabilistic predictor, i.e., a probability-like structure (precise or imprecise probability) defined on , depending on and , designed to quantify uncertainty about by directly assigning degrees of belief to relevant assertions. The most common approach to probabilistic prediction is Bayesian, where a prior distribution for the model is specified and uncertainty is quantified by the posterior predictive distribution of , given and . Other non-Bayesian approaches leading to predictive distributions include [29], [10], [49], [46].
Before moving forward, it is important to distinguish between uncertainty quantification with prediction sets and with probabilistic predictors. One does not need a full (precise or imprecise) probability distribution to construct prediction sets and, moreover, sets derived from a probabilistic predictor are not guaranteed to satisfy the frequentist coverage probability property that warrants calling them genuine “prediction sets.” Therefore, the motivation for going through the trouble of constructing probabilistic predictor, Bayesian or otherwise, must be that there are important prediction-related tasks that prediction sets cannot satisfactorily handle. In other words, the belief assignments provided by a (precise or imprecise) probability must be a high priority. Strangely, however, the reliability of probabilistic predictors is only ever assessed in terms of (asymptotic) coverage probability properties of their corresponding prediction sets. Our unique perspective is that, since belief assignments are a priority, there ought to be a way to directly assess the reliability of a probabilistic predictor's belief assignments.
For prediction problems where only the (response) variables are observed, Cella and Martin [8] introduced a notion of validity for probabilistic predictors. Roughly, their validity condition requires that the subsets to which the probabilistic predictor tends to assign large numerical degrees belief are the same as those that tend to contain . The point being that such a constraint ensures that the belief assignments made by the probabilistic predictor are not systematically misleading. Here we extend their notion of validity to the case where explanatory variables are present, and the precise definitions are given below in Definition 1, Definition 2. It turns out these notions of validity have some important consequences, imposing certain constraints on the mathematical structure of the probabilistic predictor. Indeed, we argue in Section 3 (see, also, Corollary 1 in Section 4) that validity can only be achieved by probabilistic predictors that take the form of an imprecise probability distribution. Section 2 provides a preview of the formal definition of validity and offers empirical support for the claim that precise probabilistic predictors cannot be valid.
After formally introducing these notions of validity in Section 3, we explore their behavioral and statistical consequences. First, we show that even the weaker validity property in Definition 1 implies that the probabilistic predictor avoids (a property stronger than) the familiar sure loss property in the imprecise probability literature, hence is not internally irrational from a behavioral point of view. We go on to show that prediction-related procedures, e.g., tests and prediction regions, derived from (uniformly) valid probabilistic predictors control frequentist error probability. The take-away message is that a (uniformly) valid probabilistic predictor provides the “best of both worlds”—it simultaneously achieves desirable behavioral and statistical properties.
Given the desirable properties of a valid probabilistic predictor, the natural question is how to construct one? The probabilistic predictor we construct here is largely based on the general theory of valid inferential models (IMs) as described in Martin and Liu [37], [39]. Martin and Liu's construction usually assumes a parametric model but, here, we aim to avoid such strong assumptions. For this, we use a particular extension of the so-called generalized IM approach developed in [32], [33]. The basic idea is that a link/association between observable data, quantities of interest, and an unobservable auxiliary variable with known distribution can be made without fully specifying the data-generating process. In Section 5, we develop a valid IM construction that assumes only exchangeability of the observed data process, no parametric model assumptions required. There, in Theorem 1, we establish that this general IM-based probabilistic predictor construction achieves the (uniform) validity property. The specifics of this construction are presented in Section 6, in the context of regression. Section 7 considers the classification problem, and we show that the discreteness of Y in classification problems may cause the IM random set output, from which the probabilistic predictor is derived, to be empty with positive probability. Two possible adjustments are provided, with the one based on suitably “stretching” the random set being most efficient.
An important observation is that parallels can be drawn between our proposed IM construction and the conformal prediction approach put forward in [44] and elsewhere. This is interesting for at least two reasons.
- •
It demonstrates that one does not necessarily need “new methods” to construct probabilistic predictors to achieve the desired (uniform) validity property, just an appropriate re-interpretation of the output returned by certain existing methods. In particular, our proposed IM construction returns a possibility measure whose contour function is the transducer derived from an appropriate conformal prediction algorithm. Consequently, all we need is the corresponding conformal prediction algorithm to achieve our goals.
- •
However, there would be a variety of different ways the conformal prediction algorithm could be re-interpreted as a probabilistic predictor, e.g., as a precise probability distribution or one of several different imprecise probability distributions. Our developments here reveal that the appropriate re-interpretation, the one that leads to (uniform) validity, is by treating the conformal transducer as the contour function that defines a possibility measure.
Section snippets
Prediction validity: a preview
To help clarify the difference between the traditional notions of uncertainty quantification in (probabilistic) prediction and the notions we have in mind here, we consider a relatively simple example for illustration, one in which there are no covariates. That is, suppose we have a sequence of real-valued observables and, based on the observations , the goal is to predict in a probabilistic way. One standard way to approach this is to construct a Bayesian predictive
Setup
The goal here is to formalize the ideas discussed in Section 2 above. Recall that the present paper is concerned with prediction in supervised learning problems, so we assume there is an exchangeable process with distribution , where each is a pair . As is customary, “” is understood to mean the marginal probability for the event “” derived from the joint distribution of under . The distribution is completely unknown, except that it belongs to the
Behavioral
Despite our focus on frequentist-style properties, validity has some important behavioral consequences, à la de Finetti, Walley, and others. Towards this, define the lower/upper probabilistic predictor evaluated at A, optimized over all of its data inputs; recall that and depend implicitly on an argument . An especially poor specification of prediction probabilities is a situation in which, for some ,
Inferential models
A relevant question is how to construct a probabilistic predictor that achieves the (uniform) validity condition. One strategy would be through a generalized Bayes approach as advocated for in, e.g., [47, Sec. 6.4]. That is, if is the set of candidate joint distributions for the observables, the generalized Bayes rule would define an upper prediction probability as and corresponding lower probability by replacing sup by inf. That this satisfies validity
Probabilistic prediction in regression
Recall that the A-step requires the specification of a real-valued function , such that the distribution of is known. Towards this, given consisting of the observable and the yet-to-be-observed , consider first a transformation , defined by where , and Ψ is a suitable real-valued function that compares to a prediction derived from at , being small if they agree and large if they disagree.
Probabilistic prediction in classification
In Section 6, we found that the A-step boils down to the specification of a suitable real-valued, exchangeability-preserving function Ψ, which Vovk et al. [44] refer as a non-conformity measure. In binary classification problems, a Ψ function like in (26) can also be used here by encoding the binary labels as distinct real numbers. However, if there are more than two labels, and not in an ordinal scale where the assignment of different numbers to them is justified, there is no natural way to
Conclusion
Here we focused on the important problem of prediction in supervised learning applications with no model assumptions (except exchangeability). We presented a notion of prediction validity, one that goes beyond the usual coverage probability guarantees of prediction sets. This condition assures the reliability of the degrees of belief, obtained from a imprecise probability distribution, assigned to all relevant assertions about the yet-to-be-observed quantity of interest. We also showed that, by
CRediT authorship contribution statement
Both authors contributed equally to the paper.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
The authors thank the reviewers from the conference proceedings and journal submissions for their valuable feedback, and the IJAR guest editors—Andrés Cano, Jasper De Bock, and Enrique Miranda—for the invitation to contribute to the special journal issue. This work is partially supported by the U.S. National Science Foundation, grants DMS–1811802 and SES–2051225.
References (49)
- et al.
Generalized inferential models for censored data
Int. J. Approx. Reason.
(2021) - et al.
Interval predictor models: identification and reliability
Automatica
(2009) - et al.
Validity, consonant plausibility measures, and conformal prediction
Int. J. Approx. Reason.
(2022) The Dempster–Shafer calculus for statisticians
Int. J. Approx. Reason.
(2008)Constructing belief functions from sample data using multinomial confidence regions
Int. J. Approx. Reason.
(2006)Likelihood-based belief function: justification and some extensions to low-quality data
Int. J. Approx. Reason.
(2014)- et al.
Frequency-calibrated belief functions: review and new insights
Int. J. Approx. Reason.
(2018) - et al.
Inference about constrained parameters using the elastic belief method
Int. J. Approx. Reason.
(2012) On an inferential model construction using generalized associations
J. Stat. Plan. Inference
(2018)False confidence, non-additive beliefs, and valid statistical inference
Int. J. Approx. Reason.
(2019)
Reconciling frequentist properties with the likelihood principle
J. Stat. Plan. Inference
Fiducial prediction intervals
J. Stat. Plan. Inference
Categorical Data Analysis
Satellite conjunction analysis and the false confidence theorem
Proc. R. Soc. A, Math. Phys. Eng. Sci.
Generalized inferential models for meta-analyses based on few studies
Stat. Appl.
Approximately valid and model-free possibilistic inference
Valid inferential models for prediction in supervised learning problems
Distributional conformal prediction
On nonparametric predictive inference and objective Bayesianism
J. Log. Lang. Inf.
Upper and lower probabilities induced by a multivalued mapping
Ann. Math. Stat.
A generalization of Bayesian inference
J. R. Stat. Soc., Ser. B, Stat. Methodol.
Statistical inference from a Dempster–Shafer perspective
UCI Machine Learning Repository
Possibility Theory
Cited by (5)
The Twelfth International Symposium on Imprecise Probabilities: Theories and Applications (ISIPTA-21)
2023, International Journal of Approximate ReasoningQuantifying Prediction Uncertainty in Regression Using Random Fuzzy Sets: The ENNreg Model
2023, IEEE Transactions on Fuzzy Systems