Data-driven nonlinear expectations for statistical uncertainty in decisions

In stochastic decision problems, one often wants to estimate the underlying probability measure statistically, and then to use this estimate as a basis for decisions. We shall consider how the uncertainty in this estimation can be explicitly and consistently incorporated in the valuation of decisions, using the theory of nonlinear expectations.

Example 1. Consider the following problem. Let {X n } n∈N be identical independent Bernoulli random variables, with unknown parameter p = P (X n = 1) = 1 − P (X n = 0), i.e. independent tosses of the same, possibly unfair, coin. You observe {X n } N n=1 , and then need to draw a conclusion about the likely behaviour of an iid trial X.
In a classical frequentist framework, this is straightforward: the estimator of p (either from MLE or moment matching) is given byp = S N /N , where S N = N n=1 X N ; this estimate has sampling variance p(1 − p)/N ≈p(1 −p)/N . Suppose we need to evaluate a wager on X. Given a loss function φ, we would then usually calculate the expected loss E[φ(X)], where the expectation is based on the estimated parameters. Without loss of generality, we can assume φ(0) = 0, so the inferred expectation is simply given bŷ (1). This leads to a surprising conclusion: the precision of the estimate of p has no impact on our assessment of the wager. To see this, consider a sample based on N ′ ≫ N observations, but with the same value ofp . Then the precision of the estimate (as indicated by the reciprocal of the sampling variance) is much higher, but the expected loss of the wager remains identical. Consequently, when considering this wager, this approach concludes that you are indifferent between the settings when p is known precisely or imprecisely. For example, suppose there were two coins, the first was thrown 3 times with 2 heads, the second 3000 times with 2000 heads. The estimated-expected-loss criterion then states that you are indifferent in choosing which coin to bet on, which is contrary to experience. Note that this conclusion is not changed by the presence of the loss function φ. Now, some may argue that this is a particular flaw in the frequentist pointestimate approach, as the error of the estimate of p is not part of the probabilistic framework we use when calculating the expectation. So, let's take a Bayesian approach and put a prior on p, for example the (conjugate) Beta distribution B(α, β). The posterior distribution is then B(α + S N , β + N − S N ); this has mean µ p := (α + S N )/(β + α + N ) and variance µ p (1 − µ p )/(β + α + N + 1). The posterior expected loss is then µ p φ(1); again this does not depend on the precision of the estimate.
The choice of prior used is immaterial, as the behaviour is determined by (writing F N for the σ-algebra generated by our observations) (1) so only the posterior mean value of p has any impact, not its posterior variance (or any other measure of uncertainty). Even if we extend beyond taking an expected payoff, for example to considering a posterior mean-variance criterion, we would find that the posterior variance of φ(X) is which still only depends on the posterior mean of p. The same conclusion will be reached for any criterion which depends only on the posterior law of φ(X).
From this, we can conclude both the frequentist and Bayesian expected loss approaches fail to incorporate uncertainty in p in our decision making, in this simple setting 1 .
The unusual behaviour of this type of example has been noticed before. For example, Keynes remarks (using the term 'evidential weight' to indicate a concept similar to the precision of probabilities): For in deciding on a course of action, it seems plausible to suppose that we ought to take account of the weight as well as the probability of different expectations. -J.M. Keynes, A Treatise on Probability 2 , Knight [14] argues that ignoring this uncertainty is not descriptive of people's actions -we do, generally, have a strict preference for knowledge of the probabilities of outcomes (see also the more general criticism of Allais [1]). This leads him to distinguish between the concepts of 'risk', which is associated with the outcome of X given p, and 'uncertainty' 3 , which is associated with our lack of knowledge of p.
Within either of the two classical frameworks considered above, there is a natural and classical way to deal with this issue. For a frequentist, instead of using the point estimatep, one could consider building a confidence interval for p, and then comparing wagers by their worst expectation among parameters within the confidence interval. As the sample size increases, the confidence interval shrinks, and so (for a fixed value ofp) the value of the wager increases. Similarly for a Bayesian, using a credible interval in the place of the confidence interval. While well known and sensible, this is (at least on the surface) an ad hoc fix, and needs to be defended philosophically: for example, in the Bayesian setting, the uncertainty in p should already have been included in the assessment of E[φ(X)|F N ], so this approach seems to be double-counting the uncertainty. In more complex settings, where the parameter p is replaced by a multidimensional parameter and we are interested in comparing the values of a variety of random outcomes (whose expectations are generally nonlinear functions of the parameters), confidence sets become less natural, so a more general and rigorous approach seems to be needed.
In this paper, we will give one such approach. As Example 1 shows, to fully incorporate our statistical uncertainty, we cannot simply estimate the (posterior) distribution of the outcome. Instead, we need to retain some knowledge of how accurate that estimate is, and feed that additional knowledge into our decision making. We shall do this by making a general suggestion of a method, proving some of its general properties, and giving a selection of pertinent examples. This is our key philosophical claim: When evaluating outcomes in the presence of estimation and model uncertainty, it is not enough to depend simply on the distribution of the outcome under a fitted model; the evaluation should also depend on how well other parameter choices and models would have fitted the observations on which we are basing our evaluation.
Instead of simply dealing with a single probability, we will study the effect of using the likelihood function (which indicates how well a model fits our observations) to generate a 'convex expectation', closely related to the risk measures often studied in mathematical finance. The theory of these nonlinear expectations is explored in detail in Föllmer and Schied [8] (up to some changes of sign), and gives a mathematically rigorous way to deal with 'Knightian uncertainty'. In economics, this is closely linked to Gilboa and Schmeidler's model of multiple priors [10]. However, little work has been done on connecting nonlinear expectations with statistics.
For Example 1 above, our proposal amounts to the following. Instead of working with the expected loss E[φ(X)] under one particular estimated measure, consider the quantity for a fixed uncertainty aversion parameter k > 0 and exponent γ ≥ 1, where α is the negative log-likelihood of our observations, shifted to have minimal value zero, that is (forp = S N /N as above), where the approximation is for large N , in a sense to be explored later (it is essentially a form of the central limit theorem, see Section 3.2). As α(p) = 0, , and basic calculation gives a (rather inelegant) formula for E(ξ) in terms ofp, N , k, φ(1) and φ(0). The operator E gives an 'upper' expectation for the loss, depending on the certainty of our parameter estimate given the sample. In effect, we are considering all possible values for p, and using our data to determine how reasonable we think they are (as indicated by −(k −1 α(p)) γ ). In effect, we do not attempt to give any point-estimate of p, or assume that we can treat p as a random variable with known distribution. If we were to use E to choose between a family of wagers φ i , we would obtain a classical minimax or 'robust optimization' problem (see for example Ben-Tal, El Ghaoui and Nemirovski [4]), The expectation E can be thought of as an 'upper' expectation, and is convex. The corresponding 'lower' expectation −E(−ξ) can also be defined, and is concave. This leads naturally to as an interval prediction for ξ. Comparing with more familiar quantities, such as (frequentist) confidence intervals, (Bayesian) credible intervals and upper and lower probabilities in Dempster-Schafer theory, we see that an interval estimate is a natural object to study when describing uncertainty in parameters. We shall see that confidence intervals (in particular, likelihood intervals) arise as a special case of our approach. Remark 1. The approach taken here is specifically tailored to consider 'uncertainty' (lack of knowledge of probabilities), rather than 'risk' (lack of knowledge of outcomes, but with known probabilities). In particular, if we have sufficient data that we know the probabiltiy measure exactly, then our expectation is simply the classical expected value, and does not involve any loss-aversion. A loss or utility function can be used to incorporate these effects, or our approach can be extended to allow a wider class of evaluations (see Remark 9) This article proceeds as follows: First, we give a summary of some of the basic properties of nonlinear expectations. Secondly, we consider the effect of using the log-likelihood as the basis for a penalty function and the corresponding "divergence-robust nonlinear expectations", and their connection to relative entropy. Using this, we tease out generic large-sample approximations, in both parametric and non-parametric settings. Finally, we consider the connection between divergence-robust expectations and robust statistics (in particular Mestimates).

Nonlinear expectations
In this section we introduce the concepts of nonlinear expectations and convex risk measures, and discuss their connection with penalty functions on the space of measures. These objects provide a technical foundation with which to model the presence of uncertainty in a random setting. This theory is explored in some detail in Föllmer and Schied [8] and Frittelli and Rosazza-Gianin [9], among many others. We here present, without proof, the key details of this theory as needed for our analysis. Definition 1. Let (Ω, F , P ) be a probability space, and L ∞ (F ) denote the space of P -essentially bounded F -measurable random variables. A nonlinear expectation on L ∞ (F ) is a mapping , and if in addition E(ξ 1 ) = E(ξ 2 ) then ξ 1 = ξ 2 a.s.
A 'convex' expectation in addition satisfies • Convexity: for any λ ∈ [0, 1], ξ 1 , ξ 2 ∈ L ∞ (F ), If E is a convex expectation, then the operator defined by ρ(ξ) = E(−ξ) is called a convex risk measure. A particularly nice class of convex expectations is those which satisfy The following theorem (which was expressed in the language of risk measures) is due to Föllmer and Scheid [8] and Frittelli and Rosazza-Gianin [9]. Theorem 1. Let M 1 denote the space of all probability measures on (Ω, F ) absolutely continuous with respect to P . Suppose E is a lower semicontinuous convex expectation. Then there exists a 'penalty' function α : In addition, there is a minimal such function, given by Provided α(Q) < ∞ for some Q equivalent to P , we can restrict our attention to measures in M 1 equivalent to P without loss of generality.
Classic convex analysis shows that α min is the Fenchel-Legendre conjugate of E, and is also a convex function and weak-* lower semicontinuous. It is also clear that, as E(0) = 0, we have the identity inf Q∈M1 {α min (Q)} = 0. Remark 2. As discussed above in the context of our example, this result gives some intuition as to how a convex expectation can model 'Knightian' uncertainty. One considers all the possible probability measures on the space, and then selects the maximal expectation among all measures, penalizing each measure depending on how plausible it is considered. As convexity of E is a natural requirement of an 'uncertainty averse' assessment of outcomes, Theorem 1 shows that this is the only way to construct an 'expectation' E which penalizes uncertainty, while preserving monotonicity, translation equivariance and constant triviality.
In particular, if (and only if) E is positively homogenous, that is, it satisfies "for any λ ≥ 0, E(λξ) = λE(ξ)", then α min only takes the values {0, ∞}, and we can rewrite our representation as where M * ⊂ M 1 is the set of measures for which α(Q) = 0. In this case, we see that our convex expectation corresponds to taking the maximum of the expectations under a range of possible models for the random system. Remark 3. The convex expectation E is defined above as an operator on L ∞ . However, given the equivalent representation we can clearly define E(ξ) for a wider class of random variables. In particular, E(ξ) is well defined (but may be infinite) for all random variables ξ such that Given a convex nonlinear expectation E, there is a natural class of 'acceptable' random variables for a decision problem, namely (given we evaluate losses) the convex level set A = {ξ : E(ξ) ≤ 0}.
One can also use a nonlinear expectation as a value to be optimized; in this setting the convexity of the operator is of significant interest. Finally, one can use a nonlinear expectation to give a robust point estimate of ξ, given a loss function φ, by choosing the valueξ ∈ R which minimizes the loss E(φ(ξ −ξ)) (cf. Wald [22]).

Penalties and likelihood
The general framework of nonlinear expectations is well suited to modelling Knightian uncertainty, but is not usually connected with statistical estimation. We would like to have a general principle for treating our uncertainty, which is closely tied to classical statistics. Our aim is to have a nonlinear expectation which uses observations to derive estimates of real-world probabilities, and uses these estimates and their uncertainty to give robust average values for a wide range of random outcomes. Rather than continuing to take an abstract axiomatic approach, we shall consider the following concrete proposal: Definition 2. Suppose we have an observation vector x taking values in R N . For a model Q ∈ M 1 , let L(Q|x) denote the likelihood of x under Q, that is the density of x with respect to a reference measure (which we shall take to be Lebesgue measure on R N for simplicity). Let Q ⊆ M 1 be a set of models under consideration (for example, a parametric set of distributions). We then define the "Q|x-divergence" to be the negative log-likelihood ratio The right hand side is well defined whether or not a maximum likelihood estimator 4 exists. Given a Q-MLEQ, we would have the simpler representation Given α Q|x , for an uncertainty aversion parameter k > 0 and exponent γ ∈ [1, ∞], we obtain the corresponding convex expectation where we adopt the convention x ∞ = 0 for x ∈ [0, 1] and +∞ otherwise. We call E k,γ Q|x the "Q|x-divergence robust expectation" (with parameter k, γ), or simply the "DR-expectation 5 ".
Our attention will mainly be on the two extremal cases γ = 1 and γ = ∞, however the intervening cases are natural interpolations between them. The statement 1 ∞ = 0 is natural from a convex analytic perspective, as it implies |x| q is proportional to the convex dual of |x| p , whenever We shall focus our attention on the special case where x = {X n } N n=1 and, under each Q ∈ Q, we know X, {X n } n∈N are iid random variables -this allows analytically tractable results, however our approach is applicable much more widely. Remark 4. In our example above, Q corresponds to the set of measures such that X, {X n } N n=1 are iid Bernoulli with parameter p ∈ [0, 1]. In this example, we did not consider all measures in M 1 (this would include, for example, models where {X n } N n=1 and X come from completely unrelated distributions), but neither did we restrict our attention to a single Q ∈ Q.
Typically the operator E cannot be evaluated by hand, instead numerical optimization or approximation is needed. In the setting of Example 1 above, if γ = 1 then a closed form representation can be obtained, however is quite inelegant (the optimal q is the solution to a quadratic equation, but the resulting equation for E k,γ Q|x (ξ) does not simplify). A simple example where closed form quantities can be derived is the classic setting where the data are assumed to be Gaussian with unknown mean (and as alluded to above, these are unsurprisingly very similar for large N ). Example 2. Suppose x = (X 1 , X 2 , ..., X N ) and Q corresponds to those measures under which X, {X n } N n=1 are iid N (µ, 1) random variables, where µ is unknown. Then, ifX = N −1 N n=1 X n denotes the sample mean, for any constant β > 0, simple calculus can be used to derive In particular, when γ = 1, we have and, taking the limit γ → ∞ (or directly from the definition), In this latter case, taking k ≈ 2, we obtain the upper bound of the classical 95% confidence interval for βX.
The corresponding lower expectations are given by the symmetric quantities From this example we can observe a few phenomena, which we will discuss more generally below. First, for The larger (in absolute terms) the random variable considered, the more the uncertainty affects our assessment. On the other hand, this requirement is satisfied when γ = ∞, and there is a close relationship between E k,∞ Q|x and the classical confidence interval for E[X]. Secondly, for any γ, as the ratio of the uncertainty parameter and the sample size k/N → 0, the DR-expectation converges to the (unique) Q-MLE βX (that is, the parameter corresponding to the measure in Q with the largest likelihood). This convergence is of the order (k/N ) γ 2γ−1 . In this setting we can also calculate, for β > 0, which is always finite. This explosion in E k,1 Q|x will be considered in more detail in Section 4. Notice that again, as k/N → 0, Remark 5. We will focus on the use of the likelihood for estimation, however it is clear that other quantities could also be considered. In particular, if we have a family of parametric distributions Q with varying numbers of parameters, then it would be reasonable to penalize by Akaike's information criterion or the Bayesian information criterion, rather than simply by the likelihood. In some settings, the use of a quasi-likelihood, rather than the true likelihood, may also be of interest, particularly if this renders the problem more computationally efficient. Furthermore, including terms relating to the log-density of a 'prior' penalty may be of interest, so the penalty will be taken using the log-density of the posterior distribution, rather than with the likelihood (see Example 5 for this). For the sake of simplicity, we will not pursue these variants in detail here, however their behaviour should be qualitatively similar to what we consider.
We have noticed above, in the Gaussian case, that our nonlinear expectation is positively homogeneous only in the case γ = ∞. This is a general fact, as shown by the following proposition. Proposition 1. In the case γ = ∞ (and only in this case, provided the likelihood is finite and varying for a nontrivial subset of Q), our nonlinear expectation is positively homogeneous, that is Proof. It is classical (see for example Föllmer and Schied [8]) that a convex nonlinear expectation is positively homogeneous if and only if the penalty takes only the values {0, ∞}. Given the likelihood is finite and varying on a nontrivial subset of Q, this is not the case for any γ < ∞, but is the case for γ = ∞ by definition.
Remark 6. For a Q-MLEQ, by definition we have L(Q|x) ≤ L(Q|x), so α Q|x (Q) ≥ 0, with equality only if Q is a maximum likelihood model. In general we cannot say whether α Q|x is convex (indeed, we have not assumed that its domain Q is a convex set), so it is not generally the case that (k −1 α Q|x ) γ is the minimal penalty for E k,γ Q|x .

Dynamic consistency
Within the theory of nonlinear expectations, much attention has been paid to questions of dynamic consistency. If we have a family {E s } s≥0 of 'conditional' nonlinear expectations relative to a filtration {F t } t≥0 , then dynamic consistency requires, for every ξ and all s ≤ t, that we have (i) the recursivity relationship This concept is generally not appropriate for our approach, as the expectations we define are typically not consistent. This can be seen from the following easy extension of Example 2.
Example 3. In the context of Example 2, write So in either case, the nonlinear expectation {E k,γ Q|xN } N ∈N is not recursive 6 . In effect, our problem differs from the dynamically consistent one in the following (closely related) key ways: • In a dynamically consistent setting, the penalty is prescribed (and may be taken to be constant through time, as discussed in [6]) while the observations lead to conditional expectations appearing in the nonlinear expectation. In our setting, the penalty is determined by the observations (through the Q|x-divergence), and so only the family of models Q and the constants k, γ need to be specified. In this way, the observations will inform our understanding of the real-world probabilities directly, rather than simply replacing them with conditional probabilties.
• In a dynamically consistent setting, decisions are typically thought of as being made at each time point, and one needs to ensure that they satisfy a dynamic programming principle (so it is reasonable to make plans for future decisions). In our setting, we naturally consider making a single decision after repeated observations, and seek an empirical basis on which to do so.
• In a dynamically consistent setting, one typically has that the limit as the number of observations increases is the true value of the outcome, that is E(ξ|F t ) → ξ as t → T . In our setting, we will instead typically have that as N → ∞ in P -probability for all P ∈ Q, that is, our expectation converges to the expectation under the 'true' measure.
• In a dynamically consistent setting, if we assume our observations are independent of X (under all Q ∈ Q), we will typically learn nothing about the value of X, so E(φ(X)|F N ) is constant. In our setting, independent observations are needed to teach us about the distribution of X, and hence give useful information, which is incorporated into our expectations.
• In a dynamically consistent setting, the underlying models are typically required to be stable under pasting through time. Conceptually, this implies that there is no significant link assumed between the 'true' model governing our observations at different times 7 . Conversely, in our setting, we typically assume that the underlying model is constant through time (i.e. our observations are iid), and hence repeated observations can inform our view of the 'true' model.
In [6], a filtering problem was considered, where ξ was taken to depend on a hidden time-varying process, and the observations were used to filter the expected value of ξ. This was coupled with a nonlinear expectation in two ways. First, a DR-expectation approach was used in an initial calibration phase to estimate the underlying probabilistic structures of the model and their uncertainty, that is, the dynamics of the hidden process and the observation process. Secondly, the penalty function obtained from the DR-expectation approach was used to build a nonlinear expectation with good dynamic properties for new realizations of these processes, associated with an on-line filter. In this second setting, new information is incorporated into the risk assessment, but one does not recalibrate the estimation of the underlying probabilistic dynamics. Nevertheless, as we shall see, some 'dynamic' properties of our nonlinear expectation are available. In particular, in Section 3 we shall consider the largesample asymptotic behaviour of a DR-expectation.

Exponentials and Entropy
It will not be a surprise that there is a connection between the convex expectation we propose and a more traditional quantity in risk-averse decision making, namely the certainty equivalent under exponential utility.
Definition 3. For a random variable ξ, under a reference measure P , the certainty equivalent under exponential utility has definition where k > 0 is a risk-aversion parameter. Defining the relative entropy (or Kullback-Liebler divergence) we have the representation (see, for example, [8]) Replacing expectations by conditional expectations, we obtain the conditional certainty equivalent.
Remark 7. It is useful to consider the relative entropy of the law of X separately from the other observations {X n } n∈N . We therefore define AssumingQ ≈ P , where P is the 'real world' probability measure, in light of the law of large numbers we hope for a simple connection, at least asymptotically, between the scaled deviance and the penalty in the exponential utility, that is, D KL (Q||P ). In general, this is made more difficult by the fact we have an infinite family of measures Q, and by the lack of symmetry in the relative entropy, as D KL (Q||P ) = D KL (P ||Q).
We shall pursue this connection in the coming section.
Remark 8. Consider the case of an uncertain symmetric location parameter, that is when Q is parameterized by an unknown quantity θ, and under each Q θ , {X n } N n=1 are iid with density f (|x − θ|). Then one can show that Therefore, in these cases, one could try and equate the nonlinear expectation and the exponential certainty equivalent under the MLE measureQ. However, this requires that the measure Q ∈ M 1 which maximizes is within the class Q. That is, the optimizer differs fromQ only through a change of the parameter θ. In general, this is only the case when the models considered are Gaussian with uncertain mean. Then, with P =Q, we have (cf. Var(X).
Remark 9. One extension of our approach is to change the penalty function to include a entropy term taken in the 'other' direction, that is, to use the penalty for some β > 0. This is particularly of interest where Q is a parametric family (and we think our model should be similar, if not precisely the same, as a parametric model) or where we wish to include risk aversion (as measured using exponential utility 8 ). This is well defined for all measures Q ∈ M 1 , and gives the expectation:

Large-sample theory
In this section, we shall seek to study the large-sample theory of the nonlinear expectation E k,γ Q|x . In practice, this is particularly useful to give approximations and qualitative descriptions of its behaviour.
Throughout this section, we shall assume that we have observations {X n } n∈N , and a family of measures Q under which X, {X n } n∈N are iid random variables with corresponding densities f (x; Q)dx. We write x N = (X 1 , ..., X N ). We shall be interested in determining the behaviour, for large N , of E k,γ Q|xN (φ(X)), where φ is a bounded function. For simplicity, we shall assume that the MLE exists (however our results can be extended to remove this assumption, with an increase in notational complexity). We writeQ N for the Q-MLE based on observations x N .
Given the lack of positive homogeneity, it is interesting to consider the be- where c N has prescribed growth in N . The following lemma allows us to instead vary the uncertainty parameter k. Lemma 1. For any k > 0, any γ < ∞, any random variable ξ, To enable a simple description of our asymptotic results, we recall the following definition  N )) whenever f (N )/g(N ) is stochastically bounded (that is, P (|f (N )/g(N )| > M ) → 0 as M → ∞ for each N ) and f = o P (g(N )) whenever P -lim N →∞ |f (N )/g(N )| = 0. Note that this depends on the choice of measure P . The subscript P is omitted in the classical case (i.e. when the convergence is not in probability).

Nonparametric results
We now give some results when we do not assume Q comes from a 'nice' parametric family. Given we will take a supremum over a family of densities, we need a uniform version of the law of large numbers. For this reason, we make the following definition. It is easy to show, given consistency of the MLE and some integrability, that a finite family Q is always a GCD class.
Lemma 2. Suppose Q is a family of measures such that {X n } n∈N are iid with respective densities {f (·; Q)} Q∈Q which satisfy i) there is a compact set K, such that for every P ∈ Q, P (X ∈ K) = 1 ii) there is ǫ > 0 such that ǫ < inf Q∈Q min x∈K f (x, Q), iii) there is C < ∞ and ρ > 1/2 such that, for all P, Q ∈ Q, the likelihood ratios f (·, Q)/f (·, P ) take values in [C −1 , C] and are uniformly ρ-Hölder continuous with norm C, that is, writing Then Q is a GCD class of measures.
Proof. See Appendix.
We can now prove the following versions of the law of large numbers and the central limit theorem. We begin with the case γ = 1.
Q|xN is a consistent estimator, that is (ii) We have the asymptotic behaviour (as N → ∞, for each P ∈ Q) with equality whenever P is such that, for all N sufficiently large, the measureP with density is chosen to ensure this is a probability density) is also in Q. (This can be thought of as related to the central limit theorem, cf. Example 2.) Remark 11. Given the error of the expectation based on the Q-MLE is asymptotically of the order of 1/ √ N , the requirement implied by (ii) that k N grows faster than √ N , is unsurprising, as this is what is needed to ensure that the risk aversion term kN 2N Var(ξ) asymptotically dominates the statistical error of the estimation of E P [ξ].
Proof. We begin by proving (ii). As Q is a GCD class, we know that, with error bounded independently of Q. Hence, uniformly in Q, Calculating E kN ,1 Q|xN (ξ), we have We shall now focus on solving the problem under the assumption that the penalty is given by (N/k N )D KL (P ||Q).
For fixed N , we can try and solve this simplified problem directly. Assuming the optimum will be attained with a measure denoted Q g , this corresponds to finding the density g = f (·, Q g ). Calculus of variations yields where λ is chosen to ensure g is a density, that is, E P [(λ − kN N ξ) −1 ] = 1. This requires λ > kN N sup x φ(x) (this is the reason we have assumed ξ = φ(X) is bounded). As the map λ → (λ − kN N φ(x)) −1 is monotone, we also know that the corresponding value of λ is unique and This avoids inconsistency with the requirement λ > kN N sup x φ(x) whenever N is large enough that N kN > 2 sup x |φ(x)|. For every fixed large N , we have a compact set of values for λ. Therefore, we can assume (λ − kN N ξ) −1 is uniformly approximated by its Taylor series in λ around λ = 1 + kN N E P [ξ]. Furthermore, we immediately see the first approximation Expanding the Taylor series of (λ − kN N ξ) −1 , we have Substituting our first approximation of λ on the right hand side of (2), we have Substituting this second approximation back into (2), we observe that the error can be taken to be O((k N /N ) 3 ), rather than O((k N /N ) 2 ). We can now approximate our convex expectation. We know that and similarly Hence we can calculate the desired approximation with equality whenever Q g ∈ Q, as stated in (ii). We now seek to reduce to the assumptions of (i). As increasing k N will only increase the (nonnegative) differences and we know that EQ N [ξ] is consistent, we can assume that N 1/2 /k N → 0 without loss of generality. Under this assumption, the right hand side of (3) converges to E P [ξ], and hence we verify that E kN ,1 Q|xN (ξ) → P E P [ξ] as desired.
Remark 12. Assuming that Q is sufficiently rich and k N / √ N → ∞ (so the approximation of (ii) is useful) this result implies that, if we have simple estimators of the mean and variance of ξ = φ(X), for example the classical sample mean and variance of {φ(X n )} n∈N (which have error O P (N −1/2 )), then we have the asymptotic approximation It is well known that a mean-variance criterion is not a convex expectation in general, however retains convexity for Gaussian distributions (see, for example, [8]). In this setting, we can see that the central limit theorem renders our uncertainty approximately Gaussian, so no contradiction arises. We will now consider the case γ = ∞. It is easy to check that the interval is a likelihood interval for E[ξ], that is, it corresponds to the range of expectations under the measures in Q with likelihood at least e −k . Such intervals are commonly used as generalizations of confidence intervals (see for example Hudson [12], drawing on the well known results of Neyman and Pearson [18]). In this context, we shall see that a stronger property holds, as the confidence region is uniform in φ. (See also Theorem 6.) Theorem 3. Suppose Q is a GCD family and X, {X n } n∈N are iid under each Q ∈ Q. Then if k N = o(N ), the nonlinear expectation with γ = ∞ is a uniformly consistent estimator, that is, Proof. Observe that As Q is a GCD class, we know that for any P ∈ Q, with the terminal error uniform in Q. From Pinsker's inequality, looking only at the marginal law of X, we know that the total variation norm satisfies Therefore, It follows that the nonlinear expectation is a uniformly consistent estimator.
By a simple comparison, we also obtain consistency for all other γ ∈ [1, ∞].

Corollary 1.
If Q is a GCD class, k N = o(N ) and γ ∈ [1, ∞], the nonlinear expectation E k,γ Q|xN (φ(ξ)) is a consistent estimator of E P [φ(ξ)]. Proof. We know that the two extreme cases γ = 1 and γ = ∞ are both consistent, as is the MLE EQ N [φ(ξ)] (this follows, for example, from the fact EQ N [ξ] ∈ I N , where I N is as in Theorem 3. Furthermore, for any γ, as |x| γ ≥ min{|x|, |x| ∞ }, it is easy to check from the definition that . These types of perturbation are rarely considered in other settings, so such results appear to be of purely technical interest.

Parametric results
We now suppose that Q is a class of measures coming from a 'nice' parametric family. In this setting, we can obtain more precise asymptotics by considering the divergence as a function of the parameter, rather than as a function of the abstract space of probability measures. For simplicity, we shall consider an exponential family of measures, which is general enough for many applications, but gives sufficient structure to obtain tight results. We shall also assume throughout that, for every Q ∈ Q, X, {X n } n∈N are iid with density f (·; Q).

Definition 6.
A distribution is said to come from the exponential family (in natural parameters) if the density can be written Here θ is the parameter of Q, and is in an open subset Θ of R d for some d, T is a (vector of ) sufficient statistics, h is a normalization function and A is the log-partition function. We assume that Q corresponds to all those measures with parameters in Θ, and write θ Q for the parameters of Q, Q θ for the measure associated with θ, and E θ for E Q θ , etc...
The key result we shall use is that A is convex and smooth (in particular has a continuous third derivative). We shall in fact use the following, slightly stronger condition. (ii) The Q-MLE exists and is consistent, with probability tending to 1 as N → ∞ (that is, for every Q ∈ Q, a maximizerQ N exists with Q-probability approaching 1 andθ N = θQ N → Q θ Q ). These assumptions can be justified using weak assumptions on the family considered, see for example Berk [5, Theorem 3.1], Silvey [20] or the more general discussion of Lehmann [15] (see also [16]). For more advanced discussion of the theory of likelihood in exponential families, see Barndorff-Nielsen [2].
Observe that, whenever the Q-MLEθ exists, the divergence is given by using the natural abuse of notation α Q|xN (θ) := α Q|xN (Q θ ). Given a first order condition will hold at the MLE, we can simplify to remove dependence on the observations (except through the MLE) The following result will allow us to get a tight asymptotic approximation of the penalty, as it will allow us to focus our attention on a small ball around the MLE. Lemma 3. Let ρ > 0 be a constant and letθ N denote the MLE of θ. Then, for each P ∈ Q, there exist constants c 1 , c 2 independent of N such that, writing In other words, with high probability, we know α Q|x > ρ whenever θ −θ > R = O(N −1/2 ).
Proof. See Appendix.
Remark 14. The previous result will mainly be used to show that, when we consider bounded random variables, for any P ∈ Q we can approximate the divergence by This is itself an interesting and useful result, particularly when we use the DRexpectation approach as a first step in a larger problem. For example, when we use a DR-expectation to capture the uncertainty in calibration of a model, which we then wish to use in a variety of settings this result shows that it is enough (to first order) to penalize using the observed information matrix, rather than repeatedly calculating the likelihood function. This is the approximation we made in (1). As the approximation is a quadratic, the optimization needed to calculate E k,γ Q|xN is straightforward (particularly for linear or quadratic functionals of the parameters), which can have significant numerical advantages (see for example Ben-Tal and Nemirovski [3]).
We now use this approximation to give asymptotic estimates for the DRexpectation. This can be seen as an analogue to the central limit theorem (cf. Example 2). Note that, unlike in the nonparametric case, we do not need to scale the risk aversion parameter k as N → ∞. It is convenient to make the following definition.
Definition 7. Let φ be a bounded function such that the mapφ : . Remark 15. Observe that, by classical arguments, if φ can be written as a linear function of the sufficient statistics then V (φ,θ) = Varθ(φ(X)).
Ifθ N has the variance appearing in the central limit theorem, that is 9 , Var(θ N ) ≈ N −1 I −1 θP , then (given an appropriate array of integrability and continuity assumptions) we have the approximate variance of the MLE-expectation Theorem 4. Let φ be a bounded function such that the mapφ : is twice differentiable. Then for all P ∈ Q, Proof. Fix P ∈ Q. For simplicity, we writeθ forθ N . To begin, observe that and as φ is bounded, we need only consider those measures From Lemma 3, we know that We knowθ → P θ P andφ is twice differentiable at θ P , so for θ ∈ Θ N , We also know that α Q|xN (θ) is smooth, convex and minimized atθ, so for θ ∈ Θ N , Substituting these, we have the approximate DR-expectation The term in braces has optimizer where we know that, asθ → θ P and I θ P is positive definite, with P -probability approaching 1 the matrix Iθ + O P (N −1/2 ) is nonsingular. Substituting, we have the desired approximation We now consider the case γ = ∞.
Theorem 5. Let φ be a bounded function such that the mapφ : is twice differentiable. Then for all P ∈ Q, Proof. The proof follows much in the same way as the case γ = 1 and we use the same notation. We know that We see that and from Lemma 3, with probability approaching 1, it is enough to consider to Θ N = {θ : θ −θ < O P (N −1/2 )}. Standard optimization then yields The result follows.
Remark 16. The cases γ ∈ (1, ∞) can also be treated using the approximation implied by Lemma 3 (in the way suggested by Remark 14), and are left as a tedious exercise for the reader.
The following result is can be shown to hold under the assumption that Q is of the exponential family we consider here, or more generally. It is particularly of interest as it is naturally a 'uniform' result over the space of outcomes (which do not need to be bounded or independent of the observations). This is of importance in decision marking, as we will often wish to choose between a range of outcomes ξ, and wish to be confident that our comparison method is valid for all choices simultaneously. Theorem 6. Suppose the MLE is consistent and Wilks' theorem holds under every P ∈ Q (that is, 1 2 α Q|xN (P ) is asymptotically χ 2 d distributed under P , where d is a known parameter). Then, for a random variable ξ, is a likelihood interval for E[ξ], with the uniform asymptotic property . Proof. That I N (ξ) corresponds to a likelihood interval is trivial, as γ = ∞ implies we are considering expectations under those measures where the log likelihood (relative to the MLE) is at least k. Wilks' theorem then determines the asymptotic behaviour of the relative log likelihood, in particular, we know where F χ 2 d is the cdf of the χ 2 d -distribution. Clearly α Q|xN (P ) ≤ k implies E P [ξ|x N ] ∈ I N (ξ) for all ξ. We then obtain the desired result, Remark 17. Conditions for Wilks' theorem are closely related to those for the central limit theorem, and are typically based on integrability and continuity assumptions on the densities. The result is that, where d is the dimension of the parameter space and P -Dist refers to convergence, in distribution, under P . See Lehmann [15, Section 7.7] for details.

Robustness and models
In this section, we shall consider the behaviour of the divergence-robust expectation for unbounded random variables, and its relationship with 'robust' statistical estimates. We shall regard the sample size N as fixed. The following theorem complements our earlier asymptotic results (which were generally for bounded outcomes), to demonstrate that without any parametric structure most unbounded random variables do not have finite DR-expectations.
Proof. For any ǫ > 0, let Q ǫ ∈ Q be a measure such that E Q ǫ [ξ] > ǫ −2 . For any measure P ∈ Q, we define the mixture distribution P (ǫ) = (1 − ǫ)P + ǫQ ǫ . It follows that P (ǫ) ∈ Q and, provided E P [ξ] > −∞, , so (assuming for notational simplicity that the Q-MLEQ exists) It follows that, as ǫ → 0, Remark 18. The above assumes Q is closed under finite mixtures of measures. If we assume that Q is such that {X n } n∈N are iid, then this is not the case. However, for N < ∞, an almost identical proof holds whenever Q is associated with a family of densities f (·; Q), and this family of densities is closed under taking finite mixtures. (The only significant change is that we obtain the inequality L(P (ǫ)|x N ) > (1 − ǫ) N L(P |x), where P (ǫ) corresponds to the measure with a mixture density.) This result highlights the importance of parametric structure for estimation of unbounded random variables, in terms of restricting the class of probability measures that can be considered. This restriction can be thought of in terms of restricting the probabilities of very large (positive or negative) values of ξ, and hence ensuring enough integrability that finite expectations arise. Without these restrictions, unlikely events (which by their very nature will generally not be seen in the data, so are not penalized) result in unbounded expectations. Remark 19. While we have not considered the numerical aspects of this problem in great detail, it is often the case that parametric models are also needed to reduce our problem to a finite-dimensional setting, rather than needing to solve optimization problems on the infinite-dimensional space of measures.
Given the importance of parametric families, it is then of interest to consider how the 'statistical robustness' of the parametric estimation problem interacts with the 'robustness' of the expectations considered. Given our use of likelihood theory, there is a natural connection to M -estimates, which correspond to estimates obtained by maximizing some function. Before giving general results, we consider a simple setting.
Example 4. Consider X, {X n } N n=1 iid observations from a Laplace (or double exponential) distribution with known scale 1 and unknown mean µ. That is, X n has density Let Q denote the corresponding family of measures and write Q µ for the measure with mean µ. For simplicity, assume N is odd, so the MLE is uniquely given by Q m , where m is the sample median. This is known to be 'statistically robust', see Huber and Ronchetti [11], as it does not depend on extreme observations, and is therefore unaffected by outliers. The Q|x N -divergence is then given by For X an iid observation from the same distribution as X n and β > 0 (the case β < 0 is symmetric) we have A first observation which can be drawn is that E k,1 Q|x (βX) is generally infinite, unless β < N/k. To see this, observe that if β > N/k, then the function to maximize is linear and increasing for µ > max n≤N X n .
Assuming that β < N/k, the function to maximize is piecewise linear, concave and asymptotically decreasing (for both positive and negative µ), so a finite solution exists. Except at points where µ = X n for some n, we can differentiate to obtain the equation where ⊲⊳ indicates that either the statement is an equality, or that the left and right limits of the right hand side differ in sign (if X n = µ for some n). As we are looking for the maximal solution, we can generally state that the solution will be We can also write where F (y) = 1 N N n=1 I {Xn≤y} is the empirical cdf of our observations. Assuming N is moderately large, this is well approximated by a continuous increasing function (so all quantiles are uniquely defined), and we will obtain It follows that the optimizing choice of µ * is given by the empirical 1 2 + βk 2N quantile.
Introducing this back into our equation for E k,1 Q|xN (βX), we obtain We see that the divergence-robust estimate depends on a weighted combination of the median βm, an upper quantile βµ * , and the mean value taken between these two bounds 10 . Therefore, this quantity can still be robustly estimated, as it still does not depend on the tail behaviour beyond the 1 2 + βk 2N quantile. More formally, the breakdown point of this estimator (the proportion of the data which can be made arbitrarily large without affecting the estimate) is 1 2 (1−β k N ). It is easy to see (as E Q µ [βX 2 ] = βµ 2 + 2β) that Q|x (βX 2 ) = ∞ for all N, k, β > 0. For negative β, a finite answer can be obtained, but even its approximate closed-form representation is inelegant.
Comparing this example with the normal example (Example 2), we can see that, when considering a likelihood model, there is a delicate relationship between the 'statistical' robustness in the classical estimation problem and the 'parameter uncertainty' robustness embedded in E k,1 Q|x . The following theorem makes this behaviour more precise.
Theorem 8. Consider a sequence of iid random variables X, {X n } N n=1 , and a family of measures Q describing an uncertain 'location parameter'. In other words, under Q ∈ Q, suppose X, {X n } N n=1 are iid observations from a distribution with density exp(Ψ(x − µ Q )), and so Q is parameterized by µ Q = E Q [X] ∈ R. Suppose Ψ has monotone increasing derivative ψ (which may be discontinuous) and −∞ ≤ lim Note that the MLE parameter (assuming it exists, for simplicity) is given by the solution µ to n ψ(X n − µ) = 0.
The following are then equivalent.
(i) The MLE parameter has a breakdown point above zero (that is, some fraction of the observations can be made arbitrarily large or small without making the MLE arbitrarily large or small), (ii) The MLE parameter is weakly continuous with respect to the empirical cdf of observations, for any empirical cdf where the MLE parameter is uniquely defined, (iii) ψ is bounded, (iv) For any fixed k, N , for all β ∈ R sufficiently large (in absolute value), E k,1 Q|xN (βξ) ∈ R, where ξ is an iid copy of the observations.
Proof. The equivalence of (i)-(iii) is given by Chapter 3 of Huber and Ronchetti [11], in particular Theorem 3.6. We seek to show (iii) and (iv) are equivalent. First, if (iii) holds, then Ψ is of linear growth. Let β > N sup x |ψ(x)|. We can then calculate As β is larger than the maximal derivative of n Ψ(X n − µ Q ), we can see that the term in brackets is unbounded, so (iv) holds. A similar result holds if β < −N sup x |ψ(x)|.
To show (iv) implies (iii), we first observe that (iv) implies that for all β sufficiently large, In other words, Ψ is bounded above by a linear function. As ψ = Ψ ′ is monotone increasing, this implies that ψ is bounded above. A similar argument shows that ψ is bounded below.
Remark 20. This behaviour may seem pathological, but it has a natural interpretation. Suppose one has an estimation technique which does not depend on some property of the data, for example does not depend on some extreme quantile. Then, given the results of fitting this method, one should not be able to make strong statements about the behaviour of future observations at this quantile.
In some sense, this is what is being captured by our approach. In the Laplace distributed case, our MLE does not depend on the behaviour of the data in the tails of the distribution (particularly not in a 'linear' way, unlike the sample mean), so it is unsurprising that we at times cannot say much about the mean of linear functions of future observations. At the same time, this interpretation of the non-existence of moments is imperfect, as we still do have finite moments for sufficiently small multiples of X.
Conversely, we have the following. If ψ(−∞) < ±β/N < ψ(∞), then the nonlinear expectation E k,1 Q|xN (βX) is finite, and has breakdown fraction at least That is, for m/N < δ, at least m observations can be made arbitrarily large or small while E k,1 Q|xN (βX) remains bounded.
Proof. Without loss of generality, suppose β > 0 (we shall prove the result for both β and −β simultaneously). Consider the functions From a first order condition, the value of ±E k,1 Q|x (±βX) is given by where µ ± (x) is the solution to λ ± (µ, x) ⊲⊳ 0 (where again ⊲⊳ indicates either equality or a change of sign) and µ MLE (x) is the MLE based on x. Now observe that λ ± is monotone increasing with respect to µ and, as ψ(−∞) < β/N < ψ(∞), we know lim µ→∞ λ ± (µ, x) > 0 and lim µ→∞ λ ± (µ, x) < 0. Therefore, there is exactly one (finite) solution to λ ± (µ, x) ⊲⊳ 0. It follows that E k,1 Q|x (±βX) exists and is real. We now need to determine the breakdown fraction. For M a set of indices, let x(M, y) denote the set of observations, with X i replaced by y i for i ∈ M . Suppose |M | = m and m/N < δ. We wish to show that there is a bound on E k,1 Q|xN (±βX) which is uniform in y. From the definition and nonnegativity of the penalty function, it is easy to see that it follows that it is enough for us to prove that µ ± (x(M, y)) is uniformly bounded in y.
As ψ is monotone, we observe that By monotonicity, it is enough to show that the terms on the right and left have roots for finite values of µ (as these will not depend on y). Considering the lower bound first, we see that as µ → ∞, we obtain and as µ → −∞, Therefore, there is a finite root for the lower bound on λ ± (µ, x(M, y)). A similar argument applies to the upper bound. By monotonicity, we conclude that µ ± (x(M, y)) and hence E k,1 Q|xN (±βX) are uniformly bounded in y, as desired. Remark 21. Given the close relationship between entropy and extreme value theory, these results suggest a further relationship between the extremes of a class of models, the statistical robustness of estimators, and the existence of divergence-robust estimates. The development of this theory may be of future interest.
To conclude, we observe that the non-existence of finite values for E k,1 Q|x (X) can also manifest itself in surprising ways, as we can see from the following extension of Example 2. Example 5. Consider the case where X, {X i } N i=1 are iid N (µ, σ 2 ), where both µ and σ 2 are unknown. The divergence penalty is then (writingσ 2 = 1 If we attempt to calculate E k,1 Q|x (βX), we obtain This causes a problem, as the term on the right is unbounded above with respect to σ 2 . Looking more closely, this function typically has a local maximum for σ 2 ≈σ 2 , but for very large values of σ the β 2 k 2N σ 2 term will dominate. Therefore, there is no way that, even for large samples, a finite value of E k,1 Q|xN (βX) can be obtained.
One possible way to deal with this is to modify our approach slightly, either by including a prior distribution 11 , or by adding an additional regularizing term to ensure the supremum chooses values close to the statistical parameters. For example, taking the penaltỹ for some ǫ > 0, results in a finite value for E k,1 Q|xN (βX) whenever β 2 k 2N < ǫ k , in particular one obtains consistent estimates as N → ∞.
There are innumerable other applications and examples of this approach, and extensions to other settings (for example where likelihoods are replaced by more general objects) may also be of interest. While our results have focussed on the (analytically simpler) setting of independent observations, the approach naturally extends to where models include correlated observations, as is common in time-series models.
It is enough, therefore, to prove a uniform convergence rate for functions in F ρ .
We can now appeal to Corollary 17.3.3 and the proof of Theorem 17.3.1 of Shorack and Wellner [19, p.633] (itself based on Strassen and Dudley [21]) to see that, writing we know that for any η > 0, there is M sufficiently large (independent of N ) that P sup g∈Fρ Z N (g) > M ≤ η.
(The usual purpose of this is as a step towards showing that Z N converges weakly to a Gaussian process, which is a form of Donsker's theorem.) By rearranging, it follows that In particular, we know thatQ N takes values in Q, so, The result then follows from using (4), (5) and (6) with the triangle inequality.
Proof of Lemma 3. Our proof depends on three facts: that α is locally a quadratic to second order (via Taylor's theorem), that the MLE is consistent (allowing us to bound the third derivative with high probability), and that α is convex (which controls its global behaviour). We writeθ forθ N for notational simplicity. As the MLE is consistent (and exists with high probability), as N → ∞, for any radius C > 0, we know P ( θ − θ P < C/2)) → 1.
We also know that, for some constant k (which will in general depend on P and on C being sufficiently small, but is independent of N ), we have the bound ∂ 3 A(θ) ≤ k for all θ with θ − θ P < C. Combining these, for all θ with θ −θ < C/2, from Taylor's theorem As we know that Iθ is not degenerate (uniformly in a neighbourhood of θ P ), we can also assume that (making k sufficiently large) Therefore, taking C ≤ k −2 , on the set θ −θ ≤ C/2 we have Note that k and C do not depend on N , so (7) remains valid. We now need to extend the bound of (8) to all θ. We know that α Q|x is convex and α Q|x (θ) = 0. For any point θ such that θ −θ > C/2, its projection on the ball of radius C/2 aroundθ is given by θ π =θ + λ(θ −θ) :=θ + C 2 θ −θ (θ −θ).