Likelihood decision functions

In both classical and Bayesian approaches, statistical inference is unified and generalized by the corresponding decision theory. This is not the case for the likelihood approach to statistical inference, in spite of the manifest success of the likelihood methods in statistics. The goal of the present work is to fill this gap, by extending the likelihood approach in order to cover decision making as well. The resulting decision functions, called likelihood decision functions, generalize the usual likelihood methods (such as ML estimators and LR tests), in the sense that these methods appear as the likelihood decision functions in particular decision problems. In general, the likelihood decision functions maintain some key properties of the usual 
likelihood methods, such as equivariance and asymptotic optimality. By unifying and generalizing the likelihood approach to statistical inference, the present work offers a new perspective on statistical methodology and on the connections among likelihood methods.


Introduction
Wald [67] tried to unify statistics in his theory of decision functions. However, many of the most appreciated statistical methods do not fit well in this setting. In particular, the likelihood methods (such as the maximum likelihood estimators and the likelihood ratio tests) are usually suboptimal in corresponding (finite-sample) decision problems, from the standpoint of minimax risk. In fact, the post-data nature of likelihood methods is at variance with the pre-data evaluation of decision functions.
Since statistical methods based on the likelihood function are extremely successful as regards estimation and testing, it is natural to try extending the likelihood approach to more general decision problems. The present paper introduces criteria for basing decisions on the likelihood function alone. The optimal decisions resulting from such criteria generalize the usual likelihood methods, in the sense that these methods are optimal in corresponding (finite-sample) decision problems. In other decision problems, new likelihood methods are obtained.
Surprisingly, only very few authors have studied extensions of the likelihood approach to cover decision making. Besides the author [18,19], only Lehmann and Romano [45, Section 1.7 (substantially unchanged since the first edition in 1959)], Diehl and Sprott [25], and Giang and Shenoy [33] seem to have worked in this direction. However, the latter three approaches are not directly applicable to general statistical decision problems in the sense of Wald [67], and their properties have not been investigated.
Many authors (such as Fisher [30], Barnard [6], Birnbaum [14], Hacking [35], Kalbfleisch [40], Sprott [61], Edwards [28], Lindsey [48], Azzalini [3], Royall [57], Reid [55], Pawitan [53], Hills [37]) consider the likelihood function as a description of uncertain knowledge about the parameters of the statistical model. More precisely, the likelihood function can be interpreted as a description of a kind of relative plausibility of the possible values of the parameters in the light of the observed data. The uncertainty in this description is non-probabilistic, and therefore the likelihood approach to decision making clearly differs from the Bayesian approach.
In particular, prior (uncertain) knowledge about the parameters is not needed in the likelihood approach to decision making: this is a fundamental advantage over the Bayesian approach. However, the fact that likelihood functions induced by independent data are combined by (pointwise) multiplication suggests the possibility of describing prior uncertain knowledge by a prior likelihood function, which can then be (pointwise) multiplied with the likelihood function induced by the data (this idea is implicitly or explicitly considered for example in [5,7,8,14,26,27,28,38,46,48,49,53]).
That is, when prior information is available, it can be used in the likelihood approach to decision making. But prior information is not necessary, because complete ignorance about the values of the parameters can be described by a constant (prior) likelihood function, which contains no information for discrimination between the possible values of the parameters (in the sense of Kullback and Leibler [44]). The possibility of describing ignorance distinguish likelihood functions from probability measures (as descriptions of uncertain knowledge) and leads to the above fundamental advantage over the Bayesian approach.
Despite the different descriptions of uncertain knowledge about the parameters of the statistical model, the likelihood and Bayesian approaches to decision making share a basic property: they both satisfy the (strong) likelihood principle (see for example [14,10,39,13,29,50]). This principle gives theoretical reasons for the likelihood approach to decision making, in addition to the pragmatic reasons mentioned above (that is, the successfulness of the likelihood methods).
In particular, the likelihood approach to decision making can be applied postdata, avoiding the severe difficulties of pre-data evaluations (see for instance [41,56,11,34]), and drastically reducing the complexity of the decision problems. However, in the tradition of Wald [67], the pre-data properties of the resulting decision functions, called likelihood decision functions, will be studied in the present paper.
The paper is organized as follows. In the next section, basic definitions and notations are introduced. Section 3 presents criteria for basing decisions on the likelihood function alone. Pre-data properties of the resulting likelihood decision functions are the subject of Section 4 (the proofs of the theorems are in Appendix A). The final section is devoted to conclusions and directions for further research.

Settings
Let (Ω, F ) be a measurable space, and for each θ ∈ Θ, let P θ be a probability measure on (Ω, F ). Random variables on Ω are denoted by X or X n (with n ∈ N), and their codomains by X and X n , respectively (it is assumed that all singleton subsets of X and X n are measurable). The only assumption about Θ is that it is not empty. In particular, the statistical model {P θ : θ ∈ Θ} can be parametric (in this case, θ describes the parameters of the statistical model) or nonparametric (in this case, θ simply indexes the probability measures).
If x ∈ X satisfies P θ (X = x) > 0 for some θ ∈ Θ, then the likelihood function given X = x is denoted by λ x . This definition is not applicable when the random variable X is continuous for all θ ∈ Θ. In fact, it can be argued that the realization of a continuous random variable can never be observed with infinite precision: it is only possible to observe X ∈ N for a suitable neighborhood N of x. The likelihood function λ N given X ∈ N is then usually well-defined. If for each θ ∈ Θ the density f θ of X with respect to a fixed σ-finite measure µ on X exists and satisfies sup θ∈Θ f θ (x) ∈ R >0 , then λ N is possibly well approximated by the unique function f ∈ Λ such that f (θ) ∝ f θ (x).
The likelihood function given X = x is often simply defined as this function f , but the definition of likelihood in terms of probability (and the consequent interpretation of f as a mere approximation of λ N ) seems to be preferred by most authors who consider likelihood functions as descriptions of uncertain knowledge (see for example [28,40,9,48,49,61,53,52]). The reasons are that the post-data interpretation of the function f can be problematic, since the densities f θ are not unique (but only unique µ-a.e.), and that f is not well-defined when f θ (x) is an unbounded function of θ. However, for the pre-data properties studied in Section 4 the nonuniqueness of the densities is not a problem, and therefore, in order to simplify the results, f will then be called the likelihood function given X = x and denoted by λ x (when it is well-defined).
If the random variables X 1 , X 2 are independent for all θ ∈ Θ, then the likelihood function given (X 1 , X 2 ) = (x 1 , x 2 ) satisfies λ (x1,x2) (θ) ∝ λ x1 (θ) λ x2 (θ) (when these three functions are well-defined). As noted in Section 1, this suggests the possibility of describing prior uncertain knowledge by a prior likelihood function: when X 2 = x 2 is observed, the prior (likelihood function) λ x1 is updated to the posterior λ (x1,x2) . The prior λ x1 is simply interpreted as the likelihood function given X 1 = x 1 , regardless of whether the observation X 1 = x 1 is real or imagined.
The choice of a prior likelihood function seems better supported by intuition than the choice of a prior probability measure: in particular, the likelihood function constant equal to 1 describes the complete ignorance about the value of θ ∈ Θ (see also [19,Subsection 3.1.2]). In fact, this is the likelihood function obtained when observing no data (that is, when observing the event Ω), and using it as a prior is equivalent to using no prior likelihood function (since it is the neutral element of pointwise multiplication). On the other hand, the penalty term in penalized likelihood methods can often be formally interpreted as a prior likelihood function (see for example [46]).

Decision problem
A statistical decision problem is described by a loss (or weight) function W : Θ × D → R ≥0 , where D is the (nonempty) set of all possible decisions, one or more of which must be chosen. For each pair (θ, d) ∈ Θ × D, the value W (θ, d) represents the loss suffered by choosing the decision d when P θ is the correct probability measure. It is assumed that the function W summarizes all important aspects of the decision problem. In particular, if randomized decisions are allowed, then they should already be contained in D, and the corresponding loss described by W .
Let W be the set of all functions w : Θ → R ≥0 . To each decision d ∈ D can be associated the function w d ∈ W such that w d (θ) = W (θ, d) for all θ ∈ Θ.
The decision problem can be restated as the problem of choosing one or more functions w from the subset {w d : d ∈ D} of W, where the loss (as a function of θ) suffered by choosing w is represented by the function w itself. To each function w can correspond more than one decision d ∈ D, but these decisions are equivalent from the standpoint of the decision problem.
When X = x is observed, the likelihood function λ x can be interpreted as a description of a kind of relative plausibility of the possible values of θ, and can thus be useful for choosing a decision d ∈ D. Possible criteria for this kind of post-data decision making are the subject of Section 3. Some pre-data properties of these decision criteria are then studied in Section 4. In order to do this, the chosen decision must be considered as a function of the observed realization of X. Such a function δ : X → D, describing a whole decision strategy, is called decision function.

Likelihood decision criteria
Let λ ∈ Λ be the likelihood function given the data (possibly including prior information), and let the loss function W on Θ × D describe a decision problem. This section introduces criteria for choosing, on the basis of λ and W , one or more decisions d ∈ D, or equivalently, on the basis of λ, one or more functions w ∈ {w d : d ∈ D}.
For instance, when the maximum likelihood estimateθ is well-defined, a simple criterion for choosing d consists in minimizing W (θ, d), or equivalently, for choosing w, in minimizing w(θ). That is, the criterion consists in minimizing the loss under the assumption that Pθ is the correct probability measure. This simple criterion is often used in practical applications: for example when in the portfolio selection problem of Markowitz [51] the parameters of the model are estimated by maximum likelihood (see for instance [47,15]). The criterion was also formally, though hesitantly, considered by Diehl and Sprott [25]. However, besides being perhaps too optimistic about the quality of maximum likelihood estimates, this simple criterion is not always well-defined. Before considering some alternative criteria, in the next subsection a general definition of likelihood decision criteria is introduced.

General definition
A likelihood decision criterion for choosing one or more decisions d ∈ D consists in minimizing a certain evaluation V (w d , λ) of the corresponding loss w d on the basis of the likelihood function λ, where the functional V : W × Λ → R must satisfy the following three properties: hold for all pairs of functions (w, λ) ∈ W × Λ.
(P3) If the subset H of Θ and the sequence of functions λ n ∈ Λ (with n ∈ N) satisfy lim n→∞ λ n (Θ \ H) = 0, then lim n→∞ V (c I H + c ′ I Θ\H , λ n ) = c must hold for all constants c, c ′ ∈ R ≥0 .
Before analyzing these properties, it is important to clarify what is meant by minimization of V (w d , λ). If there is a decision d ∈ D such that V (w d , λ) = inf d ′ ∈D V (w d ′ , λ), then d is optimal according to the likelihood decision criterion described by the functional V . When there is no optimal decision, the criterion suggests the choice of a decision d ∈ D such that (P1) can be interpreted as a property of monotonicity of the functional V , following directly from the assumption that the loss function W summarizes all important aspects of the decision problem. In fact, if the decisions d, (P2) can be interpreted as a property of parametrization invariance, typical of the likelihood methods. This invariance is a consequence of the idea that everything important about θ is described by the loss function W and the likelihood function λ. In particular, (P2) excludes the Bayesian criteria when Θ is infinite. In fact, with some additional measurability restrictions, the Bayesian criterion with prior π is described by the functional V π : (w, λ) → w λ dπ. Hence, (P2) implies in particular the invariance π • b −1 = π for all measurable bijections b, since V π (I H , I Θ ) = π(H) for all measurable subsets H of Θ. This invariance can be satisfied only if Θ is finite (when π is the uniform prior) or if the measurability conditions are very restrictive.
(P3) can be interpreted as a minimal consistency property, implying that some information provided by the likelihood function is actually used by the likelihood decision criterion. In particular, it excludes the minimax (loss) criterion, 1 described by the functional (w, λ) → sup θ∈Θ w(θ). Moreover, (P3) with H = Θ implies the following calibration property: V (c I Θ , λ) = c for all constants c ∈ R ≥0 and all likelihood functions λ ∈ Λ. This property and (P1) imply in particular that inf θ∈Θ w(θ) ≤ V (w, λ) ≤ sup θ∈Θ w(θ) holds for all pairs of functions (w, λ) ∈ W × Λ.
A simple example of likelihood decision criterion can be obtained by modifying the minimax (loss) criterion in order to satisfy (P3). It suffices to reduce Θ to the likelihood confidence region consisting of all θ whose likelihood exceeds a certain threshold β ∈ ]0, 1[, before applying the minimax (loss) criterion. The resulting likelihood decision criterion is called Likelihood-based Region Minimax (LRM) criterion and is described by the functional It has been applied for example in the problem of regression with imprecisely observed data (see for instance [22]).
If the maximum likelihood estimateθ ∈ Θ is well-defined and there is a topology on Θ such that w ∈ W is continuous atθ and λ(Θ\N ) < 1 holds for all neighborhoods N ofθ, then lim β↑1 V LRM,β (w, λ) = w(θ). Hence, the likelihood decision criterion described by the (pointwise) limit of V LRM,β when β tends to 1 is strictly related to the idea considered at the beginning of the present section, but has the advantage of being always well-defined. It is called Maximum Likelihood Decision (MLD) criterion and is described by the functional The MLD criterion clearly generalizes maximum likelihood estimation, while the LRM criterion can be seen as a generalization of likelihood ratio testing. In the next subsection, a likelihood decision criterion generalizing both these very successful components of the likelihood approach to statistics is considered in more detail.

MPL criterion
An alternative way of modifying the minimax (loss) criterion in order to satisfy (P3) consists in applying it after having weighted the loss associated to θ by means of the likelihood of θ (raised to a certain power α ∈ R >0 ). The resulting likelihood decision criterion is called Minimax Plausibility-weighted Loss (MPL) criterion and is described by the functional It can be characterized among the likelihood decision criteria by few basic decision-theoretic properties, but this goes beyond the scope of the present paper (see [19,Subsection 4 The exponent α ∈ R >0 plays a similar role for the MPL criterion as the threshold β ∈ ]0, 1[ does for the LRM criterion. In fact, lim α↓0 V MP L,α (w, λ) = lim β↓0 V LRM,β (w, λ) holds for all pairs of functions (w, λ) ∈ W × Λ, while The simple choice α = 1 for the exponent of the likelihood function is supported by the analogy with the Bayesian criterion: the integral with respect to π in the functional V π is replaced by the supremum with respect to θ in the functional V MP L,1 .
The analogy of the Bayesian and MPL criteria (with α = 1) emerges also when considering the likelihood ratio test statistic λ(H) as a function of H ⊆ Θ. This set function is a completely maxitive measure in the terminology of Shilkret [60], who introduced also the corresponding theory of integration: the integral of w ∈ W with respect to the completely maxitive measure λ is V MP L,1 (w, λ). Hence, the MPL criterion with α = 1 corresponds to minimizing the integral of the loss with respect to the likelihood ratio test statistic, interpreted as a completely maxitive measure describing the posterior information about θ.
Example 3.1 (maximum likelihood estimation). The estimation of θ can be described as a decision problem with D = Θ. When Θ is finite, it makes sense to employ the simple loss function W such that w d = I Θ\{d} for all d ∈ D. In this case, if the maximum likelihood estimateθ is well-defined, then it is the unique optimal decision according to the MLD and MPL criteria (independently of the exponent α), while for the LRM criterion this holds only if the threshold β is sufficiently large.
These results can be generalized to the case with infinite Θ, for example when a metric on Θ is considered. For a suitably small ε ∈ R >0 , it makes then sense to employ the simple loss function W such that where B(d, ε) denotes the closed ball with center d and radius ε. It can be easily proved that in this case, if the maximum likelihood estimateθ is well-defined, B(θ, ε) is compact, and λ (Θ \ B(θ, ε)) < 1, then for the MLD and MPL criteria (independently of the exponent α) optimal decisions exist and, even when they are not unique, they all lie in B(θ, ε), while for the LRM criterion this holds only if the threshold β is sufficiently large.
Hence, the MPL and MLD criteria lead practically to maximum likelihood estimates in this simple decision-theoretic description of estimation, and therefore they can be interpreted as generalizations of maximum likelihood estimation (while this is not true for the LRM criterion).
Example 3.2 (likelihood ratio testing). For each subset H of Θ, testing for the null hypothesis H 0 : θ ∈ H against the alternative H 1 : θ ∈ Θ \ H can be described as a decision problem with D = {1, 0}, where 1 and 0 represent the rejection and the acceptance (or non-rejection) of H 0 , respectively. When constant losses c 1 , c 2 ∈ R >0 (with c 1 > c 2 ) are assigned to errors of the first and of the second kind, respectively, the resulting loss function W satisfies w 1 = c 1 I H and w 0 = c 2 I Θ\H .
In this case, according to the MPL criterion with exponent α, rejection is the unique optimal decision if and only if λ(H) < ( c2 /c1) 1 /α , while acceptance is the unique optimal decision if and only if λ(H) > ( c2 /c1) 1 /α . Similarly, according to the LRM criterion with threshold β, rejection is the unique optimal decision if and only if λ(H) ≤ β, while acceptance is the unique optimal decision if and only if λ(H) > β. Finally, according to the MLD criterion, rejection is the unique optimal decision if and only if λ(H) < 1, while acceptance is the unique optimal decision if and only if λ(H) = 1.
Hence, the MPL and LRM criteria lead practically to likelihood ratio tests in this simple decision-theoretic description of hypothesis testing, and therefore they can be interpreted as generalizations of likelihood ratio testing (while this is not true for the MLD criterion).
The likelihood decision functions resulting from the MPL criterion thus generalize the usual likelihood methods, in the sense that these methods are optimal according to the MPL criterion in simple decision-theoretic descriptions of the corresponding inference problems. In other decision problems (as in the examples of the next section), this criterion can lead to new likelihood methods.

Properties
Likelihood decision criteria were introduced in Section 3 as criteria for post-data decision making. The present section studies pre-data properties of the resulting likelihood decision functions. Before considering some asymptotic results, in the next subsection finite-sample invariance properties are presented.

Invariances
Let X be a random variable such that the likelihood function λ x ∈ Λ is welldefined for all x ∈ X , and let the loss function W on Θ × D describe a decision problem. A likelihood decision criterion described by the functional V can be applied for each possible realization x of X, by minimizing the evaluation V (w d , λ x ) over all decisions d ∈ D. In this subsection, in order to simplify the results, it is assumed that for each possible realization x of X there is a unique optimal decision δ(x) according to the likelihood decision criterion. The resulting likelihood decision function δ : X → D is then uniquely defined.
Some basic invariance properties follow directly from the fact that the likelihood approach to decision making satisfies the likelihood principle. In particular, if s(X) is a sufficient statistic for θ, then δ(x) = δ(x ′ ) holds for all x, x ′ ∈ X such that s(x) = s(x ′ ), since in this case λ x = λ x ′ (see for instance [58, Theorem 2.21 and Proposition 2.23]). That is, the likelihood decision function δ is completely described by a function δ ′ : S → D such that δ = δ ′ • s, where S is the codomain of s.
As noted in Subsection 3.1, a certain kind of parametrization invariance is implied by (P2). In fact, a bijection b : Θ → Θ can be interpreted as the description of a reparametrization of the statistical model, in which θ ∈ Θ is replaced by ϑ ∈ Θ, with b(ϑ) = θ. For the reparametrized statistical model, the likelihood function given X = x is λ x • b, and the decision problem is described by the loss function (ϑ, d) → W (b(ϑ), d). Hence, (P2) implies that the likelihood decision function δ is left invariant by this reparametrization of the statistical model.
Another direct consequence of (P2) is the following important invariance property. Given three bijections g : X → X , b : Θ → Θ, and h : D → D, if for each x ∈ X the likelihood function given That is, if the decision problem is invariant, then δ is equivariant (see for example [12, Section 6.2], [58, Subsection 6.2.1]). In particular, it is not even necessary to identify the symmetries of the decision problem: the likelihood decision functions are guaranteed to respect them anyway. Among the invariance properties considered in the present subsection, this is the only one that does not necessarily hold when a prior likelihood function is used. In fact, prior information can destroy the symmetries of the decision problem.
Example 4.1 (variance components). Let X 1 , . . . , X m be independent and nvariate normally distributed random variables (with n ≥ 2) such that for all i ∈ {1, . . . , m}, each component of X i has expected value µ and variance τ 2 +σ 2 , and each pair of different components of X i has covariance τ 2 , where θ = (µ, τ, σ) and Θ = R×R >0 ×R >0 . That is, each vector X i represents the n observations in one of the m groups of a balanced one-way random effect model under normality assumptions (see for example [59]). In order to simplify the results, assume that the model is conditioned on the (a.s.) event that for no vector X i all components are equal, and so X 1 = · · · = X m = R n \ {(y 1 , . . . , y n ) ∈ R n : y 1 = · · · = y n }.
The problem of estimating the variance component τ 2 is particularly interesting, because the analysis of variance estimate can be negative. For this problem, Portnoy [54] suggested the following location and scale invariant version of the squared error loss function (with D = R): For each i ∈ {1, . . . , m}, letX i and S i be the mean and the sum of squared deviations from the mean, respectively, for the components of X i . Furthermore, letX and S be the mean and the sum of squared deviations from the mean, respectively, for the sampleX 1 , . . . ,X m . That is,X is the grand mean, while n S and m i=1 S i are the sum of squares due to differences between groups and within groups, respectively. Finally, define the ratio Since (X, n S, m i=1 S i ) is a sufficient statistic for (µ, τ, σ), and the decision problem described by the loss function W is location and scale invariant, when a likelihood decision function δ : X 1 × · · · × X m → R is uniquely defined, it satisfies δ(X 1 , . . . , X m ) = (n S + m i=1 S i ) δ ′ (R) for some function δ ′ : [0, 1[ → R. This holds in particular for the likelihood decision function resulting from the MPL criterion with exponent α = 1: for each r ∈ [0, 1[, the value δ ′ (r) can be easily obtained numerically as the unique The resulting function δ ′ in the case m = n = 3 is plotted in the left panel of Figure 1, together with the functions δ ′ corresponding to some other decision criteria or estimation methods. The right panel of Figure 1 shows the expected loss (that is, the risk) of these estimators as a function of ρ = τ 2 /(τ 2 +σ 2 ). Besides the MPL criterion, the methods considered are the analysis of variance, maximum likelihood (corresponding to the MLD criterion), restricted maximum likelihood (the function δ ′ is the pointwise maximum of the ones corresponding to analysis of variance and maximum likelihood, see for example [63]), and the Bayesian criterion with the Jeffreys' (improper) prior proposed by Tiao and Tan [64]. The results are qualitatively similar for other values of m and n. Hence, the MPL criterion leads to a new estimator, which uniformly improves upon the maximum likelihood, restricted maximum likelihood, and analysis of variance estimators from the standpoint of risk. Moreover, unlike these three estimators, the new likelihood method resulting from the MPL criterion does not face the problem of negative or zero estimates of the positive variance component τ 2 .
Since the pre-data choice of a decision function (an estimator) is much more complicated than the post-data choice of a decision (an estimate), a minimax (risk) estimator is not known. However, Portnoy [54] showed that the estimator resulting from the Bayesian criterion with the Jeffreys' (improper) prior proposed by Tiao and Tan [64] is nearly minimax (from the standpoint of risk). Therefore, the new estimator resulting from the MPL criterion is nearly minimax as well, and has the fundamental advantage of avoiding the problematic choice of a prior probability measure (see for instance [36,64,62,43,54]).

Consistency
Let the loss function W on Θ × D describe a decision problem, and consider a sequence of random variables X n (with n ∈ N). A sequence of decision functions δ n : X 1 × · · ·× X n → D (with n ∈ N) is said to be (strongly) consistent at θ 0 ∈ Θ if lim n→∞ W (θ 0 , δ n (X 1 , . . . , X n )) = inf d∈D W (θ 0 , d) holds P θ0 -a.s. That is, consistency at θ 0 means that when P θ0 is the correct probability measure, the sequence of decisions δ n (X 1 , . . . , X n ) tends to minimize the loss (almost surely). For example, if D = Θ, and W is a metric on Θ, then the decision problem describes the estimation of θ, and the sequence of estimators δ n is (strongly) consistent in the usual sense if and only if it is consistent in the above sense at each θ ∈ Θ. A sequence of decision functions δ n : X 1 × · · · × X n → D (with n ∈ N) is said to be optimal according to the likelihood decision criterion described by a functional V if V (w δn(x1,...,xn) , λ (x1,...,xn) ) < inf d∈D V (w d , λ (x1,...,xn) ) + 2 −n holds for all n ∈ N and all (x 1 , . . . , x n ) ∈ X 1 × · · · × X n such that the likelihood function λ (x1,...,xn) ∈ Λ is well-defined. Hence, for each likelihood decision criterion an optimal sequence of decision functions δ n always exists, though in general it is not unique and no single decision δ n (x 1 , . . . , x n ) needs to be optimal. However, this weak definition of optimality of a sequence of decision functions is strong enough to warrant important asymptotic results. 2 In general, a sequence of decision functions that is optimal according to a likelihood decision criterion can be consistent at θ 0 ∈ Θ only if the likelihood tends to concentrate on θ 0 , in the following sense. Given a topology on Θ, the likelihood is said to tend to concentrate on θ 0 if P θ0 -a.s. the likelihood function λ (X1,...,Xn) ∈ Λ is well-defined for sufficiently large n, and lim n→∞ λ (X1,...,Xn) (Θ \ H) = 0 holds P θ0 -a.s. for all neighborhoods H of θ 0 . Sufficient conditions for the likelihood to tend to concentrate on θ 0 are well-known: see for example [66,Theorem 1], [42, (2.12)], or [4, (xxvii)]. The tendency of the likelihood to concentrate on θ 0 is not affected by the use of a prior likelihood function bounded away from 0 in a neighborhood of θ 0 .
As noted in Subsection 3.1, some kind of minimal consistency is implied by (P3). In fact, a simple consequence of (P3) and (P1) is that lim n→∞ V (w, λ (X1,...,Xn) ) = w(θ 0 ) holds P θ0 -a.s. when the function w ∈ W is bounded and there is a topology on Θ such that w is continuous at θ 0 and the likelihood tends to concentrate on θ 0 . This implies in particular the consistency at θ 0 of all sequences of decision functions that are optimal according to some likelihood decision criterion, when D is finite and for each d ∈ D the loss w d is bounded and there is a topology on Θ such that w d is continuous at θ 0 and the likelihood tends to concentrate on θ 0 . The following theorem shows that in the case of infinite D it suffices to replace the assumptions of continuity at θ 0 of the functions w d with the stronger assumption of their equicontinuity at θ 0 . Theorem 4.1. If the loss w d is bounded for each decision d ∈ D, the sequence of decision functions δ n : X 1 × · · · × X n → D (with n ∈ N) is optimal according to a likelihood decision criterion, and there are a θ 0 ∈ Θ and a topology on Θ such that the likelihood tends to concentrate on θ 0 and the set of functions {w d : d ∈ D} is equicontinuous at θ 0 , then the sequence of decision functions δ n is consistent at θ 0 . Example 4.2 (hypothesis testing). In the decision problem of Example 3.2, if there is a topology on Θ such that for each θ 0 ∈ Θ the likelihood tends to concentrate on θ 0 , then Theorem 4.1 implies the consistency at each θ ∈ Θ \ ∂H (where ∂H denotes the boundary of H) of all sequences of decision functions that are optimal according to some likelihood decision criterion. That is, each likelihood decision criterion will P θ -a.s. give the correct test result for sufficiently large n, for all θ ∈ Θ \ ∂H.
In Theorem 4.1, it is assumed that the functions w d are bounded and equicontinuous at θ 0 . As noted by Wald [67, Subsection 3.1.2], such assumptions are not seriously restrictive from a practical point of view. However, they are not satisfied in many standard formulations of statistical decision problems, such as for example the estimation of θ when Θ is a Euclidean space and W represents squared error. In order to prove the consistency of sequences of likelihood decision functions in such standard decision problems as well, the assumptions of Theorem 4.1 can be replaced by the weaker, but more complex ones of the following theorem.
Theorem 4.2. If the sequence of decision functions δ n : X 1 × · · · × X n → D (with n ∈ N) is optimal according to the likelihood decision criterion described by a functional V , and there are a θ 0 ∈ Θ, a topology on Θ, a constant c ∈ R >0 with c > inf d∈D W (θ 0 , d), and a neighborhood H of θ 0 such that the following three conditions are satisfied: (i) the likelihood tends to concentrate on θ 0 , (ii) the set of functions {w d : d ∈ D, inf θ∈H W (θ, d) < c} is equicontinuous at θ 0 , (iii) lim m→∞ lim sup n→∞ V (w d , λ (X1,...,Xn) ) − V (w d ∧ m, λ (X1,...,Xn) ) = 0 (where w d ∧ m denotes the pointwise minimum of w d and m ∈ N) holds P θ0 -a.s. for all d ∈ D such that W (θ 0 , d) < c, then the sequence of decision functions δ n is consistent at θ 0 . . For each n ∈ N, since the maximum X (n) is a sufficient statistic of X 1 , . . . , X n for θ, and the decision problem is scale invariant, when a likelihood decision function δ n : R n >0 → R >0 is uniquely defined, it satisfies δ n (X 1 , . . . , X n ) = κ n X (n) for some constant κ n ∈ R >0 . More generally, for each likelihood decision criterion an optimal sequence of decision functions of the form δ n (X 1 , . . . , X n ) = κ n X (n) always exists.
For each θ 0 ∈ R >0 , the likelihood tends to concentrate on θ 0 with respect to the Euclidean topology, since lim n→∞ X (n) = θ 0 holds P θ0 -a.s., and λ (X1,...,Xn) : θ → ( X (n)/θ) n I ]X (n) ,∞[ (θ) for all n ∈ N. Moreover, it can be easily checked that for any c ∈ R >0 and any bounded neighborhood H of θ 0 , condition (ii) of Theorem 4.2 is satisfied, while condition (iii) holds for instance when the functional V satisfies V (w, λ) = V (w I λ −1 (]0,1]) , λ) for all pairs of functions (w, λ) ∈ W × Λ. That is, Theorem 4.2 implies the (strong) consistency of all sequences of estimators δ n resulting from likelihood decision criteria with the property that each evaluation V (w d , λ) does not depend on the loss associated with values of θ with zero likelihood.
The three examples of likelihood decision criteria explicitly considered in Section 3 have this property, and for each n ∈ N, they lead to uniquely defined likelihood decision functions δ n : (x 1 , . . . , x n ) → κ n x (n) . Therefore, Theorem 4.2 implies lim n→∞ κ n = 1. In fact, for the MLD criterion κ n = 1 holds for all n ∈ N, while κ n = κ(α n) and κ n = κ ′ ( n √ β) hold for the MPL criterion with exponent α and the LRM criterion with threshold β, respectively, where κ : R >0 → ]1, 2[ and κ ′ : ]0, 1[ → ]1, 2[ are decreasing bijections. More precisely, κ ′ : y → 2 /(y+1), while κ assigns to each y ∈ R >0 the unique solution s > 1 of the equation (s − 1) s y = y y /(y+1) y+1 . Though all sequences of estimators κ n X (n) resulting from the MPL, LRM, and MLD criteria are (strongly) consistent, the estimators κ(n) X (n) and κ ′ ( 1 / n √ 2) X (n) resulting from the MPL criterion with α = 1 and the LRM criterion with β = 1 /2, respectively, have smaller expected losses than the (maximum likelihood) estimators X (n) resulting from the MLD criterion (for all n ∈ N and all θ ∈ Θ). For instance, when n = 1, the estimators ( √ 2+1) /2 X 1 and 4 /3 X 1 resulting from the MPL criterion with α = 1 and the LRM criterion with β = 1 /2 have expected losses smaller than that of the maximum likelihood estimator X 1 by factors of approximately 1.16 and 1.20, respectively (independently of θ). These factors are slightly smaller than the largest possible one for a scale equivariant estimator, which is approximately 1.21 (independently of θ) and is obtained with Pitman's estimator √ 2 X 1 (that is, the estimator resulting from the Bayesian criterion with Jeffreys' improper prior).

Efficiency
Stronger assumptions about the statistical model, the loss function, and the likelihood decision criterion allow asymptotic properties stronger than consistency for the sequences of likelihood decision functions. For example, in a parametric estimation problem, the following theorem gives simple sufficient conditions for a sequence of likelihood decision functions to be an asymptotically efficient sequence of estimators. Its proof uses the result (strictly related to the Bernsteinvon Mises theorem) that, under some regularity conditions, the likelihood function tends to a normal density around the maximum likelihood estimate (see for example [16,32], [58,Subsection 7.4.2], [65, Section 10.2]).
When a continuous prior likelihood function taking only positive values is used, the theorem still holds. For simplicity, its statement is restricted to the estimation of the natural parameter of a minimal regular exponential family (see for example [17]) under a power loss function, and to the three examples of likelihood decision criteria explicitly considered in Section 3 (for a version of the theorem with weaker, but more complex assumptions see [19,Subsection 5.1.1]). An example of a likelihood decision criterion for which the result does not hold is the minimin version of the LRM criterion, described by the functional (w, λ) → inf θ∈Θ : λ(θ)>β w(θ) (for some threshold β ∈ ]0, 1[).
Theorem 4.3. Let the sequence of random variables X n (with n ∈ N and X n = X ) be independent and identically distributed according to a minimal regular exponential family with natural parameter space Θ ⊆ R k . Let W be the loss function (θ, d) → |θ − d| γ , where D = Θ and γ ∈ R >0 . If the sequence of decision functions δ n : X n → Θ (with n ∈ N) is optimal according to the MPL criterion (for some exponent α ∈ R >0 ), the LRM criterion (for some threshold β ∈ ]0, 1[), or the MLD criterion, then it is asymptotically efficient.
Example 4.4 (normal distribution). Let the sequence of random variables X n (with n ∈ N and X n = R) be independent and normally distributed with expected value θ and variance 1, where Θ = R. Consider the problem of estimating θ with the power loss function W : (θ, d) → |θ − d| γ , where D = R and γ ∈ R >0 . For each n ∈ N, letX n denote the mean of the sample X 1 , . . . , X n . From (P2) with the reflection with respect toX n as bijection b : R → R it follows that for each n ∈ N, when a likelihood decision function δ n : R n → R is uniquely defined, it satisfies δ n (X 1 , . . . , X n ) =X n . This holds in particular for the likelihood decision functions resulting from the MPL, LRM, and MLD criteria (independently of the exponent α and the threshold β), which is in accordance with Theorem 4.3, since the sequence of estimatorsX n is asymptotically efficient.
On the other hand, asymptotic efficiency is not necessarily a desirable property when the loss function is asymmetric. Consider for instance the so-called pinball (or check) loss function W : (θ, d) → (θ − d) τ − I ]θ,∞[ (d) , where τ ∈ ]0, 1[. This loss function penalizes the overestimation of θ more than its underestimation when τ < 1 /2, and vice versa when τ > 1 /2. For each n ∈ N, since the meanX n is a sufficient statistic of X 1 , . . . , X n for θ, and the decision problem is location invariant, when a likelihood decision function δ n : R n → R is uniquely defined, it satisfies δ n (X 1 , . . . , X n ) =X n + κ n for some constant κ n ∈ R. More generally, for each likelihood decision criterion an optimal sequence of decision functions of the form δ n (X 1 , . . . , X n ) =X n + κ n always exists. Such a sequence of estimators is asymptotically efficient if and only if lim n→∞ √ n κ n = 0. However, when τ = 1 /2, a sequence of estimators with lim n→∞ √ n κ n = 0 can have expected loss smaller than that ofX n by a factor of up to exp( z 2 τ/2) (where z τ denotes the τ -quantile of the standard normal distribution), independently of θ and n.
In particular, if the likelihood decision function δ 1 : x 1 → x 1 + κ 1 is uniquely defined, and the likelihood decision criterion is described by a functional V such that V (c w, λ) = c V (w, λ) for all pairs of functions (w, λ) ∈ W × Λ and all constants c ∈ R >0 (that is, the evaluation of the loss is scale equivariant), then δ n : (x 1 , . . . , x n ) →x n + κ1 / √ n is the uniquely defined likelihood decision function for each n ∈ N. This follows from (P2) with the scaling by √ n as bijection b : R → R, and is true in particular for the likelihood decision functions resulting from the MPL, LRM, and MLD criteria. More precisely, κ 1 = 0 and κ 1 = √ −2 ln β (2 τ − 1) hold for the MLD criterion and LRM criterion with threshold β, respectively, while κ 1 = s / √ α holds for the MPL criterion with exponent α, where s is the unique real solution of the equation . Therefore, the sequence of (maximum likelihood) estimatorsX n resulting from the MLD criterion is asymptotically efficient for all τ ∈ ]0, 1[, while the sequences of estimators resulting from the MPL and LRM criteria are asymptotically efficient if and only if τ = 1 /2. However, when τ = 1 /2, the estimators resulting from the MPL and LRM criteria have smaller expected losses than X n . For instance, when τ = 1 /10, the estimators resulting from the MPL criterion with α = 1 and the LRM criterion with β = 1 /2 have expected losses smaller than that ofX n by factors of approximately 2.21 and 2.13, respectively (independently of θ and n). These factors are slightly smaller than the largest possible one for a location equivariant estimator, which is approximately 2.27 (independently of θ and n) and is obtained with Pitman's estimators (that is, the estimators resulting from the Bayesian criterion with Jeffreys' improper prior).

Conclusion
In the present paper, the likelihood approach to statistics is extended and unified by the concept of likelihood decision function. Such a decision function is obtained by a post-data evaluation of the possible decisions on the basis of the likelihood function (interpreted as a description of uncertain knowledge about the parameters of the statistical model, and possibly including prior information).
In particular decision problems, the likelihood decision functions correspond to the usual likelihood methods (such as maximum likelihood estimators and likelihood ratio tests), which are among the most successful statistical methods. In other decision problems, the likelihood decision functions describe new likelihood methods, which maintain some key properties of the usual ones (such as invariance or asymptotic properties). The likelihood approach to decision making thus offers a new perspective on statistical methodology and on the connections among likelihood methods.
The extended likelihood approach presented here can be applied to any decision problem. In the examples of this paper, standard problems of estimation and testing are considered, in order to allow the comparison of likelihood decision functions and usual statistical methods. However, decision problems go well beyond estimation and testing, and even in problems of estimation and testing the use of more elaborate loss functions can be of interest. The likelihood approach to decision making has been applied for instance to the problems of (distribution-free) regression with imprecisely observed data (see [22,23]) and supervised classification (see [20,1]).
The likelihood approach to decision making can be applied post-data: this is the main advantage over classical decision theory. It is an advantage from both the standpoint of interpretation (the goal is the actual decision, not the decision function) and the standpoint of practical application (the post-data choice of a decision is possible also in situations in which the pre-data choice of a decision function is hopeless). However, pre-data properties of decision functions can be important, and some have been studied in the present paper. In particular, likelihood decision functions have several invariance properties, and also (under regularity conditions) asymptotic properties such as consistency and efficiency.
Another advantage of the likelihood approach to decision making over classical decision theory is the possibility of using prior information. However, prior information is not necessary in the likelihood approach, and this is the main advantage over the Bayesian approach. These two approaches are otherwise strictly related, since both satisfy the (strong) likelihood principle. The advantage of the Bayesian approach over the likelihood one are some additional invariance properties, and in particular the essential equivalence of pre-data and post-data evaluations (see for example [ Besides further applications of the likelihood approach to decision making, future work will include a detailed analysis of the decision-theoretic properties characterizing the MPL criterion (see [19,Sections 3.1 and 4.1]), in connection with the theories of risk measurement (see for instance [31,2]) and of nonadditive measures and integrals (see for example [21,24]).
In order to complete the proof, it suffices to show that from this result and the properties (a) and (e) it follows that P θ0 -a.s. the inequality w δn(X1,...,Xn) (θ 0 ) < i 0 +7 ε holds for sufficiently large n. In particular, it suffices to show that for any decision d ∈ D and any likelihood function λ ∈ Λ, the inequality w d (θ 0 ) < i 0 +7 ε is implied by the inequalities V (w d , λ) < i 0 + 5 ε and V ((i 0 + 6 ε) I H ′ , λ) > i 0 + 5 ε.