Estimation of Kullback-Leibler losses for noisy recovery problems within the exponential family

We address the question of estimating Kullback-Leibler losses rather than squared losses in recovery problems where the noise is distributed within the exponential family. Inspired by Stein unbiased risk estimator (SURE), we exhibit conditions under which these losses can be unbiasedly estimated or estimated with a controlled bias. Simulations on parameter selection problems in applications to image denoising and variable selection with Gamma and Poisson noises illustrate the interest of Kullback-Leibler losses and the proposed estimators.


Introduction
We consider the problem of predicting an unknown d-dimensional vector μ ∈ R d from its noisy measurements V ∈ R d . Given a collection of parametric predictors of μ, we focus on the selection of the predictorμ that minimizes the discrepancy with the unknown vector μ. For instance, this includes the problem of selecting the best predictors from the set of Least Absolute Shrinkage and Selection Operator (LASSO) solutions [44] obtained for all possible choices of regularization parameters. To this end, the common approach is to selectμ that minimizes an unbiased estimate of the expected squared loss E||μ −μ|| 2 , typically, with the Stein unbiased risk estimator (SURE) [43]. Such estimators are classically built on some statistical modeling of the noise, e.g., as being distributed within the exponential family. In this context, we investigate the interest of going beyond squared losses by rather estimating a loss function grounded on an information based criterion, namely, the Kullback-Leibler divergence. We will first recall some basic properties of the exponential family, give a quick review on risk estimation and motivate the use of the Kullback-Leibler divergence.
Exponential family. We assume that in the aforementioned recovery problem the noise distribution belongs to the exponential family. Formally, the recovery problem can be reparametrized using two one-to-one mappings ψ : R d → R d and φ : R d → R d such that Y = ψ(V ) has a probability measure P θ characterized by a probability density or mass function with respect to the Lebesgue measure dy of the following form p(y; θ) = h(y) exp ( y, θ − A(θ)) (1.1) where θ = φ(μ) ∈ R d . The distribution P θ is said to be within the natural exponential family. We call θ the natural parameter, Y a sufficient statistic for θ, h : R d → R + the base measure, and A : R d → R the log-partition function. Classical and important properties of the exponential family include A that is convex, E[Y ] = ∇A(θ) and Var[Y ] = ∇∇ t A(θ) (see, e.g., [3]). Here and in the following, E[Y ] = Y dP θ denotes the expectation of the random vector Y with respect to the measure dP θ , and Var ) t ] is its so-called variance-covariance matrix. Without loss of generality, we consider that Y is a minimal sufficient statistic. As a consequence, ∇A is one-to-one and we can choose φ as the canonical link function satisfying φ = (∇A) −1 (as coined in the language of generalized linear models). An immediate consequence is that Y has expectation E[Y ] = μ and its variance is a function of μ given by Var[Y ] = Λ(μ) where Λ = (∇∇ t A) • φ. The function Λ : R d → R d×d is the so-called variance function (see, e.g., [36]), also known as the noise level function (in the language of signal processing). Table 1 gives five examples of univariate distributions of the exponential family -two of them are defined in a continuous domain, the other three are defined in a discrete domain.
Risk estimation. We now assume that the predictorμ of μ is a function of Y only, hence, we write itμ(Y ), and we focus on estimating the loss associated toμ(Y ) with respect to μ. When the noise has a Gaussian distribution with independent entries, SURE [43] can be used to estimate the mean squared error (MSE), or in short the risk, defined as: MSE μ = E||μ −μ(Y )|| 2 . The resulting estimator, being independent on the unknown predictor μ, can serve in practice as an objective for parameter selection. Eldar [15] builds on Stein's lemma [43], a generalization of SURE valid for some continuous distributions of the exponential family. It provides an unbiased estimate of the "natural" risk, defined as: i.e., the risk with respect to θ = φ(μ). In the same vein, when the distribution is discrete, Hudson [26] provides another result for estimating the "exp-natural" risk: the risk with respect to η = exp θ, where exp : R d → R d is the entry-wise exponential. As φ is assumed one-to-one, there is no doubt that if such loss functions cancel thenμ(Y ) = μ. In this sense, they provide good objectives for selectinĝ μ(Y ). However, within a family of parametric predictors and without strong assumptions on μ, such a loss function might never cancel. In such a case, it becomes unclear what its minimization leads it to select, all the more when φ or exp • φ are non-linear. Furthermore, even when they are linear (e.g., exp • φ = id for Poisson noise), minimizing MSE μ = E||μ−μ(Y )|| 2 might not even be relevant as it does not compensate for the heteroscedasticity of the noise (this will be made clear in our experiments). Estimating the reweighted or Mahanalobis risk given by E||Λ(μ) −1/2 (μ −μ(Y ))|| 2 could be more relevant in this case, but its estimation is more intricate.
Kullback-Leibler divergence. The Kullback-Leibler (KL) divergence [27] is a measure of information loss when an alternative distribution P 1 is used to approximate the underlying one P 0 . Its formal definition is given by D(P 0 P 1 ) = dP 0 log dP0 dP1 . Unlike squared losses, it does not measure the discrepancy between an unknown parameter and its estimate, but between the unknown distribution P 0 of Y and its estimate P 1 . As a consequence, it is invariant with one-to-one reparametrization of the parameters and, hence, becomes a serious competitor to squared losses. Remark that it is also invariant under one-toone transformations of Y because such transforms do not affect the quantity of information carried by Y . Interestingly, provided P 0 and P 1 belongs to the same member of the natural exponential family respectively with parameters θ 0 and θ 1 , the KL divergence can be written in terms of the Bregman divergence associated with A for points θ 0 and θ 1 , i.e., While squared losses are defined irrespective of the noise distribution, the KL divergence adjusts its penalty with respect to the scales and the shapes of the deviations. In particular, it accounts for heteroscedasticity.

Contributions.
In this paper, we address the problem of estimating KL losses, i.e., losses based on the KL divergence. As it is a non symmetric discrepancy measure, we can define two KL loss functions. The first one will be referred to as the mean KL analysis loss as it can be given the following interpretation: "how well might Pθ (Y ) explain independent copies of Y ". The mean KL analysis loss is inherent to many statistical problems as it takes as reference the true underlying distribution. It is at the heart of the maximum likelihood estimator and is typically involved in non-parametric density estimation, oracle inequalities, mini-max control, etc. (see, e.g., [22,17,41]). The second one will be referred to as the mean KL synthesis loss given by which can be given the following interpretation: "how well might Pθ (Y ) generate independent copies of Y ". The mean KL synthesis loss has also been considered in different statistical studies. For instance, the authors of [48] consider this loss function to design a James Stein-like shrinkage predictor. Hannig and Lee address a very similar problem to ours, by designing a consistent estimator of MKLS used as an objective for bandwidth selection in kernel smoothing
problems subject to Gamma [24] and Poisson noise [25]. Table 2 gives a summary of our contributions. It highlights which loss can be estimated and under which conditions of the exponential family. The main contributions of our paper are: 1. provided y →μ(y) and the base measure h are both weakly differentiable, MKLS can be unbiasedly estimated (Theorem 4.1), 2. for any mapping y →μ(y), MKLA can be unbiasedly estimated for Poisson variates (Theorem 4.2), 3. provided y →μ(y) is k ≥ 3 times differentiable with bounded k-th derivative, MKLA can be estimated with vanishing bias when Y results from a large sample mean of independent random vectors with finite k-th order moments (Theorem 4.3).
It is worth mentioning that a symmetrized version of the mean Kullback-Leibler loss: MKLA + MKLS, can be estimated as soon as MKLA and MKLS can both be estimated (e.g., for continuous distributions according to Table 2).

Risk estimation under Gaussian noise
This section recalls important properties of the MSE and the definition of SURE under additive noise models of the form Y = μ + Z where Z ∼ N (0, σ 2 Id d ) and Id d denotes the d × d identity matrix. Before turning to the unbiased estimation of MSE μ , it is important to recall that for any additive models and zero-mean noise with variance σ 2 Id d , provided the following quantities exists, we have is the cross-covariance matrix between Y andμ(Y ). Equation (2.1) gives a variational interpretation of the minimization of the MSE as the optimization of a trade-off between overfitting (first term) and complexity (second term). In fact, σ −2 tr Cov(Y,μ(Y )) is a classical measure of the complexity of a statistical modeling procedure, known as the degrees of freedom (DOF), see, e.g., [13]. The DOF plays an important role in model validation and model selection rules, such as, Akaike information criteria (AIC) [1], Bayesian information criteria (BIC) [42], and the generalized cross-validation (GCV) [20].
For linear predictors of the formμ(y) = W y, W ∈ R d×d (think of least-square or ridge regression), the DOF boils down to tr W . As a consequence, the random quantity ||Y −μ(Y )|| 2 − dσ 2 + 2 tr W becomes an unbiased estimator of MSE μ , that depends solely on Y without prior knowledge of μ. If W is a projector, the DOF corresponds to the dimension of the target space, and we retrieve the well known Mallows' C p statistic [35] as well as the aforementioned AIC. The SURE provides a generalization of these results that is not only restricted to linear predictors but can be applied to weakly differentiable mappings. A comprehensive account on weak differentiability can be found in e.g., [16,18]. Let us now recall Stein's lemma [43].

Lemma 1 (Stein lemma). Assume f is weakly differentiable with essentially bounded weak partial derivatives on
A direct consequence of Stein's Lemma, providedμ fulfills the assumptions of Lemma 1, is that satisfies ESURE = MSE μ . Applications of SURE emerged for choosing the smoothing parameters in families of linear predictors [30] such as for model selection, ridge regression, smoothing splines, etc. After its introduction in the wavelet community with the SURE-Shrink algorithm [11], it has been widely used to various image restoration problems, e.g., with sparse regularizations [2,38,6,37,5,32,39,45] or with non-local filters [46,12,9,47].

Risk estimation for the exponential family and beyond
In this section, we recall how SURE has been extended beyond Gaussian noises towards noises distributed within the natural exponential family.
Continuous exponential family. We first consider continuous noise models, e.g., Gamma noise. To begin, we recall a well known result derived by Eldar [14], that can be traced back to Hudson 1 in the case of independent entries [26], and that can be seen as a generalization of Stein's lemma.

Lemma 2 (Generalized Stein's lemma).
Assume f is weakly differentiable with essentially bounded weak partial derivatives on R d and Y follows a distribution of the natural exponential family with natural parameter θ, provided h is also weakly differentiable on R d , we have Lemma 2, whose proof can be found in [14], provides an estimator of the dot product E θ, f (Y ) that solely depends on Y without reference to θ. As a consequence, the Generalized SURE (as coined by [14]) defined by is an unbiased estimator of MSE θ , i.e., EGSURE = MSE θ , providedθ, h and ∇h are weakly differentiable 2 . Note that omitting the last term in (3.1) leads to the seminal definition of GSURE given in [14] which provides an unbiased estimate of MSE θ − ||θ|| 2 , even though ∇h is not weakly differentiable. The GSURE can be specified for Gaussian noise, and in this case GSURE = σ −4 SURE and the "natural" risk boils down to the risk as In general, such a linear relationship between the "natural" risk and the risk of interest might not be met. For instance, under Gamma noise 3 with scale parameter L (see Table 1), with expectation μ and independent entries, the GSURE reads as which, as soon as L > 2 andμ fulfills the assumptions of Lemma 2, unbiasedly We will see in practice that minima of MSE θ can strongly depart from those of interest. As the GSURE can only measure discrepancy in the "natural" parameter space, its applicability in real scenarios can thus be seriously limited.
follows a Gamma distribution with scale parameter L if it results from the mean of L independent and identically distributed exponential random variables. For this reason, L is often referred to as the number of looks and controls the spread of the distribution as Var[Y ] = Λ(μ) = μ 2 L . This distribution is widely used to describe fluctuations of speckle in coherent laser imagery [21]. 4 L > 2 implies that h and ∇h are weakly differentiable. By omitting the last term of GSURE, an unbiased estimate of Discrete exponential family. We now consider discrete noises distributed within the natural exponential family, e.g., Poisson or binomial. Before turning to the general result, let us focus on Poisson noise with mean μ and independent entries for which the Poisson unbiased risk estimator (PURE) defined as unbiasedly estimates MSE μ , see, e.g., [7,26]. The vector e i is defined as (e i ) i = 1 and (e i ) j = 0 for j = i. The PURE is in fact the consequence of the following lemma also due to Hudson [26].

Lemma 3 (Hudson's lemma).
Assume Y follows a discrete distribution on Z d of the natural exponential family with natural parameter θ, then holds for every mapping f : Hudson's lemma provides an estimator of the dot product E exp θ, f (Y ) that solely depends on Y without reference to the parameter η = exp θ. As a consequence, we can define a Generalized PURE (GPURE) as which unbiasedly estimates MSE η for the discrete natural exponential family 5 . As for GSURE, GPURE cannot in general measure discrepancy in the parameter space of interest, and for this reason, its applicability in real scenarios can also be limited. However, under Poisson noise, the "exp-natural" space coincides with the parameter space of interest as η = exp(φ(μ)) = μ, hence, leading to the PURE. Another interesting case, already investigated in [26], is the one of noise with a negative binomial distribution with mean μ and independent entries, for which the "exp-natural" space does not match with the one of μ but with the one of the underlying probability vector p ∈ [0, 1] d as defined in Table  1 (we have θ i = log p i ). In such a case, GPURE reads, for and is an unbiased estimator of E ||p(Y ) − p|| 2 . Other related works. It is worth mentioning that there have been several works focusing on estimating mean squared errors in other scenarios. For instance, when Y has an elliptical-contoured distribution with a finite known covariance matrix Σ, the works of [28,23] provide a generalization of Stein's lemma that can also be used to estimate the risk associated to μ. In [40], the authors provide a versatile approach that provides unbiased risk estimators in many cases, including, all members of the exponential family (continuous or discrete), the Cauchy distribution, the Laplace distribution, and the uniform distribution [40]. The authors of [33] use a similar approach to design such an estimator in the case of the non-centered χ 2 distribution [33].

Kullback-Leibler loss estimation for the exponential family
We now turn to our first contribution that provides, for continuous distributions of the natural exponential family, an unbiased estimator of the Kullback-Leibler synthesis loss.
which concludes the proof.

Example 1.
Under Gamma noise with expectation μ, shape parameter L (as defined in Table 1) and independent entries, SUKLS reads as which, up to a constant, and provided L > 1, unbiasedly estimates In our experiments, we will see that minimizing MKLS (or its SUKLS estimate) leads to relevant selections, unlike minimizing MSE θ (or its GSURE estimate).
Note that the authors of [24] have proposed a consistent estimator of MKLS when L = 1 (they did not study the case where L > 1), their estimator has been however designed only for kernel smoothing problems.
is an unbiased estimator of MKLA and log is the entry-wise logarithm.
Proof. The expression of MKLA follows directly from Table 1 and Equation which concludes the proof as h ↓ (y)/h(y) = y andθ ↓ (Y ) = logμ ↓ (Y ).
With such results at hand, only the Poisson distribution admits an unbiased estimator of the mean Kullback-Leibler analysis loss. In order to design an estimator of MKLA for a larger class of natural exponential distributions, we will make use of the following proposition. Proposition 1. For any probability density or mass function y → p(y; θ) of the natural exponential family of parameter θ, the Kullback-Leibler analysis loss associated to y →θ(y) can be decomposed as follows Proof. Subtracting and adding Y,θ(Y ) − θ in the MKLA definition leads to , this concludes the proof.
In the same vein as for the decomposition (2.1), Proposition 1 provides a variational interpretation of the minimization of MKLA, valid for noise distributions within the exponential family. Minimizing MKLA leads to a maximum a posteriori selection promoting faithful models with low complexity. It boils down to (2.1) when specified for Gaussian noise. As for the MSE, the fidelity term can always be unbiasedly estimated, up to an additive constant, without knowledge of θ. Only the complexity term tr Cov(θ(Y ), Y ), which generalizes the notion of degrees of freedom, is required to be estimated. Except for the Poisson distribution, none of the previous lemmas can be applied to unbiasedly estimate this term. However, we will show that it can be biasedly estimated, with vanishing bias depending on both the "smoothness" ofθ and the behavior of the moments of Y . Towards this goal, let us first recall the Delta method.

is an infinite sequence of independent and identically distributed random vectors in R d with EZ i = μ, Var[Z i ] = Σ and finite moments up to order
Lemma 4 is a direct d-dimensional extension of [29] (Theorem 5.1a, page 109), that allows us to introduce our biased estimator of MKLA.

. is an infinite sequence of independent random vectors in R d
identically distributed within the natural exponential family with natural parameter θ, log-partition function A, expectation μ, variance function Λ and finite moments up to order k ≥ 3. As a result, the distribution of Y n is also in the natural exponential family parametrized by θ n = nθ with log-partition function A n (θ n ) = nA(θ n /n), expectation μ and variance function Λ n = Λ/n. Provided θ n reads asθ n = nθ, andθ : R d → R d is k times totally differentiable with bounded k-th derivative, then where MKLA n is the KL analysis loss associated toθ n with respect to θ n .
It is worth mentioning that Theorem 4.3 can be applied to Gaussian noise, with DKLA boiling down to SURE, as DKLA = (2σ 2 ) −1 (SURE − ||Y || 2 + dσ 2 ). However, the conclusion is not as strong, as by virtue of Lemma 1, DKLA would be in fact an unbiased estimator provided only thatμ is weakly differentiable. More interestingly, consider the two following examples.

Example 2.
Gamma random vectors Y n with expectation μ ∈ (R + * ) d and shape parameter L n = n (as defined in Table 1) results from the sample mean of n independent exponential random vectors with expectation μ (entries of the vectors are supposed to be independent). As exponential random vectors have finite moments, providedμ is sufficiently smooth and since φ is continuously differentiable in (R + * ) d , Theorem 4.3 applies and we get

Example 3.
Consider Y n the sample mean of n independent Poisson random vectors with expectation μ ∈ (R + * ) d . We have that Y n , for all n, belongs to the natural exponential family with A n (θ n ) = n exp(θ n /n) and θ n = n log μ (entries of the vectors are supposed to be independent). As Poisson random vectors have finite moments, providedμ is sufficiently smooth and since φ is continuously where MKLA n Interestingly, remark that PUKLA(μ, Y ) ≈ DKLA(μ, Y ), as soon as we have

Reliability study
In this section, we aim at studying and comparing the sensitivity of the previously studied risk estimators. Little is known about the variance of SURE: It is in general an intricate problem and some studies [37,31] Proof. This is a straightforward consequence of Cauchy-Schwartz's inequality.
Proposition 2 allows us to compare the relative sensitivities of the different estimators. Comparing GSURE and SUKLS, one can notice that the bounds are similar but the first one is controlled byθ(Y ) while the second one is controlled byμ(Y ). While it is difficult to make a general statement, we believe SUKLS estimates might be more stable than GSURE sinceμ(Y ) has usually better control thanθ(Y ), given the non-linearity of the canonical link function φ.

Implementation details for the proposed estimators
In this section, we explain how the proposed risk estimators can be evaluated in practice within a reasonable computation time.
All risk estimators designed for continuous distributions rely on the computation of tr g(y) ∂f (y) ∂y y for some mappings g : For instance, SURE requires to compute such a a quantity with g(y) = Id d and f =μ (see eq. (2.2)). In general, the computation of these terms requires at least O(d 2 ) operations and thus prevents the use of such risk estimators in practice. Fortunately, following [19,38], we can approximate such terms by using Monte-Carlo simulations, thanks to the following relation tr g(y) ∂f (y) ∂y y = E ζ, g(y) ∂f (y) ∂y y ζ for ζ ∼ N (0, Id d ), (6.1) where the directional derivatives in the direction ζ ∈ R d can be computed by using finite differences or algorithm differentiations as described in [10]. This leads in general to a much faster evaluation in O(d) operations.
In the Poisson setting, risk estimators rely on the computation of y, f ↓ (y) for some mapping f : R d → R d . For instance, PUKLA requires to compute such a quantity with f = − logμ (see Theorem (4.2)). Again, the computation of such terms requires at least O(d 2 ) operations in general. Based on first order expansions, we have empirically chosen to perform Monte-Carlo simulations on the following approximation where ζ ∈ {−1, +1} d is Bernoulli distributed with p = 0.5. In our numerical experiments, this approximation led to O(d) operations and satisfactory results even though f was chosen to be non-linear. This approximation clearly deserves more attention but is considered here to be beyond the scope of this study.

Numerical experiments
In this section, we will perform numerical experiments showing the interest of the proposed Kullback-Leibler risk estimators in two different applications.

Application to image denoising
We first consider that Y and μ are d dimensional vectors representing images on a discrete grid of d pixels, such that entries with index i are located at pixel location δ i ∈ Δ ⊂ Z 2 . A realization y of Y represents a noisy observation of the image μ. The estimateμ of μ is a denoised version of y.
Performance evaluation. In order to evaluate the proposed loss functions and their estimates, visual inspection will be considered to assess the image quality in terms of noise variance reduction and image content preservation. In order to provide an objective measure of performance, taking into account heteroscedasticity and tails of the noise, we will evaluate the mean normalized absolute deviation error defined as The MNAE measures to which extentμ(Y ) might belong to a confident interval around μ with dispersion related to Λ(μ). The MNAE is expected to be 1 when μ(Y ) ∼ N (μ, Λ(μ)), and should get closer to 0 whenμ(Y ) improves on Y itself.
Simulations in linear filtering. We consider here thatμ is the linear filter where W ∈ R d×d is a circulant matrix encoding a discrete convolution with a Gaussian kernel of bandwidth τ > 0. In this context, we will evaluate the relevance of the different proposed loss functions and their estimates as objectives to select a bandwidth τ offering a satisfying denoising. Figure 1 gives an example of a noisy observation y of an image μ representing fingerprints whose pixel values are independently corrupted by Gamma noise with shape parameter L = 3. We have evaluated the relevance of the natural risk MSE θ given by ||μ −1 −μ(Y ) −1 || 2 , MKLS and MKLA in selecting the bandwidth τ . Visual inspection of the results obtained at the optimal bandwidth for each criterion shows that the natural risk MSE θ fails in selecting a relevant bandwidth while MKLS and MKLA both provide a better trade-off. The natural risk strongly penalizes small discrepancies at the lowest intensities while not being sensitive enough for discrepancies at higher intensities. As the noisy image has several isolated pixel values approaching 0, the natural risk will strongly penalize smoothing effects of such isolated structures preventing satisfying noise variance reduction. The Kullback-Leibler loss functions take into account that Gamma noise has a constant signal to noise ratio. Hence, it does not favor the restoration of either bright or dark structures more, allowing satisfying smoothing for both, as assessed by the MNAE. Finally, estimators of these loss functions with respectively GSURE, SUKLS and DKLA are given. Note that for L = 3, the Gamma distribution is far from reaching the asymptotic conditions of Theorem 4.3. As a result, bias is not negligible (it becomes obvious for the lowest values of Fig 2. (a,b,c) Risks and their estimates as a function of the bandwidth in the same setting as in Figure 1 but for Gamma noise with L = 100. The optimal bandwidth τ and the MNAE are indicated. Red shows unbiased estimation and blue biased estimation. τ in Figure 1.h). Nevertheless, minimizing DKLA can still provide an accurate location of the optimal parameter for MKLA. Figure 2 reproduces the same experiment but with Gamma noise with L = 100, i.e., with a much better signal to noise ratio. Interestingly, the bias of DKLA gets much smaller than in the previous experiment. This was indeed expected as with L = 100, the Gamma distribution fulfills much better the asymptotic conditions of Theorem 4.3. Remark that MNAE values are still in favor of Kullback-Leibler objectives, but the gains are much smaller. In fact, all MNAE values get closer to 1 since noise reduction with signal preservation using linear filtering becomes much harder in such a low signal to noise ratio setting.
Simulations in non-linear filtering. We consider here thatμ is the nonlocal filter [4] defined bŷ μ(y) = W (y)y with W i,j (y) = exp − d(P i y, P j y) τ (7.2) where P i ∈ R p×n is a linear operator extracting a patch (a small window of fixed size) at location δ i , d : R p × R p → R + is a dissimilarity measure (infinitely differentiable and adapted to the exponential family following [8]) and τ > 0 a bandwidth parameter. Remark as W (y) ∈ R d×d depends on y,μ(y) is nonlinear. In this context, we will evaluate again the relevance of the proposed loss functions and their estimates as objectives to select the bandwidth τ . Figure 3 gives an example of a noisy observation y of an image μ representing a bright two dimensional chirp signal shaded gradually into a dark homogeneous region. The noisy observation y is contaminated by noise following a Gamma distribution with shape parameter L = 3. We have again evaluated the relevance of the natural risk MSE θ given by ||μ −1 −μ(Y ) −1 || 2 , MKLS and MKLA in selecting the bandwidth parameter. Visual inspection of the results obtained at the optimal bandwidth for each criterion shows that the natural risk fails in selecting a relevant bandwidth while MKLS and MKLA both provide more satisfying results. As the image μ is very smooth in the darker region, the natural risk favors strong variance reduction leading to a strong smoothing of the texture in the brightest area. Again, the Kullback-Leibler loss functions find a good trade-off preserving simultaneously the bright texture and reducing the noise in the dark homogeneous region, as assessed by the MNAE. Finally, estimators of these loss functions with respectively GSURE, SUKLS and DKLA are given. Figure 4 gives a similar example where the image μ represents a two dimensional chirp signal shaded gradually into a bright homogeneous region. The image is displayed in log-scale to better assess the variations of the texture in the darkest region. The noisy observation y is corrupted by independent noise following a Poisson distribution. We have evaluated the relevance of the risks MSE μ , MKLS and MKLA in selecting the bandwidth parameter. Visual inspection of y shows that darker regions are more affected by noise than brighter ones. This is due to the fact that Poisson corruptions lead to a signal to noise ratio evolving as √ μ.

Application to variable selection
We now consider the problem of variable selection in linear regression problems, i.e., in finding the non-zero components of a vector β ∈ R q that fulfills the assumption that an observed vector y ∈ R d has expectation μ = Xβ where X ∈ R d×q is the so-called design matrix. To this aim, we consider the Least Absolute Shrinkage and Selection Operator (LASSO) [44] given, for λ > 0, bŷ β(y) ∈ argmin β∈R q − log p(y; θ = φ(Xβ)) + λ||β|| 1 .
In this case the predictorμ is given byμ(y) = Xβ(y). The LASSO is known to promote sparse solutions, i.e., such that the number of non-zero entries of β is small compared to q. The level of sparsity is indirectly controlled by the regularization parameter λ, the larger λ is, the sparserβ will be. Finding the optimal parameter λ, and then selecting the relevant variables (columns of X) explaining y, is a challenging problem that can be addressed by minimizing an estimator of the risk. In this context, we will evaluate again the relevance of the different proposed loss functions and their estimates as objectives to select a regularization parameter λ offering a relevant selection of variables. Figure 5 and Table 3 provide results obtained on such a linear regression problem where X is an orthogonal matrix and d = q = 16, 384. The vector β was chosen such that 28% of its entries are non-zero. We have generated 200  independent realizations y of Y using a Gamma distribution model with scale parameter L = 8. We have again evaluated the relevance of the natural risk MSE θ given by ||μ −1 −μ(Y ) −1 || 2 , MKLS and MKLA in selecting the regularization parameter. Figure 5 shows the evolution of these objectives as a function of λ. It shows that KL objectives lead to selecting a larger λ parameter than with the natural risk. Performance in terms of average percentages of false negatives (FN:β i = 0 and β i = 0), false positives (FP:β i = 0 and β i = 0) and errors (FP or FN) are reported in Table 3. It shows that tuning the parameter λ with respect to KL objectives leads to lower numbers of errors than with the natural risk. One can observe that the subsequent LASSO estimators work at different trade-offs: KL objectives favor FN over FP, while the natural risk favors FP over FN. Finally, performances with estimators of MSE θ with GSURE, MKLS with SUKLS, and MKLA with DKLA are also given. It can be observed that risk estimators offer in average comparable results than their oracle counterparts but have higher variance. Note that the LASSO is not differentiable, such that DKLA is not guaranteed to be asymptotically unbiased (as the conditions of Theorem 3 are not fulfilled), which explains the large discrepancies observed between the results obtained by MKLA and DKLA. Nevertheless, even though DKLA is not asymptotically unbiased in this case variable selections with the LASSO guided by DKLA still provides a good objective for variable selection, with similar results as if it was guided by the oracle MKLA objective. A last important question is to know whether our risk estimators are robust against model misspecification, i.e., when the generative model (1.1) is only approximately known. Indeed, Lv and Liu [34] demonstrated the advantage of using KL divergence principle for model selection problems in both correctly specified and misspecified models. Along these lines, we have also shown in Table 3 the results obtained under misspecification. We have chosen to evaluate the performance of the LASSO guided by the aforementioned estimators of the risk when the shape parameter L of the Gamma distribution is misestimated by a factor 0.1%: |1 −L/L| = 0.1. We found that the performance of all estimators drop in this case. Nevertheless, their relative performance are preserved: KL objectives lead to lower numbers of errors than with the natural risk.

Conclusion
We addressed the problem of using and estimating Kullback-Leibler losses for model selection in recovery problems involving noise distributed within the exponential family. Our conclusions are threefold: 1) Kullback-Leibler losses have shown empirically to be more relevant than squared losses for model selection in the considered scenarios; 2) Kullback-Leibler losses can be estimated in many cases unbiasedly or with controlled bias depending on the regularity of both the predictor and the noise; 3) Even though the estimation is subject to variance and bias, the subsequent selection has shown empirically to be close to the optimal one associated to the loss being estimated. Future works should focus on understanding under which conditions such a behavior can be guaranteed. This includes establishing tighter bounds on the reliability, consistency with respect to the data dimension d and asymptotic optimality results for some given class of predictors. Estimation of Kullback-Leibler losses and other discrepancies (e.g., Battacharyya, Hellinger, Mahanalobis, Rényi or Wasserstein distances/divergences) beyond the exponential family and requiring less regularity on the predictor should also be investigated.