Upper and lower bounds for the Bregman divergence

In this paper we study upper and lower bounds on the Bregman divergence ΔFξ(y,x):=F(y)−F(x)−〈ξ,y−x〉\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\Delta_{\mathcal {F}}^{\xi }(y,x):=\mathcal {F}(y)-\mathcal {F}(x)- \langle \xi , y-x \rangle$\end{document} for some convex functional F\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {F}$\end{document} on a normed space X\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {X}$\end{document}, with subgradient ξ∈∂F(x)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\xi \in\partial \mathcal {F}(x)$\end{document}. We give a considerably simpler new proof of the inequalities by Xu and Roach for the special case F(x)=∥x∥p\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {F}(x)= \Vert x \Vert ^{p}$\end{document}, p>1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$p>1$\end{document}. The results can be transferred to more general functions as well.


Introduction
In recent times the Bregman divergence (or Bregman distance) x * F (y, x), introduced by Bregman in [1], has been used as a generalized distance measure in various branches of applied mathematics, for example optimization, inverse problems, statistics and computational mathematics, especially machine learning. To get an overview over the Bregman divergence and its possible applications in optimization and inverse problems we refer to [2][3][4]. In particular the Bregman divergence has been used for various algorithms in numerical analysis and also for convergence analysis of numerical methods and algorithms.
Especially when doing convergence analysis it is often crucial to have lower and upper bounds on the Bregman divergence in terms of norms. In [5] the authors prove upper and lower bounds for expressions x + y px pp j p (x), y =: where j p : X → X * is a duality mapping, under certain assumptions on the Banach space X . As it turns out that (1) is the Bregman divergence corresponding to the functional F = · p these results have been used since then in many papers working with the Bregman divergence. However, from the proofs of [5] it seems difficult to transfer the results to other functions F . Thus we develop in this work a simple framework to find such bounds and in fact can apply it to give a short new proof of the results from [5] for F(x) = x p , p > 1.
Our approach is as follows: Proving upper bounds is rather simple if one sufficiently understands the smoothness of F as the Bregman divergence is basically a linearization error and linearization errors are related to differentiability by definition. In particular we will show that one can obtain upper bounds for the Bregman divergence corresponding to F = φ( · ), if φ : R → R is convex and sufficiently smooth.
Regarding lower bounds we will make use of F * , the convex conjugate of F . Actually it can be shown that lower bounds for x * F (y, x) correspond to upper bounds for x F * (y * , x * ). Note that this idea is not at all new. Already in [6,7] this kind of connection between F and F * was discussed in depth. So again one can just make use of the smoothness of F * to conclude lower bounds for x * F (y, x). One might argue that convex conjugates can be rather complicated functions and expecting differentiability is too optimistic. This is true to some extent, but actually reasonable lower bounds on x * F (y, x) already imply differentiability of F * at x * (see [6,Theorem 2.1]). So if one has any hope of finding lower bounds then one might as well work with the convex conjugate.
One reason why our proof is simpler than the proof from [5] is that they did it the other way round. They firstly proved lower bounds with quite some effort and then used the convex conjugate to show upper bounds.
We will focus mainly on asymptotic bounds for x * F (y, x) as xy → 0. It is the more interesting case for applications as for example in convergence analysis one will be interested in the Bregman divergence of x n and x, where x n → x. Also theoretical it is the more challenging case, since for xy → ∞ the Bregman divergence x * F (y, x) will mostly depend on the behavior of F(y) as y tends to infinity and it should be easy to find lower and upper bounds. In particular we will show at the end of the paper, how one can deduce uniform bounds for all x, y ∈ X from the asymptotic bounds for the case F = · p .
The paper consists of 4 sections. In Sect. 2 we recall some basis notions of convex analysis. In Sect. 3 we define moduli of smoothness and convexity corresponding to a general functional F and develop some properties of them. Finally, in Sect. 4 we then use the theory from Sect. 3 on the functional F = 1 p · p for p > 1 and find lower and upper bounds for the corresponding Bregman divergence given by the smoothness, respectively, the convexity of the space X as shown in [5].

Tools from convex analysis
In this work X will always be a real Banach space, with dim X ≥ 2, X * denotes its dual space, S X = {x ∈ X : x = 1} the unit sphere and F : X → R := R ∪ {∞} some function. We will need some basic concepts from convex analysis, so we shortly recall them in this chapter.
x * ∈ X * is called a subgradient of a convex function F : for all y ∈ X . The set of all subgradients of F at x is called the subdifferential of F at x and denoted by ∂F (x). The convex conjugate F * : X * → R of F is defined by From these two definitions one can directly conclude the following generalized Young (in)equality. For all x ∈ X , x * ∈ X * we have Equality holds true if and only if x * ∈ ∂F (x). Further we have where equality holds if and only if F is convex and lower-semicontinuous. Finally, we define the object of interest of this work. For F(x) < ∞ and x * ∈ ∂F (x) the Bregman divergence x * F (y, x) is given by for all y ∈ X . We will be especially interested in functionals F(x) = 1 p x p for some p ≥ 1 and need to understand their subdifferentials, so finally we have the following. For some p ≥ 1 the set-valued mapping J p : X → 2 X * given by is called the duality mapping with respect to p of X . The sets J p (x) are always non-empty. A mapping j p :

Moduli of smoothness and convexity
Finding upper bounds for (1) is related to the smoothness of the norm of X whereas lower bounds are related to convexity. Thus it is necessary to understand the moduli of smoothness and convexity of the space X and we shortly recall their definitions (see e.g. [9]): These two moduli have a well-developed theory, which has been known in the literature for a long time and we will not discuss all their properties. However, for our proofs we will need the following properties.

2.
We see for all τ > 0 that there exists a constant C τ such that for all Banach spaces X we have Proof For our purposes it will be more natural to introduce new definitions of the moduli of smoothness and convexity related to functionals instead of spaces.
The quantities ρ ξ F ,x , δ ξ F ,x give us a reformulation of our basic problem: We want to find upper bounds for ρ ξ F ,x (τ ) and lower bounds for δ ξ F ,x (τ ). Before we show some properties of these functions we should state some simple facts for their interpretation.
Remark 3.4 We will mostly consider convex functions F with ξ ∈ ∂F (x) so that the linearization error functional is a Bregman divergence and one can neglect the absolute value.
F is Fréchet-differentiable in x if and only if there exists ξ ∈ X * , such that ρ ξ F ,x (τ )/τ → 0 as τ → 0. F being s-smooth in x, with s ∈ (1, 2] then can be seen as a stronger form of differentiability, comparable to fractional derivatives; however, F being 2-smooth is not equivalent to twice differentiability but rather to the notion of strong smoothness. If there exists a selection j : X → X * of the subdifferential of F , i.e. for every x exists j(x) ∈ ∂F (x), then this implies already that F is convex. δ F ,x (τ ) > 0 for all x, τ implies strict convexity. As before r-convexity is an even stronger notion of convexity and 2-convexity is connected to strong convexity. In [10] the modulus of local (or total) convexity of F , ν F (x, τ ), was introduced and is basically given by δ ξ F ,x (τ ) just that ξ , yx is replaced by the right hand side derivative of F at x in direction yx. If F is convex and Gâteauxdifferentiable then ν F (x, τ ) coincides with δ ξ F ,x (τ ), where ξ = F (x). The modulus of total convexity has been studied in several papers, see e.g. [11]. There exist further definitions of moduli of convexity and smoothness related to functions (e.g. [4,12]), but giving a complete overview over all such definitions goes beyond the scope of this work.
It turns out that for functionals F that originate from the norm of X the moduli of the space and of the functions are closely related. Proposition 3.5 Let F = · X and for all x ∈ X let ξ x ∈ ∂F (x) be arbitrary. We have Proof If we replace y by -y in the definition of ρ ξ x F ,x we see and for all x, y ∈ S X we have by the definition of the subdifferential and as F(x) = x X = 1 So this already gives us an upper bound for ρ ξ · X ,x (τ ) if x ∈ S X , ξ ∈ ∂F (x). For generalizing this to all x ∈ X we use the following.

Proposition 3.6
If the functional F is positively q-homogeneous then we have, for all x ∈ X \ {0}, ξ ∈ X * , so that the first claim follows from Definition 3.3. The second claim follows from multiplying (2) either by x q or x -q .
For convex functions F one can show that both moduli are nondecreasing.
Proposition 3.7 Let F be convex, x ∈ X and ξ ∈ ∂F (x). Then for λ ≥ 1 one has In particular δ ξ F ,x , ρ ξ F ,x are nondecreasing.
We also have a chain rule.
Proof Let s = F(x) and define functions R, r by Then we have for τ > 0 and y ∈ S X Now the claim follows from R(τ y) ≤ ρ ξ F ,x (τ ) and r(h) ≤ ρ t f ,F (x) (|h|) together with the assumption that ρ t f ,F (x) is a nondecreasing function.
Propositions 3.5, 3.7 and 3.8 are already sufficient to find upper bounds on ρ ξ F ,x for F = f ( x X ) if f is convex and if we sufficiently understand the smoothness of f and of the space X . Regarding lower bounds the following proposition will be our key instrument.

Proposition 3.9
Let F convex and x be such there exists ξ ∈ ∂F (x). We have

Further we see that F is p-convex in x w.r.t. ξ if and only if
By Young's equality (3) we then have The second statement follows from (6), which gives and the fact that by Proposition 3.7 we have for τ > τ ρ ξ Thus one can just put in the corresponding lower or upper bound and calculate the maximum, which completes the proof.

Application to norm powers
In this section we will consider F = 1 p · p for some p > 1 and use the theory from the last chapter to reproduce the main results from [5]. Note that in the light of Proposition 3.6 it is sufficient to understand δ F ,x and ρ

If we have for
where C τ ,p is the constant from 1. 4. If there exists τ > 0 such that we have for all x ∈ S X and τ ≤ τ where φ : R + → R + is nondecreasing and φ(τ ) > 0 for τ > 0, then X is uniformly convex.
We have by Taylor's theorem where the inequality holds for a constant depending on p and τ as ρ 1 f ,1 is always finite and so is the remainder r. We have j p (x) ∈ ∂ · (x) for x ∈ S X , so by Proposition 3.5 we have ρ · ,x (τ ) ≤ 2ρ X (τ ) and one can easily see that ρ X (τ ) ≤ τ . So we have where the second inequality follows from item 2 of Lemma 3.2.
Then by Lemma 3.2, 1. we have τ r ≤ C τ ,p ρ X * (r) for r ≥ τ and thus find So we have by Lemmas 3.2, 3 and 4 Finally, note that C τ ,p ≥ p -1 2 C -1 τ , with C τ from item 2 of Lemma 3.2 and thus C τ ,p ρ X * (τ )/ τ ≥ p -1 2 τ . Claim 4: By assumption we have δ F ,x (τ ) ≥ φ(τ ) for τ ≤ τ and by Proposition 3.7 we have for τ > τ δ F ,x (τ )/τ and thus δ So by Proposition 3.9 we have for all as φ is nondecreasing. So by part 2 of the theorem we see that X * is uniformly smooth from which it follows that X is uniformly convex [9, Prop. 1.e.2].
Remark 4.2 One can see from the above proof that in the asymptotic case τ → 0 one can choose the constant C τ ,p such that These constants are not sharp for every space X , but at least in the asymptotic case the constants are much simpler than the ones given in [5]. For best known constants with respect to L p spaces we refer to [13] and [14].
The above theorem combined with Proposition 3.6 gives us upper and lower bounds on the Bregman divergence for xy ≤ τ x . However, as for large xy the Bregman divergence will be dominated by the term 1 p y p it is not difficult to also find bounds that hold for all x, y ∈ X. Further one can additionally conclude bounds for the symmetric Bregman divergence, sym F (x, y) := F (x, y) = j p (x)j p (y), xy , from our theorem. These two claims are shown in the following two propositions.

Proposition 4.3
For some fixed p > 1 let F = 1 p · p and let φ : R + → R + be nondecreasing. Let V = X \ {0} × X and define the statements: Proof We only show that (a) implies (b) as (b) ⇒ (c) follows trivially. Without loss of generality let c ≤ 1. First of all assume x-y x > c. Then by one can see that regardless of whether we have x / y > 1/2 or x / y ≤ 1/2 one always Now consider the case xy / x ≤ c ≤ 1. In this range we can conclude (b) from (a) as y ≤ 2 x , so that

Proposition 4.4
For some fixed p > 1 let F = 1 p · p , let φ : R + → R + be nondecreasing and φ(τ ) > 0 for τ > 0. Let V = X \ {0} × X and define the statements: Proof The proof is very similar to the previous proof so we just sketch it. We look at three different cases. By Proposition 3.7 we know that δ j p (x) F ,x is nondecreasing, so (d) gives also for xy / x ≥ c j p (x) F (y, x) ≥ C x p φ(c) and thus for sufficiently large N > 3, where the last line follows from the fact that x-y x ≥ N implies y ≥ (N -1) x , which implies x-y y ≥ 1 -(N -1) -1 ≥ 1/2 and thus it suffices to see that the Bregman divergence will be dominated by y p , for sufficiently large N . To conclude (e) one then basically has to redefine the constants. Equation (f) follows trivially.
To conclude this chapter we combine the results and summarize the most important inequalities.

Corollary 4.5
Let X be a Banach space and F(x) = 1 p x p for p > 1. Then there exist constants C 1 , C 2 > 0 such that for all x, y ∈ X we have and j p (x) If the space X is s-smooth, then there exists C > 0 and for all τ > 0 also C τ > 0 such that C τ x p-s xy s , for x-y x ≤ τ , p = s.
If the space X is r-convex, then there existsC > 0 and for all τ > 0 alsoC τ > 0 such that xy r , p = r, C τ x p-r xy r , for x-y x ≤ τ , p = r.
Proof By Proposition 3.6 we have Thus item 1 of Theorem 4.1 and the s-smoothness show the bound (9) for x ∈ X and y such that xy ≤ τ x . Similarly item 3 of Theorem 4.1 and the r-convexity show (10) for x ∈ X and y such that xy ≤ p -1 2 τ x . As this holds true for all τ > 0 one can just replaceτ = 2τ p -1 . Apply Proposition 4.3 and Proposition 4.4 to conclude from this the uniform bounds for all x, y ∈ X .