A continuous-time approach to online optimization

We consider a family of learning strategies for online optimization problems that evolve in continuous time and we show that they lead to no regret. From a more traditional, discrete-time viewpoint, this continuous-time approach allows us to derive the no-regret properties of a large class of discrete-time algorithms including as special cases the exponential weight algorithm, online mirror descent, smooth fictitious play and vanishingly smooth fictitious play. In so doing, we obtain a unified view of many classical regret bounds, and we show that they can be decomposed into a term stemming from continuous-time considerations and a term which measures the disparity between discrete and continuous time. As a result, we obtain a general class of infinite horizon learning strategies that guarantee an $\mathcal{O}(n^{-1/2})$ regret bound without having to resort to a doubling trick.


Introduction
Online optimization focuses on decision-making in sequentially changing environments (the weather, the stock market, etc.). More precisely, at each stage of a repeated decision process, the agent/decision-maker obtains a payoff (or incurs a The authors are greatly indebted to Vianney Perchet for his invaluable help in improving all aspects of this work, and to Rida Laraki and Sylvain Sorin for their insightful suggestions and careful reading of the manuscript. The authors would also like to express their gratitude to Pierre Coucheney, Bruno Gaujal and Guillaume Vigeral for many helpful discussions and remarks. Part of this work was done during the authors' visit at the Hausdorff Research Institute for Mathematics at the University of Bonn in the framework of the Trimester Program "Stochastic Dynamics in Economics and Finance". This work was supported by the European Commission in the framework of the FP7 Network of Excellence in Wireless COMmunications NEWCOM# (contract no. 318306) and the French National Research Agency (ANR) projects GAGA (grant no. ANR-13-JS01-0004-01) and NETLEARN (grant no. ANR-13-INFR-004). loss) based on the environment and his decision, and his long-term objective is to maximize his cumulative payoff via the use of past observations. The worst-case scenario for the agent -and one which has attracted considerable interest in the literature -is when he has no Bayesian-like prior belief on the environment. In this context, the cumulative payoff difference between an oraclelike device (a decision rule which prescribes an action based on knowledge of the future) and a learning strategy (a rule which only relies on past observations) can become arbitrarily large, even in very simple problems. As a result, in the absence of absolute payoff guarantees, the most widely used online optimization criterion is that of regret minimization, a notion which was first introduced by Hannan [15] and has since given rise to a vigorous literature at the interface of optimization, statistics and theoretical computer science -see e.g. Cesa-Bianchi and Lugosi [10], Shalev-Shwartz [28] for a survey. Specifically, the cumulative regret of a strategy compares the payoff obtained by an agent that follows it to the payoff that he would have obtained by constantly choosing one action; accordingly, one of the main goals in online optimization is to devise strategies that lead to (vanishingly) small average regret against any fixed action, and irrespective of how the agent's environment evolves over time.
In this paper, we take a continuous-time approach to online optimization and we consider a class of strategies that lead to no regret in continuous time. From a more traditional, discrete-time viewpoint, the importance of this approach lies in that it provides a unifying view of the regret properties of a broad class of well-known online optimization algorithms. In particular, the discrete-time version of our family of strategies is an extension of the general class of online mirror descent (OMD) algorithms (themselves equivalent to "Following the Regularized Leader" (FtRL) in the case of linear payoffs; see e.g. Shalev-Shwartz [28], Bubeck [7], Hazan [16]) with a time-varying parameter. As such, our analysis contains as special cases a) the exponential weight (EW) algorithm (Littlestone and Warmuth [19], Vovk [30]) and its decreasing parameter variant (Auer et al. [1]); b) smooth fictitious play (SFP) (Fudenberg and Levine [13], Benaïm et al. [4]) and vanishingly smooth fictitious play (VSFP) (Benaïm and Faure [3]); and c) the method of online gradient descent (OGD) introduced by Zinkevich [32] (the Euclidean predecessor of OMD).
With regards to the OMD/FtRL family of algorithms, the vanishing regret bounds that we derive by using a time-varying parameter are not particularly new: bounds of the same order can be obtained by taking existing guarantees for learning with a finite horizon and then using the so-called "doubling trick" (Cesa-Bianchi et al. [9], Vovk [31]). 1 That said, the introduction of a time-varying parameter has several advantages: a) it allows us to integrate SFP and VSFP into the fold and to derive explicit bounds for their regret; b) it provides a unified any-time analysis without needing to reboot the algorithm every so often (to the best of our knowledge, such an analysis only exists for the EW algorithm with a time-varying parameter (Bubeck [7], Auer et al. [1])); and c) in the case of ordinary convex optimization problems with an open-ended termination criterion (as opposed to a fixed number of steps), a variable parameter leads to more efficient value convergence bounds than a variable step-size.
Building on an idea that was introduced by Sorin [29] in the study of the exponential weight algorithm, the key ingredient of our analysis is the descent from continuous to discrete time. More precisely, given an online optimization problem in discrete time, we construct a continuous-time interpolation where our continuoustime dynamics lead to no regret; then, by comparing the agent's payoffs in discrete and continuous time, we are able to deduce a bound for the agent's regret in the original discrete-time framework.
One of the main contributions of this approach is that it leads to a unified derivation of several existing regret bounds with disparate proofs; secondly, it allows us to decompose many classical bounds into two components, a term coming from continuous-time considerations and a comparison term which measures the disparity between discrete and continuous time (see also Mannor and Perchet [20] for an alternative interpretation of such a decomposition). Each of these terms can be made arbitrarily small by itself, but their sum is coupled in a nontrivial way that induces a trade-off between continuous-and discrete-time considerations: in a sense, faster decay rates in continuous time lead to greater discrepancies in the discrete/continuous comparison -and hence, to slower regret decay bounds in discrete time.
Finally, we also give a brief account of how the derived regret bounds are related to classical convergence results for certain convex optimization and stochastic convex optimization algorithms -including the projected subgradient (PSG) method, mirror descent (MD), and their stochastic variants (Nemirovski and Yudin [23], Nemirovski et al. [22]), and we illustrate a (somewhat surprising) performance gap incurred by using an optimization algorithm with a decreasing parameter instead of a decreasing step-size.
1.1. Paper Outline. In Section 2, we present some basics of online optimization to fix notation and terminology; then, in Section 3, we define regularizer functions, choice maps and the class of variable-parameter OMD/FtRL strategies that we will focus on. The core of our paper consists of Sections 4 and 5: we first show that the corresponding class of continuous-time strategies leads to no regret in Section 4; this analysis is then translated to discrete time in Section 5 where we derive the noregret properties of the class of algorithms under consideration. Finally, in Section 6, we establish several links with existing online learning and convex optimization algorithms, and we show how their properties can be derived as corollaries of our results.

Notation and Preliminaries.
Let d be a positive integer and let V = R d be equipped with an arbitrary norm · . The dual of V will be denoted by V * and the induced dual norm on V * will be given by the familiar expression: where y|x denotes the canonical pairing between y ∈ V * and x ∈ V . For a nonempty subset U ⊂ V will use the notation U = sup x∈U x .
In the rest of our paper, C will denote a nonempty compact convex subset of V ; moreover, given a convex function f : V → R ∪ {+∞}, its effective domain will be the convex set dom f = {x ∈ V : f (x) < ∞}. For convenience, if f : C → R is convex, we will treat f as a convex function on V by setting f (x) = +∞ for x ∈ V \ C; conversely, if f : V → R ∪ {+∞} has domain dom f = C, we will also treat f as a real-valued function on C (in all cases, the ambient space V will be clear from the context). We will then say that v ∈ V * is a subgradient of f at v|x ′ − x for all x ′ ∈ C; likewise, the set ∂f (x) = {v ∈ V * : v is a subgradient of f at x} will be called the subdifferential of f at x and f will be called subdifferentiable if ∂f (x) is nonempty for all x ∈ dom f .
If it exists, the minimum (resp. maximum) of a function f : V → R ∪ {+∞} will be denoted by f min (resp. f max ). Moreover, if A = {a 1 , . . . , a d } is a finite set, the set ∆(A) of probability measures on A will be identified with the standard also, the elements of A will be identified with the corresponding vertices of ∆(A), i.e. the canonical basis vectors Finally, for x, y ∈ R, we will let ⌊x⌋ = max{k ∈ Z : k x} and ⌈x⌉ = min{k ∈ Z : k x}, and we will write x ∨ y = max{x, y} and x ∧ y = min{x, y}.

The Model
The heart of the online optimization model that we consider is as follows: at every discrete time instance n 1, an agent (decision-maker) chooses an action from a nonempty convex action set C ⊂ V and gains a payoff (or incurs a loss) determined by some time-dependent function. Information about this function is only revealed to the agent after he picks his action, and the agent's objective is to maximize his long-term payoff in an adaptive manner.
2.1. The Core Model. Let C ⊂ V denote the agent's action space. Then, at each stage n 1, the process of play is as follows: 1. The agent chooses an action x n ∈ C. 2. Nature chooses and reveals the payoff vector u n ∈ V * of the n-th stage and the agent receives a payoff of u n |x n . 2 3. The agent uses some decision rule to pick a new action x n+1 ∈ C and the process is repeated ad infinitum.
More precisely, define a strategy to be a sequence of maps σ n : (V * ) n−1 → C, n 1, such that σ n+1 determines the player's action at stage n + 1 in terms of the payoff vectors u 1 , . . . , u n ∈ V * that have been revealed up to stage n (in a slight abuse of notation, σ 1 will be regarded as an element of C). Then, given a sequence of payoff vectors u = (u n ) n 1 in V * , the sequence of actions generated by σ will be and the agent's cumulative regret with respect to x ∈ C is defined as: u k |σ k (u 1 , . . . , u k−1 ) .

(2.2)
In what follows, we focus on strategies that lead to no (or, at worst, small ) regret : 2 Nature may be adversarial, i.e. un may be chosen as a function of x 1 , . . . , xn.
Definition 2.1. A strategy σ leads to ε-regret (ε 0) if, for every sequence of payoff vectors (u n ) n 1 in V * such that u n * 1: In particular, if (2.3) holds with ε = 0, we will say that σ leads to no regret.
Remark 1. The definition of an ε-regret strategy depends on the dual norm · * of V * (and hence, on the original norm · on V ); on the other hand, the definition of "no regret" is independent of the norm.
Remark 2. In our framework, we can easily see that a strategy leading to ε-regret against "any sequence" is equivalent to leading to ε-regret against "any strategy of nature". However, this may not be true in the randomized setting we present in the following paragraph.
Despite its simplicity, this online linear optimization model may be used to analyze more general online optimization models. In what follows, we summarize some examples of this kind.

2.2.
The Case of the Simplex and Mixed Actions. Consider a discrete decision process where, at each stage n 1, the agent chooses an action a n from a finite set of pure actions A = {1, . . . , d}. To do so, the agent draws a n according to some probability distribution x n ∈ ∆(A); then, once a n is drawn, the payoff vector u n ∈ [−1, 1] d which prescribes the payoff u n,a of each action a ∈ A is revealed and the agent receives the payoff u n,an that corresponds to his choice of action.
In this setting, a strategy is still defined as in the core model of Section 2.1 with the agent's action set replaced by the set of mixed actions ∆(A). 3 The agent's realized regret with respect to a pure action a ∈ A will then be n k=1 (u k,a − u k,a k ), (2.4) and we will say that a strategy σ leads to ε-realized-regret (resp. to no realized regret for ε = 0) if lim sup n→∞ 1 n max a∈A n k=1 (u k,a − u k,a k ) ε (a.s.), (2.5) for every sequence of payoff vectors (u n ) n 1 in R d such that u n ∞ 1. 4 On the other hand, the agent's expected payoff at stage n is E[u n,an ] = u n |x n ; thus, if we interpret u n |x n as the payoff of the mixed action x n ∈ ∆ d , we will have: where the basis vector e a ∈ ∆(A) is identified here with the Dirac point mass δ a on a ∈ A. By a classical argument based on Hoeffding's inequality and the Borel-Cantelli lemma, the minimization of (2.4) is then reduced to the core model of Section 2.1: Proposition 2.2 (Cesa-Bianchi and Lugosi [10], Corollary 4.3). If a strategy σ leads to ε-regret with respect to the uniform norm on V * , it also leads to ε-realizedregret.
2.3. Online Convex Optimization. We briefly discuss here a more general online convex optimization model where losses are determined by a sequence of convex functions. Formally, the only change from Section 2.1 is that at each stage n 1, the agent incurs a loss ℓ n (x n ) determined by a subdifferentiable convex loss function ℓ n : C → R. In this nonlinear setting, the information revealed to the agent after playing includes a (negative) subgradient u n ∈ −∂ℓ n (x n ) ⊂ V * of ℓ n at x n , so the incurred cumulative regret with respect to a fixed action x ∈ C is: and for all x ∈ C; in this way, (2.7) readily yields: . This last expression can obviously be interpreted as the regret incurred by an agent facing a sequence of payoff vectors u n ∈ V * (cf. the core model of Section 2.1), so a strategy which guarantees a bound on the righthand side of (2.8) will guarantee the same for (2.7). Consequently, when the loss functions ℓ n are uniformly Lipschitz continuous, results for the core model can be directly translated into this one.

Regularizer Functions and Choice Maps.
We begin with the concept of a regularizer function: Definition 3.1. A convex function h : V → R ∪ {+∞} will be called a regularizer function on C if dom h = C and h| C is strictly convex and continuous.
Remark 3. This definition is intimately related to the notion of a Legendre-type function (see e.g. Rockafellar [25,Section 26]); however, as was recently noted by Shalev-Shwartz [27] (and in contrast to the analysis of e.g. Benaïm and Faure [3], Bubeck [7] and Benaïm et al. [4]), we will not require any differentiability or steepness assumptions.
A key tool in our analysis will be the convex conjugate h * : Since h is equal to +∞ on V \ {C} and h| C is continuous and strictly convex, the supremum in (3.1) will be attained at a unique point in C. This unique maximizer then defines our choice map as follows: Definition 3.2. The choice map associated to a regularizer function h on C will be the map Q h : V * → C defined as Example 3.3 (Entropy and logit choice). In the case of the simplex (C = ∆ d ), 5 a classical example of a choice map is generated by the entropy function A standard calculation then yields the so-called logit choice map: This map is used to define the exponential weight algorithm (cf. Section 6), and its importance stems from the well known fact that it leads to the optimal regret bound for C = ∆ d (Cesa-Bianchi and Lugosi [10, Theorems 2.2 and 3.7]).
Example 3.4 (Euclidean projection). Another important example arises by taking the squared Euclidean distance as a regularizer function; more precisely, we define the Euclidean regularizer on C as The associated choice map Q h : R N → C corresponds to taking the orthogonal projection with respect to C: Example 3.5 (Bregman projections). The Euclidean example above is a special case of a class of projection mappings known as Bregman projections (Bregman [5]). Let F : V −→ R ∪ {+∞} be a proper convex function, differentiable on its domain. Let us denote D = dom F and for x, x ′ ∈ D, the Bregman divergence Hence, given a compact set C ⊂ D, the associated Bregman projection of a point x 0 ∈ D onto C is given by Now assume that F * is also differentiable on its domain which we will denote D * . It is easy to check that for y ∈ D * , ∇F * (y) ∈ D and ∇F (∇F * (y)) = y. Then, the process of mapping y ∈ D * to ∇F * (y) and then projecting to C can be written as a choice map in the sense of (3.2): where h| C = F | C and h(x) = +∞ for x ∈ R d \ {C}. 5 In this setting, choice maps are more commonly known as smooth best reply maps (Fudenberg and Levine [12], Hofbauer and Sandholm [17], Benaïm et al. [4], Benaïm and Faure [3]).

Strategies Generated by Regularizer Functions.
The class of strategies that we will consider in the rest of this paper is a variable-parameter extension of the so-called online mirror descent (OMD) method -itself equivalent to the family of algorithms known as Follow the Regularized Leader (FtRL) in the case of linear payoffs (see e.g. Shalev-Shwartz [28] and Hazan [16]). In a nutshell, this class of strategies may be described as follows: the agent aggregates his payoffs over time into a score vector y ∈ V * and then uses a choice map to turn these scores into actions and continue playing. Formally, if h is a regularizer function on the agent's action space C and (η n ) n 1 is a positive nonincreasing sequence, the strategy σ ≡ σ h,ηn n n 1 generated by h with parameter η n is defined as . The corresponding sequence of play x n+1 = σ n+1 (u 1 , . . . , u n ) will then be given by the recursion: In addition to the standard variants of OMD/FtRL, a list of examples of strategies and algorithms that can be expressed in this general form is given in Table  1. A more detailed analysis (including the regret properties of each algorithm) will also be provided in Section 6; we only mention here that the variability of η n will be key for the no-regret properties of σ: when η n is constant, the strategy (3.10) does not guarantee a sublinear regret bound (see e.g. Shalev-Shwartz [28] and Bubeck [7]).

3.3.
Regularity of the Choice Map and the Role of Strong Convexity. In this section, we derive some regularity properties of the choice map Q h that will be needed in the analysis of the subsequent sections. We begin by showing that Q h is continuous and equal to the gradient of h * : Proposition 3.6. Let h be a regularizer function on C. Then h * is continuously differentiable on C and ∇h * (y) = Q h (y) for all y ∈ V * .
Proof. For y ∈ V * , we have However, since the latter set only consists of Q h (y), h * will be differentiable with ∇h * (y) = Q h (y) for all y ∈ V * . The continuity of ∇h * then follows from Rockafellar [25, Corollary 25.5.1].
In the discrete-time analysis of Section 5, (3.10) will be shown to guarantee a regret bound of a simple form when Q h is Lipschitz continuous. This last requirement is equivalent to h being strongly convex : (1) f is K-strongly convex w.r.t. · if, for all w 1 , w 2 ∈ R d and for all λ ∈ [0, 1]: Strong convexity of a function was shown in Kakade et al. [18] to be equivalent to strong smoothness of its conjugate. In turn, this equivalence yields the following characterization of Lipschitz continuity: Proposition 3.8. Let f : V → R ∪ {+∞} be proper and lower semi-continuous. Then, for K > 0, the following are equivalent: (i) f is K-strongly convex with respect to · .
Hence, given that regularizer functions are proper and lower semi-continuous by definition, Proposition 3.8 leads to the following characterization: This characterization of the Lipschitz continuity of ∇f * (which will be of particular interest to us) is a classical result in the case of the Euclidean norm -see e.g. Rockafellar and Wets [26,Proposition 12.60]. On the other hand, the implication (ii) =⇒ (iii) appears to be new in the case of an arbitrary norm (though the proof technique is fairly standard).
(iii) =⇒ (i). Since f is proper and lower semi-continuous, it will also be closed. Our assertion then follows from e.g. Kakade et al. [18,Theorem 3].
Proof. The strong convexity of the Euclidean regularizer is trivial; for the strong convexity of the entropy with respect to · 1 , see e.g. Beck and Teboulle [2, Proposition 5.1].

The Continuous-Time Analysis
Motivated by a technique introduced by Sorin [29] in the context of the exponential weight (EW) algorithm, we present in this section a continuous-time version of the class of strategies of Section 2 and we derive a bound for the induced regret in continuous time. This will then enable us to bound the actual discrete-time regret by comparing the continuous-and discrete-time variants of this and the previous section respectively.
In continuous time, instead of a sequence of payoff vectors (u n ) n 1 in V * , the agent will be facing a measurable and locally integrable stream of payoff vectors (u t ) t∈R+ in V * . Hence, extending (3.10) to continuous time, we will consider the process: where (η t ) t∈R+ is a positive, nonincreasing and piecewise continuous parameter, while x c t ∈ C denotes the agent's action at time t given the history of payoff vectors u s , 0 s < t. 6 Our main result in this section is the following regret bound for (4.1): If h is a regularizer function on C and (η t ) t∈R+ is a positive, nonincreasing and piecewise continuous parameter, then, for every locally integrable payoff stream (u t ) t∈R+ in V * , we have: Proof. Assume first that η t is of class C 1 and let y t = η t t 0 u s ds. Then, for all x ∈ C and for all t 0, Fenchel's inequality gives: On the other hand, with x c t = Q h (y t ), we will also have by definition: (4.5) 6 In the rest of the paper, we will consistently use n and k for discrete indices and s, t, . . . for continuous ones.
where we used the fact that, by assumption,η 0. Integrating (4.5) then yields where we have used the fact that h * (y 0 ) = h * (0) = −h min in the second step. Hence, by combining this last equation with (4.3), we finally obtain: Remark 4. We should note here that the quantity δ h = h max − h min in (4.2) can be taken arbitrarily small so there is no "optimal" regret bound in continuous time. That said, we shall see in the following section that smaller values of δ h result in greater disparities between continuous and discrete time, thus leading to a trade-off for the regret in discrete time.

Regret Minimization in Discrete Time
In this section, our aim will be to provide a bound for the regret incurred by the discrete-time strategy (3.10). To that end, our approach will be as follows: first, given a positive nonincreasing parameter (η n ) n 1 and a sequence of payoff vectors (u n ) n 1 , we construct their continuous-time counterparts by setting u t = u ⌈t⌉ (5.1a) and η t = η ⌊t⌋∨1 (5.1b) for all t ∈ R + (i.e. η t = η ⌊t⌋ if t 1 and η t = η 1 otherwise). Then, given a regularizer h : C → R, we will compare the cumulative payoffs of the processes (x n ) n 1 and (x c t ) t∈R+ that are generated by (3.10) and (4.1) in discrete and continuous time respectively. In this way, the derived regret bound will consist of two terms: one coming from the continuous-time bound (4.2), and a term coming from the discrete/continuous comparison. Formally: Theorem 5.1. Let h be a K-strongly convex regularizer on C and let (η n ) n 1 be a positive nonincreasing parameter. Then, for every sequence of payoff vectors (u n ) n 1 in V * , the sequence of play generated by the strategy σ = (σ h,ηn n ) n 1 of (3.10) guarantees the bound where we have set η 0 = η 1 . In particular, if u n * M for some M > 0, then: Proof. Define the continuous-time interpolations of u n and η n as in (5.1) and let y t = η t t 0 u s ds; Then, for the continuous-time process x c t = Q h (y t ) generated by (4.1), we will have: and hence, for k 1 and t ∈ (k − 1, k), the payoffs corresponding to x c t and x k will differ by at most where the last inequality follows from the 1/K-Lipschitz continuity of Q h (Corollary 3.9). On the other hand, the definition of y t gives which leads to the estimate: In view of this discrete/continuous comparison, we thus obtain: where the first inequality follows from Theorem 4.1 and the last one from (5.8); the bounds (5.3) and (5.4) are then immediate.
To get the optimal dependence of the bound (5.4) in n, both terms should scale as √ n (otherwise, one would be slower than the other). In this case, we get a bound for the average regret which vanishes as O(n −1/2 ): Corollary 5.2. Let (u n ) n 1 be a sequence of payoff vectors in V * . Then, with notation as in Theorem 5.1, the sequence of play guarantees the regret bound: so the bound (5.4) becomes: Remark 5. We should stress here that regret guarantees of the same order as (5.11) can be obtained for the OMD/FtRL family of algorithms by optimizing the choice of parameter over a finite learning horizon and then restarting the algorithm every so often, using the doubling trick (Cesa-Bianchi et al. [9], Vovk [31]) to guarantee a sublinear regret bound in the long run. The doubling trick may thus be seen as a special case of a nonincreasing parameter; for the general case, the bounds (5.3)/(5.4) describe in a precise way the impact of the variability of η n on the method's regret guarantees (see also Section 6 for a more detailed discussion).
Remark 6. The dependence of η on δ h , K and M in (5.11) has been chosen precisely so as to minimize the expression δ h /η + M 2 η/K over all η > 0.
Remark 7 (On the dependence on K and the choice of optimal h). The dependence of the bound (5.11) on K is clearly artificial: (5.11) remains invariant if h is rescaled by a positive constant, so it suffices to consider regularizer functions that are 1-strongly convex over C. This then leads to the following question: given a norm · on V and a compact convex subset C ⊂ V , which 1-strongly convex function minimizes h max − h min ? With the exception of the Euclidean norm, this question does not seem to admit a trivial answer (cf. Section 7.1 for a more detailed discussion).
By expressing the cumulative payoff gap between discrete-and continuous-time exactly, Theorem 5.1 can be extended further to regularizer functions that are not strongly convex over C. The only thing that changes in this case is that the comparison term of the bound (5.4) is replaced by a term involving the Bregman divergence associated with the convex conjugate h * of h.
The following result is a variable-parameter extension of Theorem 5.6 in Bubeck and Cesa-Bianchi [6]. Theorem 5.3. Let h be a regularizer function on C. Then, with notation as in Theorem 5.1, the strategy σ = (σ h,ηn n ) n 1 of (3.10) guarantees the regret bound: where we have set y + n = η n n k=1 u k , y − n = η n−1 n k=1 u k and η 0 = η 1 .
Proof. With notation as in the proof of Theorem 5.1, the variables y ± n in the statement of the theorem may be expressed more concisely as: (5.14) and hence, with η t right-continuous, we get x n = Q h (y n−1 ) = Q h (y + n−1 ). Accordingly, if x c t = Q h (y t ) denotes the continuous-time process generated by (4.1), then, for all k 1 and for all t ∈ (k − 1, k), we will have: 1, k), we obtain the following comparison over (k − 1, k): In view of the above, the claim follows by summing this bound over k = 1, . . . , n and plugging the resulting expression in the first inequality of (5.9) -which holds independently of any assumptions on h.

Links with Existing Results
In this section, we discuss how certain existing results in online optimization and (stochastic) convex programming can be obtained as corollaries of the general analysis of the previous sections.

Links with Known Online Optimization Algorithms.
6.1.1. The Exponential Weight Algorithm. The exponential weight (EW) algorithm was introduced independently by Littlestone and Warmuth [19] and Vovk [30] as a learning strategy in discrete time. Motivated by the approach of Sorin [29] who used a continuous-time variant to retrieve the algorithm's classical regret bounds, we show here how the same bounds can be obtained directly from Theorem 5.1.
The framework of the EW algorithm is that of randomized action selection as in Section 2.2. Specifically, let A = {1, . . . , d} be a finite set of pure actions, and let the agent's action set be the unit simplex C = ∆ d of R d -the latter being endowed with the ℓ 1 norm · 1 . In this context, the EW algorithm is defined as: where η > 0 is a (fixed) parameter and (u n ) n 1 is a sequence of payoff vectors in [−1, 1] d (so that u n ∞ 1 in the induced dual norm). Example 3.3 in Section 3.1 shows that (EW) corresponds to (3.10) with η n = η and h(x) = d i=1 x i log x i . Since h max − h min = log d and h is 1-strongly convex with respect to · 1 (cf. Proposition 3.10), Theorem 5.1 readily yields the bound a bound which, unlike (6.2), has the advantage of holding uniformly in time.

Smooth Fictitious
Play. The smooth fictitious play (SFP) process was introduced by Fudenberg and Levine [11] (see also Fudenberg and Levine [12] and Fudenberg and Levine [13]), and its regret properties were examined further by Benaïm et al. [4] using the theory of stochastic approximation -but without providing any quantitative bounds for the regret. Just like the EW algorithm, SFP falls within the randomized actions framework of Section 2.2. In particular, SFP corresponds to the sequence of play generated by (3.10) for an arbitrary regularizer on ∆ d and with parameter η/n for some η > 0 ; specifically: With regards to the regret induced by (SFP), Benaïm et al. [4,Theorem 6.6] show that for every ε > 0, there exists some η * ≡ η * (ε) such that the strategy (SFP) with parameter η η * leads to ε-realized-regret. On the other hand, combining Proposition 2.2 with Theorem 5.1 yields the following more precise statement: Proposition 6.1. Let h be a K-strongly convex regularizer on the unit simplex ∆ d ⊂ R d endowed with the ℓ 1 norm. Then, for every sequence of payoff vectors (u n ) n 1 in [−1, 1] d , the strategy (SFP) with parameter η > 0 guarantees In particular, (SFP) with parameter η leads to (h max − h min )/η (realized ) regret.
Proof. Simply combine the logarithmic growth estimate n k=1 k −1 < 1 + log n for the harmonic series and Theorem 5.1 with η n = η/n; the claim for the realized regret then follows from Proposition 2.2.
Remark 9. It should be noted here that the qualitative analysis of Benaïm et al. [4] does not require h to be strongly convex; that said, if h is strongly convex, Proposition 6.1 gives a quantitative bound on the regret.

Vanishingly Smooth Fictitious
Play. The variant of SFP known as vanishingly smooth fictitious play (VSFP) was introduced by Benaïm and Faure [3], and its regret properties were established using sophisticated tools from the theory of differential inclusions and stochastic approximation -but, again, without providing explicit regret bounds.
Using the same notation as before, VSFP corresponds to the sequence of play where h is a strongly convex regularizer on ∆ d and the sequence η n satisfies: (A1) lim n→∞ nη n = +∞.
(A2) η n = O(n −α ) for some α > 0. Under these assumptions, the main result of Benaïm and Faure [3] is that (VSFP) leads to no realized regret; in our framework, this follows directly from Proposition 2.2 and Theorem 5.1 (which also gives a quantitative regret guarantee): Proposition 6.2. With notation as in Proposition 6.1, the strategy (VSFP) with η n satisfying assumptions (A1) and (A2) guarantees the regret bound
6.1.5. Online Gradient Descent. The online gradient descent (OGD) algorithm was introduced by Zinkevich [32] in the context of online convex optimization that we described in Section 2.3 -see also Bubeck [7,Section 4.1]. Here, we focus on a socalled lazy variant (Shalev-Shwartz [28, p. 144]) defined by means of the recursion U n ∈ U n−1 − η ∂ℓ n (x n ), where ℓ n : C → R is a sequence of M -Lipschitz loss functions, η > 0 is a constant parameter, and the algorithm is initialized with U 0 = 0. In view of Example 3.4, (OGD-L) corresponds to the strategy σ = (σ h,η n ) n 1 generated by the Euclidean regularizer h on C -defined itself as in (3.5). Theorem 5.1 thus yields the regret bound: Accordingly, if the time horizon n is known in advance, the optimal choice for η is η = δ C /(M √ n), leading to a cumulative regret guarantee of M δ C √ n, which is essentially the bound derived by Shalev-Shwartz [28, Corollary. 2.7] (see also Bubeck [7,Theorem 3.1] for the greedy variant). 7 6.1.6. Online Mirror Descent. The family of (lazy) online mirror descent (OMD) algorithms studied by Shalev-Shwartz [27,28] is the most general family of strategies that we discuss in this section (see also Bubeck [7] for a greedy version). In particular, the OMD class of strategies contains EW and OGD as special cases, and it is also equivalent to the family of Follow the Regularized Leader (FtRL) algorithms in the case of linear payoffs (Shalev-Shwartz [28], Hazan [16]).
Following Shalev-Shwartz [28] (and with notation as in Section 2.3), let ℓ n : C → R be a sequence of convex functions which are M -Lipschitz with respect to some norm · on R d . Then, given a regularizer function h on C, the lazy OMD algorithm is defined by means of the recursion: where η > 0 is a fixed parameter and the algorithm is initialized with U 0 = 0. As a result, if h is taken K-strongly convex with respect to · , Theorem 5.1 immediately yields the known regret bound for OMD: 6.2. Links with Convex Optimization. Ordinary convex programs can be seen as online optimization problems where the loss function remains constant over time and the agent seeks to attain its minimum value. In what follows, we outline how regret-minimizing strategies can be used for this purpose and we describe the performance gap incurred by using a method with a variable step-size instead of a variable parameter.
Let f : C → R be a convex real-valued function on C and let (γ n ) n 1 be a positive sequence (which we will later interpret as a sequence of step-sizes); also, given a sequence (x n ) n 1 in C, let (6.10) If we use the notation x ′ n ∈ {x min n , x γ n } to refer interchangeably to either x min n or x γ n , Jensen's inequality readily gives: Now consider the algorithm: where γ n is a sequence of step sizes and η n is a sequence of parameters. In the case of a constant parameter η n = 1, (6.12) then becomes U n ∈ U n−1 − γ n ∂f (x n ), which is a lazy variant of the mirror descent (MD) algorithm (Nemirovski and Yudin [23]). In particular, if h is the Euclidean regularizer on C, the algorithm boils down to a lazy version of the standard projected subgradient (PSG) method: (PSG-L) The following corollary shows that these lazy versions guarantee the same value convergence bounds as the corresponding greedy variants -see e.g. Beck and Teboulle [2, Theorem 4.1]. Corollary 6.3 (Constant parameter, variable step size). Let f : C → R be an M -Lipschitz convex function and let (x n ) n 1 be the sequence of play generated by (MD-L) for some K-strongly convex regularizer h on C. Then, the adjusted iterates x ′ n ∈ {x min n , x γ n } of x n satisfy: (6.13) Proof. With σ = (σ h,ηn n ) n 1 , u k ∈ −γ k ∂f (x k ) and x ′ n ∈ {x min n , x γ n }, we have: where the last step follows from (6.11). By taking x ∈ arg min f , we then obtain: The result then follows by applying Theorem 5.1 and using the fact that u k * γ k ∂f (x k ) * γ k M (recall that f is M -Lipschitz continuous).
One can see that the best convergence rate that we get with constant η and step-sizes of the form γ n ∝ n −α is O(log n/ √ n) for α = 1/2 (and there is no straighforward choice of γ n leading to a better convergence rate). On the other hand, by taking a constant step-size γ n = 1 and varying the algorithm's parameter η n ∝ n −1/2 , we do achieve an O(n −1/2 ) rate of convergence. Corollary 6.4 (Constant step size, variable parameter). With notation as in Corollary 6.3, let (x n ) n 1 be the sequence of play generated by (6.12) with 16) and constant γ n = 1. Then, the adjusted iterates Proof. Similar to the proof of Corollary 6.3.

Noisy Observations and Links with Stochastic Convex Optimization.
Assume that at every stage n = 1, 2, . . . of the decision process, the agent does not observe the actual payoff vector u n ∈ V * , but the realization of a random vector u n with E u n | u n−1 , . . . , u 1 = u n . In this case, a learning strategy σ can be used with the observed vectors u n , thus leading to a (random) sequence of play x n+1 = σ n+1 ( u 1 , . . . , u n ) -see e.g. Shalev-Shwartz [28, Section 4.1] for a model of this kind. In this framework, the agent's (maximal) cumulative regret will be given by On the other hand, n k=1 u k |x − x k can be interpreted as the agent's cumulative regret against the observed payoff sequence ( u n ) n 1 . Thus, if h is a K-strongly convex regularizer on C and u k * M (a.s.), Theorem 5.1 yields: where R n is the regret guarantee of (5.4) and we have used the easily verifiable fact that E[ u k − u k | x k − x ] = 0 (recall that E[ u k | u k−1 , . . . , u 1 ] = 0 and that x k only depends on u k−1 , . . . , u 1 ). This bound is of the same form as that of e.g. Shalev-Shwartz [28,Theorem 4.1]; furthermore, by the strong law of large numbers for martingale differences (Hall and Heyde [14, Theorem 2.18]), we also obtain the stronger statement that max x∈C n k=1 u k |x − x k R n + o(n) (a.s.), (6.20) i.e., if the parameter η n is suitably chosen, then the strategy (3.10) with noisy observations still leads to no regret (a.s.). The above can be adapted to the framework of stochastic convex optimization as follows: let f : C → R be a Lipschitz convex function on C, let (γ n ) n 1 be a positive sequence of step sizes, and consider the strategy σ generated by (3.10) with η = 1 and h a K-strongly convex regularizer on C. Then, the sequence of play where g n is a random vector with E g n | g n−1 , . . . , g 1 ∈ ∂f ( x n ) may be written recursively as: This algorithm may be seen as a lazy version of the so-called mirror descent stochastic approximation (MDSA) process of Nemirovski et al. [22]; in particular, using the Euclidean regularizer leads to the lazy stochastic projected subgradient (SPSG) method: Setting u n = −γ n g n , u n = −γ n g n and taking x ′ n ∈ { x min n , x γ n } as before, Corollary 6.3 combined with our previous discussion then gives: 7. Discussion 7.1. On the optimal choice of h. As mentioned in the discussion after Corollary 5.2, the following open question arises: given a norm · on V and a compact, convex subset C ⊂ V , which 1-strongly convex regularizer on h : C → R has minimal depth δ h = h max − h min ? As the following proposition shows, in the case of the Euclidean norm on V , this minimal depth is half the radius squared of the smallest enclosing sphere of C: Proposition 7.1. Let h : C → R be a 1-strongly convex regularizer function on C with respect to the ℓ 2 norm · 2 on V . Then: if x ∈ C, +∞ otherwise, OGD-L any 1 −γ n (∇f (x n ) + ξ n ) ℓ 2 Table 1. Summary of the algorithms discussed in Section 6. The suffix "L" indicates a "lazy" variant; the input column stands for the stream of payoff vectors which is used as input for the algorithm and the norm column specifies the norm of the ambient space; finally, ξn represents a zero-mean stochastic process with values in R d .
where x 0 ∈ arg min x ′ ∈C max x∈C x ′ − x 2 2 is the center of the smallest enclosing sphere of C.
Proof. Letting x 1 ∈ arg min x∈C h(x) and x 2 ∈ arg max x∈C x− x 1 2 2 , we readily get: where the second inequality follows from the strong convexity of h and the fact that ∂h(x 1 ) ∋ 0. That (7.2) attains the bound (7.1) is then a trivial consequence of its definition, as is its geometric characterization.
Despite the simplicity of the bound (7.1), this analysis does not work for an arbitrary norm because 1 2 x − x 0 2 might fail to be 1-strongly convex with respect to · -for instance, x − x 0 2 1 is not even strictly convex. 7.2. Greedy versus Lazy. To illustrate the difference between lazy and greedy variants, we first focus on the PSG method run with constant step γ = 1 for a smooth function f : C → R. The two variants may then be expressed by means of the recursions: a n = x n − ∇ f (x n ) x n+1 = arg min x∈C x − a n 2 (7.4a) x greedy z n = ∇F (x n ) − ∇f (x n ) a n = ∇F * (z n ) x n+1 = pr C F (a n ) . . . x n+2 = Q h (y n+1 ) . . . for the greedy version and: x n+1 = arg min x∈C x − y n 2 (7.4b) for the lazy one. As can be seen in Fig. 1, the greedy variant is based on the classical idea of gradient descent, i.e. adding − ∇ f (x n ) to x n and projecting back to C if needed. On the other hand, in the lazy variant, the gradient term − ∇ f (x n ) is not added to x n , but to the "unprojected" iterate y n ; we only project to C in order to obtain the algorithm's next iterate. Owing to this modification, the lazy variant is thus driven by the sum y n = n k=1 ∇ f (x n ).
In the case of mirror descent with an arbitrary regularizer function h, the lazy version has an implementation advantage over its greedy counterpart. Specifically, given a proper convex function F such that F = h on C (cf. Example 3.5), greedy mirror descent is defined as: a n = ∇ F * (∇ F (x n ) − ∇ f (x n )) , x n+1 = pr C F (a n ), where the Bregman projection pr C F (a n ) is given by (3.8); on the other hand, lazy MD is defined as y n = y n−1 − ∇ f (x n ), x n+1 = Q h (y n ). (7.5b) The computation steps for each variant are represented in Figure 2. The first step in the greedy version which consists in computing ∇F has no equivalent in the lazy version, which is thus computationally more lightweight.