Scale-invariant unconstrained online learning

We consider a variant of online convex optimization in which both the instances (input vectors) and the comparator (weight vector) are unconstrained. We exploit a natural scale invariance symmetry in our unconstrained setting: the predictions of the optimal comparator are invariant under any linear transformation of the instances. Our goal is to design online algorithms which also enjoy this property, i.e. are scale-invariant. We start with the case of coordinate-wise invariance, in which the individual coordinates (features) can be arbitrarily rescaled. We give an algorithm, which achieves essentially optimal regret bound in this setup, expressed by means of a coordinate-wise scale-invariant norm of the comparator. We then study general invariance with respect to arbitrary linear transformations. We first give a negative result, showing that no algorithm can achieve a meaningful bound in terms of scale-invariant norm of the comparator in the worst case. Next, we compliment this result with a positive one, providing an algorithm which"almost"achieves the desired bound, incurring only a logarithmic overhead in terms of the norm of the instances.


Introduction
We consider the following variant of online convex optimization (Cesa-Bianchi and Lugosi, 2006;Shalev-Shwartz, 2011;Hazan, 2015). In trials t = 1, . . . , T , the algorithm receives an instance x t ∈ R d , on which it predicts y t = x ⊤ t w t by means of a weight vector w t ∈ R d . Then, the true label y t is revealed and the algorithm suffers loss ℓ(y t , y t ), convex in y t . The algorithm's performance is evaluated by means of regret, the difference between the algorithm's cumulative loss and the cumulative loss of a prediction sequence produced by a fixed comparator (weight vector) u ∈ R d . The goal of the algorithm is to minimize its regret for every data sequence {(x t , y t )} T t=1 and every comparator u. This framework includes numerous machine learning scenarios, such as linear classification (with convex surrogate losses) and regression.
Most of the work in online convex optimization assumes that the instances and the comparator are constrained to some bounded convex sets, often known to the algorithm in advance. In practice, however, such boundedness assumptions are often unjustified: the learner has little prior knowledge on the potential magnitude of instances, while the prior knowledge on the upper bound of the comparator seems even less realistic. Therefore, there has been much work recently dedicated to relaxing some of these prior assumptions (Streeter and McMahan, 2012;Orabona, 2013;McMahan and Abernethy, 2013;McMahan and Orabona, 2014;Orabona andPál, 2015, 2016;Luo et al., 2016). Here, we go a step further, dropping these assumptions entirely and treating the instances, the comparator, as well as comparator's predictions as unconstrained.
In this paper, we exploit a natural scale invariance symmetry of the unconstrained setting: if we transform all instances by any invertible linear transformation A, x → Ax, and simultaneously transform the comparator by the (transposed) inverse of A, u → A −⊤ u, the predictions, and hence the comparator's loss will not change. This means that the predictions of the optimal (loss-minimizing) comparator (if exists) are invariant under any linear transformation of the instances, so that the scale of the weight vector is only relative to the scale of the instances. Our goal is to design online algorithms which also enjoy this property, i.e. their predictions are invariant under any rescaling of the instances.
Since in the absence of any constraints, the adversary can inflict arbitrarily large regret in just one trial by choosing the instance an/or comparator sufficiently large, the regret can only be bounded by a data-dependent function Ψ(u, {x t } T t=1 ), which can be thought of as a penalty for the adversary for having played with sequence {x t } T t=1 and the comparator u. We incorporate the scale invariance into this framework by working with Ψ which depends on the data and the comparator only throught the predictions of u. As we will see, designing the online algorithms to have their regret bounded by such Ψ will automatically lead to scale-invariant methods.
We first consider a specific form of scale-invariance, which we call coordinate-wise invariance, in which the individual instance coordinates ("features") can be arbitrarily rescaled (which corresponds to choosing transformation A which is diagonal). One can think of such rescaling as the change of units in which coordinates are expressed. Inspired by the work of Ross et al. (2013), we choose the penalty function to capture the coordinate-wise invariance in the following decomposable form: where s T,i = T t=1 x 2 t,i are "standard deviations" 1 of individual coordinates (so that |u i |s T,i measures the scale of i-th coordinate relative to comparator's weight) and f (x) = x log(1 + x 2 ). This particular choice of f is motivated by a lower bound of Streeter and McMahan (2012), which indicates that such dependency is the best we can hope for. The main result of Section 3 is a scale-invariant algorithm which achieves this bound up to O(log T ) factor. The algorithm is a first-order method and runs in O(d) time per trial. We note that when the Euclidean norms of instances and comparator are bounded by X and U , respectively, our bound reduces to Online Gradient Descent bound of O(U X √ T ) (Zinkevich, 2003) up to a logarithmic factor.
We then turn to a general setup in which the instances can be rescaled by arbitrary linear transformations. A natural and analytically tractable choice is to parameterize the bound by means of a sum of squared predictions: and where S T = t x t x ⊤ t is the empirical "covariance" matrix, and f (x) = x log(1 + x 2 ) as before. Our first result is a negative one: any algorithm can be forced by an adversary to have a regret at least Ω( u S T √ T ) already for d = 2 dimensional inputs. It turns out that such a bound is meaningless, as a trivial algorithm which always predicts zero has its regret bounded by O( u S T √ T ). Is is then the end of the story? While the above result suggests that the adversary has too much power and every algorithm fails in this setting, we show that this view is too pessimistic, complementing the negative result with a positive one. In Section 4 we derive a scale-invariant algorithm that is capable of almost achieving the bound expressed by Ψ above, with only a logarithmic dependence on the norm of the instances. The algorithm is a second-order method and runs in O(d 2 ) time per trial.

Related work
A standard setup in online convex optimization (Cesa-Bianchi and Lugosi, 2006;Shalev-Shwartz, 2011;Hazan, 2015) assumes that both the instances 2 and the comparator are constrained to some bounded convex sets, known to the learner in advance. A recent series of papers has explored a setting in which the comparator is unconstrained and the learner needs to adapt to an unknown comparator norm (Streeter and McMahan, 2012;Orabona, 2013;McMahan and Abernethy, 2013;McMahan and Orabona, 2014;Orabona, 2014;Orabona and Pál, 2016). Most of these papers (exception being Orabona (2013)), however, assume that the loss gradients (and thus instances in our setup) are bounded. Moreover, none of these papers concerns scale-invariance.
Scale-invariant online algorithms were studied by Ross et al. (2013), who consider a setup similar to our coordinate-wise case. They, however, make a strong assumption that all individual feature constituents of the comparator predictions are bounded: |u i x t,i | ≤ C for all i = 1, . . . , d and t = 1, . . . , T , where C is known to the learner. Their algorithm has a bound which depends on C b T,i , i = 1, . . . , d (where b t,i = max q=1,...,t |x q,i |), which is in fact the worst-case upper bound on u i ; furthermore their bound also depend as on the ratios b T,i x t i ,i (t i being the first trial in which the i-th feature x t i ,i is non-zero), which can be made arbitrarily large in the worst case. Orabona et al. (2015) study a similar setup, giving a bound in terms of the quantities C |x t,i | b t,i , and the algorithm still requires to know C to tune its learning rate. In contrast, we do not make any assumptions on the predictions of u, and our bound depends on the actual values of u i , solely by means of u 2 i t x 2 t,i , i = 1, . . . , d. Luo et al. (2016) consider a setup similar to our full scale-invariant case, but they require an additional constraint that |u ⊤ x t | ≤ C for all t, which we avoid in this work. Finally, Orabona and Pál (2015) consider a different notion of invariance, unrelated to this setup.

Problem Setup
We consider a variant of online convex optimization summarized in Figure 1. In each trial t = 1, . . . , T , an instance x t ∈ R d is presented to the learner, which produces a weight 2. Most papers assume the bound on the (sub)gradient of loss with respect to w, which translates to bound on the instances, as ∇wℓ(y, x ⊤ w) = ∂ y ℓ(y, y) · x.
At trial t = 1 . . . T : Instance x t ∈ R d is revealed to the learner. Learner predicts with y t = x ⊤ t w t for some w t ∈ R d . Adversary reveals label y t ∈ R. Learner suffers loss ℓ(y t , y t ). vector w t ∈ R d (possibly depending on x t ) and prediction y t = x ⊤ t w t . Then, the true label y t is revealed, and the learner suffers loss ℓ(y t , y t ). We assume the loss is convex in its second argument and L-Lipschitz (where L is known to the learner), i.e. subderivatives of ℓ are bounded, |∂ y ℓ(y, y)| ≤ L for all y, y. Two popular loss functions which fall into this framework (with L = 1) are logistic loss ℓ(y, y) = log(1 + exp(−y y)), and hinge loss ℓ(y, y) = (1 − y y) + . Throughout the rest of the paper, we assume L = 1 without loss of generality.
The performance of the learner is evaluated by means of regret: where u ∈ R d is a fixed comparator weight vector, and the dependence on the data sequence has been omitted on the left-hand side as clear from the context. The goal of the learner is to minimize its regret for every data sequence {(x t , y t )} T t=1 and every comparator vector u. We use the "gradient trick" (Kivinen and Warmuth, 1997;Shalev-Shwartz, 2011), which exploits the convexity of ℓ to bound ℓ(y t , y ′ t ) ≥ ℓ(y t , y t ) + ∂ yt ℓ(y, y t )( y ′ t − y t ) for any subderivative ∂ yt ℓ(y t , y t ) at y t . Using this inequality in each trial with y ′ t = u ⊤ w t , we get: where we denoted the subderivative by g t . Throughout the rest of the paper, we will only be concerned with bounding the right-hand side of (1), i.e. we will treat the loss to be linear in the prediction, g t y, with |g t | ≤ 1 (which follows from 1-Lipschitzness of the loss).
In this paper, contrary to previous work, we do not impose any constraints on the instances x t or the comparator u, neither on the predictions x ⊤ t u. Since in the absence of any constraints, the adversary can inflict arbitrarily large regret in just one trial, the regret can only be bounded by a data dependent function Ψ(u, {x t } T t=1 ), which we henceforth concisely denote by Ψ T (u), dropping the dependence on data as clear from the context. An alternative view, which will turn out to be useful, is to study penalized regret regret T (u) − Ψ T (u), i.e. the regret offset by Ψ T (u), where the latter can now be treated as the penalty for the adversary (a related quantity was called benchmark by McMahan and Abernethy (2013)). We will design online learning algorithms which aim at minimizing the penalized regret, and this will immediately imply a data-dependent regret bound expressed by Ψ T (u).
As in the unconstrained setup, the predictions of the optimal comparator are invariant under any linear transformation of the instances, our goal will be to design online learning algorithms which also enjoy this property, i.e. their predictions do not change under linear transformation of the instances. As we will see, the invariance of learning algorithms will follow from an appropriate choice of the penalty function. In Section 3, we consider algorithms invariant with respect to coordinate-wise transformations. We then move to full scale invariance (for arbitrary linear transformations) in Section 4.

Coordinate-wise Scale Invariance
In this section we consider algorithms which are invariant under any rescaling of individual features: if we apply any coordinate-wise transformation x t,i → a i x t,i for some a i > 0, i = 1, . . . , d, t = 1, . . . , T , the predictions of the algorithm should remain the same. Such transformation has a natural interpretation as a change of units in which the instances are measured on each coordinate. The key element is the right choice of the penalty function Ψ T (u), which translates into the desired bound on the regret: the penalty function should be invariant under any feature scaling, offset by the corresponding rescaling of the comparator. Inspired by Ross et al. (2013), we consider the following function which has such a property: Since x t = 1 for all t, we have s T,1 = √ T , and the theorem suggests that the best dependence on |u i |s T,i one can hope for is f (x) = x √ log x. This motivates us to study the function of the form (McMahan and Orabona, 2014;Orabona and Pál, 2016): for some α, β > 0. This particular choice of parameterization will simplify the forthcoming analysis. Ross et al. (2013) have shown that if the learner knew the comparator and standard deviations of each feature in hindsight, the optimal tuning of learning rate would result in a regret bound i |u i |s T,i (for g t ∈ {−1, 1}). We will show that without any such prior knowledge, we will be able to essentially (up to log(T ) factor) achieve a bound of i f (|u i |s T,i ), incurring only a logarithmic overhead for not knowing the scale of the instances and the comparator.
Note that the problem decomposes coordinate-wise into d one-dimensional problems, as: Thus, it suffices to separately analyze each such one-dimensional problem, and the final bound will be obtained by summing the individual bounds for each coordinate.

Motivation
Fix i ∈ {1, . . . , d} and let us temporarily drop index i for the sake of clarity. Our goal is to design an algorithm which minimizes the one dimensional penalized regret: where s T = t x 2 t and f is given by (2). If we denote h t = − q≤t x q g q , we can rewrite the penalized regret by: where we observed that the worst-case u will have the same sign as h T . We now use some simple facts on Fenchel duality (Boyd and Vandenberghe, 2004). Given a function f : for some a > 0, then g * (θ) = f * (θ/a). Choosing X = [0, ∞), and a = s T , we get that: 3 We now use Lemma 18 by Orabona and Pál (2016) (modified to our needs): Lemma 2 (Orabona and Pál, 2016) Let f (x) = x α log(1 + αβ 2 x 2 ) for α, β > 0 and Applying Lemma 2 to (3) results in: The main advantage of this bound is the elimination of unknown comparator u. We can now design learning algorithm to directly minimize the right-hand side of (4) over the worst-case choice of the data. What we derived here is essentially a variant of "regret-reward duality" (Streeter and McMahan, 2012;McMahan and Orabona, 2014;Orabona and Pál, 2016).
3. In the excluded case sT = 0 the regret is trivially zero as xt = 0 for all t.
Receive y t and suffer loss ℓ(y t , y t )

The algorithm
We now describe an algorithm which aims at minimizing (4) for each coordinate i = 1, . . . , d. The algorithm maintains the negative past cumulative (linearized) losses: as well as variances s 2 t,i for all i. At the beginning of trial t, after observing x t (and updating s 2 t,i ), the algorithm predicts with weight vector w t , such that: Our algorithm resembles two previously considered methods in online convex optimization, AdaptiveNomal (McMahan and Orabona, 2014) and PiSTOL (Orabona, 2014). Similarly to these methods, we also use a step size which is exponential in the square of the gradient (which is actually directly related to the same shape of regret bound (2) we are aiming for). However, we counterweight the total gradient by dividing it by the variance s 2 t,i , whereas AdaptiveNormal uses to this end the number of trials t, while PiSTOL -sum of the absolute values, q≤t |g t x t,i |. Only our choice leads to a scale invariant algorithm, which is easiest to understand by thinking in terms of physical units: if we imagine that i-th coordinate of instances has unit [x i ], the term in the exponent in (5) is unitless, while the weight w i has unit 1/[x i ], so that the prediction y t also becomes unitless. Thus, rescaling the i-th coordinate (or, equivalently, changing its unit) does not affect the prediction. Note that our algorithm uses a separate "learning rate" η t,i for each coordinate, similarly to methods by McMahan and Streeter (2010); Duchi et al. (2011). The pseudo-code is presented as Algorithm 1.
We now show that the algorithm maintains small penalized regret (4). To simplify notation, define the potential function: Lemma 3 Let α 0 = 9 8 and define: κ(α) = exp 1 2(α−α 0 ) . In each trial t = 1, . . . , T , for all i = 1, . . . , d, Algorithm 1 satisfies: The proof is given in Appendix A. Lemma 3 can be though of as a motivation behind the particular form of the weight vector used by the algorithm: the algorithm's predictions are set to keep its loss bounded by the drop of the potential. Note, however, that the algorithm does not play with the negative gradient of the potential (which is how many online learning algorithms can be motivated), as there is additional, necessary, correction of exp(x 2 t,i /(2αs 2 t,i )) in the weight expression. Applying Lemma 3 to each t = 1, . . . , T and summing over trials gives: where we bound T t=1 1 t ≤ 1 + log T . Identifying the left-hand side of the above with the right-hand side of (4) for β = T d, and following the line of reasoning in Section 3.1, we obtain the bound on the penalized regret for the i-th coordinate: Summing over i = 1, . . . , d results in the following regret bound for the algorithm: Theorem 4 For any comparator u and any sequence of outcomes {(x t , y t )} T t=1 , Algorithm 1 satisfies: We finish this section by comparing the obtained bound with a standard bound of Online Gradient Descent U X √ T when the instances and the comparator are bounded, x t ≤ X, u ≤ U . By Cauchy-Schwarz inequality we have: so that our bound is O(U X √ T log(1 + d 2 U 2 X 2 T 3 )), incurring only a logarithmic overhead for not knowing the bound on the instances and on the comparator in hindsight.

Full Scale Invariance
In this section we consider algorithms which are invariant under general linear transformations of the form x t → Ax t for all t = 1, . . . , T . As we will see, imposing such a general symmetry will lead to a second order algorithm, i.e. the algorithm will maintain the full covariance matrix S t = q≤t x t x ⊤ t . To incorporate the scale invariance into the problem, we choose the penalty Ψ T (u) to depend only on the predictions generated by u. A natural and analytically tractable choice is to parameterize Ψ T (u) by means of a sum of squared predictions: As before, by taking into account the lower bound from Lemma 1, we choose f (x) as defined in (2), i.e. f (x) = x α log(1 + αβ 2 x 2 ), for some α, β > 0. Our goal is thus to design a scale-invariant algorithm which maintains small penalized regret: where we defined h t = − q≤t g q x q . We will make use of the following general result, proven in the Appendix B. For any positive semi-definite matrix A ∈ R d×d , let u A def = √ u ⊤ Au denote the semi-norm of u ∈ R d induced by A. We have: Lemma 5 For any f (x) : [0, ∞) → R, any positive semi-definite matrix A and any vector y ∈ range(A), sup Application of Lemma 5 together with Lemma 2 gives: As before, we have eliminated the unknown comparator from the equation, and we will design the algorithm to directly minimize the right-hand side of (6) over the worst-case choice of the data.

Lower bound
We start with a negative result. It turns out that the full scale invariance setting is significantly harder than then coordinate-wise one already for d = 2. We will show that any algorithm will suffer at least Ω( u S T √ T ) regret in the worst case, and this bound has a matching upper bound for a trivial algorithm which predicts 0 all the time.
Theorem 6 Let d ≥ 2. For any algorithm, and any nonnegative number β ∈ R + , there exist a sequence of outcomes and a comparator u, such that u S T = β and: On the other hand, consider an algorithm which predicts 0 all the time. In this case, where the second inequality is from Cauchy-Schwarz inequality. Thus, the lower bound is trivially achieved, and we conclude that it is not possible to obtain meaningful bound that only depends on u S T by any online algorithm in the worst-case.

The algorithm
While it is not possible to get a meaningful bound in terms of u S T in the worst case, here we provide a scale-invariant algorithm which almost achieves that. Precisely, we derive an algorithm with a regret bound expressed by f ( u S T ), with f defined as in (2), with only a logarithmic dependence on the size of the instances hidden in constant β.
The algorithm is designed in order to minimize the right-hand side of (6). It maintains the negative past cumulative (linearized) loss vector h t = − q≤t g q x q , as well as the covariance matrix S t . Furthermore, the algorithm also keeps track of a quantity Γ t ≥ 0, recursively defined as: At the beginning of trial t, after observing x t (and updating S t ), the algorithm predicts with weight vector w t , such that: This choice of the update leads to the invariance of the algorithm's predictions under transformations of the form x t → Ax t , t = 1, . . . , T , for any invertible matrix A (shown in Appendix D). The algorithm is a second-order method, and is reminiscent of the Online Newton algorithm (Hazan et al., 2007;Luo et al., 2016). Our algorithm, however, adaptively chooses step size η t ("learning rate") in each trial. Moreover, no projections are performed, which let us reduce the runtime of the algorithm to O(d 2 ) per trial (an efficient implementation is discussed at the end of this section). The pseudo-code is presented as Algorithm 2.
Receive y t and suffer loss ℓ(y t , y t ) We now bound the regret of the algorithm. Define the potential function as: We have the following result: Lemma 7 In each trial t = 1, . . . , T , Algorithm 2 satisfies: The proof is given in Appendix E. The choice of w t in Algorithm 2 can be motivated as the one that guarantees bounding the loss of the algorithm by the drop of the potential function (note, however, the as in the coordinate-wise case, the weight vector is not equal to the negative gradient of the potential). Comparing to Lemma 3, there is no overhead on the right-hand side; however, the overhead is actually hidden in the definition of ψ t in quantity Γ t . Applying Lemma 7 to each t = 1, . . . , T and summing over trials gives: Identifying the left-hand side of the above with the right-hand side of (6) for β = e Γ T 2α , we obtain the following bound on the regret: where we used log(1+ab) ≤ log(a+ab) = log a+log(1+b) for a ≥ 1, applied to a = e Γ T /α ≥ 1. Thus, the algorithm achieves an essentially optimal (up to logarithmic factor) scale-invariant bound expressed in terms u S T , with an additional overhead hidden in Γ T .
How large can Γ T be? By the definition, Γ T = t g 2 t x ⊤ t S † t x t ; as g 2 t x ⊤ t S † t x t ≤ g 2 t ≤ 1, Γ T is at most T in the worst case, and the bound becomesÕ( u S T √ T ) (logarithmic factors dropped), which is what we expected given the negative result in Theorem 6. However, Γ T can be much smaller in most practical cases as it can be shown to grow only logarithmically with the size of the instances (Luo et al., 2016, notation translated to our setup): Lemma 8 (Luo et al., 2016, Theorem 4) Let λ * be the minimum among the smallest nonzero eigenvalues of S t (t = 1, . . . , T ) and r be the rank of S T . We have: (1 + r)rλ * .
Combining the above results, we thus get: Theorem 9 For any comparator u and any sequence of outcomes {(x t , y t )} T t=1 , Algorithm 2 satisfies: where: with λ * being the minimum among the smallest nonzero eigenvalues of S t (t = 1, . . . , T ) and r being the rank of S T .
We finally note that the dependence on the dimension d in the bound (through the dependence on the rank r in Γ T ) cannot be eliminated, as Luo et al. (2016, Theorem 1) show that in a setting in which the predictions of u are constrained to be at most C, any algorithm will suffer the regret at least Ω(C √ dT ). 4 Efficient implementation. The dominating cost in Algorithm 2 is the computation of pseudoinverse S † t in each trial after performing the update S t = S t−1 + x t x ⊤ t , which can be O(d 3 ). However, we can improve the computational cost per trial to O(d 2 ) by noticing that S t is never used by the algorithm, so it suffices to store and directly update S † t using a rank-one update procedure in the spirit Sherman-Morrison formula, which takes O(d 2 ). The procedure is highlighted in the proof of Lemma 7 in Appendix E.

Conclusions
We considered unconstrained online convex optimization, exploiting a natural scale invariance symmetry: the predictions of the optimal comparator (weight vector) are invariant under any linear transformation of the instances (input vectors). Thus, the scale of the 4. We can, however, improve the dependence on d to O( √ d) by modifying the algorithm to play with St = ǫI + St for ǫ > 0, and apply the bound on t x ⊤ tSt xt from Cesa-Bianchi and Lugosi (2006), Theorem 11.7. This would, however, come at the price of losing the scale invariance of the algorithm.
weight vector is only relative to the scale of the instances, and we aimed at designing online algorithms which also enjoy this property, i.e. are scale-invariant. We first considered the case of coordinate-wise invariance, in which the individual coordinates (features) can be arbitrarily rescaled. We gave an algorithm, which achieves essentially (up to logarithmic factor) optimal regret bound in this setup (expressed by means of a coordinate-wise scale-invariant norm of the comparator). We then moved to a general (full) invariance with respect to arbitrary linear transformations. We first gave a negative result, showing that no algorithm can achieve a meaningful bound in terms of scale-invariant norm of the comparator in the worst case. Next, we complimented this result with a positive one, providing an algorithm which "almost" achieve the desired bound, incurring only a logarithmic overhead in terms of the norm of the instances.
In the future research, we plan to test the introduced algorithms in the computational experiments to verify how their performance relate to the existing online methods from the past work (Zinkevich, 2003;Ross et al., 2013;Orabona et al., 2015;Luo et al., 2016).
Thus, we have shown the lemma for trials t = 1, . . . , t 0 . We can now assume that s t−1,i > 0 and prove the lemma for the remaining trials t > t 0 . By using the definition of w t,i from (5)), we need to show: where we remind that h t,i = h t−1,i − g t x t,i and s 2 t,i = s 2 t−1,i + x 2 t,i . First note that the left-hand side is convex in g t , and hence it is maximized for g t ∈ {−1, 1}. As the right-hand side does not depend on g t , it suffices to show that the inequality holds for g t ∈ {−1, 1}. Furthermore, as the inequality depends on g t only through the product g t x t,i , we assume without loss of generality that x t,i ≥ 0 (the sign can always be incorporated to g t ). We now simplify the notation. Define: Note that by the definition γ ∈ (0, 1] as x t,i is unconstrained. In this notation, we have: where we used g 2 t = 1. Using the new notation in (8), multiplying both sides by td we equivalently get: Let us denote the left-hand side of (9) as A. We have: We need the following result, which is proved in Appendix F: Lemma 10 Let α 0 = 9 8 . For all x ∈ R it holds: x + e −x ≤ e x 2 2 α 0 .
We thus jointly upper bound B(γ) ≤ max{v 2 , 1 1−a }, which results in: A ≤ max e which verifies (9) and finishes the proof.

Appendix B. Proof of Lemma 5
Without loss of generality assume A has k ≥ 1 strictly positive eigenvalues (the remaining eigenvalues being zero), as otherwise (for k = 0) range(A) = {0} and there is nothing to show. Let A = V ΣV ⊤ for V ∈ R d×k , Σ = diag(λ 1 , . . . , λ k ), be the 'thin' eigenvalue decomposition of A (i.e., the eigendecomposition without explicit appearance of eigenvectors with zero eigenvalues). Since y ∈ range(A), there exists y ∈ R d , such that y = A y, and therefore: y = V ΣV ⊤ y = V Σ 1/2 z, where z = Σ 1/2 V ⊤ y ∈ R k .
We have: where we reparametrized Σ 1/2 V ⊤ u as u ∈ R k . Now, note that keeping the norm u fixed, the supremum is achieved by u in the direction of z; therefore, without loss of generality assume u = β z z for some β ≥ 0. This means that: which finishes the proof of the first part of the lemma. For the second part, we only need h T ∈ range(S T ), a well-known fact, which we show below for completeness. Let v be any eigenvector of S T associated with zero eigenvalue, so that v ⊤ S T v = 0. Using the definition But since h T = − T t=1 g t x t , this also means h ⊤ T v = 0. Let v 1 , . . . , v k be the eigenvectors of S T associated with all non-zero eigenvalues λ 1 , . . . , λ k . By the previous argument, h T ∈ span {v 1 , . . . , v k }, i.e. h T = k i=1 α i v i . Choosing a vector z = k i=1 α i λ i v i reveals that: which shows that h T ∈ range(S T ). This finishes the proof. As all lower bounds are positive, this finishes the proof.
Note: If we chose α 0 = 2, the proof of the lemma would simplify dramatically, as we would only need to separately bound x ∈ (−∞, −1), x ∈ [−1, 0], and x ∈ (0, ∞). However, we opted for the smallest α 0 , as smaller α 0 translates to a smaller achievable constant in the regret bound. Our choice α 0 = 9 8 was obtained by performing numerical tests, which showed that this value is very close to the smallest α 0 , for which the inequality still holds.