Online Convex Optimization Using Coordinate Descent Algorithms

This paper considers the problem of online optimization where the objective function is time-varying. In particular, we extend coordinate descent type algorithms to the online case, where the objective function varies after a finite number of iterations of the algorithm. Instead of solving the problem exactly at each time step, we only apply a finite number of iterations at each time step. Commonly used notions of regret are used to measure the performance of the online algorithm. Moreover, coordinate descent algorithms with different updating rules are considered, including both deterministic and stochastic rules that are developed in the literature of classical offline optimization. A thorough regret analysis is given for each case. Finally, numerical simulations are provided to illustrate the theoretical results.


Introduction
Online learning [29], resource allocation [9], demand response in power systems [14], and localization of moving targets [2] are just a few examples where online convex optimization (OCO) has been applied.In the problem setup of OCO, the objective functions are time-varying and are not available to the decision maker a priori.At each time instance, after an update of the decision variable, new information of the latest cost function is made available to the decision maker.The objective of the decision maker is to minimize the objective function over time.One commonly used performance measure of an online optimization algorithm is the notion of regret which measures the suboptimality of the algorithm compared to the outcome generated by the best decision at each time step.
In the seminal work of [41], the method of online gradient descent is proposed for OCO problems, where at each time step the decision maker performs one gradient descent step using the latest available information.A static regret upper bound that is sublinear in T is proved, where T is the length of the horizon.Under stronger assumptions on the cost functions such as strong convexity, an improved logarithmic regret bound can be achieved [10,11,23].If future information is available, it can be used to further improve the performance of the online optimization algorithm in terms of regret bounds.The work [15] introduces an additional predictive step following the algorithm developed in [41], if certain conditions on the estimated gradient and descent direction are met.Similar algorithms have also been extended to cases where zeroth order [6,26,30,32,36] and second order [16] oracles are used instead of (sub)gradients.The works [6,36] on bandit feedback consider the situation where there are time-varying inequality constraints.In such cases the algorithms proposed in [41] will be hard to implement because of the high computational resource demand of the projection operation.This motivates recent research on online optimization algorithms with timevarying constraints including the primal-dual algorithm proposed in [13,37], a modified saddle-point method given in [9].Other algorithms are also proposed to handle stochastic constraints [38] and cover continuous-time applications [27].In the case where only the values rather than the exact form of the cost function at are revealed to the decision maker, bandit feedback based online algorithms [6,36] can be used to solve the problem, other methods such as forward gradient [22] have also been proposed recently to deal with the issue.The need for applications in large-scale systems has also led to extensive research on distributed OCO.Distributed online algorithms that achieve sublinear regret bound for convex optimization problems with static constraints can be found in [12,28,33].For instance, [28] proposes a distributed version of the dynamic mirror descent algorithm which is a generalization of the classical gradient descent methods suitable for high-dimensional optimization problems.The work [19] proposes distributed online primal-dual algorithms for optimization problems with static coupled inequality constraints while the work [35] studies distributed online convex optimization with time-varying inequality constraints in the discrete-time setting.For a more detailed documentation of recent advances of online optimization, we refer the readers to the survey paper [17].
To the best of our knowledge, Coordinate descent [31], as an important class of optimization algorithms, is not sufficiently analyzed by researchers in the online optimization community.In coordinate descent algorithms, most components of the decision variable are fixed during one iteration while the cost function is minimized with respect to the remaining components of the decision variable.The resulting problem is lower-dimensional and often much easier to solve.Thus, coordinate descent algorithms have great potential in applications such as machine learning, where iteration with full gradient information is computationally expensive.In [24], it is shown that for huge scale problems, coordinate descent can be very efficient.Another situation where one may find coordinate descent useful is dual decomposition based methods for distributed optimization, see [21] and references therein.Specifically, the dual problem of multiagent optimal consensus results in a sum of functions with very loose coupling between the dual variables.Calculation of a component of the gradient of the dual function only involves computations and communications of a pair of agents (or processors).Moreover, it can also be implemented in a parallel fashion as shown in [3].Therefore, sufficient effort has been made recently by researchers to develop theoretical performance guarantees of various coordinate descent algorithms [31].In this paper, we aim to extend this appealing algorithm to solve OCO problems by providing an in-depth regret analysis for different types of online coordinate descent algorithms.
The main contributions of the paper can be summarized as follows.First, we extend the coordinate descent algorithms considered in [31] to the online case and provide their regret analysis.To the best of our knowledge, this is the first attempt to look at possibilities of using coor-dinate descent methods to solve OCO problems.Second, we provide an in-depth regret analysis of various coordinate descent algorithms with different rules, such as cyclic updating rules and random updating rules.Specifically, we consider both random and deterministic online coordinate descent algorithms under assumptions commonly used in the literature.In particular, most existing literature on OCO are based on extensions of offline algorithms that monotonically reduce the distance from the decision variable to the set of solutions at each iteration.An example is the well-known online gradient descent [11,41].However, offline deterministic coordinate descent algorithm, although has provable convergence properties to the set of solutions, does not necessarily result in an updated variable that is closer to the set of solutions at each iteration.We overcome this issue by using predictive like updates at each time which are detailed in Section 5. Lastly, we show that the regret bounds achieved by our online coordinate descent algorithms are comparable to those achieved by the literature on centralized full-gradient based online algorithms.
We summarize the theoretical upper bounds of regrets we prove in Theorem 1 to Theorem 7 for Algorithms 1, 4, 5 in the following table.
and C stands for convex cases SC stands for strongly convex cases.
The regret bounds summarized in Table 1 are consistent with regret bounds of full-gradient based online optimization algorithms proved in the existing literature [7,11,23,29] under similar settings.Our dynamic regret bounds for strongly convex functions proved in Theorems 6 and 7 might need multiple updates at each time t.This setup is also adopted in some existing works including [8,40] to achieve less conservative regret bounds.It should also be noted that, although the online algorithms with different update rules share the regret bounds with the same order, the exact coefficient for the regret bounds may still be different.
The rest of paper is organized as follows.The problem formulation is presented in Section 2. The online coordinate descent algorithm considered in this paper is given in Section 3. Regret bounds for random online coordinate descent algorithms are given in Section 4 followed by regret bounds for deterministic online coordinate descent algorithms in Section 5.The numerical simulation is given in Section 6. Finally the results presented in this paper is summarized in Section 7.
Notation: Let R be the set of real numbers and R n be the n-dimensional Euclidean space, R ≥0 (resp.R >0 ) be the set of non-negative (resp.positive) real numbers.For x, y ∈ R n , x, y denotes the inner product in R n .The set of non-negative (resp.positive) integers is denoted by Z ≥0 (resp.Z >0 ).Furthermore, | • | denotes the Euclidean norm of a vector x ∈ R n .The matrix I n is used to denote the n-dimensional identity matrix and n will be omitted when the dimension is clear.For a given vector x, x (i) denotes the i-th component of x, x i denotes the value of x at time i, and x (i),j denotes the value of the i-th component of x at time j.All random variables considered in this paper are defined on a probability space (Ω, F , P), where Ω is the sample space, the σ-field F is the set of events and P is the probability measure defined on (Ω, F ).

Problem formulation
We consider the following optimization problem where f t : R n → R is the convex cost function at time t, x ∈ R n is the decision variable, and Θ ⊆ R n is the non-empty closed convex constraint set.For simplicity, we assume f t is continuously differentiable for any t ∈ Z ≥0 .Moreover, let the decision variable x t at any given time t be partitioned as x t = [x T (1),t , x T (2),t , . . ., x T (P ),t ], x (p),t ∈ R np , P p=1 n p = n.The P components of the vector are assigned to P individual processors.For any integer p ∈ {1, . . .P }, the processor p is responsible for updating its own "local" decision variable x (p),t [3].
Denote the minimizer of f t (x) at time t by x * t ∈ Θ.We use a notion of regret to measure processors' ability to track x * t .Regret refers to a quantity that measures the overall difference between the cost incurred by an algorithm and the cost at the best possible point from an offline view up to a horizon T .Two notions of regret commonly considered in the literature are static regret and dynamic regret.The static regret is defined as: where x * := arg min x∈Θ T t=1 f t (x) for a given T , the subscript T is omitted for convenience.On the other hand, the dynamic regret is defined as Remark 1 Static regret is a useful performance metric in applications such as static parameter estimation, where the variable of interest is static.However, if the variable of interest evolves over time (e.g.tracking moving targets), the notion of dynamic regret makes more sense than its static counterpart.It can be seen from ( 2) and (3) that regret captures the accumulation of errors due to the fact that optimization problems are not solved exactly at each step.If the regret is sublinear in T , then the average accumulated error R s T /T (or R d T /T ) will converge to 0 as T → ∞.This further implies that x t converges to x * (or x * t ).Although being the more appropriate performance metric in most applications, dynamic regret does not have a provable bound sublinear in T in general.To obtain a sublinear regret bound, additional regularity assumptions such as bounded variation of environment are typically required, see [28] for example.
If stochastic algorithms are considered, similar notions of regrets can be defined via expectations.
With some abuse of notation, R s T and R d T are used for both random and deterministic cases.However, this would not lead to confusion since it should be clear whether an algorithm is stochastic or not.

Online coordinate descent algorithms
We construct our online block coordinate descent algorithms following the setup in [31].At time t, we select a component of x t to update.The updating component at time t is denoted by i t .If the i-th component is updating, then updating equation for component i is given by where α t is the stepsize at time t and ∇ (i) f t (x t ) is the i-th component of the gradient of f evaluated at x t at time t.For any p = i t , we have Then x t+1 is projected onto Θ.We define the matrix U t = diag(U (1),t , U (2),t , . . ., U (P ),t ).For any t ∈ Z ≥0 and 1 ≤ p ≤ P , U (p),t ∈ {I np , 0 np }, where I np and 0 np denote identity and zero matrices of dimension n p , respectively.Then the updates of the coordinate descent algorithm at time t can be written as where Π Θ (•) denotes projection on Θ which is well defined by closedness and convexity of Θ.The following non-expansive property of the projection operator will be used extensively throughout the paper.
Lemma 1 [4, Proposition 3.2.1]Let Θ be a non-empty closed convex set.We have In this paper, we consider the following three commonly used schemes of selecting the updating component i t .
• The random coordinate descent algorithm.In this case, i t is selected randomly with equal probability, independently of the selections made at previous iterations.• The cyclic coordinate descent algorithm.In this case, the updating component i t is selected in a predetermined cyclic fashion: i t+1 = (i t mod P ) + 1. • The coordinate descent algorithm with Gauss-Southwell Rule.In this case, the updating component In the paper, we consider the case where only one coordinate is allowed to change per iteration.The results on random coordinate descent can be extended to other cases where probabilities of selections are unequal with potentially overlapping components [20], such as the random sleep scheme [34].Intuitively speaking, if for any t the expectation of the update direction takes the form Γ∇f t (x) for a given positive definite diagonal matrix Γ, then the analysis in this work can be applied with mild modifications.Our analysis for deterministic cases can not be trivially extended to cover overlapping components.We aim to address this topic in future research.
Remark 2 A substantial review of variants of coordinate descent algorithms can be found in [4, Section 6.5.1].The cyclic selection of coordinates is normally assumed to ensure convergence of the algorithm.On the other hand, the use of an irregular order is then considered by researchers to accelerate convergence.Particularly, it is shown in [31] that randomization leads to faster convergence in terms of expectation.Obviously, this is not guaranteed for each instance of the algorithm.The Gauss-Southwell method leads to faster convergence at the cost of extra computations and evaluations of gradients during the selection of coordinates which can be an issue in large-scale problems [25].

Regret bounds for online coordinate descent algorithms with random coordinate selection rules
The online random coordinate descent algorithm considered in this section is given in Algorithm 1.

Static regret for convex functions
Before we state the main results, we first list the assumptions.
Assumption 1 For any given t ∈ Z >0 , x ∈ Θ and x * := arg min x∈Θ T t=1 f t (x), the following inequalities hold uniformly in t and T for problem (1), where x t is the decision variable at time t.
Remark 3 Item (ii) of Assumption 1 can be ensured if the constraint set Θ is bounded.Moreover, when Θ is bounded, item (i) of Assumption 1 holds in many cases including linear regression and logistic regression [5].By [29, Lemma 2.6], Assumption 1 (i) implies that f t is also Lipschitz continuous uniformly in t over Θ.These assumptions are quite standard in the literature on online optimization [5-7, 15, 17, 18, 28, 37, 39].
Before we state the first main result of the paper, we first present the so-called doubling trick scheme stepsize rule which is introduced in [29, Section 2.3.1].

Remark 4
The doubling trick scheme of stepsize choices is particularly useful when stepsizes of type 1/T are needed to achieve a desirable regret bound.Since in applications it is often unrealistic to know the horizon T in advance, by using the doubling trick scheme, regret bounds of the same order can be achieved without explicit knowledge of T .The same trick will be used extensively throughout the paper to derive regret bounds.It should also be noted that, the doubling trick scheme in general does not result a better regret bound in terms orders compared to other stepsize rules [41].
The following result states that, under Assumption 1, if the stepsize at each iteration is chosen by the doubling trick scheme, there is an upper bound for the static regret defined in (4).Moreover, the upper bound has the order of O( √ T ) for convex costs.
Theorem 1 Suppose Assumption 1 holds.Furthermore, if the stepsize is chosen according to Definition 1.Then, the static regret (4) achieved by Algorithm 1 satisfies where PROOF.From (8) and Lemma 1, we have for any x ∈ Θ.Given x t , denote the σ-field containing past data of Algorithm 1 up to time t by F t .Then, by convexity of f t , we have Substituting x * := arg min x∈Θ T t=1 f t (x) in to (10), taking total expectations using and rearranging (10) leads to Due to Assumption 1 the gradient of f t is uniformly bounded by G and |x t − x * | is upper bounded uniformly by R. Consequently, In the last inequality of (13), the property α T ≥ 1/ √ T is used, which is a direct consequence of the definition of the stepsize rule.On the other hand, the remaining term in (12) can be upper bounded as follows To derive the regret bound, we again use the properties of the doubling trick scheme.First, we set the horizon to be some known constant T * and the stepsize is chosen as the constant 1 √ T * .Then it is obvious that Since for q = 0, 1, . . ., ⌈log 2 T ⌉ + 1, α t = 1 √ 2 q in each period of 2 q iterations t = 2 q , 2 q + 1, . . ., 2 q+1 − 1.That is T * = 2 q .Then we can sum (14) up over any given T as Thus, the proof is complete.✷

Remark 5
In Theorem 1, the differentiability of f t is in fact never used.Thus the results apply to the case where subgradients are used when f t is not differentiable for some t as long as item 1 of Assumption 1 is satisfied by the subgradients used in the iterations.Moreover, the regret bound established in Theorem 1 is of the same order as the one established in [29] for online gradient descent under the same assumptions.

Static regret for strongly convex functions
In this part, we show that if the cost functions are strongly convex, then we can achieve an improved static regret bound for the online random coordinate descent algorithms.
Assumption 2 For any t ∈ Z ≥0 , the function f t is uniformly µ-strongly convex, i.e., there exists µ > 0 such that the following inequality holds for any Theorem 2 Suppose Assumptions 1 (i) and 2 hold.Furthermore, if the stepsize is chosen as α t = P µt .Then, the static regret (4) achieved by Algorithm 1 satisfies PROOF.Similar to (10), given x t , we have the follow-ing relationship for any x ∈ Θ, By substituting x * to (18) and rearranging the terms, we have the following inequality Since the gradient of f t is uniformly bounded by G, the following inequality holds By choosing α t = P µt , we have the following relationship where the second inequality follows by expanding the sum.Thus, the proof is complete.✷ By making the extra assumption of strong convexity, the sublinear static regret bound of order O( √ T ) established in Theorem 1 is improved to be of order O(log T ) which is consistent with the regret bound for online gradient descent algorithms under the same assumptions established in [11].

Dynamic regret for convex functions
Now, we provide an upper bound of the dynamic regret of the online coordinate descent algorithm.First, we define the following measure of variations of the problem (1) which is commonly used in the literature [23,28] where x * t is a solution to the optimization problem at time t and x * 0 can be arbitrary constant.The term C T captures the accumulated variations of optimal points at two consecutive time instances over time.If the cost functions are general convex functions and an upper bound of C T is known, the following results can be stated.
Theorem 3 Suppose Assumption 1 holds.Furthermore, if the stepsize is chosen as α t = CT T , the dynamic regret (5) achieved by Algorithm 1 satisfies PROOF.Substituting x * t in (10) for any t ≥ 2, yields Furthermore, T t=1 Then, Choosing the stepsize to be CT T leads to As discussed in Remark 4, the choice of stepsize CT T can be replaced by the doubling trick scheme (see Definition 1) to derive a regret bound of the same order.Moreover, the exact value of C T is also not necessary to derive the regret bound given in Theorem 3. If C T is not available, a stepsize of the same order of CT T can be used with constant factor errors.We refer readers to [7, Theorem 1] for a more detailed discussion on how to implement stepsizes of the form CT T and obtain an upper bound of C T in practice using limited information.The inclusion of C T in the dynamic regret bound is common in existing literature [7,28,37].As argued in [7], if C T is sublinear in T , then the overall dynamic regret bound in Theorem 3 will be sublinear in T , which is desired and consistent with the dynamic regret bounds established in [7] for full-gradient based online algorithms.To see this, if the variation of minimizers decreases with the order 1/t, then Similarly, if the variation of minimizers decreases with the order 1/t q with 0 < q < 1, then ).In the worst case, by Assumption 1, we have , which means the corresponding online algorithm incurs a steady-state tracking error.Moreover, the error decreases with the constant bound on the variation of minimizers.

Dynamic regret for strongly convex functions
Now, we consider the dynamic regret of the online random coordinate descent algorithm for strongly convex functions.As before, we consider µ-strongly convex functions.In addition, we will make the following assumption commonly seen in online optimization literature [18,23].
Assumption 3 For any t ∈ Z ≥0 , the gradient of f t is uniformly Lipschitz continuous, i.e., there exists L > 0 such that the following inequality holds for any It is shown in [4, Proposition 6.1.2]that under the same conditions stated in Assumption 3, ( 23) is equivalent to The main result regarding the dynamic regret bound of Algorithm 1 for strongly convex functions with Lipschitz gradients is stated as follows.
Assumption 1 (i) implies |f t (x) − f t (y)| ≤ G|x − y| for any t ∈ Z ≥0 and any x, y ∈ Θ.As a result, the stochastic dynamic regret is bounded as . Moreover, we have Now taking total expectations, using (29) and summing up both sides of (30) from t = 1 to t = T − 1 yields Rearranging ( 31) leads to

✷
Remark 7 By assuming uniform strong convexity and uniform Lipschitz continuity of gradients of f t , Theorem 4 improves the order of the theoretical dynamic regret bound from O( √ C T T ) in Theorem 3 to O(C T ), which is consistent with results shown in [23] for full-gradient based online algorithms.This means if problem (1) does not change over time i.e. f t = f for any t, C T = 0 and the regret grows at a rate O(1) which is consistent with the convergence result of coordinate descent algorithms in the offline setting [31].As item (ii) of Assumption 1 is not required for proving Theorem 4, the regret bound is applicable to unconstrained problems with bounded gradients.

Regret bounds for online coordinate descent algorithms with deterministic coordinate selection rules
We consider two deterministic online coordinate descent algorithms in this section and they are given in Algorithms 2 and 3, respectively.
Note that both Algorithms 2 and 3 are deterministic and update a component of the decision variable.We can Algorithm 2 Online cyclic coordinate descent algorithm 1: Initialization: x 0 , i 0 .2: Select Coordinate: i t ← (i t−1 mod P ) + 1.
therefore relate the regrets achievable by Algorithms 2 and 3 to regret achievable by the online projected gradient descent algorithm which takes the form It can be easily seen that Algorithms 2 and 3 take the following form where the vector ) while all other entries are 0. It captures the effect of the components of the gradient that are not updating.We use Rs T and Rd T to denote the static and dynamic regrets of the online projected gradient descent algorithm respectively.Then, we can have the following result.
Proposition 1 Suppose Assumption 1 holds, then the static regret R s T and dynamic regret R d T of iterations (33) satisfy the following two relationships.
PROOF.By the definitions of regrets in ( 2) and ( 3), it is obvious that R s T − Rs T = R d T − Rd T .Thus, if one of the two items is proved, so is the other item.Since Assumption 1 holds, from (2) and Lemma 1, we know that Thus the proof is complete.✷ By using Proposition 1, we can establish regret bounds for Algorithms 2 and 3 using known regret bounds for online gradient descent algorithms.Moreover, if the regret bounds for online gradient descent are sublinear in T and T t=1 α t is also sublinear in T , then the established regret bounds for Algorithms 2 and 3 will be sublinear.

Static regret for convex functions
The following static regret bound of online projected gradient descent algorithm is proved in [41,Theorem 1].
Lemma 2 Suppose Assumption 1 holds.Furthermore, if the stepsize is chosen as α t = 1 t , then the static regret of online projected gradient descent iterations (32) satisfy The following result on static regret bounds of Algorithms 2 and 3 is a direct corollary of Proposition 1 and Lemma 2.
Corollary 1 Suppose Assumption 1 holds.Furthermore, if the stepsize is chosen as α t = 1 t , then the static regrets of Algorithms 2 and 3 satisfy PROOF.Since α t = 1 t , we have The conclusion follows as a result of Proposition 1 and Lemma 2. ✷

Static regret for strongly convex functions
The static regret bound in Lemma 2 is improved in [11,Theorem 1] under the assumption of strong convexity.
Lemma 3 Suppose Assumptions 1 and 2 hold.Furthermore, if the stepsize is chosen as α t = 1 µt , then the static regret of online projected gradient descent algorithm (32) satisfies The following result on static regret bounds of Algorithms 2 and 3 for strongly convex functions is a direct corollary of Proposition 1 and Lemma 3.
Corollary 2 Suppose Assumptions 1 and 2 hold.Furthermore, if the stepsize is chosen as α t = 1 µt , then the static regrets of Algorithms 2 and 3 satisfy PROOF.Since α t = 1 µt , we have The conclusion follows as a result of Proposition 1 and Lemma 3. ✷

Dynamic regret for convex functions
We can adapt the proof of Theorem 3 to the deterministic case to derive a dynamic regret bound for convex functions which is given in the following result.
Proposition 2 Suppose Assumption 1 holds.Furthermore, if the stepsize is chosen as α t = CT T , the dynamic regret achieved by the online gradient descent algorithm (32) satisfies PROOF.From (32) and Lemma 1, we have for any x ∈ Θ. Substituting x * t to (34) results in the following inequality for any t ≥ 2 Note that, the inequality (22) still holds in this case Thus, we have If we set the stepsize to be CT T , we have

✷
Using Propositions 1 and 2, we can state the following result on dynamic regret bounds of Algorithms 2 and 3.
Corollary 3 Suppose Assumption 1 holds.Furthermore, if the stepsize is chosen as α t = CT T , then the dynamic regrets of Algorithms 2 and 3 satisfy

The conclusion follows as a result of Propositions 1 and 2. ✷
As discussed in Remark 6, the stepsize choice α t = CT T can be made independent of T by using the doubling trick scheme.

Dynamic regret for strongly convex functions
Note that when the cost functions are strongly convex and constant stepsizes are used, the result in Proposition 1 only gives a dynamic regret bound that is linear in T .Therefore, we aim to establish better dynamic regret bounds for Algorithms 2 and 3.In this subsection, we will make the following assumption on the problem (1).

Assumption 4
The gradient of f t is block-wise Lipschitz continuous uniformly in t, i.e., for any 1 ≤ i ≤ P , there exists L i > 0 such that the following inequality holds for any t ∈ Z ≥0 , x ∈ R n , and u ∈ R ni where We denote the largest L i for 1 ≤ i ≤ P by L max := max{L 1 , • • • , L P }.By further assuming ∇f t (x * t ) = 0 for any t ∈ Z ≥0 , the dynamic regret bounds can be derived without Assumption 1.Therefore, they are applicable to unconstrained problems with unbounded gradients.Before analyzing the regrets, we define the following measure of variations of the problem (1) which is first introduced in [40], Remark 8 Compared to C T , C T,2 can be significantly smaller if the variation of minimizers is small.Following the discussion in Remark 6, if the variation of minimizers decreases with the order 1/t q with 1/2 < q ≤ 1, then C T,2 is finite while C T grows to ∞ with a rate sublinear in T .Similarly, if Theorem 5 Suppose ∇f t (x * t ) = 0 for any t ∈ Z ≥0 and Assumptions 2-4 hold.Furthermore, we assume the number of blocks P satisfies P < ).Then, the dynamic regret (5) achieved by Algorithm 3 satisfies where PROOF.Let {y t } denote the sequence of real vectors such that y where the value of U t follows the coordinate selection rule in Algorithm 3. The variable y t ∈ R n stores the value of the decision variable x t before projection, i.e., x t = Π Θ (y t ).By the block descent lemma [1, Lemma 3.2] and the fact that L max ≥ L i for all i, we have From ( 37) and i t ∈ arg max i |∇ (i) f t (x t )|: Since ∇f t (x * t ) = 0 for any t ∈ Z ≥0 , minimizing both sides of ( 16) with respect to y, we have for any x t ∈ Θ (known as the Polyak-Lojasiewicz condition).Then, by (38), we have

2
), it can be shown that L µ (1 − 2µ Ā) < 1 2 is equivalent to L max α 2 − 2α + B 3 P < 0 after some algebraic manipulations.When P < 1 B3Lmax we have 4 − 4B 3 P L max > 0 and the set of solutions to the inequality L max α 2 − 2α + B 3 P < 0 with respect to α is non-empty.Moreover the solution is given by . Therefore, the listed conditions in the statement of the theorem ensures that L µ (1 − 2µ Ā) < 1 2 .Thus, (41) implies Note that The deterministic dynamic regret (3) can be upper bounded as follows From (42), we have the following relationship by summing up both sides of (43 Note that in Theorem 5, we introduce an upper bound that depends on the number of components P which increases to ∞ as B 3 decreases to 0. This is mainly a result of conservatism introduced in the derivation of (41).However, we manage to improve the regret bound in Theorem 4 from O(C T ) to O(C T,2 ) following the discussion in Remark 8.The bound is also valid for unconstrained OCO problems that are not covered by Theorem 1-4.
The following two theorems give dynamic regret bounds without assuming an upper bound on the number of components P .However, at each time t, potentially multiple offline steps are needed to guarantee desirable regret bounds.
The modified algorithms are in Algorithm 4.
It can be seen that in Algorithms 4 and 5, at each time t, k updates are performed where k is an integer to be chosen.In Algorithms 2 and 3, however, only one step is performed at each time t.
dynamic regret (5) achieved by Algorithm 4 satisfies PROOF.By the block descent lemma [1, Lemma 3.2] and the fact that L max ≥ L i for all i, we have Summing over all P blocks in a full round where all components have been updated exactly once leads to By the update equation and Lipschitz continuity of the gradient, we have Summing over P blocks leads to Therefore, By Assumption 2, minimizing both sides of ( 16) with respect to y, we have Then, by (46), we have which further implies, By Assumption 3, we have For any k ∈ Z ≥0 , there exist k 1 ∈ Z ≥0 and k 2 ∈ {1, . . ., P − 1} such that k = k 1 P + k 2 .Then, by Assumptions 2 and 3, we have for any x ∈ R n .Consequently, (47) and (48) yield at any t ∈ Z ≥0 .Since α < 2/L max , we have A > 0 and 0 < 1 − 2µA < 1. Hence there exists k such that By (44), we have Summing up both sides of (51) from t = 1 to t = T − 1 using (50), we have (1 + αL max ) 2P −2 < 1 2 holds, then the dynamic regret (5) achieved by Algorithm 5 satisfies PROOF.Following similar steps as those in the proof of Theorem 6 and using ( 39) and (48) yield where Ā = 1 P (α − α 2 Lmax

2
) as in the proof of Theorem 5. Since Ā > 0, we know there exists k such that . Hence, we have From (53), the following inequality holds by summation of both sides of (51 First, we study the following unconstrained quadratic problem min where b ∈ R 20 is a randomly generated constant vector, Q t is a time-varying matrix that is positive definite for all t ≥ 1.Moreover, all elements of Q t are in the closed interval [ t b is uniformly bounded, the expectation of x t will remain bounded too.This means x t and the gradient Q t x t − b must be bounded, almost surely.We choose P = 20 and for each 1 ≤ i ≤ 20, x (i) is a scalar.The horizon length is T = 5000, and the constant stepsize is chosen as α t = α = 0.001.Since the static regret is always upper bounded by the dynamic regret, we only show the plots for dynamic regrets and their time-averages R d T /T .It can be seen from Fig. 1 that when the constant stepsize is chosen sufficiently small, the regrets in all cases have sublinear growths and therefore their time-averages go to 0 when T is sufficiently large.This is consistent with our theoretical results from Theorem 4-7 on strongly convex functions.In order to see how the variation of the problem impacts the performance of the algorithms.We add an extra term of 100I 20 such that the matrix Q t is diagonally dominant and therefore being less sensitive to t.We test the online algorithms in this case and the results are shown in Fig. 2. It can be seen that when problem (54) varies slowly with respect to time, the curves of the regrets in Fig. 2 have a lower growth rate compared to the regrets shown in Fig. 1.
As expected, the algorithm using full gradient has the best performance in terms of minimizing the dynamic regret.Yet, it is worth mentioning that among the three coordinate descent algorithms considered for this numerical example, Gauss-Southwell rule gives the best performance which is consistent with Remark 2. The extra information of the component-wise gradient norms enables a better selection of the coordinate to update.An in-depth theoretical analysis of this problem in an online setting is left for future work.
Next, we consider the following problem of minimizing entropy functions online min The variable x ∈ Θ is decomposed into 5 scalar components with the compact constraint set Θ = {x ∈ R 5 | 0.001 ≤ x (i) ≤ 1000, i = 1, 2, 3, 4, 5}.The values of p i,t are such that each p i,1 is individually and randomly selected from [1,5].For t ≥ 2, p i,t is such that p i,t = p i,t−1 + 1 t−1 for all i.It can be verified that the above selection ensures that |x * t+1 − x * t | = 1 t and C T = O(log T ).Note that the cost function in (55) is convex but not strongly convex and hence only Theorem 1 and Theorem 3 apply.We again show the plots of dynamics regrets of Algorithms 1-3 and full gradient based algorithms in Fig. 3 with constant stepsize α = 0.05 and T = 5000.Moreover, the plot for random coordinate descent is averaged over 1000 runs.
It can be seen from the Fig. 1 to Fig. 3 that, for the quadratic problem, the dynamic regrets are a lot flatter when t is large.On the other hand, the dynamic regrets in Fig. 3 still exhibit a significant growth when t = T = 5000 even if we select the time-varying parameters to ensure that C T = O(log T ).These findings are consistent with the improved regret bounds shown in Theorems 4-7 for uniformly strongly convex functions with uniformly Lipschitz gradients.

Summary
In this work, we have proposed an online coordinate descent algorithms to deal with optimization problems that may change over time.Three widely used update rules of coordinate descent are considered.Under different assumptions, we have provided different upper bounds on the regrets of these online algorithms.In particular, we have verified that the established regret bounds of these coordinate descent algorithms are of similar orders as those of online gradient descent methods under same settings.The regret bounds proved in this paper are summarized in Table 1.Lastly, a numerical example was given to illustrate our main result.The possibilities of using coordinates with overlapping components is an interesting future research direction, especially for the deterministic case.Another topic of interest is the use of inaccurate gradient information in online coordinate descent algorithms.

Fig. 1 .
Fig. 1.Plots of the dynamic regrets R d T and their time-averages R d T /T .

Fig. 2 .
Fig. 2. Plots of the dynamic regrets R d T and their time-averages R d T /T with slow variation.

Fig. 3 .
Fig. 3. Plots of the dynamic regrets R d T and their time-averages R d T /T in the non-strongly convex case.

Table 1
Regret bounds proved in the paper with CT

5 :
Set t ← t + 1 and go to Step 2.
4: Update: For i = i κ , such that κ ≤ k: ≥0 and Assumptions 2-4 hold and the stepsize is chosen such that α t = α < 2Lmax .Let k be an integer such that B 6 := T,2 + C 2 ) follows.✷ Theorem 7 Suppose ∇f t (x * t ) = 0 for any t ∈ Z 1, 2].As a result, (54) satisfies Assumptions 2-4.Next, we discuss Assumption 1 (i) made in Theorem 4. The constant G from Assumption 1 (i) is only used to show R d T ≤ GE[ T t=1 |x t − x * t |] such that (31) holds.All other arguments made in the proof of Theorem 4 are still true even if G = ∞.Thus, every iteration of Algorithm 1 will move x t closer to x * t