Optimization in High Dimensions via Accelerated, Parallel, and Proximal Coordinate Descent

. We propose a new randomized coordinate descent method for minimizing the sum of convex functions, each of which depends on a small number of coordinates only. Our method (APPROX) is simultaneously Accelerated, Parallel, and PROXimal; this is the ﬁrst time such a method has been proposed. In the special case when the number of processors is equal to the number of coordinates, the method converges at the rate 2¯ ω ¯ LR 2 / ( k + 1) 2 , where k is the iteration counter, ¯ ω is a data-weighted average degree of separability of the loss function, ¯ L is the average of Lipschitz constants associated with the coordinates and individual functions in the sum, and R is the distance of the initial point from the minimizer. We show that the method can be implemented without the need to perform full-dimensional vector operations, which is the major bottleneck of accelerated coordinate descent, rendering it impractical. The fact that the method depends on the average degree of separability, and not on the maximum degree, can be attributed to the use of new safe large stepsizes, leading to improved expected separable overapproximation (ESO). These are of independent interest and can be utilized in all existing parallel randomized coordinate descent algorithms based on the concept of ESO. In special cases, our method recovers several classical and recent algorithms such as simple and accelerated proximal gradient descent, as well as serial, parallel, and distributed versions of randomized block coordinate descent. Due to this ﬂexibility, APPROX had been used successfully by the authors in a graduate class setting as a modern introduction to deterministic and randomized proximal gradient methods. Our bounds match or improve upon the best known bounds for each of the methods APPROX specializes to. Our method has applications in a number of areas, including machine learning, submodular optimization, and linear and semideﬁnite programming.

extremely big sizes.Applications can be found in all areas of human endeavor where data is available, including the Internet, machine learning, data science, and scientific computing.The size of these problems is so large that it is necessary to decompose the problem into smaller, more manageable pieces.Traditional approaches, where it is possible to rely on full-vector operations in the design of an iterative scheme, must be revisited.Coordinate descent methods [23,32] appear as a very popular class of algorithms for such problems as they can break down the problem into smaller pieces and can take advantage of sparsity patterns in the data.With big data problems it is necessary to design algorithms capable of utilizing modern parallel computing architectures.This resulted in an interest in parallel [33,44,18,11,29,19,26] and distributed [30,17] coordinate descent methods.
In this work we focus on the solution of convex optimization problems with a huge number of variables of the form (1) min Here x = (x (1) , . . ., x (n) ) ∈ R N is a decision vector composed of n blocks with x (i) ∈ R Ni , and N = i N i .We assume that f : R N → R is of the form where f j are smooth convex functions, with ψ : R N → R ∪ {+∞} being a convex (and lower semicontinuous) block separable regularizer (e.g., the L1 norm).We now summarize the main contributions of this work.

Combination of Good
Features.We design and analyze the first randomized block coordinate descent method (APPROX) which is simultaneously accelerated, parallel, and proximal.In fact, we are not aware of any published results on accelerated coordinate descent which would be either proximal or parallel.Our method is accelerated in the sense that it achieves an O(1/k 2 ) convergence rate, where k is the iteration counter.The first gradient method with this convergence rate is due to Nesterov [21]; see also [47,3].An accelerated randomized coordinate descent method, for convex minimization without constraints, was originally proposed in 2010 by Nesterov [23].
Several variants of proximal and parallel (but nonaccelerated) randomized coordinate descent methods have been proposed [5,33,11,30].In Table 1 we provide a list 1  of some recent research papers proposing and analyzing randomized coordinate descent methods.The table substantiates our observation that while the block ("Blck" column) and proximal ("Prx" column) setup is relatively common in the literature, parallel methods ("Par" column) are much less studied, and there is just a handful of papers dealing with accelerated variants ("Acc" column).Moreover, existing accelerated methods are not efficient ("Eff" column)-with the exception of [13]-a point of crucial importance we will discuss next.

Efficient Iterations.
We identify a large subclass of problems of the form (1) for which the full-vector operations inherent in accelerated methods can be eliminated.This contrasts with Nesterov's accelerated coordinate descent scheme [23], which is Table 1 An overview of selected recent papers proposing and analyzing the iteration complexity of randomized coordinate descent methods.The years correspond to the time the papers were first posted online (e.g., onto arXiv), and not the eventual publication time."Eff " = the cost of each iteration is low (in particular, independent of the problem dimension N ); "Blck" = works with blocks of coordinates; "Prx" = can handle proximal setup (has ψ term); "Par" = can update more blocks per iteration; "Acc" = accelerated, i.e., achieving the optimal O(1/k 2 ) rate for nonstrongly convex objectives.Our algorithm has all these desirable properties.In the last column we highlight a single notable feature, necessarily chosen subjectively, of each work.
Subsequently, in part due to these issues, the work of the community focused on simple methods as opposed to accelerated variants.For instance, Richtárik and Takáč [32] use Nesterov's observation to justify their focus on nonaccelerated methods in their work on coordinate descent methods in the proximal/composite setting.
Recently by a careful modification of Nesterov's method Lee and Sidford [13] were able to avoid full-dimensional operations in the case of minimizing a convex quadratic without constraints.This was achieved by introducing an extra sequence of iterates and observing that for quadratic functions it is possible to compute partial derivative of f evaluated at a linear combination of full-dimensional vectors without ever forming the combination.We extend the ideas of Lee and Sidford [13] to our general setting (1) in the case when f j (x) = φ j (a T j x), where φ j are scalar convex functions with Lipschitz derivative and the vectors a j are block-sparse.

Flexibility.
APPROX is a remarkably versatile method, encoding several classical, recently developed, and new optimization methods as special cases.These variants are achieved by combinations of four design elements (see Table 2).In particular, by choosing to group all coordinates into a single block (n = 1), the only sensible the presence and form of the proximal term ψ in the problem formulation ("Prx"), the number of blocks n we decide to split the variable x ∈ R N into ("Blck"), the choice of the block samplings Ŝ, and the choice of the stepsize parameter θ k .(GD = gradient descent; BCD = block coordinate descent.)[47,3] any 1 Ŝ = {1} wp 1 as in APPROX Serial BCD [32] separable any serial uniform constant Parallel BCD [33] separable any any uniform constant Distributed BCD [30] separable any distributed constant Acc Distr BCD [10] separable any distributed as in APPROX sampling is to pick this block with probability 1, which makes the method deterministic.This corresponds to the first four methods in Table 2. To obtain the first three, we need to modify the stepsizes in APPROX (Algorithm 2) so that θ k = θ 0 for all k.Doing this, we obtain simple (i.e., nonaccelerated) gradient descent in three varieties, depending on the choice of the proximal term: gradient descent (GD, no proximal term), projected GD (indicator function of a convex constraint set), and proximal GD.If we decrease the stepsizes as prescribed by APPROX, we recover Tseng's accelerated proximal gradient method.Let us now look at the last four methods in the table, all of which correspond to a setting with a nontrivial block decomposition and a general (but block-separable) proximal term.If we set θ k = θ 0 for all k, we recover existing (nonaccelerated) serial (UCDC [32]), parallel (PCDM [33]), and distributed (Hydra [30]) coordinate descent methods, depending on the choice of the sampling.Finally, a follow-up paper to our work looks at APPROX specialized to a distributed sampling (Hydra 2 [10]).This last method was applied to solving a problem involving 50 billion variables.

New
Stepsizes.We propose new stepsizes for parallel coordinate descent methods, based on a new expected separable overapproximation (ESO).These stepsizes can for some classes of problems (e.g., f j = quadratics) be much larger than the stepsizes proposed for the (nonaccelerated) parallel coordinate descent method (PCDM) in [33].Let ω j be the number of blocks function f j depends on.The stepsizes, and hence the resulting complexity, of PCDM depend on the quantity ω = max j ω j .However, our stepsizes take all the values ω j into consideration, and the result of this is a complexity that depends on a data-weighted average ω of the values ω j .Since ω can be much smaller than ω, our stepsizes result in dramatic acceleration for our method and other methods whose analysis is based on an ESO [33,11,30].

Contents.
The rest of the paper is organized as follows.In section 2 we outline a number of notable applications our method has found since it was first published [12].We then describe new long stepsizes for parallel coordinate descent methods (applying more widely than to APPROX), based on novel assumptions, and compare them with existing stepsizes (section 3).The APPROX algorithm and the main complexity result are described in section 4. Subsequently, we give a proof of the result (section 5).We then describe an efficient implementation of our method, one that does not require the computation of full-vector operations (section 6), and finally comment on our numerical experiments (section 7).
1.6.Notation.It will be convenient to define natural operators acting between the spaces R N and R Ni .In particular, we will often wish to lift a block x (i) from R Ni to R N , filling the coordinates corresponding to the remaining blocks with zeros.Likewise, we will project x ∈ R N back into R Ni .We will now formalize these operations.
Let U be the N × N identity matrix, and let U = [U 1 , U 2 , . . ., U n ] be its decomposition into column submatrices U i ∈ R N ×Ni .For x ∈ R N , let x (i) be the block of variables corresponding to the columns of U i , that is, In other words, h [S] is a vector in R N obtained from h ∈ R N by zeroing out the blocks that do not belong to S. For convenience, we will also write (4) for the vector of partial derivatives with respect to coordinates belonging to block i.
With each block i ∈ [n] we associate a positive definite matrix B i ∈ R Ni×Ni and a scalar v i > 0, and equip R Ni and R N with the norms ( 5) The corresponding conjugate norms (defined by s * = max{ s, x : x ≤ 1}) are ( 6) We also write Example 1 (blocks).We now illustrate the above notation in two extreme situations: 1. Blocks correspond to coordinates.That is, n = N , and hence N i = 1 for all i.In this case, U i = e i is the ith unit coordinate vector, and hence | for some positive scalar B i , and the primal norm in R N is a weighted Euclidean norm: . The dual norms have an analogous meaning.2. All coordinates belong to a single block.That is, n = 1, and hence N 1 = N .
In this case, U 1 = I is the identity matrix, and hence x (1) = x.Further, is the gradient of f at x.The primal block norm x (1)  (1) is simply equal to B 1 x, x 1/2 , and the primal norm in R N is a weighted version thereof: 2. Applications.In this section we describe four applications areas for the AP-PROX algorithm, all motivated and building on the developments in the earlier version of this paper where the APPROX method was first developed [12] (see Table 3).This section therefore contains new material not present in [12].It is not necessary for the reader at this point to know any details about APPROX beyond what was mentioned in the introduction; the purpose here is to stress that the method has a wide array of applications.

Empirical Risk Minimization.
Empirical risk minimization (ERM) is a powerful and immensely popular paradigm for training statistical (machine) learning models [37].In statistical learning, one wishes to "learn" an unknown function h * : X → Y, where X (set of samples) and Y (set of labels) are arbitrary domains.Roughly speaking, the goal of statistical learning is to find a function (predictor, hypothesis) h : X → Y from some predefined set (hypothesis class) H of predictors which in some statistical sense is the best approximation of h * .In particular, we assume that there is an unknown distribution D over ξ ∈ X .Given a loss function : Y × Y → R, we define the risk (generalization error) associated with predictor h ∈ H to be (7) L D (h) = E ξ∼D (h(ξ), h * (ξ)).
The goal of statistical learning is to find h ∈ H of minimal risk: A natural, albeit in general intractable, choice of a loss function in some applications is (y, y ) = 0 if y = y and (y, y ) = 1 otherwise.Let X be a collection of images, let Y = {−1, 1}, and let h * (ξ) be 1 if image ξ contains an image of a cat, and h * (ξ) = −1 otherwise.If we are able to learn h * , we will be able to detect images of cats.Problems where Y consist of two elements are called classification problems.The domain set can instead represent a video collection, a text corporus, a collection of emails, or any other collection of objects which we can represent mathematically.If Y is a finite set consisting of more than two elements, we speak of multiclass classification.If Y = R, we speak of regression.
One of the fundamental issues making (8) difficult to solve is the fact that the distribution D is not known.ERM is a paradigm for overcoming this obstacle, assuming that we have access to independent samples from D. In ERM, we first collect a training set of independent and identically distributed samples and their labels; that is, S = {(ξ j , y j ) ∈ X × Y : j = 1, 2, . . ., m}, where y j = h * (ξ j ).Subsequently, we replace the expectation in (7) defining the risk by a sample average approximation, which defines the empirical risk: The ERM paradigm is to solve the empirical risk minimization problem (9) min h∈H L S (h) instead of the harder risk minimization problem (8).In practice, H is often chosen to be a parametric class of functions described by a parameter x ∈ R d .For instance, let X ⊆ R d (d = number of features) and Y = R, and consider the class of linear predictors H = {h : h(ξ) = x ξ}.Clearly, h is uniquely defined by x ∈ R d .Defining j : R → R via j (t) = 1 m (t, y j ), and setting f j (x) = j (ξ T j x), we have Hence, the ERM problem fits our framework (1)+(2), with ψ ≡ 0. However, in practice one often uses nonzero ψ, which is interpreted as a regularizer, and is included in order to prevent overfitting and hence allow the estimator to generalize to future, unobserved samples.
If the number of features is larger than the number of examples (d m), randomized coordinate descent is an efficient algorithm for solving (1) [32,33,40].If the training set S is so large that it does not fit the memory (or disk space) of a single machine, one needs to employ a distributed computing system and solve ERM via a distributed optimization algorithm.One option is the use of distributed coordinate descent [31,17], known as Hydra.APPROX has been successfully applied in the distributed setting, leading to the Hydra2 method [10].In this work, the authors solve an ERM problem involving a training set of several terabytes in size, and 50 billion features.
If the number of examples in the training set is larger than the number of features (m d), it is typically not efficient to employ randomized coordinate descent, or APPROX, to the ERM problem directly.Instead, the state-of-the-art methods are variants of randomized coordinate descent applied to the dual problem. 2he (Fenchel) dual of the regularized ERM problem for linear predictors considered above has the form where ψ * (resp., * j ) is the Fenchel conjugate of ψ (resp., j ).The function y → ψ * ( 1 m j y j ξ j ) has Lipschitz gradient if we assume that ψ is strongly convex, and y → 1 m m j=1 * j (−y j ) is separable.This also fits the framework (1)+( 2), f corresponding to the first part of the objective (and consisting of a single summand), and ψ corresponding to the second part of the objective (block separability is implied by separability).
We developed APPROX with ERM as an application in mind, and hence our numerical experiments in section 7 consider two key ERM problems: the Lasso problem and the support vector machine (SVM) problem.
Following our paper, Lin, Lu, and Xiao [15] proposed a version of APPROX designed for strongly convex problems.Their motivation was that practitioners often choose regularizers that are at the same time separable and strongly convex.This leads to problems for which we have a good lower bound on the strong convexity parameter.With this additional knowledge, they showed that the rate of convergence of a properly modified APPROX algorithm, applied to the dual problem, leads to state-of-the-art complexity for a class of ERM problems.[9] showed how the APPROX algorithm leads to a state-of-the-art method for minimizing decomposable submodular functions.Submodular minimization has a vast and growing array of applications, including image segmentation [36,9], graphical model structure learning, experimental design, Bayesian variable selection, and minimizing matroid rank functions [2].

Submodular Optimization. Ene and Nguyen
We now briefly introduce the notion of submodularity.Let V = {1, 2, . . ., d} be a finite ground set.A real-valued set function φ : for any two sets A, B ⊆ V .An equivalent and often more intuitive characterization of submodularity is the following "diminishing returns property": φ is submodular if and only if for all Ene and Nguyen [9] consider the decomposable submodular minimization problem (10) min where φ i : 2 V → R are simple submodular functions (simplicity refers to the assumption that it is simple to minimize φ i plus a modular function).Instead of solving (10) directly, one can focus on solving the unconstrained convex minimization problem (11) min where • is the standard Euclidean norm, and φi : R d → R is the Lovász extension of φ i (i.e., the support function of the base polytope P i ⊂ R d of φ i ).Given a solution z, once recovers the solution of (10) by setting Further, instead of solving (11), one focuses on its (Fenchel) dual: (13) min It can be shown that if x (i)   solves (11).Note that f is a convex quadratic function.If we let ψ be the indicator function of the set 13) is of the form (1), where N i = d for all i.It remains to apply the APPROX method to this problem and transform the solution back via (14) and then (12) to obtain solution of the original problem (10).

Packing and Covering Linear Programs. Packing and covering problems are a pair of mutually dual linear programming problems of the form
Packing LP: max where A is a real matrix, and 1 denotes the vector of appropriate dimension with all entries equal to 1.These problems become more difficult as the size of A grows since each iteration of interior-point solvers becomes more expensive.
Allen-Zhu and Orecchia [1] developed algorithms for these problems whose complexity is O(N A log(N A ) log( −1 )/ ) for packing and O(N A log(N A ) log( −1 ) −1.5 ) for covering LP, respectively.This complexity is nearly linear in the size of the problem N A = nnz(A) (number of nonzero entries in A), does not depend on the magnitude of the elements of A, and has a better dependence on the accuracy than other nearly linear time approaches.The improvement in the complexity is due to the use of accelerated proximal coordinate descent techniques such as those developed in our paper, combined with extra techniques, such as the use of an exponential penalty.

Least Squares Semidefinite Programming.
Semidefinite programming has a very important role in optimization due to its ability to model and efficiently solve a wide array of problems appearing in fields such as control, network science, signal processing, and computer science [48,46,4,49].In semidefinite programming, one aims to minimize a linear function in a matrix variable, subject to linear equality and inequality constraints and the additional requirement that the matrix variable be positive semidefinite.
Sun, Toh, and Yang [43] consider the canonical semidefinite program (SDP) min where C, X is the trace inner product, S n + is the cone of n × n symmetric positive semidefinite matrices, A E : S n + → R mE and A I : S n + → R mI are linear maps, L ≤ U are given positive semidefinite matrices, and l ≤ u are given vectors in R mI .
The above SDP can be solved by a proximal point algorithm (PPA) of Rockafellar [34,35].In each iteration of PPA, one needs to solve a least-squares semidefinite program (LS-SDP) of the form (X k+1 , s k+1 ) = arg min where (X k , s k ) is the previous iterate, and σ k > 0 a regularization parameter.Sun, Toh, and Yang [43] observe that the dual of LS-SDP has block-separable constraints and can hence be written in the form (1), with either 2 or 4 blocks (n = 2 or n = 4).They used this observation as a starting point to propose new algorithms for LS-SDP that combine advanced linear algebra techniques, a fine study of the errors made by each inner solver, and coordinate descent ideas.They consider APPROX and block coordinate descent as natural competitors to their specialized methods.They implemented the methods using 4 blocks, each equipped with a nontrivial norm defined by a well-chosen positive semidefinite matrix B i ∈ R Ni×Ni and approximate solutions to the proximity operators.Finally, Sun, Toh, and Yang conducted extensive experiments on 616 SDP instances coming from relaxations of combinatorial problems.
It is worth noting that on these instances, APPROX is vastly faster than standard (nonaccelerated) block coordinate descent.

3.
Stepsizes for Parallel Coordinate Descent Methods.The framework for designing and analyzing (nonaccelerated) PCDMs, developed by Richtárik and Takáč [33], is based on the notions of block sampling and expected separable overapproximation (ESO).We now briefly review this framework, as our accelerated method is cast in it, too.Informally, a block sampling is the random law describing the selection of blocks in each iteration.An ESO is an inequality, involving f and Ŝ, defining stepsize/ESO constants v 1 , . . ..v n , which are in turn used to compute updates to selected blocks.The complexity analysis in our paper is based on the following generic assumption.
1. f is convex and differentiable.
2. Ŝ is a uniform block sampling.That is, Ŝ is a random subset of [n] = {1, 2, . . ., n} with the property If the above inequality holds, for brevity we will write 4 (f, Ŝ) ∼ ESO(v), indicating that v depends both on f and Ŝ.
In the context of PCDMs, and for uniform samplings, the ESO inequality (15) was introduced and systematically studied by Richtárik and Takáč [33].An ESO inequality for distributed sampling was developed in [30] and further refined in [10].A PCDM with a nonuniform sampling, and the associated nonuniform ESO inequality, was proposed in [29].
Note that part 1 of the above assumption involves f only; that is, it is an assumption on the smooth part of the objective function describing the problem.On the other hand, part 2 of the assumption is fully in the hands of the practitionerand is in principle independent of the problem itself.That is, Ŝ is a parameter of the method, and hence one can decide how to choose this parameter.Note that we restrict our attention to uniform samplings.However, the results of this paper were extended after the publication of [12], and, in particular, the uniformity assumption was lifted.This also necessitated a change in definition of the ESO inequality (15), and in the APPROX algorithm (the extended algorithm is referred to by the name ALPHA [26]).A detailed explanation of why ( 15) is a reasonable assumption is given 3 It is easy to see that if Ŝ is a uniform sampling, then, necessarily, 4 In [33], the authors write β 2 h 2 w instead of 1 2 h 2 v .This is because they study families of samplings Ŝ, parameterized by τ , for which w is fixed and all changes are captured in the constant β.Clearly, the two definitions are interchangeable, as one can choose v = βw.Here we will need to compare weights which are not linearly dependent, hence the simplified notation.
in [33,11]; here we shall only provide a brief commentary.Recall that the modeler can choose how the space R N is decomposed into n blocks.If we choose n = 1, then all coordinates belong to a single block, all randomness is removed from (15), and APPROX specializes to one of the variants of gradient descent from Table 2.The ESO inequality then simply requires the gradient of f to be Lipschitz with constant v with respect to the norm • (1) (compare this with Theorem 1(ii)).Inequality (15) is the natural extension of this to the case when it is only allowed to move in a random subspace of R N .Note that this assumption is always satisfied if the gradient of f is Lipschitz, for instance, for large-enough constants {v i }.These constants determine the stepsizes in our method, and hence we need to have easy-to-compute formulas for {v i }.It turns out that the complexity bound of APPROX improves as these constants get smaller, which means that tighter bounds are preferable.
Fercoq and Richtárik [11,Theorem 10] observed that inequality ( 15) is equivalent to requiring that the gradients of the functions be Lipschitz at h = 0, uniformly in x, with constant τ/n, with respect to the norm • v .Equivalently, the Lipschitz constant is L f with respect to the norm • ṽ, where The change of norms is done so as to enforce the weights in the norm to add up to n, which would roughly enable us to compare different ESOs via constants L f .The above observations are useful in understanding what the ESO inequality encodes: By moving from x to , one is taking a step in a random subspace of R N spanned by the blocks belonging to Ŝ.If τ n, which is often the case in big data problems,5 the step is confined to a low-dimensional subspace of R N .It turns out that for many classes of functions arising in applications, for instance, for functions exhibiting certain sparsity or partial separability patterns, it is the case that the gradient of f varies much more slowly in such subspaces, on average, than it does in R N .This in turn would imply that updates h based on minimizing the right-hand side of (15) would produce larger steps and eventually lead to faster convergence.Assumption 2. The functions {f j } have block-Lipschitz gradient with constants L ji ≥ 0. That is, for all j = 1, 2, . . ., m and i = 1, 2, . . ., n, (16) Note that, under the above assumption, we necessarily have ( 17) Assumption 2 is stronger than the assumption considered in [33].Indeed, in [33] the authors only assumed that the sum, f , as opposed to the individual functions f j , has a block-Lipschitz gradient, with constants L 1 , . . ., L n : It is easy to see that if the stronger condition is satisfied, then the weaker one is also satisfied with L i ≤ m j=1 L ji .3.2.New ESO.The main result of this section is Theorem 1, in which we derive an ESO inequality for functions satisfying Assumption 2 and the τ -nice sampling (for some τ ∈ [n]).A sampling is called τ -nice if it picks a set of size τ uniformly at random from all subsets of size τ .It is, however, possible to derive similar bounds for all uniform samplings considered in [33] using the same approach.For an alternative approach to deriving ESO inequalities, one relying on a different assumption on f and capable of handling arbitrary samplings, we refer the reader to [27]. 6n the proof we will refer to two identities established in [33].First, for the τ -nice sampling and any set J ⊆ [n], we have the identity If, moreover, θ 1 , . . ., θ n are arbitrary real scalars and P(|J We are now ready to state and prove our result. where where v i = j ω j L ji , and ω, L, and w = (w 1 , . . ., w n ) are defined by Note that v = ω Lw, w i = n, and that ω is a data-weighted average of the values {ω j }.
Proof.Statement (ii) is a special case of (i) for τ = n (notice that for n-nice sampling we have v = v and ω Lw = v).We hence only need to prove (i).A well-known consequence of ( 16) is (24) f We first claim that for all j, where ). Inequality (20) then follows by adding up7 the inequalities ( 25) for all j.Let us now prove the claim. 8e fix x and define = We adopt the convention that expectation conditional on an event which happens with probability 0 is equal to 0. Letting η j def = |C j ∩ Ŝ|, we can now write For any k ≥ 1 for which P(η j = k) > 0, we now use convexity of fj to write Finally, combining ( 28) and ( 29), we get ( 27): = τβ j 2n h 2 Lj: , which concludes the proof.
Note that (23) says that the gradient of f is Lipschitz with respect to the norm • w with constant ω L. We write (23) in terms of the norm • w since the weights w i add up to n, and hence the norm is in some sense comparable in scale to the standard Euclidean norm.The quantities β j have a natural interpretation.It can be inferred from the identities established in [33, section 4] that Alternatively, it can be seen that β j is the expected size of |C j ∩ Ŝ| conditioned on the event that the intersection is nonempty.

Computation of L ji .
We now give a formula for the constants L ji in the case when f j arises as a composition of a scalar function φ j , whose derivative has a known Lipschitz constant (this is often easy to compute), and a linear functional.Let A be an m × N real matrix, and for j ∈ {1, 2, . . ., m} and i ∈ [n] define (30) That is, A ji is a row vector composed of the elements of row j of A corresponding to block i.
In other words, f j satisfies (16) with constants L ji given above.
Proof.For any x ∈ R N , t ∈ R Ni , and i we have where the last step follows by applying the Cauchy-Schwarz inequality.
that is, all coordinates belong to a single block only.Further, let B 1 (recall that B i is the positive definite matrix defining the norm associated with block i) be the N × N identity matrix.Then Consider the block setup with N i = 1 (all blocks are of size 1) and

iii) Choose nontrivial block sizes and define data-driven block norms with B i =
A T i A i , where A i = AU i , assuming that the matrices A T i A i are positive definite (necessarily, N i ≤ m).The idea here is that data-driven norms better capture the curvature of the function in the subspaces spanned by the blocks.Then where Since M i is a projection matrix, all its eigenvalues are either 0 or 1, and tr(M i ) = rank(A i ) = N i .In particular, its diagonal elements, L ji , satisfy 0 ≤ L ji ≤ 1 and j L ji = N i .Table 4 lists constants L φ for selected scalar loss functions φ popular in machine learning.In Table 5 we list stepsizes for coordinate descent methods proposed in the literature.For simplicity of comparison, this is done for the setup described in case (i) in the above example.It can be seen that our stepsizes are better than those proposed by Richtárik and Takáč [33] and those proposed by Necoara and Clipici [19].Indeed, v rt i ≥ v fr i for all i.The difference grows as τ grows, and there is equality for τ = 1.We also have v nc 1 ≥ v fr 1 , but here the difference decreases with τ , and there is equality for τ = n.

Accelerated Parallel and Proximal Coordinate Descent.
We are interested in solving the regularized optimization problem (33) minimize where ψ : R N → R ∪ {+∞} is a (possibly nonsmooth) convex regularizer that is separable in the blocks x (i) : (34) The functions ψ i : R Ni → R ∪ {+∞} are assumed to be convex and closed (i.e., lower semicontinuous).

The Algorithm.
We now describe our method (Algorithm 1).It is presented here in a form that facilitates analysis and comparison with existing methods.In section 6 we rewrite the method in a different (equivalent) form-one that is geared toward practical efficiency.
Generate a random set of blocks S k ∼ Ŝ 5: for i ∈ S k do 7: end for 9: 11: end for The method starts with x 0 ∈ R N and generates three vector sequences denoted {x k , y k , z k } k≥0 .In step 3, y k is defined as a convex combination of x k and z k , which may in general be full-dimensional vectors.This is not efficient, but we will ignore this issue for now.In section 6 we show that it is possible to implement the method in such a way that it not necessary to ever form y k .In step 4 we generate a random block sampling S k and then perform steps 5-9 in parallel.The assignment z k+1 ← z k is not necessary in practice; the vector z k should be overwritten in place.Instead, steps 5-8 should be seen as saying that we update blocks i ∈ S k of z k by solving |S k | proximal problems in parallel, and we call the resulting vector z k+1 .Note in step 9, x k+1 should also be computed in parallel.Indeed, x k+1 is obtained from y k by changing the blocks of y k that belong to S k ; this is because z k+1 and z k differ in those blocks only.Note that gradients are evaluated only at y k .We show in section 6 how this can be done efficiently for some problems, without the need to form y k .
We now formulate the main result of this paper; its proof is in section 5.
Theorem 3. Let Assumption 1 be satisfied, with (f, Ŝ) ∼ ESO(v), where τ = E[| Ŝ|] > 0. Let x 0 ∈ dom ψ, and assume that the random sets S k in Algorithm 1 are chosen independently, following the distribution of Ŝ.Let x * be any optimal point of problem (33).Then the iterates {x k } k≥1 of APPROX satisfy where In other words, for any 0 < ≤ C * , the number of iterations for obtaining ansolution in expectation does not exceed Let us now comment on the result.
Assumptions.For the complexity result to hold, we do not assume that f is of the form (1)-all that is needed is Assumption 1.
All Coordinates Belong to a Single Block.If we choose n = 1 (single block), then the only reasonable sampling is to pick this block with probability 1 (P( Ŝ = {1}) = 1).The method becomes deterministic.Let B 1 be the N × N identity matrix, so that • (1) is the standard Euclidean norm (and hence x 2 v = v x 2 ).In this case we recover Tseng's accelerated proximal gradient descent [47], and the complexity bound (35) takes the form where v is the Lipschitz constant of the gradient of f (this is what the assumption (f, Ŝ) ∼ ESO(v) means for this sampling).Note that Theorem 1 gives a bound on the Lipschitz constant for f of the form (2) satisfying Assumption 2.
Updating All Blocks.In the case when we update all blocks in one iteration (τ = n), the method is deterministic, and the bound (35) simplifies to (39) F where, as before, ṽ = nv/ v 1 .Note that • ṽ is a weighted norm with weights adding up to n; which means it is "comparable" to the standard Euclidean norm (all weights of which are equal to 1 and hence sum up to n).If we use stepsize v proposed in Theorem 1, then in view of part (ii) of that theorem, bound (39) takes the form as advertised in the abstract.Recall that ω is a data-weighted average of the values {ω j } and that i w i = n.In contrast, using the stepsizes proposed by Richtárik and Takáč [33] (see Table 5), we get Note that in the case when the functions f j are convex quadratics (f j (x) = 1 2 (a T j x − b j ) 2 ), for instance, we have L i = j L ji , and hence the new stepsizes lead to a vast improvement in the complexity in cases when ω ω.On the other hand, in cases where L i j L ji (which can happen with logistic regression, for instance), the result based on the Richtárik-Takáč stepsizes [33] may be better.
, where A ∈ R m×N .APPROX is superior to both coordinate descent and accelerated SDCA.For simplicity, we assume the starting point is x 0 = 0 and define as in Theorem 3. The complexity bounds for APPROX are exact; we omit constant terms in the complexity bounds for standard coordinate descent, accelerated SDCA, and in the formulas for cost of 1 iteration.The notation • v means a weighted Euclidean norm with weights defined by the vector v = (v 1 , . . ., vn).Each algorithm depends on a norm defined in the last column.Note that the norms can differ a lot.

Four Lemmas.
In the first lemma we summarize well-known properties of the sequence θ k used in Algorithm 1.
Lemma 1 (Tseng [47]).The sequence {θ k } k≥0 defined in Algorithm 1 is decreasing and satisfies 0 We now give an explicit characterization of x k as a convex combination of the vectors z 0 , . . ., z k .Lemma 2. Let {x k , z k } k≥0 be the iterates of Algorithm 1.Then, for all k ≥ 0, (44) x where the coefficients γ 0 k , γ 1 k , . . ., γ k k are nonnegative and sum to 1.That is, x k is a convex combination of the vectors z 0 , z 1 , . . ., z k .In particular, the constants are defined recursively in k by setting γ 0 0 = 1, γ 0 1 = 0, γ 1 1 = 1 and for k ≥ 1, (45) Moreover, for all k ≥ 0, the following identity holds: Proof.We proceed by induction.First, notice that x 0 = z 0 = γ 0 0 z 0 .This implies that y 0 = z 0 , which in turn, together with θ 0 = τ n , gives Assuming now that (44) holds for some k ≥ 1, we obtain By applying Lemma 1, together with the inductive assumption that γ l k ≥ 0 for all l, we observe that γ l k+1 ≥ 0 for all l.It remains to show that the constants sum to 1.This is true since x k is a convex combination of z 1 , . . ., z k , and by ( 47), x k+1 is an affine combination of x k , z k , and z k+1 .

Define zk+1
From this and the definition of z k+1 we see that ( 48) The next lemma is an application to a specific function of a well-known result that can be found, for instance, in [47, Property 1].The result was used by Tseng to construct a simplified complexity proof for a proximal gradient descent method.
Lemma 3 (see [47]).Let ξ(u) Our next lemma is a technical result connecting the gradient mapping (producing zk+1 ) and the randomized block gradient mapping (producing the random vector z k+1 ).The lemma reduces to a trivial identity in the case when of a single block (n = 1).From now on, by E k we denote the expectation with respect to S k , keeping everything else fixed.
Lemma 4. For any x ∈ R N and k ≥ 0, (50) Moreover, Proof.Let Ŝ be any uniform sampling and a, h ∈ R N .By Theorem 4 in [33], where a, h In view of ( 3) and ( 48), we can write The remaining statement follows from the last identity in (52) used with a = z k .

Proof of Theorem 3.
Using Lemma 2 and convexity of ψ, which holds for all k ≥ 0. From this we get Note that from the definition of y k in the algorithm, we have For all k ≥ 0 we define an upper bound on and bound the expectation of Fk+1 in S k as follows: Using (57), we can now further bound (59) as follows: Dividing both sides in the last inequality by θ 2 k , using (43), and rearranging the terms, we obtain We now apply expectation to the above inequality and unroll the recurrence, obtaining from which we finally get, for k ≥ 1, where in the last step we have used the facts that F0 = F (x 0 ), x 0 = z 0 , θ 0 = τ n and the estimate θ k−1 ≤ 2 k−1+2n/τ from Lemma 1.
6. Implementation without Full-Dimensional Vector Operations.Algorithm 1, as presented, performs full-dimensional vector operations.Indeed, y k is defined as a convex combination of x k and z k .Also, x k+1 is obtained from y k by changing |S k | coordinates; however, if |S k | is small, the latter operation is not costly.In any case, vectors x k and z k will in general be dense, and hence computation of y k may cost O(N ) arithmetic operations.However, simple (i.e., nonaccelerated) coordinate descent methods are successful and popular precisely because they can avoid such operations.Adapting ideas from Lee and Sidford [13], we rewrite9 Algorithm 1 into a new form, incarnated as Algorithm 2.
Algorithm 2 APPROX (written in a form facilitating efficient implementation) for i ∈ S k do 6: end for 10: 11: end for 12: OUTPUT: Note that if instead of updating the constants θ k as in line 10 we keep them constant throughout, θ k = τ n , then u k = 0 for all k.The resulting method is precisely the PCDM algorithm (nonaccelerated parallel block coordinate descent method) proposed and analyzed in [33].
As it is not immediately obvious that the two methods (Algorithms 1 and 2) are equivalent, we include the following result.Its proof can be found in the appendix.
Proposition 1 (equivalence).Algorithms 1 and 2 are equivalent.In particular, if we run Algorithm 2 with z0 = x 0 , where x 0 ∈ dom ψ is the starting point of Algorithm 1, and define In Algorithm 2 we never need to form x k throughout the iterations.The only time this is needed is when producing the output: x k+1 = θ 2 k u k+1 + z k+1 .More importantly, the method does need to explicitly compute y k .Instead, we introduce a new vector, u k , and express y k as y k = θ 2 k u k + zk .The method accesses y k only via the block gradients ∇ i f (y k ) for i ∈ S k .Hence, if it is possible to cheaply compute these gradients without actually forming y k , we can avoid full-dimensional operations.
We now show that this can be done for functions f of the form (2), where f j is as in Theorem 2: Let D i be the set of such j for which A ji = 0.If we write r u k = Au k and r zk = Az k , then, using (63), we can write (64) Assuming we store and maintain the residuals r u k and r zk , the computation of the product A T ji φ j (•) costs O(N i ) (we assume that the evaluation of the univariate derivative φ j takes O(1) time), and hence the computation of the block derivative (64) requires O(|D i |N i ) arithmetic operations.Hence on average, computing all block gradients for i ∈ S k will cost This will be small if |D i | are small and τ is small.For simplicity, assume all blocks are of equal size, where ω = 1 m j ω j .It can be easily shown that the maintenance of the residual vectors r u k and r zk takes the same amount of time (C), and hence the total work per iteration is C.In many practical situations, m ≤ n, and often m n (we focus on this case in the paper since usually this corresponds to f not being strongly convex) and ω = O(1).This then means that C = τ × O(b).That is, each of the τ processors do work proportional to the size of a single block per iteration.
The favorable situation described above is the consequence of the block sparsity of the data matrix A and does not depend on φ j insofar as the evaluation of its derivative takes O(1) work.Hence, it applies to convex quadratics (φ j (s) = s 2 ), to logistic regression (φ j (r) = log(1 + exp(s))), and also to the smooth approximation with smoothing parameter μ > 0, as considered in [22,11].Vector w * is as defined in [11]; • v is a weighted norm in R m .

Numerical Experiments.
In all tests we used a shared-memory workstation with 4 Intel Xeon X5670 processors (24 cores in total) at 2.93 GHz and 192 GB RAM.
In the experiments, we have departed from the theory in two ways: (i) Our implementation of APPROX is asynchronous in order to limit communication costs.For example, on the problem of section 7.2, the asynchronous implementation is about 5 times faster than the synchronous implementation, where each processor waits until the others terminate their update of the variable before proceeding.(ii) We approximated the τ -nice sampling by a τ -independent sampling as in [33] (the latter is very easy to generate in parallel; please note that our analysis can be very easily extended to cover the τ -independent sampling).For simplicity, in all tests we assume all blocks are of size 1 (N i = 1 for all i).However, further speedups can be obtained by working with larger block sizes as then each processor is better utilized.For the problems we consider, coordinate descent methods are the state of the art.Hence, all methods we compare are coordinate descent methods of some variety.These methods share many similar components (e.g., computation of partial derivative, update of a coordinate, sampling); wherever possible, in our comparisons we have built all methods using identical components.Hence, while we often report runtime in the experiments, all differences are a genuine reflection of differences of the algorithms.

The Effect of New
Stepsizes.In this experiment, we compare the performance of the new stepsizes (introduced in section 3.2) with those proposed in [33] (see Table 5).We generated random LASSO (L 1 -regularized least-squares) instances with various distributions of the separability degrees ω j (= number of nonzero elements on the jth row of A) and studied the weighted distance to the optimum x * − x 0 v for the initial point x 0 = 0.This quantity appears in the complexity estimate (37) and depends on τ (the number of processors).We chose a random matrix of small size, N = m = 1000, as this is sufficient to make our point, and consider τ ∈ {10, 100, 1000}.In particular, we consider three different distributions of {ω j }: uniform, intermediate, and extreme.The results are summarized in Table 7. First, we generated a uniformly sparse matrix with ω j = 30 for all j.In this case, v fr = v rt , and hence the results are the same.We then generated an intermediate instance, with ω j = 1 + 30j 2 /m 2 .The matrix has many rows with a few nonzero elements and some rows with up to 30 nonzero elements.Looking at the table, clearly, the new  stepsizes are better.The improvement is moderate when there are a few processors, but for τ = 1000, the complexity is 25% better.Finally, we generated a rather extreme matrix with ω 1 = 500 and ω j = 3 for j > 1.We can see that the new stepsizes are much better, even with few processors, and can lead to a 5× speedup.
In the experiments above, we have first fixed a sparsity pattern and then generated a random matrix A based on it.However, much larger differences can be seen for special matrices A. Consider the case τ = n.In view of (39), the complexity of APPROX is proportional to v 1 .Fix ω and ω 1 , . . ., ω j and let us ask the question, for what data matrix A will the ratio θ = v rt 1 / v fr 1 be maximized?Since v rt 1 = ω j A j: 2 and v fr 1 = j ω j A j: 2 , the maximal ratio is given by max The extreme case is attained for some matrix with at least one dense row (ω j ) and one maximally sparse row (ω j = 1), leading to θ = n.So, there are instances for which the new stepsizes can lead to an up to n× speedup for APPROX when compared to the stepsizes v rt .Needless to say, these extreme instances are artificially constructed.
In Figure 1, we give the value of the SVM dual objective when minimized by serial randomized coordinate descent [32] (see section 7.4 for details on this problem).A similar plot would be obtained with APPROX.The dataset is rcv1 [24].It consists of a matrix A with m = 47,236 and N = 20,242 and a vector b.The new stepsizes gradient method (dotted black line), accelerated gradient method ( [22], dash-dotted red line), smoothed parallel coordinate descent method (SPCDM [11], dashed green line) with stepsizes v fr and APPROX with stepsizes v fr (solid blue line).
are very useful for this problem since ω = max j ω j = 8,551 and ω ≈ 488 < ω/17 (see Theorem 1 for the definition of ω).In all subsequent experiments we consider these new stepsizes, be it for accelerated or nonaccelerated parallel coordinate descent.

L1-Regularized L1
Regression.We wish to find x ∈ R N that minimizes Ax − b 1 + λ x 1 with λ = 1.Because the objective is nonsmooth and nonseparable, we apply the smoothing technique presented in [22] for the first part of the objective and use the smoothed parallel coordinate descent method (SPCDM) proposed in [11].The level of smoothing depends on the expected accuracy: we chose = 0.1 (0.0125% of the initial value obtained at x 0 = 0), so the smoothing parameter defined in [11] is μ = 4.2×10 −6 .We consider the dorothea dataset [24].It is a sparse moderate-sized feature matrix A with m = 800, N = 100,000, ω = 6,061, ω ≈ 1,104.1, and a vector b ∈ R m .
We compared 4 algorithms (see Figure 2), all run with 4 processors (τ = 4).As one can see, coordinate descent methods are much more efficient on this problem than both gradient descent and accelerated gradient descent.Also, APPROX is better than SPCDM.As the problem is of small size, we could compute the optimal solution using an interior point method for linear programming and compare the value at each iteration to the optimal value (Table 8).Each line of the table gives the time needed by APPROX and PCDM to reach a given accuracy target.In the beginning (until F (x k ) − F (x * ) < 6.4), the algorithms are in a transitional phase.Then, when one runs the algorithm twice as long, F (x k ) − F (x * ) is divided by 2 for SPCDM and by 4 for APPROX.This highlights the difference in the convergence speeds: O(1/k) compared to O(1/k 2 ).As a result, APPROX gives an -solution in 186.5 seconds, while SPCDM has not finished yet after 2,000 seconds.

LASSO.
We now consider L 1 -regularized least-squares regression on the KDDB dataset [42].It consists of a large sparse matrix A ∈ R m×N with m = 29,890,095, N = 19,264,097 (with ω = 75 and ω ≈ 31.87), and a vector b ∈ R m .As is standard An important feature of the L 1 -regularization is that it promotes sparsity in the optimization variable x.As APPROX only involves proximal steps on the z variable, only z k is encouraged to be sparse, and not x k , y k , or u k .A possible way to obtain a sparse solution is to first compute x k and then postprocess with a few iterations of a sparsity-oriented method (such as iterative hard thresholding, proximal gradient descent, or cyclic/randomized coordinate descent).

7.4.
Training Linear Support Vector Machines.Our last experiment is the dual of the SVM problem [38].For the dual SVM, the coordinates correspond to examples.We use the Malicious URL dataset [24] with data matrix A of size m = 3,231,961, N = 2,396,130, and a vector b ∈ R N .We adapted b so that b i ∈ {−1, 1}.Here ω = N = n but the matrix is still sparse: nnz(A) = 277,058, 644 mn, 1 m m j=1 ω j ≈ 85.7.Our goal is to find x ∈ [0, 1] N that minimizes with λ = 1/N .We compare APPROX (Algorithm 2) with stochastic dual coordinate ascent (SDCA [32,40,44]); the results are in Figure 3 (right).We have used a single processor only (τ = 1) because ω = N and ω ≈ 1.6 × 10 6 ≈ 2N/3 are quite large, making parallelization not so efficient.For this problem, one can recover the primal solution [38], and thus we can compare the decrease in the duality gap, summarized in Table 10.APPROX is two to three times as fast as SDCA on this instance.An alternative algorithm is accelerated SDCA [41].But to apply it, one needs to smooth the L1 norm, which makes the problem ill conditioned.In a similar fashion to LASSO (see Table 6), the complexity bound of accelerated SDCA involves logarithmic terms of order log( vN ) 2 ≈ 200 that are not negligible in the present case and make this algorithm not competitive when compared with SDCA or APPROX.= arg min

3. 1 .
New Model.Consider f of the form (2), i.e., f (x) = m j=1 f j (x), and let C j = ∅ be the set of blocks function f j depends on.Define ω j def = |C j |, and ω def = max j ω j .Clearly, any function f is of this form: it suffices to choose m = 1 and C 1 = {1, 2, . . ., n}.However, many functions appearing in applications, notably in machine learning and statistics, have a natural representation of this form with m being large and ω j n for some, most, or all j.

Fig. 1
Fig. 1 Comparison of new (dash-dotted line) and old (dashed line) stepsizes for the dual SVMproblem on the rcv1 dataset.We used only τ = 8 processors and the new stepsizes already lead to a two times speedup.

Table 2
The methods in this table all arise as special cases of APPROX by varying four elements:

Table 3
[12]cted applications of APPROX, developed by others after the publication of[12].

Table 4
Lipschitz constants of the derivative of selected scalar loss functions.

Table 5
ESO stepsizes for coordinate descent methods suggested in the literature in the case of a quadratic f (x) = 1 2 Ax − b 2 .For simplicity, we consider the setup with elementary block sizes (N i = 1) and the absolute value norm (this corresponds to B i = 1).

Table 6
Complexity of coordinate descent, accelerated stochastic dual coordinate ascent (SDCA) and selected variants of APPROX, when applied to the LASSO problem:

Table 7
Comparison of ESOs in the uniform case.

Table 9
Comparison of LASSO problem's duality gap after 2,000 s of computations for APPROX and smoothed parallel coordinate descent (SPCDM) on the KDDB dataset.We give the value as a percentage of the initial duality gap obtained at x 0 = 0. We obtained a sparse solution with the postprocessing described in the last paragraph of this section.

Table 10
Decrease of the duality gap for APPROX and stochastic dual coordinate ascent (SDCA).