On the oracle complexity of smooth strongly convex minimization

We construct a family of functions suitable for establishing lower bounds on the oracle complexity of first-order minimization of smooth strongly-convex functions. Based on this construction, we derive new lower bounds on the complexity of strongly-convex minimization under various inaccuracy criteria. The new bounds match the known upper bounds up to a constant factor, and when the inaccuracy of a solution is measured by its distance to the solution set, the new lower bound exactly matches the upper bound obtained by the recent Information-Theoretic Exact Method by the same authors, thereby establishing the exact oracle complexity for this class of problems.


Introduction
In this paper, we study the performance of deterministic first-order methods for approximating the solution of unconstrained strongly-convex minimization problems, i.e., problems of the form for some d ∈ N, where f is a strongly-convex function.
This problem setting has the attractive property that first-order methods for solving it are simple, scalable and easy to implement, nevertheless they enjoy 'fast' rates of convergence [11], making highly-accurate solutions relatively easy to attain. As a result, these problems are considered 'tractable' and play a key role in a wide range of applications, from machine learning, parameter estimation, computer vision and many more. These types of problems also appear as steps in methods for solving more complex problems, a property which makes efficient solutions of these problems even more important.
There has been significant progress in recent years in devising efficient methods for solving strongly-convex problems. Suppose f is a µ-strongly convex function in C 1,1 L (R d ) (the set of continuously differentiable functions that have L-Lipschitz gradient), then the classical gradient method with an appropriately chosen step size attains after N iterations an approximate solution x N such that [11,Theorem 2.1.15] x where here and through the rest of the paper · stands for the Euclidean norm. This rate of convergence has been improved by the celebrated Accelerated Gradient Methods [11], which generates a sequence of iterates converging to an optimal point at rate in the order of O((1 − µ/L) N/2 ). Recently, the Triple Momentum Method [16] improved this rate even further: after N iterations the method attains an approximate solution x N such that An additional improvement is due to the very recent Information-Theoretic Exact Method [14], which further improves the leading constant in the bound above. This progress naturally raises the question: can we do even better?
A framework for formalizing this question has been pioneered by Nemirovsky and Yudin in their seminal book [9]. Under their approach, we assume that the optimization method has access to the objective only via a first-order oracle, O f , that is, a subroutine which given a point in R d , returns the value of the objective and its gradient at that point. In addition, the method is provided with a starting point x 0 ∈ dom(f ) that is assumed to be "not too far" from an optimal solution. We call the pair (O f , x 0 ) a problem instance, and for a first-order method A we denote the approximate solution generated by method when applied on this problem instance by A(O f , x 0 ). The cost of the method is then measured by the number of calls it makes to the oracle to obtain its output.
In this work, we consider two criteria for measuring the inaccuracy of an approximate solution: absolute inaccuracy, which quantifies the inaccuracy of an approximate solution ξ for a problem instance (O f , x 0 ) by the value of f (ξ) − f * , and the distance to the solution set which measures the inaccuracy of approximate solution ξ by inf x * ∈X * (f ) ξ − x * , where X * denotes the set of optimal solutions. The efficiency estimate of a first-order method A over a given set of problem instances I is then defined as the worst-case value of the chosen inaccuracy measure. For the absolute inaccuracy measure we denote ε(A; I) := sup and for the distance to the solution set, we denote We can now put the main concepts addressed in this paper in formal terms: denoting by A N the set of all deterministic first-order methods that perform at most N calls to their first-order oracle, the minimax risk [7] associated with I is defined as the infimal efficiency estimate that a method in A N can attain over I as a function of the computational effort N Similarly, for the distance to the solution set inaccuracy measure, we define the minimax error by The classical notion of oracle (or information-based) complexity of the set I can now be identified as the inverse to the functions defined above, i.e., the minimal computational effort needed by a first-order method in order to reach a given worst-case accuracy level. For example, under the absolute inaccuracy measure, the oracle complexity of I is given by In the following, we express our results using the minimax risk and minimax error functions as they prove to be better suited for expressing the dependence of the results on the dimension of the domain.
In this paper we focus on the class of strongly convex functions with Lipschitz-continuous gradient, where the initial point is assumed to be at bounded distance from an optimal point: Lower bounds on the minimax risk and minimax error for this class of problems were derived by Nemirovski [8,9] and Nesterov [11,Theorem 2 and were shown to be optimal in the sense that the number of steps required to reach a certain accuracy matches the upper bound up to a constant. Nevertheless, a gap yet remains between the convergence rates of the upper and lower bounds, which is the purpose of this paper to close. Note that the bounds above were constructed using quadratic functions, making them also applicable on the subclass of strongly convex quadratic functions, thus partially explaining this gap.

Contribution
The main contribution of this work is as follows: 1. We present a general technique for establishing lower bounds on the oracle complexity of smooth minimization problems (i.e., problems where the gradient of the objective is Lipschitzcontinuous, but the objective is not necessarily convex). The technique is based on a construction that allows to smoothly extend a set of function values and gradients over the entire domain in such a way that the resulting function possesses properties making it "hard" to optimize by first-order methods and allows standard lower-complexity proof schemes to be applied.
2. We derive a lower bound on the value of R F µ,L Rx (R 2N +1 ) (N ) that matches the upper bound up to a constant, and derive the exact value of X F µ,L Rx (R 2N +1 ) (N ).
3. We present a shorter and easier to follow proof for establishing the exact minimax risk for non-strongly-convex smooth convex minimization problems, derived in [6]. Notably, the proof is based on the standard set of arguments commonly used for establishing lower bounds and does not take advantage of special properties of the construction.

Strongly-convex extensions
In the problem of strongly-convex extension (also referred to as interpolation), given a finite or Note that unless stated otherwise we allow µ to be negative (where f is µ-strongly convex, as in the µ ≥ 0 case, if f (·) − µ 2 · 2 is convex). Necessary and sufficient conditions for the existence of functions f satisfying the conditions above were presented by Taylor, Hendrickx and Glineur in [15] for the case where I and d are finite, and independently by Azagra and Mudarra [2] for arbitrary sets in Hilbert spaces. In addition, several explicit constructions of extension functions were presented [5,6,15]. In this section, we describe a different construction, which will form a main building block for this paper.
The construction is as follows.
A strongly-convex extension Given L > 0, µ < L (possibly negative), and T = and by V T (y) : R d → R the strongly-convex extension where ∆ I denotes the |I|-dimensional unit simplex Before establishing the main properties of V T let us first state the standard first-order optimality conditions for the optimization problem V T . For a proof see e.g., [3, Example 2.1.2]. Proposition 1. Let L > 0, µ < L (possibly negative), and let for some finite index set I. Fix any y ∈ R d and suppose α ∈ ∆ I . Then α is optimal for V T (y) if and only if for any k ∈ I such that α k > 0 and any j ∈ I The main properties of V T (y) now follow. The theorem guarantees that V T is always µ-strongly convex, has an L-Lipschitz gradient, and under certain conditions forms an extension of the given set of points.
Furthermore, for any i ∈ I such that Proof. The result can be derived from [6, Theorem 1] by observing that the function L−µ 2 y 2 − V T (y) has the same form considered by the theorem. For the sake of completeness we now give a direct proof.
To establish the smoothness and strong convexity properties, we proceed to show that both V T (y) − µ 2 y 2 and L 2 y 2 − V T (y) are convex. Indeed, V T (y) − µ 2 y 2 is a maximum of linear functions and therefore convex and L 2 y 2 − V T (y) is convex from the convexity of L 2 y 2 − v T (y, α) and a well-known property of the infimum operator (see e.g., [13,Proposition 2.22]). Because V T is bounded between two tangent quadratic/linear functions at every point of its domain, its differentiability follows.
Next, noting that α = e i is feasible for the maximization problem V T , we get Finally, suppose (3) holds, then in order to establish that V T (x i ) = f i and ∇V T (x i ) = g i we first show that α = e i is optimal for V T (x i ). Indeed, it is straightforward to verify that when α = e i and y = x i , conditions (2) and (3) are equivalent, thus V T (x i ) = v T (x i , e i ) = f i and in addition, from the properties of the max operator, we have Note that condition (3) necessarily holds for all i ∈ I for every µ-strongly convex function f that extends T over the entire linear space (see e.g., [15,Theorem 4]). As a result, if a finite set T has a strongly convex extension, then V T provides one such possible extension.

Zero-chain functions and lower bounds
In this section, we briefly review a standard technique for establishing lower complexity bounds. The technique was first introduced by Nemirovsky [8,9] and has been extensively used to derive lower complexity results, see [1,4,6,11,12] to name a few. Here we follow the convenient presentation of the technique due to Carmon et al. [4]. The presentation, for the special case of deterministic first-order minimization, is based on the following definition.
Note that zero-chain functions are typically defined over the canonical unit vectors, however, in the context of this paper it is useful to consider an arbitrary set which will be chosen according to the iterates and gradient values produced by the algorithm.
Zero-chain functions are well-suited for establishing lower complexity bounds. The standard argument proceeds as follows. As a first step, we consider only the special class of first-order methods which make their first query to oracle at zero and choose the following query points only from the subspace spanned by the previous gradients seen by the method (this class of algorithms is referred to in [4] as zero-respecting algorithms). This assumption plays well with the zero-chain property, as together they constrain the location of each iterate to a known subspace. By showing that all points at the final subspace are 'bad' under the chosen inaccuracy measure, a bound on the performance of any algorithm in the class can be established.
The next step of the argument is to extend the bound to arbitrary deterministic first-order methods. This is done via the "resisting oracle" technique pioneered by Nemirovsky and Yudin in their seminal book [10]. This procedure requires that the problem class satisfies the following properties (see [4,Proposition 2]): 1. The problem class and inaccuracy measure must be invariant under orthogonal transformations.
2. The domain of the function class must be embeddable in a domain whose dimension is arbitrarily larger.
Suppose the conditions above hold and assume a given algorithm queries a point x k (k ≥ 0) outside the span of the previous gradients and denote the component of the query point orthogonal to the span by v. Then by applying an orthogonal transformation on the objective it is possible to find a transformed objective that has the same first-order information at the points already queried by algorithm while being non-informative in the direction v (e.g., having a simple separable quadratic behavior along that direction). Since the algorithm cannot distinguish between the two functions, it must therefore choose the same query point x k when given the transformed function, thereby at the worst-case it gains no additional information by querying at the direction v.
Since the problem settings which are the focus of this paper satisfy the conditions above, we have the following result.
For further details and proofs, we refer the reader to [4, Section 3].

A family of zero-chain functions
In this section, we present a set of easily-verifiable conditions under which V T , defined in (1), is a zero-chain. Note that the results of this section apply to both strongly-convex and weakly-convex functions.
We start with the following technical lemma.
for some j, k ∈ I and suppose there exists some y such that then for any optimal solution α * of V T (y) we have α * k = 0. Proof. Suppose α * k > 0. Since i α * i = 1, the first-order optimality condition condition (2) can be written as which yields a contradiction in view of (4), the nonnegativity of α * i , and the assumption on y. The main result of this section now follows: a set of conditions that guarantee that V T is a zero-chain function. The construction is based on a given set of n + m triplets, (x i , g i , f i ), where the first n triplets will be used to form the 'zero-chain space', and latter m triplets are available for enforcing additional properties on V T (as will be done in the next section).
, . . . , n + m − 1}. Suppose the following conditions hold where where α * is optimal for ∇V T (y). Thus, in order to establish that V T is a zero-chain it is sufficient to show that α * k = 0 for all k ∈ K j . Indeed, from (8) it follows that there exists a vector v such that then setting for some arbitrary ε > 0 y ′ := y + εv, we have for all k ∈ K j where the second equality follows from (6) and from the assumption on y. As a result, by Lemma 1, any optimal solution for V T (y ′ ) satisfies α * k = 0. We conclude that Finally, taking ε → 0, then from the continuity of ∇V T (Theorem 1) it follows that establishing the zero-chain property.
Note that conditions (3) were not included in the requirements of the theorem above. As a result, although the theorem guarantees that V T is a zero-chain, the function is not guaranteed to be an extension of T . Of course, if conditions (3) do hold for T , then V T will also form an extension.

Lower complexity bounds on strongly convex minimization
Theorem 3 provides a convenient building block for establishing zero-chain based lower complexity bounds. To complete the standard argument, it remains to find a way of bounding the chosen inaccuracy measure over the span of the zero-chain base (c.f., Theorem 2). For this purpose, in addition to N points that form the 'zero-chain space' part of the function: x 0 , . . . , x N −1 , two additional points x N and x * are added together with constraints ensuring that the chosen inaccuracy measure attains its minimum over the span of the zero-chain base vectors at the pair of points x N , x * . This construction is summarized by the following corollary. Note that hereafter we assume convexity: µ ≥ 0.

If in addition
it follows that 2. If in addition µ > 0 and it follows that Proof. Since (11) holds, Theorem 1 implies that ∇V T (x * ) = g * = 0 and V T (x * ) = f * . We conclude that x * ∈ X * (V T ), V T * = f * and thus Assumptions (12)- (14) imply by Theorem 3 that V T is a zero-chain on {g i − µx i } 0≤i≤N −1 therefore, by Theorem 2 To establish the first claim, note that by Theorem 1 it then follows from (15) and (16) that and thus get the desired bound on R F µ,L Rx (R d+N ) (N ).
The second bound follows similarly, by noting that for strongly convex functions (µ > 0) the minimizer x * is unique, therefore conditions (17) and (18) imply that Remark 1. An identical argument could be used to derive bounds when the inaccuracy of an approximate solution is measured by its gradient norm. However, it is not clear how to add constraints ensuring that the gradient norm of V T becomes bounded over the relevant subspace. We therefore leave the analysis of this case for future work.
We are now ready to present the main result of this section: a simple criterion for establishing lower complexity bounds on convex and strongly-convex minimization problems.

Theorem 4.
Let R x ∈ R + and 0 ≤ µ < L. Suppose the sequences {γ i } 0≤i≤N and {δ i } 0≤i≤N satisfy the following conditions: then Furthermore, if µ > 0 we have The proof proceeds by exhibiting a set T which satisfies the conditions of Corollary 1. As the proof is rather technical, we postpone it to Appendix A.
Based on the theorem, we now derive several lower complexity bounds. As a first example, the next result establishes a simple lower bound for the non-strongly convex case, µ = 0.
Proof. We show that the choice Finally, x 4(N + 1) 3 , which completes the proof.
The exact lower bound for the non-strongly convex case, derived in [6], can be also attained by Theorem 4, although with some additional effort. See an outline of the proof in Appendix B. Note, however, that the lower bound derived in [6] is slightly stronger, as it makes a weaker requirements on the dimension d of the domain of the problem, i.e., it establishes the results for problems over R d with d ≥ N + 1 instead of d ≥ 2N + 1 as guaranteed by Theorem 4 (the proof in [6] takes advantage of properties of the constructed function that do not hold for general zero-chain functions).
Turning to the strongly-convex µ > 0 case, a simple bound can be obtained via Theorem 4 as follows.
Corollary 3. Let L, R x > 0, 0 < µ < L and N ∈ N, then Proof. Consider the sequences defined by then (19) holds since µ/L ≤ µ/L ≤ 1 and (20) holds with equality since hence the conditions of Theorem 4 are satisfied. Finally, completing the proof.
As in the non-strongly convex case, the previous bounds can be further improved at the cost of a greater proof complexity. In the following result, we give such an improvement for the distance to the solution set inaccuracy measure. The bound exactly matches the worst-case performance attained by the Information-Theoretic Exact Method [14], thus, as the upper and lower bounds on the complexity of the class are equal, it follows that both bounds are exact and cannot be further improved.

Corollary 4.
Let R x > 0, 0 < µ < L and N ∈ N, and define the sequence {λ i } by, The proof establishing the corollary is detailed in Appendix C.

Conclusion
In this work, we presented a construction of a family of zero-chain functions suitable for establishing lower complexity bounds for smooth problems, and showed how the construction can be used to derive lower complexity bounds on the class of strongly-convex minimization. Based on this result, we obtained the following bound on the minimax risk for any 0 ≤ µ < L with θ N defined as in (23), and the bound for 0 < µ < L, where λ N is defined in (22). These bounds were then shown to exhibit the optimal rate of convergence for this class, and in the case of the distance to the solution set inaccuracy measure the bound is exact. Note that the results only apply to the absolute inaccuracy and distance to the solution set measures (and possibly other measures solely based on these two). An open question that remains, is the possibility of using the same construction in obtaining lower bounds when the inaccuracy measure is chosen to be the norm of the gradient. This question naturally arises when attempting to generalize the above results to the class of weakly-convex functions. We leave the analysis of this case for future research.
Another open question that remains is the exact oracle complexity of strongly-convex minimization under the absolute inaccuracy measure. Preliminary numerical experiments suggest that the upper bound derived in [14] does not match the best bound attainable by Theorem 4, leading us to raise the conjecture that the best bound attainable by Theorem 4 for absolute inaccuracy performance measure is not tight for µ > 0, and that a more refined approach is required to attain the exact value of the minimax risk function for this case.

Appendix A Proof of Theorem 4
The proof proceeds by showing that the requirements of Corollary 1 are satisfied taking the set δ t e t , j = 0, . . . , N, g j := Lγ j e j , j = 0, . . . , N, δ t e t , g * := 0, To establish the requirements of Corollary 1, first note that (9), (12), and (15)-(18) are immediate from the construction, and (10) follows directly from (21). Next, to establish (11), we substitute the choice of x i , g i above to get which follows sincê Linear separability ofĝ j − µx j and {ĝ k − µx k } k∈K * j (14) can be established by considering the following two cases: First, if Lγ j = µδ j , taking v = e j gives ĝ j − µx j , v = Lγ j while for k > j and k = * , we have ĝ k − µx k , v = µδ j , establishing separability. Otherwise, when Lγ j = µδ j , which must be positive since γ k and {µδ i } j<i<k (or {µδ i } j<i≤N when k = * ) are nonnegative and are not all zeros, as the definition of K * j implies thatĝ k − µx k =ĝ j − µx j . We complete the proof of Theorem 4 by establishing that the choice for {(x i ,ĝ i ,f i )} satisfies (13). Note that from the symmetry in (13), it is sufficient to verify it for j = 0, . . . , N − 1, k = j + 1 and for j = N, k = * , for all i ∈ I * N .
A.1 Case 1: k = j + 1 We start by considering the case 0 ≤ j ≤ N − 1, k = j + 1. By the choice ofx i ,ĝ i it follows that ĝ i ,x i = 0 for all i ∈ I * N , thus (13) can be simplified to where for ease of notation we set γ * := 0. Then from (20) and the assumption µ < L we get that in order to establish the expression above it is sufficient to verify that the following inequality holds We consider the following four cases: Here, the left-hand side of the inequality is zero, and the inequality follows directly from the non-negativity constraints (19).
2. i = j. For this case, the inequality above becomes which trivially holds with equality.
3. i = j + 1. Here, we need to establish which holds as it reduces to and follows from the nonnegativity assumptions (19).
A.2 Case 2: j = N, k = * For this case, γ * = 0 and therefore (13) can be simplified to Substitutingx i ,ĝ i andf i with the values chosen above, we get We therefore need to establish We consider the following three cases: Here the right-hand side of the expression is zero and the inequality follows from (19).
2. i = N . In this case, we need to show which follows from (19).
To complete the proof, it is sufficient to show that which are established as part of the proof of [6, Lemma 3].

C Proof of Corollary 4
For the sake of conciseness, let us denote by q := µ/L the inverse condition number, which we assume to be in the open interval (0, 1). We begin the proof by showing that the sequence λ i is well-defined. This follows by a simple inductive argument: suppose 0 ≤ λ i ≤ √ q, then q 2 ≤ q − (1 − q)λ 2 i ≤ q and we therefore have In particular, q − (1 − q)λ 2 i ≥ 0 for all i and thus the sequence λ i is well-defined. The main part of the proof also proceeds by induction. Assume without loss of generality that R x = 1 and suppose We proceed to show that {γ i } 0≤i≤N +1 , {δ i } 0≤i≤N +1 defined bŷ We show that the expressions above are well-defined in §C.1 and establish the relations above in §C. 2- §C.4. For the base of the induction, we take γ 0 = q −1 , δ 0 = 1. Finally, the claim of Corollary 4 follows directly from Theorem 4. C.1 α N +1 ,γ N andδ N are well-defined The well-definiteness of α N +1 follows immediately from 0 ≤ λ 2 N , λ 2 N +1 ≤ q. Next, we show and along the way, establish convenient expressions forγ N andδ N . First, let us show the following identity We get In view of the identities above and the definitions ofγ 2 N andδ 2 N we immediately get (26) C.2 0 ≤ qδ i ≤γ i ≤δ i , i = 0, . . . , N + 1 For i = 0, . . . , N − 1 the inequalities are immediate from the assumptions on γ i , δ i . For i = N , 0 ≤ qδ N ≤γ N is immediate andγ N ≤δ N follows from (25) and (26) by where the leftmost inequality follows from the upper bound on λ i+1 (24). For i = N + 1 the bound follows directly from the definition.