Linearly convergent away-step conditional gradient for non-strongly convex functions

Beck, Amir; Shtern, Shimrit

doi:10.1007/s10107-016-1069-4

Linearly convergent away-step conditional gradient for non-strongly convex functions

Full Length Paper
Series A
Published: 21 September 2016

Volume 164, pages 1–27, (2017)
Cite this article

Mathematical Programming Submit manuscript

Amir Beck¹ &
Shimrit Shtern¹

1278 Accesses
33 Citations
Explore all metrics

Abstract

We consider the problem of minimizing the sum of a linear function and a composition of a strongly convex function with a linear transformation over a compact polyhedral set. Jaggi and Lacoste-Julien (An affine invariant linear convergence analysis for Frank-Wolfe algorithms. NIPS 2013 Workshop on Greedy Algorithms, Frank-Wolfe and Friends, 2014) show that the conditional gradient method with away steps — employed on the aforementioned problem without the additional linear term — has a linear rate of convergence, depending on the so-called pyramidal width of the feasible set. We revisit this result and provide a variant of the algorithm and an analysis based on simple linear programming duality arguments, as well as corresponding error bounds. This new analysis (a) enables the incorporation of the additional linear term, and (b) depends on a new constant, that is explicitly expressed in terms of the problem’s parameters and the geometry of the feasible set. This constant replaces the pyramidal width, which is difficult to evaluate.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

$\mathbf{C^{2}}$ -Lusin approximation of strongly convex functions

Article 03 April 2024

Daniel Azagra, Marjorie Drake & Piotr Hajłasz

Sum-of-Squares Relaxations for Information Theory and Variational Inference

Article 05 April 2024

Francis Bach

Random Gradient-Free Minimization of Convex Functions

Article 30 November 2015

Yurii Nesterov & Vladimir Spokoiny

Notes

The paper [12] assumes that the feasible set is a bounded polyhedral, but the proof is actually correct for general compact convex sets.
This is how the algorithm is described in [16] although in [17] the authors extend this result to include atom linear oracles, which are oracles whose output is within a predetermined finite set of points, called atoms. This set of atoms includes but is not limited to the set of vertices.
This was done as part of the proof of [16, Lemma 6], and does not appear as a separate lemma.

References

Beck, A., Teboulle, M.: A conditional gradient method with linear rate of convergence for solving convex linear systems. Math. Methods of Oper. Res. 59(2), 235–247 (2004)
Article MathSciNet MATH Google Scholar
Beck, A., Teboulle, M.: Gradient-based algorithms with applications to signal recovery problems. In: Palomar, D., Eldar, Y. (eds.) Convex Optimization in Signal Processing and Communications, pp. 139–162. Cambridge University Press, Cambridge (2009)
Google Scholar
Bertsekas, D.P.: Nonlinear Programming, 2nd edn. Athena Scientific, Belmont, MA, USA (1999)
Bertsimas, D., Tsitsiklis, J.N.: Introduction to Linear Optimization, vol. 6. Athena Scientific, Belmont (1997)
Google Scholar
Canon, M.D., Cullum, C.D.: A tight upper bound on the rate of convergence of Frank-Wolfe algorithm. SIAM J. Control 6(4), 509–516 (1968)
Article MathSciNet MATH Google Scholar
Dunn, J., Harshbarger, S.: Conditional gradient algorithms with open loop step size rules. J. Math. Anal. Appl. 62(2), 432–444 (1978)
Article MathSciNet MATH Google Scholar
Epelman, M., Freund, R.M.: Condition number complexity of an elementary algorithm for computing a reliable solution of a conic linear system. Math. Program. 88(3), 451–485 (2000)
Article MathSciNet MATH Google Scholar
Frank, M., Wolfe, P.: An algorithm for quadratic programming. Nav. Res. Logist. Quart. 3(1–2), 95–110 (1956)
Article MathSciNet Google Scholar
Freund, R.M., Grigas, P., Mazumder, R.: An extended frank-wolfe method with “in-face” directions, and its application to low-rank matrix completion. arXiv preprint; arXiv:1511.02204 (2015)
Garber, D., Hazan, E.: A linearly convergent conditional gradient algorithm with applications to online and stochastic optimization. arXiv preprint; arXiv:1301.4666 (2013)
Goldfarb, D., Todd, M.J.: Chapter ii: Linear programming. In: Nemhauser, G., Kan, A.R., Todd, M. (eds.) Optimization, volume 1 of Handbooks in Operations Research and Management Science, pp. 73–170. Elsevier, Amsterdam (1989)
Google Scholar
Guelat, J., Marcotte, P.: Some comments on Wolfe’s away step. Math. Program. 35(1), 110–119 (1986)
Article MathSciNet MATH Google Scholar
Güler, O.: Foundations of Optimization. Graduate Texts in Mathematics, vol. 258. Springer, New York (2010)
Hoffman, A.J.: On approximate solutions of systems of linear inequalities. J. Res. Natl. Bur. Stand. 49(4), 263–265 (1952)
Article MathSciNet Google Scholar
Jaggi, M.: Sparse Convex Optimization Methods for Machine Learning. Ph.D. thesis, ETH Zurich (2011)
Lacoste-Julien, S., Jaggi, M.: An affine invariant linear convergence analysis for Frank-Wolfe algorithms. In: NIPS 2013 Workshop on Greedy Algorithms, Frank-Wolfe and Friends (2014)
Lacoste-Julien, S., Jaggi, M.: On the global linear convergence of frank-wolfe optimization variants. In: Advances in Neural Information Processing Systems, pp. 496–504 (2015)
Levitin, E., Polyak, B.T.: Constrained minimization methods. USSR Comput. Math. Math. Phys. 6(5), 787–823 (1966)
Article MATH Google Scholar
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course, vol. 87. Springer, Berlin (2004)
MATH Google Scholar
Pena, J., Rodriguez, D.: Polytope conditioning and linear convergence of the frank-wolfe algorithm. arXiv preprint; arXiv:1512.06142 (2015)
Luo, Z.Q., Tseng, P.: Error bounds and convergence analysis of feasible descent methods: a general approach. Ann. Oper. Res. 46–47(1), 157–178 (1993)
Article MathSciNet MATH Google Scholar
Rockafellar, R.T.: Convex Analysis, 2nd edn. Princeton University Press, Princeton (1970)
Book MATH Google Scholar
Wang, P.-W., Lin, C.-J.: Iteration complexity of feasible descent methods for convex optimization. J. Mach. Learn. Res. 15, 1523–1548 (2014)
MathSciNet MATH Google Scholar
Wolfe, P.: Chapter 1:Convergence Theory in Nonlinear Programming. In: Abadie J. (ed.) Integer and nonlinear programming, pp. 1–36. North-Holland Publishing Company, Amsterdam (1970)

Download references

Acknowledgments

The research of Amir Beck was partially supported by the Israel Science Foundation grant 1821/16

Author information

Authors and Affiliations

Faculty of Industrial Engineering and Management, Technion - Israel Institute of Technology, Haifa, Israel
Amir Beck & Shimrit Shtern

Authors

Amir Beck
View author publications
You can also search for this author in PubMed Google Scholar
Shimrit Shtern
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Amir Beck.

Appendix: Incremental representation reduction using the Carathéodory theorem

In this section we will show a way to efficiently and incrementally implement the constructive proof of Carathéodory theorem, as part of the VRU scheme, at each iteration of the ASCG algorithm. We note that this reduction procedure does not have to be employed, and instead the trivial procedure, which does not change the representation can be used. In that case, the upper bound on the number of extreme points in the representation is just the number of extreme points of the feasible set X.

The implementation described in this section will allow maintaining a vertex representation set $U^k$, with cardinality of at most $n+1$, at a computational cost of $O(n^2)$ operations per iteration. For this purpose, we assume that at the beginning of iteration k, $\mathbf{x}^{k}$ has a representation with vertex set $U^{k}=\left\{ {\mathbf{v}^1,\ldots ,\mathbf{v}^{L}}\right\} \subseteq V$, such that the vectors in the set are affinely independent. Moreover, we assume that at the beginning of iteration k, we have at our disposal two matrices $\mathbf{T}^k\in \mathbb {R}^{n\times n}$ and ${\mathbf{W}}^k\in \mathbb {R}^{n\times (L-1)}$. We define $\mathbf{V}^k\in \mathbb {R}^{n\times (L-1)}$ to be the matrix whose ith column is the vector $\mathbf{w}^i=\mathbf{v}^{i+1}-\mathbf{v}^1$ for $i=1, \ldots , L-1$, where $\mathbf{v}^1$ is called the reference vertex. The matrix $\mathbf{T}^k$ is a product of elementary matrices, which ensures that the matrix ${\mathbf{W}}^k=\mathbf{T}^k\mathbf{V}^k$ is in row echelon form. The implementation does not require to save the matrix $\mathbf{V}^k$, and so at each iteration, only the matrices $\mathbf{T}^k$ and $\mathbf{W}^k$ are updated.

Let $U^{k+1}$ be the vertex set and let ${\varvec{\mu }}^{k+1}$ be the coefficients vector at the end of iteration k, before applying the rank reduction procedure. Updating the matrices $\mathbf{W}^{k+1}$ and $\mathbf{T}^{k+1}$, as well as $U^{k+1}$ and ${\varvec{\mu }}^{k+1}$, is done according to the following Incremental Representation Reduction scheme, which is partially based on the proof of Carathéodory theorem presented in [22, Sect. 17].

Notice that in order to compute the row rank of the matrix $\mathbf{W}^{k+1}$ in step 6(d), we may simply convert the matrix to row echelon form, and then count the number of nonzero rows. This is done similarly to step 7, and requires ranking of at most one column. We will need to rerank the matrix in step 7 only if $L>M$, and subsequently at least one column is removed in step 6(e)vi.

The IRR scheme may reduce the size of the input $U^{k+1}$ only in the case of a forward step, since otherwise the vertices in $U^{k+1}$ are all affinely independent. Nonetheless, the IRR scheme must be applied at each iteration in order to maintain the matrices ${\mathbf{W}}^k$ and $\mathbf{T}^k$.

The efficiency of the scheme relies on the fact that only a small number of vertices are either added to or removed from the representation. The potentially computationally expensive steps are: step 5(b)—replacing the reference vertex, step 6(d)—finding the row rank of $\mathbf{W}^{k+1}$, step 6(e)i—solving the system of linear equalities, step 6(e)vi—removing columns corresponding with the vertices eliminated from the representation, and step 7—the ranking of the resulting matrix ${\mathbf{W}}^{k+1}$. Step 5(b) can be implemented without explicitly using matrix multiplication and therefore has a computational cost of $O(n^2)$. Since ${\mathbf{W}}^k$ was in row echelon form, step 6(d) requires a row elimination procedure, similar to step 7, to be conducted only on the last column of ${\mathbf{W}}^{k+1}$, which involves at most O(n) operations and an additional $O(n^2)$ operation for updating $\mathbf{T}^{k+1}$. Moreover, since ${\mathbf{W}}^k$ was full column rank, the IRR scheme guarantees that in step 6(e)i the vector ${\varvec{\lambda }}$ has a unique solution, and since ${\mathbf{W}}^{k+1}$ is in row echelon form, it can be found in $O(n^2)$ operations. Moreover, in step 6(e)i, the specific choice of $\alpha $ ensures that the reference vertex $\mathbf{v}^1$ is not eliminated from the representation, and so there is no need to change the reference vertex at this stage. Furthermore, it is reasonable to assume that the set I satisfies $|I|=O(1)$, since otherwise the vector $\mathbf{x}^{k+1}$, produced by a forward step, can be represented by significantly less vertices than $\mathbf{x}^k$, which, although possible, is numerically unlikely. Therefore, assuming that indeed $|I|=O(1)$, the matrix $\tilde{\mathbf{T}}$, calculated in step 7, applies a row elimination procedure to at most O(1) rows (one for each column removed from $\mathbf{W}^{k+1}$) or one column (if a column was added to $\mathbf{W}^{k+1}$). Conducting such an elimination on either row or column takes at most $O(n^2)$ operations, which may include row switching and at most n row addition and multiplication. Therefore, the total computational cost of the IRR scheme amounts to $O(n^2)$.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Beck, A., Shtern, S. Linearly convergent away-step conditional gradient for non-strongly convex functions. Math. Program. 164, 1–27 (2017). https://doi.org/10.1007/s10107-016-1069-4

Download citation

Received: 19 April 2015
Accepted: 02 September 2016
Published: 21 September 2016
Issue Date: July 2017
DOI: https://doi.org/10.1007/s10107-016-1069-4

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Linearly convergent away-step conditional gradient for non-strongly convex functions

Abstract

Access this article

Similar content being viewed by others

$\mathbf{C^{2}}$ -Lusin approximation of strongly convex functions

Sum-of-Squares Relaxations for Information Theory and Variational Inference

Random Gradient-Free Minimization of Convex Functions

Notes

References

Acknowledgments