Skip to main content
Log in

Optimized first-order methods for smooth convex minimization

  • Full Length Paper
  • Series A
  • Published:
Mathematical Programming Submit manuscript

Abstract

We introduce new optimized first-order methods for smooth unconstrained convex minimization. Drori and Teboulle (Math Program 145(1–2):451–482, 2014. doi:10.1007/s10107-013-0653-0) recently described a numerical method for computing the N-iteration optimal step coefficients in a class of first-order algorithms that includes gradient methods, heavy-ball methods (Polyak in USSR Comput Math Math Phys 4(5):1–17, 1964. doi:10.1016/0041-5553(64)90137-5), and Nesterov’s fast gradient methods (Nesterov in Sov Math Dokl 27(2):372–376, 1983; Math Program 103(1):127–152, 2005. doi:10.1007/s10107-004-0552-5). However, the numerical method in Drori and Teboulle (2014) is computationally expensive for large N, and the corresponding numerically optimized first-order algorithm in Drori and Teboulle (2014) requires impractical memory and computation for large-scale optimization problems. In this paper, we propose optimized first-order algorithms that achieve a convergence bound that is two times smaller than for Nesterov’s fast gradient methods; our bound is found analytically and refines the numerical bound in Drori and Teboulle (2014). Furthermore, the proposed optimized first-order methods have efficient forms that are remarkably similar to Nesterov’s fast gradient methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

Notes

  1. The problem \(\mathcal {B}_{\mathrm {P}}(\varvec{h},N,d,L,R)\) was shown to be independent of d in [17]; thus this paper’s results are independent of d.

  2. Substituting \(\varvec{x}' = \frac{1}{R}\varvec{x}\) and \(\breve{f} (\varvec{x}') = \frac{1}{LR^2}f(R\varvec{x}')\in \mathcal {F}_1(\mathbb {R} ^d)\) in problem (P), we get \(\mathcal {B}_{\mathrm {P}}(\varvec{h},N,L,R) = LR^2\mathcal {B}_{\mathrm {P}}(\varvec{h},N,1,1)\). This leads to \(\varvec{\hat{h}}_{\mathrm {P}} = \hbox {arg min}_{\varvec{h}} \mathcal {B}_{\mathrm {P}}(\varvec{h},N,L,R) = \hbox {arg min}_{\varvec{h}} \mathcal {B}_{\mathrm {P}}(\varvec{h},N,1,1)\).

  3. Using the term ‘best’ or ‘optimal’ here for DT [5] may be too strong, since DT [5] relaxed (HP) to a solvable form. We also use these relaxations, so we use the term “optimized” for our proposed algorithms.

  4. If coefficients \(\varvec{h}\) in Algorithm FO have a special recursive form, it is possible to find an equivalent efficient form, as discussed in Sects. 3 and 7.

  5. The equivalence of two of Nesterov’s fast gradient methods for smooth unconstrained convex minimization was previously mentioned without details in [18].

  6. The fast gradient method in [12] was originally developed to generalize FGM1 to the constrained case. Here, this second form is introduced for use in later proofs.

  7. The second inequality of (3.5) is widely known since it provides simpler interpretation of a convergence bound, compared to the first inequality of (3.5).

  8. The vector \(\varvec{e}_{N,i}^{ }\) is the \(i\hbox {th}\) standard basis vector in \(\mathbb {R} ^{N}\), having 1 for the \(i\hbox {th}\) entry and zero for all other elements.

  9. Equation (5.2) in [5, Theorem 3] that is derived from (6.1) has typos that we fixed in (6.3).

References

  1. Allen-Zhu, Z., Orecchia, L.: Linear coupling: an ultimate unification of gradient and mirror descent (2015). arXiv:1407.1537

  2. Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009). doi:10.1137/080716542

    Article  MathSciNet  MATH  Google Scholar 

  3. CVX Research, I.: CVX: Matlab software for disciplined convex programming, version 2.0. (2012). http://cvxr.com/cvx

  4. Drori, Y.: Contributions to the complexity analysis of optimization algorithms. Ph.D. thesis, Tel-aviv Univ., Israel (2014)

  5. Drori, Y., Teboulle, M.: Performance of first-order methods for smooth convex minimization: a novel approach. Math. Program. 145(1–2), 451–482 (2014). doi:10.1007/s10107-013-0653-0

    Article  MathSciNet  MATH  Google Scholar 

  6. Grant, M., Boyd, S.: Graph implementations for nonsmooth convex programs. In: Blondel, V., Boyd, S., Kimura, H. (eds.) Recent Advances in Learning and Control, Lecture Notes in Control and Information Sciences, pp. 95–110. Springer, Berlin (2008). http://stanford.edu/~boyd/graph_dcp.html

  7. Kim, D., Fessler, J.A.: Optimized first-order methods for smooth convex minimization (2014). arXiv:1406.5468v1

  8. Kim, D., Fessler, J.A.: Optimized momentum steps for accelerating X-ray CT ordered subsets image reconstruction. In: Proceedings of 3rd International Meeting on Image Formation in X-ray CT, pp. 103–106 (2014)

  9. Kim, D., Fessler, J.A.: An optimized first-order method for image restoration. In: Proceedings of IEEE International Conference on Image Processing (2015). (to appear)

  10. Nesterov, Y.: A method of solving a convex programming problem with convergence rate \(O(1/k^2)\). Sov. Math. Dokl. 27(2), 372–376 (1983)

    MATH  Google Scholar 

  11. Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Kluwer, Dordrecht (2004)

    Book  MATH  Google Scholar 

  12. Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–152 (2005). doi:10.1007/s10107-004-0552-5

    Article  MathSciNet  MATH  Google Scholar 

  13. Nesterov, Y.: Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–161 (2013). doi:10.1007/s10107-012-0629-5

    Article  MathSciNet  MATH  Google Scholar 

  14. O’Donoghue, B., Candès, E.: Adaptive restart for accelerated gradient schemes. Found. Comput. Math. 15(3), 715–732 (2015). doi:10.1007/s10208-013-9150-3

    Article  MathSciNet  MATH  Google Scholar 

  15. Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964). doi:10.1016/0041-5553(64)90137-5

    Article  Google Scholar 

  16. Su, W., Boyd, S., Candes, E.J.: A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights (2015). arXiv:1503.01243

  17. Taylor, A.B., Hendrickx, J.M., Glineur, F.: Smooth strongly convex interpolation and exact worst-case performance of first- order methods (2015). arXiv:1502.05666

  18. Tseng, P.: Approximation accuracy, gradient methods, and error bound for structured convex optimization. Math. Program. 125(2), 263–295 (2010). doi:10.1007/s10107-010-0394-2

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Donghwan Kim.

Additional information

This research was supported in part by NIH Grants R01-HL-098686 and U01-EB-018753.

Appendix

Appendix

1.1 Proof of Lemma 2

We prove that the choice in (6.9), (6.10), (6.11) and (6.12) satisfies the feasible conditions (6.14) of (RD1).

Using the definition of \(\varvec{\breve{Q}}(\varvec{r},\varvec{\lambda },\varvec{{\tau }})\) in (6.7), and considering the first two conditions of (6.14), we get

$$\begin{aligned} \lambda _{i+1}&= \breve{Q}_{i,i}(\varvec{r},\varvec{\lambda },\varvec{{\tau }}) = 2\breve{q}_i^2(\varvec{r},\varvec{\lambda },\varvec{{\tau }}) = \frac{1}{2\tau _N^2}\tau _i^2 \\&= {\left\{ \begin{array}{ll} \frac{1}{2(1-\lambda _N)^2}\lambda _1^2, &{} i = 0 \\ \frac{1}{2(1-\lambda _N)^2} (\lambda _{i+1} - \lambda _i)^2, &{} i = 1,\ldots ,N-1, \end{array}\right. } \end{aligned}$$

where the last equality comes from \((\varvec{\lambda },\varvec{{\tau }})\in {\varLambda }\), and this reduces to the following recursion:

$$\begin{aligned} {\left\{ \begin{array}{ll} \lambda _1 = 2(1 - \lambda _N)^2, &{} \\ (\lambda _i - \lambda _{i-1})^2 - \lambda _1\lambda _i = 0. \quad i=2,\ldots ,N. &{} \end{array}\right. } \end{aligned}$$
(10.1)

We use induction to prove that the solution of (10.1) is

$$\begin{aligned} \lambda _i = {\left\{ \begin{array}{ll} \frac{2}{\theta _N^2}, &{} i=1, \\ \theta _{i-1}^2\lambda _1, &{} i=2,\ldots ,N, \end{array}\right. } \end{aligned}$$

which is equivalent to \(\varvec{\hat{\lambda }}\) (6.10). It is obvious that \(\lambda _1 = \theta _0\lambda _1\), and for \(i=2\) in (10.1), we get

$$\begin{aligned} \lambda _2 = \frac{3\lambda _1 + \sqrt{9\lambda _1^2 - 4\lambda _1^2}}{2} = \frac{3+\sqrt{5}}{2}\lambda _1 = \theta _1^2\lambda _1 . \end{aligned}$$

Then, assuming \(\lambda _i = \theta _{i-1}^2\lambda _1\) for \(i=1,\ldots ,n\) and \(n\le N-1\), and using the second equality in (10.1) for \(i=n+1\), we get

$$\begin{aligned} \lambda _{n+1}&= \frac{\lambda _1 + 2\lambda _n + \sqrt{(\lambda _1 + 2\lambda _n)^2 - 4\lambda _n^2}}{2} = \frac{1 + 2\theta _{n-1}^2 + \sqrt{1 + 4\theta _{n-1}^2}}{2}\lambda _1 \\&= \left( \theta _{n-1}^2 + \frac{1 + \sqrt{1 + 4\theta _{n-1}^2}}{2}\right) \lambda _1 = \theta _n^2\lambda _1, \end{aligned}$$

where the last equality uses (3.2). Then we use the first equality in (10.1) to find the value of \(\lambda _1\) as

$$\begin{aligned}&\lambda _1 = 2(1 - \theta _{N-1}^2\lambda _1)^2 \\&\theta _{N-1}^4\lambda _1^2 - 2\left( \theta _{N-1}^2 + \frac{1}{4}\right) \lambda _1 + 1 = 0 \\&\lambda _1 = \frac{\theta _{N-1}^2 + \frac{1}{4} - \sqrt{\left( \theta _{N-1}^2 + \frac{1}{4}\right) ^2 - \theta _{N-1}^4}}{\theta _{N-1}^4}\\&\quad = \frac{1}{\theta _{N-1}^2 + \frac{1}{4} + \sqrt{\frac{\theta _{N-1}^2}{2} + \frac{1}{16}}} \\&\quad = \frac{8}{\left( 1 + \sqrt{1 + 8\theta _{N-1}^2}\right) ^2} = \frac{2}{\theta _N^2} \end{aligned}$$

with \(\theta _N\) in (6.13).

Until now, we derived \(\varvec{\hat{\lambda }}\) (6.10) using some conditions of (6.14). Consequently, using the last two conditions in (6.14) with (3.2) and (6.15), we can easily derive the following:

$$\begin{aligned} \tau _i&= {\left\{ \begin{array}{ll} \hat{\lambda } _1 = \frac{2}{\theta _N^2}, &{} i = 0, \\ \hat{\lambda } _{i+1} - \hat{\lambda } _i = \frac{2\theta _i^2}{\theta _N^2} - \frac{2\theta _{i-1}^2}{\theta _N^2} = \frac{2\theta _i}{\theta _N^2}, &{} i=1,\ldots ,N-1, \\ 1 - \hat{\lambda } _N = 1 - \frac{2\theta _{N-1}^2}{\theta _N^2} = \frac{1}{\theta _N}, &{} i = N, \end{array}\right. } \\ \gamma&= \tau _N^2 = \frac{1}{\theta _N^2}, \end{aligned}$$

which are equivalent to \(\varvec{{\hat{\tau }}}\) (6.11) and \(\hat{\gamma } \) (6.12).

Next, we derive for given \(\varvec{\hat{\lambda }}\) (6.10) and \(\varvec{{\hat{\tau }}}\) (6.11). Inserting \(\varvec{{\hat{\tau }}}\) (6.11) to the first two conditions of (6.14), we get

(10.2)

for \(i,k=0,\ldots ,N-1\), and considering (6.5) and (10.2), we get

(10.3)

Finally, using the two equivalent forms (6.2) and (10.3) of , we get

(10.4)

and this can be easily converted to the choice \(\hat{r}_{i,k}\) in (6.9).

For these given , we can easily notice that

(10.5)

for \(\varvec{\check{\theta }}= \left( \theta _0,\ldots ,\theta _{N-1},\frac{\theta _N}{2}\right) ^\top \), showing that the choice is feasible in both (RD) and (RD1). \(\square \)

1.2 Proof of (8.2)

We prove that (8.2) holds for the coefficients \(\varvec{\hat{h}}\) (7.1) of OGM1 and OGM2.

We first show the following property using induction:

$$\begin{aligned} \sum _{k=0}^{j-1}\hat{h}_{j,k} = {\left\{ \begin{array}{ll} \theta _j, &{} j = 1,\ldots ,N-1, \\ \frac{1}{2}(\theta _N+1), &{} j = N. \end{array}\right. } \end{aligned}$$

Clearly, \(\hat{h}_{1,0} = 1 + \frac{2\theta _0 - 1}{\theta _1} = \theta _1\) using (3.2). Assuming \(\sum _{k=0}^{j-1}\hat{h}_{j,k} = \theta _j\) for \(j=1,\ldots ,n\) and \(n\le N-1\), we get

$$\begin{aligned} \sum _{k=0}^n\hat{h}_{n+1,k}&= 1 + \frac{2\theta _n - 1}{\theta _{n+1}} + \frac{\theta _n - 1}{\theta _{n+1}}(\hat{h}_{n,n-1}- 1) + \frac{\theta _n - 1}{\theta _{n+1}}\sum _{k=0}^{n-2}\hat{h}_{n,k} \\&= 1 + \frac{\theta _n}{\theta _{n+1}} + \frac{\theta _n - 1}{\theta _{n+1}}\sum _{k=0}^{n-1}\hat{h}_{n,k} = \frac{\theta _{n+1} + \theta _n^2}{\theta _{n+1}} \\&= {\left\{ \begin{array}{ll} \theta _n, &{} n = 1,\ldots ,N-2, \\ \frac{1}{2}(\theta _N + 1), &{} n = N-1, \end{array}\right. } \end{aligned}$$

where the last equality uses (3.2) and (6.15).

Then, (8.2) can be easily derived using (3.2) and (6.15) as

$$\begin{aligned} \sum _{j=1}^i\sum _{k=0}^{j-1}\hat{h}_{j,k}&= {\left\{ \begin{array}{ll} \sum \nolimits _{j=1}^i\theta _j, &{} i = 1,\ldots ,N-1, \\ \sum \nolimits _{j=1}^{N-1}\theta _j + \frac{1}{2}(\theta _N+1), &{} i = N, \end{array}\right. } \\&= {\left\{ \begin{array}{ll} \theta _i^2 - 1, &{} i = 1,\ldots ,N-1, \\ \frac{1}{2}(\theta _N^2 - 1), &{} i = N. \end{array}\right. } \end{aligned}$$

\(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kim, D., Fessler, J.A. Optimized first-order methods for smooth convex minimization. Math. Program. 159, 81–107 (2016). https://doi.org/10.1007/s10107-015-0949-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10107-015-0949-3

Keywords

Mathematics Subject Classification

Navigation