Abstract
We consider the minimization problem of a sum of a number of functions having Lipshitz p-th order derivatives with different Lipschitz constants. In this case, to accelerate optimization, we propose a general framework allowing to obtain near-optimal oracle complexity for each function in the sum separately, meaning, in particular, that the oracle for a function with lower Lipschitz constant is called a smaller number of times. As a building block, we extend the current theory of tensor methods and show how to generalize near-optimal tensor methods to work with inexact tensor step. Further, we investigate the situation when the functions in the sum have Lipschitz derivatives of a different order. For this situation, we propose a generic way to separate the oracle complexity between the parts of the sum. Our method is not optimal, which leads to an open problem of the optimal combination of oracles of a different order.
The work of D. Kamzolov in Sects. 1–4 is funded by RFBR, project number 19-31-27001. The work of A. Gasnikov and P. Dvurechensky in Sects. 1–4 of the paper is supported by RFBR grant 18-29-03071 mk. The work in Sect. 5 is supported by the Ministry of Science and Higher Education of the Russian Federation (Goszadaniye) No. 075-00337-20-03, project No. 0714-2020-0005.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Agarwal, N., Hazan, E.: Lower bounds for higher-order convex optimization. In: Conference On Learning Theory. PMLR (2018)
Arjevani, Y., Shamir, O., Shiff, R.: Oracle complexity of second-order methods for smooth convex optimization. Math. Program. 178(1–2), 327–360 (2019)
Beznosikov, A., Gorbunov, E., Gasnikov, A.: Derivative-free method for decentralized distributed non-smooth optimization. arXiv preprint arXiv:1911.10645 (2019)
Bubeck, S., Jiang, Q., Lee, Y.T., Li, Y., Sidford, A.: Near-optimal method for highly smooth convex optimization. In: Conference on Learning Theory, pp. 492–507 (2019)
Chebyshev, P.: Collected Works, vol. 5. Strelbytskyy Multimedia Publishing, Kyiv (2018)
Doikov, N., Nesterov, Y.: Local convergence of tensor methods. arXiv preprint arXiv:1912.02516 (2019)
Doikov, N., Nesterov, Y.: Minimizing uniformly convex functions by cubic regularization of newton method. arXiv preprint arXiv:1905.02671 (2019)
Doikov, N., Richtárik, P.: Randomized block cubic newton method. In: International Conference on Machine Learning, pp. 1290–1298 (2018)
Dvinskikh, D., Gasnikov, A.: Decentralized and parallelized primal and dual accelerated methods for stochastic convex programming problems. arXiv preprint arXiv:1904.09015 (2019)
Dvinskikh, D., Omelchenko, S., Tiurin, A., Gasnikov, A.: Accelerated gradient sliding and variance reduction. arXiv preprint arXiv:1912.11632 (2019)
Dvurechensky, P., Gasnikov, A., Ostroukhov, P., Uribe, C.A., Ivanova, A.: Near-optimal tensor methods for minimizing the gradient norm of convex function. arXiv preprint arXiv:1912.03381 (2019)
Gasnikov, A., Dvurechensky, P., Gorbunov, E., Vorontsova, E., Selikhanovych, D., Uribe, C.A.: Optimal tensor methods in smooth convex and uniformly convex optimization. In: Conference on Learning Theory, pp. 1374–1391 (2019)
Gasnikov, A., et al.: Near optimal methods for minimizing convex functions with lipschitz \(p\)-th derivatives. In: Conference on Learning Theory, pp. 1392–1393 (2019)
Gorbunov, E., Dvinskikh, D., Gasnikov, A.: Optimal decentralized distributed algorithms for stochastic convex optimization. arXiv preprint arXiv:1911.07363 (2019)
Grapiglia, G.N., Nesterov, Y.: On inexact solution of auxiliary problems in tensor methods for convex optimization. arXiv preprint arXiv:1907.13023 (2019)
Grapiglia, G.N., Nesterov, Y.: Tensor methods for minimizing functions with Hölder continuous higher-order derivatives. arXiv preprint arXiv:1904.12559 (2019)
Jiang, B., Wang, H., Zhang, S.: An optimal high-order tensor method for convex optimization. In: Conference on Learning Theory, pp. 1799–1801 (2019)
Kantorovich, L.V.: On Newton’s method. Trudy Matematicheskogo Instituta imeni VA Steklova 28, 104–144 (1949)
Lan, G.: Lectures on optimization. Methods for machine learning. H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA (2019)
Lan, G.: Gradient sliding for composite optimization. Math. Program. 159(1–2), 201–235 (2016)
Lan, G., Lee, S., Zhou, Y.: Communication-efficient algorithms for decentralized and stochastic optimization. Math. Program. 180(1), 237–284 (2018). https://doi.org/10.1007/s10107-018-1355-4
Lan, G., Ouyang, Y.: Accelerated gradient sliding for structured convex optimization. arXiv preprint arXiv:1609.04905 (2016)
Monteiro, R.D., Svaiter, B.F.: An accelerated hybrid proximal extragradient method for convex optimization and its implications to second-order methods. SIAM J. Optim. 23(2), 1092–1125 (2013)
Nesterov, Y.: Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–161 (2013)
Nesterov, Y.: Lectures on Convex Optimization. SOIA, vol. 137. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91578-4
Nesterov, Y.: Implementable tensor methods in unconstrained convex optimization. Math. Program., 1–27 (2019). https://doi.org/10.1007/s10107-019-01449-1
Rogozin, A., Gasnikov, A.: Projected gradient method for decentralized optimization over time-varying networks. arXiv preprint arXiv:1911.08527 (2019)
Song, C., Ma, Y.: Towards unified acceleration of high-order algorithms under Hölder continuity and uniform convexity. arXiv preprint arXiv:1906.00582 (2019)
Acknowledgements
We would like to thank Yu. Nesterov for fruitful discussions on inexact solution of tensor subproblem.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
A Proof of Composite Accelerated Taylor Descent
This section is a rewriting of proof from [4], with adding composite part into the proof. Next theorem based on Theorem 2.1 from [4]
Theorem 6
Let \((y_k)_{k \ge 1}\) be a sequence of points in \(\mathbb {R}^d\) and \((\lambda _k)_{k \ge 1}\) a sequence in \(\mathbb {R}_+\). Define \((a_k)_{k \ge 1}\) such that \(\lambda _k A_k = a_k^2\) where \(A_k = \sum _{i=1}^k a_i\). Define also for any \(k\ge 0\), \(x_k = x_0 - \sum _{i=1}^k a_i (\nabla f(y_i)+g'(y_i))\) and \(\tilde{x}_k := \frac{a_{k+1}}{A_{k+1}} x_{k} + \frac{A_k}{A_{k+1}} y_k\). Finally assume if for some \(\sigma \in [0, 1]\)
then one has for any \(x \in \mathbb {R}^d\),
and
To prove this theorem we introduce auxiliaries lemmas based on Lemmas 2.2–2.5 and 3.1, Lemmas 2.6 and 3.3 one can take directly from [4] without any changes.
Lemma 3
Let \(\psi _0(x) = \frac{1}{2} \Vert x-x_0\Vert ^2\) and define by induction \(\psi _{k}(x) = \psi _{k-1}(x) + a_{k} \varOmega _1(F, y_{k}, x)\). Then \(x_k =x_0 - \sum _{i=1}^k a_i (\nabla f(y_i) + g'(y_i))\) is the minimizer of \(\psi _k\), and \(\psi _k(x) \le A_k F(x) + \frac{1}{2} \Vert x-x_0\Vert ^2\) where \(A_k = \sum _{i=1}^k a_i\).
Lemma 4
Let \((z_k)\) be a sequence such that
Then one has for any x,
Proof
One has (recall Lemma 3):
Lemma 5
One has for any x,
Proof
Firstly, by simple calculation we note that:
so that
Now we want to make appear the term \(A_{k+1} F(z_{k+1}) - A_k F(z_k)\) as a lower bound on the right hand side of (29) when evaluated at \(x=x_{k+1}\). Using the inequality \(\varOmega _1(F, y_{k+1}, z_k) \le f(z_k)\) we have:
which concludes the proof.
Lemma 6
Denoting \(\lambda _{k+1} := \frac{a_{k+1}^2}{A_{k+1}}\) and \(\tilde{x}_k := \frac{a_{k+1}}{A_{k+1}} x_{k} + \frac{A_k}{A_{k+1}} y_k\) one has:
In particular, we have in light of (24)
Proof
We apply Lemma 5 with \(z_k = y_k\) and \(x=x_{k+1}\), and note that (with \(\tilde{x} := \frac{a_{k+1}}{A_{k+1}} x + \frac{A_k}{A_{k+1}} y_k\)):
This yields:
The value of the minimum is easy to compute.
For the first conclusion in Theorem 6, it suffices to combine Lemma 6 with Lemma 4, and Lemma 2.5 from [4]. The second conclusion in Theorem 6 follows from Lemma 6 and Lemma 3.
The following lemma shows that minimizing the \(p^{th}\) order Taylor expansion (4) can be viewed as an implicit gradient step for some “large” step size:
Lemma 7
Equation (24) holds true with \(\sigma = 1/2\) for (4), provided that one has:
Proof
Observe that the optimality condition gives:
In particular we get:
By doing a Taylor expansion of the gradient function one obtains:
so that we find:
where we used (31) in the second last equation and we let \(\eta := \lambda _{k+1} \frac{L_p \cdot \Vert y_{k+1} - \tilde{x}_k\Vert ^{p-1}}{(p-1)!}\) in the last equation. The result follows from the assumption \(1/2 \le \eta \le p/(p+1)\) in (30).
Finally, if we replace \(\Vert x^{*}\Vert \) by \(\Vert x_0-x^{*}\Vert \) in Lemma 3.3 and use Lemma 3.4 from [4] we prove Theorem 6.
B Inexact solution of the subproblem
Suppose that (4) can not be solved exactly. Assume that we can find only inexact solution \(\tilde{y}_{k+1}\) satisfies
In this case Lemma 7 should be corrected.
Lemma 8
Equation (24) holds true with \(\sigma = 3/4\) for (32), provided that one has:
Proof
Let’s introduce
The main difference with the proof of Lemma 7 is in the following line
To complete the proof it’s left to notice that due to the (32)
Based on (32) we try to relate the accuracy \(\tilde{\varepsilon }\) we need to solve auxiliary problem to the desired accuracy \(\varepsilon \) for the problem (1). For this we use Lemma 2.1 from [15]. This Lemma guarantee that if
then (32) holds true. So it’s sufficient to solve auxiliary problem in terms of (33).
Assume that F(x) is r-uniformly convex function with constant \(\sigma _r\) (\(r\ge 2\), \(\sigma _r > 0\), see Definition 1), then from Lemma 2 [7] we have
Inequalities (33), (34) give us guarantees that it’s sufficient to solve auxiliary problem with the accuracy
in terms of criteria (33). Since auxiliary problem is every time r-uniformly convex we can apply (34) to auxiliary problem to estimate the accuracy in terms of function discrepancy. Anyway we will have that there is no need to think about it since the dependence of this accuracy are logarithmic. The only restrictive assumption we made is that F(x) is r-uniformly convex. If this is not a case, like in Sect. 4, we may use regularisation tricks [11]. This lead us to \(\sigma _2 \sim \varepsilon \). So the dependence \(\tilde{\varepsilon }\) becomes worthier, but this doesn’t change the main conclusion about possibility to skip the details concern the accuracy of the solution of auxiliary problem.
C CATD with restarts
The proof of the Theorem 2.
Proof
As F is r-uniformly convex function we get
Now we compute the total number of CATD steps.
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Kamzolov, D., Gasnikov, A., Dvurechensky, P. (2020). Optimal Combination of Tensor Optimization Methods. In: Olenev, N., Evtushenko, Y., Khachay, M., Malkova, V. (eds) Optimization and Applications. OPTIMA 2020. Lecture Notes in Computer Science(), vol 12422. Springer, Cham. https://doi.org/10.1007/978-3-030-62867-3_13
Download citation
DOI: https://doi.org/10.1007/978-3-030-62867-3_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-62866-6
Online ISBN: 978-3-030-62867-3
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)