Skip to main content
Log in

A Newton Frank–Wolfe method for constrained self-concordant minimization

  • Published:
Journal of Global Optimization Aims and scope Submit manuscript

Abstract

We develop a new Newton Frank–Wolfe algorithm to solve a class of constrained self-concordant minimization problems using linear minimization oracles (LMO). Unlike L-smooth convex functions, where the Lipschitz continuity of the objective gradient holds globally, the class of self-concordant functions only has local bounds, making it difficult to estimate the number of linear minimization oracle (LMO) calls for the underlying optimization algorithm. Fortunately, we can still prove that the number of LMO calls of our method is nearly the same as that of the standard Frank-Wolfe method in the L-smooth case. Specifically, our method requires at most \({\mathcal {O}}\big (\varepsilon ^{-(1 + \nu )}\big )\) LMO’s, where \(\varepsilon \) is the desired accuracy, and \(\nu \in (0, 0.139)\) is a given constant depending on the chosen initial point of the proposed algorithm. Our intensive numerical experiments on three applications: portfolio design with the competitive ratio, D-optimal experimental design, and logistic regression with elastic-net regularizer, show that the proposed Newton Frank–Wolfe method outperforms different state-of-the-art competitors.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. The smoothness of f is only defined on \(\mathrm {dom}(f)\), an open set.

  2. The differentiability of \(\varphi \) is only defined on \(\mathrm {dom}(\varphi )\), an open set.

  3. Notice that Theorem 2 is proven under an assumption that \(x^0\) is sufficiently close to \(x^{\star }\) (the optimal solution of (1)) so that the damped step is never invoked.

  4. One can see from the proof leading to [43, equation (72)] that the relation holds more generally when the \({\mathbf {z}}_+\) and \({\mathbf {z}}\) are replaced by any two vectors satisfying \(\Vert {\mathbf {y}}-{\mathbf {x}}\Vert _{\mathbf {x}}\le 1\).

  5. In fact, by (23), we have \(\nabla ^2 f({\mathbf {y}}) \preceq \frac{\nabla ^2 f({\mathbf {x}})}{(1 - \Vert {\mathbf {y}}-{\mathbf {x}}\Vert _{{\mathbf {y}}})^2}\), which is equivalent to \(\nabla ^2 f({\mathbf {x}})^{-1} \preceq \frac{\nabla ^2 f({\mathbf {y}})^{-1}}{(1 - \Vert {\mathbf {y}}-{\mathbf {x}}\Vert _{{\mathbf {y}}})^2}\). Therefore, we have \(\frac{\Vert {\mathbf {u}}\Vert _{{\mathbf {y}}}^{*}}{1-\Vert {\mathbf {y}}-{\mathbf {x}}\Vert _{{\mathbf {y}}}} \ge \Vert {\mathbf {u}}\Vert _{{\mathbf {x}}}^{*}\) for \({\mathbf {u}}\in {\mathbb {R}}^p\).

References

  1. Barzilai, J., Borwein, J.M.: Two-point step size gradient methods. IMA J. Numer. Anal. 8(1), 141–148 (1988)

    Article  MathSciNet  Google Scholar 

  2. Bauschke, H.H., Combettes, P.: Convex Analysis and Monotone Operators Theory in Hilbert Spaces. Springer-Verlag, 2nd edn. (2017)

  3. Beck, A., Teboulle, M.: A conditional gradient method with linear rate of convergence for solving convex linear systems. Math. Methods Oper. Res. 59(2), 235–247 (2004)

    Article  MathSciNet  Google Scholar 

  4. Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)

    Article  MathSciNet  Google Scholar 

  5. Becker, S., Candès, E.J., Grant, M.: Templates for convex cone problems with applications to sparse signal recovery. Math. Program. Compt. 3(3), 165–218 (2011)

    Article  MathSciNet  Google Scholar 

  6. Birgin, E.G., Martínez, J.M., Raydan, M.: Nonmonotone spectral projected gradient methods on convex sets. SIAM J. Optim. 10(4), 1196–1211 (2000)

    Article  MathSciNet  Google Scholar 

  7. Chen, Y., Ye, X.: Projection onto a simplex. Preprint arXiv:1101.6081 (2011)

  8. Chang, C.-C., Lin, C.-J.: LIBSVM, A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 1–27 (2011)

    Article  Google Scholar 

  9. Damla, S.A., Sun, P., Todd, M.J.: Linear convergence of a modified Frank–Wolfe algorithm for computing minimum-volume enclosing ellipsoids. Optim. Methods Softw. 23(1), 5–19 (2008)

    Article  MathSciNet  Google Scholar 

  10. Defazio, A., Bach, F., Lacoste-Julien, S.: SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems (NIPS), pp. 1646–1654 (2014)

  11. de Oliveira, F.R., Ferreira, O.P., Silva, G.N.: Newton’s method with feasible inexact projections for solving constrained generalized equations. Comput. Optim. Appl. 72(1), 159–177 (2019)

  12. Duchi, J., Shalev-Shwartz, S., Singer, Y., Chandra, T.: Efficient projections onto the \(\ell _1\)-ball for learning in high dimensions. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, pp. 272–279, New York, NY, USA, ACM (2008)

  13. Dvurechensky, P., Ostroukhov, P., Safin, K., Shtern, S., Staudigl, M.: Self-concordant analysis of Frank-Wolfe algorithms. In International Conference on Machine Learning, pp. 2814–2824. PMLR, (2020)

  14. Frank, M., Wolfe, P.: An algorithm for quadratic programming. Naval Res. Logist. Q. 3, 95–110 (1956)

    Article  MathSciNet  Google Scholar 

  15. Garber, D., Hazan, E.: A linearly convergent conditional gradient algorithm with applications to online and stochastic optimization. Preprint arXiv:1301.4666 (2013)

  16. Garber, D., Hazan, E.: Faster rates for the Frank–Wolfe method over strongly-convex sets. In Proceedings of the 32nd International Conference on Machine Learning 951, pp. 541–549 (2015)

  17. Gonçalves, M.L.N., Melo, J.G.: A newton conditional gradient method for constrained nonlinear systems. J. Comput. Appl. Math. 311, 473–483 (2017)

    Article  MathSciNet  Google Scholar 

  18. Gonçalves, D. S., Gonçalves, M. L. N., Menezes, T. C.: Inexact variable metric method for convex-constrained optimization problems. Optimization, 1–19, (online first) (2021)

  19. Gonçalves, D. S., Gonçalves, M. L. N., Oliveira, F. R.: Levenberg-marquardt methods with inexact projections for constrained nonlinear systems. Preprint arXiv:1908.06118 (2019)

  20. Gonçalves, M.L.N., Oliveira, F.R.: On the global convergence of an inexact quasi-Newton conditional gradient method for constrained nonlinear systems. Numer. Algorithm 84(2), 606–631 (2020)

    Article  MathSciNet  Google Scholar 

  21. Gross, D., Liu, Y.-K., Flammia, S., Becker, S., Eisert, J.: Quantum state tomography via compressed sensing. Phys. Rev. Lett. 105(15), 150401 (2010)

    Article  Google Scholar 

  22. Guelat, J., Marcotte, P.: Some comments on Wolfe’s away step. Math. Program. 35(1), 110–119 (1986)

  23. Harman, R., Trnovská, M.: Approximate D-optimal designs of experiments on the convex hull of a finite set of information matrices. Math. Slov. 59(6), 693–704 (2009)

    Article  MathSciNet  Google Scholar 

  24. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Science & Business Media (2009)

  25. Hazan, E.: Sparse approximate solutions to semidefinite programs. In: Latin American Symposium on Theoretical Informatics, pp. 306–316. Springer (2008)

  26. Jaggi, M.: Revisiting Frank–Wolfe: projection-free sparse convex optimization. JMLR W&CP 28(1), 427–435 (2013)

    Google Scholar 

  27. Khachiyan, L.G.: Rounding of polytopes in the real number model of computation. Math. Oper. Res. 21(2), 307–320 (1996)

    Article  MathSciNet  Google Scholar 

  28. Lacoste-Julien, S., Jaggi, M.: On the global linear convergence of Frank–Wolfe optimization variants. In Advances in Neural Information Processing Systems (NIPS), pp. 496–504 (2015)

  29. Lan, G., Zhou, Y.: Conditional gradient sliding for convex optimization. SIAM J. Optim. 26(2), 1379–1409 (2016)

    Article  MathSciNet  Google Scholar 

  30. Lan, G., Ouyang, Y.: Accelerated gradient sliding for structured convex optimization. Preprint arXiv:1609.04905 (2016)

  31. Lu, Z., Pong, T.K.: Computing optimal experimental designs via interior point method. SIAM J. Matrix Anal. Appl. 34(4), 1556–1580 (2013)

    Article  MathSciNet  Google Scholar 

  32. Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course volume 87 of Applied Optimization. Kluwer Academic Publishers (2004)

  33. Nesterov, Y., Nemirovski, A.: Interior-point polynomial algorithms in convex programming. Soc. Ind. Math. (1994)

  34. Odor, G., Li, Y.-H., Yurtsever, A., Hsieh, Y.-P., Tran-Dinh, Q., El-Halabi, M., Cevher, V.: Frank-Wolfe works for non-lipschitz continuous gradient objectives: Scalable poisson phase retrieval. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6230–6234. IEEE (2016)

  35. Ostrovskii, D.M., Bach, F.: Finite-sample analysis of M-estimators using self-concordance. Electron. J. Stat. 15(1), 326–391 (2021)

    Article  MathSciNet  Google Scholar 

  36. Peyré, G., Cuturi, M.: Computational optimal transport. Found. Trends Mach. Learn. 11(5–6), 355–607 (2019)

    Article  Google Scholar 

  37. Ryu, E. K., Boyd, S.: Stochastic proximal iteration: a non-asymptotic improvement upon stochastic gradient descent. Author website, early draft (2014)

  38. Raydan, M.: On the Barzilai and Borwein choice of steplength for the gradient method. IMA J. Numer. Anal. 13(3), 321–326 (1993)

    Article  MathSciNet  Google Scholar 

  39. Su, W., Boyd, S., Candes, E.: A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights. In Advances in Neural Information Processing Systems (NIPS), pp. 2510–2518 (2014)

  40. Sun, T., Tran-Dinh, Q.: Generalized self-concordant functions: a recipe for Newton-type methods. Math. Program. 178, 145–213 (2019)

    Article  MathSciNet  Google Scholar 

  41. Tran-Dinh, Q., Kyrillidis, A., Cevher, V.: Composite self-concordant minimization. J. Mach. Learn. Res. 15, 374–416 (2015)

    MathSciNet  MATH  Google Scholar 

  42. Tran-Dinh, Q., Ling, L., Toh, K.-C.: A new homotopy proximal variable-metric framework for composite convex minimization. Math. Oper. Res., 1–28, (online first) (2021)

  43. Tran-Dinh, Q., Sun, T., Lu, S.: Self-concordant inclusions: a unified framework for path-following generalized Newton-type algorithms. Math. Program. 177(1–2), 173–223 (2019)

    Article  MathSciNet  Google Scholar 

  44. Yurtsever, A., Fercoq, O., Cevher, V.: A conditional-gradient-based augmented lagrangian framework. In International Conference on Machine Learning (ICML), pp. 7272–7281 (2019)

  45. Yurtsever, A., Tran-Dinh, Q., Cevher, V.: A universal primal-dual convex optimization framework. Advances in Neural Information Processing Systems (NIPS), pp. 1–9 (2015)

Download references

Acknowledgements

Q. Tran-Dinh was partly supported by the National Science Foundation (NSF), Grant No. DMS-1619884 and the Office of Naval Research (ONR), Grant No. N00014-20-1-2088. V. Cevher was partly supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (Grant agreement n 725594 - time-data) and by 2019 Google Faculty Research Award.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Quoc Tran-Dinh.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: The proof of technical results

Appendix: The proof of technical results

Let us recall the following key properties of standard self-concordant functions. Let f be standard self-concordant and \({\mathbf {x}},{\mathbf {y}}\in \mathrm {dom}(f)\) such that \(\Vert {\mathbf {y}}- {\mathbf {x}}\Vert _{{\mathbf {x}}} < 1\). Then

$$\begin{aligned} \begin{array}{lclclcll} \big (\Vert {\mathbf {u}}\Vert _{{\mathbf {y}}}\big )^2:= & {} {\mathbf {u}}^{\top }\nabla ^2f({\mathbf {y}}){\mathbf {u}}\le & {} {\mathbf {u}}^{\top }\frac{\nabla ^2f({\mathbf {x}})}{(1-\Vert {\mathbf {y}}-{\mathbf {x}}\Vert _{{\mathbf {x}}})^2}{\mathbf {u}}= & {} \left( \frac{\Vert {\mathbf {u}}\Vert _{{\mathbf {x}}}}{1-\Vert {\mathbf {y}}-{\mathbf {x}}\Vert _{{\mathbf {x}}}}\right) ^2,&\quad \forall {\mathbf {u}}\in {\mathbb {R}}^p. \end{array} \end{aligned}$$
(22)

Similarly, if \(\Vert {\mathbf {y}}- {\mathbf {x}}\Vert _{{\mathbf {y}}} < 1\), then

$$\begin{aligned} \begin{array}{lclclcll} \big (\Vert {\mathbf {u}}\Vert _{{\mathbf {y}}}\big )^2:= & {} {\mathbf {u}}^{\top }\nabla ^2f({\mathbf {y}}){\mathbf {u}}\le & {} {\mathbf {u}}^{\top }\frac{\nabla ^2f({\mathbf {x}})}{(1-\Vert {\mathbf {y}}-{\mathbf {x}}\Vert _{{\mathbf {y}}})^2}{\mathbf {u}}= & {} \left( \frac{\Vert {\mathbf {u}}\Vert _{{\mathbf {x}}}}{1-\Vert {\mathbf {y}}-{\mathbf {x}}\Vert _{{\mathbf {y}}}}\right) ^2,&\quad \forall {\mathbf {u}}\in {\mathbb {R}}^p. \end{array} \end{aligned}$$
(23)

These inequalities can be found in [32, Theorem 4.1.6]. In addition, from [43, equation (72)], we have

$$\begin{aligned} \Vert \nabla f({\mathbf {y}}) - \nabla f({\mathbf {x}}) - \nabla ^2 f({\mathbf {x}})({\mathbf {y}}- {\mathbf {x}})\Vert _{{\mathbf {x}}}^*\le \frac{\Vert {\mathbf {y}}- {\mathbf {x}}\Vert _{{\mathbf {x}}}^2}{1 - \Vert {\mathbf {y}}- {\mathbf {x}}\Vert _{{\mathbf {x}}}}, \end{aligned}$$
(24)

if \(\Vert {\mathbf {y}}- {\mathbf {x}}\Vert _{{\mathbf {x}}} < 1\).Footnote 4 These inequalities will be repeatedly used in our proofs below.

1.1 Two key lemmas for proving theorem 1

We need the following two lemmas to prove Theorem 1. The first lemma describes the decreasing of the objective value when applying damped-step iterations.

Lemma 3

Let \(\gamma _k := \Vert {\mathbf {z}}^k - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}\) be the local distance between \({\mathbf {z}}^k\) to \({\mathbf {x}}^k\), where \({\mathbf {z}}^k\) is the output of Algorithm 2 at \(x^k\) with \(\eta = \eta _k^2\). Recall that \(\Vert {\mathbf {z}}^k - T({\mathbf {x}}^k)\Vert _{{\mathbf {x}}^k} \le \eta _k\). If we choose \(\alpha \in (0, 1)\) such that \(\alpha \gamma _k < 1\) and update \({\mathbf {x}}^{k+1} := {\mathbf {x}}^k + \alpha ({\mathbf {z}}^k - {\mathbf {x}}^k)\), then we have

$$\begin{aligned} f({\mathbf {x}}^{k+1}) \le f({\mathbf {x}}^k) - \left[ \alpha ( \gamma _k^2 - \eta _k^2) - \omega _{*}(\alpha \gamma _k)\right] . \end{aligned}$$
(25)

Assume \(\gamma _k > \eta _k\). If \(\delta \in (0,1)\) and the step size is \(\alpha _k := \frac{\delta (\gamma _k^2 - \eta _k^2)}{\gamma _k(\gamma _k^2 + \gamma _k - \eta _k^2)}\) then we have \(\alpha _k\gamma _k< \delta < 1\). Moreover, it also holds that

$$\begin{aligned} f({\mathbf {x}}^{k+1}) \le f({\mathbf {x}}^k) - \delta \omega \Big (\frac{\gamma _k^2 - \eta _k^2}{\gamma _k} \Big ), \end{aligned}$$
(26)

where \(\omega (\tau ) := \tau - \log (1 + \tau )\) and \(\omega _{*}(\tau ) := -\tau - \log (1 - \tau )\) are two nonnegative and convex functions.

Proof

From (7) and the stop criterion of Algorithm 2, \({\mathbf {z}}^k\) is an \(\eta _k\)-solution of (4) at \({\mathbf {x}}= {\mathbf {x}}^k\). It is clear that \({\mathbf {z}}^k\) satisfies

$$\begin{aligned} \langle \nabla f({\mathbf {x}}^k) + \nabla ^2 f({\mathbf {x}}^k)({\mathbf {z}}^k -{\mathbf {x}}^k), {\mathbf {z}}^k - {\mathbf {x}}^k\rangle \le \eta _k^2. \end{aligned}$$

This inequality leads to

$$\begin{aligned} \langle \nabla f({\mathbf {x}}^k), {\mathbf {z}}^k - {\mathbf {x}}^k\rangle \le \eta _k^2 - \Vert {\mathbf {z}}^k - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}^2. \end{aligned}$$
(27)

Therefore, using the self-concordance of f [32, Theorem 4.1.8], we can derive

$$\begin{aligned} \begin{array}{lcl} f({\mathbf {x}}^{k+1}) &{}\le &{} f({\mathbf {x}}^k) + \left\langle \nabla f({\mathbf {x}}^k), {\mathbf {x}}^{k+1} - {\mathbf {x}}^k\right\rangle + \omega _{*}(\Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}) \\ &{} = &{} f({\mathbf {x}}^k) + \alpha \left\langle \nabla f({\mathbf {x}}^k), {\mathbf {z}}^k - {\mathbf {x}}^k\right\rangle + \omega _{*}(\alpha \Vert {\mathbf {z}}^k - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}) \\ &{} \overset{(27)}{\le } &{} f({\mathbf {x}}^k) + \alpha \left( \eta _k^2 - \Vert {\mathbf {z}}^k - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}^2\right) + \omega _{*}(\alpha \Vert {\mathbf {z}}^k - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}) \\ &{} = &{} f({\mathbf {x}}^k) - [\alpha (\gamma _k^2 - \eta _k^2) - \omega _{*}(\alpha \gamma _k)]. \end{array} \end{aligned}$$
(28)

This is exactly (25).

Assume that \(\gamma _k^2 > \eta _k^2\). Define \(\psi (\alpha ) := \alpha (\gamma _k^2 - \eta _k^2) - \omega _{*}(\alpha \gamma _k)\) and plug \(\alpha _k = \frac{\delta (\gamma _k^2 - \eta _k^2)}{\gamma _k(\gamma _k^2 + \gamma _k - \eta _k^2)}\) into \(\psi (\alpha )\), we arrive at

$$\begin{aligned} \begin{array}{lcl} \psi (\alpha _k) &{}= &{} \alpha _k(\gamma _k^2 - \eta _k^2) - \omega _{*}(\alpha _k\gamma _k) \\ &{} = &{} \alpha _k(\gamma _k^2 - \eta _k^2 + \gamma _k) + \log (1 - \alpha _k\gamma _k) \\ &{} = &{} \frac{\delta (\gamma _k^2 - \eta _k^2)}{\gamma _k} + \log \left( 1 - \frac{\delta (\gamma _k^2 - \eta _k^2)}{\gamma _k^2 - \eta _k^2 + \gamma _k}\right) \\ &{} \ge &{} \frac{\delta (\gamma _k^2 - \eta _k^2)}{\gamma _k} + \delta \log \left( 1 - \frac{(\gamma _k^2 - \eta _k^2)}{\gamma _k^2 - \eta _k^2 + \gamma _k}\right) \\ &{} = &{} \delta \omega (\frac{\gamma _k^2 - \eta _k^2}{\gamma _k}), \end{array} \end{aligned}$$
(29)

where we use \(\log (1 - \delta s) \ge \delta \log (1 - s)\) in \(s \in (0,1)\) for the inequality. Using (28) and (29) we proves (26). \(\square \)

The following lemma shows that the residual \(\Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }}\) can be bounded by the projected Newton decrement \({\bar{\gamma }}_k := \Vert T({\mathbf {x}}^k) - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}\).

Lemma 4

Let \({\bar{\lambda }}_k := \Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }}\), \({\bar{\gamma }}_k := \Vert T({\mathbf {x}}^k) - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}\), \(\gamma _k := \Vert {\mathbf {z}}^k - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}\), and h be defined by (8). Recall that \(\Vert {\mathbf {z}}^k - T({\mathbf {x}}^k)\Vert _{{\mathbf {x}}^k} \le \eta _k\). If \(\gamma _k + \eta _k \in (0, C_2)\), then we have

$$\begin{aligned} {\bar{\lambda }}_k \le h({\bar{\gamma }}_k) \le h(\gamma _k + \eta _k). \end{aligned}$$
(30)

Proof

Firstly, we can write down the optimality condition of (4) and (1), respectively as follows:

$$\begin{aligned} \left\{ \begin{array}{l} \left\langle \nabla f({\mathbf {x}}^k) + \nabla ^2 f({\mathbf {x}}^k)[T({\mathbf {x}}^k) - {\mathbf {x}}^k], {\mathbf {x}}- T({\mathbf {x}}^k)\right\rangle \ge 0,~~~\forall {\mathbf {x}}\in {\mathcal {X}}, \\ \left\langle \nabla f({\mathbf {x}}^{\star }), {\mathbf {x}}- {\mathbf {x}}^{\star }\right\rangle \ge 0,~~~\forall {\mathbf {x}}\in {\mathcal {X}}. \end{array} \right. \end{aligned}$$

Substituting \({\mathbf {x}}^{\star }\) for \({\mathbf {x}}\) into the first inequality and \(T({\mathbf {x}}^k)\) for x into the second inequality, respectively we get

$$\begin{aligned} \left\{ \begin{array}{l} \left\langle \nabla f({\mathbf {x}}^k) + \nabla ^2 f({\mathbf {x}}^k)[T({\mathbf {x}}^k) - {\mathbf {x}}^k], {\mathbf {x}}^{\star } - T({\mathbf {x}}^k)\right\rangle \ge 0, \\ \left\langle \nabla f({\mathbf {x}}^{\star }), T({\mathbf {x}}^k) - {\mathbf {x}}^{\star }\right\rangle \ge 0. \end{array} \right. \end{aligned}$$

Adding up both inequalities yields

$$\begin{aligned} \langle \nabla f({\mathbf {x}}^{\star }) - \nabla f({\mathbf {x}}^k) - \nabla ^2 f({\mathbf {x}}^k)[T({\mathbf {x}}^k) - {\mathbf {x}}^k], T({\mathbf {x}}^k) - {\mathbf {x}}^{\star }\rangle \ge 0, \end{aligned}$$

which is equivalent to

$$\begin{aligned}&\langle \nabla f(T({\mathbf {x}}^k)) - \nabla f({\mathbf {x}}^k) - \nabla ^2 f({\mathbf {x}}^k)[T({\mathbf {x}}^k) - {\mathbf {x}}^k], T({\mathbf {x}}^k) - {\mathbf {x}}^{\star }\rangle \\&\quad \ge \langle \nabla f(T({\mathbf {x}}^k)) - \nabla f({\mathbf {x}}^{\star }) , T({\mathbf {x}}^k) - {\mathbf {x}}^{\star }\rangle . \end{aligned}$$

Since f is self-concordant, by [32, Theorem 4.1.7], we have

$$\begin{aligned} \left\langle \nabla f(T({\mathbf {x}}^k)) - \nabla f({\mathbf {x}}^{\star }) , T({\mathbf {x}}^k) - {\mathbf {x}}^{\star }\right\rangle \ge \frac{\Vert T({\mathbf {x}}^k) - {\mathbf {x}}^{\star }\Vert _{T({\mathbf {x}}^k)}^2}{1 + \Vert T({\mathbf {x}}^k) - {\mathbf {x}}^{\star }\Vert _{T({\mathbf {x}}^k)}}. \end{aligned}$$

By the Cauchy-Schwarz inequality, this estimate leads to

$$\begin{aligned} \frac{\Vert T({\mathbf {x}}^k) - {\mathbf {x}}^{\star }\Vert _{T({\mathbf {x}}^k)}}{1 + \Vert T({\mathbf {x}}^k) - {\mathbf {x}}^{\star }\Vert _{T({\mathbf {x}}^k)}} \le \Vert \nabla f(T({\mathbf {x}}^k)) - \nabla f({\mathbf {x}}^k) - \nabla ^2 f({\mathbf {x}}^k)[T({\mathbf {x}}^k) - {\mathbf {x}}^k]\Vert _{T({\mathbf {x}}^k)}^*. \end{aligned}$$
(31)

Now, we can bound the right-hand side of the above inequality as

$$\begin{aligned} \begin{array}{lcl} {\mathcal {R}} &{}:= &{} \Vert \nabla f(T({\mathbf {x}}^k)) - \nabla f({\mathbf {x}}^k) - \nabla ^2 f({\mathbf {x}}^k)[T({\mathbf {x}}^k) - {\mathbf {x}}^k]\Vert _{T({\mathbf {x}}^k)}^* \\ &{} \le &{} \frac{\Vert \nabla f(T({\mathbf {x}}^k)) - \nabla f({\mathbf {x}}^k) - \nabla ^2 f({\mathbf {x}}^k)[T({\mathbf {x}}^k) - {\mathbf {x}}^k]\Vert _{{\mathbf {x}}^k}^*}{1 - \Vert T({\mathbf {x}}^k) - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}} \\ &{} \overset{(24)}{\le } &{} \left( \frac{\Vert T({\mathbf {x}}^k) - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}}{1 - \Vert T({\mathbf {x}}^k) - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}}\right) ^2, \end{array} \end{aligned}$$
(32)

where the first inequality comes from the dual form of (23), i.e., \(\frac{\Vert {\mathbf {u}}\Vert _{{\mathbf {y}}}^{*}}{1-\Vert {\mathbf {y}}-{\mathbf {x}}\Vert _{{\mathbf {y}}}} \ge \Vert {\mathbf {u}}\Vert _{{\mathbf {x}}}^{*}\) for \({\mathbf {u}}\in {\mathbb {R}}^p\),Footnote 5 and the last term holds since \(\Vert T({\mathbf {x}}^k) - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k} \le \gamma _k + \eta _k \le C_2 < 0.5\).

From (31) and (32), we have

$$\begin{aligned} \frac{\Vert T({\mathbf {x}}^k) - {\mathbf {x}}^{\star }\Vert _{T({\mathbf {x}}^k)}}{1 + \Vert T({\mathbf {x}}^k) - {\mathbf {x}}^{\star }\Vert _{T({\mathbf {x}}^k)}} \le \left( \frac{\Vert T({\mathbf {x}}^k) - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}}{1 - \Vert T({\mathbf {x}}^k) - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}}\right) ^2, \end{aligned}$$

which can be reformulated as

$$\begin{aligned} \Vert T({\mathbf {x}}^k) - {\mathbf {x}}^{\star }\Vert _{T({\mathbf {x}}^k)} \le \frac{\Vert T({\mathbf {x}}^k) - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}^2}{1 - 2\Vert T({\mathbf {x}}^k) - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}}. \end{aligned}$$
(33)

Next, since we want to use \(\Vert T({\mathbf {x}}^k) - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}\) to bound \(\Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^k}\), we can derive

$$\begin{aligned} \begin{array}{lcl} \Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^k} &{}\le &{} \Vert {\mathbf {x}}^k - T({\mathbf {x}}^k)\Vert _{{\mathbf {x}}^k} + \Vert T({\mathbf {x}}^k) - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^k} \\ &{}\overset{(23)}{\le } &{} \Vert {\mathbf {x}}^k - T({\mathbf {x}}^k)\Vert _{{\mathbf {x}}^k} + \frac{\Vert T({\mathbf {x}}^k) - {\mathbf {x}}^{\star }\Vert _{T({\mathbf {x}}^k)}}{1 - \Vert {\mathbf {x}}^k - T({\mathbf {x}}^k)\Vert _{{\mathbf {x}}^k}} \\ &{}\overset{(33)}{\le } &{} \Vert {\mathbf {x}}^k - T({\mathbf {x}}^k)\Vert _{{\mathbf {x}}^k} + \frac{\Vert {\mathbf {x}}^k - T({\mathbf {x}}^k)\Vert _{{\mathbf {x}}^k}^2}{\left( 1 - 2\Vert {\mathbf {x}}^k - T({\mathbf {x}}^k)\Vert _{{\mathbf {x}}^k}\right) \left( 1 - \Vert {\mathbf {x}}^k - T({\mathbf {x}}^k)\Vert _{{\mathbf {x}}^k}\right) } \\ &{}= &{} {\bar{\gamma }}_k + \frac{{\bar{\gamma }}_k^2}{(1 - 2{\bar{\gamma }}_k)(1 - {\bar{\gamma }}_k)}. \end{array} \end{aligned}$$
(34)

Notice that (23) of the above inequality holds because of \(\Vert {\mathbf {x}}^k - T({\mathbf {x}}^k)\Vert _{{\mathbf {x}}^k} \le \gamma _k + \eta _k \le C_2 < 1\), where \(C_2\) is a constant defined right after (8). Since h is monotonically increasing and \({\bar{\gamma }}_k \le \gamma _k + \eta _k\), we finally get

$$\begin{aligned} \Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }} \overset{(22)}{\le } \frac{\Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^k}}{1 - \Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^k}} \overset{(34)}{\le } \frac{{\bar{\gamma }}_k(1 -2{\bar{\gamma }}_k + 2{\bar{\gamma }}_k^2)}{(1 - 2{\bar{\gamma }}_k)(1 - {\bar{\gamma }}_k)^2 - {\bar{\gamma }}_k^2} = h({\bar{\gamma }}_k) \le h(\gamma _k + \eta _k), \end{aligned}$$

which proves (30). Notice that we can also prove \(\Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^k} < 1\) to justify (22) of the above inequality, by using (34) and \({\bar{\gamma }}_k \le C_2\). \(\square \)

1.2 Key bounds for proving theorem 2

The following lemma shows that \(\Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }}\) and \(\Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}\) can both be bounded by \(\Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }}\) when \(\Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }}\) is sufficiently small.

Lemma 5

Suppose that \({\bar{\lambda }}_k := \Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }} \le \beta \), where \(\beta \in (0, 0.5)\) is chosen by Algorithm 1. Then, we have

$$\begin{aligned} {\bar{\lambda }}_{k+1} \le \frac{\eta _k}{1-{\bar{\lambda }}_k} + \frac{{\bar{\lambda }}_k^2}{(1 - {\bar{\lambda }}_k)^2(1-2{\bar{\lambda }}_k)}. \end{aligned}$$
(35)

In addition, we can also bound \(\Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}\) as follows:

$$\begin{aligned} \Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k} \le \eta _k + \frac{{\bar{\lambda }}_k^2}{(1-2{\bar{\lambda }}_k)(1-{\bar{\lambda }}_k)} + \frac{{\bar{\lambda }}_k}{1 - {\bar{\lambda }}_k}. \end{aligned}$$
(36)

Proof

Since we always choose full-step \(\alpha _k = 1\), we have \({\mathbf {x}}^{k+1} = {\mathbf {z}}^{k}\). Therefore, \(\Vert {\mathbf {x}}^{k+1} - T({\mathbf {x}}^k)\Vert _{{\mathbf {x}}^k} = \Vert {\mathbf {z}}^k - T({\mathbf {x}}^k)\Vert _{{\mathbf {x}}^k} \le \eta _k\), which leads to

$$\begin{aligned} \begin{array}{lcl} {\bar{\lambda }}_{k+1} &{}= &{} \Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }} \le \Vert {\mathbf {x}}^{k+1} - T({\mathbf {x}}^k)\Vert _{{\mathbf {x}}^{\star }} + \Vert T({\mathbf {x}}^k) - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }} \\ &{}\overset{(23)}{\le } &{} \frac{\Vert {\mathbf {x}}^{k+1} - T({\mathbf {x}}^k)\Vert _{{\mathbf {x}}^k}}{1 - \Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }}} + \frac{\Vert T({\mathbf {x}}^k) - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^k}}{1 - \Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }}} \\ &{} \le &{} \frac{\eta _k}{1 - {\bar{\lambda }}_k} + \frac{\Vert T({\mathbf {x}}^k) - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^k}}{1 - {\bar{\lambda }}_k}. \end{array} \end{aligned}$$
(37)

Now, we bound \(\Vert T({\mathbf {x}}^k) - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^k}\) as follows. Firstly, the optimality conditions of (4) and (1) can be written as

$$\begin{aligned} \left\{ \begin{array}{ll} \langle \nabla f({\mathbf {x}}^k) + \nabla ^2f({\mathbf {x}}^k)(T({\mathbf {x}}^k) - {\mathbf {x}}^k), {\mathbf {x}}- T({\mathbf {x}}^k)\rangle \ge 0, ~~\forall {\mathbf {x}}\in {\mathcal {X}}, \\ \langle \nabla f({\mathbf {x}}^{\star }), {\mathbf {x}}- {\mathbf {x}}^{\star }\rangle \ge 0,~~\forall {\mathbf {x}}\in {\mathcal {X}}. \\ \end{array} \right. \end{aligned}$$

This can be rewritten equivalently to

$$\begin{aligned} \left\{ \begin{array}{ll} \langle \nabla ^2f({\mathbf {x}}^k)[T({\mathbf {x}}^k) - ({\mathbf {x}}^k - \nabla ^2f({\mathbf {x}}^k)^{-1}\nabla f({\mathbf {x}}^k))], {\mathbf {x}}- T({\mathbf {x}}^k)\rangle \ge 0, \quad \forall {\mathbf {x}}\in {\mathcal {X}}, \\ \langle \nabla ^2f({\mathbf {x}}^k)[{\mathbf {x}}^{\star } - ({\mathbf {x}}^{\star } - \nabla ^2f({\mathbf {x}}^k)^{-1}\nabla f({\mathbf {x}}^{\star }))], {\mathbf {x}}- {\mathbf {x}}^{\star }\rangle \ge 0, \quad \forall {\mathbf {x}}\in {\mathcal {X}}. \end{array} \right. \end{aligned}$$
(38)

Similar to the proof of [2, Theorem 3.14], we can show that (38) is equivalent to

$$\begin{aligned} \left\{ \begin{array}{lcl} T({\mathbf {x}}^k) &{}= &{} \mathrm {proj}_{{\mathcal {X}}}^{\nabla ^2f({\mathbf {x}}^k)}\left( {\mathbf {x}}^k - \nabla ^2f({\mathbf {x}}^k)^{-1}\nabla f({\mathbf {x}}^k)\right) , \\ {\mathbf {x}}^{\star } &{}= &{} \mathrm {proj}_{{\mathcal {X}}}^{\nabla ^2f({\mathbf {x}}^k)}\left( {\mathbf {x}}^{\star } - \nabla ^2f({\mathbf {x}}^k)^{-1}\nabla f({\mathbf {x}}^{\star })\right) . \\ \end{array} \right. \end{aligned}$$
(39)

Using the nonexpansiveness of the projection operator [2, Chapter 4] we can derive

$$\begin{aligned} \begin{array}{lcl} \Vert T({\mathbf {x}}^k) - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^k} &{}\overset{(39)}{=} &{} \Big \Vert \mathrm {proj}_{{\mathcal {X}}}^{\nabla ^2f({\mathbf {x}}^k)}\left( {\mathbf {x}}^k - \nabla ^2f({\mathbf {x}}^k)^{-1}\nabla f({\mathbf {x}}^k)\right) \\ &{}&{} \quad - {~} \mathrm {proj}_{{\mathcal {X}}}^{\nabla ^2f({\mathbf {x}}^k)}\left( {\mathbf {x}}^{\star } - \nabla ^2f({\mathbf {x}}^k)^{-1}\nabla f({\mathbf {x}}^{\star })\right) \Big \Vert _{{\mathbf {x}}^k} \\ &{}\le &{} \Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star } - \nabla ^2f({\mathbf {x}}^k)^{-1}(\nabla f({\mathbf {x}}^k) - \nabla f({\mathbf {x}}^{\star }))\Vert _{{\mathbf {x}}^k} \\ &{} = &{} \Vert \nabla f({\mathbf {x}}^{\star }) - \nabla f({\mathbf {x}}^k) - \nabla ^2f({\mathbf {x}}^k)({\mathbf {x}}^{\star } - {\mathbf {x}}^k)\Vert _{{\mathbf {x}}^k}^{*} \\ &{} \overset{(24)}{\le } &{} \frac{\Vert {\mathbf {x}}^{\star } - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}^2}{1 - \Vert {\mathbf {x}}^{\star } - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}} \\ &{} \overset{(22)}{\le } &{} \frac{\Vert {\mathbf {x}}^{\star } - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^{\star }}^2}{(1 - 2\Vert {\mathbf {x}}^{\star } - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^{\star }})(1 - \Vert {\mathbf {x}}^{\star } - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^{\star }})} \\ &{} = &{} \frac{{\bar{\lambda }}_k^2}{(1-2{\bar{\lambda }}_k)(1-{\bar{\lambda }}_k)}. \end{array} \end{aligned}$$
(40)

We make the following two explanation for (40):

  • In the second inequality of (40), \(1 - \Vert {\mathbf {x}}^{\star } - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^{k}}\) in the denominator can be justified by \(\Vert {\mathbf {x}}^{\star } - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^{k}} < 1\), which follows directly from (24) and our assumption that \(\Vert {\mathbf {x}}^{\star } - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^{\star }} \le \beta < 0.5\) stated at the beginning of this lemma.

  • For the last inequality of (40), we first have \(0< \Vert {\mathbf {x}}^{\star } - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k} \le \frac{\Vert {\mathbf {x}}^{\star } - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^{\star }}}{1 - \Vert {\mathbf {x}}^{\star } - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^{\star }}} < 1\) by (22) and our assumption that \(\Vert {\mathbf {x}}^{\star } - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^{\star }} < 0.5\). Since \(\frac{t^2}{1 - t}\) is increasing for \(t \in (0,1)\), we can replace \(\Vert {\mathbf {x}}^{\star } - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}\) by \(\frac{\Vert {\mathbf {x}}^{\star } - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^{\star }}}{1 - \Vert {\mathbf {x}}^{\star } - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^{\star }}}\) to get the last inequality of (40).

Plugging (40) into (37), we get (35).

Finally, we note that

$$\begin{aligned} \begin{array}{lcl} \Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k} &{}\le &{} \Vert {\mathbf {x}}^{k+1} - T({\mathbf {x}}^k)\Vert _{{\mathbf {x}}^k} + \Vert {\mathbf {x}}^{\star } - T({\mathbf {x}}^k)\Vert _{{\mathbf {x}}^k} + \Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^k}\\ &{}\overset{(40)}{\le } &{} \Vert {\mathbf {x}}^{k+1} - T({\mathbf {x}}^k)\Vert _{{\mathbf {x}}^k} + \frac{{\bar{\lambda }}_k^2}{(1-2{\bar{\lambda }}_k)(1-{\bar{\lambda }}_k)} + \Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^k}\\ &{}\overset{(22)}{\le } &{} \eta _k + \frac{{\bar{\lambda }}_k^2}{(1-2{\bar{\lambda }}_k)(1-{\bar{\lambda }}_k)} + \frac{\Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }}}{1 - \Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }}} \\ &{}= &{} \eta _k + \frac{{\bar{\lambda }}_k^2}{(1-2{\bar{\lambda }}_k)(1-{\bar{\lambda }}_k)} + \frac{{\bar{\lambda }}_k}{1 - {\bar{\lambda }}_k}, \end{array} \end{aligned}$$

which proves (36). \(\square \)

1.3 An intermediate lemma for proving theorem 3

Firstly, the following lemma establishes the sublinear convergence rate of the Frank–Wolfe gap in each outer iteration.

Lemma 6

At the k-th outer iteration of Algorithm 1, if we run the Frank–Wolfe subroutine (7) to update \({\mathbf {u}}^t\), then, after \(T_k\) iterations, we have

$$\begin{aligned} \min _{t = 1,\cdots , T_k} V_k({\mathbf {u}}^t) \le \frac{6\lambda _{\max }(\nabla ^2f({\mathbf {x}}^k))D_{{\mathcal {X}}}^2}{T_k + 1}, \end{aligned}$$
(41)

where \(V_k({\mathbf {u}}^t) := \max _{{\mathbf {u}}\in {\mathcal {X}}}\left\langle \nabla f({\mathbf {x}}^k) + \nabla ^2f({\mathbf {x}}^k)({\mathbf {u}}^t-{\mathbf {x}}^k), {\mathbf {u}}^t - {\mathbf {u}}\right\rangle \). As a result, the number of LMO calls at the k-th outer iteration of Algorithm 1 is at most \(O_k := \frac{6\lambda _{\max }(\nabla ^2f({\mathbf {x}}^k))D_{{\mathcal {X}}}^2}{\eta _k^2}\).

Proof

Let \(\phi _k({\mathbf {u}}) = \left\langle \nabla f({\mathbf {x}}^k), {\mathbf {u}}- {\mathbf {x}}^k\right\rangle + 1/2\left\langle \nabla ^2 f({\mathbf {x}}^k)({\mathbf {u}}- {\mathbf {x}}^k), {\mathbf {u}}- {\mathbf {x}}^k\right\rangle \) and \(\left\{ {\mathbf {u}}^t\right\} \) be generated by the Frank–Wolfe subroutine (7). Then, it is well-known that (see [26, Theorem 1]):

$$\begin{aligned} \phi _k({\mathbf {u}}^t) - \phi _k^{\star } \le \frac{2\lambda _{\max }(\nabla ^2 f({\mathbf {x}}^k))D_{{\mathcal {X}}}^2}{t + 1}. \end{aligned}$$
(42)

Let \({\mathbf {v}}^t := \arg \min _{{\mathbf {u}}\in {\mathcal {X}}}\{\left\langle \nabla \phi _k({\mathbf {u}}^t), {\mathbf {u}}\right\rangle \}\). Notice that

$$\begin{aligned} \begin{array}{lcl} \phi _k({\mathbf {u}}^{t+1}) &{}= &{} \min _{\tau \in [0,1]}\{\phi _k((1-\tau ){\mathbf {u}}^t + \tau {\mathbf {v}}^t)\} \le \phi _k\left( \left( 1 - \frac{2}{t+1}\right) {\mathbf {u}}^t + \frac{2}{t+1}{\mathbf {v}}^t\right) \\ &{}\le &{} \phi _k({\mathbf {u}}^t) + \frac{2}{t+1}\left\langle \nabla \phi _k({\mathbf {u}}^t), ({\mathbf {v}}^t - {\mathbf {u}}^t)\right\rangle + \frac{\lambda _{\max }(\nabla ^2f({\mathbf {x}}^k))}{2}\left( \frac{2}{t+1}\right) ^2\Vert {\mathbf {v}}^t - {\mathbf {u}}^t\Vert ^2 \\ &{}\le &{} \phi _k({\mathbf {u}}^t) - \frac{2}{t+1}V_k({\mathbf {u}}^t) + \frac{2\lambda _{\max }(\nabla ^2f({\mathbf {x}}^k))}{(t+1)^2}D_{{\mathcal {X}}}^2. \end{array} \end{aligned}$$

This is equivalent to

$$\begin{aligned} tV_k({\mathbf {u}}^t) \le \frac{t(t+1)}{2}\left( \phi _k({\mathbf {u}}^t) - \phi _k({\mathbf {u}}^{t+1})\right) + \frac{t\lambda _{\max }(\nabla ^2f({\mathbf {x}}^k))}{t+1}D_{{\mathcal {X}}}^2. \end{aligned}$$
(43)

Summing up this inequality from \(t = 1\) to \(T_k\), we get

$$\begin{aligned} \begin{array}{lcl} \displaystyle \frac{T_k(T_k+1)}{2}\min _{t = 1,\cdots , T_k}\left\{ V_k({\mathbf {u}}^t)\right\} &{}\le &{} \displaystyle \sum _{t = 1}^{T_k} tV_k({\mathbf {u}}^t) \\ &{}\overset{(43)}{\le } &{} \displaystyle \sum _{t = 1}^{T_k} t\phi _k({\mathbf {u}}^t)\\ &{}&{} - \frac{T_k(T_k + 1)}{2}\phi _k({\mathbf {u}}^{T_k+1}) + T_k \lambda _{\max }(\nabla ^2f({\mathbf {x}}^k))D_{{\mathcal {X}}}^2 \\ &{}\le &{} \displaystyle \sum _{t = 1}^{T_k} t(\phi _k({\mathbf {u}}^t) - \phi _k^{\star }) + T_k \lambda _{\max }(\nabla ^2f({\mathbf {x}}^k))D_{{\mathcal {X}}}^2 \\ &{} \overset{(42)}{\le } &{} 3T_k \lambda _{\max }(\nabla ^2f({\mathbf {x}}^k))D_{{\mathcal {X}}}^2, \end{array} \end{aligned}$$

which implies (41). \(\square \)

1.4 An intermediate lemma for proving theorem 4

The following lemma states that we can bound \(f({\mathbf {x}}^k) - f^{\star }\) by \(\Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}\) and \(\Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }}\). Therefore, from the convergence rate of \(\Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}\) and \(\Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }}\) in Theorem 2, we can obtain a convergence rate of \(\left\{ f({\mathbf {x}}^k) - f^{\star }\right\} \).

Lemma 7

Let \(\gamma _k := \Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k} = \Vert {\mathbf {z}}^k - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}\) and \({\bar{\lambda }}_k := \Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }}\). Suppose that \({\mathbf {x}}^0\in \mathrm {dom}(f)\cap {\mathcal {X}}\). If \(0< \gamma _k,{\bar{\lambda }}_k, {\bar{\lambda }}_{k+1} < 1\), then we have

$$\begin{aligned} f({\mathbf {x}}^{k+1}) \le f({\mathbf {x}}^{\star }) + \frac{\gamma _k^2(\gamma _k + {\bar{\lambda }}_k)}{1 - \gamma _k} + \eta _k^2 + \omega _{*}({\bar{\lambda }}_{k+1}), \end{aligned}$$
(44)

where \(\omega _{*}(\tau ) := -\tau - \log (1-\tau )\).

Proof

Firstly, from [32, Theorem 4.1.8], we have

$$\begin{aligned} f({\mathbf {x}}^{k+1}) \le f({\mathbf {x}}^{\star }) + \left\langle \nabla f({\mathbf {x}}^{\star }), {\mathbf {x}}^{k+1} - {\mathbf {x}}^{\star }\right\rangle + \omega _{*}(\Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }}), \end{aligned}$$

provided that \(\Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }} < 1\). Next, using \(\left\langle \nabla f({\mathbf {x}}^{\star }) - \nabla f({\mathbf {x}}^{k+1}), {\mathbf {x}}^{k+1} - {\mathbf {x}}^{\star }\right\rangle \le 0\), we can further derive

$$\begin{aligned} f({\mathbf {x}}^{k+1}) \le f({\mathbf {x}}^{\star }) + \left\langle \nabla f({\mathbf {x}}^{k+1}), {\mathbf {x}}^{k+1} - {\mathbf {x}}^{\star }\right\rangle + \omega _{*}(\Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }}). \end{aligned}$$
(45)

Now, we bound \(\left\langle \nabla f({\mathbf {x}}^{k+1}), {\mathbf {x}}^{k+1} - {\mathbf {x}}^{\star }\right\rangle \) as follows. We first notice that this term can be decomposed as

$$\begin{aligned} \begin{array}{lcl} \left\langle \nabla f({\mathbf {x}}^{k+1}), {\mathbf {x}}^{k+1} - {\mathbf {x}}^{\star }\right\rangle &{}= &{} \underbrace{\langle \nabla f({\mathbf {x}}^k) + \nabla ^2 f({\mathbf {x}}^k)({\mathbf {x}}^{k+1} - {\mathbf {x}}^k), {\mathbf {x}}^{k+1} - {\mathbf {x}}^{\star }\rangle }_{{\mathcal {T}}_1}\\ &{}&{}+ {~} \underbrace{\langle \nabla f({\mathbf {x}}^{k+1}) - \nabla f({\mathbf {x}}^k) - \nabla ^2 f({\mathbf {x}}^k)({\mathbf {x}}^{k+1} - {\mathbf {x}}^k), {\mathbf {x}}^{k+1} - {\mathbf {x}}^{\star }\rangle }_{{\mathcal {T}}_2}. \\ \end{array} \end{aligned}$$

Since \({\mathbf {x}}^{k+1}\) is an \(\eta ^k\)-solution of (4) at \({\mathbf {x}}= {\mathbf {x}}^k\), we have

$$\begin{aligned} {\mathcal {T}}_1 = \langle \nabla f({\mathbf {x}}^k) + \nabla ^2 f({\mathbf {x}}^k)({\mathbf {x}}^{k+1} - {\mathbf {x}}^k), {\mathbf {x}}^{k+1} - {\mathbf {x}}^{\star }\rangle \le \eta _k^2. \end{aligned}$$
(46)

Using the Cauchy-Schwarz inequality and the triangle inequality, \({\mathcal {T}}_2\) can also be bounded as

$$\begin{aligned} \begin{array}{lcl} {\mathcal {T}}_2 &{}= &{} \left\langle \nabla f({\mathbf {x}}^{k+1}) - \nabla f({\mathbf {x}}^k) - \nabla ^2 f({\mathbf {x}}^k)({\mathbf {x}}^{k+1} - {\mathbf {x}}^k), {\mathbf {x}}^{k+1} - {\mathbf {x}}^{\star }\right\rangle \\ &{}\le &{} \Vert \nabla f({\mathbf {x}}^{k+1}) - \nabla f({\mathbf {x}}^k) - \nabla ^2 f({\mathbf {x}}^k)({\mathbf {x}}^{k+1} - {\mathbf {x}}^k)\Vert _{{\mathbf {x}}^k}^*\Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^k} \\ &{}\overset{(24)}{\le } &{} \frac{\Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}^2}{1 - \Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}}\Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^k} \\ &{}\le &{} \frac{\Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}^2}{1 - \Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}}\left[ \Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^k} + \Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}\right] \\ &{} = &{} \frac{\gamma _k^2(\gamma _k + {\bar{\lambda }}_k)}{1 - \gamma _k}. \end{array} \end{aligned}$$
(47)

Finally, we can bound \(f({\mathbf {x}}^{k+1}) - f^{\star }\) as

$$\begin{aligned} \begin{array}{lcl} f({\mathbf {x}}^{k+1}) - f({\mathbf {x}}^{\star }) &{} \overset{(45)}{\le } &{} \left\langle \nabla f({\mathbf {x}}^{k+1}), {\mathbf {x}}^{k+1} - {\mathbf {x}}^{\star }\right\rangle + \omega _{*}({\bar{\lambda }}_{k+1}) \\ &{} = &{} {\mathcal {T}}_1 + {\mathcal {T}}_2 + \omega _{*}({\bar{\lambda }}_{k+1}) \\ &{} \overset{(46)(47)}{\le } &{} \eta _k^2 + \frac{\gamma _k^2(\gamma _k + {\bar{\lambda }}_k)}{1 - \gamma _k} + \omega _{*}({\bar{\lambda }}_{k+1}), \end{array} \end{aligned}$$

which proves (44). \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, D., Cevher, V. & Tran-Dinh, Q. A Newton Frank–Wolfe method for constrained self-concordant minimization. J Glob Optim 83, 273–299 (2022). https://doi.org/10.1007/s10898-021-01105-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10898-021-01105-z

Keywords

Mathematics Subject Classification

Navigation