A Newton Frank–Wolfe method for constrained self-concordant minimization

Liu, Deyi; Cevher, Volkan; Tran-Dinh, Quoc

doi:10.1007/s10898-021-01105-z

A Newton Frank–Wolfe method for constrained self-concordant minimization

Published: 20 November 2021

Volume 83, pages 273–299, (2022)
Cite this article

Journal of Global Optimization Aims and scope Submit manuscript

594 Accesses
2 Citations
Explore all metrics

Abstract

We develop a new Newton Frank–Wolfe algorithm to solve a class of constrained self-concordant minimization problems using linear minimization oracles (LMO). Unlike L-smooth convex functions, where the Lipschitz continuity of the objective gradient holds globally, the class of self-concordant functions only has local bounds, making it difficult to estimate the number of linear minimization oracle (LMO) calls for the underlying optimization algorithm. Fortunately, we can still prove that the number of LMO calls of our method is nearly the same as that of the standard Frank-Wolfe method in the L-smooth case. Specifically, our method requires at most ${\mathcal {O}}\big (\varepsilon ^{-(1 + \nu )}\big )$ LMO’s, where $\varepsilon $ is the desired accuracy, and $\nu \in (0, 0.139)$ is a given constant depending on the chosen initial point of the proposed algorithm. Our intensive numerical experiments on three applications: portfolio design with the competitive ratio, D-optimal experimental design, and logistic regression with elastic-net regularizer, show that the proposed Newton Frank–Wolfe method outperforms different state-of-the-art competitors.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An analytical derivation of properly efficient sets in multi-objective portfolio selection

Article 14 February 2024

On Nested Affine Variational Inequalities: The Case of Multi-Portfolio Selection

An inexact restoration approach to optimization problems with multiobjective constraints under weighted-sum scalarization

Article 01 August 2015

Notes

The smoothness of f is only defined on $\mathrm {dom}(f)$, an open set.
The differentiability of $\varphi $ is only defined on $\mathrm {dom}(\varphi )$, an open set.
Notice that Theorem 2 is proven under an assumption that $x^0$ is sufficiently close to $x^{\star }$ (the optimal solution of (1)) so that the damped step is never invoked.
One can see from the proof leading to [43, equation (72)] that the relation holds more generally when the ${\mathbf {z}}_+$ and ${\mathbf {z}}$ are replaced by any two vectors satisfying $\Vert {\mathbf {y}}-{\mathbf {x}}\Vert _{\mathbf {x}}\le 1$.
In fact, by (23), we have $\nabla ^2 f({\mathbf {y}}) \preceq \frac{\nabla ^2 f({\mathbf {x}})}{(1 - \Vert {\mathbf {y}}-{\mathbf {x}}\Vert _{{\mathbf {y}}})^2}$, which is equivalent to $\nabla ^2 f({\mathbf {x}})^{-1} \preceq \frac{\nabla ^2 f({\mathbf {y}})^{-1}}{(1 - \Vert {\mathbf {y}}-{\mathbf {x}}\Vert _{{\mathbf {y}}})^2}$. Therefore, we have $\frac{\Vert {\mathbf {u}}\Vert _{{\mathbf {y}}}^{*}}{1-\Vert {\mathbf {y}}-{\mathbf {x}}\Vert _{{\mathbf {y}}}} \ge \Vert {\mathbf {u}}\Vert _{{\mathbf {x}}}^{*}$ for ${\mathbf {u}}\in {\mathbb {R}}^p$.

References

Barzilai, J., Borwein, J.M.: Two-point step size gradient methods. IMA J. Numer. Anal. 8(1), 141–148 (1988)
Article MathSciNet Google Scholar
Bauschke, H.H., Combettes, P.: Convex Analysis and Monotone Operators Theory in Hilbert Spaces. Springer-Verlag, 2nd edn. (2017)
Beck, A., Teboulle, M.: A conditional gradient method with linear rate of convergence for solving convex linear systems. Math. Methods Oper. Res. 59(2), 235–247 (2004)
Article MathSciNet Google Scholar
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)
Article MathSciNet Google Scholar
Becker, S., Candès, E.J., Grant, M.: Templates for convex cone problems with applications to sparse signal recovery. Math. Program. Compt. 3(3), 165–218 (2011)
Article MathSciNet Google Scholar
Birgin, E.G., Martínez, J.M., Raydan, M.: Nonmonotone spectral projected gradient methods on convex sets. SIAM J. Optim. 10(4), 1196–1211 (2000)
Article MathSciNet Google Scholar
Chen, Y., Ye, X.: Projection onto a simplex. Preprint arXiv:1101.6081 (2011)
Chang, C.-C., Lin, C.-J.: LIBSVM, A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 1–27 (2011)
Article Google Scholar
Damla, S.A., Sun, P., Todd, M.J.: Linear convergence of a modified Frank–Wolfe algorithm for computing minimum-volume enclosing ellipsoids. Optim. Methods Softw. 23(1), 5–19 (2008)
Article MathSciNet Google Scholar
Defazio, A., Bach, F., Lacoste-Julien, S.: SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems (NIPS), pp. 1646–1654 (2014)
de Oliveira, F.R., Ferreira, O.P., Silva, G.N.: Newton’s method with feasible inexact projections for solving constrained generalized equations. Comput. Optim. Appl. 72(1), 159–177 (2019)
Duchi, J., Shalev-Shwartz, S., Singer, Y., Chandra, T.: Efficient projections onto the $\ell _1$-ball for learning in high dimensions. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, pp. 272–279, New York, NY, USA, ACM (2008)
Dvurechensky, P., Ostroukhov, P., Safin, K., Shtern, S., Staudigl, M.: Self-concordant analysis of Frank-Wolfe algorithms. In International Conference on Machine Learning, pp. 2814–2824. PMLR, (2020)
Frank, M., Wolfe, P.: An algorithm for quadratic programming. Naval Res. Logist. Q. 3, 95–110 (1956)
Article MathSciNet Google Scholar
Garber, D., Hazan, E.: A linearly convergent conditional gradient algorithm with applications to online and stochastic optimization. Preprint arXiv:1301.4666 (2013)
Garber, D., Hazan, E.: Faster rates for the Frank–Wolfe method over strongly-convex sets. In Proceedings of the 32nd International Conference on Machine Learning 951, pp. 541–549 (2015)
Gonçalves, M.L.N., Melo, J.G.: A newton conditional gradient method for constrained nonlinear systems. J. Comput. Appl. Math. 311, 473–483 (2017)
Article MathSciNet Google Scholar
Gonçalves, D. S., Gonçalves, M. L. N., Menezes, T. C.: Inexact variable metric method for convex-constrained optimization problems. Optimization, 1–19, (online first) (2021)
Gonçalves, D. S., Gonçalves, M. L. N., Oliveira, F. R.: Levenberg-marquardt methods with inexact projections for constrained nonlinear systems. Preprint arXiv:1908.06118 (2019)
Gonçalves, M.L.N., Oliveira, F.R.: On the global convergence of an inexact quasi-Newton conditional gradient method for constrained nonlinear systems. Numer. Algorithm 84(2), 606–631 (2020)
Article MathSciNet Google Scholar
Gross, D., Liu, Y.-K., Flammia, S., Becker, S., Eisert, J.: Quantum state tomography via compressed sensing. Phys. Rev. Lett. 105(15), 150401 (2010)
Article Google Scholar
Guelat, J., Marcotte, P.: Some comments on Wolfe’s away step. Math. Program. 35(1), 110–119 (1986)
Harman, R., Trnovská, M.: Approximate D-optimal designs of experiments on the convex hull of a finite set of information matrices. Math. Slov. 59(6), 693–704 (2009)
Article MathSciNet Google Scholar
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Science & Business Media (2009)
Hazan, E.: Sparse approximate solutions to semidefinite programs. In: Latin American Symposium on Theoretical Informatics, pp. 306–316. Springer (2008)
Jaggi, M.: Revisiting Frank–Wolfe: projection-free sparse convex optimization. JMLR W&CP 28(1), 427–435 (2013)
Google Scholar
Khachiyan, L.G.: Rounding of polytopes in the real number model of computation. Math. Oper. Res. 21(2), 307–320 (1996)
Article MathSciNet Google Scholar
Lacoste-Julien, S., Jaggi, M.: On the global linear convergence of Frank–Wolfe optimization variants. In Advances in Neural Information Processing Systems (NIPS), pp. 496–504 (2015)
Lan, G., Zhou, Y.: Conditional gradient sliding for convex optimization. SIAM J. Optim. 26(2), 1379–1409 (2016)
Article MathSciNet Google Scholar
Lan, G., Ouyang, Y.: Accelerated gradient sliding for structured convex optimization. Preprint arXiv:1609.04905 (2016)
Lu, Z., Pong, T.K.: Computing optimal experimental designs via interior point method. SIAM J. Matrix Anal. Appl. 34(4), 1556–1580 (2013)
Article MathSciNet Google Scholar
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course volume 87 of Applied Optimization. Kluwer Academic Publishers (2004)
Nesterov, Y., Nemirovski, A.: Interior-point polynomial algorithms in convex programming. Soc. Ind. Math. (1994)
Odor, G., Li, Y.-H., Yurtsever, A., Hsieh, Y.-P., Tran-Dinh, Q., El-Halabi, M., Cevher, V.: Frank-Wolfe works for non-lipschitz continuous gradient objectives: Scalable poisson phase retrieval. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6230–6234. IEEE (2016)
Ostrovskii, D.M., Bach, F.: Finite-sample analysis of M-estimators using self-concordance. Electron. J. Stat. 15(1), 326–391 (2021)
Article MathSciNet Google Scholar
Peyré, G., Cuturi, M.: Computational optimal transport. Found. Trends Mach. Learn. 11(5–6), 355–607 (2019)
Article Google Scholar
Ryu, E. K., Boyd, S.: Stochastic proximal iteration: a non-asymptotic improvement upon stochastic gradient descent. Author website, early draft (2014)
Raydan, M.: On the Barzilai and Borwein choice of steplength for the gradient method. IMA J. Numer. Anal. 13(3), 321–326 (1993)
Article MathSciNet Google Scholar
Su, W., Boyd, S., Candes, E.: A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights. In Advances in Neural Information Processing Systems (NIPS), pp. 2510–2518 (2014)
Sun, T., Tran-Dinh, Q.: Generalized self-concordant functions: a recipe for Newton-type methods. Math. Program. 178, 145–213 (2019)
Article MathSciNet Google Scholar
Tran-Dinh, Q., Kyrillidis, A., Cevher, V.: Composite self-concordant minimization. J. Mach. Learn. Res. 15, 374–416 (2015)
MathSciNet MATH Google Scholar
Tran-Dinh, Q., Ling, L., Toh, K.-C.: A new homotopy proximal variable-metric framework for composite convex minimization. Math. Oper. Res., 1–28, (online first) (2021)
Tran-Dinh, Q., Sun, T., Lu, S.: Self-concordant inclusions: a unified framework for path-following generalized Newton-type algorithms. Math. Program. 177(1–2), 173–223 (2019)
Article MathSciNet Google Scholar
Yurtsever, A., Fercoq, O., Cevher, V.: A conditional-gradient-based augmented lagrangian framework. In International Conference on Machine Learning (ICML), pp. 7272–7281 (2019)
Yurtsever, A., Tran-Dinh, Q., Cevher, V.: A universal primal-dual convex optimization framework. Advances in Neural Information Processing Systems (NIPS), pp. 1–9 (2015)

Download references

Acknowledgements

Q. Tran-Dinh was partly supported by the National Science Foundation (NSF), Grant No. DMS-1619884 and the Office of Naval Research (ONR), Grant No. N00014-20-1-2088. V. Cevher was partly supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (Grant agreement n 725594 - time-data) and by 2019 Google Faculty Research Award.

Author information

Authors and Affiliations

Department of Statistics and Operations Research, The University of North Carolina at Chapel Hill, 318 Hanes Hall, Chapel Hill, NC, 27599-3260, USA
Deyi Liu & Quoc Tran-Dinh
Laboratory for Information and Inference Systems, EPFL, Lausanne, Switzerland
Volkan Cevher

Authors

Deyi Liu
View author publications
You can also search for this author in PubMed Google Scholar
Volkan Cevher
View author publications
You can also search for this author in PubMed Google Scholar
Quoc Tran-Dinh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Quoc Tran-Dinh.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: The proof of technical results

Let us recall the following key properties of standard self-concordant functions. Let f be standard self-concordant and ${\mathbf {x}},{\mathbf {y}}\in \mathrm {dom}(f)$ such that $\Vert {\mathbf {y}}- {\mathbf {x}}\Vert _{{\mathbf {x}}} < 1$. Then

$$\begin{aligned} \begin{array}{lclclcll} \big (\Vert {\mathbf {u}}\Vert _{{\mathbf {y}}}\big )^2:= & {} {\mathbf {u}}^{\top }\nabla ^2f({\mathbf {y}}){\mathbf {u}}\le & {} {\mathbf {u}}^{\top }\frac{\nabla ^2f({\mathbf {x}})}{(1-\Vert {\mathbf {y}}-{\mathbf {x}}\Vert _{{\mathbf {x}}})^2}{\mathbf {u}}= & {} \left( \frac{\Vert {\mathbf {u}}\Vert _{{\mathbf {x}}}}{1-\Vert {\mathbf {y}}-{\mathbf {x}}\Vert _{{\mathbf {x}}}}\right) ^2,&\quad \forall {\mathbf {u}}\in {\mathbb {R}}^p. \end{array} \end{aligned}$$

(22)

Similarly, if $\Vert {\mathbf {y}}- {\mathbf {x}}\Vert _{{\mathbf {y}}} < 1$, then

$$\begin{aligned} \begin{array}{lclclcll} \big (\Vert {\mathbf {u}}\Vert _{{\mathbf {y}}}\big )^2:= & {} {\mathbf {u}}^{\top }\nabla ^2f({\mathbf {y}}){\mathbf {u}}\le & {} {\mathbf {u}}^{\top }\frac{\nabla ^2f({\mathbf {x}})}{(1-\Vert {\mathbf {y}}-{\mathbf {x}}\Vert _{{\mathbf {y}}})^2}{\mathbf {u}}= & {} \left( \frac{\Vert {\mathbf {u}}\Vert _{{\mathbf {x}}}}{1-\Vert {\mathbf {y}}-{\mathbf {x}}\Vert _{{\mathbf {y}}}}\right) ^2,&\quad \forall {\mathbf {u}}\in {\mathbb {R}}^p. \end{array} \end{aligned}$$

(23)

These inequalities can be found in [32, Theorem 4.1.6]. In addition, from [43, equation (72)], we have

$$\begin{aligned} \Vert \nabla f({\mathbf {y}}) - \nabla f({\mathbf {x}}) - \nabla ^2 f({\mathbf {x}})({\mathbf {y}}- {\mathbf {x}})\Vert _{{\mathbf {x}}}^*\le \frac{\Vert {\mathbf {y}}- {\mathbf {x}}\Vert _{{\mathbf {x}}}^2}{1 - \Vert {\mathbf {y}}- {\mathbf {x}}\Vert _{{\mathbf {x}}}}, \end{aligned}$$

(24)

if $\Vert {\mathbf {y}}- {\mathbf {x}}\Vert _{{\mathbf {x}}} < 1$.^{Footnote 4} These inequalities will be repeatedly used in our proofs below.

1.1 Two key lemmas for proving theorem 1

We need the following two lemmas to prove Theorem 1. The first lemma describes the decreasing of the objective value when applying damped-step iterations.

Lemma 3

Let $\gamma _k := \Vert {\mathbf {z}}^k - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}$ be the local distance between ${\mathbf {z}}^k$ to ${\mathbf {x}}^k$, where ${\mathbf {z}}^k$ is the output of Algorithm 2 at $x^k$ with $\eta = \eta _k^2$. Recall that $\Vert {\mathbf {z}}^k - T({\mathbf {x}}^k)\Vert _{{\mathbf {x}}^k} \le \eta _k$. If we choose $\alpha \in (0, 1)$ such that $\alpha \gamma _k < 1$ and update ${\mathbf {x}}^{k+1} := {\mathbf {x}}^k + \alpha ({\mathbf {z}}^k - {\mathbf {x}}^k)$, then we have

$$\begin{aligned} f({\mathbf {x}}^{k+1}) \le f({\mathbf {x}}^k) - \left[ \alpha ( \gamma _k^2 - \eta _k^2) - \omega _{*}(\alpha \gamma _k)\right] . \end{aligned}$$

(25)

Assume $\gamma _k > \eta _k$. If $\delta \in (0,1)$ and the step size is $\alpha _k := \frac{\delta (\gamma _k^2 - \eta _k^2)}{\gamma _k(\gamma _k^2 + \gamma _k - \eta _k^2)}$ then we have $\alpha _k\gamma _k< \delta < 1$. Moreover, it also holds that

$$\begin{aligned} f({\mathbf {x}}^{k+1}) \le f({\mathbf {x}}^k) - \delta \omega \Big (\frac{\gamma _k^2 - \eta _k^2}{\gamma _k} \Big ), \end{aligned}$$

(26)

where $\omega (\tau ) := \tau - \log (1 + \tau )$ and $\omega _{*}(\tau ) := -\tau - \log (1 - \tau )$ are two nonnegative and convex functions.

Proof

From (7) and the stop criterion of Algorithm 2, ${\mathbf {z}}^k$ is an $\eta _k$-solution of (4) at ${\mathbf {x}}= {\mathbf {x}}^k$. It is clear that ${\mathbf {z}}^k$ satisfies

$$\begin{aligned} \langle \nabla f({\mathbf {x}}^k) + \nabla ^2 f({\mathbf {x}}^k)({\mathbf {z}}^k -{\mathbf {x}}^k), {\mathbf {z}}^k - {\mathbf {x}}^k\rangle \le \eta _k^2. \end{aligned}$$

This inequality leads to

$$\begin{aligned} \langle \nabla f({\mathbf {x}}^k), {\mathbf {z}}^k - {\mathbf {x}}^k\rangle \le \eta _k^2 - \Vert {\mathbf {z}}^k - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}^2. \end{aligned}$$

(27)

Therefore, using the self-concordance of f [32, Theorem 4.1.8], we can derive

$$\begin{aligned} \begin{array}{lcl} f({\mathbf {x}}^{k+1}) &{}\le &{} f({\mathbf {x}}^k) + \left\langle \nabla f({\mathbf {x}}^k), {\mathbf {x}}^{k+1} - {\mathbf {x}}^k\right\rangle + \omega _{*}(\Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}) \\ &{} = &{} f({\mathbf {x}}^k) + \alpha \left\langle \nabla f({\mathbf {x}}^k), {\mathbf {z}}^k - {\mathbf {x}}^k\right\rangle + \omega _{*}(\alpha \Vert {\mathbf {z}}^k - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}) \\ &{} \overset{(27)}{\le } &{} f({\mathbf {x}}^k) + \alpha \left( \eta _k^2 - \Vert {\mathbf {z}}^k - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}^2\right) + \omega _{*}(\alpha \Vert {\mathbf {z}}^k - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}) \\ &{} = &{} f({\mathbf {x}}^k) - [\alpha (\gamma _k^2 - \eta _k^2) - \omega _{*}(\alpha \gamma _k)]. \end{array} \end{aligned}$$

(28)

This is exactly (25).

Assume that $\gamma _k^2 > \eta _k^2$. Define $\psi (\alpha ) := \alpha (\gamma _k^2 - \eta _k^2) - \omega _{*}(\alpha \gamma _k)$ and plug $\alpha _k = \frac{\delta (\gamma _k^2 - \eta _k^2)}{\gamma _k(\gamma _k^2 + \gamma _k - \eta _k^2)}$ into $\psi (\alpha )$, we arrive at

$$\begin{aligned} \begin{array}{lcl} \psi (\alpha _k) &{}= &{} \alpha _k(\gamma _k^2 - \eta _k^2) - \omega _{*}(\alpha _k\gamma _k) \\ &{} = &{} \alpha _k(\gamma _k^2 - \eta _k^2 + \gamma _k) + \log (1 - \alpha _k\gamma _k) \\ &{} = &{} \frac{\delta (\gamma _k^2 - \eta _k^2)}{\gamma _k} + \log \left( 1 - \frac{\delta (\gamma _k^2 - \eta _k^2)}{\gamma _k^2 - \eta _k^2 + \gamma _k}\right) \\ &{} \ge &{} \frac{\delta (\gamma _k^2 - \eta _k^2)}{\gamma _k} + \delta \log \left( 1 - \frac{(\gamma _k^2 - \eta _k^2)}{\gamma _k^2 - \eta _k^2 + \gamma _k}\right) \\ &{} = &{} \delta \omega (\frac{\gamma _k^2 - \eta _k^2}{\gamma _k}), \end{array} \end{aligned}$$

(29)

where we use $\log (1 - \delta s) \ge \delta \log (1 - s)$ in $s \in (0,1)$ for the inequality. Using (28) and (29) we proves (26). $\square $

The following lemma shows that the residual $\Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }}$ can be bounded by the projected Newton decrement ${\bar{\gamma }}_k := \Vert T({\mathbf {x}}^k) - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}$.

Lemma 4

Let ${\bar{\lambda }}_k := \Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }}$, ${\bar{\gamma }}_k := \Vert T({\mathbf {x}}^k) - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}$, $\gamma _k := \Vert {\mathbf {z}}^k - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}$, and h be defined by (8). Recall that $\Vert {\mathbf {z}}^k - T({\mathbf {x}}^k)\Vert _{{\mathbf {x}}^k} \le \eta _k$. If $\gamma _k + \eta _k \in (0, C_2)$, then we have

$$\begin{aligned} {\bar{\lambda }}_k \le h({\bar{\gamma }}_k) \le h(\gamma _k + \eta _k). \end{aligned}$$

(30)

Proof

Firstly, we can write down the optimality condition of (4) and (1), respectively as follows:

$$\begin{aligned} \left\{ \begin{array}{l} \left\langle \nabla f({\mathbf {x}}^k) + \nabla ^2 f({\mathbf {x}}^k)[T({\mathbf {x}}^k) - {\mathbf {x}}^k], {\mathbf {x}}- T({\mathbf {x}}^k)\right\rangle \ge 0,~~~\forall {\mathbf {x}}\in {\mathcal {X}}, \\ \left\langle \nabla f({\mathbf {x}}^{\star }), {\mathbf {x}}- {\mathbf {x}}^{\star }\right\rangle \ge 0,~~~\forall {\mathbf {x}}\in {\mathcal {X}}. \end{array} \right. \end{aligned}$$

Substituting ${\mathbf {x}}^{\star }$ for ${\mathbf {x}}$ into the first inequality and $T({\mathbf {x}}^k)$ for x into the second inequality, respectively we get

$$\begin{aligned} \left\{ \begin{array}{l} \left\langle \nabla f({\mathbf {x}}^k) + \nabla ^2 f({\mathbf {x}}^k)[T({\mathbf {x}}^k) - {\mathbf {x}}^k], {\mathbf {x}}^{\star } - T({\mathbf {x}}^k)\right\rangle \ge 0, \\ \left\langle \nabla f({\mathbf {x}}^{\star }), T({\mathbf {x}}^k) - {\mathbf {x}}^{\star }\right\rangle \ge 0. \end{array} \right. \end{aligned}$$

Adding up both inequalities yields

$$\begin{aligned} \langle \nabla f({\mathbf {x}}^{\star }) - \nabla f({\mathbf {x}}^k) - \nabla ^2 f({\mathbf {x}}^k)[T({\mathbf {x}}^k) - {\mathbf {x}}^k], T({\mathbf {x}}^k) - {\mathbf {x}}^{\star }\rangle \ge 0, \end{aligned}$$

which is equivalent to

$$\begin{aligned}&\langle \nabla f(T({\mathbf {x}}^k)) - \nabla f({\mathbf {x}}^k) - \nabla ^2 f({\mathbf {x}}^k)[T({\mathbf {x}}^k) - {\mathbf {x}}^k], T({\mathbf {x}}^k) - {\mathbf {x}}^{\star }\rangle \\&\quad \ge \langle \nabla f(T({\mathbf {x}}^k)) - \nabla f({\mathbf {x}}^{\star }) , T({\mathbf {x}}^k) - {\mathbf {x}}^{\star }\rangle . \end{aligned}$$

Since f is self-concordant, by [32, Theorem 4.1.7], we have

$$\begin{aligned} \left\langle \nabla f(T({\mathbf {x}}^k)) - \nabla f({\mathbf {x}}^{\star }) , T({\mathbf {x}}^k) - {\mathbf {x}}^{\star }\right\rangle \ge \frac{\Vert T({\mathbf {x}}^k) - {\mathbf {x}}^{\star }\Vert _{T({\mathbf {x}}^k)}^2}{1 + \Vert T({\mathbf {x}}^k) - {\mathbf {x}}^{\star }\Vert _{T({\mathbf {x}}^k)}}. \end{aligned}$$

By the Cauchy-Schwarz inequality, this estimate leads to

$$\begin{aligned} \frac{\Vert T({\mathbf {x}}^k) - {\mathbf {x}}^{\star }\Vert _{T({\mathbf {x}}^k)}}{1 + \Vert T({\mathbf {x}}^k) - {\mathbf {x}}^{\star }\Vert _{T({\mathbf {x}}^k)}} \le \Vert \nabla f(T({\mathbf {x}}^k)) - \nabla f({\mathbf {x}}^k) - \nabla ^2 f({\mathbf {x}}^k)[T({\mathbf {x}}^k) - {\mathbf {x}}^k]\Vert _{T({\mathbf {x}}^k)}^*. \end{aligned}$$

(31)

Now, we can bound the right-hand side of the above inequality as

$$\begin{aligned} \begin{array}{lcl} {\mathcal {R}} &{}:= &{} \Vert \nabla f(T({\mathbf {x}}^k)) - \nabla f({\mathbf {x}}^k) - \nabla ^2 f({\mathbf {x}}^k)[T({\mathbf {x}}^k) - {\mathbf {x}}^k]\Vert _{T({\mathbf {x}}^k)}^* \\ &{} \le &{} \frac{\Vert \nabla f(T({\mathbf {x}}^k)) - \nabla f({\mathbf {x}}^k) - \nabla ^2 f({\mathbf {x}}^k)[T({\mathbf {x}}^k) - {\mathbf {x}}^k]\Vert _{{\mathbf {x}}^k}^*}{1 - \Vert T({\mathbf {x}}^k) - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}} \\ &{} \overset{(24)}{\le } &{} \left( \frac{\Vert T({\mathbf {x}}^k) - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}}{1 - \Vert T({\mathbf {x}}^k) - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}}\right) ^2, \end{array} \end{aligned}$$

(32)

where the first inequality comes from the dual form of (23), i.e., $\frac{\Vert {\mathbf {u}}\Vert _{{\mathbf {y}}}^{*}}{1-\Vert {\mathbf {y}}-{\mathbf {x}}\Vert _{{\mathbf {y}}}} \ge \Vert {\mathbf {u}}\Vert _{{\mathbf {x}}}^{*}$ for ${\mathbf {u}}\in {\mathbb {R}}^p$,^{Footnote 5} and the last term holds since $\Vert T({\mathbf {x}}^k) - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k} \le \gamma _k + \eta _k \le C_2 < 0.5$.

From (31) and (32), we have

$$\begin{aligned} \frac{\Vert T({\mathbf {x}}^k) - {\mathbf {x}}^{\star }\Vert _{T({\mathbf {x}}^k)}}{1 + \Vert T({\mathbf {x}}^k) - {\mathbf {x}}^{\star }\Vert _{T({\mathbf {x}}^k)}} \le \left( \frac{\Vert T({\mathbf {x}}^k) - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}}{1 - \Vert T({\mathbf {x}}^k) - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}}\right) ^2, \end{aligned}$$

which can be reformulated as

$$\begin{aligned} \Vert T({\mathbf {x}}^k) - {\mathbf {x}}^{\star }\Vert _{T({\mathbf {x}}^k)} \le \frac{\Vert T({\mathbf {x}}^k) - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}^2}{1 - 2\Vert T({\mathbf {x}}^k) - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}}. \end{aligned}$$

(33)

Next, since we want to use $\Vert T({\mathbf {x}}^k) - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}$ to bound $\Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^k}$, we can derive

$$\begin{aligned} \begin{array}{lcl} \Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^k} &{}\le &{} \Vert {\mathbf {x}}^k - T({\mathbf {x}}^k)\Vert _{{\mathbf {x}}^k} + \Vert T({\mathbf {x}}^k) - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^k} \\ &{}\overset{(23)}{\le } &{} \Vert {\mathbf {x}}^k - T({\mathbf {x}}^k)\Vert _{{\mathbf {x}}^k} + \frac{\Vert T({\mathbf {x}}^k) - {\mathbf {x}}^{\star }\Vert _{T({\mathbf {x}}^k)}}{1 - \Vert {\mathbf {x}}^k - T({\mathbf {x}}^k)\Vert _{{\mathbf {x}}^k}} \\ &{}\overset{(33)}{\le } &{} \Vert {\mathbf {x}}^k - T({\mathbf {x}}^k)\Vert _{{\mathbf {x}}^k} + \frac{\Vert {\mathbf {x}}^k - T({\mathbf {x}}^k)\Vert _{{\mathbf {x}}^k}^2}{\left( 1 - 2\Vert {\mathbf {x}}^k - T({\mathbf {x}}^k)\Vert _{{\mathbf {x}}^k}\right) \left( 1 - \Vert {\mathbf {x}}^k - T({\mathbf {x}}^k)\Vert _{{\mathbf {x}}^k}\right) } \\ &{}= &{} {\bar{\gamma }}_k + \frac{{\bar{\gamma }}_k^2}{(1 - 2{\bar{\gamma }}_k)(1 - {\bar{\gamma }}_k)}. \end{array} \end{aligned}$$

(34)

Notice that (23) of the above inequality holds because of $\Vert {\mathbf {x}}^k - T({\mathbf {x}}^k)\Vert _{{\mathbf {x}}^k} \le \gamma _k + \eta _k \le C_2 < 1$, where $C_2$ is a constant defined right after (8). Since h is monotonically increasing and ${\bar{\gamma }}_k \le \gamma _k + \eta _k$, we finally get

$$\begin{aligned} \Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }} \overset{(22)}{\le } \frac{\Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^k}}{1 - \Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^k}} \overset{(34)}{\le } \frac{{\bar{\gamma }}_k(1 -2{\bar{\gamma }}_k + 2{\bar{\gamma }}_k^2)}{(1 - 2{\bar{\gamma }}_k)(1 - {\bar{\gamma }}_k)^2 - {\bar{\gamma }}_k^2} = h({\bar{\gamma }}_k) \le h(\gamma _k + \eta _k), \end{aligned}$$

which proves (30). Notice that we can also prove $\Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^k} < 1$ to justify (22) of the above inequality, by using (34) and ${\bar{\gamma }}_k \le C_2$. $\square $

1.2 Key bounds for proving theorem 2

The following lemma shows that $\Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }}$ and $\Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}$ can both be bounded by $\Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }}$ when $\Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }}$ is sufficiently small.

Lemma 5

Suppose that ${\bar{\lambda }}_k := \Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }} \le \beta $, where $\beta \in (0, 0.5)$ is chosen by Algorithm 1. Then, we have

$$\begin{aligned} {\bar{\lambda }}_{k+1} \le \frac{\eta _k}{1-{\bar{\lambda }}_k} + \frac{{\bar{\lambda }}_k^2}{(1 - {\bar{\lambda }}_k)^2(1-2{\bar{\lambda }}_k)}. \end{aligned}$$

(35)

In addition, we can also bound $\Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}$ as follows:

$$\begin{aligned} \Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k} \le \eta _k + \frac{{\bar{\lambda }}_k^2}{(1-2{\bar{\lambda }}_k)(1-{\bar{\lambda }}_k)} + \frac{{\bar{\lambda }}_k}{1 - {\bar{\lambda }}_k}. \end{aligned}$$

(36)

Proof

Since we always choose full-step $\alpha _k = 1$, we have ${\mathbf {x}}^{k+1} = {\mathbf {z}}^{k}$. Therefore, $\Vert {\mathbf {x}}^{k+1} - T({\mathbf {x}}^k)\Vert _{{\mathbf {x}}^k} = \Vert {\mathbf {z}}^k - T({\mathbf {x}}^k)\Vert _{{\mathbf {x}}^k} \le \eta _k$, which leads to

$$\begin{aligned} \begin{array}{lcl} {\bar{\lambda }}_{k+1} &{}= &{} \Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }} \le \Vert {\mathbf {x}}^{k+1} - T({\mathbf {x}}^k)\Vert _{{\mathbf {x}}^{\star }} + \Vert T({\mathbf {x}}^k) - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }} \\ &{}\overset{(23)}{\le } &{} \frac{\Vert {\mathbf {x}}^{k+1} - T({\mathbf {x}}^k)\Vert _{{\mathbf {x}}^k}}{1 - \Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }}} + \frac{\Vert T({\mathbf {x}}^k) - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^k}}{1 - \Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }}} \\ &{} \le &{} \frac{\eta _k}{1 - {\bar{\lambda }}_k} + \frac{\Vert T({\mathbf {x}}^k) - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^k}}{1 - {\bar{\lambda }}_k}. \end{array} \end{aligned}$$

(37)

Now, we bound $\Vert T({\mathbf {x}}^k) - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^k}$ as follows. Firstly, the optimality conditions of (4) and (1) can be written as

$$\begin{aligned} \left\{ \begin{array}{ll} \langle \nabla f({\mathbf {x}}^k) + \nabla ^2f({\mathbf {x}}^k)(T({\mathbf {x}}^k) - {\mathbf {x}}^k), {\mathbf {x}}- T({\mathbf {x}}^k)\rangle \ge 0, ~~\forall {\mathbf {x}}\in {\mathcal {X}}, \\ \langle \nabla f({\mathbf {x}}^{\star }), {\mathbf {x}}- {\mathbf {x}}^{\star }\rangle \ge 0,~~\forall {\mathbf {x}}\in {\mathcal {X}}. \\ \end{array} \right. \end{aligned}$$

This can be rewritten equivalently to

$$\begin{aligned} \left\{ \begin{array}{ll} \langle \nabla ^2f({\mathbf {x}}^k)[T({\mathbf {x}}^k) - ({\mathbf {x}}^k - \nabla ^2f({\mathbf {x}}^k)^{-1}\nabla f({\mathbf {x}}^k))], {\mathbf {x}}- T({\mathbf {x}}^k)\rangle \ge 0, \quad \forall {\mathbf {x}}\in {\mathcal {X}}, \\ \langle \nabla ^2f({\mathbf {x}}^k)[{\mathbf {x}}^{\star } - ({\mathbf {x}}^{\star } - \nabla ^2f({\mathbf {x}}^k)^{-1}\nabla f({\mathbf {x}}^{\star }))], {\mathbf {x}}- {\mathbf {x}}^{\star }\rangle \ge 0, \quad \forall {\mathbf {x}}\in {\mathcal {X}}. \end{array} \right. \end{aligned}$$

(38)

Similar to the proof of [2, Theorem 3.14], we can show that (38) is equivalent to

$$\begin{aligned} \left\{ \begin{array}{lcl} T({\mathbf {x}}^k) &{}= &{} \mathrm {proj}_{{\mathcal {X}}}^{\nabla ^2f({\mathbf {x}}^k)}\left( {\mathbf {x}}^k - \nabla ^2f({\mathbf {x}}^k)^{-1}\nabla f({\mathbf {x}}^k)\right) , \\ {\mathbf {x}}^{\star } &{}= &{} \mathrm {proj}_{{\mathcal {X}}}^{\nabla ^2f({\mathbf {x}}^k)}\left( {\mathbf {x}}^{\star } - \nabla ^2f({\mathbf {x}}^k)^{-1}\nabla f({\mathbf {x}}^{\star })\right) . \\ \end{array} \right. \end{aligned}$$

(39)

Using the nonexpansiveness of the projection operator [2, Chapter 4] we can derive

$$\begin{aligned} \begin{array}{lcl} \Vert T({\mathbf {x}}^k) - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^k} &{}\overset{(39)}{=} &{} \Big \Vert \mathrm {proj}_{{\mathcal {X}}}^{\nabla ^2f({\mathbf {x}}^k)}\left( {\mathbf {x}}^k - \nabla ^2f({\mathbf {x}}^k)^{-1}\nabla f({\mathbf {x}}^k)\right) \\ &{}&{} \quad - {~} \mathrm {proj}_{{\mathcal {X}}}^{\nabla ^2f({\mathbf {x}}^k)}\left( {\mathbf {x}}^{\star } - \nabla ^2f({\mathbf {x}}^k)^{-1}\nabla f({\mathbf {x}}^{\star })\right) \Big \Vert _{{\mathbf {x}}^k} \\ &{}\le &{} \Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star } - \nabla ^2f({\mathbf {x}}^k)^{-1}(\nabla f({\mathbf {x}}^k) - \nabla f({\mathbf {x}}^{\star }))\Vert _{{\mathbf {x}}^k} \\ &{} = &{} \Vert \nabla f({\mathbf {x}}^{\star }) - \nabla f({\mathbf {x}}^k) - \nabla ^2f({\mathbf {x}}^k)({\mathbf {x}}^{\star } - {\mathbf {x}}^k)\Vert _{{\mathbf {x}}^k}^{*} \\ &{} \overset{(24)}{\le } &{} \frac{\Vert {\mathbf {x}}^{\star } - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}^2}{1 - \Vert {\mathbf {x}}^{\star } - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}} \\ &{} \overset{(22)}{\le } &{} \frac{\Vert {\mathbf {x}}^{\star } - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^{\star }}^2}{(1 - 2\Vert {\mathbf {x}}^{\star } - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^{\star }})(1 - \Vert {\mathbf {x}}^{\star } - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^{\star }})} \\ &{} = &{} \frac{{\bar{\lambda }}_k^2}{(1-2{\bar{\lambda }}_k)(1-{\bar{\lambda }}_k)}. \end{array} \end{aligned}$$

(40)

We make the following two explanation for (40):

In the second inequality of (40), $1 - \Vert {\mathbf {x}}^{\star } - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^{k}}$ in the denominator can be justified by $\Vert {\mathbf {x}}^{\star } - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^{k}} < 1$, which follows directly from (24) and our assumption that $\Vert {\mathbf {x}}^{\star } - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^{\star }} \le \beta < 0.5$ stated at the beginning of this lemma.
For the last inequality of (40), we first have $0< \Vert {\mathbf {x}}^{\star } - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k} \le \frac{\Vert {\mathbf {x}}^{\star } - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^{\star }}}{1 - \Vert {\mathbf {x}}^{\star } - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^{\star }}} < 1$ by (22) and our assumption that $\Vert {\mathbf {x}}^{\star } - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^{\star }} < 0.5$. Since $\frac{t^2}{1 - t}$ is increasing for $t \in (0,1)$, we can replace $\Vert {\mathbf {x}}^{\star } - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}$ by $\frac{\Vert {\mathbf {x}}^{\star } - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^{\star }}}{1 - \Vert {\mathbf {x}}^{\star } - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^{\star }}}$ to get the last inequality of (40).

Plugging (40) into (37), we get (35).

Finally, we note that

$$\begin{aligned} \begin{array}{lcl} \Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k} &{}\le &{} \Vert {\mathbf {x}}^{k+1} - T({\mathbf {x}}^k)\Vert _{{\mathbf {x}}^k} + \Vert {\mathbf {x}}^{\star } - T({\mathbf {x}}^k)\Vert _{{\mathbf {x}}^k} + \Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^k}\\ &{}\overset{(40)}{\le } &{} \Vert {\mathbf {x}}^{k+1} - T({\mathbf {x}}^k)\Vert _{{\mathbf {x}}^k} + \frac{{\bar{\lambda }}_k^2}{(1-2{\bar{\lambda }}_k)(1-{\bar{\lambda }}_k)} + \Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^k}\\ &{}\overset{(22)}{\le } &{} \eta _k + \frac{{\bar{\lambda }}_k^2}{(1-2{\bar{\lambda }}_k)(1-{\bar{\lambda }}_k)} + \frac{\Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }}}{1 - \Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }}} \\ &{}= &{} \eta _k + \frac{{\bar{\lambda }}_k^2}{(1-2{\bar{\lambda }}_k)(1-{\bar{\lambda }}_k)} + \frac{{\bar{\lambda }}_k}{1 - {\bar{\lambda }}_k}, \end{array} \end{aligned}$$

which proves (36). $\square $

1.3 An intermediate lemma for proving theorem 3

Firstly, the following lemma establishes the sublinear convergence rate of the Frank–Wolfe gap in each outer iteration.

Lemma 6

At the k-th outer iteration of Algorithm 1, if we run the Frank–Wolfe subroutine (7) to update ${\mathbf {u}}^t$, then, after $T_k$ iterations, we have

$$\begin{aligned} \min _{t = 1,\cdots , T_k} V_k({\mathbf {u}}^t) \le \frac{6\lambda _{\max }(\nabla ^2f({\mathbf {x}}^k))D_{{\mathcal {X}}}^2}{T_k + 1}, \end{aligned}$$

(41)

where $V_k({\mathbf {u}}^t) := \max _{{\mathbf {u}}\in {\mathcal {X}}}\left\langle \nabla f({\mathbf {x}}^k) + \nabla ^2f({\mathbf {x}}^k)({\mathbf {u}}^t-{\mathbf {x}}^k), {\mathbf {u}}^t - {\mathbf {u}}\right\rangle $. As a result, the number of LMO calls at the k-th outer iteration of Algorithm 1 is at most $O_k := \frac{6\lambda _{\max }(\nabla ^2f({\mathbf {x}}^k))D_{{\mathcal {X}}}^2}{\eta _k^2}$.

Proof

Let $\phi _k({\mathbf {u}}) = \left\langle \nabla f({\mathbf {x}}^k), {\mathbf {u}}- {\mathbf {x}}^k\right\rangle + 1/2\left\langle \nabla ^2 f({\mathbf {x}}^k)({\mathbf {u}}- {\mathbf {x}}^k), {\mathbf {u}}- {\mathbf {x}}^k\right\rangle $ and $\left\{ {\mathbf {u}}^t\right\} $ be generated by the Frank–Wolfe subroutine (7). Then, it is well-known that (see [26, Theorem 1]):

$$\begin{aligned} \phi _k({\mathbf {u}}^t) - \phi _k^{\star } \le \frac{2\lambda _{\max }(\nabla ^2 f({\mathbf {x}}^k))D_{{\mathcal {X}}}^2}{t + 1}. \end{aligned}$$

(42)

Let ${\mathbf {v}}^t := \arg \min _{{\mathbf {u}}\in {\mathcal {X}}}\{\left\langle \nabla \phi _k({\mathbf {u}}^t), {\mathbf {u}}\right\rangle \}$. Notice that

$$\begin{aligned} \begin{array}{lcl} \phi _k({\mathbf {u}}^{t+1}) &{}= &{} \min _{\tau \in [0,1]}\{\phi _k((1-\tau ){\mathbf {u}}^t + \tau {\mathbf {v}}^t)\} \le \phi _k\left( \left( 1 - \frac{2}{t+1}\right) {\mathbf {u}}^t + \frac{2}{t+1}{\mathbf {v}}^t\right) \\ &{}\le &{} \phi _k({\mathbf {u}}^t) + \frac{2}{t+1}\left\langle \nabla \phi _k({\mathbf {u}}^t), ({\mathbf {v}}^t - {\mathbf {u}}^t)\right\rangle + \frac{\lambda _{\max }(\nabla ^2f({\mathbf {x}}^k))}{2}\left( \frac{2}{t+1}\right) ^2\Vert {\mathbf {v}}^t - {\mathbf {u}}^t\Vert ^2 \\ &{}\le &{} \phi _k({\mathbf {u}}^t) - \frac{2}{t+1}V_k({\mathbf {u}}^t) + \frac{2\lambda _{\max }(\nabla ^2f({\mathbf {x}}^k))}{(t+1)^2}D_{{\mathcal {X}}}^2. \end{array} \end{aligned}$$

This is equivalent to

$$\begin{aligned} tV_k({\mathbf {u}}^t) \le \frac{t(t+1)}{2}\left( \phi _k({\mathbf {u}}^t) - \phi _k({\mathbf {u}}^{t+1})\right) + \frac{t\lambda _{\max }(\nabla ^2f({\mathbf {x}}^k))}{t+1}D_{{\mathcal {X}}}^2. \end{aligned}$$

(43)

Summing up this inequality from $t = 1$ to $T_k$, we get

$$\begin{aligned} \begin{array}{lcl} \displaystyle \frac{T_k(T_k+1)}{2}\min _{t = 1,\cdots , T_k}\left\{ V_k({\mathbf {u}}^t)\right\} &{}\le &{} \displaystyle \sum _{t = 1}^{T_k} tV_k({\mathbf {u}}^t) \\ &{}\overset{(43)}{\le } &{} \displaystyle \sum _{t = 1}^{T_k} t\phi _k({\mathbf {u}}^t)\\ &{}&{} - \frac{T_k(T_k + 1)}{2}\phi _k({\mathbf {u}}^{T_k+1}) + T_k \lambda _{\max }(\nabla ^2f({\mathbf {x}}^k))D_{{\mathcal {X}}}^2 \\ &{}\le &{} \displaystyle \sum _{t = 1}^{T_k} t(\phi _k({\mathbf {u}}^t) - \phi _k^{\star }) + T_k \lambda _{\max }(\nabla ^2f({\mathbf {x}}^k))D_{{\mathcal {X}}}^2 \\ &{} \overset{(42)}{\le } &{} 3T_k \lambda _{\max }(\nabla ^2f({\mathbf {x}}^k))D_{{\mathcal {X}}}^2, \end{array} \end{aligned}$$

which implies (41). $\square $

1.4 An intermediate lemma for proving theorem 4

The following lemma states that we can bound $f({\mathbf {x}}^k) - f^{\star }$ by $\Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}$ and $\Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }}$. Therefore, from the convergence rate of $\Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}$ and $\Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }}$ in Theorem 2, we can obtain a convergence rate of $\left\{ f({\mathbf {x}}^k) - f^{\star }\right\} $.

Lemma 7

Let $\gamma _k := \Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k} = \Vert {\mathbf {z}}^k - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}$ and ${\bar{\lambda }}_k := \Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }}$. Suppose that ${\mathbf {x}}^0\in \mathrm {dom}(f)\cap {\mathcal {X}}$. If $0< \gamma _k,{\bar{\lambda }}_k, {\bar{\lambda }}_{k+1} < 1$, then we have

$$\begin{aligned} f({\mathbf {x}}^{k+1}) \le f({\mathbf {x}}^{\star }) + \frac{\gamma _k^2(\gamma _k + {\bar{\lambda }}_k)}{1 - \gamma _k} + \eta _k^2 + \omega _{*}({\bar{\lambda }}_{k+1}), \end{aligned}$$

(44)

where $\omega _{*}(\tau ) := -\tau - \log (1-\tau )$.

Proof

Firstly, from [32, Theorem 4.1.8], we have

$$\begin{aligned} f({\mathbf {x}}^{k+1}) \le f({\mathbf {x}}^{\star }) + \left\langle \nabla f({\mathbf {x}}^{\star }), {\mathbf {x}}^{k+1} - {\mathbf {x}}^{\star }\right\rangle + \omega _{*}(\Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }}), \end{aligned}$$

provided that $\Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }} < 1$. Next, using $\left\langle \nabla f({\mathbf {x}}^{\star }) - \nabla f({\mathbf {x}}^{k+1}), {\mathbf {x}}^{k+1} - {\mathbf {x}}^{\star }\right\rangle \le 0$, we can further derive

$$\begin{aligned} f({\mathbf {x}}^{k+1}) \le f({\mathbf {x}}^{\star }) + \left\langle \nabla f({\mathbf {x}}^{k+1}), {\mathbf {x}}^{k+1} - {\mathbf {x}}^{\star }\right\rangle + \omega _{*}(\Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }}). \end{aligned}$$

(45)

Now, we bound $\left\langle \nabla f({\mathbf {x}}^{k+1}), {\mathbf {x}}^{k+1} - {\mathbf {x}}^{\star }\right\rangle $ as follows. We first notice that this term can be decomposed as

$$\begin{aligned} \begin{array}{lcl} \left\langle \nabla f({\mathbf {x}}^{k+1}), {\mathbf {x}}^{k+1} - {\mathbf {x}}^{\star }\right\rangle &{}= &{} \underbrace{\langle \nabla f({\mathbf {x}}^k) + \nabla ^2 f({\mathbf {x}}^k)({\mathbf {x}}^{k+1} - {\mathbf {x}}^k), {\mathbf {x}}^{k+1} - {\mathbf {x}}^{\star }\rangle }_{{\mathcal {T}}_1}\\ &{}&{}+ {~} \underbrace{\langle \nabla f({\mathbf {x}}^{k+1}) - \nabla f({\mathbf {x}}^k) - \nabla ^2 f({\mathbf {x}}^k)({\mathbf {x}}^{k+1} - {\mathbf {x}}^k), {\mathbf {x}}^{k+1} - {\mathbf {x}}^{\star }\rangle }_{{\mathcal {T}}_2}. \\ \end{array} \end{aligned}$$

Since ${\mathbf {x}}^{k+1}$ is an $\eta ^k$-solution of (4) at ${\mathbf {x}}= {\mathbf {x}}^k$, we have

$$\begin{aligned} {\mathcal {T}}_1 = \langle \nabla f({\mathbf {x}}^k) + \nabla ^2 f({\mathbf {x}}^k)({\mathbf {x}}^{k+1} - {\mathbf {x}}^k), {\mathbf {x}}^{k+1} - {\mathbf {x}}^{\star }\rangle \le \eta _k^2. \end{aligned}$$

(46)

Using the Cauchy-Schwarz inequality and the triangle inequality, ${\mathcal {T}}_2$ can also be bounded as

$$\begin{aligned} \begin{array}{lcl} {\mathcal {T}}_2 &{}= &{} \left\langle \nabla f({\mathbf {x}}^{k+1}) - \nabla f({\mathbf {x}}^k) - \nabla ^2 f({\mathbf {x}}^k)({\mathbf {x}}^{k+1} - {\mathbf {x}}^k), {\mathbf {x}}^{k+1} - {\mathbf {x}}^{\star }\right\rangle \\ &{}\le &{} \Vert \nabla f({\mathbf {x}}^{k+1}) - \nabla f({\mathbf {x}}^k) - \nabla ^2 f({\mathbf {x}}^k)({\mathbf {x}}^{k+1} - {\mathbf {x}}^k)\Vert _{{\mathbf {x}}^k}^*\Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^k} \\ &{}\overset{(24)}{\le } &{} \frac{\Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}^2}{1 - \Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}}\Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^k} \\ &{}\le &{} \frac{\Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}^2}{1 - \Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}}\left[ \Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^k} + \Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}\right] \\ &{} = &{} \frac{\gamma _k^2(\gamma _k + {\bar{\lambda }}_k)}{1 - \gamma _k}. \end{array} \end{aligned}$$

(47)

Finally, we can bound $f({\mathbf {x}}^{k+1}) - f^{\star }$ as

$$\begin{aligned} \begin{array}{lcl} f({\mathbf {x}}^{k+1}) - f({\mathbf {x}}^{\star }) &{} \overset{(45)}{\le } &{} \left\langle \nabla f({\mathbf {x}}^{k+1}), {\mathbf {x}}^{k+1} - {\mathbf {x}}^{\star }\right\rangle + \omega _{*}({\bar{\lambda }}_{k+1}) \\ &{} = &{} {\mathcal {T}}_1 + {\mathcal {T}}_2 + \omega _{*}({\bar{\lambda }}_{k+1}) \\ &{} \overset{(46)(47)}{\le } &{} \eta _k^2 + \frac{\gamma _k^2(\gamma _k + {\bar{\lambda }}_k)}{1 - \gamma _k} + \omega _{*}({\bar{\lambda }}_{k+1}), \end{array} \end{aligned}$$

which proves (44). $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, D., Cevher, V. & Tran-Dinh, Q. A Newton Frank–Wolfe method for constrained self-concordant minimization. J Glob Optim 83, 273–299 (2022). https://doi.org/10.1007/s10898-021-01105-z

Download citation

Received: 30 June 2020
Accepted: 18 October 2021
Published: 20 November 2021
Issue Date: June 2022
DOI: https://doi.org/10.1007/s10898-021-01105-z

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Newton Frank–Wolfe method for constrained self-concordant minimization

Abstract

Access this article

Similar content being viewed by others

An analytical derivation of properly efficient sets in multi-objective portfolio selection

On Nested Affine Variational Inequalities: The Case of Multi-Portfolio Selection

An inexact restoration approach to optimization problems with multiobjective constraints under weighted-sum scalarization

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix: The proof of technical results

1.1 Two key lemmas for proving theorem 1

Lemma 3

Proof

Lemma 4

Proof

1.2 Key bounds for proving theorem 2

Lemma 5

Proof

1.3 An intermediate lemma for proving theorem 3

Lemma 6

Proof

1.4 An intermediate lemma for proving theorem 4

Lemma 7

Proof

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

A Newton Frank–Wolfe method for constrained self-concordant minimization

Abstract

Access this article

Similar content being viewed by others

An analytical derivation of properly efficient sets in multi-objective portfolio selection

On Nested Affine Variational Inequalities: The Case of Multi-Portfolio Selection

An inexact restoration approach to optimization problems with multiobjective constraints under weighted-sum scalarization

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix: The proof of technical results

Appendix: The proof of technical results

1.1 Two key lemmas for proving theorem 1

Lemma 3

Proof

Lemma 4

Proof

1.2 Key bounds for proving theorem 2

Lemma 5

Proof

1.3 An intermediate lemma for proving theorem 3

Lemma 6

Proof

1.4 An intermediate lemma for proving theorem 4

Lemma 7

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation