Abstract
We develop a new Newton Frank–Wolfe algorithm to solve a class of constrained self-concordant minimization problems using linear minimization oracles (LMO). Unlike L-smooth convex functions, where the Lipschitz continuity of the objective gradient holds globally, the class of self-concordant functions only has local bounds, making it difficult to estimate the number of linear minimization oracle (LMO) calls for the underlying optimization algorithm. Fortunately, we can still prove that the number of LMO calls of our method is nearly the same as that of the standard Frank-Wolfe method in the L-smooth case. Specifically, our method requires at most \({\mathcal {O}}\big (\varepsilon ^{-(1 + \nu )}\big )\) LMO’s, where \(\varepsilon \) is the desired accuracy, and \(\nu \in (0, 0.139)\) is a given constant depending on the chosen initial point of the proposed algorithm. Our intensive numerical experiments on three applications: portfolio design with the competitive ratio, D-optimal experimental design, and logistic regression with elastic-net regularizer, show that the proposed Newton Frank–Wolfe method outperforms different state-of-the-art competitors.
Similar content being viewed by others
Notes
The smoothness of f is only defined on \(\mathrm {dom}(f)\), an open set.
The differentiability of \(\varphi \) is only defined on \(\mathrm {dom}(\varphi )\), an open set.
One can see from the proof leading to [43, equation (72)] that the relation holds more generally when the \({\mathbf {z}}_+\) and \({\mathbf {z}}\) are replaced by any two vectors satisfying \(\Vert {\mathbf {y}}-{\mathbf {x}}\Vert _{\mathbf {x}}\le 1\).
In fact, by (23), we have \(\nabla ^2 f({\mathbf {y}}) \preceq \frac{\nabla ^2 f({\mathbf {x}})}{(1 - \Vert {\mathbf {y}}-{\mathbf {x}}\Vert _{{\mathbf {y}}})^2}\), which is equivalent to \(\nabla ^2 f({\mathbf {x}})^{-1} \preceq \frac{\nabla ^2 f({\mathbf {y}})^{-1}}{(1 - \Vert {\mathbf {y}}-{\mathbf {x}}\Vert _{{\mathbf {y}}})^2}\). Therefore, we have \(\frac{\Vert {\mathbf {u}}\Vert _{{\mathbf {y}}}^{*}}{1-\Vert {\mathbf {y}}-{\mathbf {x}}\Vert _{{\mathbf {y}}}} \ge \Vert {\mathbf {u}}\Vert _{{\mathbf {x}}}^{*}\) for \({\mathbf {u}}\in {\mathbb {R}}^p\).
References
Barzilai, J., Borwein, J.M.: Two-point step size gradient methods. IMA J. Numer. Anal. 8(1), 141–148 (1988)
Bauschke, H.H., Combettes, P.: Convex Analysis and Monotone Operators Theory in Hilbert Spaces. Springer-Verlag, 2nd edn. (2017)
Beck, A., Teboulle, M.: A conditional gradient method with linear rate of convergence for solving convex linear systems. Math. Methods Oper. Res. 59(2), 235–247 (2004)
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)
Becker, S., Candès, E.J., Grant, M.: Templates for convex cone problems with applications to sparse signal recovery. Math. Program. Compt. 3(3), 165–218 (2011)
Birgin, E.G., Martínez, J.M., Raydan, M.: Nonmonotone spectral projected gradient methods on convex sets. SIAM J. Optim. 10(4), 1196–1211 (2000)
Chen, Y., Ye, X.: Projection onto a simplex. Preprint arXiv:1101.6081 (2011)
Chang, C.-C., Lin, C.-J.: LIBSVM, A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 1–27 (2011)
Damla, S.A., Sun, P., Todd, M.J.: Linear convergence of a modified Frank–Wolfe algorithm for computing minimum-volume enclosing ellipsoids. Optim. Methods Softw. 23(1), 5–19 (2008)
Defazio, A., Bach, F., Lacoste-Julien, S.: SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems (NIPS), pp. 1646–1654 (2014)
de Oliveira, F.R., Ferreira, O.P., Silva, G.N.: Newton’s method with feasible inexact projections for solving constrained generalized equations. Comput. Optim. Appl. 72(1), 159–177 (2019)
Duchi, J., Shalev-Shwartz, S., Singer, Y., Chandra, T.: Efficient projections onto the \(\ell _1\)-ball for learning in high dimensions. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, pp. 272–279, New York, NY, USA, ACM (2008)
Dvurechensky, P., Ostroukhov, P., Safin, K., Shtern, S., Staudigl, M.: Self-concordant analysis of Frank-Wolfe algorithms. In International Conference on Machine Learning, pp. 2814–2824. PMLR, (2020)
Frank, M., Wolfe, P.: An algorithm for quadratic programming. Naval Res. Logist. Q. 3, 95–110 (1956)
Garber, D., Hazan, E.: A linearly convergent conditional gradient algorithm with applications to online and stochastic optimization. Preprint arXiv:1301.4666 (2013)
Garber, D., Hazan, E.: Faster rates for the Frank–Wolfe method over strongly-convex sets. In Proceedings of the 32nd International Conference on Machine Learning 951, pp. 541–549 (2015)
Gonçalves, M.L.N., Melo, J.G.: A newton conditional gradient method for constrained nonlinear systems. J. Comput. Appl. Math. 311, 473–483 (2017)
Gonçalves, D. S., Gonçalves, M. L. N., Menezes, T. C.: Inexact variable metric method for convex-constrained optimization problems. Optimization, 1–19, (online first) (2021)
Gonçalves, D. S., Gonçalves, M. L. N., Oliveira, F. R.: Levenberg-marquardt methods with inexact projections for constrained nonlinear systems. Preprint arXiv:1908.06118 (2019)
Gonçalves, M.L.N., Oliveira, F.R.: On the global convergence of an inexact quasi-Newton conditional gradient method for constrained nonlinear systems. Numer. Algorithm 84(2), 606–631 (2020)
Gross, D., Liu, Y.-K., Flammia, S., Becker, S., Eisert, J.: Quantum state tomography via compressed sensing. Phys. Rev. Lett. 105(15), 150401 (2010)
Guelat, J., Marcotte, P.: Some comments on Wolfe’s away step. Math. Program. 35(1), 110–119 (1986)
Harman, R., Trnovská, M.: Approximate D-optimal designs of experiments on the convex hull of a finite set of information matrices. Math. Slov. 59(6), 693–704 (2009)
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Science & Business Media (2009)
Hazan, E.: Sparse approximate solutions to semidefinite programs. In: Latin American Symposium on Theoretical Informatics, pp. 306–316. Springer (2008)
Jaggi, M.: Revisiting Frank–Wolfe: projection-free sparse convex optimization. JMLR W&CP 28(1), 427–435 (2013)
Khachiyan, L.G.: Rounding of polytopes in the real number model of computation. Math. Oper. Res. 21(2), 307–320 (1996)
Lacoste-Julien, S., Jaggi, M.: On the global linear convergence of Frank–Wolfe optimization variants. In Advances in Neural Information Processing Systems (NIPS), pp. 496–504 (2015)
Lan, G., Zhou, Y.: Conditional gradient sliding for convex optimization. SIAM J. Optim. 26(2), 1379–1409 (2016)
Lan, G., Ouyang, Y.: Accelerated gradient sliding for structured convex optimization. Preprint arXiv:1609.04905 (2016)
Lu, Z., Pong, T.K.: Computing optimal experimental designs via interior point method. SIAM J. Matrix Anal. Appl. 34(4), 1556–1580 (2013)
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course volume 87 of Applied Optimization. Kluwer Academic Publishers (2004)
Nesterov, Y., Nemirovski, A.: Interior-point polynomial algorithms in convex programming. Soc. Ind. Math. (1994)
Odor, G., Li, Y.-H., Yurtsever, A., Hsieh, Y.-P., Tran-Dinh, Q., El-Halabi, M., Cevher, V.: Frank-Wolfe works for non-lipschitz continuous gradient objectives: Scalable poisson phase retrieval. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6230–6234. IEEE (2016)
Ostrovskii, D.M., Bach, F.: Finite-sample analysis of M-estimators using self-concordance. Electron. J. Stat. 15(1), 326–391 (2021)
Peyré, G., Cuturi, M.: Computational optimal transport. Found. Trends Mach. Learn. 11(5–6), 355–607 (2019)
Ryu, E. K., Boyd, S.: Stochastic proximal iteration: a non-asymptotic improvement upon stochastic gradient descent. Author website, early draft (2014)
Raydan, M.: On the Barzilai and Borwein choice of steplength for the gradient method. IMA J. Numer. Anal. 13(3), 321–326 (1993)
Su, W., Boyd, S., Candes, E.: A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights. In Advances in Neural Information Processing Systems (NIPS), pp. 2510–2518 (2014)
Sun, T., Tran-Dinh, Q.: Generalized self-concordant functions: a recipe for Newton-type methods. Math. Program. 178, 145–213 (2019)
Tran-Dinh, Q., Kyrillidis, A., Cevher, V.: Composite self-concordant minimization. J. Mach. Learn. Res. 15, 374–416 (2015)
Tran-Dinh, Q., Ling, L., Toh, K.-C.: A new homotopy proximal variable-metric framework for composite convex minimization. Math. Oper. Res., 1–28, (online first) (2021)
Tran-Dinh, Q., Sun, T., Lu, S.: Self-concordant inclusions: a unified framework for path-following generalized Newton-type algorithms. Math. Program. 177(1–2), 173–223 (2019)
Yurtsever, A., Fercoq, O., Cevher, V.: A conditional-gradient-based augmented lagrangian framework. In International Conference on Machine Learning (ICML), pp. 7272–7281 (2019)
Yurtsever, A., Tran-Dinh, Q., Cevher, V.: A universal primal-dual convex optimization framework. Advances in Neural Information Processing Systems (NIPS), pp. 1–9 (2015)
Acknowledgements
Q. Tran-Dinh was partly supported by the National Science Foundation (NSF), Grant No. DMS-1619884 and the Office of Naval Research (ONR), Grant No. N00014-20-1-2088. V. Cevher was partly supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (Grant agreement n 725594 - time-data) and by 2019 Google Faculty Research Award.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: The proof of technical results
Appendix: The proof of technical results
Let us recall the following key properties of standard self-concordant functions. Let f be standard self-concordant and \({\mathbf {x}},{\mathbf {y}}\in \mathrm {dom}(f)\) such that \(\Vert {\mathbf {y}}- {\mathbf {x}}\Vert _{{\mathbf {x}}} < 1\). Then
Similarly, if \(\Vert {\mathbf {y}}- {\mathbf {x}}\Vert _{{\mathbf {y}}} < 1\), then
These inequalities can be found in [32, Theorem 4.1.6]. In addition, from [43, equation (72)], we have
if \(\Vert {\mathbf {y}}- {\mathbf {x}}\Vert _{{\mathbf {x}}} < 1\).Footnote 4 These inequalities will be repeatedly used in our proofs below.
1.1 Two key lemmas for proving theorem 1
We need the following two lemmas to prove Theorem 1. The first lemma describes the decreasing of the objective value when applying damped-step iterations.
Lemma 3
Let \(\gamma _k := \Vert {\mathbf {z}}^k - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}\) be the local distance between \({\mathbf {z}}^k\) to \({\mathbf {x}}^k\), where \({\mathbf {z}}^k\) is the output of Algorithm 2 at \(x^k\) with \(\eta = \eta _k^2\). Recall that \(\Vert {\mathbf {z}}^k - T({\mathbf {x}}^k)\Vert _{{\mathbf {x}}^k} \le \eta _k\). If we choose \(\alpha \in (0, 1)\) such that \(\alpha \gamma _k < 1\) and update \({\mathbf {x}}^{k+1} := {\mathbf {x}}^k + \alpha ({\mathbf {z}}^k - {\mathbf {x}}^k)\), then we have
Assume \(\gamma _k > \eta _k\). If \(\delta \in (0,1)\) and the step size is \(\alpha _k := \frac{\delta (\gamma _k^2 - \eta _k^2)}{\gamma _k(\gamma _k^2 + \gamma _k - \eta _k^2)}\) then we have \(\alpha _k\gamma _k< \delta < 1\). Moreover, it also holds that
where \(\omega (\tau ) := \tau - \log (1 + \tau )\) and \(\omega _{*}(\tau ) := -\tau - \log (1 - \tau )\) are two nonnegative and convex functions.
Proof
From (7) and the stop criterion of Algorithm 2, \({\mathbf {z}}^k\) is an \(\eta _k\)-solution of (4) at \({\mathbf {x}}= {\mathbf {x}}^k\). It is clear that \({\mathbf {z}}^k\) satisfies
This inequality leads to
Therefore, using the self-concordance of f [32, Theorem 4.1.8], we can derive
This is exactly (25).
Assume that \(\gamma _k^2 > \eta _k^2\). Define \(\psi (\alpha ) := \alpha (\gamma _k^2 - \eta _k^2) - \omega _{*}(\alpha \gamma _k)\) and plug \(\alpha _k = \frac{\delta (\gamma _k^2 - \eta _k^2)}{\gamma _k(\gamma _k^2 + \gamma _k - \eta _k^2)}\) into \(\psi (\alpha )\), we arrive at
where we use \(\log (1 - \delta s) \ge \delta \log (1 - s)\) in \(s \in (0,1)\) for the inequality. Using (28) and (29) we proves (26). \(\square \)
The following lemma shows that the residual \(\Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }}\) can be bounded by the projected Newton decrement \({\bar{\gamma }}_k := \Vert T({\mathbf {x}}^k) - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}\).
Lemma 4
Let \({\bar{\lambda }}_k := \Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }}\), \({\bar{\gamma }}_k := \Vert T({\mathbf {x}}^k) - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}\), \(\gamma _k := \Vert {\mathbf {z}}^k - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}\), and h be defined by (8). Recall that \(\Vert {\mathbf {z}}^k - T({\mathbf {x}}^k)\Vert _{{\mathbf {x}}^k} \le \eta _k\). If \(\gamma _k + \eta _k \in (0, C_2)\), then we have
Proof
Firstly, we can write down the optimality condition of (4) and (1), respectively as follows:
Substituting \({\mathbf {x}}^{\star }\) for \({\mathbf {x}}\) into the first inequality and \(T({\mathbf {x}}^k)\) for x into the second inequality, respectively we get
Adding up both inequalities yields
which is equivalent to
Since f is self-concordant, by [32, Theorem 4.1.7], we have
By the Cauchy-Schwarz inequality, this estimate leads to
Now, we can bound the right-hand side of the above inequality as
where the first inequality comes from the dual form of (23), i.e., \(\frac{\Vert {\mathbf {u}}\Vert _{{\mathbf {y}}}^{*}}{1-\Vert {\mathbf {y}}-{\mathbf {x}}\Vert _{{\mathbf {y}}}} \ge \Vert {\mathbf {u}}\Vert _{{\mathbf {x}}}^{*}\) for \({\mathbf {u}}\in {\mathbb {R}}^p\),Footnote 5 and the last term holds since \(\Vert T({\mathbf {x}}^k) - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k} \le \gamma _k + \eta _k \le C_2 < 0.5\).
which can be reformulated as
Next, since we want to use \(\Vert T({\mathbf {x}}^k) - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}\) to bound \(\Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^k}\), we can derive
Notice that (23) of the above inequality holds because of \(\Vert {\mathbf {x}}^k - T({\mathbf {x}}^k)\Vert _{{\mathbf {x}}^k} \le \gamma _k + \eta _k \le C_2 < 1\), where \(C_2\) is a constant defined right after (8). Since h is monotonically increasing and \({\bar{\gamma }}_k \le \gamma _k + \eta _k\), we finally get
which proves (30). Notice that we can also prove \(\Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^k} < 1\) to justify (22) of the above inequality, by using (34) and \({\bar{\gamma }}_k \le C_2\). \(\square \)
1.2 Key bounds for proving theorem 2
The following lemma shows that \(\Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }}\) and \(\Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}\) can both be bounded by \(\Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }}\) when \(\Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }}\) is sufficiently small.
Lemma 5
Suppose that \({\bar{\lambda }}_k := \Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }} \le \beta \), where \(\beta \in (0, 0.5)\) is chosen by Algorithm 1. Then, we have
In addition, we can also bound \(\Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}\) as follows:
Proof
Since we always choose full-step \(\alpha _k = 1\), we have \({\mathbf {x}}^{k+1} = {\mathbf {z}}^{k}\). Therefore, \(\Vert {\mathbf {x}}^{k+1} - T({\mathbf {x}}^k)\Vert _{{\mathbf {x}}^k} = \Vert {\mathbf {z}}^k - T({\mathbf {x}}^k)\Vert _{{\mathbf {x}}^k} \le \eta _k\), which leads to
Now, we bound \(\Vert T({\mathbf {x}}^k) - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^k}\) as follows. Firstly, the optimality conditions of (4) and (1) can be written as
This can be rewritten equivalently to
Similar to the proof of [2, Theorem 3.14], we can show that (38) is equivalent to
Using the nonexpansiveness of the projection operator [2, Chapter 4] we can derive
We make the following two explanation for (40):
-
In the second inequality of (40), \(1 - \Vert {\mathbf {x}}^{\star } - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^{k}}\) in the denominator can be justified by \(\Vert {\mathbf {x}}^{\star } - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^{k}} < 1\), which follows directly from (24) and our assumption that \(\Vert {\mathbf {x}}^{\star } - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^{\star }} \le \beta < 0.5\) stated at the beginning of this lemma.
-
For the last inequality of (40), we first have \(0< \Vert {\mathbf {x}}^{\star } - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k} \le \frac{\Vert {\mathbf {x}}^{\star } - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^{\star }}}{1 - \Vert {\mathbf {x}}^{\star } - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^{\star }}} < 1\) by (22) and our assumption that \(\Vert {\mathbf {x}}^{\star } - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^{\star }} < 0.5\). Since \(\frac{t^2}{1 - t}\) is increasing for \(t \in (0,1)\), we can replace \(\Vert {\mathbf {x}}^{\star } - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}\) by \(\frac{\Vert {\mathbf {x}}^{\star } - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^{\star }}}{1 - \Vert {\mathbf {x}}^{\star } - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^{\star }}}\) to get the last inequality of (40).
Plugging (40) into (37), we get (35).
Finally, we note that
which proves (36). \(\square \)
1.3 An intermediate lemma for proving theorem 3
Firstly, the following lemma establishes the sublinear convergence rate of the Frank–Wolfe gap in each outer iteration.
Lemma 6
At the k-th outer iteration of Algorithm 1, if we run the Frank–Wolfe subroutine (7) to update \({\mathbf {u}}^t\), then, after \(T_k\) iterations, we have
where \(V_k({\mathbf {u}}^t) := \max _{{\mathbf {u}}\in {\mathcal {X}}}\left\langle \nabla f({\mathbf {x}}^k) + \nabla ^2f({\mathbf {x}}^k)({\mathbf {u}}^t-{\mathbf {x}}^k), {\mathbf {u}}^t - {\mathbf {u}}\right\rangle \). As a result, the number of LMO calls at the k-th outer iteration of Algorithm 1 is at most \(O_k := \frac{6\lambda _{\max }(\nabla ^2f({\mathbf {x}}^k))D_{{\mathcal {X}}}^2}{\eta _k^2}\).
Proof
Let \(\phi _k({\mathbf {u}}) = \left\langle \nabla f({\mathbf {x}}^k), {\mathbf {u}}- {\mathbf {x}}^k\right\rangle + 1/2\left\langle \nabla ^2 f({\mathbf {x}}^k)({\mathbf {u}}- {\mathbf {x}}^k), {\mathbf {u}}- {\mathbf {x}}^k\right\rangle \) and \(\left\{ {\mathbf {u}}^t\right\} \) be generated by the Frank–Wolfe subroutine (7). Then, it is well-known that (see [26, Theorem 1]):
Let \({\mathbf {v}}^t := \arg \min _{{\mathbf {u}}\in {\mathcal {X}}}\{\left\langle \nabla \phi _k({\mathbf {u}}^t), {\mathbf {u}}\right\rangle \}\). Notice that
This is equivalent to
Summing up this inequality from \(t = 1\) to \(T_k\), we get
which implies (41). \(\square \)
1.4 An intermediate lemma for proving theorem 4
The following lemma states that we can bound \(f({\mathbf {x}}^k) - f^{\star }\) by \(\Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}\) and \(\Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }}\). Therefore, from the convergence rate of \(\Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}\) and \(\Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }}\) in Theorem 2, we can obtain a convergence rate of \(\left\{ f({\mathbf {x}}^k) - f^{\star }\right\} \).
Lemma 7
Let \(\gamma _k := \Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k} = \Vert {\mathbf {z}}^k - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}\) and \({\bar{\lambda }}_k := \Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }}\). Suppose that \({\mathbf {x}}^0\in \mathrm {dom}(f)\cap {\mathcal {X}}\). If \(0< \gamma _k,{\bar{\lambda }}_k, {\bar{\lambda }}_{k+1} < 1\), then we have
where \(\omega _{*}(\tau ) := -\tau - \log (1-\tau )\).
Proof
Firstly, from [32, Theorem 4.1.8], we have
provided that \(\Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }} < 1\). Next, using \(\left\langle \nabla f({\mathbf {x}}^{\star }) - \nabla f({\mathbf {x}}^{k+1}), {\mathbf {x}}^{k+1} - {\mathbf {x}}^{\star }\right\rangle \le 0\), we can further derive
Now, we bound \(\left\langle \nabla f({\mathbf {x}}^{k+1}), {\mathbf {x}}^{k+1} - {\mathbf {x}}^{\star }\right\rangle \) as follows. We first notice that this term can be decomposed as
Since \({\mathbf {x}}^{k+1}\) is an \(\eta ^k\)-solution of (4) at \({\mathbf {x}}= {\mathbf {x}}^k\), we have
Using the Cauchy-Schwarz inequality and the triangle inequality, \({\mathcal {T}}_2\) can also be bounded as
Finally, we can bound \(f({\mathbf {x}}^{k+1}) - f^{\star }\) as
which proves (44). \(\square \)
Rights and permissions
About this article
Cite this article
Liu, D., Cevher, V. & Tran-Dinh, Q. A Newton Frank–Wolfe method for constrained self-concordant minimization. J Glob Optim 83, 273–299 (2022). https://doi.org/10.1007/s10898-021-01105-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10898-021-01105-z
Keywords
- Frank–Wolfe method
- Inexact projected Newton scheme
- Self-concordant function
- Constrained convex optimization
- Oracle complexity