Iterative regularization for low complexity regularizers

Molinari, Cesare; Massias, Mathurin; Rosasco, Lorenzo; Villa, Silvia

doi:10.1007/s00211-023-01390-8

Iterative regularization for low complexity regularizers

Published: 10 February 2024

Volume 156, pages 641–689, (2024)
Cite this article

Numerische Mathematik Aims and scope Submit manuscript

Cesare Molinari¹,
Mathurin Massias^2,3,
Lorenzo Rosasco^2,4,5 &
…
Silvia Villa¹

198 Accesses
1 Citation
Explore all metrics

Abstract

Iterative regularization exploits the implicit bias of optimization algorithms to regularize ill-posed problems. Constructing algorithms with such built-in regularization mechanisms is a classic challenge in inverse problems but also in modern machine learning, where it provides both a new perspective on algorithms analysis, and significant speed-ups compared to explicit regularization. In this work, we propose and study the first iterative regularization procedure with explicit computational steps able to handle biases described by non smooth and non strongly convex functionals, prominent in low-complexity regularization. Our approach is based on a primal-dual algorithm of which we analyze convergence and stability properties, even in the case where the original problem is unfeasible. The general results are illustrated considering the special case of sparse recovery with the $\ell _1$ penalty. Our theoretical results are complemented by experiments showing the computational benefits of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A New Insight on Augmented Lagrangian Method with Applications in Machine Learning

Article 13 April 2024

The Frank-Wolfe Algorithm: A Short Introduction

Article Open access 13 December 2023

$\mathbf{C^{2}}$ -Lusin approximation of strongly convex functions

Article 03 April 2024

Notes

This is a abuse of language since, when R is not strictly convex, $D_R$ may not be a divergence; meaning that, in general, $D_{R}(x,x')=0$ does not imply $x'=x$.
If initialized at 0 and provided $\gamma < 2 / \left\| A \right\| _{\text {op}}^2$.
https://github.com/mathurinm/libsvmdata.

References

Bach, F., Jenatton, R., Mairal, J., Obozinski, G.: Structured sparsity through convex optimization. Stat. Sci. 27(4), 450–468 (2012)
Article MathSciNet Google Scholar
Bachmayr, M., Burger, M.: Iterative total variation schemes for nonlinear inverse problems. Inverse Probl. 25(10), 105004 (2009)
Article MathSciNet Google Scholar
Bahraoui, M., Lemaire, B.: Convergence of diagonally stationary sequences in convex optimization. Set-Valued Anal. 2, 49–61 (1994)
Article MathSciNet Google Scholar
Barré, M., Taylor, A., Bach, F.: Principled analyses and design of first-order methods with inexact proximal operators. arXiv preprint arXiv:2006.06041 (2020)
Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, New York (2011)
Book Google Scholar
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)
Article MathSciNet Google Scholar
Benning, M., Burger, M.: Modern regularization methods for inverse problems. Acta Numer. 27, 1–111 (2018)
Article MathSciNet Google Scholar
Benning, M., Burger, M.: Modern regularization methods for inverse problems. Acta Numer. 27, 1–111 (2018)
Article MathSciNet Google Scholar
Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122 (2011)
Article Google Scholar
Bredies, K., Zhariy, M.: A discrepancy-based parameter adaptation and stopping rule for minimization algorithms aiming at Tikhonov-type regularization. Inverse Probl. 29(2), 025008 (2013)
Article MathSciNet Google Scholar
Brianzi, P., Di Benedetto, F., Estatico, C.: Preconditioned iterative regularization in Banach spaces. Comput. Optim. Appl. 54(2), 263–282 (2013)
Article MathSciNet Google Scholar
Burger, M., Osher, S.: Convergence rates of convex variational regularization. Inverse Probl. 20(5), 1411 (2004)
Article MathSciNet Google Scholar
Burger, M., Osher, S., Xu, J., Gilboa, G: Nonlinear inverse scale space methods for image restoration. In: International Workshop on Variational, Geometric, and Level Set Methods in Computer Vision, pp. 25–36. Springer (2005)
Burger, M., Gilboa, G., Osher, S., Xu, J.: Nonlinear inverse scale space methods. Commun. Math. Sci. 4(1), 179–212 (2006)
Article MathSciNet Google Scholar
Burger, M., Resmerita, E., He, L.: Error estimation for Bregman iterations and inverse scale space methods in image restoration. Computing 81(2–3), 109–135 (2007)
Article MathSciNet Google Scholar
Burger, M., Möller, M., Benning, M., Osher, S.: An adaptive inverse scale space method for compressed sensing. Math. Comput. 82(281), 269–299 (2013)
Article MathSciNet Google Scholar
Cai, J.-F., Osher, S., Shen, Z.: Convergence of the linearized Bregman iteration for $\ell _1$-norm minimization. Math. Comput. 78(268), 2127–2136 (2009)
Article Google Scholar
Cai, J.-F., Osher, S., Shen, Z.: Linearized Bregman iterations for compressed sensing. Math. Comput. 78(267), 1515–1536 (2009)
Article MathSciNet Google Scholar
Calatroni, L., Garrigos, G., Rosasco, L., Villa, S.: Accelerated iterative regularization via dual diagonal descent. SIAM J. Optim. 31(1), 754–784 (2021)
Article MathSciNet Google Scholar
Candès, E.J., Recht, B.: Exact matrix completion via convex optimization. Found. Comput. Math. 9(6), 717–772 (2009)
Article MathSciNet Google Scholar
Candès, E.J., Romberg, J., Tao, T.: Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Trans. Inf. Theory 52(2), 489–509 (2006)
Article MathSciNet Google Scholar
Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vis. 40(1), 120–145 (2011)
Article MathSciNet Google Scholar
Chambolle, A., Pock, T.: An introduction to continuous optimization for imaging. Acta Numer. 25, 161–319 (2016)
Article MathSciNet Google Scholar
Chen, S.S., Donoho, D.L., Saunders, M.A.: Atomic decomposition by basis pursuit. SIAM J. Sci. Comput. 20(1), 33–61 (1998)
Article MathSciNet Google Scholar
Combettes, P.L., Pesquet, J.-C.: Proximal splitting methods in signal processing. In: Fixed-Point Algorithms for Inverse Problems in Science and Engineering, pp. 185–212. Springer (2011)
Condat, L.: A primal-dual splitting method for convex optimization involving Lipschitzian, proximable and linear composite terms. J. Optim. Theory Appl. 158(2), 460–479 (2013)
Article MathSciNet Google Scholar
Daubechies, I., Defrise, M., De Mol, C.: An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Commun. Pure Appl. Math. A J. Issued Courant Inst. Math. Sci. 57(11), 1413–1457 (2004)
Article MathSciNet Google Scholar
Deledalle, C., Vaiter, S., Peyré, G., Fadili, J.: Stein Unbiased GrAdient estimator of the Risk (SUGAR) for multiple parameter selection. SIAM J. Imaging Sci. 7(4), 2448–2487 (2014)
Article MathSciNet Google Scholar
Donoho, D.L.: Compressed sensing. IEEE Trans. Inf. Theory 52(4), 1289–1306 (2006)
Article MathSciNet Google Scholar
Engl, H.W., Heinz, W., Hanke, M., Neubauer, A.: Regularization of Inverse Problems, vol. 375. Springer, New York (1996)
Book Google Scholar
Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96(456), 1348–1360 (2001)
Fazel, M.: Matrix rank minimization with applications. Ph.D. thesis, Stanford University (2002)
Figueiredo, M., Nowak, R.: Ordered weighted $\ell _1$ regularized regression with strongly correlated covariates: theoretical aspects. In: AISTATS, pp. 930–938. PMLR (2016)
Foucart, S., Rauhut, H.: A Mathematical Introduction to Compressive Sensing. Springer, New York (2013)
Book Google Scholar
Friedlander, M.P., Tseng, P.: Exact regularization of convex programs. SIAM J. Optim. 18(4), 1326–1350 (2008)
Article MathSciNet Google Scholar
Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1 (2010)
Garrigos, G., Rosasco, L., Villa, S.: Iterative regularization via dual diagonal descent. J. Math. Imaging Vis. 60(2), 189–215 (2018)
Article MathSciNet Google Scholar
Ghorbani, B., Mei, S., Misiakiewicz, T., Montanari, A.: Linearized two-layers neural networks in high dimension. Ann. Stat. 49(2), 1029–1054 (2021)
Article MathSciNet Google Scholar
Goldenshluger, A., Pereverzev, S.: On adaptive inverse estimation of linear functionals in Hilbert scales. Bernoulli 9(5), 783–807 (2003)
Article MathSciNet Google Scholar
Grasmair, M., Scherzer, O., Haltmeier, M.: Necessary and sufficient conditions for linear convergence of l1-regularization. Commun. Pure Appl. Math. 64(2), 161–182 (2011)
Article Google Scholar
Gunasekar, S., Woodworth, B.E., Bhojanapalli, S., Neyshabur, B., Srebro, N.: Implicit regularization in matrix factorization. In: NeurIPS, pp. 6151–6159 (2017)
Gunasekar, S., Lee, J., Soudry, D., Srebro, N.: Characterizing implicit bias in terms of optimization geometry. In: Proceedings of the 35th International Conference on Machine Learning, vol. 80, pp. 1832–1841 (2018)
Hastie, T.J., Tibshirani, R., Wainwright, M.: Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press, Boca Raton (2015)
Book Google Scholar
Huang, B., Ma, S., Goldfarb, D.: Accelerated linearized Bregman method. J. Sci. Comput. 54(2–3), 428–453 (2013)
Iutzeler, F., Malick, J.: Nonsmoothness in machine learning: specific structure, proximal identification, and applications. Set-Valued Var. Anal. 28(4), 661–678 (2020)
Article MathSciNet Google Scholar
Kaltenbacher, B., Neubauer, A., Scherzer, O.: Iterative Regularization Methods for Nonlinear Ill-Posed Problems, vol. 6. Walter de Gruyter, Berlin (2008)
Book Google Scholar
Lanza, A., Morigi, S., Selesnick, I.W., Sgallari, F.: Sparsity-inducing nonconvex nonseparable regularization for convex image processing. SIAM J. Imaging Sci. 12(2), 1099–1134 (2019)
Article MathSciNet Google Scholar
Lorenz, D., Wenger, S., Schöpfer, F., Magnor, M.: A sparse Kaczmarz solver and a linearized Bregman method for online compressed sensing. In: 2014 IEEE international conference on image processing (ICIP), pp. 1347–1351. IEEE (2014)
Lorenz, D.A., Schopfer, F., Wenger, S.: The linearized Bregman method via split feasibility problems: analysis and generalizations. SIAM J. Imaging Sci. 7(2), 1237–1262 (2014)
Article MathSciNet Google Scholar
Massias, M., Vaiter, S., Gramfort, A., Salmon, J.: Dual extrapolation for sparse generalized linear models. J. Mach. Learn. Res. (2020)
Molinari, C., Peypouquet, J.: Lagrangian Penalization scheme with parallel forward-backward splitting. J. Optim. Theory Appl. 117, 413–447 (2018)
Article MathSciNet Google Scholar
Molinari, C., Peypouquet, J., Roldan, F.: Alternating forward-backward splitting for linearly constrained optimization problems. Optim. Lett. 14, 1071–1088 (2020)
Article MathSciNet Google Scholar
Molinari, C., Massias, M., Rosasco, L., Villa, S.: Iterative regularization for convex regularizers. In: International Conference on Artificial Intelligence and Statistics. PMLR, pp 1684–1692 (2021)
Moreau, T., Massias, M., Gramfort, A., Ablin, P., Bannier, P.A., Charlier, B., Dagréou, M., Dupre la Tour, T., Durif, G., Dantas, C.F., Klopfenstein, Q.: Benchopt: reproducible, efficient and collaborative optimization benchmarks. NeuRIPS 35, 25404–25421 (2022)
Mosci, S., Rosasco, L., Santoro, M., Verri, A., Villa, S.: Solving structured sparsity regularization with proximal methods. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 418–433. Springer (2010)
Moulines, E., Bach, F.: Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In: NeurIPS, pp. 451–459 (2011)
Nemirovski, A.S., Yudin, D.B.: Problem Complexity and Method Efficiency in Optimization. A Wiley-Interscience Publication. Wiley, New York (1983)
Google Scholar
Nesterov, Y.E.: A method for solving the convex programming problem with convergence rate o (1/k$^{2}$). In Dokl. Akad. Nauk. Sssr. 269, 543–547 (1983)
MathSciNet Google Scholar
Neubauer, A.: On Nesterov acceleration for Landweber iteration of linear ill-posed problems. J. Inverse Ill-posed Probl. 25(3), 381–390 (2017)
Article MathSciNet Google Scholar
Obozinski, G., Taskar, B., Jordan, M.I.: Joint covariate selection and joint subspace selection for multiple classification problems. Stat. Comput. 20(2), 231–252 (2010)
Article MathSciNet Google Scholar
Osher, S., Burger, M., Goldfarb, D., Xu, J., Yin, W.: An iterative regularization method for total variation-based image restoration. SIAM Multiscale Model. Simul. 4, 460–489 (2005)
Article MathSciNet Google Scholar
Osher, S., Ruan, F., Xiong, J., Yao, Y., Yin, W.: Sparse recovery via differential inclusions. Appl. Comput. Harmon. Anal. 41(2), 436–469 (2016)
Article MathSciNet Google Scholar
Pagliana, N., Rosasco, L.: Implicit regularization of accelerated methods in Hilbert spaces. In: NeurIPRS, pp. 14454–14464 (2019)
Pock, T., Chambolle, A.: Diagonal preconditioning for first order primal-dual algorithms in convex optimization. In: 2011 International Conference on Computer Vision, pp. 1762–1769 (2011)
Polyak, B.: Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964)
Article Google Scholar
Raskutti, G., Wainwright, M.J., Yu, B.: Early stopping and non-parametric regression: an optimal data-dependent stopping rule. J. Mach. Learn. Res. 15(1), 335–366 (2014)
MathSciNet Google Scholar
Rosasco, L., Villa, S.: Learning with incremental iterative regularization. In: NeurIPS, pp. 1630–1638 (2015)
Rosasco, L., Santoro, M., Mosci, S., Verri, A., Villa, S.: A regularization approach to nonlinear variable selection. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Volume 9 of Proceedings of Machine Learning Research, pp. 653–660 (2010)
Rudin, L.I., Osher, S., Fatemi, E.: Nonlinear total variation based noise removal algorithms. Physica D 60(1–4), 259–268 (1992)
Article MathSciNet Google Scholar
Salzo, S., Villa, S.: Inexact and accelerated proximal point algorithms. J. Convex Anal. 19(4), 1167–1192 (2012)
MathSciNet Google Scholar
Schmidt, M., Le Roux, N., Bach, F.: Convergence rates of inexact proximal-gradient methods for convex optimization. In: NeurIPS, pp. 1458–1466 (2011)
Schopfer, F.: Exact regularization of polyhedral norms. SIAM J. Optim. 22(4), 1206–1223 (2012)
Article MathSciNet Google Scholar
Schöpfer, F., Lorenz, D.: Linear convergence of the randomized sparse Kaczmarz method. Math. Program. 173(1), 509–536 (2019)
Article MathSciNet Google Scholar
Schöpfer, F., Louis, A., Schuster, T.: Nonlinear iterative methods for linear ill-posed problems in Banach spaces. Inverse Probl. 22(1), 311 (2006)
Article MathSciNet Google Scholar
Simon, N., Friedman, J., Hastie, T.J., Tibshirani, R.: A sparse-group lasso. J. Comput. Graph. Stat. 22(2), 231–245 (2013). (ISSN 1061-8600)
Article MathSciNet Google Scholar
Steinwart, I., Christmann, A.: Support Vector Machines. Information Science and Statistics. Springer, New York (2008)
Google Scholar
Teboulle, M., Beck, A.: Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett. 31, 167–175 (2003)
Article MathSciNet Google Scholar
Tikhonov, A.N., Arsenin, V.Y.: Solutions of ill-posed problems. Wiley, New York, V. H. Winston & Sons, Washington (1977). Translated from the Russian, Preface by translation editor Fritz John, Scripta Series in Mathematics
Vaiter, S., Peyré, G., Fadili, J.: Low complexity regularization of linear inverse problems. Sampling Theory, a Renaissance: Compressive Sensing and Other Developments, pp. 103–153 (2015)
Vaškevičius, T., Kanade, V., Rebeschini, P.: Implicit regularization for optimal sparse recovery. In: NeurIPS, pp. 2968–2979 (2019)
Vaškevičius, T., Kanade, V., Rebeschini, P.: The statistical complexity of early stopped mirror descent. In: NeurIPS, pp. 253–264 (2020)
Villa, S., Salzo, S., Baldassarre, L., Verri, A.: Accelerated and inexact forward-backward algorithms. SIAM J. Optim. 23(3), 1607–1633 (2013)
Article MathSciNet Google Scholar
Villa, S., Matet, S., Vu, B.C., Rosasco, L.: Implicit regularization with strongly convex bias: stability and acceleration. Anal. Appl. (2022). https://doi.org/10.1142/S0219530522400139
Article Google Scholar
Vũ, B.C.: A splitting algorithm for dual monotone inclusions involving cocoercive operators. Adv. Comput. Math. 38(3), 667–681 (2013)
Article MathSciNet Google Scholar
Wei, Y., Yang, F., Wainwright, M.J.: Early stopping for kernel boosting algorithms: a general analysis with localized complexities. In: Guyon, I., Von Luxburg, U., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc. (2017). https://proceedings.neurips.cc/paper/2017/file/a081cab429ff7a3b96e0a07319f1049e-Paper.pdf
Wu, T.T., Lange, K.: Coordinate descent algorithms for lasso penalized regression (2008)
Yao, Y., Rosasco, L., Caponnetto, A.: On early stopping in gradient descent learning. Constr. Approx. 26(2), 289–315 (2007)
Article MathSciNet Google Scholar
Yin, W.: Analysis and generalizations of the linearized Bregman method. SIAM J. Imaging Sci. 3(4), 856–877 (2010)
Article MathSciNet Google Scholar
Yin, W., Osher, S., Goldfarb, D., Darbon, J.: Bregman iterative algorithms for l1-minimization with applications to compressed sensing. SIAM J. Imaging Sci. 1(1), 143–168 (2008)
Article MathSciNet Google Scholar
Zhang, C.-H.: Nearly unbiased variable selection under minimax concave penalty (2010)
Zhang, X., Burger, M., Bresson, X., Osher, S.: Bregmanized nonlocal regularization for deconvolution and sparse reconstruction. SIAM J. Imaging Sci. 3, 253–276 (2010)
Article MathSciNet Google Scholar
Zhang, X., Burger, M., Osher, S.: A unified primal-dual algorithm framework based on Bregman iteration. J. Sci. Comput. 46, 20–46 (2011)
Article MathSciNet Google Scholar
Zhao, P., Rocha, G., Yu, B.: The composite absolute penalties family for grouped and hierarchical variable selection. Ann. Stat. 37(6A), 3468–3497 (2016)
MathSciNet Google Scholar

Download references

Acknowledgements

L.R. acknowledges the financial support of the European Research Council (Grant SLING 819789), the AFOSR project FA9550-18-1-7009 (European Office of Aerospace Research and Development), the EU H2020-MSCA-RISE project NoMADS-DLV-777826, and the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216. S.V. and L.R. acknowledge the support of the AFOSR project FA8655-22-1-7034 and of the H2020-MSCA-ITN Project Trade-OPT 2019. S.V., L.R. and C.M. acknowledge the support of MIUR-PRIN 202244A7YL project Gradient Flows and Non-Smooth Geometric Structures with Applications to Optimization and Machine Learning. The research by C.M. and S.V. has been supported by the MIUR Excellence Department Project awarded to Dipartimento di Matematica, Universita di Genova, CUP D33C23001110001. S.V. and C.M. are part of the INDAM research group “Gruppo Nazionale per l’Analisi Matematica, la Probabilitá e le loro applicazioni”. C.M. was supported by the Programma Operativo Nazionale (PON) “Ricerca e Innovazione” 2014-2020.

Author information

Authors and Affiliations

MaLGa, DIMA, Dipartimento di Eccellenza 2023-2027, Università di Genova, Genova, Italy
Cesare Molinari & Silvia Villa
MaLGa, DIBRIS, Università di Genova, Genoa, Italy
Mathurin Massias & Lorenzo Rosasco
Univ. Lyon, ENS Lyon, UCBL, CNRS, Inria, LIP, F-69342, Lyon Cedex 07, France
Mathurin Massias
Center for Brains, Minds and Machines, MIT, Cambridge, USA
Lorenzo Rosasco
Istituto Italiano di Tecnologia, Genoa, Italy
Lorenzo Rosasco

Authors

Cesare Molinari
View author publications
You can also search for this author in PubMed Google Scholar
Mathurin Massias
View author publications
You can also search for this author in PubMed Google Scholar
Lorenzo Rosasco
View author publications
You can also search for this author in PubMed Google Scholar
Silvia Villa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cesare Molinari.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Preliminary lemmas

Lemma 27

[71, Lemma 2] Assume that $(u_j)$ is a non-negative sequence, $(S_j)$ is a non-decreasing sequence with $S_0\ge u_0^2$ and $\lambda \ge 0$ such that, for every $j\in \mathbb {N}$,

$$\begin{aligned} u_j^2\le S_j+\lambda \sum _{i=1}^{j} u_i . \end{aligned}$$

(39)

Then, for every $j\in \mathbb {N}$,

$$\begin{aligned} u_j\le \frac{\lambda j}{2}+\sqrt{S_j+\left( \frac{\lambda j}{2}\right) ^2} . \end{aligned}$$

(40)

Lemma 28

(Descent lemma, ([5], Thm 18.15 (iii)) Let $f: \mathcal {X}\rightarrow \mathbb {R}$ be Fréchet differentiable with L-Lipschitz continuous gradient. Then, for every x and $y\in \mathcal {X}$,

$$\begin{aligned} f(y)\le f(x)+\langle \nabla f(x),y-x\rangle +\frac{L}{2}\left\| y-x \right\| _{ }^2 . \end{aligned}$$

(41)

Lemma 29

Let $\mathcal {Z}$ denote $\mathcal {X}$ or $\mathcal {Y}$ and U denote T or $\Sigma $ accordingly. Let $f \in \Gamma _0(\mathcal {Z})$ and $\varepsilon \ge 0$. It follows easily from the definition of the $\varepsilon $-subdifferential that if $a, b \in \mathcal {Z}$ satisfy

$$\begin{aligned} U^{-1}\left( a - b\right) \in \partial _{\varepsilon } f(b), \end{aligned}$$

(42)

then, for every $c \in \mathcal {Z}$,

$$\begin{aligned} f(b) - f(c) + \frac{1}{2}\left\| b - c \right\| ^2_U - \frac{1}{2}\left\| a - b \right\| ^2_U - \frac{1}{2}\left\| c - a \right\| ^2_U \le \varepsilon . \end{aligned}$$

(43)

1.1 Primal-dual estimates

Lemma 30

(One step estimate) Let Assumption 4 hold. Let $(x_k,y_k)$ be the sequence generated by iterations (12) under Assumption 7. Then, for any $z=(x, y)\in \mathcal {X}\times \mathcal {Y}$ and for any $k\in \mathbb {N}$, with $V(z):= \frac{1}{2}\left\| x \right\| _T^2 + \frac{1}{2} \left\| y \right\| _\Sigma ^2$,

$$\begin{aligned} \begin{aligned}&V(z_{k + 1} - z) - V(z_k - z) + \frac{1-\tau _M L}{2\tau _M}\left\| x_{k+1}-x_k \right\| _{ }^2+ \frac{1}{2}\left\| y_{k+1}-y_k \right\| _{ \Sigma }^2\\&\quad + \left[ \mathcal {L}^{\delta }(x_{k + 1}, y) - \mathcal {L}^{\delta }(x, y_{k + 1})\right] + \langle y_{k + 1} - {\tilde{y}}_k, A\left( x - x_{k + 1}\right) \rangle \le \varepsilon _{k+1} . \end{aligned} \end{aligned}$$

(44)

Proof

Let $(x, y) \in \mathcal {X}\times \mathcal {Y}$. Applying Lemma 29 to the definition of $x_{k+1}$ yields

$$\begin{aligned} \begin{aligned}&\frac{1}{2} \left\| x_{k + 1} - x \right\| _T^2 - \frac{1}{2} \left\| x_k - x \right\| _T^2 + \frac{1}{2} \left\| x_{k + 1} - x_k \right\| _T^2 + \left[ R(x_{k + 1})-R(x)\right] \\&\quad + \langle {\tilde{y}}_k, A\left( x_{k + 1} - x\right) \rangle + \langle \nabla F(x_k), x_{k + 1} - x\rangle \ \le \ \varepsilon _{k+1} . \end{aligned} \end{aligned}$$

(45)

For the dual update, similarly,

$$\begin{aligned} \begin{aligned} \frac{1}{2} \left\| y_{k + 1} - y \right\| _{\Sigma }^2 - \frac{1}{2} \left\| y_k - y \right\| _{\Sigma }^2 + \frac{1}{2} \left\| y_{k + 1} - y_k \right\| _{\Sigma }^2 + \langle y_{k + 1} - y, b^\delta - Ax_{k + 1}\rangle \le 0 . \end{aligned}\nonumber \\ \end{aligned}$$

(46)

Recall that $z:= (x, y)$ and the definition of V. Sum Eqs. (45) and (46):

$$\begin{aligned} \begin{aligned}&V(z_{k + 1} - z) - V(z_k - z) + V(z_{k + 1} - z_k) + \left[ R\left( x_{k + 1}\right) -R(x)\right] \\&\quad + \langle {\tilde{y}}_k, A\left( x_{k + 1} - x\right) \rangle + \langle y_{k + 1} - y, b^\delta - Ax_{k + 1}\rangle + \langle \nabla F(x_k), x_{k + 1} - x\rangle \le \varepsilon _{k+1} . \end{aligned} \end{aligned}$$

(47)

From the Lemma 28,

$$\begin{aligned} F(x_{k+1})\le F(x_k)+ \langle \nabla F(x_k), x_{k + 1} - x_k\rangle +\frac{L}{2}\left\| x_{k+1}-x_k \right\| _{ }^2 , \end{aligned}$$

while from the convexity of F,

$$\begin{aligned} F(x_k)+ \langle \nabla F(x_k), x - x_k\rangle \le F(x) . \end{aligned}$$

Summing the last two equations, one obtains the 3 points descent lemma:

$$\begin{aligned} F(x_{k+1})\le F(x)+ \langle \nabla F(x_k), x_{k + 1} - x\rangle +\frac{L}{2}\left\| x_{k+1}-x_k \right\| _{ }^2 . \end{aligned}$$

(48)

Summing Eqs. (47) and (48),

$$\begin{aligned} \begin{aligned}&V(z_{k + 1} - z) - V(z_k - z) + V(z_{k+1} - z_k)\\&\qquad + \left[ R+F\right] (x_{k+1})- \left[ R+F\right] (x) + \langle {\tilde{y}}_k, A\left( x_{k + 1} - x\right) \rangle + \langle y_{k + 1} - y, b^\delta - Ax_{k + 1}\rangle \\&\quad \le \frac{L}{2}\left\| x_{k+1}-x_k \right\| _{ }^2 + \varepsilon _{k+1} . \end{aligned} \end{aligned}$$

Now compute

$$\begin{aligned}{} & {} \left[ R+F\right] (x_{k+1})- \left[ R+F\right] (x) + \langle {\tilde{y}}_k, A\left( x_{k + 1} - x\right) \rangle \\{} & {} \qquad + \langle y_{k + 1} - y, b^\delta - Ax_{k + 1}\rangle \\{} & {} \quad = \left[ \mathcal {L}^{\delta }(x_{k + 1}, y) - \mathcal {L}^{\delta }(x, y_{k + 1})\right] -\langle y, Ax_{k + 1} - b^\delta \rangle + \langle y_{k + 1}, Ax - b^\delta \rangle \\{} & {} \qquad + \langle {\tilde{y}}_k, A\left( x_{k + 1} - x\right) \rangle + \langle y_{k + 1} - y, b^\delta -Ax_{k + 1}\rangle \\{} & {} \quad = \left[ \mathcal {L}^{\delta }(x_{k + 1}, y) - \mathcal {L}^{\delta }(x, y_{k + 1})\right] - \langle y_{k + 1}-y, b^\delta \rangle - \langle y, Ax_{k + 1}\rangle + \langle y_{k + 1}, A x \rangle \\{} & {} \qquad + \langle {\tilde{y}}_k, A x_{k + 1}\rangle - \langle {\tilde{y}}_k, A x\rangle + \langle y_{k + 1} - y, b^\delta \rangle - \langle y_{k + 1} - y, Ax_{k + 1}\rangle \\{} & {} \quad = \left[ \mathcal {L}^{\delta }(x_{k + 1}, y) - \mathcal {L}^{\delta }(x, y_{k + 1})\right] \\{} & {} \qquad - \langle y, Ax_{k + 1}\rangle + \langle y_{k + 1}, A x\rangle + \langle {\tilde{y}}_k, Ax_{k + 1}\rangle \\{} & {} \qquad - \langle {\tilde{y}}_k, A x\rangle - \langle y_{k + 1}, Ax_{k + 1}\rangle + \langle y, Ax_{k + 1}\rangle \\{} & {} \quad = \left[ \mathcal {L}^{\delta }(x_{k + 1}, y) - \mathcal {L}^{\delta }(x, y_{k + 1})\right] + \langle y_{k + 1} - {\tilde{y}}_k, A\left( x - x_{k + 1}\right) \rangle . \end{aligned}$$

Notice that

$$\begin{aligned} \frac{1}{2\tau _M}\left\| x_{k+1}-x_k \right\| _{ }^2 \le \frac{1}{2}\left\| x_{k+1}-x_k \right\| _{ T }^2 . \end{aligned}$$

(49)

Finally,

$$\begin{aligned} \begin{aligned}&V(z_{k + 1} - z) - V(z_k - z) + \frac{1-\tau _M L}{2\tau _M}\left\| x_{k+1}-x_k \right\| _{ }^2+ \frac{1}{2}\left\| y_{k+1}-y_k \right\| _{ \Sigma }^2\\&\quad + \left[ \mathcal {L}^{\delta }(x_{k + 1}, y) - \mathcal {L}^{\delta }(x, y_{k + 1})\right] + \langle y_{k + 1} - {\tilde{y}}_k, A\left( x - x_{k + 1}\right) \rangle \le \varepsilon _{k+1} . \end{aligned} \end{aligned}$$

$\square $

Lemma 31

(First cumulating estimate) Let Assumption 4 hold. Let $(x_k,y_k)$ be the sequence generated by iterations (12) under Assumption 7. Define $\omega := 1 - \tau _M(L+\sigma _M\left\| A \right\| ^2)$. Then, for any $(x, y)\in \mathcal {X}\times \mathcal {Y}$ and for any $k\in \mathbb {N}$,

$$\begin{aligned} \begin{aligned}&\tfrac{1-\tau _M\sigma _M\left\| A \right\| _{ }^2}{2\tau _M}\left\| x_{k} - x \right\| ^2 + \frac{1}{2}\left\| y_{k} - y \right\| _{ \Sigma }^2 + \sum _{j=1}^{k}\left[ \mathcal {L}^{\delta }(x_j,y) - \mathcal {L}^{\delta }(x, y_j)\right] \\&\qquad +\frac{\omega }{2\tau _M}\sum _{j=1}^{k}\left\| x_j - x_{j - 1} \right\| ^2 \\&\quad \le V(z_0 - z) + \sum _{j=1}^{k}\varepsilon _{j}. \end{aligned} \end{aligned}$$

(50)

Proof

We start from the inequality in Lemma 30, switching the index from k to j. Recall that ${\tilde{y}}_j:= 2 y_j - y_{j - 1}$, to get

$$\begin{aligned} \begin{aligned}&V(z_{j + 1} - z) - V(z_j - z) + \frac{1-\tau _M L}{2\tau _M}\left\| x_{j+1}-x_j \right\| _{ }^2+ \frac{1}{2}\left\| y_{j+1}-y_j \right\| _{ \Sigma }^2 \\&\qquad + \left[ \mathcal {L}^{\delta }(x_{j+ 1}, y) - \mathcal {L}^{\delta }(x, y_{j + 1})\right] \\&\quad \le \varepsilon _{j+1}- \langle y_{j + 1} - \left( 2y_j - y_{j - 1}\right) , A\left( x - x_{j + 1}\right) \rangle \\&\quad = \varepsilon _{j+1}- \langle y_{j+ 1} - y_j, A\left( x - x_{j + 1}\right) \rangle +\langle y_j-y_{j - 1}, A\left( x - x_{j + 1}\right) \rangle \\&\quad = \varepsilon _{j+1}- \langle y_{j+ 1} - y_j, A\left( x - x_{j + 1}\right) \rangle +\langle y_j-y_{j - 1}, A\left( x - x_j\right) \rangle \\&\qquad +\langle y_j-y_{j - 1}, A\left( x_j - x_{j + 1}\right) \rangle . \end{aligned} \end{aligned}$$

Now focus on the term

$$\begin{aligned} \langle y_j-y_{j - 1}, A\left( x_j - x_{j + 1}\right) \rangle&= \langle \Sigma ^{\frac{1}{2}} \Sigma ^{-\frac{1}{2}} \left( y_j-y_{j - 1}\right) , A\left( x_j - x_{j + 1}\right) \rangle \nonumber \\&= \langle \Sigma ^{-\frac{1}{2}}\left( y_j-y_{j - 1}\right) , \Sigma ^{\frac{1}{2}}A\left( x_j - x_{j + 1}\right) \rangle \nonumber \\&\le \left\| \Sigma ^{-\frac{1}{2}}\left( y_j-y_{j - 1}\right) \right\| \left\| \Sigma ^{\frac{1}{2}}A\left( x_j - x_{j + 1}\right) \right\| \nonumber \\&\le \frac{1}{2}\left\| \Sigma ^{-\frac{1}{2}}\left( y_j-y_{j - 1}\right) \right\| ^2+\frac{1}{2} \left\| \Sigma ^{\frac{1}{2}}A\left( x_j - x_{j + 1}\right) \right\| ^2\nonumber \\&\le \frac{1}{2}\left\| y_j-y_{j - 1} \right\| _{ \Sigma }^2+\frac{\sigma _M\left\| A \right\| _{ }^2}{2} \left\| x_{j + 1}-x_j \right\| ^2 , \end{aligned}$$

(51)

where we used Cauchy-Schwarz and Young inequalities. Then, using the definition of $\omega := 1 - \tau _M(L+\sigma _M\left\| A \right\| ^2)$, we have

$$\begin{aligned} \begin{aligned}&V(z_{j + 1} - z) - V(z_j - z) + \left[ \mathcal {L}(x_{j+ 1}, y) - \mathcal {L}(x, y_{j + 1})\right] \\&\qquad + \frac{\omega }{2\tau _M}\left\| x_{j+1}-x_j \right\| _{ }^2+ \frac{1}{2}\left\| y_{j+1}-y_j \right\| _{ \Sigma }^2 -\frac{1}{2} \left\| y_j-y_{j-1} \right\| _{ \Sigma }^2\\&\quad \le \varepsilon _{j+1}- \langle y_{j + 1} - y_j, A\left( x - x_{j + 1}\right) \rangle +\langle y_j-y_{j- 1}, A\left( x - x_j\right) \rangle . \end{aligned} \end{aligned}$$

(52)

Imposing $y_{-1}=y_0$, summing-up Eq. (52) from $j=0$ to $j=k-1$:

$$\begin{aligned}{} & {} V(z_{k} - z) - V(z_0 - z) + \sum _{j=0}^{k-1}\left[ \mathcal {L}^{\delta }(x_{j + 1}, y) - \mathcal {L}^{\delta }(x, y_{j + 1})\right] \\{} & {} \qquad + \frac{\omega }{2\tau _M}\sum _{j=0}^{k-1}\left\| x_{j + 1} - x_j \right\| ^2 + \frac{1}{2} \left\| y_k - y_{k - 1} \right\| _{ \Sigma }^2 \\{} & {} \quad \le \sum _{j=0}^{k-1}\varepsilon _{j+1}- \langle y_{k} - y_{k-1}, A\left( x - x_{k}\right) \rangle \\{} & {} \quad \le \frac{1}{2} \left\| y_{k} - y_{k-1} \right\| _{ \Sigma }^2 +\frac{\sigma _M \left\| A \right\| ^2}{2} \left\| x_{k} - x \right\| ^2 + \sum _{j=1}^{k}\varepsilon _{j}, \end{aligned}$$

where in the last inequality we used again Cauchy-Schwarz and Young inequalities as before. Reordering, we obtain the claim. $\square $

Lemma 32

(Second cumulative estimate) Let Assumption 4 hold. Let $(x_k,y_k)$ be the sequence generated by iterations (12) under Assumption 7. Given $\xi >0$ and $\eta >0$, define $\theta := \xi - \tau _M(\xi L+\sigma _M \left\| A \right\| ^2)$ and $\rho :=\sigma _m(\eta -1)-\sigma _M\xi \eta $. Then, for any $z = (x, y)\in \mathcal {X}\times \mathcal {Y}$ and for any $k\in \mathbb {N}$,

$$\begin{aligned}{} & {} V(z_{k} - z) + \frac{\theta }{2\tau _M\xi }\sum _{j=1}^{k}\left\| x_{j} - x_{j-1} \right\| ^2 +\frac{\rho }{2\eta }\sum _{j=1}^{k}\left\| A x_{j}- Ax \right\| ^2\nonumber \\{} & {} \qquad +\sum _{j=1}^{k}\left[ \mathcal {L}^{\delta }(x_{j }, y) - \mathcal {L}^{\delta }(x, y_{j})\right] \nonumber \\{} & {} \quad \le V(z_0 - z) + \sum _{j=1}^{k} \varepsilon _j + \frac{\sigma _m \left( \eta -1\right) k}{2}\left\| Ax-b^\delta \right\| ^2 . \end{aligned}$$

(53)

Proof

In a similar fashion as in the previous proof, we start again from the main inequality in Lemma 30, switching the index from k to j. Since ${\tilde{y}}_j = y_j + (y_j - y_{j - 1}) = y_j + \Sigma (A x_j - b^\delta )$ and $y_{j+1} - y_j = \Sigma (A x_{j+1} - b^\delta )$, we get

$$\begin{aligned} \begin{aligned}&V(z_{j + 1} -z) - V(z_j - z) + \frac{1-\tau _M L}{2 \tau _M} \left\| x_{j+1} - x_j \right\| ^2 + \frac{1}{2}\left\| \Sigma \left( A x_{j+1} - b^\delta \right) \right\| _{ \Sigma }^2 \\&\qquad + \left[ \mathcal {L}^{\delta }(x_{j+ 1}, x) - \mathcal {L}^{\delta }(x, y_{j + 1})\right] \\&\quad \le \varepsilon _{j+1}+\langle y_{j+1} - y_j - \Sigma \left( A x_{j} - b^\delta \right) , A x_{j+1} - A x \rangle \\&\quad = \varepsilon _{j+1}+\langle \Sigma A\left( x_{j+1} - x_j\right) , A x_{j+1} - Ax\rangle . \end{aligned} \end{aligned}$$

Now estimate

$$\begin{aligned} \begin{aligned} \frac{1}{2}\left\| \Sigma \left( A x_{j+1} - b^\delta \right) \right\| _{ \Sigma }^2&=\frac{1}{2} \langle \Sigma \left( A x_{j+1} - b^\delta \right) , A x_{j+1} - b^\delta \rangle \\&\ge \frac{\sigma _m}{2} \left\| A x_{j+1} - b^\delta \right\| _{ }^2\\&= \frac{\sigma _m}{2}\left\| A x_{j+1} - Ax \right\| ^2 \\&\quad + \frac{\sigma _m}{2}\left\| Ax-b^\delta \right\| ^2 + \sigma _m \langle A x_{j+1} - Ax, Ax-b^\delta \rangle . \end{aligned} \end{aligned}$$

So,

$$\begin{aligned}{} & {} V(z_{j + 1} - z) - V(z_j - z) + \frac{1-\tau _M L}{2\tau _M} \left\| x_{j+1} - x_j \right\| ^2 + \frac{\sigma _m}{2}\left\| A x_{j+1} - Ax \right\| ^2 \\{} & {} \qquad + \left[ \mathcal {L}^{\delta }(x_{j+1},y) - \mathcal {L}^{\delta }(x,y_{j+1})\right] \\{} & {} \quad \le \varepsilon _{j+1}+\langle \Sigma A\left( x_{j+1} - x_j\right) , A x_{j+1} - Ax\rangle + \sigma _m \langle A x_{j+1} - Ax, b^\delta -Ax \rangle \\{} & {} \qquad - \frac{\sigma _m}{2}\left\| Ax-b^\delta \right\| ^2\\{} & {} \quad \le \varepsilon _{j+1}+ \frac{\sigma _M\left\| A \right\| ^2}{2\xi }\left\| x_{j+1} - x_j \right\| ^2 +\frac{\xi \sigma _M}{2}\left\| A x_{j+1}- Ax \right\| ^2 \\{} & {} \qquad - \frac{\sigma _m}{2}\left\| Ax - b^\delta \right\| ^2\\{} & {} \qquad +\frac{\sigma _m}{2\eta }\left\| A x_{j+1}- Ax \right\| ^2 + \frac{\sigma _m \eta }{2}\left\| Ax-b^\delta \right\| ^2 . \end{aligned}$$

In the last inequality we used three times Cauchy-Schwarz inequality and twice Young inequality with parameters $\xi >0$ and $\eta >0$. Then, reordering and recalling the definitions of $\theta := \xi - \tau _M(\xi L+\sigma _M \left\| A \right\| ^2)$, we obtain

$$\begin{aligned} \begin{aligned}&V(z_{j + 1} - z) - V(z_j - z) + \frac{\theta }{2\tau _M \xi }\left\| x_{j+1} - x_j \right\| ^2 \\&\quad + \frac{\sigma _m(\eta -1)-\sigma _M\xi \eta }{2\eta }\left\| A x_{j+1}- Ax \right\| ^2 \\&\quad + \left[ \mathcal {L}^{\delta }(x_{j+1},y) - \mathcal {L}^{\delta }\left( x,y_{j+1}\right) \right] \le \varepsilon _{j+1}+\frac{\sigma _m \left( \eta -1\right) }{2}\left\| Ax-b^\delta \right\| ^2 . \end{aligned} \end{aligned}$$

Summing-up the latter from $j=0$ to $j=k-1$, we get

$$\begin{aligned} \begin{aligned}&V(z_{k} - z) - V(z_0 - z) +\frac{\theta }{2\tau _M\xi }\sum _{j=0}^{k-1}\left\| x_{j + 1} - x_j \right\| ^2 \\&\quad +\frac{\sigma _m(\eta -1)-\sigma _M\xi \eta }{2\eta }\sum _{j=0}^{k-1}\left\| A x_{j+1}- Ax \right\| ^2\\&\quad + \sum _{j=0}^{k-1}\left[ \mathcal {L}^{\delta }(x_{j + 1}, y) - \mathcal {L}^{\delta }(x, y_{j + 1})\right] \ \ \le \ \ \sum _{j=0}^{k-1}\varepsilon _{j+1}+\frac{\sigma _m \left( \eta -1\right) k}{2}\left\| Ax-b^\delta \right\| ^2 . \end{aligned} \end{aligned}$$

By trivial manipulations, we get the claim. $\square $

Proofs of main results

1.1 Proof of Proposition 10

Proposition 10

Assume that Assumptions 4 and 5 hold. Let $(x_k, y_k)$ be the sequence generated by iterations (12) applied to $b^{\delta } = {b}^\star $ under Assumptions 7 and 8. Let also $\varepsilon _{k}=0$ for every $k\in \mathbb {N}$. Then $(x_k, y_k)$ weakly converges to a pair in $\mathcal {S}^\star $. In particular, $(x_k)$ weakly converges to a point in $\mathcal {P}^{\star }$.

Proof

Up to a change of initialization and offset of index, the steps of algorithm (12) when $\varepsilon _k = 0$ correspond to

$$\begin{aligned} {\left\{ \begin{array}{ll} y_{k + 1} = y_k + \Sigma \left( Ax_{k} - {b}^\star \right) \\ x_{k+1} = {{\,\textrm{prox}\,}}^{T}_R(x_k - T\nabla F(x_k) - TA^*(2 y_{k+1} - y_{k})) . \end{array}\right. } \end{aligned}$$

(54)

We now show that the previous iterations correspond to Algorithm 3.2 in [26], setting $\sigma =\tau =1$ and applying it in the metrics defined by the preconditioning operators; namely, in the primal and dual spaces $(\mathcal {X}, \ \langle T^{-1} \cdot , \cdot \rangle )$ and $(\mathcal {Y}, \ \langle \Sigma \cdot , \cdot \rangle )$ - respectively. Comparing problem (15) with (1) in [26], their notation in our setting reads as $F=F,\ G=R,\ H=\iota _{\left\{ {b}^\star \right\} }$ and $K=A$. The Fenchel conjugate of H in $(\mathcal {Y}, \ \langle \Sigma \cdot , \cdot \rangle )$ is

$$\begin{aligned} \begin{aligned} H^{\star }(y)&=\sup _{z\in \mathcal {Y}} \left\{ \langle \Sigma z, y \rangle -\iota _{\left\{ {b}^\star \right\} }(z)\right\} =\langle \Sigma {b}^\star , y \rangle \end{aligned} \end{aligned}$$

(55)

and its proximal-point operator, again in $(\mathcal {Y}, \ \langle \Sigma \cdot , \cdot \rangle )$, is

$$\begin{aligned} \begin{aligned} {{\,\textrm{prox}\,}}_{H^{\star }}(y)&=\mathop {\textrm{argmin}}\limits _{z\in \mathcal {Y}} \left\{ \langle \Sigma {b}^\star , z \rangle +\frac{1}{2}\langle \Sigma (z-y),z-y\rangle \right\} =y-{b}^\star . \end{aligned} \end{aligned}$$

(56)

The gradient of F in $(\mathcal {X}, \ \langle T^{-1}\cdot , \cdot \rangle )$ is denoted by $\nabla _{T} F(x)$ and satisfies, for x and v in $\mathcal {X}$,

$$\begin{aligned} \langle T^{-1}\nabla _{T} F(x), v \rangle =\langle \nabla F(x), v \rangle . \end{aligned}$$

It is easy to see that one has $\nabla _T F(x)=T \nabla F(x)$.

The adjoint operator of $K: \ (\mathcal {X}, \ \langle T^{-1}\cdot , \cdot \rangle )\rightarrow (\mathcal {Y}, \ \langle \Sigma \cdot , \cdot \rangle )$ satisfies, for every $(x,y)\in \mathcal {X}\times \mathcal {Y}$,

$$\begin{aligned} \begin{aligned} \langle T^{-1}K^*y,x \rangle&=\langle \Sigma Kx, y \rangle =\langle \Sigma Ax, y \rangle = \langle x, A^*\Sigma y \rangle , \end{aligned} \end{aligned}$$

(57)

implying that $T^{-1}K^*=A^*\Sigma $ and so that $K^*=TA^*\Sigma $. Then Algorithm 3.2 in [26] (with $\sigma =\tau =1$, $\rho _k=1$ for every $k\in \mathbb {N}$ and no errors involved) is:

$$\begin{aligned} {\left\{ \begin{array}{ll} \bar{y}_{k + 1} = {{\,\textrm{prox}\,}}_{H^{\star }}(\bar{y}_k+K\bar{x}_k)\\ \bar{x}_{k+1} = {{\,\textrm{prox}\,}}_R(\bar{x}_k - \nabla _T F(\bar{x}_k) - K^*(2 \bar{y}_{k+1} - \bar{y}_{k})) , \end{array}\right. } \end{aligned}$$

and becomes, applied to our setting in the spaces $(\mathcal {X}, \ \langle T^{-1} \cdot , \cdot \rangle )$ and $(\mathcal {Y}, \ \langle \Sigma \cdot , \cdot \rangle )$,

$$\begin{aligned} {\left\{ \begin{array}{ll} \bar{y}_{k + 1} =\bar{y}_k + A\bar{x}_k-{b}^\star \\ \bar{x}_{k+1} = \mathop {\textrm{argmin}}\limits _{x\in \mathcal {X}} \left\{ R(x)+\frac{1}{2}\left\| x-\left[ \bar{x}_k - T\nabla F(\bar{x}_k) - TA^*\Sigma (2 \bar{y}_{k+1} - \bar{y}_{k})\right] \right\| _{ T^{-1} }^2\right\} . \end{array}\right. } \end{aligned}$$

Define the variable $\bar{z}_k=\Sigma \bar{y}_k$ and multiply the first line by $\Sigma $. Then,

$$\begin{aligned} {\left\{ \begin{array}{ll} \bar{z}_{k + 1} =\bar{z}_k + \Sigma \left( A\bar{x}_k-{b}^\star \right) \\ \bar{x}_{k+1} = {{\,\textrm{prox}\,}}_R^T\left( \bar{x}_k - T\nabla F(\bar{x}_k) - TA^*(2 \bar{z}_{k+1} - \bar{z}_{k})\right) . \end{array}\right. } \end{aligned}$$

Comparing the previous with (54), we get that they are indeed the same algorithm. To conclude, we want to use Theorem 3.1 in [26], that ensures the weak convergence of the sequence generated by the algorithm to a saddle-point. It remains to check that, under our assumptions, the hypothesis of the above result are indeed satisfied; namely, that

$$\begin{aligned} 1-\left\| K \right\| _{ }^2-\frac{L_T}{2} \ge 0 , \end{aligned}$$

(58)

where $\left\| K \right\| _{ }$ represents the operator norm of $K: \ (\mathcal {X}, \ \langle T^{-1}\cdot , \cdot \rangle )\rightarrow (\mathcal {Y}, \ \langle \Sigma \cdot , \cdot \rangle )$ and $L_T$ is the Lipschitz constant of $\nabla _T F$. Notice that

$$\begin{aligned} \left\| K \right\| _{ }^2=\sup _{x\in \mathcal {X}} \frac{\langle \Sigma Ax,Ax\rangle }{\langle T^{-1}x,x\rangle }\le \sigma _M\tau _M\left\| A \right\| _{ }^2 . \end{aligned}$$

Moreover, $L_T\le \tau _M L$. Indeed, for every x and $x'\in \mathcal {X}$,

$$\begin{aligned} \left\| \nabla _T F(x')-\nabla _T F(x) \right\| _{ }=\left\| T \nabla F(x')-T \nabla F(x) \right\| _{ }\le \tau _M \left\| \nabla F(x')- \nabla F(x) \right\| _{ } . \end{aligned}$$

Then, by Assumption 8 and the previous considerations,

$$\begin{aligned} 0\le 1-\tau _M(L + \sigma _M\left\| A \right\| _{ }^2)\le 1-L_T-\left\| K \right\| _{ }^2\le 1-\frac{L_T}{2}-\left\| K \right\| _{ }^2 . \end{aligned}$$

In particular, (58) is satisfied and we the claim is proved. $\square $

1.2 Proof of Proposition 11

Proposition 11

Let $({x}^\star , {y}^\star ) \in \mathcal {S}^\star $ and $(x, y) \in \mathcal {X}\times \mathcal {Y}$ such that $\mathcal {L}^{\star }(x, {y}^\star ) - \mathcal {L}^{\star }({x}^\star , y) = 0$ and $A x = {b}^\star $. Then $(x, {y}^\star ) \in \mathcal {S}^\star $.

Proof

For simplicity, denote $J:=R+F$. First notice that, for our problem, the Lagrangian gap is equal to the Bregman divergence. Indeed, using $-A^* {y}^\star \in \partial J({x}^\star )$ and $A {x}^\star = {b}^\star $:

$$\begin{aligned} \mathcal {L}^{\star }(x, {y}^\star ) - \mathcal {L}^{\star }({x}^\star , y)&= J(x) - J({x}^\star ) + \langle {y}^\star , A x - {b}^\star \rangle - \langle y, A {x}^\star - {b}^\star \rangle \nonumber \\&= J(x) - J({x}^\star ) + \langle A^* {y}^\star , x - {x}^\star \rangle = D_J^{-A^* {y}^\star }(x, {x}^\star ), \end{aligned}$$

(59)

We then show that if $v \in \partial J({x}^\star )$ and $D_{J}^{v}(x, {x}^\star ) = 0$, then $v \in \partial J(x)$. Indeed, $J(x) - J({x}^\star ) - \langle v, x - {x}^\star \rangle = 0$ and so, for all $x' \in \mathcal {X}$,

$$\begin{aligned}{} & {} J(x') \ge J({x}^\star ) + \langle v, x' - {x}^\star \rangle = J(x) - \langle v, x - {x}^\star \rangle + \langle v, x' - {x}^\star \rangle \nonumber \\{} & {} \quad = J(x) + \langle v, x' - x \rangle . \end{aligned}$$

(60)

Proposition 11 follows by taking $v = -A^* {b}^\star $. $\square $

1.3 Proof of Theorem 13

Theorem 13

Let Assumptions 4 and 5 hold and $({x}^\star , {y}^\star ) \in \mathcal {S}^\star $ be a saddle-point of the exact problem. Let $(x_k, y_k)$ be generated by (12) under Assumptions 7 and 8 with inexact data $b^\delta $ such that $\left\| b^\delta - {b}^\star \right\| \le \delta $ and error $\vert \varepsilon _k \vert \le C_0 \delta $ in the proximal operator for all $k \in \mathbb {N}$. Denote by $(\hat{x}_k, \hat{y}_k)$ the averaged iterates $(\tfrac{1}{k}\sum _{j=1}^{k}x_j, \tfrac{1}{k} \sum _{j=1}^{k} y_j)$. Then there exist constants $C_1, C_2$, $C_3$ and $C_4$ such that, for every $k \in \mathbb {N}$,

$$\begin{aligned} \begin{aligned} \mathcal {L}^{\star }(\hat{x}_k,{y}^\star ) - \mathcal {L}^{\star }({x}^\star , \hat{y}_k) \ \le \ \frac{C_1}{k} +C_2 \delta + C_3 \delta ^{3/2} k^{1/2} + C_4 \delta ^2 k . \end{aligned} \end{aligned}$$

(19)

Let also Assumption 9 hold. Then there exist constants $C_5, C_6$, $C_7$, $C_8$ and $C_9$ such that, for every $k\in \mathbb {N}$,

$$\begin{aligned} \begin{aligned} \left\| A \hat{x}_{k} - {b}^\star \right\| ^2 \ {}&\le \frac{C_5}{k} + C_6 \delta + C_7 \delta ^{3/2} k^{1/2} + C_8 \delta ^2 k + C_9 \delta ^2 . \end{aligned} \end{aligned}$$

(20)

Proof

Recall that we denote $z = (x, y) \in \mathcal {X}\times \mathcal {Y}$ a primal-dual pair, and define

$$\begin{aligned} V(z):= \frac{1}{2}\left\| x \right\| _{ T }^2+\frac{1}{2}\left\| y \right\| _{ \Sigma }^2 . \end{aligned}$$

(61)

Use Lemma 31 at $x={x}^\star $ and $y={y}^\star $, to get

$$\begin{aligned} \begin{aligned}&\tfrac{1-\tau _M\sigma _M\left\| A \right\| _{ }^2}{2\tau _M}\left\| x_{k} - {x}^\star \right\| ^2 + \tfrac{1}{2}\left\| y_{k} - {y}^\star \right\| _{ \Sigma }^2 \\&\qquad + \sum _{j=1}^{k} {[}\mathcal {L}^\delta (x_j, {y}^\star ) - \mathcal {L}^\delta ({x}^\star , y_j)] +\tfrac{\omega }{2\tau _M}\sum _{j=1}^{k}\left\| x_j - x_{j - 1} \right\| ^2 \\&\quad \le V(z_0 - {z}^\star )+\sum _{j=1}^{k}\varepsilon _{j} . \end{aligned} \end{aligned}$$

(62)

Notice that

$$\begin{aligned} \mathcal {L}^\delta (x_{j}, {y}^\star ) - \mathcal {L}^\delta ({x}^\star , y_{j}) = \mathcal {L}^{\star }(x_{j}, {y}^\star ) - \mathcal {L}^{\star }({x}^\star , y_{j}) + \langle y_{j} - {y}^\star , b^\delta -{b}^\star \rangle . \end{aligned}$$

(63)

Then,

$$\begin{aligned}&\tfrac{1-\tau _M\sigma _M\left\| A \right\| _{ }^2}{2\tau _M}\left\| x_{k} - {x}^\star \right\| ^2 + \frac{1}{2}\left\| y_{k} - {y}^\star \right\| _{ \Sigma }^2 + \sum _{j=1}^{k}\left[ \mathcal {L}^{\star }(x_j,{y}^\star ) - \mathcal {L}^{\star }({x}^\star , y_j)\right] \nonumber \\&\qquad +\tfrac{\omega }{2\tau _M}\sum _{j=1}^{k}\left\| x_j - x_{j - 1} \right\| ^2 \nonumber \\&\quad \le V(z_0 - {z}^\star )+\sum _{j=1}^{k}\varepsilon _{j}+ \delta \sum _{j=1}^{k} \left\| y_{j} - {y}^\star \right\| . \end{aligned}$$

(64)

Recall that $\mathcal {L}^{\star }(x,{y}^\star ) - \mathcal {L}^{\star }({x}^\star , y)\ge 0 $ for every $\left( x,y\right) \in \mathcal {X}\times \mathcal {Y}$. Moreover, $\omega \ge 0$ by Assumption 8 and so $1-\tau _M\sigma _M\left\| A \right\| _{ }^2 \ge 0$. Then, for every $j\in \mathbb {N}$, we have that

$$\begin{aligned} \left\| y_j - {y}^\star \right\| _{ \Sigma }^2 \ \le \ 2 V(z_0 - {z}^\star ) + 2\sum _{i=1}^{j}\varepsilon _{i} + 2\delta \sum _{i=1}^{j}\left\| y_i - {y}^\star \right\| \end{aligned}$$

(65)

and so

$$\begin{aligned} \left\| y_j - {y}^\star \right\| _{ }^2 \ \le \ 2\sigma _M \left[ V(z_0 - {z}^\star ) +\sum _{i=1}^{j}\varepsilon _{i}\right] + 2\delta \sigma _M \sum _{i=1}^{j}\left\| y_i - {y}^\star \right\| . \end{aligned}$$

(66)

Apply Lemma 27 to Eq. (66) with $u_j=\left\| y_j - {y}^\star \right\| $, $S_j=2 \sigma _M \left[ V(z_0 - {z}^\star )+\sum _{i=1}^{j}\varepsilon _{i} \right] $ and $\lambda =2\delta \sigma _M$. We get, for $1\le j\le k$,

$$\begin{aligned} \begin{aligned} \left\| y_j - {y}^\star \right\|&\le \delta \sigma _M j + \sqrt{2\sigma _M \left[ V(z_0 - {z}^\star )+\sum _{i=1}^{j}\varepsilon _{i} \right] +\left( \delta \sigma _M j\right) ^2}\\&\le 2\delta \sigma _M k + \sqrt{2\sigma _M \left[ V(z_0 - {z}^\star )+\sum _{i=1}^{k}\varepsilon _{i} \right] } . \end{aligned} \end{aligned}$$

(67)

Insert the latter in Eq. (64), to obtain

$$\begin{aligned} \begin{aligned}&\sum _{j=1}^{k}\left[ \mathcal {L}^{\star }(x_j,{y}^\star ) - \mathcal {L}^{\star }({x}^\star , y_j)\right] \\&\quad \le V(z_0 - {z}^\star )+\sum _{j=1}^{k}\varepsilon _{j} + \delta \sum _{j=1}^{k} \left( 2\delta \sigma _M k + \sqrt{2\sigma _M \left[ V(z_0 - {z}^\star )+\sum _{i=1}^{k}\varepsilon _{i} \right] }\right) \\&\quad = V(z_0 - {z}^\star ) + \sum _{j=1}^{k}\varepsilon _{j}+\delta k \sqrt{2\sigma _M \left[ V(z_0 - {z}^\star )+\sum _{i=1}^{k}\varepsilon _{i} \right] } + 2 \delta ^2 \sigma _M k^2 \\&\quad \le V(z_0 - {z}^\star ) + C_0 k \delta + \delta k \left( \sqrt{2\sigma _M V(z_0 - {z}^\star )} + \sqrt{2\sigma _M C_0 k \delta } \right) + 2 \delta ^2 \sigma _M k^2 , \end{aligned} \end{aligned}$$

where the last line uses $\sqrt{a + b} \le \sqrt{a} + \sqrt{b}$. By Jensen’s inequality, we get the first claim.

For the second result, apply Lemma 32 at $x={x}^\star $ and $y={y}^\star $:

$$\begin{aligned} \begin{aligned}&V(z_{k} - {z}^\star ) + \frac{\theta }{2\tau _M\xi } \sum _{j=1}^{k}\left\| x_{j} - x_{j-1} \right\| ^2 + \frac{\rho }{2\eta }\sum _{j=1}^{k}\left\| A x_{j} - A{x}^\star \right\| ^2\\&\quad + \sum _{j=1}^{k}\left[ \mathcal {L}^\delta (x_{j }, {y}^\star ) - \mathcal {L}^\delta ({x}^\star , y_{j})\right] \\&\quad \le V(z_0 - {z}^\star )+\sum _{j=1}^{k} \varepsilon _j + \frac{\sigma _m \left( \eta -1\right) k}{2}\left\| A{x}^\star -b^\delta \right\| ^2 . \end{aligned} \end{aligned}$$

(68)

Using Eqs. (63) and (67), we have

$$\begin{aligned} V(z_{k} - {z}^\star )&+\frac{\theta }{2\tau _M\xi }\sum _{j=1}^{k}\left\| x_{j} - x_{j-1} \right\| ^2 +\frac{\rho }{2\eta }\sum _{j=1}^{k}\left\| A x_{j}- {b}^\star \right\| ^2\nonumber \\&\qquad +\sum _{j=1}^{k}\left[ \mathcal {L}^{\star }(x_{j }, {y}^\star ) - \mathcal {L}^{\star }({x}^\star , y_{j})\right] \nonumber \\&\quad \le V(z_0 - {z}^\star )+\sum _{j=1}^{k} \varepsilon _j +\sum _{j=1}^{k}\langle y_j-{y}^\star ,{b}^\star -b^\delta \rangle + \frac{\sigma _m \left( \eta -1\right) k}{2}\left\| {b}^\star -b^\delta \right\| ^2\nonumber \\&\quad \le V(z_0 - {z}^\star )+\sum _{j=1}^{k} \varepsilon _j +\delta \sum _{j=1}^{k}\left\| y_j-{y}^\star \right\| + \frac{\sigma _m \left( \eta -1\right) k}{2}\delta ^2\nonumber \\&\quad \le V(z_0 - {z}^\star )+\sum _{j=1}^{k} \varepsilon _j+2\sigma _M\delta ^2 k^2 + \delta k \sqrt{2\sigma _M \left[ V(z_0 - {z}^\star )+\sum _{i=1}^{k}\varepsilon _{i} \right] }\nonumber \\&\qquad +\frac{\sigma _m \left( \eta -1\right) k}{2}\delta ^2 . \end{aligned}$$

(69)

Recall that $\theta \ge 0$ and that $\mathcal {L}^{\star }(x,{y}^\star ) - \mathcal {L}^{\star }({x}^\star , y)\ge 0 $ for every $\left( x,y\right) \in \mathcal {X}\times \mathcal {Y}$. By Jensen’s inequality, rearranging the terms and using $\sum _{i=1}^k \varepsilon _i \le C_0 k \delta $, we get the claim. The exact values of the constants of Theorem 13 are therefore:

$$\begin{aligned} C_1= & {} V(z_0 - {z}^\star ) , \nonumber \\ C_2= & {} C_0 + \sqrt{2\sigma _M V(z_0 - {z}^\star )} , \nonumber \\ C_3= & {} \sqrt{2 \sigma _M C_0} , \nonumber \\ C_4= & {} 2 \sigma _M , \nonumber \\ C_5= & {} \frac{2\eta }{\rho } C_1 , \nonumber \\ C_6= & {} \frac{2\eta }{\rho } C_2 , \nonumber \\ C_7= & {} \frac{2\eta }{\rho } C_3 , \nonumber \\ C_8= & {} \frac{2\eta }{\rho } C_4 , \nonumber \\ C_9= & {} \frac{\eta \sigma _m(\eta - 1)}{\rho } , \end{aligned}$$

(70)

$\square $

1.4 Example of divergence in absence of noisy solution (see Remark 15)

We present an example in which the primal exact problem has solution, but the noisy one does not and the averaged primal iterates generated by Algorithm (12) indeed diverge. First note that, if the function R has bounded domain, the primal iterates remain bounded. So, to exhibit a case of divergence of the primal iterates, we consider a function R with full domain: set $R (\cdot )= \frac{1}{2} \left\| \cdot \right\| ^2$ (and $F=0$). The exact problem is then

$$\begin{aligned} \min _{x \in \mathcal {X}}\ \frac{1}{2} \left\| x \right\| ^2 \quad \text {s.t.} \quad Ax = b^{\star } . \end{aligned}$$

(71)

Now consider a noisy datum $b^{\delta }$ such that $Ax=b^\delta $ does not have a solution. If the associated normal equation, namely $A^*Ax=A^*b^{\delta }$ is feasible, in Sect. 7 we prove not only boundedness of the iterates but also convergence to a normal solution. On the contrary, to get divergence of the iterates, here we consider a classic scenario in which the perturbation of the exact data generates an unfeasible constraint, even for the associated normal equation. We recall that this may happen only in the infinite dimensional setting, as when R(A) is finite dimensional, it is also closed and a solution to the normal equation always exists. As a prototype of ill-posed problem, let $\mathcal {X}=\mathcal {Y}=\ell ^2$ and A be defined by, for every $x\in \ell ^2$ and for every $i\in \mathbb {N}$,

$$\begin{aligned} (Ax)^i = a^ix^i , \end{aligned}$$

where, for every $i\in \mathbb {N}$, $a^i\in (0,M)$ for a fixed constant $M>0$ and $\inf _{i\in \mathbb {N}} a_i =0$. Note that $A:\ell ^2\rightarrow \ell ^2$ is well-defined, linear, continuous, self-adjoint and compact. Let $b^\star $ in the range of A and denote by $x^\star $ the unique solution to $\mathcal {P}^{\star }$ defined in Eq. (71); namely, $(x^\star )^i:= (b^\star )^i/a^i$ for every $i\in \mathbb {N}$. In particular, the $(b^\star )^i$ are such that $x^\star $ belongs to $\ell ^2$. Let also $b^{\delta }\in \ell ^2$ with $\Vert b^\delta -b^\star \Vert \le \delta $, but such that the noisy equation does not have a normal solution. Defining, for every $i\in \mathbb {N}$,

$$\begin{aligned} (x^\delta )^i:= (b^\delta )^i/a^i, \end{aligned}$$

(72)

the previous means that $x^\delta $ does not belong to $\ell ^2$. For an explicit example, consider $a^i=1/i, (b^\star )^i=1/i^2$ and $(b^\delta )^i=(b^\star )^i+C/i$, with $C={\delta /}{ \sqrt{\sum _{j=1}^{+\infty } 1/j^2}}$.

Apply the algorithm with step-sizes $\sigma >0$ and $\tau >0$ such that $\sigma \tau < 1/\left\| A \right\| _{ }^2$ and notice that it implies, for every $i\in \mathbb {N}$, $\sigma \tau < 1/(a^i)^2$. As $a^i>0$ for every $i\in \mathbb {N}$, the coordinates of the averaged sequence $(\hat{x}_k^i)$ are convergent to a solution of the following (one-dimensional) optimization problem:

$$\begin{aligned} \mathcal {P}^i&:= \mathop {\textrm{argmin}}\limits _{x^i\in \mathbb {R}} \left\{ \frac{1}{2}(x^i)^2: \ \ a^ix^i=(b^\delta )^i\right\} = \left\{ \frac{(b^\delta )^i}{a^i} \right\} . \end{aligned}$$

Hence, for the primal-dual algorithm, if $x^{\delta }\notin \ell ^2$, then $(\hat{x}_k)$ diverges. Indeed, by contradiction, suppose that $(\hat{x}_k)$ is bounded. As it is bounded and converges coordinate-wise to $x^{\delta }$, then it weakly converges to $x^{\delta }$. But this is not possible since $x^{\delta }$ is not in $\ell ^2$.

Note that the problem considered in this example can be treated by Landweber method and it is well-known that also the iterates generated by this method, while being different from the ones of primal-dual algorithm, diverge.

Sparse recovery

1.1 Proof of Proposition 16

Proposition 16

Fix a primal-dual solution $\left( {x}^\star , {y}^\star \right) \in \mathcal {S}^\star $. Let the extended support be $\Gamma := \{i \in \mathbb {N}: | \left( A^*{y}^\star \right) _i | = 1 \}$ and the saturation gap be $m:= \sup \left\{ | \left( A^*{y}^\star \right) _i |: | \left( A^*{y}^\star \right) _i | < 1 \right\} $. Then $\Gamma $ is finite, and $m < 1$. Moreover, for every $x\in \mathcal {X}$, with $\Gamma _C:= \mathbb {N}{\setminus } \Gamma $,

$$\begin{aligned} \begin{aligned} D^{-A^*{y}^\star }(x,{x}^\star )&\ge (1-m) \sum _{i\in \Gamma _C}^{} | x_i |. \end{aligned} \end{aligned}$$

(24)

Proof

Recall that ${x}^\star , {y}^\star $ is a primal-dual solution, hence $-A^* {y}^\star \in \partial \left\| {x}^\star \right\| _1$. For every $i \in \mathbb {N}$ we have that $\left[ \partial \left\| \cdot \right\| _{ 1 }\right] _i({x}^\star ) \subseteq \left[ -1,1\right] $ and so $| \left( A^*{y}^\star \right) _i |\le 1$. Recall that $\Gamma _C:=\mathbb {N}{\setminus } \Gamma $. As $A^*{y}^\star $ belongs to $\mathcal {X}=\ell ^2(\mathbb {N}; \mathbb {R})$, we have

$$\begin{aligned} \sum _{i\in \mathbb {N}}^{} | \left( A^*{y}^\star \right) _i |^2<+\infty . \end{aligned}$$

(73)

Indeed, $m\le 1$ by definition and from Eq. (73) the coefficients $| \left( A^*{y}^\star \right) _i |$ converge to 0 (and so they can not accumulate at 1). We have also that

$$\begin{aligned} D^{-A^*{y}^\star }(x,{x}^\star )= & {} \sum _{i\in \mathbb {N}}\left[ | x_i |-| {x}^\star _i |+\left( A^*{y}^\star \right) _i\left( x_i-{x}^\star _i\right) \right] \\= & {} \sum _{i\in \mathbb {N}}\left[ | x_i |+\left( A^*{y}^\star \right) _i x_i\right] \\\ge & {} \sum _{i\in \Gamma }\left[ | x_i |-\underbrace{| \left( A^*{y}^\star \right) _i |}_{=1}| x_i |\right] +\sum _{i\in \Gamma _C}\left[ | x_i |-\underbrace{| \left( A^*{y}^\star \right) _i |}_{\le m} | x_i | \right] \\\ge & {} (1-m) \sum _{i\in \Gamma _C}^{} | x_i |. \end{aligned}$$

$\square $

1.2 Tikhonov regularization: Lasso

For Tikhonov regularization, the results in terms of Bregman divergence and feasibility are the following.

Lemma 33

([40], Lemma 3.5) Let $A{x}^\star ={b}^\star $, $-A^*{y}^\star \in \partial \left\| \cdot \right\| _1({x}^\star )$ and, for $\alpha >0$,

$$\begin{aligned} x_\alpha \in \mathop {\textrm{argmin}}\limits _{x\in \mathcal {X}} \left\{ \left\| Ax - b^\delta \right\| ^2 + \alpha \left\| x \right\| _1 \right\} . \end{aligned}$$

(74)

Then it holds that

$$\begin{aligned} \left\| Ax_{\alpha }-{b}^\star \right\| _{ } \le \delta +\alpha \left\| {y}^\star \right\| _{ } \quad \quad \text {and} \quad \quad D^{-A^*{y}^\star }(x_{\alpha },{x}^\star )\le \frac{\left( \delta +\alpha \left\| {y}^\star \right\| _{ }/2\right) ^2}{\alpha }. \end{aligned}$$

The previous bounds, combined with Assumption 17 and the last inequality in Lemma 18, lead naturally to the following corollary.

Corollary 34

([40], Theorem 5.6) Suppose Assumption 17 holds. Then, for $x_\alpha $ defined as in Lemma 33 and $C:=\alpha /\delta $,

$$\begin{aligned}&\left\| Ax_\alpha -{b}^\star \right\| _{ } \le \left( 1+CW_s\right) \delta \quad \quad \text {and} \\&\left\| x_{\alpha }-{x}^\star \right\| _{ } \le Q_s \left( 1+CW_s\right) \delta + \frac{1+Q_s\left\| A \right\| _{ }}{1-M_s} \frac{\left( 1+CW_s/2\right) ^2}{C} \delta . \end{aligned}$$

Proofs of Sect. 7

1.1 Proof of Corollary 21

Corollary 21

Let Assumption 4 hold. Let $(x_k,y_k)$ be the sequence generated by Eq. (12) with data b under Assumption 7, Assumption 8 and summable error ($(\varepsilon _k)\in \ell ^1$). Denote by $(\hat{x}_k,\hat{y}_k)$ the averaged iterates. Then, every weak cluster point of $(\hat{x}_k,\hat{y}_k)$ belongs to $\mathcal {S}$. In particular, if $\mathcal {S}=\emptyset $, then the primal-dual sequence $(\hat{x}_k,\hat{y}_k)$ diverges: $\left\| (\hat{x}_k,\hat{y}_k) \right\| _{ }\rightarrow +\infty $.

Proof

From Lemma 31, for any $(x, y)\in \mathcal {X}\times \mathcal {Y}$ and for any $k\in \mathbb {N}$, we have

$$\begin{aligned} \begin{aligned}&\frac{1-\tau _M\sigma _M\left\| A \right\| _{ }^2}{2\tau _M}\left\| x_{k} - x \right\| ^2 + \frac{1}{2\sigma }\left\| y_{k} - y \right\| _{ \Sigma }^2 + \sum _{j=1}^{k}\left[ \mathcal {L}(x_j,y) - \mathcal {L}(x, y_j)\right] \\&\quad +\frac{\omega }{2\tau _M}\sum _{j=1}^{k}\left\| x_j - x_{j - 1} \right\| ^2 \\&\quad \le V(z_0 - z)+\sum _{j=1}^{k}\varepsilon _{j}, \end{aligned} \end{aligned}$$

(75)

where $\omega := 1 - \tau _M(L+\sigma _M\left\| A \right\| ^2) \ge 0$ by Assumption 8. Using Jensen’s inequality, we get

$$\begin{aligned} \begin{aligned}&\mathcal {L}(\hat{x}_k,y) - \mathcal {L}(x, \hat{y}_k) \le \frac{1}{k} \left[ V(z_0 - z)+\sum _{j=1}^{+\infty }\varepsilon _{j}\right] . \end{aligned} \end{aligned}$$

(76)

Let $\left( x_{\infty },y_{\infty }\right) $ be a weak cluster point of $(\hat{x}_k,\hat{y}_k)$; namely, there exists a subsequence $(\hat{x}_{k_j},\hat{y}_{k_j})\subseteq (\hat{x}_k,\hat{y}_k)$ such that $(\hat{x}_{k_j},\hat{y}_{k_j})\rightharpoonup (x_{\infty },y_{\infty })$. By weak lower-semicontinuity of R and F, for every $(x, y)\in \mathcal {X}\times \mathcal {Y}$,

$$\begin{aligned} \begin{aligned} \mathcal {L}(x_{\infty },y) - \mathcal {L}(x, y_\infty )&\le \liminf _{j} \mathcal {L}(\hat{x}_{k_j},y) - \mathcal {L}(x, \hat{y}_{k_j})\\&\le \liminf _{j} \frac{1}{k_j} \left[ V(z_0 - z)+\sum _{j=1}^{+\infty }\varepsilon _{j}\right] =0. \end{aligned} \end{aligned}$$

(77)

Thus $(x_{\infty },y_{\infty })$ is a saddle-point for the Lagrangian.

Now suppose that the set of saddle-points of $\mathcal {L}$ is empty. Assume also, for contradiction, that $(\hat{x}_k,\hat{y}_k)$ does not diverge. Then we can extract a bounded subsequence, that consequently admits a weakly converging subsequence. But then, the limit is a saddle-point, which contradicts the assumption. $\square $

1.2 Proof of Lemma 22

Lemma 22

Let Assumption 4 hold. Assume that $\tilde{\mathcal {C}}\ne \emptyset $. Let $\left( x_k\right) $ be the primal sequence generated by algorithm (35); namely, with $T=\tau {{\,\textrm{Id}\,}}$, $\Sigma = \sigma {{\,\textrm{Id}\,}}$, $\varepsilon _k=0$ for every $k\in \mathbb {N}$ and $y_0=y_{-1}+\sigma (Ax_0-b)$. Then, there exists a primal sequence $\left( u_k\right) $ generated by the same procedure but applied to problem $\tilde{\mathcal {P}}$ (as stated in (36)) such that $x_k=u_k$ for every $k\in \mathbb {N}$.

Proof

As $\tilde{\mathcal {C}}\ne \emptyset $, there exists $x^b\in \mathcal {X}$ such that $A^*Ax^b=A^*b$. First consider the algorithm in (35). Note that, for every $k\in \mathbb {N}$, $\tilde{y}_k=y_k+\sigma \left( Ax_k-b\right) $ and multiply the last step by $A^*$. We get, for every $k\in \mathbb {N}$,

$$\begin{aligned} \begin{aligned} x_{k+1}&={{\,\textrm{prox}\,}}^{}_{\tau R} (x_k- \tau \nabla F(x_k)-\tau A^*y_{k}-\sigma \tau A^*A(x_k-x^b))\\ A^*y_{k+1}&=A^*y_{k}+\sigma A^*A(x_{k+1}-x^b). \end{aligned} \end{aligned}$$

Recall that $S:=(A^*A)^{\frac{1}{2}}$ and introduce $p_k:=A^*y_k$. Then the primal sequence $\left( x_k\right) $ is equivalently defined by the following recursion: given $x_0$ and $p_{0}=A^*y_{0}$, for every $k\in \mathbb {N}$,

$$\begin{aligned} \begin{aligned} x_{k+1}&={{\,\textrm{prox}\,}}^{}_{\tau R} (x_k- \tau \nabla F(x_k)-\tau p_{k}-\sigma \tau S^2(x_k-x^b))\\ p_{k+1}&=p_{k}+\sigma S^2(x_{k+1}-x^b). \end{aligned} \end{aligned}$$

(78)

As $A^*y_{-1}$ belongs to $R(A^*)$ and $R(A^*)=R(S)$ ([30], Prop 2.18), there exists $v_{-1}$ such that $Sv_{-1}=A^*y_{-1}$. Now consider the primal-dual algorithm applied to problem (36) starting at $u_0=x_0$, $v_{-1}$ and $v_0=v_{-1}+\sigma (Su_0-Sx^b)$. It reads as: for every $k\in \mathbb {N}$,

$$\begin{aligned} \begin{aligned} \tilde{v}_k&=2v_k-v_{k-1}\\ u_{k+1}&={{\,\textrm{prox}\,}}_{\tau R} (u_k-\tau \nabla F(u_k)-\tau S\tilde{v}_k)\\ v_{k+1}&=v_k+\sigma (Su_{k+1}-Sx^b). \end{aligned} \end{aligned}$$

Then, noticing that $\tilde{v}_k=v_k+\sigma \left( Su_k-Sx^b\right) $ and multiplying the last step by S,

$$\begin{aligned} \begin{aligned} u_{k+1}&={{\,\textrm{prox}\,}}_{\tau R} (u_k-\tau \nabla F(u_k)-\tau Sv_{k}-\sigma \tau S^2(u_k-x^b))\\ Sv_{k+1}&=Sv_{k}+\sigma S^2(u_{k+1}-x^b) . \end{aligned} \end{aligned}$$

Define the change of variable $q_k:=Sv_k$, so that $q_{-1}=Sv_{-1}=A^*y_{-1}$ and

$$\begin{aligned} q_0=Sv_0=S\left( v_{-1}+\sigma (Su_0-Sx^b)\right) =A^*(y_{-1}+\sigma (Ax_0-b))=A^*y_0=p_0. \end{aligned}$$

Then the primal sequence $\left( u_k\right) $ is alternatively defined by the following recursion: for every $k\in \mathbb {N}$,

$$\begin{aligned} \begin{aligned} u_{k+1}&={{\,\textrm{prox}\,}}_{\tau R} (u_k-\tau \nabla F(u_k)-\tau q_{k}-\sigma \tau S^2(u_k-x^b))\\ q_{k+1}&=q_{k}+\sigma S^2(u_{k+1}-x^b) . \end{aligned} \end{aligned}$$

(79)

Comparing Eq. (78) with Eq. (79), with $(u_0,q_0)=(x_0,p_0)$, we get the claim.

$\square $

1.3 Proof of Theorem 23

Theorem 23

Let Assumption 4 hold. Assume that $\tilde{\mathcal {P}}$ (as stated in 32) admits a saddle-point; namely, that there exists a pair $(\tilde{x},\tilde{v})\in \mathcal {X}\times \mathcal {X}$ such that

$$\begin{aligned} {\left\{ \begin{array}{ll} -A^*A\tilde{v}\in \partial R(\tilde{x})+\nabla F(\tilde{x}) , \\ A^*A\tilde{x}=A^*b . \end{array}\right. } \end{aligned}$$

(37)

Let $(x_k,y_k)$ be the sequence generated by Eq. (35), namely with initialization $y_0=y_{-1}+\sigma (Ax_0-b)$, and under Assumption 8. Denote by $(\hat{x}_k)$ the averaged primal iterates. Then there exists $\tilde{x}_{\infty }\in \tilde{\mathcal {P}}$ such that $\hat{x}_k\rightharpoonup \tilde{x}_{\infty }$. Moreover, if $\mathcal {P}= \emptyset $, then $\hat{y}_k$ diverges.

Proof

From Lemma 22, we know that the sequence $\left( \hat{x}_k\right) $ generated by Eq. (35) coincides with the primal iterate of a sequence $\left( \hat{u}_k, \hat{v}_k\right) $ generated by the same algorithm on problem (36). Notice that $\left\| S \right\| = \Vert (A^*A)^\frac{1}{2} \Vert = \left\| A \right\| $ and so, if Assumption 8 holds, the analogue also holds for problem (36): namely, $1 - \tau (L + \sigma \left\| S \right\| ^2) \ge 0$. The same is true for Assumption 5. Indeed, defining $\bar{v}=S\tilde{v}$, $-S\bar{v}=-A^*A\tilde{v}\in \partial R(\tilde{x})+\nabla F(\tilde{x})$. Moreover, we have seen already that $A^*Ax=A^*b$ if and only if $Sx=Sx^b$, where $x^b$ is any vector in $\mathcal {X}$ such that $A^*Ax^b=A^*b$. Then, $S\tilde{x}=Sx^b$ and $(\tilde{x},\bar{v})$ is a saddle-point for (36). So, by Proposition 10, we know that the averaged primal-dual sequence $(\hat{u}_k,\hat{v}_k)$ weakly converges to a saddle-point for (36). In particular, there exists $\tilde{x}_{\infty }\in \tilde{\mathcal {P}}$ such that $\hat{u}_k\rightharpoonup \tilde{x}_{\infty }$ and so the same holds for $(\hat{x}_k)$. For the second claim, by assumption we have that $\mathcal {P}=\emptyset $, which implies that $\mathcal {S}=\emptyset $. All the assumptions of Corollary 21 are verified, so $(\hat{x}_k, \hat{y}_k)$ diverges. As $\left( \hat{x}_k\right) $ is weakly convergent and so bounded, we conclude that $\left( \hat{y}_k\right) $ has to diverge. $\square $

1.4 Proof of Theorem 24

Theorem 24

Let Assumption 4 hold and suppose that there exists a pair $(\tilde{x},\tilde{v})\in \mathcal {X}\times \mathcal {X}$ such that

$$\begin{aligned} {\left\{ \begin{array}{ll} -A^*A\tilde{v}\in \partial R(\tilde{x})+\nabla F(\tilde{x}) , \\ A^*A\tilde{x}=A^*b^{\star } \end{array}\right. } \end{aligned}$$

(38)

(namely, a saddle-point for the normal exact problem $\tilde{\mathcal {P}}^{\star }$). Let $b^{\delta }\in \mathcal {Y}$ be a noisy data such that $\left\| b^{\delta }-b^{\star } \right\| _{ }\le \delta $ for some $\delta \ge 0$. Moreover, suppose that $\tilde{\mathcal {C}}^\delta \ne \emptyset $; namely, that there exists $x^{\delta }\in \mathcal {X}$ such that $A^*Ax^{\delta }=A^*b^{\delta }$. Let Assumption 8 and Assumption 9 hold and $(x_k,y_k)$ be the sequence generated by the algorithm Eq. (35) on the noisy data $b^{\delta }$; namely, for the initialization $y_0=y_{-1}+\sigma (Ax_0-b^{\delta })$,

$$\begin{aligned} {\left\{ \begin{array}{ll} \tilde{y}_{k} = 2 y_k - y_{k - 1} , \\ x_{k+1} = {{\,\textrm{prox}\,}}^{}_{\tau R}(x_k - \tau \nabla F(x_k) - \tau A^*{\tilde{y}}_{k}) ,\\ y_{k + 1} = y_k + \sigma \left( Ax_{k + 1} - b^{\delta }\right) . \end{array}\right. } \end{aligned}$$

Denote by $(\hat{x}_k)$ the averaged primal iterates. Then,

$$\begin{aligned} D^{-A^*A\tilde{v}}(\hat{x}_k,\tilde{x}) \le \frac{C_1}{k} +C_2 \delta + C_4 \delta ^2 k \end{aligned}$$

and

$$\begin{aligned} \left\| A^*A\hat{x}_k-A^*b^{\star } \right\| _{ }^2\le \left\| S \right\| _{ }\left[ \frac{C_5}{k} + C_6 \delta + C_8 \delta ^2 k + C_9 \delta ^2 \right] , \end{aligned}$$

where the constants involved in the bounds are specified in the proof.

Proof

From the assumption $\tilde{\mathcal {C}}^{\delta }\ne \emptyset $ and Lemma 22, we know that the sequence $\left( \hat{x}_k\right) $ coincides with the primal iterate of a sequence $\left( \hat{u}_k, \hat{v}_k\right) $ generated by the same algorithm on problem

$$\begin{aligned} \tilde{\mathcal {P}}^{\delta }=\mathop {\textrm{argmin}}\limits _{x\in \mathcal {X}} \left\{ R(x)+F(x): \ \ Sx=Sx^{\delta } \right\} , \end{aligned}$$

(80)

where $x^{\delta }$ is any vector in $\mathcal {X}$ such that $A^*Ax^{\delta }=A^*b^{\delta }$. As in the proof of the previous theorem, notice that $\left\| S \right\| = \left\| A \right\| $ and so, as Assumption 8 and Assumption 9 hold by hypothesis, the analogue also holds for problem (80): namely, $1 - \tau (L + \sigma \left\| S \right\| ^2) \ge 0$, $\xi - \tau (\xi L+ \sigma \left\| S \right\| ^2)\ge 0$ and $\sigma (\eta - 1) - \sigma \xi \eta >0$. The same is true for Assumption 5. Indeed, define $\bar{v}=S\tilde{v}$. Then, from Eq. (38), $-S\bar{v}=-A^*A\tilde{v}\in \partial R(\tilde{x})+\nabla F(\tilde{x})$ and $(\tilde{x},\bar{v})$ is a saddle-point for

$$\begin{aligned} \tilde{\mathcal {P}}^{\star }=\mathop {\textrm{argmin}}\limits _{x\in \mathcal {X}} \left\{ R(x)+F(x): \ \ Sx=S\tilde{x} \right\} . \end{aligned}$$

(81)

In particular, we can apply Theorem 13 for $\left( \hat{u}_k, \hat{v}_k\right) $ - averaged primal-dual sequence generated on the noisy problem in (80) - with respect to $(\tilde{x},\bar{v})$ - saddle-point for the exact problem in (81) - to get that

$$\begin{aligned} D^{-S\bar{v}}(\hat{u}_k,\tilde{x}) \le \frac{C_1}{k} +C_2 \tilde{\delta } + C_4 (\tilde{\delta })^2 k \end{aligned}$$

and

$$\begin{aligned} \left\| S\hat{u}_k-S\tilde{x} \right\| _{ }^2\le \frac{C_5}{k} + C_6 \tilde{\delta }+ C_8 (\tilde{\delta })^2 k + C_9 (\tilde{\delta })^2. \end{aligned}$$

The constants in the previous bounds are the same as in (70) with $z_0=(u_0,v_0)$, $z^{\star }=(\tilde{x},\bar{v})$, $C_3=C_7=0$ (because $C_0=0$ as we suppose $\varepsilon _k=0$ for every $k\in \mathbb {N}$), $\sigma _m=\sigma _M=\sigma $ and

$$\begin{aligned} \tilde{\delta }:=\Vert Sx^{\delta }-S\tilde{x}\Vert . \end{aligned}$$

From Lemma 22, we recall also that $u_0=x_0$ and $v_0=v_{-1}+\sigma (Su_0-Sx^{\delta })$, where $v_{-1}$ is any element in $\mathcal {X}$ such that $Sv_{-1}=A^*y_{-1}$ ($v_{-1}$ exists due to $R(A^*)=R(S)$). Now it remains to show that $\tilde{\delta }\le \delta $. Denote by $(\mu _i, f_i,g_i )_{i\in \mathbb {N}} \subseteq \mathbb {R}_+ \times \mathcal {X}\times \mathcal {Y}$ the singular value decomposition of the operator A. First, notice that $S^2(x^{\delta }-\tilde{x})=A^*(b^{\delta }-b^{\star })$ and so that, for every $i\in \mathbb {N}$,

$$\begin{aligned} \mu _i^2 \langle x^{\delta }-\tilde{x},f_i\rangle =\mu _i \langle b^{\delta }-b^{\star }, g_i \rangle . \end{aligned}$$

Then, for every $i\in \mathbb {N}$ such that $\mu _i\ne 0$, $\mu _i \langle x^{\delta }-\tilde{x},f_i\rangle =\langle b^{\delta }-b^{\star }, g_i \rangle $ and so

$$\begin{aligned} \tilde{\delta }^2&= \Vert Sx^{\delta }-S\tilde{x}\Vert ^2 =\sum _{i\in \mathbb {N}} \left( \mu _i \langle x^{\delta }-\tilde{x},f_i\rangle \right) ^2 =\sum _{\mu _i\ne 0} \left( \mu _i \langle x^{\delta }-\tilde{x},f_i\rangle \right) ^2 \\&= \sum _{\mu _i\ne 0} \left( \langle b^{\delta }-b^{\star }, g_i \rangle \right) ^2 \le \sum _{i\in \mathbb {N}} \left( \langle b^{\delta }-b^{\star }, g_i \rangle \right) ^2 = \Vert b^{\delta }-b^{\star }\Vert ^2 \le \delta ^2. \end{aligned}$$

We conclude the claim simply by noticing that

$$\begin{aligned} D^{-A^*A\tilde{v}}(\hat{x}_k,\tilde{x})=D^{-S\bar{v}}(\hat{u}_k,\tilde{x}) \end{aligned}$$

and

$$\begin{aligned} \left\| A^*A\hat{x}_k-A^*b^{\star } \right\| _{ }=\left\| S^2\hat{u}_k-S^2 \tilde{x} \right\| _{ }\le \left\| S \right\| _{ }\left\| S\hat{u}_k-S\tilde{x} \right\| _{ }. \end{aligned}$$

$\square $

A dual view on the implicit bias of gradient descent on least squares

Here we provide an interesting view on why the “implicit” bias of gradient descent on least squares is not so implicit. Recall that these iterations,

$$\begin{aligned} x_{k+1} = x_k - \gamma A^* (A x_k - b) , \end{aligned}$$

(82)

converge, for $\gamma < 2 / \left\| A \right\| ^2_{\textrm{op}}$, to the minimal Euclidean norm solution of $Ax = b$:

$$\begin{aligned} \min _{x \in \mathcal {X}} \frac{1}{2} \left\| x \right\| ^2 \quad \text {s.t.} \quad Ax = b , \end{aligned}$$

(83)

provided that Problem (83) is feasible and $x_0=0$.

It turns out that the iterations (82) correspond, up to multiplication by $-A^*$, to the iterates of gradient descent to the dual of (83), namely:

$$\begin{aligned} \min _{y \in \mathcal {Y}} \frac{1}{2} \left\| A^* y \right\| ^2 + \langle b, y \rangle , \quad \text {and} \quad y_{k+1} = y_k - \gamma (A A^* y_k + b) . \end{aligned}$$

(84)

By setting $x_{k+1} =- A^* y_{k+1}$ one recovers the iterates of gradient descent on least squares (82). Therefore the “implicit bias” of gradient descent on least squares is not so implicit: its iterates $x_k$ are dual to iterates $y_k$ on Problem (84), which is itself the dual of Problem (83) in which the bias appears explicitly.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Molinari, C., Massias, M., Rosasco, L. et al. Iterative regularization for low complexity regularizers. Numer. Math. 156, 641–689 (2024). https://doi.org/10.1007/s00211-023-01390-8

Download citation

Received: 23 December 2022
Revised: 16 October 2023
Accepted: 13 December 2023
Published: 10 February 2024
Issue Date: April 2024
DOI: https://doi.org/10.1007/s00211-023-01390-8

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Iterative regularization for low complexity regularizers

Abstract

Access this article

Similar content being viewed by others

A New Insight on Augmented Lagrangian Method with Applications in Machine Learning

The Frank-Wolfe Algorithm: A Short Introduction

$\mathbf{C^{2}}$ -Lusin approximation of strongly convex functions

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Preliminary lemmas

Lemma 27

Lemma 28

Lemma 29

1.1 Primal-dual estimates

Lemma 30

Proof

Lemma 31

Proof

Lemma 32

Proof

Proofs of main results

1.1 Proof of Proposition 10

Proposition 10

Proof

1.2 Proof of Proposition 11

Proposition 11

Proof

1.3 Proof of Theorem 13

Theorem 13

Proof

1.4 Example of divergence in absence of noisy solution (see Remark 15)

Sparse recovery

1.1 Proof of Proposition 16

Proposition 16

Proof

1.2 Tikhonov regularization: Lasso

Lemma 33

Corollary 34

Proofs of Sect. 7

1.1 Proof of Corollary 21

Corollary 21

Proof

1.2 Proof of Lemma 22

Lemma 22

Proof

1.3 Proof of Theorem 23

Theorem 23

Proof

1.4 Proof of Theorem 24

Theorem 24

Proof

A dual view on the implicit bias of gradient descent on least squares

Rights and permissions

About this article

Cite this article

Share this article

Mathematics Subject Classification

Search

Navigation