Abstract
Iterative regularization exploits the implicit bias of optimization algorithms to regularize ill-posed problems. Constructing algorithms with such built-in regularization mechanisms is a classic challenge in inverse problems but also in modern machine learning, where it provides both a new perspective on algorithms analysis, and significant speed-ups compared to explicit regularization. In this work, we propose and study the first iterative regularization procedure with explicit computational steps able to handle biases described by non smooth and non strongly convex functionals, prominent in low-complexity regularization. Our approach is based on a primal-dual algorithm of which we analyze convergence and stability properties, even in the case where the original problem is unfeasible. The general results are illustrated considering the special case of sparse recovery with the \(\ell _1\) penalty. Our theoretical results are complemented by experiments showing the computational benefits of our approach.
Similar content being viewed by others
Notes
This is a abuse of language since, when R is not strictly convex, \(D_R\) may not be a divergence; meaning that, in general, \(D_{R}(x,x')=0\) does not imply \(x'=x\).
If initialized at 0 and provided \(\gamma < 2 / \left\| A \right\| _{\text {op}}^2\).
References
Bach, F., Jenatton, R., Mairal, J., Obozinski, G.: Structured sparsity through convex optimization. Stat. Sci. 27(4), 450–468 (2012)
Bachmayr, M., Burger, M.: Iterative total variation schemes for nonlinear inverse problems. Inverse Probl. 25(10), 105004 (2009)
Bahraoui, M., Lemaire, B.: Convergence of diagonally stationary sequences in convex optimization. Set-Valued Anal. 2, 49–61 (1994)
Barré, M., Taylor, A., Bach, F.: Principled analyses and design of first-order methods with inexact proximal operators. arXiv preprint arXiv:2006.06041 (2020)
Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, New York (2011)
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)
Benning, M., Burger, M.: Modern regularization methods for inverse problems. Acta Numer. 27, 1–111 (2018)
Benning, M., Burger, M.: Modern regularization methods for inverse problems. Acta Numer. 27, 1–111 (2018)
Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122 (2011)
Bredies, K., Zhariy, M.: A discrepancy-based parameter adaptation and stopping rule for minimization algorithms aiming at Tikhonov-type regularization. Inverse Probl. 29(2), 025008 (2013)
Brianzi, P., Di Benedetto, F., Estatico, C.: Preconditioned iterative regularization in Banach spaces. Comput. Optim. Appl. 54(2), 263–282 (2013)
Burger, M., Osher, S.: Convergence rates of convex variational regularization. Inverse Probl. 20(5), 1411 (2004)
Burger, M., Osher, S., Xu, J., Gilboa, G: Nonlinear inverse scale space methods for image restoration. In: International Workshop on Variational, Geometric, and Level Set Methods in Computer Vision, pp. 25–36. Springer (2005)
Burger, M., Gilboa, G., Osher, S., Xu, J.: Nonlinear inverse scale space methods. Commun. Math. Sci. 4(1), 179–212 (2006)
Burger, M., Resmerita, E., He, L.: Error estimation for Bregman iterations and inverse scale space methods in image restoration. Computing 81(2–3), 109–135 (2007)
Burger, M., Möller, M., Benning, M., Osher, S.: An adaptive inverse scale space method for compressed sensing. Math. Comput. 82(281), 269–299 (2013)
Cai, J.-F., Osher, S., Shen, Z.: Convergence of the linearized Bregman iteration for \(\ell _1\)-norm minimization. Math. Comput. 78(268), 2127–2136 (2009)
Cai, J.-F., Osher, S., Shen, Z.: Linearized Bregman iterations for compressed sensing. Math. Comput. 78(267), 1515–1536 (2009)
Calatroni, L., Garrigos, G., Rosasco, L., Villa, S.: Accelerated iterative regularization via dual diagonal descent. SIAM J. Optim. 31(1), 754–784 (2021)
Candès, E.J., Recht, B.: Exact matrix completion via convex optimization. Found. Comput. Math. 9(6), 717–772 (2009)
Candès, E.J., Romberg, J., Tao, T.: Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Trans. Inf. Theory 52(2), 489–509 (2006)
Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vis. 40(1), 120–145 (2011)
Chambolle, A., Pock, T.: An introduction to continuous optimization for imaging. Acta Numer. 25, 161–319 (2016)
Chen, S.S., Donoho, D.L., Saunders, M.A.: Atomic decomposition by basis pursuit. SIAM J. Sci. Comput. 20(1), 33–61 (1998)
Combettes, P.L., Pesquet, J.-C.: Proximal splitting methods in signal processing. In: Fixed-Point Algorithms for Inverse Problems in Science and Engineering, pp. 185–212. Springer (2011)
Condat, L.: A primal-dual splitting method for convex optimization involving Lipschitzian, proximable and linear composite terms. J. Optim. Theory Appl. 158(2), 460–479 (2013)
Daubechies, I., Defrise, M., De Mol, C.: An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Commun. Pure Appl. Math. A J. Issued Courant Inst. Math. Sci. 57(11), 1413–1457 (2004)
Deledalle, C., Vaiter, S., Peyré, G., Fadili, J.: Stein Unbiased GrAdient estimator of the Risk (SUGAR) for multiple parameter selection. SIAM J. Imaging Sci. 7(4), 2448–2487 (2014)
Donoho, D.L.: Compressed sensing. IEEE Trans. Inf. Theory 52(4), 1289–1306 (2006)
Engl, H.W., Heinz, W., Hanke, M., Neubauer, A.: Regularization of Inverse Problems, vol. 375. Springer, New York (1996)
Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96(456), 1348–1360 (2001)
Fazel, M.: Matrix rank minimization with applications. Ph.D. thesis, Stanford University (2002)
Figueiredo, M., Nowak, R.: Ordered weighted \(\ell _1\) regularized regression with strongly correlated covariates: theoretical aspects. In: AISTATS, pp. 930–938. PMLR (2016)
Foucart, S., Rauhut, H.: A Mathematical Introduction to Compressive Sensing. Springer, New York (2013)
Friedlander, M.P., Tseng, P.: Exact regularization of convex programs. SIAM J. Optim. 18(4), 1326–1350 (2008)
Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1 (2010)
Garrigos, G., Rosasco, L., Villa, S.: Iterative regularization via dual diagonal descent. J. Math. Imaging Vis. 60(2), 189–215 (2018)
Ghorbani, B., Mei, S., Misiakiewicz, T., Montanari, A.: Linearized two-layers neural networks in high dimension. Ann. Stat. 49(2), 1029–1054 (2021)
Goldenshluger, A., Pereverzev, S.: On adaptive inverse estimation of linear functionals in Hilbert scales. Bernoulli 9(5), 783–807 (2003)
Grasmair, M., Scherzer, O., Haltmeier, M.: Necessary and sufficient conditions for linear convergence of l1-regularization. Commun. Pure Appl. Math. 64(2), 161–182 (2011)
Gunasekar, S., Woodworth, B.E., Bhojanapalli, S., Neyshabur, B., Srebro, N.: Implicit regularization in matrix factorization. In: NeurIPS, pp. 6151–6159 (2017)
Gunasekar, S., Lee, J., Soudry, D., Srebro, N.: Characterizing implicit bias in terms of optimization geometry. In: Proceedings of the 35th International Conference on Machine Learning, vol. 80, pp. 1832–1841 (2018)
Hastie, T.J., Tibshirani, R., Wainwright, M.: Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press, Boca Raton (2015)
Huang, B., Ma, S., Goldfarb, D.: Accelerated linearized Bregman method. J. Sci. Comput. 54(2–3), 428–453 (2013)
Iutzeler, F., Malick, J.: Nonsmoothness in machine learning: specific structure, proximal identification, and applications. Set-Valued Var. Anal. 28(4), 661–678 (2020)
Kaltenbacher, B., Neubauer, A., Scherzer, O.: Iterative Regularization Methods for Nonlinear Ill-Posed Problems, vol. 6. Walter de Gruyter, Berlin (2008)
Lanza, A., Morigi, S., Selesnick, I.W., Sgallari, F.: Sparsity-inducing nonconvex nonseparable regularization for convex image processing. SIAM J. Imaging Sci. 12(2), 1099–1134 (2019)
Lorenz, D., Wenger, S., Schöpfer, F., Magnor, M.: A sparse Kaczmarz solver and a linearized Bregman method for online compressed sensing. In: 2014 IEEE international conference on image processing (ICIP), pp. 1347–1351. IEEE (2014)
Lorenz, D.A., Schopfer, F., Wenger, S.: The linearized Bregman method via split feasibility problems: analysis and generalizations. SIAM J. Imaging Sci. 7(2), 1237–1262 (2014)
Massias, M., Vaiter, S., Gramfort, A., Salmon, J.: Dual extrapolation for sparse generalized linear models. J. Mach. Learn. Res. (2020)
Molinari, C., Peypouquet, J.: Lagrangian Penalization scheme with parallel forward-backward splitting. J. Optim. Theory Appl. 117, 413–447 (2018)
Molinari, C., Peypouquet, J., Roldan, F.: Alternating forward-backward splitting for linearly constrained optimization problems. Optim. Lett. 14, 1071–1088 (2020)
Molinari, C., Massias, M., Rosasco, L., Villa, S.: Iterative regularization for convex regularizers. In: International Conference on Artificial Intelligence and Statistics. PMLR, pp 1684–1692 (2021)
Moreau, T., Massias, M., Gramfort, A., Ablin, P., Bannier, P.A., Charlier, B., Dagréou, M., Dupre la Tour, T., Durif, G., Dantas, C.F., Klopfenstein, Q.: Benchopt: reproducible, efficient and collaborative optimization benchmarks. NeuRIPS 35, 25404–25421 (2022)
Mosci, S., Rosasco, L., Santoro, M., Verri, A., Villa, S.: Solving structured sparsity regularization with proximal methods. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 418–433. Springer (2010)
Moulines, E., Bach, F.: Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In: NeurIPS, pp. 451–459 (2011)
Nemirovski, A.S., Yudin, D.B.: Problem Complexity and Method Efficiency in Optimization. A Wiley-Interscience Publication. Wiley, New York (1983)
Nesterov, Y.E.: A method for solving the convex programming problem with convergence rate o (1/k\(^{2}\)). In Dokl. Akad. Nauk. Sssr. 269, 543–547 (1983)
Neubauer, A.: On Nesterov acceleration for Landweber iteration of linear ill-posed problems. J. Inverse Ill-posed Probl. 25(3), 381–390 (2017)
Obozinski, G., Taskar, B., Jordan, M.I.: Joint covariate selection and joint subspace selection for multiple classification problems. Stat. Comput. 20(2), 231–252 (2010)
Osher, S., Burger, M., Goldfarb, D., Xu, J., Yin, W.: An iterative regularization method for total variation-based image restoration. SIAM Multiscale Model. Simul. 4, 460–489 (2005)
Osher, S., Ruan, F., Xiong, J., Yao, Y., Yin, W.: Sparse recovery via differential inclusions. Appl. Comput. Harmon. Anal. 41(2), 436–469 (2016)
Pagliana, N., Rosasco, L.: Implicit regularization of accelerated methods in Hilbert spaces. In: NeurIPRS, pp. 14454–14464 (2019)
Pock, T., Chambolle, A.: Diagonal preconditioning for first order primal-dual algorithms in convex optimization. In: 2011 International Conference on Computer Vision, pp. 1762–1769 (2011)
Polyak, B.: Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964)
Raskutti, G., Wainwright, M.J., Yu, B.: Early stopping and non-parametric regression: an optimal data-dependent stopping rule. J. Mach. Learn. Res. 15(1), 335–366 (2014)
Rosasco, L., Villa, S.: Learning with incremental iterative regularization. In: NeurIPS, pp. 1630–1638 (2015)
Rosasco, L., Santoro, M., Mosci, S., Verri, A., Villa, S.: A regularization approach to nonlinear variable selection. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Volume 9 of Proceedings of Machine Learning Research, pp. 653–660 (2010)
Rudin, L.I., Osher, S., Fatemi, E.: Nonlinear total variation based noise removal algorithms. Physica D 60(1–4), 259–268 (1992)
Salzo, S., Villa, S.: Inexact and accelerated proximal point algorithms. J. Convex Anal. 19(4), 1167–1192 (2012)
Schmidt, M., Le Roux, N., Bach, F.: Convergence rates of inexact proximal-gradient methods for convex optimization. In: NeurIPS, pp. 1458–1466 (2011)
Schopfer, F.: Exact regularization of polyhedral norms. SIAM J. Optim. 22(4), 1206–1223 (2012)
Schöpfer, F., Lorenz, D.: Linear convergence of the randomized sparse Kaczmarz method. Math. Program. 173(1), 509–536 (2019)
Schöpfer, F., Louis, A., Schuster, T.: Nonlinear iterative methods for linear ill-posed problems in Banach spaces. Inverse Probl. 22(1), 311 (2006)
Simon, N., Friedman, J., Hastie, T.J., Tibshirani, R.: A sparse-group lasso. J. Comput. Graph. Stat. 22(2), 231–245 (2013). (ISSN 1061-8600)
Steinwart, I., Christmann, A.: Support Vector Machines. Information Science and Statistics. Springer, New York (2008)
Teboulle, M., Beck, A.: Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett. 31, 167–175 (2003)
Tikhonov, A.N., Arsenin, V.Y.: Solutions of ill-posed problems. Wiley, New York, V. H. Winston & Sons, Washington (1977). Translated from the Russian, Preface by translation editor Fritz John, Scripta Series in Mathematics
Vaiter, S., Peyré, G., Fadili, J.: Low complexity regularization of linear inverse problems. Sampling Theory, a Renaissance: Compressive Sensing and Other Developments, pp. 103–153 (2015)
Vaškevičius, T., Kanade, V., Rebeschini, P.: Implicit regularization for optimal sparse recovery. In: NeurIPS, pp. 2968–2979 (2019)
Vaškevičius, T., Kanade, V., Rebeschini, P.: The statistical complexity of early stopped mirror descent. In: NeurIPS, pp. 253–264 (2020)
Villa, S., Salzo, S., Baldassarre, L., Verri, A.: Accelerated and inexact forward-backward algorithms. SIAM J. Optim. 23(3), 1607–1633 (2013)
Villa, S., Matet, S., Vu, B.C., Rosasco, L.: Implicit regularization with strongly convex bias: stability and acceleration. Anal. Appl. (2022). https://doi.org/10.1142/S0219530522400139
Vũ, B.C.: A splitting algorithm for dual monotone inclusions involving cocoercive operators. Adv. Comput. Math. 38(3), 667–681 (2013)
Wei, Y., Yang, F., Wainwright, M.J.: Early stopping for kernel boosting algorithms: a general analysis with localized complexities. In: Guyon, I., Von Luxburg, U., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc. (2017). https://proceedings.neurips.cc/paper/2017/file/a081cab429ff7a3b96e0a07319f1049e-Paper.pdf
Wu, T.T., Lange, K.: Coordinate descent algorithms for lasso penalized regression (2008)
Yao, Y., Rosasco, L., Caponnetto, A.: On early stopping in gradient descent learning. Constr. Approx. 26(2), 289–315 (2007)
Yin, W.: Analysis and generalizations of the linearized Bregman method. SIAM J. Imaging Sci. 3(4), 856–877 (2010)
Yin, W., Osher, S., Goldfarb, D., Darbon, J.: Bregman iterative algorithms for l1-minimization with applications to compressed sensing. SIAM J. Imaging Sci. 1(1), 143–168 (2008)
Zhang, C.-H.: Nearly unbiased variable selection under minimax concave penalty (2010)
Zhang, X., Burger, M., Bresson, X., Osher, S.: Bregmanized nonlocal regularization for deconvolution and sparse reconstruction. SIAM J. Imaging Sci. 3, 253–276 (2010)
Zhang, X., Burger, M., Osher, S.: A unified primal-dual algorithm framework based on Bregman iteration. J. Sci. Comput. 46, 20–46 (2011)
Zhao, P., Rocha, G., Yu, B.: The composite absolute penalties family for grouped and hierarchical variable selection. Ann. Stat. 37(6A), 3468–3497 (2016)
Acknowledgements
L.R. acknowledges the financial support of the European Research Council (Grant SLING 819789), the AFOSR project FA9550-18-1-7009 (European Office of Aerospace Research and Development), the EU H2020-MSCA-RISE project NoMADS-DLV-777826, and the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216. S.V. and L.R. acknowledge the support of the AFOSR project FA8655-22-1-7034 and of the H2020-MSCA-ITN Project Trade-OPT 2019. S.V., L.R. and C.M. acknowledge the support of MIUR-PRIN 202244A7YL project Gradient Flows and Non-Smooth Geometric Structures with Applications to Optimization and Machine Learning. The research by C.M. and S.V. has been supported by the MIUR Excellence Department Project awarded to Dipartimento di Matematica, Universita di Genova, CUP D33C23001110001. S.V. and C.M. are part of the INDAM research group “Gruppo Nazionale per l’Analisi Matematica, la Probabilitá e le loro applicazioni”. C.M. was supported by the Programma Operativo Nazionale (PON) “Ricerca e Innovazione” 2014-2020.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Preliminary lemmas
Lemma 27
[71, Lemma 2] Assume that \((u_j)\) is a non-negative sequence, \((S_j)\) is a non-decreasing sequence with \(S_0\ge u_0^2\) and \(\lambda \ge 0\) such that, for every \(j\in \mathbb {N}\),
Then, for every \(j\in \mathbb {N}\),
Lemma 28
(Descent lemma, ([5], Thm 18.15 (iii)) Let \(f: \mathcal {X}\rightarrow \mathbb {R}\) be Fréchet differentiable with L-Lipschitz continuous gradient. Then, for every x and \(y\in \mathcal {X}\),
Lemma 29
Let \(\mathcal {Z}\) denote \(\mathcal {X}\) or \(\mathcal {Y}\) and U denote T or \(\Sigma \) accordingly. Let \(f \in \Gamma _0(\mathcal {Z})\) and \(\varepsilon \ge 0\). It follows easily from the definition of the \(\varepsilon \)-subdifferential that if \(a, b \in \mathcal {Z}\) satisfy
then, for every \(c \in \mathcal {Z}\),
1.1 Primal-dual estimates
Lemma 30
(One step estimate) Let Assumption 4 hold. Let \((x_k,y_k)\) be the sequence generated by iterations (12) under Assumption 7. Then, for any \(z=(x, y)\in \mathcal {X}\times \mathcal {Y}\) and for any \(k\in \mathbb {N}\), with \(V(z):= \frac{1}{2}\left\| x \right\| _T^2 + \frac{1}{2} \left\| y \right\| _\Sigma ^2\),
Proof
Let \((x, y) \in \mathcal {X}\times \mathcal {Y}\). Applying Lemma 29 to the definition of \(x_{k+1}\) yields
For the dual update, similarly,
Recall that \(z:= (x, y)\) and the definition of V. Sum Eqs. (45) and (46):
From the Lemma 28,
while from the convexity of F,
Summing the last two equations, one obtains the 3 points descent lemma:
Now compute
Notice that
Finally,
\(\square \)
Lemma 31
(First cumulating estimate) Let Assumption 4 hold. Let \((x_k,y_k)\) be the sequence generated by iterations (12) under Assumption 7. Define \(\omega := 1 - \tau _M(L+\sigma _M\left\| A \right\| ^2)\). Then, for any \((x, y)\in \mathcal {X}\times \mathcal {Y}\) and for any \(k\in \mathbb {N}\),
Proof
We start from the inequality in Lemma 30, switching the index from k to j. Recall that \({\tilde{y}}_j:= 2 y_j - y_{j - 1}\), to get
Now focus on the term
where we used Cauchy-Schwarz and Young inequalities. Then, using the definition of \(\omega := 1 - \tau _M(L+\sigma _M\left\| A \right\| ^2)\), we have
Imposing \(y_{-1}=y_0\), summing-up Eq. (52) from \(j=0\) to \(j=k-1\):
where in the last inequality we used again Cauchy-Schwarz and Young inequalities as before. Reordering, we obtain the claim. \(\square \)
Lemma 32
(Second cumulative estimate) Let Assumption 4 hold. Let \((x_k,y_k)\) be the sequence generated by iterations (12) under Assumption 7. Given \(\xi >0\) and \(\eta >0\), define \(\theta := \xi - \tau _M(\xi L+\sigma _M \left\| A \right\| ^2)\) and \(\rho :=\sigma _m(\eta -1)-\sigma _M\xi \eta \). Then, for any \(z = (x, y)\in \mathcal {X}\times \mathcal {Y}\) and for any \(k\in \mathbb {N}\),
Proof
In a similar fashion as in the previous proof, we start again from the main inequality in Lemma 30, switching the index from k to j. Since \({\tilde{y}}_j = y_j + (y_j - y_{j - 1}) = y_j + \Sigma (A x_j - b^\delta )\) and \(y_{j+1} - y_j = \Sigma (A x_{j+1} - b^\delta )\), we get
Now estimate
So,
In the last inequality we used three times Cauchy-Schwarz inequality and twice Young inequality with parameters \(\xi >0\) and \(\eta >0\). Then, reordering and recalling the definitions of \(\theta := \xi - \tau _M(\xi L+\sigma _M \left\| A \right\| ^2)\), we obtain
Summing-up the latter from \(j=0\) to \(j=k-1\), we get
By trivial manipulations, we get the claim. \(\square \)
Proofs of main results
1.1 Proof of Proposition 10
Proposition 10
Assume that Assumptions 4 and 5 hold. Let \((x_k, y_k)\) be the sequence generated by iterations (12) applied to \(b^{\delta } = {b}^\star \) under Assumptions 7 and 8. Let also \(\varepsilon _{k}=0\) for every \(k\in \mathbb {N}\). Then \((x_k, y_k)\) weakly converges to a pair in \(\mathcal {S}^\star \). In particular, \((x_k)\) weakly converges to a point in \(\mathcal {P}^{\star }\).
Proof
Up to a change of initialization and offset of index, the steps of algorithm (12) when \(\varepsilon _k = 0\) correspond to
We now show that the previous iterations correspond to Algorithm 3.2 in [26], setting \(\sigma =\tau =1\) and applying it in the metrics defined by the preconditioning operators; namely, in the primal and dual spaces \((\mathcal {X}, \ \langle T^{-1} \cdot , \cdot \rangle )\) and \((\mathcal {Y}, \ \langle \Sigma \cdot , \cdot \rangle )\) - respectively. Comparing problem (15) with (1) in [26], their notation in our setting reads as \(F=F,\ G=R,\ H=\iota _{\left\{ {b}^\star \right\} }\) and \(K=A\). The Fenchel conjugate of H in \((\mathcal {Y}, \ \langle \Sigma \cdot , \cdot \rangle )\) is
and its proximal-point operator, again in \((\mathcal {Y}, \ \langle \Sigma \cdot , \cdot \rangle )\), is
The gradient of F in \((\mathcal {X}, \ \langle T^{-1}\cdot , \cdot \rangle )\) is denoted by \(\nabla _{T} F(x)\) and satisfies, for x and v in \(\mathcal {X}\),
It is easy to see that one has \(\nabla _T F(x)=T \nabla F(x)\).
The adjoint operator of \(K: \ (\mathcal {X}, \ \langle T^{-1}\cdot , \cdot \rangle )\rightarrow (\mathcal {Y}, \ \langle \Sigma \cdot , \cdot \rangle )\) satisfies, for every \((x,y)\in \mathcal {X}\times \mathcal {Y}\),
implying that \(T^{-1}K^*=A^*\Sigma \) and so that \(K^*=TA^*\Sigma \). Then Algorithm 3.2 in [26] (with \(\sigma =\tau =1\), \(\rho _k=1\) for every \(k\in \mathbb {N}\) and no errors involved) is:
and becomes, applied to our setting in the spaces \((\mathcal {X}, \ \langle T^{-1} \cdot , \cdot \rangle )\) and \((\mathcal {Y}, \ \langle \Sigma \cdot , \cdot \rangle )\),
Define the variable \(\bar{z}_k=\Sigma \bar{y}_k\) and multiply the first line by \(\Sigma \). Then,
Comparing the previous with (54), we get that they are indeed the same algorithm. To conclude, we want to use Theorem 3.1 in [26], that ensures the weak convergence of the sequence generated by the algorithm to a saddle-point. It remains to check that, under our assumptions, the hypothesis of the above result are indeed satisfied; namely, that
where \(\left\| K \right\| _{ }\) represents the operator norm of \(K: \ (\mathcal {X}, \ \langle T^{-1}\cdot , \cdot \rangle )\rightarrow (\mathcal {Y}, \ \langle \Sigma \cdot , \cdot \rangle )\) and \(L_T\) is the Lipschitz constant of \(\nabla _T F\). Notice that
Moreover, \(L_T\le \tau _M L\). Indeed, for every x and \(x'\in \mathcal {X}\),
Then, by Assumption 8 and the previous considerations,
In particular, (58) is satisfied and we the claim is proved. \(\square \)
1.2 Proof of Proposition 11
Proposition 11
Let \(({x}^\star , {y}^\star ) \in \mathcal {S}^\star \) and \((x, y) \in \mathcal {X}\times \mathcal {Y}\) such that \(\mathcal {L}^{\star }(x, {y}^\star ) - \mathcal {L}^{\star }({x}^\star , y) = 0\) and \(A x = {b}^\star \). Then \((x, {y}^\star ) \in \mathcal {S}^\star \).
Proof
For simplicity, denote \(J:=R+F\). First notice that, for our problem, the Lagrangian gap is equal to the Bregman divergence. Indeed, using \(-A^* {y}^\star \in \partial J({x}^\star )\) and \(A {x}^\star = {b}^\star \):
We then show that if \(v \in \partial J({x}^\star )\) and \(D_{J}^{v}(x, {x}^\star ) = 0\), then \(v \in \partial J(x)\). Indeed, \(J(x) - J({x}^\star ) - \langle v, x - {x}^\star \rangle = 0\) and so, for all \(x' \in \mathcal {X}\),
Proposition 11 follows by taking \(v = -A^* {b}^\star \). \(\square \)
1.3 Proof of Theorem 13
Theorem 13
Let Assumptions 4 and 5 hold and \(({x}^\star , {y}^\star ) \in \mathcal {S}^\star \) be a saddle-point of the exact problem. Let \((x_k, y_k)\) be generated by (12) under Assumptions 7 and 8 with inexact data \(b^\delta \) such that \(\left\| b^\delta - {b}^\star \right\| \le \delta \) and error \(\vert \varepsilon _k \vert \le C_0 \delta \) in the proximal operator for all \(k \in \mathbb {N}\). Denote by \((\hat{x}_k, \hat{y}_k)\) the averaged iterates \((\tfrac{1}{k}\sum _{j=1}^{k}x_j, \tfrac{1}{k} \sum _{j=1}^{k} y_j)\). Then there exist constants \(C_1, C_2\), \(C_3\) and \(C_4\) such that, for every \(k \in \mathbb {N}\),
Let also Assumption 9 hold. Then there exist constants \(C_5, C_6\), \(C_7\), \(C_8\) and \(C_9\) such that, for every \(k\in \mathbb {N}\),
Proof
Recall that we denote \(z = (x, y) \in \mathcal {X}\times \mathcal {Y}\) a primal-dual pair, and define
Use Lemma 31 at \(x={x}^\star \) and \(y={y}^\star \), to get
Notice that
Then,
Recall that \(\mathcal {L}^{\star }(x,{y}^\star ) - \mathcal {L}^{\star }({x}^\star , y)\ge 0 \) for every \(\left( x,y\right) \in \mathcal {X}\times \mathcal {Y}\). Moreover, \(\omega \ge 0\) by Assumption 8 and so \(1-\tau _M\sigma _M\left\| A \right\| _{ }^2 \ge 0\). Then, for every \(j\in \mathbb {N}\), we have that
and so
Apply Lemma 27 to Eq. (66) with \(u_j=\left\| y_j - {y}^\star \right\| \), \(S_j=2 \sigma _M \left[ V(z_0 - {z}^\star )+\sum _{i=1}^{j}\varepsilon _{i} \right] \) and \(\lambda =2\delta \sigma _M\). We get, for \(1\le j\le k\),
Insert the latter in Eq. (64), to obtain
where the last line uses \(\sqrt{a + b} \le \sqrt{a} + \sqrt{b}\). By Jensen’s inequality, we get the first claim.
For the second result, apply Lemma 32 at \(x={x}^\star \) and \(y={y}^\star \):
Using Eqs. (63) and (67), we have
Recall that \(\theta \ge 0\) and that \(\mathcal {L}^{\star }(x,{y}^\star ) - \mathcal {L}^{\star }({x}^\star , y)\ge 0 \) for every \(\left( x,y\right) \in \mathcal {X}\times \mathcal {Y}\). By Jensen’s inequality, rearranging the terms and using \(\sum _{i=1}^k \varepsilon _i \le C_0 k \delta \), we get the claim. The exact values of the constants of Theorem 13 are therefore:
\(\square \)
1.4 Example of divergence in absence of noisy solution (see Remark 15)
We present an example in which the primal exact problem has solution, but the noisy one does not and the averaged primal iterates generated by Algorithm (12) indeed diverge. First note that, if the function R has bounded domain, the primal iterates remain bounded. So, to exhibit a case of divergence of the primal iterates, we consider a function R with full domain: set \(R (\cdot )= \frac{1}{2} \left\| \cdot \right\| ^2\) (and \(F=0\)). The exact problem is then
Now consider a noisy datum \(b^{\delta }\) such that \(Ax=b^\delta \) does not have a solution. If the associated normal equation, namely \(A^*Ax=A^*b^{\delta }\) is feasible, in Sect. 7 we prove not only boundedness of the iterates but also convergence to a normal solution. On the contrary, to get divergence of the iterates, here we consider a classic scenario in which the perturbation of the exact data generates an unfeasible constraint, even for the associated normal equation. We recall that this may happen only in the infinite dimensional setting, as when R(A) is finite dimensional, it is also closed and a solution to the normal equation always exists. As a prototype of ill-posed problem, let \(\mathcal {X}=\mathcal {Y}=\ell ^2\) and A be defined by, for every \(x\in \ell ^2\) and for every \(i\in \mathbb {N}\),
where, for every \(i\in \mathbb {N}\), \(a^i\in (0,M)\) for a fixed constant \(M>0\) and \(\inf _{i\in \mathbb {N}} a_i =0\). Note that \(A:\ell ^2\rightarrow \ell ^2\) is well-defined, linear, continuous, self-adjoint and compact. Let \(b^\star \) in the range of A and denote by \(x^\star \) the unique solution to \(\mathcal {P}^{\star }\) defined in Eq. (71); namely, \((x^\star )^i:= (b^\star )^i/a^i\) for every \(i\in \mathbb {N}\). In particular, the \((b^\star )^i\) are such that \(x^\star \) belongs to \(\ell ^2\). Let also \(b^{\delta }\in \ell ^2\) with \(\Vert b^\delta -b^\star \Vert \le \delta \), but such that the noisy equation does not have a normal solution. Defining, for every \(i\in \mathbb {N}\),
the previous means that \(x^\delta \) does not belong to \(\ell ^2\). For an explicit example, consider \(a^i=1/i, (b^\star )^i=1/i^2\) and \((b^\delta )^i=(b^\star )^i+C/i\), with \(C={\delta /}{ \sqrt{\sum _{j=1}^{+\infty } 1/j^2}}\).
Apply the algorithm with step-sizes \(\sigma >0\) and \(\tau >0\) such that \(\sigma \tau < 1/\left\| A \right\| _{ }^2\) and notice that it implies, for every \(i\in \mathbb {N}\), \(\sigma \tau < 1/(a^i)^2\). As \(a^i>0\) for every \(i\in \mathbb {N}\), the coordinates of the averaged sequence \((\hat{x}_k^i)\) are convergent to a solution of the following (one-dimensional) optimization problem:
Hence, for the primal-dual algorithm, if \(x^{\delta }\notin \ell ^2\), then \((\hat{x}_k)\) diverges. Indeed, by contradiction, suppose that \((\hat{x}_k)\) is bounded. As it is bounded and converges coordinate-wise to \(x^{\delta }\), then it weakly converges to \(x^{\delta }\). But this is not possible since \(x^{\delta }\) is not in \(\ell ^2\).
Note that the problem considered in this example can be treated by Landweber method and it is well-known that also the iterates generated by this method, while being different from the ones of primal-dual algorithm, diverge.
Sparse recovery
1.1 Proof of Proposition 16
Proposition 16
Fix a primal-dual solution \(\left( {x}^\star , {y}^\star \right) \in \mathcal {S}^\star \). Let the extended support be \(\Gamma := \{i \in \mathbb {N}: | \left( A^*{y}^\star \right) _i | = 1 \}\) and the saturation gap be \(m:= \sup \left\{ | \left( A^*{y}^\star \right) _i |: | \left( A^*{y}^\star \right) _i | < 1 \right\} \). Then \(\Gamma \) is finite, and \(m < 1\). Moreover, for every \(x\in \mathcal {X}\), with \(\Gamma _C:= \mathbb {N}{\setminus } \Gamma \),
Proof
Recall that \({x}^\star , {y}^\star \) is a primal-dual solution, hence \(-A^* {y}^\star \in \partial \left\| {x}^\star \right\| _1\). For every \(i \in \mathbb {N}\) we have that \(\left[ \partial \left\| \cdot \right\| _{ 1 }\right] _i({x}^\star ) \subseteq \left[ -1,1\right] \) and so \(| \left( A^*{y}^\star \right) _i |\le 1\). Recall that \(\Gamma _C:=\mathbb {N}{\setminus } \Gamma \). As \(A^*{y}^\star \) belongs to \(\mathcal {X}=\ell ^2(\mathbb {N}; \mathbb {R})\), we have
Indeed, \(m\le 1\) by definition and from Eq. (73) the coefficients \(| \left( A^*{y}^\star \right) _i |\) converge to 0 (and so they can not accumulate at 1). We have also that
\(\square \)
1.2 Tikhonov regularization: Lasso
For Tikhonov regularization, the results in terms of Bregman divergence and feasibility are the following.
Lemma 33
([40], Lemma 3.5) Let \(A{x}^\star ={b}^\star \), \(-A^*{y}^\star \in \partial \left\| \cdot \right\| _1({x}^\star )\) and, for \(\alpha >0\),
Then it holds that
The previous bounds, combined with Assumption 17 and the last inequality in Lemma 18, lead naturally to the following corollary.
Corollary 34
([40], Theorem 5.6) Suppose Assumption 17 holds. Then, for \(x_\alpha \) defined as in Lemma 33 and \(C:=\alpha /\delta \),
Proofs of Sect. 7
1.1 Proof of Corollary 21
Corollary 21
Let Assumption 4 hold. Let \((x_k,y_k)\) be the sequence generated by Eq. (12) with data b under Assumption 7, Assumption 8 and summable error (\((\varepsilon _k)\in \ell ^1\)). Denote by \((\hat{x}_k,\hat{y}_k)\) the averaged iterates. Then, every weak cluster point of \((\hat{x}_k,\hat{y}_k)\) belongs to \(\mathcal {S}\). In particular, if \(\mathcal {S}=\emptyset \), then the primal-dual sequence \((\hat{x}_k,\hat{y}_k)\) diverges: \(\left\| (\hat{x}_k,\hat{y}_k) \right\| _{ }\rightarrow +\infty \).
Proof
From Lemma 31, for any \((x, y)\in \mathcal {X}\times \mathcal {Y}\) and for any \(k\in \mathbb {N}\), we have
where \(\omega := 1 - \tau _M(L+\sigma _M\left\| A \right\| ^2) \ge 0\) by Assumption 8. Using Jensen’s inequality, we get
Let \(\left( x_{\infty },y_{\infty }\right) \) be a weak cluster point of \((\hat{x}_k,\hat{y}_k)\); namely, there exists a subsequence \((\hat{x}_{k_j},\hat{y}_{k_j})\subseteq (\hat{x}_k,\hat{y}_k)\) such that \((\hat{x}_{k_j},\hat{y}_{k_j})\rightharpoonup (x_{\infty },y_{\infty })\). By weak lower-semicontinuity of R and F, for every \((x, y)\in \mathcal {X}\times \mathcal {Y}\),
Thus \((x_{\infty },y_{\infty })\) is a saddle-point for the Lagrangian.
Now suppose that the set of saddle-points of \(\mathcal {L}\) is empty. Assume also, for contradiction, that \((\hat{x}_k,\hat{y}_k)\) does not diverge. Then we can extract a bounded subsequence, that consequently admits a weakly converging subsequence. But then, the limit is a saddle-point, which contradicts the assumption. \(\square \)
1.2 Proof of Lemma 22
Lemma 22
Let Assumption 4 hold. Assume that \(\tilde{\mathcal {C}}\ne \emptyset \). Let \(\left( x_k\right) \) be the primal sequence generated by algorithm (35); namely, with \(T=\tau {{\,\textrm{Id}\,}}\), \(\Sigma = \sigma {{\,\textrm{Id}\,}}\), \(\varepsilon _k=0\) for every \(k\in \mathbb {N}\) and \(y_0=y_{-1}+\sigma (Ax_0-b)\). Then, there exists a primal sequence \(\left( u_k\right) \) generated by the same procedure but applied to problem \(\tilde{\mathcal {P}}\) (as stated in (36)) such that \(x_k=u_k\) for every \(k\in \mathbb {N}\).
Proof
As \(\tilde{\mathcal {C}}\ne \emptyset \), there exists \(x^b\in \mathcal {X}\) such that \(A^*Ax^b=A^*b\). First consider the algorithm in (35). Note that, for every \(k\in \mathbb {N}\), \(\tilde{y}_k=y_k+\sigma \left( Ax_k-b\right) \) and multiply the last step by \(A^*\). We get, for every \(k\in \mathbb {N}\),
Recall that \(S:=(A^*A)^{\frac{1}{2}}\) and introduce \(p_k:=A^*y_k\). Then the primal sequence \(\left( x_k\right) \) is equivalently defined by the following recursion: given \(x_0\) and \(p_{0}=A^*y_{0}\), for every \(k\in \mathbb {N}\),
As \(A^*y_{-1}\) belongs to \(R(A^*)\) and \(R(A^*)=R(S)\) ([30], Prop 2.18), there exists \(v_{-1}\) such that \(Sv_{-1}=A^*y_{-1}\). Now consider the primal-dual algorithm applied to problem (36) starting at \(u_0=x_0\), \(v_{-1}\) and \(v_0=v_{-1}+\sigma (Su_0-Sx^b)\). It reads as: for every \(k\in \mathbb {N}\),
Then, noticing that \(\tilde{v}_k=v_k+\sigma \left( Su_k-Sx^b\right) \) and multiplying the last step by S,
Define the change of variable \(q_k:=Sv_k\), so that \(q_{-1}=Sv_{-1}=A^*y_{-1}\) and
Then the primal sequence \(\left( u_k\right) \) is alternatively defined by the following recursion: for every \(k\in \mathbb {N}\),
Comparing Eq. (78) with Eq. (79), with \((u_0,q_0)=(x_0,p_0)\), we get the claim.
\(\square \)
1.3 Proof of Theorem 23
Theorem 23
Let Assumption 4 hold. Assume that \(\tilde{\mathcal {P}}\) (as stated in 32) admits a saddle-point; namely, that there exists a pair \((\tilde{x},\tilde{v})\in \mathcal {X}\times \mathcal {X}\) such that
Let \((x_k,y_k)\) be the sequence generated by Eq. (35), namely with initialization \(y_0=y_{-1}+\sigma (Ax_0-b)\), and under Assumption 8. Denote by \((\hat{x}_k)\) the averaged primal iterates. Then there exists \(\tilde{x}_{\infty }\in \tilde{\mathcal {P}}\) such that \(\hat{x}_k\rightharpoonup \tilde{x}_{\infty }\). Moreover, if \(\mathcal {P}= \emptyset \), then \(\hat{y}_k\) diverges.
Proof
From Lemma 22, we know that the sequence \(\left( \hat{x}_k\right) \) generated by Eq. (35) coincides with the primal iterate of a sequence \(\left( \hat{u}_k, \hat{v}_k\right) \) generated by the same algorithm on problem (36). Notice that \(\left\| S \right\| = \Vert (A^*A)^\frac{1}{2} \Vert = \left\| A \right\| \) and so, if Assumption 8 holds, the analogue also holds for problem (36): namely, \(1 - \tau (L + \sigma \left\| S \right\| ^2) \ge 0\). The same is true for Assumption 5. Indeed, defining \(\bar{v}=S\tilde{v}\), \(-S\bar{v}=-A^*A\tilde{v}\in \partial R(\tilde{x})+\nabla F(\tilde{x})\). Moreover, we have seen already that \(A^*Ax=A^*b\) if and only if \(Sx=Sx^b\), where \(x^b\) is any vector in \(\mathcal {X}\) such that \(A^*Ax^b=A^*b\). Then, \(S\tilde{x}=Sx^b\) and \((\tilde{x},\bar{v})\) is a saddle-point for (36). So, by Proposition 10, we know that the averaged primal-dual sequence \((\hat{u}_k,\hat{v}_k)\) weakly converges to a saddle-point for (36). In particular, there exists \(\tilde{x}_{\infty }\in \tilde{\mathcal {P}}\) such that \(\hat{u}_k\rightharpoonup \tilde{x}_{\infty }\) and so the same holds for \((\hat{x}_k)\). For the second claim, by assumption we have that \(\mathcal {P}=\emptyset \), which implies that \(\mathcal {S}=\emptyset \). All the assumptions of Corollary 21 are verified, so \((\hat{x}_k, \hat{y}_k)\) diverges. As \(\left( \hat{x}_k\right) \) is weakly convergent and so bounded, we conclude that \(\left( \hat{y}_k\right) \) has to diverge. \(\square \)
1.4 Proof of Theorem 24
Theorem 24
Let Assumption 4 hold and suppose that there exists a pair \((\tilde{x},\tilde{v})\in \mathcal {X}\times \mathcal {X}\) such that
(namely, a saddle-point for the normal exact problem \(\tilde{\mathcal {P}}^{\star }\)). Let \(b^{\delta }\in \mathcal {Y}\) be a noisy data such that \(\left\| b^{\delta }-b^{\star } \right\| _{ }\le \delta \) for some \(\delta \ge 0\). Moreover, suppose that \(\tilde{\mathcal {C}}^\delta \ne \emptyset \); namely, that there exists \(x^{\delta }\in \mathcal {X}\) such that \(A^*Ax^{\delta }=A^*b^{\delta }\). Let Assumption 8 and Assumption 9 hold and \((x_k,y_k)\) be the sequence generated by the algorithm Eq. (35) on the noisy data \(b^{\delta }\); namely, for the initialization \(y_0=y_{-1}+\sigma (Ax_0-b^{\delta })\),
Denote by \((\hat{x}_k)\) the averaged primal iterates. Then,
and
where the constants involved in the bounds are specified in the proof.
Proof
From the assumption \(\tilde{\mathcal {C}}^{\delta }\ne \emptyset \) and Lemma 22, we know that the sequence \(\left( \hat{x}_k\right) \) coincides with the primal iterate of a sequence \(\left( \hat{u}_k, \hat{v}_k\right) \) generated by the same algorithm on problem
where \(x^{\delta }\) is any vector in \(\mathcal {X}\) such that \(A^*Ax^{\delta }=A^*b^{\delta }\). As in the proof of the previous theorem, notice that \(\left\| S \right\| = \left\| A \right\| \) and so, as Assumption 8 and Assumption 9 hold by hypothesis, the analogue also holds for problem (80): namely, \(1 - \tau (L + \sigma \left\| S \right\| ^2) \ge 0\), \(\xi - \tau (\xi L+ \sigma \left\| S \right\| ^2)\ge 0\) and \(\sigma (\eta - 1) - \sigma \xi \eta >0\). The same is true for Assumption 5. Indeed, define \(\bar{v}=S\tilde{v}\). Then, from Eq. (38), \(-S\bar{v}=-A^*A\tilde{v}\in \partial R(\tilde{x})+\nabla F(\tilde{x})\) and \((\tilde{x},\bar{v})\) is a saddle-point for
In particular, we can apply Theorem 13 for \(\left( \hat{u}_k, \hat{v}_k\right) \) - averaged primal-dual sequence generated on the noisy problem in (80) - with respect to \((\tilde{x},\bar{v})\) - saddle-point for the exact problem in (81) - to get that
and
The constants in the previous bounds are the same as in (70) with \(z_0=(u_0,v_0)\), \(z^{\star }=(\tilde{x},\bar{v})\), \(C_3=C_7=0\) (because \(C_0=0\) as we suppose \(\varepsilon _k=0\) for every \(k\in \mathbb {N}\)), \(\sigma _m=\sigma _M=\sigma \) and
From Lemma 22, we recall also that \(u_0=x_0\) and \(v_0=v_{-1}+\sigma (Su_0-Sx^{\delta })\), where \(v_{-1}\) is any element in \(\mathcal {X}\) such that \(Sv_{-1}=A^*y_{-1}\) (\(v_{-1}\) exists due to \(R(A^*)=R(S)\)). Now it remains to show that \(\tilde{\delta }\le \delta \). Denote by \((\mu _i, f_i,g_i )_{i\in \mathbb {N}} \subseteq \mathbb {R}_+ \times \mathcal {X}\times \mathcal {Y}\) the singular value decomposition of the operator A. First, notice that \(S^2(x^{\delta }-\tilde{x})=A^*(b^{\delta }-b^{\star })\) and so that, for every \(i\in \mathbb {N}\),
Then, for every \(i\in \mathbb {N}\) such that \(\mu _i\ne 0\), \(\mu _i \langle x^{\delta }-\tilde{x},f_i\rangle =\langle b^{\delta }-b^{\star }, g_i \rangle \) and so
We conclude the claim simply by noticing that
and
\(\square \)
A dual view on the implicit bias of gradient descent on least squares
Here we provide an interesting view on why the “implicit” bias of gradient descent on least squares is not so implicit. Recall that these iterations,
converge, for \(\gamma < 2 / \left\| A \right\| ^2_{\textrm{op}}\), to the minimal Euclidean norm solution of \(Ax = b\):
provided that Problem (83) is feasible and \(x_0=0\).
It turns out that the iterations (82) correspond, up to multiplication by \(-A^*\), to the iterates of gradient descent to the dual of (83), namely:
By setting \(x_{k+1} =- A^* y_{k+1}\) one recovers the iterates of gradient descent on least squares (82). Therefore the “implicit bias” of gradient descent on least squares is not so implicit: its iterates \(x_k\) are dual to iterates \(y_k\) on Problem (84), which is itself the dual of Problem (83) in which the bias appears explicitly.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Molinari, C., Massias, M., Rosasco, L. et al. Iterative regularization for low complexity regularizers. Numer. Math. 156, 641–689 (2024). https://doi.org/10.1007/s00211-023-01390-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00211-023-01390-8