Convergence analysis of critical point regularization with non-convex regularizers

Daniel Obmann; Markus Haltmeier

doi:10.1088/1361-6420/acdd8d

1. Introduction

In various scientific fields and applications, such as medical imaging or remote sensing, it is often not possible to obtain the desired quantity of interest directly. Assuming a linear measurement model, recovering the quantity of interest requires solving an inverse problem of the form

$\begin{equation} y^\delta = \mathbf{K} x + \eta^\delta \,, \end{equation} \tag{ 1.1 }$

where $\mathbf{K} \colon \mathbb X \to {\mathbb{Y}}$ is a linear operator between Hilbert spaces modeling the forward problem, η^δ is the data perturbation, $y^\delta \in {\mathbb{Y}}$ is the noisy data and $x \in \mathbb X$ is the sought for signal. In many cases these problems are ill-posed, meaning that no continuous right inverse of the operator K exists. To overcome such issues several established approaches for the stable approximation of solutions of inverse problems exist.

1.1. Regularization with non-convex penalties

Particularly popular regularization techniques are variational methods [10, 22]. These methods recover regularized solutions $x_\alpha^\delta$ as global minimizers of the Tikhonov functional

$\begin{equation} \mathcal{T}_{\alpha, y^\delta}(x) : = \frac{1}{2} \lVert \mathbf{K} x - y^\delta \rVert^2 + \alpha \mathcal R(x) \,. \end{equation} \tag{ 1.2 }$

Here, $\mathcal R$ is a regularizer which encodes prior information about the desired solution and $\frac{1}{2} \lVert \mathbf{K} x - y^\delta \rVert^2$ plays the role of a data-discrepancy measure. Classically, regularizers have been hand-crafted, including L²-penalties, sparse regularization techniques or total variation [2, 10, 12]. While such hand-crafted regularizers are often convex and hence global minima can be computed by classical convex optimization, hand-crafted priors typically lack adaptability to available data.

In more recent years, there has been a shift to learned and potentially non-convex priors [4, 14–16, 18]. It has been observed that these methods often outperform classical methods. Moreover, a full convergence analysis has been be provided [14, 18]. However, such an analysis assumes minimizers of the Tikhonov functional to be known or at least be given within a certain accuracy. For non-convex regularizers such an assumption is unrealistic and global minimizers are challenging to compute. Instead, when trying to find a regularized solution one often employs minimization algorithms such as gradient descent or variations thereof which converge to critical points (such as local minimizers close to the initial guess) rather than to global minimizers of the Tikhonov functional. While one could constrain the learned regularizers to only include convex functionals [3, 16] this might result in suboptimal reconstructions when the underlying signal class is inherently non-convex. For such classes non-convexity of the regularizer can be a highly desirable property and as such a convergence analysis for this case is needed. Importantly, such an analysis should not rely on the strict assumption that the regularized solutions are global minimizers of the underlying Tikhonov functional.

We briefly mention here that there exist other interesting cases where the Tikhonov functional is non-convex such as for example in the case of a nonlinear forward operator. However, in this paper we only consider the linear case. Besides this we also mention that there are potentially different ways to deal with non-convexity of Tikhonov functionals, for example, by use of convexification [13]. However, for the learned regularizers we have in mind (see results in section 4), the modification of the involved functionals is nontrivial in general. Besides, the modification of the learned functionals can change the original properties of the learned functional in an unfavorable way.

1.2. Proposed critical point regularization

In this paper we present a convergence analysis of critical points of the Tikhonov functional $\mathcal{T}_{\alpha, y^\delta}$ for the stable solution of inverse problems of the form (1.1). We refer to any such method which recovers a critical point as regularized solutions as critical point regularization. In fact, we show stability and convergence for a relaxed notion of critical points. More precisely we study stability and convergence of φ-critical points, namely elements satisfying $0 \in \partial_\phi \mathcal{T}_{\alpha, y^\delta} (x_\alpha^\delta)$ . Here $\partial_\phi$ is the φ-relative sub-differential, a novel concept that we introduce and study in this paper. Whenever the classical norm-discrepancy is used to measure similarity, as the noise level tends to zero, we show that regularized elements converge to elements $x^{\boldsymbol{\texttt{+}}} \in \mathbb X$ with

$\begin{equation} -\partial_\phi \mathcal R(x^{\boldsymbol{\texttt{+}}}) \cap \ker(\mathbf{K})^\perp \neq \emptyset \,, \end{equation} \tag{ 1.3 }$

resembling first order conditions of the constraint optimization problem $\operatorname{arg\,min}\{\mathcal R(x)\!\! \mid \!\!\mathbf{K} x = y \!\}$ defining $\mathcal R$ -minimizing solutions.

We give our analysis for more general data discrepancy measures $\mathcal{S}(x, y^\delta)$ for which $\lVert \mathbf{K} x - y^\delta \rVert^2/2$ is only a special case. Further, we mention that in [9] an analysis of stability for the case of local minima has been done. Opposed to our work the authors of [9] restrict themselves to the finite dimensional setting and do not provide convergence results for the case that the noise-level tends to zero. Allowing that the underlying space is a general Hilbert space without any restrictions on the dimension has the advantage that the analysis is independent of the dimension and as such applies to any discretization used for practical applications. The precise analysis of the discretization is beyond the scope of this paper and we refer to the corresponding work in conventional variational regularization [4, 20].

Note that whenever the Tikhonov functional is relatively subdifferentiable, then critical points of the Tikhonov functional are also relatively critical points if φ is constructed accordingly, and the proposed concept yields a convergent regularization for critical points. We will show that this is actually the case, for example, for a class of learned regularizers defined by neural networks. We are not aware of any other study which includes stability and convergence of critical points and to the best of our knowledge the present analysis is the first to attempt this.

1.3. Main contributions

In this paper we introduce the concept of relative sub-differentiability as a generalization of sub-differentiability of convex functions to the non-convex case. We develop theory for relative sub-differentiability and show that corresponding φ-critical points can be found by employing a generalized gradient descent method. From the viewpoint of regularization theory we give existence, stability and convergence results for φ-critical points and derive the limiting problem for critical point regularization. As opposed to the convex case where the solutions one obtains are $\mathcal R$ -minimizing solutions we get as a limiting problem a related first order optimality condition. As a special case of our analysis we derive stability and convergence results for critical points of differentiable Tikhonov functionals. For example, in this case, we get that $-\mathcal R^{^{\prime}}(x_{\boldsymbol{\texttt{+}}})$ is in the normal cone of the set of all solutions.

Finally, we provide numerical simulations which support our theoretical findings, in particular the stability, convergence and the limiting problem. Moreover, the results of our numerical simulations show that even in simple cases of non-convex regularizers the assumption of obtaining global minima or even local minima is infeasible thus further emphasizing the need for the analysis we provide in this paper. Besides, the numerical results show that the solutions we obtain cannot be expected to be $\mathcal R$ -minimizing solutions and may even be local maxima of the regularizer whenever the initialization is chosen inappropriately and the algorithm does not guarantee that local minima are obtained.

1.4. Overview

The rest of the paper is organized as follows. In section 2 we motivate and introduce the concept of relative sub-differentiability and corresponding φ-critical points. Moreover, we study basic properties of relative sub-differentiability and show that φ-critical points can be achieved by employing a generalized gradient descent method. Section 3 builds on this concept of relative sub-differentiability and gives a convergence analysis for critical point regularization. Moreover, we take a closer look at the differentiable case and identify the limiting problem in this case. In section 5 we provide numerical experiments which support our theoretical findings such as stability and convergence. Finally, we conclude the paper by giving a brief summary and outlook in section 6.

2. Relative sub-differentiability

In this section and in the rest of the paper, unless stated otherwise, we assume that $\mathbb X$ is a Banach space, denote by $\mathbb X^*$ its dual and by $\langle \cdot, \cdot \rangle$ the dual pairing of $\mathbb X$ and $\mathbb X^*$ , i.e. for $\varphi \in \mathbb X^*$ and $x \in \mathbb X$ we have $\langle \varphi, x \rangle = \varphi(x)$ . Moreover, we denote by $\mathcal R^{^{\prime}}$ the derivative of any differentiable function $\mathcal R \colon \mathbb X \to \mathbb{R}$ and for any similarity measure $\mathcal{S} \colon \mathbb X \times {\mathbb{Y}} \to \mathbb{R}$ we denote by $\mathcal{S}^{^{\prime}}$ the derivative with respect to its first argument.

Before giving the crucial definition of relative sub-differentiability we recall the importance of classical sub-differentiabilty in the context of convex functions. Recall that $r \in \mathbb X^*$ is called subgradient of some functional $\mathcal F \colon \mathbb X \to \mathbb{R}$ at $x \in \mathbb X$ if $\mathcal F(x) + \langle r, u - x \rangle \leqslant \mathcal F(u)$ for all $u \in\mathbb X$ and that $\mathcal F$ is sub-differentiable whenever the set of subgradients is non-empty for all $x \in \mathbb X$ . Minimizers x of $\mathcal F$ are characterized by the optimality condition $0 \in \partial_0 \mathcal F(x)$ where $\partial_0 \mathcal F(x)$ denotes the set of all subgradients at point x. However, sub-differentiability implies convexity. We will therefore develop a relaxed concept of sub-differentiability relative to some functional $\phi \colon \mathbb X \to [0, \infty)$ by replacing the right hand side in the definition of subgradients by $\mathcal F(u) + \phi(u)$ .

2.1. Definition and basic properties

The following concept generalizing sub-differentiablity is also applicable to non-convex functions.

Definition 2.1 (Relative sub-differentiability). Let $\mathcal F \colon \mathbb X \to \mathbb{R}$ and $\phi \colon \mathbb X \to [0, \infty)$ .

(a)
$r \in \mathbb X^*$ is called φ-relative subgradient of $\mathcal F$ at $x \in \mathbb X$ if
$\begin{equation} u \in \mathbb X \colon \quad \mathcal F(x) + \langle r, u - x \rangle \leqslant \mathcal F(u) + \phi(u). \end{equation} \tag{ 2.1 }$
(b)
The set set of all φ-relative subgradients at x is denoted by $\partial_\phi \mathcal F(x)$ and called φ-relative sub-differential of $\mathcal F$ at x.
(c)
The functional $\mathcal F$ is called φ-relative sub-differentiable if $\partial_\phi \mathcal F(x) \neq \emptyset$ for all $x \in \mathbb X$ .

Some remarks about definition 2.1 are in order.

Remark 2.2.

We call any such function φ a bound. It is clear that such a bound cannot be unique, since whenever $\mathcal F$ is a relatively sub-differentiable function with bound φ then it is also relatively sub-differentiable with bound $\phi + c$ for any $c \in [0, \infty)$ .
Choosing φ = 0 we see that any convex and sub-differentiable function $\mathcal F$ is relatively sub-differentiable, i.e. the class of all relative sub-differentiable functions includes the set of convex sub-differentiable functions.
The relative subgradients depend on the function φ. This shows that whenever we choose a larger φ then we generally also increase the set of possible relative subgradients, i.e. if $\phi_1 \leqslant \phi_2$ then $\partial_{\phi_1} \mathcal F \subseteq \partial_{\phi_2} \mathcal F$ .
Similar to the concept of subgradients for convex functions, the concept of relative subgradients is a global property since the defining inequality has to hold for any point $u \in \mathbb X$ .

Another approach of generalizing convexity and subgradients (and as a consequence critical points) is given in [11, 24] where convexity with respect to a set of functions W is defined. In such a setting $w \in W$ is a subgradient of $\mathcal F$ at x whenever $\mathcal F(u) \geqslant \mathcal F(x) + w(u) - w(x)$ for any $u \in \mathbb X$ . As a consequence any critical point, i.e. a point where 0 is a subgradient, will also be a global minimizer and hence such a generalization cannot be used for our purposes. In [7] another concept of generalized gradients is discussed. In this setting the definition of the gradient depends only on neighborhoods around the point of interest. As a consequence we cannot expect the critical points to have any global properties which are necessary for the analysis in section 3 hence making this generalization unfit for our analysis. However, it should be noted, that whenever convenient one might substitute any differentiability assumption on the involved functionals with Clarke's generalized gradient concept in any of the following discussions.

In what follows we will assume that $\mathcal F \colon \mathbb X \to \mathbb{R}$ is φ-relatively sub-differentiable for some fixed φ. Based on this definition we generalize the concept of critical points as follows.

Definition 2.3 ( φ -critical points). We call $x \in \mathbb X$ a φ-critical point of $\mathcal F$ if $0 \in \partial_\phi \mathcal F(x)$ . Moreover, we denote by $\operatorname{crit}_\phi \mathcal F$ the set of all φ-critical points of $\mathcal F$ .

It should be noted that the definition of φ-critical points depends on φ and in practical applications one might not have access to φ. In such cases evaluating or finding relative subgradients might be infeasible. Nevertheless, the concept of φ-critical points is general enough to include an important class of points as the following remark illustrates.

Remark 2.4 (Critical points of differentiable functions). Let us assume that $\mathcal F \colon \mathbb X \to \mathbb{R}$ is a differentiable function which satisfies the inequality $\mathcal F(x) + \langle \mathcal F^{^{\prime}}(x), u - x \rangle \leqslant \mathcal F(u) + \phi(u)$ for any $x, u \in \mathbb X$ and some $\phi \colon \mathbb X \to [0, \infty)$ . Then we have $\mathcal F^{^{\prime}}(x) \in \partial_\phi \mathcal F (x)$ . This shows that in this special case we have access to at least one element of $\partial_\phi \mathcal F$ . In particular, any critical point of $\mathcal F$ , i.e. a point $x \in \mathbb X$ with $\mathcal F^{^{\prime}}(x) = 0$ , will always yield a φ-critical point of $\mathcal F$ in the sense of definition 2.3 and hence definition 2.3 is a generalization of the classical concept of critical points for differentiable functions satisfying above inequality.

This shows that for a class of functions we have access to at least one element of the relative subgradient of $\mathcal F$ . More importantly, for this class of functions we can make assertions about the points $x \in \mathbb X$ where $\mathcal F^{^{\prime}}(x) = 0$ holds, i.e. points which are reachable by use of a (minimization) algorithm which guarantees to find a critical point.

Before we move on, we briefly give a prototypical example of a non-convex function for which a bound φ can be chosen, such that $\mathcal F^{^{\prime}}(x) \in \partial_\phi \mathcal F(x)$ .

Remark 2.5 (Examples of relative sub-differentiability). We start by giving a simple example of a function which is non-convex, but relatively sub-differentiable. To this end, let $a, b \in \mathbb{R}$ be given and define $\mathcal F(t) = (t + a)^2 (t + b)^2$ . It is readily seen, that $\mathcal F(t) + \mathcal F^{^{\prime}}(t)(s - t)$ is a polynomial of degree 4 with negative leading coefficient. Hence, this function is bounded from above and the relative sub-differentiability immediately follows by for example choosing $\phi(s) = \sup_t \mathcal F(t) + \mathcal F^{^{\prime}}(t)(s - t)$ . Clearly then, the function $g(t) = \mathcal F(t) + c t^2$ is also sub-differentiable for c > 0. The function g is plotted in figure 1 on the left side for different parameters $a, b, c$ on a semi-logarithmic scale to emphasize the non-convexity.

Now let us consider the function $\mathcal F(t) = \cos(t) + \lvert t \rvert$ , see figure 1 on the right. Then, due to the coercivity and the existence of critical points 'at infinity', the derivative of $\mathcal F$ cannot be in the relative sub-gradient of $\mathcal F$ for any φ. This example illustrates what types of functions are not included in the concept of relatively sub-differentiable functions for which the derivative is supposed to lie in the relative subgradient. In particular, the concept of relative sub-differentiability excludes coercive functionals which have critical points 'at infinity'.

Before discussing how one might obtain φ-critical points of relatively sub-differentiable functions, we list some useful properties of which we make constant use during the rest of the paper.

Lemma 2.6 (Basic properties of relative subgradients). Let $\mathcal F, \mathcal F_i \colon \mathbb X \to \mathbb{R}$ and $\phi, \phi_i \colon \mathbb X \to [0, \infty)$ be bounds of $\mathcal F, \mathcal F_i$ for $i = 1, \dots, n$ and w > 0. Moreover, set $c : = \inf \mathcal F + \phi$ . Then the following hold

(1)
$\sum \partial_{\phi_i} \mathcal F_i \subseteq \partial_{\sum \phi_i} \sum \mathcal F_i$
(2)
$w \partial_\phi \mathcal F = \partial_{w \phi} (w \mathcal F)$
(3)
If $\mathcal F$ is convex then $\partial_0 \mathcal F (x) \subseteq \partial_\phi \mathcal F(x)$ for any $x \in \mathbb X$
(4)
$\partial_\phi \mathcal F (x)$ is convex and (weak*) closed
(5)
If $x_{\boldsymbol{\texttt{+}}} \in \operatorname{arg\,min} \mathcal F(x)$ then $0 \in \partial_\phi \mathcal F (x_{\boldsymbol{\texttt{+}}})$
(6)
$0 \in \partial_\phi \mathcal F (x) \Longleftrightarrow \mathcal F(x) \leqslant c$
(7)
If $\mathcal F$ is Lipschitz and φ bounded on bounded subsets, then $\partial_\phi \mathcal F$ is bounded. In particular, in this case the set $\partial_\phi \mathcal F$ is weak*-compact.
(8)
Let $p_k = g_k + z_k$ where $g_k \in \partial_\phi \mathcal F(x_k)$ and $\lVert z_k \rVert \leqslant \varepsilon_k$ with $\varepsilon_k \to 0$ . Assume that x_k converge weakly to $x_{\boldsymbol{\texttt{+}}}$ and g_k converge to g and that $\mathcal F$ is weakly lower semi-continuous. Then $g \in \partial_\phi \mathcal F(x_{\boldsymbol{\texttt{+}}})$ . If, instead, x_k converge strongly to $x_{\boldsymbol{\texttt{+}}}$ , g_k converge weakly to g and $\mathcal F$ is lower semi-continuous then we also have $g \in \partial_\phi \mathcal F(x_{\boldsymbol{\texttt{+}}})$ .

Proof. (1) Let $p_i \in \partial \mathcal F_i(x)$ and define $p = \sum_{i} p_i$ . Then we have

$\begin{equation*} \sum_i \mathcal F_i(x) + \langle \,p, u - x \rangle = \sum_i \left( \mathcal F_i(x) + \langle \,p_i, u - x \rangle \right) \leqslant \sum_i \left( \mathcal F_i(u) + \phi_i(u) \right) \end{equation*}$

and hence the claim follows.

(2) Assume that $p \in \partial_\phi \mathcal F(x)$ . Then we have $w \mathcal F(x) + w \langle \,p, u - x \rangle \leqslant w \left( \mathcal F(u) + \phi(u) \right)$ by non-negativity of w and hence $w \partial_\phi \mathcal F \subseteq \partial_{w \phi} (w \mathcal F)$ . Now let $p \in \partial_{w \phi} (w \mathcal F)$ then we define $q = \frac{p}{w}$ and it follows

$\begin{equation*} w \left( \mathcal F(x) + \langle q, u - x \rangle \right) = w \mathcal F(x) + \langle \,p, u - x \rangle \leqslant w \left( \mathcal F(u) + \phi(u) \right) \end{equation*}$

which shows that $p \in w \partial_\phi \mathcal F(x)$ .

(3) This is an immediate consequence of $\mathcal F(u) \leqslant \mathcal F(u) + \phi(u)$ by non-negativity of φ.

(4) Let $p_1, p_2 \in \partial_\phi \mathcal F(x)$ and $\lambda \in (0, 1)$ . Then we have

$\begin{align*} \mathcal F(x) + \langle \lambda p_1 + (1 - \lambda) p_2, u - x \rangle & = \lambda ( \mathcal F(x) + \langle \,p_1, u -x \rangle) + (1 - \lambda) (\mathcal F(x) + \langle \,p_2, u -x \rangle) \\ &\leqslant \lambda (\mathcal F(u) + \phi(u)) + (1-\lambda) (\mathcal F(u) + \phi(u)), \end{align*}$

which proves the convexity of $\partial_\phi \mathcal F$ . Now let us assume that $p_k \in \partial \mathcal F (x)$ with p_k (weak*) converges to p. By (weak*) convergence we have $\langle \,p_k, u - x \rangle \to \langle \,p, u - x \rangle$ and hence p is also an relatively sub-differentiable subgradient.

(5) This is also a consequence of the non-negativity of φ and the assumption that $x_{\boldsymbol{\texttt{+}}}$ is a global minimizer.

(6) Let $0 \in \partial_\phi \mathcal F(x)$ . Then by definition we have $\mathcal F(x) \leqslant \mathcal F(u) + \phi(u)$ for any $u \in \mathbb X$ and hence also $\mathcal F(x) \leqslant c$ . On the other hand, if $\mathcal F(x) \leqslant c$ then we have $\mathcal F(x) \leqslant \mathcal F(u) + \phi(u)$ for any $u \in \mathbb X$ and hence $0 \in \partial_\phi \mathcal F(x)$ .

(7) Let $p \in \partial_\phi \mathcal F(x)$ and set $u = x + v$ with $\lVert v \rVert = 1$ . Using the defining inequality we find

$\begin{align*} \langle \,p, v \rangle \leqslant \mathcal F(x + v) - \mathcal F(x) + \phi(x + v) \leqslant L + \phi(x + v) \end{align*}$

and thus by taking the supremum over v we find that $\lVert p \rVert$ is bounded. Using Banach-Alaouglu we see that $\partial_\phi \mathcal F$ must be weak*-compact.

(8) By assumption x_k is bounded. Thus, we have

$\begin{align*} \mathcal F(x_{\boldsymbol{\texttt{+}}}) + \langle g, u - x_{\boldsymbol{\texttt{+}}} \rangle &\leqslant \liminf_k \mathcal F(x_k) + \langle \,p_k, u - x_k \rangle\\ &\leqslant \liminf_k \mathcal F(u) + \phi(u) + \langle z_k, u - x_k \rangle\\ &\leqslant \mathcal F(u) + \phi(u) + \liminf_k \varepsilon_k \lVert u - x_k \rVert \\ & = \mathcal F(u) + \phi(u), \end{align*}$

which proves the claim.

Lemma 2.6 gives us a characterization of φ-critical points as points x for which $\mathcal F + \phi$ is an upper bound of $\mathcal F(x)$ . This characterization in particular implies that for any differentiable and relatively sub-differentiable function $\mathcal F$ we have that the points x with $\mathcal F^{^{\prime}}(x) = 0$ must have bounded value independent of x. Comparing this to the convex case we have that x is a critical point of the function $\mathcal F$ if and only if x is a global minimizer. In some sense, the definition of φ-critical points allows for some error to be made and guarantees that φ-critical points cannot have arbitrarily large $\mathcal F$ -value. Moreover, whenever $\mathcal F$ is coercive then all φ-critical points must be inside some ball $B_r(0)$ for some r > 0.

2.2. Computation of φ-critical points

We next answer the question of how to obtain φ-critical points at least for the case where $\mathbb X$ is a Hilbert space. Clearly, if $\mathcal F$ is differentiable then one could consider classical gradient descent methods. Since we are also interested in non-differentiable functions, gradient descent in its classical form may not be applicable. Below we show that a generalized gradient method using relative subgradients instead of gradients will yield φ-critical points in the sense of definition 2.3. This shows that algorithm is a natural extension of subgradient descent [6, 23].

Algorithm 1. Relative subgradient descent.
Require: Starting point $x_0 \in \mathbb X$ , stepsizes $\eta_n \gt 0$
$n \gets 0$
while $0 \notin \partial_\phi \mathcal F (x_n)$ do
Choose $g^_n \in \partial_\phi \mathcal F (x_n)$ and $g_n \in \mathbb X$ such that $\langle g_n^, g_n \rangle \gt 0$
$x_{n+1} = x_n - \eta_n g_n$
$n \gets n + 1$
end while

The following results shows that Algorithm converges to a φ-critical point of the function $\mathcal F$ . The given proof closely follows the one given in [6] but does not assume a finite dimensional setting and considers relatively sub-differentiable functionals instead of sub-differentiable function.

Theorem 2.7 (Convergence of Algorithm). Assume that $\mathbb X$ is a Hilbert space and that $\mathcal F$ is relatively sub-differentiable with bound φ. Moreover, choose $g_n = \lambda_n g_n^*$ in algorithm with $\lambda_n \gt 0$ such that $\lVert g_n \rVert \leqslant C$ for all $n \in \mathbb{N}$ . Then for any point $u \in \mathbb X$ and any step $N \in \mathbb{N}$ we have

$\begin{align*} \min_{i = 1, \dots, N} \mathcal F(x_i) \leqslant \mathcal F(u) + \phi(u) + \frac{\lVert x_0 - u \rVert^2 + C^2 \sum_{i = 1}^N \eta_i^2}{2 \sum_{i = 1}^N \eta_i}. \end{align*}$

Proof. Let $u \in \mathbb X$ . After rescaling of $g_n^*$ according to assumption we may assume that $g_n = g_n^*$ . Then by definition of $x_{n+1}$ we have

$\begin{align*} \lVert x_{n+1} - u \rVert^2 & = \lVert x_n - u \rVert^2 - 2 \eta_n \langle g_n, x_n - u \rangle + \eta_n^2 \lVert g_n \rVert^2 \\ &\leqslant \lVert x_n - u \rVert^2 - 2 \eta_n (\mathcal F(x_n) - \mathcal F(u) - \phi(u)) + \eta_n^2 \lVert g_n \rVert^2. \end{align*}$

Applying this inequality recursively and using the fact that $\lVert x_{n+1} - u \rVert^2 \geqslant 0$ we find

$\begin{align*} 2\sum_{i = 1}^{n} \eta_i (\mathcal F(x_i) - \mathcal F(u) - \phi(u)) \leqslant \lVert x_0 - u \rVert^2 + \sum_{i = 1}^n \eta_i^2 \lVert g_i \rVert^2, \end{align*}$

which together with the inequalities $\lVert g_i \rVert \leqslant C$ and $\sum_{i = 1}^n \eta_i \mathcal F(x_i) \geqslant \min_{i = 1,\dots,n} \mathcal F(x_i) \sum_{i = 1}^n \eta_i$ shows the desired result.

Theorem 2.7 shows that under the assumption that the sequence of step-sizes $(\eta_n)_{n \in \mathbb{N}}$ is square-summable but not summable, then in the limit we have $\lim_{n \to \infty} \min_{i = 1, \dots, n} \mathcal F(x_i) \leqslant \mathcal F(u) + \phi(u)$ for any $u \in \mathbb X$ . Note the analysis and the proofs heavily rely on the usage of the functional φ, but we note that at no point during algorithm do we need explicit knowledge of the functional φ but only access to elements of $\partial_\phi \mathcal F$ . In particular, in the case of remark 2.4 when using the gradient of $\mathcal F$ as the update direction the generated sequence will yield a φ-critical point.

Finally, assume that we have a functional of the form $\mathcal F(x) = \mathcal{S}(x) + \alpha \mathcal R(x)$ where each term is relatively sub-differentiable with bounds $\phi_\mathcal{S}$ and $\phi_\mathcal R$ . Then lemma 2.6 shows that for $s \in \partial_{\phi_\mathcal{S}} \mathcal{S}$ and $r \in \partial_{\phi_\mathcal R} \mathcal R$ we have $s + \alpha r \in \partial_{\phi_\mathcal{S} + \alpha \phi_\mathcal R} \left( \mathcal{S} + \alpha \mathcal R \right)$ . This implies that algorithm can be applied in the case where we are looking for a φ-critical point of the sum of two relatively sub-differentiable functionals and only have access to elements of $\partial_{\phi_\mathcal{S}} \mathcal{S}$ and $\partial_{\phi_\mathcal R} \mathcal R$ .

3. Regularizing properties of φ -critical points

In this section we present a convergence analysis for φ-critical points of Tikhonov-type functionals $\mathcal{T}_{\alpha, y^\delta}$ extending the existing analysis for global minima [22]. At this point we want to emphasize again that the assumption of being able to obtain global minima of $\mathcal{T}_{\alpha, y^\delta}$ can be extremely restrictive when $\mathcal R$ is non-convex and the main goal of our analysis is to discard this assumption. Instead we focus only on φ-critical points of $\mathcal{T}_{\alpha, y^\delta}$ which may include local minimizers, saddle points or even local maxima.

Recall that we are interested in Tikhonov-type functionals $\mathcal{T}_{\alpha, y^\delta} \colon \mathbb X \to [0, \infty)$ of the form

$\begin{equation} \mathcal{T}_{\alpha, y^\delta}(x) = \mathcal{S}(x, y^\delta) + \alpha \mathcal R(x) \,, \end{equation} \tag{ 3.1 }$

for given $\mathcal{S} \colon \mathbb X \times {\mathbb{Y}} \to [0, \infty)$ and $\mathcal R \colon \mathbb X \to [0, \infty)$ . Here, $\mathcal{S}$ is a similarity measure between x and y^δ and a standard situation we are interested in is $\mathcal{S}(x, y^\delta) = \frac{1}{2} \lVert \mathbf{K} (x) - y^\delta \rVert^2$ where $\mathbf{K} \colon \mathbb X \to {\mathbb{Y}}$ is the forward operator of the inverse problem of interest. Instead of working with global minima of the functional (3.1) we consider regularized solutions $x_\alpha^\delta$ as αφ-critical points of $\mathcal{T}_{\alpha, y^\delta}$ , meaning

$\begin{equation} 0 \in \partial_{\alpha \phi} \left(\mathcal{S}(\cdot, y^\delta) + \alpha \mathcal R(\cdot) \right) (x_\alpha^\delta). \end{equation} \tag{ 3.2 }$

We will analyze stability and convergence of such critical points.

For the analysis we make the following assumptions.

Condition 3.1 (Critical point regularization)

$\mathbb X$ is a reflexive Banach spaces and ${\mathbb{Y}}$ is a metric space with metric $\mathcal{D}$
$\mathcal R$ is weakly sequentially lower semi-continuous
$\mathcal R$ is relatively sub-differentiable with bound φ
$\mathcal{S}$ is weakly sequentially lower semi-continuous, convex in its first argument and continuous in its second argument
$\exists C \gt 0~\exists p \geqslant 1~\forall z \in \mathbb X~\forall y, y^\delta \in {\mathbb{Y}} \colon \mathcal{S}(z, y) \leqslant C \left( \mathcal{S}(z, y^\delta) + \mathcal{D}(y, y^\delta)^p \right)$
$\forall \alpha \gt 0$ and $\forall y^\delta \in {\mathbb{Y}}$ the functional $\mathcal{T}_{\alpha, y^\delta}$ is coercive, i.e. $\mathcal{T}_{\alpha, y^\delta}(x) \to \infty$ for $\lVert x \rVert \to \infty$

Most of the assumptions in condition 3.1 are classical assumptions (or generalizations thereof), e.g. [14, 18, 22], made for the analysis of variational methods. For example, the coercivity assumption only poses a condition on the involved functionals 'at infinity' and as such does not pose any form of condition, say for example, on the behaviour in a ball around 0. This means, that the function $\mathcal{T}$ can be highly non-convex as long as it is growing fast enough outside bounded sets. The major difference in the analysis provided here is that $\mathcal R$ is relatively sub-differentiable, which we have motivated in section 2, and the assumption that in general the regularized solutions are not global minima but only φ-critical points.

One of the simplest and commonly used example of a similarity measure which satisfies Assumptions (C4) and (C5) is given by $\mathcal{S}(x, y^\delta) = \lVert \mathbf{K}\, x - y^\delta \rVert^p$ whenever $\mathbf{K} \colon \mathbb X \to {\mathbb{Y}}$ is the linear forward operator of the underlying inverse problem and ${\mathbb{Y}}$ is a Banach space. In general, any similarity measure of the form $\lVert \mathbf L (\mathbf{K} x - y^\delta) \rVert^p$ satisfies these assumptions, if $\mathbf L$ is a linear and bounded operator, e.g. a reweighting of the residual $(\mathbf{K} x - y^\delta)$ .

We now turn our focus to the stability and convergence analysis of the considered method, i.e. $x_\alpha^\delta \in \operatorname{crit}_{\alpha \phi} \mathcal{T}_{\alpha, y^\delta}$ . We start with existence and stability results.

3.1. Existence and stability

Theorem 3.2 (Existence). Under Assumption 3.1 the problem is well-posed, i.e. for every α > 0 and $y^\delta \in {\mathbb{Y}}$ the set $\operatorname{crit}_{\alpha \phi} \mathcal{T}_{\alpha, y^\delta}$ is non-empty.

Proof. This is an immediate consequence of the existence of minimizers of $\mathcal{T}_{\alpha, y^\delta}$ which follows from the coercivity and the continuity assumptions on the functional $\mathcal{T}_{\alpha, y^\delta}$ . A more detailed proof can be found in [22].

Clearly, αφ-critical points may exist under weaker assumptions than a coercivity assumption. However, the coercivity is an important property in the following analysis which guarantees the existence of a weakly convergent subsequence whenever the sequence is bounded. As such we have also derived existence of φ-critical points using the coercivity. Extending the current analysis to the case of non-coercive functionals $\mathcal{T}_{\alpha, y^\delta}$ is subject to future work.

Another advantage of using φ-critical points opposed to global minima, besides being numerically and hence practically more tractable for non-convex functionals, is that we have a simple way of talking about 'inexact' critical points, i.e. points where the gradient is small but not necessarily 0. As it turns out, the following analysis can be performed under the even weaker assumption that the stabilized solutions are 'inexact' critical points instead of exact critical points.

Theorem 3.3 (Stability). Let $y^\delta \in {\mathbb{Y}}, \alpha \gt 0$ and $y_k \to y^\delta$ and assume that $x_k \in \mathbb X$ is such that $z_k \in \partial_{\alpha \phi} \left( \mathcal{S}(\cdot, y_k) + \alpha \mathcal R(\cdot) \right) (x_k)$ with $\lVert z_k \rVert \to 0$ and $\langle z_k, x_k \rangle \leqslant 0$ . Then the sequence $(x_k)_k$ has a weakly convergent subsequence and the limit $x_{\boldsymbol{\texttt{+}}}$ of every weakly convergent subsequence is an αφ-critical point of $\mathcal{T}_{\alpha, y^\delta}$ .

Proof. To show the existence of a weakly convergent subsequence, using the reflexivity of $\mathbb X$ , it is enough to show that $(x_k)_k$ is a bounded sequence. By coercivity of $\mathcal{T}_{\alpha, y^\delta}$ it is enough to show that $(\mathcal{T}_{\alpha, y^\delta}(x_k))_k$ is bounded. We have for any $u \in \mathbb X$

$\begin{align*} \mathcal{S}(x_k, y_k) + \alpha \mathcal R(x_k) + \langle z_k, u - x_k \rangle \leqslant \mathcal{S}(u, y_k) + \alpha \mathcal R(u) + \alpha \phi(u) \end{align*}$

and using $\langle z_k, x_k \rangle \leqslant 0$ it follows

$\begin{align*} \mathcal{S}(x_k, y_k) + \alpha \mathcal R(x_k) \leqslant \mathcal{S}(u, y_k) + \alpha \mathcal R(u) + \alpha \phi(u) + \lVert z_k \rVert \lVert u \rVert. \end{align*}$

By assumption on $\mathcal{S}$ we have $\mathcal{S}(x_k, y^\delta) \leqslant C(\mathcal{S}(x_k, y_k) + \mathcal{D}(y_k, y^\delta)^p)$ which yields

$\begin{align*} \mathcal{S}(x_k, y^\delta) + \alpha \mathcal R(x_k) &\leqslant C \left( \mathcal{S}(x_k, y_k) + \alpha \mathcal R(x_k) + \mathcal{D}(y_k, y^\delta)^p \right)\\ &\leqslant C \left( \mathcal{S}(u, y_k) + \alpha \mathcal R(u) + \alpha \phi(u) + \lVert z_k \rVert \lVert u \rVert + \mathcal{D}(y_k, y^\delta)^p \right) \\ &\leqslant \tilde{C} \left( \mathcal{S}(u, y^\delta) + \alpha \mathcal R(u) + \alpha \phi(u) + \lVert z_k \rVert \lVert u \rVert + \mathcal{D}(y_k, y^\delta)^p \right) \end{align*}$

for any $u \in \mathbb X$ . By assumption we have $\lVert z_k \rVert \to 0$ and $\mathcal{D}(y_k, y^\delta) \to 0$ so the right hand side is bounded for k large enough. This shows that there exists some weakly convergent subsequence.

Let now $(x_k)_k$ denote such a subsequence and denote by $x_{\boldsymbol{\texttt{+}}}$ its limit. Using the weak lower semi-continuity of the involved functionals it follows for any $u \in \mathbb X$

$\begin{align*} \mathcal{S}(x_{\boldsymbol{\texttt{+}}}, y^\delta) + \alpha \mathcal R(x_{\boldsymbol{\texttt{+}}}) &\leqslant \liminf_k \mathcal{S}(x_k, y_k) + \alpha \mathcal R(x_k) + \langle z_k, u - x_k \rangle\\ & \leqslant \liminf_k \mathcal{S}(u, y_k) + \alpha \mathcal R(u) + \alpha \phi(u) + \lVert z_k \rVert \lVert u \rVert\\ & = \mathcal{S}(u, y^\delta) + \alpha \mathcal R(u) + \alpha \phi(u) \end{align*}$

where the last equality follows from continuity of $\mathcal{S}$ in its second argument. This shows that $0 \in \partial_{\alpha \phi} \left( \mathcal{S}(\cdot, y^\delta) + \alpha \mathcal R(\cdot) \right) (x_{\boldsymbol{\texttt{+}}})$ .

Clearly, whenever $z_k = 0$ , i.e. x_k is an αφ-critical point, then the assumptions on z_k in theorem 3.3 are satisfied. It follows that αφ-critical points are stable in the above sense. However, theorem 3.3 also shows that we do not need access to exact αφ-critical points but rather points which are in some sense close to an αφ-critical point.

Remark 3.4 (Inexact critical points obtained by use of minimization schemes). Consider once again the case of remark 2.4 and assume that the αφ-critical points are obtained by using gradient descent or any other algorithm which finds zeros of the gradient. Then we have $z_k = \mathcal{S}^{^{\prime}}(x_k, y_k) + \alpha \mathcal R^{^{\prime}}(x_k)$ and whenever $\lVert z_k \rVert \to 0$ and $\langle z_k, x_k \rangle \leqslant 0$ we have that the considered points have a weakly convergent subsequence. For practical applications this means, that we have an easily verifiable condition which can be used as a kind of stopping criterion when searching for critical points. As a consequence, we do not have to guarantee that the regularized solutions are critical points but rather are 'close' to being a critical point.

3.2. Convergence

The next goal is to show the convergence of the regularized solutions to a solution of the original problem in the case that the noise-level δ tends to 0. Here, we call $z \in \mathbb X$ an $\mathcal{S}$ -solution of $y \in {\mathbb{Y}}$ if $\mathcal{S}(z, y) = 0$ . Like in the case of theorem 3.3, the proof can be done under the weaker assumption of only having access to 'inexact' φ-critical points (see remark 3.4).

Theorem 3.5 (Convergence). Let $y \in {\mathbb{Y}}$ and assume it has an $\mathcal{S}$ -solution. Further, let $y_k \in {\mathbb{Y}}$ with $\mathcal{D}(y_k, y) \leqslant \delta_k$ with $\delta_k \to 0$ . Choose $\alpha = \alpha(\delta)$ such that for $\alpha_k = \alpha(\delta_k)$ we have $\lim_k \alpha_k = \lim_k \delta_k^p / \alpha_k = 0$ . Assume that the regularized solutions $x_k \in \mathbb X$ are such that $z_k \in \partial_{\alpha_k \phi} \left( \mathcal{S}(\cdot, y_k) + \alpha_k \mathcal R(\cdot) \right) (x_k)$ with $\lVert z_k \rVert / \alpha_k \to 0$ and $\langle z_k, x_k \rangle \leqslant 0$ .

Then the sequence $(x_k)_k$ has a weakly convergent subsequence and the limit $x_{\boldsymbol{\texttt{+}}}$ of any such sequence is an $\mathcal{S}$ -solution of y. Moreover, we have $\mathcal R(x_{\boldsymbol{\texttt{+}}}) \leqslant \mathcal R(u) + \phi(u)$ for any $\mathcal{S}$ -solution u. Finally, whenever the $\mathcal{S}$ -solution is unique then $(x_k)_k$ converges weakly to this solution.

Proof. Similar to the stability proof we show that $(x_k)_k$ is bounded by using the coercivity of the functionals $\mathcal{T}_{\alpha, y^\delta}$ . Following the above proof we find for any $u \in \mathbb X$

$\begin{align*} \mathcal{S}(x_k, y_k) + \alpha_k \mathcal R(x_k) \leqslant \mathcal{S}(u, y_k) + \alpha_k \mathcal R(u) + \alpha_k \phi(u) + \lVert z_k \rVert \lVert u \rVert \end{align*}$

and by choosing u such that $\mathcal{S}(u, y) = 0$ we find $\mathcal{S}(u, y_k) \leqslant C \delta_k^p$ which implies

$\begin{align*} \mathcal{S}(x_k, y_k) + \alpha_k \mathcal R(x_k) \leqslant C \delta_k^p + \alpha_k (\mathcal R(u) + \phi(u)) + \lVert z_k \rVert \lVert u \rVert. \end{align*}$

Since both $\mathcal{S}$ and $\mathcal R$ are non-negative it then follows

$\begin{align*} \lim_k \mathcal{S}(x_k, y_k) & = 0\\ \limsup_k \mathcal R(x_k) &\leqslant \mathcal R(u) + \phi(u), \end{align*}$

where we have used the assumptions $\lim_k \delta_k^p / \alpha_k = \lim_k \lVert z_k \rVert/ \alpha_k = 0$ . This shows that $(\mathcal R(x_k))_k$ is a bounded sequence and using once again $\mathcal{S}(x_k, y) \leqslant C \left( \mathcal{S}(x_k, y_k) + \delta_k^p \right)$ we find that for $\alpha^{\boldsymbol{\texttt{+}}} = \max \{\alpha_k \colon k \in \mathbb{N}\}$ the sequence $(\mathcal{S}(x_k, y) + \alpha^{\boldsymbol{\texttt{+}}} \mathcal R(x_k))_k$ is bounded. Using the coercivity of $\mathcal{S}(\cdot, y) + \alpha^{\boldsymbol{\texttt{+}}} \mathcal R(\cdot)$ we get that the sequence $(x_k)_k$ is bounded and hence has a weakly convergent subsequence.

Finally, using the weak lower-semicontinuity of $\mathcal{S}$ and $\mathcal R$ we have that for any such weakly convergent subsequence with limit $x_{\boldsymbol{\texttt{+}}}$

$\begin{align*} \mathcal{S}(x_{\boldsymbol{\texttt{+}}}, y) &\leqslant \liminf_k \mathcal{S}(x_k, y_k) = 0\\ \mathcal R(x_{\boldsymbol{\texttt{+}}}) &\leqslant \liminf_k \mathcal R(x_k) \leqslant \mathcal R(u) + \phi(u) \end{align*}$

for any $u \in \mathbb X$ with $\mathcal{S}(u, y) = 0$ .

Whenever the solution is unique, then every subsequence of $(x_k)_k$ has a subsequence converging to this solution. This shows that $(x_k)_k$ converges weakly to the unique solution.

At this point, we want to emphasize once again, that the assumptions on the choice of points x_k in theorem 3.5 are weaker than the assumption that x_k is an $\alpha_k \phi$ -critical point and that in particular the analysis also holds for these points.

Since for this section we only assume that $\mathcal R$ is relatively sub-differentiable without explicit knowledge of the bound φ theorem 3.5 gives a somewhat intangible condition on the type of solutions we obtain in the limit δ → 0. A more tangible condition, and more importantly one independent of φ, is given by the next theorem, where we assume a separability condition on the gradients $z_k \in \partial_{\alpha_k \phi} \left( \mathcal{S}(\cdot, y_k) + \alpha_k \mathcal R(\cdot) \right) (x_k)$ . This separability assumption can be satisfied in many cases, e.g. when $\mathcal{S}$ and $\mathcal R$ are (relatively sub-)differentiable and the φ-critical points arise due to some algorithm such as gradient descent. Using these algorithms we are often in the situation that $z_k = s_k + \alpha_k r_k$ where s_k is a (sub-)gradient of $\mathcal{S}$ and r_k is an (relatively sub-differentiable sub-)gradient of $\mathcal R$ . Assuming that the gradients $(r_k)_k$ of $\mathcal R$ have a cluster point, we get the additional following property of these cluster points.

Theorem 3.6 (Normality property of the solution). Let the same assumptions as in theorem 3.5 hold and denote by $(x_k)_k$ a weakly convergent subsequence with limit $x_{\boldsymbol{\texttt{+}}}$ . Let $z_k = s_k + \alpha_k r_k$ where $s_k \in \partial_0 \mathcal{S}(\cdot, y_k) (x_k)$ and $r_k \in \partial_\phi \mathcal R (x_k)$ .

Then any cluster point r of the sequence $(r_k)_k$ satisfies $-r \in -\partial_\phi \mathcal R (x_{\boldsymbol{\texttt{+}}}) \cap N_{L(y)}(x_{\boldsymbol{\texttt{+}}})$ , where $N_{L(y)}(x_{\boldsymbol{\texttt{+}}})$ is the normal cone of the convex set of all $\mathcal{S}$ -solutions of y at $x_{\boldsymbol{\texttt{+}}}$ .

Proof. Let r be a cluster point of the sequence $(r_k)_k$ . Then by weak lower semi-continuity of $\mathcal R$ and by assumption on r_k we have

$\begin{align*} \mathcal R(x_{\boldsymbol{\texttt{+}}}) + \langle r, u - x_{\boldsymbol{\texttt{+}}} \rangle &\leqslant \liminf_k \mathcal R(x_k) + \langle r_k, u - x_k \rangle \leqslant \mathcal R(u) + \phi(u), \end{align*}$

which shows that $r \in \partial_\phi \mathcal R (x_{\boldsymbol{\texttt{+}}})$ .

Now assume that $u \in \mathbb X$ is such that $\mathcal{S}(u, y) = 0$ . Then we have $\mathcal{S}(u, y_k) \leqslant C \delta_k^p$ and it follows

$\begin{align*} \langle -r, u - x_{\boldsymbol{\texttt{+}}} \rangle & = \lim_k \left\langle \frac{s_k - z_k}{\alpha_k}, u - x_k \right\rangle \\ &\leqslant \lim_k \frac{1}{\alpha_k} \left( \mathcal{S}(u, y_k) - \mathcal{S}(x_k, y_k) + \lVert z_k \rVert \lVert u \rVert + \langle z_k, x_k \rangle \right)\\ &\leqslant \lim_k \frac{1}{\alpha_k} \left( \mathcal{S}(u, y_k) + \lVert z_k \rVert \lVert u \rVert \right) \\ &\leqslant \lim_k C \frac{\delta_k^p}{\alpha_k} + \lVert u \rVert \frac{\lVert z_k \rVert}{\alpha_k}\\ & = 0, \end{align*}$

where we have used the convexity of $\mathcal{S}$ in its first argument and the assumption on the limits of the sequences $(\alpha_k)_k$ and $\left( \lVert z_k \rVert / \alpha_k \right)_k$ .

Theorem 3.6 shows that the solution $x_{\boldsymbol{\texttt{+}}}$ obtained by critical point regularization satisfies some form of first order optimality conditions, see e.g. [21].

Also note that in the case where $\mathcal R$ is convex and we choose φ = 0, both properties in theorems 3.5 and 3.6 reduce to the common property that $x_{\boldsymbol{\texttt{+}}}$ is an $\mathcal R$ -minimizing solution, i.e. $\mathcal R(x_{\boldsymbol{\texttt{+}}}) \leqslant \mathcal R(u)$ for any $u \in \mathbb X$ with $\mathcal{S}(u, y) = 0$ .

Remark 3.7 (Convex regularizers). Clearly, any sub-differentiable convex function is relatively sub-differentiable with the choice φ = 0. Nevertheless, one could also choose $\phi = \varepsilon \gt 0$ . With this choice we see that the results in theorem 3.5 roughly state that the solutions $x_{\boldsymbol{\texttt{+}}}$ we approximate by using critical point regularization are $\mathcal R$ -minimizing solutions up to an error ε whenever the regularized solutions are minimizers up to an error of ε.

This result, as opposed to classical variational regularization theory e.g. [22], has the advantage that at no point do we require exact global minimizers of the functionals $\mathcal{T}_{\alpha, y}$ but only approximate minimizers, which may be more easily reachable in practical applications. Consider for example the case where we employ an iterative algorithm which has convergence guarantees of the form $\mathcal F(x_n) - \mathcal F(x_*) \leqslant C / n^r$ for the n-th iterate and $x_*$ being a minimizer of $\mathcal F$ . Applying this algorithm to $\mathcal F = \mathcal{T}_{\alpha, y}$ and requiring that $C / n^r \leqslant \alpha \varepsilon$ in order to get that x_n is an αφ-critical point, we see that the above theory shows that one might stop the iterative algorithm after a finite amount of steps, i.e. we do not necessarily need to run the algorithm until it converges and we still get a stable and convergent regularization method.

At first it might seem that a disadvantage of this is that we do not achieve an $\mathcal R$ -minimizing solutions in the limit. However, this can also be circumvented by considering a variable ε. To be more precise, following the proof of theorem 3.5 with $\varepsilon = \varepsilon(\delta)$ and the condition $\varepsilon(\delta) \to 0$ as δ → 0, it is easy to see that to obtain a sequence $(x_k)_k$ weakly converging to an $\mathcal R$ -minimizing solution it is enough to run the iterative algorithm for a number of iterations steps n_k such that $C / n_k^r \leqslant \alpha_k \varepsilon_k$ .

We next discuss another special case of our analysis which pertains to functionals such as the one in remark 2.5.

3.3. Differentiable regularizers and classical critical points

In this subsection we consider the important special case where the φ-critical points are given by classical critical points, i.e. by points x for which $\mathcal{T}_{\alpha, y}^{^{\prime}}(x) = 0$ and we give stability and convergence results for this case. To this end, we assume that the bound φ can be constructed in such a way that $\mathcal R(x) + \langle \mathcal R^{^{\prime}}(x), u - x \rangle \leqslant \mathcal R(u) + \phi(u)$ , see e.g. remark 2.4. Then, if $\mathcal{S}$ is differentiable in its first argument, by convexity of $\mathcal{S}$ , we have $\mathcal{S}^{^{\prime}}(x, y^\delta) + \alpha \mathcal R^{^{\prime}}(x) \in \partial_{\alpha \phi} (\mathcal{S}(\cdot, y^\delta) + \alpha \mathcal R(\cdot))(x)$ . This shows, that whenever we employ some algorithm which finds a classical critical point, we also obtain an αφ-critical point in the sense of definition 2.1 which satisfies the separability assumption necessary for theorem 3.6. In particular, these points are amenable to the analysis above.

Nevertheless, the analysis relies on an abstract concept of φ-critical points and even in the case where the involved functionals are differentiable we cannot guarantee that the limits will again be φ-critical points without any additional assumptions. In order to guarantee this we need the assumption that $\mathcal{S}^{^{\prime}}$ and $\mathcal R^{^{\prime}}$ are weakly (sequentially) continuous. Combining the above theorems we then get the following result.

Proposition 3.8 (Existence, stability and convergence for classical critical points). Assume that $\mathcal{S}$ and $\mathcal R$ are differentiable with weakly continuous derivatives and let condition 3.1 hold. Moreover, let $y, y^\delta \in {\mathbb{Y}}$ and α > 0 and assume that y has an $\mathcal{S}$ -solution. Then the following hold

1.
Existence: $\mathcal{T}_{\alpha, y^\delta}$ has at least one φ-critical point.
2.
Stability: If $(y_k)_k \subseteq {\mathbb{Y}}$ is a sequence converging to y^δ and x_k is such that $z_k = \mathcal{S}^{^{\prime}}(x_k, y_k) + \alpha \mathcal R^{^{\prime}}(x_k) \to 0$ as $k \to \infty$ and $\langle z_k, x_k \rangle \leqslant 0$ . Then $(x_k)_k$ has a weakly convergent subsequence and any weak clusterpoint $x_{\boldsymbol{\texttt{+}}}$ of $(x_k)_k$ is a critical point of $\mathcal{T}_{\alpha, y^\delta}$ .
3.
Convergence: Let $(y_k)_k \subseteq {\mathbb{Y}}$ be a sequence with $\mathcal{D}(y_k, y) \leqslant \delta_k$ and $\alpha = \alpha(\delta)$ be such that for $\alpha_k = \alpha(\delta_k)$ we have $\lim_k \alpha_k = \lim_k \delta_k^p / \alpha_k = 0$ . Then, if we choose $x_k \in \mathbb X$ such that $z_k = \mathcal{S}^{^{\prime}}(x_k, y_k) + \alpha_k \mathcal R^{^{\prime}}(x_k)$ satisfies $\lim_k \lVert z_k \rVert/ \alpha_k = 0$ and $\langle z_k, x_k \rangle \leqslant 0$ the sequence $(x_k)_k$ has at least one weak clusterpoint and any such clusterpoint $x_{\boldsymbol{\texttt{+}}}$ is an $\mathcal{S}$ -solution of y with the following additional properties
- (a)
  $\mathcal R(x_{\boldsymbol{\texttt{+}}}) \leqslant \inf_{\mathcal{S}(u, y) = 0} \mathcal R(u) + \phi(u)$
- (b)
  $\langle -\mathcal R^{^{\prime}}(x_{\boldsymbol{\texttt{+}}}), z - x_{\boldsymbol{\texttt{+}}} \rangle \leqslant 0$ for any $z \in \mathbb X$ with $\mathcal{S}(z, y) = 0$ , i.e. $-\mathcal R^{^{\prime}}(x_{\boldsymbol{\texttt{+}}}) \in N_{L(y)}(x_{\boldsymbol{\texttt{+}}})$ .
Finally, whenever the $\mathcal{S}$ -solution is unique then $(x_k)_k$ converges weakly to this solution.

Proof. This follows immediately by applying theorems 3.2, 3.3, 3.5 and 3.6.

For the differentiable case this identifies the limiting problem we solve by regularizing the inverse problems with critical points, i.e. in the limit we find solutions which satisfies a first order optimality condition of the constrained optimization problem

$\begin{align*} \inf_u \mathcal R(u) \quad \textrm{such that} \quad \mathcal{S}(u, y) = 0. \end{align*}$

We now briefly discuss the special case where $\mathcal{S}$ is given as the norm-discrepancy, e.g. in the case where ${\mathbb{Y}}$ is a Hilbert-space.

Lemma 3.9 (Solution for norm discrepancy). Let the same assumptions as in proposition 3.8 hold and assume that $\mathcal{S}(x, y^\delta) = \frac{1}{p} \lVert \mathbf{K} x - y^\delta \rVert_{\mathbb{Y}}^p$ for some p > 1 where $\mathbf{K} \colon \mathbb X \to {\mathbb{Y}}$ is a linear and bounded forward operator between Banach spaces and $\lVert \cdot \rVert_{\mathbb{Y}}^p$ is differentiable. Furthermore, denote by $x_{\boldsymbol{\texttt{+}}}$ a solution according to proposition 3.8.

Then we have $-\mathcal R^{^{\prime}}(x_{\boldsymbol{\texttt{+}}}) \in \ker(\mathbf{K})^\perp = \{p \in \mathbb X^\ast \colon \forall x_0 \in \ker(\mathbf{K}) \colon \langle \,p, x_0 \rangle = 0\}$ .

Proof. Any solution z can be written as $z = x_{\boldsymbol{\texttt{+}}} + x_0$ where $x_0 \in \ker(\mathbf{K})$ . By using x₀ and $-x_0$ , proposition 3.8 shows that $\langle -\mathcal R^{^{\prime}}(x_{\boldsymbol{\texttt{+}}}), x_0 \rangle = 0$ for any $x_0 \in \ker(\mathbf{K})$ and hence the claim follows.

4. Examples and comparison

In this section we compare the proposed regularization concept using αφ-critical points to standard Tikhonov regularization, its convex relaxation and discuss the influence of different choices for φ. Moreover, we discuss ReLU-regularizers as a class of non-convex regularizers for which the presented theory is applicable.

4.1. Dependence on the choice of φ

We begin this section with a comparison between classical Tikhonov regularization and regularization with φ-critical points using different choices for φ. For the sake of simplicity we consider a convex example, where the data-fidelity is chosen as $\lVert \mathbf{K} x - y_\delta \rVert^2/2$ and the regularizer is given by $\mathcal R(x) = \lVert x \rVert^2/2$ . We assume that α > 0 and $y_\delta \in {\mathbb{Y}}$ are fixed and denote by $x_{\alpha, \delta}$ the unique minimizer of the Tikhonov functional $\mathcal{T}_{\alpha, y_\delta}(x) = \lVert \mathbf{K} x - y_\delta \rVert^2/2 + \alpha \lVert x \rVert^2/2$ . Note that in the convex case, standard Tikhonov regularization corresponds to regularization with αφ-critical points for the choice $\phi \equiv 0$ .

Consider now the case where $\phi \equiv \varepsilon \gt 0$ is constant. Then, according to the general theory, any αφ-critical point x of $\mathcal{T}_{\alpha, y_\delta}$ is characterized by

$\begin{align*} \mathcal{T}(x) \leqslant \min_{u} \mathcal{T}_{\alpha, y_\delta}(u) + \alpha \phi(u) = \min_{u} \mathcal{T}_{\alpha, y_\delta}(u) + \alpha \varepsilon = \mathcal{T}_{\alpha, y_\delta}(x_{\alpha, \delta}) + \alpha \varepsilon\,. \end{align*}$

Writing $x = x_{\alpha, \delta} + x_0$ we find after some rearrangements that this is in turn equivalent to

$\begin{equation*} \langle x_0, r(\alpha) \rangle + \frac{1}{2} \langle x_0, L(\alpha) x_0 \rangle \leqslant \alpha \varepsilon \,, \end{equation*}$

where $L(\alpha) : = (\mathbf{K}^* \mathbf{K} + \alpha I)$ and $r(\alpha) : = L(\alpha) x_{\alpha, \delta} - \mathbf{K}^* y_\delta$ . Since $L(\alpha)$ is positive definite, this shows that x₀ has to be chosen in an ellipsoid around 0 and thus $x = x_{\alpha, \delta} + x_0$ is contained in some ellipsoid around $x_{\alpha, \delta}$ . Here, the size of the ellipsoid depends on the choice ε and, for example, choosing ε = 0 leads to the singleton $\{x_{\alpha, \delta}\}$ .

Let now φ be an arbitrary non-negative function such that minimization of $\mathcal{T}_{\alpha, y_\delta} + \alpha \phi$ is well-posed with minimizer x_φ. Then, following the steps above, we find that αφ-critical points $x = x_\phi + x_0$ of $\mathcal{T}_{\alpha, y_\delta}$ are characterized by

$\begin{equation} \langle x_0, r(\alpha) \rangle + \frac{1}{2} \langle x_0, L(\alpha) x_0 \rangle \leqslant \alpha \phi(x_\phi), \end{equation} \tag{ 4.1 }$

where $L(\alpha)$ is as above and $r(\alpha) = L(\alpha) x_\phi - \mathbf{K}^* y_\delta$ . That is, the set of αφ-critical points is an ellipsoid around the point x_φ where the size of the ellipsoid depends on the value $\phi(x_\phi)$ . Depending on the choice of φ this value can even be 0. To see this, consider for example the case where $\mathbb X = \ell^2(\mathbb{N})$ and $\phi(x) = \beta \, \lVert \max \{0, x\} \rVert^2/2$ for some β > 0, where the maximum is taken pointwise. Then, if $x_{\alpha, \delta} \leqslant 0$ pointwise, we find that $\phi(x_{\alpha, \delta}) = 0$ and we arrive at $x_\phi = x_{\alpha, \delta}$ and (4.1) collapses to $x_0 = 0$ . In a similar fashion, the choice $\phi(x) = \beta \, \lVert \max(0, -x) \rVert^2 / 2$ for some β > 0 leads to a point estimate whenever $x_{\alpha, \delta}$ is non-negative and to an ellipsoid if $x_{\alpha, \delta}$ has at least one negative entry.

While we allow for the whole set of solutions defined as αφ-critical points of $\mathcal{T}_{\alpha, y_\delta}$ , in practical applications one typically only chooses a specific subset. More precisely, as shown in section 2.2 one can apply variants of gradient descent to construct critical points. In the convex situation this might lead to the same solution independent of the choice of φ since the classical gradient $\partial_0 \mathcal{T}_{\alpha, y_\delta}$ can be used as an αφ subgradient for any choice of φ. However, in the non-convex case using such algorithms will not find solutions which are global minima but only critical points as we discuss in the following subsection.

4.2. Relation to convex relaxation

Next we investigate the relation between critical point regularization for a non-convex example. Consider $\mathbb X = {\mathbb{Y}} = \ell^2(\mathbb{N})$ and $\mathbf{K} x = (k_i x_i)_{i\in \mathbb{N}}$ . As non-convex regularizer we use a slightly perturbed double-well potential $\mathcal R(x) = \sum_{i} r_i(x_i)$ , where for $q \in [1/2, 1)$ and $w_i\gt0$ we define

$\begin{equation*} r_i(t) = \begin{cases} t^2/2 & \text{for } t \leqslant q w_i\\ (t - w_i)^2 / 2 + (q-1/2) w_i^2 & \text{for } t > q w_i \,. \end{cases} \end{equation*}$

Here, q controls the perturbation away from 0 of the local minimum at w_i . An illustration of r_i for $w_i = 1$ and two different values of q is depicted in figure 2. While at this point the choice of the regularizer might seem somewhat arbitrary and contrived, we will argue later that the regularizer chosen here is a simplified model for a reasonable class of learned regularizers; see the discussion on ReLU-networks in section 4.4 and in particular remark 4.6.

**Figure 2.** Illustration of the non-convex double-well regularizer (solid lines) for two different values of q and the corresponding convex hull (dashed lines).
Download figure:
Standard image High-resolution image

For given $\alpha \gt 0, y_\delta \in {\mathbb{Y}}$ we consider (classical) critical points of the Tikhonov functional $\mathcal{T}_{\alpha, y_\delta}$ defined by

$\begin{equation} \mathbf{K}^*(\mathbf{K} x - y_\delta) + \alpha \nabla \mathcal R(x) = 0 \,. \end{equation} \tag{ 4.2 }$

Although r_i is not differentiable at $t = q w_i$ , with slight abuse of notation we will denote by $r_i^{^{\prime}}(t)$ the derivative of r_i , which we define as $q w_i$ at $t = q w_i$ . In the following the gradient $\nabla \mathcal R (x) = (r_i^{^{\prime}}(x_i))_{i \in \mathbb{N}}$ will be understood with this convention.

Remark 4.1. One can construct a specific φ such that $\nabla \mathcal R(x) \in \partial_{\alpha\phi} \mathcal R (x)$ . It is then guaranteed that any classical critical point is an αφ-critical point and hence a regularized solution fitting to the theory presented in this paper. Such a φ can be constructed by considering the defining inequality pointwise and constructing φ_i for r_i . The calculations for φ_i are relatively simple but tedious. Since the exact form of φ is irrelevant for our purpose, we refrain from explicitly defining φ. Instead, we focus on solutions that arise when solving equation (4.2). Further note that the algorithm presented in the section 2.2 with $\nabla \mathcal R$ as relative subgradient is the same as classical gradient descent, thus further substantiating the assumption that the constructed regularized solution is a solution of (4.2).

By definition of the operator and the regularizer, solutions of (4.2) can be computed component-wise via $\forall i \colon k_i (k _i x_i - y_i) + \alpha r_i^{^{\prime}}(x_i) = 0$ . Hence

$\begin{align*} x_i = \begin{cases} \frac{k_i y_i}{k_i^2 + \alpha} & \text{for } k_i \neq 0 \wedge \frac{k_i y_i}{k_i^2 + \alpha} \leqslant q w_i, \\[5pt] \frac{\alpha w_i + k_i y_i}{k_i^2 + \alpha} & \text{for } k_i \neq 0 \wedge \frac{k_i y_i}{k_i^2 + \alpha} > q w_i \\ x_i \in \{0, w_i\} & \text{for } k_i = 0 \,. \end{cases} \end{align*}$

An interesting observation is that for $k_i = 0$ we have a choice between 0 and w_i where the choice $x_i = w_i$ will, in general, only lead to a local minimizer instead of a global one; compare with figure 2 for q = 0.75. In essence, this shows that the regularized solutions one obtains are a subset of the solutions of equation (4.2) and need not be global minimizers.

Next we consider regularization with the convex relaxation of $\mathcal R$ , i.e. the convex hull $\mathcal R^\textrm{conv}$ . After some lengthy calculations, which we do not present here for the sake of brevity, we derive for the convex hull of r_i the form

$\begin{align*} r_i^\text{conv}(t) = \begin{cases} t^2/2 & \text{for } t \leqslant (q-1/2) w_i, \\ (t - w_i)^2/2 + (q-1/2) w_i^2 & \text{for } t \geqslant (q + 1/2)w_i \\ t(q - 1/2) - (q - 1/2) ^2 w_i^2/2 & \text{otherwise }. \end{cases} \end{align*}$

A comparison of r_i and $r_i^\text{conv}$ for two choices of the parameter q is given in figure 2 where the convex hull in both cases is visualized with a dashed line.

Solving the critical point equation (4.2) where $\mathcal R$ is replaced by its convex hull we find that the solutions differ quite a bit, at least for the case $k_i = 0$ . Considering this case, we find that there are two different cases. First, whenever $q = 1/2$ the convex hull allows for arbitrary solutions $x_i \in [0, w_i]$ . On the other hand if $q \gt 1/2$ the convex hull forces the choice $x_i = 0$ . This means that even the slightest perturbation of the value of $r_i(w_i)$ will lead to a convex hull which loses this information. Comparing this to the solutions above we see that in the non-convex case we can always choose $x_i \in \{0, w_i\}$ independent of the value of q. This shows that there is a difference between the regularized solutions one obtains when considering the convex hull of the regularizer $\mathcal R^\textrm{conv}$ and the original regularizer $\mathcal R$ .

4.3. Non-equivalence to Tikhonov regularization

One might conjecture that the proposed regularization with αφ-critical points of $\mathcal{T}_{\alpha, y_\delta}$ is equivalent to Tikhonov regularization for some other modified choice of the regularizer. While we cannot give a definite answer to this question at this point, we conjecture that this is not the case.

To support our hypothesis, let us analyze what would happen if the construction of αφ-critical points of $\mathcal{T}_{\alpha, _\delta}$ were equivalent to the Tikhonov regularization with some regularizer $\mathcal R_\phi$ . In this case, the limiting problems would also coincide and thus for any $y \in\, \mathrm{ran}(\mathbf{K})$ and $x \in \mathbb X$ we have

$\begin{equation*} \mathcal R(x) \leqslant \min_{\mathbf{K} u = y} \mathcal R(u) + \phi(u) \Longleftrightarrow \mathcal R_\phi(x) \leqslant \min_{\mathbf{K} u = y} \mathcal R_\phi(u) \,. \end{equation*}$

Denoting by x_φ a solution of $\min_{\mathbf{K} u = y} \mathcal R(u) + \phi(u)$ we have $\mathcal R(x) \leqslant \mathcal R(x_\phi) + \phi(x_\phi)$ . Clearly, this is the case if and only if $\max \{\mathcal R , \mathcal R(x_\phi) + \phi(x_\phi)\}$ is minimal among all possible solutions. This essentially means that $\mathcal R_\phi = \max \{\mathcal R , \mathcal R(x_\phi) + \phi(x_\phi)\}$ .

While such a choice can theoretically be used to characterize the limiting problem, we do not have access to x_φ and therefore cannot work with $\mathcal R_\phi$ in practice. A slightly more subtle problem is that $\mathcal R_\phi$ depends on the exact data and as such cannot be used for the case of noisy case where α > 0. Thus, we conjecture that the proposed regularization is not equivalent to Tikhonov regularization independent of the choice $\mathcal R_\phi$ .

4.4. ReLU-Networks as class of possible regularizers

Next, we demonstrate that ReLU networks form a class of non-convex, relatively subdifferentiable regularizers that fit within the theory presented in this paper. As discussed in [14, 15], such regularizers are a powerful tool in the context of classical variational regularization.

Let now $\mathbb V$ , $\mathbb U$ be further Hilbert spaces.

Definition 4.2 (Quasi-homogeneity). A function $f \colon \mathbb X \to \mathbb V$ is quasi-homogeneous, if there exists $\mathcal{L}_f \colon \mathbb X \to L(\mathbb X, \mathbb V)$ such that $\sup_{x \in \mathbb X} \lVert \mathcal{L}_f(x) \rVert \lt \infty$ and $\sup_{x \in \mathbb X} \lVert f(x) - \mathcal{L}_f(x) x \rVert \lt \infty$ . We call $\mathcal{L}_f$ the quasi-derivative of f.

In definition 4.2 and below $L(\mathbb X, \mathbb V)$ denotes the space of all bounded linear mappings from $\mathbb X$ to $\mathbb V$ . Quasi-homogeneity satisfies the following elementary rules.

Lemma 4.3 (Quasi-homogeneity and relatively sub-differentiability). Let $f,h \colon \mathbb X \to \mathbb V$ and $g \colon \mathbb V \to \mathbb U$ be quasi-homogeneous, W $\in L(\mathbb X, \mathbb V)$ and $c \in \mathbb{R}$ . Moreover, let $v \in \mathbb V$ and let $\psi \colon \mathbb V \to [0, \infty)$ be convex and sub-differentiable with $\psi(v) \leqslant C \lVert v \rVert^p$ for some C > 0, $p \geqslant 1$ and subgradient selection $\psi^{^{\prime}}(v) \in (\partial_0 \psi)(v)$ . Then for some $\phi^1, \phi^2$ the following hold:

(1)
$g \circ f$ is quasi-homogeneous with $\mathcal{L}_{g \circ f}(x) = \mathcal{L}_g(f(x)) \circ \mathcal{L}_f(x)$ .
(2)
$f + c h$ is quasi-homogeneous with $\mathcal{L}_{f + c h} = \mathcal{L}_f + c \mathcal{L}_h$ .
(3)
$x \mapsto$ W $x + b$ is quasi-homogeneous.
(4)
$\langle v, f \rangle$ is φ¹-relative sub-differentiable with $\langle v, \mathcal{L}_f(x) (\cdot) \rangle \in \partial_{\phi^1} (\langle v, f \rangle)(x)$ .
(5)
$\psi \circ f$ is φ²-relatively sub-differentiable with and $\psi^{^{\prime}}(f(x)) \mathcal{L}_f(x) \in \partial_{\phi^2}(\psi \circ f)(x)$ .

Proof. These properties follow immediately from the triangle inequality and the defining properties of quasi-homogeneity and relatively sub-differentiability.

Theorem 4.4 (Learned regularizers). Let $\psi \colon \mathbb V \to [0, \infty)$ be convex and sub-differentiable with $\psi(v) \leqslant C \lVert v \rVert^p$ for some C > 0 and $p \geqslant 1$ . Let $\sigma_\ell$ be quasi-homogeneous and A $_\ell$ be affine and continuous for $\ell \in \{1, \dots, L\}$ . Then

$\begin{equation} \mathcal{N} = \sigma_L \circ {\mathbf{A}}_L \circ \dots \circ \sigma_1 \circ {\mathbf{A}}_1 \end{equation} \tag{ 4.3 }$

is quasi-homogeneous. Additionally, $\mathcal R = \psi \circ \mathcal{N}$ is relatively sub-differentiable.

Proof. Follows from repeated application of lemma 4.3.

The crucial assumption in theorem 4.4 is that the activation functions $\sigma_\ell$ are quasi-homogeneous. This property is, for example, satisfied in the case of the ReLU as the choice for the activation function as we discuss in the following example.

Example 4.5 (ReLU regularizer). Consider the case $\sigma_\ell = \operatorname{ReLU}$ defined by $\operatorname{ReLU}(x) = \max \{0, x\}$ where max is to be understood pointwise. Then $\mathcal{L}_{\operatorname{ReLU}}(x) = M_{g(x)}$ , where $M_{g(x)}$ denotes pointwise multiplication with $g(x) = 0$ if $x \leqslant 0$ and $g(x) = 1$ if x > 0 and g is again understood pointwise. The ReLU function is then quasi-homogeneous whenever the space $\mathbb X$ has the following property: For any $x, v \in \mathbb X$ we have $\operatorname{ReLU}(x) \in \mathbb X, g(x) v \in \mathbb X$ and $\lVert g(x) v \rVert \leqslant \lVert g(x) \rVert_\infty \lVert v \rVert$ . Examples of such spaces are $\mathbb X = \ell^r(\Lambda, L^p(\Omega, \mu))$ for some at most countable set Λ and parameters $p, r \geqslant 1$ . In particular, this also holds in the finite dimensional case $\mathbb X = \mathbb{R}^n$ . Thus, theorem 4.4 shows that ReLU networks are an appropriate choice to construct regularizers. We note here, that the same also holds true when the ReLU activation functions are replace by the more general class of parametric ReLU activation functions.

Theorem 4.4 and lemma 4.3 imply that a relative sub-gradient of any ReLU regularizer can be evaluated with the chain rule. Since deep-learning frameworks such as PyTorch [19] and Tensorflow [1] are built on formal application of the chain rule, calculating elements G(x) with $G(x) \in \partial_\phi \mathcal R(x)$ can be done by using backpropagation. Thus, the backpropagation procedure is an appropriate choice for any form of gradient descent used to find critical points of the given functional satisfying (4.2). This is for example of interest for learned regularizers [14–16, 18].

Remark 4.6 (Learned regularizers as multi-well potentials). Using (parametric) ReLU activation functions, the network (4.3) is a composition of piecewise affine operators and as such itself a piecewise affine operator. This means that the regularizer $\mathcal R = \lVert \mathcal{N}(\cdot) \rVert^2$ of theorem 4.4, as for example considered in [14], behaves like a 'multi-well potential' similar to the one considered in figure 2. That is, it behaves like a function with multiple local minima where ideally each local minimum is located at a desirable solution.

A reasonable strategy to find such a regularizer is to train a network to have local minima which are located at the desired solutions. However, due to various difficulties during this process (e.g. the regularizer itself being only a local minimum of the loss function used for training, non-ideal network architectures) one would also expect the regularizer to have slightly different values at the desired solutions. This means, that even if the local minima are located at the desired solutions one cannot expect all of these local minima to have the same value much less expect each of these local minima to be global minima of the regularizer. To put this in different words, one should expect slight perturbations as in figure 2.

5. Numerical simulations

The goal of this section is not to show that non-convex regularizers can improve the reconstructions, but rather to test the theory derived in the previous sections and to show what may happen when non-convex regularizers are chosen.

To this end, we consider the discretzied version of two toy-problems in 1D. We consider an inpainting problem where around $50 \%$ of the signal entries were randomly removed. In this case the kernel of the forward operator K is simple to compute and by using a separable prior we can easily study the properties of the solution we obtain in the limit. This makes the first toy problem ideal for testing whether the properties (as described in the theory section) of the limiting solution hold true or not.

Further, we consider recovering a signal from its cumulative sum. Since this matrix is invertible there is a unique solution and following theorem 3.5 we should observe convergence to this solution in the limit δ → 0. This toy problem is therefore well suited to study if the given φ-critical points actually converge to the unique solution.

For both problems we consider as the signal to recover the discretization of the function $f(t) = \exp(-t ^2) \cdot \cos(t) \cdot (t - 0.5) ^ 2 + \sin(t ^ 2)$ on $t \in [-1, 1]$ using N = 512 equidistant sample points. We denote this signal by $x_\mathrm{true}$ and the true underlying data by $y_\mathrm{true} = \mathbf{K} x_\mathrm{true}$ where K is the forward operator of the corresponding problem.

For each problem we consider the similarity measure given by $\mathcal{S}(x, y) = \frac{1}{2} \lVert \mathbf{K} x - y \rVert^2$ and we construct a regularizer by $\mathcal R(x) = \sum_{i = 1}^N \psi_{\rho, \beta}(x_i)$ . Here, we define $\psi_{\rho, \beta}(t) = (t - \rho)^2 \cdot (t + \frac{\rho}{2})^2 + \frac{\beta}{2} t^2$ for $\rho, \beta \gt 0$ . The function $\psi_{\rho, \beta}$ is constructed in such a way that it is non-convex but relatively sub-differentiable, see remark 2.5. Figure 5 shows the function $\psi_{\rho, \beta}(t)$ with parameters ρ = 2 and $\beta = 10^{-1}$ for $t \in [-3, 3]$ where the y-axis is plotted on a logarithmic scale in order to emphasize the non-convexity. We can see that this function has a global minimum at around $t = -\frac{\rho}{2}$ , a local minimum close to $t = \rho$ and another critical point in the interval $[0, 1]$ . The parameters ρ = 2 and $\beta = 10^{-1}$ are used for all the following simulations.

As a separable sum of relatively sub-differentiable and non-convex terms the regularizer $\mathcal R$ as defined above is relatively sub-differentiable and non-convex. By definition of $\mathcal R$ it is further coercive and hence the functional $\mathcal{T}_{\alpha, y^\delta}$ is coercive. This shows that condition 3.1 is satisfied and we consider the stability and convergence of the φ-critical points according to theorems 3.3 and 3.5.

To simulate noisy data we consider the data $y_k = y_\mathrm{true} + \delta_k \cdot n$ where $n = \frac{\xi}{\lVert \xi \rVert}$ , ξ is a normally distributed random variable and $\delta_k = 10^{-k}$ for $k \in \{4, \dots, 14\}$ .

Since a bound φ for $\mathcal R$ can be chosen such that $\mathcal R^{^{\prime}}(x) \in \partial_\phi \mathcal R(x)$ (see remark 2.5), we can simply search for a classical critical point of $\mathcal{T}_{\alpha, y^\delta}$ in order to obatin φ-critical points. To achieve this, we apply Newton's method, e.g. [8], and we find an initial guess for Newton's method by applying Nesterov accelerated gradient descent [17] to the starting point $x_0 = 0$ . In the following we denote by $x_\alpha^\delta$ a critical point of $\mathcal{T}_{\alpha, y^\delta}$ and by x_α a critical point of $\mathcal{T}_{\alpha, y_\mathrm{true}}$ . Here, x_α is considered as the limit point for the stability considerations for which we consider the choices $\alpha \in \{10^{-2}, 10^{-3}, 10^{-4}\}$ . In order to test for convergence we chosen $\alpha = \alpha(\delta) = \delta^q$ for $q \in \{1, \frac{3}{2}\}$ . For the convergence simulations we consider as the limit point the signal $x_\mathrm{true}$ for the cumulative sum problem, as in this case the solution is unique, and we construct an approximate solution for the inpainting problem by finding a critical point of the function $\mathcal{T}_{\alpha(\delta), \delta}$ for $\delta = 10^{-16}$ and we denote this solution by $x_{\boldsymbol{\texttt{+}}}$ . Implementation details and code are publicly available¹ .

5.1. Results

Figure 3 depicts the value $\lVert x_\alpha^\delta - x_\alpha \rVert$ for different values of α > 0 (left), $\lVert \mathbf{K} x_{\alpha(\delta), \delta} - y_\mathrm{true} \rVert$ (middle) and $\lVert x_{\alpha(\delta), \delta} - x_{\boldsymbol{\texttt{+}}} \rVert$ (right). Each of these values is plotted against δ on a log-log scale. The plot in the left shows that for any chosen α we can observe the convergence of the sequence $x_\alpha^\delta$ to the critical point x_α as the noise-level tends to 0. The plots in the middle and right show the convergence behaviour for different choices $\alpha(\delta) = \delta^q$ as specified above. All of the sequences can be observed to converge, i.e. $\lVert \mathbf{K} x_{\alpha(\delta), \delta} - y_\mathrm{true} \rVert \to 0$ and $\lVert x_{\alpha(\delta), \delta} - x_{\boldsymbol{\texttt{+}}} \rVert \to 0$ as the noise δ tends to 0.

Figure 4 shows the same behaviour for the stability and convergence plots for the inpainting problem as figure 3 in the limit δ → 0. In particular convergence to a solution $x_{\boldsymbol{\texttt{+}}}$ of the problem $\mathbf{K} x = y_\mathrm{true}$ can be observed.

A closer look at the inpainting problem reveals that the limit point $x_{\boldsymbol{\texttt{+}}}$ is, however, not an $\mathcal R$ -minimizing solution. This can easily be checked due to the separability of the regularizer and the simple representation of the kernel of the inpainting problem. The orange dot in figure 5(left) is the $\psi_{\rho, \beta}$ -value of $(x_{\boldsymbol{\texttt{+}}})_i$ where i is chosen as an index in the kernel of the inpainting matrix K, i.e. such that $\mathbf{K} e_i = 0$ where e_i is the ith standard basis vector. Due to the separability of the regularizer we clearly have that $x_{\boldsymbol{\texttt{+}}}$ is not an $\mathcal R$ -minimizing solution which arises due to the non-convexity of the regularizer.

**Figure 5.** Regularizer and properties of the inpainting solution. Left: $\psi_{2, 10^{-1}}(t)$ for $t \in [-3, 3]$ on a logarithmic scale to emphasize the local minimum. The dot is the value of $(x_{\boldsymbol{\texttt{+}}})_i$ where $x_{\boldsymbol{\texttt{+}}}$ is the solution of the inpainting problem and i is an index chosen such that *e_i* is in the kernel of the inpainiting problem. Right: the values $\lvert \langle \mathcal R^{^{\prime}}(x_{\boldsymbol{\texttt{+}}}), e_i \rangle \rvert$ where $(e_i)_{i \in I}$ is a basis of the kernel of the inpainting problem.
Download figure:
Standard image High-resolution image

**Figure 5.** Regularizer and properties of the inpainting solution. Left: $\psi_{2, 10^{-1}}(t)$ for $t \in [-3, 3]$ on a logarithmic scale to emphasize the local minimum. The dot is the value of $(x_{\boldsymbol{\texttt{+}}})_i$ where $x_{\boldsymbol{\texttt{+}}}$ is the solution of the inpainting problem and i is an index chosen such that *e_i* is in the kernel of the inpainiting problem. Right: the values $\lvert \langle \mathcal R^{^{\prime}}(x_{\boldsymbol{\texttt{+}}}), e_i \rangle \rvert$ where $(e_i)_{i \in I}$ is a basis of the kernel of the inpainting problem.
Download figure:
Standard image High-resolution image

Moreover, we have observed that if we initialize the values in the kernel close to −1 or 2 then the limit $x_{\boldsymbol{\texttt{+}}}$ will have entries at these φ-critical points of $\psi_{\rho, \beta}$ . This shows that in such cases the solution we obtain in the limit heavily depends on the initialization we choose and that, depending on this initialization, the recovered solution may not be an $\mathcal R$ -minimizing solution and potentially even a local maximum or a saddle point.

Finally, figure 5(right) shows the values $\lvert \langle \mathcal R^{^{\prime}}(x_{\boldsymbol{\texttt{+}}}), e_i \rangle \rvert$ where $(e_i)_i$ is a basis of the kernel of K. Up to numerical accuracy we see that we have $\langle \mathcal R^{^{\prime}}(x_{\boldsymbol{\texttt{+}}}), e_i \rangle = 0$ for each such index i which shows that $-\mathcal R^{^{\prime}}(x_{\boldsymbol{\texttt{+}}}) \in \ker(\mathbf{K})^\perp$ as in lemma 3.9.

6. Conclusion and outlook

We have introduced and studied the concept of critical point regularization, which, opposed to classical variational regularization, considers (φ-)critical points of Tikhonov-functionals as regularized solutions. The advantage of this approach is that it completely discards the strong and typically unrealistic assumption of being able to achieve global minimizers of these functionals. Our theory shows that under reasonable assumptions on the involved functionals the resulting method will nevertheless be a stable and convergent regularization method. Further, we have shown that the solutions in the limit δ → 0 satisfy some form of first order optimality conditions of the constrained optimization problem $\inf_x \mathcal R(x)$ subject to the constraint $\mathcal{S}(x, y) = 0$ . Besides this, the theory presented here extends the theory of convex functionals by showing that at no point does one require global minimizers, but only points which are close to a global minimum in some sense. For practical applications this means that minimization algorithms do not need to be run until convergence but may be stopped early, if easily verifiable conditions are met. Additionally, under assumptions on the regularizer $\mathcal R$ this theory is directly applicable to regularized solutions which are classical critical points of the involved functionals. As such our theory gives stability and convergence results for critical points of potentially non-convex functionals.

Finally, we have provided numerical simulations which support our theoretical findings, i.e. the stability and convergence of critical point regularization. Depending on the algorithm used for obtaining critical points, these numerical examples show that one cannot expect to find global or even local minima which further supports the arguments for the need of a theory based on (φ-)critical points, which we have developed in this paper.

As the main concern of this paper was to introduce the concept of using (φ-)critical points as regularized solutions, we have not derived any stability- or convergence-rates and studying such rates is subject to future work. Besides this, deriving conditions under which learned regularizers, e.g. [5, 16, 18], give rise to relatively sub-differentiable functions is also subject of future work.

Data availability statement

No new data were created or analysed in this study.

Convergence analysis of critical point regularization with non-convex regularizers

Article metrics

Submit

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction

1.1. Regularization with non-convex penalties

1.2. Proposed critical point regularization

1.3. Main contributions

1.4. Overview

2. Relative sub-differentiability

2.1. Definition and basic properties

2.2. Computation of φ-critical points

3. Regularizing properties of φ -critical points

3.1. Existence and stability

3.2. Convergence

3.3. Differentiable regularizers and classical critical points

4. Examples and comparison

4.1. Dependence on the choice of φ

4.2. Relation to convex relaxation

4.3. Non-equivalence to Tikhonov regularization

4.4. ReLU-Networks as class of possible regularizers

5. Numerical simulations

5.1. Results

6. Conclusion and outlook

Data availability statement

Footnotes

Convergence analysis of critical point regularization with non-convex regularizers

Article metrics

Submit

Share this article

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction

1.1. Regularization with non-convex penalties

1.2. Proposed critical point regularization

1.3. Main contributions

1.4. Overview

2. Relative sub-differentiability

2.1. Definition and basic properties

2.2. Computation of φ-critical points

3. Regularizing properties of φ -critical points

3.1. Existence and stability

3.2. Convergence

3.3. Differentiable regularizers and classical critical points

4. Examples and comparison

4.1. Dependence on the choice of φ

4.2. Relation to convex relaxation

4.3. Non-equivalence to Tikhonov regularization

4.4. ReLU-Networks as class of possible regularizers

5. Numerical simulations

5.1. Results

6. Conclusion and outlook

Data availability statement

Footnotes