Brought to you by:
Paper The following article is Open access

Convergence analysis of critical point regularization with non-convex regularizers

and

Published 22 June 2023 © 2023 The Author(s). Published by IOP Publishing Ltd
, , Citation Daniel Obmann and Markus Haltmeier 2023 Inverse Problems 39 085004 DOI 10.1088/1361-6420/acdd8d

0266-5611/39/8/085004

Abstract

One of the key assumptions in the stability and convergence analysis of variational regularization is the ability of finding global minimizers. However, such an assumption is often not feasible when the regularizer is a black box or non-convex making the search for global minimizers of the involved Tikhonov functional a challenging task. This is in particular the case for the emerging class of learned regularizers defined by neural networks. Instead, standard minimization schemes are applied which typically only guarantee that a critical point is found. To address this issue, in this paper we study stability and convergence properties of critical points of Tikhonov functionals with a possible non-convex regularizer. To this end, we introduce the concept of relative sub-differentiability and study its basic properties. Based on this concept, we develop a convergence analysis assuming relative sub-differentiability of the regularizer. The rationale behind the proposed concept is that critical points of the Tikhonov functional are also relative critical points and that for the latter a convergence theory can be developed. For the case where the noise level tends to zero, we derive a limiting problem representing first-order optimality conditions of a related restricted optimization problem. Besides this, we also give a comparison with classical methods and show that the class of ReLU-networks are appropriate choices for the regularization functional. Finally, we provide numerical simulations that support our theoretical findings and the need for the sort of analysis that we provide in this paper.

Export citation and abstract BibTeX RIS

Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 license. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

1. Introduction

In various scientific fields and applications, such as medical imaging or remote sensing, it is often not possible to obtain the desired quantity of interest directly. Assuming a linear measurement model, recovering the quantity of interest requires solving an inverse problem of the form

Equation (1.1)

where $\mathbf{K} \colon \mathbb X \to {\mathbb{Y}}$ is a linear operator between Hilbert spaces modeling the forward problem, ηδ is the data perturbation, $y^\delta \in {\mathbb{Y}}$ is the noisy data and $x \in \mathbb X$ is the sought for signal. In many cases these problems are ill-posed, meaning that no continuous right inverse of the operator K exists. To overcome such issues several established approaches for the stable approximation of solutions of inverse problems exist.

1.1. Regularization with non-convex penalties

Particularly popular regularization techniques are variational methods [10, 22]. These methods recover regularized solutions $x_\alpha^\delta$ as global minimizers of the Tikhonov functional

Equation (1.2)

Here, $\mathcal R$ is a regularizer which encodes prior information about the desired solution and $\frac{1}{2} \lVert \mathbf{K} x - y^\delta \rVert^2$ plays the role of a data-discrepancy measure. Classically, regularizers have been hand-crafted, including L2-penalties, sparse regularization techniques or total variation [2, 10, 12]. While such hand-crafted regularizers are often convex and hence global minima can be computed by classical convex optimization, hand-crafted priors typically lack adaptability to available data.

In more recent years, there has been a shift to learned and potentially non-convex priors [4, 1416, 18]. It has been observed that these methods often outperform classical methods. Moreover, a full convergence analysis has been be provided [14, 18]. However, such an analysis assumes minimizers of the Tikhonov functional to be known or at least be given within a certain accuracy. For non-convex regularizers such an assumption is unrealistic and global minimizers are challenging to compute. Instead, when trying to find a regularized solution one often employs minimization algorithms such as gradient descent or variations thereof which converge to critical points (such as local minimizers close to the initial guess) rather than to global minimizers of the Tikhonov functional. While one could constrain the learned regularizers to only include convex functionals [3, 16] this might result in suboptimal reconstructions when the underlying signal class is inherently non-convex. For such classes non-convexity of the regularizer can be a highly desirable property and as such a convergence analysis for this case is needed. Importantly, such an analysis should not rely on the strict assumption that the regularized solutions are global minimizers of the underlying Tikhonov functional.

We briefly mention here that there exist other interesting cases where the Tikhonov functional is non-convex such as for example in the case of a nonlinear forward operator. However, in this paper we only consider the linear case. Besides this we also mention that there are potentially different ways to deal with non-convexity of Tikhonov functionals, for example, by use of convexification [13]. However, for the learned regularizers we have in mind (see results in section 4), the modification of the involved functionals is nontrivial in general. Besides, the modification of the learned functionals can change the original properties of the learned functional in an unfavorable way.

1.2. Proposed critical point regularization

In this paper we present a convergence analysis of critical points of the Tikhonov functional $\mathcal{T}_{\alpha, y^\delta}$ for the stable solution of inverse problems of the form (1.1). We refer to any such method which recovers a critical point as regularized solutions as critical point regularization. In fact, we show stability and convergence for a relaxed notion of critical points. More precisely we study stability and convergence of φ-critical points, namely elements satisfying $0 \in \partial_\phi \mathcal{T}_{\alpha, y^\delta} (x_\alpha^\delta)$. Here $\partial_\phi$ is the φ-relative sub-differential, a novel concept that we introduce and study in this paper. Whenever the classical norm-discrepancy is used to measure similarity, as the noise level tends to zero, we show that regularized elements converge to elements $ x^{\boldsymbol{\texttt{+}}} \in \mathbb X$ with

Equation (1.3)

resembling first order conditions of the constraint optimization problem $\operatorname{arg\,min}\{\mathcal R(x)\!\! \mid \!\!\mathbf{K} x = y \!\}$ defining $\mathcal R$-minimizing solutions.

We give our analysis for more general data discrepancy measures $\mathcal{S}(x, y^\delta)$ for which $ \lVert \mathbf{K} x - y^\delta \rVert^2/2$ is only a special case. Further, we mention that in [9] an analysis of stability for the case of local minima has been done. Opposed to our work the authors of [9] restrict themselves to the finite dimensional setting and do not provide convergence results for the case that the noise-level tends to zero. Allowing that the underlying space is a general Hilbert space without any restrictions on the dimension has the advantage that the analysis is independent of the dimension and as such applies to any discretization used for practical applications. The precise analysis of the discretization is beyond the scope of this paper and we refer to the corresponding work in conventional variational regularization [4, 20].

Note that whenever the Tikhonov functional is relatively subdifferentiable, then critical points of the Tikhonov functional are also relatively critical points if φ is constructed accordingly, and the proposed concept yields a convergent regularization for critical points. We will show that this is actually the case, for example, for a class of learned regularizers defined by neural networks. We are not aware of any other study which includes stability and convergence of critical points and to the best of our knowledge the present analysis is the first to attempt this.

1.3. Main contributions

In this paper we introduce the concept of relative sub-differentiability as a generalization of sub-differentiability of convex functions to the non-convex case. We develop theory for relative sub-differentiability and show that corresponding φ-critical points can be found by employing a generalized gradient descent method. From the viewpoint of regularization theory we give existence, stability and convergence results for φ-critical points and derive the limiting problem for critical point regularization. As opposed to the convex case where the solutions one obtains are $\mathcal R$-minimizing solutions we get as a limiting problem a related first order optimality condition. As a special case of our analysis we derive stability and convergence results for critical points of differentiable Tikhonov functionals. For example, in this case, we get that $-\mathcal R^{^{\prime}}(x_{\boldsymbol{\texttt{+}}})$ is in the normal cone of the set of all solutions.

Finally, we provide numerical simulations which support our theoretical findings, in particular the stability, convergence and the limiting problem. Moreover, the results of our numerical simulations show that even in simple cases of non-convex regularizers the assumption of obtaining global minima or even local minima is infeasible thus further emphasizing the need for the analysis we provide in this paper. Besides, the numerical results show that the solutions we obtain cannot be expected to be $\mathcal R$-minimizing solutions and may even be local maxima of the regularizer whenever the initialization is chosen inappropriately and the algorithm does not guarantee that local minima are obtained.

1.4. Overview

The rest of the paper is organized as follows. In section 2 we motivate and introduce the concept of relative sub-differentiability and corresponding φ-critical points. Moreover, we study basic properties of relative sub-differentiability and show that φ-critical points can be achieved by employing a generalized gradient descent method. Section 3 builds on this concept of relative sub-differentiability and gives a convergence analysis for critical point regularization. Moreover, we take a closer look at the differentiable case and identify the limiting problem in this case. In section 5 we provide numerical experiments which support our theoretical findings such as stability and convergence. Finally, we conclude the paper by giving a brief summary and outlook in section 6.

2. Relative sub-differentiability

In this section and in the rest of the paper, unless stated otherwise, we assume that $\mathbb X$ is a Banach space, denote by $\mathbb X^*$ its dual and by $\langle \cdot, \cdot \rangle$ the dual pairing of $\mathbb X$ and $\mathbb X^*$, i.e. for $\varphi \in \mathbb X^*$ and $x \in \mathbb X$ we have $\langle \varphi, x \rangle = \varphi(x)$. Moreover, we denote by $\mathcal R^{^{\prime}}$ the derivative of any differentiable function $\mathcal R \colon \mathbb X \to \mathbb{R}$ and for any similarity measure $\mathcal{S} \colon \mathbb X \times {\mathbb{Y}} \to \mathbb{R}$ we denote by $\mathcal{S}^{^{\prime}}$ the derivative with respect to its first argument.

Before giving the crucial definition of relative sub-differentiability we recall the importance of classical sub-differentiabilty in the context of convex functions. Recall that $r \in \mathbb X^*$ is called subgradient of some functional $\mathcal F \colon \mathbb X \to \mathbb{R}$ at $x \in \mathbb X$ if $\mathcal F(x) + \langle r, u - x \rangle \leqslant \mathcal F(u) $ for all $ u \in\mathbb X $ and that $\mathcal F$ is sub-differentiable whenever the set of subgradients is non-empty for all $x \in \mathbb X$. Minimizers x of $\mathcal F$ are characterized by the optimality condition $0 \in \partial_0 \mathcal F(x)$ where $\partial_0 \mathcal F(x)$ denotes the set of all subgradients at point x. However, sub-differentiability implies convexity. We will therefore develop a relaxed concept of sub-differentiability relative to some functional $\phi \colon \mathbb X \to [0, \infty)$ by replacing the right hand side in the definition of subgradients by $\mathcal F(u) + \phi(u)$.

2.1. Definition and basic properties

The following concept generalizing sub-differentiablity is also applicable to non-convex functions.

Definition 2.1 (Relative sub-differentiability). Let $\mathcal F \colon \mathbb X \to \mathbb{R}$ and $\phi \colon \mathbb X \to [0, \infty)$.

  • (a)  
    $r \in \mathbb X^*$ is called φ-relative subgradient of $\mathcal F$ at $x \in \mathbb X$ if
    Equation (2.1)
  • (b)  
    The set set of all φ-relative subgradients at x is denoted by $\partial_\phi \mathcal F(x)$ and called φ-relative sub-differential of $\mathcal F$ at x.
  • (c)  
    The functional $\mathcal F$ is called φ-relative sub-differentiable if $\partial_\phi \mathcal F(x) \neq \emptyset$ for all $x \in \mathbb X$.

Some remarks about definition 2.1 are in order.

Remark 2.2. 

  • We call any such function φ a bound. It is clear that such a bound cannot be unique, since whenever $\mathcal F$ is a relatively sub-differentiable function with bound φ then it is also relatively sub-differentiable with bound $\phi + c$ for any $c \in [0, \infty)$.
  • Choosing φ = 0 we see that any convex and sub-differentiable function $\mathcal F$ is relatively sub-differentiable, i.e. the class of all relative sub-differentiable functions includes the set of convex sub-differentiable functions.
  • The relative subgradients depend on the function φ. This shows that whenever we choose a larger φ then we generally also increase the set of possible relative subgradients, i.e. if $\phi_1 \leqslant \phi_2$ then $\partial_{\phi_1} \mathcal F \subseteq \partial_{\phi_2} \mathcal F$.
  • Similar to the concept of subgradients for convex functions, the concept of relative subgradients is a global property since the defining inequality has to hold for any point $u \in \mathbb X$.

Another approach of generalizing convexity and subgradients (and as a consequence critical points) is given in [11, 24] where convexity with respect to a set of functions W is defined. In such a setting $w \in W$ is a subgradient of $\mathcal F$ at x whenever $\mathcal F(u) \geqslant \mathcal F(x) + w(u) - w(x)$ for any $u \in \mathbb X$. As a consequence any critical point, i.e. a point where 0 is a subgradient, will also be a global minimizer and hence such a generalization cannot be used for our purposes. In [7] another concept of generalized gradients is discussed. In this setting the definition of the gradient depends only on neighborhoods around the point of interest. As a consequence we cannot expect the critical points to have any global properties which are necessary for the analysis in section 3 hence making this generalization unfit for our analysis. However, it should be noted, that whenever convenient one might substitute any differentiability assumption on the involved functionals with Clarke's generalized gradient concept in any of the following discussions.

In what follows we will assume that $\mathcal F \colon \mathbb X \to \mathbb{R}$ is φ-relatively sub-differentiable for some fixed φ. Based on this definition we generalize the concept of critical points as follows.

Definition 2.3 ( φ -critical points). We call $x \in \mathbb X$ a φ-critical point of $\mathcal F$ if $0 \in \partial_\phi \mathcal F(x)$. Moreover, we denote by $\operatorname{crit}_\phi \mathcal F$ the set of all φ-critical points of $\mathcal F$.

It should be noted that the definition of φ-critical points depends on φ and in practical applications one might not have access to φ. In such cases evaluating or finding relative subgradients might be infeasible. Nevertheless, the concept of φ-critical points is general enough to include an important class of points as the following remark illustrates.

Remark 2.4 (Critical points of differentiable functions). Let us assume that $\mathcal F \colon \mathbb X \to \mathbb{R}$ is a differentiable function which satisfies the inequality $\mathcal F(x) + \langle \mathcal F^{^{\prime}}(x), u - x \rangle \leqslant \mathcal F(u) + \phi(u)$ for any $x, u \in \mathbb X$ and some $\phi \colon \mathbb X \to [0, \infty)$. Then we have $\mathcal F^{^{\prime}}(x) \in \partial_\phi \mathcal F (x)$. This shows that in this special case we have access to at least one element of $\partial_\phi \mathcal F$. In particular, any critical point of $\mathcal F$, i.e. a point $x \in \mathbb X$ with $\mathcal F^{^{\prime}}(x) = 0$, will always yield a φ-critical point of $\mathcal F$ in the sense of definition 2.3 and hence definition 2.3 is a generalization of the classical concept of critical points for differentiable functions satisfying above inequality.

This shows that for a class of functions we have access to at least one element of the relative subgradient of $\mathcal F$. More importantly, for this class of functions we can make assertions about the points $x \in \mathbb X$ where $\mathcal F^{^{\prime}}(x) = 0$ holds, i.e. points which are reachable by use of a (minimization) algorithm which guarantees to find a critical point.

Before we move on, we briefly give a prototypical example of a non-convex function for which a bound φ can be chosen, such that $\mathcal F^{^{\prime}}(x) \in \partial_\phi \mathcal F(x)$.

Remark 2.5 (Examples of relative sub-differentiability). We start by giving a simple example of a function which is non-convex, but relatively sub-differentiable. To this end, let $a, b \in \mathbb{R}$ be given and define $\mathcal F(t) = (t + a)^2 (t + b)^2$. It is readily seen, that $\mathcal F(t) + \mathcal F^{^{\prime}}(t)(s - t)$ is a polynomial of degree 4 with negative leading coefficient. Hence, this function is bounded from above and the relative sub-differentiability immediately follows by for example choosing $\phi(s) = \sup_t \mathcal F(t) + \mathcal F^{^{\prime}}(t)(s - t)$. Clearly then, the function $g(t) = \mathcal F(t) + c t^2$ is also sub-differentiable for c > 0. The function g is plotted in figure 1 on the left side for different parameters $a, b, c$ on a semi-logarithmic scale to emphasize the non-convexity.

Now let us consider the function $\mathcal F(t) = \cos(t) + \lvert t \rvert$, see figure 1 on the right. Then, due to the coercivity and the existence of critical points 'at infinity', the derivative of $\mathcal F$ cannot be in the relative sub-gradient of $\mathcal F$ for any φ. This example illustrates what types of functions are not included in the concept of relatively sub-differentiable functions for which the derivative is supposed to lie in the relative subgradient. In particular, the concept of relative sub-differentiability excludes coercive functionals which have critical points 'at infinity'.

Figure 1.

Figure 1. Left: Example of relatively sub-differentiable functions where the classical gradient is contained in the relative sub-gradient. Right: A function for which classical derivative cannot be in the relative sub-gradient.

Standard image High-resolution image

Before discussing how one might obtain φ-critical points of relatively sub-differentiable functions, we list some useful properties of which we make constant use during the rest of the paper.

Lemma 2.6 (Basic properties of relative subgradients). Let $\mathcal F, \mathcal F_i \colon \mathbb X \to \mathbb{R}$ and $\phi, \phi_i \colon \mathbb X \to [0, \infty)$ be bounds of $\mathcal F, \mathcal F_i$ for $i = 1, \dots, n$ and w > 0. Moreover, set $c : = \inf \mathcal F + \phi$. Then the following hold

  • (1)  
    $\sum \partial_{\phi_i} \mathcal F_i \subseteq \partial_{\sum \phi_i} \sum \mathcal F_i $
  • (2)  
    $w \partial_\phi \mathcal F = \partial_{w \phi} (w \mathcal F)$
  • (3)  
    If $\mathcal F$ is convex then $\partial_0 \mathcal F (x) \subseteq \partial_\phi \mathcal F(x)$ for any $x \in \mathbb X$
  • (4)  
    $\partial_\phi \mathcal F (x)$ is convex and (weak*) closed
  • (5)  
    If $x_{\boldsymbol{\texttt{+}}} \in \operatorname{arg\,min} \mathcal F(x)$ then $0 \in \partial_\phi \mathcal F (x_{\boldsymbol{\texttt{+}}})$
  • (6)  
    $0 \in \partial_\phi \mathcal F (x) \Longleftrightarrow \mathcal F(x) \leqslant c$
  • (7)  
    If $\mathcal F$ is Lipschitz and φ bounded on bounded subsets, then $\partial_\phi \mathcal F$ is bounded. In particular, in this case the set $\partial_\phi \mathcal F$ is weak*-compact.
  • (8)  
    Let $p_k = g_k + z_k$ where $g_k \in \partial_\phi \mathcal F(x_k)$ and $\lVert z_k \rVert \leqslant \varepsilon_k$ with $\varepsilon_k \to 0$. Assume that xk converge weakly to $x_{\boldsymbol{\texttt{+}}}$ and gk converge to g and that $\mathcal F$ is weakly lower semi-continuous. Then $g \in \partial_\phi \mathcal F(x_{\boldsymbol{\texttt{+}}})$. If, instead, xk converge strongly to $x_{\boldsymbol{\texttt{+}}}$, gk converge weakly to g and $\mathcal F$ is lower semi-continuous then we also have $g \in \partial_\phi \mathcal F(x_{\boldsymbol{\texttt{+}}})$.

Proof. (1) Let $p_i \in \partial \mathcal F_i(x)$ and define $p = \sum_{i} p_i$. Then we have

and hence the claim follows.

(2) Assume that $p \in \partial_\phi \mathcal F(x)$. Then we have $w \mathcal F(x) + w \langle \,p, u - x \rangle \leqslant w \left( \mathcal F(u) + \phi(u) \right)$ by non-negativity of w and hence $w \partial_\phi \mathcal F \subseteq \partial_{w \phi} (w \mathcal F)$. Now let $p \in \partial_{w \phi} (w \mathcal F)$ then we define $q = \frac{p}{w}$ and it follows

which shows that $p \in w \partial_\phi \mathcal F(x)$.

(3) This is an immediate consequence of $\mathcal F(u) \leqslant \mathcal F(u) + \phi(u)$ by non-negativity of φ.

(4) Let $p_1, p_2 \in \partial_\phi \mathcal F(x)$ and $\lambda \in (0, 1)$. Then we have

which proves the convexity of $\partial_\phi \mathcal F$. Now let us assume that $p_k \in \partial \mathcal F (x)$ with pk (weak*) converges to p. By (weak*) convergence we have $\langle \,p_k, u - x \rangle \to \langle \,p, u - x \rangle$ and hence p is also an relatively sub-differentiable subgradient.

(5) This is also a consequence of the non-negativity of φ and the assumption that $x_{\boldsymbol{\texttt{+}}}$ is a global minimizer.

(6) Let $0 \in \partial_\phi \mathcal F(x)$. Then by definition we have $\mathcal F(x) \leqslant \mathcal F(u) + \phi(u)$ for any $u \in \mathbb X$ and hence also $\mathcal F(x) \leqslant c$. On the other hand, if $\mathcal F(x) \leqslant c$ then we have $\mathcal F(x) \leqslant \mathcal F(u) + \phi(u)$ for any $u \in \mathbb X$ and hence $0 \in \partial_\phi \mathcal F(x)$.

(7) Let $p \in \partial_\phi \mathcal F(x)$ and set $u = x + v$ with $\lVert v \rVert = 1$. Using the defining inequality we find

and thus by taking the supremum over v we find that $\lVert p \rVert$ is bounded. Using Banach-Alaouglu we see that $\partial_\phi \mathcal F$ must be weak*-compact.

(8) By assumption xk is bounded. Thus, we have

which proves the claim.

Lemma 2.6 gives us a characterization of φ-critical points as points x for which $\mathcal F + \phi$ is an upper bound of $\mathcal F(x)$. This characterization in particular implies that for any differentiable and relatively sub-differentiable function $\mathcal F$ we have that the points x with $\mathcal F^{^{\prime}}(x) = 0$ must have bounded value independent of x. Comparing this to the convex case we have that x is a critical point of the function $\mathcal F$ if and only if x is a global minimizer. In some sense, the definition of φ-critical points allows for some error to be made and guarantees that φ-critical points cannot have arbitrarily large $\mathcal F$-value. Moreover, whenever $\mathcal F$ is coercive then all φ-critical points must be inside some ball $B_r(0)$ for some r > 0.

2.2. Computation of φ-critical points

We next answer the question of how to obtain φ-critical points at least for the case where $\mathbb X$ is a Hilbert space. Clearly, if $\mathcal F$ is differentiable then one could consider classical gradient descent methods. Since we are also interested in non-differentiable functions, gradient descent in its classical form may not be applicable. Below we show that a generalized gradient method using relative subgradients instead of gradients will yield φ-critical points in the sense of definition 2.3. This shows that algorithm is a natural extension of subgradient descent [6, 23].

Algorithm 1. Relative subgradient descent.
Require: Starting point $x_0 \in \mathbb X$, stepsizes $\eta_n \gt 0$
$n \gets 0$
while $0 \notin \partial_\phi \mathcal F (x_n)$ do
   Choose $g^*_n \in \partial_\phi \mathcal F (x_n)$ and $g_n \in \mathbb X$ such that $\langle g_n^*, g_n \rangle \gt 0$
   $x_{n+1} = x_n - \eta_n g_n$
   $n \gets n + 1$
end while

The following results shows that Algorithm converges to a φ-critical point of the function $\mathcal F$. The given proof closely follows the one given in [6] but does not assume a finite dimensional setting and considers relatively sub-differentiable functionals instead of sub-differentiable function.

Theorem 2.7 (Convergence of Algorithm). Assume that $\mathbb X$ is a Hilbert space and that $\mathcal F$ is relatively sub-differentiable with bound φ. Moreover, choose $g_n = \lambda_n g_n^*$ in algorithm with $\lambda_n \gt 0$ such that $\lVert g_n \rVert \leqslant C$ for all $n \in \mathbb{N}$. Then for any point $u \in \mathbb X$ and any step $N \in \mathbb{N}$ we have

Proof. Let $u \in \mathbb X$. After rescaling of $g_n^*$ according to assumption we may assume that $g_n = g_n^*$. Then by definition of $x_{n+1}$ we have

Applying this inequality recursively and using the fact that $\lVert x_{n+1} - u \rVert^2 \geqslant 0$ we find

which together with the inequalities $\lVert g_i \rVert \leqslant C$ and $\sum_{i = 1}^n \eta_i \mathcal F(x_i) \geqslant \min_{i = 1,\dots,n} \mathcal F(x_i) \sum_{i = 1}^n \eta_i$ shows the desired result.

Theorem 2.7 shows that under the assumption that the sequence of step-sizes $(\eta_n)_{n \in \mathbb{N}}$ is square-summable but not summable, then in the limit we have $\lim_{n \to \infty} \min_{i = 1, \dots, n} \mathcal F(x_i) \leqslant \mathcal F(u) + \phi(u)$ for any $u \in \mathbb X$. Note the analysis and the proofs heavily rely on the usage of the functional φ, but we note that at no point during algorithm do we need explicit knowledge of the functional φ but only access to elements of $\partial_\phi \mathcal F$. In particular, in the case of remark 2.4 when using the gradient of $\mathcal F$ as the update direction the generated sequence will yield a φ-critical point.

Finally, assume that we have a functional of the form $\mathcal F(x) = \mathcal{S}(x) + \alpha \mathcal R(x)$ where each term is relatively sub-differentiable with bounds $\phi_\mathcal{S}$ and $\phi_\mathcal R$. Then lemma 2.6 shows that for $s \in \partial_{\phi_\mathcal{S}} \mathcal{S}$ and $r \in \partial_{\phi_\mathcal R} \mathcal R$ we have $s + \alpha r \in \partial_{\phi_\mathcal{S} + \alpha \phi_\mathcal R} \left( \mathcal{S} + \alpha \mathcal R \right)$. This implies that algorithm can be applied in the case where we are looking for a φ-critical point of the sum of two relatively sub-differentiable functionals and only have access to elements of $\partial_{\phi_\mathcal{S}} \mathcal{S}$ and $\partial_{\phi_\mathcal R} \mathcal R$.

3. Regularizing properties of φ -critical points

In this section we present a convergence analysis for φ-critical points of Tikhonov-type functionals $\mathcal{T}_{\alpha, y^\delta}$ extending the existing analysis for global minima [22]. At this point we want to emphasize again that the assumption of being able to obtain global minima of $\mathcal{T}_{\alpha, y^\delta}$ can be extremely restrictive when $\mathcal R$ is non-convex and the main goal of our analysis is to discard this assumption. Instead we focus only on φ-critical points of $\mathcal{T}_{\alpha, y^\delta}$ which may include local minimizers, saddle points or even local maxima.

Recall that we are interested in Tikhonov-type functionals $\mathcal{T}_{\alpha, y^\delta} \colon \mathbb X \to [0, \infty)$ of the form

Equation (3.1)

for given $\mathcal{S} \colon \mathbb X \times {\mathbb{Y}} \to [0, \infty)$ and $\mathcal R \colon \mathbb X \to [0, \infty)$. Here, $\mathcal{S}$ is a similarity measure between x and yδ and a standard situation we are interested in is $\mathcal{S}(x, y^\delta) = \frac{1}{2} \lVert \mathbf{K} (x) - y^\delta \rVert^2$ where $\mathbf{K} \colon \mathbb X \to {\mathbb{Y}}$ is the forward operator of the inverse problem of interest. Instead of working with global minima of the functional (3.1) we consider regularized solutions $x_\alpha^\delta$ as αφ-critical points of $\mathcal{T}_{\alpha, y^\delta}$, meaning

Equation (3.2)

We will analyze stability and convergence of such critical points.

For the analysis we make the following assumptions.

Condition 3.1 (Critical point regularization) 

  • $\mathbb X$ is a reflexive Banach spaces and ${\mathbb{Y}}$ is a metric space with metric $\mathcal{D}$
  • $\mathcal R$ is weakly sequentially lower semi-continuous
  • $\mathcal R$ is relatively sub-differentiable with bound φ
  • $\mathcal{S}$ is weakly sequentially lower semi-continuous, convex in its first argument and continuous in its second argument
  • $\exists C \gt 0~\exists p \geqslant 1~\forall z \in \mathbb X~\forall y, y^\delta \in {\mathbb{Y}} \colon \mathcal{S}(z, y) \leqslant C \left( \mathcal{S}(z, y^\delta) + \mathcal{D}(y, y^\delta)^p \right)$
  • $\forall \alpha \gt 0$ and $\forall y^\delta \in {\mathbb{Y}}$ the functional $\mathcal{T}_{\alpha, y^\delta}$ is coercive, i.e. $\mathcal{T}_{\alpha, y^\delta}(x) \to \infty$ for $\lVert x \rVert \to \infty$

Most of the assumptions in condition 3.1 are classical assumptions (or generalizations thereof), e.g. [14, 18, 22], made for the analysis of variational methods. For example, the coercivity assumption only poses a condition on the involved functionals 'at infinity' and as such does not pose any form of condition, say for example, on the behaviour in a ball around 0. This means, that the function $\mathcal{T}$ can be highly non-convex as long as it is growing fast enough outside bounded sets. The major difference in the analysis provided here is that $\mathcal R$ is relatively sub-differentiable, which we have motivated in section 2, and the assumption that in general the regularized solutions are not global minima but only φ-critical points.

One of the simplest and commonly used example of a similarity measure which satisfies Assumptions (C4) and (C5) is given by $\mathcal{S}(x, y^\delta) = \lVert \mathbf{K}\, x - y^\delta \rVert^p$ whenever $\mathbf{K} \colon \mathbb X \to {\mathbb{Y}}$ is the linear forward operator of the underlying inverse problem and ${\mathbb{Y}}$ is a Banach space. In general, any similarity measure of the form $\lVert \mathbf L (\mathbf{K} x - y^\delta) \rVert^p$ satisfies these assumptions, if $\mathbf L$ is a linear and bounded operator, e.g. a reweighting of the residual $(\mathbf{K} x - y^\delta)$.

We now turn our focus to the stability and convergence analysis of the considered method, i.e. $x_\alpha^\delta \in \operatorname{crit}_{\alpha \phi} \mathcal{T}_{\alpha, y^\delta}$. We start with existence and stability results.

3.1. Existence and stability

Theorem 3.2 (Existence). Under Assumption 3.1 the problem is well-posed, i.e. for every α > 0 and $y^\delta \in {\mathbb{Y}}$ the set $\operatorname{crit}_{\alpha \phi} \mathcal{T}_{\alpha, y^\delta}$ is non-empty.

Proof. This is an immediate consequence of the existence of minimizers of $\mathcal{T}_{\alpha, y^\delta}$ which follows from the coercivity and the continuity assumptions on the functional $\mathcal{T}_{\alpha, y^\delta}$. A more detailed proof can be found in [22].

Clearly, αφ-critical points may exist under weaker assumptions than a coercivity assumption. However, the coercivity is an important property in the following analysis which guarantees the existence of a weakly convergent subsequence whenever the sequence is bounded. As such we have also derived existence of φ-critical points using the coercivity. Extending the current analysis to the case of non-coercive functionals $\mathcal{T}_{\alpha, y^\delta}$ is subject to future work.

Another advantage of using φ-critical points opposed to global minima, besides being numerically and hence practically more tractable for non-convex functionals, is that we have a simple way of talking about 'inexact' critical points, i.e. points where the gradient is small but not necessarily 0. As it turns out, the following analysis can be performed under the even weaker assumption that the stabilized solutions are 'inexact' critical points instead of exact critical points.

Theorem 3.3 (Stability). Let $y^\delta \in {\mathbb{Y}}, \alpha \gt 0$ and $y_k \to y^\delta$ and assume that $x_k \in \mathbb X$ is such that $z_k \in \partial_{\alpha \phi} \left( \mathcal{S}(\cdot, y_k) + \alpha \mathcal R(\cdot) \right) (x_k)$ with $\lVert z_k \rVert \to 0$ and $\langle z_k, x_k \rangle \leqslant 0$. Then the sequence $(x_k)_k$ has a weakly convergent subsequence and the limit $x_{\boldsymbol{\texttt{+}}}$ of every weakly convergent subsequence is an αφ-critical point of $\mathcal{T}_{\alpha, y^\delta}$.

Proof. To show the existence of a weakly convergent subsequence, using the reflexivity of $\mathbb X$, it is enough to show that $(x_k)_k$ is a bounded sequence. By coercivity of $\mathcal{T}_{\alpha, y^\delta}$ it is enough to show that $(\mathcal{T}_{\alpha, y^\delta}(x_k))_k$ is bounded. We have for any $u \in \mathbb X$

and using $\langle z_k, x_k \rangle \leqslant 0$ it follows

By assumption on $\mathcal{S}$ we have $\mathcal{S}(x_k, y^\delta) \leqslant C(\mathcal{S}(x_k, y_k) + \mathcal{D}(y_k, y^\delta)^p)$ which yields

for any $u \in \mathbb X$. By assumption we have $\lVert z_k \rVert \to 0$ and $\mathcal{D}(y_k, y^\delta) \to 0$ so the right hand side is bounded for k large enough. This shows that there exists some weakly convergent subsequence.

Let now $(x_k)_k$ denote such a subsequence and denote by $x_{\boldsymbol{\texttt{+}}}$ its limit. Using the weak lower semi-continuity of the involved functionals it follows for any $u \in \mathbb X$

where the last equality follows from continuity of $\mathcal{S}$ in its second argument. This shows that $0 \in \partial_{\alpha \phi} \left( \mathcal{S}(\cdot, y^\delta) + \alpha \mathcal R(\cdot) \right) (x_{\boldsymbol{\texttt{+}}})$.

Clearly, whenever $z_k = 0$, i.e. xk is an αφ-critical point, then the assumptions on zk in theorem 3.3 are satisfied. It follows that αφ-critical points are stable in the above sense. However, theorem 3.3 also shows that we do not need access to exact αφ-critical points but rather points which are in some sense close to an αφ-critical point.

Remark 3.4 (Inexact critical points obtained by use of minimization schemes). Consider once again the case of remark 2.4 and assume that the αφ-critical points are obtained by using gradient descent or any other algorithm which finds zeros of the gradient. Then we have $z_k = \mathcal{S}^{^{\prime}}(x_k, y_k) + \alpha \mathcal R^{^{\prime}}(x_k)$ and whenever $\lVert z_k \rVert \to 0$ and $\langle z_k, x_k \rangle \leqslant 0$ we have that the considered points have a weakly convergent subsequence. For practical applications this means, that we have an easily verifiable condition which can be used as a kind of stopping criterion when searching for critical points. As a consequence, we do not have to guarantee that the regularized solutions are critical points but rather are 'close' to being a critical point.

3.2. Convergence

The next goal is to show the convergence of the regularized solutions to a solution of the original problem in the case that the noise-level δ tends to 0. Here, we call $z \in \mathbb X$ an $\mathcal{S}$-solution of $y \in {\mathbb{Y}}$ if $\mathcal{S}(z, y) = 0$. Like in the case of theorem 3.3, the proof can be done under the weaker assumption of only having access to 'inexact' φ-critical points (see remark 3.4).

Theorem 3.5 (Convergence). Let $y \in {\mathbb{Y}}$ and assume it has an $\mathcal{S}$-solution. Further, let $y_k \in {\mathbb{Y}}$ with $\mathcal{D}(y_k, y) \leqslant \delta_k$ with $\delta_k \to 0$. Choose $\alpha = \alpha(\delta)$ such that for $\alpha_k = \alpha(\delta_k)$ we have $\lim_k \alpha_k = \lim_k \delta_k^p / \alpha_k = 0$. Assume that the regularized solutions $x_k \in \mathbb X$ are such that $z_k \in \partial_{\alpha_k \phi} \left( \mathcal{S}(\cdot, y_k) + \alpha_k \mathcal R(\cdot) \right) (x_k)$ with $\lVert z_k \rVert / \alpha_k \to 0$ and $\langle z_k, x_k \rangle \leqslant 0$.

Then the sequence $(x_k)_k$ has a weakly convergent subsequence and the limit $x_{\boldsymbol{\texttt{+}}}$ of any such sequence is an $\mathcal{S}$-solution of y. Moreover, we have $\mathcal R(x_{\boldsymbol{\texttt{+}}}) \leqslant \mathcal R(u) + \phi(u)$ for any $\mathcal{S}$-solution u. Finally, whenever the $\mathcal{S}$-solution is unique then $(x_k)_k$ converges weakly to this solution.

Proof. Similar to the stability proof we show that $(x_k)_k$ is bounded by using the coercivity of the functionals $\mathcal{T}_{\alpha, y^\delta}$. Following the above proof we find for any $u \in \mathbb X$

and by choosing u such that $\mathcal{S}(u, y) = 0$ we find $\mathcal{S}(u, y_k) \leqslant C \delta_k^p$ which implies

Since both $\mathcal{S}$ and $\mathcal R$ are non-negative it then follows

where we have used the assumptions $\lim_k \delta_k^p / \alpha_k = \lim_k \lVert z_k \rVert/ \alpha_k = 0$. This shows that $(\mathcal R(x_k))_k$ is a bounded sequence and using once again $\mathcal{S}(x_k, y) \leqslant C \left( \mathcal{S}(x_k, y_k) + \delta_k^p \right)$ we find that for $\alpha^{\boldsymbol{\texttt{+}}} = \max \{\alpha_k \colon k \in \mathbb{N}\}$ the sequence $(\mathcal{S}(x_k, y) + \alpha^{\boldsymbol{\texttt{+}}} \mathcal R(x_k))_k$ is bounded. Using the coercivity of $\mathcal{S}(\cdot, y) + \alpha^{\boldsymbol{\texttt{+}}} \mathcal R(\cdot)$ we get that the sequence $(x_k)_k$ is bounded and hence has a weakly convergent subsequence.

Finally, using the weak lower-semicontinuity of $\mathcal{S}$ and $\mathcal R$ we have that for any such weakly convergent subsequence with limit $x_{\boldsymbol{\texttt{+}}}$

for any $u \in \mathbb X$ with $\mathcal{S}(u, y) = 0$.

Whenever the solution is unique, then every subsequence of $(x_k)_k$ has a subsequence converging to this solution. This shows that $(x_k)_k$ converges weakly to the unique solution.

At this point, we want to emphasize once again, that the assumptions on the choice of points xk in theorem 3.5 are weaker than the assumption that xk is an $\alpha_k \phi$-critical point and that in particular the analysis also holds for these points.

Since for this section we only assume that $\mathcal R$ is relatively sub-differentiable without explicit knowledge of the bound φ theorem 3.5 gives a somewhat intangible condition on the type of solutions we obtain in the limit δ → 0. A more tangible condition, and more importantly one independent of φ, is given by the next theorem, where we assume a separability condition on the gradients $z_k \in \partial_{\alpha_k \phi} \left( \mathcal{S}(\cdot, y_k) + \alpha_k \mathcal R(\cdot) \right) (x_k)$. This separability assumption can be satisfied in many cases, e.g. when $\mathcal{S}$ and $\mathcal R$ are (relatively sub-)differentiable and the φ-critical points arise due to some algorithm such as gradient descent. Using these algorithms we are often in the situation that $z_k = s_k + \alpha_k r_k$ where sk is a (sub-)gradient of $\mathcal{S}$ and rk is an (relatively sub-differentiable sub-)gradient of $\mathcal R$. Assuming that the gradients $(r_k)_k$ of $\mathcal R$ have a cluster point, we get the additional following property of these cluster points.

Theorem 3.6 (Normality property of the solution). Let the same assumptions as in theorem 3.5 hold and denote by $(x_k)_k$ a weakly convergent subsequence with limit $x_{\boldsymbol{\texttt{+}}}$. Let $z_k = s_k + \alpha_k r_k$ where $s_k \in \partial_0 \mathcal{S}(\cdot, y_k) (x_k)$ and $r_k \in \partial_\phi \mathcal R (x_k)$.

Then any cluster point r of the sequence $(r_k)_k$ satisfies $-r \in -\partial_\phi \mathcal R (x_{\boldsymbol{\texttt{+}}}) \cap N_{L(y)}(x_{\boldsymbol{\texttt{+}}})$, where $N_{L(y)}(x_{\boldsymbol{\texttt{+}}})$ is the normal cone of the convex set of all $\mathcal{S}$-solutions of y at $x_{\boldsymbol{\texttt{+}}}$.

Proof. Let r be a cluster point of the sequence $(r_k)_k$. Then by weak lower semi-continuity of $\mathcal R$ and by assumption on rk we have

which shows that $r \in \partial_\phi \mathcal R (x_{\boldsymbol{\texttt{+}}})$.

Now assume that $u \in \mathbb X$ is such that $\mathcal{S}(u, y) = 0$. Then we have $\mathcal{S}(u, y_k) \leqslant C \delta_k^p$ and it follows

where we have used the convexity of $\mathcal{S}$ in its first argument and the assumption on the limits of the sequences $(\alpha_k)_k$ and $\left( \lVert z_k \rVert / \alpha_k \right)_k$.

Theorem 3.6 shows that the solution $x_{\boldsymbol{\texttt{+}}}$ obtained by critical point regularization satisfies some form of first order optimality conditions, see e.g. [21].

Also note that in the case where $\mathcal R$ is convex and we choose φ = 0, both properties in theorems 3.5 and 3.6 reduce to the common property that $x_{\boldsymbol{\texttt{+}}}$ is an $\mathcal R$-minimizing solution, i.e. $\mathcal R(x_{\boldsymbol{\texttt{+}}}) \leqslant \mathcal R(u)$ for any $u \in \mathbb X$ with $\mathcal{S}(u, y) = 0$.

Remark 3.7 (Convex regularizers). Clearly, any sub-differentiable convex function is relatively sub-differentiable with the choice φ = 0. Nevertheless, one could also choose $\phi = \varepsilon \gt 0$. With this choice we see that the results in theorem 3.5 roughly state that the solutions $x_{\boldsymbol{\texttt{+}}}$ we approximate by using critical point regularization are $\mathcal R$-minimizing solutions up to an error ε whenever the regularized solutions are minimizers up to an error of ε.

This result, as opposed to classical variational regularization theory e.g. [22], has the advantage that at no point do we require exact global minimizers of the functionals $\mathcal{T}_{\alpha, y}$ but only approximate minimizers, which may be more easily reachable in practical applications. Consider for example the case where we employ an iterative algorithm which has convergence guarantees of the form $\mathcal F(x_n) - \mathcal F(x_*) \leqslant C / n^r$ for the n-th iterate and $x_*$ being a minimizer of $\mathcal F$. Applying this algorithm to $\mathcal F = \mathcal{T}_{\alpha, y}$ and requiring that $C / n^r \leqslant \alpha \varepsilon$ in order to get that xn is an αφ-critical point, we see that the above theory shows that one might stop the iterative algorithm after a finite amount of steps, i.e. we do not necessarily need to run the algorithm until it converges and we still get a stable and convergent regularization method.

At first it might seem that a disadvantage of this is that we do not achieve an $\mathcal R$-minimizing solutions in the limit. However, this can also be circumvented by considering a variable ε. To be more precise, following the proof of theorem 3.5 with $\varepsilon = \varepsilon(\delta)$ and the condition $\varepsilon(\delta) \to 0$ as δ → 0, it is easy to see that to obtain a sequence $(x_k)_k$ weakly converging to an $\mathcal R$-minimizing solution it is enough to run the iterative algorithm for a number of iterations steps nk such that $C / n_k^r \leqslant \alpha_k \varepsilon_k$.

We next discuss another special case of our analysis which pertains to functionals such as the one in remark 2.5.

3.3. Differentiable regularizers and classical critical points

In this subsection we consider the important special case where the φ-critical points are given by classical critical points, i.e. by points x for which $\mathcal{T}_{\alpha, y}^{^{\prime}}(x) = 0$ and we give stability and convergence results for this case. To this end, we assume that the bound φ can be constructed in such a way that $\mathcal R(x) + \langle \mathcal R^{^{\prime}}(x), u - x \rangle \leqslant \mathcal R(u) + \phi(u)$, see e.g. remark 2.4. Then, if $\mathcal{S}$ is differentiable in its first argument, by convexity of $\mathcal{S}$, we have $\mathcal{S}^{^{\prime}}(x, y^\delta) + \alpha \mathcal R^{^{\prime}}(x) \in \partial_{\alpha \phi} (\mathcal{S}(\cdot, y^\delta) + \alpha \mathcal R(\cdot))(x)$. This shows, that whenever we employ some algorithm which finds a classical critical point, we also obtain an αφ-critical point in the sense of definition 2.1 which satisfies the separability assumption necessary for theorem 3.6. In particular, these points are amenable to the analysis above.

Nevertheless, the analysis relies on an abstract concept of φ-critical points and even in the case where the involved functionals are differentiable we cannot guarantee that the limits will again be φ-critical points without any additional assumptions. In order to guarantee this we need the assumption that $\mathcal{S}^{^{\prime}}$ and $\mathcal R^{^{\prime}}$ are weakly (sequentially) continuous. Combining the above theorems we then get the following result.

Proposition 3.8 (Existence, stability and convergence for classical critical points). Assume that $\mathcal{S}$ and $\mathcal R$ are differentiable with weakly continuous derivatives and let condition 3.1 hold. Moreover, let $y, y^\delta \in {\mathbb{Y}}$ and α > 0 and assume that y has an $\mathcal{S}$-solution. Then the following hold

  • 1.  
    Existence: $\mathcal{T}_{\alpha, y^\delta}$ has at least one φ-critical point.
  • 2.  
    Stability: If $(y_k)_k \subseteq {\mathbb{Y}}$ is a sequence converging to yδ and xk is such that $z_k = \mathcal{S}^{^{\prime}}(x_k, y_k) + \alpha \mathcal R^{^{\prime}}(x_k) \to 0$ as $k \to \infty$ and $\langle z_k, x_k \rangle \leqslant 0$. Then $(x_k)_k$ has a weakly convergent subsequence and any weak clusterpoint $x_{\boldsymbol{\texttt{+}}}$ of $(x_k)_k$ is a critical point of $\mathcal{T}_{\alpha, y^\delta}$.
  • 3.  
    Convergence: Let $(y_k)_k \subseteq {\mathbb{Y}}$ be a sequence with $\mathcal{D}(y_k, y) \leqslant \delta_k$ and $\alpha = \alpha(\delta)$ be such that for $\alpha_k = \alpha(\delta_k)$ we have $\lim_k \alpha_k = \lim_k \delta_k^p / \alpha_k = 0$. Then, if we choose $x_k \in \mathbb X$ such that $z_k = \mathcal{S}^{^{\prime}}(x_k, y_k) + \alpha_k \mathcal R^{^{\prime}}(x_k)$ satisfies $\lim_k \lVert z_k \rVert/ \alpha_k = 0$ and $\langle z_k, x_k \rangle \leqslant 0$ the sequence $(x_k)_k$ has at least one weak clusterpoint and any such clusterpoint $x_{\boldsymbol{\texttt{+}}}$ is an $\mathcal{S}$-solution of y with the following additional properties
    • (a)  
      $\mathcal R(x_{\boldsymbol{\texttt{+}}}) \leqslant \inf_{\mathcal{S}(u, y) = 0} \mathcal R(u) + \phi(u)$
    • (b)  
      $\langle -\mathcal R^{^{\prime}}(x_{\boldsymbol{\texttt{+}}}), z - x_{\boldsymbol{\texttt{+}}} \rangle \leqslant 0$ for any $z \in \mathbb X$ with $\mathcal{S}(z, y) = 0$, i.e. $-\mathcal R^{^{\prime}}(x_{\boldsymbol{\texttt{+}}}) \in N_{L(y)}(x_{\boldsymbol{\texttt{+}}})$.
    Finally, whenever the $\mathcal{S}$-solution is unique then $(x_k)_k$ converges weakly to this solution.

Proof. This follows immediately by applying theorems 3.2, 3.3, 3.5 and 3.6.

For the differentiable case this identifies the limiting problem we solve by regularizing the inverse problems with critical points, i.e. in the limit we find solutions which satisfies a first order optimality condition of the constrained optimization problem

We now briefly discuss the special case where $\mathcal{S}$ is given as the norm-discrepancy, e.g. in the case where ${\mathbb{Y}}$ is a Hilbert-space.

Lemma 3.9 (Solution for norm discrepancy). Let the same assumptions as in proposition 3.8 hold and assume that $\mathcal{S}(x, y^\delta) = \frac{1}{p} \lVert \mathbf{K} x - y^\delta \rVert_{\mathbb{Y}}^p$ for some p > 1 where $\mathbf{K} \colon \mathbb X \to {\mathbb{Y}}$ is a linear and bounded forward operator between Banach spaces and $\lVert \cdot \rVert_{\mathbb{Y}}^p$ is differentiable. Furthermore, denote by $x_{\boldsymbol{\texttt{+}}}$ a solution according to proposition 3.8.

Then we have $-\mathcal R^{^{\prime}}(x_{\boldsymbol{\texttt{+}}}) \in \ker(\mathbf{K})^\perp = \{p \in \mathbb X^\ast \colon \forall x_0 \in \ker(\mathbf{K}) \colon \langle \,p, x_0 \rangle = 0\}$.

Proof. Any solution z can be written as $z = x_{\boldsymbol{\texttt{+}}} + x_0$ where $x_0 \in \ker(\mathbf{K})$. By using x0 and $-x_0$, proposition 3.8 shows that $\langle -\mathcal R^{^{\prime}}(x_{\boldsymbol{\texttt{+}}}), x_0 \rangle = 0$ for any $x_0 \in \ker(\mathbf{K})$ and hence the claim follows.

4. Examples and comparison

In this section we compare the proposed regularization concept using αφ-critical points to standard Tikhonov regularization, its convex relaxation and discuss the influence of different choices for φ. Moreover, we discuss ReLU-regularizers as a class of non-convex regularizers for which the presented theory is applicable.

4.1. Dependence on the choice of φ

We begin this section with a comparison between classical Tikhonov regularization and regularization with φ-critical points using different choices for φ. For the sake of simplicity we consider a convex example, where the data-fidelity is chosen as $\lVert \mathbf{K} x - y_\delta \rVert^2/2$ and the regularizer is given by $\mathcal R(x) = \lVert x \rVert^2/2$. We assume that α > 0 and $y_\delta \in {\mathbb{Y}}$ are fixed and denote by $x_{\alpha, \delta}$ the unique minimizer of the Tikhonov functional $ \mathcal{T}_{\alpha, y_\delta}(x) = \lVert \mathbf{K} x - y_\delta \rVert^2/2 + \alpha \lVert x \rVert^2/2$. Note that in the convex case, standard Tikhonov regularization corresponds to regularization with αφ-critical points for the choice $\phi \equiv 0$.

Consider now the case where $\phi \equiv \varepsilon \gt 0$ is constant. Then, according to the general theory, any αφ-critical point x of $\mathcal{T}_{\alpha, y_\delta}$ is characterized by

Writing $x = x_{\alpha, \delta} + x_0$ we find after some rearrangements that this is in turn equivalent to

where $L(\alpha) : = (\mathbf{K}^* \mathbf{K} + \alpha I)$ and $r(\alpha) : = L(\alpha) x_{\alpha, \delta} - \mathbf{K}^* y_\delta$. Since $L(\alpha)$ is positive definite, this shows that x0 has to be chosen in an ellipsoid around 0 and thus $x = x_{\alpha, \delta} + x_0$ is contained in some ellipsoid around $x_{\alpha, \delta}$. Here, the size of the ellipsoid depends on the choice ε and, for example, choosing ε = 0 leads to the singleton $\{x_{\alpha, \delta}\}$.

Let now φ be an arbitrary non-negative function such that minimization of $\mathcal{T}_{\alpha, y_\delta} + \alpha \phi $ is well-posed with minimizer xφ . Then, following the steps above, we find that αφ-critical points $x = x_\phi + x_0$ of $\mathcal{T}_{\alpha, y_\delta} $ are characterized by

Equation (4.1)

where $L(\alpha)$ is as above and $r(\alpha) = L(\alpha) x_\phi - \mathbf{K}^* y_\delta$. That is, the set of αφ-critical points is an ellipsoid around the point xφ where the size of the ellipsoid depends on the value $\phi(x_\phi)$. Depending on the choice of φ this value can even be 0. To see this, consider for example the case where $\mathbb X = \ell^2(\mathbb{N})$ and $\phi(x) = \beta \, \lVert \max \{0, x\} \rVert^2/2$ for some β > 0, where the maximum is taken pointwise. Then, if $x_{\alpha, \delta} \leqslant 0$ pointwise, we find that $\phi(x_{\alpha, \delta}) = 0$ and we arrive at $x_\phi = x_{\alpha, \delta}$ and (4.1) collapses to $x_0 = 0$. In a similar fashion, the choice $\phi(x) = \beta \, \lVert \max(0, -x) \rVert^2 / 2$ for some β > 0 leads to a point estimate whenever $x_{\alpha, \delta}$ is non-negative and to an ellipsoid if $x_{\alpha, \delta}$ has at least one negative entry.

While we allow for the whole set of solutions defined as αφ-critical points of $\mathcal{T}_{\alpha, y_\delta}$, in practical applications one typically only chooses a specific subset. More precisely, as shown in section 2.2 one can apply variants of gradient descent to construct critical points. In the convex situation this might lead to the same solution independent of the choice of φ since the classical gradient $\partial_0 \mathcal{T}_{\alpha, y_\delta}$ can be used as an αφ subgradient for any choice of φ. However, in the non-convex case using such algorithms will not find solutions which are global minima but only critical points as we discuss in the following subsection.

4.2. Relation to convex relaxation

Next we investigate the relation between critical point regularization for a non-convex example. Consider $\mathbb X = {\mathbb{Y}} = \ell^2(\mathbb{N})$ and $\mathbf{K} x = (k_i x_i)_{i\in \mathbb{N}}$. As non-convex regularizer we use a slightly perturbed double-well potential $\mathcal R(x) = \sum_{i} r_i(x_i)$, where for $q \in [1/2, 1)$ and $w_i\gt0$ we define

Here, q controls the perturbation away from 0 of the local minimum at wi . An illustration of ri for $w_i = 1$ and two different values of q is depicted in figure 2. While at this point the choice of the regularizer might seem somewhat arbitrary and contrived, we will argue later that the regularizer chosen here is a simplified model for a reasonable class of learned regularizers; see the discussion on ReLU-networks in section 4.4 and in particular remark 4.6.

Figure 2.

Figure 2. Illustration of the non-convex double-well regularizer (solid lines) for two different values of q and the corresponding convex hull (dashed lines).

Standard image High-resolution image

For given $\alpha \gt 0, y_\delta \in {\mathbb{Y}}$ we consider (classical) critical points of the Tikhonov functional $\mathcal{T}_{\alpha, y_\delta}$ defined by

Equation (4.2)

Although ri is not differentiable at $t = q w_i$, with slight abuse of notation we will denote by $r_i^{^{\prime}}(t)$ the derivative of ri , which we define as $q w_i$ at $t = q w_i$. In the following the gradient $\nabla \mathcal R (x) = (r_i^{^{\prime}}(x_i))_{i \in \mathbb{N}} $ will be understood with this convention.

Remark 4.1. One can construct a specific φ such that $\nabla \mathcal R(x) \in \partial_{\alpha\phi} \mathcal R (x)$. It is then guaranteed that any classical critical point is an αφ-critical point and hence a regularized solution fitting to the theory presented in this paper. Such a φ can be constructed by considering the defining inequality pointwise and constructing φi for ri . The calculations for φi are relatively simple but tedious. Since the exact form of φ is irrelevant for our purpose, we refrain from explicitly defining φ. Instead, we focus on solutions that arise when solving equation (4.2). Further note that the algorithm presented in the section 2.2 with $\nabla \mathcal R$ as relative subgradient is the same as classical gradient descent, thus further substantiating the assumption that the constructed regularized solution is a solution of (4.2).

By definition of the operator and the regularizer, solutions of (4.2) can be computed component-wise via $\forall i \colon k_i (k _i x_i - y_i) + \alpha r_i^{^{\prime}}(x_i) = 0$. Hence

An interesting observation is that for $k_i = 0$ we have a choice between 0 and wi where the choice $x_i = w_i$ will, in general, only lead to a local minimizer instead of a global one; compare with figure 2 for q = 0.75. In essence, this shows that the regularized solutions one obtains are a subset of the solutions of equation (4.2) and need not be global minimizers.

Next we consider regularization with the convex relaxation of $\mathcal R$, i.e. the convex hull $\mathcal R^\textrm{conv}$. After some lengthy calculations, which we do not present here for the sake of brevity, we derive for the convex hull of ri the form

A comparison of ri and $r_i^\text{conv}$ for two choices of the parameter q is given in figure 2 where the convex hull in both cases is visualized with a dashed line.

Solving the critical point equation (4.2) where $\mathcal R$ is replaced by its convex hull we find that the solutions differ quite a bit, at least for the case $k_i = 0$. Considering this case, we find that there are two different cases. First, whenever $q = 1/2$ the convex hull allows for arbitrary solutions $x_i \in [0, w_i]$. On the other hand if $q \gt 1/2$ the convex hull forces the choice $x_i = 0$. This means that even the slightest perturbation of the value of $r_i(w_i)$ will lead to a convex hull which loses this information. Comparing this to the solutions above we see that in the non-convex case we can always choose $x_i \in \{0, w_i\}$ independent of the value of q. This shows that there is a difference between the regularized solutions one obtains when considering the convex hull of the regularizer $\mathcal R^\textrm{conv}$ and the original regularizer $\mathcal R$.

4.3. Non-equivalence to Tikhonov regularization

One might conjecture that the proposed regularization with αφ-critical points of $\mathcal{T}_{\alpha, y_\delta}$ is equivalent to Tikhonov regularization for some other modified choice of the regularizer. While we cannot give a definite answer to this question at this point, we conjecture that this is not the case.

To support our hypothesis, let us analyze what would happen if the construction of αφ-critical points of $\mathcal{T}_{\alpha, _\delta}$ were equivalent to the Tikhonov regularization with some regularizer $\mathcal R_\phi$. In this case, the limiting problems would also coincide and thus for any $y \in\, \mathrm{ran}(\mathbf{K})$ and $x \in \mathbb X$ we have

Denoting by xφ a solution of $\min_{\mathbf{K} u = y} \mathcal R(u) + \phi(u)$ we have $\mathcal R(x) \leqslant \mathcal R(x_\phi) + \phi(x_\phi)$. Clearly, this is the case if and only if $\max \{\mathcal R , \mathcal R(x_\phi) + \phi(x_\phi)\}$ is minimal among all possible solutions. This essentially means that $\mathcal R_\phi = \max \{\mathcal R , \mathcal R(x_\phi) + \phi(x_\phi)\}$.

While such a choice can theoretically be used to characterize the limiting problem, we do not have access to xφ and therefore cannot work with $\mathcal R_\phi$ in practice. A slightly more subtle problem is that $\mathcal R_\phi$ depends on the exact data and as such cannot be used for the case of noisy case where α > 0. Thus, we conjecture that the proposed regularization is not equivalent to Tikhonov regularization independent of the choice $\mathcal R_\phi$.

4.4. ReLU-Networks as class of possible regularizers

Next, we demonstrate that ReLU networks form a class of non-convex, relatively subdifferentiable regularizers that fit within the theory presented in this paper. As discussed in [14, 15], such regularizers are a powerful tool in the context of classical variational regularization.

Let now $\mathbb V$, $\mathbb U$ be further Hilbert spaces.

Definition 4.2 (Quasi-homogeneity). A function $f \colon \mathbb X \to \mathbb V$ is quasi-homogeneous, if there exists $\mathcal{L}_f \colon \mathbb X \to L(\mathbb X, \mathbb V)$ such that $\sup_{x \in \mathbb X} \lVert \mathcal{L}_f(x) \rVert \lt \infty$ and $\sup_{x \in \mathbb X} \lVert f(x) - \mathcal{L}_f(x) x \rVert \lt \infty$. We call $\mathcal{L}_f$ the quasi-derivative of f.

In definition 4.2 and below $L(\mathbb X, \mathbb V)$ denotes the space of all bounded linear mappings from $\mathbb X$ to $\mathbb V$. Quasi-homogeneity satisfies the following elementary rules.

Lemma 4.3 (Quasi-homogeneity and relatively sub-differentiability). Let $f,h \colon \mathbb X \to \mathbb V$ and $g \colon \mathbb V \to \mathbb U$ be quasi-homogeneous, W $\in L(\mathbb X, \mathbb V)$ and $c \in \mathbb{R}$. Moreover, let $v \in \mathbb V$ and let $\psi \colon \mathbb V \to [0, \infty)$ be convex and sub-differentiable with $\psi(v) \leqslant C \lVert v \rVert^p$ for some C > 0, $p \geqslant 1$ and subgradient selection $\psi^{^{\prime}}(v) \in (\partial_0 \psi)(v)$. Then for some $\phi^1, \phi^2$ the following hold:

  • (1)  
    $g \circ f$ is quasi-homogeneous with $\mathcal{L}_{g \circ f}(x) = \mathcal{L}_g(f(x)) \circ \mathcal{L}_f(x)$.
  • (2)  
    $f + c h$ is quasi-homogeneous with $\mathcal{L}_{f + c h} = \mathcal{L}_f + c \mathcal{L}_h$.
  • (3)  
    $x \mapsto $ W $x + b$ is quasi-homogeneous.
  • (4)  
    $\langle v, f \rangle$ is φ1-relative sub-differentiable with $\langle v, \mathcal{L}_f(x) (\cdot) \rangle \in \partial_{\phi^1} (\langle v, f \rangle)(x)$.
  • (5)  
    $\psi \circ f$ is φ2-relatively sub-differentiable with and $\psi^{^{\prime}}(f(x)) \mathcal{L}_f(x) \in \partial_{\phi^2}(\psi \circ f)(x)$.

Proof. These properties follow immediately from the triangle inequality and the defining properties of quasi-homogeneity and relatively sub-differentiability.

Theorem 4.4 (Learned regularizers). Let $\psi \colon \mathbb V \to [0, \infty)$ be convex and sub-differentiable with $\psi(v) \leqslant C \lVert v \rVert^p$ for some C > 0 and $p \geqslant 1$. Let $\sigma_\ell$ be quasi-homogeneous and A $_\ell$ be affine and continuous for $\ell \in \{1, \dots, L\}$. Then

Equation (4.3)

is quasi-homogeneous. Additionally, $\mathcal R = \psi \circ \mathcal{N} $ is relatively sub-differentiable.

Proof. Follows from repeated application of lemma 4.3.

The crucial assumption in theorem 4.4 is that the activation functions $\sigma_\ell$ are quasi-homogeneous. This property is, for example, satisfied in the case of the ReLU as the choice for the activation function as we discuss in the following example.

Example 4.5 (ReLU regularizer). Consider the case $\sigma_\ell = \operatorname{ReLU}$ defined by $\operatorname{ReLU}(x) = \max \{0, x\}$ where max is to be understood pointwise. Then $\mathcal{L}_{\operatorname{ReLU}}(x) = M_{g(x)}$, where $M_{g(x)}$ denotes pointwise multiplication with $g(x) = 0$ if $x \leqslant 0$ and $g(x) = 1$ if x > 0 and g is again understood pointwise. The ReLU function is then quasi-homogeneous whenever the space $\mathbb X$ has the following property: For any $x, v \in \mathbb X$ we have $\operatorname{ReLU}(x) \in \mathbb X, g(x) v \in \mathbb X$ and $\lVert g(x) v \rVert \leqslant \lVert g(x) \rVert_\infty \lVert v \rVert$. Examples of such spaces are $\mathbb X = \ell^r(\Lambda, L^p(\Omega, \mu))$ for some at most countable set Λ and parameters $p, r \geqslant 1$. In particular, this also holds in the finite dimensional case $\mathbb X = \mathbb{R}^n$. Thus, theorem 4.4 shows that ReLU networks are an appropriate choice to construct regularizers. We note here, that the same also holds true when the ReLU activation functions are replace by the more general class of parametric ReLU activation functions.

Theorem 4.4 and lemma 4.3 imply that a relative sub-gradient of any ReLU regularizer can be evaluated with the chain rule. Since deep-learning frameworks such as PyTorch [19] and Tensorflow [1] are built on formal application of the chain rule, calculating elements G(x) with $G(x) \in \partial_\phi \mathcal R(x)$ can be done by using backpropagation. Thus, the backpropagation procedure is an appropriate choice for any form of gradient descent used to find critical points of the given functional satisfying (4.2). This is for example of interest for learned regularizers [1416, 18].

Remark 4.6 (Learned regularizers as multi-well potentials). Using (parametric) ReLU activation functions, the network (4.3) is a composition of piecewise affine operators and as such itself a piecewise affine operator. This means that the regularizer $\mathcal R = \lVert \mathcal{N}(\cdot) \rVert^2 $ of theorem 4.4, as for example considered in [14], behaves like a 'multi-well potential' similar to the one considered in figure 2. That is, it behaves like a function with multiple local minima where ideally each local minimum is located at a desirable solution.

A reasonable strategy to find such a regularizer is to train a network to have local minima which are located at the desired solutions. However, due to various difficulties during this process (e.g. the regularizer itself being only a local minimum of the loss function used for training, non-ideal network architectures) one would also expect the regularizer to have slightly different values at the desired solutions. This means, that even if the local minima are located at the desired solutions one cannot expect all of these local minima to have the same value much less expect each of these local minima to be global minima of the regularizer. To put this in different words, one should expect slight perturbations as in figure 2.

5. Numerical simulations

The goal of this section is not to show that non-convex regularizers can improve the reconstructions, but rather to test the theory derived in the previous sections and to show what may happen when non-convex regularizers are chosen.

To this end, we consider the discretzied version of two toy-problems in 1D. We consider an inpainting problem where around $50 \%$ of the signal entries were randomly removed. In this case the kernel of the forward operator K is simple to compute and by using a separable prior we can easily study the properties of the solution we obtain in the limit. This makes the first toy problem ideal for testing whether the properties (as described in the theory section) of the limiting solution hold true or not.

Further, we consider recovering a signal from its cumulative sum. Since this matrix is invertible there is a unique solution and following theorem 3.5 we should observe convergence to this solution in the limit δ → 0. This toy problem is therefore well suited to study if the given φ-critical points actually converge to the unique solution.

For both problems we consider as the signal to recover the discretization of the function $f(t) = \exp(-t ^2) \cdot \cos(t) \cdot (t - 0.5) ^ 2 + \sin(t ^ 2)$ on $t \in [-1, 1]$ using N = 512 equidistant sample points. We denote this signal by $x_\mathrm{true}$ and the true underlying data by $y_\mathrm{true} = \mathbf{K} x_\mathrm{true}$ where K is the forward operator of the corresponding problem.

For each problem we consider the similarity measure given by $\mathcal{S}(x, y) = \frac{1}{2} \lVert \mathbf{K} x - y \rVert^2$ and we construct a regularizer by $\mathcal R(x) = \sum_{i = 1}^N \psi_{\rho, \beta}(x_i)$. Here, we define $\psi_{\rho, \beta}(t) = (t - \rho)^2 \cdot (t + \frac{\rho}{2})^2 + \frac{\beta}{2} t^2$ for $\rho, \beta \gt 0$. The function $\psi_{\rho, \beta}$ is constructed in such a way that it is non-convex but relatively sub-differentiable, see remark 2.5. Figure 5 shows the function $\psi_{\rho, \beta}(t)$ with parameters ρ = 2 and $\beta = 10^{-1}$ for $t \in [-3, 3]$ where the y-axis is plotted on a logarithmic scale in order to emphasize the non-convexity. We can see that this function has a global minimum at around $t = -\frac{\rho}{2}$, a local minimum close to $t = \rho$ and another critical point in the interval $[0, 1]$. The parameters ρ = 2 and $\beta = 10^{-1}$ are used for all the following simulations.

As a separable sum of relatively sub-differentiable and non-convex terms the regularizer $\mathcal R$ as defined above is relatively sub-differentiable and non-convex. By definition of $\mathcal R$ it is further coercive and hence the functional $\mathcal{T}_{\alpha, y^\delta}$ is coercive. This shows that condition 3.1 is satisfied and we consider the stability and convergence of the φ-critical points according to theorems 3.3 and 3.5.

To simulate noisy data we consider the data $y_k = y_\mathrm{true} + \delta_k \cdot n$ where $n = \frac{\xi}{\lVert \xi \rVert}$, ξ is a normally distributed random variable and $\delta_k = 10^{-k}$ for $k \in \{4, \dots, 14\}$.

Since a bound φ for $\mathcal R$ can be chosen such that $\mathcal R^{^{\prime}}(x) \in \partial_\phi \mathcal R(x)$ (see remark 2.5), we can simply search for a classical critical point of $\mathcal{T}_{\alpha, y^\delta}$ in order to obatin φ-critical points. To achieve this, we apply Newton's method, e.g. [8], and we find an initial guess for Newton's method by applying Nesterov accelerated gradient descent [17] to the starting point $x_0 = 0$. In the following we denote by $x_\alpha^\delta$ a critical point of $\mathcal{T}_{\alpha, y^\delta}$ and by xα a critical point of $\mathcal{T}_{\alpha, y_\mathrm{true}}$. Here, xα is considered as the limit point for the stability considerations for which we consider the choices $\alpha \in \{10^{-2}, 10^{-3}, 10^{-4}\}$. In order to test for convergence we chosen $\alpha = \alpha(\delta) = \delta^q$ for $q \in \{1, \frac{3}{2}\}$. For the convergence simulations we consider as the limit point the signal $x_\mathrm{true}$ for the cumulative sum problem, as in this case the solution is unique, and we construct an approximate solution for the inpainting problem by finding a critical point of the function $\mathcal{T}_{\alpha(\delta), \delta}$ for $\delta = 10^{-16}$ and we denote this solution by $x_{\boldsymbol{\texttt{+}}}$. Implementation details and code are publicly available 1 .

5.1. Results

Figure 3 depicts the value $\lVert x_\alpha^\delta - x_\alpha \rVert$ for different values of α > 0 (left), $\lVert \mathbf{K} x_{\alpha(\delta), \delta} - y_\mathrm{true} \rVert$ (middle) and $\lVert x_{\alpha(\delta), \delta} - x_{\boldsymbol{\texttt{+}}} \rVert$ (right). Each of these values is plotted against δ on a log-log scale. The plot in the left shows that for any chosen α we can observe the convergence of the sequence $x_\alpha^\delta$ to the critical point xα as the noise-level tends to 0. The plots in the middle and right show the convergence behaviour for different choices $\alpha(\delta) = \delta^q$ as specified above. All of the sequences can be observed to converge, i.e. $\lVert \mathbf{K} x_{\alpha(\delta), \delta} - y_\mathrm{true} \rVert \to 0$ and $\lVert x_{\alpha(\delta), \delta} - x_{\boldsymbol{\texttt{+}}} \rVert \to 0$ as the noise δ tends to 0.

Figure 3.

Figure 3. Stability and convergence for the cumulative sum problem. Each value is plotted in dependence on δ. Left: $\lVert x_\alpha^\delta - x_\alpha \rVert$ for different but fixed values of α. Middle: $\lVert \mathbf{K} x_{\alpha(\delta), \delta} - y \rVert$ for different $\alpha(\delta)$. Right: $\lVert x_{\alpha(\delta), \delta} - x_{\boldsymbol{\texttt{+}}} \rVert$ for different $\alpha(\delta)$.

Standard image High-resolution image

Figure 4 shows the same behaviour for the stability and convergence plots for the inpainting problem as figure 3 in the limit δ → 0. In particular convergence to a solution $x_{\boldsymbol{\texttt{+}}}$ of the problem $\mathbf{K} x = y_\mathrm{true}$ can be observed.

Figure 4.

Figure 4. Stability and convergence for the inpainting problem. Each value is plotted in dependence on δ. Left: $\lVert x_\alpha^\delta - x_\alpha \rVert$ for different but fixed values of α. Middle: $\lVert \mathbf{K} x_{\alpha(\delta), \delta} - y \rVert$ for different $\alpha(\delta)$. Right: $\lVert x_{\alpha(\delta), \delta} - x_{\boldsymbol{\texttt{+}}} \rVert$ for different $\alpha(\delta)$.

Standard image High-resolution image

A closer look at the inpainting problem reveals that the limit point $x_{\boldsymbol{\texttt{+}}}$ is, however, not an $\mathcal R$-minimizing solution. This can easily be checked due to the separability of the regularizer and the simple representation of the kernel of the inpainting problem. The orange dot in figure 5(left) is the $\psi_{\rho, \beta}$-value of $(x_{\boldsymbol{\texttt{+}}})_i$ where i is chosen as an index in the kernel of the inpainting matrix K, i.e. such that $\mathbf{K} e_i = 0$ where ei is the ith standard basis vector. Due to the separability of the regularizer we clearly have that $x_{\boldsymbol{\texttt{+}}}$ is not an $\mathcal R$-minimizing solution which arises due to the non-convexity of the regularizer.

Figure 5.

Figure 5. Regularizer and properties of the inpainting solution. Left: $\psi_{2, 10^{-1}}(t)$ for $t \in [-3, 3]$ on a logarithmic scale to emphasize the local minimum. The dot is the value of $(x_{\boldsymbol{\texttt{+}}})_i$ where $x_{\boldsymbol{\texttt{+}}}$ is the solution of the inpainting problem and i is an index chosen such that ei is in the kernel of the inpainiting problem. Right: the values $\lvert \langle \mathcal R^{^{\prime}}(x_{\boldsymbol{\texttt{+}}}), e_i \rangle \rvert$ where $(e_i)_{i \in I}$ is a basis of the kernel of the inpainting problem.

Standard image High-resolution image

Moreover, we have observed that if we initialize the values in the kernel close to −1 or 2 then the limit $x_{\boldsymbol{\texttt{+}}}$ will have entries at these φ-critical points of $\psi_{\rho, \beta}$. This shows that in such cases the solution we obtain in the limit heavily depends on the initialization we choose and that, depending on this initialization, the recovered solution may not be an $\mathcal R$-minimizing solution and potentially even a local maximum or a saddle point.

Finally, figure 5(right) shows the values $\lvert \langle \mathcal R^{^{\prime}}(x_{\boldsymbol{\texttt{+}}}), e_i \rangle \rvert$ where $(e_i)_i$ is a basis of the kernel of K. Up to numerical accuracy we see that we have $\langle \mathcal R^{^{\prime}}(x_{\boldsymbol{\texttt{+}}}), e_i \rangle = 0$ for each such index i which shows that $-\mathcal R^{^{\prime}}(x_{\boldsymbol{\texttt{+}}}) \in \ker(\mathbf{K})^\perp$ as in lemma 3.9.

6. Conclusion and outlook

We have introduced and studied the concept of critical point regularization, which, opposed to classical variational regularization, considers (φ-)critical points of Tikhonov-functionals as regularized solutions. The advantage of this approach is that it completely discards the strong and typically unrealistic assumption of being able to achieve global minimizers of these functionals. Our theory shows that under reasonable assumptions on the involved functionals the resulting method will nevertheless be a stable and convergent regularization method. Further, we have shown that the solutions in the limit δ → 0 satisfy some form of first order optimality conditions of the constrained optimization problem $\inf_x \mathcal R(x)$ subject to the constraint $\mathcal{S}(x, y) = 0$. Besides this, the theory presented here extends the theory of convex functionals by showing that at no point does one require global minimizers, but only points which are close to a global minimum in some sense. For practical applications this means that minimization algorithms do not need to be run until convergence but may be stopped early, if easily verifiable conditions are met. Additionally, under assumptions on the regularizer $\mathcal R$ this theory is directly applicable to regularized solutions which are classical critical points of the involved functionals. As such our theory gives stability and convergence results for critical points of potentially non-convex functionals.

Finally, we have provided numerical simulations which support our theoretical findings, i.e. the stability and convergence of critical point regularization. Depending on the algorithm used for obtaining critical points, these numerical examples show that one cannot expect to find global or even local minima which further supports the arguments for the need of a theory based on (φ-)critical points, which we have developed in this paper.

As the main concern of this paper was to introduce the concept of using (φ-)critical points as regularized solutions, we have not derived any stability- or convergence-rates and studying such rates is subject to future work. Besides this, deriving conditions under which learned regularizers, e.g. [5, 16, 18], give rise to relatively sub-differentiable functions is also subject of future work.

Data availability statement

No new data were created or analysed in this study.

Footnotes

Please wait… references are loading.
10.1088/1361-6420/acdd8d