Abstract
One of the key assumptions in the stability and convergence analysis of variational regularization is the ability of finding global minimizers. However, such an assumption is often not feasible when the regularizer is a black box or non-convex making the search for global minimizers of the involved Tikhonov functional a challenging task. This is in particular the case for the emerging class of learned regularizers defined by neural networks. Instead, standard minimization schemes are applied which typically only guarantee that a critical point is found. To address this issue, in this paper we study stability and convergence properties of critical points of Tikhonov functionals with a possible non-convex regularizer. To this end, we introduce the concept of relative sub-differentiability and study its basic properties. Based on this concept, we develop a convergence analysis assuming relative sub-differentiability of the regularizer. The rationale behind the proposed concept is that critical points of the Tikhonov functional are also relative critical points and that for the latter a convergence theory can be developed. For the case where the noise level tends to zero, we derive a limiting problem representing first-order optimality conditions of a related restricted optimization problem. Besides this, we also give a comparison with classical methods and show that the class of ReLU-networks are appropriate choices for the regularization functional. Finally, we provide numerical simulations that support our theoretical findings and the need for the sort of analysis that we provide in this paper.
Export citation and abstract BibTeX RIS
Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 license. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
1. Introduction
In various scientific fields and applications, such as medical imaging or remote sensing, it is often not possible to obtain the desired quantity of interest directly. Assuming a linear measurement model, recovering the quantity of interest requires solving an inverse problem of the form
where is a linear operator between Hilbert spaces modeling the forward problem, ηδ is the data perturbation, is the noisy data and is the sought for signal. In many cases these problems are ill-posed, meaning that no continuous right inverse of the operator K exists. To overcome such issues several established approaches for the stable approximation of solutions of inverse problems exist.
1.1. Regularization with non-convex penalties
Particularly popular regularization techniques are variational methods [10, 22]. These methods recover regularized solutions as global minimizers of the Tikhonov functional
Here, is a regularizer which encodes prior information about the desired solution and plays the role of a data-discrepancy measure. Classically, regularizers have been hand-crafted, including L2-penalties, sparse regularization techniques or total variation [2, 10, 12]. While such hand-crafted regularizers are often convex and hence global minima can be computed by classical convex optimization, hand-crafted priors typically lack adaptability to available data.
In more recent years, there has been a shift to learned and potentially non-convex priors [4, 14–16, 18]. It has been observed that these methods often outperform classical methods. Moreover, a full convergence analysis has been be provided [14, 18]. However, such an analysis assumes minimizers of the Tikhonov functional to be known or at least be given within a certain accuracy. For non-convex regularizers such an assumption is unrealistic and global minimizers are challenging to compute. Instead, when trying to find a regularized solution one often employs minimization algorithms such as gradient descent or variations thereof which converge to critical points (such as local minimizers close to the initial guess) rather than to global minimizers of the Tikhonov functional. While one could constrain the learned regularizers to only include convex functionals [3, 16] this might result in suboptimal reconstructions when the underlying signal class is inherently non-convex. For such classes non-convexity of the regularizer can be a highly desirable property and as such a convergence analysis for this case is needed. Importantly, such an analysis should not rely on the strict assumption that the regularized solutions are global minimizers of the underlying Tikhonov functional.
We briefly mention here that there exist other interesting cases where the Tikhonov functional is non-convex such as for example in the case of a nonlinear forward operator. However, in this paper we only consider the linear case. Besides this we also mention that there are potentially different ways to deal with non-convexity of Tikhonov functionals, for example, by use of convexification [13]. However, for the learned regularizers we have in mind (see results in section 4), the modification of the involved functionals is nontrivial in general. Besides, the modification of the learned functionals can change the original properties of the learned functional in an unfavorable way.
1.2. Proposed critical point regularization
In this paper we present a convergence analysis of critical points of the Tikhonov functional for the stable solution of inverse problems of the form (1.1). We refer to any such method which recovers a critical point as regularized solutions as critical point regularization. In fact, we show stability and convergence for a relaxed notion of critical points. More precisely we study stability and convergence of φ-critical points, namely elements satisfying . Here is the φ-relative sub-differential, a novel concept that we introduce and study in this paper. Whenever the classical norm-discrepancy is used to measure similarity, as the noise level tends to zero, we show that regularized elements converge to elements with
resembling first order conditions of the constraint optimization problem defining -minimizing solutions.
We give our analysis for more general data discrepancy measures for which is only a special case. Further, we mention that in [9] an analysis of stability for the case of local minima has been done. Opposed to our work the authors of [9] restrict themselves to the finite dimensional setting and do not provide convergence results for the case that the noise-level tends to zero. Allowing that the underlying space is a general Hilbert space without any restrictions on the dimension has the advantage that the analysis is independent of the dimension and as such applies to any discretization used for practical applications. The precise analysis of the discretization is beyond the scope of this paper and we refer to the corresponding work in conventional variational regularization [4, 20].
Note that whenever the Tikhonov functional is relatively subdifferentiable, then critical points of the Tikhonov functional are also relatively critical points if φ is constructed accordingly, and the proposed concept yields a convergent regularization for critical points. We will show that this is actually the case, for example, for a class of learned regularizers defined by neural networks. We are not aware of any other study which includes stability and convergence of critical points and to the best of our knowledge the present analysis is the first to attempt this.
1.3. Main contributions
In this paper we introduce the concept of relative sub-differentiability as a generalization of sub-differentiability of convex functions to the non-convex case. We develop theory for relative sub-differentiability and show that corresponding φ-critical points can be found by employing a generalized gradient descent method. From the viewpoint of regularization theory we give existence, stability and convergence results for φ-critical points and derive the limiting problem for critical point regularization. As opposed to the convex case where the solutions one obtains are -minimizing solutions we get as a limiting problem a related first order optimality condition. As a special case of our analysis we derive stability and convergence results for critical points of differentiable Tikhonov functionals. For example, in this case, we get that is in the normal cone of the set of all solutions.
Finally, we provide numerical simulations which support our theoretical findings, in particular the stability, convergence and the limiting problem. Moreover, the results of our numerical simulations show that even in simple cases of non-convex regularizers the assumption of obtaining global minima or even local minima is infeasible thus further emphasizing the need for the analysis we provide in this paper. Besides, the numerical results show that the solutions we obtain cannot be expected to be -minimizing solutions and may even be local maxima of the regularizer whenever the initialization is chosen inappropriately and the algorithm does not guarantee that local minima are obtained.
1.4. Overview
The rest of the paper is organized as follows. In section 2 we motivate and introduce the concept of relative sub-differentiability and corresponding φ-critical points. Moreover, we study basic properties of relative sub-differentiability and show that φ-critical points can be achieved by employing a generalized gradient descent method. Section 3 builds on this concept of relative sub-differentiability and gives a convergence analysis for critical point regularization. Moreover, we take a closer look at the differentiable case and identify the limiting problem in this case. In section 5 we provide numerical experiments which support our theoretical findings such as stability and convergence. Finally, we conclude the paper by giving a brief summary and outlook in section 6.
2. Relative sub-differentiability
In this section and in the rest of the paper, unless stated otherwise, we assume that is a Banach space, denote by its dual and by the dual pairing of and , i.e. for and we have . Moreover, we denote by the derivative of any differentiable function and for any similarity measure we denote by the derivative with respect to its first argument.
Before giving the crucial definition of relative sub-differentiability we recall the importance of classical sub-differentiabilty in the context of convex functions. Recall that is called subgradient of some functional at if for all and that is sub-differentiable whenever the set of subgradients is non-empty for all . Minimizers x of are characterized by the optimality condition where denotes the set of all subgradients at point x. However, sub-differentiability implies convexity. We will therefore develop a relaxed concept of sub-differentiability relative to some functional by replacing the right hand side in the definition of subgradients by .
2.1. Definition and basic properties
The following concept generalizing sub-differentiablity is also applicable to non-convex functions.
Definition 2.1 (Relative sub-differentiability). Let and .
- (a)is called φ-relative subgradient of at if
- (b)The set set of all φ-relative subgradients at x is denoted by and called φ-relative sub-differential of at x.
- (c)The functional is called φ-relative sub-differentiable if for all .
Some remarks about definition 2.1 are in order.
- We call any such function φ a bound. It is clear that such a bound cannot be unique, since whenever is a relatively sub-differentiable function with bound φ then it is also relatively sub-differentiable with bound for any .
- Choosing φ = 0 we see that any convex and sub-differentiable function is relatively sub-differentiable, i.e. the class of all relative sub-differentiable functions includes the set of convex sub-differentiable functions.
- The relative subgradients depend on the function φ. This shows that whenever we choose a larger φ then we generally also increase the set of possible relative subgradients, i.e. if then .
- Similar to the concept of subgradients for convex functions, the concept of relative subgradients is a global property since the defining inequality has to hold for any point .
Another approach of generalizing convexity and subgradients (and as a consequence critical points) is given in [11, 24] where convexity with respect to a set of functions W is defined. In such a setting is a subgradient of at x whenever for any . As a consequence any critical point, i.e. a point where 0 is a subgradient, will also be a global minimizer and hence such a generalization cannot be used for our purposes. In [7] another concept of generalized gradients is discussed. In this setting the definition of the gradient depends only on neighborhoods around the point of interest. As a consequence we cannot expect the critical points to have any global properties which are necessary for the analysis in section 3 hence making this generalization unfit for our analysis. However, it should be noted, that whenever convenient one might substitute any differentiability assumption on the involved functionals with Clarke's generalized gradient concept in any of the following discussions.
In what follows we will assume that is φ-relatively sub-differentiable for some fixed φ. Based on this definition we generalize the concept of critical points as follows.
Definition 2.3 ( φ -critical points). We call a φ-critical point of if . Moreover, we denote by the set of all φ-critical points of .
It should be noted that the definition of φ-critical points depends on φ and in practical applications one might not have access to φ. In such cases evaluating or finding relative subgradients might be infeasible. Nevertheless, the concept of φ-critical points is general enough to include an important class of points as the following remark illustrates.
Remark 2.4 (Critical points of differentiable functions). Let us assume that is a differentiable function which satisfies the inequality for any and some . Then we have . This shows that in this special case we have access to at least one element of . In particular, any critical point of , i.e. a point with , will always yield a φ-critical point of in the sense of definition 2.3 and hence definition 2.3 is a generalization of the classical concept of critical points for differentiable functions satisfying above inequality.
This shows that for a class of functions we have access to at least one element of the relative subgradient of . More importantly, for this class of functions we can make assertions about the points where holds, i.e. points which are reachable by use of a (minimization) algorithm which guarantees to find a critical point.
Before we move on, we briefly give a prototypical example of a non-convex function for which a bound φ can be chosen, such that .
Remark 2.5 (Examples of relative sub-differentiability). We start by giving a simple example of a function which is non-convex, but relatively sub-differentiable. To this end, let be given and define . It is readily seen, that is a polynomial of degree 4 with negative leading coefficient. Hence, this function is bounded from above and the relative sub-differentiability immediately follows by for example choosing . Clearly then, the function is also sub-differentiable for c > 0. The function g is plotted in figure 1 on the left side for different parameters on a semi-logarithmic scale to emphasize the non-convexity.
Now let us consider the function , see figure 1 on the right. Then, due to the coercivity and the existence of critical points 'at infinity', the derivative of cannot be in the relative sub-gradient of for any φ. This example illustrates what types of functions are not included in the concept of relatively sub-differentiable functions for which the derivative is supposed to lie in the relative subgradient. In particular, the concept of relative sub-differentiability excludes coercive functionals which have critical points 'at infinity'.
Before discussing how one might obtain φ-critical points of relatively sub-differentiable functions, we list some useful properties of which we make constant use during the rest of the paper.
Lemma 2.6 (Basic properties of relative subgradients). Let and be bounds of for and w > 0. Moreover, set . Then the following hold
- (1)
- (2)
- (3)If is convex then for any
- (4)is convex and (weak*) closed
- (5)If then
- (6)
- (7)If is Lipschitz and φ bounded on bounded subsets, then is bounded. In particular, in this case the set is weak*-compact.
- (8)Let where and with . Assume that xk converge weakly to and gk converge to g and that is weakly lower semi-continuous. Then . If, instead, xk converge strongly to , gk converge weakly to g and is lower semi-continuous then we also have .
Proof. (1) Let and define . Then we have
and hence the claim follows.
(2) Assume that . Then we have by non-negativity of w and hence . Now let then we define and it follows
which shows that .
(3) This is an immediate consequence of by non-negativity of φ.
(4) Let and . Then we have
which proves the convexity of . Now let us assume that with pk (weak*) converges to p. By (weak*) convergence we have and hence p is also an relatively sub-differentiable subgradient.
(5) This is also a consequence of the non-negativity of φ and the assumption that is a global minimizer.
(6) Let . Then by definition we have for any and hence also . On the other hand, if then we have for any and hence .
(7) Let and set with . Using the defining inequality we find
and thus by taking the supremum over v we find that is bounded. Using Banach-Alaouglu we see that must be weak*-compact.
(8) By assumption xk is bounded. Thus, we have
which proves the claim.
Lemma 2.6 gives us a characterization of φ-critical points as points x for which is an upper bound of . This characterization in particular implies that for any differentiable and relatively sub-differentiable function we have that the points x with must have bounded value independent of x. Comparing this to the convex case we have that x is a critical point of the function if and only if x is a global minimizer. In some sense, the definition of φ-critical points allows for some error to be made and guarantees that φ-critical points cannot have arbitrarily large -value. Moreover, whenever is coercive then all φ-critical points must be inside some ball for some r > 0.
2.2. Computation of φ-critical points
We next answer the question of how to obtain φ-critical points at least for the case where is a Hilbert space. Clearly, if is differentiable then one could consider classical gradient descent methods. Since we are also interested in non-differentiable functions, gradient descent in its classical form may not be applicable. Below we show that a generalized gradient method using relative subgradients instead of gradients will yield φ-critical points in the sense of definition 2.3. This shows that algorithm is a natural extension of subgradient descent [6, 23].
Algorithm 1. Relative subgradient descent. |
---|
Require: Starting point , stepsizes |
while do |
Choose and such that |
end while |
The following results shows that Algorithm converges to a φ-critical point of the function . The given proof closely follows the one given in [6] but does not assume a finite dimensional setting and considers relatively sub-differentiable functionals instead of sub-differentiable function.
Theorem 2.7 (Convergence of Algorithm). Assume that is a Hilbert space and that is relatively sub-differentiable with bound φ. Moreover, choose in algorithm with such that for all . Then for any point and any step we have
Proof. Let . After rescaling of according to assumption we may assume that . Then by definition of we have
Applying this inequality recursively and using the fact that we find
which together with the inequalities and shows the desired result.
Theorem 2.7 shows that under the assumption that the sequence of step-sizes is square-summable but not summable, then in the limit we have for any . Note the analysis and the proofs heavily rely on the usage of the functional φ, but we note that at no point during algorithm do we need explicit knowledge of the functional φ but only access to elements of . In particular, in the case of remark 2.4 when using the gradient of as the update direction the generated sequence will yield a φ-critical point.
Finally, assume that we have a functional of the form where each term is relatively sub-differentiable with bounds and . Then lemma 2.6 shows that for and we have . This implies that algorithm can be applied in the case where we are looking for a φ-critical point of the sum of two relatively sub-differentiable functionals and only have access to elements of and .
3. Regularizing properties of φ -critical points
In this section we present a convergence analysis for φ-critical points of Tikhonov-type functionals extending the existing analysis for global minima [22]. At this point we want to emphasize again that the assumption of being able to obtain global minima of can be extremely restrictive when is non-convex and the main goal of our analysis is to discard this assumption. Instead we focus only on φ-critical points of which may include local minimizers, saddle points or even local maxima.
Recall that we are interested in Tikhonov-type functionals of the form
for given and . Here, is a similarity measure between x and yδ and a standard situation we are interested in is where is the forward operator of the inverse problem of interest. Instead of working with global minima of the functional (3.1) we consider regularized solutions as αφ-critical points of , meaning
We will analyze stability and convergence of such critical points.
For the analysis we make the following assumptions.
Condition 3.1 (Critical point regularization)
- is a reflexive Banach spaces and is a metric space with metric
- is weakly sequentially lower semi-continuous
- is relatively sub-differentiable with bound φ
- is weakly sequentially lower semi-continuous, convex in its first argument and continuous in its second argument
- and the functional is coercive, i.e. for
Most of the assumptions in condition 3.1 are classical assumptions (or generalizations thereof), e.g. [14, 18, 22], made for the analysis of variational methods. For example, the coercivity assumption only poses a condition on the involved functionals 'at infinity' and as such does not pose any form of condition, say for example, on the behaviour in a ball around 0. This means, that the function can be highly non-convex as long as it is growing fast enough outside bounded sets. The major difference in the analysis provided here is that is relatively sub-differentiable, which we have motivated in section 2, and the assumption that in general the regularized solutions are not global minima but only φ-critical points.
One of the simplest and commonly used example of a similarity measure which satisfies Assumptions (C4) and (C5) is given by whenever is the linear forward operator of the underlying inverse problem and is a Banach space. In general, any similarity measure of the form satisfies these assumptions, if is a linear and bounded operator, e.g. a reweighting of the residual .
We now turn our focus to the stability and convergence analysis of the considered method, i.e. . We start with existence and stability results.
3.1. Existence and stability
Theorem 3.2 (Existence). Under Assumption 3.1 the problem is well-posed, i.e. for every α > 0 and the set is non-empty.
Proof. This is an immediate consequence of the existence of minimizers of which follows from the coercivity and the continuity assumptions on the functional . A more detailed proof can be found in [22].
Clearly, αφ-critical points may exist under weaker assumptions than a coercivity assumption. However, the coercivity is an important property in the following analysis which guarantees the existence of a weakly convergent subsequence whenever the sequence is bounded. As such we have also derived existence of φ-critical points using the coercivity. Extending the current analysis to the case of non-coercive functionals is subject to future work.
Another advantage of using φ-critical points opposed to global minima, besides being numerically and hence practically more tractable for non-convex functionals, is that we have a simple way of talking about 'inexact' critical points, i.e. points where the gradient is small but not necessarily 0. As it turns out, the following analysis can be performed under the even weaker assumption that the stabilized solutions are 'inexact' critical points instead of exact critical points.
Theorem 3.3 (Stability). Let and and assume that is such that with and . Then the sequence has a weakly convergent subsequence and the limit of every weakly convergent subsequence is an αφ-critical point of .
Proof. To show the existence of a weakly convergent subsequence, using the reflexivity of , it is enough to show that is a bounded sequence. By coercivity of it is enough to show that is bounded. We have for any
and using it follows
By assumption on we have which yields
for any . By assumption we have and so the right hand side is bounded for k large enough. This shows that there exists some weakly convergent subsequence.
Let now denote such a subsequence and denote by its limit. Using the weak lower semi-continuity of the involved functionals it follows for any
where the last equality follows from continuity of in its second argument. This shows that .
Clearly, whenever , i.e. xk is an αφ-critical point, then the assumptions on zk in theorem 3.3 are satisfied. It follows that αφ-critical points are stable in the above sense. However, theorem 3.3 also shows that we do not need access to exact αφ-critical points but rather points which are in some sense close to an αφ-critical point.
Remark 3.4 (Inexact critical points obtained by use of minimization schemes). Consider once again the case of remark 2.4 and assume that the αφ-critical points are obtained by using gradient descent or any other algorithm which finds zeros of the gradient. Then we have and whenever and we have that the considered points have a weakly convergent subsequence. For practical applications this means, that we have an easily verifiable condition which can be used as a kind of stopping criterion when searching for critical points. As a consequence, we do not have to guarantee that the regularized solutions are critical points but rather are 'close' to being a critical point.
3.2. Convergence
The next goal is to show the convergence of the regularized solutions to a solution of the original problem in the case that the noise-level δ tends to 0. Here, we call an -solution of if . Like in the case of theorem 3.3, the proof can be done under the weaker assumption of only having access to 'inexact' φ-critical points (see remark 3.4).
Theorem 3.5 (Convergence). Let and assume it has an -solution. Further, let with with . Choose such that for we have . Assume that the regularized solutions are such that with and .
Then the sequence has a weakly convergent subsequence and the limit of any such sequence is an -solution of y. Moreover, we have for any -solution u. Finally, whenever the -solution is unique then converges weakly to this solution.
Proof. Similar to the stability proof we show that is bounded by using the coercivity of the functionals . Following the above proof we find for any
and by choosing u such that we find which implies
Since both and are non-negative it then follows
where we have used the assumptions . This shows that is a bounded sequence and using once again we find that for the sequence is bounded. Using the coercivity of we get that the sequence is bounded and hence has a weakly convergent subsequence.
Finally, using the weak lower-semicontinuity of and we have that for any such weakly convergent subsequence with limit
for any with .
Whenever the solution is unique, then every subsequence of has a subsequence converging to this solution. This shows that converges weakly to the unique solution.
At this point, we want to emphasize once again, that the assumptions on the choice of points xk in theorem 3.5 are weaker than the assumption that xk is an -critical point and that in particular the analysis also holds for these points.
Since for this section we only assume that is relatively sub-differentiable without explicit knowledge of the bound φ theorem 3.5 gives a somewhat intangible condition on the type of solutions we obtain in the limit δ → 0. A more tangible condition, and more importantly one independent of φ, is given by the next theorem, where we assume a separability condition on the gradients . This separability assumption can be satisfied in many cases, e.g. when and are (relatively sub-)differentiable and the φ-critical points arise due to some algorithm such as gradient descent. Using these algorithms we are often in the situation that where sk is a (sub-)gradient of and rk is an (relatively sub-differentiable sub-)gradient of . Assuming that the gradients of have a cluster point, we get the additional following property of these cluster points.
Theorem 3.6 (Normality property of the solution). Let the same assumptions as in theorem 3.5 hold and denote by a weakly convergent subsequence with limit . Let where and .
Then any cluster point r of the sequence satisfies , where is the normal cone of the convex set of all -solutions of y at .
Proof. Let r be a cluster point of the sequence . Then by weak lower semi-continuity of and by assumption on rk we have
which shows that .
Now assume that is such that . Then we have and it follows
where we have used the convexity of in its first argument and the assumption on the limits of the sequences and .
Theorem 3.6 shows that the solution obtained by critical point regularization satisfies some form of first order optimality conditions, see e.g. [21].
Also note that in the case where is convex and we choose φ = 0, both properties in theorems 3.5 and 3.6 reduce to the common property that is an -minimizing solution, i.e. for any with .
Remark 3.7 (Convex regularizers). Clearly, any sub-differentiable convex function is relatively sub-differentiable with the choice φ = 0. Nevertheless, one could also choose . With this choice we see that the results in theorem 3.5 roughly state that the solutions we approximate by using critical point regularization are -minimizing solutions up to an error ε whenever the regularized solutions are minimizers up to an error of ε.
This result, as opposed to classical variational regularization theory e.g. [22], has the advantage that at no point do we require exact global minimizers of the functionals but only approximate minimizers, which may be more easily reachable in practical applications. Consider for example the case where we employ an iterative algorithm which has convergence guarantees of the form for the n-th iterate and being a minimizer of . Applying this algorithm to and requiring that in order to get that xn is an αφ-critical point, we see that the above theory shows that one might stop the iterative algorithm after a finite amount of steps, i.e. we do not necessarily need to run the algorithm until it converges and we still get a stable and convergent regularization method.
At first it might seem that a disadvantage of this is that we do not achieve an -minimizing solutions in the limit. However, this can also be circumvented by considering a variable ε. To be more precise, following the proof of theorem 3.5 with and the condition as δ → 0, it is easy to see that to obtain a sequence weakly converging to an -minimizing solution it is enough to run the iterative algorithm for a number of iterations steps nk such that .
We next discuss another special case of our analysis which pertains to functionals such as the one in remark 2.5.
3.3. Differentiable regularizers and classical critical points
In this subsection we consider the important special case where the φ-critical points are given by classical critical points, i.e. by points x for which and we give stability and convergence results for this case. To this end, we assume that the bound φ can be constructed in such a way that , see e.g. remark 2.4. Then, if is differentiable in its first argument, by convexity of , we have . This shows, that whenever we employ some algorithm which finds a classical critical point, we also obtain an αφ-critical point in the sense of definition 2.1 which satisfies the separability assumption necessary for theorem 3.6. In particular, these points are amenable to the analysis above.
Nevertheless, the analysis relies on an abstract concept of φ-critical points and even in the case where the involved functionals are differentiable we cannot guarantee that the limits will again be φ-critical points without any additional assumptions. In order to guarantee this we need the assumption that and are weakly (sequentially) continuous. Combining the above theorems we then get the following result.
Proposition 3.8 (Existence, stability and convergence for classical critical points). Assume that and are differentiable with weakly continuous derivatives and let condition 3.1 hold. Moreover, let and α > 0 and assume that y has an -solution. Then the following hold
- 1.Existence: has at least one φ-critical point.
- 2.Stability: If is a sequence converging to yδ and xk is such that as and . Then has a weakly convergent subsequence and any weak clusterpoint of is a critical point of .
- 3.Convergence: Let be a sequence with and be such that for we have . Then, if we choose such that satisfies and the sequence has at least one weak clusterpoint and any such clusterpoint is an -solution of y with the following additional properties
- (a)
- (b)for any with , i.e. .
Proof. This follows immediately by applying theorems 3.2, 3.3, 3.5 and 3.6.
For the differentiable case this identifies the limiting problem we solve by regularizing the inverse problems with critical points, i.e. in the limit we find solutions which satisfies a first order optimality condition of the constrained optimization problem
We now briefly discuss the special case where is given as the norm-discrepancy, e.g. in the case where is a Hilbert-space.
Lemma 3.9 (Solution for norm discrepancy). Let the same assumptions as in proposition 3.8 hold and assume that for some p > 1 where is a linear and bounded forward operator between Banach spaces and is differentiable. Furthermore, denote by a solution according to proposition 3.8.
Then we have .
Proof. Any solution z can be written as where . By using x0 and , proposition 3.8 shows that for any and hence the claim follows.
4. Examples and comparison
In this section we compare the proposed regularization concept using αφ-critical points to standard Tikhonov regularization, its convex relaxation and discuss the influence of different choices for φ. Moreover, we discuss ReLU-regularizers as a class of non-convex regularizers for which the presented theory is applicable.
4.1. Dependence on the choice of φ
We begin this section with a comparison between classical Tikhonov regularization and regularization with φ-critical points using different choices for φ. For the sake of simplicity we consider a convex example, where the data-fidelity is chosen as and the regularizer is given by . We assume that α > 0 and are fixed and denote by the unique minimizer of the Tikhonov functional . Note that in the convex case, standard Tikhonov regularization corresponds to regularization with αφ-critical points for the choice .
Consider now the case where is constant. Then, according to the general theory, any αφ-critical point x of is characterized by
Writing we find after some rearrangements that this is in turn equivalent to
where and . Since is positive definite, this shows that x0 has to be chosen in an ellipsoid around 0 and thus is contained in some ellipsoid around . Here, the size of the ellipsoid depends on the choice ε and, for example, choosing ε = 0 leads to the singleton .
Let now φ be an arbitrary non-negative function such that minimization of is well-posed with minimizer xφ . Then, following the steps above, we find that αφ-critical points of are characterized by
where is as above and . That is, the set of αφ-critical points is an ellipsoid around the point xφ where the size of the ellipsoid depends on the value . Depending on the choice of φ this value can even be 0. To see this, consider for example the case where and for some β > 0, where the maximum is taken pointwise. Then, if pointwise, we find that and we arrive at and (4.1) collapses to . In a similar fashion, the choice for some β > 0 leads to a point estimate whenever is non-negative and to an ellipsoid if has at least one negative entry.
While we allow for the whole set of solutions defined as αφ-critical points of , in practical applications one typically only chooses a specific subset. More precisely, as shown in section 2.2 one can apply variants of gradient descent to construct critical points. In the convex situation this might lead to the same solution independent of the choice of φ since the classical gradient can be used as an αφ subgradient for any choice of φ. However, in the non-convex case using such algorithms will not find solutions which are global minima but only critical points as we discuss in the following subsection.
4.2. Relation to convex relaxation
Next we investigate the relation between critical point regularization for a non-convex example. Consider and . As non-convex regularizer we use a slightly perturbed double-well potential , where for and we define
Here, q controls the perturbation away from 0 of the local minimum at wi . An illustration of ri for and two different values of q is depicted in figure 2. While at this point the choice of the regularizer might seem somewhat arbitrary and contrived, we will argue later that the regularizer chosen here is a simplified model for a reasonable class of learned regularizers; see the discussion on ReLU-networks in section 4.4 and in particular remark 4.6.
Download figure:
Standard image High-resolution imageFor given we consider (classical) critical points of the Tikhonov functional defined by
Although ri is not differentiable at , with slight abuse of notation we will denote by the derivative of ri , which we define as at . In the following the gradient will be understood with this convention.
Remark 4.1. One can construct a specific φ such that . It is then guaranteed that any classical critical point is an αφ-critical point and hence a regularized solution fitting to the theory presented in this paper. Such a φ can be constructed by considering the defining inequality pointwise and constructing φi for ri . The calculations for φi are relatively simple but tedious. Since the exact form of φ is irrelevant for our purpose, we refrain from explicitly defining φ. Instead, we focus on solutions that arise when solving equation (4.2). Further note that the algorithm presented in the section 2.2 with as relative subgradient is the same as classical gradient descent, thus further substantiating the assumption that the constructed regularized solution is a solution of (4.2).
By definition of the operator and the regularizer, solutions of (4.2) can be computed component-wise via . Hence
An interesting observation is that for we have a choice between 0 and wi where the choice will, in general, only lead to a local minimizer instead of a global one; compare with figure 2 for q = 0.75. In essence, this shows that the regularized solutions one obtains are a subset of the solutions of equation (4.2) and need not be global minimizers.
Next we consider regularization with the convex relaxation of , i.e. the convex hull . After some lengthy calculations, which we do not present here for the sake of brevity, we derive for the convex hull of ri the form
A comparison of ri and for two choices of the parameter q is given in figure 2 where the convex hull in both cases is visualized with a dashed line.
Solving the critical point equation (4.2) where is replaced by its convex hull we find that the solutions differ quite a bit, at least for the case . Considering this case, we find that there are two different cases. First, whenever the convex hull allows for arbitrary solutions . On the other hand if the convex hull forces the choice . This means that even the slightest perturbation of the value of will lead to a convex hull which loses this information. Comparing this to the solutions above we see that in the non-convex case we can always choose independent of the value of q. This shows that there is a difference between the regularized solutions one obtains when considering the convex hull of the regularizer and the original regularizer .
4.3. Non-equivalence to Tikhonov regularization
One might conjecture that the proposed regularization with αφ-critical points of is equivalent to Tikhonov regularization for some other modified choice of the regularizer. While we cannot give a definite answer to this question at this point, we conjecture that this is not the case.
To support our hypothesis, let us analyze what would happen if the construction of αφ-critical points of were equivalent to the Tikhonov regularization with some regularizer . In this case, the limiting problems would also coincide and thus for any and we have
Denoting by xφ a solution of we have . Clearly, this is the case if and only if is minimal among all possible solutions. This essentially means that .
While such a choice can theoretically be used to characterize the limiting problem, we do not have access to xφ and therefore cannot work with in practice. A slightly more subtle problem is that depends on the exact data and as such cannot be used for the case of noisy case where α > 0. Thus, we conjecture that the proposed regularization is not equivalent to Tikhonov regularization independent of the choice .
4.4. ReLU-Networks as class of possible regularizers
Next, we demonstrate that ReLU networks form a class of non-convex, relatively subdifferentiable regularizers that fit within the theory presented in this paper. As discussed in [14, 15], such regularizers are a powerful tool in the context of classical variational regularization.
Let now , be further Hilbert spaces.
Definition 4.2 (Quasi-homogeneity). A function is quasi-homogeneous, if there exists such that and . We call the quasi-derivative of f.
In definition 4.2 and below denotes the space of all bounded linear mappings from to . Quasi-homogeneity satisfies the following elementary rules.
Lemma 4.3 (Quasi-homogeneity and relatively sub-differentiability). Let and be quasi-homogeneous, W and . Moreover, let and let be convex and sub-differentiable with for some C > 0, and subgradient selection . Then for some the following hold:
- (1)is quasi-homogeneous with .
- (2)is quasi-homogeneous with .
- (3)W is quasi-homogeneous.
- (4)is φ1-relative sub-differentiable with .
- (5)is φ2-relatively sub-differentiable with and .
Proof. These properties follow immediately from the triangle inequality and the defining properties of quasi-homogeneity and relatively sub-differentiability.
Theorem 4.4 (Learned regularizers). Let be convex and sub-differentiable with for some C > 0 and . Let be quasi-homogeneous and A be affine and continuous for . Then
is quasi-homogeneous. Additionally, is relatively sub-differentiable.
Proof. Follows from repeated application of lemma 4.3.
The crucial assumption in theorem 4.4 is that the activation functions are quasi-homogeneous. This property is, for example, satisfied in the case of the ReLU as the choice for the activation function as we discuss in the following example.
Example 4.5 (ReLU regularizer). Consider the case defined by where max is to be understood pointwise. Then , where denotes pointwise multiplication with if and if x > 0 and g is again understood pointwise. The ReLU function is then quasi-homogeneous whenever the space has the following property: For any we have and . Examples of such spaces are for some at most countable set Λ and parameters . In particular, this also holds in the finite dimensional case . Thus, theorem 4.4 shows that ReLU networks are an appropriate choice to construct regularizers. We note here, that the same also holds true when the ReLU activation functions are replace by the more general class of parametric ReLU activation functions.
Theorem 4.4 and lemma 4.3 imply that a relative sub-gradient of any ReLU regularizer can be evaluated with the chain rule. Since deep-learning frameworks such as PyTorch [19] and Tensorflow [1] are built on formal application of the chain rule, calculating elements G(x) with can be done by using backpropagation. Thus, the backpropagation procedure is an appropriate choice for any form of gradient descent used to find critical points of the given functional satisfying (4.2). This is for example of interest for learned regularizers [14–16, 18].
Remark 4.6 (Learned regularizers as multi-well potentials). Using (parametric) ReLU activation functions, the network (4.3) is a composition of piecewise affine operators and as such itself a piecewise affine operator. This means that the regularizer of theorem 4.4, as for example considered in [14], behaves like a 'multi-well potential' similar to the one considered in figure 2. That is, it behaves like a function with multiple local minima where ideally each local minimum is located at a desirable solution.
A reasonable strategy to find such a regularizer is to train a network to have local minima which are located at the desired solutions. However, due to various difficulties during this process (e.g. the regularizer itself being only a local minimum of the loss function used for training, non-ideal network architectures) one would also expect the regularizer to have slightly different values at the desired solutions. This means, that even if the local minima are located at the desired solutions one cannot expect all of these local minima to have the same value much less expect each of these local minima to be global minima of the regularizer. To put this in different words, one should expect slight perturbations as in figure 2.
5. Numerical simulations
The goal of this section is not to show that non-convex regularizers can improve the reconstructions, but rather to test the theory derived in the previous sections and to show what may happen when non-convex regularizers are chosen.
To this end, we consider the discretzied version of two toy-problems in 1D. We consider an inpainting problem where around of the signal entries were randomly removed. In this case the kernel of the forward operator K is simple to compute and by using a separable prior we can easily study the properties of the solution we obtain in the limit. This makes the first toy problem ideal for testing whether the properties (as described in the theory section) of the limiting solution hold true or not.
Further, we consider recovering a signal from its cumulative sum. Since this matrix is invertible there is a unique solution and following theorem 3.5 we should observe convergence to this solution in the limit δ → 0. This toy problem is therefore well suited to study if the given φ-critical points actually converge to the unique solution.
For both problems we consider as the signal to recover the discretization of the function on using N = 512 equidistant sample points. We denote this signal by and the true underlying data by where K is the forward operator of the corresponding problem.
For each problem we consider the similarity measure given by and we construct a regularizer by . Here, we define for . The function is constructed in such a way that it is non-convex but relatively sub-differentiable, see remark 2.5. Figure 5 shows the function with parameters ρ = 2 and for where the y-axis is plotted on a logarithmic scale in order to emphasize the non-convexity. We can see that this function has a global minimum at around , a local minimum close to and another critical point in the interval . The parameters ρ = 2 and are used for all the following simulations.
As a separable sum of relatively sub-differentiable and non-convex terms the regularizer as defined above is relatively sub-differentiable and non-convex. By definition of it is further coercive and hence the functional is coercive. This shows that condition 3.1 is satisfied and we consider the stability and convergence of the φ-critical points according to theorems 3.3 and 3.5.
To simulate noisy data we consider the data where , ξ is a normally distributed random variable and for .
Since a bound φ for can be chosen such that (see remark 2.5), we can simply search for a classical critical point of in order to obatin φ-critical points. To achieve this, we apply Newton's method, e.g. [8], and we find an initial guess for Newton's method by applying Nesterov accelerated gradient descent [17] to the starting point . In the following we denote by a critical point of and by xα a critical point of . Here, xα is considered as the limit point for the stability considerations for which we consider the choices . In order to test for convergence we chosen for . For the convergence simulations we consider as the limit point the signal for the cumulative sum problem, as in this case the solution is unique, and we construct an approximate solution for the inpainting problem by finding a critical point of the function for and we denote this solution by . Implementation details and code are publicly available 1 .
5.1. Results
Figure 3 depicts the value for different values of α > 0 (left), (middle) and (right). Each of these values is plotted against δ on a log-log scale. The plot in the left shows that for any chosen α we can observe the convergence of the sequence to the critical point xα as the noise-level tends to 0. The plots in the middle and right show the convergence behaviour for different choices as specified above. All of the sequences can be observed to converge, i.e. and as the noise δ tends to 0.
Download figure:
Standard image High-resolution imageFigure 4 shows the same behaviour for the stability and convergence plots for the inpainting problem as figure 3 in the limit δ → 0. In particular convergence to a solution of the problem can be observed.
Download figure:
Standard image High-resolution imageA closer look at the inpainting problem reveals that the limit point is, however, not an -minimizing solution. This can easily be checked due to the separability of the regularizer and the simple representation of the kernel of the inpainting problem. The orange dot in figure 5(left) is the -value of where i is chosen as an index in the kernel of the inpainting matrix K, i.e. such that where ei is the ith standard basis vector. Due to the separability of the regularizer we clearly have that is not an -minimizing solution which arises due to the non-convexity of the regularizer.
Download figure:
Standard image High-resolution imageMoreover, we have observed that if we initialize the values in the kernel close to −1 or 2 then the limit will have entries at these φ-critical points of . This shows that in such cases the solution we obtain in the limit heavily depends on the initialization we choose and that, depending on this initialization, the recovered solution may not be an -minimizing solution and potentially even a local maximum or a saddle point.
Finally, figure 5(right) shows the values where is a basis of the kernel of K. Up to numerical accuracy we see that we have for each such index i which shows that as in lemma 3.9.
6. Conclusion and outlook
We have introduced and studied the concept of critical point regularization, which, opposed to classical variational regularization, considers (φ-)critical points of Tikhonov-functionals as regularized solutions. The advantage of this approach is that it completely discards the strong and typically unrealistic assumption of being able to achieve global minimizers of these functionals. Our theory shows that under reasonable assumptions on the involved functionals the resulting method will nevertheless be a stable and convergent regularization method. Further, we have shown that the solutions in the limit δ → 0 satisfy some form of first order optimality conditions of the constrained optimization problem subject to the constraint . Besides this, the theory presented here extends the theory of convex functionals by showing that at no point does one require global minimizers, but only points which are close to a global minimum in some sense. For practical applications this means that minimization algorithms do not need to be run until convergence but may be stopped early, if easily verifiable conditions are met. Additionally, under assumptions on the regularizer this theory is directly applicable to regularized solutions which are classical critical points of the involved functionals. As such our theory gives stability and convergence results for critical points of potentially non-convex functionals.
Finally, we have provided numerical simulations which support our theoretical findings, i.e. the stability and convergence of critical point regularization. Depending on the algorithm used for obtaining critical points, these numerical examples show that one cannot expect to find global or even local minima which further supports the arguments for the need of a theory based on (φ-)critical points, which we have developed in this paper.
As the main concern of this paper was to introduce the concept of using (φ-)critical points as regularized solutions, we have not derived any stability- or convergence-rates and studying such rates is subject to future work. Besides this, deriving conditions under which learned regularizers, e.g. [5, 16, 18], give rise to relatively sub-differentiable functions is also subject of future work.
Data availability statement
No new data were created or analysed in this study.