Approximal operator with application to audio inpainting

In their recent evaluation of time-frequency representations and structured sparsity approaches to audio inpainting, Lieb and Stark (2018) have used a particular mapping as a proximal operator. This operator serves as the fundamental part of an iterative numerical solver. However, their mapping is improperly justified. The present article proves that their mapping is indeed a proximal operator, and also derives its proper counterpart. Furthermore, it is rationalized that Lieb and Stark's operator can be understood as an approximation of the proper mapping. Surprisingly, in most cases, such an approximation (referred to as the approximal operator) is shown to provide even better numerical results in audio inpainting compared to its proper counterpart, while being computationally much more effective.


Introduction
In signal and image processing, the so-called proximal splitting algorithms represent an effective way of finding numerical solutions to various problems [1]. However, despite their conceptual simplicity, proximal algorithms are not always computationally tractable. For instance, in the area of audio signal restoration, it is often necessary to handle the proximal operator of a composition of a linear (or affine) transform and a functional. This happens in many variations of audio inpainting tasks [2,3,4] (further discussed in Sec. 3), audio declipping [5,6,7,8], and audio dequantization [9,10,11].
To introduce the concept more specifically, let L : W → V be a linear mapping between two vector spaces, and let g : V → R be a convex lower semi-continuous functional. We are interested in computing the proximal operator prox f , where f = g •L (the symbol • shall denote the composition of two functions), i.e. the mapping L is applied first, followed by g.
In some cases, explicit formulas for prox g•L are available. For instance, [1] provides an explicit formula for the case of real-valued operator L between finite-dimensional spaces such that LL ⊤ = αId. A similar result is presented in [12], where L is assumed to be complex-valued and satisfying the condition that LL * is diagonal (the asterisk denotes the adjoint operator). However, it is limited to the case when prox g is the operator of projecting onto a box-type set.
In the present contribution, we provide an analysis of the case when L ⊤ L = αId, which is closely related to [1], and this is motivated by a recent signal processing application [4]. In Sec. 2, we derive a formula for prox g•L in such a scenario. Furthermore, an approximation of the derived proximal operator is introduced and analyzed, provided that an explicit formula for prox αg exists. Sec. 3 shows its usefulness in the case of sparsity-based audio inpainting. The error arising from computing the proximal step of an iterative algorithm only approximately is evaluated on the example of audio inpainting.
Throughout the whole article, the symbol · may denote a general norm on a Hilbert space V, the common ℓ 2 norm on a finite-dimensional real or complex vector space, or the induced operator norm. The particular case should be clear from the context. The inner product inducing · will be denoted ·, · . Should any other (pseudo)norm appear, it will be identified using the lower index notation, e.g. · 0 , · 1 .
2. Proximal operator of a composition of a proper convex function and an affine mapping and its approximation

Theoretical proposition
It has already been stated that the novel proposition of the paper is related to the known formula for prox g•L in the case of semi-orthogonal L, i.e. LL ⊤ = αId. To build upon this relation, we start by quoting the corresponding lemma.
Lemma 1 (the original lemma from [13, p. 140]). Let g : R m → R ∪ {∞} be a proper convex function, and let f (x) = g(Lx + b), where b ∈ R m and L : V → R m is a linear transformation satisfying LL ⊤ = αId for some constant α > 0. Then for any x ∈ V, Note that when prox αg is known explicitly, Eq. (1) provides an explicit form of prox f . As an example, take f = L · 1 = · 1 • L. Then g is the ℓ 1 norm and the corresponding prox αg is the soft thresholding operator with the threshold α. Lemma 1 may also be used when g is the indicator function of a closed convex set C ⊂ R m , In such a case, the proximal operator of αg is the operator of projecting onto C, denoted prox αg (x) = proj C (x). Additionally, the frame theory provides an example of L that fits the lemma-it may be the synthesis operator of a tight frame [14,15,16,17,18]. The question is: If we want to employ the analysis operator instead, how will formula (1) be affected? The following lemma answers the question.
Lemma 2 (the proposed lemma). Let g : R m → R ∪ {∞} be a proper convex function, and let where ι (R(A)+b) is the indicator function of the affine space which we obtain by shifting the range space of A by a vector b.
Although the proof is given in Appendix A including the vector b, we will for simplicity further assume b = 0.
From the viewpoint of frame theory, the operator A in Lemma 2 is the analysis operator of a tight frame. Suppose that V = R n , i.e. A : R n → R m . It is a straightforward consequence of the assumption A ⊤ A = αId that in such a case, it must hold n ≤ m due to the property Observe that the crucial difference between the two lemmas is that in the latter case, the image of the proximal operator of αg is forced to lie in the subspace R(A). Omitting this condition, Eq. (3) would be obtained simply by using Eq. (1) and plugging in the property A ⊤ A = αId. Such a step is correct when A is surjective, in which case m = n (as a consequence of Eq. (4)) and the indicator function ι R(A) equals zero on whole R n and therefore prox αg+ι R(A) = prox αg . However, the consequence of A being full-rank 1 is that the operator is unitary and by taking A = L ⊤ = L −1 , Lemmas 1 and 2 coincide.
Finally, observe that Lemma 1 provides a constructive way to compute prox g•L when prox αg is known explicitly. On the contrary, Lemma 2 is not constructive in the case of n < m, since the explicit form of prox αg does not guarantee the existence of an explicit form of prox αg+ι R(A) .

Explicit approximation
The strength of Lemma 1 is that it offers an explicit form of prox f when the explicit form of prox αg is available. This is possible for example in the aforementioned cases of the (weighted) ℓ 1 norm or the indicator function in the role of g. However, the proximal operator including an additional restriction, which is the case of prox αg+ι R(A) in Lemma 2, is seldom known 2 , resulting in the need for a reasonable approximation of prox αg+ι R(A) .
Since the orthogonal projection onto R(A) in the case of A ⊤ A = αId is expressed easily as [14] proj a natural possibility is to study the approximation i.e. the composition of two known operators. Such an approximation will thus take the form of where we introduced the denotation 'approx' for the approximal operator.
The most important property of approx f is pointed out in the following lemma.

Lemma 3. Under the conditions of Lemma 2, the approximal operator is
and it is a proximal operator of some convex lower semi-continuous function.
Note that the shortened form of approx f in Eq. (8) was obtained by plugging (5) into (7) and using the above-mentioned property A ⊤ A = αId.
The lemma is proved in Appendix A, justifying two properties of proximal operators, one of which is the non-expansivity of approx f . As mentioned in [19, p. 7], the non-expansivity plays a role in the convergence analysis of iterative proximal algorithms. However, it should be emphasized that the lemma is not constructive and that it is unclear which (convex lower semi-continuous) function in particular this proximal operator corresponds to.
In the following section, an example from the field of audio processing demonstrates the suitability of approx f as an approximation of prox f .

Sparsity-based audio inpainting
Audio inpainting is a rather modern term for the task of filling highly degraded or missing samples of digital audio [3]. The same task is referred to as the interpolation of missing samples [2,20] or packet-loss concealment [21,22].
Popular audio inpainting methods are based on the assumption that musical audio signals are sparse with respect to a suitable time-frequency (TF) transform. To proceed with the formalization of this assumption, denote y ∈ C n the observed (i.e., degraded) signal and let M : C n → C n be the operator which fills with zeros the missing information in the signal. Denote Γ the set of signals consistent with the observed signal y, The inpainting task is then formulated as the following optimization task: Find the signal from Γ with the corresponding TF coefficients as sparse as possible.
Let us recall two operators: Let A : C n → C m , m ≥ n, be the analysis operator, expanding the timedomain signal into the vector of TF coefficients, and let A * : C m → C n be its synthesis counterpart, producing a signal, given the coefficients. With this notation, (10) can be understood in two ways: arg min The symbol · 0 denotes sparsity, i.e. the pseudonorm that counts the non-zero entries of the argument. Since Eq. (11a) includes the synthesis operator, it is referred to as the synthesis formulation; by the same reasoning, Eq. (11b) is called the analysis formulation [23]. Note that the two formulations are equivalent only when the operator A is unitary, i.e. it holds A * = A −1 . However, this is not the case of the commonly used Gabor, wavelet or ERBlet transforms [15,18].

Convex relaxation
Formulations (11) are problematic in that they include the ℓ 0 pseudonorm, resulting in the task being NP-hard. Two possible classes of methods exist to solve (11), in general only approximately. Either a non-convex heuristic algorithm is employed to tackle (11), e.g. the OMP [3] or SPAIN [24]. Alternatively, the task needs to be relaxed to a convex optimization problem [25,26]. Denoting S : C m → R a convex sparsity-related penalty, the convex relaxations of (11a) and (11b) attain the form arg min arg min The constraints from (11) are now included in the objective function using the indicator function ι Γ . Note that Γ is a convex set by design. Thus, in order to have an overall convex problem, S has to be a convex function, and it shall promote sparsity. As an example of a suitable and widely used penalty S, we mention the (weighted) ℓ 1 norm.

Inconsistent reformulation
In (12), the solution is forced to be equal to the observed signal in its reliable (non-distorted) parts. However, this assumption may be too strong, for instance when the observed signal y is noisy. In such a case, the alternative reformulations to solve are the so-called inconsistent problems arg min arg min where the parameter λ > 0 balances consistency with the data and the sparsity.

Solving the task
Formulations (12) and (13) consist of sums of lower semicontinuous convex functions. Such a scenario allows using proximal algorithms to solve the tasks numerically [1]. To find the minimum of the sum of lower semicontinuous convex functions f 1 and f 2 using the proximal splitting approach, one must be able to evaluate both prox f1 and prox f2 , or, in the case of differentiable f 1 or f 2 , the corresponding gradient. These are evaluated in every iteration, meaning that simple, computationally cheap formulas are preferred.
To choose suitable algorithms for solving (12) and (13), assume that the explicit form of prox S is available. Note that in practice, this assumption is not too restricting. Sometimes the situation is even the opposite, meaning that a suitable operator in the place of prox S is used although an explicit form of S is not available, for instance in the case of the (persistent) empirical Wiener and similar operators [27,19]. Moreover, assume using a tight frame as the TF transformation, i.e. A * A = αId. To solve the synthesis model (12a), Lemma 1 is used (putting L = A * ) to compute the prox of the second term, which enables the use of the Douglas-Rachford (DR) algorithm [1, Sec. IV]. The reason is that for any time-domain signal s ∈ C n , the projection proj Γ (s) is computed simply entry-by-entry by setting the samples at the reliable positions equal to the observed ones, and keeping the rest unchanged.
It should be pointed out that Lemma 1 is used here to handle a complex-valued operator. This is treated in Appendix B, where one can find the argumentation why using complex variables in place of the real variables does not affect the proposed theoretical results of both lemmas.
To solve the analysis model (12b), prox S•A has to be known in order to be used in the DR algorithm; formally, it is constructed using Lemma 2 as The numerical treatment of (15), however, requires choosing one of the following three options: • prox αS+ι R(A) is only approximated by the operator approx S•A defined by Eq. (8), • prox S•A (x) = arg min u S(Au) + 1 2 u − x 2 is computed using a nested iterative subroutine, since it is by definition nothing but another optimization task, • finally, the Chambolle-Pock (CP) algorithm [28] can also be used instead of the DR algorithm to solve (12b) without nested iterations, since it allows composing a convex functional with a linear mapping.
In analogy to the above, solving (13a) is not a problem, but for (13b) the approximation of prox S•A is needed. For both inconsistent cases, a suitable iterative algorithm to produce the numerical result is ISTA or FISTA (fast iterative shrinkage-thresholding algorithm) [29].

Evaluation setting
Our experiments basically follow and extend the experiment from [4,Sec. 3]. The motivation is that [4] applies Lemma 1 instead of Lemma 2 to compute prox S•A . As a result, [4] actually uses approx S•A (unintentionally). This is clear from plugging A in place of L in Eq. (1), which (recalling A ⊤ A = αId) directly produces approx S•A . The MATLAB codes and data from [4] are available at https://github.com/flieb/AudioInpainting. We recomputed their results using the shared code, and on top of that, we aimed at answering the question: Does the more accurate (but more complicated) use of prox S•A produce better results in the audio inpainting task, compared to the use of approx S•A ?
Hence, our contribution concerns the analysis model. There we decided to use the CP algorithm instead of the DR algorithmin the consistent case, while in the inconsistent case, the CP algorithm approximates prox S•A in each iteration of FISTA.
Exactly the same as in [4], the algorithms are tested on four musical signals from [30] sampled at 44.1 kHz that are distorted by dropping out 80 % of the samples at random positions. As the convex sparsity-related penalty, the ℓ 1 norm is used, i.e. S(·) = · 1 . Three different transforms • ERBlet transform (ERB) with qvar = 0.08 and bins = 18, • wavelet transform (WAV) with f min = 100 Hz, bw = 3 Hz and bins = 120.
The reconstruction performance is evaluated using the SNR, computed in agreement with [4] as SNR(s,ŝ) = 20 log 10 σ(s) σ(s −ŝ) , where s is the original (undistorted) signal,ŝ is the reconstructed signal and σ denotes the standard deviation. It is worth noting that the reliable parts of s andŝ are not taken into account in Eq. (16). Table 1 is a slightly reordered reproduction of the table provided in [4,Sec. 4]. As explained above, the results based on the synthesis model correspond to (12a) and (13a), whereas the analysisbased results only approximate the solutions to Eq. (12b) and (13b), since the operator approx S•A is used.  Table 2], based on the four test signals, three TF dictionaries and for both the consistent and the inconsistent approach, corresponding to problems (12) and (13) Since our modifications stem from the use of approx S•A in the analysis-based model, Table 2 presents only the results of this model. Also, for better readability, only the difference between the values of SNR is shown.   Table 2 shows two remarkable results. On the right side, the SNR for FISTA is almost independent of whether prox S•A is computed accurately or not. The other note concerns the left side, showing the results of DR and CP algorithms, representing the use of approx S•A and prox S•A , respectively. Here, the more accurate approach with prox S•A outperforms the approximate approach with approx S•A only with the Gabor dictionary, whereas for wavelets and ERBlets it performs slightly worse in most cases.

Results and discussion
Recall that the inconsistent analysis model employs a nested iterative CP algorithm within FISTA, which results in the computational cost being remarkably higher compared to the synthesis model.
Furthermore, we examine the performance of the DR (original) and the CP (new) algorithm in the analysis model while choosing stricter convergence criteria than in the previously described experiment. In the implementation of [4], the parameters indicating convergence are the maximum number of iterations (param.maxit) and the relative norm of solutions in subsequent iterations (param.tol). The algorithm stops when either of the criteria is reached. The settings of these parameters in the experiments are as follows: criterion setting for Tables 1 and 2 setting for Tables 3 and 4 param.maxit 200 500 param.tol 10 −3 10 −5 In Table 3, the results of synthesis-based model using the DR algorithm are also recomputed with the new choice of parameters. Note that the inconsistent approach is omitted, since even with the less strict convergence criteria, the results of the two approaches are almost indistinguishable. For the purpose of direct comparison, Table 4 presents the differences in SNR shown in Table 3. As mentioned above, positive values indicate a better performance of the implementation based on prox S•A .   Table 3.  Table 4 that letting the algorithms converge closer to the actual solution of the corresponding optimization tasks reduces the difference between the more and the less accurate approach. Nonetheless, it can still be seen that not only the results in general but also the inaccuracy of the DR algorithm in the analysis-based approach depend on the choice of (redundant) timefrequency representation of the audio signal.

Software and reproducible research
All the experimental data were generated using MATLAB R2017b while using LTFAT [31]

Conclusions
The article presents the proximal operator of a composition of a convex function with a linear mapping A that satisfies A ⊤ A = αId, which is the case of A being the (redundant) analysis operator of a tight frame. So far, a compact formula for such a operator has been known only for the composition with A ⊤ (in the current notation). The theoretical derivation does not yield a really effective way to apply the proximal operator in practice. Nevertheless, it is shown that using the fast explicit approximation from [4], the approximal operator, is not only meaningful, but it is also reasonably close to the proper solution obtained with the exact proximal operator. The practical experiment was limited only to audio inpainting scenarios originally performed in [4], yet the theoretical result has a straightforward application to the problems of audio declipping or dequantization as well, or even in the related problems in image and video processing.
Since A ⊤ A = αId, applying synthesis A ⊤ onto both sides of Eq. (A.7) leads to The important observation here is that (A.8) is equivalent to (A.7) only whenz − b ∈ R(A), which is enforced by the left hand side of the relation (A.7). Substituting this result into (A.6) leads to an explicit expression forũ in terms ofz (still limited toz − b ∈ R(A)): Plugging this result in the minimization problem (A.2), we obtain thatz is given bỹ where the relationz − b ∈ R(A) is enforced by the indicator function ι (R(A)+b) . Now recall two useful properties: 1. Viewing A from the perspective of frame theory, it is an analysis operator corresponding to a tight frame, for which it holds Az 2 = α z 2 for all z ∈ V.

Since we requirez
Using the first property, we can rewrite (A.10) as Proof of Lemma 3. To prove that a function F : V → V is a proximal operator of a convex lower semi-continuous function, it is sufficient to show two properties [32, Corollary 10.c]: 1. there exists a convex lower semi-continuous function ψ such that for any y ∈ V, F (y) ∈ ∂ψ(y), 2. F is non-expansive, i.e. (A.14) The symbol ∂ψ(y) denotes the subdifferential of the function ψ at the point y, see e.g. [ Observe that the first property is necessarily satisfied for F = prox αg , since it is a proximal operator. This means that there exists a convex lower semi-continuous function η such that If this relation holds for all x, it must also hold for vectors in the form x = Az, leading to ∀z ∈ V prox αg (Az) ∈ ∂η(Az) = {s | η(y) ≥ η(Az) + s, y − Az ∀y ∈ R m }. (A. 16) Then also Since η is a lower semi-continuous and convex function and A is a linear operator, η • A and α −1 η • A are also lower semi-continuous and convex. Property 1. is thus satisfied for The non-expansivity can be shown similarly: Substituting (8) into (A.14) and using the fundamental property of operator norm leads to α −1 A ⊤ prox αg (Ay) − prox αg (Ay ′ ) ≤ α −1 A ⊤ prox αg (Ay) − prox αg (Ay ′ ) . (A.20) First, let us compute the operator norm of α −1 A ⊤ . The property Ax 2 = α x 2 ∀x implies Ax = √ α x , i.e. A = √ α. Thus Now, since prox αg meets (A.14), it holds prox αg (Ay) − prox αg (Ay ′ ) ≤ Ay − Ay ′ = A(y − y ′ ) = √ α y − y ′ .

Appendix B. Comments on the use of complex-valued operators
In the experimental part, the complex-valued analysis operator A : C n → C m and its synthesis counterpart A * (instead of A ⊤ ) are used. Nonetheless, both Lemma 1 and Lemma 2 remain valid also in such a case. The place in the proof that could make trouble in the complex case is the formulation of the Lagrangian in Eq. (A.4), since it should be real; the rest of the manipulations also hold in the complex case.
To deal with the Lagrangian for complex variables, observe that whereÂ : R 2n → R 2m ,û ∈ R 2n andĉ ∈ R 2m . Rewriting x ∈ C n in a similar way into a real vector x ∈ R 2n leads to the Lagrangian whereŷ ∈ R 2m . The unique minimizer of (B.3) iŝ It can be easily shown that the real and imaginary parts of the complex solutionũ = x − A * y obtained using Eq. (A.6) match the results in just presented real case.
Note that in the case of the complex Gabor transform presented earlier, the mapping prox f should map the real signal x to a real signalũ. This would be ensured by a particular convexconjugate structure of A, A * and also y in such a case. For a more detailed description, see the discussion in [12, p. 9] and the references therein.
A similar argumentation would be used regarding the proof of Lemma 3 and the concept of subdifferentials, where, once again, we would define the real-valued inner product for complex vectors.