MEGS: A Penalty for Mutually Exclusive Group Sparsity

Penalty functions or regularization terms that promote structured solutions to optimization problems are of great interest in many fields. We introduce MEGS, a nonconvex structured sparsity penalty that promotes mutual exclusivity between components in solutions to optimization problems. This enforces, or promotes, 1-sparsity within arbitrary overlapping groups in a vector. The mutual exclusivity structure is represented by a matrix <inline-formula><tex-math notation="LaTeX">${\bf {S}}$</tex-math></inline-formula>. We discuss the design of <inline-formula><tex-math notation="LaTeX">${\bf {S}}$</tex-math></inline-formula> from engineering principles and show example use cases including the modeling of occlusions in 3D imaging and a total variation variant with uses in image restoration. We also demonstrate synergy between MEGS and other regularizers and propose an algorithm to efficiently solve problems regularized or constrained by MEGS.


I. INTRODUCTION
Sparsity-promoting regularization is an integral component of modern-day signal processing, optimization, and machine learning. The most prevalent method, the least absolute shrinkage and selection operator (LASSO) [1], uses the 1 norm to promote solutions to optimization problems that have few nonzero coefficients. More recently, there has been significant interest in structured sparsity. For instance, group sparsity aims to set entire groups of components within a vector to zero and allow the components in other groups to take on any value [2]. This is often achieved by regularizing with the 2,1 norm over groups, called group LASSO (G-LASSO) [3]. Conversely, there is interest in sparsity penalties that promote intra-group sparsity, that is, ensuring groups within a vector are themselves sparse. Such penalties have been used in areas such as deep learning [4], computer vision [5], [6], and medicine [7], [8]. Previous work in this area includes the convex elitist [9], [10], or exclusive LASSO (E-LASSO) [11], and the nonconvex sparsity within and across groups (SWAG) [12] plus extensions that allow for overlapping groups of components [13].
In this article, we introduce the MEGS penalty to promote sparsity structure that is specified as mutual exclusivity between pairs of entries, meaning pairs that should not simultaneously be nonzero. When used as a constraint or regularizer, the penalty does not induce the shrinkage bias that is typical of sparsity-promoting regularization. The most closely related penalty is the extended SWAG [13]. The similarities and differences among MEGS and other exclusive sparsity penalties are discussed in Section II.II-C; for a survey of other sparsity penalties we direct the reader to reviews of LASSO variants [14], sparsity for feature selection [15], and non-convex sparsity penalties [16].
The most important feature of MEGS is its exclusivity structure matrix S that defines which components in a solution vector are mutually exclusive with one another. In addition to introducing MEGS and exploring its properties, our interest in this article is to provide evidence of the effectiveness of MEGS when S is designed for particular use cases using intuitive engineering principles. The paper is structured as follows. In addition to introducing the MEGS penalty, Section II develops properties, comparisons to other structured sparsity penalties, and algorithms for regularizing with MEGS. Section III presents some common designs for the exclusivity structure matrix S. Section IV demonstrates the utility of MEGS in a number of settings, including a novel total variation-style denoising in one and two dimensions, and modeling occlusions in 3D imaging problems. Section V concludes with a summary. Code for this article can be found at https://github.com/Goyal-STIR-Group/Mutually-Exclusive-Group-Sparsity-MEGS-.

A. INTRODUCTION TO MEGS
We define the MEGS penalty for x ∈ R M as where the function h : R M → R M and symmetric matrix S ∈ R M×M satisfy the following conditions: Conditions (2) and (3) ensure that the penalty (1) is bounded from below. In particular, the nonlinear function h makes it unnecessary to constrain S to be positive semidefinite to have the desired boundedness from below. Except for a discussion of extensions in Appendix A and an example in Section III-B, we assume h(x) is the elementwise absolute value |x|, which satisfies Condition (3). Thus, most of following is written with The matrix S is key, as it encodes the sparsity structure. The entry S i, j (which equals S j,i ) represents the strength of exclusivity between the pair of components |x| i and |x| j (hence the diagonal is always zero). When S i, j = 0, there is no preference for exclusivity between the two components. When S i, j > 0, it indicates that the two components should preferably not be nonzero simultaneously (with the strength of preference increasing with S i, j ). We bound S i, j in the [0,1] range for convenience and note that any scaling can be absorbed into a constraint or a multiplicative regularization parameter. Since the diagonal of S is zero, S will typically be indefinite, resulting in nonconvexity (see Appendix B). While a few classes of examples for S are given in Section III, to build intuition we include one simple, illustrative example here.
Example: Suppose we want the first and third components of x ∈ R 3 to be mutually exclusive and we have no other preferences. Hence, we select S 1,3 = S 3,1 = 1 and set the rest of S to zero: For a vector x = [1 1 0] T in the preferred form, the penalty is zero: However, for a vector x = [1 0 1] T that breaks the mutual exclusivity rule, the penalty is strictly positive: Consider using the MEGS penalty to constrain a leastsquares problem: The constraint restricts x to a union of subspaces: The solution of (6) is the orthogonal projection of y onto W. In general, the lack of convexity of W could require an enumeration of the subspaces. Here, the closer of the two subspaces is determined by comparing |y 1 | and |y 3 |, so The MEGS penalty could instead be used in a regularization term: for some nonnegative λ. To simplify the following, let us assume |y 1 | < |y 3 |; modifications for the opposite case simply reverse the roles of y 1 and y 3 . The solution to (8) is As illustrated in Fig. 1, x = y when λ = 0, and | x 1 | shrinks as λ is increased, reaching zero at λ = |y 1 |/|y 3 | and staying there. In the meantime, | x 3 | initially shrinks, but the shrinkage eventually reverses, so the selected components of the sparse solution have no shrinkage when λ ≥ |y 1 |/|y 3 |. In some settings this corresponds to a lack of bias. It contrasts with the behavior of regularization with λ x 1 (i.e., LASSO).
The case of |y 1 | = |y 3 | yields a solution not useful from the perspective of sparsity or mutual exclusion: It demonstrates that for fixed λ, the solution to (8) can be discontinuous in y. Nevertheless, the solution has remained stable in the sense of x ≤ y .

B. PENALTY FOLLOWING LINEAR TRANSFORMATION
As with other sparsity penalties, it can be interesting to extend the use of MEGS from the canonical basis by first multiplying the vector of interest x by some matrix B: Possible prototypes for B include a finite difference matrix or a transform to a different basis; see Section IV. A non-square B can also be used to encode mutual exclusivity structure in lower-or higher-dimensional spaces.

C. RELATIONSHIPS WITH OTHER EXCLUSIVE SPARSITY PENALTIES
MEGS has many properties that are shared with some other structured sparsity penalties. Key properties are as follows: r Encodes mutual exclusivity between components, with no restrictions on overlapping groups or regularization strength.
r Non-convex formulation does not underestimate (shrink) or otherwise bias nonzero components.
r Choices of nonlinearity h(·) lead to different behaviors.
A discussion of similar penalties in the literature follows.

1) E-LASSO
The E-LASSO penalty is convex and promotes intra-group sparsity. It uses the 1 -squared norm to do so: where G k ⊂ {1, . . . , n} is the kth of K groups of indexes of components of x ∈ R n . We can relate MEGS to the E-LASSO penalty by relaxing the requirement of a zero diagonal in S. Consider S η = 11 T − ηI, where 1 is a column vector of 1 s of appropriate dimension. Scalar parameter η ∈ [0, 1] then gives us E-LASSO (with one group) for η = 0 and an instance of MEGS that promotes 1-sparsity for η = 1 (see Section III-A): We notice the − x 2 2 term (present in MEGS) de-biases the penalty by ensuring it equals zero when x is 1-sparse.

2) SWAG
The extended SWAG [13] penalty is where W is referred to as a weight matrix. We see that by equating S and 1 2 W, SWAG is nearly a special case of MEGS, but with an additional 1 term included to induce unstructured sparsity.
The descent algorithm to solve SWAG-regularized problems requires that (I + λW) is positive semidefinite, to ensure the threshold operator, which is applied iteratively during the optimization, is well-defined. Conversely, MEGS does not require constraints of this type. Fig. 2 shows surface and contour plots of LASSO, E-LASSO, SWAG, and MEGS with S = 11 T − I, where 1 is a column vector of 1 s of appropriate dimension. All four contour plots show why these penalties are sparsifying: if one minimizes the penalty while constrained to an affine subspace, the solution will be sparse. One key distinction of MEGS shown in the surface plot is that the penalty is zero on the coordinate axes, which is what leads to unbiased 1-sparse solutions. In the contour plot for MEGS, this property is shown by the contours not touching the coordinate axes.

D. ALGORITHMS FOR REGULARIZING WITH MEGS
We aim to efficiently solve any problem of the form using a proximal subgradient method (Algorithm 1), assuming that f (x) is a proper function with Lipschitz continuous gradients (e.g., the least squares loss). There is no easily computable proximal operator for the penalty as written in (4), so instead we look to write it in a different form. First, notice that Algorithm 1: Proximal Subgradient Method for MEGS Constrained Problems.
We can therefore express the penalty (4) as The algorithm to solve (14) then simply requires us to evaluate the gradient of f (x) − x 2 2 , the subgradient of −|x| T (11 T − S − I)|x|, and the proximal operator of x 2 1 , for which an efficient algorithm is available (see [17]). In practice, we find the convergence speed of this algorithm is not negatively affected by the use of a subgradient of −|x| T (11 T − S − I)|x|, and in cases with no overlapping groups, this term equals zero.
Algorithm 1 outlines the proximal subgradient scheme used to approximately solve the problem in (14) by introducing a Lagrange multiplier ν.
In Algorithm 1, P αν k is the proximal operator associated with the 2 1 term, and α is a step size that can be fixed or computed dynamically. As gradient methods are not generally guaranteed to converge to a global minimum when the cost function is nonconvex, it is prudent to start with an initial estimate of x that minimizes the data fidelity term. There also exists an accelerated proximal gradient algorithm that is guaranteed to converge to a stationary point even for nonconvex problems [18], which can be used here to improve convergence speed. The algorithm for this is outlined in Appendix C.
If we wish to introduce a transform to the vector of interest, we can instead solve a problem of the form: using the alternating direction method of multipliers (ADMM) [19], which also uses Algorithm 1 or an accelerated variation thereof at each iteration.

III. DESIGNING S FROM ENGINEERING PRINCIPLES
In this section, we show some choices for the design of S and explain the implications.

A. CANONICAL-BASIS 1-SPARSITY
If we choose S = 11 T − I, then S is an all ones matrix except for a zero diagonal; see Fig. 3(a) for a graphical depiction. Then as shown in (15), the penalty (4) becomes which is equal to zero for any 1-sparse x. Hence, this can be used to promote 1-sparse solutions to optimization problems.

B. INTRA-GROUP 1-SPARSITY
Consider a vector x ∈ R gn where g is a number of groups and n is the number of components in each group. Then, we can promote 1-sparsity within each group using S ∈ R gn×gn defined in blocks: where S = 11 T − I n (see Fig. 3(b)). This is now a blockseparable problem, equivalent to applying the penalty defined in Section III-A to different sections of x, i.e., g k=1 |x k | T S|x k |, where k indexes blocks of n components. One use case for this formulation is in modeling occlusions in 3D imaging problems, which we demonstrate with an example in Section IV-C.

C. LOCAL NEIGHBORHOOD 1-SPARSITY
In settings with a natural ordering of indexes, enforcing or promoting a minimum distance between nonzero entries is a form of local neighborhood sparsity. MEGS expresses this with a symmetric Toeplitz S with nonzero entries in a band around the diagonal. For instance, an S with neighbor distance n = 2 is depicted in Fig. 3(c). For a general neighbor distance n, S is given by This imposes n-wide local-neighborhood 1-sparsity, which ensures nonzeros in a solution occur with a minimum separation distance of n + 1indices. This S is used in Section IV-B. More generally, a Toeplitz S is applicable to a shift-invariant sparsity structure.

IV. APPLICATIONS WITH DESIGNED S A. RANDOM COMPRESSIVE MEASUREMENTS
Three experiments were performed in MATLAB [20], where measurements y were generated using y = Ax, with white Gaussian noise added to result in a signal-to-noise ratio (SNR) of 25 dB. Matrix A ∈ R 25×60 has random Gaussian entries realized for each trial. The vector x has nonzero entries that are independent and uniformly distributed on [− 3 2 , − 1 2 ] ∪ [ 1 2 , 3 2 ] with sparsity structure such that |x| T S|x| ≈ 0 for some S. Details of how this is achieved are presented in Appendix D.
In the first experiment, S is fixed and enforces intra-group 1-sparsity in ten groups (see Section III-B), where S ∈ R 6×6 . In the second experiment, a local neighborhood S (see Section III-C) is used with n = 4. In the third experiment, a new binary S is generated per trial with off-diagonal entries equal to 0 or 1 with 50% probability. We compare the performance of the MEGS penalty with other typical sparsity regularizers, both structured and unstructured. The problems solved are of the form x = arg min x The E-LASSO penalty uses a sum of the 1 -squared term over groups. We create one group per row in S, where each group contains the nonzero components in each row (and the diagonal), i.e., g i = { j | S i, j + δ i− j = 1 }. The pseudoinverse is used in the initialization for the nonconvex penalties, x 0 = (A T A) −1 A T y. Table 1 shows the results from the three experiments. The metrics evaluated are the percentage of components correctly identified as nonzero, the Jaccard index (the size of the intersection of the true support and estimated support divided by the size of the union), and the mean-squared error (MSE) for the components that are nonzero in the ground truth. We see that MEGS outperforms the other algorithms in almost every case. It is unsurprising that LASSO performs especially poorly in the group S and local neighborhood settings as the true support is not especially sparse overall, whereas MEGS is agnostic to the overall sparsity level. We also see that the MSE within the correct support is particularly low with MEGS, as it implicitly de-biases the solutions, whereas both SWAG and E-LASSO exhibit a shrinkage bias. Of particular interest are the results where λ is tuned to result in the correct sparsity level. The MEGS results are particularly favorable in this setting, and this is the most typical use case; it is roughly equivalent to solving the constrained MEGS problem in (14) and hence requires no tuning of the regularization strength by the user.

B. LOCAL NEIGHBORHOOD TOTAL VARIATION
The well-established total variation (TV) penalty Dx 1 , where multiplying by D takes finite differences, aims to sparsify the changes along the solution vector x. We propose here a local neighborhood total variation (LN-TV) penalty  (0-1), and the MSE Where the Ground Truth is Nonzero using MEGS. LN-TV is MEGS with the transform B = D (see Section II-B) and an S that promotes sparsity in local neighborhoods (see Section III-C). LN-TV simply ensures that changes in the vector occur at least some minimum distance apart, while allowing changes of any amplitude. This is a prior that suggests the vector is piecewise-constant with a certain minimum segment length. In Fig. 4, we show the results of an experiment where some arbitrary noisy piecewise-constant signal y is denoised by solving x = arg min where R(x) is either Dx 1 (for TV) or |Dx| T S|Dx| (for LN-TV) and S is chosen to promote sparsity within five indices on either side of any non-zero component. We also compare with Moreau-enhanced TV denoising (a convexnonconvex variation) [21]. The parameter λ is tuned in each case to provide the lowest sum of absolute error. We see that if a typical minimum separation distance between significant changes in a vector is known, LN-TV can outperform TV and Moreau-enhanced TV in a denoising setting. For more general problems, where a piecewise approximation is helpful but the segments do not necessarily have a minimum length well known a priori, the LN-TV penalty can be combined with the traditional TV by solving a problem of the form where the λs are parameters that control the strength of the regularization. If A = I, this acts to denoise the input y, but different choices of A can be used to solve deconvolution or deblurring problems and others. In Fig. 5, we show the results of an experiment where a noisy piecewise-constant signal is denoised by solving (21) with either λ 1 = 0 and λ 2 optimized for lowest absolute error (i.e. only TV), or λ 1 , λ 2 > 0 and optimized for lowest absolute error (LN-TV and TV). Here, we are using the MEGS term as a regularization with a strength prescribed by λ 1 . This allows us to have some more nuance by having a linear ramp from 1 to 0.5 in the band around the diagonal of S, to make changes close to one another less likely than ones a greater distance away. We find that the addition of the LN-TV term can help avoid the 'staircasing' artifacts that often arise when using TV, and allow for weaker TV regularization overall thus reducing the bias in the result. We can extend this formulation to a 2D setting also. The penalty formulation in (22) extends LN-TV to two dimensions by promoting horizontal and vertical differences that occur  only with some minimal separation distance: where D v and D h are matrices that take finite differences along the vertical columns and horizontal rows, respectively, and S v and S h impose local neighborhood sparsity in the vertical and horizontal directions (the exact form of these depends on how one chooses to vectorize the image into the column vector x). Here, we choose h(·) = | · | κ which introduces another parameter κ as discussed in Appendix A.
In Fig. 6, we show the results of an experiment where a 70 × 70 pixel image containing hard edges is blurred with a 10 × 10 pixel Gaussian kernel with sigma of 4 pixels, then white Gaussian noise is added to achieve 45 dB SNR. We then solve (21) with A as the blurring operator and with the regularization terms replaced with their 2D equivalents, to achieve deblurring using either TV, LN-TV, or a combination of both. The regularization strength is set to optimize the PSNR in each case, and κ = 0.75. Whilst LN-TV alone does not provide a better result than TV, we see again that there is synergy between TV and 2D LN-TV which results in a PSNR improvement.

C. MODELING MUTUAL SURFACE OCCLUSIONS IN NON-LINE-OF-SIGHT IMAGING
The aim of non-line-of-sight (NLOS) imaging is to form reconstructions of scenes hidden from the direct line of sight of the observer. Typical methods to achieve this rely on the use of ultra-fast pulsed lasers and time-resolved single photon counting to infer the hidden geometry by measuring roundtrip distances [22], [23], [24], [25], [26]. Often, to recover a 3D scene, the hidden area will be discretized into patches or surfaces in some manner. Some of these patches may occlude others: if a surface is actually present in the scene, light coming from others behind it may have no paths to the measurement device. As the discretization is prescribed beforehand, the surfaces occluded by each other are known a priori, and they should be mutually exclusive. Therefore, we can use MEGS to ensure that groups of surfaces that occlude one another will be 1-sparse in the recovered output. This ensures that the recovered output will be physically plausible, as surfaces that cannot contribute any light to the measurements will not be present in the solution.
This idea applies more generally to any situation in which an ideal discretization of some physical area, field, etc., involves some mutual exclusivity between elements of the discretization. Typically, to deal with this one may instead use a discretization that is less desirable but avoids or reduces this exclusivity requirement, use post-processing to make the recovery fit the expected structure, or simply ignore it. Using MEGS can instead ensure that reconstructions both fit the expected structure prescribed by the underlying physics of the problem, and also enjoy improved estimates due to the knowledge of the structure being used within the optimization procedure.
One NLOS imaging system, Edge-Resolved Transient Imaging [27], [28], uses a laser to illuminate the floor at numerous positions along a semi-circle centered on a vertical edge, such as a doorway, to form a 3D reconstruction of the room beyond. For each measurement position, a histogram of the arrival times of photons reaching a single photon avalanche diode detector is accumulated using time-correlated single photon counting. Each subsequent laser position illuminates more of the hidden scene, and by taking differences between measurement histograms one can recover a measurement containing photons originating mostly from surfaces within a single wedge in the hidden area. One can attempt to reconstruct the hidden area by fitting a model of the surfaces within a wedge to the corresponding measured difference histogram. It is assumed that surfaces always start at the ground, extend to a certain height, and are a certain distance from the corner. We formalize this reconstruction by solving the following problem: where y is a difference histogram measurement and A is a matrix governing the light transport to and from surfaces at different distances from the edge, and with different heights. Fig. 7(b), and (d) shows reconstructions of a wedge using experimental data (2019_09_10_erti_nlos_3_Histograms_30. mat from https://github.com/Computational-Periscopy/ERTI/ https://github.com/Computational-Periscopy/ERTI/) [27]. The LASSO only reconstructions use only the x 1 term and no constraints. We show four solutions obtained with a range of regularization strengths λ as marked, none of which recover the ground truth scene, and most of which contain physically impossible combinations of surfaces. In contrast, the LASSO + MEGS reconstruction includes the MEGS constraint and the result is close to the ground truth scene and physically plausible. In [27], a Markov chain Monte Carlo (MCMC) method was used to form the reconstructions as it could explicitly model occlusions and needed to only estimate parameters for a small number of surfaces, rather than use a full scene discretization. This is however quite slow to converge, and does not always converge to the same solution given the same initial condition. Here, we use MEGS and a full scene discretization to ensure that surfaces which should occlude those behind them, and vice versa, are accurately modeled. MEGS also permits us to use a discretization with mutually exclusive elements-surfaces in the same position but with different heights-where only one should be present in the solution. As such, the result is a physically plausible reconstruction, which is quite accurate in support, height, and reflectivity.

V. SUMMARY
Using MEGS as a constraint allows one to recover solutions to problems which obey prescribed mutual exclusivity rules between components or a transform thereof. This is useful in modeling a wide variety of phenomena, such as occlusions in 3D imaging. We illustrate the usefulness of the proposed penalty for some example use cases where, by simply constructing a sparsity structure matrix S, one can ensure estimates fit known mutual exclusivity constraints derived from domain knowledge of the problem. This results in solutions that are both physically plausible and with reduced error. Similarly, MEGS can be used as a regularization term with some strength, to promote solutions that fit a sparsity structure whilst allowing more nuance. This, along with the intuition that a Toeplitz S can encode local neighborhood mutual exclusivity, gave rise to our proposed local neighborhood total variation. MEGS can be readily combined with other priors, regularization terms and constraints to improve estimations. A proximal (sub)gradient algorithm (and accelerated algorithm) have been proposed to quickly solve MEGS-constrained and MEGS-regularized problems.
In a case where there can be a large disparity between the size of components within groups in the solution vector, one may wish to impose a nonlinear scaling and use h(x) = |x| κ where κ < 1. This reduces the effect of large but possibly incorrect components on other, potentially correct components. This can be seen in use in the two-dimensional local neighborhood total variation example in Section IV-B.

B. NONCONVEXITY
When h(·) = | · |, the Hessian of the penalty is given by where x = diag(sign(x)). The sparsity structure matrix is indefinite as it has at least one positive eigenvalue (see [29]) and the sum of the eigenvalues is zero (since the trace of a symmetric matrix is equal to the sum of its eigenvalues and Tr(S) = 0). The Hessian has the same structure and can therefore not be positive semidefinite outside of the trivial case where x = 0. Thus, the penalty is always nonconvex for practical purposes. More analysis of the eigenvalues of matrices of this kind can be found in [30].

C. ACCELERATED MEGS ALGORITHM
In [18], a monotonic accelerated proximal gradient method is outlined that is guaranteed to converge to a stationary point for nonconvex problems. Algorithm 2 applies this acceleration to Algorithm 1, where α x , α y and α λ are step sizes that can be fixed or computed dynamically with a back-tracking scheme. The convergence proof requires that f (x) is a proper function with Lipschitz continuous gradients (e.g., the least squares loss), and the MEGS penalty is proper and lower semicontinuous, which is the case for h(x) = |x|.

Algorithm 3:
Generate a Vector x S s.t. |x i | T S|x i | ≈ 0.

D. GENERATING TEST VECTORS
For the experiments in Section IV-A, we must generate a test vector x for each trial which fits the sparsity structure described by a specific S matrix. To do so we use Algorithm 3, where in Step 4, u j is drawn from the continuous uniform distribution on [0.5, 1.5].