Bayesian Compressive Sensing of Sparse Signals with Unknown Clustering Patterns

We consider the sparse recovery problem of signals with an unknown clustering pattern in the context of multiple measurement vectors (MMVs) using the compressive sensing (CS) technique. For many MMVs in practice, the solution matrix exhibits some sort of clustered sparsity pattern, or clumpy behavior, along each column, as well as joint sparsity across the columns. In this paper, we propose a new sparse Bayesian learning (SBL) method that incorporates a total variation-like prior as a measure of the overall clustering pattern in the solution. We further incorporate a parameter in this prior to account for the emphasis on the amount of clumpiness in the supports of the solution to improve the recovery performance of sparse signals with an unknown clustering pattern. This parameter does not exist in the other existing algorithms and is learned via our hierarchical SBL algorithm. While the proposed algorithm is constructed for the MMVs, it can also be applied to the single measurement vector (SMV) problems. Simulation results show the effectiveness of our algorithm compared to other algorithms for both SMV and MMVs.


Background and Introduction
Single and multiple measurement vector (SMV and MMV) problems are computational inverse problems in the compressive sensing (CS) area. CS provides the possibility of representing a sparse or compressible signal using a small set of non-adaptive linear measurements [1,2]. In linear CS, the P-dimensional signal x ∈ R P is modeled by the linear equation y = Φx, where y ∈ R M is the measurement vector (with M P) and Φ ∈ R M×P is a wide sensing matrix. The sensing matrix Φ is usually constructed from a Gaussian or Bernoulli random operator. In [3], it was shown that Φ also can be constructed from a class of circulant matrices based on deterministic sequences such as the Golay sequence [3]. In the CS context, it is further assumed that x is sparse under some proper basis Ψ, i.e., x = Ψx s , where x s denotes a sparse vector. A sparse vector contains few non-zero components. Combining the two above equations, we obtain y = Ax s , where A = ΦΨ [4]. Since A is wide, the model is underdetermined, and CS looks for a sparse (if not the most sparse) solutionx s such that y = Ax s [5,6]. The SMV is a CS problem when A is known and the measurements are contaminated with noise e, i.e., y = Ax s + e. The case where Y and X s are matrices is called the MMV problem, i.e., Y= AX s + E. In the basic MMV model, it is assumed that all the columns of the solution matrix X s share joint sparsity, meaning that they have the same unknown non-zero locations. Figure 1 shows an example of the MMV structure. The problem we address in this paper is for the recovery of sparse signals with an unknown clustering pattern via either SMV or MMVs. After providing an extensive survey of existing techniques in this area, we present a new hierarchical sparse Bayesian learning model and compare it to the existing algorithms. Though the formulation and modeling will be presented for the basic MMV, the proposed model is also applicable to SMV by simply considering vector cases rather than matrices in the model.

Literature Review on SMV and MMVs
Finding a sparse representationx s in the SMV problem can be achieved using greedy algorithms such as matching pursuit (MP) and orthogonal matching pursuit (OMP) [5,7] or relaxed-to-be-convex methods (such as basis pursuit de-noising (BPDN) and the in-crowd algorithm) [8,9], the class of iterative shrinkage-thresholding algorithms (ISTA), and their variations such as fast iterative shrinkage-thresholding algorithm (FISTA), NESTA (a shorthand for Nesterov's algorithm), and ISTA-NET [10][11][12], and sparse Bayesian learning (SBL) algorithms [13][14][15][16]. Similarly, there exist three main approaches for solving MMVs. The first approach is the extended version of the greedy-based SMV solvers such as MMV basic matching pursuit (M-BMP), MMV order recursive matching pursuit (M-ORMP), and MMV orthogonal matching pursuit (M-OMP) [17][18][19]. The second approach is relaxed-to-be-convex algorithms such as the joint l 2,0 approximation algorithm (JLZA) [20]. The third approach includes the SBL algorithms that are more flexible for incorporating prior knowledge on the structure of the solution compared to the greedy-based algorithms [21][22][23].
In some practical applications, the non-zero components of the sparse signal appear in clusters. For the MMV case, this means that in addition to the joint sparsity structure, the non-zeros also appear in clusters in each column of X s in Y = AX s + E. This feature has been referred to as the clustered structure or block-sparsity pattern in the literature [7,24,25]. Applications of clustered sparsity for the SMV cases arise in problems such as gene expression analysis [26], image reconstruction of hand-written digits [27], and audio signals using the discrete cosine transform (DCT) basis [28]. Applications of MMVs can be found in neuromagnetic imaging [17], the reconstruction stage of Xampling (compressed-sensing of analog signals) for multi-band signals [7,24], and the direction of arrival (DOA) estimation problem [29]. For example, in magnetoencephalography (MEG), the goal is to investigate the locations where most brain activities are produced. The brain activities exhibit contiguity, meaning that they occur in localized regions [25]. Therefore, the measured signal at each snapshot can be modeled as a block-sparse SMV problem.
When taking successive and almost simultaneous snapshots from the phenomena, one expects the block-sparsity structure to be preserved. Hence, it is possible to model these activities with a block-sparse MMV problem where the block partitions are unknown a priori.
During the last decade, several greedy-based algorithms have been proposed to solve clustered pattern SMVs such as reduce MMV and boost (ReMBo) algorithm [7], block-OMP [30], structured OMP (StructOMP) algorithm [31], and group LASSO [32]. However, these algorithms need prior knowledge on the block sizes or the cluster pattern.
Bayesian learning models incorporate prior knowledge on the characteristics of the underlying signal. Starting with prior knowledge, these algorithms update their belief about the underlying features of interest in an unsupervised manner using the observations. Regarding the sparse recovery of SMV and MMVs, existing SBL algorithms can be mainly categorized into the two following approaches. The first and most common approach to impose sparsity on the solution is achieved by modeling each component of the solution with a zero-mean Gaussian prior accompanied with a Gamma distribution on the precision (inverse of variance) of the corresponding component [13,[33][34][35]. In order to promote the clustering pattern, as well as sparsity in these models, different priors have been introduced on the variance of each component of the signal [15,28,36]. For example, in the case of clustered SMV using SBL with zero-mean Gaussian priors, Zhang and Rao incorporated the intra-block correlation structure (correlation structure in each block) [15]. In order to simplify the model, reduce the complexity, and suppress the over-fitting of the parameters in the model, they considered the same, but uncorrelated underlying covariance matrix for each possible block of the solution. The covariance matrix is updated via the expectation-maximization (EM) algorithm. In another work, Fang et al. used a zero-mean Gaussian prior where the precision on each component is statistically dependent on the precisions of the corresponding component and its two immediate neighbors [28].
The second SBL approach for the clustered sparse signal reconstruction uses a spike-and-slab prior [27,[37][38][39][40]. These models have been applied to the SMV problem. Hernandez et al. proposed the generalized spike-and-slab prior, which is suitable for situations where prior information on the groups of components in the solution (that are expected to be jointly zero or jointly non-zero) is available [27]. Yu et al. made the spike and slab probabilities for each solution component depend on three possible patterns of the neighbor supports [37,40]. The patterns depend on whether the two immediate neighbors of each component are active, inactive, or only one of them is active. In [38], a Gaussian process prior was imposed on the spike-and-slab probabilities.

Idea Behind the Proposed Algorithm
In this paper, we present a new hierarchical Bayesian learning algorithm to solve the MMV problem for sparse signals with an unknown cluster pattern. We first establish a simple hierarchical Bayesian model for solving the general form of the MMV problem. In this initial model, we use the Bernoulli-Gaussian prior, which approximates the spike-and-slab prior, but in the sense that instead of employing spikes in the model, we have a binary vector that is to be learned to determine the supports of the sparse solution. Related algorithms can be found in [37,41,42]. In terms of Gaussian-Bernoulli modeling of the sparse signal, our initial model is close to the simplified form of [37,42], and in terms implementation, it uses the MCMC implementation with Gibbs sampler technique as in [37,41]. A binary matrix was used as a part of the Bayesian modeling in [41] for separating the foreground (sparse) component and the background (low-rank) component from the collection of noisy frames of a video recording. The difference between our initial model and [41] is that here, we use a binary vector to learn the supports of the solution for the MMV problem with the joint sparsity structure. However, this initial model only favors sparse solutions without any feature to promote the clustering pattern. The main reason for defining this initial model appears later in this paper where we modify the model not only to promote sparsity, but also to account for the clustered structure that may exist in the solution. The main contribution in this paper is related to the modified model, where we impose a prior based on the l 1 -norm of the discrete gradient of the support vector s to promote clustering. This prior incorporates a parameter to account for the measure of contiguity, or "clumpiness," in the supports of the solution. Assigning a large value to the hyperparameter represented by this parameter encourages the overall supports of the solution to have fewer on/off transitions, that is more contiguity of clustering in the supports. Similar to the other parameters, this parameter is also to be learned via our hierarchical SBL algorithm. Previously-developed algorithms do not have this control parameter for learning the pattern via the measure of overall clumpiness over the solution.
The proposed algorithm learns the supports and the corresponding values of the active components in the solution simultaneously. This task is accomplished by defining a Bernoulli-Gaussianinverse Gamma distribution on the solution. Based on the built-in prior modeling, the solution tends to become sparse. In contrast, existing algorithms that use Gaussian-inverse Gamma prior modeling may need to perform some post-processing to remove the non-dominant components from the estimated solution. Existing algorithms in this area evaluate the performance based on the MSE reconstruction performance, but there are applications for which the support recovery is of equal or more importance. Examples can be found in compressive sampling and reconstruction of blind multi-narrowband signals in the continuous-to-finite (CTF) reconstruction stage, where the goal is to seek the edges of the sparsely-scattered narrowband signals to estimate the carrier frequencies [7,24]. In [7,24], the authors assumed that the number of narrowband signals is known, which yields the use of a modified OMP. The modified OMP is fed with the true sparsity level and the support block sizes of two. However, if the number of such signals is unknown, then we need an algorithm such as our proposed algorithm to learn the actual sparsity level, as well. For the problems raised in [7,24], it turns out that the supports of the solution in the CTF stage tend to clump together, which is representative of clustered pattern supports. For this problem, the supports of the solution are needed in order to figure out the locations of the actual carriers of the signals in the spectrum. Since our algorithm learns the support vector, it can naturally estimate the location of the carriers. In contrast, the other existing algorithms may tend to provide many supports, including non-dominant supports, which need to be post-processed to estimate the locations of the carriers. Another example is in [43], where the goal is to figure out the location of the sparsely-scattered multi-narrowband signals in the spectrum and then use this information to be able to fill out the empty spaces in the spectrum.
The novelty of our algorithm can be described as follows. Prior works in this area usually consider three hyperpriors commonly modeled by Gamma distributions, either on the precision of the Gaussian modeling on the solution or on the probabilities associated with Bernoulli modeling of the support vector elements, to promote clustered pattern solutions. These Gamma hyperpriors have different parameters to decide on each component of the solution based on the active/inactive status of its immediate components. By contrast, our model incorporates a total variation-like hyperprior, as a measure of the overall clustering pattern, on the support vector of the solution. Our model includes one control parameter in this prior to account for the emphasis on the amount of clumpiness in the support vector. This control parameter is learned in a Bayesian fashion, rather than having three different hyperpriors. Learning this parameter and the use of total variation-like prior on the support vector are novel.
Our proposed algorithm differs from [28] in two aspects. First, our model can be readily applied to either SMV or MMV problems, while the original PC-SBL algorithm proposed in [28] solves for the clustered pattern SMVs. Although it has been recently extended via generalized approximate message passing (GAMP) to solve for 2D problems [44], it needs some extra modifications to be used for the MMVs. Secondly, our model uses the Bernoulli-Gaussian prior, and it promotes the clustering pattern by adding hyperpriors on the supports of the solution, while in [28], this task is performed on the variances of the solution components. Yu et al. [37,40] used the spike-and-slab prior model and forced each mixing weight to depend on one of the three different possible active/inactive patterns of its immediate neighboring supports. This idea comes from the k-nearest neighbor approach in the clustering problems. Our approach differs from [37,40] due to the first reason we provided earlier for [28]. Tibshirani et al. presented an algorithm for SMVs referred to as the fused-lassoalgorithm, which promotes both sparsity and smoothness in the solution [26]. In this algorithm, the smoothness is promoted by using the absolute value of the difference between the estimated values of the successive components of the solution. In contrast, our proposed algorithm promotes sparsity and is able to learn the clustering pattern that may exist in the supports of the solution. The clustering pattern is learned by using the summed absolute value of the differences in the supports (our sigma-delta function), which distinguishes these algorithms. More specifically, we incorporate a total variation-like prior on the support vector of the solution rather than using such a prior on the solution vector itself. Finally, there are some SMV solvers that use the spike and slab prior model where the slab is modeled by a Gaussian scaled mixture model instead of just one Gaussian distribution [45]. It turns out that using this model can provide a better estimate of the underlying distribution of the non-zero elements.
However, using this model increases the number of parameters to be learned even when we have only one mixing probability parameter to learn the supports. In [45], for the purpose of reducing the complexity of the algorithm, the expectation-maximization algorithm using approximate message passing was employed. Our measure of clumpiness can also be incorporated in this model to promote the clustering pattern.
Early development of this work was presented in [46], following which significant changes have been made. We have completely changed the update rule of some of the parameters, i.e., γ p in (15) and the controlling parameter α in (20) to learn the overall clumpiness of the supports. The convergence issue in the MCMC algorithm is addressed, and we explain how to track and monitor the convergence of the posterior distributions and make decisions on the supports based on the collected samples. Additionally, we compare the performance of our algorithm with other algorithms for both the SMV and MMVs, both on synthetic and real data.
This paper is organized as follows. In Section 2, we construct a basic hierarchical SBL model for solving the MMVs when the solution shares joint sparsity. Section 3 describes our main proposed algorithm, which extends this basic model to account for both joint sparsity and the unknown clustering pattern that may exist in the solution. In Section 4, we illustrate the performance of our proposed work compared to the other algorithms. Finally, Section 5 discusses the convergence diagnostic of the MCMC technique for the proposed algorithm. Section 6 presents conclusions.

Initial Model: Sparse Bayesian Learning for MMVs
As an initial model, here we construct a hierarchical SBL algorithm for the sparse recovery of basic MMVs with the joint sparsity structure that is expected to occur across the columns of the solution matrix. As discussed earlier, this model serves as the initial model, which we will modify in Section 3 for the clustered pattern sparse signals. We refer to this SBL algorithm as ordinary-SBL (O-SBL).
In this model, the supports of the solution are modeled by the binary vector s. Therefore, the sparse solution is described by s • X, where s and X account for the support and the solution-values, respectively, and • denotes element-by-element multiplication (Hadamard product) applied across the columns of X. The model for the MMV problem is: where Y ∈ R M×N , A ∈ R M×P , s ∈ {0, 1} P×1 , X ∈ R P×N , and E ∈ R M×N . The matrix Y contains N columns of observed noisy data; A denotes the known sensing matrix; s is an unknown binary support-learning vector; X is an unknown solution-values matrix; and E represents the measurement noise. In the product s • X, when N > 1, the support vector s deals across the columns of X. The term s • X is simply equivalent to diag{s} · X, where "·" is the regular matrix product and diag{·} creates a diagonal matrix from its argument vector. A representation of a hierarchical Bayesian graphical model of the problem used in the development of our algorithm is portrayed in Figure 2.
The shaded node Y shows the observations, and the small solid nodes represent the hyperparameters. Each unshaded node denotes a random variable (or a group of random variables) [47]. The support-learning component s in (1) is a binary vector, and we model the elements of s as Bernoulli random variables, whose probabilities are governed by the prior γ = [γ 1 , γ 2 , . . . , γ P ] T ; that is, In order to favor the sparsity structure in s, on the basis of experimentation, we set α 0 = 1 P and β 0 = 1−α 0 , as suggested in [41].
The columns of the solution-value matrix X = [x 1 , . . . , x N ] in (1) are assumed to be drawn i.i.d. according to the normal-gammadistribution: where a 0 and b 0 denote the shape and rate of the Gamma distribution, respectively. For the purpose of reducing the model complexity, we use the same precision τ for all the components of X. Moreover, due to the lack of prior knowledge on the entries of X, we experimentally set the hyper-parameters to a 0 = b 0 = 10 −3 , endowing X a priori with a fairly high variance. The entries of the noise component E are assumed to be drawn i.i.d. from a Gaussian distribution with the precision ε −1 . In our model, the precision ε −1 is unknown and will be inferred, so that: e mn ∼ N (0, ε −1 ), m = 1, . . . , M, n = 1, . . . , N, ε ∼ Gamma(θ 0 , θ 1 ). ( The hyper-parameters in (3) are set to θ 0 = θ 1 = 10 −3 . Referring to the graphical model in Figure 2, the joint probability distribution of the model can be written as: In the descriptions of the marginalized posterior distributions below, conditioning on −, as in (s p |−), is used to denote the inference on s p conditioning upon all relevant variables (including the observations). where Here,ỹ The Appendix provides more details.
In our implementation, we draw from the conditional posterior densities via Markov chain Monte Carlo (MCMC) using Gibbs sampling [41], as shown in the following pseudocode.

O-SBL Algorithm:
In the above algorithm, Θ 0 represents the set of initial values of the parameters of interest, drawn initially from the prior distributions defined in (2)-(3). We then run the O-SBL algorithm for the number of N burn-in iterations. Samples are not collected during the burn-in period. Then, N collect more iterations are performed to collect the set of samples. For example, the estimate of the solution matrix X is computed from the sample mean, i.e., [n] , whereX [n] denotes the collected samples for the solution matrix obtained from the corresponding approximated posterior distribution at the n th iteration. As an alternative, one may use the samples obtained from the last iteration of the collection period as the estimate of the variable of interest. For example,X =X [N collect ] . The MCMC convergence for the algorithm is discussed in Section 4.

Clustered Sparse Bayesian Learning
In this section, we modify the initial SBL model described in Section 2 to improve the support recovery performance of MMVs for the clustered sparse signals. We assume that the columns of the solution matrix are jointly sparse and that each of the vectors might have groups of clumps, i.e., groups of adjacent non-zero terms. Following [48], we measure the amount of clumpiness in the support-learning vector s by the absolute sum of the differences between successive elements of s, There exist fewer transitions in s, corresponding to a smaller Σ∆, when the supports of the solution have a clustered pattern compared to an unstructured distribution of supports. The Σ∆ measure is used to establish a prior probability, which encourages clustered pattern supports by setting the prior for the support-learning vector s proportional to e −α(Σ∆)(s) for some α > 0. The parameter α specifies the significance of the clumpiness in the supports. Large values of α encourage more contiguity in the supports of s. However, the measure of Σ∆ (total variation on the support vector) itself is not sufficient to promote sparse clustered pattern supports, but rather only promotes clustered pattern solutions.
Here, we provide a motivational example to clarify the reasoning for this. In Figure 3, we have two signals (one sparse and the other non-sparse), where both signals are of length 100. According to Figure 3, both signals have the same measure of Σ∆ = 2. As we can see in this figure, Signal 1 is sparse, while Signal 2 is non-sparse. Therefore, the measure of total variation is not sufficient to promote sparse clustered pattern solutions. Let us also investigate how the measure of total variation is affected by forcing a specific component in these two signals, active and inactive. The original active locations of Signal 1 are 45:50 (with MATLAB notation). Now, if we set the 44 th component in Signal 1 to become active, the measure of total variation does not change. Similarly, setting the 44 th component on Signal 2 to zero does not change the measure of total variation. This suggests that we need to further modify our model to promote sparsity, as well. For this purpose, we incorporate a binomial distribution into our prior model for s. This distribution contains the effect of the sparsifying hyperprior γ p and the sum over all the support components of the support vector (for both active and inactive status of s p ). Referring back to the example illustrated in Figure 3, for Signal 1, the sum over the supports of the signal (the number of active components) is six, while this measure is 94 for Signal 2. Therefore, the solution with the lower value of summation over the supports is sparser and of more interest.
We model the prior on the elements of the support-learning vector s as follows: where Ω p := ω 1,p ω 0,p +ω 1,p and: and where Σ k,p , k ∈ {0, 1}, is defined as: that is, it is the sum over all the elements of s for the case of forcing s p = k. In other words, Σ k,p is the number of active elements in s when the p th component of s is set to be one (active, via k = 1) or zero (inactive, via k = 0). For example, Σ 1,5 = 1+∑ P p =5 s p . Furthermore, in (8), the term (Σ∆) k,p is the measure of clumpiness evaluated via (6) for the case of forcing the p th component of s equal to k. This measures how the status of s p affects the contiguity over the supports of the solution. For example, (Σ∆) 0,5 = ∑ P p=2 |s p − s p−1 |, when we force s 5 = 0. This measures how the inactive status of s 5 affects the contiguity over the supports of the solution. The term Ω p in the prior on the support learning vector s (7) can be further simplified into: where: and: Roughly speaking, this distribution favors drawing s p = 1 if this draw reduces (Σ∆) 1,p . Definē in Ω p , then the prior on s p would be only governed byc p . In this case, Ω p in (9) will be simplified into Ω p = γ p , which is the same prior as we had earlier in (2). This prior only tends to promote sparsity without favoring the clustered pattern supports. By contrast, settingc p = 1 in Ω p , for a sufficiently large value of α and (Σ∆) p > 0, favors clustered pattern supports, which may lead to non-sparse solutions. Therefore, by incorporating both c p and the exponential term in the prior on s p , defined in (9), the supports exhibit a trade off between sparsity and clustering.
In Figure 4, we demonstrate the effect of α defined in (8) and its role in the prior (9) on the support learning vector s. The figure shows draws of s according to (9) using (10) and (11).  Larger values of α result in lower Σ∆(s), meaning that the support vector s will tend to have fewer transitions along its components, promoting clustering.
We employ a Gamma prior α ∼ Gamma(a 1 , b 1 ) on α, where a 1 and b 1 are hyperparameters denoting the shape and rate of the Gamma distribution, respectively. In order to promote the clustering pattern as a prior knowledge, we set a 1 = 2 × 10 −3 and b 1 = 10 −3 , meaning that on average, we expect to have α = 2 before incorporating the measurements. By setting these parameters to small values, as we have here, the learning process of α will not be biased by the prior, once the measurements are incorporated.
With these priors, the joint probability distribution for the complete model becomes: The graphical model for the clustered pattern MMVs is shown in Figure 5. Figure 5. Graphical model of the Bayesian formulation (12).
Below, we describe the inference on the variables that are modified by (12) compared to (4). The inference on the other variables and parameters in the model is the same as those we described in Section 2.
• The posterior for s p is given by: where Ω p was defined in (7). Using the fact that s p is a binary random variable, the posterior inference on s p can be simplified to ( and: The posterior for s p can be further simplified into: where c p and (Σ∆) p were defined in (10) and (11), respectively, and: • The posterior for γ p is given by • The update rule for α is determined as follows. Using the joint probability distribution of the complete model (12) and discarding the terms that are unrelated to α, we have: Due to the complicated nature of (16), we estimate α based on the current value of all the other variables and parameters of the model in the implemented MCMC approach. In other words, we compute:α where L α is obtained by taking the logarithm of (16) and is defined as: and may be further simplified into: Taking the derivative of L α with respect to α and equating the resulting equation to zero, we obtain: The maximum a posteriori point estimate of α conditioned on the measurements and the current estimate of hidden variables and the parameters of the model can be obtained by iteratively solving: This update is computed at each iteration of the MCMC approach.
As an alternative approach, one can set α to a fixed predefined value. If under some prior knowledge, no clustering pattern is expected in the solution, one can set α to a small value close to zero. In case of expecting a highly clustered solution, we recommend setting α 0.

Remark 1.
In [46], the marginal posterior inference on γ p was estimated by (γ p |−) ∼ Beta(α 0 +s p , β 0 − s p ), ∀p = 1, . . . , P. The idea behind the above update rule was the assumption of having a directed link from the node γ p to the node s p in Figure 5. However, in (15), we have removed this assumption and directly found the inference based on the relationship between the variables that we have in Figure 5. Furthermore, we have provided an analytical approach for the update rule of α, rather than the empirical approach discussed in [46].
The pseudocode below describes our algorithm, referred to as C-SBL, which works for the clustered patterns SMV or MMVs.

End For {Iter}
Similar to the O-SBL algorithm, we perform MCMC inference in the implementation of the C-SBL using Gibbs sampling for all the variables and parameters of the model.

Simulation Results
Here, we compare the performance of our algorithm against other algorithms on both synthetic/simulated and real-world data for both the SMV and MMV problems.

Simulations on Synthetic Data
We first compare our proposed algorithm with other algorithms for the SMV problem, i.e., N = 1 defined in (1). We then consider the MMV problem for the case where the number of columns in the solution matrix X is N = 2, 5.

Performance for the SMV Problem
Each independently-generated trial is constructed as follows. We consider the solution-value vector x ∈ R 100 , i.e., P = 100 and N = 1 in (1). The supports of the true solution are drawn randomly so that the support vector s exhibits a random clustered sparsity pattern. The total number of non-zeros in the sparse solution (x s = s • x) is set to 25 for all the trials. The nonzero elements of x at the active supports of s are drawn i.i.d. from a zero-mean Gaussian distribution with variance σ 2 x = 1. For each trial, the entries of the sensing matrix A ∈ R M×100 are drawn i.i.d. from a zero-mean Gaussian distribution with variance one, and then, we normalize the columns of A. We vary the number of measurements M to show the performance as the ratio λ = M/P changes. The elements of the noise component are drawn i.i.d. from a Gaussian distribution e m ∼ N (0, σ 2 ). The SNR for all trials was 25 dB and is defined as SNR = 20 log 10 (σ x /σ). The measurement y is computed from y = Ax s + e. The data described above were generated for 200 trials.
The recovery performance of the algorithms is demonstrated using both probability of support recovery and mean squared error. The probability of correct detection of a support location (probability of detection) P D and the probability of (erroneously) detecting a support location where there is none (probability of false alarm) P FA are respectively defined as P D = (#Correct detections)/(#Possible correct detections), P FA = (#False detections)/ ((#Possible detections) − (#Correct detections)). A successful reconstruction is reported when all the supports of the true solution are recovered. The normalized mean-squared error is defined as: wherex s is the estimated solution.    Figure 6a shows the performance of C-SBL using receiver operating characteristic (ROC) curves as the number of measurements, and equivalently the ratio λ = M/P varies. For λ > 0.4 (M > 40, P = 100), C-SBL successfully finds all the supports of the true solution with generally a low false alarm rate. The algorithm exhibits high performance when the number of measurements is almost twice the number of true non-zeros in the solution (the true number of non-zeros was set to 25). In Figure 6b-d, we also illustrate the performance of C-SBL vs. the threshold, where the threshold is defined as follows. We average over all of the N collect collected samples of the support learning vector, where each component belongs to {0, 1}. In other words, we computeŝ ave = 1/N collect ∑ N collect n=N burn-in +1ŝ [n] . Then, those indices in the resulting vectorŝ ave that contain values greater than the threshold are chosen as the estimated supports of the solution (setting the threshold to 0.5 results in the sample mean of the collected samples). In Figure 6b, we illustrate the detection rate vs. the threshold as the ratio M/P changes (if we decide on the supports based on all the samples obtained in both burn-in and collected periods, then the detection rate would become zero for the threshold of one for all λ). Figure 6c shows the difference between detection rate and false alarm rate of C-SBL vs. the threshold. The higher values of P D −P FA indicates the higher overall support recovery of the algorithm. In Figure 6c, there is a threshold of around 0.5, where P D − P FA has a peak for λ ≤ 0.4. For the case of λ > 0.4, C-SBL reaches its highest performance with a wide range of threshold of approximately [0.1, 0.9]. This verifies that estimating the support learning vector based on the sample mean (threshold of 0.5) provides a high performance for all λ. Finally, Figure 6d shows the C-SBL behavior in terms of normalized mean-squared error, defined in (21), vs. the threshold. In Figure 6d, the error term for each λ remains almost constant regardless of the threshold. This means that one can take a threshold that leads to a very high detection rate, even for a very low number of measurements, without any major change in terms of the error. However, according to Figure 6a and Figure 6c, different choices of the threshold result in different false alarm rates. According to Figure 6d, as the number of measurements becomes around twice the sparsity level (λ ≥ 0.5), the error becomes almost negligible.
We now compare the performance of C-SBL and O-SBL with other algorithms, specifically: CLUSS-MCMC algorithm for solving clustered structure compressive sensing using Markov chain Monte Carlo method [37], orthogonal matching pursuit (OMP) algorithm [5,49], MMV focal underdetermined system solver (MFOCUSS) [17], block-sparse Bayesian learning algorithm (BSBL) [15,23], MMV sparse Bayesian learning algorithm (MSBL) [22], basis pursuit denoising algorithm for group sparsity (BPDN-group) using the spectral projected gradient for l 1 minimization (SPGL1) solver [50], the single-task version of multi-task compressive sensing algorithm (MTCS) [34,51], and PC-SBL [28,52]. In all of the algorithms, we discarded those estimated elements in the solution with an amplitude less than 0.01 from the support set. In Figure 7a-d, the results for the O-SBL and C-SBL algorithms are based on the sample mean of the collected samples.
In Figure 7a, we demonstrate the empirical results of detection rate vs. the ratio M/P.  Figure 7a shows that C-SBL provides the best performance in terms of detecting the true supports of the solution. In Figure 7b, the false alarm rate in support recovery is illustrated, where we see that for M/P < 0.35, our algorithm has a higher false alarm rate in support recovery at the cost of providing a higher detection rate within the same range for M/P. In contrast, the rates for C-SBL, O-SBL, CLUSS-MCMC, and MTCS become almost the same and have the lowest values. Figure 7c compares the performance algorithms in terms of the trade off between the detection rate and false alarm rate in support recovery, in which C-SBL and PC-SBL show almost the same performance for M/P < 0.35. However, C-SBL outperforms all the other algorithms for M/P > 0. 35. Figure 7d illustrates the comparison of NMSE between the true and estimated solution. We observe that C-SBL provides lower error among the other algorithms for a wide range of M/P.

Remark 2.
The MATLAB codes for BSBL, MSBL, and MFOCUSS were obtained from [53]. For BSBL, the noise flag was set to two (small noise). and the block-size of h = 2 was considered. For MSBL, we activated the option for learning the tuning parameter and initialized it by the true noise variance. For the MFOCUSS algorithm, the regularization parameter was set to 10 −3 . Based on some initial experiments, we decided to use the default settings for CLUSS-MCMC [54] and PCSBL [52]. The parameters of the Gamma prior on the noise variance for MTCS [51] were both set to one.
It has been very common in the literature to demonstrate the performance primarily on NMSE, so that successful recovery is reported when the NMSE becomes lower than some pre-defined value. In that sense and by referring to Figure 7d, we see that our algorithm provides the lowest error rate for a wide range of M/P. In addition, our algorithm demonstrates good performance on the ROC plot, showing high detection against the false alarm rate. Finally, in Figure 8, we show the average run-time comparison of the respective algorithms for 500 randomly-generated SMVs in the same way stated earlier. In Figure 8, the legend C-SBL(learning α) denotes the execution time of the C-SBL algorithm for the case where the algorithm learns α from (20), while in C-SBL(α = 1), we do not use (20) and instead experimentally set α = 1. According to Figure 8, for low sampling ratios, which is of more interest in CS problems, the lowest execution times belong to THE PBDN, MFOCUSS, MTCS, and CLUSS-MCMC algorithms, respectively. However, as we showed in Figure 7, these algorithms do not provide good accuracy in signal reconstruction compared to the other algorithms. Furthermore, these algorithms do not account for learning the unknown clustering pattern. Comparing the execution time of the PC-SBL, B-SBL, and C-SBL algorithms for low sampling ratios, we observe that C-SBL demonstrates reasonably low execution time. Furthermore, notice that the C-SBL and CLUSS-MCMC algorithms are the only algorithms that are implemented using the MCMC method, and based on Figure 8, we observe that they both show the same behavior with respect to the change of the sampling ratio. The reason for the higher execution time of C-SBL against CLUSS-MCMC is again due to the extra computations that C-SBL requires to account for the clustering pattern.

Performance for the MMV Problem with N = 2
We first demonstrate the performance of C-SBL in terms of detection rate and false alarm rate in support recovery as the ratio M/P and NMSE. Figure 9a displays ROC curves. Comparing the results demonstrated in Figure 6a with Figure 9a, increasing the number of columns in the solution from N = 1 to N = 2 provides considerable improvement in the support recovery. Figure 9b illustrates the detection rate of C-SBL for the MMVs (with N = 2) vs. different threshold values. Once M/P ≥ 0.4 (over 40% compression and the sparsity of 25), almost full success in support recovery is attained regardless of the selected threshold value. The difference between the detection rate and false alarm rate of C-SBL vs. the threshold is shown in Figure 9c. There is a threshold of around 0.5, where P D −P FA has a peak for λ ≤ 0.4. For λ > 0.4 in the MMV case, C-SBL reaches its highest performance almost regardless of the chosen threshold value. Therefore, we make a decision on the supports based on computing the sample mean, i.e., setting the threshold equal to 0.5. Figure 9d shows the NMSE vs. the threshold for the clustered MMV problem using the C-SBL algorithm. Finally, Figure 10a-d compares the performance of C-SBL against the other algorithms. We compare the results of C-SBL with the following algorithms: MFOCUSS [17], MSBL [22], T-MSBL [55,56], and MTCS [34]. Notice that T-MSBL is devised for correlated signals, while our model does not account for this feature. Although MTCS does not promote the clustering pattern, we use it as a baseline for our comparisons. We consider two sets of simulations. Our setup for the simulations is similar to the SMV case, as was described earlier. The only difference is in generating the true solution matrices. In the first case, we generate uncorrelated columns for the solution matrices. In the legend of Figure 10a-d, the uncorrelated cases are denoted by ρ = 0. In the second case, the columns of the solution matrix for each trial are correlated with the correlation factor of ρ = 0.85. Figure 10a illustrates the detection rate in support recovery for clustered pattern MMVs (with N = 2), in which we observe that C-SBL has the highest performance among the other algorithms for the uncorrelated case. Furthermore, we observe that C-SBL competes with T-MSBL for the correlated case. In Figure 10b, it is clear that C-SBL provides a lower false alarm rate in terms of support recovery compared to the MSBL and T-MSBL algorithms. The best performance belongs to MTCS, but it provided the lowest performance in terms of detection rate, as was shown in Figure 10a. For overall comparison, Figure 10c shows the simulation results in terms of the difference between the detection rate and false alarm rate in support recovery when varying the ratio M/P. According to Figure 10c, the overall performance of C-SBL is higher than the other algorithms. The NMSE comparisons are illustrated in Figure 10d, where we see that C-SBL provided the lowest error. According to Figure 10a-d, C-SBL is more successful than the compared algorithms in terms of both support recovery and estimating the non-zero values of the true solution. Here, we perform simulations on synthetically-generated data in the same way explained previously except setting N = 5. The burn-in and collection periods of C-SBL were set to 2000 and 1000, respectively. Figure 11 illustrates the performance comparison results for both the uncorrelated case (ρ = 0) and the correlated case (with ρ = 0.85).
According to Figure 11a, the best performance in terms of the difference between the detection rate and false alarm rate belongs to the C-SBL algorithm. The lowest performance belongs to FOCUSS, where as the sampling ratio becomes greater than 0.4, FOCUSS starts to activate more components in s. As a result, the false alarm rate increases, and this yields the smaller P D −P FA . For a sampling ratio of one, all the components of s are active, resulting in P D −P FA = 0. The performance of FOCUSS would be still acceptable if the NMSE for high sampling ratios would have became very low, meaning that the wrongly-determined supports had very low amplitudes. However, according to the error curve of FOCUSS in Figure 11b, this does not happen, meaning that MFOCUSS crashed for moderately high and high sampling ratios. For sampling ratios λ > 0.5, the best performance in terms of P D − P FA belongs to both C-SBL and MTCS. Notice that they were both able to provide almost full support recovery. However, for low sampling ratios like λ ≤ 0.4, the best support recovery belongs to C-SBL. For example, for λ = 0.2, the C-SBL provided P D − P FA of around 0.7, while the other algorithms provided values less than 0.4. This should justify the merit of the C-SBL algorithm. In summary, C-SBL demonstrates the best performance in support recovery for the uncorrelated case.

Interpretation of the results for the correlated case
For the correlated case, Figure 11a shows that C-SBL performed a little bit better than the two other compared algorithms in terms of support recovery, even though the C-SBL model does not account for the correlation that may exist across the columns of X. According to Figure 11b, MTCS, T-MSBL, and C-SBL have almost the same overall performance in terms of error.

Experiments on Real Data
In this section, we compare the performance of C-SBL against other algorithms on real data. Specifically, we consider the image reconstruction problem for the SMV case using the MNISTdataset. For the comparisons of the MMV case, the problem of blind sub-Nyquist sampling and reconstruction of multi-narrowband signals is considered.

Performance for the SMV Case (Experiments on MNIST Data)
Here, we evaluate the performance of the algorithms in reconstructing images of hand-written digits, using the well-known MNIST dataset [57]. MNIST consists of 70,000 gray-scale images of 28 × 28 hand-written digits. Experiments are conducted on a randomly-chosen set of hand-written digits from "0"-"9" from this dataset. Due to space considerations, we show only some of the results. The original images are upsampled to size 100 × 100 pixels, and the pixel values were normalized to be within [0, 1]. Then, the pixel values were subtracted from one, and those with a value of less than 0.3 were set to zero, similar to the binary pixel values in [58,59]. The threshold of 0.3 was obtained based on the average threshold of all the images using Otsu thresholding (the actual average threshold was 0.268). The corresponding matrix of pixel values of each image is then treated as the true sparse signal of interest, denoted by the true solution matrix X.
For the SMV case, we solve each column of X for each digit one at a time. The number of measurements for each column of X is set to 55 and x n ∈ R 100 , ∀n = 1, . . . , 100, i.e., we consider a compression of 55% measurements for each vector x n . We randomly generated the sensing matrix A in the same way we described earlier. Then, each column of the matrix A is normalized to have a unit norm. The hand-written images of MNIST are already naturally sparse since most pixels in these images are inactive, i.e., they have a small number of non-zero pixels. The measurements for each column of the digits are computed by y n = Ax n + e n with SNR = 25 dB, where e is a Gaussian noise accounting for the measurement noise. This setting follows some of the other recent work in this area [60][61][62][63][64]. We feed all the algorithms with the measurement vector y and the same sensing matrix A. The generated measurement noise matrix is the same for the digits. Oncex n , ∀n = 1, . . . , 100 is known, we collect the results and then stack them all together and reconstruct the digits. For the purpose of demonstrating the support recovery performance, in Figure 12, we illustrate how successful the algorithms were in finding the non-zero pixel locations. Since in the compared algorithms, except our proposed C-SBL, the models do not have the support-leaning vector s, we performed the following.
In the reconstruction, we set the estimated pixel values less than 0.3 to zero, the same way as we treated the actual images. This means that we zeroed out the brightest pixels with the normalized value of lower than the threshold 0.3 and set the non-zero survival pixel values to one. However, since the proposed C-SBL algorithm already can provide the estimates on the active locations via s, the thresholding process is not required. The first column of images in Figure 12 shows the true hand-written digits, and the other columns show the results of processing with other algorithms. We compare the performance of the C-SBL algorithm against other algorithms in the reconstruction of the images via both the support and pattern recovery. In Table 1, we evaluate the reconstruction based on the difference between the detection rate and false alarm rate (P D − P FA ) in terms of support recovery. In Table 2, the performance is evaluated based on the success in pattern recovery. For this purpose, we stack all the columns of the true matrix X for each digit into a single column. We then construct the corresponding support vector, where the index of pixels with non-zero value will be replaced by "1" in the corresponding support vector. The true measure of clumpiness for each digit will then be computed via (6). We do the same procedure for computing the estimated measure of clumpiness in the reconstructed digits and provide the results in Table 2. In Table 3, the error between the true and the reconstructed images is represented.   According to the results shown in Table 1, the best results in support recovery belong to the C-SBL algorithm. Furthermore, Table 2 shows that C-SBL was more successful in learning the underlying clustering pattern of the digits. In terms of the reconstruction error, Table 3 shows that the C-SBL and BSBL algorithms compete with each other, with some results being better for C-SBL and some being better for BSBL.

Performance for the MMV Case (Experiments on Blind Multi-Narrowband Signal Sampling and Reconstruction
In this section, we consider the problem of blind sub-Nyquist sampling and reconstruction of multi-narrowband signals. The notion of blindness here means that the frequency support is unknown, and it occupies only a small portion of a wide spectrum [65]. In the original problem, it is assumed that the number of sparsely-scattered bands and their bandwidth are known, while the carrier locations are unknown at the receiver. The sub-Nyquist sampling in [65] is performed in the modulated wideband converter (MWC) stage, which multiplies the analog signal by a bank of M ch periodic waveforms followed by low-pass filtering and then sampling the outputs uniformly at a low rate far less than the Nyquist rate. The periodic waveforms intentionally alias the spectrum such that a portion of each band appears in the baseband. The technique used here is referred to as Xampling for the compressive sensing of analog signals [24,65]. The MMV problem appears in the reconstruction stage of the Xampling framework, referred to as the continuous-to-finite (CTF) stage. In this stage, the low rate samples, obtained from the MWC stage, are fed to the CTF stage to estimate the support vector. The estimation problem here involves solving for the sparse solution of an underdetermined system of linear equations, which has the structure of the MMV problem. The active bands (spectrum slices) of the signal are reconstructed based on the estimated supports. For more detail, refer to [24,[65][66][67]. In our example, the signal of interest is a multi-band signal containing two pairs of bands, i.e., N 0 = 4, where each band is of width B = 50 MHz, and it is assumed that the Nyquist rate of the multi-band signal is as high as f Nyq = 10 GHz. The true carriers are set to f c 1 = 2.4956 MHz and f c 2 = 4.4086 MHz, which are assumed to be unknown for the simulation purposes. The number of channels are set to M ch = 70, and the energies of the bands are set to E 1 = 1 and E 2 = 2, respectively. All the other settings of the signal model and the sampling parameters are defined the same as the implementation used in [68,69], which do not appear here due to the space consideration. In the simulations, the MMV problem that all algorithms need to solve is of the form Y = AX, where P = 195, M = 70, and N = 8. The actual sparsity level is k = 8, which here we assume to be unknown. Having no information on k here means that we assume that the number of active bands in the signal is also unknown. In Figure 13, we illustrate an example of a multi-narrowband signal, its spectrum, and the signal when contaminated with noise, which is used for the simulation purposes. Furthermore, Figure 14 shows the comparison of the OMP, our proposed algorithm (C-SBL), MFOCUSS, MSBL, and MTCS in estimating the spectrum of the signal to reconstruct the signal. Notice that the OMP algorithm is fed with the actual sparsity level (k = 8), while this information is assumed unknown to the other algorithms. The results shown for the OMP serve as a baseline in our comparisons, but the OMP requires more information to solve the problem, so the comparison is not exactly fair. According to the results obtained in Figure 14, the C-SBL algorithm performed better than the alternatives in estimating the spectrum supports (closest support recovery to the OMP), which resulted in better signal reconstruction. Again, the OMP in these simulations is fed with the true sparsity level with the true support block sizes.
In Table 4, the reconstruction error in terms of NMSE (dB) is represented. According to the results, we see that the C-SBL algorithm outperformed the alternatives in reconstructing the signal. The reason that the alternative algorithms did not provide low reconstruction error is due to the fact that their estimated spectrum supports of the signal became non-sparse, as illustrated in Figure 14.

Convergence Diagnostics of the MCMC Implementation
Convergence is an important issue in order to determine the burn-in period of MCMC algorithms [37]. There is work on the convergence of iterative simulations and their inference using multiple sequences via potential scale reduction factor (PSRF) and multiple-PSRF (MPSRF) in [70,71]. For example, in [37], the convergence issue of the CLUSS-MCMC algorithm was resolved via studying the evolution of MPSRF in the collected samples for the sparse signal and its corresponding variances and the measure of PSRF for the noise variance. Following the same approach, in Figure 15, we provide some examples demonstrating the evolution of the PSRF for 20 independent chains of our proposed C-SBL algorithm. In these examples, we generated random clustered pattern sparse signals of length N = 100, a sparsity level of 25, and with the number of measurements M = 20, 50, and 90.
In Figure 15a, we demonstrate the PSRF and MPSRF for the variables and parameters of interest in our model for an example with the low sampling ratio of 0.2. According to the plot, the Gibbs sampler converges very quickly for ε, γ, and x, i.e., the convergence measure of PSRF became close to one with few iterations. However, the convergence of the distribution on s is slower. The convergence of the precision on X was the slowest. According to Gelman's discussion in [72], one may prefer to set the burn-in period based on the PSRF close to 1.2. Based on this criterion, we can set the burn-in period to approximately 2000 iterations for the example made in Figure 15a. In Figure 15b, we provide another example for the case with the moderate sampling ratio of 0.5. As can be seen in this plot, the distributions of all the variables and parameters of interest converged faster than when the sampling ratio was 0.2. For this example, a burn-in period of around 600 is satisfactory. Figure 15c illustrates another example for the high sampling ratio of 0.9. In this plot, we observe that a burn-in period of around 200 suffices. Since in Figure 7 and Figures 9-11, we wanted to show the average of the overall performance of our algorithm, we first performed the following and then set the burn-in period based on the obtained experimental results. For each sampling ratio, we generated 100 random trials for solving the SMV and MMV problems. In these trials, we assessed the average PSRF measure for all the variables and parameters of interest based on Gelman's criterion in [72]. We monitored the average number of elapsed iterations until the variations on the outcomes of the estimated s became negligible for a fair number of iterations (this is easy to monitor since the outcome of the posterior on s p is Bernoulli). This is equivalent to monitoring the trace plots, as suggested by Neal [72]. More specifically, we monitored the posterior distribution on the support learning vector s, using (5) for O-SBL and (13) for C-SBL, based on the samples obtained from MCMC implementation. In Figure 16, we illustrate some examples of support learning vector s using the C-SBL algorithm with the number of measurements M = 55. Each plot shows the learning process of s ∈ R 100 , represented by the samples drawn from (13), as a function of the number of iterations. Using the experimental results based on both the PSRF evaluation and monitoring the outcomes of s, we then set fixed burn-in periods for the simulations illustrated in Section 4. In other words, since we wanted show the average performance, we preferred not to assess the PSRF of each simulated example, but rather using a fixed experimental burn-in period. Below, we provide the details on the burn-in and the collection periods of both the C-SBL and O-SBL algorithms for the simulation results illustrated in Section 4. In simulations on the synthetic data and the MNIST for the MMVs, we set the burn-in period to 500 followed by 500 iterations for the collection period. The same settings were used for the SMV case on the MNIST. In the experiments on synthetic data for the SMV, we set N burn-in = 2000 followed by N collect = 1000 iterations for the collection period for the sampling ratios of M/P ≤ 0.4. For M/P > 0.4, we set N burn-in = 1000 and N collect = 1000. Thus, it might be the case that the burn-in period is required to be more than what we set. The effect of the need for a longer burn-in period can be observed in the results of Figures  7, 10, and 11 for low sampling ratios. The convergence diagnostic and the effect of burn-in period can also be detected in Figures 6 and 9. It should be clear from Figure 6c that for sampling ratios over 0.4, the average detection performance is almost independent of the threshold. The performance reveals that the approximated posterior distribution on s has already been stabilized. However, we see a different behavior for lower sampling ratios in Figure 6c. The posterior distribution on s is Bernoulli, and the variations on the supports in the iterative samples directly affect the performance in the support recovery. We see in Figure 6c that the average sample mean of the collected samples occurred around the threshold of 0.5. This can be interpreted as follows. The burn-in period may have been required to be larger than our setting, but the posterior distributions have been almost stabilized. Therefore, even for a lower burn-in period and sufficient iterations for the collection period, we could still extract the information required for estimating the supports via computing the sample mean of the collected samples.
Computing the MSPRF for s needed some modifications. The estimated posterior variance of s is assessed based on the mixture of all the simulated sequences divided by the average of the variances within each sequence [72]. The main issue with the MPSRF for s occurred in our simulations when computingR. In fact, this matrix became ill-conditioned, and the issue was with the fact that sequences on s were either zero or one. We dealt with this issue by adding random draws from a zero-mean Gaussian with the variance of 10 −8 to the samples of s and then measured the MPSRF.

Conclusions
The O-SBL algorithm simultaneously learns both the supports and solution-value matrix for the MMVs with the joint sparsity structure. As the main contribution of this paper, the method was then extended to account for the case where the solution also exhibits an unknown clustered sparsity pattern. For this purpose, we introduced the C-SBL algorithm, which incorporates a total variation-based prior on the supports of the solution to learn the underlying clustered pattern. Based on simulations, we observed that C-SBL provides competitive performance for both the SMV and MMVs compared to the other algorithms.
Although C-SBL provides encouraging results, the MCMC implementation is computationally expensive. In future work, we will consider alternative approaches to MCMC implementation such as the variational Bayes inference technique [40,73].

Conflicts of Interest:
The authors declare no conflict of interest.