A hierarchical Bayesian perspective on majorization-minimization for non-convex sparse regression: application to M/EEG source imaging

Majorization-minimization (MM) is a standard iterative optimization technique which consists in minimizing a sequence of convex surrogate functionals. MM approaches have been particularly successful to tackle inverse problems and statistical machine learning problems where the regularization term is a sparsity-promoting concave function. However, due to non-convexity, the solution found by MM depends on its initialization. Uniform initialization is the most natural and often employed strategy as it boils down to penalizing all coefficients equally in the first MM iteration. Yet, this arbitrary choice can lead to unsatisfactory results in severely under-determined inverse problems such as source imaging with magneto- and electro-encephalography (M/EEG). The framework of hierarchical Bayesian modeling (HBM) is an alternative approach to encode sparsity. This work shows that for certain hierarchical models, a simple alternating scheme to compute fully Bayesian maximum a posteriori (MAP) estimates leads to the exact same sequence of updates as a standard MM strategy (cf. the Adaptive Lasso). With this parallel outlined, we show how to improve upon these MM techniques by probing the multimodal posterior density using Markov Chain Monte-Carlo (MCMC) techniques. Firstly, we show that these samples can provide well-informed initializations that help MM schemes to reach better local minima. Secondly, we demonstrate how it can reveal the different modes of the posterior distribution in order to explore and quantify the inherent uncertainty and ambiguity of such ill-posed inference procedure. In the context of M/EEG, each mode corresponds to a plausible configuration of neural sources, which is crucial for data interpretation, especially in clinical contexts. Results on both simulations and real datasets show how the number or the type of sensors affect the uncertainties on the estimates.


Introduction
Over the last two decades, sparsity has emerged as a key concept to solve inverse problems such as tomographic image reconstruction, deconvolution or inpainting, but also to regularize high dimensional regression problems in the field of machine learning. There are mainly two routes to introduce sparsity in such problems.
The first route, embraced by the optimization community and frequentist statisticians, is to promote sparsity using convex optimization theory. This line of work has led to now mature theoretical guarantees (Foucart & Rauhut 2013) when using regularization functions based on 1 -norm and other convex variants (Tibshirani 1996). In particular, it has been popularized in the signal processing community under the name of compressed sensing  when combined with incoherent measurements. There are however some limitations of sparsity-promoting convex penalties based on the 1 -norm. All the features (also called regressors, atoms or sources depending on the terminology of the community) involved in the solution form what is called the support of the solution. Convex penalties can fail to identify the correct support in the presence of highly noisy data, but also in low noise setups if the forward operator (referred to as design matrix in statistics) is poorly conditioned. Convex regularizations also lead to a systematic underestimation bias in the amplitude of the coefficients (Osher et al. 2006, Chartrand 2007, Saab et al. 2008, Chzhen et al. 2017. To address these limitations of 1 -type models, reweighted schemes have been proposed , Gasso et al. 2009, Rakotomamonjy 2011, Zhang & Rao 2011, Strohmeier et al. 2016, of which the Adaptive Lasso (Zou 2006) is the most commonly used in the statistics community: starting from the Lasso estimator, which amounts to regressing with a standard 1 -norm as a regularizer (this estimator is sometimes referred to as Basis Pursuit Denoising (BPDN) (Chen et al. 1998) in signal processing), the Adaptive Lasso solves a sequence of weighted Lasso problems, where at each iteration the weights are chosen such that the strongest coefficients are less and less penalized. From the optimization point of view, such an iterative scheme can be derived from so-called majorization-minimization (MM) strategies (Lange et al. 2000, Schifano et al. 2010. The idea behind MM is to minimize the objective function by successively minimizing upper bounds that are easier to optimize. Many well-known optimization approaches can be interpreted as instances of MM, e.g., simple gradient descent or proximal algorithms (Combettes & Pesquet 2011), expectation-maximization (EM) (Dempster et al. 1977), and difference-of-convex (DC) programming techniques (Horst & Thoai 1999). More recently, re-weighted 1 -norm schemes based on MM principle have been particularly popular to handle concave, hence non-convex regularizations such as 0.5 -quasi-norms or logarithmic functions. As such, these schemes are prone to converging to a local minimum determined by the initial, uniformly weighted 1 -norm solution (i.e., the Lasso estimator) that constitutes the first iterate.
The second route to introduce sparsity formulates the regression problem in a Bayesian framework and uses hierarchical Bayesian models (HBM) (MacKay 2003) for the inference. The common way to formulate HBMs is to consider the variance parameters of Gaussian prior models as additional random variables which have to be estimated from the data as well. Their prior distributions are referred to as hyper-priors. Plausible solutions to the regression problem that both fit data and the a priori assumption of sparsity are explicitly characterized as multiple distinct modes of the posterior distribution. This characterization is the Bayesian analogue to local minima in variational regression approaches when working with non-convex functionals. Different strategies to infer a point estimate for the parameters of interest from the a posteriori distribution then lead to different algorithmic frameworks, for instance Variational Bayesian approaches (MacKay 2003, Jordan et al. 1999, Sato et al. 2004, Friston et al. 2008, Shervashidze & Bach 2015, Sparse Bayesian Learning (SBL) approaches (also referred to as type-I or type-II maximum likelihood estimates) (Tipping 2001, Wipf & Rao 2004, Wipf & Nagarajan 2009, Zhang & Rao 2011) and fully-Bayesian strategies (Calvetti et al. 2009. In this work, we focus on the latter one for a non-standard type of HBM examined in (Lucka 2014) that combines a non-Gaussian prior with an 1 -type energy function with a specific Gamma hyper-prior. For this HBM, a simple alternating scheme to compute full maximum a posteriori (MAP) estimates leads to exactly the same sequence of problems solved by MM applied to 1/2 -type regularizations. With this observation made, it is natural to revisit and improve these MM schemes by leveraging the ability of the Bayesian framework to explore the modes of the posterior distribution by Markov chain Monte-Carlo (MCMC) schemes (Robert & Casella 2005, Kaipio & Somersalo 2005. This can not only mitigate the aforementioned initialization-dependence of MM, but more importantly, it can offer insights into the structure and importance of potentially multiple plausible sparse solutions. Yet, the benefit comes at the cost of additional computational efforts. Magnetoencephalography (MEG) and electroencephalography (EEG). M/EEG are technologies that allow to measure the electromagnetic fields produced by active neurons in a non-invasive way. Localization of foci of neural activations from M/EEG recordings is a high impact problem both for cognitive neuroscience and clinical neuroscience, with applications in pathologies such as epilepsy, sleep research or neurodegenerative disorders. Despite the linearity of the forward problem, this inverse problem is particularly challenging as the forward operator is both under-determined and strongly ill-conditioned. As such, both non-convex optimization strategies with reweighted schemes (Strohmeier et al. 2016) and hierarchical Bayesian approaches have been proposed (Sato et al. 2004, Calvetti et al. 2009, Wipf & Nagarajan 2009, Sorrentino et al. 2009 for M/EEG source localization. For this reason, it is an ideal application for our examinations. The manuscript is organized as follows: first, we present a unified perspective on both routes to sparsity, i.e., reweighted 1 MM schemes and specific HBMs. We show that a particular optimization-based inference strategy recovers the MM algorithm. We then describe a HBM inference strategy based upon an MCMC sampling and show on simulated and experimental M/EEG datasets how these stochastic MCMCbased techniques can not only help to improve upon deterministic approaches but also help to reveal multiple plausible solutions to the inverse problem. This analysis leads to an uncertainty quantification (UQ) of the support recovery of non-convex sparse regression problems that provides very useful complementary information, in particular for very ill-conditioned and under-determined applications like M/EEG source localization.

Notation
We consider here the following linear model, known in machine learning as multitask regression and in signal processing as the multiple measurement vector (MMV) model (Cotter et al. 2005): where M ∈ R m×t . In machine learning m corresponds to the number of samples and t to the number of tasks, and in our application m corresponds to the number of sensors and t to the number of measurements over time. The matrix G ∈ R m×q is the design matrix, a known instantaneous mixing matrix also referred to as the forward, gain or system matrix. It relates the unknown coefficients X ∈ R q×t , which correspond to amplitudes of neural sources, to the measurements M. In our application q = dn where n is the number of source locations and d is the number of oriented sources per location (d = 1 or d = 3 corresponding to estimating a scalar field or a 3D vector field of sources). The matrix E models the measurement noise, which is assumed to be an additive, white Gaussian noise (AWGN ). This is a reasonable assumption after a proper spatial whitening of the data using an estimate of the noise covariance (Engemann & Gramfort 2015). By X (i,j) we refer to the entry in the i-th row and j-th column in X, while X (i,:) ∈ R t and X (:,j) ∈ R q refer to the complete i-th row and j-th column, respectively. In addition, we denote by X [i] , i = 1, . . . , n the d × t sub-matrix of X corresponding to the i-th group: for d = 1, this coincides with :) ] . Note that thereby, the group size is dt. Negative indices are used to exclude certain entries, rows, columns or groups, i.e., , X (:,−j) ∈ R q×(t−1) refers to the matrix obtained by deleting the j-th column of X. Furthermore, let I m denote the identity matrix of size m ∈ N. For any matrix A ∈ R n×m , the Frobenius norm is given by A 2 F = i,j A 2 (i,j) .

Methods
We start this section by recalling how majorization-minimization works when addressing variational formulations with concave, non-convex, regularization. It is followed by an introduction to hierarchical Bayesian models with Gamma hyper-priors. Then, we explain how these seemingly different approaches can lead to the exact same optimization algorithm. From this, we detail how different Bayesian inference strategies using MCMC sampling can more precisely explore the landscape of the posterior distribution of the HBM model and provide multiple possible solutions to the sparse regression problem compared to MM.

Majorization-Minimization
Majorization-Minimization (MM) strategies consist in replacing a difficult optimization problem with a series of easier ones that are obtained by upper bounding the objective function, often by a convex majorant. In the context of inverse problems or high-dimensional statistics using sparsity constraints, MM has been successfully applied to address non-convex regularization terms. An example is the regression model with 2,p -quasi-norms regularization over the groups when 0 < p < 1: The desired estimateX is defined as one of potentially multiple minimizers of where λ > 0 is the regularization parameter balancing the data fit and the penalty term. One possible MM approach to solve Eq. (2) with p = 1/2 would consist in minimizing a sequence of non-smooth convex surrogate functions where the non-convex regularization is replaced by a weighted 2,1 norm (Strohmeier et al. 2016). In each iteration, the weights are derived from the current estimate of X.
Due to the concavity of the non-decreasing function X → X F , it is upper bounded by its tangent and a first order Taylor expansion at the current estimate X [i] provides an upper bound that can be used to construct the non-smooth convex surrogate problem. By solving this sequence of surrogate problems, the value of the non-convex objective function is guaranteed to decrease. However, due to the non-convexity, only convergence towards a local minimum can be guaranteed. For the problem in Eq. (2) with p = 1/2, the k-th iteration of the MM scheme reads: [i] F , sources with high amplitudes in one iteration will be less penalized in the next iteration and can better explain the data M. Sources for which X (k) [i] F = 0 at a certain iteration k are effectively pruned from the model for all following iterations. Using MM therefore leads to a solution that explains the data with fewer active locations i compared to a standard 2,1 norm regularized solution. Note that a default initialization consists in setting w To exploit existing fast solvers for the 2,1 regularized problems (Strohmeier et al. 2016, Ndiaye et al. 2015, we reformulate the weighted subproblem and apply the weights by scaling the matrix G with a diagonal matrix W (k) ∈ R dn×dn given by: where w (k) ∈ R n , 1 d ∈ R d is a vector of ones and ⊗ is the Kronecker product. Defining G (k) = GW (k−1) , the reformulated problem reads: After convergence, we reapply the scaling toX to obtainX: The reformulation through Eq. (5) and (6) avoids any division by zero when X (k−1) = 0. The above procedure, which matches the strategy of the Adaptive Lasso estimator (Zou 2006), is expressed as pseudo-code in Algorithm 1. More technical details can be found in (Strohmeier et al. 2016, Algorithm 3).

Hierarchical Bayesian Modeling
In this section, we formulate the inference problem given by Eq.
(1) and the regularization strategy with 2,p -quasi-norms from a Bayesian perspective (Kaipio & Somersalo 2005, Lucka 2014): the Bayesian approach incorporates prior beliefs about the model parameters in terms of probability distributions. Under the AWGN assumption the likelihood of the model is given by: Algorithm 1: 2,p MM algorithm with p = 1/2 (Adaptive Lasso) (2) we can construct the 2,p group prior as: which leads to the following posterior probability density using Bayes rule: To extend Eq. (8) to a hierarchical prior model (MacKay 2003), we replace the scalar λ by a vector of hyper-parameters γ ∈ R n + and for any p ≥ 1 we write the conditional 2,p prior as: where the logarithmic term accounts for the terms of the normalization that depend on γ (Lucka 2014). A common choice for a hyper-prior on each γ i is given by a Gamma distribution (MacKay 2003, Kaipio & Somersalo 2005, Calvetti et al. 2009) with shape and scale parameters α and β: Then, the full posterior over both X and γ becomes: The question of how to best derive parameter estimates, in particular how to treat the two different types of parameters X and γ, distinguishes different HBM-based inference strategies. Variational Bayesian approaches (MacKay 2003, Jordan et al. 1999, Sato et al. 2004, Friston et al. 2008, Shervashidze & Bach 2015 and Sparse Bayesian Learning (Tipping 2001, Wipf & Rao 2004, Wipf & Nagarajan 2009, Zhang & Rao 2011) approaches rely on approximating or marginalizing the full, joint posterior distribution (12). In contrast, fully-Bayesian strategies (Calvetti et al. 2009) work with it directly. The most popular one is the full maximum-aposteriori (full-MAP ) estimate which is defined as A common strategy to compute it is to minimize the negative log posterior − log p post (X, γ|M) by alternating minimization over X and γ (known as block coordinate descent in optimization): Other fully-Bayesian estimates are defined as integrals of functions of X and γ with respect to the posterior distribution, e.g., first or second moment estimates. To compute these high dimensional integrals efficiently, only Markov chain Monte Carlo (MCMC) methods that draw correlated samples from the posterior distribution can be used (Robert & Casella 2005, Kaipio & Somersalo 2005. A commonly used MCMC scheme for HBM is given by blocked Gibbs sampling which alternates as: In this study, however, we are not interested in sampling the posterior distribution to compute the integral-based estimators but we rather want to explore the different modes of this multi-modal distribution, each of which corresponds to parameters that are both sparse and likely to explain the data. One can notice similar structures in (14)- (15) and (16)- (17): In each step, we make use of the conditional structure of the posterior: for γ fixed, we have to solve one qtdimensional 2,p optimization/sampling problem, while for X fixed, we have to solve n 1-dimensional optimization/sampling problems. We will describe these two steps in more detail in the next two sections.

HBM Optimization
The optimization problem defined in Eq. (14) reduces to an 2,p -norm regularized regression problem that can be solved as described in Section 2.1. To solve Eq. (15), we compute the first order optimality condition for each i: For α dt/p + 1, the problem in Eq. (15) is convex, and the positive root of Eq. (18) is given by: Note that similar rules to update the noise level were considered in the Bayesian Lasso (Park & Casella 2008, Kyung et al. 2010) and the Scaled Lasso (see for instance (Städler et al. 2010, Dalalyan 2012. A difference though is that the update we performed here is on the penalty term, whereas in the mentioned references, it was rather performed on the data-fitting term. If we furthermore choose α = dt/p + 1, then ν = 0 and most terms disappear; Eq. (14) and (15) hence read: which can be combined to the fixed point iteration: If we compare Eq. (22) with Eq. (3) we see that we re-derived the MM algorithm for p = 1 as an alternating optimization scheme to compute the full-MAP estimate for a specific HBM, namely using a conditional 2,1 group prior and a Gamma hyper-prior with α = dt + 1 and β = 4/λ 2 . Using w (0) i := 1 in the MM scheme corresponds to starting with γ (0) i := 1/λ = √ β/2. From previous work (Strohmeier et al. 2016) we know that due to the non-convexity, a good initialization of the weights w (0) i in the MM algorithm is crucial for its performance, but aside from uniform initialization, only heuristic initialization strategies were used, e.g., using the same re-weighting as in the sLORETA method (Pascual-Marqui 2002). In this work, we leverage the reinterpretation of the MM algorithm through the HBM framework to obtain multiple initializations in a systematic fashion, namely as samples drawn from the full posterior. This way, we can not only reach better local minima but more importantly, we can identify and characterize multiple possible sparse solutions. Such plausible solutions to the sparse regression problem in Eq. (1) are the modes of the posterior distribution (12) with different relative probability masses.

HBM Sampling
As outlined in Eq. (16) and (17) in Section 2.2, we sample the full posterior p post (X, γ|M) by blocked Gibbs sampling, i.e., we alternate between sampling the conditional distributions p post (X|M, γ (k−1) ) and p post (γ|M, X (k) ). The conditional p post (X|M, γ (k−1) ) is a high dimensional distribution composed of a Gaussian likelihood and an 2,p prior, where our main interest here is p = 1. It was demonstrated in (Lucka 2012) that single component Gibbs sampling (SC Gibbs) is an efficient MCMC technique to sample such distributions. For the specific 2,p priors used here, slice sampling can be used to perform the sub-steps in SC Gibbs sampling, namely the sampling of the one-dimensional single-component conditional densities. The resulting Slice-Within-Gibbs sampler was examined in (Lucka 2016). For completeness, the details of the implementation are given in Appendix B. The conditional p post (γ|M, X (k) ) factorizes over groups i: Algorithm 2: Block Gibbs Sampling scheme For the case of α = dt/p + 1, which is our main interest due to its connection to MM revealed in the previous section, Eq. (23) reduces to: which can be sampled with a simple accept-reject algorithm as described in Appendix B. The complete procedure is described in Algorithm 2. Therein, K 0 refers to the burn-in size, i.e., the initial samples that are discarded, K to the sample size of the blocked Gibbs sampler and K SC , K SS to the sample sizes of the SC Gibbs and the slice sampler that carry out the sampling in the sub-steps.

Results
We now examine the benefits of our re-interpretation of the MM algorithm described in Section 2.1 as a specific way to compute a full-MAP estimate for a specific HBM as described in Sections 2.2 and 2.3. In particular, we investigate how using MCMC sampling of the posterior distribution as described in Section 2.4 can help getting better initializations for the optimization algorithm. We first present results for a simulated MEG dataset and then for two experimental M/EEG datasets.

Simulation study
We generated a realistic simulation based on a free-orientation (d = 3) source model with n = 7498 cortical locations and m = 306 MEG sensors. Two of these locations were selected to be active, one in each hemisphere. One of the sources had a deep ventral location in the inferior occipital gyrus (Fig. 1-c), and the second one had a more superficial location in the motor cortex ( Fig. 1-a). Their corresponding waveforms are shown in Fig. 1-b. When passed to the solvers, they are cropped between 40 to 180 ms to keep only the two peaks. This leads to t = 43 time samples. The aim of the simulation is to answer two separate questions. First, we want to know whether we are able to find better source estimates using MCMC-derived initializations than with the uniformly initialized MM Algorithm 1. For this, we first run the MM Algorithm 1 using a uniform initialization, i.e., w 2 F is the smallest regularization value for which no source is found as active using an 2,1 regularization (Ndiaye et al. 2015, Strohmeier et al. 2016). As described above, this corresponds to computing a full-MAP estimate for the HBM with p = 1, α = dt + 1, β = 4/λ 2 using the alternation scheme initialized with γ (0) i = 1/λ, ∀i = 1 ∈ [n]. Then, we sampled the corresponding posterior distribution given in Eq. (12) using Algorithm 2 with K 0 = 300, K = 900, K SC = K SS = 1. From each γ (k) of the K = 900 obtained γ samples, we construct an initialization W (0) for the MM Algorithm 1 by setting w  5)) obtained this way. The vertical black bar shows the value of the objective function of the uniformly initialized MM solver and we can see that some initialization indeed lead to source estimates with a lower objective value. Fig. 2-a and Fig. 2-b show the locations of the estimated sources resulting from uniform and best MCMC-based initialization. For the artificial source in Fig. 2-a, both results find the exact location, so they are superposed. For the deeper source in Fig. 2-b, neither result finds the exact position, but the MCMC-based initialization is closer. This means that the result did not only improve from an optimization point of view, but also judged by the quality criteria of the given application. Finding the correct support in a sparse under-determined regression problem like (1) is inherently of combinatorial complexity. In our two approaches, this is reflected in the non-convexity of the objective function (2) and the multi-modality of the joint posterior distribution (12), respectively. The second question we want to investigate is whether the methods we developed here can reveal or quantify some of the ambiguity and uncertainty of this sparse support identification problem. Traditional uncertainty quantification (UQ) measures such as variance estimates of X or γ fail to do so as they cannot capture the multi-modality of the posterior in a satisfactory way. In addition, no sample X (k) is exactly sparse: as the posterior distribution is a continuous probability density, the probability of event {X (k) [i] = 0} is zero, which means that the whole support of X (k) is active with probability 1. Even a thresholded average of the support of X (k) will only reveal the average probability of a location being part of the support. In source analysis, it is arguably more interesting which networks of sources from different brain areas have most likely produced a given data set, a question left open by these measures. Here, we propose to tackle this question in a different way. Our procedure of initializing an MM iteration with a sample from the posterior distribution yields different local minima of Eq. (2), i.e., approximate solutions to Eq. (1) that fulfill our a priori knowledge of a sparse support, but it also yields the relative frequencies with which these minima are found by the MM algorithm. If we assume that the division of R q×t into attractors of the MM algorithm roughly overlaps with the division of R q×t into modes of the marginalized posterior over X within the HBM framework, this relative frequency corresponds to the relative volume of the local minima of Eq. (2). The latter is a better measure to compare different local minima than their depth (a local minimum that is deep but thin corresponds to an unstable source estimate). While a mathematically more profound and detailed analysis of this heuristic is left for future work, we examine here if this approach reacts to changes in the measurement design in the way we would expect. To do so, we switch from using all 306 MEG sensors to using only 204 gradiometers or each other gradiometer (102 sensors). By reducing the number of sensors we increase the under-determinedness of the problem and the intuition is that it should lead to more variability among the plausible sparse solutions. The graphical analysis presented and described in Fig. 3 and Fig. 4 confirms this. A first observation is the superficial source in the premotor cortex was correctly identified as part of the support of every local minima when using the full 306 MEG sensors. It was however sometimes miss-localized when reducing the number of sensors (Fig. 3). A second observation is that the spatial spread of these miss-localizations is smaller for this superficial source than it is for the deep source. This deep source in the ventral cortex is more difficult to find even with all sensors. Indeed, none of the 100 best initialization perfectly localized the deep simulated source. In general, we can clearly see how the ambiguity increases when decreasing the number of sensors, and how the distribution of source networks gets more fuzzy. However, our analysis also provides useful local measures of these phenomena.

Experimental MEG data
We now repeat our analysis with two experimental open datasets. The first one is a recording of auditory evoked fields (MNE sample dataset (Gramfort et al. 2013)). The second one contains visual evoked fields (visual condition of MNE sample dataset) for all 306 MEG 204 gradiometers 102 gradiometers Figure 3: Source network analysis for simulated data: for a clearer presentation, the set of 900 initializations was thinned to the 100 that gave the lowest objective function (Eq. (2)). The first row of sub-figures displays the support of these best local minima in the following way: each position in the circle represents a source location that was part of the support of at least one minima for one sensor configuration. The black bar attached to each position corresponds to the relative frequency with which this source location appeared as part of the support. Two positions are connected by a line if they were simultaneously part of the support and the color of this line corresponds to the relative frequency with which this happened. Note that the background of the circle is white, but densely covered by purple lines indicating rare connections. The positions are placed left or right, depending on which hemisphere they belong to. For symmetry, for each active source location, its counterpart on the other hemisphere was included in the graphic as well. In addition, the positions are grouped and colored based on a parcellation of the brain into anatomical regions (taken from an atlas). The second row of sub figures shows these regions in the brain and the simulated sources.
which source localization is a more difficult task due to the proximity between neural sources. The true nature of the underlying source network is also less clear for this second dataset. Figure 5 shows the equivalent to Fig. 2 for both datasets. Again, we see that lower objective function value can be obtained using MCMC-based initializations. The auditory sample dataset is commonly assumed to be generated by two bilateral focal sources around the auditory cortices in the superior temporal gyrus of the temporal lobe. Due to the superficial nature of these sources and their large distance, estimation of their position is regarded as a relatively simple task. Indeed, the histogram shows that using MCMC-based initializations does not help a lot to reduce the objective function compared to a uniformly initialized MM solution. In the case of the visual dataset, where several closed-by sources are active, the difference is however quite drastic. The majority of the MCMC-based initializations lead to lower values of the objective function. Looking at the source distribution plots on the brain for both datasets, one can also observe more complex source configurations for the visual data. Next, we repeat the graphical source network analysis from Fig. 3 Figure 4: The support of the MM results based upon 900 MCMC-based initializations were extracted to build an uncertainty map. The relative frequencies with which each source location was part of the support was computed and plotted on the brain surface together with the two simulated sources (green dots). Each column corresponds to the results for each of the three sensor setups examined. Less the number of sensors and/or more the source is deep, more uncertainty in the brain map. Figure 6 shows the results for the auditory dataset and three sensor configurations: all 364 EEG + MEG sensors, all 306 MEG sensors or each other sensor resulting in 182 EEG + MEG sensors. One can see how adding EEG to MEG sensors reduces the ambiguity of the regression problem. The plots show less but more prominent modes, i.e., the posterior mass is concentrated on fewer stable source configurations. We also see that the locations of the most prominent modes shift This is consistent with results of other studies on EEG-MEG combination (Molins et al. 2008, Lucka 2014, Aydin et al. 2014 as EEG is sensitive to some sources that MEG is almost blind to, e.g., sources with a strong radial component. If we subsample the EEG+MEG sensors by only using every other location, the ambiguity and spatial spread of the recovered support increases. One can see that there is more activity in the dark green label, which corresponds to a brain area commonly not associated with auditory responses.The connections between source locations show that none of the modes found really stands out, i.e., is found much more often compared to the others. Most of the connections do not occur more than 200 times within the 900 samples, so they are part of the purple background of low frequency connections in the plots. Fig. 7 shows the same results for the more complex visual dataset. Compared to the auditory dataset, we see that even with all sensors, the ambiguity of the regression problem seems to be a lot higher compared to the auditory dataset: we see that the posterior mass is distributed among many more source configurations. For the other two sensor configurations, we see similar effects as in the auditory data set. Nevertheless, it can be noticed that the large majority of identified sources with all MCMC initializations are on the right hemisphere. This is consistent with the known functional organisation of the visual cortex. Indeed, in this experimental condition the subject was presented with checker board flashes on the left visual hemifield which is known to primarily project onto the right hemisphere of the cortex.

Discussion
Scientific literatures relying either on frequentist or on Bayesian statistical inference often coexist in many fields ranging from machine learning, inverse problems, signal processing to computational biology. In this work, we started from an underdetermined, ill-conditioned MMV / multi-task regression problem and examined two seemingly unrelated approaches -MM as an optimization technique for tackling non-convex optimization problems arising in frequentist regression, and HBM as a Bayesian prior modeling framework. We showed that one obtains the same algorithms, all 364 EEG+MEG all 306 MEG 182 MEG+EEG Figure 6: Source network analysis for auditory data. The figures are constructed in the same way as described in Fig. 3 Figure 7: Source network analysis for visual data. The figures are constructed in the same way as described in Fig. 3 except that all 900 MCMC initializations are displayed. and therefore the same solutions, when considering some specific choices of models, parameters and inference strategies. In particular the parallel was established between the 2,1/2 -norm regularized regression by MM and the full-MAP estimation for 2,1 hierarchical priors with specific Gamma hyper-priors. We further showed that this conceptual parallel can be exploited to improve the MM solution by providing wellinformed algorithmic initializations. For this, we first constructed a multi-layered Gibbs sampler for the joint posterior density of our HBM. Each sample is then used to initialize the MM step done with a state-of-the-art convex solver using block coordinate descent techniques and acceleration strategies based on active sets. The sampler used has also an efficient sub-sampler for 2,1 priors at its core. Despite the multi-modality of the posterior, the MCMC scheme is able to jump rapidely between the different attractors of the MM scheme. Indeed, using each sample as an initialization to the MM computation, one ends up in many different local minima (cf. Fig 5, Fig 7). Therefore, this procedure allows us to reveal and explore different plausible source configurations in more details. Based on this observation, we showcased how one can use the chain of local minima found by MCMC-initialized MM to analyze the variability of the different sparse solutions. Note that this is different from traditional and generic Bayesian uncertainty quantification techniques that use for example covariance estimates or credible sets derived from posterior samples (Szabó et al. 2015). It is also different from methods developed specifically for parametric M/EEG source localization based on dipole fitting (Fuchs et al. 2004, Darvas et al. 2005. These latter approaches cannot easily be transferred to sparse, non-parametric approaches. Using our developed techniques on simulations and actual data, one could observe that uncertainty in M/EEG is location specific and also source configuration specific. This is of course well-known by experts in this field, but here we provide a computational approach to visualize it and quantify it. This is an important incentive to develop such automated, data-dependent methods to quantify uncertainties in the context of M/EEG source imaging. In more conventional imaging methods such as computer tomography (CT) or magnetic resonance imaging (MRI), the signal originates from weak tissue interaction with strong external fields and the forward operator G depends almost exclusively on the physical properties of the scanner. In this situation, uncertainty is usually distributed in a smooth, well-known way over the image domain. Artifacts as well as real anatomical features are also easy to distinguish for a trained radiologist. The situation for M/EEG is very different. The weak signals originate from endogenous activity, and they are very dependent on dataset specific factors such as source orientation, location and attenuation which all depend on the geometry of the head of the analyzed subject. That is also why the forward matrix G needs to be constructed for each individual patient, after fixing the electrical properties of the head issues, which if wrong, increases the uncertainties. When considering real data, the source to recover is often poorly understood, especially when it comes to pathological brain activity such as ictal or inter-ictal epileptic activity. In such a situation, providing a single source configuration as a result, together with an ad-hoc uncertainty quantification based on previous studies or acquired expertise, might not be an optimal use of the M/EEG data. Instead, providing multiple hypotheses together, along with a quantification of their uncertainty, can be more useful. Indeed for applications such as pre-surgical epilepsy diagnosis, where M/EEG recordings are one of several diagnostic modalities, each candidate source configuration can provide some evidence for or against a diagnostic hypothesis that could lead to a surgery decision. We therefore believe that extending the first steps we took here towards developing a consistent framework for interpreting and quantifying the multitude of potential results of sparse M/EEG source reconstruction approaches can have a significant impact on clinical settings.
Algorithm 4: Accept-Reject Algorithm for Hyperparameter Update input : c 0, β > 0 Setx = √ βc. Set logp = −c/x −x/β. Setx = βc/x +x. Set log G x x = log β −x/β. Set log G x<x = logp + logx. Set log G tot = log G x x + log (1 + exp (G x<x − G x x )). while true do Draw u, v, w uniform from (0, 1). if log v + log G tot < log G x x then Set W = log w −x/β. Propose x = −βW : if β log(u) < c/W then return x (acceptance) else Propose x = wx: if log u + logp < −c/x − x/β then return x (acceptance) wherep = max x p(x) is the maximal probability attained atx = arg max x p(x) = √ βc andx = βc/x +x is the solution to exp (−x/β) =p (cf. Figure B1). Sampling from (B.2) is then straight-forward using v, w ∼ U [0,1] : If one computes then v < G x x /(G x x + G x<x ) determines that we are in the tail, x >x, where we can use a simple inverse cumulative distribution method to draw a proposal from g(x) using w. If v determines that we are in x x, then x = wx is the proposal. For numerical precision, we only compute logarithms of probabilities and use that for a > 0, b 0: log (a + b) = log a + log (1 + exp (b − a)) . (B.4) The whole sampling scheme is shown in Algorithm 4. We found the scheme to be efficient enough for all of our computations, i.e., the chosen g(x) is close enough to p(x) to result in an accepted sample after a few trails. If this would become a problem, it would be easy to adaptively improve the dominating density. Figure B1: Sketch of the quantities used in the accept-reject sampling Algorithm 4.