Information Assisted Dictionary Learning for fMRI Data Analysis

In this paper, the task-related fMRI problem is treated in its matrix factorization form, focusing on the Dictionary Learning (DL) approach. The proposed method allows the incorporation of a priori knowledge that is associated with both the experimental design and available brain atlases. Moreover, it can cope efficiently with uncertainties in the modeling of the hemodynamic response function. In addition, the method bypasses one of the major drawbacks of the DL methods; namely, the selection of the sparsity-related regularization parameters. Under the proposed formulation, the associated regularization parameters bear a direct relation to the number of the activated voxels for each one of the sources’ spatial maps. This natural interpretation facilitates fine-tuning of the related parameters and allows for exploiting external information from brain atlases. The proposed method is evaluated against several other popular techniques, including the classical General Linear Model (GLM). The obtained performance gains are quantitatively demonstrated via a novel realistic synthetic fMRI dataset as well as real data from a challenging experimental design.


Introduction
In order to perform several actions/tasks, the human brain relies on the simultaneous activation of many Functional Brain Networks (FBN), which are engaged in proper interaction to execute the tasks effectively. Such networks, potentially distributed over the whole brain, are defined as segregated regions exhibiting high functional connectivity. The latter is quantified via the underlying correlations among the associated activation/deactivation time patterns, which are referred to as time courses [1]. Examples of brain networks are the visual, somatosensory, neuronal default mode, dorsal attention, and executive control networks. Functional Magnetic Resonance Imaging (fMRI) is the dominant data acquisition technique for the detection and study of FBNs [2]. fMRI measures the Blood Oxygenation Level-Dependent (BOLD) contrast [3] that constitutes the evoked hemodynamic response of the brain to the corresponding neuronal activity. This process can be modeled as a convolution between the actual neuronal activation and a person-dependent impulse response function, called Hemodynamic Response Function (HRF). The fMRI captures 3D images with a typical resolution being 64 × 64 × 32 voxels per image, whereas a sequence of 200 to 300 images are acquired per session (usually, one image every one or two seconds). The obtained fMRI measurements, associated with each voxel, comprise a mixture of the time courses corresponding to those FBNs, which are active at the specific voxel. Moreover, beyond the brain-induced sources, additional machine-induced interfering sources are also present that contribute to the measured mixture. The ultimate goal of the fMRI analysis is to detect the set activated of voxels, referred to as spatial map, in which each brain source of interest manifests itself in revealing the corresponding FBN.
In the case of block-or event-related experimental designs, i.e., when the subject is presented with a fixed number of pre-selected stimuli, which are referred to as conditions, the shape of the time courses that correspond to the conditions of the experimental design are estimated as the convolution of the conditions with the canonical Hemodynamic Response Function (cHRF) [4]. Hereafter, such time courses are referred to as task-related time courses.
A prominent approach for fMRI data analysis is via Blind Source Separation (BSS), which is usually performed via appropriate matrix factorization schemes [5]. Independent Component Analysis (ICA) [6], [7], [8], and Dictionary Learning (DL) are the most popular paths. A drawback of ICA is the independnce assumption, which can be violated in fMRI, especially in high spatial overlap, [9], [10], [11]. Unlike the independence hypothesis, DL relies on sparsity that is a reasonable assumption concerning the brain activity [12], [13], [14].
However, DL approaches are not beyond shortcomings. The tuning of the associated regularization parameters is not easy in practice and it is usually performed via cross validation techniques; however, in real experiments, this is not possible due to the lack of ground truth data, [5,15,16,17,18]. Therefore, the only way to proceed with the parameters' fine-tuning is via the visual inspection of the results, a process that requires the active judgment of the user; this can be inconsistent and susceptible to errors that may hamper the adoption of DL approaches in practice.
Alternatively, another candidate family of BSS for the fMRI data analysis is the Non-negative Matrix Factorization (NMF) approach [19,20,21,22]. Unlike the previous techniques, the NMF methods impose a nonnegativity constraint over the matrix factorization. Nevertheless, the non-negativity constraint may not be valid in practice [23]. In the fMRI framework, the aim of the non-negativity constraint conceptually consists of eliminating the negative contribution of the BOLD signal response, leaving only the positive activations [19], which potentially weakens the performance of the NMF methods. Furthermore, NMF algorithms, often, require the tunning of several regularization parameters, sharing the same limitations related to standard DL techniques.
Conventional analysis of fMRI data relies on approaches focused on the General Linear Model (GLM), which assumes the prior availability of the task-related time courses [24]. This approach suffers from a critical limitation: It assumes that the HRF is known and fixed, whereas the truth is that the HRF significantly varies across subjects [25], [26]. In contrast, BSS methods do not need to make any assumption regarding the HRF. Moreover, they inherently model interfering artifacts, such as machine-induced artifacts, uncorrected head-motion residuals or other unmodeled physiological signals that may obscure the brain activity of interest. In addition, they can discover other brain-induced sources beyond the task-related ones.
Despite their advantages, BSS methods share a major drawback compared to GLM: when two or more taskrelated sources manifest themselves in highly overlapped brain regions, ICA, to a larger, and DL to a smaller extent, can fail to discriminate them. This situation is typical in most experimental designs of interest, in which the different experimental conditions can activate the same major functional brain network, e.g., the auditory, the visual, the somatomotor, etc. In contrast, GLMbased methods offer the option of studying the contrast among different conditions, aiming at detecting brain areas, which are activated due to a specific task.
In an attempt to overcome the aforementioned fundamental drawbacks of the BSS methods against GLM, efforts have been made to create new approaches [27,28]. In the ICA framework, the most relevant is to impose tast-related information. Collectively, such methods are referred to as constrained ICA [29,30,31,31,32,33]. Although constrained ICA methods often lead to enhanced performance, compared with their fully blind counterparts [34], they suffer from a critical limitation: the embedded constraint, e.g., the imposed task-related time courses, must not violate the independence assumption. This requirement poses stringent constraints either on the total number of allowable time sequences, e.g. [35], or on the nature of the imposed time courses, which need to be independent of each other [36,37,32,33]. Both restrictions heavily limit the application of constrained ICA to fMRI, since the most common case is to have experimental designs that comprises more than two BOLD sequences. Moreover, they often correspond to significantly overlapping regions and, therefore, to highly statistically dependent sources. Furthermore, in contrast to many unconstrained ICA algorithms, which require a reduced number of relatively easy to tune parameters, all the constrained ICA algorithms require extensive regularization paramenter fine-tuning [33], following crossvalidation arguments. Even the most recent constrained ICA techique, referred as CSTICA [33], involves three regularization parameters and, as it is pointed out by the authors, that the algorithm still needs further improvement to "enable these parameters not to be determined by the experiments". Beyond the constrained ICA, there are also NMF algorithms that allow the incorporation of external information [21,22]; yet, they suffer from similar drawbacks that limit their applicability in practice.
Recently, a DL method called Supervised Dictionary Learning (SDL) [38] was introduced, which allows the incorporation of external information from the task-related time courses in a rationale similar to to GLM. As a result, SDL is greatly benefited in the case of highly overlapping spatial maps, and leads to performance similar to that of GLM. However, SDL inherits from GLM two major drawbacks: a) it builds upon the cHRF, which is fixed and inevitably different from the true one, and b) it adopts a regularized formulation of the DL, which inherits the difficulties associated with the tuning of the related parameter. In Section 4.4, Table II, we provide a thorough comparison among all competitive approaches and their characteristics, which are of interest in the fMRI case.
In this paper, a novel DL formulation, referred to as Information Assisted Dictionary Learning (IADL) is proposed, which, among other merits, alleviates the two aforementioned critical disadvantages of SDL as well as those related to constrained ICA approaches. More specifically: • A new semi-blind DL approach is proposed that incorporates HRF-related information. However, this information is provided in a relaxed way that allows flexibility around the cHRF and, implicitly, it can accommodate related inaccuracies and variations.
• A new sparsity constraint is adopted, which incorporates sparsity in a systematic way that accepts a physical interpretation that relates to the relative size of the areas, which are activated by the underlying FBNs. This latter information is readily available in brain atlases and it naturally complies with the FBNs' segregated nature.
• The proposed sparsity constraint also offers the flexibility of dealing with dense sources that is, sources that are not sparse. Indeed, in real fMRI, some of the sources, which are usually related to machine artifacts, can be dense.
• A highly realistic synthetic dataset is constructed allowing to conduct a thorough performance evaluation of the new method against state-of-the-art ICA and DL-based approaches.

Notation
A lower case letter, x, denotes scalars, a bold capital letter, X, denotes a matrix, and a bold lower case letter, x, denotes a vector with its i th component denoted as x i . The i th row and the i th column of a matrix, X ∈ R M ×N , are represented as x i ∈ R 1×N and x i ∈ R M ×1 , respectively. Moreover, x ij denotes the element "located" at row i and column j of the matrix X.
2 Novel DL constraints tailored to task-related fMRI

Preliminaries on DL-based fMRI analysis
The data collected during an fMRI experiment form a two dimensional data matrix as follows: Each one of the acquired 3D images is unfolded and stored into a vector, x = [x 1 , x 2 , . . . , x N ] ∈ R 1×N , where N is the total number of voxels per image. Such vectors correspond to the sequence of, say T , successively obtained images and they are concatenated as rows to form the data matrix X ∈ R T ×N . Note that in practice, prior to the formation of the data matrix, X, several standardized pre-processing steps are conducted to account for many detrimental effects related to the fMRI image acquisition process, such as slice timing correction, head motion, realignment, normalization, etc. From a mathematical point of view, the source separation problem can be described as a matrix factorization task of the data matrix, i.e., where, following the dictionary learning jargon, D ∈ R T ×K is the dictionary matrix, whose columns represents different time courses, S ∈ R K×N is the coefficient matrix, whose rows are the spatial maps associated with the corresponding time courses, and K is the number of sources.
In general, such a matrix factorization can be obtained via the solution of a constrained optimization task: where D, L, are two sets of admissible matrices, defined by an adopted set of appropriately imposed constraints and · F denotes the Frobenious norm of a matrix. The concept of sparsity refers to discrete signals that involve a sufficiently large number of zero values. The typical way to quantify sparsity is via the 0 -norm: given an arbitrary vector, x ∈ R N , the 0 -norm (which, strictly speaking, is not a norm its strict mathematical definition [5]) is defined as the number of non-zero components of a vector. In DL, the coefficient matrix is constrained to be sparse. In standard DL methods, the sparsity constraints are usually implemented in two ways: either each column of S is separately constrained to be sparse, e.g., s i 0 γ i , where γ i is the maximum number of the nonzero values of the i th column of S, or the full coefficient matrix, S, is constrained to involve, at most,γ non-zeros [16,38,39].
The new proposed method introduces novel constraints on the spatial maps (i.e., on the coefficient matrix, S) and time courses (i.e., on the dictionary, D), whose design serves the specific needs of the task-related fMRI data analysis. These constraints are discussed next.

Information bearing sparsity constraints on the spatial maps
In the fMRI framework, sparsity seems a natural assumption to model the segregated nature of the spatial maps of the FBNs. In other words, each row, say s i , of the coefficient matrix, S, should have non-zeros values only at entries that correspond to voxels activated by the corresponding time course d i . The smaller the area, which a specific FBN occupies in the brain, the sparser the corresponding row of the coefficient matrix should be. At this point, it is worth to recall that, so far, all the DL methods applied in fMRI do not impose sparsity row-wise. However, imposing sparsity column-wise assumes that only a few sources are active in each voxel. Although this is generally true, this piece of information is voxel-dependent and it is hard to accommodate a suitable regularization parameter that simultaneously work optimally for all the thousands of columns. On the other hand, imposing sparsity in the full coefficient matrix may be easier to handle, yet, it is a general constraint that fails to exploit relevant information regarding the spread of each FBN. In this paper, to the best of our knowledge, it is the first time that the DL framework is extended in order to allow sparsity promotion along the rows of the coefficient matrix. Looking at Eq. (2), sparsity in the rows of the coefficient matrix can be imposed using the following admissible set of constraint: where φ i is a user-defined parameter, which denotes the maximum number of non-zero elements of the i th row of S. In the fMRI case, the parameter φ i bears a clear physical interpretation and it corresponds to the total number of voxels that are active due to the i th source. This corresponds to the number of voxels that the corresponding FBN occupies, an estimate of which can be obtained from brain atlases.
It is well known that the 0 -norm constraint results in an NP-hard optimization, and it is usually relaxed to its closer convex relative, i.e., the 1 -norm [40]. Then, the corresponding constrained set becomes: where, λ i , are new user-defined parameters to implicitly control the sparsity of the rows of the coefficient matrix. In contrast to the parameters, φ i , the new parameters, λ i , are not directly related to the sparsity level, rendering them hard to tune in practice, unless cross-validation is an option, where in fMRI this is not the case.
An additional novelty of IADL that allow us to bypass the previous drawback, is the application across rows of a weighted version of the 1 -norm. In particular, given an arbitrary vector x ∈ R N , the weighted 1 -norm is defined as: where w is a real positive vector given by and ε ∈ R + is a real positive number, which is introduced in order to avoid division by zero, providing enhanced numerical stability, [41,42,5], and can be set directly to a small value, e.g. 10 −6 . Accordingly, the row-sparsity constraint now becomes: where w i is the vector of weights that correspond to the vector s i computed as in Eq. (6). The key point is that, similarly to the 0 case, the constraint is imposed via the sparsity level φ i and, consequently, φ i can now be used as an upper bound. This is theoretically substantiated, that the weighted 1 -norm is bounded by the corresponding 0norm [42]. As it will be further discussed in Section 3, this facilitates the transferring of knowledge that relates to the brain structure and functional connectivity into the optimizations task. Moreover, as it shown in Appendix F.3, given w, L w remains convex.
Hereafter, an equivalent but conceptually easier to handle sparsity measure that is independent of the length of the vector, known as sparsity percentage, will be used interchangeably with the sparsity level. Sparsity percentage expresses the proportion of zeros within a vector, x and it is given by where φ = x 0 , and N is the total number of elements.

Task-related dictionary soft constraints
The enhanced discriminative power of the GLM over the BSS methods comes from the fact that in GLM, the task related time courses are explicitly provided into the GLM modeling via the estimated BOLD sequences [2]. Such information is left unexploited in the BSS framework.
Hence, it seems reasonable to incorporate this information in the BSS methods, leading naturally to a semi-blind formulation.
As it has already been stated before in the Introduction, in contrast to ICA techniques, DL-based methods can easily incorporate, in principle, any constraint in the time courses, since sparsity is not affected by such assumptions. This fact has been exploited in SDL by splitting the dictionary into two parts: where the first part, ∆ ∈ R T ×M , comprises fixed columns, which are set equal to the imposed task-related time courses, δ 1 , δ 2 , . . . , δ M . The second part, , is left to vary and it is learned during the DL optimization phase. Nevertheless, SDL inherits the same drawback associated with the GLM: the constrained atoms of the fixed dictionary (columns of the matrix ∆) lead to improvement only if the imposed task-related time courses are sufficiently accurate. Otherwise, the taskrelated time courses will be mis-modelled and their contribution can introduce detrimental effects, leading to innaccurate results.
In this paper, we relax the strong equality requirement of SDL to a looser similarity-based distance-measuring norm constraint. Then, if part of the a priori information is inaccurate, e.g., the assumed HRF differs from the true one, the method can efficiently adjust the constrained atoms since they are not forced to remain fixed and equal to the preselected time-courses. Moreover, the proposed modeling accounts, also, for multiple factors that potentially alter the functional shape of the task-related time courses across subjects and brain regions, such as, vascular differences, partial volume imaging, brain activations, [26], hematocrit concentrations [43], lipid ingestion [44] and even nonlinear effects due to short interstimulus intervals [45].
Mathematically, the starting point lies on splitting the dictionary into two parts: where, in contrast to SDL, the part, D C ∈ R T ×M , has columns which are constrained to be similar rather than equal to the imposed task-related time courses. Then, we can define a new convex set of admissible dictionaries: where · 2 denotes the Euclidean norm, d i is the i th column of the dictionary D and δ i is the i th a priori selected task-related time course. The constant c δ is a userdefined parameter, which controls the degree of similarity between the constrained atoms and the imposed time courses; essentially, this reflects our confidence on how accurate is the cHRF for the subject under consideration. In particular, as it is further explained in Appendix A.3, c δ accounts for the natural variability that the HRFs among subjects are expected to have giving rise to consistent strategies for its tuning. Moreover, the free atoms have a bounded norm controlled by c d , another user-defined parameter to avoid ill conditioned phenomena, which can be fixed and set equal to 1 [46].

The IADL algorithm
In this section, we present an implementation of IADL that solves Eq. (2), incorporating the two proposed sets of constraints, i.e. D = D δ and L = L w .
The simultaneous minimization for D and S is challenging due to both the non-convexity of task in Eq. (2), as well as the potentially large size of the data matrix. The size of the latter is restrictive for optimisation frameworks, which are computationally demanding. For the above reasons, we adopt the Block Majorized Minimization (BMM) rationale, which provides a powerful framework for the solution of such type of optimization tasks, e.g. [47]. Essentially, the BMM scheme simplifies the optimization task by adopting a two-step alternating minimization, under certain assumptions [48]. The proposed DL approach is algorithmically described next and the full mathematical derivation and convergence proof are analytically provided in Appendix F.
Put succinctly, the proposed DL algorithm follows the standard scheme of classical DL methods, which iteratively alternate between a sparse coding step and a dictionary update step. Concerning the sparse cording step, the corresponding recovery mechanism is a soft thresholding operator similar to the one that corresponds to the standard 1 -norm constraint as it is explained in [42].
In the Algorithm 1, we present the pseudo-code for solving Eq. (2), given the number of sources, K, an arbitrary set of estimates, S [0] , D [0] , M estimates of the task-related time courses, δ 1 , δ 2 , . . . , δ M , grouped by columns in the matrix ∆, the sparsity for each row, φ = [φ 1 , φ 2 , . . . , φ K ] T , and the number of iterations, Iter. Observe that the free parameters of the algorithm that need to be tuned are: a) the number of sources, K, b) the maximum sparsity per row φ i , c) the radius parameter c δ and d) the parameters c d and ε involved in equations Eq. (11) and Eq. (6), respectively. The last two can be directly fixed to 1 and a small value, e.g. 10 −6 , respectively. All the rest of the parameters can be straightforwardly tuned based on physical arguments, which can be easily drawn from the fMRI study at hand. The algorithm is insensitive in overestimating K, which renders this parameter easily tunable (see discussion in Appendix A.1 and Appendix E.3. Also, the maximum sparsity per row is readily obtained from published brain atlases (see Appendix A.2 and Appendix B) and c δ is easily tuned considering the task related time courses (see Appendix A.3); the latter results from established HRF models and their 1 IADL repository: https://github.com/MorCTI/IADL expected variability. Furthermore, we provide a Matlab implementation for IADL based analysis, which is available in the IADL 1 github repository. The implementation provided comprises automatic initialization and parameter selt-tuning for c δ . Accordingly, its default parameter setup can be adopted out-of-the-box with any task-related fMRI dataset. At the same time, a more thoughtful specification of the parameters, e.g. incorporating the expected sparsity level of the task-related FBNs at hand, can improve results further.
of weights w i and radius φ i . This projection operator onto the weighted 1 -norm ball is derived in closed form in [42].

Performance results based on synthetic data
In Appendix C, a novel synthetic data set is presented. This set is highly realistic that emulates demanding experimental tasks, where substantial spatial map overlap exists. This allows us to effectively evaluate the performance of the proposed DL method in comparison with the state-of-the-art of blind and semi-blind approaches under a more realistic setting. The brain-like sources comprising the dataset are, for convenience, depicted in Figure 7 in Appendix C. The adopted performance measure, r, is associated with the Pearson's correlation coefficient among the estimated and the true sources and it is described in detail in Appendix D.
The aim of our first performance study is twofold: first, to study the effectiveness of the proposed approach in dealing with HRF mis-modeling and second to evaluate the algorithm's decomposition performance with respect to sets of sources of interest. As benchmarks, the following competitive algorithms are considered: a) McICA 2 , which is a constrained ICA algorithm [35] that allows to assist a source using the task-related time courses, b) SDL, an Online DL algorithm (ODL) [15], which is included in SPAMS toolbox 3 and c) two ICA algorithms, namely, Infomax 4 [49], a widely used ICA algorithm within the fMRI community and JADE 5 [50], which we used as initialization point for all the involved DL algorithms.
In order to emulate the HRF variability, we generated six different "subjects", through six different, yet realistic, synthetic HRFs. These selected HRFs correspond to the HRFs that are depicted in the Figure 1. In particular, six different datasets have been built, one per each subject, with the only difference among them being the fact that the brain-induced time courses have been generated using the HRF of the corresponding subject. Moreover, the task-related time courses are chosen to be the sources 1, 11 and 14 (see Figure 7), since they correspond to realistic scenarios that are often encountered in practice: Source 1 is easy to identify, since it barely spatially overlaps with other sources and it corresponds to a block-event experimental design. Sources 11 and 14 are more challenging and exhibit notable overlap and they emulate an eventrelated task (see Figure 7). Consequently, we generate the imposed task-related time courses δ 1 , δ 2 and δ 3 , convolving the experimental conditions related to these sources with the cHRF, according to the standard procedure that    is followed in GLM/SPM-based analysis. Concerning the parametrisation of the algorithms, K was overestimated by 20%, i.e., K = 25 rather than 20. This is typical in realistic scenarios since it is not possible to know the exact number of sources. All the benchmark methods require an estimate of the number of sources. Thus, the same value was provided to all the algorithms. Moreover, as it has been discussed in Section 3, IADL requires to set up three extra quantities: a) the maximum sparsity for each task-related sources, b) the set of maximum sparsity values that the rest of the sources can take and c) the parameter c δ . For (a), the imposed sparsity percentage per task-related sources 1, 11 and 14 are 95%, 90% and 94% respectively, which correspond to a slight overestimation compared to the true sparsity values. Sparsity set-up information related to this experiment is also listed in Table 1. In particular, the true sparsities of the task-related time courses are shown in the first three rows of Table 1.a and the corresponding imposed sparsities are shown in the corresponding rows of the first column of Table 1 The sub-figures correspond to a) the sources of interest [1,11,14], b) the brain-like sources only, and c) all sources (including artifacts) task-related sources, both the true and the imposed ones are sorted with decreasing sparsity percentage. In Section 4.2, it will be shown that the proposed scheme is largely insensitive to the provided sparsity values. Finally, for the point (c) above, we set c δ = 0.2, which is big enough to accommodate the variations among HRFs between the different synthetic datasets. This value has been calculated following the algorithmic procedure described in the Appendix A.3. SDL and ODL tune the sparsity constraint via a single regularisation parameter, λ. In contrast to IADL, λ is not directly related to a certain sparsity level and its optimum value is case-specific. Moreover, it does not bear a concise physical interpretation that could serve as a guide for its tuning. Also, the range of values of the optimum λ can vary significantly from case to case. It can take a very small value, e.g., 0.01 or a value that can be relatively large, e.g., 10. In the synthetic dataset case, since the ground truth, i.e., the correct decomposition, is fully known, λ can get optimized through cross validation. However, such a luxury is never available in practice, where real fMRI datasets need to be analysed. In this study, we evaluate SDL using λ values that lead to the best performance according to certain criteria: the first criterion was to optimize SDL with respect to the mean performance of the time courses of all the sources, leading to a λ 1 = 0.02. The second criterion was to optimize with respect the mean value of the full source performance for the assisted sources only, leading to a λ 2 = 2.8. In both cases, we conducted the λ value optimization for the subject that corresponds to the cHRF. Observe the large variation between the obtained λ values, which is indicative of the sensitivity of the SDL approach to λ tuning. On the other hand, the optimum value for the fully blind ODL algorithm was found to be λ = 1.6. Moreover, SDL and ODL are initialized from ICA, similarly to IADL. Such an initialization leads to a better performance compared to the original version presented in [38]. These observations agree with the results reported in [51], where the authors use ICA as initialization point for their proposed DL algorithm to enhance its performance.
McICA requires to fine-tune a set of 4 regularization parameters. We also observed that the optimal selection of these parameters heavily depends on the particular synthetic subject, similarly to the SDL algorithm. Accordingly, we manually optimized these parameters via crossvalidation aiming to achieve the best average performance over all the subjects. Figure 2.I and Figure 2.II show the performance results with respect to the full sources and the time courses, respectively. The horizontal axis indicates the six synthetic subjects that correspond to different HRFs. Both figures comprise three sub-graphs, where each one of them considers the performance with respect to three different sets of sources: (a) the task-related sources 1, 11 and 14, (b) the brain-like sources (1, 2, . . . , 15), and (c) the whole dataset (1, 2, . . . , 20) which includes both brain-like sources and artifacts. For completeness, Figure 9 in Appendix E shows the individual performance of the studied methods for each assisted source and depicts some of the obtained time courses and its corresponding spatial maps.
Let us first focus on the two information-assisted DL algorithms, SDL and IADL, whose performance is indicated with green and dark blue curves, respectively. They are both assisted with the task-related time courses that correspond to the cHRF. In the SDL case, the solid and the dashed curves correspond to regularization parameter tuning equal to λ 1 and λ 2 respectively. Recall that these two cases lead to the best performance for the canonical subject in Figure Figure 2.II.a, it is already observed that IADL manages to cope well with the subject variability even in cases where the discrepancy between the canonical and the subject HRF is large, e.g., subject E (see Figure 1). On the contrary, in Figure 2.I.a, SDL copes well only in the canonical case, for which the algorithm has been explicitly optimized. In Figure 2.II.a, SDL succeeds a superior performance only in the canonical case -where the exact task-related time courses have been used-as well as in subject A who, as it can be seen in the Figure 1, has an HRF similar to the canonical one. Focusing on the SDL performance for subject B to E, it is apparent that the strategy to fix the assisted time courses can lead to a high deterioration according to both performance measures.
In the cases where only brain-line and all the sources are considered (mid and right-most sub-figures), the proposed approach still outperforms SDL. Note that, the time courses estimated by IADL are overall better than that of SDL even in the case of the canonical subject, in which SDL is fully optimized exploiting the knowledge of the ground truth. In comparison to the fully blind methods, i.e., ODL, JADE and Infomax, task-related course assisting approaches perform clearly better. In this case, the ODL works better than ICA-based approaches in the cost; however, in practice, it is very hard, if possible at all, to optimize its λ parameter as it is done here with the synthetic dataset.
The yellow curves in Figure 2 depict the performance of McICA. This particular constrained ICA algorithm provides estimates only of the assisted sources [35]; hence, the relates results correspond only to the assisted brainlike sources. First, we can observe that the McICA algorithm performs better than JADE and Infomax. On the other hand, our proposed IADL algorithm outperforms McICA for all the studied synthetic subjects.

IADL robustness against sparsity parameter mistuning.
In this section, the tolerance of the proposed approach to the choice of the maximum sparsity parameters, φ i , is investigated. As it was apparent in the previous experiment with synthetic data, it is convenient to separately consider the sparsity constraints of the task-related time courses, which need to be explicitly set on an one-to-one basis. The sparsity constraints of the rest of the sources, only need a rough estimation of the sparsity percentage, as it is described in Appendix A.2. This convention is followed in Table 1.b, where three such setups, denoted as θ 1 , θ 2 and θ 2 are listed. The first one is the closest, overall, to the true sparsities. In θ 2 , the sparsities of the non task-related sources, i.e., the 4th up to the 25th, have been grossly assigned in a simplified way; 5 sources with sparsity percentage 90%, 5 sources with 80%, etc. In θ 3 , the sparsity percentage of the task-related time sources have been largely relaxed by fixing the sparsity percentage value to 85% for all three of them (i.e., the 1st up to the 3rd). This figure roughly corresponds to the smaller sparsity percentage that can be found in all the FBNs and atlases that we have checked (a list of FBNs' sparsity percentages corresponding to several published brain atlas can be found in Appendix B. Figure 3.I and Figure 3.II show the performance for these three sparsity setups. The first figure illustrates the performance results with respect to the full sources, whereas the second one shows the performance results with respect to the time courses only. Besides the sparsity levels, the set-up of the experiment is the same as the previous one; also and for reference, the performance curves of JADE and SDL are copied here. It can be readily observed that the proposed approach is remarkably robust to the sparsity specifications. Indeed, in the timecourse only measure case, there are not any detrimental effects, whereas the full-source case, θ 3 , led the estimates of the task-related sources to a minor performance degradation. This result is highly welcome and ensures that the requirement to explicitly set the sparsity for the taskrelated time courses is not an obstacle, when using the 9 I. Performance with respect to the full sources II. Performance with respect to the time courses Can. proposed algorithm. In fact, if there is extra information concerning the maximum expected sparsity level of the task-related time courses, then this can be used since it can only help. Otherwise, the sparsity percentage for the task-related time courses are all set to a safely small sparsity percentage value, e.g., 85%, and this will still be beneficial to the algorithm leading to reliable results.

Comparison between IADL and GLM
For completeness, we have also performed a comparison between the proposed IADL and the standard GLM approach using the SPM12 6 toolbox, where the design matrix comprises the three task-related time courses. For this study, we observed that SPM and IADL recover the assisted sources 1 and 3 correctly. However, for the assisted source 2, the result of SPM is significantly inferior to that of IADL. Full details of this experiment can be found in Appendix E.2.

Comparison between IADL and several alternatives for the analysis of fMRI data
In order to clearly position IADL among all the matrix factorization-based methods that have been used in fMRI, with respect to all their features and analysis capabilities, we present a comprehensive comparison in Table II. Among those different alternatives, observe that IADL complies with all the studied criteria apart from Table 2: Comparison between several alternatives for fMRI data analysis.

Criteria Algorithm
A B C D E F G H I IADL ICA [6], [7], [8] SPM [4] McICA [35] CSTICA [32], [33] CTICA [29], [30] ADL [17] CSICA [31] T-NMF [22] ODL [15] SDL [38] NMF [20] S-NMF [21] Sparse-SPM [28], [11] Criteria: A -Can impose task-related time information in a strict way. B -Can impose task-related time information in a relaxed/assisted way. C -Provides blind estimation of brain-induced sources and other unmodeled physiological noise or artifacts. D -Cross validation nor experimental visual inspection are not needed for parameter fine tuning. E -Free parameters are easy to tune and accept a physical interpretation within the fMRI context. F -Allows to include spatial information from the sources of interest through user-defined masks. G -Allows to include spatial information for all the sources in a relaxed/assisted way. H -Can explicitly deal with dense sources. I -An implementation of the code is freely available.
criterion F, namely to be able to explicitly specify the place/vexels within the brain that a FBN will appear through user-defined masks. However, note that IADL with relatively mild modifications in the spatial map constraint, L w , can also comply with this criterion. Such development is beyond this work and it will be presented elsewhere.

Real fMRI data analysis
To test the capabilities of the proposed method over a realistic scenario, we study the motor task-fMRI experiment from the WU-Minn Human Connectome Project [52]. This experiment follows a standard block paradigm, where a visual cue asks the participants to either tap their left/right fingers, squeeze their left/right toes or move their tongue. Each movement block lasted 12 seconds and is preceded by a 3 second of visual cue. In addition, there are 3 extra fixation blocks of 15 second each, as it is detailed in the Human Connectome Project protocols 7 .
The reason for selecting this specific dataset is twofold: first, the FBNs related to this experimental design are well studied [18], [53], [54], [55], which facilitates the evaluation of the results by inspection. Second, this dataset is particularly challenging: the FBNs of interest exhibit significant asymmetries in their intensity [54], and the spatial maps exhibit high overlap, particularly within the cerebellar cortex [55].
For this study, we selected the first 15 participants (10 females, range 26-35 years) that performed the studied motor task from the Q1 release Human Connectome Project repository 8 , where the acquisition parameters are summarized in the imaging protocols 9 . On top of the standard preprocessing pipeline already conducted on the original dataset, [56], [53], we also smoothed each volume with a 4-mm FWHM Gaussian kernel.
For comparison, in addition to IADL, the analysis of the real data is also conducted with SDL, ICA and GLM. Thus, for this experiment we defined 6 task-related time courses, i.e., one per condition (visual cue, righthand, left-hand, right-feet, left-feet and tongue), as a prior knowledge. Each condition was modeled as a succession of blocks with duration equal to its presentation time (see protocols 7 ) and then convolved with cHRF. This is the standard procedure followed in GLM-based analysis as well [2].
For the GLM analysis, we used SPM12 6 and we followed the same standard procedure as it is described in Figure 4: Significant active voxles (z > 1.97) for the group analysis. Each row shows the most representative positions for each specific task: visual, left/right hand, left/right feet and tongue. [53]. For each subject, we computed a linear contrast to estimate significant activity for each movement type versus the baseline and versus all the other movements, and an extra linear contrast for the visual cue. Then, a second level analysis was performed to asses significant activity among the 15 participants, for each condition.
For the matrix factorization methods, in all the cases, the total number of sources was set equal to 20, which is a value often used in ICA analysis (for further details see discussion in Appendix A.1). Both semi-blind methods, IADL and SDL impose the same task-related time courses used in the SPM analysis, that is, each information-assisted algorithm will contain 6 different assisted sources, one per condition.
For ICA, we used the software toolbox GIFT 10 , which implements multiple ICA algorithms in the context of fMRI data analysis. The algorithms Fast-ICA, Infomax, Erica and ERBM were tested and all of them produced qualitatively similar results. To save space, the results of the ERBM algorithm [57] are shown only, since by visual inspection it appeared to lead to somewhat better performance compared to the rest. With respect to IADL, the sparsity percentage per source is shown in Table 1.c, where the first 6 values correspond to the task-related time courses. Although all FBNs are known to be very sparse, the visual FBN is one of the denser [58], [59] (see also Table S-I). Therefore, the sparsity percentage that corresponds to the visual cue was set to 90% and all the rest were set to 95%. As it has been discussed in the previous section, IADL is robust to sparsity over determination, therefore not much difference would be expected if all values were set equal to 90%. For the rest of the sources, (7th -20th value in Table 1.c) values of gradually diminishing sparsity percentage and c δ computed to be equal to 4.6, following the same tuning approach as in Section 4.1, which is also described in Appendix A.3.
We have performed both single subject analysis, where the data set of each subject is decomposed separately, and group analysis, where the datasets of all 15 subjects are concatenated along time [60]. In essence, group analysis implies that all subjects share a similar common spatial map. In the IADL and SDL cases, the task-related time courses of all subjects are concatenated as well in order to construct the overall task-related time courses. Moreover, the similarity parameter c δ is recalculated to the value 68 using the same procedure described in Appendix A.3, in order to account for longer, concatenated task-related time courses.
At the single subject level, all the matrix factorization methods produce similar results for all the participants. Figure 17 in Appendix E.5 depicts the results of one randomly selected subject for ICA, SDL and IADL. However, for SPM, some of the subjects did not exhibit significant activity within the expected area of interest, which agree with previous analysis [53].
The spatial maps that correspond to the experimental conditions for the group analysis are depicted in Figure 4. All the spatial maps from the matrix factorization methods were computed via the pseudo-inverse approach, namelyS = D −1 X, following the discussion in Appendix A.5. This approach, besides being the standard in ICA-based analysis, led to enhanced performance for SDL. For the latter algorithm, the regularization parameter, λ, was optimized by inspection. In fact, due to the absence of physically consistent guidance for its tuning, we run SDL for a large set of λ values, namely λ = {0.01, 0.02, 0.05, 0.1, 0.5, 1, 2.8, 5, 10, 50} (similarly to the range proposed in the SDL paper [38]). It turned out that with visual inspection λ = 10, produced spatial maps, which were closer to the ones expected to be detected in the specific experimental design. Note that λ was optimized in the single subject case and the same value was used in the group analysis, too. Finally, the spatial maps from SPM are the results obtained from the second level analysis.
From observing the spatial maps that resulted, it is apparent that all the methods detected significant activity within the visual cortex associated to the visual cue. However, ICA has difficulties to detect the rest of the components since only the right foot and the left hand appear recovered partially. Although the group results for ICA are somewhat better than those for the single subject analysis (see Figure 17 in Appendix E), ICA still fails to recover the FBNs related to most of the motion activities. These results are expected since, as we discussed before, the sources related to the motor task are more difficult to separate compared with the visual area, which appear correctly recovered in all the studied methods.
In contrast to ICA/ERGM, the rest of the approaches (SPM, IADL and SDL) are able to properly separate the different tasks. However, it is observed that in the single subject case, SDL exhibits more spurious activity around the lateral ventricles compared to the IADL, which provides a more reliable spatial map. Similarly, although SPM present significant activity within the region of interest, the obtained significant activation are smaller than the expected activation areas of interest and the results exhibit a large amount of spurious false positives. Moreover, in the group analysis, SDL failed to recover the right hand. In contrast to that, IADL successfully unmix all the expected FBNs and exhibits an enhanced performance. Note that for IADL, the same sparsity percentage values were used for both, the single subject and group analyses, and they are defined in Table 1.c. Figure 5 depicts, as an example, the estimated time courses corresponding to the visual area and the left hand. In particular, the red curve corresponds to the task-related time course, which is incorporated as knowledge in the case of IADL and it is fixed in the case of SPM and SDL. The green curve is the fully blind estimation obtained by ICA/ERBM. With respect to ICA, it is clear that it produces a relatively reliable estimate of the visual time course. This is in agreement with its performance in revealing the corresponding spatial maps (Figure 4.a). Still, however, there are some spurious peaks, which are associated with fixation (dark dotted curve). Furthermore, on the left hand case, ICA fails to recover one out of two block events and excessive volatility is also observed. On the contrary, in both conditions, IADL gently adjusts the task undershoots.
Finally, to further illustrate the potential of IADL, we performed a closer comparison of it with the standard SPM. Thus, Figure 6 displays the significant activation clusters from the group analysis of the 15 subjects. For both approaches, the group maps are displayed between a lower threshold of z = 2.32 ( p < 0.01 uncorrected) and a upper threshold of z = 5 (Bonferroni-corrected at p < 0.066). We selected the same range of z-scores and the same coordinates as Figure 7 in [53], to facilitate the comparison of the results.
From a general perspective, as we previously mentioned, the significant clusters from SPM, within each region of interest, are smaller compared with those from Figure 7 in [53] and IADL. Specifically, SPM seems to have problems to unmixes the areas related to the hands and the right foot within the motor areas. Moreover, SPM fails to recover most of the associated areas within the cerebellum -which are particularly more challeging [55]-for each movement type. In particular, SPM only exhibits significant activation clusters within the cerebellum for the tongue and the left foot. In contrast, the results of IADL are more consistent; we observe that the main activation clusters of IADL are located precisely as suggested by previous literature [53,54], even for the cerebellar areas [55] (see also Appendix E.6).

Conclusions
In this paper, we present a new Dictionary Learning method that incorporates external information in a natural way via two novel convex constraints: a) a sparsity constraint based on the weighted 1 -norm, which allows to set row-wise sparsity constraints that naturally encapsulates the sparsity of the spatial maps, and b) a similarity constraint over the dictionary, which integrates external a priori information that is available from the experimental task.
The proposed sparsity constraint constitutes a natural alternative to the standard 1 -norm regularization, allowing to incorporate external sparsity-related information that is available in brain atlases, and by-passes the problem of selecting the regularization parameters by following cross-validation arguments that have no practical meaning, when real data are involved.
Furthermore, the incorporation of the task-related time courses from the experimental task enhances the performance decomposition of their corresponding sources. Moreover, the newly proposed constraints exhibit more tolerance and robustness against mis-modeling, compared to alternative approaches.
Extensive simulation results, over realistic synthetic and real fMRI datasets, have verified the advantages and the enhanced performance obtained by the new proposed method. 15

A Parameter setup, initialization guidelines and tips
In this section, we provide a setup guideline for running the proposed algorithm. The free parameters that need to be tuned are: a) the number of sources, K, b) the maximum sparsity per row, φ i , and c) the radius parameter c δ . Concerning the parameters c d and ε, both are set to values so that to satisfy straightforward stability arguments as we described in the main text. Put succinctly, we set c d equal to 1 to avoid ill-conditioned phenomena [46] and ε to a small value, e.g., 10 −6 , to avoid division by zero and to enhacne numerical stability [5,41,42]. The choice of their specific value is not critical and the algorithm is insensitive to it.
Unlike other alternative approaches, in our case, all the parameters of the proposed algorithm can be tuned based on physical arguments with respect to the fMRI study at hand. Specifically,

A.1 How to choose the number of sources K
The imposed number of sources, K, should be idealy equal to the true number of sources, which sums up to the number of active FBNs plus the number of machineinduced sources plus any physiological (non brain-induced) source, e.g., cardiac, movement residuals, etc. It is certainly expected that the true number of sources would vary in some degree from experiment to experiment and from subject to subject. However, the proposed algorithm is insensitive to overestimating K, as we illustrate in Appendix E.3. Therefore, it is practically easy to set K to an approximate value, which grossly reflects the order of sources being present. Studies concerning the number of sources present in fMRI data, e.g., [61], [62], [63], have shown that the total number of sources sources are roughly 20 to 40. Therefore, an according to the results in Appendix E.3, the range from 20 to 40 contitutes a adequate and realistic interval, where the proposed algorithm produces similar outcomes, and one can safetly set the number of sources as any value in that interval. Note, however, that bigger number of sources will require more computation time, since larger number of sources increases the complexity of the decomposition.
Besides, there are specialized approaches that one could use to estimates the intrinsic dimensionality of the fMRI data [61] directly; for example, Autoregresive models [61], Principal Component Analysis (PCA) [64] or Minimum Description Length (MDL) [65].
A.2 How to choose the maximum sparsity per source, φ i , based on brain atlases According to the constraint in Eq. (11) in the main text (see also Eq. (19)), the dictionary is split into M columns, which correspond to the assisted time courses, and K − M columns, which correspond to the rest of the sources. Accordingly, the first M rows of S, i.e., spatial maps, correspond to the assisted sources and the rest of the rows are related to any other source present in the data matrix. Therefore, the sparsity specification is handled differently for the two sets of rows. For those corresponding to the assisted sources, the maximum sparsity levels φ i , i = 1, . . . , M , need to be explicitly set for each one of them. To this end, we resort to information available via existing brain atlases, from which the sparsity level of the FBNs of interest can easily be extracted as it is described is Appendix B.
However, it should be emphasized that the algorithm does not requires the exact sparsity value; instead, an upper bound of it suffices. In other words, one needs to get an estimate of the maximum expected sparsity level. In addition, as we demonstrated in the simulations section (see Section 4 of the main text), the method is very robust to overestimates. This gives the users great flexibility: imagine the worst scenario; that is, setting the sparsity percentage of the assisted sources while no prior related knowledge concerning the corresponding FBNs is available. In such cases, one can safely give to all the assisted sources a sparsity percentage around, say, 85%, which is close to the mean sparsity percentage of the primary anatomical division of the brain (see Table 4) and it is roughly the denser brain-induced source that we observed from the studied brain atlases. Of course, if one has a tighter value than that to provide -closer to the correct one-then this can only be beneficial to the source decomposition task.
On the other hand, the sparsity level of the rest of the sources need not be tuned one by one. Parameters φ i , i = M + 1 . . . K, can be assigned a series of values, which roughly represent the sparsity levels expected to be found in the dataset. Still the method is exceptionally robust to this parametrization and the ordering of the values does not have any effect on the result; the algorithm has the power to automatically distribute the estimated sources to the rows of S that they best fit; this behavior is guaranteed by the proper initialization, as we described latter in ??. As a rule of thumb, we suggest to provide a gradient of ordered sparsity values having sparsity percentage from 99% down to 80%, in order to model expected brain-like sources and movement artifacts. Also, it is a good practice to include a few denser sparsity levels, e.g. 70% -0% to model machine-induced artifacts, unmodeled movement corrections and potential noise-like residuals.
A.3 How to select the radius parameter c δ By definition, the parameter c δ constraints the energy of the residual between the task-related time course, δ, and the subject-specific true time course, d. In practice, these time courses are estimated using the linear convolution with the corresponding HRF, that is: where u is the specific neuronal activation pattern that theoretically matches the experimental task and h c , h s are the canonical HRF (cHRF) and the unknown subject-specific HRF. Note that, in general, the neuronal activation pattern and the experimental one might not be exactly the same: any miss-modeled activation pattern will increase the difference between the real and the considered time course. Therefore, under the convolutional model, we observe that the parameter c δ reflects our confidence on how accurate is the cHRF for the subject under consideration, as well as, how reliable is the imposed experimental condition, which theoretically matches the corresponding activation pattern of the subject. Setting c δ = 0, no flexibility is allowed at all and the information regarding the task-induced time courses is equivalent to that provided in the GLM/SPM framework. Such a strategy is also acceptable and the whole analysis can still be benefited from the rest of the sources that will be estimated, in the same way as the SDL in [38] does. However, there are good reasons to allow the task-related time courses to drift from the imposed cHRF-based ones: a) subject HRF will indeed diverge from the cHRF and b) the convolution of the HRF with the activation patterns is a linear function that is unable to account for natural nonlinear effects, produced when short interstimulus intervals are considered [45]. Therefore, the use of a c δ moderately larger than zero can only be of benefit. Note, however, that large values of c δ will cancel the benefit of exploiting a priori imposed information. Indeed, the extreme case that c δ = ∞ corresponds to the case where the columns of D are not constrained at all.
In order to obtain a meaningful value, assuming that the specific neuronal activation pattern matches the experimental task, we used the results in Eq. (12) to obtain an adequate estimate of c δ , by explicitly measuring the distance between the task-related time courses generated by the HRF and a significantly different HRF, yet realistic. This will ensure enough flexibility to accommodate the expected natural variability of the HRF. In general, when two or more task-related time courses are implemented, a different estimate of the similarity parameter per taskrelated time course is obtained. All these estimates provide information about how strict/relaxed the similarity constraint needs to be for each specific task-related time course. Therefore, a good candidate for c δ is the mean value of all the obtained estimates, since this value considers the expected mean variability among all the imposed task-related time courses. However, note that other potential alternatives may also be used; for example, one may set the similarity parameter, c δ , equal to the minimum (maximum) estimate among the task-related time courses for a more conservative (relaxed) approach.
In this paper, based on the previous rationale, we estimate the similarity parameter, c δ , for the different studied fMRI experiments as follows: first, given all the conditions, we determine their corresponding task-related time courses, δ, using the cHRF and an extra task-related time course, δ * , generated using the HRF 5 from the Figure 1, which constitutes a significantly different, yet realistic, HRF compared with the cHRF. Then, according to Eq. (12), the difference δ − δ * 2 provides an estimate of the similarity parameter for each particular task. Finally, we set c δ as the mean value obtained among the different tasks. For completeness, we provide a Matlab function that automatically performs this procedure given a particular set of experimental tasks. This function is available in AIDL 1 repository and is used to set the default value of this parameter in within the main algorithm.

A.4 Dictionary Initialization
In general, the matrix factorization approaches, as described in Eq. (2), admit an infinite number of potential solutions and also constitutes a non-convex optimization task. Consequently, the solution that will be obtained via the proposed algorithm, as well as by any BSS counterpart, depends on the initialization of S [0] and D [0] . As a result, a careful initialisation is more likely to lead to a better outcome rather than initialising from a random guess of S [0] and D [0] . For this reason, a systematic initialisation procedure comprising four steps is proposed here: 1. An initial estimate of S and D, denoted asS andD is obtained via any ICA algorithm. A known issue with ICA methods is that the same FBN, from the same independent component, might be split into several time-courses/spatialmap vectors [66], [63]. To address this issue, the split sources are merged back following the merging process described in [16].
2. The columns ofD that are most correlated with the M task-related time courses are detected. Then, the columns ofD and the corresponding rows ofS are rearranged in order to bring the M most correlated columns on the far left ofD and the corresponding rows on the top ofS. Such a rows-columns rearrangement is allowed since the product DS is invariant to such permutations. Then, the M first columns of D are replaced with the imposed task-related time courses. This step brings the initialD in the form described in Eq. (11).
3. The initialization is further refined using the proposed algorithm for a few iterations but by imposing sparsity in the full coefficient matrix, i.e., using the alternative constrained set L W described in Eq. (21), rather than imposing sparsity row-by-row. The reason for including this extra step is twofold: first, the ICA-based estimate is gently improved by using the imposed information concerning both the overall sparsity and the imposed task-related time courses. Second, the resulting coefficient matrixS will become sparse and in agreement to the imposed sparsity specifications. Indeed, this was not the case before the execution of this step, since the coefficient matrix produced by ICA is dense. In particular, using the pseudocode described in Algorithm 1 in the main text, the for loop in lines 6-10 is replaced by:  Although the user is advised to proceed with the initialisation method described above, the algorithm is theoretically guaranteed to converge to a local minimum irrespectively of the initial values of S [0] and D [0] . Note, also, that using ICA as initialization point has also a Bayesian flavor: the imposed sparsity constraint that acts componentwise (see L w ) works similarly as an implicit independence assumption over a prior. Of course, independence over the prior does not necessarily mean independence over the posterior. In this way, initializing our algorithm from ICA constitutes an effective synergy between ICA and DL; using ICA only as initialization allows the DL algorithm to go beyond the independence assumption, yet preserving the contribution of ICA as a good reference point. Similarly, other approaches have tried to combine the benefits of sparsity and independence. For example, in [30] a hybrid method called sparse-ICA tries to integrate both approaches into a single algorithm. However, the independence assumption as well as the lack of consistent ways to tune the sparsity constraint limit the power of these approaches in the fMRI case.

A.5 Detection of the FBNs in Real datasets
The standard practice when working with BSS techniques for fMRI analysis is the detection and representation of the obtained sources/FBNs on the brain. In the fullyblind case, the time courses of the dictionary, hopefully, produce reasonable estimates of their corresponding hemodynamic responses, including those induced by the FBNs of interest. Similarly, the semi-blind methods assume that the imposed task-related time course assembles an adequate representation of the real brain activity. An alternative way to determine the spatial distribution of each source is to use the obtained time courses to get an estimate of the coefficient matrix directly, sayS, using the pseudoinverse of the dictionary, that is,S = D −1 X, where X is the data matrix. In practice, this is a well-established approach particularly when working with ICA methods, which provides directly an estimate of the pseudoinverse of the dictionary, usually referred as "unmixing matrix", and is later used to determine the coefficient matrix from the data. Under this approach, each row of the obtained coefficient matrix represents the "weights" that specify the contribution of each time course to its corresponding spatial map. Then, the values of each spatial map can be mapped back to the brain volume space, representing its spatial distribution on the brain. Nevertheless, the obtained coefficient matrix,S, potentially turns out dense, i.e., the most of its values are non-zero, since noise and small residuals from the decomposition may also appear, interfering with the location of the main patterns. Consequently, to obtain interpretable results, the spatial maps are "sparsified" using some significant tests, such as z-scores, which provide statistical support to infer significant activation patterns that correspond to each source [67], [19], [68], [69], [38]. Similarly, this modus operandi is valid for DL methods, where one can use the pseudoinverse to estimate the coefficient matrix and use statistical tests to infer their corresponding significant activation patterns.
On the other hand, sparsity-based approaches offer an alternative way for the representation of the spatial distribution of the brain activity. In general, sparsitybased approaches, such as DL, simultaneously produce estimates of the dictionary and its corresponding sparse representation over the coefficient matrix, S. Theoretically, proper sparsity constraints will naturally force to zero the values of each spatial map, which appear unrelated to the specific source. On the contrary, the non-zero values represent the regions with important activity. Consequently, for each spatial map, the representation of the values s i > 0 provides also meaningful information concerning to the spatial distribution of the i th detected FBN.
In this paper, we followed the first standard procedure to represent the main activation patterns using statistical test, since it provides a fair comparison between the studied methods. Thus, the real fMRI results presented in the Section 4.4 of the main text were obtained as follows: first, using the pseudoinverse of the dictionary, we estimate its corresponding coefficient matrix. Then, we scaled each spatial map to z-values; voxels that exhibited values greater than z 1.97 were considered as significant active voxels [67]. Finally, the z-statistical scores of the detected blobs were superimposed on its corresponding anatomical slices for visualization.
For completeness, we also explored the alternative way to display the spatial distribution of the obtained FBNs, using the sparse coefficient matrix directly. In this case, we only studied two sparsity-based methods: SDL and IADL. Thus, the comparison between the obtained sparse spatial maps are presented in Appendix E.4 as additional results.

B Brain Atlas Sparsity Information
In general, brain atlases contain information regarding the anatomical structure of the spatial distribution of some FBNs. In our context, this information can directly lead to infer the sparsity percentage of a particular FBN; that is, to estimate the proportion of voxels that belong to that FBN, which can be used as extra a priori information for tuning the associated sparsity parameters.
In practice, we distinguish two types of atlases: functional connectivity atlases and anatomical/structural atlases. The function connectivity-based atlases contain information about the spatial distribution of different FBNs, regardless of their anatomical proximity or location. Some examples of this type of atlases are [58] and [59] , which contain some standard FBNs, such as Auditory, Default Mode Network, Language, etc. On the other hand, the anatomical-based atlases use several anatomical/structural arguments to define different brain partitions. Although a priori, anatomical atlases may not provide information concerning the sparsity percentage of a specific FBN, the particular value corresponding to the FBN can be easily estimated using the sparsity percentage of the principal regions of interest occupied by that FBN over the anatomical atlas. For example, assume one particular FBN distributed over N distinct anatomical region of one anatomical brain atlas, with sparsity percentages θ 1 , θ 2 , . . . , θ N . Then, the sparsity percentage for that FBN, θ F BN , is given by: Note than this equation holds if and only if the anatomical areas are not overlapping, which is the case for any anatomical brain atlas. Consequently, we observed that both types of atlases are suitable for obtaining estimates of the sparsity percentage of a specific FBN. To illustrate the reliability and consistency of this procedure, we present the sparsity percentage of some common FBNs, using two different brain atlases: a) the Automatic Anatomical Labeling (AAL) atlas 11 , which is a well established anatomical-based brain atlas that contains a total of 116 regions of interest, and b) a Multi-Subject Dictionary Learning (MSDL) atlas [58], which is a functional connectivity-based atlas that includes some common FBNs obtained from a resting state-fMRI study. Table 3 shows the sparsity percentage with respect to the brain for some FBNs for both atlases. In this case, for the MSDL atlas, the FBNs were already defined, so the sparsity percentage was measured directly. With respect the AAL, we measured the total sparsity of the principal regions of interest that are occupied by the specific FBN of interest using Eq. (13). Comparing the values of the sparsity percentage for some common FBNs (see Table 3), observe that the obtained values are similar in both atlases, despite that each atlas was independently defined following different approaches. Hence, the sparsity percentage is independent of the atlas and any brain atlas potentially contains exploitable information. For completeness, the Table 4 also illustrates the sparsity of some of the principal anatomical structures of the human brain from the AAL atlas.

C Realistic Synthetic Dataset
In this section, a new, highly realistic synthetic dataset is developed using as a base, SimTB 12 , an open MATLAB toolbox for the design of fMRI-like data, which has been widely used in several studies, such as [70], [71] and [72]. This dataset allows to effectively evaluate the performance of the proposed DL method in comparison with the state-of-the-art of the blind and semi-blind approaches.
The new dataset resembles demanding experimental designs, where the sources present highly relative different energy and their corresponding brain areas that correspond to different condition overlap. This renders them hardly detectable by conventional BSS methods. Moreover, the new dataset is examined under realistic noise levels and statistics and it also comprises a relatively large number of components/sources that exhibit a large range of sparsity characteristics. In order to succeed in the above and also to study the effect of HRF miss-modeling, certain interventions in the SimTB data generation procedure needed to take place.
SimTB implements a simple spatiotemporal separable model characterized by three fundamental features: a) the spatial maps are smooth (Gaussian-shaped) b) the sources are barely overlapped and c) a fixed HRF is considered. Although these features simplify the study of synthetic experiments, they are unrealistic; the overlap between the standard sources of SimTB only affects small areas with relatively low intensity, whereas, in real fMRI, some sources may exhibit high overlap [73]. Moreover, HRF naturally varies among subjects [25], [26].
For these reasons, we modified the default mode of SimTB to accomplish a more realistic synthetic fMRI-like dataset: first, we translated in space some of the sources to increase their relative overlap. Second, instead of smooth Gaussian shaped spatial maps, which show maximum activity at only one specific peak/voxel, we prefer to have more than one neighbor voxels activated with the same intensity. We achieved that, by starting from the standard SimTB's spatial maps and reducing the highest activated voxels to have the same relative energy, within a small neighborhood.
Regarding the HRF, many different parametric models have been built that capture its shape, with the most popular being two gamma distributions model [74], [4], [26]. A certain set of parameter values produces the cHRF that represents a parametric approximation of the real HRFs [74]. The cHRF is the default model in many different toolboxes such as SimTB [70], and and it is also used in the GLM/SPM analysis for the computation, via convolution, of the task related time-courses. Since one of the scopes of this study is to analyze the effects of HRF mismodelling, we account for the natural variability of the HRF, emulating six different synthetic subjects, with each one of them having its own specific HRF. Then, we generate one dataset per subject since all the brain-induced sources are HRF-dependent.
The distinct subject-specific HRFs were generated using the two-gamma distribution model and in order to ensure that they are realistic enough, we followed the procedure described next. We fed the free parameters of the two-gamma distribution model with values drawn from a uniform distribution centered on the parameter values that correspond to the cHRF. The support of the aforementioned distribution was experimentally tuned to a level that led to HRFs, which visually resemble the natural variability of HRFs. For better inspection, a total of 100 HRFs was generated (gray lines in Figure 1) and qualitatively compared with natural HRFs estimated from real fMRI studies, e.g., Figure 3 and Figure 4 of [26]. It is clear that the 100 simulated HRFs are remarkably similar to true ones. For the simulation studies that we presented in the Section 4 of the main text, five randomly generated HRFs, as well as the cHRF, were used (depicted with colored curves in Figure 1). For simplicity, hereafter, and we define the "canonical" subject as the subject that will use the cHRF, we introduce the letters A, B, C, D and E to refers to the subjects that correspond to the HRFs 1, 2, 3, 4 and 5, respectively.
The subject-specific datasets comprise 15 brain-like sources. Each source is defined by its spatial map, which is represented by a single brain slice of size 100 × 100 voxels and a time course that comprise 300-time instances with a difference of two seconds between acquisition times (TR=2). Figure 7 shows the brain-like sources for the canonical subject, which includes all the modifications that we discussed above. The rest of the subjects correspond to the same spatial maps but their time courses differ, depending on their respective HRF. Moreover, in Table 5, the anatomical correspondence as well as the sparsity level of each source are also listed.
Besides the brain-induced sources, real fMRI data also contains sources that are either scanner-induced or related to other biological processes unrelated to the brain activity, such as heart-beating, breathing, movements, etc. All these phenomena are collectively referred to as artifacts. In the new dataset, we have also included realistic fMRI-like artifact sources, in particular the sources 3, 4, 5, 7 and 8 of the well-known dataset [75]. Special information concerning each artifact is also provided in Table 5. Note that the majority of the brain-like sources are sparse, whereas most of the artifacts are dense. Finally, the subject-specific datasets are corrupted by Rician noise of SNR = 0. Both noise distribution and energy are realistic for the fMRI data case, [76], [77].

D Pearson's correlation coefficient -based Performance measure
In this direction, we are interested in measuring the discrepancy between the true and the estimated sources. Since each source expresses itself with a time course/spatial map pair, (d i ∈ R T ×1 , s i ∈ R 1×N ), a way to jointly evaluate the decomposition performance with respect to both is to work with the matrix, F i = d i s i ∈ R T ×N , which represent the full i th source as it is expressed in all voxels along time. Ideally, F i should be equal toF j =d js j ∈ R T ×N , whered j ands j denotes the time course and the spatial map of the corresponding true source. Then, to measure the quality of the performance for the j th true source, we adopted: where ρ(A, B) A, B ∈ R M ×N , is the Pearson's correlation coefficient for matrices given by: where µ A and σ A are the mean and the standard deviation of A, respectively and µ B , σ B are the mean and the standard deviation of B.
Observe that the measured value in Eq. (14) assumes that the i th unmixed source corresponds to the j th true one. Nevertheless, in practice, the correspondence between the true sources and the unmixed sources is unknown (since the solution of the matrix decomposition is usually equivalent under permutation). Therefore, to measure the quality of the performance it is necessary to find the correspondence of each one among theK true sources with one of the K estimated sources (where in generalK = K). The procedure followed is described next: 1. Define the matrix C ∈ RK ×K , with c ij = (ρ(F i , F j )) 2 . Observe that the larger an entry c ij is the more likely the i th real source and the j th source to be a match becomes.
2. Detect the assisted sources: This is straightforward, since we know that these are the first M of the estimated sources. Denote p = [p 1 , . . . , p M ] the vector with the indixes of the corresponding true sources associated to each one of the first M assisted sources (In our specific dataset, p = [1,11,14]). Then, the quality of the composition for the assisted sources, according to Eq. (14), is given by r pi = c pi,i with i = 1, . . . , M .
3. Remove the contribution of the assisted sources in the matrix C, setting to zero their associated rows and columns, that is, c pi = 0 and c i = 0 for i = 1, . . . , M .
4. Detection of the non-assisted sources: This can be efficiently done with the aid of matrix C. First, we find the maximum entry of the matrix C, say c ij . This entry reveals the fact that the best match of the i th true source is the j th estimated source, consequently, r i = c ij . Then, in order to exclude the detected source, we set to zero their contribution to the rest of the matrix, that is, c i = 0 and c j = 0. This process is repeated for all the rest of the true sources, i.e., until C = O.
After these steps, the result is a vector r, where each component, r i , reflects the quality of the decomposition associated to the i th source. Then, this information can be used to quantify the performance in different ways. For example, we can take the mean of the values associated with the assisted sources only, or we can take the mean over all the values of r to have a general overview of the performance of the decomposition as a whole. Apart from the full-source performance measure, the decomposition performance with respect the time courses only, i.e., with respect the estimated dictionary D, is also of special interest. Indeed, as we discussed in the previous Section A. 5, the spatial distribution of the FBNs are commonly estimated applying certain statistical  analysis over the spatial maps, which are commonly estimated via the pseudoinverse of the dictionary. Therefore, to determine the quality of the decomposition for the time course, we adopted as an alternative performance measurement: where now d i is the i th obtained time course andd j is its true corresponding time course. In order to compute the values for all the time courses, we followed the same procedure as described above, but changing the definition of the matrix C as c ij = (ρ(d i , d j )) 2 instead. Similarly, the result is a vectorr , where each value,r i , reflects the quality of the decomposition of the time course of the i th source.

E.1 Individual results from the synthetic fMRI analysis
In Section 4.1 of the main text, we present the performance results based on the synthetic dataset, where Figure 2 summarizes the comparison results of the different studied methods. Therefore, to further investigate the synthetic results shown in Figure 2, we depicted the individual results of the performance comparison of the assisted sources, including the obtained time courses and spatial maps obtained by some of the different methods. Thus, the Figure 9 shows the individual results for each assisted source. The left graph includes the main comparison using the introduced performance measurement among sources, whereas the right part displays one specific noise realization of one particular subject, in this case, the subject B.
Observe that all the methods are capable of separating the assisted source 1 correctly, as we expected, since we selected this assisted source as an "easy" source example. On the other hand, the assisted sources 2 and 3 (which represents the sources 11 and 14 of the synthetic dataset Figure 7) explicitly require assistance to unmix them properly. Moreover, for the SDL method, we observe the same issues as in the Figure 3 in the main text: the SDL method only works if the HRF of the subject is close enough to the cHRF. On the contrary, IADL successfully unmixes all the assisted sources and exhibits a more stable performance among the synthetic subjects.

E.2 GLM analysis for the synthetic dataset
For the study of GLM over the synthetic dataset, we performed a comparison between the proposed IADL and the standard GLM approach, using the software toolbox SPM12. The design matrix of SPM includes the three task-related time courses that we used for IADL and the rest of information-assisted algorithms, as it is described in Section 4.2 of the main text. Figure 10 shows the normalized parameter maps for the assisted sources corresponding to (a) the canonical subject and (b) the subject E. Furthermore, for completeness, Figure 11 depicts the corresponding statistical parametric maps for both studied approaches with z > 1.97 (p < 0.05, uncorrected), respectively.
In both figures, we observe that SPM and IADL recover the assisted sources 1 and 3 sucessfully, similarly to the other studied algorithms, since these two synthetic sources are relatively easy to identify. However, for the assisted source 2, the result of SPM is significantly inferior to that of IADL. Specifically, the result of SPM contains considerable residual from other overlapped sources, which partially remains as false positives in Figure 11, whereas the source of interest appears considerably deteriorated. On the other hand, IADL does not have residual from other sources, and the source of interest is recovered correctly.

E.3 Performance evaluation in function of the number of sources
In order to evaluate the tolerance of the proposed algorithm to the choice of the number of sources, K, we studied the performance of IADL for different number of sources over the synthetic and the real dataset.

Synthetic Data Analysis
Concerning to the synthetic dataset, we studied the performance of IADL for different number of sources, K = 5, 10, 15, . . . , 40, keeping the same three assisted sources with the same parameters, as we described in Section 4.1 in the main text. For each number of sources, we implemented a different set of sparsity percentage for the free part of the dictionary following the principal guidelines described in Appendix A.2. Table 6.a contains the particular values used for each number of sources. Note that the sparsity percentage for the assisted sources is the same in all the studied cases. Furthermore, the sparsity percentage for K = 25 corresponds to the parameter θ 2 from Table II.b in the main text. Figure 12 shows the performance comparison obtained for each specific assisted source in function of the number of sources. Similarly, Figure 13 depicts the mean performance of the three assisted sources, the brain-like sources and all the sources (including artifacts) respectively. In both figures, the red line depicts the obtained results of the canonical subject and the blue line represents the results of the subject E from the synthetic dataset. Observe that the obtained results agree with the expected behavior of the standard DL techniques: the algorithm exhibits problems to unmix the different components if the number of sources is inferior to the correct one, whereas the performance of IADL remains more or less stable for both subjects if the number of sources is overestimated.
Furthermore, we observe that IADL seems particularly robust against the overestimation of K; comparing the results of 35 and 40 with 20 (the correct one), we observe that all the sources are present in all the compositions and most of the sources of the overestimated cases appears filled with random noise.

Real Data Analysis
In this case, we studied the same task-related fMRI experiment used in the main text, implementing different number of sources, K = 10, 15, 20, 25, . . . , 40 and 60.
For this analysis, we randomly selected a single subject from the studied dataset. The rest of the parameters, including the imposed task-related time courses for the six assisted sources were the same as they are described in Section 4.5 of the main text. As in the synthetic experiment, we implemented a different set of sparsity parameters for the free part of the dictionary, following the guideline described in Appendix A.2. Table 5.b depicts the particular selection of the sparsity percentage used in this experiment. The parameter for K = 20 corresponds to the parameter θ from Table II.c, which was implemented for the analysis of the real data in the main text. The absence of ground truth in real data hinders a direct performance comparison in contrast to the synthetic data. Therefore, for the real data, we analyzed the consistency of the obtained results searching for common components among all the studied number of components and visually analyzing their corresponding spatial maps. First of all, for the results from K = 10 and K = 15, the performance of the algorithms deteriorates compared with the results observed in the main text (K = 20). However, for the results between K = 20 and K = 40 we observe that all the obtained results present between 17-20 common components. Between these common components, we observed the six different spatial maps that correspond to the six different assisted sources, and the rest belong to extra common sources observed within the free part of the dictionary. For the case K = 60 we only observed 14 common component and these results were slightly worse compared with the rest of the analysis.  Figure 15. Figure 14 depicts the results of the assisted sources for K = 20, K = 40 and K = 60. Observe that IADL provides similar results independently of the number of sources. Furthermore, Figure 15 shows three common components observed in the free part of the dictionary for different number of sources. Consequently, after this study we observe that the performance of the IADL algorithm is consistently robust to overestimating the number of sources, which agrees with the expected behavior of standard DL methods.

E.4 Spatial maps from the sparse representation
The sparsity-based BSS method simultaneously provides an estimate of the dictionary and its sparse representation over the coefficient matrix. As we discussed in Appendix A.5, an alternative way to display the spatial distribution of the detected FBNs of interest is to use the sparse spatial maps from the coefficient matrix.
Therefore, to study the reliability of the sparse spatial maps, we decided to depict the obtained results of the group analysis of the motor task-fMRI preformed in Section 4 of the main text, using just the obtained coefficient matrix. Thus, Figure 16 depicts the spatial maps of the different studied conditions. The first two columns show the sparse spatial maps of SDL and IADL from the group analysis, where we only display the values bigger than zero (s i > 0). This is because theoretically, the sparsity constraint will push to zero values with null contribution to that source. Then, we normalized the maximum intensity to 1 for its visualization. The third column shows the same result as we presented in the Figure 5.b, where the spatial maps were obtained using a z-scores. We included this extra column to facilitate a visual comparison.
Observe that, although SDL and IADL are sparsity-based approaches, the results obtained are entirely different: first, SDL present an overwhelming ratio of false positives, particularly for the hands and foot tasks, which makes impossible to infer any interpretable result. On the contrary, IADL presents more clear results, even comparable to those obtained using z-statistical tests. Figure 16: Comparison between the obtained sparsity map for the group analysis. The first and the second columns represent shows the positive values (s i > 0) of the spatial maps of SDL and IADL respectively. The third columns shows the same results depicted in Figure 5 of the main text as reference.  Figure 5 in the main text, which contains the results of the group analysis. E.6 Extra results from the comparison between IADL and SPM Figure 6 within the main text shows a comparison between IADL and SPM. That figure is displayed between a lower and higher threshold. We specifically selected this particular interval to help to visualize what type of activation would be present at a specific threshold, in the same way as it is described in [53].
Therefore, to further evaluate the performance of both approaches as well as to evaluate the presence of potential false positives, Figure 18 shows the significant active voxel after applying Bonferroni correction, which is considered a a conservative correction. Observe that all the majority of false positive detected in Figure 6 disappeared after the correction, as it was expected. Interestingly, SPM still display some significant voxels within the expected region of interest after this correction, whereas IADL still produces a good visualization of the correct region of interest. Thus, Figure 19 displays the significant active voxels after the Bonferroni correction rendered over the cortex. Observe that SPM still haves some small voxels within the regions of interest, pecesiely, over the left hemisphere, whereas IADL consistently depicts the main activation areas expected for each movement type [54,55].  Figure 19: Rendered significant active voxles z > 5.00 (Bonferroni-corrected at p > 0.066) from the group parametric maps for SPM and IADL.

F Mathematical deviation of the Proposed Algorithm
In this section, we analytically provide the full mathematical derivation of the propose Information Assisted Dictionary Learning (IADL).

F.1 Notation
A lower case letter, x, denotes scalars, a bold capital letter, X, denotes a matrix, and a bold lower case letter, x, denotes a vector with its i th component denoted as x i . The i th row and the i th column of a matrix, X ∈ R M ×N , are represented as x i ∈ R 1×N and x i ∈ R M ×1 , respectively. Moreover, x ij denotes the element "located" at row i and column j of the matrix X. The vector 1 N denotes a column vector of size N having all its components equal to one and I N is the identity matrix of size N × N . The notation A = diag(x) stands for the diagonal matrix having a ii = x i . Eventually, given an arbitrary vector, x ∈ R N , the notation x > a stands for x i > a ∀i = 1, 2, . . . , N .

F.2 Optimization Task
In this paper, we are focused on the study the matrix factorization problem: where, D ∈ R T ×K , is the dictionary and, S ∈ R K×N , is the coefficient matrix as we presented them in the main text. In general, this matrix factorization problem can be seen as a constrained optimization task as follows: where, D, L, are two sets of admissible constraints. According to the nature of the studied problem, we set D = D δ and L = L w , where those two convex set (see convexity proofs in Appendix 1) were defined as: where · 2 denotes the Euclidean norm, d i is the i th column of the dictionary D and δ i is the i th a priori selected task-related time course and c δ , c d are two positive user-defined constants.
where · 1,w denotes the weighted 1 -norm, w i is the vector of weights that corresponds to the vector s i and φ i is the expected number of non-zeros of the i th row of the coefficient matrix.
As an alternative to the constrained set above, we also introduce a new convex set (see proof in Appendix F.3), where we used the weighted 1 -norm to impose sparsity over the full coefficient matrix as: where Φ is the maximum number of non-zero elements of the full coefficient matrix, S, and W ∈ R K×N is the matrix of the associated weights with w ij = 1 |sij |+ε (the same definition that we introduced in the Eq. (6), in the main text). The notation · 1,W stands for a generalization of the weighted 1 -norm for matrices, which is compactly written as: where |S| returns the absolute value of each element of S. At this point, observe that if we have a good guess about the values φ i , it is possible to set the sparsity level of L W as Φ = K i=1 φ i forcing both constrained sets to lead to the same overall sparsity of S.
In general, the simultaneous optimization for D and S for Eq. (18) is challenging, due to both the non-convexity and the potential large size of the data matrix. For the above reasons, we adopted the Block Majorized Minimization (BMM) rationale, which provides a powerful framework for the solution of such as optimization tasks [48,47]. The main strategy of the BMM consists of changing the actual cost function at each step by an approximation function called surrogate function, which majorizes the former and simultaneously facilitates the optimization process.
For example, starting from an arbitrary set of estimates D [0] and S [0] at the t th iteration, the BMM for the studied optimization task comprises into the following two steps: where, ψ S is the so called surrogate function of the first step, given D fixed to its current estimate D [t] , and ψ D is the corresponding surrogate function of the second step, gives S fixed to its current estimate S [t+1] . In practice, finding an appropriate surrogate function -despite it may not be unique-is a complicated task and also problem dependent. Furthermore, the surrogate function has to satisfy specific constraints [48], to guarantee that the proposed two-step iterative minimization will converge to an optimal value of the original optimization task, as we discuss in detail bellow in Appendix F.4.
After studying several candidates for the surrogate function [47], we finally implemented the second order Taylor's expansion for matrices of the original cost functions to obtains their corresponding surrogate functions, motivated by two reasons: a) the Taylor's expansions guarantees that all the conditions of the surrogate functions are satisfied by construction [47] and b) the Taylor's expansion simplifies the optimization process, avoiding computationally expensive operations, e.g., matrix inversions, which are notably restrictive in this particular case due to the natural size of the fMRI data.
Step I -Coefficient Update According to the Eq. (23), the update stage at the t th iteration is formulated as: Regarding to the surrogate function, ψ S , fixed to the current estimates the coefficient matrix D = D [t] , we use the matrix form of the second order Taylor's expansion (see Appendix D in [78]), to derive the following surrogate function: where c S D T D is a constant and · stands for the spectral norm. Observe that this current surrogate function is convex and majorizes its corresponding loss function by construction.
At this point, the BMM does not impose any specific optimization process. Then, we implement a Lagrangian relaxation to solve the current minimization task. Implementing the Lagrangian multipliers, the optimization becomes into: where P γ is a penalty term that depends on the weighted 1 -norm constraints and it is defined as: where the introduced parameters γ i 0, for i = 1, 2, . . . , K are the Lagrangian multipliers, K is the number of consider spatial maps (rows of S), and φ i is the expected maximum sparsity, that is, the maximum expected number of active voxels of the i th spatial map.
From optimization theory, Lagrangian multipliers at the optimal point satisfy their corresponding KKT conditions, i.e., with i = 1, 2, . . . , K. Now, observe that the optimization task in Eq. (26) involves a convex loss function. Hence, given the surrogate function at the t th iteration, a global minimum exists.
In order to determine the global minimum, although the function is convex and we know that such a point exists, the loss function is not differentiable. Nevertheless, we can obtain the minimum at a point with zero subgradient. That is, a minimizer of the function exists if and only if the zero (matrix) belong to the subdifferential set, which is given by: 0 ∈ ∂ ψ S (S, S [t] ) + P γ (S) .
Observe that the first term is fully differentiable, and the subdifferential set becomes a singleton given by the standard derivative: where A is matrix defined as: In contrast, the second term is not differentiable, and its corresponding subdifferential set is: However, the imposed Lagrangian multipliers address the definition of this set. Therefore, the task becomes to find the matrix S in the subdifferential set that simultaneously satisfies 2c S (S − A) + ∂P γ (S) = 0, and its corresponding KKT conditions Eq. (28), imposed by the Lagrangian multipliers. Thus, following standard arguments and after some algebraic manipulations, it can be demonstrated that the solution of this task is given by: with i = 1, 2, . . . , K, where K is the total number of sources, φ i , is the expected maximum number of active voxels of the i th source, s i and a i are the i th row of the matrix S and A respectively, the vector w i is the associated vector of weights, and P B 1 [w i ,φi] is the projection operator over the weighted 1 -norm ball, B 1 [w i , φ i ] = x ∈ R N | x 1,w i φ i , of weights w i and radius φ i . Eventually, with respect to the Lagrangian multipliers, γ i , the values are given by: with i = 1, 2, . . . , K.
Step II -Dictionary Update In the second step of the alternating minimization (see Eq. (24)), the optimization task at the t th step is given by: Regarding to the surrogate function, ψ D , fixed to the current estimates the coefficient matrix S = S [t] , we again use the second order Taylor's expansion (see Appendix D in [78]) to derive the following surrogate function: where c D SS T and · stands for the spectral norm. Observe that the surrogate function is convex and majorizes its corresponding loss function by construction.
Again, the BMM gives freedom for choosing any procedure for solving the current optimization task. For simplicity, we again implemented a Lagrangian relaxation introducing a new penalty term into the original problem: where P γ is defined according to the constraints over the dictionary: where γ i , i = 1, 2, . . . , K, are the respective K Lagrange multipliers, d i are the i th column of the matrix D, δ i is the i th task-related time course, and c δ and c d are two user-defined parameters to control the norm and the similarity constraint respectively. From the optimization theory, Lagrangian multipliers at the optimal point satisfy their associated KKT conditions: For simplicity, we can rewrite the penalty term in Eq. (37) in a more compact form introducing the following matrix notation: Let's define the mask matrix, M ∈ R M ×K , which is a rectangular matrix (M < K) with zero elements everywhere except for ones along its principal diagonal, i.e., Using this mask matrix and the properties of the trace, we can compact the term, P γ , as follows: where Γ is a diagonal matrix Γ = diag{γ 1 , γ 2 , . . . , γ K } and C is another diagonal matrix given by C = diag{c δ 1 M , c d 1 K−M }. Therefore, the optimization problems is given by: Note that the current problems involves a convex loss function. Hence, fixed the coefficient matrix to the current estimate, at the t th iteration, a global minimum exists.
In this case, since the loss function is fully differentiable, we can find the minimum as the point with zero (matrix) gradient, i.e.: The derivative of the first term is given by: and the derivative of the second term is given by: Now, solving with respect to D, we obtain: where B is a matrix defined as: and the Lagrange multipliers in Γ are obtained so that to satisfy the corresponding KKT conditions. Thus, following standard arguments and after some algebraic manipulation, it can be easily demonstrated that the solution of this task is given by: with i = 1, 2, . . . , K. Eventually, concerning to their respective Lagrangian multipliers, they are given by: For completeness, the Algorithm 1 in the main text summarizes a pseudocode with the main steps of the proposed algorithm. Furthermore, we provided a Matlab implementation of the proposed algorithm, which is available in the AIDL 1 github repository.

F.3 Convexity Proofs of the constrained sets
In the following section, we present the convexity proofs of the proposed constrained sets: Constrained set D δ Given M different imposed task-related time courses δ 1 , δ 2 , . . . , δ M , and the specific constants c δ and c d , the proposed set is convex: Proof 1 By definition, if D δ is convex, then for any X, Y ∈ D δ the matrix λX + (1 − λ) Y ∈ D δ ∀λ ∈ [0, 1].
Constrained set L w Given a specific matrix of weights W and a vector with the expected number of non-zero values per row φ = [φ 1 , φ 2 , . . . , φ K ], this proposed set is convex: Proof 2 By definition, if L w is convex, then for any X, Y ∈ L w , the matrix λX + (1 − λ) Y ∈ L w ∀λ ∈ [0, 1]. Let X, Y ∈ L w and the real parameter λ ∈ [0, 1]: Thus, the i th row of Z is given by z i = λx i + (1 − λ) y i with i = 1, 2, . . . , K. Question: Is Z ∈ L w ?
Then, for this case: We obtained that Z ∈ L w for any value of λ ∈ [0, 1], consequently by definition L w is convex.
Constrained set L W Given the matrix of weights W and the expected number of non-zeros of the full coefficient matrix Φ, the set L W is convex: Proof 3 By definition, if L W is convex, then for any X, Y ∈ L W the matrix λX + (1 − λ) Y ∈ L W ∀λ ∈ [0, 1]. Let X, Y ∈ L W and a real parameter λ ∈ [0, 1]: Question: Is Z ∈ L w ? Z ∈ L W ⇔ Z 1,W Φ. ⇒ Z 1,W Φ.
We then obtained that Z ∈ L W for any value of λ ∈ [0, 1], consequently by definition L W is convex.

F.4 Convergence Analysis
According to the specification of the Block Majorized Minimization (BMM) algorithm, the convergence of the proposed algorithm to a local minimum of the main optimization task (see Eq. (18)) is guaranteed if the following conditions are satisfied [48]: 1. The objective function of the original problem is regular.
2. Each surrogate function is quasi-convex, defined over a convex set and majorizes its corresponding cost function.

Each minimization block has a unique solution.
Condition 1 By definition, the main loss function is regular, since X − DS 2 F is infinitely differentiable and single-valued.
Condition 2 For each block, we obtained the surrogate function based on a second order Taylor's expansion. As we described above, one of the reasons for selecting this kind of surrogate function is because the second order Taylor's expansion is convex and majorize their corresponding loss functions by construction. Besides, the imposed set of constraints for each block, D δ , L w (and L W ) are convex as we proved it in the previous Section. Condition 3 For each block, each surrogate function is convex and defined over a convex set. Therefore, each block has a unique solution.
Consequently, the proposed algorithm satisfies the convergence conditions of the BMM algorithm, which guarantee that the proposed optimization algorithm converges monotonically to a stationary point of the main optimization problem [48], [47].