Graphical Models of Functional MRI Data for Assessing Brain Connectivity

1.1 Brain connectivity and fMRI Modern neuroimaging technologies have allowed researchers to non-invasively observe indirect markers of brain activity in vivo (Fig. 1). This has resulted in a rapid growth of studies trying to ascertain what brain loci are associated with certain cognitive, sensory and motor tasks. In particular, the recent development of functional magnetic resonance imaging (fMRI) has allowed researchers to non-invasively investigate brain activity at excellent spatial resolution and relatively good temporal resolution. While probing aspects of brain function is typically under the domain of neuroscientists, fMRI work is inherently interdisciplinary: it involves MR physicists who determine MRI sequences sensitive to small changes in the brain, neuroscientists who design the behavioural experiments and interpret the observations, statisticians to assess significance of changes, and increasingly, people with signal processing expertise to derive more and more information from the time series extracted. Analysis of fMRI data sets represents a special challenge for traditional statistical methods that were originally designed for a large number of samples of low-dimensional data points. The number of “voxels” (ie. representing a specific locus in the brain) to be analyzed are large (≈ 105), yet the number of time points (≈ 102) is relatively small. Most early fMRI analysis methods were designed to ascertain the regions where brain functions are localized by performing voxel-wise analysis. Even when simple tasks are performed in the MRI scanner, widespread activation can be observed in the brain with fMRI. These and other studies suggest that the brain is active at multiple spatial and time scales supporting both segregated and distributed information processing (Bassett & Bullmore, 2006). In fact, the advent of non-invasive functional neuroimaging has re-ignited a centuries-old debate about whether or not cognitive and motor tasks are encoded in discrete loci or are more diffusely and fluidly represented, the latter emphasizing the importance of assessing brain connectivity (Catani & ffytche, 2005). While connectivity appears to be of critical importance for understanding and assessment of brain function, it can be difficult to define in a rigorous sense with current technologies that can only probe brain activity at certain spatial and temporal scales (see Fig. 1). Conventionally, brain connectivity can be studied at three levels: anatomical, functional, and effective connectivity (see Fig. 2). Anatomical connectivity refers to actual physical connections 19

between brain structures. It can be determined with the help of rich anatomical studies that have been developed over decades, or more recently, using MR techniques such as Diffusion Tensor Imaging (DTI). Functional connectivity is defined as the significant mutual information between the time series found at distinct loci in the brain. However this raises several problems. If two regions have similarities between their respective time series, is this because one region influences the other, or there is a third region affecting both (Figs. 4 and 5)? Thus the term effective connectivity has been used to imply the causal influence that activity in one brain region exerts over the activity of another. The importance of assessing brain effective connectivity is also related to the fact that brain connectivity impairments are associated with many neuropsychiatric diseases such as depression (Schlösser et al., 2008), schizophrenia (Schlösser et al., 2008), Alzheimer's (Supekar et al., 2008) and Parkinson's disease (Palmer et al., 2009).

Graphical models for brain effective connectivity
Many methods for inferring connectivity from the four-dimensional fMRI data (three spatial dimensions and one temporal dimension) have been suggested. Proposed methods include correlation thresholding (Cao & Worsley, 1999), linear decomposition (Calhoun et al., 2001;McKeown, 2000), structural equation models (SEM) (Bollen, 1989), multi-variate auto-regression (Valdes-Sosa et al., 2005), dynamic causal models (Friston et al., 2003), Bayesian networks (Li et al., 2008;Zheng & Rajapakse, 2006), wavelet analysis (Bullmore et al., 2004), and clustering (Heller et al., 2006). Conventionally, brain connectivity is studied at three levels: (a) anatomical, (b) functional, and (c) effective connectivity. Anatomical connectivity is actual physical connections between brain structures. Functional connectivity is defined as the significant mutual information between the time series found at distinct loci in the brain. Effective connectivity has been used to imply the causal influence that activity in one brain region exerts over the activity of another. Correlation thresholding (Cao & Worsley, 1999) directly examines the correlation between the activities of brain regions. If the correlation is so strong that it is extremely unlikely based on chance, then the two regions are considered connected, though not necessarily directly. Linear decomposition approaches, e.g. principal component analysis and independent component analysis (ICA) (Calhoun et al., 2001;McKeown, 2000), assume that observed brain activities are a combination of underlying psychological processes that spatially recruit different brain regions or temporally have unrelated behaviours. Regions involved in the same psychological process as revealed by the decomposition is considered as connected, though not necessarily directly. Both correlation thresholding and linear decomposition are designed for discovering functional connectivity, and neither can distinguish whether two regions interact directly or indirectly through a third region (Kaminski, 2005). Though correlation thresholding and linear decomposition are generally not considered as graphical model, actually both can be related to graphical models (Roweis & Ghahramani, 1999). Unlike correlation thresholding and linear decomposition whose results can be visualized as brain images at the voxel level, structure equation models 1 (Bollen, 1989), dynamic causal models (Friston et al., 2003), multivariate auto-regression (Valdes-Sosa et al., 2005), and Bayesian networks (Zheng & Rajapakse, 2006), are another category of methods that normally work at the level of regions, and whose results can be visualized as graphs where nodes usually represent brain regions and edges represent connections. The brain regions are typically defined anatomically, and some automatic or manual segmentation of brain Markov Random Field Bayesian Network Chain Graph Model Latent Layer Time Slices Fig. 3. Examples of the structures of classical graphical models. The structure of a Markov random field is an undirected graph. The joint probability is decomposed as the product of clique potential functions Φ c (x c ) where c is a clique in the graph and x c is the variables associated with the nodes in c. The structure of a Bayesian network is a directed acyclic graph. The joint probability is decomposed as the product of node conditional probabilities ) can be further decomposed as clique potential functions Φ c (x c ) where c is a clique in the moral graph derived from the chain component τ. Dynamic causal models (Friston et al., 2003) can be regarded as non-linear Bayesian networks with an observed layer and a latent layer. Multi-variate auto-regression (Valdes-Sosa et al., 2005) can be regarded as linear Bayesian networks with many time slices and directed edges from slices at time t − 1, t − 2, ··· pointing to the slice at time t.
structures is required to act as nodes in the model. According to the interaction relationships specified by the graph, the joint probability of node random variables can be decomposed as the product of many local potential functions or local conditional probabilities, as shown in Fig. 3. A node variable usually depends on its neighbor variables and/or parent variables. For example, in Bayesian networks, the activity of a region A is usually modeled as a stochastic function of the activities of its "parent" regions, as in Eq. (1) where X A is the activity of region A and pa i [A]s are the parent nodes of A in the graph. The graph structure of the model is not just for visualization, but encodes conditional-independence relationships among the activities of brain regions. A network structure can be translated to a set of conditional-independence relationships according to the Markov properties and vice versa, with certain assumptions, a set of conditional-independence relationships can also be encoded by a network structure (Lauritzen, 1996).

Pair-wise and conditional correlation
Graphical models are suitable for modelling brain connectivity, not only because their structures can be easily visualized as a network, but more importantly, their fundamental feature, namely conditional independence, is a key concept for differentiating effective connectivity from functional connectivity. When two brain regions show similar activation patterns, they can be somehow connected with several underlying possibilities, as illustrated in Fig. 4: Fig. 4. When two brain regions show similar activation patterns, they can be connected with different underlying possibilities: (1) they directly reciprocally communicate with each other; (2) one region directly exerts the other; (3) they indirectly reciprocally communicate with each other via other brain regions; (4) one indirectly exerts the other via other regions; (5) they both are driven other regions; (6) they communicate with a combination of (1)-(5).
1. they directly reciprocally communicate with each other; 2. one region directly exerts the other; 3. they indirectly reciprocally communicate with each other via other brain regions; 4. one indirectly exerts the other via other regions; 5. they both are driven other regions; 6. they communicate by a combination of the above possibilities.
Pair-wise correlation can only tell that two regions is probably connected, but cannot distinguish among the above possibilities. To distinguish between direct and indirect connections, conditional independence must be considered. The example in Fig. 5 clearly explains this motivation. The two signals A and B show strong pair-wise correlation, but if we consider a third signal C, then the residuals of A and B after C is extracted from them hardly show any correlation. In this example, A and B are conditionally independent if given C, and maybe both are driven by C, as illustrated in the indirect common-stimuli case in Fig. 4. It must be noted that conditional independence alone without temporal information is not enough to determine causal relationships, ie. the direction of connections. To infer the direction, criteria considering temporal information, such as Granger causality (Granger, Aug., 1969), can be employed.

Challenges in modeling brain connectivity
Biomedical research explores the highly complex and diverse realm of living organisms and often incorporates clinical needs such as diagnosis and treatment design. Analysis of biomedical data typically emphasizes such features of reliability, interpretability and generality of reported results. For example, when brain connections are reported, it is important to control or assess error rates in the claimed discoveries, addressing questions such as "how many among the reported connections are actually true connections?" and "how many true connections can be detected?" Additionally, the ultimate goal of a biomedical experiment is usually a population inference applicable to a group of people, such as patients with a particular disease. However, subjects classified to the same experimental group according to the factor of interest can still be highly diverse with respect to other factors, such as gender, age, or race. Even repetitive experiments with the same subject can still be affected by various physical or psychological factors, such as drowsiness or stress. It is therefore important to integrate the information from separate experiments to make inference on the target topic, and to keep a balance between commonality and diversity. Finally, as a multidisciplinary field, end users of connectivity analysis reports are often biomedical researchers or clinicians who focus on the biological implication of the results and the effects of medication. Therefore, it is undesirable to simply generate a vast network of potential connections or just report abstract statistical scores, without providing an intuitive interpretation. Rather, clinicians prefer interpretable, informative and human-understandable results, for example, which brain regions play the central role in conducting a functional task, or which connections are normalized by a pharmacological manipulation. These considerations have implications for interpretation and feature extraction from graphical models. As a response to the above common challenges in biomedical research (ie. reliability, generality and interpretability), in the following sections, we will focus on three topics: error control in learning brain connectivity, group analysis taking into account the enhanced inter-subject variability typically seen in patient populations, and brain network analysis. Finally, for completeness, we also briefly overview several popular software packages suitable for assessing fMRI brain connectivity in the Appendix.

Error control in structure learning
In real world applications, especially in modelling brain connectivity, graphical models are not only a tool for operations such as classification or prediction, but more often than not, it is the network structure of the model itself which is of particular interest. Thus a desirable graphical model of fMRI data should not only statistically fit the overall data well, but also accurately reflect the internal brain connectivity structure. Structure-learning algorithms must therefore control or assess the error rate of the connections/edges detected by them.

Criteria for error control
There are two basic types of statistical errors: type I errors, ie. falsely claiming connections when they actually do not exist; and type II errors, ie. failure in detecting connections that truly exist. Since real data are not free from noise, limited samples may appear to support the existence of a connection when it does not exist, or vice versa. It is therefore impossible to absolutely prevent the two types of errors simultaneously, but rather keep a balance between them. This can be done by, for example, minimizing a loss function associated with the two types of errors according to Bayesian decision theory. There are several criteria available for error-rate control (see Table 2). Generally there is no single criteria that is universally superior if the research scenario is not specified. Selecting the error rate is largely not an abstract question "which error rate is superior over others?", but a practical question "which error rate is the researchers' concern?". One error-rate criterion may be favored in one scenario while another may be right in a different scenario, for example: • We are diagnosing a serious disease whose treatment has serious potential side effects. Due to the risk of the treatment, we hope that less than 0.01% of healthy people will be falsely diagnosed as affected by the disease. In this case, the type I error rate should be controlled under 0.01%.
• We are diagnosing a disease with high mortality, e.g. a type of cancer. Because failure in detecting the disease will have catastrophic consequences, we hope that 95% of subjects with the disease will be correctly detected. In this case, the type II error rate should be controlled under 5%.
• In a pilot study, we are selecting candidate genes for a genetic research on Parkinson's disease. Because of limited funding, we can only study a limited number of genes, so when selecting candidate genes in the pilot study, we hope that 95% of the selections are  (Benjamini & Yekutieli, 2001) FDR E(FP/R 2 )* Positive False Discovery Rate (Storey, 2002) pFDR Table 2. Criteria for multiple hypothesis testing. Here E(x) means the expected value of x, and P(A) means the probability of event A. Please refer to Table 1 for related notations. * If R 2 = 0, FP/R 2 is defined to be 0.
truly associated with the disease. In this case, the FDR will be chosen as the error rate of interest and should be controlled under 5%.
• We are selecting electronic components to make a device. Any error in any component will cause the device to run out of order. To guarantee the device functions well with a probability higher than 99%, the family-wise error rate should be controlled under 1%.
Since the scenario favoring the false discovery rate (FDR) (Benjamini & Yekutieli, 2001;Storey, 2002) is common in exploratory research, the FDR has become an important and widely used criterion in many fields, such as in inferring brain connectivity. Simply controlling the type I and type II error rates at specified levels does not necessarily keep the FDR sufficiently low, especially in the case of large and sparse networks. For example, suppose a network includes 40 nodes where each interact in average with 3 other nodes, i.e. there are 60 edges in the network. Then an algorithm with the realized type I error rate = 5% and the realized power = 90% (i.e. the realized type II error rate = 10%) will recover a network with 60×90% = 54 correct connections and [40 × (40 − 1)/2 − 60] × 5% = 36 false connections, which means that 36/(36 + 54)=40% of the claimed connections do not exist in the true network.

Structure-learning methods with error controlled
Score-based search methods (Heckerman et al., 1995) look for a suitable network structure by optimizing a certain criterion of goodness-of-fit, such as the Akaike information criterion (AIC) (Akaike, 1974), the Bayesian information criterion (BIC) (Schwarz, 1978), or the Bayesian Dirichlet likelihood equivalent metric (BDE) (Heckerman et al., 1995)), with a random walk (e.g. simulated annealing) or a greedy walk (e.g. hill-climbing). However, scores do not explicitly reflect the error rate of edges, and the sample sizes in real world applications are usually not enough to guarantee asymptotic performance. Both classical and Bayesian approaches are available for controlling errors during network learning (Listgarten & Heckerman, 2007). Classical approaches are based on the Markov property of graphical models, and treat error control as a problem of multiple testing. Since a graphical model is a graphical encoding of conditional-independence relationships, the non-adjacency between two nodes is tested by inspecting their conditional independence given other nodes. Conditional-independence relationships among node variables are tested one by one in a certain order, and p-values about the existence of each edge are estimated.
Error control procedures, such as Bonferroni correction for the family-wise error rate, or the Benjamini-Hochberg procedure for the FDR, or without-correction for the type-I error rate, are applied to the p-values to set the cut-off threshold of accepting or rejecting the existence of edges.
Recently, a series of papers have addressed the problem using the classical approach. Listgarten and Heckman (Listgarten & Heckerman, 2007) proposed a permutation method to estimate the number of spurious connections in a graph learned from data. The basic idea is to repetitively apply a structure learning algorithm to data simulated from the null hypotheses with permutation. In general, this method will work with any structure learning method, but permutation may make the already time-consuming structure learning problem even more computationally expensive, limiting its practical usage. Kalisch and Bühlmann (Kalisch & Bühlmann, 2007) in 2007 proved that for Gaussian Bayesian networks, by adaptively decreasing the type I error rate, as the sample size approaches infinity, the PC algorithm (Spirtes et al., 2001) can, without errors, recover the equivalence class of the underlying sparse directed acyclic graphs, even if the number of nodes grows exponentially as the sample size does. Tsamardinos and Brown (Tsamardinos & Brown, July, 2008) in 2008 applied the FDR-procedure separately to edges related to each node. Li and Wang (Li & Wang, 2009) in 2009 applied FDR-control procedures globally to all connections of interest, and proved that with mild conditions, their method is able to asymptotically control the FDR of the "claimed" edges. They showed by empirical experiments that in the cases of moderate sample size (about several hundred), the method is still able to control the FDR under the user-specified level. Bayesian approaches control errors by inferring the posterior probability of edges given the data. If G is the learned graph and G i is the true graph, then the spurious edges in G are those of G \ G i , ie. the sub-graph of G after edges in G i are removed. In this case, the realized FDR is |G \ G i |/|G| where |•|denotes the number of edges in a graph, and the realized type-I error rate in this case is |G \ G i |/|G full \ G i | where G full is the fully connected graph. Since Bayesian inference assigns a probability to each possible model, the error rate of G given data D should be integrated over all possible G i according to their posterior possibilities (Listgarten & Heckerman, 2007). Therefore we have: where P(G i |D) is the probability of a model structure G i given data D. Similarly the posterior type-I error rate is: As in many other Bayesian procedures, the most difficult part of the inference is not the formulation, but rather the calculation, and especially the integration. Because the number of possible graphs increases super-exponentially as the number of nodes increases (Steinsky, 2003), it is impractical to enumerate all the possibilities and sum them up. For certain prior distributions, given the order of nodes, Friedman and Koller (Friedman & Koller, 2003) in 2003 derived a formula that can calculate the exact posterior probability of a structure feature with the computational complexity bounded by O(N D in +1 ), where N is the number of nodes and D in is the upper bound of node in-degrees. Considering similar prior distributions, but without the restriction on the order of nodes, Koivisto and Sood (Koivisto & Sood, 2004) in 2004 developed a fast exact Bayesian inference algorithm based on dynamic programming that is able to compute the exact posterior probability of a sub-network with the computational complexity bounded by approximately O(N2 N ). In practice, this algorithm runs fairly fast when the number of nodes is less than 25. For networks with more than about 30 vertices, the authors suggested setting more restrictions or combining with inexact techniques. For general situations, the posterior probability of a structure feature can be estimated with Markov chain Monte Carlo (MCMC) methods (Madigan et al., 1995). As a versatile implementation of Bayesian inference, the MCMC method can estimate the posterior probability given any prior probability distribution. However, MCMC usually requires intensive computation and the results may depend on the initial state. In Listgarten and Heckman's simulation (2007) (Listgarten & Heckerman, 2007), the error-control curves of the Bayesian approach was smoother and more favorable than those of the classical approach, as show in Fig. 6-A and Fig. 6-B. However, it was also pointed out that in practice the expected FDR of interest usually is very small, within a narrow range near 0, and that the classical approach showed reasonable performance in this range. (The axes of Fig. 6-A and Fig. 6-B are marked with the positive predictive value (PPV) instead of the FDR. The relationship between PPV and FDR is PPV = 1 − FDR, as in Table 2.) In Li and Wang's simulation (2009) (Li & Wang, 2009), to control the FDR at the conventional level of 5%, their classical approaches, the PC fdr algorithm and its heuristic modification, the PC fdr* algorithm, controlled the FDR satisfactorily around the expected level 5%, as shown in Fig. 6-C. For inferring brain connectivity, since brain regions are not just algebraically isolated variables, but rather located in a three-dimension space with complex geometric structure, it may be important in the future to exploit such geometric information for improving error control.

Group analysis
Biomedical experiments are usually conducted to verify or discover knowledge about a population characterized by health or certain disease state. However, subjects classified to the same group can still be highly diverse with respect to factors such as gender, age, or race. With careful experiment design, the effect of these confounding factors can be reduced, but inter-subject variability still plays an important role and remains a challenge. Even studies on a single subject may still face challenges related to variability. For example, EEG recordings conducted at different times from the same subject can be affected by the subject's physical or psychological state, such as drowsiness or stress. Thus in this paper, the term "group analysis" is not restricted to the analysis of a group of people, but generalized to the inference by integrating the information distributed in separate experiments and affected by cross-experiment variability.

Commonality and diversity at different levels
Two basic concepts in group analysis are commonality and diversity. For example, all doctors learn professional knowledge related to medicine, but a doctor could be a pediatrician, a surgeon or a physician. Each one has their own speciality and this is the diversity among them. Commonality and diversity usually co-exist, and are revealed at different levels, depending on the perspective and scale we study the problem.  (Listgarten & Heckerman, 2007). Their x-axes and y-axes are the expected and realized positive predictive values (PPV) respectively. (The relationship between PPV and FDR is PPV = 1 -FDR.) The curves of the Bayesian approach was smoother and more favorable than those of the classical approach. When the expected PPV is high, or equivalently the expected FDR is low, the classical approach performed reasonably well. Sub-figure C is the result in (Li & Wang, 2009) to control the FDR at the conventional level of 5% with the classical approach. The x-axis is the strength of the conditional-dependence relationships among node variables, and the y-axis is the realized FDR. The PC fdr algorithm controlled the FDR under 5%, and its heuristic modification, the PC fdr* algorithm, controlled the FDR satisfactorily around 5%. For details of the two simulation studies, please refer to (Listgarten & Heckerman, 2007) and (Li & Wang, 2009). Fig. 7. Three broad categories of group-analysis methods. The "virtual typical" approach constructs a typical subject to represent the whole group, usually by pooling or averaging data of subjects. The "common-structure" approach imposes the same network structure to the model of every subject, and usually uses mixed-effect models to handle the parameter variability among subjects. The "individual-model" approach allows each individual subject to have its own model, and integrates the individual models with a group-level model.
Since graphical models combine network structures and probability descriptions, the group analysis needs also accommodate commonality and diversity with both model structures and probability parameters. A review of the literature shows that current group-analysis methods based on graphical models can be classified into three broad categories (see Fig. 7), as discussed as follows (Li et al., 2008). First, we could ignore subject diversity, and assume that the brains of all the subjects are structured and function in a similar way, as if there is a virtual typical subject able to satisfactorily represent the whole group. This can be called the "virtual typical subject" approach. In this approach, the model for every subject has the same structure, and the Fig. 8. Product of two Gaussian distributions. Thin lines are the contour of two bivariate Gaussian distributions, and the bold lines are the contour of their product. If the parameter likelihood for the data of two subjects is the two bivariate Gaussian distributions, then the parameter likelihood for the pooled data is the product distribution. In sub-figures B and C, the center of the product distribution is located off the x-axis, while neither of the two bivariate Gaussian distributions is centred off the x-axis. This is an undesirable and misleading phenomenon for data pooling. PPC is the abbreviation for partial correlation coefficient; m x is the mean of x, and m y is that of y. This figure is from (Kasess et al., 2010).
same parameters as well. The "virtual-typical" subject is usually constructed by pooling or averaging the group data, and then one model is learned from the data. Technically this degrades group analysis to learning a model for a single subject, for which both classical (Heckerman et al., 1995) and Bayesian (Neumann & Lohmann, 2003) approaches have been developed. When the group is homogeneous or inter-subject variability follows certain regular distributions, this approach could increase detection sensitivity, because pooling can build a relatively large data set, and averaging can enhance the signal-to-noise ratio (Kasess et al., 2010). However, when the group becomes more heterogeneous, this approach could lead to undesirable and misleading results (Kasess et al., 2010). Fig. 8 shows that by pooling data together, the group estimation of connection parameters could be located far way from the center of individual estimations. The other extreme is that we assume subjects can be completely different from each other -the "individual-model approach". In this approach, the model of each subject can be completely different. This approach is related to the concept of functional degeneracy, ie. "the ability of elements that are structurally different to perform the same function or yield the same output" (Edelman & Gally, 2001), or more plainly "there are multiple ways of completing the same task" . Because subjects in the same group are considered to share similarity, their individual models must be linked together in a certain way, usually by a second-level group model built over the individual models. The most straight-forward implementation of the "individual-model approach" is to directly input separately learned individual models as subject features into a second-level analysis. For example, structural features of the network of individual models can be selected with classification and cross-validation procedures at the group level, as applied in (Li et al., 2008). A theoretically elegant approach is to build a group-level model to describe the diversity distribution in a group, and the group-level and the individual-level models form a big model over the group data. Usually this big integrated model should be learned from the batch of group data, which would require an intensive computation. The Bayesian group model proposed by Stefan, etc. (Stephan et al., 2009) provides a rigorous theoretical background and is able to break the model learning into two separate stages, the individual and group stages.
This property allow the group to be updated incrementally without re-learning individual models when new data are added. Niculescu-Mizil and Caruana proposed a heuristic method to link individual models (Niculescu-Mizil & Caruana, 2007). They allow network structures to be different across subjects, and punish excessive diversity among the structures with a tunable parameter. A trade-off between the two extremes is assuming that subjects' brains are structured similarly, but function with considerable difference. This can be referred as the "common-structure" approach (Mechelli et al., 2002). In this approach, the model of every subject share the same network structure, but the parameters can be different from subject to subject. When the common model structure is specified, this approach focuses on dealing with the model parameters across subjects. The standard method is the mixed-effect model (Mumford & Nichols, 2006) as follows. Consider an experiment where there are n subjects and for each subject, indexed by k, a regression parameter β k modeling the relationship between a response variable Y k and an explanatory variables X k . The first-level model for each individual subject is where k is the within-subject randomness following Gaussian distribution N(0, V k ). The second-level model for group parameters is where β g is the group-level parameter, X g is the group design matrix and g is the cross-subject randomness following Gaussian distribution N(0, V g ). The combination of Eqs. (4) and (5) is called a mixed model, because both the within-subject and the cross-subject randomness are considered in the model. A notable issue in mixed-effect models is that the cross-subject variance V g could be negative in maximum likelihood estimation. To avoid this undesirable and counter-intuition phenomenon, the random-effect variance is usually enforced to be positive. The mixed-effect model in practice usually is solved with the summary-statistics approach that reformulates the model as Eqs. (6) and (7): whereβ k is the least-square-estimation of the parameter for each subject, and * g = g +β − β following N(0, V * g ). Given V * g (the variance of * g ), β g can be solved from the estimation of β k s, unnecessarily from β k s. The summary-statistics approach decomposes a complicated model into two relatively easy stages, and retains the estimation for each single subject even when new subjects are added into the analysis. The summary-statistics approach assumes that V * g is known, but in practice it should be estimated from the data. The estimation, including both its value and its degree of freedom, is challenging and has attracted much research attention. Methods such as Restricted Maximum Likelihood (Harville, 1977), Smoothing with fixed-effect model (Worsley et al., 2002) , and Markov Chain Monte Carlo (Woolrich et al., 2004) have been developed. Though the aforementioned methods deal with group commonality and diversity with various techniques, most take or can be considered to be in a two-level framework: a lower level of models for each individual subject, and a group level integrating individual models and describing inter inter-subject commonality and diversity. The group level could enforce strong commonality, like the "virtual-typical" approach, or model group diversity probabilistically, like the "individual-model" approach in (Stephan et al., 2009). Models able to technically decouple the computation of the individual and the group levels are favoured.

Desirable features of graph analysis methods
To set clear goals for the future development of more advanced group-analysis methods, we suggest three highly desirable features: being modular, being incrementally updatable, and being scalable. Being modular means that a group-analysis method is not only designed for a particular type of single-subject model, but versatile and applicable to different types of single-subject models. For example, both Bayesian networks and structural equation models are applicable at the subject level, so the group-analysis method should not be restricted to only one of them, but should be able to work with both of them, though not necessarily with a mixture of them. If the group-level model just needs inputs such as the likelihood of individual models, then it is free from the specific format of the individual models. If a group-analysis model can be a module of itself, then it will be able to handle multi-level hierarchical group structures. Being incrementally updatable means that group-inference results can be summarized as summary statistics and used for further analysis involving newly collected data. This feature is very useful in research practice because experimental data are usually collected incrementally. For example, after a study on eighty subjects half a year ago, twenty more subjects might be recruited. In this case, it may require cumbersome computation to analyze the entire data of one hundred subjects. However, if the group inference is incrementally updatable, it may need much less computation to include the additional twenty subjects. Being scalable means that a group-analysis method can handle fast growing diversity among subjects. Because modern exploratory research usually involves investigation of a large number of candidate models, scalability has become a highly desirable feature for group analysis. For example, if the connectivity between ten brain regions is studied with Bayesian networks, then a group-analysis method should be able to handle the diversity of about 3.1 × 10 17 (Steinsky, 2003) possible network structures.

Network analysis
Modelling is only the first step to investigate a system, following which humanunderstandable information should be further extracted from models to provide insightful understanding. For example, as final readers of a report on brain connectivity, neurologists might be interested in questions such as "which brain regions play the central role in conducting a functional task?", "in what patterns are cognitive functions segregated and integrated among brain regions?" or "how does this brain connectivity network react to the presence of a disease?" Simply reporting a vast and plain network without any highlights does not answer these questions. Graphical models notably have visualized network structures, so it is natural to analyze their structures as an important post-processing in their applications in brain connectivity, as discussed in (Bullmore & Sporns, 2009;Stam & Reijneveld, 2007). The history of graph theory can be traced back to nearly three hundred years ago, marked by preeminent Swiss mathematician Euler's paper on the Seven Bridges of Königsberg. Its application to real-world complex networks was boosted at the end of last century by a series of discoveries on the architecture of world-wide-web, social networks, cellular networks, etc. These systems, despite their tremendous variety, share certain common properties, such as the "small world" (Watts & Strogatz, 1998), the "scale free" (Barabasi & Albert, 1999) and the "self-similarity" (Song et al., 2005) properties. These properties might hint how these networks evolve and grow, and are also related to their functions and interactions with the environment (Watts & Strogatz, 1998). Some well-know properties such as the "scale-free" or "self-similarity" properties are more suitable for large networks, than for networks of moderate size (with dozens of nodes), because their statistics need large scale observations. However, the number of time points of an fMRI scan is relatively small, approximately several hundred, and cannot support reliably discovery of large scale networks. Therefore, in this section, we focus on network analysis suitable for brain connectivity networks of moderate size.

Network measures
Graphs can be studied at different levels from basically nodes and edges, to paths, or more intricately, sub-graphs. According to the object of interest, network measures can be broadly classified into two categories: (1) local measures that focus on local objects in the network, for example, a node, an edge, or a sub-graph, and (2) global measures that feature the pattern of the overall architecture. Local measures usually, yet not necessarily, put a local object in the global view. For example, the importance of a node could be defined as the proportion of communication in the whole network that must go through it. Vice versa, global measures are usually built on local features. For example, the "scale-free" property is about the distribution of node degrees. Fig. 9 illustrates those network measures listed below. Most network measures are ultimately linked to fundamental concepts such as node degree and path length.
• Centrality and local contribution to network communication. A local object, for instance a node or an edge, that plays an important role in network communication is considered to be central in the network. The centrality of a node can be measured by its relay of the communication between other nodes. For example, betweenness centrality (Freeman, 1977) is based on the number of shortest paths between other nodes passing through a node. It can also be assessed by the geodesic distance to other nodes, as closeness centrality (Beauchamp, 1965) does, or by deleting a node and then comparing the connectivity loss of the "impaired" network, as Shapley ratings (Kötter et al., 2007) do. Similar ideas can be applied to define the centrality of an edge or a sub-graph. Some measures are not as intuitive as the aforementioned ones: for instance, eigenvector centrality (Bonacich, 1972) and sub-graph centrality (Estrada & Rodrguez-Velzquez, 2005) are also implemented for the same concept.
• Modularity and brain function organization. It is believed that various cognitive functions are localized in different brain regions, and that these distributed functions are integrated together for complicated information processing. Such a perspective on brain function organization naturally leads to a network structure where some nodes are densely clustered and form function modules. The "small-world" property, at the global level, features systems that are highly locally clustered, like regular lattices, but still have small geodesic diameter, like random graphs (Watts & Strogatz, 1998). Modules can be detected by hierarchical clustering algorithms that groups node from the most linked pairs to the least pairs, or by community detection algorithms (Girvan & Newman, 2002) that draw module boundaries by breaking unimportant edges one by one.
• Perturbation and compensation mechanisms. It is believed that as a dynamic system, the brain will respond to impairment such as that induced by disease, by recruiting other neural resources to compensate the partially disabled function. This compensation mechanism is an important hypothesis of neuro-rehabilitation, and also related to many neurological diseases. Network analysis for the compensation mechanism can take a perturbation-and-recovery approach: deleting a connection or deactivating a node, and then searching for the most efficient changes that are needed to restore the impaired connectivity. Such mechanism could be developing a new connection other than the deleted one, or increasing the functionality of the most central node of the "lesioned" network (Kötter et al., 2007).
• Motif and connectivity pattern. Inter-connected nodes are building blocks of a big network, and the connectivity patterns among neighboring nodes characterize how information is processed at the local level. It has been found that in real-world networks certain patterns of inter-connections occur much more frequently than in random networks, and these "signature" local patterns are called network "motif" (Milo et al., 2002;Sporns & Kötter, 2004). Another network measure related to the "small-world" property is clustering coefficient, which is a function of the counts of pattern C and pattern D in Fig. 9-(c). Fig. 10. An example of the discriminability of different centrality measures. With the betweenness, the closeness and the eigenvector centrality, all the nodes are identical, while with the sub-graph centrality, the nodes are distinguishable, with score 45.696 for blue nodes, and 45.651 for green nodes. This example is from (Estrada & Rodrguez-Velzquez, 2005).

Discriminability of network measures
When using network measures to quantitatively reveal the differences with respect to a certain network concept, we naturally expect that the measure could sensitively detect even subtle differences. As various calculation methods could be proposed to quantify the same network concept, this raises the concern of the discriminability of these calculation methods. For example, to measure the centrality of a node, there are options such as the betweenness (Freeman, 1977), the closeness (Beauchamp, 1965), the sub-graph (Estrada & Rodrguez-Velzquez, 2005), and the eigenvector centrality (Bonacich, 1972). As Fig. 10 shows, certain difference can only be detected by some measures. This situation motivates the future development of theoretical criteria for rigorously comparing network measures and guiding their design, besides just evaluating them empirically. Theoretically, it is possible to use just a real number to uniquely represent a binary graph, achieving perfect discriminability. For instance, the adjacency matrix can be lined straight as the binary coding of a rational number. Though this simple mapping may not be meaningfully related to any network concept, it at least shows that with a single index all graphs can be distinguished. For network measures of a graph, an available criterion is the mutual information between the measure and the "isomorphic" class of graphs (Corneil & Gotlieb, 1970). Two graphs G 1 and G 2 are isomorphic if and only if there is a one-to-one mapping f between the nodes of G 1 and G 2 such that for every adjacent node pair a and b in G 1 , their mirrors in G 2 , ie. f (a) and f (b), are also adjacent, and vice versa. Similarly, for network measures about a node in a graph, their discriminability can quantified with the mutual information between the measure and the "isomorphic" class of the nodes. Two nodes a and b in a graph G are isomorphic if and only if there a permutation p of the nodes of G such that p(a)=b and p(b)=a and that for every adjacent node pair c and d (which can be a or b), p(c) and p(d) are also adjacent. It is of great theoretic and practical importance to further pursue criteria for rigorously comparing the discriminability of network measures.

Concluding remarks
In this article, we reviewed the application of graphical models for inferring brain connectivity from fMRI data. We have described and provided signal processing solutions for the challenges raised in this highly interdisciplinary and innovating research field related to model reliability, generality and interpretability. The importance of error control during brain network structure learning has been increasingly recognized, with a series of papers being published since 2005. These papers proposed solutions from the perspective of both classical and Bayesian statistics, and provided some theoretical conclusions. Because brain regions are not just algebraically isolated variables, but rather located in a three-dimension space with complex geometric structure, a desirable future direction is to exploit this geometric information for improving the error control. Group analysis is a frequently encountered requirement in biomedical research. Graphical models introduce inter-subject diversity at both the parameter level and the structure level. Most existing methods can be considered to take a two-level framework: a lower level of models for each individual subject, and a group level integrating individual models and describing inter-subject commonality and diversity. Being modular, incrementally updatable, and scalable is highly desirable, yet not well implemented features for current group analysis. Network analysis is an important post-processing for extracting interpretable and human-understandable information from graphical models. Network concepts such as centrality, modularity, connection patterns, the "small-world", "scale-free" property have been actively explored in the analysis of brain connectivity. As various calculation methods could be proposed to quantify the same network concept, it is of great theoretic and practical importance to further pursue criteria for rigorously comparing the discriminability of network measures.

Software and databases
The interest on modeling brain connectivity using fMRI has been experiencing an increasing, important growth in the signal processing community during the last decade. One of the factors of this success is the availability of public-available software and databases. As a reference for interested readers, here we provide an overview of several widely used computer programs related to fMRI brain connectivity analysis. This list is by no means complete.
• Statistical Parametric Mapping (SPM): Developed by the Wellcome Trust Centre for Neuroimaging. Website: http://www.fil.ion.ucl.ac.uk/spm Brief description: The SPM software package, probably the most popular one, has been designed for the analysis of brain imaging data sequences. The sequences can be a series of images from different cohorts, or time-series from the same subject. The current release is designed for the analysis of fMRI, PET, SPECT, EEG and MEG.
• LONI Software: Developed by the Laboratory of Neuro Imaging at the University of California, Los Angeles. Website: http://www.loni.ucla.edu/Software Brief description: The popular LONI Software is a comprehensive library for neuroimaging analysis, including pipelines for automated processing, web-based applications, tools for image processing and visualization, etc.
• FMRIB Software Library (FSL): Developed mainly by the FMRIB Analysis Group at the University of Oxford. Website: http://www.fmrib.ox.ac.uk/fsl Brief description: FSL is a comprehensive library of analysis tools for fMRI, MRI and DTI brain imaging data.
• MRIcro: Developed by Professor Chris Rorden's group at the University of South Carolina. Website: http://www.sph.sc.edu/comd/rorden Brief description: MRIcro allows efficient viewing and exporting of brain images. It can create Analyze format headers for exporting brain images to other platforms, such as SPM.
In addition, it allows neuropsychologists to identify regions of interest (ROIs).
• FreeSurfer: Developed by the Athinoula A. Martinos Center for Biomedical Imaging. Website: http://surfer.nmr.mgh.harvard.edu Brief description: FreeSurfer is a set of automated tools for reconstruction of the brain cortical surface from structural MRI data, and overlay of functional MRI data onto the reconstructed surface.
• Brain Connectivity Toolbox (BCT): Developed mainly by the Computational Cognitive Neuroscience Laboratory at Indiana University. Website: http://www.brain-connectivity-toolbox.net Brief description: This toolbox provides an access to a large selection of complex network measures in Matlab. Such measures aim to characterize brain connectivity by neuro-biologically meaningful statistics, and are used in the description of structural and functional connectivity datasets.
There are normally fMRI datasets associated with the above software. Here we also briefly mention a few publicly-available fMRI databases. Details related with the experiment, design and data content are available in the associated website links.
• The rate of technological progress is encouraging increasingly sophisticated lines of enquiry in cognitive neuroscience and shows no sign of slowing down in the foreseeable future. Nevertheless, it is unlikely that even the strongest advocates of the cognitive neuroscience approach would maintain that advances in cognitive theory have kept in step with methods-based developments. There are several candidate reasons for the failure of neuroimaging studies to convincingly resolve many of the most important theoretical debates in the literature. For example, a significant proportion of published functional magnetic resonance imaging (fMRI) studies are not well grounded in cognitive theory, and this represents a step away from the traditional approach in experimental psychology of methodically and systematically building on (or chipping away at) existing theoretical models using tried and tested methods. Unless the experimental study design is set up within a clearly defined theoretical framework, any inferences that are drawn are unlikely to be accepted as anything other than speculative. A second, more fundamental issue is whether neuroimaging data alone can address how cognitive functions operate (far more interesting to the cognitive scientist than establishing the neuroanatomical coordinates of a given function -the where question).