Journal of Neuroscience Methods

Background: In the graphical modelling of brain data, we are interested in estimating connectivity between various regions of interest, and evaluating statistical signiﬁcance in order to derive a network model. This process involves aggregating results across frequency ranges and several patients, in order to obtain an overall result that can serve to construct a graph. New method: In this paper, we propose a method based on p -value combiners, which have never been used in applications to EEG data analysis. This new method is split into two aspects: frequency-wide tests and group-wide tests. The ﬁrst step can be effectively adjusted to control for false detection rate. Results: This two-step protocol is applied to EEG data collected from distinct groups of mental health patients, in order to draw graphical models for each group and highlight structural connectivity differences. Using the method proposed, we show that it is possible to reliably achieve this while effectively controlling for false connections detection. Comparison with existing method(s): Conventionally, the Holm’s Stepdown procedure is used for this type of problem, as it is robust to type I errors. However, it is known to be conservative and prone to false negatives. Furthermore, unlike the proposed methods, it does not directly output a decision rule on whether to accept or reject a statement. Conclusions: The proposed methodology offers signiﬁcant improvements over the stepdown procedure in terms of error rate and false negative rate across the network models, as well as in term of applicability. © 2016 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).


Introduction
The understanding of connectivity in large-dimensional time series has been a topic of central importance in neurology, and more precisely in neurological imaging.The interest in these techniques is widespread across imaging techniques (EEG, Medkour et al., 2010;fMRI, Marrelec et al., 2006) and experimental works of various types (learning experiments, Fiecas and Ombao, 2014;motor skills, Mima et al., 2000;resting-state, Salvador et al., 2005).
One of the most important features of neurological data analysis is functional connectivity between parts of the brain: how do various regions of interest interact?For this purpose, graphical modelling of time series is an ideal tool.
A well-known contribution to this field is the frequency-domain approach exposed in Dahlhaus (2000).In this methodology, the data is transformed into the frequency domain, where its codependency structure is analysed via the partial coherences.The partial coherence measures the connection between two series after the removal of the linear effects of the remaining series.It is a function of frequency and can be used to reflect connectivity across any frequency range ˝ ⊆ [0, f N ].It is derived from S −1 (f), the inverse of the spectral matrix S(f).Its use in network analysis and related methodologies has become very widespread in the literature (Makhtar et al., 2014;Mima et al., 2000;Pohja et al., 2002).
Once the partial coherences have been estimated, an important question arises concerning the statistical significance of the results.A zero-valued estimate would indicate that no connection is detectable between two variables at a given frequency.The standard approach is to test for the following at all frequencies f in range: H 0 : partial coherence = 0 vs H 1 : partial coherence > 0.
In some cases, multiple measurements of the data are available, hence allowing the estimation of a bootstrapped distribution (Fiecas and Ombao, 2011).When this is not possible, the use of analytical tools is required.A popular alternative involves the Partial Mutual Information (PMI), which integrates the estimates at all frequencies within a band into one variable for each partial coherence, and is then compared to a threshold (Salvador et al., 2005).However, its distribution and statistical properties remain largely unknown, and it is often unclear how the threshold value should be set for a given band in [0, f N ].
In this paper, we consider an alternative protocol that follows the method detailed in Medkour et al. (2009).Under conventional pre-processing and estimation methods for the spectral matrix S(f), its distribution and the distribution of its partial coherence are known (Goodman, 1963), and then, under H 0 , the partial coherence estimates can be modelled as Beta-distributed random variables.
In the frequency domain approach, results are derived at each frequency f, for all frequencies in range f ∈ ˝.However, this is many steps away from an overall graphical model.Once the partial coherences have been measured and tested for significance at a suitable level ˛, how can the results be aggregated across frequencies and subjects to deliver one graph?
This can be regarded as a multiple hypothesis testing problem, otherwise known as a conjunction analysis problem in Neurology (Friston et al., 1999).A traditional approach to this is the Holm's Stepdown procedure (Lehmann and Romano, 2005;Holm, 1979), where the null hypothesis H 0 is tested for at each frequency ordered by p-values p (1) ≤ p (2) ≤• • •.Every time H 0 is rejected, the procedure moves on to the next frequency, and only stops at the frequency L where H 0 is finally accepted.This method is robust to Type I errors, as it is designed to control the family-wise error rate (FWER) below a desired level ˛ (Lehmann and Romano, 2005).However, the stepdown is prone to false negatives.Furthermore, translating the resulting L into a decision is ambiguous.L may sometimes be very small, especially compared to the number of frequencies it is computed for.For instance, if L = 1 for some pairs of variables, should the connections between these pairs still be included in the graphical model?No substantial research has been carried out to answer this beyond some case-specific solutions (Medkour et al., 2010).
Multiple hypothesis testing is not limited to the use of the Stepdown procedure.In other applications, the use of p-value combiners is very prevalent.In Genomics, the Westfall-Young min-p procedure (Westfall and Young, 1993) proves to be very popular, as it is robust to Type I errors and can also handle correlated data by estimating the joint H 0 distribution through resampling.Other well-known combiners rely on Bayesian inference, such as Efron's empirical Bayes method (Efron, 2003), where prior probabilities are assigned on the proportion of null and non-null statements and the false discovery rate is evaluated empirically.
In the context of spectral domain analysis, closed-form analytical methods tend to be preferred, due to the large computational cost associated with performing calculations at each frequency.In this category of methods, The Fisher (Fisher, 1932) and Simes (Simes, 1986) combiners constitute popular examples, that are widely used in applications of Computational Statistics, such as Genetic Epidemiology (Sungho et al., 2009) and Biostatistics (Chen et al., 2014).They deliver a single scalar that can then be tested on well-defined distributions, in order to ascertain the significance of an overall proposition.The use of these p-value combiners has been relatively rare in graphical modelling of neurological data thus far.Conventionally, they require that the set of multiple tests are independent, which is almost never the case with frequency domain data.However it is possible to generalise their use for this specific application.
In this paper, we review various p-values combiners and assess their suitability for graphical modelling of EEG data compared to the Stepdown procedure.After reviewing some background results in frequency domain analysis and multiple hypothesis testing, we propose a two-step procedure to carry out graphical modelling on EEG data.
• We demonstrate how the classical p-value combiners can be used on a subset of the frequency range that only includes uncorrelated data, and evidence their performance on simulated data.• Test combiners can also be used to ascertain the significance of a graphical model for a sample population.Using EEG measurements from three distinct groups of mental health patients, we demonstrate how we can aggregate the results of each patient in all groups in order to obtain a group-wide coloured graphs, which show the intensity of connections in each group.• When combining results across a frequency range, each connection between a pair of channel is evaluated independently.In doing so, it is important to control for the detection of false positives when constructing individual graphs.We show how this can be managed using a false edge detection adjustment, for both low-dimensional and larger dimensional data.

Background -graphical modelling
Let {X t } be a 2nd order stationary vector time series, {X t } ∈ R p , t ∈ {0, . .., T − 1}, with an associated spectral matrix S(f ) ∈ C p×p .Many estimation procedure exist for S(f), here we choose the multitaper spectral estimate for its good statistical and analytical properties (Percival and Walden, 1993).It starts with a set of K orthogonal tapers, satisfying the following property: There are many types orthogonal tapers {h t,k } that are regularly used in the literature on spectral estimation, we choose here the Sine tapers, for their ease of implementation (Walden et al., 1995), defined for all k ≤ K and t ≤ T − 1: Using the multiple tapers {h t,k }, we can define the following Fourier transforms J k (f) for k < K on the data X t : which in turn can be used to create an estimate for the spectral matrix S(f), called the multitaper estimate: To ensure the invertibility of the matrix S(f), we require that the number of tapers exceed the dimensions of the data, i.e.K > p.
The matrix S(f) is instrumental in evaluating the dependency structure between each pair of the p channels.Doing so, we aim is to build a graph for the data set {X t }: E represents the set of edges that connects channels to each other.Because we are interested in forming an undirected, conditional correlation graph, we define the set of edges E as follow: The term "X i ⊥ X j |X (\ij) " defines conditional independence between channel i and j -in other words, no direct connections between these two channels when excluding the p − 2 others.This can be reformulated using the partial coherence (Dahlhaus, 2000;Medkour et al., 2009): Hence, we bring our attention to the partial coherences for the construction of the graph G.Because it relies on the unknown quantity S −1 (f), we use estimates for the partial coherences: where S −1 (f ) is a suitable estimator for S −1 (f), derived from Ŝ(mt) K (f ).In order to construct the set of edges E in the graph G, we need to test for the significance of the partial coherence estimates at every pair (i, j): For each pair of channels i and j, we derive a decision on whether to accept/reject H 0 at level ˛, using the protocol in Section 3.This can be translated into a graphical model as follow: Ĝ = (V, Ê), (2.4) We then have an estimate Ĝ of the true graph G for the data {X t }.
This testing procedure is repeated for all n pairs = ( p 2 ) = p(p − 1)/2 pairs (i, j), i < j ≤ p, until the set of edges Ê is fully established.Each edge is tested independently, as they are defined as the conditional dependence of different pairs of channels while discounting the others.The risk of detecting a false edge increases with n pair .
Therefore, the probability of including an edge in Ê erroneously should be controlled, as covered in Section 5. We propose two methods to achieve this, on that is strict and the other more permissive, which will be appropriate depending on the dimensions of the data.The same applies for population-wide testing.Consider a set of k ≤ N data samples {X k t }, collected from N G distinct groups of subjects, with N g individuals in each group.We may be interested in estimating a group-wide graphical model G g = (V, E g ) for each group g ≤ N G , starting with individual graphical models for each i ≤ N g : ∀i ≤ N g , Ĝi = (V, Êi ).
The individual graphs Ĝi need to be combined for each group in order to derive a group-wide estimate Ĝg for the true graph G g of the population g.
This echoes the questions covered by multi-subject conjunction analysis (Friston et al., 1999;Heller et al., 2007), where in a group of subjects we need to ascertain the proportion u of individuals who exhibit a trait -in our case a connection (i, j) ∃i / = j ≤ p.While in general multiple hypothesis testing, we would test for u > 0, in conjunction analysis we want to test for a range of u ∈ [0, 1]: where k is the true proportion of individuals in the group g exhibiting the trait of interest.The hypothesis test of Eq. (2.5) can be repeated for a range of u's, and the results of these combined in a "coloured" graph, where a colour is associated with each value of u.This is covered in Section 6.

Hypothesis tests for graphical modelling
Eq. ( 2.3) can be evaluated using multiple tests H 0 (f)/H 1 (f) at each frequency f, and subsequently aggregated: across discretised frequency f ≤ N f .Under H 0 (f), when using the multitaper estimate of Eq. (2.1) with K > p, the estimate ˆ 2 ij|(\ij) (f ) of Eq. (2.2) is known to have a Beta(1, K − p + 1) distribution (Medkour et al., 2009): An important consideration is that the data at each frequency is not always independent of other frequencies, depending on the bandwidth of the spectral estimate.
Definition 3.1.The bandwidth of a spectral matrix estimate is the spectral window covered by said estimator, and is analogous to its resolution.In our case, using a multitaper estimate with Sine tapers, we have (Walden et al., 1995): The bandwidth of the data in applications is defined by f : = f l+1 − f l , for any 1 ≤ l < N f , N f representing the number of discretised frequencies in [0, f N ].If B < f, the data across frequencies is deemed independent, otherwise we consider that there exists dependency between frequencies.(Percival and Walden, 1993).

Remark 1. Data at any frequencies
The nature of the correlation structure of the spectral data across various frequencies is not well established, but it can be assumed that it is strictly non-negative: Remark 2. Spectral estimates reliant on smoothing/tapering are virtually never free of inter-frequency correlation, as the smoothing process incurred by the smoothing/tapering widens the bandwidth B. In spite of this, this class of spectral estimators is still preferred for their other statistical and inversion properties.
With every hypothesis test procedure, we pay close attention to the levels of the tests ˛.It represent the upper limit we are willing to tolerate on the probability of making a false detection, or in other words, the probability of making a type I error: Across a set -or a "family" -of multiple and/or simultaneous tests, it is equally important to consider the family-wise error rate (FWER); Definition 3.2 (Family-wise error rate (FWER), Lehmann and Romano, 2005).The family-wise error rate, in the context of multiple tests H 1 0 /H 1 1 , . .., H 0 /H 1 , is the probability of making one or more Type I errors across all the tests: where |V| represents the size of the set V.
Hence, FWER control is an extension of type I error control to the context of multiple hypothesis testing.Thus, for a given level ˛, we require that: An important related concept is the false discovery rate (FDR).
Definition 3.3 (False discovery rate (FDR), Lehmann and Romano, 2005).The FDR is the expected rate of Type I errors in a multiple hypothesis testing: with equality holding if H 0 is true, and otherwise the inequality is strict.Thus, any form of FDR control is more liberal than FWER control, in the sense that it will permit more rejections (Lehmann and Romano, 2005).
Equally, the Bonferroni inequality is a central result in the understanding of FWER/FDR control.Definition 3.4 (Bonferroni Inequality, Lehmann and Romano, 2005).For a set of hypothesis tests with p-values p 1 ,. .., p , we have: The implication of this result is that the FWER can be controlled while testing for H 0 : H i 0 true ∀i ≤ vs H 1 : H i 1 true ∃i ≤ at significance level ˛, if we test for each of the individual test at level ˛/ .While this result is very reliable, it can be too strict when becomes large, and induce large type II error rates.Throughout the rest of the paper, we rely on this result: Lemma 3.5 (Dickhaus, 2014, Thm 2.1).For multiple hypothesis tests H i 0 vs H i 1 with p-value p i where the test statistics have a continuous distribution, the p-values are uniformly distributed under the general null hypothesis H 0 : Having established the nature of the set of hypothesis tests we want to run, we now review the various protocols available to adequately perform the multiple hypothesis tests in Section 4.

p-Value combiners
Multiple hypothesis tests can be handled with various p-value combiners.Here we present three approaches: the Holm's Stepdown procedure, the Sime's approach and the Fisher combiner.The first can handle correlated tests and returns a scalar L representing the number of tests where the statement was significant, and is also a methodology that is widely used in multiple hypothesis testing applied to neurological data.The other two return a single p-value, but only work for independent tests.

Holm's
Stepdown procedure (Holm, 1979) Multiple hypothesis testing, especially in the case of correlated test, can be approached with the Stepdown procedure.In the context of Eq. ( 2.3), we have a set of ϒ = N f hypothesis tests to perform on the data, with decisions H i 0 , H i 1 for each i ≤ N f , an associated test statistics T i and p-value p i .Denote the ordered test statistics and p-values as follow: and their corresponding hypotheses as In the Stepdown approach, we test sequentially for a subset of the multiple hypothesis: The Holm's Stepdown procedure can be used on the hypothesis test of Eq. ( 2.3) and on the conjunction analysis tests of Eq. (2.5) in the following way: In either form, the Stepdown procedure delivers a scalar L: When applied to frequency-wide data, this shows the number of discretised frequencies f ≤ N f for which the estimated partial coherences were significantly greater than zero.This protocol is uniformly more powerful than just applying a Bonferroni correction based on Definition 3.4, as it is more capable of detecting cases where H 0 is false (Holm, 1979;Lehmann and Romano, 2005).For the purpose of frequency-wide testing, we use thresholding and reject H 0 in Eq. (2.3) if L > 0.
When applied to subject conjunction, this shows the number of subjects in a group for whom a trait was deemed significant.
For the following methods, we only work across uncorrelated frequencies, which are B Hz apart (Eq.(3.3)).Hence, we have M = N f B/ f eligible data entries and ϒ = M multiple tests.

Sime's modified Bonferroni approach
Sime's method builds directly on the Bonferroni Inequality, and works for independent tests.Although it has been shown to apply to positively correlated tests as well, we will not recourse to this extension here, as it isn't clear whether partial coherences estimates between frequencies are non-negatively correlated.
Lemma 4.1 (Simes, 1986).Let p (1) ≤ p (2) ≤ • • • ≤ p (M) be the ordered p-values associated with M multiple tests, and all p (j) follow the U(0, 1) distribution under H 0 .Set a test procedure that decides to reject H 0 when there are some j ≤ M such that: This test procedure has a FWER = ˛ when the multiple tests are independent.
Lemma 4.1 is a consequence of the Bonferroni inequality, as shown in Simes (1986).
From Lemma 4.1 we can reformulate the general null hypothesis H 0 as follow: Based on this, we can introduce a p-value combiner built on Lemma 4.1: H 0 is then rejected in Eq. ( 2.3) if p Simes < ˛ in Eq. (4.2).

Fisher p-values combiner
The Fisher combiner (Fisher, 1932) is by far the most popular choice of p-value combiners for independent tests, as it has a welldefined formulation and distribution under H 0 : , where C represents the critical value of the 2 2M distribution at level ˛.

Network-wide FDR control
In the previous sections, we have covered various techniques to combine multiple tests across a set of discretised frequencies in order to deliver a decision rule for the whole frequency range ˝.This is done for each pair of channels (i, j), i < j ≤ p. Doing so, we also need to control for false positives across pairs of connections when evaluating individual graphs.
The above procedure is repeated for all n pairs pairs (i, j), i < j ≤ p -with n pairs = p(p − 1)/2 -until the set of edges Ê in Eq. (2.4) is fully established.Each edge is tested independently, as they are defined as the conditional dependence of different pairs of channels while discounting the others.This in itself defines a new family of tests, and as such the risk of detecting a false edge increases with n pairs : This directly relates to the Bonferroni inequality (Definition 3.4).
Hence, an appropriate course of action when testing for the n pairs edges in the graph Ĝ of Eq. (2.4) is to set the significance level at ˛* = ˛/n pairs instead of just ˛ in order to minimise the probability of detecting false edges.For each pair of variables (i, j), denote their associated p-values p (i,j) output from the combining of results across frequencies.We then have the following rule: (i, j) / ∈ Ê for all pairs s.a.p (i,j) > ˛ * . (5.2) This correction becomes very strong as the dimensions p grow large.For instance, with p = 50 dimensions, this would lead to a corrected level ˛* = 4 .10−5 from ˛ = 0.05, rendering the detection of truly-existent edges almost impossible.An alternative is to allow some relaxation in the correction required on the level ˛*, without losing out too excessively on the FDR control.
In the context of conjunction analysis, the Sime's combiner can be used in the hypothesis test of Eq. (2.5).Building on a protocol detailed in Benjamini and Yekutieli (2001) and in (Heller et al., 2007), we propose the following methodology: . This is in fact also known as the Benjamini and Hochberg step-up procedure.
Proof.This topic and associated proofs are extensively covered by Benjamini and Hochberg (1995) and Benjamini and Yekutieli (2001).ᮀ Remark 4. The Benjamini and Hochberg step-up procedure is comparable to the Holm's Stepdown procedure of Section 4.1, but aims to control the FDR while the Stepdown procedure performs a strict FWER control.
Therefore, the procedure of Eq. ( 5.2) is a form of FWER control, whereas the procedure of Eq. (5.4) is a type of FDR control.Based on Remark 3, we expect the second scheme to be less conservative than the first.

Multi-subject conjunction/group-wide graphical modelling
Thus far, the various testing protocols on the discretized frequencies and connections of channels have been performed for each individual in the study, and some attention has been paid on FDR control for the purpose of evaluating individual graph models.When dealing with groups of subjects, the next stage is to aggregate these results across all subjects in order to obtain a unique group-wide graph.
Consider a collection of k ≤ N time series data {X k t }, split into N G distinct groups We are interested in estimating a graphical model G g = (V, E g ) for each group g ≤ N G .In each of these groups, there are N g individuals -such as g N g = N G -for which we estimated a graphical model from the p-values associated with each edge p i (1) , . .., p i (n pairs ) : for some p-value combiner function f(•).The p-values p i (1) , . .., p i (n pairs ) need to be aggregated across all individuals i ≤ N g in each group in order to derive a group-wide graphical model estimate Ĝg .We do so via some function F(•) that takes the p-values of all the individuals i ≤ N g in the group g as inputs.
We are interested in assessing whether Ĝg is the true graph G g of the group g.In doing so, we are concerned with controlling for false edge detection between any two channels j and k: We denote the proportion of individuals in group g for whom the connection (j, k) is deemed significant as follow: Note that PI jk is not equal to the proportion of graph estimates Ĝi that exhibit the connection (j, k): The quantity PI jk can be computed using: • The Stepdown procedure (Eq.(4.1)),where the output L represents the number of individuals in group g for which the connection (j, k) is significant.The quantity PI jk would then be equal to PI jk = L Ng .• The Benjamini & Hochberg procedure (Eq.(5.4)),where for each connection (j, k), in the group g, denote the p-value p i associated with individual i, and the ordered p-values p (1) ≤ • • • ≤ p (Ng) and set then PI jk = u * Ng .
Remark 5.The Stepdown and Benjamini & Hochberg procedures are effectively a form of group-wide FDR control, equivalent to the procedures proposed in Section 5 for the making of individual graphs.
We propose to construct the graph Ĝg based on the individual p-values p i , i ∈ g, for each edge, using the subjects proportions PI jk .For any connection (j, k), given a threshold level Â ∈ (0, 1), In this context, the estimate Ĝg is in fact a function of Â, i.e.Ĝg = Ĝg (Â).Here, Â represents the minimum required proportion of subjects that the channels connections need to be significant for, in order to include them in a graphical model estimate Ĝg (Â).This procedure can be repeated for any value of Â ∈ (0, 1].Thus, for each group g, estimates Ĝg (Â) can be devised for a range of Âs in and overlaid to form a general graphical model estimate: This creates a "coloured" graph.
Definition 6.1 (Coloured Graphs).A coloured graph is a graphical model Ĝ constructed from several sub-graphs Ĝ(Â), derived from thresholding the variables {PI jk } for a set of several Then, each Â ∈ can be associated with a colour -or a "line-style" -and each edge (j, k) inside of Ĝ is associated with the colour of the largest Â such that (j, k) ∈ Ĝ(Â).
For each connection in the graph and a range of thresholds Âs, an edge is included in a graphical model if it is significant for any threshold Â ∈ : Remark 6.For any two thresholds Â, Â in [0, 1] such that Â > Â and any pair of variables (j, k), if (j, k) ∈ Ĝg (Â) then it follows that (j, k) ∈ Ĝg (Â ).
In the context of building a graphical model with various Â ∈ , a colour is associated with each Â and matched to an edge in Ĝg if it belong to Ĝg (Â).
This colouring system shows how prevalent connections are in the graphical models for each group in terms of subjects proportion.This type of representation is highly appropriate for neurological data analysis, as it allows to create intensity maps to represent the strength of neural activities in the brain, in the style of Heller et al. (2007Heller et al. ( , p. 1184)).
Alternatively, researchers can decide what choice is appropriate for a fixed value Â = Â, and estimate a graphical model using it on its own: Ĝg = Ĝg ( Â).This choice will depend on the nature of the data being analysed and the level of inference required.
Using these methods, we can ensure that for a group g the probability p jk,g is bounded above by ˛, for all connections (j, k).

Simulation results
We apply three p-value combiners to several VAR(1) data models, in order to assess their ability to retrieve the true network structure of each model: • Holm's Stepdown approach (Eq.(4.1)), applied as a dependent and independent scheme.In the dependent case, results from all frequencies are combined, whereas in the independent scheme only independent frequencies are considered (see details in Section 3), • Simes combiner (Eq.(4.2)), • Fisher combiner (Eq.(4.3)).
We look at two groups of models, that differ in their dimensions: Group 1: Low to moderate dimensions, p = 10, Group 2: Moderate to high dimensions, p = 35.
Both group consist of five VAR(1) models, all expressed in the following way: with t ∈ [0, T − 1], T = 3072, Á t ∼ i.i.d N p (0 p×1 , I p ), t = 1, hence f N = 0.5Hz, generating N f = 3073 discretised frequencies on the range [0, 0.5] Hz.Each VAR(1) model is stationary, and produced using the semi-random VAR(1) model generator detailed in Appendix A. The true spectral matrices S(f) for each model is found as described in Appendix B. In each group, the various five models differ in their Degree of Connectivity (DoC), which represents how much the variables of a multivariate system are connected to one another.
Definition 7.1 (Degree of Connectivity (DoC)).For a multivariate time series {X t } ∈ R N×p , and a graph G = (V, E) fitted to the data {X t }, we define the DoC to be the proportion of existing edges in the set of all possible connections: For instance, a set of perfectly independent p-dimensional variables {X t } ∈ R N×p would return a graph G with no edges, and a DoC of 0%, whereas a perfectly connected graph G would have a DoC of 100%.
The DoCs of the models under consideration are displayed in Table 1.
For each model in each group, we create m = 500 data-copies {X m t } t .For each one of them, we compute the multitaper spectral estimate Ŝ(f ) (Eq. (2.1)) with K = 1.2 × p -resulting in a bandwidth of B = 0.005 Hz for group 1, and B = 0.015 Hz for group 2. We subsequently apply a diagonal up-lift with a factor of Á = 10 −6 in order to stabilise inversion.The estimated partial coherences are then calculated from the inverted spectral matrix estimate Ŝ−1 (f ) at each discretized frequency f, and tested according to the protocol in Eq. (3.2).We then aggregate the results for each partial coherence using each p-value combiner: Fisher, Simes, and Holms.The combiners are then transformed into critical values C, the p-value of each combiner under their H 0 distribution.This then produces an adjacency matrix: the case of the Holm's Stepdown procedure, we set C = L and For consistency with the rest of the literature on the subject of hypothesis testing, we set ˛ at ˛ = 0.05.
Using the adjacency matrix produced by each combiner, we can bring our attention to the average error rate (AER), which is measured in terms of Hamming distance between the estimated graph Ĝ with adjacency matrix A(C) and the true graph G with adjacency matrix A (Banks and Carley, 1994;Medkour et al., 2010): where we have #(i, j) = p 2 = p(p − 1)/2 possible connections in the graph.Alternatively, we also look at the Average False Positive (AFP) rate and the Average False Negative (AFN) rate, which are defined as the AER over the set of non-existent and existing connections respectively: where we have: Remark 7. The AFP rate in Eq. ( 7.3) is analogous to the FDR of Definition 3.3, whereas the AFN rate is a measure for type II errors.
Results are averaged over all m copies and displayed in terms of the evolution of the AER (Eq.(7.2)), AFP rate and AFN rate (Eq.(7.3)) for each p-value combiner, as a function of DoC, in Fig. 1 for group 1, and in Fig. 2 for group 2. For all models, we evaluate edge detection using three approaches: with regular levels ˛ = 0.05, with Bonferroni-adjusted levels ˛* = (0.05/n pairs ) of Eq. (5.2), and with the Benjamini & Hochberg adjusted levels ˛*(k) = ˛ × k/n pairs of Eq. (5.4).

Main results
We can see for group 1 in Fig. 1 that, when using the level ˛ = 0.05 on the edges, the overall Average False Positive rate is not actually bounded above by 0.05 for any p-value combiner, even with the more conservative dependent Stepdown procedure.Indeed, the dependent Stepdown scheme consistently achieves lower AFP rates than its counterparts.The stepdown protocol being designed to perform a strict FWER control, this is expected.On the other hand, the Fisher combiner is known to exaggerate evidence against the null hypothesis in multiple tests, which is manifested here with higher AFP rates and lower AFN rates across most models.
In Fig. 1, the adjustments on the significance level ˛ proposed in Section 5 manage to reduce the AFP rate of all combiners.The reduction in AFP rate is most significant with the Bonferroni adjustment of Eq. (5.2), ˛ = (0.05/n pairs ).This comes at the cost of higher AER and AFN rates for all combiners.The same observations can be made for the level adjustment of Eq. ( 5.4), ˛(k) = 0.05 × k/n pairs , but the impact on the AER and AFN rates is much more moderate.This is expectable, as this type of level adjustment is known to be less strict than a Bonferroni correction.However the reduction in AFP rate it achieves is limited, and for low dimensions such as p = 10 it is only marginally better to using no correction at all.
The benefits of using the Benjamini & Hochberg (B&H) level correction of Eq. ( 5.4) are more obvious on larger dimensional data, for which the results are displayed in Fig. 2. We can see that the more conventional Bonferroni adjustment of Eq. ( 5.2) is far too strict, leading to all p-value combiners to return AFN rates ranging from 45% to 99%.The B&H level correction manages to reduce the AFN rate considerably for most combiners, even down to 20% for the Fisher combiner for the models with the higher DoCs.While this is a significant improvement over the conventional Bonferroni adjustment, this still represents a substantial level of false negatives in the estimated graphical models, especially compared to the results obtained on lower dimensional data in Fig. 1.This calls for further investigations of false positives control in graphical models of large dimensions.
In Figs. 1 and 2, the Fisher combiner stands apart from its counterpart in terms of all divergence criteria.While its AFP rate can be higher compared to other combiners, it manages an overall AER that is lower in most models, but especially so for highly connected VAR(1) models.More importantly, it is more robust to type II errors than its counterparts.This makes it an highly eligible p-value combiner for higher dimensional data, such as in Fig. 2, where none of the other methodology were able to return AFN rates below 60%.Based on its low AER and despite its higher AFP rate, the Fisher combiners seems to be the most appropriate p-value combiner for aggregating partial coherence results over frequencies and detecting edges across networks.
Looking at the different ways of implementing the Stepdown procedure, we can see that the regime using all frequencies has lower AFP rates compared to the regime using independent frequencies only, in all models.Additionally, the dependent stepdown scheme returns very similar AERs and AFN rates to the independent scheme.Therefore the dependent scheme outperforms the independent scheme, as it can achieve the same level of accuracy but with greater robustness to false positives.This also suggests that combiners that can handle correlated data points are better than their independent counterparts.This calls for the development of a Fisher combiner that is capable of handling dependent frequencies.As its performance in terms of AER and AFN rates is currently superior to those of other combiners, it could be improved even further when adapted to work across all data points, and help reduce its relatively high AFP rate.Some initial work in this direction has been made, notably in Kost and McDermott (2002), Chen et al. (2014) and Li et al. (2014), where a generalisation of the conventional form proposed, using Satterthwaite approximations.The analytical validity of such methods is yet to be established, and would form a worthwhile objective for future research work.

Effect of group heterogeneity
The simulation study depicted in the previous section is repeated here for "families" of time series, which all share the same underlying parametric model but have different noise processes.Recall the definition of the true spectral matrices S(f), detailed in We create n ind = 10 perturbation matrices 1 , . .., 10 , and associate them to one of the n ind families derived from the 10dimensional VAR(1) model # 2, detailed in Table 1.In the same manner as in Section 7.1, m = 500 copies are created for each of the n ind families of the VAR(1) model.The results for each partial coherence test (Eq.(3.2)) are then aggregated using the Fisher combiner of Section 4.3, as it produced better results in terms of AER in the previous section.We then evaluate the ability of the Stepdown procedure and of the Benjamini & Hochberg (B&H) procedure (Section 6) to retrieve the true connectivity network after combining p-values from all 10 families, with ˛ = 0.05.Results are displayed in Fig. 3 in terms of AER (Eq.(7.2)), AFP rate and AFN rate (Eq.(7.3)) as a function of the threshold Â ∈ [0, 1], depicted in Eq. (6.2).
From Fig. 3, we can see that the AFP rate drops and the AFN rate rises as a function of Â.This is expected, as a larger Â is synonymous with stricter requirements in terms of edge detection across all subjects.Beyond Â = 0.5, we can see that the AFP rate drops to zero, meaning that no false edge detection occurs.This suggests that no additional FDR correction is needed on group-wide graphs produced with the Stepdown or B&H conjunction protocols, for any Â > 0.5.
In both the Stepdown and B&H procedure, an optimum can be achieved in terms of AER for thresholds Â ∈ [0.3, 0.6].This understandably represents a balance between a low AFP rate and an AFN rate that is not excessively high, and also gives a reasonable set of choices for Âs for the making of coloured graphs in Section 8.2.

Application to data set
In this section the results exposed in Section 6 are applied to the data set detailed below, in Section 8.1.

Data set
We focus our attention on the data set discussed in Medkour et al. (2010): EEGs are taken from 34 mental health patients (all male, aged between 20 and 60 years), in resting-state and with eyes closed, from the Serbsky Institute in Moscow.Of these patients, 15 were diagnosed with positive syndrome schizophrenia, 19 were diagnosed with negative syndrome Schizophrenia.These two conditions are quite distinct, notably in terms of their respective clinical symptoms.Hence, a motivation for this data collection is to understand whether these conditions can also be reliably identified using neurological imaging.Notably, we are interested in establishing whether it is possible to discriminate between positive and negative schizophrenia with graphical modelling of brain functionality, using the method described in Sections 3.4-6 on EEG recordings.The implication of such findings would be that various mental health conditions impact the organisation of the brain in specific and distinct ways.This could help at the diagnosis stage, as many mental health conditions share a large number of clinical symptoms and can be hard to demarcate.
In addition, a control group of 24 healthy subjects of similar age to the patients is introduced to the study.We denote the positive syndrome group as "Positive", the negative syndrome group as "Negative" and the control group as "Control".
All subjects gave written informed consent before taking part in this investigation.Ethical approval came from the local Moscow Ethics Committee and in compliance with national legislation and the Declaration of Helsinki.
Previous research has found consistent differences in brain activity of schizophrenia patients compared to controls at low frequencies (Di Lorenzo et al., 2015;Itoh et al., 2011), especially in unmedicated patients (Mientus et al., 2002).Thus the Delta band, defined over the frequency range [0.5, 4] Hz, is of high scientific interest in the context of this study.
Additionally, there exists strong neural oscillations, known as alpha rhythm, that create a dominant spectral line at around 10 Hz (Dawson and Fischer, 1994) which can interfere with the results.The alpha rhythm -otherwise known as alpha wave -evolves in the range of [8,12] Hz and is caused by the electrical activity of the thalamic pacemaker cells in the occipital lobes (a.k.a in channels O1 and O2).Depending on the spectral window used in the estimation method, there can be transfers of power into other bands, including the delta band.It is important to get rid of this interference, as it could cause noticeable side-lobe leakage and significant estimation errors when analysing causal activity.Therefore we focus on the Delta band by means of a Low-Pass filter, as detailed below.
The EEGs in this data have been recorded once from distinct patients, under resting conditions and with eyes closed, meaning they could not have been any interactions between subjects during the experiment.On this basis, we expect to see no correlation between any of the recordings, and assume that they are independent of each other.
The EEGs are recorded for 30 sec. at a sampling rate of 100 Hz ( t = 0.01 s, Nyquist frequency f N = 1 2 t = 50 Hz) from p = 10 scalp sites, named according to the 10-20 system, F3, F4, C3, C4, T3, T4, P3, P4, O1 and O2, and referenced to linked ears with a bandpass filter of 0.5-45 Hz.The signals are then put thought a 6 Hz Low-Pass Butterworth filter to remove leakage from the alpha rhythm.The signal is then downsampled by 5, in order to bring the Nyquist frequency to f N = 10 Hz, so as to eliminate the zero-valued regions of the spectrum.The effects of this pre-processing are illustrated on the power spectrum of one channel from one patient in Fig. 4. The Low-Pass filter adequately manages to remove the alpha rhythm spike at around 10 Hz, while the downsampling adjusts the spectrum over the zero-valued ranges.We can also see that most of the power in the pre-processed spectrum is constrained to the Delta band, in [0.5, 4] Hz.Hence, we focus our analysis in this frequency band specifically.
The resulting signal is a multivariate time series {X j,t } with dimension p = 10 where j = 1, . .., 10 represents the re-labelling of scalp sites: The data of each patient can be viewed as a 2nd order stationary vector time series {X t } ∈ R p , t ∈ {0, . .., T − 1}, as the EEG are recorded under resting-state and eyes-closed conditions.
For the data of each patient, the multitaper spectral estimate Ŝ(f ) (Eq. (2.1)) with K = 12 is computed, with a bandwidth B = 0.4221 Hz (Eq.(3.3)).A diagonal up-lift with a factor of Á = 10 −6 is applied in order to stabilise inversion.
The estimated partial coherences are then calculated from the inverted spectral matrix estimate Ŝ−1 (f ) at each discretized frequency f.Some examples are displayed in Fig. 5, where we can see that some estimates can be quite large on average across frequency band of interest.These estimates at each frequency need to be tested according to the protocol in Eq. (3.2) to ascertain significance.We then aggregate the results for each partial coherence using the Fisher p-value combiner of Section 4.3.Based on results of Section 7.2, we opt to use no additional FDR control, and keep the level ˛ at ˛ = 0.05.This produces one p-value for each patient in each group, which we wish to aggregate as detailed below.
Results are displayed in Figs. 6 and 7.This protocol is applied to all the patients in all three groups, "Positive", "Negative" and "Control".Using these choices of thresholds Â ∈ , we can see the edges that are significant across 33%, 50% and 66% of patients in each group, respectively.For each group, a coloured graph is produced using the thresholds in .The estimated graphs can be compared with one another at each threshold Â ∈ using the Hamming distance detailed in Eq. (7.1).
Additionally, the independence between subjects in each group can be exploited to perform a bootstrap on the subjects time series.Denote N B = 5000 to be the number of bootstraps copies.For each of the three groups g ≤ N G and each bootstrap copy n b ≤ N b , we perform the following: 1. Sample the time series of N g subjects in group g, selected at random with replacement from the set {1, .., N g }, which we denote as {b n b 1 , . .., b n b Ng }.

Compute a group-specific graphs
Ĝg n b from {b n b 1 , . .., b n b Ng } across a wider range of thresholds Â ∈ [0, 1], using the method detailed in Sections 3.4-6.
The results are averaged across all bootstraps N B and summarised in terms of expected Hamming distances between groups as a function of Â in Fig. 8. Performing this bootstrapping procedure is an effective way to assess whether there is a true difference in connectivity between groups, and what is the sensitivity of the Stepdown and B&H procedures to the choice of Â.
In Figs. 6 and 7, we can see clear distinctions between each group of patients, for all values of Â ∈ .This holds across both combining methods, which return similar profiles for each group.Across both conjunction techniques, there are visible differences between the graphical models of the "Negative" and those of the two other groups.The group graph for the negative syndrome patient is significantly more connected in comparison with the other two groups.While the "Positive" group also differs form the controls, it does less so.The distance is greatest between the groups "Negative" and "Control" for any value of Â ≥ 0.15, suggesting that negative syndrome schizophrenia impacts the functional connectivity of the brain in a way that is more profound than the positive syndrome type.This is consistent with clinical observations made in patients: negative syndrome schizophrenia is considered to be a more severe form of the illness, as patients suffering from it see their ability to function and live independently greatly diminished, and have worse prognosis.Typical symptoms include reductions in speech, emotional withdrawal and apathy (Peralta and Cuesta, 1994).Positive syndrome patients on the other hand tend to experience delusions and/or hallucinations for instance, but are more likely to function better than their negative syndrome counterparts.
Fig. 8 shows the evolution of the expected hamming distance between all three groups as a function of Â ∈ (0, 1).We can see that none of the groups have any hamming distance to each other for very high values of Â, suggesting that such thresholds return empty   Coloured graphical models for the groups "Positive", "Negative" and "Control" of Section 8.1, computed with the thresholds Â ∈ (Dotted edges: Â = 0.33, Dashed: Â = 0.5, solid: Â = 0.66).Top: Stepdown procedure, as detailed in Eq. (4.1); Bottom: B&H procedure, as in Eq. (6.1).graphs for each groups.Divergences between groups of patients become more apparent as the threshold Â is lowered, showing maximal distance between graphs for values of Â in [0.1, 0.3].This observation is consistent across both protocols, suggesting that they can return agreeable results when applied to the same data set.
The value of Â directly impacts the Hamming distance between groups, and an interesting endeavour would be to select values of Â Fig. 9. Bootstrapped distributions of the mean Hamming distances between subject specific-graphs.Left: inter-group distances between the "Positive" and "Control" groups (Grey bars), and between the "Negative" and "Control" groups (White bars).Right: intra-group distances for the "Positive", "Control" and "Negative" group, from left to right.[13.7, 13.85] [16.13, 16.19] [14.12, 14.22] [16.7, 17] [16.9, 17.1] [14.8, 14.93] such that the resulting graphs are most distinct from each other.For the purpose of analysing the data of Section 8.1, this may not be the most relevant approach; we are more concerned with retrieving the true functional connectivity graphs of each mental health patients group, as opposed to maximising the distance hamming distances between these graphs.Groups of subjects can also be compared using subject-specific graphs, and Hamming distances measured between individuals within and between groups.Using a bootstrap on the subjects time series, we can compute the means and empirical distributions for the intra-groups and inter-groups subject-specific Hamming distances, in Fig. 9 and Table 2.While the subject-to-subject distance between groups is strictly non-negative for any two groups, the expectation and variability of intra group distances are also nonnegligibly large for all groups, rendering any form of comparison non-viable.On the other hand, the results of Table 2 highlights the substantial heterogeneity of subjects within each group.We can see more distances between the graphs within the "Negative" group compared to the two other groups, for instance.This is reflected in Figs. 6 and 7, where there are more edges of various colours in the graphical models of the "Negative" group.This indicates stronger heterogeneity in this group of patient.This can be explained by the diverse nature of Schizophrenia: clinically, the heterogeneity of this condition is widely recognised, with at least five sub-types being documented across three more general syndromes -"Positive", "Negative" and "Disorganised" (Heinrichs and Awad, 1993).
Remark 9.While no FDR control is needed for the derivation of the coloured graphs of Figs. 6 and 7, it is necessary to use a FDR adjustment for the analysis of the subject-specific graphs examined in Fig. 9 and Table 2.In this case, due to the low dimensionality of the data (p = 10 dimensions) we opt for the regular Bonferroni correction of Eq. (5.2).
Using conjunction analysis on coloured graphical models, significant differences can be established between the various groups of subjects in the data set of Section 8.1.This supports the idea that EEG data can be used for the purpose of identifying various conditions.It remains to be seen whether or not, and how, this type of analysis can scale to neurological recordings of larger dimensions.The samples used in this study were collected from a p = 10 dimensional EEG set, and the results derived from it were obtained using methodologies suitable for low-dimensional data.It would be worthwhile to apply those graphical modelling methodologies to larger data -collected from larger EEG recordings for instanceto see whether similar results can be found.

Discussion
In this paper, the use of various p-value combiners have been demonstrated for the purpose of estimating graphical models for time series analysed in the frequency domain.This process is applied to EEG data collected from groups of psychiatric patients with distinct conditions, and split into two aspects.
First, estimates of pair-wise connectivity -the partial coherences -are tested for their significance at all frequencies, and p-values are aggregated across a suitable set of frequencies in the range of interest.In this paper the Fisher combiner was identified as the preferred choice in terms of average error rate, despite its high false positive rate and even when compared with the Stepdown procedure.
For any of the p-value combiners reviewed in this paper, the false positive rate can be effectively controlled across subjectspecific networks, using a Bonferroni-type adjustment on the level ˛ from Section 5, even on combiners with high False Positive Rates such as the Fisher combiner.This works well on data of relatively low dimensions, but quickly becomes too restrictive with higher dimensional data.An alternative based on Benjamini & Hochberg's protocol has been proposed, and shown to improve on the Bonferroni type adjustment in terms of False Negative Rate on higher dimensional data.However the improvement it achieves is moderate if the data becomes very high dimensional.
At this stage, we have results for each patient in the dataset, but need to derive group-wide graphs.As a next step, subject-specific results are aggregated to produce group-wide results using techniques borrowed from conjunction analysis.Two thresholding methods are put forward to perform this step, which produce results that converge as the threshold is raised and efficiently remove the need for FDR control beforehand.These are applied to an EEG data set of unmedicated schizophrenia patients, separated into groups of positive/negative syndrome types and healthy controls.Distinct group-wide graphical models are produced, in which the functional connectivity differences between positive/negative syndrome schizophrenia patients and healthy individuals are highlighted.Various levels of Â can be used to produce coloured graphs, showing how prevalent each connection is in each group.The coloured graphical model system is an efficient way of representing neurological data collected on groups of subjects, as it manages to show the level of heterogeneity present in each group of data.On the other hand, it does not give any indication as to which threshold level Â is optimal.It would be of interest to further research methods to aggregate results subjects-wide.
The data set investigated in this paper was of relatively low dimensions, with data recorded once from ten channels only.It would be highly interesting to apply the methods proposed in this paper -notably those of Section 6 on combining group-wide results -to higher dimensional data, such as those issued from larger EEGs, MEGs or fMRI.
This paper advocates the use of p-value combiners for graphical modelling in the frequency domain, and also raises questions that merit further research.Empirical evidence from simulation studies show that combiners that are able to deal with dependent data clearly outperform their counterparts using independent data in terms of false positive rate, but currently dependent versions of the Fisher p-value combiner do not exist in a form that is analytically satisfactory.The topic of developing combiners for dependent data has been covered in applications to Bio-statistics (Kost and McDermott, 2002;Chen et al., 2014;Li et al., 2014), but most of these approaches are analytically limited.Hence, there still exist a real need for p-value combiners that can effectively operate on dependent data, and especially on spectral data.

Fig. 4 .
Fig. 4. Power spectrum of channel T3 recorded in patient # 1 of the "Positive" set, estimated as described in Section 8.1, at various stages of the pre-processing protocol detailed in the same section.The dashed lines in the right-most plot represent the Delta band.Note the spike around 10 Hz in the left-most panel, corresponding to the alpha rhythm.

Fig. 8 .
Fig. 8. Bootstrapped average of the Hamming distance computed for group 'P' and 'C' (dotted line) and 'N' and 'C' (solid line) as a function of threshold Â, after the significant proportions {PI jk } were computed by a Stepdown procedure (left) and B&H protocol (right).

Table 1
DoCs (Definition 7.1) of the ten VAR(1) models, split into groups 1 and 2, used in the simulation study.

Table 2
Bootstrapped means, standard deviations and 95% Confidence intervals of the Hamming distances, for each group in the EEG study.