Estimating a network from multiple noisy realizations

Complex interactions between entities are often represented as edges in a network. In practice, the network is often constructed from noisy measurements and inevitably contains some errors. In this paper we consider the problem of estimating a network from multiple noisy observations where edges of the original network are recorded with both false positives and false negatives. This problem is motivated by neuroimaging applications where brain networks of a group of patients with a particular brain condition could be viewed as noisy versions of an unobserved true network corresponding to the disease. The key to optimally leveraging these multiple observations is to take advantage of network structure, and here we focus on the case where the true network contains communities. Communities are common in real networks in general and in particular are believed to be presented in brain networks. Under a community structure assumption on the truth, we derive an efficient method to estimate the noise levels and the original network, with theoretical guarantees on the convergence of our estimates. We show on synthetic networks that the performance of our method is close to an oracle method using the true parameter values, and apply our method to fMRI brain data, demonstrating that it constructs stable and plausible estimates of the population network.


Introduction
Networks provide a natural way to model many complex systems, and network data are increasingly common in many application fields. Statistical network analysis to date has to a large extent focused on the case of observing a single network, without noise, and analyzing the observed network in order to learn something about its structure, for example, find communities. The problem of community detection in particular, in a single noiseless network, is very well studied and understood by now (see [12,10,1] for reviews of this topic and [2,8,21,11] for some of the many important recent developments). A lot of effort in this field has focused on the analysis of exchangeable networks, where any permutation of nodes results in the same distribution of the edges [5,16,9,29].
In this paper, our focus is on applications where multiple noisy realizations rather than a single network are available, much like an i.i.d. sample in classical multivariate analysis, except our observations are networks rather than vectors. The particular application that motivated this work is neuroimaging, where a network of connections in the brain is constructed separately for each subject, and there is a sample of subjects available, e.g., people suffering from a mental illness. Nodes in this context correspond to locations or regions of interest in the brain, and connections between nodes are measured in various ways depending on the technology used. Here we focus on data from resting state fMRI brain imaging [33,34], where time series of blood oxygen levels are recorded at multiple voxels in the brain while the subjects "rest" in the fMRI machine (see more details on the data in Section 4.2). Inferring connections between nodes from this type of data invariably involves a lot of preprocessing (registration, background subtraction, normalization, etc.), and is typically measured by computing Pearson correlations between the processed time 1 series for each pair of nodes, although arguments have also been made for using partial correlations and more generally Markov random fields [25,26].
However the connections between nodes are computed, they are then frequently thresholded in order to obtain a connectivity matrix with binary entries, from which various network summaries such as the average degree and the clustering coefficient can be computed and averaged over the sample to characterize the population [6,4,14,17]. However, these one-number summaries necessarily result in loss of information, and one may want to learn more about the prototypical brain network for a population of patients beyond onenumber summaries, for instance, finding regions of the brain consisting of similar voxels in terms of functional connectivity and comparing them to healthy controls, or else comparing levels of functional connectivity within known anatomical regions. A natural question to ask then is how to estimate a population network adjacency matrix A (an n × n matrix where A ij = 1 if there is an edge between node i and node j, and A ij = 0 otherwise) from a sample of noisy observations, with noise resulting from both preprocessing and natural individual variations. In other words, we postulate the question of how to compute the "mean" from a sample of N independent noisy realizations A (1) , A (2) , ..., A (N ) of an unknown underlying adjacency matrix A, while respecting and ideally taking advantage of the network structure of the problem instead of simply averaging the observed matrices. We next introduce basic notation to focus the discussion. Since the underlying true A is binary and so are the observations, the noise in each entry of A can only be present in the form of false positive and false negative edges. We assume that the entries of A above the diagonal are generated independently of each other (an assumption that certainly simplifies reality but enables analysis that has been found to give useful practical results in much of previous literature on networks), and that the noise is independent of A. Let P be the n × n symmetric matrix of false positive probabilities, and Q the n × n symmetric matrix of false negative probabilities. That is, for each 1 ≤ m ≤ N and i < j, if A ij = 1 then A ji , and diagonal entries of A (m) are set to zero, though the latter is not important. In other words, true edges A ij = 1 are randomly removed with probabilities Q ij while non-edges A ij = 0 are randomly replaced with false edges with probabilities P ij . For identifiability, we assume that all entries of P and Q are less than 1/2.
In principle, each entry A ij of the underlying true network A can be estimated separately from the corresponding entries A (m) ij , 1 ≤ m ≤ N . However, this naive approach does not take advantage of any potential structure in A, and we know that real networks typically have a lot of structure, so we can expect to gain efficiency by estimating the entries of A jointly. More specifically, we assume that the structure in A takes the form of communities, frequently encountered in many real-world networks in general and in brain networks in particular [30]. We will model this structure in A through one of the most commonly used network community models, the stochastic block model (SBM) [13]. The SBM is a simple and easily tractable model which can also serve as a basic building block in approximating a much larger family of network models, much in the same way that a step-wise constant function can be used to approximate any smooth function [29]. Making this assumption about A allows us to borrow information among edges while retaining the flexibility to fit a wide range of network data.
The SBM assumes that the network is generated by first drawing a vector of node labels c ∈ {1, ..., K} n from a multinomial distribution with parameter π = {π 1 , . . . , π K }. The number of communities K is often assumed to be known, or can be estimated by using one of several methods now available [7,36,20]. Edges between pairs of nodes i, j are then drawn independently with probability P (A ij = 1) = B c i c j , where B is a K × K matrix of within and between communities edge probabilities. Following the literature, we condition on c and treat c as a fixed unknown vector from this point on. Community detection under the SBM has been studied intensively in the last decade and many methods are available by now, e.g. [28,5,2,19,21], and many others.
We make a further assumption that the expectation W = E A of A and the noise probability matrices P and Q share the same block structure. That is, if c i = c i ′ and c j = c j ′ then P ij = P i ′ j ′ and Q ij = Q i ′ j ′ . In other words, edges between nodes with the same patterns of connectivity are subject to the same noise levels, or for the SBM one can also think of this assumption as the probability of making an error about an edge being a function of the probability of that edge existing. In many biological contexts, it is plausible to assume that the probability of a false negative is higher when the probability of an edge is small, it is harder to detect, and conversely for an edge with high probability, the probability of a false negative might be low.
The main contribution of this work is an algorithm to estimate the true unobserved "population" adjacency matrix A by taking advantage of the community structure in both the network and the noise. The algorithm works by first estimating the community structure of A from an initial naive estimate, using an existing method such as spectral clustering or pseudo-likelihood [2]. Then the estimated community structure is used in an EM-type algorithm to update the estimate of A and the parameters of interest. Results in Section 4 show that our method performs well on both simulated data and functional connectomics brain data [33,34]. The method is computationally very efficient because we can leverage existing fast algorithms for community detection in the first stage and the EM algorithm in the second stage only involves simple updates which converge quickly. More complicated models of the relationship between the network and the noise are certainly possible and are left to future work, but even with this simple model we demonstrate conclusively that "network-aware" analyses of samples of networks, as opposed to "massively univariate" analyses that vectorize the adjacency matrices and ignore their network structure, are needed to take full advantage of the network nature of the data.
The problem we consider in this paper shares some similarity with the problem of estimating the edge probability matrix from independent network observations A (1) , ..., A (N ) studied in [32,35]. Assuming that A (1) , ..., A (N ) are identically distributed and E A (1) is of low rank, the authors of [32] estimate E A (1) by a low rank approximation H of N −1 N m=1 A (m) . In [35], the authors model the entrywise logit of E A (m) as the sum of a baseline matrix Z and an individual-specific matrix D m and propose a spectral method to estimate them. Note that in our setting, Since entries of P and Q are less than 1/2, in principle one can threshold entries of H or the estimate of Z at 1/2 to obtain an estimate of A. However, these are not good estimates because (i) E A (1) is not a low-rank matrix, (ii) they are not designed specifically for estimating a binary matrix, and (iii) estimates of P and Q are required for a noise-dependent threshold. Therefore the problem of estimating a binary network must be treated differently, and it is the main focus of this work.

Optimal estimates and the role of noise
We start by deriving two estimates of A when parameters W, P, Q are known: a maximum likelihood estimate and an estimate based on likelihood ratio tests. These are not practical estimators, but since they are optimal in terms of estimation error and test power, it is instructive to understand their behavior as a function of noise level. We will also use these estimators as oracle benchmarks for comparisons, and as steps in deriving the EM algorithm presented in the next section.
When W, P, Q are known and the only unknown is the underlying matrix A, treated as fixed, we can estimate each entry A ij independently, since the only source of randomness is independent noise. To simplify notation, we fix a pair (i, j) of nodes and denote a = A ij , 2.1. Maximum likelihood estimation. The likelihood of a given the data a 1 , ..., a N is Up to a constant, we can write the log-likelihood as Since a can only take on values of 0 or 1, the estimate will be determined by the sign of the multiplier of a in (2.1). Therefore the maximum likelihood estimator of a is To understand how the optimal estimate a * depends on the noise, consider the estimation error of a * , which has the form P(a * = a) = w P(s < µ|a = 1) + (1 − w) P(s ≥ µ|a = 0). (2. 3) The probabilities are binomial: conditional on a = 1, s is Binomial(N, 1 − q), and conditional on a = 0, s is Binomial(N, p). Since the threshold t depends on p, q, and w as well, the dependence of the error on all these parameters is somewhat complicated, but straightforward to compute. Figure 1 shows the error P(a * = a) as a function of p and q. When p or q increases and all other parameters are fixed, the estimation error increases. This observation is confirmed by the following lemma (the proof is given in Appendix A).
Lemma 2.1 (The role of noise). The estimation error P(a * = a) defined by (2.3) is an increasing function of p and q.

2.2.
Likelihood ratio tests. An alternative approach for estimating a when all parameters are known is to perform a test. Unlike maximum likelihood estimation, testing allows us to explicitly control the false discovery rate, which is often important in practice. Consider the null hypothesis a = 0 and the alternative hypothesis a = 1. Under the null, s = N m=1 a m follows Binomial(N, p); under the alternative, s follows Binomial(N, 1 − q). For a given confidence level α, let T α be a likelihood ratio test with critical value k α ∈ N that accepts the null if s < k α and accepts the alternative if s > k α ; when s = k α , it accepts the alternative with a certain probability adjusted to achieve the level α. The power γ α of the test T α is then also a function of α. Since P {T α rejects the null} = αP {a = 0} + γ α P {a = 1} = α(1 − w) + γ α w, P {T α falsely rejects the null} = αP {a = 0} = α(1 − w), the false discovery rate ξ α of T α can be computed by We state a property of ξ α that we will use to control the false discovery rate (the proof is given in Appendix A).
Lemma 2.2 (Monotonicity of false discovery rate). Consider the likelihood ratio test T α and the false discovery rate ξ α of T α defined by (2.4). If p, q ≤ 1/2 then ξ α is an increasing function of α.
It is easy to see that ξ α takes values from zero to 1 − w (γ α tends to one as α tends to one). Therefore for a fixed ξ ∈ (0, 1 − w), by the monotonicity of ξ α , we can estimate a by performing T α with α being the unique solution of equation (2.4). Figure 2 shows the power γ α of T α as a function of p and q when N = 10, w = 0.2 and the false discovery rate is fixed at ξ α = 0.05. We can see that γ α decreases when either p or q increases and the other is fixed. Also, γ α is close to one when both p, q are small and γ α is close to zero when both p, q are large.
In Section 3 we estimate unknown parameters via the EM algorithm and use the estimates as plugins for the unknown parameters to perform likelihood ratio tests.

The estimation algorithm
We now return to the situation of unknown parameters, as will be the case in practice, and derive an algorithm to estimate all of A, W , P and Q. We initialize the algorithm with an estimate of the block structure shared by W , P and Q, obtained from the matrix S = N m=1 A (m) . Then we apply the EM algorithm to estimate submatrices of A, W , P and Q associated with the blocks determined in the previous step.
3.1. Estimating the block structure. LetÂ = (Â ij ) be the n × n matrix with entrieŝ A ij = 1(S ij ≥ N/2). Under the assumption that entries of P and Q are at most 1/2, without which the problem becomes unidentifiable, the matrixÂ is a consistent estimate of A. We can estimate the block structure of W by applying a community detection algorithm, such as spectral clustering [2,15,22] or the pseudo-likelihood method [2], to the initial estimateÂ. We will assume that the number of communities K is known, as is usually done in network literature, or alternatively K can be first estimated by one of several methods available [7,36,20]. Having estimated the block structure, we condition on the node labels and treat them as known, which means that the entries of W , P , and Q are constant within each estimated block. With this assumption in mind, we derive the EM algorithm to recover the sub-matrix of A corresponding to each estimated block.
3.2. The EM algorithm when node labels are known. In this section, we derive an estimate of A using the EM algorithm assuming that the vector of node labels c is known. To get the final estimate of A, we will replace c with an estimate from Section 3.1.
Recall that A is generated from a block model with K communities. Therefore W = E A is a symmetric matrix with K 2 blocks (determined by c), with equal entries within each block. To focus on one such block, we fix k, l ∈ {1, ..., K} with k = l (the case k = l is treated similarly) and consider the (k, l) block according to c: By assumption of shared block structure, restrictions of W , P and Q to J are matrices of constant entries, values of which we denote by w, p, and q, respectively. Thus, W ij = w, P ij = p and Q ij = q for all (i, j) ∈ J. The likelihood of A J and A (m) J -restrictions of A and A (m) to J -takes the form Adding up the log-likelihoods of the independent A (m) J , 1 ≤ m ≤ N , and grouping the terms with (i, j) ∈ I r , we obtain Hereafter, we use |R| to denote the cardinality of a set R. Taking the conditional expectation of the log-likelihood given the data, The M-step involves finding estimates of w, p, q that maximizeL. They are unique because it is easy to see thatL is concave in w, p, q. The partial derivative ofL with respect to w has the form Setting the derivative to zero yields an estimateŵ of w: Similarly, the estimates of p and q take the form Since τ r 's are unknown, we initialize by majority voteτ r = 1(r ≥ N/2). The Bayes rule gives Therefore onceŵ,p andq are computed, we can updateτ r bŷ The EM steps are then iterated until convergence.
3.3. The complete EM algorithm with unknown labels. In Section 3.2 we assume that c is known and derive the EM algorithm for estimating A. Since c is unknown in practice, we first compute its estimateĉ usingÂ as described in Section 3.1. We then repeat the following steps until convergence: (i) treatĉ as the ground truth and estimate A by applying the EM algorithm described in Section 3.2 for each block, (ii) updateĉ using the new estimate of A.
Recall that S is the sum of observations A (m) , 1 ≤ m ≤ N , andÂ is the matrix with entriesÂ ij = 1(S ij ≥ N/2). We initialize an estimateĉ of community labels c by applying an existing clustering algorithm onÂ. Although we can choose any consistent clustering algorithm, for concreteness, we will use spectral clustering. Similar to Section 3.2, we fix k, l ∈ {1, ..., K} and consider the (k, l) block according toĉ: Within blockĴ, we estimate entries of W , P and Q byŵ,p andq, respectively, and compute them as follows. DenoteÎ r = {(i, j) ∈Ĵ : S ij = r}. Initializeτ r = 1(r ≥ N/2) and repeat T times: (1) Computeŵ,p, andq byŵ (2) Using current estimatesŵ,p, andq, update the posteriorτ r (3) Return to step (1) unless the parameter estimates have converged. (4) Update theĴ block ofÂ byÂ ij = 1{τ r ≥ 1/2}. (5) Update the label estimateĉ by applying spectral clustering on currentÂ. In practice, we only need a few updates ofĉ to obtain a reasonable result. The EM updates in steps (1) − (3) also converge fast given a good estimate of the community label. For all simulations in Section 4, we set T = 2 and the number of EM iterations to be 20.

3.4.
A theoretical guarantee of convergence. We focus our theoretical investigation of convergence properties on the case T = 1. Before stating the result, we need to introduce further notation. Recall thatĉ is the estimate of the label assignment c output by spectral clustering. Following [15], we measure the error betweenĉ and c by where the minimum is over allc obtained from c by permuting labels of c.
where h −1 is the inverse of h and the graph of h −1 is shown in Figure 8 in Appendix B.
A simple analysis shows that h is an increasing function, Denote by w, p and q the common values of entries of W , P and Q on J, respectively. Further, let θ = (w, p, q) T and θ t be the estimate of θ after repeating t times steps (1)- (3). Assume that θ ∈ R δ and where C is a sufficiently large constant. Then with probability at least 1 − exp(−r), The error bound in Theorem 3.1 depends critically on δ (note that φ(x) tends to zero as x → 0). As δ becomes smaller, i.e. (w, p, q) T gets closer to the boundary of the set (0, 1) × (0, 1/2) 2 , the problem of estimating parameters becomes harder, and therefore a larger sample size is required.
The error bound consists of three terms. The first term goes to zero exponentially fast as t → ∞ when N is sufficiently large. The second term depends on the error γ(ĉ, c) of estimating communities, which is essentially proportional to the inverse of the expected node degree of A when the community signal is sufficiently strong. This can be easily shown using existing results on community detection (see e.g. [23]), and we do not develop this further in this paper. The last term is a statistical error of order O(K min{ √ N /n, 1/n + exp(−N )}) if all communities are of similar sizes. The consistency of estimating the original network A follows easily from the consistency of parameter estimation. The proof of Theorem 3.1 is given in the Appendix B.

Numerical results
In this section we empirically compare performance of several estimators of A: the "naive" majority-vote estimateÂ described in Section 3.1 (MV), which estimates each entry of A separately; the EM estimate we proposed in Section 3.3 (EM); and, in simulations, the oracle estimate described in Section 2.1 which uses known parameter values of W , P , and Q (OP). To control the false positive rate, we also consider variants EM [T] and OP[T] of EM and OP. Assuming that parameters are known, OP[T] estimates A by performing the likelihood ratio test on each entry of A, as discussed in Section 2.2. EM[T] first estimates parameters by EM and then plugs them in for true parameters to perform the likelihood ratio test. We set the false discovery rate to be 0.05 for both EM[T] and OP [T].
As discussed in the Introduction, one can obtain an estimate of A by thresholding at 1/2 entries of the low-rank estimate of E A (1) proposed by [32]. However, it is not a good estimate of A and in fact produces very large errors, on a different scale from all other methods; therefore we omit it from comparisons in order to be able to plot all the other errors together at an appropriate scale.
For our main algorithm (described in Section 3.3), we set the number of outer loops to T = 2 and the number of EM iterations to 20. This means the algorithm first estimates the community structure usingÂ as the input. Once node labels are computed, it estimates all parameters of the model, including the posteriorτ , by running 20 iterations of the inner loop. The posteriorτ is then thresholded to obtain an estimate of the original network A. This estimate is used in the second run of the outer loop to update the node labels and subsequently re-estimate all parameters and A.
We first test the methods on synthetic networks and then apply them to brain fMRI data, the motivating example discussed in the Introduction. To initialize EM, we use regularized spectral clustering [2,15,22] to estimate the community labels. LetÂ reg = A + 0.5n −1 11 T , D = diag(Â reg 1) and L = D −1/2Â reg D −1/2 . We first compute the K eigenvectors of L that correspond to its K largest eigenvalues. We then apply the K-means algorithm on row vectors of the n × K matrix obtained by stacking the K eigenvectors together to find the community labels. The K-means algorithm is implemented via Matlab function kmeans and is run with 20 iterations.
The performance of all estimators is measured by the false discovery rate (FDR) and the true positive rate (TPR). For an estimateÂ of A, FDR and TPR are defined as For each method, we also report the overlap 1 − γ(ĉ, c) between community assignmentsĉ and c, where γ(ĉ, c) is defined by (3.4) andĉ is computed by applying regularized spectral clustering on the estimate of A produced by that method. Note that γ(ĉ, c) can be greater than one; in that case we set the overlap to zero. Finally, we report the errors in estimating the false positive, false negative and edge probabilities of EM and MV (the corresponding errors of EM[T] are very similar to that of EM and therefore omitted). For EM, we measure the errors by directly computing the ratios of Frobenius norms: For MV, we first estimate edge probabilities in each block specified byĉ by the average number of non-zero entries ofÂ in that block and then compute the Frobenius norm errors defined above for W . To estimate P and Q from MV, for each pair of nodes (i, j), ifÂ ij = 0 then we estimate P ij byP ij = S ij /N ; ifÂ ij = 1, we estimate Q ij byQ ij = 1 − S ij /N . We measure the errors of estimating P and Q by the Frobenius norm ratios computed separately over the zero and non-zero enries ofÂ: We first test the performance of the estimates on a simple example of a sample of networks with shared community structure. We generate the adjacency matrix A from a SBM with n = 300 nodes and K = 3 communities of 100 nodes each. We parameterize the 3 × 3 matrix B of within and between communities edge probabilities of this SBM as The parameter ρ w controls the overall expected node degree of the model while β w specifies the ratio of between-community probability of edge to within-community probability of edge. The smaller β w , the easier the problem of community detection; conversely, the larger ρ w indicates more observed edges and therefore an easier community detection problem. Note, however, that the difficulty of the community detection problem does not directly translate into the difficulty of estimating the underlying true A, which is also influenced by P and Q.
We similarly parameterize the 3 × 3 noise matrices P and Q of within-and betweencommunities false positive and false negative probabilities as Thus, P ij = P c i c j and Q ij = Q c i c j for 1 ≤ i, j ≤ n. The overall numbers of false positive and false negative edges are controlled by parameters ρ p and ρ q , respectively. The relative prevalence of false positives and false negatives between communities compared to within communities is controlled by parameters β p and β q , respectively. Thus, if A ≡ 0 (a network with no edges) then the average degree of a noisy realization of A is ρ p (1 + 2β p )(n − 1); if A = 11 T − diag(1) (a fully connected network), then the average degree of a noisy realization of A is (n − 1) − ρ q (1 + 2β q )(n − 1).
In order to focus attention on the relative performance of various methods dealing with a noisy sample of networks, we will use the true number of communities in simulations, K = 3. When the number of communities is not known, it can be estimated fromÂ in the first stage by several methods [7,36,20], which have been shown to provide accurate results when K is relatively small compared to n. Alternatively, one could use a larger K and interpret the stochastic block model fit as a histogram approximation to the network rather than the true model, as was argued in [29].
The performance of all methods -majority vote (MV), our proposal (EM, EM[T]) and oracle parameters (OP, OP[T]) -is shown in Figures 3, 4 and 5. In all cases, n = 300, K = 3, community sizes are equal, ρ w = 0.15, ρ q = 0.2, ρ p = 0.25, the target FDR is set to 0.05, and all results are averaged over 100 replications. To see the effect of structured versus unstructured noise, we consider three different settings where we fix two of the parameters β w , β p , β q and let the third one vary. In Figure 3 the out-in ratios β p = β q = 1, meaning that P and Q do not have any community structure and all entries of A are equally likely to be flipped to the opposite. When β w is not too close to 0 or 1, community labels and parameters of SBM are accurately estimated, EM performs similarly to the oracle and has a much smaller FDR (essentially equal to the target of 0.05) than MV. In contrast, when β w is close to 0 or 1, EM does not estimate all parameters of SBM accurately; however, it still provides a reasonable estimate of A. When likelihood ratio tests are used, both EM[T] and OP[T] output estimates with stable FDR close to the target 0.05, although the FDR of EM[T] is slightly larger due to the errors from parameter estimation. In most cases, MV has large FDR and TPR, which indicates that it estimates A has many more edges than it really does. Compared to EM and EM[T], MV also has larger errors in estimating false positive and false negative probabilities. All methods perform fairly similarly in recovering communities, with MV being the least accurate and OP[T] the most accurate. Figures 4 and 5 show the effect of false positive and false negative edges when one of parameters β p , β q is set to 1 and the other varies. Again, EM and OP perform similarly when β p , β q are not too large and community labels can be accurately estimated. EM also has much smaller FDR than MV in all settings. Both EM[T] and OP[T] have stable FDRs, close to the target of 0.05, but at the expense of lower TPR as β p or β q increases.
Overall, as one would expect, all methods perform better as the sample size N increases, β w , β p and β q decrease, and the community structure becomes stronger. EM and OP perform very similarly and provide better FDR than MV in all settings, especially when N is small. EM[T] and OP[T] also perform well in controlling the FDR. These empirical results show the importance of leveraging the block structure for estimating the original network A.

Brain networks.
In this section we evaluate the performance of our proposed EM method on functional brain networks [33,34]. The data are obtained from resting state fMRI images, where blood oxygenation levels at different locations in the brain are recorded over time; these time series of oxygen levels are then preprocessed and used to compute a Pearson correlation between each pair of locations. Finally, the correlations are thresholded to construct a binary network matrix.
The dataset we analyze here includes 81 subjects, 39 suffering from schizophrenia and 42 healthy controls (see [33,34] for details on the data). The resulting correlation matrices are 264 × 264, corresponding to 264 regions of interest (ROIs) in the brain. For a given value of the threshold ν ∈ (0, 1), we construct a brain network A (m) for subject m from its correlation matrix C (m) by setting A ij = 0 otherwise. We view each network A (m) as a noisy observation of an underlying true biological network A, which differs between schizophrenics and controls. Note that the number of edges in A (m) depends on ν. In practice, there is no consensus on how to choose ν, therefore it is desirable to have a method that is accurate and stable over a large range of ν values.
Since the number of communities K is unknown, we first estimate it from the majority vote matrixÂ using a spectral method for estimating the number of communities based  on counting the negative eigenvalues of the graph Bethe-Hessian [20]. As expected, the estimated number of communities depends on the threshold value; please see Figure 6. However, there is a stable range of ν roughly between 0.2 and 0.4, and the estimated K over that range is close to 14, the number of functional regions suggested independently by [30]. To facilitate comparison with this known functional parcelation, we fit our EM-based method with K = 14 in the subsequent analysis. Figure 6 shows several global summary statistics of the estimates of A for a range of ν. Global network summary statistics have been a popular tool in the study of brain networks [31] and can be used to predict disease status, but here our focus is on how the network estimation method affects the population estimates of these summary statistics.  In particular, since there is no consensus on choosing ν, a stable estimate over a range of values of ν is desirable. Overall, the plots in Figure 6 show that the EM method produces much more stable estimates over a wider range of ν. For all statistics, the left column shows schizophrenics and the right column healthy controls. The first row shows estimated population average degree for the EM and MV methods, along with the range (from minimum to maximum value) and sample median of individual's average network degrees. Two other summary statistics shown in the second and third rows, global efficiency (the average inverse shortest path length, viewed as a measure of network functional integration) and transitivity (a normalized average fraction of triangles around  an individual node, viewed as a measure of network functional segregation that reflects the presence of communities), are also more stable over a wider range of ν for the EM method. The fourth row shows the strength of the networks estimated by EM and MV, measured by modularity optimized via spectral clustering [27]. Finally, the fifth row presents the estimated number of communities based on counting the negative eigenvalues of the graph Bethe-Hessian [20]. For almost all summaries, the estimates obtained by EM are closer to the median values obtained from the individual networks, suggesting the EM produces a more accurate population estimate, or at least one that is more representative of the sample. The exception to this general pattern occurs only at the very low values of ν, where the network is probably too dense to be informative anyway. Figure 7 shows sagittal views of the underlying networks estimated by EM and MV for the threshold parameter ν = 0.5. We use ν = 0.5 since higher ν produces sparser networks that are easier to visualize, and the network statistics are still fairly stable in that range. The plots are drawn by the brain network visualization tool BrainNet Viewer [37].

Discussion
We proposed a novel way to estimate an underlying "population" network from its multiple noisy realizations, leveraging the underlying community structure. In contrast to most of previous work (with the notable exceptions of [32,35]), our algorithm does not vectorize the network or reduce it to global summaries; the procedure is designed specifically for network data, and thus tends to outperform methods that do not respect the underlying network structure. While we focused on the stochastic block model as the underlying network structure, because of its simple form and its role as an approximation to any exchangeable network model, this assumption is not essential. An extension to the degree-corrected stochastic block model is left as future work, and we believe in practice the algorithm will work well for any network with a community structure. On the other hand, the assumption of independent noise is important and unlikely to be relaxed.
The assumption of false positive and false negative probability matrices being piecewise constant is also important, as it allows us to significantly reduce the number of parameters and estimate them using the shared information within each block, but clearly many other ways to impose sharing information are possible, perhaps through a general low rank formulation. We leave exploring such a formulation for future work.

Acknowledgements
This work was partially supported by NSF grant DMS-1521551 and ONR grant N000141612910 to E. Levina. We thank our collaborators in Stephan Taylor's lab in Psychiatry at the University of Michigan for providing a processed version of the data.

Appendix A. Estimation error
We first prove Lemma 2.1, which formalizes the intuitive fact that the estimation error is an increasing function of noise levels. Recall that for a fixed pair of nodes (i, j), s = Proof of Lemma 2.1. Denote by f = f (w, p, q) the estimation error (2.3), that is f (w, p, q) = P(a * = a) = wP(s < µ|a = 1) + (1 − w)(1 − P(s < µ|a = 0)).
We show that there exists a finite set M ⊆ [0, 1/2] such that the partial derivative ∂f /∂q is positive for all q ∈ [0, 1/2] \ M and f is continuous for all q. This clearly implies that f is an increasing function of q; the proof that f is increasing in p is similar.
Let M be the set of points q ∈ [0, 1/2] such that µ = µ(w, p, q) is an integer; this set is finite because µ is a smooth function of q. Fix q 0 ∈ M and choose an integer k so that k < µ(w, p, q 0 ) < k + 1. For any q sufficiently close to q 0 , the event s < µ is the same as s ≤ k. Since s ∼ Binomial(N, 1 − q) given a = 1 and s ∼ Binomial(n, p) given a = 0, we have P(s ≤ k|a = 1) = (n − k) N k Let us now fix q 0 ∈ M and choose an integer k such that µ(w, p, q 0 ) = k. We consider four possible cases based on the local behavior of µ near q 0 : µ reaches local maximum at q 0 , µ reaches local minimum at q 0 , µ is increasing, and µ is decreasing.
If µ reaches local maximum at q 0 then for any q sufficiently close to q 0 , the event s < µ(w, p, q) is the same as s ≤ k − 1. Using (A.1) and (A. 2) we obtain that f is continuous at q 0 . Similarly, f is continuous at q 0 if µ reaches local minimum at q 0 .
If µ is increasing near q 0 then for any q sufficiently close to q 0 , the event s < µ(w, p, q) is the same as s ≤ k − 1 if q ≤ q 0 and s ≤ k if q > q 0 . Therefore the jump of f at q 0 is h = wP(s = k|a = 1) − (1 − w)P(s = k|a = 0).
Using s|a = 1 ∼ Binomial(N, 1 − q 0 ) and s|a = 0 ∼ Binomial (N, p), a simple calculation shows that k = µ(w, p, q 0 ) is equivalent to h = 0, which implies the continuity of f at q 0 . Similarly, f is continuous at q 0 if µ is decreasing near q 0 , and the proof is complete.
Proof of Lemma 2.2. Recall that T α rejects the null hypothesis if s > k α ; when s = k α it rejects the null with some probability η α adjusted to achieve confidence level α. Since s follows Binomial (N, p) if a = 0 and Binomial(N, 1 − q) if a = 1, the confidence level and power of T α satisfy Solving for η α from the first equation and plugging it into the second equation, we obtain Note that k α is a piecewise constant function of α. On every interval of α where k α is constant, the coefficient of 1/α in the above representation of γ α /α is positive because for every m ≥ k α +1 by the assumption p, q ≤ 1/2. This implies that γ α /α is decreasing on every such interval, and in turn on the whole interval (0, 1] because γ α /α is a continuous function of α. Since the false positive rate ξ α is a decreasing function of γ α /α by (2.4), the claim of Lemma 2.2 follows.

Appendix B. Convergence of the EM algorithm
In this section we prove Theorem 3.1, establishing the convergence of our algorithm described in Section 3.3. The proof consists of two steps: showing the convergence of population-level updates (Section B.1) and bounding the error between population-level updates and sample-level updates (Section B.2).
B.1.1. Preliminaries. We first briefly recall the population-level updates of our algorithm and set up additional notation. To simplify notation, let us fix a pair of nodes (i, j) and Let f θ be the joint likelihood of s and a. Assume that f θ belongs to a parametric family {f θ ′ |θ ′ := (w ′ , p ′ , q ′ ) T ∈ Θ}, with Θ to be specified. For each θ ′ ∈ Θ, the joint likelihood f θ ′ (s, a) of s and a has the form Summing over a, we obtain the marginal likelihood f θ ′ (s) = X θ ′ (s) + Y θ ′ (s) of s. For each θ ′′ ∈ Θ, the conditional expectation of log f θ ′′ (s, a) given s takes the form The population-level update of a current parameter estimate θ ′ is computed by To computeM (θ ′ ), let us denote by F, G and L the following functions: where X θ ′ and Y θ ′ are defined in (B.2). Setting partial derivatives of T θ ′′ |θ ′ to zero, we find that the components of M (θ ′ ), which we denote by M (w ′ ), M (p ′ ) and M (q ′ ), respectively, can be computed by For p, q ∈ [0, 1/2] and ε ∈ [0, 1), define a neighborhood of (p, q) by where h −1 is the inverse function of h (see Figure 8) Lemma B.1 (Contraction of population-level updates). Let δ ∈ (0, 1/4), ε ∈ (0, 1) and M (θ ′ ) be the population-level update of θ ′ defined by (B.6). Assume that θ, θ ′ ∈ R δ , with the set R δ defined by (3.8), and (p ′ , q ′ ) ∈ U ε (p, q). Then where ϕ is defined by (3.7).
Proof. The technique used for proving this lemma closely follows [3]. For t ∈ [0, 1], let θ t = (w t , p t , q t ) T = θ + t(θ ′ − θ) and define g(t, s) = F θt (s), where s satisfies (B.1) and F θt (s) is defined in (B.5). Then M (w ′ ) − w = E(g(1, s) − g(0, s)) because M (w) = w. By the mean value theorem, for each s there exists t s ∈ [0, 1] such that To compute the partial derivative of g, note that A simple calculation shows that Since s ≤ N and θ ′ , θ ∈ R δ , it is easy to see that ψ t ≤ 3N/δ 2 . Therefore Let s 1 ∼ Binomial(N, p) and s 2 ∼ Binomial(N, 1 − q) be binomial random variables. Since s is a mixture of s 1 and s 2 with weights 1 − w and w, respectively, we have where Φ 1 and Φ 2 denote the corresponding expressions under the expectation. We now use concentration of s 1 and s 2 to bound E Φ 1 and E Φ 2 . Note that if and only if s 1 ≤ H(p t , q t )N . Therefore if s 1 ≃ pN is sufficiently smaller than H(p t , q t )N , then exp(− Z(s 1 ), η(θ t ) ) grows exponentially in N . This implies that Φ 1 is of order exp(−N ) and so is E Φ 1 .
Lemma B.2 (Tail bound for binomial distribution). If s ∼ Binomial(N, p) then for any ε ≥ 0, Proof. This is a direct consequence of Hoeffding's inequality.
B.2.1. Preliminaries. We now turn to the sample-level updates. Letĉ be an estimate of the label assignment c. Recall from (3.4) that the discrepancy betweenĉ and c is measured by where the minimum is over allc obtained from c by permuting the labels. Without loss of generality, assume that the minimum is achieved atc = c. We will focus on a single block (out of K 2 blocks) determined by c andĉ. By definition of γ(c,ĉ), we have To compare sample-level and population-level updates, for any pair of nodes (i, j) ∈ J, denote (as in Section B.1) Also, let s be a mixture of Binomial distributions defined by (B.1). Recall the populationlevel update M (θ ′ ) and its components M (w ′ ), M (p ′ ) and M (q ′ ) in (B.4) and (B.6). In the finite sample, instead of taking the expectation of T θ ′′ |θ ′ , we compute the average of T θ ′′ |θ ′ over all entries within the blockĴ . The sample-level update is then the maximizer of this average:M To computeM (θ ′ ), denotê where F θ ′ , G θ ′ and L θ ′ are defined in (B.5). Then similar to (B.6), the components of M (θ ′ ), which we denote byM (w ′ ),M (p ′ ) andM (q ′ ), can be computed bŷ Concentrations of sample-level updates. We first prove uniform bounds forÊF  1). Then for any r ≥ 0 the following hold with probability at least 1 − e −r : where the supremum is taken over R δ defined in (3.8).
Using the fact that |F θ ′ (S ij )| ≤ 1 and the definition of γ 2 (ĉ, c), we have We now focus on bounding Ψ − E F θ ′ (s). Condition on J, S ij are i.i.d. copies of a mixture of binomial distributions wBinomial(N, 1 − q) + (1 − w)Binomial(N, p). Let λ > 0 be a positive scalar and ε ij be independent symmetric Bernoulli random variables, also independent of S ij . By symmetrization (see e.g. [18, Theorem 2.1]), we have Since t → 1/(1 + e t ) − 1/2 is Lipschitz with constant one and η(θ ′ ) ≤ 2 log(1/δ − 1) because θ ′ ∈ R δ , using (B.18) and Talagrand's comparision theorem (see e.g. [24,Theorem 4.12]), we obtain Since ε ij Z T ij κ are sub-Gaussian random variables with sub-Gaussian norm at most Using Markov inequality, we conclude that with probability at least 1 − e −r the following holds: Therefore using (B.17) and a triangle inequality, we obtain that with probability at least 1 − e −r : It remains to boundÊG θ ′ , which is similar toÊF θ ′ except that G θ ′ contains an additional factor s. Proceeding the proof in the same way as forÊF θ ′ , and bound s by N when necessary, we obtain that with probability at least 1 − e −r , the following holds The proof is complete.
The following lemma provides alternative bounds to the bounds in Lemma B.3 when N is large. Note that the upper bounds of Lemma B.3 contain the factor N/|J|; they become uninformative when N is larger than |J|. This is an artifact of our proof as we use Talagrand's comparison theorem. Lemma B.4 shows that when N is large, we can directly compareM (θ ′ ) and the true parameter θ and effectively remove the factor √ N .