Construction and comparison of gene co-expression networks shows complex plant immune responses

Gene co-expression networks (GCNs) are graphic representations that depict the coordinated transcription of genes in response to certain stimuli. GCNs provide functional annotations of genes whose function is unknown and are further used in studies of translational functional genomics among species. In this work, a methodology for the reconstruction and comparison of GCNs is presented. This approach was applied using gene expression data that were obtained from immunity experiments in Arabidopsis thaliana, rice, soybean, tomato and cassava. After the evaluation of diverse similarity metrics for the GCN reconstruction, we recommended the mutual information coefficient measurement and a clustering coefficient-based method for similarity threshold selection. To compare GCNs, we proposed a multivariate approach based on the Principal Component Analysis (PCA). Branches of plant immunity that were exemplified by each experiment were analyzed in conjunction with the PCA results, suggesting both the robustness and the dynamic nature of the cellular responses. The dynamic of molecular plant responses produced networks with different characteristics that are differentiable using our methodology. The comparison of GCNs from plant pathosystems, showed that in response to similar pathogens plants could activate conserved signaling pathways. The results confirmed that the closeness of GCNs projected on the principal component space is an indicative of similarity among GCNs. This also can be used to understand global patterns of events triggered during plant immune responses.


Absolute value of the Pearson Correlation Coefficient (APCC)
The Pearson Correlation Coefficient (PCC) is defined by (Soranzo et al. 2007): , ∑ ∑ ∑ . 1 In equation S.1, ρ(X i ,X j ) denotes the PCC. x i (l) denotes the expression values of gen i across l=1,…,p samples and denotes the mean of expression.
In this work, the absolute value of the PCC was used as similarity measure (Zhang and Horvath 2005): , , . 2

Non-linear Correlation coefficient based on Mutual Information (NCMI)
In order to calculate the Mutual information (MI), an estimation of entropy for discrete random variables is required (Daub et al. 2004). Equation S.3 describes MI in terms of entropy and equation S.4 describes the entropy of a random variable in terms of a probability measure: . 4 In equation S.4, I(X i ,X j ) denotes the MI between variables. H(X i ) denotes the entropy of a random variable X i and p(x i ) its probability measure. k i denotes the discrete random variable's bin or subinterval. B i denotes the bin index vector denoting the bin's number where p(x ki ) is estimated.
In this work, the random variables were categorized in n b bins using the Global Equal Width algorithm (Meyer et al. 2008), where n b was assessed by the Sturges' rule: n b = 1+log 2 (p) (Scott and Sain 2005). To estimate the entropy, the estimator proposed by Hausser (2006) was selected. MI was calculated using the R (R Development Core Team 2011) minet library (Meyer et al. 2008).

Normalized Mean Residue Similarity (NMRS)
This metric is defined by (Mahanta et al. 2012): In equation S.6, x i (l) denotes the expression values of gen i across l=1,…,p samples. x i and σ i denotes the mean and the variance of expression.

The method for threshold similarity selection
The Arabidopsis curve in Fig. 2 indicated that the methodology proposed by Elo et al. (2007) is not suitable for networks where the expected clustering coefficient of the random network is higher than the clustering coefficient based on the constructed network . An adaptation of the method used for the threshold similarity selection was proposed (see equation 4).

The algorithm to simulate GCNs
The steps to evaluate the accuracy of equation 4 for similarity threshold selection were the following: i) The M-GCN for Arabidopsis was constructed using the threshold given by equation 4. The coefficient of variation of node degree , the clustering coefficient and the degree distribution ~ from Arabidopsis M-GCN were obtained.

ii)
Three groups of 100 SNs were created using R (R Development Core Team 2011) igraph library (Csardi and Nepusz 2006). Each group of SNs had the following properties: a. Group I: Same , and degree distribution of the Arabidopsis M-GCN. b. Group II: Lower than Arabidopsis M-GCN, same and degree distribution. c. Group III: Higher than Arabidopsis M-GCN, same and degree distribution.
Note that the number of nodes (n) and number of edges (n E ) of SNs were not fixed.

iii)
A similarity matrix (S nxn ) was created for every SN. First, a total of n 2 similarities (s i,j ) following a normal distribution were assessed forming a similarity array. Then, the array was sorted in descending order. The first n E higher similarities were randomly assigned to pairs of nodes connected in the SN. Consequently, a theoretical threshold * was found when all the connected pairs of genes had a similarity assigned. In other words, the threshold * is the n E -th element of the ordered similarity array. Remaining similarities lower than * were assigned randomly to non-connected pairs. iv) Using equation 4 a calculated threshold * was obtained. The suitability of our method was verified comparing graphically * vs. * in each group. The absolute difference η was calculated for each group (equation S7). * , * 7 In equation S7, is the absolute difference between the theoretical thresholds * and the calculated thresholds * across the simulated networks.

Evaluation of the method for threshold similarity selection
The adapted method exhibited extraordinary precision for the similarity threshold selection (see Online resource 8: Fig. S4). For all three of the SN groups, the observed threshold * was close to the theoretical threshold * . The performance is acceptable even for those SNs with higher or lower than in the Arabidopsis M-GCN.
Especially for the low (group 2), the SN properties are comparable to those of the M-GCNs from rice, soybean, tomato and cassava. Those for the M-GCNs from rice, soybean, tomato and cassava were relatively close to the for the SNs from group 2 (see Online resource 3: Table S2). Because the SNs in group 2 had the lowest and the lowest associated absolute error η 1.43 , we inferred that our methodology performs better for scale-free networks, such as those of rice, soybean, tomato and cassava.
Although the absolute error increased with a higher (group 1 and 3), the performance of this method was not affected. For that reason, the methodology proposed here could be used for networks with different topologies, not only those that are similar to the Arabidopsis M-GCN.
Considering these results, the absolute value of the differences between and is also suitable for the threshold selection. We applied this method (equation 4) for the entire threshold selections performed in this work.

Characterization of GCNs
The definition and equations of eight variables that were used to characterize GCNs are stated as follows:

i. Clustering coefficient
The clustering coefficient is the same referenced in methods section. It was calculated by equation 1.

ii. Network density
The network density is defined as the mean off-diagonal adjacency. It is a measure of cohesiveness (Horvath and Dong 2008): ∑ ∑ , 1 . 8 In equation S.7, n denotes the number of nodes and , denotes the adjacency between nodes i and j.

iii. Centralization
The centralization is a measure of how central is the most central node in relation to the centrality of all nodes. Consequently, centralization is a score based on nodes centrality (Freeman 1979). Here we used the betweenness centrality measure, defined by the number of shortest paths in the graph that pass through the node, divided by the total number of shortest paths (Brandes 2001). The shortest path between two nodes is a path that interconnects both nodes with the minimum number of edges. The betweenness centrality B k for a node l is (Freeman 1979): In equation S.8, , , denotes the number of shortest paths between nodes i and j that pass through node l. , denotes the number of shortest paths between nodes i and j.
And the GCN's centralization C B was calculated with (Freeman 1979): In equation S.9, denotes the maximum centrality of a node in the GCN. Here C B is normalized by the network's size.

iv. Heterogeneity
The network heterogeneity is also known as coefficient of variation of node degree. It measures the variance of connectivity across the nodes (Horvath and Dong 2008): In equation S.10, denotes the number of neighbors of gen i or node degree.

v -vi. Assortativity coefficients using different genomic data
The assortativity coefficient is a measure of how much the nodes link to other with similar characteristics (Newman 2002). Nodes are characterized based on either categorical or continuous variables. Thus, nodes share categories or values from the variables. The assortativity coefficient for categorical variables is defined by (Newman 2002): In equation S.11, denotes the fraction of edges connecting nodes with categories c and d; ∑ and ∑ . The assortativity coefficient assumes values from -1 (dissortative network) to 1 (assortative network).
Here we calculated two assortativity coefficients using the following categorical variables, one per assortativity coefficient:  Gene ontology (GO) annotations: They were downloaded from Phytozome (http://www.phytozome.net) and Gramene (http://www.gramene.org) using the Biomart database system. Genes were characterized by one or various GO IDs.
 PFAM annotations: They were obtained from the same public databases as GO annotations. Genes were characterized by one or various PFAM ids.

vii. Tolerance to attacks
We followed a common methodology to evaluate the robustness of GCNs to perturbations, here referred as tolerance to attacks, from a topological perspective (Albert and Barabasi 2001). We started from the original GCN and those nodes with the highest degree were removed sequentially, simulating an attack. To evaluate the tolerance, the average path length ( ) was measured as a function of the fraction of nodes removed (f) from the GCN. The tolerance to attacks was defined as the critical fraction (f c ) at which is maximum, just before that GCN breaks into isolated clusters (Albert and Barabasi 2001

viii. Correlation between node degree and presence of immunity domains
We attempted to find the dependence between node degree and presence of typical domains found in the immunity proteins. Initially, a reference dataset of genes encoding proteins involved in defense was assembled for each plant. The canonical immune protein domains (WRKY, TIR, NBS, LRR, kinase and LysM) were downloaded from PFAM (http://pfam.sanger.ac.uk/). HMMsearch was used to find these domains in the proteome of each plant (Finn et al. 2011). Those genes encoding proteins with unique domains TIR, NBS or LysM, and also genes encoding multiple different immune domains were included in the dataset. For Arabidopsis, this dataset was complemented with immunity related genes from a protein-protein interaction network (Mukhtar et al. 2011). Afterwards, the presence or absence of immunity domains was obtained through the binary variable b: The correlation between degree of nodes and b was calculated with the Kendall's tau non-parametric correlation (Kendall and Gobbons 1990).

Supplementary analysis of S-GCNs in the PCs space
The correlation circles of both planes show the correlations among variables ( Fig. 4c and d).
Heterogeneity has a negative correlation with other topological variables, which could be explained by the fact that high heterogeneity is a property of less dense, non-clustered and noncentralized GCNs that have a topology comparable to that of random networks.
The positions of S-GCNs in the planes (PC1-PC2, PC1-PC3) can be explained by the contribution of each variable to the PCs. In quadrant II of the PC1-PC2 plane (Fig. 4a), we found S-GCNs that were influenced by high clustering coefficients, centralization and density. In quadrant III, the separation of S-GCNs was not strong, but there were networks with high assortativity coefficients from GO and PFAM.
The networks that were projected in quadrants I and VI have opposite properties from those networks in quadrants II and III (Fig. 4a). In quadrant IV, we found less clustered, dense and centralized networks that also had high heterogeneity. Quadrant I grouped a small number of differentiated networks that have negative assortativity coefficients and high values of heterogeneity.
In relation to the PC1-PC3 plane (Fig. 4b), the S-GCNs with high tolerance to attacks and immunity degree dependence were placed in quadrants I and II. The networks in quadrants III and IV have opposing properties. From this analysis, we conclude that the S-GCN differentiation by assortativity coefficients is not as good as the differentiation by other variables.