DCGL v2.0: An R Package for Unveiling Differential Regulation from Differential Co-expression

Motivation Differential co-expression analysis (DCEA) has emerged in recent years as a novel, systematic investigation into gene expression data. While most DCEA studies or tools focus on the co-expression relationships among genes, some are developing a potentially more promising research domain, differential regulation analysis (DRA). In our previously proposed R package DCGL v1.0, we provided functions to facilitate basic differential co-expression analyses; however, the output from DCGL v1.0 could not be translated into differential regulation mechanisms in a straightforward manner. Results To advance from DCEA to DRA, we upgraded the DCGL package from v1.0 to v2.0. A new module named “Differential Regulation Analysis” (DRA) was designed, which consists of three major functions: DRsort, DRplot, and DRrank. DRsort selects differentially regulated genes (DRGs) and differentially regulated links (DRLs) according to the transcription factor (TF)-to-target information. DRrank prioritizes the TFs in terms of their potential relevance to the phenotype of interest. DRplot graphically visualizes differentially co-expressed links (DCLs) and/or TF-to-target links in a network context. In addition to these new modules, we streamlined the codes from v1.0. The evaluation results proved that our differential regulation analysis is able to capture the regulators relevant to the biological subject. Conclusions With ample functions to facilitate differential regulation analysis, DCGL v2.0 was upgraded from a DCEA tool to a DRA tool, which may unveil the underlying differential regulation from the observed differential co-expression. DCGL v2.0 can be applied to a wide range of gene expression data in order to systematically identify novel regulators that have not yet been documented as critical. Availability DCGL v2.0 package is available at http://cran.r-project.org/web/packages/DCGL/index.html or at our project home page http://lifecenter.sgst.cn/main/en/dcgl.jsp.


Introduction
This document gives instructions on how to use the functions of DCGL 2.0 which is an advanced and upgraded version of DCGL 1.0 . DCGL 2.0 contains four modules which are Gene filtration module, Link filtration module, differential co-expression analysis (DCEA) module and differential regulation analysis (DRA) module.
In Gene filtration module, there are expressionBasedfilter and varianceBasedfilter functions to filter genes on expression microarray data. rLinkfilter, percentLinkfilter and qLinkfilter functions were wrapped in Link filtration module to filter gene co-expression links in co-expression networks. DCp, DCe, WGCNA, LRC and ASC functions were implemented in DCEA module for extracting differentially coexpressed genes (DCGs) and differentially coexpressed links (DCLs). These above functions have been accomplished into DCGL 1.0 .
In DCGL 2.0 , we attached to DCEA module a new function, DCsum, to determine a final set of DCGs and DCLs which come from multiple DCEA methods. Most importantly, we produced DRA module which contains DRsort, DRplot and DRrank for differential regulation analysis. DRsort identifies differentially regulated genes (DRGs) and differentially regulated links (DRLs) from DCsum-outputted DCGs and DCLs based on TF-to-target knowledge. DRplot visualizes DRLs and DRLs-related TF-to-target links. Function of prioritizing regulators in terms of their potential relevance to the biological phenotype was designed in DRrank. Figure 1 shows the overall design of DCGL 2.0 .
The major input of DCGL 2.0 are two expression data matrices from two contrastive conditions, where the rows and columns correspond to genes and microarrays respectively. TF-to-target regulation knowledge, which was wrapped in the package, is another required input dataset.
The DCGL 2.0 package employs R library igraph, limma, org.Hs.eg.db, which must be installed in advance.

Getting started
Prior to using DCGL 2.0 , users should download the installation file of DCGL 2.0 to their local computer, and install DCGL 2.0 as a package of their R computing environment. For Linux users, they should type 'R CMD INSTALL DCGL 2.0.tar.gz' in the shell (suppose the installation file 'DCGL 2.0.tar.gz' is in the current working directory); for windows users, they should go to the R menu 'Packages' and click the 'Install package(s) from local zip files' and then locate the local file 'DCGL 2.0.zip'. If the package is installed successfully, a file folder named 'DCGL' should appear beneath the folder 'library' in the R installation directory. To load the DCGL 2.0 package, type library(DCGL).
3 Methods DCGL 2.0 provides the pre-existing facilities for gene filtering, link filtering and D-CGs/DCLs identification of DCGL 1.0 , as well as newly added functions for DCGs/DCLs summarization, DRGs/DRLs identification, networks visualization, and regulators ranking.

Gene filtration
If there are too many genes in the expression dataset, one can filter out some genes using the expressionBasedfilter or varianceBasedfilter or both of them. expression-Basedfilter filters out a half genes that have their Between-Experiment Mean Expression Signal (BEMES) lower than the median BEMES of all genes (Prieto and etal.,2008). vari-anceBasedfilter is an approximate test of the hypothesis that gene has the same variance as the median variance (Simon and Lam,2006). The variance of the log-values for each gene is compared to the median of all the variances.The quantity quantity = (n − 1) * var i /var m for each gene is compared to a percentile of a chi-square distribution (with a degree of freedom of n − 1, n being the number of arrays) to filter out those genes not significantly more variable than the median gene.

Link filtration
For all DCEA methods but WGCNA, a link filtering step is necessary to build up two gene co-expression networks for the two contrastive conditions. The two gene co-expression networks have identical linking structures but different edge weights (co-expression values). The input to link filtering methods always includes two separate gene expression matrices for the two conditions, and the output mainly comprises two data vectors, each coming from a half of the symmetrical gene-versus-gene co-expression matrices. One can imagine that, in the intermediate co-expression matrices, retained links have non-zero values while discarded links are denoted with zero values.
Three stand-alone functions are implemented for link filtering, which are the correlation value threshold (rLinkfilter), the correlation-value fraction based link filtering (per-centLinkfilter) and the q-value based link filtering (qLinkfilter). However, these link filtering functions are seldom called as independent functions; instead, they are wrapped in the DCEA functions DCp, DCe, ASC and LRC, and can be tuned with the 'link.method' and 'cutoff' parameters.

Filtering gene links according to the correlation threshold
As an argument to the 'link.method' parameter, rLinkfilter is abbreviated to 'rth'. Each gene link is associated with two correlation values (one out of condition A and the other out of condition B); if either of the two correlation values is greater than the given correlation threshold ('cutoff'), the gene link is retained.

Filtering gene links according to the max correlation value
As an argument to the 'link.method' parameter, percentLinkfilter is abbreviated to 'percent'. Each gene link is associated with two correlation values (one out of condition A and the other out of condition B) and thus a vector of 'maximum absolute values' for all correlation value pairs is decided. Then these 'maximum absolute values' are sorted in decreasing order. At last, a fraction ('cutoff') of gene pairs with the highest max correlation values will be retained.

Filtering gene links according to the q-values of correlation values
As an argument to the 'link.method' parameter, qLinkfilter is abbreviated to 'qth'. For each of the two experimental conditions, the co-expression values are associated with the corresponding p-values (student T-test of the zero nature of a Pearson Correlation Coefficient (PCC)), and these p-values are sorted and transformed to q-values (false discovery rates). In this way, each gene link is associated with a pair of q-value, and those links with at least one q-value lower than the threshold ('cutoff') are retained.

Differential co-expression analysis
DCEA module contains five DCEA methods. DCp and DCe etal.,2011)(Liu and proposed by us, and WGCNA, ASC, and LRC were proposed by other inventors. All the methods are aimed to extract DCGs/DCLs through analysing the changes of the connections. All methods must be preceded by a link filtering step, which can be tuned with the 'link.method' and 'cutoff' parameters. After the link filtering, co-expression pairs with rth/percent/qth of co-expression values in either of two conditions higher/higher/lower than the cutoff are retained.

DCp for identifying DCGs
DCp works on the filtered set of gene co-expression value pairs, where each pair is made up with two co-expression values calculated under two different conditions separately. The subset of co-expression value pairs associated with a particular gene, in two groups for the two conditions separately, can be written as two vectors X and Y (n is co-expression neighbors for a gene).
Then a length-normalized Euclidean distance is used for measuring differential co-expression (dC) of this gene.
To evaluate whether a gene has significant dC, we perform a permutation test, in which we randomly permute the disease and normal conditions of the samples, calculate new PCCs, filter gene pairs based on the new PCCs, and calculate new dC statistics. The sample permutation is repeated N times, and a large number of permutation dC statistics form an empirical null distribution. The p-value for each gene can then be estimated.

DCe for identifying DCGs and DCLs
DCe is based on the 'Limit Fold Change' (LFC) model, a robust statistical method originally proposed for selecting differentially expressed genes(DEGs) from microarray data (Mutch and etal.,2002).
First, the correlation pairs are divided into three parts according to the pairing of signs of co-expression values and the multitude of co-expression values: pairs with same signs (N 1 ), pairs with different signs (N 2 ) and pairs with differently-signed high co-expression values (N 3 ). The "high co-expression values" are deemed based on the same correlation value threshold as in the qLinkfilter function. The first two parts are processed with the 'LFC' model separately to yield two subsets of DCLs (K 1 ,K 2 ), while the third part (N 3 ) adds to the set of DCLs directly. So a total of K=N 3 +K 1 +K 2 DCLs are determined from a total of N gene links. For a gene (g i ), the total number of links (n i ) and DCLs in particular(k i ) associated with it are counted, and the Binomial Probability model is used to estimate the significance of the gene being a DCG. (Fuller and etal.,2007;van Nas and etal.,2009), ASC (Choi and etal.,2005) and LRC (Reverter and etal.,2005) are other methods for measuring genes' differential co-expression. For more details please consult (Yu and etal.,2011;Liu and etal.,2010) (i.e. DCGL 1.0 ).

DCsum for summarizing DCGs and DCLs
DCsum, short for differentially co-expression summarization, summarizes 1) a set of DCGs, which is an intersection of DCp-derived DCGs (selected with a q value cutoff or a given percentage of dC) and DCe-derived DCGs (selected with a q value cutoff), 2) a set of DCLs, which is sifted from DCe-derived DCLs that are connected to at least one DCG determined by the first step. As a result, DCsum combines results from two different co-expression analysis methods.

DRsort for sorting out DRGs and DRLs
DRsort, the first function of DRA module, is aimed to sift DCGs and DCLs according to regulation knowledge (i.e. TF-to-target) which will be introduced in the section of 'Dataset'.
If a DCG is a TF, it is intuitively speculated that its related differential co-expression may be attributed to the change of its regulation relationships with its targets. So this type of DCGs are termed differential regulation genes (DRGs). Besides if the upstream TFs of a DCG is identified, that DCG is possibly a differentially regulated target of an implicated regulator, and so such DCGs are also kept in the set of DRGs.
If a DCL happens to be a TF-to-target relation, we highlight this DCL because it is the direct attribution to differential regulation. This type of DCLs are termed 'TF2target DCL'. On the other hand, if there are one or more common TFs regulating the two genes of a DCL, we also give priority to this DCL because the change in the expression correlation of the two genes could be attributed to the disruption of their co-regulation by the common TFs. This type of DCLs are termed 'TF bridged DCLs'. TF2target DCLs and TF bridged DCLs, therefore, together form the set of differentially regulated links(DRLs).

DRplot for visualizing differential co-expression and regulatory relationships
We built a function DRplot to display combined information of DCGs/DCLs, DRGs/DRLs and TF-to-target. DRplot generates DRL-centered networks. Due to the definite of DRL, TFs, TFs' regulation links and DCGs were involved to form two heterogeneous networks which are 1): TF2target DCL-centered network ( Figure 2) and 2): TF bridged DCLcentered network ( Figure 3). In both networks, we rely on different node shapes to differentiate TFs and non-TFs (square for TFs, circle for non-TFs), different node colors to categorize genes (pink for DCGs, blue for non-DCGs, gray for TFs which are not tested in expression microarray data and therefore cannot be determined as DCGs or not), and different edge types to express different relations of gene pairs (solid for DCLs, dashed for non-DCLs; edges with arrow indicate TF-to-target relations).
In addition, DRplot allows user to delimit a sub-network around a predefined set of genes of interest ( Figure 4 as an example of TF bridged DCL-centered sub-network). DRLs in TF2target DCL-centered sub-network were extracted from whole TF2target DCLs when interested gene(s) was/were either gene of a TF2target DCL. In TF bridged DCLcentered sub-network, DRLs were kept when predefined gene(s) was/were either gene of a TF bridged DCL or the common TF. Meanwhile corresponding regulation links which regulated by common TF were also extracted.

DRrank for ranking regulators
DRrank is implemented for ranking potential TFs in terms of their relevance to the phenotypic change or biophysical process of interest. It contains three methods: RIF , TED, and TDD. The latter two methods were proposed by us firstly in this package.
TED, short for 'Target Enrichment Density', employs Binomial Probability model to quantify the enrichment of a TF's targets in the DCG set, and as such to evaluate which regulators are more likely to be subject-relevant or even causal. Suppose we sift K DCGs from expression profile which contains N genes (there, K and N must have available expression data and were covered by TF2target library). If T F i has T i targets in regulation knowledge, there should be T i * K/N DCGs appeared in T F i targets list randomly. Actually, it is found that T I DCGs are included in T F i 's targets list. The larger T I than T i * K/N is, the more targets of T F i enriched, the more likely T F i is a relevant or causative regulator.
Following is TED formula.
Taking the simplified scenario of 13 genes and 23 links in Figure 4 as an example, suppose this expression profile (GSE17967, downloaded from GEO) tested 12632 genes, and 1052 DCGs identified after DCEA. If EGR1 has 4 targets in TF-to-target knowledge, EGR1 should have 4 * 1052/12632 DCG targets by chance, but the real number is 3. So we take TED formula to calculate TED(EGR1)=−log 2 sum 4 .34351. TDD, short for 'Targets' DCL Density', uses Clustering Coefficient to quantify the density of DCLs among a regulator's targets, and so to judge the importance of a TF. Suppose that T F i has n targets, and that there are k DCLs among these targets. A larger k means more DCLs are bridged by the common T F i . We intuitively assume that, if a TF bridged more TF bridged DCL it is of more importance (even if the regulator is not a DCG). Based on this hypothesis, we employ Clustering Coefficient formula to calculate TDD as follow: Again, same example like in TED (Figure 4), EGR1 has 3 DCLs among 4 targets, TDD(Egr-1)=2*3/4(4-1)=0.5. Of note even though no expression data is available for a TF, its TED and TDD could still be calculated only if the expression level of its targets are measured.
RIF method, short for 'Regulator Impact Factor', simultaneously integrates three sources of information: (i) the extent of differential expression; (ii) the abundance of differentially expressed genes, and (iii) differential co-expression between TF and its differentially expressed target genes to assess which TFs are consistently most differentially co-expressed with the highly abundant and highly differentially expressed genes Hudson and etal.,2009).
where n de means the number of DEGs, e1 (e2) means the expression value of DEG j in condition 1 (condition 2), r1 ij (r2 ij ) means the correlation of T F i and DEG j in condition 1 (condition 2). To evaluate the statistical significance of scores which derived by our novel TED and TDD methods, we performed a permutation test, in which we randomly constructed the number of TF aŕs targets-sized pseudo targets for each TF, calculated the new TED scores and TDD scores. This target permutation was repeated many times (Repeat times can be decided by user via a parameter, Permutation_Times, the default value is 0. If Permuta-tion_Times equal to 0, it indicate that there is no permutation.), and a large number of permutation-generation TED scores and TDD scores formed an empirical null distribution respectively. The p-value and FDR of TED or TDD for each TF can then be estimated.

Dataset
DCGL 2.0 includes five datasets: exprs, tf, tf2target, exprs_design and intgenelist. exprs, contains 1000 genes and 63 samples, is a sub-dataset from a real microarray data (GSE17967) from GEO (http://www.ncbi.nlm.nih.gov/geo/). exprs_design, required by DRrank, elucidates the experiment design of the exprs. tf and tf2target, obtained through processing relevant data (TFbsConFactors.txt and TFbsConsSites.txt) from UCSC hg18, contain 215 human Transcription Factors (TFs) and 214607 TF-to-target relationships. First, two files, TFbsConFactors.txt and TFbsConsSites.txt, were downloaded from UCSC hg18 (http://genome.ucsc.edu/). TFbsConsSites gives predicted chromosomal coordinates of TF binding sites(TFBSs) on human, mouse and rat genes, while TFbsConfactors.txt links the internal TF accessions to SWISS-PROT IDs. Then, these SWISS-PROT IDs were further converted to NCBI gene IDs via BioMart (http://www.ebi.ac.uk/biomart/), and NCBI aŕs homologene.data file was used to find the human homologs of mouse and rat TFs, enabling us to compile an enlarged set of human TF-TFBS relationships. After that, we downloaded gene coordinate information (refGene.txt file), which specifies the chromosomal locations of 18620 human genes. The promoter region of each gene [from 1 kb upstream of the transcription start site (TSS) to 0.5 kb downstream of the TSS] was scanned for the TFBSs identified in the above TF-TFBS relationships. If an occurrence of a certain TFBS was found, the corresponding TF was linked with that gene. In this way, we developed a set of TF-to-target regulatory relationships. In addition, we retrieved TF target information from another source, the TRED database (http://rulai.cshl.edu/TRED/), which collects mammalian cis-and trans-regulatory elements, accompanied by experimental evidence. Finally, tf2target (TF-to-target) included 214607 binary tuples involving 215 human TFs and 16863 targets (Tu and etal.,2009). intgenelist data is a sample set of user-interested genes, and is required by DRplot to plot sub-networks. exprs was designed to study gene expression in cirrhotic tissues with (N=16) and without (N=47) HCC. So we firstly divide exprs into two parts corresponding to condition 1 (exprs.1) and condition 2 (exprs.2) respectively. q.method = c("BH", "holm", "hochberg", "hommel", "bonferroni", "BY", "fdr") Link filter methods(rLinkfilter, percentLinkfilter and qLinkfilter) are wrapped in DCp with available parameter 'link.method'. Correlation coefficient methods are also given a option by 'r.method'. So is 'q.method' for adjusting p value methods.

Gene filtration
Parameter 'N.type' is used for choosing the permutation type. If 'N.type' is set to 'pooled', that means pooling all the dC together to form a null distribution and estimate corresponding statistical significance (p-value) against null statistics. If 'N.type' is set to 'gene by gene', that means calculating p-value of a gene only against this gene's null distribution of dC.
The 'DCp.res' ia a matrix of all genes with 'dC' column, 'link' column (degree in coexpression networks), 'p.value' column and 'q.value' column. If we set N=0, no permutation has been done, and in this case the 'p.value' and 'q.value' are <NA>.

DCsum: Summarizing DCGs and DCLs
We implemented DCsum to summarize DCGs and DCLs from 'DCp.res' and 'DCe.res'. DRGs, DRLs, DCG2TF, TF_bridged_DCL, DCGs and DCLs, six components comprise 'DRsort.res'. 'Upstream TFofDCG' and 'DCGisTF' columns were added to the list of DRsort.res$DRGs to display the differential regulation genes and differential regulated genes. 'common.TF' and 'internal.TF' columns were added to the list of DRsort.res$DRLs to identify two type of differential regulated links. Lists of DRsort.res$DCGs and DRsort.res$DCLs contain all the genes and links came out from DCsum, and were annotated regulation information whenever available. And more details were displayed in DRsort.res$DCG2TF and DRsort.res$TF_bridged_DCL for the ease of follow-up investigation.
If 'type' is set to 'TF2target DCL' or 'TF bridged DCL', DRplot only plots the chosen network. If 'type' is set to 'both', two networks will be plotted. However, total information of DCGs/DCLs and DRGs/DRLs are not always needed. DRplot gives 'intgenelist' parameter which represents a group of interested gene symbols for user to delimit a sub-network.

DRrank: Ranking regulators
DRrank implements three approaches to form a potential rank to show which regulators are more relevant to a phenotypic change or biophysical process in these conditions of expression profiles.
> data(tf) > data(tf2target) > data(exprs_design) > DRrank.res <-DRrank(exprs, exprs.   Figure 4: Visualization of TF bridged DCL-centered sub-network delimited by predefined gene list. The entire GSE17967 was used as sample dataset, the predefined gene was 'A2M'. Nodes represent genes and edges represent DCLs or TF-to-target (see symbol illustration). 6 List of abbreviations used DEA: differential expression analysis DCEA: differential co-expression analysis DCG: differentially co-expressed gene DCL: differentially co-expressed link DRA: differential regulation analysis DRG: differentially regulated gene DRL: differentially regulated link LRC: Log Ratio of Connectivity ASC: Average Specific Connectivity WGCNA: Weighted Gene Co-expression Network DCp: Differential Co-expression profile DCe: Differential Co-expression enrichment GSCA: Gene Set Co-expression Analysis RIF: Regulatory Impact Factor TED: Targets Enrichment Density TDD: Targets aŕ DCL Density