SMITE: an R/Bioconductor package that identifies network modules by integrating genomic and epigenomic information

Background The molecular assays that test gene expression, transcriptional, and epigenetic regulation are increasingly diverse and numerous. The information generated by each type of assay individually gives an insight into the state of the cells tested. What should be possible is to add the information derived from separate, complementary assays to gain higher-confidence insights into cellular states. At present, the analysis of multi-dimensional, massive genome-wide data requires an initial pruning step to create manageable subsets of observations that are then used for integration, which decreases the sizes of the intersecting data sets and the potential for biological insights. Our Significance-based Modules Integrating the Transcriptome and Epigenome (SMITE) approach was developed to integrate transcriptional and epigenetic regulatory data without a loss of resolution. Results SMITE combines p-values by accounting for the correlation between non-independent values within data sets, allowing genes and gene modules in an interaction network to be assigned significance values. The contribution of each type of genomic data can be weighted, permitting integration of individually under-powered data sets, increasing the overall ability to detect effects within modules of genes. We apply SMITE to a complex genomic data set including the epigenomic and transcriptomic effects of Toxoplasma gondii infection on human host cells and demonstrate that SMITE is able to identify novel subnetworks of dysregulated genes. Additionally, we show that SMITE outperforms Functional Epigenetic Modules (FEM), the current paradigm of using the spin-glass algorithm to integrate gene expression and epigenetic data. Conclusions SMITE represents a flexible, scalable tool that allows integration of transcriptional and epigenetic regulatory data from genome-wide assays to boost confidence in finding gene modules reflecting altered cellular states. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1477-3) contains supplementary material, which is available to authorized users.


P-value combination methods
Of the available methods used in meta-analyses to combine p-values, the most common methods are implemented in SMITE. Each method has strengths and weaknesses and may be useful depending on the type of combined effect that is most interesting to the user. For an explanation of the implemented methods, we consider combining K p-values, p 1…k . The generalized form for combining K p-values from independent experiments in a meta-analysis is: where wi represents weights and H is a transformation of p-values [2].
(1) Stouffer's method [3] first applies the inverse standard normal CDF transformation of each p i such that: and then calculates a combined statistic as: Stouffer's method is convenient because one can easily include weights for the individual components w i…k [4] such that: (4) (2) Fisher's method [5] is another straightforward way to combine p-values where: Fisher's method is optimal when trying to assess the joint significance between nodes in an interaction network, which is generally a sum between the node scores. There is some evidence that Fisher's method can lose power when there are a few large p-values compared to the rest of the p-values [6].
(3) Sidak's Adjustment [7] is equivalent to taking the most significant effect within a region, which is a common practice in genomics research, but it includes an additional penalty for the number of p-values considered such that: (6) (4) The binomial method is an intuitive approach to combining p-values that relies on finding the probability under a Binomial distribution of finding significant p-values given a series of tests. By defining a threshold, α, and finding the total number of p-values less than or equal to alpha, we calculate the probability under a binomial distribution of finding the result or a more extreme result such that:

Cholesky decomposition for correlated p-values
The existence of the Cholesky decomposition for a correlation matrix is known so that given a symmetric, positive, and definite correlation matrix, $3 , there is an upper triangular matrix with positive diagonal entries, C ij , so that $3 =C ij T C ij . In the literature, one major use of this decomposition has been to correlate random variables that are independent. One example comes from Hoyland, Kaut, and Wallace [8], where they state that given , an n-dimensional N (0, 1) random variable with a correlation structure indicating mutual independence, then the = C is an n-dimensional N (0, 1) random variable with correlation matrix that depends on C and is not longer mutually independent. By rearranging this equation it is apparent that mutually independent random variables may be achieved by finding the product of inverse of the Cholesky decomposition of a correlation matrix and a N(0,1) random variables that are assumed to be dependent, +) = . Thus, as the inverse standard normal CDF transformation +) 1 − . /89 0 = $3: results in N(0,1) distributed random variables, when considering correlated p-values we can use $3 +) $3: = $3: * to find a mutually independent random variables. This use of the Cholesky decomposition has been previously demonstrated [6].

Notation used to explain SMITE algorithm
For the theory behind SMITE, we use {ijk} to denote each p-value associated with an interval and gene, e.g. one p-value in a specific gene's promoter. We use {ij} to denote each interval associated with a particular gene, e.g. a single combined p-value representing all p-values within a specific gene's promoter. We use {.j} to denote each interval as it associates with all genes, e.g. a single weight associated with all promoters. We use {i} to denote each gene. Therefore, we define genes as G i for i in 1,2…I and genomic intervals as R ij for j in 1,2…J related to G i (e.g. a specific gene's promoter and body). Within each R ij , we find the N overlapping p-values, p ijk for k in 1…N ij . Weights w. j are defined for each R. j ,

Program runtime and requirements
We benchmarked the running time for SMITE on a Windows machine with 4 GB of RAM at approximately 45 minutes, while the running time on a high performance computing Linux cluster was approximately 30 minutes, depending on the number and resolution of loaded modifications. SMITE is a pipeline with multiple steps, so the runtime is not necessarily reflective of the actual analysis time, which would involve more user interaction with the data. Memory intensive processes like the spin-glass algorithm may fail unless the system has enough free RAM, which in our tests required roughly 1.6 GB. On memory-poor systems, we have found that this sometimes requires saving the working data set object, freeing up memory on the system and within R, and loading the object again before running network algorithms in SMITE.

Infection of human foreskin fibroblasts (HFF) with Toxoplasma gondii (T. gondii)
To benchmark SMITE, we obtained a large multifaceted genomics dataset from a controlled experiment studying the genomic effects on human foreskin fibroblasts (HFF) following infection by T. gondii. This dataset is part of a separate manuscript in preparation, and will be made available as a public resource after manuscript submission. The HFFs were obtained from ATCC CRL-1634 Hs27 LOT:4012886. After being received from ATCC, they were labeled as P16 (Passage 16) and a lab stock was created that was labeled P17. All experiments were done using P17-P20, and HFFs were discarded after P20. HFFs were grown in Dulbecco's modified Eagle medium (DMEM; Gibco) supplemented with 10% fetal bovine serum (FBS; HyClone), 100 U/mL penicillin (Gibco), 100 µg/mL streptomycin (Gibco) and 2 mM L-glutamine (HyClone) and were maintained at 37°C with 5% CO 2 . T. gondii type I tachyzoites (RH) were repeatedly passaged with HFFs until the host infections were synchronized. Following synchronization of the infection, the tachyzoites were released by passing the infected cells through a 25 gauge needle three times and centrifuging at 3000 rpm for 8 minutes. Next, 75 cm 2 flasks containing confluent HFFs were infected using a multiplicity of infection (MOI) of 3. After 24 hours, the proportion of infected host cells per flask was calculated by identification of parasite rosettes adjacent to the nuclei of the infected host cells using a light microscope, and quantifying the proportion of cells showing this pattern. When flasks were found to have at least 80% infected HFFs, they were harvested by scraping. An uninfected flask containing cells to be used as controls was also harvested in parallel. The harvested cells were centrifuged at 1,300 rpm for 5 minutes. Cells were harvested such that both RNA and DNA could be extracted from the same flask for each biological replicate, and three replicates of uninfected and infected HFFs were harvested in total. We expect that these intracellular parasites could alter multiple host cellular pathways, especially pathways related to infection, inflammation, metabolism and host cell cycle [9,10].
Genomic DNA was extracted from cells using a protocol developed by the Einstein Epigenomics Facility. The cells were incubated in 10 ml of a DNA extraction buffer (10 mM Tris-HCl (Fisher), 0.1 M EDTA (Sigma-Aldrich), 0.5% SDS (Sigma-Aldrich), and 10 μl of 20 mg/mL RNaseA (NEB)) at 37°C for one hour, then incubated with 50 μl Proteinase K (Life Technologies) at 50°C overnight. Next, 10 ml of saturated phenol (Fisher) was added to the DNA extraction buffer and mixed slowly at room temperature for 15 minutes then centrifuged at 3000 rpm for 10 minutes at room temperature. The supernatant was transferred to a new 50 ml falcon tube, and this process was repeated twice more with saturated phenol. The process was repeated three additional times substituting the phenol with chloroform (Sigma). Subsequently, the sample was pipetted into a dialysis bag (Fisher) and put sequentially into three 500 ml baths of 0.2x salinesodium citrate buffer (Fisher). Finally, the dialysis bags were placed on PEG crystals (Sigma) allowing the water to be removed by osmosis. The DNA was collected from the dialysis bag and stored at 4°C for further analysis (see HELP-tagging and HELP-GT). Gene expression (directional RNA-seq), DNA methylation (HELP-tagging [11]), and DNA hydroxymethylation (HELP-GT [12]) profiles were generated from the three uninfected and three T. gondii-infected HFF samples, using additional information about cis-regulatory element locations from ChIP-seq annotations of histone modifications of IMR90 human fibroblasts. Through simultaneous alignment of RNA-seq reads to a combined hg19-Toxoplasma genome, we find that the relative proportions of parasite and host were similar between replicates ( Figure S1). Although the RH strain of T. gondii was used in our experiments, we chose to align the reads to the ME49 v9.0 Toxoplasma reference genome, as it was more completely annotated at the time of alignment, was shown to be 97.6% identical to the RH strain [13], and is the most common strain infecting humans. Figure S1: For each HFF + T. gondii replicate we aligned the RNA-seq reads to a composite hg19-Toxoplasma genome. We assess the proportion of reads that aligned to each genome separately to demonstrate consistency between and within samples.

HELP-tagging and HELP-GT
The HELP-tagging assay was developed by our group [11] and is a high-throughput approach to assay DNA methylation genome-wide. Samples were treated with the restriction enzymes HpaII or MspI, both of which recognize a CCGG motif but have a differential ability to digest DNA depending on the presence of 5-methylcystosine (5-mc). After preparing and sequencing libraries from digested DNA, the relative number of sequencing tags between the HpaII and MspI channels indicated the relative DNA methylation at a specific locus. Using HELP-tagging, we were able to assay DNA methylation at approximately 2 million loci. We used a modification of HELP-tagging, HELP-GT, to assay genome-wide DNA hydroxymethylation [12]. HELPtagging does not discriminate between methylated and hydroxymethylated CpG dinucleotides, but the use of the bacteriophage T4 beta-glucosyltransferase (BGT) can catalyze the addition of a glucose moiety to the hydroxyl group of 5-hydroxymethlcytosines (5-hmC), which interferes with the ability of MspI to digest DNA at these CpG dinucleotides. A third comparison after BGT treatment followed by MspI treatment allows for the detection of 5-hmC levels within each sample. Both HELP-tagging and HELP-GT were performed on the control and infected T. gondii samples. The resulting significance is obtained from two-sided T-testing with 4 degrees of freedom.

RNA-seq
Directional RNA-seq was performed on the uninfected and T. gondii-infected samples. Total RNA was extracted from each of the samples using a Trizol protocol. Cells were pelleted and resuspended in ice-cold PBS (Fisher). Following this, 1 ml of the Trizol reagent (Life Technologies) was added, with 5 minutes of incubation at room temperature and centrifugation to remove cell debris. The supernatant was transferred to new tube and 0.2 ml of chloroform (Sigma) was added and the tube was centrifuged. The aqueous phase was transferred and added to 0.5 ml of isopropyl alcohol. The samples were incubated at room temperature for 10 minutes and centrifuged. The RNA pellet was washed with 75% ethanol, air-dried, and resuspended in nuclease-free water. The extracted RNA was then depleted for ribosomal RNA using the Ribo-zero kit (Epicentre Biotechnologies). After sequencing the mRNA-enriched library, HTSeq was used to calculate the total counts at each human gene, and the Bioconductor package DESeq was used to compare relative expression between uninfected and T. gondii infected samples using negative binomial testing. For the purposes of this study, we examined only the human host gene expression and epigenome, querying for changes induced in the host by T. gondii infection.

Defining cis-regulatory elements in human fibroblasts
The histone modification profiles of a human fibroblastic cell type, IMR-90, assayed by the ENCODE project [14] were obtained. We used the H3K4me1, H3K27ac, and H3K27me3 modifications to define cis-regulatory elements in this cell type. We processed the data using an adaptation of an imaging signal processing algorithm to define the locations of chromatin constituents with minimal data transformation [15,16]. We defined genomic context relative to the gene transcription start site (TSS) by the criteria defined in Table S1.

Incorporating a priori hypotheses into SMITE analysis
SMITE incorporates three different points within its pipeline where the user can alter the final outcome dramatically depending on a priori information. First, users can include a priori information about how modifications within a specific genomic context should be weighted. Distance from the TSS is a popular method to weight epigenetic effects, given that epigenetic states may reflect transcription factor (TF) binding, and TFs are often assumed to maintain a relationship with a gene within a specific distance from the gene's TSS [17]. Alternatively, as is the case in other network analysis tools like Epimods [18], a researcher may only focus on the most significant p-value within a particular genomic context, although SMITE allows the use of the Sidak correction that accounts for multiple comparisons. Second, users can control a priori information about the relationship between transcription and a modification given a particular genomic context. Though the relationship between transcription and epigenetic modifications can be complicated, some relationships have been studied extensively and are generally found to be true like DNA methylation in gene promoters being associated with transcriptional silencing [19], whereas DNA methylation in a gene body is associated with increased expression [20][21][22]. By giving the user has the option to define a relationship for each modification-genomic context pairing, SMITE results in functional modules enriched for more easily interpretable gene effects. Third, when scoring nodes, a linear combination of weights can be provided to address a priori research goals defining the relative importance of a component score toward the final scores and functional modules (e.g., functional modules enriched for relating transcription and enhancer modifications).  Table S2. To score the T. gondii HFF dataset, the relationship between each modification and gene expression can be indicated in a genomic context-specific manner. Here, DNA methylation at gene promoters will have an inverse relationship with expression, and DNA methylation at gene bodies will have a positive relationship with gene expression. All other effects will be maximized regardless of their direction. In addition, the weighting vectors indicate that the modules we detect in the SMITE-Full model will be driven by gene expression, DNA methylation and DNA hydroxymethylation at enhancers, and they were chosen so that the sum of the weights would be 1. The modules that we detect in the SMITE-Reduced model will be driven by gene expression and promoter DNA methylation. Figure S2: To explore the effect of using altered weights, w ijk , we requested that no weights be used and compared it to weighting by distance. From the overall scores derived from pvalues and the R 2 =0.99 it is apparent that the overall scores are very concordant and the effect of the weight choice is minimal. Figure S3: To explore the effect of using altered weights, w .j , we fixed all weights except for one which was varied between 0.01, a value that down weights the component severely, and 0.4, a value that makes the component the dominant one in the model. Then for each iteration, we requested the highest scoring genes (genes that were above the random null score distribution) and showed their scores. Finally, we show for each iteration the proportion of high scoring genes under the original model (weights shown in red) that remain. For each component, there is a subset of genes that remains high scoring despite the choice in weight, and a subset of genes that emerges as more weight is placed on the component.   Figure S6: Because a KS-test between SMITE module genes (roughly 500) and SMITE scored genes (roughly 23,000) will generally be significant when compared with a KS-test between FEM module genes (roughly 200) and FEM scored genes (roughly 6,000), we instead used a sampling approach to determine the KS-test significance. For FEM, the reduced SMITE model (SMITE-R), and the full SMITE model (SMITE-F), we computed 10,000 random samples of equal size to the identified modules and plot the significance. On average, FEM module genes are not statistically different when compared to all FEM genes; whereas, SMITE-R and SMITE-F achieve a greater statistical significance. Figure S7: We found the number of HpaII sites that were assayed for DNA methylation and DNA hydroxymethylation, and thus the number of p-values, associated with each gene, and we looked for the distribution of the number of p-values associated with high scoring genes for all genes, SMITE-F genes, SMITE-R genes, and FEM genes. We find that the SMITE-F model outperforms the FEM model in terms of bias toward genes associated with more p-values, having a statistically significant different distribution (KS-test p-value <10 -12 ). #fill in modification data Toxo_annotation <-annotateModification(Toxo_annotation, methylation, weight_by_method="Stouffer", weight_by=c(promoter="distance", body="distance", active_enhancers="distance", poised_enhancers="distance"), verbose=TRUE, mod_corr=TRUE)

Appendix 2. R code for analyzing T. gondii HFF dataset with FEM
#load FEM and dependencies library("igraph") library("marray") library("corrplot") library("graph") library("AnnotationDBI") library(FEM) #load a converting package because FEM requires Entrez IDs sym2eg<-AnnotationDbi::as.list(org.Hs.eg.db::org.Hs.egSYMBOL2EG) #Using a combination of bedTools and R we assembled for DNA methylation: #1) average of all effects within 200 bp from a gene TSS #2) if no effects were found, the average of effects over the first exon #3) if no effects were found, taking the average over 1,500 bp around the TSS #Read methylation data into R and merge it to make statM statM.1<-read.